Bayesian adaptive design for device surveillance

Abstract

Background

Postmarket device surveillance studies often have important primary objectives tied to estimating a survival function at some future time $T$ with a certain amount of precision.

Purpose

This article presents the details and various operating characteristics of a Bayesian adaptive design for device surveillance, as well as a method for estimating a sample size vector (determined by the maximum sample size and a preset number of interim looks) that will deliver the desired power.

Methods

We adopt a Bayesian adaptive framework, which recognizes the fact that persons enrolled in a study report their results over time, not all at once. At each interim look, we assess whether we expect to achieve our goals with only the current group or the achievement of such goals is extremely unlikely even for the maximum sample size.

Results

Our Bayesian adaptive design can outperform two nonadaptive frequentist methods currently recommended by Food and Drug Administration (FDA) guidance documents in many settings.

Limitations

Our method’s performance can be sensitive to model misspecification and changes in the trial’s enrollment rate.

Conclusions

The proposed design provides a more efficient framework for conducting postmarket surveillance of medical devices.

Introduction

Current Bayesian adaptive trial design methods have largely focused on the comparison of two or more samples, typically in the realm of hypothesis testing where the research question generally concerns whether group A differs from group B. By contrast, the one-sample problem is largely an estimation problem where the research question concerns the value of a population parameter. At the outset of a one-sample problem, a researcher sets a goal of estimating a parameter with a certain precision, sometimes with an associated acceptable threshold for efficacy. For example, we may wish to measure the percentile of medical device survival at 5 years with a 95% confidence interval (CI) having half-width no larger than .03. If the survival rate is significantly below 95%, the device may need to be recalled. Here, the precision is the half-width of the interval, which improves (shrinks) as the sample size increases, and the efficacy threshold is 95% device survival at 5 years.

The reason this is relevant in surveillance settings stems from the fact that the device should already have been established as having reasonable assurance of safety and efficacy prior to receiving market approval. After receiving approval, the relevant scientific question for the sponsor is often one of estimation (‘What is the percentile of freedom from stroke at five years post-implant for device A?’), as opposed to one of testing (‘Is the percentile of freedom from stroke at five years post-implant the same for device A and device B?’).

In such scenarios, studies may have important primary objectives tied to estimating quantities with a certain amount of precision, even if they do not necessarily have a hypothesis that is intended to be tested for the primary objective. Sample sizes in such studies should be statistically justified, even if they are not obtained through the traditional routes of power calculations performed under alternative hypotheses. The estimation approach is in agreement with 21 CFR 822, which states that a surveillance plan must include a discussion of the plan objective addressing the surveillance question(s) identified in the order; no mention of a hypothesis test approach is explicitly made in 21 CFR 822. In contrast, the current draft guidance document from the Food and Drug Administration (FDA) covering postmarket surveillance [1] calls for both study objectives and hypotheses to be included in a postmarket surveillance study plan using a standard (nonadaptive) frequentist sample size calculation. This article will elucidate the manner in which statistical sample size computation may be performed in the situation where a primary objective exists, but a hypothesis does not.

The frequentist approach can provide an estimated sample size to achieve the precision-based goal mentioned earlier under certain other constraints/assumptions (i.e., dropout rate, expected device survival percentile at the time of interest, etc.). One frequentist variant starts with the estimation of the Kaplan–Meier survival curve, applies the alternative expression for the standard error of the survival estimate by Peto et al. [2] (useful when the estimate of the survival function is close to 0 or 1), and performs an ad hoc scaling incorporating a censoring rate out to a given estimation time. The resulting expression was described in a guidance document from the FDA [3]. However, if the survival curve is assumed to arise from a Weibull distribution, then the methods described in Meeker and Nelson [4] can be applied.

This general problem of estimating a single survival curve with a certain amount of precision lends itself well to a Bayesian approach. Even under the frequentist approach to sample size estimation, we require a prior guess as to the device survival percentile at the time of interest. Instead of treating this guess as fixed, we can place a prior distribution on this parameter and then attack the problem as a Bayesian. For a general overview of Bayesian trial design, see Berry [5] as well as the most recent FDA guidance document on Bayesian methods [6]. Work summarized in Berry et al. [7] has approached similar problems of sample size in the two-sample setting by using interim looks and predictive probabilities of success. That is, one assesses the probability of achieving the given characteristics with the current sample size given the information already obtained from these observations.

Here, we too adopt a Bayesian interim look framework, which recognizes the fact that persons enrolled in a study report their results over time, not all at once. Our problem consists of predicting whether we will have enough information from the presently enrolled sample to achieve the desired characteristics after follow-up on the current group, or if given the current information, it seems futile that we will ever achieve the desired characteristics even if we continue enrollment to the prespecified maximum sample size. Simply put, at each interim look, we assess whether we expect to achieve our goals with only the current group or whether the achievement of such goals is extremely unlikely even for the maximum sample size. If either of these two situations is present, we stop enrollment at the current sample size. If not, we continue enrollment until the next interim look or the maximum sample size, whichever is dictated by the design. Once we have halted enrollment, we will continue to follow the enrolled cohort to the end of the study (e.g., 5 years minimum follow-up), at which point we evaluate the results.

Because this is a Bayesian adaptive design, each decision is based on a posterior distribution. In this design, we are making decisions at several points in time. We are deciding whether to halt enrollment at each interim sample size using a predictive calculation driven by the interim posterior, and we are making a final decision about the result of the trial based on the posterior distribution at the very end. The interim looks are set to take place when enrollment hits a prespecified number of persons. Thus, this design can be summarized as a sample size vector whose components detail the interim look sample sizes and the maximum sample size. The intent of this article is to present the general details of this design and its operating characteristics in some settings, as well as a method for estimating a sample size vector that delivers a trial with the desired power.

Methods

Before presenting our Bayesian adaptive design, we briefly review two standard frequentist designs. We continue in the device surveillance framework, where our aim is to estimate an interval for the percentile of device survival (L) at a time point of interest (T). A trial will be defined as a failure if either the precision, defined as the half-width of the interval and denoted by $δ$ , is larger than some maximum allowable value $ε$ , or the upper limit of the estimated interval, denoted by $U_{CI}$ , is less than the efficacy threshold $(L_{ET})$ . Otherwise, the trial is defined as a success. Thus, given $δ < ε$ , a trial is considered successful when at least part of the interval lies above $L_{ET}$ , and a trial is considered a failure only when $U_{CI} < L_{ET}$ . A trial is always considered a failure when $δ \geq ε$ . The details of the two frequentist designs, nonparametric Kaplan–Meier and parametric Weibull, and the Bayesian adaptive design are presented in the two following sections. Then, section ‘Bayesian adaptive design sample size estimation technique’ describes methods to estimate a sample size vector for the Bayesian adaptive design that will deliver any prespecified power.

Frequentist models

We compare our Bayesian results to those of two frequentist models: nonparametric Kaplan–Meier estimation and parametric Weibull estimation, both used to obtain a $100 (1 - α) % CI$ for $L$ at a time $T$ . Beginning with the Kaplan–Meier model, the log-transformed point-wise CI is used to decide the result of the trial. Details of the implementation and calculation of the CI for the survival percentile at a particular time for the Weibull model are described in Section 12.2 of Klein and Moeschberger [8]. Throughout this article, for computational convenience, the Weibull scale parameter $(β)$ is fixed, often with a value of 1, implying device survival is exponential.

Under the nonparametric Kaplan–Meier design, the researcher calculates the sample size by plugging in hypothesized parameter values and desired characteristics to an equation (shown in Appendix 1) described by an FDA guidance document [3]. Alternatively, a parametric Weibull approach discussed by Meeker and Nelson [4] can be used as a guidance, though a direct sample size calculation does not exist in this setting. Meeker and Nelson describe techniques for sample size calculation when the researcher desires an estimate of survival times at a particular percentile. However, here we are concerned with the inverse problem of estimating the survival percentile at a particular time; their methods can be adapted to this setting to a certain extent. The incongruence of the two settings arises in the desired precision: for Meeker and Nelson, the researcher specifies the acceptable degree of error for the desired percentile (e.g., if the 95th percentile is expected to be 5 years, we may willingly accept estimates between 4 and 6 years, or within 20%). In our setting, we are estimating the survival percentile $(L)$ at a specific time (T) and we are willing to accept $δ < ε$ .

This inverse problem does not translate directly from the methods of Meeker and Nelson. Based on simulations (not shown here), the nonparametric sample size calculation appears to offer a conservative estimate for the parametric setting. When the parametric assumption is approximately correct, the estimated sample size will provide a CI with a smaller half-width relative to the nonparametric interval. Additionally, these existing frequentist methods were developed in the absence of an efficacy threshold, so no previous sample size estimation methods appear to exist for exactly the setting of our concern.

Bayesian adaptive model details

In the Bayesian setting, we no longer estimate CIs,but rather posterior credible intervals. Thus, wedetermine our Bayesian trial result using the half-width and upper limit of the posterior credible interval. Recall, we also have interim stopping rules that are assessed when enrollment has reached the interim sample size specified by the investigator: we stop enrollment for expected success when the predictive probability of success for the current sample is above a large cutoff value $(C_{S})$ , and we stop enrollment for futility if the predictive probability of success for the maximum sample size $(M)$ is below a small cutoff value $(C_{F})$ .

During each interim look, we must evaluate the predictive distribution arising from the interim posterior, whereas for the final analysis, we evaluate the posterior distribution. For flexibility, we assume that every device has a survival time that is Weibull with a fixed shape parameter $(β)$ and each enrollee has a dropout/censoring time that is exponential with known rate $λ$ . The Weibull parameterization in R [9] has density $f (x) = β {(x / η)}^{β} \exp (- {(x / η)}^{β})$ . Zhang and Meeker [10] show that the conjugate prior for a transformed scale parameter $(θ = η^{β})$ of the Weibull distribution with a fixed $β$ is the inverse gamma distribution, which we notate $θ ~ IG (a, b)$ , and which has density function $π (θ | a, b) = (b^{a} / Γ (a)) θ^{- a - 1} \exp (- b / θ)$ .

Let $x_{1}, \dots, x_{M}$ be the set of observation times, $R$ denote the set of observations with an observed device failure, and $C$ the set of right-censored observations. If we assume $X_{i} | θ \overset{iid}{~} Weibull (β, θ)$ , the likelihood given $x = (x_{1}, \dots, x_{M})$ is given by

\begin{matrix} L (θ; x) = \underset{x_{i} \in R}{Π} f (x_{i} | β, θ) \underset{x_{i} \in C}{Π} (1 - F (x_{i} | β, θ)) \\ \propto \underset{x_{i} \in R}{Π} θ^{- 1} \exp (- \frac{x_{i}^{β}}{θ}) \underset{x_{i} \in C}{Π} \exp (- \frac{x_{i}^{β}}{θ}) = θ^{- r} \exp (- \frac{\sum_{i = 1}^{M} x_{i}^{β}}{θ}) \end{matrix}

where $r$ is the number of device failures. The conditional posterior for $θ$ emerges as

\begin{matrix} π (θ | x, a, b, β) \propto θ^{- a - r - 1} exp (- \frac{b + \sum_{i = 1}^{M} x_{i}^{β}}{θ}) \\ \equiv IG (a + r, b + \sum_{i = 1}^{M} x_{i}^{β}) \end{matrix}

The last design element we must specify is the prior distribution on $θ$ ; however, we are primarily concerned with the quantity $L = S (T)$ . Since $L = \exp (- (T / η)^{β}) = \exp (- T^{β} / θ)$ , or equivalently, $θ = g (L) = - T^{β} / \log (L)$ , placing a prior distribution on either parameter implicitly defines a prior on the other. For instance, if we choose $L ~ p_{L} (L)$ , because $g (L)$ is monotonically increasing on $L \in (0, 1)$ , the derivation of $p_{θ} (θ)$ is a straightforward transformation of random variables calculation. Namely, $p_{θ} (θ) = p_{L} (g^{- 1} (θ)) | d g^{- 1} (θ) / d θ |$ , where $d g^{- 1} (θ) / d θ = \exp (- T^{β} / θ) T^{β} / θ^{2}$ . Taking $L ~ Beta (γ, ξ)$ , we have

\begin{matrix} p_{θ} (θ) \propto θ^{- 2} \exp (\frac{- T^{β}}{θ}) \exp (\frac{- (γ - 1) T^{β}}{θ}) {(1 - \exp (\frac{- T^{β}}{θ}))}^{ξ - 1} \\ \propto θ^{- 1 - 1} \exp (\frac{- γ T^{β}}{θ}) {(1 - \exp (\frac{- T^{β}}{θ}))}^{ξ - 1} \end{matrix}

We would like a noninformative prior distribution, which for a proportion often manifests as $U (0, 1) \equiv Beta (1, 1)$ . Plugging in these noninformative hyperparameter values, the term raised to the $(ξ - 1)$ power cancels, and we find that $p_{θ} (θ) \equiv IG (1, T^{β})$ . Thus, the conditional posterior for $θ$ emerges as $IG (1 + r, T^{β} + \sum_{i = 1}^{M} x_{i}^{β})$ with a prior mean corresponding to 50% device survival at time $T$ and a 95% prior confidence of $(2.5 %, 97.5 %)$ survival at time $T$ .

To make use of all the information available when making decisions about enrollment, we update the interim posterior when the next group is ready to enroll. So, when we check the trial at the first interim look of size $m_{1}$ enrollees, we will have accumulated some amount of follow-up in the most recent group enrolled, some amount more for the previously enrolled group, and so on back to the greatest accumulation of follow-up for the first group enrolled. For an interim look of size $m_{k}$ enrollees ( $k = 1, \dots, K$ , where $K$ is the potential number of interim looks), we can use this partial information to get the number of events already unmasked $(r_{k})$ and the sum of the adjusted follow-up times $(\sum_{i = 1}^{m_{k}} x_{ki}^{β})$ . Then, the interim posterior at interim look $k$ under our noninformative prior is

P (θ | x_{k}) = IG (1 + r_{k}, T^{β} + \sum_{i = 1}^{m_{k}} x_{ki}^{β})

(1)

Working the other direction, since $(θ | x) ~ IG (κ, ζ)$ , by taking $τ = - \log (L)$ so that $τ = h (θ) = T^{β} / θ$ , we see $p_{τ} (τ) \propto τ^{κ - 1} \exp (- (ζ / T^{β}) τ) \equiv Gamma (κ, (ζ / T^{β}))$ . Plugging in the relevant parameter values reveals that $(τ | x) ~ Gamma (r + 1, 1 + \sum_{i = 1}^{m} {(x_{i} / T)}^{β})$ . Since $L = \exp (- τ)$ , the distribution of $(L | x)$ is not common, but an application of the delta method provides an approximation of $Var [L | x]$ , and $E [L | x]$ can be calculated directly. Having the first two posterior moments will provide insight into the distribution of $(L | x)$ . Now

\begin{matrix} E [L | x] = ζ^{κ} \int_{0}^{\infty} Γ (τ)^{- 1} \exp (- τ) τ^{κ - 1} \exp (- ζ τ) \\ = \frac{ζ^{κ}}{{(ζ + 1)}^{κ}} \int_{0}^{\infty} \frac{{(ζ + 1)}^{κ}}{Γ (τ)} τ^{κ - 1} \exp (- (ζ + 1) τ) \\ = {(\frac{ζ}{ζ + 1})}^{κ} = {(\frac{1 + \sum {(\frac{x_{i}}{T})}^{β}}{2 + \sum {(\frac{x_{i}}{T})}^{β}})}^{r + 1} \end{matrix}

Applying the delta method

\begin{matrix} Var [L | x] \approx g' (E [τ | x {])}^{2} Var [τ | x] = \exp {(- \frac{κ}{ζ})}^{2} \frac{κ}{ζ^{2}} \\ = \exp (\frac{- 2 (r + 1)}{1 + \sum {(\frac{x_{i}}{T})}^{β}}) \frac{r + 1}{{(1 + \sum {(\frac{x_{i}}{T})}^{β})}^{2}} \end{matrix}

Note that $Var [L | x]$ depends upon the fixed shape parameter $β$ as well as the total transformed follow-up time $(\sum {(x_{i} / T)}^{β})$ and number of failures $(r)$ . Holding all else constant, $Var [L | x]$ increases with $β$ when follow-up times are predominately greater than $T$ (i.e., $S (T) > 0.5$ ), otherwise the relationship flips, and $Var [L | x]$ decreases as follow-up increases. The relationship between the number of failures and $Var [L | x]$ is more complex, although, in our setting, $Var [L | x]$ increases as the number of failures observed increases.

The complexity of this problem lies primarily in the computation of the expected success and futility at each interim look via Bayesian predictive probability distributions. At each interim look, we have partial information about the current group of enrollees and no information from potential future enrollees. We use the partial information at hand to update the prior to the interim posterior and then calculate the two predictive distributions. Rather than calculating the exact probabilities of expected success and futility from their corresponding predictive distributions, we estimate them using Monte Carlo sampling. First, we draw a large sample $(θ_{k}^{(1)}, \dots, θ_{k}^{(g)}, \dots, θ_{k}^{(G)})$ from the interim posterior described in equation (1). With this sample on hand, we can simulate the remaining follow-up for the currently enrolled individuals and potential future enrollees.

For current enrollees, we only need to simulate remaining follow-up for those that have not yet experienced device failure nor dropped out (i.e., those remaining in the risk set at the time of the interim look) via composition from a left-truncated Weibull distribution with the sampled scale parameter $θ_{k}^{(g)}$ and left truncation time corresponding to each individual’s accumulated follow-up at the time of interim look $k$ . For potential future enrollees, we simulate survival times from a Weibull distribution with sampled scale parameter $θ_{k}^{(g)}$ . For each sample $g$ , we also simulate dropout/censoring times from an exponential distribution with known rate $λ$ . Finally, if we assume a constant enrollment rate (e.g., 10 persons every 3 months) and carry out each interim look when the next group/person is ready to enroll, the follow-up accumulated for each enrollee is a straightforward calculation.

Expected success at interim look $k$ $({\hat{E}}_{S_{k}})$ is calculated with respect to the currently enrolled cohort. Assuming no more persons will be enrolled, the study will end when the most recent enrollee has accumulated follow-up $T$ . So, for each simulated trial, every current enrollee’s status and follow-up can be determined at study end. Armed with $G$ trials containing simulated full information, we update each predictive posterior distribution and record each result. We take ${\hat{E}}_{S_{k}}$ to be the proportion of predicted trials for currently enrolled cohorts that had a successful outcome. Now, futility at interim look $k$ $(1 - {\hat{E}}_{F_{k}})$ is calculated relative to the prespecified maximum sample size $M$ . Assuming that enrollment continues to the maximum sample size $M$ , the study will end when the last potential future enrollee has accumulated $T$ follow-up. So, for each simulated trial of both current and potential future enrollees, the simulated status and follow-up can be determined at the study end for each enrollee. We take ${\hat{E}}_{F_{k}}$ as the proportion of predicted trials for maximum-sized cohorts that meet the final criterion for success, so $1 - {\hat{E}}_{F_{k}}$ is the proportion of predicted trials that fail.

The expected success and futility calculations outlined above provide us with a one-off method of calculating ${\hat{E}}_{S}$ and ${\hat{E}}_{F}$ at each interim look, and thus making a sensible decision about halting enrollment. The calculated values ${\hat{E}}_{S}$ and ${\hat{E}}_{F}$ dictate whether to halt enrollment or continue to the next interim look. If ${\hat{E}}_{S}$ is larger than $C_{S}$ , we halt enrollment, because given the current information, it is very likely that we will achieve the desired characteristics with the currently enrolled cohort. If instead, ${\hat{E}}_{F}$ is below $C_{F}$ , the trial is defined as futile, because given the current information, it is very unlikely that we will achieve the desired characteristics when enrolling $M$ persons. Once enrollment has finished, we follow the cohort to the study end, update the posterior, and record the final trial result. When an actual trial operating under this design has enrolled $m_{k}$ persons and the next group is ready to be enrolled, the calculation of expected success and futility and subsequently the enrollment decision can be carried out using Algorithm 1.

To implement this design, there are a number of parameters that need to be specified. The investigator needs to define a maximally acceptable half-width $(ε)$ , credible interval coverage level $(100 (1 - α) %)$ , efficacy threshold $(L_{ET})$ , time point of interest $(T)$ , dropout/censoring rate $(λ)$ , enrollment rate information, cutoffs for expected success $(C_{S})$ and futility $(1 - C_{F})$ , and in our current Weibull model, the shape parameter $β$ . The enrollment information is broken down into the size of each enrollment group (n) and the time lag between groups (e.g., 10 persons are enrolled each quarter year). The investigator will also need to decide the number of Monte Carlo iterations (G) used to calculate the expected success $({\hat{E}}_{S})$ and futility $(1 - {\hat{E}}_{F})$ at each interim look. Finally, the investigator needs a sample size vector, determined by the maximum sample size (M) and a predetermined number of interim looks (K). The sample size vector is denoted by $(m_{1}, \dots, m_{k}, \dots, m_{K}, M)$ , where $m_{k}$ is the cumulative sample size at the kth interim look.

To assess the operating characteristics of a particular design and situation, the investigator will further need to specify the true device survival percentile $L$ at time $T$ (or equivalently $η = T / (- \log L)^{1 / β}$ ) and the number of trials to simulate (J). With these additional specifications, an assessment of the operating characteristics for a particular setting and design can be carried out using Algorithm 2.

Bayesian adaptive design sample size estimation technique

To implement this Bayesian adaptive design in practice, a common question would involve the maximum sample size (M) to be used to obtain a desired power (B) for a given number of interim looks (K). Power in this setting might be the proportion of trials that will result in a sufficiently precise credible interval that contains or is completely above the efficacy threshold. This design allows for the specification of both a maximum sample size and interim sample sizes that allow for early stopping of enrollment. A sample size calculation should thus return a sample size vector, $V = (m_{1}, \dots, m_{K}, M)$ , as opposed to a single number, that will deliver the desired power for a particular prior. In this section, by specifying enrollment rate information, an efficacy threshold, and the same parameter inputs as the nonparametric sample size equation, we indicate how a sample size vector for this design can be selected via simulation.

The vector should have certain properties that motivate the sample size calculation methodology. The maximum sample size should deliver the desired power if prior expectations come to fruition. Early interim looks provide a safeguard for uncertainty about $L$ ; thus, these looks will ideally halt enrollment when $L$ is very favorable. It would be inefficient to conduct an interim look with so little information that no matter the actual value of $L$ , enrollment will always continue. Hence, the first interim look should be related to the ‘largest plausible’ value of $L$ , which will often halt enrollment when $L$ is in fact large. Finally, the interim looks $(m_{1}, m_{2}, \dots, m_{K})$ should not be tightly clustered, but rather reasonably spread between the $m_{1}$ and $M$ , with each $m_{k}$ a multiple of enrollment group size (n).

This design provides significant flexibility regarding the placement of interim looks, but using the above principles as a guidance, we suggest the following method for determining a sample size vector that will deliver the desired power for a particular prior. Here, we simulate trials under a point-mass prior on the investigator’s expected value of $L$ to mirror frequentist methods. The investigator needs to specify two values of $L$ , the ‘expected’ value and the ‘largest plausible’ value (e.g., the design has 80% power for $L = . 95$ , and the investigator believes that the largest plausible value of $L$ is .98). The ‘largest plausible’ value will dictate the placement of $m_{1}$ , with the goal of halting enrollment when $L$ is in fact extreme.

Once the investigator has provided these two hypothesized values of $L$ , $m_{1}$ is set to the sample size, which is a multiple of $n$ and delivers the desired power for this Bayesian design when no interim looks are conducted, and device survival is fixed at the ‘largest plausible’ $L$ . Similarly, the initial maximum sample size $M_{1}$ is set to the sample size, which is a multiple of $n$ and delivers the desired power for this Bayesian design when no interim looks are conducted, and device survival is fixed at the expected $L$ . Finally, the initial sample size vector $V_{1}$ is constructed by spacing the remaining $K - 1$ interim looks evenly between $m_{1}$ and $M_{1}$ at multiples of $n$ . Finding initial values for $m_{1}$ and $M_{1}$ can be done quickly by simulating a large number of trials $(J^{*})$ with the appropriate fixed value of $L$ and an array of sample sizes. These calculations are fast since there are no interim decisions, so no additional Monte Carlo draws are required, which vastly increases computational efficiency. The smallest sample size that is a multiple of $n$ with estimated power $(\tilde{B})$ of at least $B$ is used as the initial value.

With the introduction of interim looks, some of the trials will halt early and finish with a trial result that is discordant with the counterfactual trial in which enrollment had not been halted. Because we are concerned with settings in which the expected $L$ is greater than or equal to the efficacy threshold, and the cutoff for expected success is less stringent than the cutoff for futility, we will see trials halting enrollment for expected success more often than futility. When a trial stops for expected success, the predictive probability of success is high, at least $C_{S} < 1$ , which means that the predictive probability of failure is low, no more than $1 - C_{S} > 0$ , but not 0. Some of the trials that halt enrollment may finish with a credible interval slightly larger than $ε$ , whereas had these failed trials continued enrollment to the maximum size, some would have been sufficiently powered. Following this counterfactual argument, a nonadaptive trial with sample size $M$ will necessarily have greater power than an adaptive design with interim looks and maximum sample size $M$ . This does not mean that an adaptive trial reduces power, but rather that a nonadaptive trial with a sample size set to the maximum sample size of the adaptive trial will have greater power (i.e., $\tilde{B} > B_{1}$ , where $B_{1}$ is the power of the adaptive trial). In fact, in a fair comparison of power, meaning a nonadaptive trial with a sample size equal to the average sample size of the adaptive trial, the nonadaptive trial will have much less power than the adaptive trial.

By design, $B_{1} < \tilde{B}$ , but we are limited to working with a crude estimate ${\hat{B}}_{1}$ from $J = 250$ Markov chain Monte Carlo (MCMC) draws. Furthermore, due to forcing sample sizes to be a multiple of $n$ , $\tilde{B}$ may be quite a bit larger than $B$ , and introducing adaptivity to the design may result in either $B \leq B_{1} < \tilde{B}$ or $B_{1} < B < \tilde{B}$ . With these limitations in mind, we devise the following strategy for obtaining the appropriately powered sample size vector $V_{i}$ . We initialize the algorithm with the sample size vector $V_{1}$ , and implement Algorithm 2 for $J$ Bayesian adaptive trials with $L$ fixed at the expected value, and take the proportion of trials that are successful as ${\hat{B}}_{1}$ . If ${\hat{B}}_{1} \geq B$ , we assume power was not exceedingly reduced by the introduction of interim looks so that $B \leq B_{1} < \tilde{B}$ , and we select $V_{1}$ for our design. However, if ${\hat{B}}_{1} < B$ , we assume introduction of interim looks resulted in $B_{1} < B < \tilde{B}$ , and we increase $M_{1}$ by $n$ to recover the lost power. With the now larger maximum sample size, we reconstruct the sample size vector by spacing the remaining interim looks between $m_{1}$ and $M_{2}$ to obtain $V_{2}$ . We continue this process until ${\hat{B}}_{i} \geq B$ , at this point, we assume $B_{i}$ is greater than $B$ and we select the sample size vector $V_{i}$ for our design. This process leaves $m_{1}$ fixed at the initial value but allows flexibility in the value of $M$ and the remaining interim looks.

Here, the enrollment information is again broken down into the size of each enrollment group (n) and the time lag between groups (say, 10 persons every quarter year). Given the number of trials to simulate at each iteration (J), the number of Monte Carlo samples (G) used to calculate the expected success and futility at each interim look, the cutoff values for expected success ( $C_{S}$ ) and futility ( $C_{F}$ ), the total number of interim looks (K), and a maximum number of algorithm iterations (I), the sample size vector estimation can be summarized by Algorithm 3.

Results

Operating characteristics

We now report the operating characteristics of our two interim look designs with sample size vector $(140, 190, 240)$ and compare them with those of the two frequentist designs, using $J = 1000$ trials. Though we defer the calculation until later, this sample size vector corresponds to a design with 80% power when device survival is expected to be 95% at 5 years, with the largest plausible device survival at 5 years is 98%. Our main investigation below looks at the characteristics when device survival varies between 92% and 98% at 5 years. The parametric models require a fixed value for the Weibull shape parameter $β$ ; for now, we defer investigation of performance when the shape parameter is misspecified. Finally, the enrollment rate is an important factor, since the faster persons are enrolled, the less information the data will contain during the interim looks. The degenerate case is when every person is enrolled at once, where no adaptation is possible. As such we investigate the effects of three enrollment rates: 10 persons every 10th of a year, 10 persons every quarter year, and 10 persons every half year.

The dropout/censoring rate $(λ)$ is .08, the expected success cutoff is .8, the futility cutoff is .05, the efficacy threshold is 95%, and the acceptable precision $(ε)$ is .03, throughout. For each simulated trial, $G = 1000$ Monte Carlo samples were used to estimate the interim expected success and futility. With both frequentist designs (denoted ‘K–M’ and ‘Weibull’) having sample sizes fixed at $N = 206$ persons throughout, Table 1 provides a comparison of the three designs when 5-year device survival is a range of values. Persons were enrolled at a rate of 10 per quarter year and the shape parameter was fixed at 1. The mean sample size decreases as device survival deviates from the expected value, if device survival were even lower than 92%, the mean sample size would again decrease as more trials would stop enrollment for futility. Table 1 also reports the percentages of trials under each design that terminate successfully, that achieve the desired half-width $(δ < 0.03)$ , and that have an upper interval estimate above the efficacy threshold $U_{C I} > 95 %$ . We see that the Bayesian adaptive design makes large gains in achievement of the desired half-width, which is driving the increase in power. We mention that, though rare, it is possible for a trial to result in a $δ < ε$ and $U_{CI} < L_{ET}$ , making all the three ‘% of Trials’ columns in Table 1 different. As device survival decreases to 92%, 3% below the efficacy threshold, both frequentist designs become underpowered, achieving the desired half-width less than 5% of the time while correctly excluding the efficacy threshold less than 50% of the time; in contrast, the Bayesian adaptive design achieves the desired half-width 20% of the time while correctly excluding the efficacy threshold nearly 60% of the time. When device survival is at the efficacy threshold of 95%, the Bayesian design enjoys greater success probability of 80.3%, versus below 60% for the two frequentist designs.

Table 1.

Operating characteristics for various device survival values at 5 years (counts are out of J = 1000 trials)

Actual device survival	Design	Sample size, mean (SD)	% of trials			Bounds
			Success	$U_{C I} > 95 %$	$δ < 0.03$	Mean $δ$	Lower	Upper
92%	K–M	206 (—)	2.3	78.4	2.3	0.041	0.880	0.961
	Weibull	206 (—)	4.7	48.7	4.7	0.036	0.877	0.949
	Adaptive	214.1 (40.1)	18.7	41.1	18.7	0.036	0.874	0.946
95%	K–M	206 (—)	32.2	99.2	32.2	0.032	0.919	0.983
	Weibull	206 (—)	57.2	97.5	57.2	0.029	0.913	0.971
	Adaptive	220.8 (34.2)	80.3	96.3	80.3	0.028	0.916	0.971
98%	K–M	206 (—)	95.1	100.0	95.1	0.019	0.960	0.998
	Weibull^a	206 (—)	99.4	100.0	99.4	0.022	0.947	0.992
	Adaptive	180.8 (38.6)	97.9	100.0	97.9	0.021	0.951	0.992

SD: standard deviation.

Weibull model failed to converge four times for this setup.

Table 1 also reports the average CI/credible interval bounds, which verify that all the models are estimating intervals centered about the actual device survival at time $T$ . Though the Bayesian design does have a larger mean sample size in two of the three settings, the design could be tweaked by adjusting the prior distributions on $θ$ , $C_{S}$ , and $C_{F}$ or increasing/decreasing $K$ and $M$ to deliver an average sample size of roughly 206 while still maintaining greater power (e.g., changing $C_{S} = 0.6$ and $C_{F} = 0.2$ reduces the average sample size for the 95% device survival setting to 204 while maintaining 67.4% power). When device survival is 98%, the present Bayesian design enjoys both a similar power and an average sample size 25 persons smaller than the frequentist designs. The frequentist Weibull model fails when there are no device failures observed, this happens because the maximum likelihood estimate of theta $(\hat{θ} = \sum x_{i}^{β} / r)$ does not exist when $r = 0$ .

Table 2 contains a breakdown of the reasons for halting enrollment in the Bayesian adaptive trials. That is, reported are the number of times the trial was stopped for futility $({\hat{E}}_{F} < C_{F})$ and expected success $({\hat{E}}_{S} > C_{S})$ , as well as the percentage of times these halted trials were ultimately successful (Success %). When a trial stops for expected success, we expect these trials to succeed at a high frequency, whereas when a trial stops for futility, we expect the trial to succeed only rarely. Table 2 shows that when stopping for futility, a trial never succeeds, and when stopping for expected success, a trial succeeds between 15% and 100% of the time depending on actual underlying device survival. For lower rates of survival, trials halt for futility more often; at 92% device survival, we see 258 of 1000 trials halt for futility. However, the number of trials halting for futility drops to 27 and then to 0 as actual device survival increases to 95% and 98%, respectively. As device survival increases, the frequency of trials stopping for expected success increases as well, with actual success occurring in all but 1 trial for 98% device survival. Adjusting the cutoff for expected success and futility to reflect the proportion of information collected would help stabilize the percent of trials halted early that are ultimately successful. For example, at 98% device survival, we see only 43.7% of the trials halted for expected success at the first interim look having a successful final result. Making the halting condition more stringent for earlier interim looks would not only increase this percentage but also increase the average sample size.

Table 2.

Composition of early stopping reason and final result by interim sample sizes of the adaptive design for various device survival values at 5 years (counts are out of J = 1000 trials)

Actual device survival at 5 years	Stopping reason	Statistic	Halted at interim			Maximum sample size reached
Actual device survival at 5 years	Stopping reason	Statistic	140	190	Overall	Maximum sample size reached
92%	${\hat{E}}_{F} < C_{F}$	Count	179	108	287	678
		Success %	0.0	0.0	0.0	26.4
	${\hat{E}}_{S} > C_{S}$	Count	18	17	35
		Success %	16.7	29.4	22.9
95%	${\hat{E}}_{F} < C_{F}$	Count	28	15	43	731
		Success %	0.0	0.0	0.0	88.8
	${\hat{E}}_{S} > C_{S}$	Count	87	139	226
		Success %	43.7	82.7	67.7
98%	${\hat{E}}_{F} < C_{F}$	Count	0	0	0	121
		Success %	—	—	—	100.0
	${\hat{E}}_{S} > C_{S}$	Count	406	371	777
		Success %	95.1	99.7	97.3

Next, we investigate the robustness of the three designs to shape parameter misspecification, where we expect device survival to follow $Weibull (\tilde{β}, θ)$ for a range of $\tilde{β}$ values, while in reality device survival is distributed $Weibull (β = 1, θ)$ . The adaptive design has been calibrated to deliver 80% power in the absence of misspecification, and the Weibull design is given a sample size equal to the average sample size from the adaptive design. Recall that the frequentist Weibull design and Bayesian adaptive design are both operating with a fixed shape parameter $β$ . If the assumed shape parameter, $\tilde{β}$ , is incorrect (i.e., $\tilde{β} \neq β$ ), there may arise a bias in estimates for $S (T)$ . Table 3 explores the effect of various misspecifications with $S (T) = 0.95$ . The average bounds of the intervals show that a misspecification can result in underestimation or overestimation of device survival at the time point of interest. The direction of the bias in the resulting intervals has much to do with the hazard function, which for the Weibull model is given by $λ (t) = β t^{β - 1} / θ$ . Table 3 shows that a misspecification can cause a particular design to be either conservative or anticonservative; therefore, an investigation into the effect of misspecification should be conducted before a particular design is implemented. Table 3 shows that the adaptive design is more robust to misspecification than the frequentist Weibull design. For instance, when $β = 0.75$ , the limits on the interval as well as the power shift are less in the presence of misspecification for the adaptive design relative to the frequentist Weibull design. The performance of the nonparametric Kaplan–Meier design, while poor, is of course unaffected and therefore omitted from Table 3.

Table 3.

Operating characteristics of the designs in the presence of shape parameter misspecification (counts are out of J = 1000 trials)

$\tilde{β}$	$β$	Design	Average sample size	% of trials			Bounds
				Success	$U_{C I} > 95 %$	$δ < 0.03$	Mean $δ$	Lower	Upper
1.4	1.25	Weibull	217	87.0	98.2	87.0	0.026	0.919	0.971
		Adaptive	216.6	88.8	97.2	88.8	0.026	0.919	0.971
1.25	1.25	Weibull	215	77.7	96.6	77.7	0.027	0.915	0.970
		Adaptive	214.3	79.7	95.7	79.7	0.027	0.915	0.970
1.1	1.25	Weibull	212	64.1	94.8	64.1	0.029	0.912	0.969
		Adaptive	211.1	71.3	95.4	71.3	0.028	0.913	0.970
1.25	1	Weibull	223	91.3	97.8	91.3	0.026	0.921	0.972
		Adaptive	222.9	90.7	97.3	90.8	0.026	0.921	0.973
1.1	1	Weibull	223	81.5	97.2	81.5	0.027	0.917	0.971
		Adaptive	222.2	85.6	97.0	85.6	0.027	0.918	0.971
1	1	Weibull	221	71.3	96.2	71.3	0.028	0.915	0.970
		Adaptive	220.8	80.3	96.3	80.3	0.028	0.916	0.971
0.9	1	Weibull	222	67.7	96.0	67.7	0.028	0.913	0.970
		Adaptive	221.7	74.5	95.4	74.5	0.028	0.913	0.970
0.75	1	Weibull	218	49.4	95.7	49.4	0.030	0.909	0.969
		Adaptive	217.9	67.2	95.6	67.2	0.029	0.913	0.970
0.9	0.75	Weibull	231	81.7	97.1	81.7	0.027	0.917	0.971
		Adaptive	230.6	86.6	96.5	86.6	0.027	0.918	0.972
0.75	0.75	Weibull	229	67.4	96.8	67.4	0.028	0.914	0.971
		Adaptive	228.8	80.9	95.8	80.9	0.028	0.916	0.971
0.6	0.75	Weibull	222	54.7	97.1	54.7	0.029	0.911	0.970
		Adaptive	221.7	69.1	95.7	69.1	0.029	0.914	0.971

Figure 1 shows the fitted survival curves for three $\tilde{β}$ values from a Weibull regression model on one simulated trial where $β = 1$ and $S (5) = 0.95$ . As illustrated in Figure 1, a misspecified $β$ implying a hazard increasing faster (or decreasing slower) across time than in reality (i.e., $\tilde{β} > β$ ) will underestimate survival in the later survival times. The maximum likelihood estimate under a misspecified model is $\tilde{S} (t) = \exp (- r {\sum (t / x_{i})}^{\tilde{β}})$ ; thus, the direction and magnitude of the bias in the resulting survival estimates depend upon the direction and magnitude of the misspecification, as well as the survival time of interest relative to the follow-up time covered by the data. For a thorough discussion of shape parameter misspecification in Weibull regression, see Xie et al. [11].

Figure 1.

Parametric Weibull regression survival curves under shape parameter misspecification $(β = 1)$ .

Table 4 assesses the effect of enrollment rates. As discussed previously, the adaptive design relies on interim information to predict whether more persons should be enrolled. Faster enrollment means less information will be available at each interim look to make predictions, since information here is essentially follow-up time. The Kaplan–Meier design is again unaffected by this change as the estimate for 5-year survival is independent of anything that happens beyond 5 years. Since interim looks are driven by enrollment count and not calendar dates, the frequentist Weibull design and the Bayesian adaptive design see increased power as the enrollment rate slows, implying longer average follow-up and more information. By contrast, faster enrollment rates result in less power for these parametric designs. The Bayesian adaptive design also shows a shrinking mean sample size as enrollment rates slow. In fact, when the enrollment rate is 10 persons per half year, the Bayesian adaptive design has a similar mean sample size as both the frequentist designs and still enjoys greater power.

Table 4.

Operating characteristics of the designs with various enrollment rates (counts are out of J = 1000 trials)

Enroll 10 persons per	Design	Sample size, mean (SD)	% of trials			Bounds
			Success	$U_{C I} > 95 %$	$δ < 0.03$	Mean $δ$	Lower	Upper
0.1 years	K–M	206 (—)	32.2	99.2	32.2	0.032	0.919	0.983
	Weibull	206 (—)	31.6	97.0	31.6	0.032	0.909	0.973
	Adaptive	222.6 (30.4)	57.6	96.6	57.6	0.030	0.911	0.971
0.5 years	K–M	206 (—)	32.2	99.2	32.2	0.032	0.919	0.983
	Weibull	206 (—)	84.6	97.5	84.6	0.027	0.916	0.970
	Adaptive	208.6 (34.3)	91.2	97.0	92.4	0.026	0.919	0.971

SD: standard deviation.

Sample size vector estimation

Here, we address the sample size vector estimation problem in a few different settings. Each implementation of Algorithm 3 returns a sample size vector that will deliver the desired power under the design constraints and hypothesized parameter values specified by the investigator. Recall that the primary objective is to estimate device survival at $T = 5 years$ using a 95% credible interval with a half-width less than $ε = 0.03$ and an efficacy threshold $L_{ET} = 0.95$ . This subsection reports the sample size vectors estimated and average trial duration for combinations of enrollment rates, device survival percentiles, and numbers of interim looks. For each expected device survival value, a corresponding ‘largest plausible’ value is specified to obtain $m_{1}$ .

For each sample size vector estimation in Table 5, the desired power is 80% with a dropout/censoring rate of $λ = 0.08$ and Weibull shape parameter fixed at $β = 1$ . We used $J^{*} = 3000$ trials to obtain initial values and used $J = 250$ trials for each algorithm iteration with $G = 1000$ Monte Carlo samples to estimate ${\hat{E}}_{S}$ and ${\hat{E}}_{F}$ . Recall that each element of an estimated sample size vector is constrained to be a multiple of enrollment group size $n = 10$ . Power in each setting is calculated with respect to a point-mass prior on the expected value of $L$ specified in the first column. The ‘largest plausible’ value of $L$ corresponding to an expected device survival of $0.95$ was set to $0.98$ ; for expected device survival of $0.97$ , the ‘largest plausible’ value of $L$ was set to $0.99$ . Both of these ‘largest plausible’ values dictate the placement of $m_{1}$ , which is held constant for each unique combination of $L$ and enrollment rate.

Table 5.

Estimated sample size vectors that provide 80% power

$L$ (expected/largest plausible)	10 persons enrolled each	Number of potential interim looks
		2	Mean duration (year)	3	Mean duration (year)	4	Mean duration (year)
.95/.98	.1 year	(160, 220, 270)	7.49	(160, 200, 230, 270)	7.40	(160, 190, 220, 260, 290)	7.47
	.25 year	(140, 190, 240)	10.37	(140, 170, 210, 240)	10.19	(140, 170, 200, 220, 250)	10.19
	.5 year	(130, 170, 210)	14.16	(130, 160, 180, 210)	14.01	(130, 150, 170, 190, 210)	13.81
.97/.99	.1 year	(120, 160, 200)	6.83	(120, 150, 170, 200)	6.84	(120, 140, 160, 180, 200)	6.82
	.25 year	(110, 150, 190)	9.24	(110, 130, 160, 180)	8.89	(110, 130, 140, 160, 180)	8.82
	.5 year	(100, 130, 160)	11.65	(100, 120, 150, 170)	11.86	(100, 120, 130, 140, 160)	11.48

One clear pattern in Table 5 is that both slowing enrollment and greater device survival at time $T$ shift the sample size vector downward. Slower enrollment means longer observed survival times, but also increases average trial duration. An obvious effect of increasing $K$ is that interim looks become more frequent. Recall the approximation for $Var [L | x]$ and its dependence upon $β$ ; though not illustrated in Table 5, as $β$ increases, power increases in this setting as follow-up times are predominately greater than $T$ . With the same specifications in Table 5 and device survival of .95/.98, shifting $β$ to 1.5 and 0.5 requires maximum sample sizes of 200 and 260 to maintain 80% power, respectively. Changing the number of potential interim looks appears to have relatively little effect on the range of the sample size vectors, as the maximum sample sizes are nearly all the same for each particular enrollment rate and device survival percentile combination. Although, for a time lag of .1 between enrollment groups and device survival of .95/.98, we see the maximum sample size jumps from 270 to 290 when the interim looks go from 2 or 3 to 4, respectively. Trial duration is reduced slightly when the number of interim looks increases and the maximum sample size stays the same. Clear patterns from increasing the number of interim looks are hard to discern with the admittedly small number of MCMC draws $(J = 250)$ and lack of granularity due to the large enrollment group size $(n = 10)$ .

Further investigation into the effect of the number of interim looks was conducted by increasing the number of trials used in each iteration of Algorithm 3 to $J = 750$ and adding granularity to the sample sizes by assuming an enrollment rate of $n = 3$ persons every .05 years. The latter change allows each element of the sample size vector to be a multiple of 3, whereas increasing $J$ reduces the Monte Carlo error in the power estimate at each iteration. With these specifications, Algorithm 3 was run for 2, 6, and 10 potential interim looks. Resulting sample size vectors are omitted, but the maximum sample size needed for 80% power increases from 183 to 186 to 189, respectively. Yet, the average trial duration remained nearly constant across the three choices of $K$ . This verifies that a larger maximum sample size becomes necessary to maintain power as the number of interim looks increases. Recall, this pattern is due to a higher frequency of trials halting enrollment for expected success with a growing number of interim looks, and once a trial has halted enrollment, it may finish with a half-width larger than $ε$ due to a random change in device failure following the interim decision.

Discussion

The Bayesian adaptive approach provides us with a more flexible design, which results in greater power and potentially smaller mean sample sizes when actual survival at time T is different from the assumed value used to power the design. The adaptive design becomes more powerful as the enrollment rate slows and, even when our prior expectations are exactly correct, can offer a smaller average sample size while maintaining greater power. Any $Beta (γ, 1)$ prior could be employed on L while retaining conjugacy, a noninformative choice was made here. The Bayesian adaptive design could be improved in settings where we have strong opinions about device survival through a more informative prior, perhaps informed by commensurate historical information [6,7]. The R code to reproduce the results in this article or to investigate design characteristics in other settings is available on the second author’s (B.P.C.) software page (http://www.biostat.umn.edu/~brad/software.html).

The cost of adding more interim looks while maintaining the same power is the need for a larger maximum sample size. This arises because most of the halted trials do so for expected success with promising outlooks in the first enrollees. These trials with early promise regress to the true device survival rate, which occasionally results in a half-width slightly larger than $ε$ at study end. Employing cutoff values for expected success and futility that reflect the amount of information collected at each interim look would reduce the proportion of times a halted trial finishes with an unacceptably large half-width (see e.g., Section 6.6 in the Yin [12] ). Though the maximum sample size may increase with more interim looks, the average trial duration is left largely unaffected. The adaptivity increases power relative to a nonadaptive trial with the same average sample size by reducing waste. The adaptive trial reduces average sample size and trial duration by balancing the FDA’s demand for estimate precision and the company’s desire for efficiency. A beauty of this design is that it provides a safeguard against errant prior beliefs about device survival. That is, if the maximum sample size is chosen in a conservative manner, we rarely suffer from insufficient power and still avoid wasting resources via automatic calibration to a smaller sample size when interim results are persuasive.

The current frequentist sample size calculation does not incorporate the idea of an efficacy threshold, whereas the Bayesian framework allows us to calculate the sample size vector for a trial that delivers the desired power in the presence of an efficacy threshold as well as a credible interval half-width requirement. Our simulations show that the frequentist sample size formula often returns a sample size with insufficient power in this setting. Admittedly, there are frequentist approaches, not discussed here, that use interim looks to reestimate sample size (see e.g., Section 6.7 in Yin [12] or Jennison and Turnbull [13]). For the Bayesian design, we have shown that a sample size vector can be reliably estimated to deliver a desired power in the presence of an efficacy threshold and a precision requirement; related applications appear in Berry et al. [7]. This provides an appropriate and detailed statistical justification for the sample sizes of device surveillance studies that are estimation driven, instead of hypothesis driven. The gains in lowered sample size and/or increased power indicate that the approach outlined here would be least burdensome for many practical scenarios.

An important limitation of both the Bayesian adaptive design and the frequentist Weibull design studied above is the fixed shape parameter. As seen in Table 3, misspecification of this parameter causes these designs to give biased results, in either a conservative or anticonservative direction. The Bayesian adaptive design shows greater robustness to misspecification than the frequentist Weibull design. Yet, in the fixed sample frequentist setting, the Weibull design can be extended to allow for simultaneous estimation of the shape and scale parameters, see Meeker and Nelson [4] or Klein and Moeschberger [8]. A corresponding Bayesian design that remains computationally feasible is a direction for future work. In our current setting, the predictive calculations can be more easily (and speedily) conducted using a one-off sampling technique. In the two-parameter nonconjugate setting, however, these draws must be made in a vastly more computationally intensive manner (e.g., by iteratively calling an MCMC sampler) since the two-parameter Weibull distribution does not admit a closed-form posterior. In principle, trial operating characteristics could still be simulated by calling BUGS or JAGS repeatedly from R using the brugs() or rjags() functions, respectively, and perhaps the multicore() function as well if using a multiple core processor. Simulating operating characteristics in this way for a nonconjugate model might take over 500 h with the current computing capacities and similar numbers of iterations used in section ‘Results’. One potential alternate solution is to use a bivariate normal approximation to the joint posterior of $β$ and $θ$ (see e.g., Section 3.1 in Carlin and Louis [14]). Another route would be to abandon the Weibull for a piecewise exponential model of device survival (see e.g., Section 3.1 in Ibrahim et al. [15]). Regardless, as computational advances inevitably arrive, a more general adaptive model in settings where both parameters are unknown will become feasible.

Footnotes

Appendix 1

Frequentist sample size calculation for one-sample survival curve [11]:

$α$ : 100(1 −α)% confidence interval (CI)

$δ$ : half-width of interval

$T$ : time of survival estimate

$L$ : expected device survival percentile at time $T$

$λ$ : censoring rate Then the required sample size $N$ is given by

N = \frac{L^{2} (1 - L) z_{1 - α}^{2}}{δ^{2} {(1 - λ)}^{T}}

Funding

The work of the first two authors (T.A.M. and B.P.C.) was supported by a grant from the Medtronic Corporation.

Conflict of interest

None declared.

References

Center for Devices & Radiological Health. Draft Guidance for Industry and Food and Drug Administration Staff: Procedures for Handling Section 522 Postmarket Surveillance Studies. Rockville, MD: Food and Drug Administration, 2001.

Peto

Pike

Armitage

. Design and analysis of randomized clinical trials requiring prolonged observation of each patient. Part II. Analysis and examples. Br J Cancer 1977; 35(1): 1–39.

Center for Devices & Radiological Health: Scientific & Technical Review Committee for Cardiac Pacemaker Leads. Guidance to Sponsors on the Development of a Discretionary Postmarket Surveillance Study for Permanent Implantable Cardiac Pacemaker Electrodes. Rockville, MD: Food and Drug Administration, 1992.

Meeker

Nelson

. Weibull percentile estimates and confidence limits from singly censored data by maximum likelihood. IEEE T Reliab 1976; R-25(1): 20–24.

Berry

. Bayesian clinical trials. Nat Rev Drug Discov 2006; 5(1): 27–36.

Center for Devices & Radiological Health. Guidance for the Use of Bayesian Statistics in Medical Device Clinical Trials. Rockville, MD: Food and Drug Administration, 2010.

Berry

Carlin

Lee

Müller

. Bayesian Adaptive Methods for Clinical Trials. Chapman & Hall, Boca Raton, FL, 2010.

Klein

Moeschberger

. Survival Analysis: Techniques for Censored and Truncated Data. Springer, New York, 2003.

R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, 2011.

10.

Zhang

Meeker

. Bayesian life test planning for the Weibull distribution with given shape parameter. Metrika 2005; 61: 237–49.

11.

Xie

Yang

Gaudoin

. More on the mis-specification of the shape parameter with Weibull-to-exponential transformation. Qual Reliab Eng Int 2000; 16(4): 281–90.

12.

Yin

. Clinical Trial Design: Bayesian and Frequentist Adaptive Methods. John Wiley & Sons Inc, Hoboken, NJ, 2012.

13.

Jennison

Turnbull

. Group Sequential Methods with Applications to Clinical Trials. Chapman & Hall/CRC, Boca Raton, FL, 2000.

14.

Carlin

Louis

. Bayesian Methods for Data Analysis (3rd edn). Chapman & Hall, Boca Raton, FL, 2009.

15.

Ibrahim

Chen

M-H

Sinha

. Bayesian Survival Analysis. Springer, New York, 2001.