Abstract
A new clinical trial design, designated the two-way enriched design (TED), is introduced, which augments the standard randomized placebo-controlled trial with second-stage enrichment designs in placebo non-responders and drug responders. The trial is run in two stages. In the first stage, patients are randomized between drug and placebo. In the second stage, placebo non-responders are re-randomized between drug and placebo and drug responders are re-randomized between drug and placebo. All first-stage data, and second-stage data from first-stage placebo non-responders and first-stage drug responders, are utilized in the efficacy analysis. The authors developed one, two and three degrees of freedom score tests for treatment effect in the TED and give formulae for asymptotic power and for sample size computations. The authors compute the optimal allocation ratio between drug and placebo in the first stage for the TED and compare the operating characteristics of the design to the standard parallel clinical trial, placebo lead-in and randomized withdrawal designs. Two motivating examples from different disease areas are presented to illustrate the possible design considerations.
Keywords
1 Introduction and motivating examples
The placebo-controlled trial is a gold standard in the clinical assessment of drugs intended for symptomatic improvement of diseases such as pain, anxiety, depression and sleep. However, in many of these disease areas, there is a belief that the standard trial has not adequately discriminated active compounds. One significant problem in many of these disease areas is a high rate of placebo response.1,2 A common practice in such trials is to use a placebo lead-in design (also referred to as placebo run-in), where all patients receive placebo first, after which, patients who have not responded to placebo are randomized to placebo versus drug; patients who respond to placebo in this phase are usually excluded from further trial participation. The main goal of using a placebo lead-in is to reduce the placebo response rate, and thereby hopefully increase the treatment effect.
A second proposal for improving the standard parallel trial is to conduct a randomized withdrawal design (also referred to as randomized discontinuation). 3 In the randomized withdrawal design, patients are treated with the drug of interest to select those patients who respond to the drug. These patients are subsequently randomized to either stay on drug or switch to placebo. The outcome variable is some measure of maintenance of response. The logic behind the design is that a patient who has shown symptomatic improvement to an active drug is more at risk to lose that benefit when switched to placebo as opposed to remaining on drug. Randomized withdrawal designs are often suggested when the disease is considered very heterogeneous and the patient population which can benefit from the drug is only a subset of the entire patient population.
Both the placebo lead-in design and the randomized withdrawal design are enrichment designs 4 in that the population randomized in the trial is a subset of the population recruited into the trial. In the placebo lead-in design, the subset is the population of placebo non-responders (and hence, ideally, placebo responders are eliminated from the design). In the randomized withdrawal design, the subset is the subset of drug responders (and hence, ideally, drug non-responders are eliminated from the design). The goal in both designs is to create subsets of the recruited population which will have a larger drug effect than the overall population.
The benefit of the placebo lead-in is debatable. A meta-analysis of 101 studies in depression showed that the placebo lead-in did not lower the placebo response rate, nor increase the drug–placebo difference.
5
The failure of the standard placebo lead-in may have been due to the short duration of the period or the fact that investigators and patients may not have been blinded to this design feature. Because the placebo lead-in has not produced noticeable improvement, a new design called the sequential parallel comparison design (SPCD)6,7 proposed to assign patients to both drug and placebo in an initial stage, reassign placebo non-responders in a second stage and observe the outcome in each stage. The data utilized in the efficacy analysis in this design consist of all first-stage data, and second-stage data only from patients who fail to respond to placebo in the first stage. Patients who receive drug in the first stage often participate in Stage 2 to maintain blinding, however, they are not included in the efficacy analysis.
7
We propose a design that is an extension of the basic SPCD that, in addition, randomizes drug responders between drug and placebo and utilizes information on this comparison in the overall drug–placebo comparison. The patients who respond to drug are examined to see whether the maintenance of response is different between those patients switched to placebo versus those who remain on drug. We refer to the new design the two-way enriched design (TED). The new TED is depicted in Figure 1. The premise for our TED design is that a drug which is significantly superior to placebo in achieving short-term efficacy will also be superior to placebo in the maintenance of efficacy in those patients who respond to drug. This premise, of course, needs to be examined on a case-by-case basis and may be influenced by considerations such as chronicity of the disease and time period associated with achieving response. We now give two motivating examples in which we believe our new design would be very useful.
The two-way enriched design.
Motivating Example 1: An early-phase clinical development team at a pharmaceutical company is charged with assessing efficacy of a new drug for generalized anxiety disorder (GAD). The new drug candidate has a different mechanism of action than benzodiazepines (currently used in the treatment of GAD) and is hypothesized to have a similar efficacy profile to benzodiazepines but significantly less abuse potential. The team needs to design an early Phase 2 study with as small a total sample size as possible to determine if there is sufficient evidence of efficacy to continue development of the compound. Abuse potential is to be explored in separate studies. High placebo response is an issue in GAD 1 . The GAD is a chronic disease and it is believed that if an effective drug is withdrawn from a patient, symptoms of the disease quickly return.
Motivating Example 2: Ulcerative colitis is a form of inflammatory bowel disease. The disease, which is characterized by inflammation of parts or even the whole colon, usually requires medical treatment to go into remission. In clinical trials investigating induction and maintenance regimen to induce remission in patients with active ulcerative colitis, considerable placebo rates of remission up to 40% were observed. 8 This could be due to the fact that some patients suffer additionally of irritable bowel syndrome, which is a non-inflammatory condition of the intestinal tract. These patients have high bowel frequency, which responds well to placebo. Both single-stage induction of remission trials and randomized withdrawal trials 9 are trial designs used for the evaluation of the clinical efficacy of novel drug regimen in ulcerative colitis. The proposed TED might be a more effective alternative to randomized withdrawal or parallel single-stage design.
We will return to these motivating examples later in Section 3. The remainder of the manuscript is organized in the following fashion. In Section 2, we develop a variety of score tests for the TED. The different score tests arise because of different possible assumptions one is willing to make about the underlying parameters. The asymptotic distribution of the score tests under the null and under local alternatives is derived. In Section 3, in addition to examining the motivating examples, we also examine the robustness of the various tests when the assumptions being made are not correct and compare various designs in terms of required sample size. In Section 4, we discuss testing of the separate null hypotheses underlying the TED.
2 Score tests for the TED
2.1 Description of the design
The trial is conducted in two stages. Patients are randomized to one of four sequences placebo–placebo, placebo–drug, drug–placebo and drug–drug. In the TED, first-stage placebo responders and first-stage drug non-responders are not included in the efficacy analysis in Stage 2 (although patients might be included in the trial for blinding purposes) because it is unlikely to observe a treatment effect in these patients. The time point of switching from placebo to drug and from drug to placebo does not need to be the same between patients initially treated with placebo compared to the patients initially randomized to drug. Another way to implement the design is to randomize patients between drug and placebo in the first stage and then re-randomize placebo non-responders to receive drug or placebo and re-randomize drug non-responders to receive drug or placebo. Placebo responders and drug non-responders can also be re-randomized to drug or placebo but their data is not included in primary efficacy analysis. The two approaches are equivalent if randomization is via independent Bernoulli random variables within each stage. Simple randomization to four groups in the beginning of the trial is logistically easier and we will assume this approach in the remainder of the article.
The two-way enriched design a
Note. p1: Pr (drug response in Stage 1), q1: Pr (placebo response in Stage 1), p2: Pr (drug response in Stage 2 | placebo non-responder in Stage 1), q2 = Pr (placebo response in Stage 2 | placebo non-responder in Stage 1), p3 = Pr (drug response in Stage 2 | drug responder in Stage 1), q3 = Pr (placebo response in Stage 2 | drug responder in Stage 1), s2 is the proportion of placebo non-responders in Stage 1 who participate in Stage 2, s3 is the proportion of drug responders in Stage 1 who participate in Stage 2.
Responses denoted ‘•’ are not included in the analysis by design, n14 and n24 are placebo non-responders and n34 and n44 are drug responders, who dropout and do not participate in Stage 2.
The dropout process is assumed random and independent of future outcomes. The joint likelihood for (p1, q1, p2, q2, p3, q3, s2, s3) is then:
2.2 Score test with one degree of freedom
Suppose that investigators have confidence in their knowledge of treatment effects in the enriched portion of the design relative to the treatment effect in the overall population. Define the treatment effects Δ1 = p1 – q1, Δ2 = p2 – q2 and Δ3 = p3 – q3. Define ρ2 and ρ3 such that Δ1ρ2 = Δ2 and Δ1ρ3 = Δ3. The test uses two test parameters r2 and r3 that are known constants. We restrict ri to be 0 ≤ ri < +∞, i = 2, 3. The test is derived under the assumption that r2 = ρ2 and r3 = ρ3. The parameters (p1, q1, p2, q2, p3, q3, s2, s3) are transformed to (Δ1, q1, q2, q3, s2, s3). Then p1 = Δ1 + q1, p2 = r2Δ1 + q2 and p3 = r3Δ1 + q3 and the re-parametrized likelihood is:
Maximum likelihood estimates under H0 obtained by setting Δ1 = 0 and solving the likelihood equations for q1, q2, q3, s2 and s3 are:
When deriving the test statistic, one can use expected or observed information to estimate the variance. Previous researchers10,11 have used expected information. We obtained both tests: based on expected and observed information. The tests based on expected information (Appendix 1) have more compact formulae than their observed information counterparts (Appendix 2). However, asymptotic power formulae derived from tests based on observed information are usually more accurate. This may be due to the fact that the expected information is averaged over the random sample size, while the observed information essentially conditions on its observed value. 12
The derivation of the score test statistic with observed information is given in Appendix 2. The test statistic is:
If r2 = r3 = 0, the test is equivalent to the score test that uses data from Stage 1 only. If r3 = 0, data from the randomized discontinuation part are not used in the analyses and the test is the score test for the SPCD. If both r2 and r3 are large, T1 is close to the score test statistic that uses data from Stage 2 only. If r2 is rather large compared to r3, the test statistics addresses comparison in placebo non-responders only; if r3 is rather large compared to r2, the test statistics addresses comparison in randomized withdrawal group only.
Under a local alternative with true parameters (p1, q1, p2, q2, p3, q3, s2, s3), the limiting distribution of the test statistic is non-central chi-square with df = 1 and non-centrality parameter is nγ1 where:
2.3 Score test with two degrees of freedom
Suppose that the investigator is willing to make assumptions regarding the treatment effect in the enriched population of placebo non-responders but is unwilling to make a similar assumption regarding the treatment effect in the enriched population of drug responders. This corresponds mathematically to removing the constraint regarding relationship between treatment effects Δ1 and Δ3. Changing parameters from (p1, q1, p2, q2, p3, q3, s2, s3) to (Δ1, Δ3, q1, q2, q3, s2, s3) by setting p1 = Δ1 + q1, p2 = r2Δ1 + q2 and p3 = Δ3 + q3 the likelihood is:
The distribution of T2 under H0 is chi-square with two degrees of freedom (df). Under local alternatives, the limiting distribution of the test statistic is non-central chi-square with df = 2 and non-centrality parameter is nγ2 where:
Define g as the 1 – α quantile of a chi-square random variable with df = 2, for example, if α = 0.05, g = 5.99. Let λ to be the non-centrality parameter such that
A score test for the situation where the constraint is removed from Δ1 and Δ2 but the constraint between Δ1 and Δ3 remains can be constructed similarly.
2.4 The score test with three df
The last scenario we consider is the situation in which the investigator is unwilling to make any assumptions regarding the treatment effect in the enriched populations compared to the treatment effect in the overall population. In this case, no relationship is assumed between Δ1, Δ2 and Δ3. The likelihood is:
Under local alternatives, the limiting distribution of the test statistic is non-central chi-square with df = 3 and non-centrality parameter is nγ3 where:
Let g be the 1 – α quantile of a chi-square random variable with df = 3, for example, if α = 0.05, g = 7.81. Let λ to be the non-centrality parameter such that
3 Comparison of test statistics and designs
Total sample size required to achieve 80% power when the two-way enriched design is used with one, two and three df tests with two-sided type I error rate of 0.05 a
Note. df: degrees of freedom.
Values indicated by * are the optimal values for the underlying parameter set, other sets are recommended values, if no information regarding ρ2 and ρ3 is available. If optimal r2 is infinite, r2 is set to 100; similarly for r3. Retention rates are assumed to be equal to 1.
Relationship among the two-way enriched design and its special cases a
By setting test parameters r2 and r3 in the one degrees of freedom test for two-way enriched design to the values listed below corresponding one df tests for special case designs can be obtained.
Total sample size required to achieve 80% power with one df test with two-sided type I error rate of 0.05 a
Note. df: degrees of freedom.
The designs are: the two-way enriched design (TED); the sequential parallel comparison design (SPCD); a single-stage design with equal allocation (Equal), that is, first-stage data only are used; the placebo lead-in design (Lead-in), that is, Stage 2 data from placebo non-responders only are used and randomized withdrawal (RWithdr), that is, Stage 2 data from drug responders are used. Values indicated by * are the optimal values for the underlying parameter set, other sets are recommended values for one df test, if no information regarding ρ2 and ρ3 is available. If optimal r2 is infinite, r2 is set to 100; similarly for r3. Retention rates are assumed to be equal to 1.
Motivating Example 1: For GAD, it is feasible to run either a 4-week traditional design or a 6-week TED or SPCD design with each stage being 3 weeks. Investigation of the literature suggests that the placebo response rate in the overall population will be around 0.4 with Δ1 = 0.2 yielding p1 = 0.6 and q1 = 0.4. For the placebo non-responders, p2 is assumed to be 0.4 and q2 = 0.2 and for the randomized withdrawal enriched population, p3 = 0.9 and q3 = 0.7. We feel that r2 = r3 = 1 will be good estimates of the treatment effects in the enriched population and are comfortable using these values with a single degree of freedom test. The retention rates are assumed to be 1 for both placebo non-responders and drug responders. Under these scenarios, the sample sizes for TED with equal allocation to placebo and drug in Stage 1, for the SPCD with allocation ratio of 4:3 to placebo versus drug and for the traditional parallel trial are 104, 144 and 194, respectively, for α = 0.05 and β = 0.20. In this example, there is a clear reduction in sample size for the TED over both the SPCD and the traditional parallel trial. If there is no early dropout in the trials, the total patient exposure time for the TED, SPCD and the traditional parallel trial are 624 weeks, 864 weeks and 776 weeks, respectively. Since in most two-stage trials, such as SPCD trials, all patients usually continue in the second stage irrespective of whether or not their data are used in the efficacy analysis, we report total exposure time rather than exposure time in the efficacy analyzable set.
Motivating Example 2: A randomized withdrawal in ulcerative colitis usually has a 6–12–weeks-long induction of remission stage followed by the maintenance stage that is 24 weeks or longer. A possible modification of this design is to assign some patients to placebo in Stage 1 and then re-randomize placebo non-responders to drug and placebo, subsequently adding this treatment comparison to the efficacy analysis, that is, the TED without first-stage comparison. Such a trial might be more efficient than a randomized withdrawal design. Another possibility is to increase the length of the first stage to 24 weeks to allow for comparison of placebo and drug in Stage 1, followed by the comparisons of placebo and drug in placebo non-responders and drug responders. Assume that the response rates to drug and placebo are p1 = p2 = p3 = 0.6 and q1 = q2 = q3 = 0.4 and that patients are allocated equally to drug and placebo in the first stage. The sample sizes for TED, the TED without first-stage comparison and the randomized withdrawal are 124, 328 and 328, respectively, for α = 0.05 and β = 0.20. The sample sizes are based on one df test with r2 = r3 = 1 used in TED and one df test with r2 = r3 = 1000 used in the TED without first-stage comparison, and assuming retention rates of 1. Though treatment differences are the same in each sub-population, the sample size for the TED is much smaller than for the other two designs. The total exposure time for the TED (24 weeks in both stages), the TED without first-stage comparison (Stage 1: 12 weeks; Stage 2: 24 weeks) and the randomized withdrawal designs (Stage 1: 12 weeks; Stage 2: 24 weeks) are 5952 weeks, 11808 weeks and 6288 weeks, respectively.
We performed simulations of the one, two and three df tests (results are available from the authors) and concluded that the type I error rate for the tests is generally well preserved even for small sample sizes. Our simulation study also showed that the asymptotic power formulae for tests based on observed information provide good approximation of power and required sample size. The asymptotic power formulae for tests based on expected information might underestimate required sample size, especially if the required sample size is not large. For example, in the scenario with response rates p1 = 0.5, q1 = 0.3, p2 = 0.4, q2 = 0.1, p3 = 0.9 and q3 = 0.7, according to the asymptotic formula, the test based on observed information requires 80 subjects to achieve 80% power with one df test r2 = r3 = 1 and b = 0.5 (Table 1). The simulated power is 0.82. At the same time, the asymptotic power formulae based on expected information calls for 76 total subjects. The test based on expected information yields 0.78 power with 76 subjects. When used with the same number of subjects, the power of the test based on observed information is slightly higher on average than power of the test based on expected information.
4 Testing individual hypotheses about treatment differences
The traditional single-stage randomized study of a drug versus placebo tests the hypothesis H01: Δ1 = 0. Enriched designs test hypotheses that are different from H01. The placebo lead-in tests the hypothesis H02: Δ2 = 0, the randomized withdrawal tests H03: Δ3 = 0. The TED tests H0: {#x00394;1 = 0}∩{#x00394;2 = 0}∩{#x00394;3 = 0} which is a subset of H01. The null hypothesis H0 represents a drug that is ineffective not only in Stage 1 but also in patients who did not improve on placebo in Stage 1, and also ineffective in maintaining response in patients who initially responded to drug. This null hypothesis is appropriate when the investigator has strong belief that an active drug in the overall population implies that the drug is superior to placebo in the subset of placebo non-responders and also superior to placebo in maintenance of response.
Power for testing H0: {#x00394;1 = 0}∩{#x00394;2 = 0}∩{#x00394;3 = 0} and individual hypotheses H01: Δ1 = 0, H02: Δ2 = 0 and H03: Δ3 = 0 using closed testing principlea
The H0 is tested using the one df test with recommended parameters r2 = r3 =1, H01 is tested with r2 = r3 = 0, H02 is tested with r2 = 100, r3 = 0, and H03 with r2 = 0, r3 = 100. Numbers in parentheses are the power for the Bonferroni procedure where each p-value for testing H01, H02 and H03 is compared with α/3 and H0 is rejected, if any of H01, H02 and H03
5 Discussion
Clearly, the TED is particularly well suited to chronic conditions in which achieving response does not imply that the response will be maintained without treatment. For conditions in which response is terminal (i.e. death and pregnancy), designs for absorbing binary endpoints14,15 are more appropriate. Two-stage designs like the SPCD and the TED require longer exposure for patients participating in both stages than the traditional parallel trial, however, since the SPCD or TED require smaller sample size compared to the parallel trial, the total length of the trial in calendar time would typically be shorter. Thus, in considering the TED, it is important to try to minimize the duration of the two stages in order to limit dropout of patients. Typically, the costs in clinical trials are more dependent on the number of enrolled subjects as opposed to the duration of the trial, therefore, it is often true that a smaller trial with longer duration is less costly than a larger trial with shorter duration. In trials like the TED, blinding of design aspects such as the definition of response and the changeover point/points to both investigators and patients is important. The definition of response needs to be clinically relevant if the ultimate outcome, for example, remission, is not feasible in short-term studies. Such details could be provided to institutional review boards but withheld from the site investigators.
Enriched trials have an important role in early phase studies described as proof of concept intended to detect any positive signal of a drug. In these early phases, minimizing sample size is important in order to be able to control costs. In later-phase studies, enriched trials may be more controversial because they may raise questions regarding the extrapolation of the results of such trials. However, we believe that the use of enriched trials in the confirmatory setting needs to be evaluated on a case-by-case basis, and that there will be situations where enriched trials are warranted. As one example, consider the situation of paediatric depression. Paediatric clinical trials of depression are required for any approved antidepressant and regulatory authorities enroll such trials to be both placebo and active controlled. However, these trials are difficult to enrol and are known to have high placebo response rate with a corresponding large number of failed trials.16 Because of this, there is only one drug, fluoxetine, which is indicated for depression in children under the age of 12 in the United States. The lack of a larger number of approved antidepressants for children is not an ideal situation because not all children will respond to fluoxetine. In this disease, it would seem reasonable to design either a SPCD- or TED-enriched trial as opposed to a larger placebo- and active-controlled traditional trial. The use of TED might improve enrolment as a larger proportion of patients, 1 – b/2, in a TED trial will receive experimental drug during the trial compared to a typical parallel trial. Ultimately, for diseases in which a drug will likely help only a small proportion of the population, regulators and sponsors need to decide whether an enriched trial will yield more information than the traditional trial. We believe that our TED is a new useful tool for enriched clinical trials and its application will hopefully allow medical researchers to better understand its actual potential.
Footnotes
Acknowledgements
Ivanova’s work was supported in part by the National Institutes of Health [RO1 CA120082-01A1]. The authors thank Hans Herfarth, Charles Beasley, David DeBrota and anonymous reviewers for helpful comments.
