Abstract
Background:
For a carefully planned and well-designed Phase 3 confirmatory trial, there is still a potential risk of failing to meet the study objective due to possible differences between Phase 2 and Phase 3 studies. As illustrated by the ENGAGE trial, potential sample size increase at an interim analysis can mitigate the risk for an otherwise underpowered study. Many approaches for sample size adjustment (SSA) require certain modifications to the conventional statistical method, such as changing critical values or using a weighted Z-statistic for final hypothesis testing. Without modification, the type I error rate can be inflated, primarily caused by sample size increase for nonpromising interim observation that is close to null or no treatment effect. As illustrated by the TOPICAL trial, increasing sample size for nonpromising interim result could waste limited resource on ineffective treatment. The modifications in these approaches are therefore unnecessary costs of flexibility/interpretability for unnecessary scenarios of sample size increase.
Purpose:
To discuss and illustrate the appropriateness of SSA based on promising interim results, that is, conditional power being greater than 50% (or CDL approach), in a carefully planned and well-designed Phase 3 confirmatory trial.
Methods:
Two clinical trials are used to illustrate the clinical setting for the CDL approach and appropriateness of its application. Operating characteristics are assessed and compared to other methods using numeric computation. Hypothetical trials based on real clinical data are used to illustrate the approach.
Results:
The CDL approach for SSA leads to a small increase in expected sample size resulting in a small power gain versus the fixed design. This indicates that adding SSA will not on average substantially affect the budget at the portfolio level. However, when the interim result is promising, the CDL approach can dramatically increase the conditional power therefore mitigating the risk of an otherwise underpowered study.
Limitations:
Implementation challenges of the SSA methods are not in the scope of this paper. SSA is not intended to replace careful design of a confirmatory trial; instead, it can mitigate the risk for a well-designed trial.
Conclusions:
The CDL approach for SSA based on promising interim results, that is, conditional power being greater than 50%, is particularly useful in mitigating the risk for a carefully planned and well-designed Phase 3 confirmatory trial. No modification to the conventional statistical procedure is necessary while the type I error rate is controlled. Such a feature of ''no interference,'' or no change to the conventional statistical procedure with or without sample size adjustment, is important for the interpretation of a confirmatory trial. Similar to the fixed design, carefully planned and well-designed group sequential studies can also benefit from SSA to mitigate the risk of failing to meet the study objective.
Introduction
The success of a pivotal Phase 3 study to confirm the efficacy and safety of an experimental drug is critical for regulatory registration and post-approval commercial promotion. Because of its importance, a confirmatory trial should be carefully planned and well-designed. However, even designed with high confidence, a pivotal Phase 3 study can still miss statistical significance due to unexpected reasons such as differences in patient populations between Phases 2 and 3 including inclusion/exclusion criteria, concomitant medications, surrogate versus clinical endpoints, and participating sites. In some cases, the final test yields a p value that is close to the nominal level but fails to reach the required statistical significance, as illustrated in the following example.
Example 1 (Effective aNticoaGulation with factor xA next GEneration in Atrial Fibrillation–Thrombolysis in Myocardial Infarction (ENGAGE AF-TIMI) 48 Trial): This was a randomized, double-blind trial including more than 21,000 patients with atrium fibrillation. 1 Patients were randomized to high dose edoxaban, low dose edoxaban, or warfarin. The primary efficacy endpoint was stroke or systemic embolic event and the study planned to accrue 672 events. For our purpose, we consider the comparison between high dose edoxaban and warfarin. The primary objective was to demonstrate noninferiority of edoxaban versus warfarin. If noninferiority was established, superiority would also be tested. As reported by Giugliano et al., 2 the high dose edoxaban met the noninferiority criteria but in the subsequent superiority test, the study failed to show superiority with p = 0.08. While the study was a success in showing noninferiority for the high dose edoxaban, the failure to further establish superiority over standard of care could be a disadvantage in the market place post approval when competing with other products with a superiority claim. In fact, after the study results were announced, a competitor made a press release that its product continued to be the only oral anticoagulant which showed superior ischaemic stroke reduction versus warfarin. 3
To reduce the risk, a potential sample size increase at an interim analysis if accumulating data show a lower than expected treatment effect has been discussed.4–9 Many approaches require certain modifications to the conventional statistical method, such as changing critical values4–6 or using a weighted Z-statistic for final hypothesis testing.8,9 These approaches consider the general scenario where sample size increase can occur regardless of the actual interim results. Without modification to the conventional method, the type I error rate will be inflated, 4 primarily caused by sample size increase for nonpromising interim observation that is close to null or no treatment effect. The modifications in the above approaches, therefore, are intended to account for this scenario at the cost of flexibility and/or interpretability. Increasing sample size when interim results indicate no or very small effect may not be necessary and could waste limited resource. As in Example 2, if sample size were increased based on a low 32% conditional power at an interim analysis with 50% data, the additional resource would have been wasted because there was no evidence of benefit. We argue against increasing sample size when interim results indicate no or very small effect and therefore modifications in the above approaches with the intention to account for such a scenario are not necessary.
Example 2 (TOPICAL Trial): TOPICAL was a double-blind, randomized, placebo-controlled, phase 3 trial to compare erlotinib and placebo in patients with advanced non-small-cell lung cancer. 10 The primary efficacy endpoint was overall survival. Assuming a hazard ratio of 0.75, the study planned to enroll 664 patients (or 550 events) to achieve 90% power with two-sided significance level of 0.05. When 50% of the target number of events were available, the observed hazard ratio was 0.87 with 95% confidence interval (CI) of (0.68, 1.10) and the conditional power under the current trend was 32%. 11 At the end of the study, the hazard ratio was 0.94 with 95% CI of (0.81, 1.10) which failed to show superior efficacy compared to placebo. 10
Sample size adjustment (SSA) based on the promising interim result principle 12 will consider increasing the sample size only if the interim results are “promising,” defined as a conditional power under the current trend as greater than 50%. The type I error rate will not be inflated using the conventional statistical procedure. The definition of “promising” or conditional power being greater than 50% is generally practical and sufficient for a well-designed confirmatory trial. In recent Phase 3 studies with planned SSA at an interim analysis, the maximum sample size increase was generally moderate between 40% and 60%.13–15 Such an increase, if the intention is to maintain the target power, requires interim results to at least meet the definition of a promising conditional power of >50%. This 50% conditional power principle is further extended by Gao et al. 16 and Mehta and Pocock. 17 The 50% conditional power approach and its extensions, together called promising zone approaches, are compared and evaluated by Wang et al. 18 and Menon et al. 19 A hybrid approach incorporating this “promising” idea but using the weighted statistic has also been proposed. 20 Despite the advancement of methodology research, use of SSA in clinical trial practice remains controversial. In the recent Food and Drug Administration (FDA) draft guidance on adaptive design, 21 sample size re-estimation based on unblinded interim analysis is considered less well-understood. It recommends further research and applications to gain better knowledge and more experience. Subsequently, Wang et al. 18 introduce the concept of “twilight zone” which refers to the high uncertainty when designing the confirmatory trial based on limited Phase II data in a learn-and-confirm type of design. They advise against use of sample size adaptation in such a setting. Instead, they endorse sample size re-estimation in exploratory trials such as Phase 2 studies where a substantial increase in sample size is reasonable.
Instead of new methodological developments, the main objective of this article is to identify a meaningful clinical setting for the applications of the SSA methods: carefully planned and well-designed Phase 3 confirmatory trials. Examples of clinical trials are then used to illustrate the appropriateness of such applications. The clinical setting is introduced in the next section. SSA methods using conventional unweighted statistics 12 and weighted statistics8,9 will be briefly reviewed in “Review of methods” section. We further characterize the SSA methods in “Comparison of methods” section. Hypothetical trials based on real clinical data are used to illustrate the methods in “Applications” section and recommendations are provided in “Conclusion and discussion” section.
Clinical scenario for SSA application
We consider application of SSA methods in Phase 3 confirmatory trials. Instead of looking at the “twilight zone” 18 or a wide range of uncertain design parameters, we consider confirmatory trials that are carefully planned and well-designed. This is in fact the typical clinical setting in the traditional paradigm of drug development.
Prior to committing a big investment in a large Phase 3 trial, the sponsor will usually spend time and resources in understanding the disease, characterizing the compound in development, predicting the safety and efficacy profile based on pre-clinical data, assessing how the investigational compound may affect and be affected by the human body, establishing proof of concept, obtaining a preliminary but reasonably robust estimate of efficacy in relevant patient populations, and determining an appropriate dose for confirmatory trials. There is increasing interest in having an early read of which compounds are likely to succeed, and making the hard decisions about which compounds to terminate. 22 Therefore, collecting robust data to aid in early decisions before Phase 3 is important and will continue to be important for sponsors. For the design of confirmatory Phase 3 studies, additional cautions are usually taken to ensure high confidence in study success. For example, a conservative treatment effect and/or high control rate may be assumed for Phase 3, and a high statistical power such as 80% or 90% is usually used. In the ISENTRESS program, 23 the decision of Go/No-Go to Phase 3 was based on extensive knowledge about the disease, well-understood animal models using in vitro and in vivo data, established proof-of-concept in patients, and favorable efficacy and safety data from two Phase 2 dose ranging trials. In the Phase 2 dose ranging trial in the same patient population, a treatment difference of 55% points (70% vs 15% for active vs placebo, respectively) in response rate was observed across all dose groups. 24 In anticipation of the use of newly approved drugs in the background therapy in Phase 3, the placebo rate was assumed to be 50% and the treatment effect was assumed to be 20% points. 25 Each of the two Phase 3 studies was then powered at 90%.
The cautionary steps may provide some cushion in case assumptions are wrong. However, they could be arbitrary and not necessarily adequate. SSA based on promising interim results could be particularly useful in mitigating the unexpected risk.
Review of methods
For our purpose, we consider one sample normal response
Cui, Hung, and Wang
The weighted statistic
Chen, DeMets, and Lan
At the interim analysis, the conditional power under the current trend is defined as
The normalized statistic
Comparison of methods
Consider a simple SSA plan with the sample size increment
Based on equation (8) of Mehta and Pocock,
17
This implies that if the sample size increment is determined by the above formula, the conditional power evaluated at
The following comparisons are based on numerical computations which involve normal integration. R code can be provided per request.
Expected sample size
Denote the true treatment effect

Expected sample size (ESS) as a ratio to original sample size.
Unconditional statistical power
We compare the unconditional power of four approaches: fixed design with original sample size; fixed design with ESS; SSA using Chen, DeMets, and Lan (CDL) unweighted approach; and SSA using CHW weighted Z approach. The statistical power is computed for a true treatment effect

Power comparisons between SSA methods and fixed design. Figure 2 shows that when we increase sample size to achieve the target power, only if the interim results are promising (i.e. conditional power between 50% and 90% (target power)), the expected sample size increase is small relative to the original size; furthermore, this increase is even smaller when we conduct the interim analysis at a late stage (e.g. t = 0.8). In practice, we usually limit the maximum increase to 100%, and if we apply the sample size adjustment rules proposed above, the expected sample size increase would be less than 8% for t = 0.8 and 13% for t = 0.5.
Conditional operating characteristics
In this section, we study the operating characteristics of SSA using CDL unweighted statistics in terms of ESS and power, conditional on the interim result being promising (i.e.

Expected sample size (ESS), conditional on promising interim results.

Power comparison with fixed design, conditional on promising interim results.
Applications
We first extend the notation to a survival endpoint. Under the proportional hazards assumption,
In the following hypothetical examples motivated by Examples 1 and 2, testing procedures may be different from those used in the examples. The joint normal distribution holds asymptotically, which is generally true in a large confirmatory trial. In the examples below, the sample size increment is determined by the observed treatment effect in order to maintain a target power.
SSA for effective therapy
In Example 1, the hazard ratio was 0.87 with 95% CI of (0.73, 1.04) based on 633 events.
2
Using these data, we consider a hypothetical study with a target number of
We consider SSA at an interim analysis with
If the conditional power at the interim analysis is less than 50% or greater than 80%, the study will continue with the original
Hypothetical trial 1: SSA versus No SSA (assume observed HR = 0.87 at the end of study).
SSA: sample size adjustment; HR: hazard ratio; CDL: Chen, DeMets, and Lan.
No SSA for ineffective therapy
In Example 2, the hazard ratio was 0.94 with a 95% CI of (0.81, 1.10) based on 657 deaths.
10
At the interim analysis with
Let us now assume an observed hazard ratio of 0.87 at the interim analysis with 50% of the target number of events. As shown in “Promising zone of CP between (50%, 80%)” subsection of Appendix 1, the conditional power is 40% in this hypothetical trial. The CDL approach would recommend not increasing the sample size as it is below 50%. This would be a correct decision as the additional investment would have been wasted. A more aggressive decision would be stopping the trial for futility. Note that if a decision is made to increase the sample size, the increase in sample size to ensure that the conditional power with the new sample size
Conclusion and discussion
We consider the application of SSA methods in carefully planned and well-designed Phase 3 confirmatory trials. There is still a potential risk of failing to meet the study objective due to possible differences between Phase 2 and Phase 3 studies. The CDL approach, which allows a sample size increase when the interim results are promising, is particularly useful in mitigating the risk. The requirement of conditional power of >50% for possible SSA is not a binding rule as that of a binding futility boundary. There is no modification of the efficacy boundary or other parameters in the CDL approach, while for a binding futility boundary the critical value for the efficacy test is lowered to compensate for the early futility stopping. The type I error rate is strictly controlled without modification to the conventional statistical procedure. The conventional unweighted test statistics and critical values can be used without any change. Such a feature of “no interference,” or no change to the conventional statistical procedure with or without SSA, is important for confirmatory Phase 3 trials. It is easier for clinical interpretation based on the “one patient one vote” principle, and statistical inference (estimation and CI) based on maximum likelihood estimation will be consistent with the unweighted test statistics. The SSA methods do not require lowering the final statistical significance level in order to control the type I error rate (i.e. no change to the final critical value), which also means not lowering the criteria for success, therefore maintaining a consistent threshold for required evidence in establishing efficacy for experimental drugs. By requiring promising interim results in order to increase the sample size, the CDL approach can reduce the chance of mistakenly increasing sample size for an ineffective treatment. While the unconditional power is not the primary focus of the CDL method, it provides a small power gain versus a fixed design with the original sample size, based on a small increase in ESS. The small increase in ESS indicates that adding SSA based on promising interim results will not on average substantially affect the budget at the portfolio level with multiple programs. Given the potential benefit of saving a pivotal trial, we recommend SSA based on interim results be always considered and if appropriate, be used in the design of the confirmatory trial. Second, as SSA based on promising interim results is not expected to substantially improve unconditional power, the SSA methods, especially the CDL approach, are not intended to re-design or dramatically change the ongoing trial. The new sample size is usually calculated in such a way that after sample size increase the conditional power is maintained at the high level. Therefore, its goal is to ensure that the ongoing study can meet the study objective at the end of the study if in fact the treatment is efficacious. Therefore, to achieve a high unconditional power is not the primary focus of the SSA methods. Instead, maintaining a high conditional power under the current trend would be the primary goal of SSA methods. As we know, both SSA methods would aim to maintain the conditional power at the target level. Additional characteristics based on conditional probabilities for the SSA methods have been discussed. 18 SSA and group sequential designs are intended to address different issues in large Phase 3 trials. Similar to fixed designs, carefully planned and well-designed group sequential studies can also benefit from SSA to mitigate the risk of failing to meet the study objective. The CDL approach can be extended to group sequential studies without changing the conventional group sequential approaches. 12
Some authors caution about early trends with small numbers of patients and events, since such results are generally unstable and any decisions based on them could be incorrect. 28 To help obtain a robust decision on SSA, we recommend an interim analysis with at least 50% of the originally planned data available to reduce the chance of action based on data noise. Some authors argue that this decision can be delayed until more data are available and the estimate is more robust. 16 However, an interim analysis very close to the planned end of the study may also limit the potential value of the method as the conditional power would tend to be close to either 1 or 0. 29 Our numerical study shows similar characteristics up to an information time of 0.8. However, this should also take other factors such as operational feasibility into consideration.
Footnotes
Appendix 1
Acknowledgements
The authors are full time employees of their corresponding affiliations. They are grateful to the Associate Editor and the two reviewers for their helpful comments which greatly improved the quality of this article.
Declaration of conflicting interests
The authors declare that there is no conflict of interest.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
