In multi-arm adaptive trials, several treatments are assessed simultaneously and accumulating data are used to inform decisions about the trial, such as whether treatments are dropped or continued. Different methodological approaches have been developed for such trials and research has compared the performance of different subsets of these. One particular approach, for which we use the acronym MAMS(R), has generally not been included in these comparisons because control of the family-wise error rate (FWER) could not be guaranteed. Recently, the MAMS(R) approach has been extended to facilitate the generation of efficient designs which strongly control the FWER. We consider multi-arm two-stage trials with binary outcomes and propose parameterising treatment effects using the log odds ratio. We conduct a simulation study comparing the extended MAMS(R) framework with the well-established combination method both for trials where a different outcome is used for mid-trial analysis and for trials where the same outcome is used throughout. We show how the MAMS(R) framework compares favourably only in scenarios where the same outcome is used. We propose a hybrid selection rule within MAMS(R) methodology and demonstrate that this makes it possible to use the MAMS(R) framework in trials incorporating comparative treatment selection.
When several treatments are proposed as candidates for a particular medical condition at the same time, the length of time and total sample size required to evaluate each one in a separate conventional clinical trial may be unacceptable. Multi-arm adaptive trials have been developed to offer a more timely and efficient evaluation. In a multi-arm adaptive trial, several new therapies may be assessed alongside a single control group; this can speed up the process of evaluation and substantially reduce sample size requirements compared with conducting separate trials. Furthermore, a multi-arm adaptive trial is conducted in stages allowing interim analysis of accumulating data to inform how the trial should progress, for example poorly performing arms may be dropped. A useful application of these methods has been the facility to merge a Phase II with a Phase III trial. These so-called ‘seamless trials’ may substantially reduce sample size requirements and also reduce the potentially lengthy ‘white space’ between Phase II and Phase III.1,2
A key challenge in multi-arm adaptive trial methodology is strong control of the family-wise error rate (FWER) so that the probability of recommending an ineffective treatment is not inflated by multiplicity or selection. Several methodological approaches which address this issue have been developed and compared.3,4 A key feature separating these approaches into two main types concerns the manner in which stage-wise test statistics are obtained. These may either be calculated based on data from each stage separately and then combined at the end of the trial or alternatively may be calculated cumulatively as the trial progresses. The combination method5,6 is a well-established method of the first type which uses a closed testing procedure (CTP)7 to control the FWER strongly. The group sequential method8–10 is of the second type and uses cumulative test statistics, calculated at each stage, which are compared against boundaries defined by critical values. Different types of boundaries can be specified depending on the requirements of a particular trial, for example to allow early stopping for efficacy or futility. The boundaries are obtained using numerical integration such that the Type 1 error is controlled. A related approach based on cumulative test statistics and stage-wise critical values is that proposed by Royston et al.,11 we will refer to this as the MAMS(R) framework. This method allows different outcomes to be used for the intermediate and final analyses, a useful feature in trials where the primary outcome is observable only after a long time period. Critical values are specified which determine the early dropping of poorly performing treatments and the Type 1 error can be calculated for any set of critical values if the correlation between the intermediate and final outcomes is known.
A number of studies have compared the performance of the combination method with the group sequential approach. In the single experimental arm setting, Jennison and Turnbull12 describe how using the combination method allows greater flexibility regarding stage-wise sample sizes but that unplanned changes reduce efficiency because the final test for the treatment difference is not based on a sufficient statistic. Tsiatis and Mehta13 show that for trials where such unplanned changes are made, it is always possible to find a group sequential design which has the same sample size and is more powerful. Kelly et al.14 investigated two-stage and five-stage designs in a practical setting and found the group sequential approach yielded similar or slightly greater power compared with the combination method. However, they confirm the greater flexibility of the combination method by showing that changes to sample sizes made on the basis of interim data analysis may result in a breach of the Type I error in the group sequential approach, but not in the combination method. Comparisons have also been drawn between different approaches in the multi-arm setting where interim data analysis is used to inform treatment selection.15 Friede and Stallard16 compared a number of adaptive trial designs including the group sequential approach and the combination method. They did not find any method to be consistently more powerful than another. Instead, factors such as the size of the treatment effect and the process chosen for selecting treatments determined which approach performed best. Kunz et al.17 considered multi-arm trials where data regarding an early outcome measure are incorporated in the process of treatment selection. They conducted a comparison study and again found that there was no overall advantage for the group sequential approach or the combination method, but that the preferred method depended on treatment effects and correlations between early and final outcomes.
Studies comparing different approaches in multi-arm adaptive design methodology have generally not incorporated the MAMS(R) framework, largely because control of the FWER could not be guaranteed and also because the MAMS(R) framework was developed specifically for trials with survival outcomes. However, MAMS(R) methodology has recently been extended such that binary outcomes may now be accommodated.18 Strong control of the FWER can also be guaranteed and a systematic search procedure has been developed which can produce efficient designs for trials with multiple stages and multiple treatment arms.19
In view of these developments, we consider there is good reason to explore and evaluate the performance of MAMS(R) designs. We propose utilising and further developing the MAMS(R) framework to obtain boundary-based trial designs and then using these designs as the basis for a study of the performance of MAMS(R) compared to the combination method in the setting of two-stage trials. Both approaches are relatively easy for clinicians to understand and implement in the multi-arm context and neither method requires the number of treatments selected at an interim analysis to be specified in advance. Furthermore, each of the approaches can be used in trials where treatment selection is based on the definitive outcome or on an early outcome measure only, without there being any restriction to carry forward only one treatment.
Our motivation is the evaluation of treatments for chronic diseases such as tuberculosis (TB), multiple sclerosis (MS) and osteoporosis. For these conditions, binary outcomes, representing a success or failure recorded for an individual patient, are commonly encountered. A binary outcome in TB could be whether or not a patient converted to a negative sputum culture, in MS whether or not a patient’s disability rating has increased by a given number of units and in Osteoporosis whether or not a patient has suffered a fracture during a certain time period.
We consider two types of multi-arm trials, firstly where the intermediate outcome is different from the definitive outcome , perhaps because data regarding the definitive outcome would not be available at an early stage in the trial, we refer to these as trials, and secondly trials where the same outcome is used throughout, we refer to these as trials. For trials, we use as a basis the Phase II/III seamless trial described by Bratton et al.18 in which several treatment regimens for TB were evaluated. The intermediate outcome is whether conversion to negative culture status has occurred after eight weeks treatments, and the definitive outcome is whether a patient has relapsed or not during an 18-month period. For trials, we use a two-stage version of the Phase II trial, where the outcome related to relapse is used throughout the trial.
In Section 2, we describe how treatment effects for binary outcomes may be parameterised in terms of a probability difference or a log-odds ratio (LOR). We then briefly describe methodology for the combination method and the MAMS(R) framework. In Section 3, we show how MAMS(R) methodology based on the parameterisation of probability difference can be adapted so that efficient two-stage designs based on the log-odds ratio may be readily obtained. We discuss the selection rules which are currently implemented in the combination method and the MAMS(R) framework and propose a new hybrid selection rule. We then conduct a simulation study comparing the performance of the MAMS(R) and combination methods for both and trials with binary outcomes, investigating a number of scenarios and a range of treatment effects.
2 Background
2.1 Choice of treatment difference for trials with binary outcomes
For a clinical trial with a binary outcome, let the proportion of patients who have a positive response regarding a chosen outcome be denoted pE under the experimental treatment and pC under the control treatment. To compare the new therapy with the control, the difference in proportions, may be used as a measure of the treatment effect. This option has the merit of simplicity. However, an alternative measure of treatment difference for binary outcomes is the log odds ratio (LOR) defined as . Unlike the measure ‘difference in proportions’, the LOR is asymptotically normally distributed and this may be an advantage when significance tests are based on assumptions of normality. Also, the LOR is closely linked to the logit, the natural parameter used in logistic modelling. There may be times when it is desirable to express a clinical outcome using a modelling approach, perhaps to allow inclusion of relevant covariates. Using the LOR makes this transition more straightforward. In this paper, we have chosen to consider the LOR parameterisation and describe its implementation in the MAMS(R) framework and the combination method.
2.2 MAMS(R) framework
MAMS(R) trials were initially proposed by Royston et al.11 to address the need for increased efficiency in evaluating treatments for diseases where the main outcome of interest is a survival time response. Their approach was developed for the case, but can easily be applied in the case. Recently, MAMS(R) methodology has been extended to accommodate binary outcomes18 with the difference in success rate between the control and the experimental treatments being used to parameterise treatment effects. Briefly, the fundamental elements of the MAMS(R) framework, irrespective of outcome type, are as follows. To obtain a design for a trial with K experimental treatments and a single control, we assume the same null and alternative hypotheses for all treatment arms. We denote the treatment effect, comparing experimental treatment Ti with the control treatment, by where i denotes the treatment arm and j denotes the intermediate or definitive outcome. The hypotheses of interest are then
For the case, and represent the null hypotheses for the early and definitive outcome, respectively, whereas for the case, For a superiority trial, is usually equal to 0 whereas for a non-inferiority trial generally takes a small but negative value.
In a two stage MAMS(R) trial, a test of against is conducted for each treatment arm at the end of each stage. Cumulative test statistics calculated for each treatment are compared against predetermined critical values. At the end of stage one, a treatment is dropped if the test statistic falls below the stage one critical value . At the end of the second stage, any remaining treatment is declared beneficial if the stage two critical value is exceeded. A key issue in MAMS(R) methodology is how to determine C1 and C2 so that the Type I error is controlled at some specified value. Originally, although designs included several experimental treatment arms, the pair-wise error rate (PWER) was controlled rather than the FWER. Assuming the null hypothesis is true, let standardised test statistics obtained for a given treatment arm at stage one and stage two be denoted and , respectively. Then
where BVN denotes the bivariate normal distribution and ρ is the correlation between the test statistics obtained at the interim and final analyses. For , ρ is a function of the stage-wise sample sizes, whereas for is a function of the stage-wise sample sizes and also the correlation between the intermediate and definitive outcomes. The probability of a given treatment passing both stages and thereby being declared effective is given by The PWER is calculated by integration of the tail areas of the joint distribution. Similar expressions for pair-wise power can be obtained by considering the probability of a treatment passing both stages when the alternative hypothesis is true. The critical values C1 and C2 can be chosen on a trial and error basis such that the PWER is no greater than some α and the pairwise power no less than some ω. This approach has been used for designing both and trials. However, Bratton19 suggests that although this method is appropriate when , it may not be suitable when since in this case
the maximum PWER would in fact occur when a treatment is ineffective on the definitive outcome, but is fully effective on the intermediate outcome so that C2 should be determined solely by the target α as in a single stage trial.
Bratton19 proposes a method for obtaining a set of critical values for a MAMS(R) trial which ensures that the FWER is controlled at a specified level. Following the approach to trial design introduced by Simon20 and then developed by Jung et al.21 and Mander et al.,22 Bratton developed software in which a systematic search procedure is used to generate a set of designs which achieve a specified FWER and pair-wise power; such designs are termed ‘feasible’. The overall expected sample size of each feasible trial, denoted N, is then calculated under two scenarios, firstly under the global null hypothesis and secondly under the situation where all arms have treatment effects on I and D equal to some reference values, denoted and , which are specified by the user. We denote these two scenarios using and , respectively. Designs which minimise a weighted sum of these two measures are identified as ‘admissible’. To implement this approach for a trial with K experimental treatment arms and a target FWER of α, the PWER for each treatment arm is first set to where satisfies the Dunnett probability , where is the K-dimensional multivariate normal distribution function and C is the
between-arm correlation matrix. A search is then carried out over many possible combinations of values for C1 and C2 and designs which are feasible retained. Identifying admissible designs requires calculation of and Obtaining these measures requires calculation of the per-treatment stage-wise sample sizes and also the numerical evaluation of the probability that k out of K treatments will reach stage two of the trial under each hypothesis. This probability may be obtained using a simulation approach somewhat similar to the method described by Wason and Jaki.23 Test statistics with the appropriate correlation structure are generated for each treatment at each stage of a MAMS(R) trial in accordance with Equation 5.1 described by Bratton,19 which in the notation of this paper and for equal allocation to experimental and control treatment arms can be expressed as
where are standard normally distributed random variables generated for …K with the appropriate between stage correlation of test statistics, is the true treatment effect for treatment i on the outcome of interest at stage s, is the treatment effect at stage s under the null hypothesis and is the standard deviation of the observed treatment effects under By simulating test statistics for a large number of trials and observing the proportion of trials where k out of K treatments pass stage one the required probabilities can be estimated and then used to determine and . A loss function denoted similar to that proposed by Mander et al.23 is then specified. L is a weighted sum of and and admissible designs are defined as those which minimise the loss function for a chosen weight , such that where Using this extended methodology, designs which strongly control the FWER can be readily produced for both and trials with multiple treatment arms and multiple stages. In principle, the methods could be extended to accommodate any outcome measure which has an asymptotically normally distributed test statistic provided the between stage correlation structure is known.
2.3 Combination method
Combination test methodology can accommodate a variety of outcome types and the test statistics used for treatment selection at stage one can relate either to the definitive outcome or to a suitable early outcome . The fundamental elements of the combination test are as follows. Consider again a two-stage trial where there are K experimental treatment arms
and a single control arm. Taking first the case when , the treatment effect calculated at the end of each stage is denoted and the hypotheses of interest are then
At the end of the first stage, test statistics are calculated to test against for each treatment arm. These test statistics are used initially to make a decision concerning which treatments should be continued into the second stage of the trial, for example the treatment arm associated with the largest test statistic may be selected. At the end of the second stage, test statistics relating to each treatment arm still in the trial are calculated as before, using data from the second stage only.
At the end of the trial, the test statistics arising from each stage are used in a CTP7 to produce a set of stage one and stage two p-values. In a CTP, p-values must be obtained for all possible composite or intersection null hypotheses as well as each individual null hypothesis. For example, in a trial with three experimental treatment arms stage one and stage two p-values are obtained for individual null hypotheses and and for the intersection null hypotheses and . For the intersection hypotheses, the methods of Dunnett24 can be applied such that for , the p-value will equate to the Dunnett-adjusted p-value relating to the largest of the three observed test statistics. For the final analysis of treatment effectiveness, the stage-wise p-values for each individual and intersection null hypothesis are combined, producing an overall p-value for each one, the only requirement being that the distribution of p2 conditional on p1 should be stochastically no larger than the uniform distribution.25 One approach is to use the weighted inverse normal method proposed by Lehmacher and Wassmer26 which calculates the final p-value using , where Φ denotes the normal distribution function and w1 and are predetermined weights specified for each stage, of the trial such that and + , the weights being determined by the stage-wise sample sizes. An intersection hypothesis is rejected at level α if An experimental treatment is declared superior to the control at level α only if the individual null hypothesis and all relevant intersection hypotheses are rejected. For example, at the end of the trial T2 is declared beneficial only if , , and are all rejected at level α. Using a CTP in this way ensures strong control of the FWER when multiple hypotheses are being tested.
In the second stage, a subset defining an intersection hypothesis may contain a dropped treatment. In this instance, following the methods adopted by Posch et al.27 and Friede et al.,28 the second stage p-value for this intersection hypothesis is set as the p-value for the group of treatments contained in the original subset and selected for the second stage. If the set is empty, then the second stage p value is set to 1. For the case where , the same procedure is used except that the test statistics initially obtained at the end of stage one relate to an early outcome. These test statistics are used to inform treatment selection but are not used in the final analysis of treatment efficacy. Once data regarding the definitive outcome becomes available, these are used to obtain the test statistics for the stage one group of patients, which are then used exactly as for the case.
3 Methods
For the investigations detailed in this paper, we modified routines available in Stata, as described in Section 3.1. We used these modified routines to obtain designs for trials when and when An integral part of any multi-arm adaptive trial is the selection rule, and in Section 3.2, we consider this in detail and also suggest a hybrid rule for MAMS(R) trials when . Based on the trial designs obtained, we conducted a simulation study comparing the MAMS(R) framework with the combination method, as described in Section 3.3.
3.1 Adapting trial designs in the MAMS(R) framework for the LOR
Feasible and admissible designs for trials with binary outcomes, where the treatment difference is parameterised as ‘difference in proportions’, can be readily generated according to the approach described in Section 2.2 by using the nstagebin and nstagebinopt MAMS(R) programs for Stata.19 We adapted these programmes to produce designs for two-stage MAMS(R) trials with a binary outcome and the treatment effects parameterised as a LOR. Formulae used in the routines for calculating suggested sample sizes and the variance of the treatment effect were modified to reflect the LOR parameterisation. Sample sizes suggested by the LOR formulae are approximate and may be over-estimated under the LOR,29 so we incorporated a new routine to refine stage-wise sample sizes so that the Type 1 error is as close to the target value as possible. Further details are given in Appendix 1. Bratton19 derived expressions based on the parameterisation ‘difference in proportions’ for the between stage correlation of test statistics. For trials when these expressions remain the same under the LOR. However, for trials when , we were unable to obtain an analytical expression based on the LOR. Therefore, we adapted for the binary context a simulation-based approach described by Bratton et al30 for approximating between-stage correlations of early and definitive test statistics described by in trials with a survival outcome. Again, further details are given in Appendix 1.
3.2 Selection rules
There are a number of different selection rules which may be used in a multi-arm adaptive trial, for example, a rule may specify that the k best performing treatments are continued in the trial or alternatively that all treatments meeting a certain standard are continued. A particularly flexible selection rule which encompasses many different selection options is the ‘epsilon’ rule,31 whereby the treatment associated with the largest test statistic is selected to continue along with all others whose test statistic is within a specified range of the largest. Note that when , only the best treatment is selected and when all treatments are selected to continue.
The MAMS(R) framework uses thresholds for treatment selection as well as in the final analysis of treatment efficacy. When , the threshold for the early outcome is not binding (see Section 2.2), and therefore an epsilon rule may be used in place of the threshold without inflating the Type 1 error rate. However, when all thresholds, including those which determine the treatments selected to continue, are binding and therefore control of the FWER is not guaranteed if an epsilon rule is used. For trials where a more comparative selection rule is required, we propose implementing a ‘hybrid’ rule in the MAMS(R) framework, where the selection process occurs in two steps. Firstly, the interim test statistics associated with each treatment group are compared to the threshold and only those meeting this standard retained. Then, an epsilon selection rule is implemented, so that the best performing of the retained treatments is selected along with any other treatment where the test statistic is within epsilon of the largest.
The combination method can accommodate a variety of selection rules and the user may choose a rule which facilitates the aims of the particular trial. For example, if the objective is for the early dropping of poorly performing arms, then a threshold rule may be chosen. Alternatively, if the aim is for a more comparative approach such that only the best performing treatments are selected, then an epsilon rule may be implemented.
3.3 Simulation study
We conducted a simulation study to compare the performance of the MAMS(R) framework and the combination method for conducting two-stage trials with a binary outcome, using the LOR parameterisation and a variety of selection rules. We considered first the case when and then the case when As highlighted above, modified STATA routines were used to obtain MAMS(R) trial designs. When implementing the combination method we used a number of routines from the R package ‘asd’ by Parsons et al.32
3.3.1 Trials when
The trial which motivates the simulation study is a Phase II/III trial described by Bratton18 in which a Phase II superiority trial and a Phase III non-inferiority trial were combined to create a seamless trial. We specify a one-sided FWER of 0.025 (to match a conventional two-sided error rate of 0.05), a pair-wise power of 0.9 and a 1:1 allocation ratio. Control arm event rates for the I and D outcomes are 0.75 and 0.90, respectively. Treatment effects for the I outcome are set at and and for the D outcome at and We used our revised routines based on the LOR to produce feasible and admissible MAMS(R) designs for two-stage three-arm and six-arm trials where , choosing the design which is admissible across the widest range of q (see Section 2.2). Details of the chosen MAMS(R) designs are given in Table 1.
Summary of two stage designs used in simulation study.
α (critical value)
ω
Cumulative per-arm sample size
Two experimental treatment arms
Stage 1
0.0700 (1.476)
0.97
207
Stage 2
0.0135 (2.212)
0.82
743
Five experimental treatment arms
Stage 1
0.0400 (1.751)
0.97
244
Stage 2
0.0060 (2.511)
0.82
895
Using the designs, we evaluated performance across a range of values for the underlying treatment effect of T1 on the definitive outcome, denoted . The effect of T1 on the early outcome was held constant at . For each value of , we calculated the percentage of trials, where any non-null treatment was declared beneficial at the end of the trial. Based on the sample sizes specified for the chosen three-arm and six-arm designs, we simulated individual patient data for 100,000 trials for each value of under two different scenarios. In the first scenario, all other experimental treatments other than were ineffective on both early and definitive outcomes. In the second scenario, other experimental treatments were partially effective, with treatment effects equal to for the definitive outcome and held constant at for the early outcome. Using a threshold rule initially, we compared the performance of the MAMS(R) framework and the combination method. We then implemented an epsilon rule. We set to emulate a moderately stringent rule, partway between selecting one treatment and selecting all treatments. Again, we compared the performance of the MAMS(R) framework and the combination method.
3.3.2 Trials when
The trial motivating the simulation study is a two-stage Phase II superiority trial as described by Bratton et al.18 A one-sided FWER of 0.025, a pair-wise power of 0.8 and a 1:1 allocation ratio are specified. Control arm event rates and treatment effects are the same for both stages of the trial and are as described for the D outcome in Section 3.3.1. Using the approach described for , we obtained MAMS(R) designs for two-stage three-arm and six-arm trials where The chosen MAMS(R) designs are described in Table 2.
Summary of two stage designs used in simulation study.
α (critical value)
ω
Cumulative per-arm sample size
Two experimental treatment arms
Stage 1
0.2300 (0.739)
0.94
92
Stage 2
0.0160 (2.144)
0.94
250
Five experimental treatment arms
Stage 1
0.1900 (0.878)
0.95
113
Stage 2
0.0070 (2.457)
0.93
286
We compared the performance of the MAMS(R) framework and the combination method in the same manner as for . Since for , the intermediate and definitive outcome are the same, we do not use the subscript D for θ, the underlying treatment effect for T1 being simply denoted As before, we simulated individual patient data for 100,000 trials for each value of under two different scenarios such that in the first, all other experimental treatments other than were ineffective and in the second, other experimental treatments were partially effective, with treatment effects equal to . As for the performance of MAMS(R) framework and the combination method were compared when a threshold rule is used. We then implemented an epsilon rule for the combination method and used the new hybrid rule for the MAMS(R) framework.
4 Results
4.1 Trials when I ≠ D
In this section, two sets of results are presented relating to the case where . The first gives a direct comparison of performance between the MAMS(R) framework and the combination method when both implement a threshold selection rule, this reflects the usual mode of operation for the MAMS(R) framework. The second set gives a further comparison of performance to show the effect of implementing a different selection rule.
4.1.1 Comparison of the MAMS framework and the combination method using a threshold selection rule
Table 3 presents estimated probabilities to declare effectiveness on the final outcome across a range of values for , firstly for any non-null treatments and secondly for null or partially effective treatments only. Results for the three-arm design are presented in the upper section of the table and for the six-arm design in the lower section On the left-hand side of the table, results are presented for scenarios where treatments other than T1 are ineffective on both the early and the final outcome , while results for scenarios where treatments other than T1 are partially effective on both the early and final outcome are given on the right-hand side. The rows of the table refer to the different values of investigated. Results in bold show the percentage of trials in which any non-null treatment is declared beneficial, for different values of (the effect of T1 on the early outcome being held constant at ). The results in parentheses give the percentage of trials in which at least one of the null or partially effective treatments is declared beneficial.
Comparison of power for MAMS(R) framework and the combination method under a threshold selection rule for trials where .
% trials treatment 1 declared beneficial (% trials where one or more null treatment(s) declared beneficial)
% trials any non-null treatment declared beneficial (% trials where one or more partially effective treatment(s) declared beneficial)
Scenarios where no treatments which are partially effective on the final outcome are present.
In Table 3, the results in bold show that under a threshold selection rule, the combination method results in marginally greater power than the MAMS(R) framework. This general finding is observed for the three-arm and the six-arm design and across all scenarios and treatment effects investigated. The slight power advantage of the combination method over the MAMS(R) framework is larger for the six-arm design than for the three-arm design . However, the advantage is somewhat smaller for scenarios where partially effective treatments are present compared with scenarios where treatments other than T1 are ineffective. The results in parentheses on the left-hand side of Table 3 show that when treatments other than T1 are ineffective, the percentage of trials in which null treatments are declared effective is very low for both methods, as expected. As increases, this percentage increases slightly for the combination method because for any given trial, the presence of the more effective treatment makes rejection of any intersection hypothesis which encompasses the null hypothesis for this treatment more likely. This increase does not occur for the MAMS(R) framework where the progress of individual treatment arms is not affected by the performance of other treatments. The reason why the percentages increase substantially for is because when T1 is ineffective on the final outcome, it will be more likely than other treatments to progress to the second stage and be declared effective on the final outcome due to the early outcome effect being held constant at for T1 across all values of . The percentage is much lower than 2.5% because the trials are designed such that the target FWER is 2.5% when all treatments are fully effective on the early outcome but ineffective on the final outcome (see Section 2.2). As increases, there is a sharp increase in the percentage of trials in which partially effective treatments are declared effective, shown by the results in parentheses on the right-hand side of the Table 1. This is an expected finding when selection is determined by a threshold. The rate tends to be slightly lower for MAMS(R) than for the combination method.
4.1.2 Performance of the MAMS(R) framework and the combination method under the epsilon selection rule
In Figure 1, power curves are presented showing the performance of the MAMS(R) framework and the combination method under both the threshold and the epsilon selection rule. The upper sets of four lines are obtained by plotting the percentage of trials where any non-null treatment is declared effective on the final outcome, for different values of . The lower sets of four lines show the percentage of trials where at least one null or partially effective treatment is declared beneficial on the final outcome. Panels (i) and (ii) show results for the three-arm design and panels (iii) and (iv) for the six-arm design. In panels (i) and (iii), results are presented for scenarios where treatments other than T1 are ineffective on both the early and the final outcome . Results for scenarios where treatments other than T1 are partially effective on both the early and final outcome are shown in panels (ii) and (iv).
Comparison of the MAMS(R) framework and combination method under threshold and epsilon selection rules for trials where .. Upper lines are estimated power to declare any non-null treatment beneficial and lower lines show the percentage of trials where at least one null or partially effective treatment is declared beneficial.
Considering the upper sets of lines in Figure 1, the percentage of trials where a non-null treatment is declared effective is consistently greater when an epsilon rule is used in place of the threshold rule. This is true for both the MAMS(R) framework and the combination method and reflects the operation of the epsilon selection rule at the interim analysis, allowing the most effective treatment through to the second stage even when the threshold required by the other methods has not been met. The separation resulting from the change in selection rules is larger in the context of the combination method than in the MAMS(R) framework, this is most obvious at the higher values of investigated and for the scenarios where partially effective treatments are present (panels ii) and iv)). As discussed in Section 4.1.1, under a threshold rule, the combination method is marginally more powerful than the MAMS(R) framework across all the scenarios investigated, although there is less difference between the two methods when partially effective treatments are present. Under an epsilon rule, the combination method is again more powerful than the MAMS(R) framework, but the advantage tends to be larger and is not reduced when partially effective treatments are present. For the six-arm design where partially effective treatments are present (panel (iv)), the combination method with the epsilon rule clearly provides the greatest power across all treatment effects.
Considering the lower sets of lines in Figure 1, it is clear that, compared with the threshold rule, implementing an epsilon selection rule substantially reduces the rate at which partially effective treatments are declared effective at the final analysis. In some settings, this may be viewed as desirable. In the MAMS(R) framework, the usual use of a threshold rule facilitates the objective of declaring any non-null treatment(s) effective whereas moving away from the threshold towards an epsilon selection rule results in a more directed result, with greater power to select the best treatment and a reduced probability of declaring inferior treatments beneficial.
4.2 Trials when I = D
In this section, results for the case where are considered. As before, two sets of results are presented, the first set relating to a direct comparison under a threshold selection rule and the second set showing the effect of implementing different selection rules; results are given for the combination method under the threshold and the epsilon rule and for the MAMS(R) framework under the threshold and the hybrid rule (see Section 3.2).
4.2.1 Comparison of the MAMS(R) framework and the combination method using a threshold selection rule
Table 4 presents estimated probabilities to declare effectiveness, firstly for any non-null treatment and secondly for any null or partially effective treatment(s). The structure of the table is as for Table 3. Note that on the left-hand side of the table results are presented for scenarios where treatments other than T1 are ineffective , while results for scenarios where treatments other than T1 are partially effective are given on the right-hand side.
Comparison of power for MAMS(R) framework and the combination method under a threshold selection rule for trials where
% trials treatment 1 declared beneficial (% trials where one or more null treatment(s) declared beneficial)
% trials any non-null treatment declared beneficial (% trials where one or more partially effective treatment(s) declared beneficial)
Scenarios where no treatments which are partially effective on the final outcome are present.
In contrast to the case, the results in Table 4 show that under a threshold rule the MAMS(R) framework results in slightly greater power, compared with the combination method. This opposite finding may be due to the fact that when , there is a binding threshold at stage one and this allows for a more liberal critical value at stage two compared with the case. This general finding is observed for both the three-arm and the six-arm design and across all scenarios and treatment effects investigated. It was also verified for an alternative trial scenario which had different treatment effects and stage-wise sample sizes (results not shown). The power advantage of the MAMS(R) framework over the combination method is marginal, but is greater for the scenarios where a large number of partially effective treatments are present. The results in parentheses on the left-hand side of Table 4 show the percentage of trials in which null treatments are declared effective. Under the global null hypothesis (, the estimated FWER is larger for the MAMS(R) framework than for the combination method. However, at most of the other treatment effects investigated, null treatments are declared beneficial at a similar or lower rate for the MAMS(R) framework compared with the combination method. For the reasons described in the context of Table 3, as increases this rate rises slightly for the combination method, but not for the MAMS(R) framework. As increases, there is a substantial increase in the percentage of trials in which partially effective treatments are declared effective, shown by the results in parentheses on the right-hand side of Table 4. For the three-arm design , the rate tends to be lower for MAMS(R) than for the combination method, whereas for the six-arm design , it is slightly greater for MAMS(R) across all values of .
4.2.2 Performance of the MAMS(R) framework and the combination method under different selection rules
In Figure 2, power curves are presented for four different schemes: the MAMS(R) framework and the combination method under the threshold rule, the combination method under the epsilon rule and the MAMS(R) framework under the hybrid rule. The layout of the figure is as described for Figure 1. Note that in panels (i) and (iii) results are presented for scenarios where treatments other than T1 are ineffective , while results for scenarios where treatments other than T1 are partially effective are shown in panels (ii) and (iv).
Comparison of the MAMS(R) framework and combination method under threshold and epsilon selection rules for trials where . Upper lines are estimated power to declare any non-null treatment beneficial and lower lines show the percentage of trials where at least one null or partially effective treatment is declared beneficial.
Looking at the upper sets of lines, for the combination method power is consistently greater when an epsilon rule rather than a threshold rule is implemented. The differences become larger as increases, reflecting the operation of the epsilon selection rule as discussed in Section 4.1.2. The separation resulting from the change in selection rule is most obvious for higher values of , because at lower values of even if T1 is selected at an interim it would be unlikely to be declared effective on the final outcome at the end of stage two. However, in the MAMS(R) framework, when the hybrid selection rule replaces the threshold rule, the percentage of trials where T1 is declared effective is slightly reduced because the hybrid rule is a more stringent selection rule than the threshold. As discussed in Section 4.2.1, under the threshold rule the MAMS(R) framework is more powerful than the combination method across all the scenarios investigated, particularly when a large number of partially effective treatments are present. Moving away from using a threshold rule to implementing the epsilon rule for the combination method or the hybrid rule for MAMS(R), this advantage reverses, at least for the majority of scenarios. For the three-arm trial , the combination method under the epsilon rule gives greater power than the other schemes, particularly at larger treatment effects. However, for the six-arm trial when partially effective treatments are present, there is no clear advantage. The MAMS(R) framework under the threshold or hybrid rule results in similar power at higher treatment effects and better power at lower treatment effects compared with the combination method under the epsilon rule (see panel (iv)).
Looking at the lower sets of lines, implementing the epsilon or hybrid rule substantially reduces the rate at which null and partially effective treatments are declared beneficial at the final analysis. It can be clearly seen in Figure 2 that as increases, there is no steep rise in the proportion of partially effective treatments which are declared beneficial, such as is observed under the threshold rule, (see panels (ii) and (iv)). This is because as increases the numerical distance between and the treatment effect of the partially effective treatments increases and this will tend to reduce the number of trials where these arms progress even though the absolute value of the effect in these arms is increasing. Across all the scenarios we investigated, the MAMS(R) framework under the hybrid selection rule achieved consistently lower rates for recommending null or partially effective treatments compared to any other scheme. This result can be seen clearly by noting the relative position of the lines in the lower section of each panel in Figure 2. The black dashed line showing the results for the MAMS(R) framework under the hybrid rule consistently occupies a lower position than the other lines.
5 Discussion
By adapting and implementing recent developments in methodology, we have used the MAMS(R) framework to obtain efficient boundary-based trial designs for multi-stage adaptive trials where the outcomes are binary and where treatment effects are parameterised as the LOR. Since methodology now allows the FWER to be controlled in MAMS(R) trials, we were able to carry out a simulation study to make an in-depth comparison of MAMS(R) trials with the well-established combination method in multi-arm multi-stage trials incorporating treatment selection, both for trials when and for trials when
For trials when , we found that the combination method achieves greater power than the MAMS(R) framework across all scenarios investigated. This is the case both under a threshold selection rule and an epsilon rule. The advantage of the combination method over MAMS(R) is most clearly seen for the six-arm design and when an epsilon rule is implemented. The reason why the combination method is more powerful may be that MAMS(R) designs for trials where tend to be inherently conservative. The conservatism occurs because, to ensure the FWER is strongly controlled, the critical value for the final stage is determined assuming that treatments are fully effective on the I outcome, as explained in Section 2.2. For both the MAMS(R) framework and the combination method, power is greater if an epsilon rule rather than a threshold rule is used.
In contrast, however, we found that for trials, where this conservative approach is not required, the MAMS(R) framework achieves slightly greater power than the combination method when a threshold selection rule is used. This finding is observed across all scenarios, irrespective of the size of the treatment effect or whether partially effective treatments are present. Generally, the differences are slightly greater for the six-arm design and when partially effective treatments are present. One possible reason for the combination method having less power is that the combining of evidence from the two stages of the trial means that final comparisons of treatments may not be based on a sufficient statistic for the treatment difference; this has been suggested for the single arm setting by authors such as Jennison and Turnbull12 and Kelly et al.14 We also showed that a hybrid selection rule can be implemented in the MAMS(R) framework to facilitate a more comparative selection procedure. However, when comparing the combination method under the epsilon rule with the MAMS(R) framework under the hybrid rule, we found that MAMS(R) no longer has a consistent advantage, the combination method achieving similar or greater power in some scenarios. We found that the rate at which partially effective treatments are recommended is lower for MAMS(R) under the hybrid rule than for any other scheme we investigated including the combination method under the epsilon rule.
In this paper, we have explored the use of the MAMS(R) framework to obtain boundary-based trial designs. This approach has the advantage of being relatively simple to understand and implement and of accommodating treatment selection based either on the definitive outcome or purely on an early outcome measure. We acknowledge that the MAMS(R) framework is mainly appropriate for trials where no early stopping for efficacy is envisaged. In contrast, the multi-arm group sequential designs developed by Magirr et al.10 specify both efficacy and futility boundaries so that trial designs which incorporate early stopping for efficacy may be obtained.
Based on our findings, we suggest that for multi-arm two-stage trials with binary outcomes where the combination method may be a more suitable choice than MAMS(R), particularly for trials with many treatment arms. For either method, the selection rule which best meets the objectives of the trial can be chosen. Since the stage one critical value is not binding, an epsilon rule may be implemented in the MAMS(R) context without inflating the FWER. This rule was shown to increase power compared with the threshold rule. By contrast, for trials where if the objectives of the trial are best met by using a threshold selection rule, the MAMS(R) framework may be a more suitable option than the combination method, particularly for trials with a substantial number of experimental arms and where partially effective treatments are likely to be present. Our results suggest that by implementing the hybrid rule, the MAMS(R) framework may also be successfully used for trials where the aim is to recommend the best treatments and that this may provide an effective way to minimise the probability of inferior but partially effective treatments being declared effective at the end of the trial. However, the more stringent hybrid rule does mean that some of the power advantage of MAMS(R) over the combination method seen under the threshold rule is lost. Where the main treatment effect is likely to be large and other treatments likely to be ineffective, the combination method under the epsilon rule may be a better choice since we found it achieves greater power in these scenarios. However, for a proposed trial with many treatment arms where some are likely to be partially effective and it is desirable to minimise the rate at which these are recommended, we suggest that MAMS(R) under the hybrid rule should be considered since it provides comparable power to the combination method while keeping the rate for inferior treatments substantially lower. Since no method consistently outperforms the others, the choice of which method to adopt for a given trial is best considered on an individual trial basis. We recommend that simulations based on the specific context and objectives of a particular trial should be conducted at the outset and the results used to determine which approach is the most suitable.
Finally, in this study, only two-stage trials were considered. Both the MAMS(R) and combination methodologies described in this paper can readily extend to include more than two stages,19,33 this is a possible area for future work. Similarly, now that methodology exists for calculating FWER in the context30 of trials with survival outcomes, it would be useful to develop methodology for feasible and admissible designs for this context such that further comparisons between MAMS(R) and the combination method may be conducted.
Footnotes
Acknowledgements
We thank the two referees for their comments which greatly improved this manuscript.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the EPSRC [PhD studentship to JEA].
Appendix 1
References
1.
BretzF, et al.Confirmatory seamless phase II/III clinical trials with hypotheses selection at interim: general concepts. Biometr J2006; 48: 623–634.
2.
SchmidliH, et al.Confirmatory seamless phase II/III clinical trials with hypotheses selection at interim: applications and practical considerations. Biometr J2006; 48: 635–643.
3.
FriedeTStallardN. A comparison of methods for adaptive treatment selection. Biometr J2008; 50: 767–781.
4.
KoenigF, et al.Adaptive Dunnett tests for treatment selection. Stat Med2008; 27: 1612–1625.
5.
BauerPKohneK. Evaluation of experiments with adaptive interim analyses. Biometrics1994; 50: 1029–1041.
6.
BauerPKieserM. Combining different phases in the development of medical treatments within a single trial. Stat Med1999; 18: 1833–1848.
7.
MarcusRPeritzEGabrielK. On closed testing procedures with special reference to ordered analysis of variance. Biometrika1976; 63: 655–660.
8.
StallardNToddS. Sequential designs for phase III clinical trials incorporating treatment selection. Stat Med2003; 22: 689–703.
9.
StallardNFriedeT. A group-sequential design for clinical trials with treatment selection. Stat Med2008; 27: 6209–6227.
10.
MagirrDJakiTWhiteheadJ. A generalized Dunnett test for multi-arm multi-stage clinical studies with treatment selection. Biometrika2012; 99: 494–501.
11.
Royston, et al.Novel designs for multi-arm clinical trials with survival outcomes with an application in ovarian cancer. Stat Med2003; 22: 2239–2256.
12.
JennisonCTurnbullB. Mid-course sample size modification in clinical trials based on the observed treatment effect. Stat Med2003; 22: 971–973.
13.
TsiatisAMehtaC. On the inefficiency of the adaptive design for monitoring clinical trials. Biometrika2003; 90: 367–378.
14.
KellyPSooriyarachchiR, et al.A practical comparison of group-sequential and adaptive methods. J Biopharma Stat2005; 15: 719–738.
15.
StallardNToddS. Seamless phase II/III designs. Stat Meth Med Res2010; 20: 623–634.
16.
FriedeTStallardN. a comparison of methods for adaptive treatment selection. Biometr J2008; 50: 767–781.
17.
KunzC, et al.A comparison of methods for treatment selection in seamless phase II/III clinical trials incorporating information on short-term endpoints. J Biopharma Stat2015; 25: 170–189.
18.
BrattonDPhillipsPParmerM. A multi-arm multi-stage clinical trial design for binary outcomes with application to tuberculosis. BMC Med Res Methodol2013; 13: 139–153.
19.
Bratton D. Design issues and extensions of multi-arm multi-stage clinical trials. PhD Thesis, University College London, UK, 2015.
20.
SimonR. Optimal two-stage designs for phase II clinical trials. Control Clin Trials1989; 10: 1–10.
21.
JungSH, et al.Admissible two-stage designs for phase II cancer clinical trials. Stat Med2004; 23: 561–569.
22.
ManderA, et al.Admissible two-stage designs for phase II cancer clinical trials that incorporate the expected sample size under the alternative hypothesis. Pharma Stat2012; 11: 91–96.
23.
WasonJJakiT. Optimal design of multi-arm multi-stage trials. Stat Med2012; 31: 4269–4279.
24.
DunnettC. A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc1955; 50: 1096–1121.
25.
BrannathWPoschMBauerP. Recursive combination tests. J Am Stat Assoc2002; 97: 236–244.
26.
LehmacherWWassmerG. Adaptive sample size calculations in group sequential trials. Biometrics1999; 55: 1286–1290.
27.
PoschM, et al.Testing and estimation in flexible group sequential designs with adaptive treatment selection. Stat Med2005; 24: 3697–3714.
28.
FriedeT, et al.Designing a seamless phase II/III clinical trial using early outcomes for treatment selection: an application in multiple sclerosis. Stat Med2011; 30: 1528–1540.
29.
SiqueiraAToddSWhiteheadA. Sample size considerations in active-control non-inferiority trials with binary data based on the odds ratio. Stat Meth Med Res2015; 24: 453–461.
30.
BrattonDChoodari-OskooeiBRoystonP. A menu-driven facility for sample-size calculation in mutiarm, multistage randomised controlled trials with time-to-event outcomes: update. Stata J2015; 15: 350–368.
31.
KellyPStallardNToddS. An adaptive group sequential design for phase II/III clinical trials that select a single treatment from several. J Pharma Stat2005; 15: 641–658.
32.
Parsons, et al.An R package for implementing simulations for seamless phase II/III clinical trials using early outcomes for treatment selection. Comput Stat Data Anal2012; 56: 1150–1160.
33.
WassmerGEisebittRCoburgerS. Flexible interim analyses in clinical trials using multistage adaptive test designs. Drugs Inform J2010; 35: 1131–1146.
34.
RoystonP, et al.Designs for clinical trials with time-to-event outcomes based on stopping guidelines for lack of benefit. Trials2011; 12: 81–81.