A statistical evaluation of decision-making methods and the efficiency of Bayesian multi-arm multi-stage trials

Abstract

Background/Aims: Multi-arm multi-stage trials benefit patients by providing a flexible and versatile clinical trial design compared with standard randomized controlled trials. Multi-arm multi-stage trials can evaluate multiple interventions for a single disease, which avoids the need to run multiple trials. Multi-arm multi-stage trials incorporate decision-making at interim analysis, which enables the trial to stop early for futility or efficacy of a treatment. As a result, multi-arm multi-stage trials reach conclusions faster, requiring less time and resources. Despite their growing popularity, limited research has been done to examine how decision-making methods used in Bayesian multi-arm multi-stage trials impact the efficiency of the trial.

Methods: This study examines how decision-making strategies influence the efficiency of Bayesian multi-arm multi-stage trials, including approaches for setting thresholds to declare superiority or futility and evaluating multiple treatments simultaneously. We apply the Nelder–Mead optimization algorithm to determine the decision thresholds that maximize statistical power while maintaining family-wise type I error rate below 5%. We conduct a simulation study to compare the conventional method to evaluate multiple treatments in a Bayesian multi-arm multi-stage trial to three alternatives. At each interim analysis, posterior probabilities are derived from a Normal–Gamma conjugate model, and trial decisions are made by comparing decision criteria derived from the posterior probabilities to the optimized decision thresholds. Simulation scenarios vary by treatment effect size and number of treatment arms to assess the robustness of each decision-making strategy.

Results: All treatment comparison methods achieve similar power across simulation scenarios. However, the optimal decision thresholds vary substantially among methods. These thresholds are also lower than those currently used in Bayesian multi-arm multi-stage trials, which are often conservative and lead to reduced power. Thus, adjusting decision thresholds to the optimized values can improve trial efficiency.

Conclusion: This study provides an exploration of alternative decision-making methods in Bayesian multi-arm multi-stage trials. Initial findings show that optimizing decision-making thresholds can improve the power of the trial without inflating the family-wise type I error rate, thus improving the efficiency of the trial. Further research should include the implementation of complex trial designs and non-normal outcomes to confirm that the results apply to adaptive platform trials.

Keywords

adaptive platform trial Bayesian statistics simulation study decision thresholds

Introduction

Adaptive platform trials (APTs), which are trials that compare multiple treatments under a single protocol, have become popular because their efficiency helps evaluate treatments faster.^1,2 APTs gained popularity by rapidly testing numerous treatments during the COVID-19 pandemic,³ with large APTs such as I-SPY2,⁴ REMAP-CAP,⁵ GBM AGILE,⁶ and RECOVERY⁷ recruiting thousands of patients. With this increased use, it is important to evaluate the statistical methods that drive APTs to ensure efficient study designs.

Efficient trials obtain conclusive results while minimizing resources, patients, and time. APTs improve efficiency by making decisions during the trial through interim analyses. Decision-making methods used at interim analyses are generally standard across APTs. Research on decision-making in frequentist trials^8–10 and Bayesian enrichment trials¹¹ exists; however, minimal research has examined how these methods affect Bayesian APT performance. This article explores decision-making in Bayesian multi-arm multi-stage (MAMS) trials, which are a subset of APTs that do not add new treatments over time.¹² We evaluate MAMS trials to reduce computational complexity and focus on decision-making methodology.

In Bayesian MAMS trials, decisions about treatment efficacy are typically made by comparing posterior probabilities to prespecified decision thresholds. This determines whether the trial can be stopped for superiority or futility, allowing it to efficiently reach conclusions.¹³ When multiple treatments are compared, efficacy is often evaluated by pairwise comparisons to a common control,¹⁴ as in the PRINCIPLE trial and the HEALEY ALS platform trial.^15,16 However, multiple treatments can also be compared by calculating the posterior probability that each treatment is best. This probability is then compared with decision thresholds to determine treatment superiority or futility.^5,17

In the REMAP-CAP trial,⁵ which uses the posterior probability that each treatment is best, the superiority threshold is 0.99 and the futility threshold is 0.01. This means that when the posterior probability that the treatment is best exceeds 0.99, the treatment is declared superior. Conversely, when this posterior probability falls below 0.01, the treatment is declared futile. These superiority and futility thresholds influence trial power and type I error. However, no research has focused on identifying optimal thresholds. In addition, while the posterior probability that each treatment is best is intuitive,¹⁸ it has been criticized for ignoring available information about the treatment effect.^19,20

This article focuses on two components of decision-making in Bayesian MAMS trials that directly compare multiple treatments: decision thresholds and treatment comparison methods. We first introduce an optimization approach using the Nelder–Mead algorithm to determine the decision thresholds that maximize the power of the trial while controlling family-wise type 1 error rate (FWER). We then identify three alternative treatment comparison methods, which are compared with the posterior probability that each treatment is best. Bayesian MAMS trials are simulated to evaluate the performance of each comparison method using FWER, power, and expected sample size per arm. Simulation scenarios vary by treatment effect size and number of arms. A key result is that the optimal decision thresholds differ from those currently used in Bayesian MAMS trials. We find that the thresholds used in practice are often overly conservative and could be adjusted to increase power while maintaining control of the FWER.

Decision-making is central to the structure and efficiency of Bayesian MAMS trials. Thus, understanding how these methods influence trial characteristics is essential. This study provides the first comprehensive look at how decision thresholds and treatment comparison methods can increase trial efficiency. An efficient trial will be cheaper with reduced patient recruitment, allowing researchers to allot more resources to new research questions.

Methods

A key advantage of APTs and MAMS trials is that they evaluate multiple treatments in a single trial.²¹ Consider a MAMS trial evaluating $T$ treatments at a given time. The effect of each treatment on an outcome $X$ is assessed, with $μ_{t}$ denoting a summary measure for $X$ (e.g. the mean). In Bayesian MAMS trials, the focus of this article, the posterior distribution of $μ_{t}$ is computed. Bayesian MAMS trials broadly use one of two approaches to compare the $T$ treatments.

In the first approach, each treatment is compared with a control (or standard of care), with $μ_{1}$ set as the summary measure for the control. The posterior distribution of a contrast between $μ_{t}$ and $μ_{1}$ , for example, $μ_{t} - μ_{1}$ , is calculated to determine the efficacy of treatment $t$ . For example, in the I-SPY trial studying therapies for high-risk breast cancer, neratinib was compared with the standard of care. The posterior probability that neratinib was superior to the standard of care was 95%, so the treatment graduated to the next trial phase.²² This approach is effective at identifying treatments for further investigation, but is less suitable when the goal is to identify the best treatment.²³

The second approach to decision-making in Bayesian MAMS trials is to compute the posterior probability that a treatment is best ( $P_{best}$ ).²⁴ $P_{best}$ for treatment $t$ is

θ_{t}^{Pbest} = P (μ_{t} > μ_{t^{'}}) \forall t \neq t^{'}

(1)

where we define a higher summary measure to be better. This can be interpreted as the probability that, if all $μ_{t}$ for $t = 1, \dots, T$ were ranked, $μ_{t}$ would rank first. $P_{best}$ ranges between 0 and 1, with 0 indicating the treatment is never better than all other treatments and 1 indicating it is always better. To make decisions in the trial, $θ_{t}^{P_{best}}$ is compared with a predefined superiority threshold $δ_{\sup}^{Pbest}$ and futility threshold $δ_{fut}^{Pbest}$ .²⁵ Revisiting the REMAP-CAP example, we specify that $δ_{\sup}^{Pbest}$ = 0.99 and $δ_{fut}^{Pbest}$ = 0.01. Thus, if $θ_{t}^{P_{best}} >$ 0.99, treatment $t$ is declared superior, and if $θ_{t}^{P_{best}} <$ 0.01, treatment $t$ is declared futile and dropped from the trial.⁵ This article focuses on the second class of MAMS trials that aim to compare all interventions against each other.

Trial design

This MAMS design is clarified in Figure 1. Patients are randomized into one of $T$ treatment arms and recruited until a prespecified time-point or number of patients. At the first interim analysis, the posterior distribution of $μ_{t}$ is determined for each treatment and $θ_{t}^{Pbest}$ is calculated for each treatment $t = 1, \dots, T$ . The metric $θ_{t}^{Pbest}$ represents the relative treatment efficacy for each treatment. Decisions about the treatments are made by comparing $θ_{t}^{Pbest}$ to the superiority and futility thresholds, $δ_{\sup}^{Pbest}$ and $δ_{fut}^{Pbest}$ , respectively. Superiority is declared when enough evidence indicates a treatment is better than all comparators. When this occurs, the trial concludes. Futility is declared when a treatment is not efficacious, and patients are no longer randomized to that treatment. If enough treatments remain in the trial, recruitment and interim analyses either continue perpetually or until a maximum sample size is reached.

Figure 1.

A flowchart of a multi-arm multi-stage trial that compares multiple treatments using decision thresholds and $P_{best}$ .

Trial operating characteristics summarize the statistical properties of a trial design. In general, three operating characteristics are critical to MAMS trial efficiency: power, FWER, and expected sample size per arm. Power is the ability to detect that the optimal treatment is truly optimal at the end of the trial, which we aim to maximize to improve the detection of real effects. FWER is the probability of declaring any treatment superior when it is not. The expected sample size per arm is the average sample size in each treatment arm in a MAMS trial. The goal in trial design is to maximize power, minimize expected sample size per arm, and restrict FWER below a threshold, usually 5%.

Formally, this article uses FWER in the weak sense,²⁶ defined as the probability of detecting at least one superior treatment effect, given that the prior for $μ_{t}$ is equal for all $t = 1, \dots, T$

\hat{α} = P (θ_{t} > δ_{\sup} | μ_{0, t} = a)

(2)

where $a$ is the prior mean for $μ_{t}$ when we calculate the FWER, and $θ_{t}$ is the outcome of the decision-making method for treatment $t$ . Power is then defined as the probability of correctly identifying the superior treatment, given that the treatment is assigned a prior centred around a larger value

1 - \hat{β} = P (θ_{t} > δ_{\sup} | μ_{0, t} = b)

(3)

where $b$ is the prior mean for $μ_{0, t}$ when we calculate power, and $a < b$ .

We also define the probability of correct dropping as the probability of correctly dropping the futile treatment, given that treatment is assigned a prior centred around a lower value

1 - {\hat{β}}_{fut} = P (θ_{t} < δ_{fut} | μ_{0, t} = c)

(4)

where $c$ is the prior mean for $μ_{0, t}$ when we calculate the probability of correct dropping, and $c < a$ .

Finally, the expected sample size per arm is estimated as the average number of patients enrolled in the trial at the time the trial is stopped, divided by the initial number of treatment arms

\hat{ESS} = \frac{\sum_{t = 1}^{T} n_{t}}{T}

(5)

where $n_{t}$ represents the number of patients in the treatment arm $t$ . This article explores how decision thresholds and treatment comparison methods impact the operating characteristics of an MAMS trial.

Decision thresholds

The first aspect of decision-making in Bayesian MAMS trials that we explore is decision thresholds for declaring superiority or futility, that is, $δ_{\sup}^{Pbest}$ and $δ_{fut}^{Pbest}$ . Decision thresholds are directly related to the power and FWER of the trial. Low superiority thresholds increase trial power because they make it easier to detect a superior treatment. However, this is accompanied by an increase in FWER. In practice, thresholds are often chosen to restrict the FWER rate, but it is unclear whether these thresholds are optimal. As such, we introduce an optimization method to select decision thresholds that increase trial power while constraining the FWER. Figure 2 illustrates this optimization process with one threshold ( $δ_{\sup}$ ).

Figure 2.

An illustration of our proposed optimization approach with one threshold. As the superiority threshold increases, the power of the trial and the family-wise error rate (FWER) decrease because it becomes more challenging to declare a successful treatment. The black star represents the superiority threshold that results in the highest possible power when the FWER reaches 5%. This is the superiority threshold found through optimization.

However, we aim to optimize two decision thresholds, $δ_{\sup}$ and $δ_{fut}$ . The goal is to determine the decision thresholds that produce the highest power while the FWER is constrained to 5%. Thus, we define a function of $δ_{\sup}$ and $δ_{fut}$ to be optimized. This function computes the FWER and power of a MAMS trial with thresholds $δ_{\sup}$ and $δ_{fut}$ . The function is equal to the power of the MAMS trial unless the simulated FWER exceeds 5%; then, the function equals zero. This ensures that only threshold combinations with FWER control are included. Since this function has no closed-form expression, with both power and FWER estimated by simulation, we use the Nelder–Mead algorithm to optimize the function using the simulation output.²⁷ We initialize the Nelder–Mead with values from a grid search over candidate thresholds to improve the performance of the algorithm.

Treatment comparison approaches

The second aspect of decision-making in Bayesian MAMS trials that we explore is the treatment comparison approach. As stated above, $P_{best}$ is used in Bayesian MAMS trials to compare multiple treatments. However, it has drawn criticism because it ignores the full ranking distribution of the treatment, that is, it incorporates the probability that a treatment ranks first but ignores whether it ranks second or third. This can lead to information loss and biased conclusions.²⁸ Motivated by these concerns, we identify three alternative treatment comparison approaches from the network meta-analysis literature, where the objective of comparing multiple treatments is closely aligned with the objective of a MAMS trial.

Surface Under the Cumulative Ranking Curve

The surface under the cumulative ranking curve (SUCRA) quantifies the average rank of $μ_{t}$ relative to all other treatments by summarizing the cumulative ranking probabilities.²⁸ This is appealing because it considers all possible ranking probabilities for the treatment, not just the probability of being the best. Treatments consistently ranking first or second receive higher values than those that occasionally rank last.

To calculate the SUCRA, $μ_{t}$ is ranked from most efficacious (rank = 1) to least efficacious (rank = $T$ ), where $T$ is equal to the number of treatments in the trial. The posterior probability that treatment $t$ has rank $k$ is denoted $P (t, k)$ , which is generated for all $t = 1, \dots, T$ and $k = 1, \dots, T$ . This is used to formulate the cumulative distribution function (CDF) of treatment rankings, denoted $F (t, r)$

CDF = F (t, r) = \frac{1}{k - 1} \sum_{k = 1}^{r} P (t, k)

(6)

The CDF represents the probability that $μ_{t}$ has rank $r$ or lower. Figure 3 gives a graphical illustration of these quantities.

Figure 3.

Illustration of ranking distributions used to compute the surface under the cumulative ranking curve. Left: probability ranking distribution for treatment $t$ , ( $P (t, k)$ ) when there are four treatments in the trial. Each bar in the graph represents the probability that treatment $t$ takes on rank $k$ . Right: cumulative ranking distribution for treatment $t$ , ( $F (t, k)$ ). Each bar represents the probability that treatment $t$ ranks less than or equal to k. For example, $F (t, 3)$ is the probability that $t$ ranks first, second, or third.

Once $F (t, r)$ is obtained, the SUCRA is calculated as

θ_{t}^{SUCRA} = \frac{1}{T - 1} \sum_{r = 1}^{T - 1} F (i, r),

(7)

which is the mean of the cumulative ranking probabilities. When used in a MAMS trial for decision-making, we compare $θ_{t}^{SUCRA}$ to decision thresholds $δ_{\sup}^{SUCRA}$ and $δ_{fut}^{SUCRA}$ . Note that these decision thresholds are used in the same way as $δ_{\sup}$ and $δ_{fut}$ but may have different numerical values.

Mean posterior rank

In network meta-analysis, the mean posterior rank is another summary of the ranking distribution that incorporates all possible ranks. The mean posterior rank simply averages the ranks for each treatment effect

θ_{t}^{M} = \sum_{k = 1}^{T} k P (t, k)

(8)

where $P (t, k)$ is the posterior probability that treatment $t$ has rank $k$ , as defined above. Possible outcomes range from the lowest mean rank of $1$ to the highest of $T$ .²⁹ To make decisions in MAMS trials, $θ_{t}^{M}$ is compared with the thresholds $δ_{\sup}^{M}$ and $δ_{fut}^{M}$ . Note that $δ_{fut}^{M}$ will depend on $T$ , making it difficult to compare results across trials with different $T$ .

Pairwise comparison

Finally, clinical trials often compare treatments by assessing how much they improve outcomes, for example, by using the minimal clinically important difference.³¹ Motivated by this, we developed a pairwise method that compares interventions using the magnitude of difference between treatment effects, a concept that the network meta-analysis ranking methods do not consider. This method begins by computing the posterior probability that treatment $t$ exceeds treatment $t'$ by at least $Δ \geq 0$ :

θ_{t, t'}^{P} = P (μ_{t} > μ_{t'} + Δ), \forall t \neq t' .

(9)

These probabilities can be organized in a $T \times T$ matrix

[\begin{matrix} 0 θ_{1, 2} θ_{1, 3} \dots θ_{1, t'} \\ θ_{2, 1} 0 θ_{2, 3} \dots θ_{2, t'} \\ θ_{3, 1} θ_{3, 2} 0 \dots θ_{3, t'} \\ ⋮ ⋮ ⋮ ⋮ ⋮ \\ θ_{t', 1} θ_{t', 2} θ_{t', 3} \dots 0 \end{matrix}]

where the $t^{th}$ row represents the probability that treatment $t$ is greater than treatment $t^{'} + Δ$ . Each entry is compared with the superiority threshold, $δ_{\sup}^{P}$ , to form a new matrix, assigning 1 if $θ_{t, t'}^{P} > δ_{\sup}^{P}$ , and 0 otherwise. Thus, the sum of the $t^{th}$ equals the number of treatments to which treatment $t$ is superior. If the sum equals $T - 1$ , treatment $t$ is declared superior.

For futility, each entry is compared with a futility threshold, $δ_{fut}^{P}$ , forming a new matrix that assigns 1 if $θ_{t, t'}^{P} < δ_{fut}^{P}$ , and 0 otherwise. The sum of the $t^{th}$ row equals the number of treatments to which treatment $t$ is inferior. If the sum equals $T - 1$ , treatment $t$ is dropped from the trial. This method is beneficial because $Δ$ can be set to ensure that a treatment is meaningfully different than others considered in the trial.

Simulation study

To investigate the combined impact of optimized decision thresholds and treatment comparison methods on the efficiency of Bayesian MAMS trials, we perform a simulation study. We simulate MAMS trials to compare the performance of $P_{best}$ , the SUCRA, mean rank, and pairwise based on trial power, expected sample size per arm, and FWER. For each method, decision thresholds are optimized for each trial scenario and held fixed throughout the simulations. Thresholds are based on the initial number of treatments, which may differ from the number of treatments at the end of the trial if treatments are dropped.

We use a conjugate normal distribution to simulate data from a potential MAMS trial. $x_{i},_{t}$ is the outcome for patient $i (i = 1, \dots, n_{t})$ receiving treatment $t (t = 1, \dots, T)$ , where $n_{t}$ is the number of patients in the treatment arm, and $T$ is the number of treatments in the trial. The outcome is modelled as

x_{i},_{t} ~ Normal (μ_{t}, \frac{1}{τ_{t}})

(10)

where $μ_{t}$ is the mean and $τ_{t}$ is the precision of $x_{i},_{t}$ . The priors for the mean and precision are as follows:

\begin{matrix} μ_{t} | τ_{t} ~ Normal (μ_{0, t}, \frac{1}{n_{0} τ_{t}}), \\ τ_{t} ~ Gamma (α_{0}, β_{0}), \forall t \end{matrix}

(11)

where $μ_{0, t}$ is the prior mean for $μ_{t}$ , $n_{0}$ is the effective sample size for the prior, and $α_{0}$ and $β_{0}$ are the prior shape and rate parameters for $τ_{t}$ . These assumptions define an analytic posterior distribution for $μ_{t}$ , which is used to compare treatments.³¹

Table 1 shows the priors used in this simulation study, where $n_{0}$ , $α_{0}$ , and $β_{0}$ are fixed across treatments for simplicity, but could be varied in practice to reflect differences in available information. The design priors reflect beliefs before data collection, and the analysis priors are used in the statistical analysis once data are collected. Our prior selection approach is described in the Supplemental Material.

Table 1.

Design and analysis priors for $n$ , $α$ , and $β$ in the data generation stage of the simulation study. Design priors govern data generation, while analysis priors are used for posterior inference.

	$n_{0}$	$α_{0}$	$β_{0}$
Design prior	50	25	225
Analysis prior	5	25.5	22.5

The simulation study consists of three scenarios:

1. Null Scenario

To estimate FWER, all treatments are assumed to have the same prior mean, $μ_{0, t}$ = 5 $\forall t$ .

2. Superiority Scenario

To estimate power, $T - 1$ treatments maintain a prior mean $μ_{0, t} = 5$ for $t = 1, \dots, T - 1$ , and the $T^{th}$ treatment is superior with a prior mean $μ_{0, T} = 6$ .

3. Futility Scenario

To evaluate the probability of correct stopping, $T - 1$ treatments maintain the prior mean $μ_{0, t} = 5$ for $t = 1, \dots, T - 1$ , and the $T^{th}$ treatment is inferior to a prior mean $μ_{0, T} = 4$ .

Once patient data are generated, the posterior distributions for $μ_{t}$ , $t = 1, \dots, T$ are calculated via Monte Carlo simulation. Separate simulations are performed for the null, superiority, and futility scenarios with $T$ = 3, 4, 5, 6, 8, and 10.

Two trial designs (Table 2) are simulated to test the impact of differing numbers of interim analyses. In Design 1, interim analyses occur at 50, 100, and 150 patients per arm. In Design 2, interim analyses occur at 100 and 150 patients per arm. Both designs have a maximum sample size of 200 patients per arm. At each interim analysis, superiority or futility is assessed. If no treatment is deemed superior, the trial continues to the next interim analysis. The trial concludes superiority if (1) a superior treatment is identified or (2) all but one treatment is declared futile. If the maximum sample size is reached, the trial is declared inconclusive. We run 10,000 Monte Carlo simulations per scenario, which limits simulation error to below 0.01.

Table 2.

Simulated adaptive platform trial designs, differing in minimum sample size and number of interim analyses.

	Design 1	Design 2
Min. sample size per arm	50	100
Max. sample size per arm	200	200
Number of interim analyses	3	2

Results

Decision thresholds

Table 3 lists the superiority and futility thresholds for each simulation scenario in Design 1. Thresholds in Design 1 are slightly higher than in Design 2, reflecting the need to control the FWER given the greater number of interim analyses. Design 2 results are in the Supplemental Material.

Table 3.

Optimized superiority and futility thresholds for $P_{best}$ , the surface under the cumulative ranking curve (SUCRA), mean rank, and pairwise under Design 1, selected to maximize power subject to a 5% family-wise type I error constraint.

Number of treatments	Ranking method	Superiority threshold	Futility threshold
3	Mean rank	1.05	2.94
3	Pbest	0.97	0.02
3	Pairwise	0.97	0.07
3	SUCRA	0.98	0.00
4	Mean rank	1.08	3.92
4	Pbest	0.95	0.03
4	Pairwise	0.95	0.05
4	SUCRA	0.97	0.01
5	Mean rank	1.13	4.91
5	Pbest	0.90	0.01
5	Pairwise	0.94	0.08
5	SUCRA	0.97	0.00
6	Mean rank	1.17	5.78
6	Pbest	0.88	0.01
6	Pairwise	0.93	0.08
6	SUCRA	0.97	0.00
8	Mean rank	1.24	7.92
8	Pbest	0.90	0.04
8	Pairwise	0.91	0.11
8	SUCRA	0.97	0.00
10	Mean rank	1.30	9.65
10	Pbest	0.82	0.01
10	Pairwise	0.90	0.00
10	SUCRA	0.97	0.00

For $P_{best}$ , optimal superiority thresholds decrease as the number of treatments increases. Mean rank and pairwise thresholds also decrease, but less drastically, whereas the SUCRA thresholds remain higher and constant. The variation within methods, specifically $P_{best}$ , highlights the importance of selecting the optimal threshold for each trial.

Treatment comparison methods

Figure 4 illustrates the power across superiority scenarios. All methods achieve roughly 71% power, which decreases as the number of treatments rise. Nonlinearities in the power reflect imprecision in the optimization for the decision thresholds rather than simulation error.

Figure 4.

Power across superiority scenarios for Design 1 as a function of the number of treatments.

Figure 5 illustrates the expected sample size per arm in the superiority scenarios. As the number of treatments increases, the expected sample size per arm drops for $P_{best}$ and increases for the SUCRA, pairwise, and mean rank. This is expected because $P_{best}$ drops treatments solely on the basis of the probability that they are the best, resulting in faster decision-making.

Figure 5.

Expected sample size per arm under superiority scenarios for Design 1.

Figure 6 illustrates the probability of correct dropping in the futility scenarios. The error bars are not displayed due to the small error and the large y-axis range. $P_{best}$ has the highest probability of correct dropping since treatments are easily dropped if they do not rank first. Low probability of correct dropping in the remaining methods reflects futility thresholds optimized for the superiority scenario. Future simulations could incorporate the futility scenario into the threshold optimization to fully compare the decision-making methods; however, $P_{best}$ would likely still perform best.

Figure 6.

Power to correctly drop futile treatments under Design 1.

Figure 7 shows the expected sample size per arm in futility scenarios, mirroring trends found in the superiority scenario. Minimal differences between Designs 1 and 2 indicate that the number of interim analyses has little effect on the probability of correct dropping and expected sample size per arm, once thresholds have been optimized.

Figure 7.

Expected sample size per arm under futility scenarios for Design 1.

Discussion

Bayesian MAMS trials offer flexible and efficient designs through efficient decision-making and comparing multiple treatments in a single trial. MAMS trials incorporate superiority and futility thresholds to guide early stopping, which are typically chosen to control FWER.²⁵ To our knowledge, no prior work has examined how decision thresholds should be determined, despite their central role in trial decision-making. In addition, the common method to compare multiple treatments in a Bayesian MAMS trial is $P_{best}$ , which has been criticized for potentially introducing bias into the analysis. Few studies have evaluated whether $P_{best}$ is optimal for MAMS trials or if alternative methods perform better.

This study explores four treatment comparison methods using optimized decision thresholds. $P_{best}$ , the SUCRA, mean rank, and pairwise perform similarly in power; however, $P_{best}$ results in the lowest expected sample size per arm. $P_{best}$ ignores the full treatment efficacy ranking distribution and drops treatments that do not rank first early in the trial. The remaining methods also reduce expected sample size per arm below the trial’s maximum sample size. These results suggest that threshold selection is more critical to trial efficiency than the treatment comparison method itself. Current MAMS trials use a high superiority threshold, such as 0.99, and a low futility threshold, such as 0.01, to control FWER.^17,5,32–35 These thresholds may not be optimal and could be reduced to allow for greater power, while still controlling the FWER.

In addition, we observe that the optimal superiority threshold for $P_{best}$ is highly variable and significantly decreases as the number of treatments in the trial increases. Since decision thresholds are typically kept constant during the trial, it is possible that power and the FWER can become volatile when treatments are added or removed. In contrast, the optimal decision thresholds for the SUCRA remain constant regardless of the number of treatments, potentially making it a better alternative to $P_{best}$ for controlling the FWER throughout the trial. As thresholds have not been extensively studied in the literature, there is an urgent need to explore their impact further.

In conclusion, we make the following recommendations:

Optimization should be used to determine decision thresholds. This is a simple and effective method to increase the power of the trial while controlling the FWER. We use the Nelder–Mead algorithm, which optimizes to a local maximum, meaning that we may not have obtained the global maximum every time. Future research should explore alternative optimization algorithms and objective functions, such as optimizing expected sample size per arm rather than power.

If decision thresholds are optimized, the SUCRA may provide a better alternative to $P_{best}$ . It is unknown how the volatility in optimal decision thresholds for $P_{best}$ impacts the FWER when treatments are dropped from a trial. Alternatively, the SUCRA maintains almost constant optimal decision thresholds, providing a more consistent method.

These recommendations may be specific to our simulation study. Thus, future research should extend this study to full APTs to improve the generalizability of the results. We also simulated the simplest setting where one treatment was effective. Future work could examine how decision-making methods capture disjunctive and conjunctive power,^36,37 when multiple treatments are effective.

Supplemental Material

sj-pdf-1-ctj-10.1177_17407745261453566 – Supplemental material for A statistical evaluation of decision-making methods and the efficiency of Bayesian multi-arm multi-stage trials

Supplemental material, sj-pdf-1-ctj-10.1177_17407745261453566 for A statistical evaluation of decision-making methods and the efficiency of Bayesian multi-arm multi-stage trials by Abigail McGrory, Haolun Shi and Anna Heath in Clinical Trials

Supplemental Material

sj-png-2-ctj-10.1177_17407745261453566 – Supplemental material for A statistical evaluation of decision-making methods and the efficiency of Bayesian multi-arm multi-stage trials

Supplemental material, sj-png-2-ctj-10.1177_17407745261453566 for A statistical evaluation of decision-making methods and the efficiency of Bayesian multi-arm multi-stage trials by Abigail McGrory, Haolun Shi and Anna Heath in Clinical Trials

Supplemental Material

sj-png-3-ctj-10.1177_17407745261453566 – Supplemental material for A statistical evaluation of decision-making methods and the efficiency of Bayesian multi-arm multi-stage trials

Supplemental material, sj-png-3-ctj-10.1177_17407745261453566 for A statistical evaluation of decision-making methods and the efficiency of Bayesian multi-arm multi-stage trials by Abigail McGrory, Haolun Shi and Anna Heath in Clinical Trials

Footnotes

Acknowledgements

The authors would like to acknowledge Michael Escobar for his guidance and support throughout the development of this article.

ORCID iDs

Abigail McGrory

Anna Heath

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Anna Heath was funded by a Canada Research Chair in Statistical Trial Design and the Natural Sciences and Engineering Research Council of Canada (Award Number RGPIN-2021-03366). Abigail McGrory was funded by CAN-TAP-TALENT (CIHR Grant #184898).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Garralda

Dienstmann

Piris-Giménez

, et al. New clinical trial designs in the era of precision medicine. Mol Oncol 2019; 13(3): 549–557.

Woodcock

LaVange

. Master protocols to study multiple therapies, multiple diseases, or both. N Engl J Med 2017; 377(1): 62–70.

PRACTICAL, PANTHER, TRAITS, INCEPT, and REMAP-CAP Investigators. The rise of adaptive platform trials in critical care. Am J Respir Crit Care Med 2024; 209(5): 491–496.

Barker

Sigman

Kelloff

, et al. I-SPY 2: an adaptive breast cancer trial design in the setting of neoadjuvant chemotherapy. Clin Pharmacol Ther 2009; 86(1): 97–100.

Angus

Berry

Lewis

, et al. The REMAP-CAP (Randomized Embedded Multifactorial Adaptive Platform for Community-Acquired Pneumonia) study. Rationale and design. Ann Am Thorac Soc 2020; 17(7): 879–891.

Alexander

Berger

, et al. Adaptive global innovative learning environment for glioblastoma: GBM AGILE. Clin Cancer Res 2018; 24(4): 737–743.

Abani

Abbas

, et al. Tocilizumab in patients admitted to hospital with COVID-19 (RECOVERY): a randomised, controlled, open-label, platform trial. Lancet 2021; 397(10285): 1637–1645.

Wason

Jaki

. Optimal design of multi-arm multi-stage trials. Stat Med 2012; 31(30): 4269–4279.

Magirr

Jaki

Whitehead

. A generalized Dunnett test for multi-arm multi-stage clinical studies with treatment selection. Biometrika 2012; 99(2): 494–501.

10.

Greenstreet

Jaki

Bedding

, et al. Design of platform trials with a change in the control treatment arm. Biometrics 2025; 81(2): ujaf073.

11.

Burnett

Jennison

. Adaptive enrichment trials: what are the benefits? Stat Med 2021; 40(3): 690–711.

12.

Ghosh

Liu

Mehta

. Adaptive multiarm multistage clinical trials. Stat Med 2020; 39(8): 1084–1102.

13.

Gold

Bofill Roig

Miranda

, et al. Platform trials and the future of evaluating therapeutic behavioural interventions. Nat Rev Psychol 2022; 1(1): 7–8.

14.

Streiner

. Alternatives to placebo-controlled trials. Can J Neurol Sci 2007; 34(S1): S37–S41.

15.

Hobbs

Dorward

Hayward

, et al. The PRINCIPLE randomised controlled open label platform trial of hydroxychloroquine for treating COVID-19 in community based patients at high risk. Sci Rep 2025; 15(1): 23850.

16.

Lai

Donahue

Chen

, et al. Verdiperstat in amyotrophic lateral sclerosis: results from the randomized HEALEY ALS platform trial. JAMA Neurol 2025; 82(4): 333–343.

17.

Kapur

Elm

Chamberlain

, et al. Randomized trial of three anticonvulsant medications for status epilepticus. N Engl J Med 2019; 381(22): 2103–2113.

18.

Berry

Carlin

Lee

, et al. Bayesian adaptive methods for clinical trials. CRC Press; 2010.

19.

Mills

Thorlund

Ioannidis

JPA

. Demystifying trial networks and network meta-analysis. BMJ 2013; 346: f2914.

20.

Salanti

Ades

Ioannidis

. Graphical methods and numerical summaries for presenting results from multiple-treatment meta-analysis: an overview and tutorial. J Clin Epidemiol 2011; 64(2): 163–171.

21.

Lawler

Hochman

Zarychanski

. What are adaptive platform clinical trials and what role may they have in cardiovascular medicine? Circulation 2022; 145(9): 629–632.

22.

Wang

Yee

. I-SPY 2: a neoadjuvant adaptive clinical trial designed to improve outcomes in high-risk breast cancer. Curr Breast Cancer Rep 2019; 11(4): 303–310.

23.

Greenstreet

Jaki

Bedding

, et al. A multi-arm multi-stage design for trials with all pairwise testing. arXiv preprint arXiv:250207013, 2025.

24.

Salanti

Nikolakopoulou

Efthimiou

, et al. Introducing the treatment hierarchy question in network meta-analysis. Am J Epidemiol 2022; 191(5): 930–938.

25.

Coalition

TAT

. Adaptive platform trials: definition, design, conduct and reporting considerations 2019. Nat Rev Drug Discov 2019; 18(10): 797–807.

26.

Wason

Stecher

Mander

. Correcting for multiple-testing in multi-arm trials: is it necessary and is it done? Trials 2014; 15(1): 364.

27.

Gao

Han

. Implementing the Nelder-Mead simplex algorithm with adaptive parameters. Comput Optim Appl 2012; 51(1): 259–277.

28.

Rücker

Schwarzer

. Ranking treatments in frequentist network meta-analysis works without resampling methods. BMC Med Res Methodol 2015; 15(1): 1–9.

29.

Noma

Matsui

Omori

, et al. Bayesian ranking and selection methods using hierarchical mixture models in microarray studies. Biostatistics 2010; 11(2): 281–289.

30.

Umscheid

Margolis

Grossman

. Key concepts of clinical trials: a narrative review. Postgrad Med 2011; 123(5): 194–204.

31.

Gelman

Carlin

Stern

, et al. Bayesian data analysis. Chapman and Hall/CRC, 1995.

32.

Pericàs

Tacke

Anstee

, et al. Platform trials to overcome major shortcomings of traditional clinical trials in non-alcoholic steatohepatitis? Pros and cons. J Hepatol 2023; 78(2): 442–447.

33.

Bafadhel

Dorward

, et al. Inhaled budesonide for COVID-19 in people at high risk of complications in the community in the UK (PRINCIPLE): a randomised, controlled, open-label, adaptive platform trial. Lancet 2021; 398(10303): 843–855.

34.

Mahar

McGlothlin

Dymock

, et al. A blueprint for a multi-disease, multi-domain Bayesian adaptive platform trial incorporating adult and paediatric subgroups: the Staphylococcus aureus network adaptive platform trial. Trials 2023; 24(1): 795.

35.

Butler

Hobbs

Gbinigie

, et al. Molnupiravir plus usual care versus usual care alone as early treatment for adults with COVID-19 at increased risk of adverse outcomes (PANORAMIC): an open-label, platform-adaptive randomised controlled trial. Lancet 2023; 401(10373): 281–293.

36.

Urach

Posch

. Multi-arm group sequential designs with a simultaneous stopping rule. Stat Med 2016; 35(30): 5536–5550.

37.

Choodari-Oskooei

Bratton

Gannon

, et al. Adding new experimental arms to randomised clinical trials: impact on error rates. Clin Trials 2020; 17(3): 273–284.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.01 MB

0.00 MB

0.01 MB

0.11 MB