Abstract
We investigate the optimal allocation design for response adaptive clinical trials, under the average reward criterion. The treatment randomization process is formatted as a Markov decision process and the Bayesian method is used to summarize the information on treatment effects. A span-contraction operator is introduced and the average reward generated by the policy identified by the operator is shown to converge to the optimal value. We propose an algorithm to approximate the optimal treatment allocation using the Thompson sampling and the contraction operator. For the scenario of two treatments with binary responses and a sample size of 200 patients, simulation results demonstrate efficient learning features of the proposed method. It allocates a high proportion of patients to the better treatment while retaining a good statistical power and having a small probability for a trial going in the undesired direction. When the difference in success probability to detect is 0.2, the probability for a trial going in the unfavorable direction is < 1.5%, which decreases further to < 0.9% when the difference to detect is 0.3. For normally distribution responses, with a sample size of 100 patients, the proposed method assigns 13% more patients to the better treatment than the traditional complete randomization in detecting an effect size of difference 0.8, with a good statistical power and a < 0.7% probability for the trial to go in the undesired direction.
Keywords
Introduction
Response adaptive designs (RADs) sequentially modify the treatment randomization probability based on accumulating information collected during a randomized trial, with the intention to allocate more patients to the best treatment, thus having ethical advantages over the traditional complete randomization (CR). However, the unbalanced group sizes in RADs may lead to loss of statistical power.1–3 The tension between ethical benefits and loss of statistical power has been studied by many authors and some optimal treatment allocations have been proposed considering both of the two objectives.
Constrained optimization and compound optimization have been used to study the optimal allocation for RADs while retaining certain statistical power at the conclusion of clinical trials. Rosenberger et al. 4 derived an optimal allocation for binary responses to minimize the expected number of treatment failures while keeping the conditional variance of the Wald test statistic at a fixed level. This allocation proportion was shown to assign more patients to the better treatment while retaining statistical power, in comparison with the traditional CR. The optimal allocation proportion was generalized to normally distributed responses by Biswas and Bhattacharya. 5 Tymofyeyev et al. 6 considered the criterion of minimizing the weighted sum of sample sizes while keeping the value of the non-centrality parameter to be at least at a given level for the purpose of retaining statistical power to test homogeneity of treatment effects. The compound optimization criterion combining the inferential properties of an experiment with the proportion of patients allocated to the better treatment were used to find the optimal allocation for RADs.7–10 Some reviews on RADs and optimal RADs may be found by Robertson et al. 11 and Sverdlov and Rosenberger. 12 From the inferential theory of analysis, the unbalance in group sizes due to the adaptation of treatment allocation contributes to the ethical benefits for trial patients, but affects the statistical power to detect the difference in treatment effects. Recently, the optimization problem was explored using dynamic decision processes.
In a dynamic decision process, decisions are made sequentially based on the current state and the history of past states of the decision process. Villar et al. 13 studied the optimization problem of maximizing the total discounted posterior means of response probability and derived the Gittins index for the optimization allocation, where the information on treatment effect was summarized using a beta prior for binary responses. Williamson and Villar 14 used the Gittins index method to study RADs with normally distributed responses. They derived Gittins index and sequentially optimized the treatment allocation using the information summarized in the Bayesian posterior distributions. Their results showed great gain in ethical benefits by using the Gittins index for adaptive randomization, but with a large loss of statistical power. Integrating blocking with the Gittins index sequentially was proposed to improve the optimal allocation in terms of achieving a balance between patients’ benefit and the loss of statistical power.13,14 Russo 15 proposed algorithms based on Bayesian posteriors to adaptively collect information on treatment effects, in order to identify the best treatment arm. Asymptotically, Russo’s algorithms allocate about half of the total patients to the best treatment. Wang 16 and Wang and Tiwari 17 used a monotone function as weighting to improve Russo’s algorithms to identify the best treatment. Those methods focused on the identification of the best arm without considering statistical power at the conclusion of a trial.15–17 Recently Yi and Wang. 18 formatted the optimization allocation for RADs as a Markov decision process (MDP) using sufficient statistics to summarize the information on treatment effects. However, the algorithm proposed by Yi and Wang 18 depended on two tuning parameters, the values of which were determined from extensive simulations, thus generalization beyond the simulated scenarios is very limited. Baas et al. 19 used restrictive MDPs to maximize patients’ benefit with constrains on type I error rate and statistical power using Bayesian priors, but their method works only for binary responses.
In this article, we extend the sequential model of RADs by Yi and Wang 18 to multiple treatments under the average reward criterion and propose an easily implementable algorithm for both binary and continuous responses. The reward is defined as a function of a patient’s response and the optimization criterion is to maximize the average rewards under the constraint of limiting the randomization probabilities for the purpose of preventing extremely unbalanced group sizes and retaining statistical power. We formulate the response adaptive randomization problem as a MDP and use Bayesian prior and posterior distributions to summarize the information on treatment effects, while the MDP by Yi and Wang, 18 Ondra, 20 and Baas et al. 19 uses sufficient statistics for the unknown parameters. Bayesian methods have been used to study the optimal allocation problem for RADs21–23 and to summarize information on treatment effects for adaptive randomization.13–17 Concerns were raised on the probability that a trial goes in the unfavorable direction with adaptive randomization. 22 Walthen and Thall 23 studied modified Bayesian designs of response allocation probability for multi-arm trials to reduce undesired characteristics of the designs. Yi and Wang 18 used tuning parameters in the algorithm to control undesirable effects. In this article, we propose an algorithm to approximate optimization by integrating Thompson sampling with the average reward criterion using a span contraction operator. Thompson sampling has the natural advantages of the best treatment exploration and computational efficiency. 24 The performance of the proposed method is examined using the matrix of patient proportions allocated to the alternative treatments, the statistical power to detect the difference in hypothesis testing, and the probability that a trial goes in the undesired direction.
The remainder of this article is organized as follows. Section 2 introduces the sequential decision model for RADs with
The method to approximate the optimal policy
We extend the sequential decision model for RADs by Yi and Wang 18 to multi-treatment clinical trials and introduce an algorithm based on the Thompson sampling and the span contraction operator to sequentially learn treatment effects and optimally allocate treatments.
Suppose that patients’ responses are collected sequentially based only on the treatment received. Let
A response adaptive design is characterized by a sequence of treatment allocation probabilities
We establish the probability space for adaptive treatment allocation processes for RADs by extending the sequential decision model proposed by Yi and Wang.
18
The decision model is described as a tupel
For this decision model with the measurable spaces
Our objective is to maximize the average reward
For any bounded measurable function
For functions
For any function
For the policy
We propose an algorithm to approximate the optimal value
Set up the priors Sample For Determine
For the obtained treatment allocation Replace
We denote this algorithm as the MDP with Thompson sampling (MDP-TS) procedure. This algorithm does not require tuning parameters as the MDP algorithm by Yi and Wang,
18
the values of which were determined through extensive simulation to reduce the possibility for the trials going in the unfavorable direction and to reduce loss of statistical power. We will examine the performance of the proposed algorithm in the next section.

Distribution of
We conduct simulation studies to demonstrate the performance of the proposed algorithm for both binary and continuous responses. For simplicity, we consider two-treatment comparisons and assume that Treatment
For binary outcomes, the total number of patients is
Table 1 summarizes the results for various scenarios with success probabilities
Simulated statistical power and average allocation proportions
.
Simulated statistical power and average allocation proportions
MPD-TS: Markov decision process with Thompson sampling; CR: complete randomization.
Simulated results for exponentially distributed responses (
MPD-TS: Markov decision process with Thompson sampling; CR: complete randomization.
For continuous outcomes, we assume normally distributed responses and a sample size of
Table 2 reveals that the MDP-TS procedure with
Overall, the MDP-TS algorithm works well to detect large treatment differences or effect sizes and the statistical power is close to that under CR for binary or continuous responses, but the loss on statistical power is substantial when the differences or effect sizes to detect are small, especially under the MDP-TS procedure with
The proposed procedure is built on MDPs for RADs and the information on treatment effects collected during a trial is summarized using the Bayesian method. This design is shown to have ethical advantages and good characteristics in retaining statistical power of hypothesis testing and controlling the probability a trial going in the undesired direction. The proposed algorithm based on the Thompson sampling and the span contraction operator works iteratively to approximate the optimal allocation for both binary and continuous responses. Moreover, it is easy to implement in practice. Simulation studies demonstrate that with a sample size of 200 for binary responses, the probability of a trial going in the unfavorable direction is < 1.5% when detecting a difference of 0.2 success probabilities and it decreases to < 0.9% when the difference to detect is 0.3. For normally distributed responses, with a sample size of 100, the proposed method assigns 13% more patients to the better treatment than the traditional CR in detecting the effect size of difference of 0.8 with good statistical power while the probability for a trial going in the undesired direction is < 0.7%. The proposed method is recommended for detecting large differences or effect sizes.
Supplemental Material
sj-pdf-1-smm-10.1177_09622802241293750 - Supplemental material for Approximation to the optimal allocation for response adaptive designs
Supplemental material, sj-pdf-1-smm-10.1177_09622802241293750 for Approximation to the optimal allocation for response adaptive designs by Yanqing Yi and Xikui Wang in Statistical Methods in Medical Research
Footnotes
Acknowledgements
Both authors acknowledge research support from the Natural Sciences and Engineering Research Council (NSERC) of Canada.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
