Approximation to the optimal allocation for response adaptive designs

Abstract

We investigate the optimal allocation design for response adaptive clinical trials, under the average reward criterion. The treatment randomization process is formatted as a Markov decision process and the Bayesian method is used to summarize the information on treatment effects. A span-contraction operator is introduced and the average reward generated by the policy identified by the operator is shown to converge to the optimal value. We propose an algorithm to approximate the optimal treatment allocation using the Thompson sampling and the contraction operator. For the scenario of two treatments with binary responses and a sample size of 200 patients, simulation results demonstrate efficient learning features of the proposed method. It allocates a high proportion of patients to the better treatment while retaining a good statistical power and having a small probability for a trial going in the undesired direction. When the difference in success probability to detect is 0.2, the probability for a trial going in the unfavorable direction is < 1.5%, which decreases further to < 0.9% when the difference to detect is 0.3. For normally distribution responses, with a sample size of 100 patients, the proposed method assigns 13% more patients to the better treatment than the traditional complete randomization in detecting an effect size of difference 0.8, with a good statistical power and a < 0.7% probability for the trial to go in the undesired direction.

Keywords

Adaptive randomization average reward criterion markov decision process statistical power Thompson sampling

1. Introduction

Response adaptive designs (RADs) sequentially modify the treatment randomization probability based on accumulating information collected during a randomized trial, with the intention to allocate more patients to the best treatment, thus having ethical advantages over the traditional complete randomization (CR). However, the unbalanced group sizes in RADs may lead to loss of statistical power.^1–3 The tension between ethical benefits and loss of statistical power has been studied by many authors and some optimal treatment allocations have been proposed considering both of the two objectives.

Constrained optimization and compound optimization have been used to study the optimal allocation for RADs while retaining certain statistical power at the conclusion of clinical trials. Rosenberger et al.⁴ derived an optimal allocation for binary responses to minimize the expected number of treatment failures while keeping the conditional variance of the Wald test statistic at a fixed level. This allocation proportion was shown to assign more patients to the better treatment while retaining statistical power, in comparison with the traditional CR. The optimal allocation proportion was generalized to normally distributed responses by Biswas and Bhattacharya.⁵ Tymofyeyev et al.⁶ considered the criterion of minimizing the weighted sum of sample sizes while keeping the value of the non-centrality parameter to be at least at a given level for the purpose of retaining statistical power to test homogeneity of treatment effects. The compound optimization criterion combining the inferential properties of an experiment with the proportion of patients allocated to the better treatment were used to find the optimal allocation for RADs.^7–10 Some reviews on RADs and optimal RADs may be found by Robertson et al.¹¹ and Sverdlov and Rosenberger.¹² From the inferential theory of analysis, the unbalance in group sizes due to the adaptation of treatment allocation contributes to the ethical benefits for trial patients, but affects the statistical power to detect the difference in treatment effects. Recently, the optimization problem was explored using dynamic decision processes.

In a dynamic decision process, decisions are made sequentially based on the current state and the history of past states of the decision process. Villar et al.¹³ studied the optimization problem of maximizing the total discounted posterior means of response probability and derived the Gittins index for the optimization allocation, where the information on treatment effect was summarized using a beta prior for binary responses. Williamson and Villar¹⁴ used the Gittins index method to study RADs with normally distributed responses. They derived Gittins index and sequentially optimized the treatment allocation using the information summarized in the Bayesian posterior distributions. Their results showed great gain in ethical benefits by using the Gittins index for adaptive randomization, but with a large loss of statistical power. Integrating blocking with the Gittins index sequentially was proposed to improve the optimal allocation in terms of achieving a balance between patients’ benefit and the loss of statistical power.^13,14 Russo¹⁵ proposed algorithms based on Bayesian posteriors to adaptively collect information on treatment effects, in order to identify the best treatment arm. Asymptotically, Russo’s algorithms allocate about half of the total patients to the best treatment. Wang¹⁶ and Wang and Tiwari¹⁷ used a monotone function as weighting to improve Russo’s algorithms to identify the best treatment. Those methods focused on the identification of the best arm without considering statistical power at the conclusion of a trial.^15–17 Recently Yi and Wang.¹⁸ formatted the optimization allocation for RADs as a Markov decision process (MDP) using sufficient statistics to summarize the information on treatment effects. However, the algorithm proposed by Yi and Wang¹⁸ depended on two tuning parameters, the values of which were determined from extensive simulations, thus generalization beyond the simulated scenarios is very limited. Baas et al.¹⁹ used restrictive MDPs to maximize patients’ benefit with constrains on type I error rate and statistical power using Bayesian priors, but their method works only for binary responses.

In this article, we extend the sequential model of RADs by Yi and Wang¹⁸ to multiple treatments under the average reward criterion and propose an easily implementable algorithm for both binary and continuous responses. The reward is defined as a function of a patient’s response and the optimization criterion is to maximize the average rewards under the constraint of limiting the randomization probabilities for the purpose of preventing extremely unbalanced group sizes and retaining statistical power. We formulate the response adaptive randomization problem as a MDP and use Bayesian prior and posterior distributions to summarize the information on treatment effects, while the MDP by Yi and Wang,¹⁸ Ondra,²⁰ and Baas et al.¹⁹ uses sufficient statistics for the unknown parameters. Bayesian methods have been used to study the optimal allocation problem for RADs^21–23 and to summarize information on treatment effects for adaptive randomization.^13–17 Concerns were raised on the probability that a trial goes in the unfavorable direction with adaptive randomization.²² Walthen and Thall²³ studied modified Bayesian designs of response allocation probability for multi-arm trials to reduce undesired characteristics of the designs. Yi and Wang¹⁸ used tuning parameters in the algorithm to control undesirable effects. In this article, we propose an algorithm to approximate optimization by integrating Thompson sampling with the average reward criterion using a span contraction operator. Thompson sampling has the natural advantages of the best treatment exploration and computational efficiency.²⁴ The performance of the proposed method is examined using the matrix of patient proportions allocated to the alternative treatments, the statistical power to detect the difference in hypothesis testing, and the probability that a trial goes in the undesired direction.

The remainder of this article is organized as follows. Section 2 introduces the sequential decision model for RADs with $k$ treatments and the algorithm based on the contraction operator. Simulation studies are presented in Section 3 to demonstrate the performance of the proposed method. Section 4 concludes the article.

2. The method to approximate the optimal policy

We extend the sequential decision model for RADs by Yi and Wang¹⁸ to multi-treatment clinical trials and introduce an algorithm based on the Thompson sampling and the span contraction operator to sequentially learn treatment effects and optimally allocate treatments.

Suppose that patients’ responses are collected sequentially based only on the treatment received. Let $δ_{i} = (δ_{i 1}, δ_{i 2}, \dots, δ_{i k})$ , where $δ_{i j} = 1, j = 1, 2, \dots, k,$ if the $i$ th patient receives treatment $j$ and $δ_{i j} = 0$ , otherwise. Denote the response from patient $i$ who is treated with treatment $j$ as $Y_{i j}$ and let $y_{i} = (Y_{i 1} δ_{i 1}, Y_{i 2} δ_{i 2}, \dots, Y_{i k} δ_{i k}) .$ Responses from treatment $j$ are independent and identically distributed with a density function $f_{j} (y, θ_{j}),$ where $θ_{j} \in Θ_{j}, 1 \leq j \leq k,$ is an unknown parameter, and $y \in R = (- \infty, \infty)$ and $R$ is the $σ$ algebra over $R$ .

A response adaptive design is characterized by a sequence of treatment allocation probabilities $π = {π_{i}, i = 1, 2, \dots}$ such that $π_{i} = (π_{i 1}, π_{i 2}, \dots, π_{i k}), i \geq 2$ , is a set of $F_{i - 1}$ -measurable functions on $A,$ where $\sum_{j = 1}^{k} π_{i j} = 1$ ; the $σ$ algebra $F_{i - 1}$ is generated by ${(δ_{1}, y_{1}), \dots, (δ_{i - 1}, y_{i - 1})}$ ; and $A = {(t_{1}, t_{2}, \dots, t_{k}) : \sum_{j = 1}^{k} t_{j} = 1, t_{j} \in {0, 1}}$ is equipped with the $σ$ algebra $A$ of the power set of $A$ . For the first subject, we prefix $π_{1 j} = P (δ_{1 j} = 1)$ such as $1 / k$ , $1 \leq j \leq k$ .

We establish the probability space for adaptive treatment allocation processes for RADs by extending the sequential decision model proposed by Yi and Wang.¹⁸ The decision model is described as a tupel $((S, S), (A, A), π, (q_{n}), (y_{n}))$ in the following:

(a)
$(S, S)$ is the state space, where $S = R^{k}$ and $S$ is the product of the $σ$ algebra $R$ on $R$ ;
(b)
$(A, A)$ is the action space;
(c)
$π = {π_{i}, i = 1, 2, \dots}$ is the treatment allocation rule corresponding to the RAD in which $π_{i}, i = 1, 2, \dots$ is a stochastic kernel from $(H_{i - 1}, F_{i - 1})$ to $(A, A), i \geq 2$ , and $π_{1}$ is pre-defined, where $H_{n} = {(h, δ, s), h \in H_{n - 1}, δ \in A, s \in S}, n \geq 2,$ and $H_{1} = A \times S$ ;
(d)
$q_{n}$ is a transition probability law from $(H_{n - 1} \times A, F_{n - 1} \times A$ ) to $(S, S)$ and is given by
$q_{n} (y_{n} \in B | h, δ_{n}) = \prod_{j = 1}^{k} {(\int_{B_{j}} f_{j} (y) d y)}^{δ_{n j}}$
where $h \in H_{n - 1}, δ_{n} \in A,$ and $B_{j}$ is the projection of $B$ on the $j^{t h}$ component;
(e)
$r (y_{n}, δ_{n})$ is a real-valued reward function on $S \times A$ .
For this decision model with the measurable spaces $(A, A)$ and $(S, S)$ and the transition probabilities, there is a unique probability measure $P_{π}$ on $H = A \times S \times A \times S \times \dots,$ according to C. Ionescu Tulcea’s theorem, where the correspond $σ$ -algebra $H$ is a product $σ$ -algebra. This defines the probability space $((H, H), P_{π})$ for the adaptive randomization process. We consider the randomization policies $Π = {(π_{i 1}, π_{i 2,}, \dots, π_{i k}), γ \leq π_{i j} \leq 1 - γ, j = 1, 2, \dots, k}$ , where $γ < 1 / k$ is a pre-specified value. The role of $γ$ is to allocate a certain proportion of patients to each of the $k$ treatments to learn its treatment effect and to avoid extremely unbalanced sizes of treatment groups, in order to retain reasonable statistical power at the conclusion of the clinical trial. This type of policies has been considered by Biswas and Bhattacharya,⁵ Russo,¹⁵ Cheng and Berry,²¹ Wathen and Thall,²³ and Baldi Antognini et al.²⁵

Our objective is to maximize the average reward $V (π) = \underset{n \to \infty}{lim inf} \frac{E_{π} (\sum_{i = 1}^{n} r (y_{i}, δ_{i}))}{n}$ regarding $π \in Π,$ assuming that a larger reward is better. From Yi and Wang,²⁶ we know that $V (π) = lim_{n \to \infty} E_{π} (\sum_{j = 1}^{k} \frac{N_{j}}{n} r (j))$ and $N_{j} / n - \sum_{i = 1}^{n} π_{i j} \to 0 almost surely$ , where $r (j) = E (r (Y_{i}, δ_{i}) | δ_{i j} = 1)$ is the mean reward from treatment $j$ and $N_{j}$ is the number of patients allocated to treatment $j$ when a total of $n$ patients are randomized. As a special case, $r (j)$ is the mean response when $r (Y_{i}, δ_{i}) = \sum_{l = 1}^{k} (Y_{i l} δ_{i l})$ . Let $V^{} = max_{π \in Π} V (π)$ be the optimal value of the response adaptive design under the average reward criterion. Denote $J^{} = {j^{} : j^{} = \arg \max_{j} {r (j)}}$ . Let $| J^{} |$ be the number of treatments in $J^{}$ . Since $N_{j} / n - \sum_{i = 1}^{n} π_{i j} \to 0 almost surely$ and $γ \leq π_{i j} \leq 1 - γ$ , $V^{} = (1 - (k - | J^{} |) γ) max_{j} {r (j)} + γ \sum_{j \notin J^{}} r (j)$ . Due to the unknown parameters in the mean rewards $r (j), j = 1, 2, \dots, k$ , RADs learn treatment effects from the collected information and change the allocation probabilities sequentially to assign more patients to the best treatment. We propose an algorithm to identify the optimal allocations and to approximate $V^{} .$

For any bounded measurable function $u (x),$ define
$T u (x) = max_{γ \leq π \leq 1 - γ} \sum_{j = 1}^{k} {π (δ = j | x) [r (j) + \int u (s) q (d s | x, j)]}$
Let $q (δ, C | h) = π (δ | h) P (Y_{δ} \in C | δ)$ . Since $γ \leq π (δ | h) \leq (1 - γ)$ for any $h \in H_{n},$ it can be easily shown that $sup_{h, h^{'}} | | q (. | h) - q (. | h^{'}) | | \leq 2 (1 - 2 γ),$ where $| | . | |$ is the variation norm. Therefore, $T u (x)$ is a span contraction operator. The following results can be easily generalized from the two-treatment setting in Yi and Wang¹⁸ using the Jordan-Hahn Decomposition theorem.
Lemma 2.1
For functions $u_{1} (x)$ and $u_{2} (x),$
$s p (T u_{1} - T u_{2}) (x) \leq (1 - 2 γ) s p (u_{1} - u_{2})$
where $s p (u)$ is the span semi-norm of $u (x)$ defined by $s p (u) = s u p_{x} u (x) - i n f_{x} u (x) .$

For any function $u_{0} (x),$ let $u_{n} (x) = T u_{n - 1} (x), n = 1, 2, \dots .$ Define the policy $π^{'} = {(π_{n}^{'}) : π_{n}^{'} = \arg m a x T u_{n - 1}, n = 1, 2, \dots,}$ . That is,
$u_{n} (x) = \sum_{j = 1}^{k} {π_{n}^{'} (δ_{n j} = 1 | x) [r (j) + \int u_{n - 1} (s) q (d s | x, j)]}$
Using Lemma 2.1 and the span fixed point theorem,²⁷ we have the following result. For simplicity, assume that $J^{}$ contains a unique treatment, denoted as $j^{}$ . Let $N_{j^{}}$ be the number of patients allocated to the optimal treatment $j^{}$ when a total of $n$ patients are allocated.
Theorem 2.2
For the policy $π^{'}$ defined as above, we have (1)
$π^{'}$ is optimal, that is, $V (π^{'}) = max_{π} V (π)$ ;
(2)
$lim_{n \to \infty} \frac{N_{j^{}}}{n} = ρ, P_{π^{'}} - a . s .$ , where $ρ = 1 - (k - 1) γ$ .

We propose an algorithm to approximate the optimal value $V^{}$ and to identify the optimal treatment based the sequential decision model. Instead of using sufficient statistics, we utilize the Bayesian prior and posterior distributions to summarize the information contained in $h_{n}$ in the following algorithm. The treatment allocation process becomes a MDP. Our algorithm is built on the MDP using the contraction operator $T$ and Thompson sampling to calculate the integral for $u_{n}$ in $T .$ The RADs with Bayesian methods have been studied by Williamson and Villar,¹⁴ Russo,¹⁵ Cheng and Berry,²¹ Thall et al.,²² and Wathen and Thall.²³ While Russo¹⁵ used a utility function to build algorithms to approximate the best treatment and the approach by Villar et al.¹³ and Williamson and Villar¹⁴ was based on the Gittins index for optimal allocation, our method is different in that we use the contraction nature of the operator $T$ to approximate the optimal allocation by integrating learning on rewards $r (j), j = 1, 2, \dots, k$ , and the iterated values $u_{n}$ based on the Thompson sampling. Due to the cumulative nature in the iteration, we use ${\hat{u}}_{n} = n {(1 - (k - 1) γ) r (θ_{{\tilde{j}}_{n}}) + γ \sum_{j \neq {\tilde{j}}_{n}} r (θ_{j})}$ to approximate $u_{n},$ where ${\tilde{j}}_{n}$ is the best treatment at iteration $n$ from the Thompson sampling. This exploration and exploitation works iteratively to approximate the optimal value $V^{}$ . The algorithm is as follows:

Step 1.
Set up the priors $p_{0 j}, j = 1, 2, \dots, k$ . Randomize the first $k$ patients, one to each of the $k$ treatments with $π_{1 j} = 1 / k .$ Observe the responses ${y_{1 j}, j = 1, 2, \dots, k}$ and obtain the posterior distribution $p_{1 j}$ . Define $v = (0, 0, \dots, 0)_{k + 1} .$
Step 2.
Sample ${\tilde{θ}}_{j} \sim p_{n j}, j = 1, 2, \dots, k .$ Let ${\tilde{j}}_{n} = \arg m a x_{j} (r ({\tilde{θ}}_{n j}))$ . If two or more $\tilde{j}$ achieve the maximum, flip a coin to determine ${\tilde{j}}_{n}$ . Calculate ${\tilde{\hat{u}}}_{n} = n {(1 - (k - 1) γ) r ({\hat{θ}}_{{\tilde{j}}_{n}}) + γ \sum_{j \neq {\tilde{j}}_{n}} r ({\hat{θ}}_{j})},$ where ${\hat{θ}}_{j}$ is the estimate of $θ_{j} .$ Update $v$ by setting $v_{{\tilde{j}}_{n}} = {\tilde{\hat{u}}}_{n}$ . If $r ({\tilde{θ}}_{n j}) = c, j = 1, 2, \dots, k$ , $v_{k + 1} = n c .$
Step 3.
For $j,$ sample ${\tilde{θ}}_{j}^{t} \sim p_{n j}, j = 1, \dots, k$ . Denote $j^{t} = \arg \max {r ({\tilde{θ}}_{j}^{t}), max_{l \neq j} r ({\tilde{θ}}_{l})} .$ If $j^{t} = {\tilde{j}}_{n},$ let $u_{n j}^{t} = {\tilde{\hat{u}}}_{n}$ . Otherwise, $u_{n j}^{t} = v_{j} .$ Let $j_{n + 1}^{} = \arg m a x_{j} {r ({\hat{θ}}_{n j}) + u_{n j}^{t}}$ .
Step 4.
Determine
$π_{n} (δ_{(n + 1) j} = 1) = {\begin{cases} 1 - (k - 1) γ & if j = j_{n + 1}^{} \\ γ & if j \neq j_{n + 1}^{} \\ 1 / k & if all treatment are equavilent \end{cases}$
and obtain treatment allocations $δ_{(n + 1) j}$ based on $π_{(n + 1) j}$ .
Step 5.
For the obtained treatment allocation $δ_{(n + 1) j}, n \geq 1, j = 1, 2, \dots, k,$ sample $y_{(n + 1) j} \sim f (θ_{j} | δ_{(n + 1) j} = 1)$ . Update the posterior distributions $p_{(n + 1) j}$ and update ${\hat{θ}}_{(n + 1) j}$ using $y_{(n + 1) j}$ .
Step 6.
Replace $n$ by $(n + 1)$ and go back to Step 2 until all subjects are treated.
We denote this algorithm as the MDP with Thompson sampling (MDP-TS) procedure. This algorithm does not require tuning parameters as the MDP algorithm by Yi and Wang,¹⁸ the values of which were determined through extensive simulation to reduce the possibility for the trials going in the unfavorable direction and to reduce loss of statistical power. We will examine the performance of the proposed algorithm in the next section.

Figure 1.
Distribution of $N_{A} / n$ ( $θ_{A} = 0.6$ , $θ_{B} = 0.4$ , $n = 200$ ) under for MPD-TS and MDP. MPD-TS: Markov decision process with Thompson sampling; MDP: Markov decision process.
3. Simulation studies

We conduct simulation studies to demonstrate the performance of the proposed algorithm for both binary and continuous responses. For simplicity, we consider two-treatment comparisons and assume that Treatment $A$ is better than Treatment $B$ . The simulation programs can be extended to three or more treatments. In all simulations, the reward function is $r (Y, δ) = δ Y_{A} + (1 - δ) Y_{B}$ and the objective is to maximize the mean average responses. For each simulated scenario, we produce results for $γ = 0.25$ and $γ = 0.1$ to examine the influence of the magnitude of $γ$ on statistical power and the proportion of patients allocated to Treatment $A$ . We assess the performance of the algorithm using the average proportion of patients allocated to Treatment $A$ , that is, $E (N_{A} / n)$ , its standard deviation, statistical power, and the probability $P (N_{A} / n < 0.5)$ that the design goes in the undesired direction. Statistical powers are computed using the Wald statistic for the one-side hypothesis that Treatment $A$ is better than Treatment $B$ . For each simulated scenario, the first two patients are randomized by using CR. All results are based on $10^{6}$ runs. The R code for the proposed method is in the Supplemental Appendix.

For binary outcomes, the total number of patients is $200$ in all simulations. We use $B e t a (1, 1)$ as the prior distribution for the unknown success probabilities $θ_{A}$ and $θ_{B}$ . The estimates for iteration in Step 3 of the algorithm are ${\hat{θ}}_{j} = \frac{s_{j} + 1}{n_{j} + 2}, j = A, B,$ using the method proposed by Agresti and Caffo,²⁸ where $s_{j}$ and $n_{j}$ are the number of successes and the number of patients, respectively, for treatment $j$ . Figure 1 presents the distribution of proportion of patients allocated to Treatment $A$ for the scenario of $θ_{A} = 0.6$ and $θ_{B} = 0.4$ for MPD-TS ( $γ = 0.25$ ), MDP-TS ( $γ = 0.1$ ), the MDP procedure by Yi and Wang¹⁸ with $γ = 0.25, η = 0.1, and ζ = 0.05$ , and the optimal proportion proposed by Rosenberger et al.⁴ (RSIHR) which is targeted by the doubly biased coin method with the allocation function $g (x, ρ)$ ²⁹ and $α = 100$ . It reveals that RSIHR has a smaller variation in allocation proportions and the procedures based on the MDP allocate more patients to the better treatment, but it was reported by Yi and Wang¹⁸ that RSIHR produces good statistical power that is very close to those under CR. This figure also shows that the MDP-TS procedures have thinner left tails than the MPD procedure, indicating that the probability for a trial going in the unfavorable direction is smaller. This is even more obvious for MDP-TS with $γ = 0.1,$ which allocates more patients to the better treatment than the other two procedures.

Table 1 summarizes the results for various scenarios with success probabilities $θ_{B} = 0.4, 0.6$ for Treatment B. The asymptotic statistical power under CR is listed in the last column of the table. The results demonstrates trade-off between ethical benefits and loss of statistical power for the MDP-TS procedures with $γ = 0.25 and γ = 0.1$ . The procedure with $γ = 0.1$ assigns a higher proportion of patients to the better treatment but with a loss in statistical power. When the difference in success probability is 0.2 or higher, $10 %$ or more patients under the procedure MDP-TS with $γ = 0.1$ are allocated to the better treatment, but there are about $8 %$ or more in loss of statistical power when compared with MDP-TS having $γ = 0.25.$ The probability for a trial going in the undesired direction decreases as the difference $(θ_{A} - θ_{B})$ increases. Under MDP-TS with $γ = 0.1,$ when $θ_{A} - θ_{B} = 0.2,$ $P (N_{A} / n < 0.5)$ is 1.5% and 0.9%, respectively, for $θ_{B} = 0.4$ and $θ_{B} = 0.6$ . It decreases to $0.07 %$ and $0.011 %$ for the two scenarios when $θ_{A} - θ_{B} = 0.3$ . Moreover, $P (N_{A} / n < 0.5)$ is larger under MDP-TS with $γ = 0.1$ than with $γ = 0.25.$ When the difference in success probability is 0.3, the statistical power under the MDP-TS procedures is close to those under CR while averagely having $73 %$ and $86 %$ patients allocated to the better treatment with $γ = 0.25 and γ = 0.1,$ respectively. The performance in statistical power of MDP-TS is similar to those under the method of constrained MDPs proposed by Baas et al.¹⁹ who reported a roughly attained power of $80 %$ to detect a difference of $0.25$ under the model with constrains on type I error rate and statistical power.

Table 1.
Simulated statistical power and average allocation proportions $n = 200$ .

MDP-TS $(γ = 0.25$ ) MDP-TS $(γ = 0.1)$ CR

$θ_{A}$ $θ_{B}$ $E (\frac{N_{A}}{n}) (s . d)$ $P (\frac{N_{A}}{n} < 0.5)$ Power $E (\frac{N_{A}}{n}) (s . d)$ $P (\frac{N_{A}}{n} < 0.5)$ Power Power

0.4 0.4 0.503 (0.138) 0.495 0.050 0.503 (0.219) 0.497 0.050 0.05

0.5 0.4 0.632 (0.106) 0.125 0.372 0.702 (0.168) 0.131 0.306 0.414

0.6 0.4 0.700 (0.060) 0.012 0.842 0.810 (0.090) 0.015 0.728 0.893

0.7 0.4 0.727 (0.039) $4 \times 10^{- 4}$ 0.991 0.856 (0.048) $7 \times 10^{- 4}$ 0.952 0.998

0.6 0.6 0.503 (0.137) 0.496 0.049 0.503 (0.221) 0.497 0.050 0.05

0.7 0.6 0.635 (0.103) 0.116 0.395 0.708 (0.166) 0.123 0.328 0.439

0.8 0.6 0.706 (0.055) 0.007 0.896 0.822 (0.080) 0.009 0.792 0.935

0.9 0.6 0.732 (0.036) $4.3 \times 10^{- 5}$ 0.999 0.867 (0.039) $1.1 \times 10^{- 4}$ 0.982 1.00

		MDP-TS $(γ = 0.25$ )	MDP-TS $(γ = 0.1)$	CR
0.4	0.4	0.503 (0.138)	0.495	0.050	0.503 (0.219)	0.497	0.050	0.05
0.5	0.4	0.632 (0.106)	0.125	0.372	0.702 (0.168)	0.131	0.306	0.414
0.6	0.4	0.700 (0.060)	0.012	0.842	0.810 (0.090)	0.015	0.728	0.893
0.7	0.4	0.727 (0.039)	$4 \times 10^{- 4}$	0.991	0.856 (0.048)	$7 \times 10^{- 4}$	0.952	0.998
0.6	0.6	0.503 (0.137)	0.496	0.049	0.503 (0.221)	0.497	0.050	0.05
0.7	0.6	0.635 (0.103)	0.116	0.395	0.708 (0.166)	0.123	0.328	0.439
0.8	0.6	0.706 (0.055)	0.007	0.896	0.822 (0.080)	0.009	0.792	0.935
0.9	0.6	0.732 (0.036)	$4.3 \times 10^{- 5}$	0.999	0.867 (0.039)	$1.1 \times 10^{- 4}$	0.982	1.00

MPD-TS: Markov decision process with Thompson sampling; CR: complete randomization.

Table 2.

Simulated results for exponentially distributed responses ( $n = 100$ ).

		MDP-TS ( $γ = 0.25)$			MDP-TS ( $γ = 0.10$ )			CR
$μ_{A}$	$μ_{B}$	$E (\frac{N_{A}}{n}) (s . d)$	$P (\frac{N_{A}}{n} < 0.5)$	Power	$E (\frac{N_{A}}{n}) (s . d)$	$P (\frac{N_{A}}{n} < 0.5)$	Power	Power
0.15	0.15	0.505 (0.106)	0.488	0.050	0.505 (0.160)	0.493	0.050	0.050
0.47	0.15	0.619 (0.063)	0.041	0.763	0.684 (0.081)	0.034	0.722	0.971
0.66	0.15	0.641 (0.053)	0.008	0.982	0.722 (0.059)	0.004	0.967	1
5.30	5.30	0.505 (0.103)	0.486	0.050	0.505 (0.154)	0.492	0.050	0.050
4.20	5.30	0.615 (0.059)	0.038	0.765	0.678 (0.075)	0.031	0.728	0.971
3.54	5.30	0.634 (0.050)	0.007	0.983	0.707 (0.052)	0.003	0.971	1

MPD-TS: Markov decision process with Thompson sampling; CR: complete randomization.

For continuous outcomes, we assume normally distributed responses and a sample size of $100$ . The program can be modified for other distributions. We consider two situations for the distribution of Treatment B so as to detect effect size of differences of 0.5 and 0.8 in Treatment A, assuming that the standard deviation is equal for both treatments. One scenario is $N (0.15, {0.64}^{2})$ for Treatment B, the distribution of negative logarithm of tumor size reduction for the control group,³⁰ and Treatment A as the intervention is anticipated to have larger responses. The other scenario is $N (5.3, {2.2}^{2})$ for Treatment B, the distribution of average pain scores over a week for the placebo group,³¹ and Treatment A is the intervention and is anticipated to have a mean response lower than Treatment B. In this scenario, we use negative score as responses in the algorithm but report results in original pain scale in Table 2. The mean $μ_{A}$ for Treatment A for both scenarios is set as 0.5 and 0.8 effect sizes relative to Treatment B. In the adaptive randomization with the MDP-TS procedure, the standard deviations are assumed unknown and estimated using the responses collected. We use the normal-inverse-gamma prior $N I G (0, 2, 1 / 2, 1 / 2)$ for $(μ_{j}, σ_{j}^{2}), j = A, B$ , as suggested by Williamson and Villar.¹⁴ The mean $μ$ in $r (μ)$ in Step 3 iteration is estimated by the sample mean ${\hat{μ}}_{j} = \frac{\sum_{i = 1}^{n_{j}} x_{i j}}{n_{j}} .$ The results are summarized in Table 2, where the last column lists the nominal statistical power under CR.

Table 2 reveals that the MDP-TS procedure with $γ = 0.1$ allocates more patients to the better treatment than the one with $γ = 0.25$ , but at the cost of losing some statistical power. The procedure with $γ = 0.1$ assigns about 6.5% more patients to the better treatment than MPD-TS with $γ = 0.25$ , but has 4% loss of statistical power in the situation to detect an effect size of difference of 0.5, and the loss is reduced to $1.5 %$ when detecting an effect size of 0.8. The probability $P (\frac{N_{A}}{n} < 0.5)$ that a trial goes in the undesired direction under the procedure with $γ = 0.1$ is often smaller. In the situation of detecting a 0.8 effect size of difference, $P (\frac{N_{A}}{n} < 0.5)$ under the procedure with $γ = 0.1$ is about 0.4%, which is half of that under $γ = 0.25.$ Although the MDP-TS procedures have apparent ethical benefits over CR, the loss in statistical power to detect an effect size of difference of 0.5 is substantial when compared with the nominal statistical power under CR that assumes known standard deviation. The statistical power under the MDP-TS procedure becomes close to the normal statistical power under CR when detecting the effect size of difference of 0.8 while it allocates $13 %$ or more patients to the better treatment.

Overall, the MDP-TS algorithm works well to detect large treatment differences or effect sizes and the statistical power is close to that under CR for binary or continuous responses, but the loss on statistical power is substantial when the differences or effect sizes to detect are small, especially under the MDP-TS procedure with $γ = 0.1$ . The loss of the statistical power can be reversed by increasing the sample sizes for RADs to make the power equivalent to those under CR. See the results by Korn and Freidlin² on the sample sizes and the number of non-responders for a Bayesian adaptive design to achieve the same statistical power as a non-RAD. It will be our future work to provide a sample size guidance for the proposed algorithm to reach the equivalent statistical power as CR and report the number of patients on the inferior treatment for general situations.

4. Conclusion

The proposed procedure is built on MDPs for RADs and the information on treatment effects collected during a trial is summarized using the Bayesian method. This design is shown to have ethical advantages and good characteristics in retaining statistical power of hypothesis testing and controlling the probability a trial going in the undesired direction. The proposed algorithm based on the Thompson sampling and the span contraction operator works iteratively to approximate the optimal allocation for both binary and continuous responses. Moreover, it is easy to implement in practice. Simulation studies demonstrate that with a sample size of 200 for binary responses, the probability of a trial going in the unfavorable direction is < 1.5% when detecting a difference of 0.2 success probabilities and it decreases to < 0.9% when the difference to detect is 0.3. For normally distributed responses, with a sample size of 100, the proposed method assigns 13% more patients to the better treatment than the traditional CR in detecting the effect size of difference of 0.8 with good statistical power while the probability for a trial going in the undesired direction is < 0.7%. The proposed method is recommended for detecting large differences or effect sizes.

Supplemental Material

sj-pdf-1-smm-10.1177_09622802241293750 - Supplemental material for Approximation to the optimal allocation for response adaptive designs

Supplemental material, sj-pdf-1-smm-10.1177_09622802241293750 for Approximation to the optimal allocation for response adaptive designs by Yanqing Yi and Xikui Wang in Statistical Methods in Medical Research

Footnotes

Acknowledgements

Both authors acknowledge research support from the Natural Sciences and Engineering Research Council (NSERC) of Canada.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

ORCID iDs

Yanqing Yi

Xikui Wang

Supplemental material

Supplemental material for this article is available online.

References

Rosenberger

. Optimality, variability, power: evaluating response-adaptive randomization procedures for treatment comparisons. J Am Stat Assoc 2003; 98: 671–678.

Korn

Freidlin

. Outcome-adaptive randomization: Is it useful? J Clin Oncol 2011; 29: 771–776.

Melfi

Page

. Variability in adaptive designsfor estimation of success probabilities. In: Folurnoy N, Rosenberger WF and Wong WK (eds) New developments andapplications in experimental design, Hayward, CA: Institute of Mathematical Statistics, 1998, pp.106–114.

Rosenberger

Stallard

Ivanova

, et al. Optimal adaptive designs for binary response trials. Biometrics 2001; 57: 909–913.

Biswas

Bhattacharya

. Optimal response-adaptive designs for normal responses. Biometrical J 2009; 51: 193–202.

Tymofyeyev

Rosenberger

. Implementing optimal allocation in sequential binary response experiments. J Am Stat Assoc 2007; 102: 224–234.

Baldi Antognini

Giovagnoli

. Compound optimal allocation for individual and collective ethics in binary clinical trials. Biometrika 2010; 97: 935–946.

Baldi Antognini

Zagoraiou

. Multi-objective of optimal designs in comparative clinical trials with covariates: The reinforced doubly adaptive biased coin design. Ann Stat 2012; 40: 1315–1345.

Zhu

. A unified family of covariate-adjusted response-adaptive designs based on efficiency and ethics. J Am Stat Assoc 2015; 110: 357–367.

10.

Metelkina

Pronzato

. Information-regret compromise in covariate-adaptive treatment allocation. Ann Stat 2017; 45: 2046–2073.

11.

Robertson

Lee

López-Kolkovska

, et al. Response-adaptive randomization in clinical trials: From myths to practical considerations. Stat Sci 2023; 38: 185–208.

12.

Sverdlov

Rosenberger

. On recent advances in optimal allocation designs in clinical trials. J Stat Theory Pract 2013; 7: 753–773.

13.

Villar

Wason

Bowden

. Response-adaptive randomization for multi-arm clinical trials using the forward looking Gittins index rule. Biometrics 2015; 71: 969–978.

14.

Williamson

Villar

. A response-adaptive randomization procedure for multi-armed clinical trials with normally distributed outcomes. Biometrics 2020; 76: 197–209.

15.

Russo

. Simple Bayesian algorithms for best-arm identification. Oper Res 2020; 68: 1625–1647.

16.

Wang

. Response-adaptive trial designs with accelerated Thompson sampling. Pharm Stat 2021; 20: 645–656.

17.

Wang

Tiwari

. Adaptive designs for best treatment identification with top-two Thompson sampling and acceleration. Pharm Stat 2023; 22: 1089–1103.

18.

Wang

. The Markov decision process for adaptive design of clinical trials. Econ Stat 2023; 25: 125–133.

19.

Baas

Braaksma

Boucherie

. Constrained Markov decision processes for response-adaptive procedures in clinical trials with binary outcomes. arXiv:2401.15694, 2024.

20.

Ondra

. Optimized response-adaptive clinical trials – sequential treatment allocation based on Markov decision problems. Springer Spektrum, 2015.

21.

Cheng

Berry

. Optimal adaptive randomized designs for clinical trials. Biometrika 2007; 94: 673–689.

22.

Thall

Fox

Wathen

. Statistical controversies in clinical research: Scientific and ethical problems with adaptive randomization in comparative clinical trials. Ann Oncol 2015; 26: 1621–1628.

23.

Wathen

Thall

. A simulation study of outcome adaptive randomization in multi-arm clinical trials. Clin Trials 2017; 14: 432–440.

24.

Russo

Roy

Kazerouni

, et al. A tutorial on Thompson sampling. Found Trend Mach Learn 2018; 11: 1–96.

25.

Baldi Antognini

Novelli

Zagoraiou

. A simple solution to the inadequacy of asymptotic likelihood-based inference for response-adaptive clinical trials. Stat Pap 2022; 63: 157–180.

26.

Wang

. Goodness-of-fit test for response adaptive designs. Stat Probab Lett 2007; 77: 1014–1020.

27.

Hernández-Lerma

. Discrete-time Markov control processes: Basic optimality criteria. New York: Springer-Verlag, 1996.

28.

Agresti

Caffo

. Simple and effective confidence intervals for proportions and differences of proportions result from adding two successes and two failures. Am Stat 2000; 54: 280–288.

29.

Zhang

L-X

. Asymptotic properties of doubly adaptive biased coin designs for multitreatment clinical trials. Ann Stat 2004; 32: 268–301.

30.

Karrison

Maitland

Stadler

, et al. Design of phase II cancer trials using a continuous endpoint of change in tumor size: Application to a study of sorafenib and erlotinib in non small-cell lung cancer. J Natl Cancer Inst 2007; 99: 1455–1461.

31.

Dworkin

Corbin

Young

JP Jr

, et al. Pregabalin for the treatment of postherpetic neuralgia: A randomized, placebo-controlled trial. Neurology 2003; 60: 1274–1283.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB

		MDP-TS $(γ = 0.25$ )			MDP-TS $(γ = 0.1)$			CR
$θ_{A}$	$θ_{B}$	$E (\frac{N_{A}}{n}) (s . d)$	$P (\frac{N_{A}}{n} < 0.5)$	Power	$E (\frac{N_{A}}{n}) (s . d)$	$P (\frac{N_{A}}{n} < 0.5)$	Power	Power
0.4	0.4	0.503 (0.138)	0.495	0.050	0.503 (0.219)	0.497	0.050	0.05
0.5	0.4	0.632 (0.106)	0.125	0.372	0.702 (0.168)	0.131	0.306	0.414
0.6	0.4	0.700 (0.060)	0.012	0.842	0.810 (0.090)	0.015	0.728	0.893
0.7	0.4	0.727 (0.039)	$4 \times 10^{- 4}$	0.991	0.856 (0.048)	$7 \times 10^{- 4}$	0.952	0.998
0.6	0.6	0.503 (0.137)	0.496	0.049	0.503 (0.221)	0.497	0.050	0.05
0.7	0.6	0.635 (0.103)	0.116	0.395	0.708 (0.166)	0.123	0.328	0.439
0.8	0.6	0.706 (0.055)	0.007	0.896	0.822 (0.080)	0.009	0.792	0.935
0.9	0.6	0.732 (0.036)	$4.3 \times 10^{- 5}$	0.999	0.867 (0.039)	$1.1 \times 10^{- 4}$	0.982	1.00