Abstract
The aim of this paper is to analyze the impact of response-adaptive randomization rules for normal response trials intended to test the superiority of one of two available treatments. Taking into account the classical Wald test, we show how response-adaptive methodology could induce a consistent loss of inferential precision. Then, we suggest a modified version of the Wald test which, by using the current allocation proportion to the treatments as a consistent estimator of the target, avoids some degenerate scenarios and so it should be preferable to the classical test. Furthermore, we show both analytically and via simulations how some target allocations may induce a locally decreasing power function. Thus, we derive the conditions on the target guaranteeing its monotonicity and we show how a correct choice of the initial sample size allows one to overcome this drawback regardless of the adopted target.
1 Introduction
Adaptive experiments are sequential procedures where the decision about how to proceed next is made according to a pre-established rule that makes use of the information accrued along the way. Even if their use remains controversial due to some inferential problems that could arise,1,2 adaptive designs are widely used in different experimental fields and they are nowadays considered as a panacea for ethical issues posed by randomized clinical trials. This is especially true for phase III trials, where patients are enrolled step-by-step and are assigned to one of two or more available treatments to be compared. In this context, randomization is regarded as a must and, when is combined with the adaptive nature of the experiment, it means that the treatments are assigned to the next unit by allocation probabilities that make use of the past information. However, the updating process cannot take place in a haphazard manner, which could undermine the validity and integrity of the ensuing statistical analysis. Thus, the design of these experiments requires special care and it is not surprising that statistical research on this topic has become very popular over the past two decades, also due to the strong encouragement from US Government agencies and health authorities.3,4
Due to the peculiarity of clinical context, often there are several competing goals related to the ethical demand of maximizing the subjects care and to the statistical aim of drawing correct inferential conclusions with high precision. By formalizing these goals into suitable optimization problems, several authors provided target allocations of the treatments that could represent a valid trade-off among ethics and inference.5–12 In general, these targets depend on the unknown model parameters and they can be approached asymptotically by using suitable response-adaptive (RA) randomization procedures, such as the doubly adaptive biased coin design 13 and the efficient randomized-adaptive design (ERADE), 14 converging to them.
RA designs are a class of sequential allocation rules where the probabilities of treatment assignments change at each step on the basis of earlier responses and past allocations. Starting from an initial sample of observations on each treatment (usually based on restricted randomization) to derive a non-trivial estimation, at each step these designs estimate the unknown parameters as well as the target and then force the next allocation to converge to the target.
Under these procedures the resulting statistical analysis requires refined tools able to allow for the complex dependence structure, since (i) the assignments are a stochastic process, making the resulting responses dependent, and (ii) inference must be unconditional on the design, because the allocations are themselves informative on the parameters of the model.15,16 Although the asymptotic properties of both (i) the usual maximum likelihood estimators (MLEs) and (ii) the allocation process are well-established, the large majority of the literature9,13,17–25 is focused on the implications of the RA methodology in terms of estimation of the treatment effects, while little attention is devoted to hypotheses testing,7,12,26–28 almost exclusively for binary data.
The aim of this paper is to analyze the impact of RA designs for hypothesis testing in the case of normally response trials for checking the superiority of one of two available treatments. Taking into account the classical Wald test, we first show how the RA methodology could induce an anomalous behavior of the power function. Then, we suggest a modified version of Wald test which, by using the current allocation proportion to the treatments as a consistent estimator of the target, avoids some degenerate scenarios and so it should be preferable than the classical test. Furthermore, we show both analytically and via simulations how some target allocations may induce an additional anomalous behavior of the power function, which could be locally decreasing. Thus, we derive the conditions on the target guaranteeing the monotonicity of the ensuing power, showing also how a correct choice of the initial sample size allows one to overcome this drawback regardless of the adopted target.
The paper is structured as follows. Starting from the notation and some preliminaries in Section 2, Sections 3 and 4 deal with the asymptotic power of the Wald-type Z-tests under RA randomization procedures, highlighting their drawbacks. Section 5 describes some practical implications via a simulation study, while Section 6 deals with some general conclusions about the applicability of RA randomization procedures for hypothesis testing.
2 Preliminaries
Suppose that patients come to the trial sequentially and are assigned to one of two competing treatments, say A and B. At each step i ≥ 1, let δ
i
denote the allocation of the ith subject, with δ
i
= 1 if they is assigned to A and 0 otherwise, and let Y
i
be the corresponding outcome that is assumed to be normally distributed with
Several proposals have been made in the literature in order to derive suitable target allocations (ρ; 1 − ρ) to A and B, respectively (either as finite sample allocations or as asymptotic proportions to be approximated in a large sample set-up) that achieve a good trade-off between ethical concerns and inferential precision. One of the main proposals consists in formalizing these objectives into a combined/constrained optimization problem and find the targets that are optimal with respect to the chosen approach (see Chapter 5 of Baldi Antognini and Giovagnoli
16
and the paper by Biswas and Bhattacharya
11
for a recent review). In general, the ensuing target depends on the unknown model parameters, i.e. ρ = ρ(μ), and assuming without loss of generality “the-larger-the-better” scenario (namely, treatment A is better than B if and only if μ
A
> μ
B
) it should satisfy the following conditions:
T1 ρ: ℝ → (0; 1) is a symmetric function with ρ(−x) = 1 − ρ(x), ensuring that both treatments are treated likewise; T2 ρ(x) is increasing in x, meaning that any gain in terms of the relative superiority of a given treatment should skew the assignments by increasing its desirability; T3 ρ(·) is twice continuously differentiable with bounded derivatives. From T1, ρ(x) ≠ {0; 1} guarantees that the comparative experiments do not collapse into the observation of just one treatment; moreover, ρ(0) = 1/2 and therefore, due to the symmetric structure of ρ(·) around the point (0; 1/2), we could simply model the target function for x > 0. Ethical requirement T2 ensures that the superior treatment should be favored and, combined with T1, guarantees that the desirability of either treatment is the same if and only if the two treatment arms equally perform. The target ρ(·) may depend on the nuisance parameter too and in this case conditions T1–T3 should be satisfied for any given value of the nuisance. For example, for μ
A
, μ
B
> 0, Zhang and Rosenberger
29
suggested the target
Note the following.
Although non-necessary from a mathematical perspective, an additional ethical requirement that is almost always satisfied by the targets suggested in the literature9,11,12,23,29,30 is
T4 limx → ∞ ρ(x) = 1, namely the target function has to approach 1 as A performs infinitely better than B (analogously, from T1, limx → −∞ ρ(x) = 0).
In such a case, the behavior of the target could be represented by the cumulative distribution function (cdf) of a continuous symmetric random variable centered at 0 with support ℝ, like e.g. the normal target11,23,30
Alternatively, by using the symmetric property T1, ρ(x) could be modeled for x ∈ ℝ+ as a suitably re-scaled cdf of a positive random variable, like e.g. the exponential target
Target functions ρ
N
, ρ
C
, ρ
L
and ρ
E
(with T = 1), ρ
R
and ρ
Z
(with μ
B
= 1).
Remark 1
Every above-mentioned target (namely, ρ
N
(·), ρ
C
(·), ρ
L
(·), ρ
Z
(·), ρ
R
(·) and ρ
E
(·)) satisfies T4. However, this condition could be relaxed by assuming a re-scaled target function
3 The Wald-type Z-tests under RA randomization procedures
3.1 RA designs and asymptotic inference
Several RA designs have been suggested in the literature with the aim of converging to a desired target ρ(μ) depending on the unknown model parameters. After the starting sample of n0 observations assigned to each treatment, at each step n > 2n0 these designs estimate the difference μ between the treatment effects by
An example is the ERADE
14
defined by
In general, even if the MLEs coincide with those of the non-sequential setting, their distribution under RA designs is not the same as when the observations are independent and identically distributed (i.i.d.), due to the dependence structure induced by the adaptation process. However, given a target ρ(μ) satisfying T1–T3, consistency and asymptotic normality of the MLEs are ensured provided that the RA design is chosen such that limn → ∞π
n
= ρ(μ) almost surely.
16
Indeed, as n tends to infinity
Under the same hypotheses, we suggest an alternative version of the Wald test which can be constructed by using π
n
, instead of
Remark 2
When the common variance σ2 is unknown, it can be estimated at each step n by the usual pooled sample variance
3.2 Asymptotic approximation, target function and starting sample size
The performances of the Wald tests described above, as well as the quality of the CLT approximation of the power function in (11), are strictly related to the chosen target. Test W
n
is based on the asymptotic approximation
As an example, Figure 2 shows the behavior of the Wald test W
n
in a simulated trial where the chosen target functions are ρ
N
in (4), ρ
L
in (6) and ρ
E
in (7) (colors are yellow, orange and red for T = 0.5, 1 and 2, respectively) and ρ
R
in (3) (with μ
B
= 1), where the ERADE is employed with γ = 0.5. The results come from 5000 simulations with sample sizes n = 75, 150, 250 and starting sample of n0 = 2 observations on each treatment, where the responses are generated following a Gaussian distribution with σ2 = 1, μ
B
= 1 and μ
A
= μ
B
+ k, where k ≥ 0. Taking into account targets ρ
N
, ρ
L
and ρ
E
, the ensuing power tends to be quite poor, it is not monotonically increasing in μ and tends to zero as μ grows (although under ρ
L
and ρ
E
the non-monotonicity is not noticeable in the plots for T = 2, the power function is still decreasing, but this behavior is present for larger values of μ). Moreover, this anomalous behavior is accentuated as the ethical component of the target grows (i.e. small values of T), since in such a case the ethical skew tends to assign all subjects to the superior treatment also for small values of μ. Therefore, the consistency and asymptotic normality of the MLEs are strongly compromised, as well as the quality of the approximation of power (11). This is particularly true for small sample sizes, where also the type I errors become slightly inflated, as shown in Table 1. This is due to the fact that, given the RA nature of the procedure, when the sample size is small and the chosen target is characterized by a strong ethical impact, then π
n
tends to be slightly more unstable, as an estimator of ρ(μ), than Power of the Wald test W
n
under ρ
N
, ρ
L
and ρ
E
(T = 0.5, 1 and 2) and ρ
R
(μ
B
= 1) with n = 75, 150, 250 and starting sample size n0 = 2. Type I errors of the tests W
n
and 
While the choice of ρ R (which satisfies T4, but with a lower ethical improvement with respect to ρ N , ρ L and ρ E ) always guarantees a suitable behavior of the power of the test, that goes to one as μ grows, also preserving a correct type I error (see Table 1).
As it can be easily seen from the power function in (11), a crucial condition for the applicability of Wald test W n is that the chosen target ρ should satisfy limx → ∞x2[1 − ρ(x)] = ∞. This condition characterizes the ethical improvement of the target and prescribes that 1 − ρ should tend to zero more slowly than x−2, in order to avoid the degenerate scenarios discussed previously. For instance, as also shown in Figure 2, adopting ρ R then limx → ∞ x2[1 − ρ R (x)] = ∞ (which holds for ρ C and ρ Z too), while under ρ N (and also for ρ E and ρ L ) this limit goes to zero.
Now taking into account
Thus, to take into account the initial samples, a more suitable asymptotic approximation of the allocation proportion satisfying condition (15) is π
n
≈ ρ(μ)(1 − 2τ
n
) + τ
n
, then the resulting power function of
Figure 3 shows the performance of Power of the Wald test 
If compared with the classical Wald test,
Clearly, from power functions (11) and (16), the performance of the classical Wald test W
n
(for any chosen starting sample size n0) is substantially the same of
However, does every choice of the target guarantee suitable properties of the power function? And how strong should be the ethical skew in order to avoid an anomalous behavior of the power?
4 Properties of the power function
Assuming without loss of generality σ = 1, if we let for any x > 0
Moreover, for any fixed n (sufficiently large for the CLT approximation), additional fundamental requirements are:
C1 the power should reach 1 as μ tends to infinity; C2 the power should be increasing in μ.
Provided that τ
n
≠ 0, condition C1 is always satisfied since limx → ∞g
n
(x) = ∞. Whereas if τ
n
= 0 (i.e. n0 = 0), C1 is fulfilled only when limx → ∞x2[1 − ρ(x)] = ∞, namely for targets with a low ethical improvement (as discussed in Section 3 for W
n
).
Condition C2 means that, for n sufficiently large (and for any fixed n0), g n (x) should be increasing in x. This crucial property is not generally guaranteed for any chosen target allocation (even if n0 ≠ 0), as the following theorem shows.
Theorem 1
A target ρ induces a monotonically increasing power function of Wald test
Proof
See the Appendix.□
Example 1
Taking into account ρ
Z
in (2), condition (18) becomes
To obtain suitable classes of targets satisfying Theorem 1, it could be useful to take into account the hazard function (widely used in the survivor analysis literature) associated with a given target ρ, by letting
Corollary 1
Given a target allocation ρ, if the corresponding hazard satisfies the condition limx → ∞xhρ(x) = K > 2, then the power of test
Proof
See the Appendix.□
Example 2
From Corollary 1, every target with constant or monotonically increasing hazard leads to an asymptotic power which is not monotonically increasing. For instance, the exponential target ρ
E
has a constant hazard
An additional characterization of target functions inducing a locally decreasing power can be derived through differential inequalities.
Corollary 2
Given a target ρ, if there exists η > 0 such that
Proof
See the Appendix.□
Example 3
Taking into account the Logistic target ρ
L
with T = 1, then for any x ≥ 3
Since the power function of W
n
in (11) could be regarded as a special case of (16) with τ
n
= 0, all of the previous results about monotonicity also hold for the classical Wald test W
n
. However, even if some target allocations do not guarantee that the ensuing power is monotonically increasing, the following result shows how
Theorem 2
For any chosen target ρ, letting
Proof
See the Appendix.□
Remark 3
Theorem 2 derives the minimum ratio
Computations of n* and
In general, for every target the (minimum) sample size requested is n ≥ 3, that is always satisfied in practice. While as regards the choice of the starting samples, the ensuing condition is not-trivially fulfilled, especially for the large sample framework of asymptotic inference. Indeed, for a sample size of n = 250, if we choose ρ N then n0 = 8 starting allocations on each treatment guarantee a monotonically increasing power (i.e. only the remaining 234 assignments will be allocated in the RA way).
5 A simulation study
Section 3 collects the theoretical results allowing the applicability of Wald-type Z-tests under RA designs, while in Section 4 we analyze the corresponding power function from a theoretical point of view. In particular, we show that, for certain classes of targets, the ensuing power is locally decreasing in the difference between the treatment effects and could not tend to one as μ grows, stressing also how a suitable choice of the starting sample overcomes this drawback.
In this section we focus on the practical implications in terms of loss of power by means of a simulation study, where the chosen RA procedure is the ERADE with γ = 0.5. The results come from 5000 simulations with sample sizes n = 75, 150 and 250, where the responses are generated following a Gaussian distribution with σ2 = 1, μ
B
= 1 and μ
A
= μ
B
+ k, with k ≥ 0. The considered targets are ρ
N
, ρ
L
and ρ
E
with several values of the tuning parameter T; for each scenario, the initial sample size n0 spans from 1 to Power of the Wald test Power of the Wald test Power of the Wald test W
n
under ρ
N
: colors go from yellow to red for n0 = 1, 2 and 


Figure 4 shows the behavior of the power function of Wald test
As regards the starting sample size, the plots (see Figures 4 and 5) are very similar for all of the considered targets: for low values of n0 the power function is locally decreasing, while as the sample dimension increases, so should n0 in order to ensure a monotonically increasing power. Finally, note that, for small values of T, the gain in terms of power highly increases even for small increments of n0 as discussed in Remark 3.
In order to explain the inadequacy of the classical Wald test, Figures 6 shows the power of W n when target ρ N is employed under the same simulation scenarios described previously. As it can be seen, the anomalous behavior of the power function is strongly accentuated for small sample sizes and small values of T; moreover, every choice of the starting sample size does not allow to avoid degenerate situations in which the power vanishes as μ grows (the same clearly holds for ρ E and ρ L ).
6 Discussion
The choice of the target function plays a crucial role in RA methodology, since it incorporates ethical requirements with inferential goals. In general, targets should skew the assignments towards the best treatment and a fundamental question is how strong should be the ethical improvement to obtain a suitable trade-off between ethical aims and inferential precision.
Our paper is focused on this problem by taking into account hypothesis testing instead of the classical estimation. Even though we do not suggest a specific target, we show the inadequacy of classes of target functions for hypothesis testing in comparative clinical trials, stressing also the crucial role of the initial sample size.
Under RA randomization procedures, the classical Wald test W
n
could be applied only when the desired target ρ has a low ethical skew, namely when limμ → ∞μ2[1 − ρ(μ)] → ∞ (e.g. under re-scaled targets not satisfying T4), while
Moreover, for testing hypotheses, RA randomization procedures should not be applied without a starting sample. Indeed, for certain targets the choice n0 = 0 may induce an accentuated anomalous behavior of the power, that becomes strongly decreasing even for small differences between the treatment effects, while for high values of μ the RA rule tends to allocate every subject to the best treatment inducing a null power. On the other hand, any choice of the starting sample size n0 ≥ 1 could be suitable if the chosen target satisfies condition (18): in this case, small values of n0 improve the ethical goals of the RA design. Whereas when the desired ρ induces a locally decreasing power, then the starting sample size should be chosen in an accurate manner, as shown in Theorem 2, that clearly conflicts with the general suggestion n0 = 2 given by Hu et al. 14
Finally, we wish to stress that our results still hold even for an alternative hypothesis H1: μ ≠ 0, where all of the previous conclusions about the monotonicity of the power function could be interpreted in terms of monotonicity of the non-centrality parameter φ of a non-central chi-square distribution with one dof. Indeed, taking into account the classical Wald test, from the CLT in (8) under H0 the statistic
Footnotes
Acknowledgements
We are grateful to the referees and the associate editor for their comments and suggestions, which led to a substantially improved version of the paper.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
