Noninferiority studies with multiple reference treatments

Abstract

The increasing popularity of noninferiority trials reflects the ongoing efforts to replace existing treatments (reference treatments) with new treatments (experimental treatments) that retain a substantial fraction of the effect of the reference treatments. The adoption of any new treatment has to be vindicated by a demonstration of benefits that outweigh a possible clinically insignificant reduction in the reference treatment efficacy. Statistical methods have been developed to analyze data collected from noninferiority trials. However, these methods focus on cases with only one reference treatment. In this paper, we provide the statistical inferential procedures for situations with multiple reference treatments. The computation of the corresponding critical values for simultaneous testings of noninferiority of several new treatments to multiple reference treatments in the presence of a placebo is provided. Furthermore, for a prespecified level of test power, a technique to determine the optimal sample size before the onset of a noninferiority trial is derived. A clinical example is given to illustrate our proposed procedure.

Keywords

Noninferiority trial multiple active controls placebo familywise error rate assay sensitivity noninferiority margin

1 Introduction

Noninferiority (NI) trials are becoming more common, especially in cancer and cardiovascular research.^1,2 They are instruments to assert the effectiveness of a new treatment compared to a reference treatment (also called a standard treatment or active control). NI trials verify that the former retains a substantial fraction of the effect of the latter. The adoption of a new treatment has to be justified by the demonstration of benefits that outweigh a possible clinically insignificant reduction in treatment efficacy.³ Typical benefits associated with the adoption of a viable alternative to a reference treatment include the alleviation of side effects, the lowering of costs, and/or the introduction of less complicated treatment regimens. For instance, Beck et al.⁴ conducted an NI trial for patients with opioid use disorder and found that slow-release oral morphine may be used as a substitute for methadone, which has side effects that influence compliance, resulting in inadequate treatment retention. Economic constraints and affordability may also be legitimate reasons for an NI trial. An excellent example is the exploration of less effective, but relatively inexpensive treatments for mother-to-child transmission of HIV in developing countries.⁵ Another justification for conducting an NI trial is the simplification of a complicated treatment regimen. For example, Burger et al.⁶ pointed out in their report of an NI study that patients suffering colorectal cancer are relieved of the requirement for central access if they are given oral therapy instead of an infusional regimen.

Given the controversies surrounding NI trials, extreme caution is advocated for the planning and implementation of these studies. The purpose of each NI trial should be clearly identified, to avoid committing excessive tests of “me too” drugs.² A typical concern in an NI study is the specification of the NI margin, that is the maximum tolerable reduction in efficacy compared to the reference treatment. The NI margin normally represents a small proportion of the effect size of the reference treatment and its formulation depends heavily on previous clinical studies of the efficacy of the reference treatment.^7–9 As argued by Fleming et al.,⁹ with evidence provided by numerous examples, there are various reasons for a biased estimation of the effect size of the reference treatment. Hence, the task of formulating the NI margin should be done with extreme caution.

To ensure the validity of NI trials, both the US Food and Drug administration (FDA) and the European Medicines Agency (EMA) have published technical guidelines for the design and implementation of clinical studies.^10,11 According to these two sets of guidelines, to secure validity, the design of an NI trial is required to have assay sensitivity, which refers to the ability of a trial to differentiate between an effective and an ineffective drug. To establish assay sensitivity, it is necessary to verify that the reference treatment has the expected effect of a size similar to those reported in previous placebo-controlled studies. This effect size is also useful as a yardstick for the formulation of the NI margin (e.g. 10% of the effect size of the reference treatment).

Even though the effect size of the reference treatment can be estimated from previous studies, Fleming et al.⁹ have pointed out that for various reasons the estimation may be biased. In fact, Vieta and Cruz¹² noticed that the placebo effect may not be stable over time in some depression, mania, and schizophrenia studies. Hence, it is suggested that the placebo be included in an NI trial provided that there is a lack of serious harmful consequences for patients.^11,13 In such cases, the resulting NI trial has three arms, consisting of an experimental treatment (new treatment), a reference treatment, and a placebo, denoted by E, R, and P, respectively, hereafter. For a three-arm trial, with the inclusion of the placebo, assay sensitivity can be established in a more direct fashion, increasing the validity of an NI trial.

There has been a lot of research on the statistical methods of a three-arm NI trial (see, for example, Röhmel and Pigeot¹⁴ and Kwong et al.¹⁵ and references therein). These methods can be grouped into two families. The first is the fraction method family, which formulates the NI margin as a fraction of the trial sensitivity (see, for example Pigeot et al.,¹⁶ Hasler et al.,¹⁷ and Hasler¹⁸). The second family of methods expresses the NI margin in terms of the difference of the effects of E and R (see, for example, Hida and Tango,¹⁹ Kwong et al.,¹⁵ and Stucke and Kieser²⁰). According to Stucke and Kieser,²⁰ the difference method is adopted more frequently in NI trials and hence we mainly discuss this approach in this paper.

In some NI trials, multiple new treatments are used to compare a reference treatment, especially when the efficacy of different dosage levels of a new drug or different combinations of several new drugs are being examined.^21,22 A single-step testing method is given in Kwong et al.¹⁵ However, their procedure is limited to a case with only one reference treatment. For NI trials, it is not uncommon for experimental new treatments to be compared to multiple reference treatments. For instance, in response to the gradual phasing out of chorofluorocarbon (CFC) inhalers, an NI trial was conducted to compare a hydrofluoroalkane propellant (HFA-134a) to two reference treatments for asthma (formoterol CFC aerosol spray and formoterol dry powder inhaler).²³ Another example is an NI trial that compared two experimental treatments for hypertension (40 mg azilsartan medoxomil and 80 mg azilsartan medoxomil) to two reference treatments (320 mg of valsartan and 40 mg olmesartan).²⁴ The objective of this paper is to extend Kwong et al.¹⁵ method to the case of multiple reference treatments while the familywise error rate (FWE) is controlled at a designated level, denoted as α.

The remainder of this paper proceeds as follows. In Section 2, we provide an overview of the procedure by Kwong et al.¹⁵ for NI trials with a single reference treatment. In Section 3, a detailed description of the testing procedures for multiple reference treatments, focusing on the popular case of two reference treatments, is presented. In addition, selected critical values of the two reference treatment case are tabulated for practical users. Then, the proposed method is demonstrated with a clinical example in Section 4. The test power and the determination of sample size are discussed in Section 5. Finally, we make concluding remarks in Section 6.

2 NI trials with one reference treatment

We first overview the statistical methods given by Kwong et al.¹⁵; their procedure is denoted by KCHW hereafter. Consider a one-way fixed effect model in an NI trial with k new treatments (E₁, … , E_k), a reference treatment (R), and a placebo (P). The primary endpoints are

X_{ij} = μ_{i} + ɛ_{ij}, i = E_{1}, \dots, E_{k}, R, P, j = 1, \dots, n_{i}

(1)

where X_ij represents the jth response on the ith treatment, μ_i is the ith treatment mean, and n_i is the sample size for the ith treatment. The error component ɛ_ij is assumed to have a

N (0, σ^{2})

distribution, where

σ^{2}

is the unknown common variance. To align with the notation of previous papers, without loss of generality, we assume that a higher value of μ_i represents a better treatment. Denote the total sample size by N and it is obvious that

N = n_{E_{1}} + \dots + n_{E_{k}} + n_{R} + n_{P}

. Further,

{\bar{X}}_{i}

is the sample mean of the ith treatment and

{\overset{\land}{σ}}^{2}

is the pooled sample variance, which is an unbiased estimator of

σ^{2}

, independent of

{\bar{X}}_{i}

Altogether, there are k + 1 null hypotheses to be tested, with k NI tests and one hypothesis for testing assay sensitivity. To test the null hypotheses

H_{i} : μ_{E_{i}} \leq μ_{R} - M_{2} i = 1, \dots, k K : μ_{R} \leq μ_{P} + M_{1}

against one-sided alternatives

H' i : μ E i > μ R - M 2 i = 1, \dots, k K' : μ R > μ P + M 1

the corresponding test statistics are

T_{E_{i}} = \frac{{\bar{X}}_{E_{i}} - {\bar{X}}_{R} + M_{2}}{\overset{\land}{σ} \sqrt{1 / n_{E_{i}} + 1 / n_{R}}} i = 1, \dots, k T_{P} = \frac{{\bar{X}}_{R} - {\bar{X}}_{P} - M_{1}}{\overset{\land}{σ} \sqrt{1 / n_{R} + 1 / n_{P}}}

Here, M₂ is the NI margin and M₁ is the effect size of R. These values must be clearly specified in the design protocol before the start of the clinical study. If M₁ = M₂, and k = 1, the above hypotheses will be reduced to those given in Hida and Tango.¹⁹ Nevertheless, following the FDA guidelines,¹⁰ M₂ should be much smaller than M₁ in general. For the choice of the NI margin M₂, thorough discussion can be found in Ng,²⁵ and the FDA and EMA guidelines.^10,11 The NI of a particular new treatment, say E_g, to R with assay sensitivity can be claimed if and only if both null hypotheses, H_g and K, are rejected, which implies that the three treatment means satisfy the following inequality

μ_{P} + M_{1} < μ_{R} < μ_{E_{g}} + M_{2}

Since M₂ is much less than M₁, the above inequality implies that

μ_{E_{g}}

is larger than μ_P. That is, the experimental treatment E_g is better than the placebo.

Under H_i for i = 1,…,k and K, the k + 1 variates $T_{E_{1}}, \dots, T_{E_{k}}$ , T_P have a multivariate t-distribution with degrees of freedom $ν = N - k - 2$ and the following (k + 1) × (k + 1) correlation matrix V

V = [1 ρ_{E_{1}} ρ_{E_{2}} ρ_{E_{1}} ρ_{E_{3}} \dots ρ_{E_{1}} ρ_{E_{k}} - ρ_{E_{1}} ρ_{P} ρ_{E_{2}} ρ_{E_{1}} 1 ρ_{E_{2}} ρ_{E_{3}} \dots ρ_{E_{2}} ρ_{E_{k}} - ρ_{E_{2}} ρ_{P} ρ_{E_{3}} ρ_{E_{1}} ρ_{E_{3}} ρ_{E_{2}} 1 \dots ρ_{E_{3}} ρ_{E_{k}} - ρ_{E_{3}} ρ_{P} : : : : : : ρ_{E_{k}} ρ_{E_{1}} ρ_{E_{k}} ρ_{E_{2}} ρ_{E_{k}} ρ_{E_{3}} \dots 1 - ρ_{E_{k}} ρ_{P} - ρ_{E_{1}} ρ_{P} - ρ_{E_{2}} ρ_{P} - ρ_{E_{3}} ρ_{P} \dots - ρ_{E_{k}} ρ_{P} 1]

where

ρ_{i} = \sqrt{n_{i} / (n_{i} + n_{R})}

for i = E₁, … , E_k,P. Testing the null hypotheses while simultaneously controlling the FWE at a predetermined level α in the strong sense (rejecting at least one true null hypothesis while the set of true hypotheses is a subset of {H₁, … , H_k,K}) requires the search for a critical value d that satisfies the following equation

P (⋂_{i = 1}^{k} T_{E_{i}} < d, T_{P} < d | V) = 1 - α

(2)

where

μ_{E_{i}} = μ_{R} - M_{2}, i = 1, \dots, k

, and

μ_{R} = μ_{P} + M_{1}

. The computational details of the critical value in equation (2) can be found in Kwong et al.¹⁵ For the balanced case where

n_{E_{1}} = \dots = n_{E_{k}} = n_{P}

with k ≥ 2, selected critical values are also tabulated in that paper.

3 NI trials with multiple reference treatments

The generalization of the KCHW procedure to the case with multiple reference treatments in an NI trial is discussed in this section. Instead of having only one reference treatment, assume the existence of s ≥ 1 reference treatments; then model (1) is reformulated as follows

X_{ij} = μ_{i} + ɛ_{ij}, i = E_{1}, \dots, E_{k}, R_{1}, \dots, R_{s}, P, j = 1, \dots, n_{i}

where

ɛ_{ij} ~ N (0, σ^{2})

and the total sample size N is now

n_{E_{1}} + \dots + n_{E_{k}} + n_{R_{1}} + \dots + n_{R_{s}} + n_{P}

. With a prespecified NI margin M₂ and expected efficacy estimate M₁ for all of the reference treatments, test the null hypotheses

H_{ig} : μ_{E_{i}} \leq μ_{R_{g}} - M_{2}, i = 1, \dots, k, g = 1, \dots, s K_{g} : μ_{R_{g}} \leq μ_{P} + M_{1}, g = 1, \dots, s

against the one-sided alternatives

H_{ig}^{*} : μ_{E_{i}} > μ_{R_{g}} - M_{2}, i = 1, \dots, k, g = 1, \dots, s K_{g}^{*} : μ_{R_{g}} > μ_{P} + M_{1}, g = 1, \dots, s

and the corresponding test statistics are

T_{ig} = \frac{{\bar{X}}_{E_{i}} - {\bar{X}}_{R_{g}} + M_{2}}{\overset{\land}{σ} \sqrt{1 / n_{E_{i}} + 1 / n_{R_{g}}}}, i = 1, \dots, k, g = 1, \dots, s T_{g}^{*} = \frac{{\bar{X}}_{R_{g}} - {\bar{X}}_{P} - M_{1}}{\overset{\land}{σ} \sqrt{1 / n_{R_{g}} + 1 / n_{P}}}, g = 1, \dots, s

As the efficacy of all of the reference treatments compared to the placebo should be similar in practice, it is reasonable to use a single value of M₁ for simplicity. Under H_ig (i = 1,…,k; g = 1,…,s) and K_g (g = 1,…,s), the ks + s variates $T_{11}, \dots, T_{ks}, T_{1}^{*}, \dots, T_{s}^{*}$ have a multivariate t-distribution with degrees of freedom $ν = N - (k + s + 1)$ and the $(ks + s) \times (ks + s)$ correlation matrix

Σ = [A_{11} A_{12} \dots A_{1 s} B_{1} A_{21} A_{22} \dots A_{2 s} B_{2} : : : : A_{s 1} A_{s 2} \dots A_{ss} B_{s} B_{1}' B_{2}' \dots B_{s}' S]

where

A_{ig}

is a k × k matrix for

i, g = 1, \dots, s, B_{g}

is a k × s matrix for g = 1,…,s, and

S

is a s × s matrix. As each element in Σ can be expressed in terms of the sample sizes, we define

ρ_{ij} = \sqrt{n_{E_{i}} / (n_{E_{i}} + n_{R_{j}})}

and

ρ_{ij}^{*} = \sqrt{n_{R_{j}} / (n_{E_{i}} + n_{R_{j}})}

for i = 1,…,k; and j = 1,…,s and

γ_{i} = \sqrt{n_{R_{i}} / (n_{R_{i}} + n_{P})}

and

γ_{i}^{*} = \sqrt{n_{P} / (n_{R_{i}} + n_{P})}

for i = 1,…,s. It is straightforward to show that

A_{ig} = {a_{jl}} = {1 if i = g, j = l ρ_{ji} ρ_{li} if i = g, j \neq l ρ_{ji}^{*} ρ_{jg}^{*} if i \neq g, j = l 0 if i \neq g, j \neq l,

B_{g} = {b_{jl}} = {- ρ_{jg} γ_{g}^{*} if l = g 0 if l \neq g

and

S = {s_{jl}} = {1 if j = l γ_{j} γ_{l} if j \neq l

To control the FWE at level α, the critical value c for testing all of the hypotheses can be determined by solving the equation

P (⋂_{i = 1}^{k} ⋂_{g = 1}^{s} T_{ig} < c, ⋂_{g = 1}^{s} T_{g}^{*} < c | Σ) = 1 - α

(3)

under all of the null hypotheses, i.e.

μ_{E_{i}} = μ_{R_{g}} - M_{2}

for i = 1,…,k, g = 1,…,s, and

μ_{R_{g}} = μ_{P} + M_{1}

for g = 1,…,s, the means of all of the test statistics are equal to 0. By conditioning W = w, where

W = \overset{\land}{σ} / σ ~ \sqrt{χ_{v}^{2} / ν}

and

ν = N - (k + s + 1)

, the probability on the left-hand-side of equation (3) can be expressed as

\int_{0}^{\infty} P (⋂_{i = 1}^{k} ⋂_{g = 1}^{s} \frac{{\bar{X}}_{E_{i}} - {\bar{X}}_{R_{g}} + M_{2}}{σ \sqrt{1 / n_{E_{i}} + 1 / n_{R_{g}}}} < wc, ⋂_{g = 1}^{s} \frac{{\bar{X}}_{R_{g}} - {\bar{X}}_{P} - M_{1}}{σ \sqrt{1 / n_{R_{g}} + 1 / n_{P}}} < wc) h (w) dw

where h(w) is the pdf of W. Then, conditioning on

Y_{1} = y_{1}, \dots, Y_{s} = y_{s}

, where

Y_{g} = \frac{{\bar{X}}_{R_{g}} - μ_{R_{g}}}{σ / \sqrt{n_{R_{g}}}}, g = 1, \dots, s

have independent standard normal distributions, the probability in equation (3) can be rewritten as

\int_{0}^{\infty} \int_{- \infty}^{\infty} \dots \int_{- \infty}^{\infty} Π_{i = 1}^{k} [Φ (1 mu {min}_{g \in {1, \dots, s}} {\frac{w 1 mu c + y_{g} 1 mu ρ_{ig}}{ρ_{ig}^{*}}} 1 mu)] [1 mu 1 - Φ (1 mu {max}_{g \in {1, \dots, s}} {\frac{y_{g} 1 mu γ_{g}^{*} - w 1 mu c}{γ_{g}}} 1 mu) 1 mu] φ (y_{1}) \dots φ (y_{s}) h (w) d y_{1} \dots d y_{s} dw

where

Φ (\cdot)

and

φ (\cdot)

are, respectively, the cdf and pdf of a standard normal variate. As a result, the determination of the critical value c involves the evaluation of an (s + 1)-dimensional integration. The subroutine to compute c for s ≥ 2 is available online (www.stat.ncku.edu.tw/faculty_private/mjwen/MjWenTR.htm).

For ease of illustration, let the NI trial have two experimental treatments, two reference treatments, and a placebo, i.e. k = s = 2. The test statistics $T_{11}, T_{21}, T_{12}, T_{22}, T_{1}^{*}, T_{2}^{*}$ have a multivariate t-distribution with degrees of freedom N – 5 and a 6 × 6 correlation matrix

Σ = [1 ρ_{11} ρ_{21} ρ_{11}^{*} ρ_{12}^{*} 0 - ρ_{11} γ_{1}^{*} 0 ρ_{11} ρ_{21} 10 ρ_{21}^{*} ρ_{22}^{*} - ρ_{21} γ_{1}^{*} 0 ρ_{11}^{*} ρ_{12}^{*} 01 ρ_{12} ρ_{22} 0 - ρ_{12} γ_{2}^{*} 0 ρ_{21}^{*} ρ_{22}^{*} ρ_{12} ρ_{22} 10 - ρ_{22} γ_{2}^{*} - ρ_{11} γ_{1}^{*} - ρ_{21} γ_{1}^{*} 001 γ_{1} γ_{2} 00 - ρ_{12} γ_{2}^{*} - ρ_{22} γ_{2}^{*} γ_{1} γ_{2} 1]

where

ρ_{ij} = \sqrt{n_{E_{i}} / (n_{E_{i}} + n_{R_{j}})}

and

ρ_{ij}^{*} = \sqrt{n_{R_{j}} / (n_{E_{i}} + n_{R_{j}})}

for i = 1, 2; j = 1, 2 and

γ_{i} = \sqrt{n_{R_{i}} / (n_{R_{i}} + n_{P})}

and

γ_{i}^{*} = \sqrt{n_{P} / (n_{R_{i}} + n_{P})}

for i = 1, 2.

For NI trials with multiple reference treatments, it is rather common for s = 2. Hence, to help practitioners to conduct relevant testing procedures, selected critical values for this popular case (s = 2) are tabulated for the balanced design where

n_{E_{1}} = \dots = n_{E_{k}} = n_{P}

(denoted by n) and

n_{R_{1}} = n_{R_{2}}

(denoted by n₀). In Table 1, critical values for

ρ^{*} = 0.1 (0.1) 0.5

with s = 2 and k = 1,…,8 are given, where

ρ^{*} = n / (n + n_{0})

. For other unbalanced designs of NI trials with two reference treatments, if the sample size configuration is close to the balanced case, one could either use the numerical method outlined in this paper or approximate the exact critical value by using Table 1 with linear interpolation, where the average sample size

\bar{n} = (n_{E_{1}} + \dots + n_{E_{k}} + n_{P}) / (k + 1)

and

{\bar{n}}_{0} = (n_{R_{1}} + n_{R_{2}}) / 2

are used to replace n and n₀, respectively.

Table 1.

Critical values c (where $n_{E_{1}} = \dots = n_{E_{k}} = n_{P}$ and $n_{R_{1}} = n_{R_{2}}$ ).

			k
α	$ρ^{*}$	ν	1	2	3	4	5	6	7	8
0.05	0.1	20	2.2437	2.4366	2.5692	2.6699	2.7509	2.8186	2.8767	2.9275
		30	2.1950	2.3775	2.5023	2.5967	2.6725	2.7357	2.7897	2.8370
		40	2.1714	2.3488	2.4698	2.5612	2.6344	2.6954	2.7476	2.7931
		50	2.1574	2.3319	2.4506	2.5402	2.6119	2.6716	2.7226	2.7671
		100	2.1299	2.2986	2.4129	2.4990	2.5678	2.6249	2.6737	2.7162
		∞	2.1030	2.2660	2.3762	2.4588	2.5247	2.5793	2.6259	2.6664
	0.2	20	2.2961	2.4848	2.6136	2.7112	2.7895	2.8549	2.9108	2.9597
		30	2.2448	2.4231	2.5444	2.6360	2.7093	2.7704	2.8227	2.8683
		40	2.2199	2.3932	2.5108	2.5995	2.6704	2.7295	2.7799	2.8239
		50	2.2052	2.3755	2.4909	2.5779	2.6474	2.7052	2.7546	2.7977
		100	2.1762	2.3407	2.4519	2.5355	2.6023	2.6577	2.7050	2.7462
		∞	2.1479	2.3068	2.4139	2.4942	2.5582	2.6113	2.6565	2.6958
	0.3	20	2.3297	2.5128	2.6369	2.7304	2.8053	2.8676	2.9209	2.9674
		30	2.2764	2.4494	2.5663	2.6543	2.7246	2.7830	2.8329	2.8765
		40	2.2505	2.4187	2.5321	2.6173	2.6854	2.7420	2.7902	2.8323
		50	2.2352	2.4005	2.5119	2.5955	2.6623	2.7177	2.7650	2.8062
		100	2.2051	2.3648	2.4722	2.5527	2.6168	2.6701	2.7155	2.7550
		∞	2.1758	2.3300	2.4335	2.5109	2.5725	2.6236	2.6671	2.7049
	0.4	20	2.3534	2.5296	2.6477	2.7363	2.8069	2.8655	2.9155	2.9591
		30	2.2986	2.4651	2.5765	2.6600	2.7264	2.7816	2.8286	2.8695
		40	2.2719	2.4338	2.5420	2.6229	2.6874	2.7408	2.7864	2.8261
		50	2.2561	2.4153	2.5216	2.6011	2.6644	2.7168	2.7615	2.8004
		100	2.2252	2.3790	2.4816	2.5582	2.6191	2.6696	2.7126	2.7499
		∞	2.1950	2.3436	2.4425	2.5163	2.5749	2.6235	2.6648	2.7007
	0.5	20	2.3710	2.5384	2.6491	2.7315	2.7969	2.8511	2.8972	2.9373
		30	2.3147	2.4732	2.5778	2.6556	2.7174	2.7685	2.8119	2.8497
		40	2.2874	2.4415	2.5432	2.6188	2.6788	2.7284	2.7706	2.8073
		50	2.2713	2.4228	2.5228	2.5971	2.6560	2.7047	2.7462	2.7822
		100	2.2396	2.3861	2.4827	2.5544	2.6113	2.6583	2.6982	2.7330
		∞	2.2087	2.3503	2.4435	2.5128	2.5676	2.6129	2.6514	2.6849
0.01	0.1	20	3.0136	3.1930	3.3183	3.4145	3.4925	3.5582	3.6147	3.6644
		30	2.9078	3.0710	3.1844	3.2711	3.3412	3.3999	3.4505	3.4948
		40	2.8573	3.0129	3.1206	3.2028	3.2692	3.3247	3.3725	3.4143
		50	2.8276	2.9788	3.0833	3.1629	3.2271	3.2807	3.3268	3.3672
		100	2.7699	2.9126	3.0109	3.0855	3.1455	3.1956	3.2385	3.2761
		∞	2.7142	2.8490	2.9413	3.0112	3.0672	3.1139	3.1538	3.1887
	0.2	20	3.0622	3.2392	3.3623	3.4567	3.5331	3.5972	3.6524	3.7008
		30	2.9522	3.1131	3.2245	3.3096	3.3783	3.4359	3.4854	3.5288
		40	2.8996	3.0530	3.1589	3.2396	3.3047	3.3592	3.4060	3.4469
		50	2.8688	3.0177	3.1205	3.1987	3.2617	3.3143	3.3595	3.3991
		100	2.8089	2.9494	3.0460	3.1193	3.1782	3.2274	3.2696	3.3064
		∞	2.7511	2.8837	2.9744	3.0430	3.0981	3.1440	3.1832	3.2175
	0.3	20	3.0912	3.2648	3.3849	3.4765	3.5506	3.6125	3.6658	3.7125
		30	2.9782	3.1361	3.2450	3.3278	3.3946	3.4505	3.4984	3.5404
		40	2.9243	3.0748	3.1783	3.2570	3.3204	3.3734	3.4188	3.4585
		50	2.8926	3.0389	3.1393	3.2156	3.2770	3.3283	3.3722	3.4106
		100	2.8312	2.9692	3.0637	3.1353	3.1928	3.2408	3.2819	3.3178
		∞	2.7720	2.9021	2.9910	3.0581	3.1120	3.1569	3.1952	3.2287
	0.4	20	3.1104	3.2795	3.3954	3.4834	3.5542	3.6133	3.6640	3.7084
		30	2.9952	3.1491	3.2544	3.3343	3.3984	3.4520	3.4978	3.5380
		40	2.9402	3.0870	3.1873	3.2633	3.3243	3.3752	3.4188	3.4568
		50	2.9080	3.0506	3.1480	3.2217	3.2809	3.3302	3.3725	3.4094
		100	2.8454	2.9801	3.0718	3.1412	3.1968	3.2431	3.2828	3.3174
		∞	2.7851	2.9122	2.9986	3.0638	3.1160	3.1595	3.1966	3.2290
	0.5	20	3.1237	3.2865	3.3968	3.4799	3.5465	3.6020	3.6494	3.6908
		30	3.0067	3.1552	3.2557	3.3315	3.3921	3.4426	3.4857	3.5234
		40	2.9509	3.0927	3.1886	3.2608	3.3186	3.3667	3.4078	3.4436
		50	2.9182	3.0560	3.1492	3.2194	3.2756	3.3223	3.3622	3.3970
		100	2.8547	2.9850	3.0730	3.1392	3.1922	3.2362	3.2738	3.3066
		∞	2.7935	2.9167	2.9998	3.0622	3.1121	3.1535	3.1889	3.2197

The NI null hypothesis H_ig is rejected if in the corresponding test T_ig > c, whereas the assay sensitivity hypothesis K_g is rejected when in the corresponding test $T_{g}^{*} > c$ . As explained in Section 1, a cautious approach should be adopted for establishing NI trial conclusions. Hence, we propose a conservative approach such that the NI of a particular new treatment to a reference treatment, say E_r to R_w, with assay sensitivity would be established if and only if the NI null hypothesis H_rw and all of the assay sensitivity hypotheses K₁,…,K_s are rejected. The rejection of all of the sensitivity hypotheses provides a strong and necessary support for the integrity of the clinical study.

A less stringent approach, of course, is to declare NI of E_r to R_w when both H_rw and K_w are rejected, without the requirement for the other sensitivity hypotheses to be rejected at the same time. Nevertheless, we prefer the more conservative approach, as it strengthens our confidence in the validity of the NI trial. However, it is worth pointing out that if both H_rw and K_w are rejected, and some of the remaining sensitivity hypotheses yield insignificant results, we should scrutinize the reasons underlying the inability of the trial to claim significance for any of the sensitivity hypotheses. For instance, if one of the reference treatments has a very small sample size, the test power may be too small to declare a significant result. In such cases, further investigation of possible NI of E_r to R_w should be conducted.

The purpose of the proposed method is to identify those experimental treatments that are NI to a particular reference treatment. Once these NI treatments have been found, additional hypothesis testing could be conducted to compare their efficacy.

4 Example

One quarter of the adults in the world are affected by hypertension.²⁶ The drug class of angiotensin II receptor blockers (ARBs) has long been identified as an effective treatment for hypertension.²⁷ Two reference drugs, olmesartan and valsartan, are among the most commonly used ARBs. Another newly developed drug, azilsartan, is a potent and highly selective ARB with estimated bioavailability of 60% and an elimination half-life of 12 h.²⁴ Azilsartan is in general well tolerated and patients are likely to persist with long-term treatment as there are few adverse events associated with it.²⁸ In addition, there are concerns that olmesartan may increase cardiovascular risk.²⁹

To compare the new drug azilsartan to reference treatments, White et al.²⁴ conducted an NI trial with two new treatments (azilsartan 40 mg and azilsartan 80 mg), two reference treatments (olmesartan 40 mg and valsartan 320 mg) and a placebo. The primary efficacy endpoint is the change from baseline in 24-h mean systolic blood pressure (BP). The data from the clinical study are given in Table 2. As proposed in White et al.,²⁴ the values of the NI margin M₂ and effect size M₁ are 1.5 and 4.5 mmHg, respectively. The statistics, according to the aforementioned formula, are computed and tabulated in Table 3. Note that for hypertension studies, a lower BP implies a better treatment. Hence, to be consistent with the models in Sections 2 and 3, we simply change the signs of all of the sample means.

Table 2.

Changes from baseline in 24-h mean ambulatory systolic BP (mm Hg).

Treatment	− (Mean change from baseline)	Standard deviation	Sample size
Azilsartan 40 mg (E₁)	13.4	10.78	237
Azilsartan 80 mg (E₂)	14.5	10.59	229
Valsartan 320 mg (R₁)	10.2	10.71	234
Olmesartan 40 mg (R₂)	12.0	11.16	254
Placebo (P)	0.3	10.42	134

Table 3.

Test statistics with $M_{2} = 1.5$ and $M_{1} = 4.5$ .

Treatment contrast	− Mean difference ± SE	Test statistics
$E_{1} - R_{1}$	3.2 ± 0.99	4.735^a
$E_{2} - R_{1}$	4.3 ± 1.00	5.793^a
$E_{1} - R_{2}$	1.4 ± 0.97	2.981^a
$E_{2} - R_{2}$	2.5 ± 0.98	4.076^a
$R_{1} - P$	9.9 ± 1.17	4.628^a
$R_{2} - P$	11.7 ± 1.15	6.261^a

Significance at $α = 0.05$ with critical value c = 2.347.

With the given sample sizes and FWE α = 0.05, the exact value of the critical value c computed using our algorithm is 2.347. As the sample sizes are not very different, the approximation method suggested in Section 3 can be used. As $\bar{n} = (237 + 229 + 134) / 3 = 200$ and ${\bar{n}}_{0} = (234 + 254) / 2 = 244$ , then $ρ^{*} \approx 0.45$ . Based on Table 1 with $ν \approx \infty$ and a linear interpolation on $ρ^{*}$ , the approximated value of c is (2.3436 + 2.3503)/2 = 2.34695, which is almost identical to the exact value. With reference to Table 3, all of the test statistics are larger than the critical value, hence we conclude that there is evidence that both experimental treatments are noninferior to both reference treatments with the assay sensitivity established for this study. In other words, azilsartan, in both 40 and 80 mg dosages, is found to be a viable alternative to olmesartan and valsartan.

5 Sample size determination

In designing a clinical trial, it is crucial to determine the minimum number of patients required for each treatment group to achieve a certain level of test power. As explained in the earlier sections, the objective of an NI study is to identify at least one experimental treatment that is noninferior to one of the reference treatments and at the same time, to establish assay sensitivity of all of the existing reference treatments. To evaluate the sample size requirement, the modified version of the any-pair power (APP) defined in Kwong et al.¹⁵ for one reference treatment is now extended to NI trials with multiple reference treatments as follows

APP = {min}_{\binom{i \in {1, \dots, k}}{g \in {1, \dots, s}}} P (reject H_{ig}, K_{1}, K_{2}, \dots, K_{s} | H_{ig}, K_{1}, K_{2}, \dots, K_{s} \in F)

where F denotes the set of all of the false null hypotheses.

For i = 1,2,…,k and g = 1,2,…,s, assume that an NI trial is designed to detect the differences $δ_{1} = (μ_{E_{i}} - μ_{R_{g}} + M_{2}) / σ > 0$ for any given H_ig ∈ F and $δ_{2} = (μ_{R_{g}} - μ_{P} - M_{1}) / σ > 0$ for all K_g ∈ F. In addition, σ is assumed to be known. If there is no particular reason to assign more patients to one particular new experimental treatment or reference treatment, we usually determine an optimal design configuration among balanced designs, i.e. $n_{E_{1}} = \dots = n_{E_{k}}$ (denoted by n_E) and $n_{R_{1}} = \dots = n_{R_{s}}$ (denoted by n_R). Define the ratio $n_{R} : n_{E} : n_{P} = 1 : π_{E} : π_{P}$ , such that the total sample size $N = (s + k π_{E} + π_{P}) n_{R}$ .

Without loss of generality, we assume that only H₁₁ and K₁,…,K_s are false, as the test statistic T_ig for i = 1,…,k; g = 1,…,s has the same marginal probability distribution as under the balanced design trial. For a given design configuration (n_E, n_R, n_P), with given values of k and s, and critical value c with a specified δ₁ and δ₂, the power of the test is

APP = P (T_{11} > c, T_{1}^{*} > c, \dots, T_{s}^{*} > c | δ_{1}, δ_{2}, n_{E}, n_{R}, n_{P})

where variates

T_{11}, T_{1}^{*}, \dots, T_{s}^{*}

under

H_{11}^{*}

, and

K_{1}^{*}, \dots, K_{s}^{*}

, have multivariate normal distributions with mean vector

μ = (δ_{1} \sqrt{\frac{π_{E} n_{R}}{1 + π_{E}}}, δ_{2} \sqrt{\frac{π_{P} n_{R}}{1 + π_{P}}}, \dots, δ_{2} \sqrt{\frac{π_{P} n_{R}}{1 + π_{P}}})'

and

(s + 1) \times (s + 1)

variance–covariance matrix Ω. Define

θ_{i} = \sqrt{π_{i} / (1 + π_{i})}

and

θ_{i}^{*} = \sqrt{1 / (1 + π_{i})}

for

i = E, P

. The variates

T_{11}, T_{1}^{*}, \dots, T_{s}^{*}

under

H_{11}^{*}, K_{1}^{*}, \dots, K_{s}^{*}

, respectively, can be shown to have a multivariate normal distribution with the mean vector

μ = (δ_{1} θ_{E} \sqrt{n_{R}}, δ_{2} θ_{P} \sqrt{n_{R}}, \dots, δ_{2} θ_{P} \sqrt{n_{R}})'

and the (s + 1) × (s + 1) variance–covariance matrix

Ω = {ω_{ij}} = {1 if i = j - θ_{E} θ_{P} if 1 \leq i \neq j \leq 2 0 if i = 1, j \geq 3 or i \geq 3, j = 1 θ_{P}^{*} θ_{P}^{*} if 3 \leq i \neq j \leq s + 1

Then, transform $T_{11}, T_{1}^{*}, \dots, T_{s}^{*}$ as follows

T_{11} = θ_{E}^{*} Y - θ_{E} Z_{1} + δ_{1} θ_{E} \sqrt{n_{R}} T_{g}^{*} = θ_{P} Z_{g} - θ_{P}^{*} Z_{0} + δ_{2} θ_{P} \sqrt{n_{R}} for g = 1, \dots, s

where Y, Z₀ , … , Z_s are independently and identically standard normal variates. By conditioning Z₀ = z₀ and Y = y, we have

P (T_{11} > c, T_{1}^{*} > c, \dots, T_{s}^{*} > c | δ_{1}, δ_{2}, n_{E}, n_{R}, n_{P}) = \int_{- \infty}^{\infty} \int_{- \infty}^{\infty} I [Φ (\frac{θ_{E}^{*} y + δ_{1} θ_{E} \sqrt{n_{R}} - c}{θ_{E}}) - Φ (\frac{c + θ_{P}^{*} z_{0} - δ_{2} θ_{P} \sqrt{n_{R}}}{θ_{P}})] \times [1 - Φ (\frac{c + θ_{P}^{*} z_{0} - δ_{2} θ_{P} \sqrt{n_{R}}}{θ_{P}})] s - 1 φ (y) φ (z_{0}) dyd z_{0}

where I(γ) is a function such that I(γ) = γ for γ > 0, and 0 otherwise. It is also noted that the computation of the above power only involves a two-dimensional integration.

Let the optimal sample size allocation for a balanced design have a ratio $n_{R} : n_{E} : n_{P} = 1 : π_{E}^{*} : π_{P}^{*}$ and $N^{*} = (s + k π_{E}^{*} + π_{P}^{*}) n_{R}$ , where $N^{*}$ is the smallest total sample size with modified $APP \geq 1 - β$ , a specified power level. The optimal sample size configuration $(N^{*}, π_{E}^{*}, π_{P}^{*})$ can then be determined by solving the following inequality

{min}_{N} {max}_{π_{E}, π_{P}} P (T_{11} > c^{*}, T_{1}^{*} > c^{*}, \dots, T_{s}^{*} > c^{*} | δ_{1}, δ_{2}, Ω) \geq 1 - β

where

c^{*}

is the critical value under the optimal design configuration.

As the analytical determination of the optimal design configuration $(N^{*}, π_{E}^{*}, π_{P}^{*})$ seems to be infeasible, we propose the following algorithm to search for the optimal design.

Obtain the initial conservative total sample size, N^′ under the assumption that $π_{E} = π_{P} = 1 / \sqrt{ks + s}$

For a given N′, search for $n_{E}'$ and $n_{P}'$ numerically within the domains of 0 < n_P ≤ n_E ≤ n_R < N, such that the maximum power, P′, is achieved

Set N′ = N′ – 1 and repeat step (b) if $P' > 1 - β$ . Otherwise, the optimal design configuration is $(N^{*}, π_{E}^{*}, π_{P}^{*}) = (N', π_{E}', π_{P}')$

Based on the above algorithm, selected sample size requirements for the optimal designs for given power levels

1 - β = 0.8, 0.9

are tabulated in Table 4 for s = 2 and

(δ_{1}, δ_{2}) = (0.5, 1), (1, 2)

. As indicated in the table, the required sample sizes for the placebo are much less than those in the other treatment groups, in line with the general ethical principle of sending fewer patients to be treated by the placebo.

Table 4.

Optimal design configurations $(N, n_{R}, n_{E}, n_{P})$ with s = 2.

		$α = 0.05$		$α = 0.01$
$1 - β$	k	$δ_{1} = 0.5, δ_{2} = 1$	$δ_{1} = 1, δ_{2} = 2$	$δ_{1} = 0.5, δ_{2} = 1$	$δ_{1} = 1, δ_{2} = 2$
0.8	1	(259, 77, 77, 28)	(67, 20, 20, 7)	(362, 109, 109, 35)	(93, 28, 28, 9)
	2	(366, 86, 81, 32)	(94, 22, 21, 8)	(501, 118, 113, 39)	(128, 30, 29, 10)
	3	(469, 99, 79, 34)	(119, 25, 20, 9)	(635, 135, 108, 41)	(162, 34, 28, 10)
	4	(569, 113, 77, 35)	(145, 28, 20, 9)	(764, 151, 105, 42)	(194, 38, 27, 10)
0.9	1	(328, 99, 99, 31)	(84, 25, 25, 9)	(444, 135, 135, 39)	(114, 35, 35, 9)
	2	(462, 109, 105, 34)	(117, 27, 27, 9)	(614, 145, 141, 42)	(156, 37, 36, 10)
	3	(591, 126, 101, 36)	(150, 33, 25, 9)	(775, 168, 132, 43)	(197, 42, 34, 11)
	4	(715, 141, 99, 37)	(181, 36, 25, 9)	(934, 187, 129, 44)	(236, 46, 33, 12)

6 Conclusion

The number of NI trials has increased rapidly in the past decade, and more sophisticated statistical methods are emerging to deal with various complex setups. In this paper, we develop a testing procedure for three-arm NI trials that have multiple reference treatments. Given that both researchers and government regulators continue to urge caution in the implementation of NI trials, we adopt a conservative approach that requires the NI declaration of a new treatment to be accompanied by the establishment of assay sensitivity for all of the reference treatments. The implementation of our proposed method is facilitated by the inclusion of critical values for the popular case of two reference treatments in the trial. The computation of power is also given to enable the determination of sample size before the onset of the trial. As pointed out by Kwong et al.,¹⁵ if ethical considerations militate against the allocation of many patients to a placebo, it is also straightforward to incorporate this constraint into the proposed algorithm when computing the optimal sample size. This paper concentrates on developing a single-step procedure for NI trials with multiple standard treatments. A stepwise testing procedure could be explored in the future, although the framework and procedures will be far more complex. In addition, the model that is presented in this paper assumes that treatments are having homogeneous variances. Though more complicated, an extension of this paper to models with heterogeneous variances could be developed using the idea given in Huang et al.³⁰

Footnotes

Acknowledgements

We are grateful to the referee for many valuable suggestions.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research of Huang and Wen were supported by the Ministry of Science and Technology of Taiwan (MOST 102-2118-M-006-003-MY2). Cheung's research was funded by the Research Grants Council of the Hong Kong Special Administrative Region (CUHK14300814).

References

Head

Kaul

Bogers

AJJC

. Non-inferiority study design: lessons to be learned from cardiovascular trials. Eur Heart J 2012; 33: 1318–1324.

Riechelmann

Alex

Cruz

. Non-inferiority cancer clinical trials: scope and purposes underlying their design. Ann Oncol 2013; 24: 1942–1947.

Fleming

. Current issues in non-inferiority trials. Stat Med 2008; 27: 317–332.

Beck

Haasen

Verthein

. Maintenance treatment for opioid dependence with slow-release oral morphine: a randomized cross-over, non-inferiority study versus methadone. Addiction 2014; 109: 617–626.

Fleming

Powers

. Issues in noninferiority trials: the evidence in community-acquired pneumonia. Clin Infect Dis 2008; 47: S108–S120.

Burger

Beyer

Abt

. Issues in the assessment of non-inferiority: perspectives drawn from case studies. Pharm Stat 2011; 10: 433–439.

DeMets

Friedman

. Some thoughts on challenges for noninferiority study designs. Drug Inform J 2012; 46: 420–427.

Snapinn

Jiang

. Remaining challenges in assessing non-inferiority. Ther Innovat Regulat Sci 2014; 48: 62–67.

Fleming

Odem-Davis

Rothmann

. Some essential considerations in the design and conduct of non-inferiority trials. Clin Trials 2011; 8: 432–439.

10.

United States Food and Drug Administration. Guidance for industry non-inferiority clinical trials—draft guidance. Rockville: FDA, 2010. http://www.fda.gov/downloads/Drugs/GuidanceComplianceRegulatoryInformation/Guidances/UCM202140.pdf.

11.

European Medicines Agency. Guideline on the choice of the non-inferiority margin. London: EMA, 2006. http://www.ema.europa.eu/docs/en_GB/document_library/Scientific_guideline/2009/09/WC500003636.pdf.

12.

Vieta

Cruz

. Head to head comparisons as an alternative to placebo-controlled trials. Eur Neuropsychopharmacol 2012; 22: 800–803.

13.

Temple

Ellenberg

. Placebo-controlled trials and active-control trials in the evaluation of new treatments. Ann Intern Med 2000; 133: 455–463.

14.

Röhmel

Pigeot

. A comparison of multiple testing procedures for the gold standard non-inferiority trial. J Biopharm Stat 2010; 20: 911–926.

15.

Kwong

Cheung

Hayter

. Extension of three-arm non-inferiority studies to trials with multiple new treatments. Stat Med 2012; 31: 2833–2843.

16.

Pigeot

Schäfer

Röhmel

. Assessing non-inferiority of a new treatment in a three-arm clinical trial including a placebo. Stat Med 2003; 22: 883–899.

17.

Hasler

Vonk

Hothorn

. Assessing non-inferiority of a new treatment in a three-arm trial in the presence of heteroscedasticity. Stat Med 2008; 27: 490–503.

18.

Hasler

. Multiple comparisons to both a negative and a positive control. Pharm Stat 2012; 11: 74–81.

19.

Hida

Tango

. On the three-arm non-inferiority trial including a placebo with a prespecified margin. Stat Med 2011; 30: 224–231.

20.

Stucke

Kieser

. A general approach for sample size calculation for the three-arm ‘gold standard' non-inferiority design. Stat Med 2012; 31: 3579–3596.

21.

Donohue

Fogarty

Lötvall

. Once-daily bronchodilators for chronic obstructive pulmonary disease. Am J Respir Crit Care Med 2010; 182: 155–162.

22.

Sundar

Sinha

Rai

. Comparison of short-course multidrug treatment with standard therapy for visceral leishmaniasis in India: an open-label, non-inferiority, randomised controlled trial. Lancet 2011; 377: 477–486.

23.

Houghton

Langley

Singh

. Comparison of bronchoprotective and bronchodilator effects of a single dose of formoterol delivered by hydrofluoroalkane and chlorofluorocarbon aerosols and dry powder in a double blind, placebo-controlled, crossover study. Br J Clin Pharmacol 2004; 58: 359–366.

24.

White

Weber

Sica

. Effects of the angiotensin receptor blocker azilsartan medoxomil versus olmesartan and valsartan on ambulatory and clinic blood pressure in patients with stages 1 and 2 hypertension. Hypertension 2011; 57: 413–420.

25.

. Noninferiority hypotheses and choice of noninferiority margin. Stat Med 2008; 27: 5392–5406.

26.

Kearney

Whelton

Reynolds

. Global burden of hypertension: analysis of worldwide data. Lancet 2005; 365: 217–223.

27.

Sica

White

Weber

. Comparison of the novel angiotensin II receptor blocker azilsartan medoxomil vs valsartan by ambulatory blood pressure monitoring. J Clin Hypertens 2011; 13: 467–472.

28.

Perry

. Azilsartan medoxomil: a review of its use in hypertension. Clin Drug Invest 2012; 32: 621–639.

29.

Lin

Chang

Caffrey

. Examining the association of olmesartan and other angiotensin receptor blockers with overall and cause-specific mortality. Hypertension 2014; 63: 968–976.

30.

Huang

Wen

Cheung

. Non-inferiority studies with multiple new treatments and heterogeneous variances. J Biopharm Stat. 2015; 25: 958–971.