Non-factorial analyses of two-by-two factorial trial designs

Abstract

Background/aims

Factorial analyses of 2 × 2 trial designs are known to be problematic unless one can be sure that there is no interaction between the treatments (A and B). Instead, we consider non-factorial analyses of a factorial trial design that addresses clinically relevant questions of interest without any assumptions on the interaction. Primary questions of interest are as follows: (1) is A better than the control treatment C, (2) is B better than C, (3) is the combination of A and B (AB) better than C, and (4) is AB better than A, B, and C.

Methods

A simple three-step procedure that tests the first three primary questions of interest using a Bonferroni adjustment at the first step is proposed. A Hochberg procedure on the four primary questions is also considered. The two procedures are evaluated and compared in limited simulations. Published results from three completed trials with factorial designs are re-evaluated using the two procedures.

Results

Both suggested procedures (that answer multiple questions) require a 50%–60% increase in per arm sample size over a two-arm design asking a single question. The simulations suggest a slight advantage to the three-step procedure in terms of power (for the primary and secondary questions). The proposed procedures would have formally addressed the questions arising in the highlighted published trials arguably more simply than the pre-specified factorial analyses used.

Conclusion

Factorial trial designs are an efficient way to evaluate two treatments, alone and in combination. In situations where a statistical interaction between the treatment effects cannot be assumed to be 0, simple non-factorial analyses are possible that directly assess the questions of interest without the zero interaction assumption.

Keywords

Factorial design factorial analysis Hochberg procedure interaction multiple comparisons 2 × 2 trial design

Introduction

In a two-by-two factorial trial design, patients are randomly assigned one of four treatments: a control treatment (C), treatment A, treatment B, or a combination treatment of A and B (AB). A factorial analysis of a factorial design estimates the effect of each individual treatment by pooling the results over the other treatment (with a stratified analysis). This allows the ability to answer two questions (Does A work? Does B work?) with a sample size that is one-half to what would be required for doing two separate trials (A vs C and B vs C). Unfortunately, as is well known,^1,2 if there is any statistical interaction between the treatments on the outcome, the analysis can be misleading. For example, if there is a negative interaction so that A works better than C by a much larger amount than AB works as compared to B, then the factorial analyses may suggest A does not work sufficiently well to be recommended.³ There is the possibility of using an estimate of the interaction to decide whether to abandon the factorial analysis with a completed trial,^4,5 and this strategy is not uncommonly used in practice (formally or informally). However, the statistical properties of this type of approach are poor—the statistical power to detect interactions of a size that would complicate the interpretation of a factorial analysis is not good.^2,6,7 Finally, it should be noted that the notion of statistical interaction depends on the (somewhat arbitrary) scale on which the outcome is measured.^2,4 Therefore, unless one is hypothesizing that one or both the treatments are likely to be totally ineffective (as might be reasonable to assume in a treatment screening trial of disease prevention that is screening multiple treatments³), it can be difficult to argue before seeing any data that one knows scientifically that an interaction should be close to 0.

In this article, we consider situations where one is unable to assume a lack of interaction between the treatments and so a factorial analysis is not being considered. Instead of only asking whether A or B works, we are also interested in how well AB works in comparison to A, B, and C. One possibility is to perform a pairwise analysis of all six pairs of treatment comparisons. However, as noted by Chen and Simon,⁸ there is natural preference in treatment choices: with approximately equal efficacy outcomes between A (or B) and C, one would choose to treat with C; and with equal efficacy between AB and A (or B), one would choose to treat with A (or B). Chen and Simon⁸ develop procedures for selecting the best treatment that uses these preferences. We use the natural preference in treatment choices, but rather than focusing on identifying the best treatment, our approach is structured to assess the individual contributions of the two treatments, by themselves and in combination. As described in the next section, we divide the questions into primary and secondary based on where we want to focus the ability to reject null hypotheses.

In the next section, we describe two analysis strategies, along with type 1 and type 2 errors which we desire to control. This is followed by some limited simulations that investigate the power of the proposed strategies. The sample sizes required for the different approaches are considered next, followed by a re-analysis of three published trials that used factorial trial designs. We end with a discussion including some alternative approaches to the problem.

Proposed analysis strategies

Letting $μ_{X}$ be the mean outcome for patients treated with treatment X, the primary and secondary questions we are focusing on can be expressed in terms of hypothesis tests of null and alternative hypotheses (Table 1). The following, which can be viewed in the context of multiple comparisons as a parallel gatekeeping procedure,⁹ is the proposed three-step analysis strategy for addressing these questions:

Step 1 (testing A, B, and AB vs C): Test H_o1: $μ_{A} ⩽ μ_{C}$ , H_o2: $μ_{B} ⩽ μ_{C}$ , and H_o3: $μ_{A B} ⩽ μ_{C}$ , each with a nominal significance level of $α / 3$ . If H_o3 is rejected, then go to Step 2, otherwise go to Step 3a.

Step 2 (testing AB vs max(C, A, B)): Test H_o4: $μ_{A B} ⩽ \max (μ_{C}, μ_{A}, μ_{B})$ with a nominal significance level of $α$ if both H_o1 and H_o2 were rejected in Step 1 or $(2 / 3) α$ otherwise. If one of H_o1 or H_o2 were not rejected in Step 1 and H_o4 is not rejected in this step, then go to Step 3b.

Step 3a (testing A vs B): If H₀₁ is rejected, then test H_o5: $μ_{A} = μ_{B}$ versus H_a5: $μ_{A} > μ_{B}$ with a nominal significance level of $α / 3$ . If H₀₂ is rejected, then test H_o5: $μ_{A} = μ_{B}$ versus H_a6: $μ_{B} > μ_{A}$ with a nominal significance level of $α / 3$ .

Step 3b (testing A vs B): If H₀₁ is rejected, then test H_o5: $μ_{A} = μ_{B}$ versus H_a5: $μ_{A} > μ_{B}$ with a nominal significance level of $α / 6$ . If H₀₂ is rejected, then test H_o5: $μ_{A} = μ_{B}$ versus H_a6: $μ_{B} > μ_{A}$ with a nominal significance level of $α / 6$ .

Table 1.

Null and alternative hypotheses for addressing primary and secondary questions of interest.

Null hypothesis	Corresponding alternative hypothesis
Primary questions
H_o1: $μ_{A} ⩽ μ_{C}$	H_a1: $μ_{A} > μ_{C}$
H_o2: $μ_{B} ⩽ μ_{C}$	H_a2: $μ_{B} > μ_{C}$
H_o3: $μ_{A B} ⩽ μ_{C}$	H_a3: $μ_{A B} > μ_{C}$
H_o4: $μ_{A B} ⩽ \max (μ_{C}, μ_{A}, μ_{B})$	H_a4: $μ_{A B} > \max (μ_{C}, μ_{A}, μ_{B})$
Secondary questions
H_o5: $μ_{A} = μ_{B}$	H_a5: $μ_{A} > μ_{B}$
H_o5: $μ_{A} = μ_{B}$	H_a6: $μ_{B} > μ_{A}$

A factorial analysis would not formally address a comparison of A versus B (H_o5), and one could consider not including this comparison in our non-factorial analyses here. However, we include this comparison because it does not affect the other comparisons with this analysis strategy.

Another strategy is to use the Hochberg¹⁰ step-up procedure on the four primary hypotheses: let $p_{i}$ be the one-sided p value associated with testing H_0i, i=1,…, 4, $p_{(1)} ⩽ p_{(2)} ⩽ p_{(3)} ⩽ p_{(4)}$ be these p values in order, and H_0(i) be the hypothesis associated with $p_{(i)}$ . At the first step, all four hypotheses are rejected if $p_{(4)} < α$ , otherwise go to Step 2. In Step 2, if $p_{(3)} < α / 2$ , then H_0(i), i = 1, …, 3 are rejected, otherwise go to Step 3. In Step 3, if $p_{(2)} < α / 3$ , then H₀₍₁₎ and H₀₍₂₎ are rejected, otherwise go to the last step. In the last step, H₀₍₁₎ is rejected if $p_{(1)} < α / 4$ .

We are interested in controlling the familywise type 1 error at the $α$ level (in a strong sense). For the three-step procedure, that means for any subset of the five null hypotheses that are true, the probability of rejecting any of these true null hypotheses is less than $α$ . For the Hochberg procedure, it means that for any subset of the first four null hypotheses that are true, the probability of rejecting any of these true null hypotheses is less than $α$ .

The power of a testing strategy can be represented in terms of the minimally clinically interesting difference $Δ$ between means and type 2 errors. Letting X > Y denote rejecting H_o: $μ_{X} ⩽ μ_{Y}$ , priority is given for power: (1) for showing A > C or B > C when AB is not better than A and B, (2) for showing AB > C when both A and B are not better than C, (3) for showing AB > max(C, B, A) when at least one of A or B is better than C, and (4) for showing A > B or B > A when AB is not better than the max(C, B, A). These power considerations incorporate the partial ordering of the preferences among the treatment arms and can formally be described by

P1: P(A > C|µ_A ⩾ µ_C + Δ and µ_AB < max(µ_C,µ_A, µ_B) + Δ) ⩾ 1 − β

P(B > C|µ_ö ⩾ µ_C + Δ and µ_AB < max(µ_C,µ_A,µ_B) + Δ) ⩾ 1 − β₁

P2: P(AB > C|[µ_B < µ_C + Δ and µ_A < µ_C + Δ] and µ_AB ⩾ µ_C + Δ) ⩾ 1 − β₂

P3: P(AB > max(C, A, B)|[µ_B ⩾ µ_C + Δ or µ_A ⩾ µ_C + Δ] and µ_AB ⩾ max(µ_C,µ_A,µ_B) + Δ) ⩾ 1 − β₃

P4: P(A > B|µ_A ⩾ µ_B + Δ and µ_A ⩾ µ_C + Δ and µ_AB⩽ µ_A + Δ) ⩾ 1 − β₄

P(B > A|µ_B ⩾ µ_A + Δ and µ_B ⩾ µ_C + Δ and µ_AB⩽ µ_B + Δ) ⩾ 1 − β₄

The different type 2 errors listed (β’s) reflect that we could desire more power for different hypotheses. For example, as noted above, β₄ could be very large as testing A = B is of low priority in this setting.

We show in Appendix 1 that the three-step procedure has type 1 error less than $α$ . For the Hochberg procedure, there is the theoretical possibility of a very slight inflation of type 1 error because of the negative correlation between some of the test statistics.¹¹ However, for this application, type 1 error appears to be controlled. Although we have specified the nominal significance levels at each step, we have not specified what testing procedures should be used for each of the null hypotheses. We recommend using whatever test would be used if there were only two treatments being considered, for example, a one-sided t-test or a one-sided log-rank test for survival data (in which cases the null and alternative hypotheses would be defined in terms of hazard ratios). For Step 2, we recommend using the largest of the (one-sided) p values from the pairwise tests AB versus C, AB versus A, and AB versus B.¹²

Adjusted p values for the individual hypothesis tests can be calculated by finding the smallest $α$ for which the procedure rejects that hypothesis.¹³ For example, if the nominal (one-sided) p value for testing H_o1: $μ_{A} ⩽ μ_{C}$ versus H_a1: $μ_{A} > μ_{C}$ is p = 0.02, then the adjusted p value for this test would be p = 0.06 for the three-step procedure. Simultaneous confidence intervals that are consistent with stepwise multiple comparisons procedures are generally problematic.¹⁴ For the three-step procedure, one can provide Bonferroni-type simultaneous confidence intervals associated with the first three primary hypotheses using nominal 1 − α/3 one-sided confidence intervals; simultaneous two-sided 1 − 2α confidence intervals can be obtained if desired using nominal 1 − (2/3)α two-sided confidence intervals.

Simulations

Table 2 displays the results of simulations of the rejection probabilities (type 1 errors and powers) of the two proposed procedures with α = 0.025 under various scenarios. The numbers in bold correspond to the powers of special interest noted above (P1 − P4). The simulation was conducted using normally distributed data with known standard deviation, with the alternative $Δ$ scaled so that $Δ = 1$ corresponds to 90% power for a one-sided 0.025 level Z-test comparing two means. Considering the three-step procedure first, we note that the powers of interest for the first three primary comparisons (A > C?, B > C?, AB > C?) are 80.2%, reduced from the 90% one would have if only a single comparison were evaluated; this is the price for controlling the familywise error with multiple comparisons. The comparison of AB versus the max(C, A, B) has only 68.1% power when AB is the only effective treatment (case 3) (not a priority for power considerations), but larger powers of 86.8% and 82.0% when more treatments are effective (cases 7 and 9). By design, the comparison of A versus B is not of primary interest and has only 18.2% power when all the treatments are effective (case 10), although more power when fewer treatments are effective (cases 2 and 5).

Table 2.

Simulated rejection probabilities for the proposed three-step analysis and Hochberg procedure with normally distributed data (standard deviations known) with α = 0.025 (10⁷ simulations).

Case	Treatment-arm means^a			Three-step analysis						Hochberg procedure
				Powers (%) to declare with statistical significance that					Type 1 error (%)	Powers (%) to declare with statistical significance that:				Type 1 error (%)
	$μ_{A}$	$μ_{B}$	$μ_{A B}$	A > C	B > C	AB > C	AB > max^b	A > B		A > C	B > C	AB > C	AB > max
1	0	0	0	0.8	0.8	0.8	0.1	0.1	2.23	0.7	0.7	0.7	0.0	1.71
2	1	0	0	80.2 ^c	0.8	0.8	0.0	68.9	1.57	77.2	0.9	0.9	0.0	1.57
3	0	0	1	0.8	0.8	80.2	68.1	0.1	1.57	1.1	1.2	78.7	61.6	1.94
4	1	1	0	80.2	80.2	0.8	0.0	0.8	2.47	79.2	79.2	1.2	0.0	1.25
5	1	0	1	80.2	0.8	80.2	1.7	63.8	2.47	79.4	1.3	79.3	1.0	2.26
6	1	1	1	80.2	80.2	80.2	0.4	0.4	1.15	82.1	82.1	82.1	0.3	0.34
7	1	0	2	80.2	0.8	>99.9	86.8	12.1	0.83	84.0	2.4	>99.9	84.1	2.37
8	0.5	0	1	22.0	0.8	80.2	28.2	7.5	0.83	22.6	1.4	77.7	21.0	1.45
9	1	1	2	80.2	80.2	>99.9	82.0	0.2	0.39	88.1	88.1	>99.9	82.0	NA^d
10	2	1	2	>99.9	80.2	>99.9	2.3	18.2	2.33	>99.9	84.3	>99.9	2.4	2.38

The mean for the control arm is always 0, that is, $μ_{C} = 0$ . Means are scaled so that one unit corresponds to a difference in two means that has 90% power to reject the null hypothesis of equal means at the 0.025 significance level.

Max is the maximum of C, A, and B.

Numbers in bold correspond to powers of special interest (see text).

Not applicable because all the four tested null hypotheses are not satisfied.

For the Hochberg procedure, the powers of interest are generally lower than with the three-step procedure (cases 1–3, 7, and 8) except when all the treatments are effective where the Hochberg procedure has higher power (cases 6 and 10). Note that as utilized for the primary questions, the Hochberg procedure does not formally address the two secondary questions. One could apply Hochberg’s procedure to all six hypothesis tests in Table 1, but this would reduce the power to address the primary questions. For example, the power to reject the null hypothesis H_o1: $μ_{A} ⩽ μ_{C}$ in case 2 would be 73.6% if a six-hypothesis Hochberg procedure was used (see Online Supplementary Material).

Sample size considerations

If one knew that there was no interaction and used a factorial analysis (without control for the two multiple comparisons), then one would be able to answer two questions for the price of one. However, because of the possibilities of interactions, advocacy of this position is considered “tantamount to selling snake oil” by some.¹⁵ We are not interested here in trying to increase the efficiency of asking a question about treatments A or B given alone, but about how well A, B, and AB work in comparison with C and each other. Therefore, we consider the increase in sample size per arm required for a non-factorial analysis of a factorial trial design as compared to a two-group comparison with one-sided α = 0.025, which we take as the reference sample size per arm. The proposed three-step strategy uses a nominal type 1 error of 0.0083 (α/3), which would result in an increase in sample size per arm of 49% if one was focusing on the first three primary hypotheses; a larger sample size would be required for ensuring power to test AB being better than max(A, B, C). For example, if 100 patients per arm were required to have 90% power to detect a specified treatment effect in a two-armed trial of A versus C (with one-sided α = 0.025), then approximately 150 patients per arm would be required in the proposed analysis of factorial design with C, A, B, and AB to be able detect with 90% power the same treatment effect comparing A versus C, B versus C, and AB versus C. The power to detect AB being better than the max(C, A, B) will be less than 90% and depend on the configuration of the true means.

Using the Hochberg procedure with four null hypotheses, one would size the trial using a nominal type 1 error of 0.0063 (α/4) to protect against the situation when only one null hypothesis was false; this would result in an increase in sample size per arm of 62%. (Using a Bayesian analysis, Simon and Freedman¹⁶ suggest increasing the sample size by 30% per arm to account for a possible nonzero interaction. Using a factorial analysis, a formal test of the interaction parameter (of the same size as the main effects we have been considering) would require a sample size per arm 200% larger than in a single two-arm trial, leading to an overall trial size that is 400% larger than a two-armed trial.¹⁷)

Examples

We present three examples of completed trials that used a factorial trial design. These examples are not meant to suggest that the trial investigators used inappropriate analyses, but solely to demonstrate how our proposed strategy would work on some real trial data.

Example 1: E1199 for the adjuvant treatment of breast cancer conducted by the Eastern Cooperative Oncology Group

E1199 used a factorial trial design that compared the disease-free survival of docetaxel versus paclitaxel and a weekly schedule versus a every 3 weeks schedule; the control arm was the paclitaxel given every 3 weeks (treatment C).¹⁸ In addition to C, the treatment arms were weekly paclitaxel (treatment A), docetaxel every 3 weeks (treatment B), and weekly docetaxel (treatment AB). The design specified a factorial analysis, using a 0.05 two-sided significance level for each of the two primary factorial comparisons. If either of the primary comparisons was statistically significant, the design also specified a comparison of each of the three experimental arms with the control arm using a 0.017 two-sided significance level. The 5-year disease-free survival rates for the four treatment groups were 76.9% (C), 81.5% (A), 81.2% (B), and 77.6% (AB). Neither of the primary factorial comparisons were statistically significant. Despite this, the investigators (correctly, in our view) proceeded to compare the individual arms and concluded that weekly paclitaxel improves disease-free survival in this setting.

The results of the three-step approach and Hochberg’s procedure applied to this trial data are given in Table 3. Using the three-step approach (with α = 0.025), one can conclude that weekly paclitaxel (A) is superior to paclitaxel every 3 weeks (C), and go to Step 3 where we cannot reject A = B in favor of A being better than B. Using the Hochberg procedure, one can also conclude that weekly paclitaxel is superior to paclitaxel every 3 weeks. A follow-up of this trial confirmed this result.¹⁹ The proposed procedures allow one to formally address a primary clinical question concerning weekly paclitaxel whereas the factorial analysis can only address this question in an informal ad hoc manner.

Table 3.

Disease-free survival versus with the control arm C (paclitaxel every 3 weeks) in the E1199 trial.

Treatment comparison	Hazard ratio^a	One-sided p value^b	Three-step procedure		Hochberg procedure
			Adjusted p value	Simultaneous 95% CI for hazard ratios^a	Adjusted p value
Weekly paclitaxel (A) versus C	0.79 ± 0.07	0.003	0.009	(0.64, 0.97)	0.012
Docetaxel every 3 weeks (B) versus C	0.81 ± 0.07	0.01	0.03	(0.66, 1.00)	0.03
Weekly docetaxel (AB) versus C	0.92 ± 0.08	0.145	0.435	(0.75, 1.12)	0.29

CI: confidence interval.

Hazard ratio of experimental treatment over the control treatment. Hazard ratios and confidence intervals are taken to be the inverses of those reported in Figure 1B of Sparano et al.¹⁸ for hazard ratios of control treatment over experimental treatment and were estimated stratified by number of positive nodes and estrogen-receptor status. Standard errors were derived from confidence intervals in Figure 1B of Sparano et al.¹⁸

p values are one-half of the two-sided p values given in Figure 1B of Sparano et al.¹⁸ and were based on log-rank tests stratified by the number of positive nodes and estrogen-receptor status.

Example 2: Trial to Assess Chelation Therapy

In this trial, a factorial design was used to test chelation therapy and a multivitamin supplement compared to placebos (treatment C) on the primary endpoint (a composite of death, myocardial infarction, stroke, coronary revascularization, or hospitalization for angina) for post-myocardial infarction patients; a factorial analysis was specified.²⁰ In addition to C, the treatment arms were chelation therapy (treatment A), a multivitamin supplement (treatment B), and the combination therapy (treatment AB). The 5-year event rates (±standard error (SE)) for the four treatment groups were 31.8% ± 2.2% (C), 27.3% ± 2.2% (A), 28.2% ± 2.2% (B), and 25.7% ± 2.1% (AB);²¹ using the factorial analysis, the chelation therapy was found to be useful with (two-sided) p = 0.035,²² but the multivitamins were not;²³ no adjustments for interim monitoring are made for the results presented here.

The results of the three-step approach and Hochberg’s procedure applied to this trial data are given in Table 4. With α = 0.025, using the three-step approach one can conclude that the combination treatment (AB) is superior to placebos; in Step 2, one cannot conclude that the combination therapy is better than the single-treatment arms. Using the Hochberg procedure, no null hypothesis is rejected at the α = 0.025 level. The three-step procedure formally identifies the combination therapy as better than the placebos whereas the Hochberg procedure slightly misses statistical significance for this comparison.

Table 4.

Time-to-event comparisons with the control arm C (placebos) in the Trial to Assess Chelation Therapy trial.

Treatment comparison	Hazard ratio ± SE^a	One-sided p value^b	Three-step procedure		Hochberg procedure
			Adjusted p value	Simultaneous 95% CI for hazard ratios^c	Adjusted p value
Chelation therapy (A) versus C	0.83 ± 0.10	0.064	0.192	(0.62, 1.12)	0.192
Multivitamin supplement (B) versus C	0.90 ± 0.11	0.177	0.531	(0.66, 1.21)	0.354
Chelation + vitamins (AB) versus C	0.74 ± 0.10	0.008	0.024^d	(0.54, 1.01^d)	0.032

SE: standard deviation; CI: confidence interval.

Hazard ratios are from Figure 1A of Lamas et al.²¹ Standard errors are derived from nominal two-sided 95% confidence intervals given in Figure 1A of Lamas et al.²¹

p values are one-half of the two-sided p values given in Figure 1A of Lamas et al.²¹ and were based on log-rank tests.

Nominal two-sided 98.33% confidence intervals are calculated from nominal two-sided 95% confidence intervals given in Figure 1A of Lamas et al.²¹

The apparent contradiction of having the (one-sided) adjusted p value < 0.25 but the upper limit of the (two-sided) simultaneous 95% confidence interval being greater than 1 is due to the p values being based on log-rank statistics and the confidence intervals are derived from proportional hazards models.

Example 3: Two decontamination regimens for the prevention of acquired infections

In this factorial trial design, polymyxin/tobramycin (P/T) and mupirocin/chlorhexidine (M/C) were used to see if they could prevent acquired infections in intubated patients in intensive care units.²⁴ In addition to the placebo treatments arm, the other arms were M/C, P/T, or the combination M/C + P/T. The primary endpoint was the number of acquired infections grouped into three categories (0, 1, or ⩾2) and was analyzed using a proportional odds cumulative logit model. The primary analysis specified depending on whether there was a statistically significant interaction in the factorial analysis: if there was not a statistically significant interaction, M/C + P/T would be compared with M/C, P/T, and placebo (three comparisons with a Bonferroni adjustment). If there was a statistically significant interaction, additionally M/C and P/T would be compared to placebo (five comparisons with a Bonferroni adjustment). The results showed that the proportions of acquired infections for the combination M/C + P/T were 77% (zero infections), 20% (one infection), and 8% (two or more infections) and that for the other three arms the treatments were less effective and almost identical: 58% (zero infections), 26% (one infection), and 20% (two or more infections).²⁴ There was a statistically significant interaction, and the investigators’ analysis demonstrated that the combination was better than either regimen alone and neither regimen.

The results of the three-step approach and Hochberg’s procedure applied to this trial data are given in Table 5. With α = 0.025, using the three-step approach one can conclude that the combination treatment (M/C + P/T) is superior to placebo and then in Step 3 conclude that the combination therapy is better than the single-treatment arms. Using the Hochberg procedure, the same conclusions can be made, which were also the conclusions made by the investigators.

Table 5.

Odds ratios comparisons in the trial of mupirocin/chlorhexidine (M/C) and polymyxin/tobramycin (P/T) to prevent acquired infections.^a

Treatment comparison	Odds ratio (±SE)^b	One-sided p value	Three-step procedure		Hochberg procedure
			Adjusted p value	Simultaneous 95% CI for odds ratios	Adjusted p value
M/C versus neither regimen	0.975 ± 0.239	0.46	1.00	(0.579, 1.643)	0.46
P/T versus neither regimen	0.951 ± 0.234	0.42	1.00	(0.564, 1.604)	0.46
M/C + P/T versus neither regimen	0.421 ± 0.114	0.0007	0.0021	(0.236, 0.750)	0.0029
M/C + P/T versus Max (M/C, P/T, neither)	0.443 ± 0.120	0.0013	0.0021	−	0.0039

SE: standard error; CI: confidence interval; M/C: mupirocin/chlorhexidine; P/T: polymyxin/tobramycin.

Results are calculated from raw count data given in Table 3A of Camus et al.²⁴

The odds ratios are identical to those given in Table 3B of Camus et al.²⁴

Discussion

Factorial trial designs, like other multi-arm trial designs, are efficient because they share a common control arm and allow the ability to answer multiple questions in one trial.²⁵ A valid factorial analysis, however, requires assumptions that may typically be too strong to be satisfied. Comparing pairwise all four treatment arms is a possibility, but as this approach does not use the natural preference ordering of the treatment arms in factorial design, it will have reduced power for the primary questions for which we are interested (see Online Supplementary Material). Using the preference order, if the aim is to select the best treatment arm, then using one of the approaches of Chen and Simon⁸ is appropriate. If, on the other hand, one wants to obtain a statistically rigorous inference for all clinically relevant between-arm comparisons, for example, in order to inform clinical development of the treatments, then the two analysis strategies proposed here allow one to answer these questions.

There are a plethora of possible multiple-comparison testing techniques⁹ which could be applied within our framework for the non-factorial analysis of a factorial trial design. For example, one could use the more complex Hommel²⁶ procedure rather than Hochberg procedure for testing the questions of interest; the improvement in power is minor (results not shown). In the context of the parallel gatekeeping like the three-step procedure, instead of Bonferroni in Step 1, one could use a truncated Hochberg procedure⁹ or alternatively a parametric procedure (such as Dunnett’s²⁷ method using the asymptotic normality and correlation structure of the test statistics). This would result in an increase in power for testing some of the first-step hypotheses at the cost of less power for testing the combination treatment AB in Step 2. We believe the two procedures we have chosen to highlight offer simplicity and reasonable power for testing the various hypotheses of interest.

Comparing the Hochberg and three-step procedure for non-factorial analysis of a two-by-two factorial trial design, we would generally recommend the three-step procedure. Although Hochberg’s procedure has slightly more power when more of the null hypotheses are false, this would seem to be more advantageous in other multiple comparisons’ settings. For example, if one was testing multiple related clinical endpoints in a randomized trial, then the Hochberg procedure would be advantageous in that the inference for a treatment benefit for an endpoint would be stronger when related clinical endpoints are also showing a treatment benefit.⁹ In the present context, however, it is unclear why treatment A working well should make it easier to be convinced that treatment B is working well. Perhaps the biggest advantage of the three-step procedure over Hochberg’s procedure is the ability to use Bonferroni-adjusted simultaneous confidence intervals for the comparisons of the experimental treatments with the control treatment. Whatever procedure is used, it is important not to depend on a factorial analysis when there is the possibility of a statistical interaction between the treatments and to use an appropriate sample size to assess the clinical questions of interest.

Footnotes

Appendix 1

In Appendix 1, we demonstrate that our proposed procedure has type 1 error less than $α$ . To do this, we consider all 31 possible combinations of the five null hypotheses given in the text being satisfied (Table 6). In all, 16 of the 31 combinations are logically impossible (bottom panel of Table 6). For the remaining 15 cases (top panel of Table 6), we consider the probability of making an error at Step 1 $(α_{S 1})$ , the probability of not making an error at Step 1 and making an error at Step 2 $(α_{S 2})$ , and the probability of not making an error at Step 1 or Step 2 and making an error at Step 3 $(α_{S 3})$ . The overall type 1 error is $α_{S 1} + α_{S 2} + α_{S 3}$ and can be seen in Table 6 to be less than $α$ .

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

T-H

. The impact of a preliminary test for interaction in a 2 × 2 factorial trial. Commun Stat: Theor M 1994; 23: 435–450.

Green

Liu

O’Sullivan

. Factorial design considerations. J Clin Oncol 2002; 20: 3424–3430.

Brittain

Wittes

. Factorial designs in clinical trials: the effects of non-compliance and subadditivity. Stat Med 1989; 8: 161–171.

Byar

Piantadosi

. Factorial designs for randomized clinical trials. Cancer Treat Rep 1985; 69: 1055–1063.

McAlister

Straus

Sackett

. Analysis and reporting of factorial trials: a systematic review. JAMA 2003; 289: 2545–2553.

Hung

. Two-stage tests for studying monotherapy and combination therapy in two-by-two factorial trials. Stat Med 1993; 12: 645–660.

Kahan

. Bias in randomised factorial trials. Stat Med 2013; 32: 4540–4549.

Chen

Simon

. Two multi-step selection procedures with possible selection of two treatments of equal prior preference. Commun Stat: Theor M 1994; 23: 781–801.

Dmitrienko

D’Agostino

RB Sr

Huque

. Key multiplicity issues in clinical drug development. Stat Med 2013; 32: 1079–1111.

10.

Hochberg

. A sharper Bonferroni procedure for multiple tests of significance. Biometrika 1988; 75: 800–802.

11.

Sarkar

Chang

C-K

. The Simes method for multiple hypothesis testing with positively dependent test statistics. J Am Stat Assoc 1997; 92: 1601–1608.

12.

Laska

Meisner

. Testing whether an identified treatment is best. Biometrics 1989; 45: 1139–1151.

13.

Westfall

Young

. Resampling-based multiple testing. New York: John Wiley & Sons, 1993.

14.

Guilbaud

. Simultaneous confidence regions for closed tests, including Holm-, Hochberg-, and Hommel-related procedures. Biom J 2012; 54: 317–342.

15.

Green

Benedetti

Smith

. Clinical trials in oncology. 3rd ed.Boca Raton, FL: CRC Press, 2012, p. 112.

16.

Simon

Freedman

. Bayesian design and analysis of two × two factorial clinical trials. Biometrics 1997; 53: 456–464.

17.

Peterson

George

. Sample size requirements and length of study for testing interaction in a 2 × k factorial design when time-to-failure is the outcome. Control Clin Trials 1993; 14: 511–522.

18.

Sparano

Wang

Martino

. Weekly paclitaxel in the adjuvant treatment of breast cancer. N Engl J Med 2008; 358: 1663–1671.

19.

Sparano

Zhao

Martino

. Long-term follow-up of the E1199 phase III trial evaluating the role of taxane and schedule in operable breast cancer. J Clin Oncol 2015; 21: 2353–2360.

20.

Lamas

Goertz

Boineau

. Design of the Trial to Assess Chelation Therapy (TACT). Am Heart J 2012; 163: 7–12.

21.

Lamas

Boineau

Goertz

. EDTA chelation therapy alone and in combination with oral high-dose multivitamins and minerals for coronary disease: the factorial group results of the Trial to Assess Chelation Therapy. Am Heart J 2014; 168: 37–44.

22.

Lamas

Goertz

Boineau

. Effect of disodium EDTA chelation regimen on cardiovascular events in patients with previous myocardial infarction: the TACT randomized trial. JAMA 2013; 309: 1241–1250.

23.

Lamas

Boineau

Goertz

. Oral high-dose multivitamins and minerals after myocardial infarction: a randomized trial. Ann Intern Med 2013; 159: 797–805.

24.

Camus

Bellissant

Sebille

. Prevention of acquired infections in intubated patients with the combination of two decontamination regimens. Crit Care Med 2005; 33: 307–314.

25.

Freidlin

Korn

Gray

. Multi-arm clinical trials of new agents: some design considerations. Clin Cancer Res 2008; 14: 4368–4371.

26.

Hommel

. A stagewise rejective multiple test procedure based on a modified Bonferroni test. Biometrika 1988; 75: 383–386.

27.

Dunnett

. A multiple comparison procedure for comparing several treatments with a control. J Am Stat Assoc 1955; 50: 1096–1121.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.14 MB