Empirical power comparison of statistical tests in contemporary phase III randomized controlled trials with time-to-event outcomes in oncology

Abstract

Background:

More than 95% of recent cancer randomized controlled trials used the log-rank test to detect a treatment difference making it the predominant tool for comparing two survival functions. As with other tests, the log-rank test has both advantages and disadvantages. One advantage is that it offers the highest power against proportional hazards differences, which may be a major reason why alternative methods have rarely been employed in practice. The performance of statistical tests has traditionally been investigated both theoretically and numerically for several patterns of difference between two survival functions. However, to the best of our knowledge, there has been no attempt to compare the performance of various statistical tests using empirical data from past oncology randomized controlled trials. So, it is unknown whether the log-rank test offers a meaningful power advantage over alternative testing methods in contemporary cancer randomized controlled trials. Focusing on recently reported phase III cancer randomized controlled trials, we assessed whether the log-rank test gave meaningfully greater power when compared with five alternative testing methods: generalized Wilcoxon, test based on maximum of test statistics from multiple weighted log-rank tests, difference in t-year event rate, and difference in restricted mean survival time with fixed and adaptive $τ$ .

Methods:

Using manuscripts from cancer randomized controlled trials recently published in high-tier clinical journals, we reconstructed patient-level data for overall survival (69 trials) and progression-free survival (54 trials). For each trial endpoint, we estimated the empirical power of each test. Empirical power was measured as the proportion of trials for which a test would have identified a significant result (p value < .05).

Results:

For overall survival, t-year event rate offered the lowest (30.4%) empirical power and restricted mean survival time with fixed $τ$ offered the highest (43.5%). The empirical power of the other types of tests was almost identical (36.2%–37.7%). For progression-free survival, the tests we investigated offered numerically equivalent empirical power (55.6%–61.1%). No single test consistently outperformed any other test.

Conclusion:

The empirical power assessment with the past cancer randomized controlled trials provided new insights on the performance of statistical tests. Although the log-rank test has been used in almost all trials, our study suggests that the log-rank test is not the only option from an empirical power perspective. Near universal use of the log-rank test is not supported by a meaningful difference in empirical power. Clinical trial investigators could consider alternative methods, beyond the log-rank test, for their primary analysis when designing a cancer randomized controlled trial. Factors other than power (e.g. interpretability of the estimated treatment effect) should garner greater consideration when selecting statistical tests for cancer randomized controlled trials.

Keywords

Hazard ratio log-rank test restricted mean survival time survival data analysis weighted log-rank test

Introduction

The randomized controlled trial (RCT) is the key clinical research method used to rigorously evaluate the efficacy and safety of a new therapy compared with a control therapy. Typically, regulatory approval of new developed cancer drugs is based on statistically significant differences in time-to-event outcomes from phase III RCTs. Results from RCTs also play a key role in making coverage determinations by payers and treatment decisions by practitioners.

While there are many methods to compare the survival time distribution between two randomized groups, our recent study¹ revealed that more than 95% of contemporary cancer RCTs used the log-rank test, or an asymptotically equivalent hazard ratio (HR)-based test (e.g. the score test or Wald test based on the partial likelihood via Cox’s² proportional hazards (PH) model) to detect a treatment difference. Although log-rank/HR-based tests offer maximal power when the ratio of the hazard functions from two groups is constant over time (i.e. PH), the routine use of the log-rank/HR-based test has one notable shortcoming. Using a log-rank/HR-based test as the primary analysis leads investigators to automatically choose the HR to report the magnitude of the treatment effect, yet the HR does not provide an easily interpretable summary of the treatment effect magnitude.^3–8 Specifically, because the HR is not a ratio of two numbers but a ratio of two functions of time, there is no absolute number from the control group that can be a reference to assess if the reported HR indicates a clinically meaningful magnitude of the treatment effect. This is true regardless of whether or not the PH assumption is correct.^3,4 When the PH assumption is violated, the interpretation is rather difficult because the HR derived from the standard Cox’s method is affected by the study-specific censoring time distribution.⁹

For those non-PH cases, instead of using the standard Cox’s method, one may use a different approach to estimate the HR (e.g. Kalbfleisch and Prentice,¹⁰ Xu and O’Quigley,¹¹ and Schemper et al.¹²). Since HRs estimated through those procedures will not depend on the study-specific censoring time distribution, the resulting HR can be interpreted as an “average” HR under non-PH scenarios. It can also be interpreted as an approximation of “odds of concordance” with a particular weight function, as proposed by Schemper et al.¹² However, these approaches also have some drawbacks with regarding interpretation of the treatment effect magnitude. The average HR is essentially a weighted average of time-specific HR over a time range, so the weight function and the time range should be clearly reported together with the average HR to aid interpretation. However, if the weight function is not intuitive, the interpretation of the HR could be challenging for clinicians/patients. The time range would also need to be a clinically relevant range to permit clinical interpretation of the treatment effect, unless the average HR is independent of the length of follow-up time. Moreover, the lack of a reference number from the control group will still be an issue for the average HR, as it is for the standard HR. A nice illustration of the difficulties that arise when using HR for shared decision-making is shown in the paper by McCaw et al.¹³ Also, a recent survey found that 47% of respondents misinterpreted the HR,¹⁴ supporting the concern that HR can be a challenging way to summarize the treatment effect magnitude.

To address the shortcomings of the log-rank/HR-based approach, several alternative approaches have been introduced.³ For example, restricted mean survival time (RMST)-based analysis^{3,4,7,15–17} is one alternative gaining more recent attention. As with all methods, the RMST-based analysis has pros and cons. Robustness is one notable advantage over HR as the inference of RMST-based metrics (e.g. RMST difference) do not require strong model assumptions. Another major advantage is the existence of a reference value from the control group, which can help clinicians/patients interpret the magnitude of the treatment effect summarized by the difference or ratio of RMSTs. Also, a recent theoretical study found that the RMST-based test has similar power compared with the log-rank test for PH scenarios, and it can offer a greater power than the log-rank test for non-PH scenarios except for delayed difference patterns.¹⁸ On the other hand, one disadvantage of the RMST-based approach is that interpretation of RMST depends on the time-window. When the time ranges from 0 to $τ$ years, RMST can be interpreted as “ $τ$ -year life expectancy.”¹⁶ But for RMST to have clinical impact, the $τ$ must be clinically meaningful. Another disadvantage is that the clinical research community has limited experience with RMST-based design and analyses. While the methods and software needed to conduct clinical trials (e.g. interim analysis, sample size calculations) using the traditional log-rank/HR approach are well established and widely accessible, the same may not be true for RMST-based approach.

As mentioned earlier, more than 95% of contemporary cancer RCTs use the traditional log-rank/HR approach.¹ This indicates that although there are many alternative statistical methods, most of them are only found in statistical journals. When a new alternative statistical method is proposed, Monte Carlo simulation studies are conducted to assess the performance of the new method. These numerical studies rely on artificial data and configurations/settings that are quite limited. More convincing real-world evidence would be necessary for an alternative statistical method to be realistically utilized in practice. A major factor impeding adoption of non-log-rank/HR methods may be simply the impression that the log-rank test offers the greatest power in real-life settings since statistical theory has demonstrated that log-rank/HR-based tests offer the greatest power at least under PH scenarios.¹⁹ However, it is unknown whether the log-rank/HR-based approach has a practically meaningful power advantage against alternative testing procedures in the real-world setting. Focusing on cancer RCTs, we take an evidence-based approach for comparing the empirical power of statistical tests.

Materials and methods

We identified phase III cancer RCTs recently reported in major medical journals, reconstructed patient-level data²⁰ and analyzed them to determine the empirical power of different statistical testing procedures.

Data sources and searches

Using PubMed, we searched for papers that reported overall survival and/or progression-free survival results from phase III RCTs and were published in one of seven journals: JAMA, JAMA Oncology, Journal of Clinical Oncology, Journal of the National Cancer Institute, Lancet, Lancet Oncology, and New England Journal of Medicine. The registration date on PubMed had to be between 1 July 2016 and 30 June 2017. Two authors (M. Horiguchi and H. Uno) independently examined the papers to confirm eligibility. The criteria for inclusion in the analysis were as follows: (a) two-arm phase III RCT that reported overall survival or progression-free survival, (b) comparative groups were randomized, (c) study not primarily designed to show non-inferiority, (d) reported results for the primary analysis of the study, and (e) had sufficient information to reconstruct patient-level data (i.e. Kaplan–Meier curve with number at risk at several time points). For each eligible study, we reconstructed the patient-level data for overall survival, progression-free survival, or both, using the algorithm proposed by Guyot et al.²⁰ We confirmed that our reconstructed subject-level data reproduced practically identical results that were reported in the original papers (Supplementary Material S1).

Statistical testing procedures for comparing treatment groups

We analyzed six statistical methods for testing the treatment effect.

Log-rank test

The log-rank test is asymptotically equivalent to testing if HR equals 1 in Cox’s² PH model. This test offers the highest power under PH scenarios. In this study, we used the log-rank test to represent all log-rank/HR-based tests.

Generalized Wilcoxon test

This is a test in the class of weighted log-rank tests,²¹ which includes the log-rank test as a special case. Compared with the log-rank test, the generalized Wilcoxon test places relatively more weight on early study time points. Thus, this test offers higher power for early difference patterns compared to the log-rank test. While there are several testing procedures called the generalized Wilcoxon test, we specifically used the Peto–Prentice Wilcoxon test.^22,23

Max-Combo test

A test based on maximum of several weighted log-rank test statistics. The non-PH working group, a collaboration of the US Food and Drug Administration and the pharmaceutical industry, highlighted the Max-Combo test as a way to address non-PH issues.²⁴ The test statistic of their Max-Combo test is the maximum of test statistics from multiple weighted log-rank tests in the $G^{ρ, γ}$ class¹⁹ to capture various patterns of between-group differences. Specifically, it comprises the log-rank test ( $G^{0, 0}$ ), the generalized Wilcoxon test ( $G^{1, 0}$ ), and two other tests ( $G^{1, 1}$ and $G^{0, 1}$ ), aiming to detect PH difference, early effect, middle effect, and late effect, respectively. Such a versatile test is attractive because it can capture various patterns of difference between two survival curves. However, it is important to note that there is a notable concern raised by Freidlin and Korn.²⁵ Specifically, if one-sided test is applied to both directions, such a test using the maximum of multiple test statistics can be possibly significant in favor of treatment and in favor of the control at the same time.²⁶ This indicates that this type of test can provide a significant p value even when there is no clear overall advantage on either arm. Since judgment solely based on a p value may possibly mislead the conclusion about clinical significance of treatment, it is always important to examine the entire survival profile.²⁷

RMST with a fixed $τ$ test

RMST for each group can be estimated non-parametrically as the area under the Kaplan–Meier curve. For implementation of this test, either a specific time point or a specific rule to determine the truncation time point $τ$ needs to be pre-specified. In this study, we considered two kinds of $τ$ s—(1) the minimum of the maximum observed times from two groups²⁸ and (2) the minimum of the maximum observed event times from two groups. We will call these $τ_{1}$ and $τ_{2}$ , respectively, throughout the study.

RMST with an adaptive $τ$ test

This approach allows the investigator to specify a set of candidate $τ$ s, and then relies on the analytic method to choose one of the candidate $τ$ s data dependently.²⁹ Specifically, for each $τ$ in the set of the candidate $τ$ s, we calculate the difference in RMST and the standardized test statistic corresponding to the difference. We will have $k$ test statistics if the set of candidate $τ$ s has $k$ elements. Similar to the Max-Combo test, we choose the $τ$ that gives the largest test statistic among the $k$ s. Thus, the test statistic of this test is the maximum of the $k$ test statistics. The distribution of this test statistic under the null hypothesis is derived using a resampling technique, adjusting for the multiplicity in choosing $τ$ .²⁹ Compared with the approach using a fixed $τ$ for RMST, this approach gives more flexibility in selection of $τ .$ In this study, we considered two kinds of sets of $τ$ s, each of which has three elements. The first set is ${τ_{1} / 3$ , $2 τ_{1} / 3$ , $τ_{1}$ } and the second set is ${τ_{2} / 3$ , $2 τ_{2} / 3$ , $τ_{2}$ }, where $τ_{1}$ and $τ_{2}$ are the same ones as we described for RMST with a fixed $τ$ above. We call these “adaptive $τ_{1}$ ” and “adaptive $τ_{2}$ ” in our study, respectively.

t-year event rate difference test

This test compares survival probabilities at a specific time point $t_{0}$ . Survival probabilities can be easily estimated using Kaplan–Meier curves. The time point, $t_{0}$ , needs to be specified at the design stage. Since all studies do not have a common study follow-up time or pre-specified $t_{0}$ , we considered two cases. One case sets $t_{0}$ to be 50% of the study follow-up time, and the other case sets it to be 90%.

All analyses were performed using R version 3.5.1. The program packages used to implement the above six tests are listed in Supplementary Material S2. Prior to conducting the analyses, we performed a numerical study and confirmed that the type I error rates of the six tests were controlled at a two-sided alpha level of .05 even with the smallest sample sizes in our set of papers (Supplementary Material S3). This confirms that we can perform a fair power comparison of these testing procedures, because their false-positive errors are the same. We did not include median difference testing in this comparison, because it is not applicable for all studies.

Statistical analysis

We considered a total of nine tests, since there were two entries from the RMST tests with a fixed $τ$ and an adaptive $τ,$ and the t-year event rate test. We applied each of the nine tests to the reconstructed data of each trial and calculated the p value. The empirical power was calculated as the proportion of trials where the test result gave a significant p value (<.05). We estimated the difference of the empirical power for test-pairs and the corresponding 95% two-sided confidence interval (CI).³⁰ As a subgroup analysis, we conducted the same analysis limited to the studies where the violation of the PH assumption was not detected. The determination of the violation of the PH assumption was based on a significant p value (<.05) for the Grambsch and Therneau³¹ test. Analyses for overall survival and progression-free survival were performed separately.

Results

A total of 150 articles were identified by PubMed search. Of these, 69 and 54 papers satisfied eligibility criteria for the analysis of overall survival and progression-free survival, respectively (Figure 1). Table 1 summarizes the characteristics of the eligible studies by outcome variable. The primary endpoint was overall survival in 49% of the studies that were eligible for the analysis of overall survival, and the primary endpoint was progression-free survival in 63% of the 54 studies that were eligible for the analysis of progression-free survival. The log-rank test was used in almost all circumstances—97% of studies for overall survival and 100% of studies for progression-free survival. Violation of the PH assumption was suggested in 7% of overall survival and 33% of progression-free survival analysis. About 80% of studies had sample sizes over 300 for overall survival and progression-free survival. For overall survival, median follow-up time was greater than 3 years in 39% of studies, whereas for progression-free survival, median follow-up time was less than 1 year in 48% of studies. Other characteristics (e.g. journal, cancer type) of the eligible studies are listed in Supplementary Material S4.

Figure 1.

Flow diagram of study selection.

Table 1.

Study characteristics.

Characteristic	OS(N = 69)		PFS(N = 54)
Characteristic	Count	(%)	Count	(%)
Primary endpoint
OS^a	30	(43)	19	(35)
PFS	22	(32)	32	(59)
OS and PFS	4	(6)	2	(4)
Other	13	(19)	1	(2)
Primary test procedure
Log-rank test	67	(97)	54	(100)
Other	2	(3)	0
Results of the PH assumption test
Significant (p value < .05)	5	(7)	18	(33)
Not significant (p value ≥ .05)	64	(93)	36	(67)
Journal
JAMA	1	(1)	0
JAMA Oncology	3	(4)	1	(2)
Journal of Clinical Oncology	24	(35)	17	(31)
Journal of the National Cancer Institute	1	(1)	0
Lancet	6	(9)	5	(9)
Lancet Oncology	20	(29)	16	(30)
New England Journal of Medicine	14	(20)	15	(28)
Sample size
<300	15	(22)	9	(17)
≥300, <500	21	(30)	23	(43)
≥500, <800	15	(22)	16	(30)
≥800	18	(26)	6	(11)
Median follow-up time
<1 year	14	(20)	26	(48)
≥1 year, <2 years	18	(26)	14	(26)
≥2 years, <3 years	10	(14)	3	(6)
≥3 years	27	(39)	11	(20)

OS: overall survival; PFS: progression-free survival; PH: proportional hazards.

Includes co-primary endpoints of overall survival and disease-free survival.

We applied a total of nine tests to the data from each study and calculated p values. The results are described as scatter plots to contrast the log-rank test and the others. Figure 2 shows the p values, for overall survival (N = 69), from RMST-based test with a fixed $τ_{1}$ (Figure 2(a)) and $τ_{2}$ (Figure 2(b)), in contrast to the p values from the log-rank test. Figure 3(a) and (b) show the corresponding figures for progression-free survival outcome (N = 54). For the other tests, similar scatter plots are presented in Supplementary Materials S5 (overall survival) and S6 (progression-free survival).

Figure 2.

Distribution of p values from tests for difference in RMST with fixed (a) $τ_{1}$ and (b) $τ_{2}$ , in contrast to the p values from the log-rank test for the overall survival (N = 69 studies). Fixed $τ_{1}$ : minimum of the maximum observed times from two groups. Fixed $τ_{2}$ : minimum of the maximum observed event times from two groups.

Figure 3.

Distribution of p values from tests for difference in RMST with fixed (a) $τ_{1}$ and (b) $τ_{2}$ , in contrast to the p values from the log-rank test for the progression-free survival (N = 54 studies). Fixed $τ_{1}$ : minimum of the maximum observed times from two groups. Fixed $τ_{2}$ : minimum of the maximum observed event times from two groups.

For the overall survival outcome, the empirical power estimates were 37.7% for the log-rank test, 36.2% for the generalized Wilcoxon test and the Max-Combo test. They were 43.5% and 40.6% for the RMST test with a fixed $τ_{1}$ and $τ_{2}$ , respectively. For the adaptive version of RMST test with $τ_{1}$ and $τ_{2},$ they were 37.7%. For the t-year event rate difference test, they were 30.4% and 31.9%, where the time points were chosen at 50% and 90% of follow-up time, respectively (Table 2). Interestingly, the test that showed the highest empirical power was the RMST-based test with a fixed $τ_{1}$ , not the log-rank test. The difference in empirical power between the RMST-based test with a fixed $τ_{1}$ and the log-rank test was 5.8% (95% CI: –1.0% to 12.6%; p = .096). No remarkable differences were seen among the log-rank, generalized Wilcoxon, Max-Combo tests, and other RMST-based tests. The empirical power of the t-year event rate difference tests was slightly lower than the empirical power of all other testing strategies for overall survival. When the analysis was limited to the subgroup of 67 studies where the log-rank test had been used as the primary test, similar results were observed (data not shown). Furthermore, when the analysis was limited to the subgroup of 64 studies for which there was no suggested violation of the PH assumption, the empirical power of the log-rank test was not the highest. The empirical power of the RMST-based test with a fixed $τ_{1}$ was still numerically higher than the log-rank test (43.8% vs 37.5%; difference 6.2%, 95% CI: –1.1% to 13.6%, p = .095). This subgroup analysis allowed us to evaluate the performance of the log-rank test when it had its greatest potential advantage.

Table 2.

Empirical power of selected statistical tests for overall survival.

Testing procedure	All studies (N = 69)		Studies where non-PH was not suggested^a (N = 64)
Testing procedure	Empiricalpower (%)	Difference from the log-rank test(0.95 CI; p value)	Empirical power (%)	Difference from the log-rank test(0.95 CI; p value)
Log-rank	37.7	Reference	37.5	Reference
Generalized Wilcoxon	36.2	−1.4 (−6.4 to 3.5; p = .563)	37.5	0 (−4.3 to 4.3; p = 1.000)
Max-Combo	36.2	−1.4 (−6.4 to 3.5; p = .563)	35.9	−1.6 (−6.9 to 3.7; p = .563)
RMST-based (fixed $τ_{1}$ )	43.5	5.8 (−1.0 to 12.6; p = .096)	43.8	6.2 (−1.1 to 13.6; p = .095)
RMST-based (fixed $τ_{2}$ )	40.6	2.9 (−1.1 to 6.9; p = .151)	40.6	3.1 (−1.1 to 7.4; p = .151)
RMST-based (adaptive $τ_{1}$ )	37.7	0 (−5.7 to 5.7; p = 1.000)	35.9	−1.6 (−6.9 to 3.7; p = .563)
RMST-based (adaptive $τ_{2}$ )	37.7	0 (−5.7 to 5.7; p = 1.000)	35.9	−1.6 (−6.9 to 3.7; p = .563)
t-year event rate difference
$t_{0}$ : 50% of follow-up time	30.4	−7.2 (−15.6 to 1.1; p = .089)	31.2	−6.2 (−14.8 to 2.3; p = .151)
$t_{0}$ : 90% of follow-up time	31.9	−5.8 (−13.7 to 2.1; p = .151)	29.7	−7.8 (−15.7 to 0.1; p = .052)

PH: proportional hazards CI: confidence interval; Max-Combo: maximum of the four weighted log-rank tests; RMST: restricted mean survival time.

Fixed $τ_{1}$ : minimum of the maximum observed times from two groups.

Fixed $τ_{2}$ : minimum of the maximum observed event times from two groups.

Adaptive $τ_{1}$ : a set of candidates on basis of the fixed $τ_{1}$ , ${τ_{1} / 3$ , $2 τ_{1} / 3$ , $τ_{1}$ }.

Adaptive $τ_{2}$ : a set of candidates on basis of the fixed $τ_{2}$ , ${τ_{2} / 3$ , $2 τ_{2} / 3$ , $τ_{2}$ }.

Based on the result of the PH test by Grambsch and Therneau’s³¹ method. We considered a p value of <.05 suggested a non-PH.

For progression-free survival, the log-rank test was used for 54 studies (Table 1). The empirical power estimates were 59.3% for the log-rank test, and 61.1% for the generalized Wilcoxon and the Max-Combo tests. They were 57.4% and 55.6% for the RMST test with a fixed $τ_{1}$ and $τ_{2}$ , respectively. The RMST test with an adaptive $τ_{1}$ and $τ_{2}$ gave the same empirical power estimates (61.1%). For the t-year event rate difference tests, they were 61.1% and 55.6% at 50% and 90% of follow-up time points, respectively (Table 3). The log-rank test was the most frequently used test across these studies, but it did not show the highest empirical power numerically. The RMST test with a fixed $τ_{2}$ and t-year event rate difference test at 90% of follow-up time had the lowest empirical power for the progression-free survival outcome, but the difference from the log-rank test was within the sampling variability. In the subgroup of 36 studies where there was no suggested violation of the PH assumption, the empirical power of the log-rank test was 52.8%—the highest of all these tests. However, even in this subgroup the empirical power estimates of the RMST test with a fixed $τ_{1}$ and thet-year event rate difference test at 90% of follow-up time were still comparable (50.0%).

Table 3.

Empirical power of selected statistical tests for progression-free survival.

Testing procedure	All studies (N = 54)		Studies where non-PH was not suggested^a (N = 36)
Testing procedure	Empiricalpower (%)	Difference from the log-ranktest (0.95 CI; p value)	Empiricalpower (%)	Difference from the log-ranktest (0.95 CI; p value)
Log-rank	59.3	Reference	52.8	Reference
Generalized Wilcoxon	61.1	1.9 (−7.7 to 11.4; p = .705)	44.4	−8.3 (−17.4 to 0.7; p = .070)
Max-Combo	61.1	1.9 (−6.2 to 10.0; p = .654)	47.2	−5.6 (−13.0 to 1.9; p = .146)
RMST-based (fixed $τ_{1}$ )	57.4	−1.9 (−5.4 to 1.7; p = .313)	50.0	−2.8 (−8.1 to 2.6; p = .310)
RMST-based (fixed $τ_{2}$ )	55.6	−3.7 (−8.7 to 1.3; p = .150)	47.2	−5.6 (−13.0 to 1.9; p = .146)
RMST-based (adaptive $τ_{1}$ )	61.1	1.9 (−6.2 to 10.0; p = .654)	47.2	−5.6 (−13.0 to 1.9; p = .146)
RMST-based (adaptive $τ_{2}$ )	61.1	1.9 (−7.7 to 11.4; p = .705)	44.4	−8.3 (−17.4 to 0.7; p = .070)
t-year event rate difference
$t_{0}$ : 50% of follow-up time	61.1	1.9 (−9.0 to 12.7; p = .739)	44.4	−8.3 (−20.2 to 3.5; p = .169)
$t_{0}$ : 90% of follow-up time	55.6	−3.7 (−13.9 to 6.5; p = .477)	50.0	−2.8 (−12.2 to 6.6; p = .562)

PH: proportional hazards CI: confidence interval; Max-Combo: maximum of the four weighted log-rank tests; RMST: restricted mean survival time.

Fixed $τ_{1}$ : minimum of the maximum observed times from two groups.

Fixed $τ_{2}$ : minimum of the maximum observed event times from two groups.

Adaptive $τ_{1}$ : a set of candidates on basis of the fixed $τ_{1}$ , ${τ_{1} / 3$ , $2 τ_{1} / 3$ , $τ_{1}$ }.

Adaptive $τ_{2}$ : a set of candidates on basis of the fixed $τ_{2}$ , ${τ_{2} / 3$ , $2 τ_{2} / 3$ , $τ_{2}$ }.

Based on the result of the PH test by Grambsch and Therneau’s³¹ method. We considered a p value of <.05 suggested a non-PH.

As a sensitivity analysis, we conducted the same analysis for each outcome, leaving only the studies where the primary endpoint corresponds to each outcome of our analysis (N = 34 for each outcome). The empirical power estimates with this subgroup of studies were generally higher (Supplementary Material S7) than our primary results (Tables 2 and 3) for both outcomes. However, we did not see notable differences among the tests in the subgroup analysis.

Discussion

While many statistical methods are available for conducting these analyses, one specific approach—log-rank/HR-based testing—has been used in almost all circumstances.¹ We gathered real-world data from 80 published cancer RCTs to compare the empirical power of nine tests from the six distinct analytic methods. We found no evidence to support the concern that using methods other than the log-rank test would result in a meaningful power loss. No statistically significant difference was seen in empirical power among the nine tests. Although the sample size of this study was relatively modest, the results of the 95% CIs for the difference in empirical power suggested that the potential power loss associated with the RMST-based test with a fixed $τ$ was at most 1% for overall survival-based analysis.

Demonstrating that methods other than the log-rank test offer reasonable empirical power is important. Perhaps the most notable shortcoming of using the traditional log-rank/HR method is the limited interpretability of the HR. Since the generalized Wilcoxon test and each component of the Max-Combo test are weighted versions of the log-rank test, the corresponding summary measures of those tests will be also HR-type measures. Specifically, the estimation procedures will use the partial likelihood with the weight that corresponds to the one used for the weighted log-rank test to estimate the HR.^32,33 However, as we discussed in the “Introduction” section, the HR that corresponds to the weighted log-rank test will also share the same interpretation issues as the standard HR calculated from the Cox PH model or the average HR. Although each summary measure has its pros and cons, those measures based on the RMST and t-year event rate tests, which represent the difference or ratio of the tests’ respective metrics, may be more preferable to the HR-type measures in terms of providing more robust and more intuitive quantitative information about the magnitude of the treatment effect.

A potential barrier to selecting an RMST-based test at the design stage would be the need to prespecify $τ .$ ²⁵ Regarding the choice of $τ,$ valuable guidance appears in recent publications.^34,35 As we discussed in the “Introduction” section, choosing a clinically relevant value for $τ$ aids clinical interpretation. Although the choice of $τ$ is indeed a challenge at the design stage, the fact that the RMST always comes with this explicit $τ$ may help better interpret the reported treatment effect. Usually, such a time-window is not reported together with HR. The rationale of this tradition is based on the assumption that the ratio of two hazard functions is a constant forever. Unfortunately, this assumption for justifying the extrapolation is not verifiable with the observed data. It would be recommended to report the time-window explicitly where study findings can be generalizable, regardless of using RMST difference, HR or average HR.

Given the results regarding empirical power, coupled with the interpretability of the summary measures of treatment effect, RMST-based analyses (fixed $τ$ or adaptive $τ$ ) may be a good alternative to log-rank/HR-based analyses. This position is further supported by the observation that the log-rank test did not show a remarkable advantage even when the analysis focused on the subgroup of RCTs where the PH violation was not suggested. Alternatively, these results might be demonstrating that the PH assumption test does not rule out non-PH scenarios. In other words, a non-significant p value (≥.05) from the PH assumption test does not imply that the pattern of the difference is indeed a PH, which would be a notable limitation to consider when applying a PH assumption test.^31,36–39

There is one notable limitation in our study. Because we collected the studies from published papers, our results could have been affected by publication bias. Negative trials or trials suspended early may have been more likely to go unpublished.⁴⁰ However, given that the log-rank test has been routinely used for almost all trials, the direction of the bias of our analysis favors the log-rank test and suggests our results are robust.

Conclusion

Among recently published cancer RCTs, we found six methods for testing the significance of a treatment difference provided similar empirical power. While the log-rank test has been used in almost all recent cancer RCTs, our empirical analysis was not able to confirm the power superiority of this test in the past cancer trials. Alternative testing strategies appeared to offer similar empirical power in both PH and non-PH scenarios. As with all statistical methods, the log-rank test also has some disadvantages. Given these findings, factors other than only the power to detect a between-group difference should garner greater consideration when selecting statistical tests for cancer RCTs. For example, the RMST-based approach might be employed to provide a clinically interpretable summary of treatment effect, without concern for power loss if the log-rank test were not used. In summary, our results suggest that trial investigators could have more options than the log-rank/HR approach for design and analysis of cancer RCTs to accomplish the objectives of their studies.

Supplemental Material

Supplementary_Material – Supplemental material for Empirical power comparison of statistical tests in contemporary phase III randomized controlled trials with time-to-event outcomes in oncology

Supplemental material, Supplementary_Material for Empirical power comparison of statistical tests in contemporary phase III randomized controlled trials with time-to-event outcomes in oncology by Miki Horiguchi, Michael J Hassett and Hajime Uno in Clinical Trials

Footnotes

Acknowledgements

The authors greatly appreciate the insightful comments and suggestions from two referees and the editors.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The work was partially supported by institutional funds of Dana-Farber Cancer Institute.

ORCID iD

Hajime Uno

Supplemental material

Supplemental material for this article is available online.

References

Uno

Horiguchi

Hassett

. Statistical test/estimation methods used in contemporary phase III cancer randomized controlled trials with time-to-event outcomes. Oncologist 2020; 25(2): 91–93.

Cox

. Regression models and life-tables. J R Stat Soc Series B Stat Methodol 1972; 34(2): 187–202.

Uno

Claggett

Tian

, et al. Moving beyond the hazard ratio in quantifying the between-group difference in survival analysis. J Clin Oncol 2014; 32(22): 2380–2385.

Uno

Wittes

, et al. Alternatives to hazard ratios for comparing the efficacy or safety of therapies in noninferiority studies. Ann Intern Med 2015; 163(2): 127–134.

Péron

Roy

Ozenne

, et al. The net chance of a longer survival as a patient-oriented measure of treatment benefit in randomized clinical trials. JAMA Oncol 2016; 2(7): 901–905.

Chappell

Zhu

. Describing differences in survival curves. JAMA Oncol 2016; 2(7): 906–907.

A’Hern

. Restricted mean survival time: an obligatory end point for time-to-event analysis in cancer trials? J Clin Oncol 2016; 34(28): 3474–3476.

A’Hern

. Cancer biology and survival analysis in cancer trials: restricted mean survival time analysis versus hazard ratios. Clin Oncol (R Coll Radiol) 2018; 30(9): e75–e80.

Horiguchi

Hassett

Uno

. How do the accrual pattern and follow-up duration affect the hazard ratio estimate when the proportional hazards assumption is violated? Oncologist 2019; 24(7): 867–871.

10.

Kalbfleisch

Prentice

. Estimation of the average hazard ratio. Biometrika 1981; 68(1): 105–112.

11.

O’Quigley

. Estimating average regression effect under non-proportional hazards. Biostatistics 2000; 1(4): 423–439.

12.

Schemper

Wakounig

Heinze

. The estimation of average hazard ratios by weighted Cox regression. Stat Med 2009; 28(19): 2473–2489.

13.

McCaw

Orkaby

Wei

, et al. Applying evidence-based medicine to shared decision making: value of restricted mean survival time. Am J Med 2019; 132(1): 13–15.

14.

Weir

Marshall

Schneider

, et al. Interpretation of time-to-event outcomes in randomized trials: an online randomized experiment. Ann Oncol 2019; 30(1): 96–102.

15.

Royston

Parmar

MKB

. The use of restricted mean survival time to estimate the treatment effect in randomized clinical trials when the proportional hazards assumption is in doubt. Stat Med 2011; 30(19): 2409–2421.

16.

Royston

Parmar

MKB

. Restricted mean survival time: an alternative to the hazard ratio for the design and analysis of randomized trials with a time-to-event outcome. BMC Med Res Methodol 2013; 13: 152.

17.

Trinquart

Jacot

Conner

, et al. Comparison of treatment effects measured by the hazard ratio and by the ratio of restricted mean survival times in oncology randomized controlled trials. J Clin Oncol 2016; 34(15): 1813–1819.

18.

Tian

Ruberg

, et al. Efficiency of two sample tests via the restricted mean survival time for analyzing event time observations. Biometrics 2018; 74(2): 694–702.

19.

Fleming

Harrington

. Counting processes and survival analysis. New York: John Wiley & Sons, 1991

20.

Guyot

Ades

Ouwens

, et al. Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves. BMC Med Res Methodol 2012; 12(1): 9

21.

Gill

. Censoring and stochastic integrals. Stat Neerl 1980; 34(2): 124

22.

Peto

. Asymptotically efficient rank invariant test procedures. J R Stat Soc Ser A 1972; 135(2): 185

23.

Prentice

. Linear rank tests with right censored data. Biometrika 1978; 65(1): 167

24.

Anderson

. Design and analysis of clinical trials in the presence of non-proportional hazards. In: ASA Biopharmaceutical Section Regulatory-industry Statistics Workshop, Washington, DC, 12–14 September 2018.

25.

Freidlin

Korn

. Methods for accommodating nonproportional hazards in clinical trials: ready for the primary analysis? J Clin Oncol 2019; 37(35): 3455–3459.

26.

Karrison

. Versatile tests for comparing survival curves based on weighted log-rank statistics. Stata J 2016; 16(3): 678–690.

27.

Uno

Tian

. Is the log-rank and hazard ratio test/estimation the best approach for primary analysis for all trials? J Clin Oncol 2020; 38: 2000–2001.

28.

Tian

Jin

Uno

, et al. On the empirical choice of the time window for restricted mean survival time. Biometrics. Epub ahead of print 15 February 2020. DOI: 10.1111/biom.13237.

29.

Horiguchi

Cronin

Takeuchi

, et al. A flexible and coherent test/estimation procedure based on restricted mean survival times for censored time-to-event data in randomized clinical trials. Stat Med 2018; 37(15): 2307–2320.

30.

Liu

Hsueh

Hsieh

, et al. Tests for equivalence or non-inferiority for paired binary data. Stat Med 2002; 21(2): 231–245.

31.

Grambsch

Therneau

. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika 1994; 81(3): 515

32.

Lin

. Goodness-of-fit analysis for the Cox regression model based on a class of parameter estimators. J Am Stat Assoc 1991; 86(415): 725–728.

33.

Sasieni

. Maximum weighted partial likelihood estimators for the cox model. J Am Stat Assoc 1993; 88(421): 144–152.

34.

Eaton

Therneau

Le-Rademacher

. Designing clinical trials with (restricted) mean survival time endpoint: practical considerations. Clin Trials 2020; 17(3): 285–294.

35.

Hasegawa

Misawa

Nakagawa

, et al. Restricted mean survival time as a summary measure of time-to-event outcome. Pharm Stat. Epub ahead of print 18 February 2020. DOI: 10.1002/pst.2004.

36.

Lin

Wei

Ying

. Checking the Cox model with cumulative sums of martingale-based residuals. Biometrika 1993; 80(3): 557–572.

37.

Wei

. Testing goodness of fit for proportional hazards model with censored observations. J Am Stat Assoc 1984; 79(387): 649

38.

Schoenfeld

. Chi-squared goodness-of-fit tests for the proportional hazards regression model. Biometrika 1980; 67(1): 145

39.

Marubini

Valsecchi

. Analysing survival data from clinical trials and observational studies. Chichester: J Wiley, 1995.

40.

Dwan

Altman

Arnaiz

, et al. Systematic review of the empirical evidence of study publication bias and outcome reporting bias. PLoS One 2008; 3(8): e3081.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.85 MB

Empirical power comparison of statistical tests in contemporary phase III randomized controlled trials with time-to-event outcomes in oncology

Abstract

Background:

Methods:

Results:

Conclusion:

Keywords

Introduction

Materials and methods

Data sources and searches

Statistical testing procedures for comparing treatment groups

Log-rank test

Generalized Wilcoxon test

Max-Combo test

RMST with a fixed τ test

RMST with an adaptive τ test

t-year event rate difference test

Statistical analysis

Results

Discussion

Conclusion

Supplemental Material

Supplementary_Material – Supplemental material for Empirical power comparison of statistical tests in contemporary phase III randomized controlled trials with time-to-event outcomes in oncology

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iD

Supplemental material

References

Supplementary Material

RMST with a fixed $τ$ test

RMST with an adaptive $τ$ test