Tests for equivalence of two survival functions: Alternative to the tests under proportional hazards

Abstract

For either the equivalence trial or the non-inferiority trial with survivor outcomes from two treatment groups, the most popular testing procedure is the extension (e.g., Wellek, A log-rank test for equivalence of two survivor functions, Biometrics, 1993; 49: 877–881) of log-rank based test under proportional hazards model. We show that the actual type I error rate for the popular procedure of Wellek is higher than the intended nominal rate when survival responses from two treatment arms satisfy the proportional odds survival model. When the true model is proportional odds survival model, we show that the hypothesis of equivalence of two survival functions can be formulated as a statistical hypothesis involving only the survival odds ratio parameter. We further show that our new equivalence test, formulation, and related procedures are applicable even in the presence of additional covariates beyond treatment arms, and the associated equivalence test procedures have correct type I error rates under the proportional hazards model as well as the proportional odds survival model. These results show that use of our test will be a safer statistical practice for equivalence trials of survival responses than the commonly used log-rank based tests.

Keywords

clinical importance Cox’s model critical region proportional odds survival model

1 Introduction

Clinical trials for determining equivalence of a new treatment with a standard treatment of proven efficacy have become increasingly commonplace in recent years. With growing financial and ethical pressures¹ to switch from an expensive and invasive standard treatment/procedure to a cheaper and less-invasive treatment, we can expect an increasingly higher number of equivalence trials to be conducted in future years. Our paper deals with the important concern about the validity of the conclusions from equivalence studies when the key modeling assumptions of the test is violated. Statistical methods used for equivalence trial for survival response are often based on methods of Wellek² using the proportional hazards model (PHM) of Cox.³ The reason behind the popularity of this method for equivalence trial is given below. One main challenge for developing a convenient hypothesis testing method for an equivalence trial is the formulation of the statistical hypothesis using only the parameter of the treatment effect. For a two-arm (placebo vs. treatment) superiority trial under any semi-parametric model (e.g., the PHM of Cox³), it is straightforward to make a statistical/mathematical formulation of the alternative hypothesis H_a (clinically important difference) of scientific interest. Any difference in the regression parameter η of the treatment arm implies some difference $S 1 (t) \neq S 0 (t)$ in survival curves S₁ and S₀ from two different arms at least at one time point t, and the converse is also true. For example, when two treatment arms follow PHM (PHM of Cox³) with hazard ratio η, the alternative hypothesis H_a: $S 1 (t) \neq S 0 (t)$ for some t, implies $H a *$ : $η \neq$ 1 and vice versa. However, for an equivalence trial, when the alternative H_a is $| S 1 (t) - S 0 (t) |$ being within the prespecified range of equivalence for every time point t (to be explained later), it is not straightforward to express this H_a as a statistical hypothesis $H a *$ involving only the regression parameter η (which is free of time t). For example, in Cox’s PHM, it is not obvious that $| S 1 (t) - S 0 (t) |$ less than a small known constant for all t does imply that η is within a known interval. Wellek² paved the way for a convenient log-rank based equivalence test by deriving this result for the PHM, and only for the case of no covariates beyond treatment arms. Our result, an extension of the result of Wellek² to the case of proportional odds survival model (POSM), allows us to formulate an equivalence test for the POSM based on a rejection region which only involves the estimate and the corresponding standard error of the treatment effect parameter. Please see Wellek⁴ (section 6.7) for a thorough review of the justifications behind formulating statistical hypothesis of equivalence based on the treatment effect parameter.

Due to Wellek² results, the existing literature on equivalence trials for survival responses is dominated by the log-rank test based on the assumption of a PHM for the two treatment arms, without any consideration for alternative semi-parametric models and the presence of other covariates. The non-parametric procedures of Com-Nogue et al.⁵ and others often require much higher sample sizes than tests based on semi-parametric models. In practice, often the hazard functions of two treatment arms are not proportional over time and there may be other covariates in addition to treatment arms. We show that a log-rank based test of equivalence has a higher than intended type I error rate when treatment arms do not follow the PHM. This points to the practical need to consider new equivalence tests based on other semi-parametric models. For example, the ratio of two hazards may converge towards one over time when the initial benefit of one treatment arm over the other treatment arm diminishes over time. In this situation, the POSM of Bennett⁶ will be more appropriate than a PHM. In this paper, we also show that a POSM-based equivalence test has correct type I error even when the true model is either POSM or PHM. This shows that the POSM-based equivalence test is a safer option in practice compared to the log-rank based test, especially when the underlying modeling assumption is under suspicion.

We place high emphasis on controlling the type I error rate for an equivalence trial because, unlike a superiority trial, an effective standard treatment already exists for an equivalence trial. Wrongly accepting the alternative H_a of equivalence can potentially replace an effective standard treatment with an ineffective treatment in the market. Whereas, even if we wrongly accept the null of non-equivalence, i.e., do not accept the new treatment as equivalent, we will still have the effective standard treatment available in the market. In this case, wrongly rejecting the null is a more serious mistake than wrongly accepting the null. However, we first deal with a major impediment for developing an equivalence test for a POSM. The clinicians and other non-statisticians have understandable difficulty in defining the clinically important difference between the two treatment arms in terms of the ratio of two survival odds. On the contrary, most clinical experts and researchers are comparatively more at ease to express the clinical equivalence of two treatment arms in terms of a clinically important difference between two survival functions. The development of equivalence trial methodology for POSM depends on whether the alternative hypothesis of the equivalence of two survival curves (or two hazard curves) can be properly expressed as an alternative hypothesis in the regression parameter of the POSM.

In Section 2, we first derive the formulation of the alternative statistical hypothesis, $H a *$ , that only uses the odds ratio of the POSM, such that $H a *$ also corresponds to the scientific (clinical) hypothesis related to the “equivalence” of the survival functions of two treatment arms. In section 3, we describe the statistical methods including rejection regions for two-sample and one-sample equivalence studies under POSM. In section 4, we show that even in the presence of additional covariates, testing equivalence of the survival functions for two treatment arms is the same as statistically testing the survival odds ratio parameter to be within a small interval. This result allows us to develop the statistical test of equivalence of two treatments under POSM, even in the presence of additional covariates. In section 5, we study the relationship between sample size and intended type I error rates with tests based on Cox’s model and our new POSM-based tests. Our theoretical and simulation studies show that when the POSM assumption is true for the trial in question, log-rank based equivalence test of Wellek² tends to reject the correct null hypothesis more often than the desired level of significance. On the contrary, our POSM-based equivalence tests achieve desired type I error rates and power when the true model is either POSM or Cox’s model.

2 Formulation of hypothesis under POSM

For the time being, we consider no covariate other than treatment arm. We later extend our methods to include other covariates. The POSM of Bennett⁶ assumes

\frac{1 - S 1 (t)}{S 1 (t)} = θ [\frac{1 - S 0 (t)}{S 0 (t)}]

(2.1)

for all time points

t > 0

, where θ is the time-constant survival odds ratio between new treatment and standard treatment, with corresponding survival functions

S 1 (t)

and

S 0 (t)

, respectively. For example, one may consider two treatments are clinically equivalent if

| S 1 (t) - S 0 (t) |

, the difference between two survival functions, is smaller than a predetermined equivalence level δ over time. Thus, two treatment arms are equivalent only when

| S 1 (t) - S 0 (t) | < δ

for all t. Here, the additional quantity

δ > 0

indicates the maximum clinical difference allowed between the standard therapy and a therapeutically equivalent experimental therapy. The value of δ is usually determined by clinical experts and regulatory agencies involved in determining the practical definition of the equivalence of two treatments under consideration. However, in order to implement a statistical test for the equivalence of two treatments under POSM of (2.1), the alternative statistical hypothesis

H a *

must be based on a range (interval) of θ, where the interval depends on the practical (clinical) meaning of the equivalence of two survival curves

S 1 (t)

and

S 0 (t)

. Furthermore, it is difficult for clinicians and non-statisticians to express the therapeutic equivalence in terms of a prespecified range of θ, because θ is a ratio of odds, unlike difference in probabilities of any observable event under two treatment arms. To facilitate the formulation of a statistical hypothesis testing procedure for evaluating the clinical (scientific) alternative hypothesis H_a:

| S 1 (t) - S 0 (t) | < δ

for all t, under POSM of (2.1), we develop the following theorem.

Theorem 1

Under POSM of (2.1) with continuous $S 0 (t)$ , testing H_a: $| S 1 (t) - S 0 (t) | < δ$ $forall t > 0, isthesameastesting$ $H a *$ : $(1 + ε) - 1 < θ < 1 + ε$ , where $ε = (4 δ) / (1 - δ) 2$ is a known function of δ.

Theorem 1 (proof in the Appendix 1) shows that under the POSM of (2.1), if the clinicians and practitioners can specify the maximum allowable difference δ between two survival functions $S 1 (t)$ and $S 0 (t)$ of two equivalent treatment arms, we can derive the corresponding statistical alternative hypothesis $H a *$ based on the time-constant survival odds ratio θ. This $H a *$ can now be tested using statistical hypothesis testing tools.

Many authors including Rothman et al.⁷ advocated testing the equivalence of two treatments using the hazard ratio, because the hazard ratio of Cox’s model does not depend on the baseline population. The hazard ratio is also the popular parameter for comparing treatments in efficacy trials (at least in the field of oncology). One may specify the alternative (scientific) hypothesis H_a of equivalence of the two treatment arms via H_a: $| h 1 (t) / h 0 (t) | < ρ$ for all time points $t > 0$ , where $h 1 (t)$ and $h 0 (t)$ are hazard functions for new and standard treatments, respectively. Similar to δ for Theorem 1, the maximum allowable hazard ratio $ρ > 1$ for two clinically equivalent treatments is determined from a clinical perspective. To expedite the equivalence trial under POSM of (2.1) for H_a based on hazards ratio, we have the following theorem (proof is again in the Appendix 2).

Theorem 2

Under the POSM assumption of (2.1), the alternative hypothesis of interest H_a: $| h 1 (t) / h 0 (t) | < ρ$ for all t, is the same as testing $H a *$ : $ρ - 1 < θ < ρ$ .

We note that the $H a *$ here is identical to the $H a *$ of Theorem 1 with $(1 + ε)$ replaced by ρ. This indicates that for POSM of (2.1), the formulation of the statistical hypothesis $H a *$ is the same while testing the equivalence of two treatment arms based on either the maximum hazards ratio over time or the maximum difference of the survival functions over time. Both of these alternative hypothesis can be reduced to testing the statistical hypothesis $H a *$ involving only time constant parameter θ in (2.1). In the next section, we present the statistical tests and corresponding critical regions for this hypothesis $H a *$ for two cases—the two-sample case when the baseline survival function $S 0 (t)$ of standard treatment is unknown and the one-sample case when $S 0 (t)$ is known from historical data.

3 Implementation of equivalence tests

First we discuss the statistical tests for the equivalence of two treatment arms under the POSM of (2.1) when n patients are randomized to two treatment arms with $z i = 1$ when patient i receives the new treatment, and $z i = 0$ when she/he receives the standard treatment. We denote the observed right-censored data as $(Y ~, d ~, z ~)$ , where $Y ~ = (Y 1, \dots, Y n)$ and observed censoring indicators $d ~ = (d 1, \dots, d n)$ where Y_i is the observed survival when $d i = 1$ and Y_i is the right-censoring time when $d i = 0$ . Survival time T_i is at risk of non-informative random right censoring. In practice, the decision about therapeutic equivalence of two treatment arms will be based on testing H₀: $| S 1 (t) - S 0 (t) | \geq δ$ for some time point t, versus H_a: $| S 1 (t) - S 0 (t) | < δ$ for all $t > 0$ . From Theorem 1 and (2.1), we know that testing this hypothesis is equivalent to testing

H 0 * : | β | \geq \log (1 + ε) versus H a * : | β | < \log (1 + ε)

(3.1)

where

ε = (4 δ) / (1 - δ) 2

and

β =

log(θ) in (2.1). Due to the formulation of this statistical equivalence test based solely on parameter β of POSM, we can use the test statistic of a superiority test under POSM as the test statistic for testing (3.1). However, the new test for (3.1) has a different rejection region.

One option is to use the semi-parametric Maximum likelihood estimator (SPMLE) $(\overset{\land}{β}, \overset{\land}{B})$ of Murphy et al.,⁸ obtained via maximizing the following semi-parametric likelihood

L (β, B 0 | Y ~, d ~) \propto Π_{i = 1}^{n} (\frac{\exp (z i β)}{B 0 (Y i) + \exp (z i β)}) (\frac{Δ B 0 (Y i)}{B 0 (Y i -) + \exp (z i β)}) d i

where the baseline odds function

B 0 (t) = S 0 (t) / {1 - S 0 (t)}

is a non-decreasing, right continuous function and with jumps

Δ B 0 (t) = B 0 (t) - B 0 (t -)

at the observed failure times. The rejection region of the large sample based asymptotically most powerful test for (3.1) is given as

{| \overset{\land}{β} | \sqrt{\overset{\land}{I β}} < C α (\sqrt{\overset{\land}{I β}} \log (1 + ε))}

(3.2)

where the

C α 2 (ψ)

is the αth quantile of a

χ 2

distribution with

df = 1

and non-centrality parameter

ψ 2

. Numerical differentiation of the profile likelihood

prlik n = \log {L (β, \overset{\land}{B 0} | Y, β)}

is used to obtain

\overset{\land}{I β} \approx - \frac{1}{nh 2} {prlik n (\overset{\land}{β} + h) - 2 prlik n (\overset{\land}{β}) + prlik n (\overset{\land}{β} - h)}

for some small enough h given in Murphy et al.⁸

An alternative semi-parametric approach for testing (3.1) is to use the test statistic of Chen et al.,⁹ based on the estimator $\tilde{β}$ obtained via iteratively solving a set of estimating equations. The iterative steps are outlined in Appendix 3. Using the test statistic of Chen et al.,⁹ we can similarly derive the rejection region for testing (3.1) as

{\frac{| \tilde{β} |}{ν (\tilde{β})} < C α (\frac{\log (1 + ε)}{ν (\tilde{β})})}

(3.3)

where

C α 2 (ψ)

is the αth quantile of a

χ 2

with

df = 1

and non-centrality parameter

ψ 2

. This approach avoids the high-dimensional numerical maximization and the estimator

ν 2 (\tilde{β})

of the asymptotic variance of

\tilde{β}

has a closed-form expression. We omit the closed form expression of

ν (\tilde{β})

(given in Chen et al.⁹) for the sake of brevity. Although the estimator

\tilde{β}

is not the most efficient estimator, the efficiency loss is typically small.

In many equivalence trials, particularly in oncology, for all practical purposes, we may know the baseline survival $S 0 (t)$ of the standard treatment. In particular, there often exists a considerable amount of historical data on the survival function $S 0 (t)$ of the standard treatment because its efficacy has been already studied. In this situation, every patient with observed survival data Y_i and censoring indicator d_i for $i = 1, \dots, n$ receives the new treatment. We note that the logic and the result of Theorem 1 still apply here, and the hypothesis of equivalence of two treatments is again reduced to the hypothesis of (3.1). Since $S 0 (t)$ is known, we can find the MLE $(\overset{\land}{β})$ of β by solving the score equation

n - \sum_{i = 1}^{n} (1 + d i) \frac{\exp (β)}{\exp {- B (y i)} + \exp (β)} = 0

where

B (t) = S 0 (t) / {1 - S 0 (t)}

is known. Using the usual asymptotic theory, the large-sample rejection region is

\frac{| \overset{\land}{β} |}{ν (\overset{\land}{β})} < C α (\frac{\log (1 + ε)}{ν (\overset{\land}{β})})

, where

C α (ψ)

is the same as in (3.3) and the estimated variance

ν 2 (\overset{\land}{β})

has the closed form expression

ν 2 (\overset{\land}{β}) = \sum_{i = 1}^{n} (1 + d i) \frac{B 0 (y i) \exp (\overset{\land}{β})}{n {B 0 (y i) + \exp (\overset{\land}{β})} 2}

The computer codes for computing the test statistic of (3.2) and corresponding critical region of (3.3) are available from the authors upon request. The authors also have codes for the competing test statistic and critical region of Wellek.²

4 Extension to include other covariates

We now extend our previously described procedure of equivalence tests to accommodate even other covariates $x i$ , in addition to treatment arm indicator z_i. Even though it is very much conceivable to have additional covariates in practice, we have not yet come across any previous research on equivalence tests to accommodate additional covariates. We assume that the underlying model with additional covariate $x$ is a natural extension of the POSM of (2.1) with

\frac{1 - S 1 (t | x)}{S 1 (t | x)} = θ [\frac{1 - S 0 (t | x)}{S 0 (t | x)}] = θ e γ x [\frac{1 - S 0 (t)}{S 0 (t)}]

(4.1)

where

γ

is the regression parameter of

x

and θ is again the treatment effect of interest. For this situation, the relevant clinical hypothesis of interest is

H a : | S 1 (t | x) - S 0 (t | x) | < δ

for all covariates

x

and for all

t > 0

. Similar to the statement of Theorem 1, we can show that H_a for this case is equivalent to testing the statistical hypothesis

H a * : (1 + ε) - 1 < θ < (1 + ε)

, where

ε = (4 δ) / (1 - δ) 2

(proof omitted). It is important to note that

H a *

does not depend on either γ or

x

. This result shows that for survival response with the POSM assumption, the hypothesis of equivalence of two patients with the same covariate

x

but from different treatment arms is the same as testing the statistical hypothesis

H a *

. This result allows us to extend the formulation of the statistical hypothesis of equivalence in Theorem 1 to the equivalence studies under POSM with additional covariates

x

. However, the test statistic and corresponding critical region are now different from those used for equivalence tests with no covariates. The new test statistic, its corresponding critical region, and associated computational steps are given in the Appendix 4.

5 Error rates of tests

Since the properties of our equivalence testing procedures do not depend on additional covariates $x$ , for the sake of simplicity, we do not include covariate $x$ for our theoretical and simulation studies to compare the error rates of competing procedures. In this section, we first theoretically show inflation of type I error rate of the PHM-based test when true model is POSM. After that, we also perform simulation studies to study the finite sample properties (type I error and power) of both the POSM-based tests and the log-rank based tests under correctly and incorrectly specified models.

In practice, the most frequently used semi-parametric procedure for testing the equivalence (e.g., Wellek²) is via a log-rank based statistic under the assumption of the PHM of Cox³

h 1 (t) / h 0 (t) = \exp (η)

(5.1)

where

h 0 (t)

is the baseline hazard and

\exp (η)

is the hazards ratio of the two treatment arms under the PHM. In spite of substantial literature on the robustness of a log-rank statistic based on the PHM of (5.1) for superiority tests, there is not much research studying the effect of wrongly using a log-rank based test statistic for an equivalence hypothesis when the true underlying model is not of (5.1). We examine the type I error rate for wrongly using a log-rank based equivalence test when the true underlying model is the POSM of (2.1) with true value of β as

β 0 = 2 \log {(1 + δ) / (1 - δ)}

. This implies that two treatment arms following the POSM of (2.1) have the maximum difference of δ between their survival curves. If we wrongly use a log-rank based equivalence test with the same δ, we actually use a test based on the partial likelihood estimate

\overset{\land}{η}

of Cox.³ In this case, the asymptotic density of

\overset{\land}{η}

is not centered around true parameter value

β 0

of model (2.1). Instead, Lin and Wei¹⁰ showed that

n 1 / 2 (\overset{\land}{η} - η *)

follows an asymptotic normal distribution with mean 0 and variance

v 2 (η)

, where

η *

is the unique solution of the equation

n 1 - \int 0 + \infty \frac{n 1 e η S 0 (t)}{n 1 e η S 0 (t) + n 0 S 1 (t)} dt = 0

(5.2)

and where n₀ and n₁ are the sample sizes for the standard treatment and new treatment respectively. Here,

v (η)

is the estimated standard error of

\overset{\land}{η}

obtained from Cox.³ When the sample sizes n₀ and n₁ in the two treatment arms increase to

+ \infty

, we can show that the center of the asymptotic distribution of η is

| η | < \log (1 + ε h)

, where

ε h

satisfies

(1 + ε h) - 1 / ε h - (1 + ε h) - (1 + ε h) / ε h = δ

(the proof is in the Appendix 5). Since the rejection region for the log-rank based test is

{\frac{| η |}{v (η)} < C α (\frac{\log (1 + ε h)}{v (η)})}

the necessary condition for controlling the type I error rate within 0.05 for large sample size is

| η * | = \log (1 + ε h)

. Under the null hypothesis H₀, as sample sizes become sufficiently large and

| η * |

goes below

\log (1 + ε h)

, the type I error rate for a log-rank based test becomes greater than 0.05, the intended type I error rate of the test. Below, we also show, via simulation studies, the approximate levels of inflation of the type I error rate for finite sample sizes if we wrongly use a log-rank based test when the true model is POSM of (2.1) with true regression parameter

β 0

Our simulation studies with underlying POSM use a log normal baseline survival function

S 0 (t) = Φ (2 - \log (t))

with mean

= 2

and variance

= 1

, and an exponential censoring distribution with mean 50. The test statistics for the log-rank and POSM-based tests were calculated in Matlab. We take the maximum allowable difference in survival curves between two equivalent treatments as

δ = 0.15

, the same used by Wellek.² Using Theorem 1, we get the corresponding

ε = 0.8304

, the cut-off for the equivalence test based on POSM. Each entry gives the fraction of times out of 1000 replications of simulated data sets for which the test statistic falls in the critical region of (3.2) with

δ = 0.15

(that is

ε = 0.8304

). The columns for

m = \max | S 1 (t) - S 0 (t) | = 0

and 0.10 represent the approximate powers of the tests. The rest of the columns represent the type I error rates (sizes) of the tests at different

m \geq 0.15

. Table 1 shows the approximate powers and sizes using the POSM test, it appears to be below the nominal significance level of 0.05.

Table 1.

For different values of maximum difference in survival curves $m = \max | S 1 (t) - S 0 (t) |$ , the Pr(Rejecting H₀) for using POSM-based test when the true model is POSM (sample $size = n 1 + n 2$ for $n 1 = n 2$ ).

Sample size	Power		Type I error rate
Sample size	m = 0	m = 0.10	m = 0.15	m = 0.20	m = 0.30
50	0.114	0.072	0.049	0.030	0.006
100	0.210	0.115	0.050	0.010	0.000
150	0.378	0.154	0.050	0.012	0.000
200	0.598	0.200	0.055	0.007	0.000
400	0.930	0.308	0.044	0.004	0.000

POSM: proportional odds survival model.

Table 2 summarizes the approximate powers and sizes for (wrongly) using the log-rank test proposed by Com-Nougue et al.⁵ and Wellek² for equivalence using the same 1000 replicate data sets simulated from the POSM models. We use the rejection region of Wellek,² with intended test size 0.05 and the maximum difference in survival curves

δ = 0.15

as the margin of equivalence (same as Table 1). Each entry gives the fraction of replications for which the test statistic falls in the critical region of Wellek² for

δ = 0.15

. The simulation results show that the type I error rates at the boundary of the null H₀ of the log-rank based tests are greater than 0.05 when the true model is POSM. The difference between the actual (estimated) size type I error rate and the intended probability of type I error (5%) increases as the sample size increases. This indicates that when we wrongly use a log-rank based test, the probability of accepting the alternative that the two treatments are equivalent even when they are actually different from each other (null is true) is higher than the intended level of significance of the test.

Table 2.

For different values of maximum difference in survival curves $m = \max | S 1 (t) - S 0 (t) |$ , Pr(Rejecting H₀) for using log-rank based test when the true model is POSM (sample size = $n 1 + n 2$ for $n 1 = n 2$ ).

Sample size	Power		Type I error rate
Sample size	m = 0	m = 0.10	m = 0.15	m = 0.20	m = 0.30
50	0.120	0.085	0.069	0.032	0.007
100	0.286	0.155	0.085	0.038	0.006
150	0.497	0.235	0.111	0.034	0.004
200	0.685	0.335	0.130	0.033	0.000
400	0.964	0.539	0.180	0.033	0.000

POSM: proportional odds survival model.

Our next simulation study use data sets generated from the PHM with baseline survival

S 0 (t) = Φ (2 - \log (t))

and random censoring density of exponential with mean 50. Both

S 0 (t)

and censoring density are same as the one used in previous simulation study. We again compare the type I error rates as well as the power of the log-rank based test with those of our POSM-based tests. The rejection regions for both tests are determined using the equivalence margin

δ = 0.15

and an intended level of significance of 0.05. The values in Table 3 are the results using the log-rank based test under PHM. Table 4 values represent the POSM-based test when the simulation model is PHM. Although we have only limited amount of loss of power for wrongly assuming the POSM compared to the powers of the log-rank based test, the type I error rates (sizes) of POSM-based test remain below and close to the intended 5% level. This shows that the test based on a POSM assumption is a more conservative and robust approach, when compared to the log-rank based test, even when the true underlying model has proportional hazards.

Table 3.

For different values of maximum difference in survival curves $m = \max | S 1 (t) - S 0 (t) |$ , Pr(Rejecting H₀) for using log-rank based test when the true model is PHM (sample size = $n 1 + n 2$ for $n 1 = n 2$ ).

Sample size	Power		Type I error rate
Sample size	m = 0	m = 0.10	m = 0.15	m = 0.20	m = 0.30
50	0.127	0.087	0.045	0.012	0.005
100	0.268	0.131	0.052	0.016	0.000
150	0.510	0.156	0.047	0.010	0.000
200	0.676	0.201	0.049	0.009	0.000
400	0.966	0.355	0.052	0.002	0.000

PHM: proportional hazards model.

Table 4.

For different values of maximum difference in survival curves $m = \max | S 1 (t) - S 0 (t) |$ , Pr(Rejecting H₀) for using POSM-based test when the true model is PHM (sample size = $n 1 + n 2$ for $n 1 = n 2$ ).

Sample size	Power		Type I error rate
Sample size	m = 0	m = 0.10	m = 0.15	m = 0.20	m = 0.30
50	0.104	0.076	0.042	0.026	0.008
100	0.211	0.116	0.046	0.015	0.001
150	0.381	0.143	0.044	0.009	0.000
200	0.594	0.207	0.049	0.003	0.000
400	0.920	0.302	0.044	0.003	0.000

POSM: proportional odds survival model; PHM: proportional hazards model.

For the sake of brevity, we skip the results of the simulation study of the one sample case comparing the PHM-based test and the POSM-based test. Similar to the two-sample case, the size of our POSM-based one-sample test has type I error lower than intended significance level test even when the sample size is small. The power of the one-sample test is almost double compared to the corresponding power of the two-sample test, indicating that we need a smaller number of patients compared to the two sample case when $S 0 (t)$ is known.

6 Data example and conclusion

The main goal of the pediatric oncology trial of Nam et al.¹¹ was to evaluate whether a seven-month long maintenance treatment (the standard) is equivalent to a shorter (less toxic) four-month long treatment (new treatment) for non-Hodgkin’s malignant type B lymphoma. It is necessary to accept a “small” decrease in survival rate as a trade-off for the better tolerability and less toxicity of a new shorter treatment. For this study, the investigators decided that $δ = 0.09$ would be the “threshold of equivalence region” from Nam et al.¹¹ —the maximum difference between the survival rates of two equivalent treatment arms. Using this $δ = 0.09$ , the p-value of the log-rank based equivalence test was 0.024 based on the observed data with 11 and nine failures out of sample sizes of 84 and 82, respectively, from standard long and new shorter treatment regimens. This p-value ( $0.024 < 0.05$ ) is a highly significant evidence in favor of the alternative hypothesis of equivalence of two treatments. We cannot re-analyze the clinical trial because the original data is proprietary. Instead, we would like to demonstrate using a simulation study the comparison between a log-rank based test and a POSM-based test when we follow the design and the censoring mechanism similar to this trial and the true difference between treatment arms is at the margin of equivalence ( $m = \max | S 1 (t) - S 0 (t) | = 0.09$ ). We simulate 1000 data sets with 11 out of $n 1 = 84$ and nine failures out of $n 2 = 82$ for two treatment arms following POSM. We use the censoring scheme and monitoring length (18 months) similar to Nam et al.¹¹ The proportion of log-rank based test statistics with p-values more extreme than 0.024 is 0.038. When using the POSM test on the same set of data, we find the corresponding proportion to be 0.012. This shows that when the true model is POSM, the probability that a log-rank test will have a p-value less than 0.024 (same or more significant than the p-value obtained by Nam et al.¹¹) is almost three times more than the probability of having such a significance level with the POSM-based test. This further demonstrates that a log-rank based test has a high probability to give a highly significant (very small) p-value, even though we do not want to reject the null hypothesis H₀ when the actual $δ = 0.09$ .

Unlike Cox’s model where one hazard function dominates the other over time, a POSM can allow two hazard functions to merge over time. This may be a possible explanation for POSM-based equivalence test being more conservative than the log-rank based test. Li et al.¹² and Betensky et al.¹³ (among others) have argued that even for superiority trials, the efficiency and validity of a log-rank based test likely is questionable for solid tumor oncology studies in which there is heterogeneity of tumors due to existence of various unidentified genetic subtypes. For these studies, the hazard functions from two treatment arms may merge over time. A POSM assumption with POSM-based equivalence tests would be a wise choice in this situation.

In this paper, we have presented the test statistics, critical region, and robustness and other related properties of POSM-based tests restricted only to right-censored survival data. We would like point out that the statements of Theorem 1 and Theorem 2 are valid irrespective of the censoring mechanism. Arguably, for any type of censoring, it is possible to develop an equivalence test based on POSM if one can determine the appropriate statistic (preferably based on the SPML estimate of β of POSM from interval-censored data) and the corresponding critical region. However, equivalence trials with different types of censoring such as interval censoring are beyond the scope of this paper. These are important topics of future research.

Footnotes

Acknowledgements

We would like to thank the editor, and two referees for their valuable suggestions that led to the great improvement of our article.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors are grateful for the support provided by NIH grant R01CA69222 for this research.

References

World Medical Association Declaration of Helsinki. Ethical principles for medical research involving human subjects. JAMA 2000; 284: 3043–3045.

Wellek

. A log-rank test for equivalence of two survivor functions. Biometrics 1993; 49: 877–881.

Cox

. Regression models and life-tables (with discussion). J R Stat Soc Ser B 1972; 30: 248–275.

Wellek

. Testing statistical hypothesis of equivalence and non inferiority, 2nd ed. Florida: Chapman & Hall, 2010.

Com-Nougue

Rodary

Patte

. How to establish equivalence when data are censored: a randomized trial of treatments for B non Hodgkin lymphoma. Stat Med 1993; 12: 1353–1364.

Bennett

. Analysis of survival data by the proportional odds model. Stat Med 1983; 2: 273–277.

Rothmann

Chen

. Design and analysis of non inferiority mortality trials in oncology. Stat Med 2003; 22: 239–264.

Murphy

Rossini

Van Der Vaart

. Maximum likelihood estimation in the proportional odds model. J Am Stat Assoc 1997; 92: 968–976.

Chen

Jin

Ying

. Semi-parametric analysis of transformation models with censored data. Biometrika 2002; 89: 659–668.

10.

Lin

Wei

. The robust inference for the Cox proportional Hazards model. J Am Stat Assoc 1989; 89: 659–668.

11.

Nam

Kim

Seungyeoun

. Equivalence of two treatments and sample size determination under exponential survival model with censoring. Comput Stat Data Anal 2005; 49: 217–226.

12.

Adelstein

Adams

. An intergroup phase III comparison of standard radiation therapy and two schedules of concurrent chemoradiotherapy in patients with unresectable squamous cell head and neck cancer. J Clin Oncol 2003; 21: 92–98.

13.

Betensky

Louis

Cairncross

. Influence of unrecognized molecular heterogeneity on randomized clinical trials. J Clin Oncol 2003; 20: 2495–2499.