Abstract
Background
Among various design aspects, the choice of randomization procedure have to be agreed on, when planning a clinical trial stratified by center. The aim of the paper is to present a methodological approach to evaluate whether a randomization procedure mitigates the impact of bias on the test decision in clinical trial stratified by center.
Methods
We use the weighted t test to analyze the data from a clinical trial stratified by center with a two-arm parallel group design, an intended 1:1 allocation ratio, aiming to prove a superiority hypothesis with a continuous normal endpoint without interim analysis and no adaptation in the randomization process. The derivation is based on the weighted t test under misclassification, i.e. ignoring bias. An additive bias model combing selection bias and time-trend bias is linked to different stratified randomization procedures.
Results
Various aspects to formulate stratified versions of randomization procedures are discussed. A formula for sample size calculation of the weighted t test is derived and used to specify the tolerated imbalance allowed by some randomization procedures. The distribution of the weighted t test under misclassification is deduced, taking the sequence of patient allocation to treatment, i.e. the randomization sequence into account. An additive bias model combining selection bias and time-trend bias at strata level linked to the applied randomization sequence is proposed. With these before mentioned components, the potential impact of bias on the type one error probability depending on the selected randomization sequence and thus the randomization procedure is formally derived and exemplarily calculated within a numerical evaluation study.
Conclusion
The proposed biasing policy and test distribution are necessary to conduct an evaluation of the comparative performance of (stratified) randomization procedure in multi-center clinical trials with a two-arm parallel group design. It enables the choice of the best practice procedure. The evaluation stimulates the discussion about the level of evidence resulting in those kind of clinical trials.
Keywords
1 Introduction
Large clinical trials often stratify the randomization on a small collection of covariates that may introduce heterogeneity into the patient stream. An important covariable in multi-center trials is often the clinical center, as different study personnel, clinical settings, and patient populations may result in differential study outcomes. 1 A stratified population-based analysis can be performed with or without stratification in the design. Less is known about the impact of stratification when there is a bias in the clinical trial. In this paper, we explore this issue both for selection bias and chronological bias, and we demonstrate the impact of these analyses on a weighted stratified analysis. In so doing, we explore the role of specific stratified randomization procedures (RPs) and how certain procedures may mitigate the effects of bias. The recognition of the role of RPs in mitigating bias has been explored in prior research for unstratified trials.2–6 But because stratification into K strata creates K different independent randomized clinical trials, and a stratified test combines K independent tests, the impact of bias can be more pronounced.
The paper is organized as follows: In Section 2, we describe different stratified RPs and discuss aspects to formulate stratified versions of RPs. In Section 3, we derive a formulation of Fleiss 1 stratified test statistic preserving the allocation sequence and derive the distribution of the test statistic taking bias into account but ignoring bias in the analysis and mention some sample size considerations for the stratified test. In Section 4, we specify the bias model in the form of an additive combination of strata-specific selection bias and strata-specific time-trend bias linked to stratified allocation sequence. The criterion introduced in Section 5 is used to summarize the impact of the allocation sequence-specific bias on the type I error probability over the range of all sequences induced by a specific RP. Consequently, an assessment of different RPs is enabled which guides the choice of an RPs for application in a particular clinical trial setting. The methodology is applied to some-specific scenarios in Section 6 to illustrate the effects. We discuss the findings in Section 7 and draw conclusions in Section 8.
2 Stratified RPs
RPs for clinical trials for two treatments are well described in literature.
2
In principle, any RP used for two-treatment clinical trials can be employed within strata in a stratified randomization. A comprehensive review is given in Rosenberger.
2
Complete randomization in which patients are assigned to treatments with probability 1/2 is rarely used in stratified clinical trials. Rather, some form of restricted randomization is employed in an effort to balance treatments within strata. Hilgers
6
categorized restricted RPs that force balance in probability, force balance using a maximal tolerated imbalance, or force terminal balance. A selective list of restricted RPs is given as follows
2
:
Efron's biased coin design (EBC(p)), which consists of flipping a biased coin with probability Big stick design (BSD(a)), which can be implemented via complete randomization with a forced deterministic assignment when a maximal tolerated imbalance a is reached during the enrollment,
8
Random allocation rule (RAR), which assigns half the patients to E and C randomly,
9
Permuted block randomization (PBR(b)) with block size b uses RAR within blocks of b patients, for b even,
10
Maximal procedure (MP(a)) which uses the allocation sequences of RAR by additionally imposing a maximal tolerated imbalance (a) and assigning equal probability to all such sequences.
11
Note that EBC(p) may be classified as a restricted RP forcing balance in probability. BSD(a) forces balance by maximal tolerated imbalance a during the allocation process but does not force terminal balance. Restricted RPs with a maximal tolerable imbalance and terminal balance are PBR(b) and MP(a).
The International Council of Harmonization stated in the E9 recommendation (ICH E9) It is advisable to have a separate random scheme for each centre, i.e., to stratify by centre or to allocate several whole blocks to each centre.
The European Medicines Agency “Guideline on Clinical Trials in Small Populations” recommends stratified randomization to improve power. Using permuted blocks within each stratum is the most popular method of stratified randomization, and this is often called the stratified block design. Blocks can be selected with a fixed size or with variable sizes. However, blocking is not the only method to use within strata. The ICH E912 guidelines also state that “different trial designs will require different procedures for generating randomization schedules.” We now define stratified randomization more formally.
Consider the allocation
However, when implementing a stratified restricted RP, this observation generally does not hold and some further definitions are necessary. Even in the very simple RAR, the set of possible randomization sequences is reduced considerable and the probability for the stratified allocation sequence becomes
Another important aspect concerns the “balancing behavior” of restricted RPs. The term restricted refers to the fact that conditions on the randomization process are introduced to control the potential imbalance in the frequency of treatment allocations. Let
Three definitions of imbalance are used in the following:
An RP shows overall final balance, if An RP controls the final balance within strata, if An RP controls the maximal tolerated imbalance, if
Of course controlling the overall final balance within strata does not imply to control final balance within stratum, i.e.
It should also be noted that the stratified RAR procedure in general cannot be considered as an unstratified PBR with block sizes
Similar problems arise, when controlling the maximal tolerated imbalance with margin a, which results in an upper overall bound of
3 Stratified analysis
As mentioned in Section 1, a stratified randomization requires a stratified analysis, although a stratified analysis can be performed whether or not the randomization was stratified. In this section, we examine the distributional properties of a test statistic introduced by Fleiss 1 (page 268, formulas 1 and 2) based on a weighted t statistic for the analysis of stratified clinical trials. While we do not consider randomization tests in this paper, clearly randomization-based inference is an attractive alternative, see Rosenberger. 2 The reason for using a parametric t test is that it facilitates our goal of determining the effect of bias on inference, since we can derive the distribution of the test statistic under various forms of bias. In particular, in this section, we derive the non-centrality parameter for the distribution of the test statistic under alternative hypotheses and comment on how it can be used for sample size considerations. In the sequel, we are interested in the role of the RP in the analysis of stratified trials. Because Fleiss wrote specifically about centers rather than strata, we use both interchangeably; it should be clear that stratification can be done on variables other than center however.
We will consider a two-arm parallel group clinical trial stratified by K centers with no interim analysis. The response to the treatments E and C respectively is measured with the continuous normally distributed endpoint
We use the allocation sequence notation of the statistical model assuming no treatment by center interaction by
Fleiss's statistic to test the hypothesis
Here
Next, we calculate the distribution of the denominator
Note that
Applying that the sum of independent non-central
For enabling the matrix notation of the above expressions, a usual design matrix
In the case sampling is “stratified” by center and the objective is to estimate the overall treatment effect accounting for center, Fleiss
1
proposed the weights
Of course, equation (5) implies that
In the case sampling is “stratified” by center and the objective is to estimate the overall treatment effect, Fleiss
1
proposed the weights w
j
= 1 so that equation (3) becomes
Weighting centers in the absence and presence of center-by-treatment interaction has discussed in detail by other authors. 15
3.1 Sample size considerations
We now briefly discuss the aspects of the sample size and power calculation using the weighted t test statistic. Details can be found in the Supplementary Material Section S1. The results will be used in our numerical evaluation study.
Assuming no bias
The derivation assumed homogeneous variances in all groups and centers. Using the optimal weights of Fleiss,
1
i.e.
This formula, derived under the assumption of homogenous variances using the optimal weights and the allocation ratio of r, can be evaluated under various perspectives. One can determine the sample size necessary to detect a certain treatment effect of a clinical trial or to determine the power for various settings of the allocation ratio. Of course, the relationship of the sample size to the RP is obvious in the case of RPs forcing terminal balance. The power can also be related to RPs with the maximal tolerated imbalance margin a. The margin can be justified on the basis of the tolerable loss in power resulting from unbalanced allocation. In this case, equation (9) can be used to describe the relationship between r and the power. Both aspects are mentioned in the numerical evaluation study below. Using the weights
4 Stratification in the presence of bias
We now turn to the question of bias. Two common forms of bias encountered in clinical trials are chronological bias due to time trends in patient outcomes,
16
and selection bias, which can result in covariate imbalances and inflation of type I error rates.
3
By definition, selection bias arises from the conscious or unconscious guessing of treatment assignments so that patients have a higher chance of assignment to the investigator's treatment of choice for those patients. While double-blinded studies, and multi-center studies with a central randomization unit mitigate the possibility of selection bias, Berger
17
gives numerous examples of when selection bias has arisen in practice. As the ICH E9 Guidelines note,
12
It is important to identify potential sources of bias as completely as possible so that attempts to limit such bias may be made…. The treatment effect and treatment comparisons should involve consideration of the potential contribution of bias to the p-value.
We first specify a compound bias vector τ
ji
for stratum j and patient i that is a linear combination of a metric of chronological bias and selection bias. Taking into account the stratified randomization, we explore a linear time-trend
16
model per stratum similar to Hilgers
6
given by
Hereby, the magnitude θ
j
of the linear time trend varies between centers. Note that Hilgers
6
proposed to formulate θ
j
as fraction of the variance
5 Evaluation criterion
In our numerical evaluation study, we enumerate all possible randomization sequences for four different procedures and compare the bias to the type I error rate via computing the proportion of sequences that preserve the type I error rate at the nominal (0.05) level. If there is no bias (e.g.
6 Numerical evaluation study
The objective of the following numerical evaluation study is to illustrate effects of stratification in both the randomization and the test statistic. It is not intended to conduct a comprehensive simulation study, recognizing that the specification of the sample size as well as θ
j
and η
j
depends on the practical situation. To be more specific, we start with a K = 2 center clinical trial and use a total sample size of 80 patients with common θ
j
and η
j
in all centers. The following reasoning leads to the specification of θ
j
and η
j
. Concerning the linear time trend θ
j
, it should be noted that although the θ
j
are defined within each center, the maximal extent of the time trend should not exceed σ. In contrast, although the magnitude of the selection bias effect η
j
may vary between centers, it is like a population effect within center and no maximal extent restriction may apply. To relate the total sample size of 80 in a K = 2 center clinical trial to the effect size, formula (9) is used. The hypothesis
Probability of stratified and unstratified randomization procedures to keep the 5% level for BSD(9), CR, EBC(0.67) and PBR(4) depending on the amount of selection
BSD: big stick design; EBC: Efron's biased coin design; PBR: permuted block randomization; CR: complete randomization.
Probability of stratified and unstratified randomization procedures to keep the 5% level for BSD(9), CR, EBC(0.67) and PBR(4) depending on the amount of selection
BSD: big stick design; EBC: Efron's biased coin design; PBR: permuted block randomization; CR: complete randomization.
In the case where both biases are present, the stratified randomization with stratified analysis performs worse than unstratified analysis scenarios. The magnitude does not depend on the balancing of sample sizes between centers (
7 Discussion
The approach presented in this paper for multi-center trials follows the ideas of the evaluation of randomization procedures for design optimization (ERDO) 6 framework. However as outlined, many aspects need to be addressed to demonstrate the contribution of randomization in mitigating bias during the planning phase of a multi-center trial.
Although Kraemer 18 discussed various RPs in clinical trials including stratification, the most common choice of stratified randomization is PBR with common block size.19–22 We have presented new aspects to formulate RPs, whether unrestricted or restricted, in order to induce the final balance or maximal tolerated imbalance including PBR in a stratified form. We have discussed the formulation of stratified unrestricted and restricted procedures forcing balance in probability, forcing balance by maximal tolerable imbalance, and forcing terminal balance as three subclassifications of restricted RPs.
There are several limitations of this study. First, our compound criterion for selection bias and chronological bias imposes similar scaling, but it is difficult or impossible to scale them identically. Second, the weighting of the two criteria is subjective and may be adjusted to account for the different scaling. Although our statistical test assumes homogeneous variances across centers, the methodology can be used with standardized observation in the case of known heterogeneous variances across centers.
Our proposed approach is demonstrated in a numerical evaluation study. Here, we use very specific settings, e.g. common selection bias and time-trend effects across centers, limited sample sizes corresponding to a particular effect size. We are aware that this evaluation study does not mirror all practical situations. However, specific practical situations of the multi-center clinical trial to be planned can be embedded easily into the evaluation study to demonstrate the corresponding effects. Moreover, the corresponding results for different evaluation metrics, e.g. mean type I error probability, are supplemented in tables. We used the supplemented R code for all computations.
We have chosen to use a parametric t test as our evaluation statistic rather than the more natural randomization test. 23 Randomization tests can be computed easily through the Monte Carlo re-randomization methodology, although power considerations are computationally intensive. They tend to preserve type I error rates under time trends and have no distributional assumptions. 2 Randomization tests can be formulated easily incorporating stratification, but the theoretical results we have derived herein would be impossible for exact randomization tests or Monte Carlo re-randomization tests.
Our theoretical derivation could be applied to a general class of weights w
j
including, in particular, the inverse variance approach, although we focus our numerical evaluation study to the weights w
j
= 1 or
Sample size considerations are presented by various authors. Whereas Ruvuna
24
and Vierron and Giraudeau
25
used the normal approximation formula, Lin's
15
approach is based on the t statistic. We presented a general sample size formula for the weighted t test with K centers which generalizes Lin's approach for the two center case and the weighted (
Although some authors mention that randomization is used to avoid bias, bias is quite likely to occur when the PBR is used, particularly when the block size is small. We present a general formal analytical approach to show how RPs are able to limit the impact of selection and chronological bias on the test decision.
The idea behind the selection bias used originates from a natural preference for one of the treatments. Furthermore, it seems to be very common, assuming that the allocation process tends to produce a balanced allocation ratio at least at the end, that investigators would believe that the treatment used most frequently thus far is less likely to appear next. Combining these two arguments, it may be reasonable, that in the situation of knowledge or best guessing what the next allocation would probably be, to choose the next patient according to the expected next treatment. This is also in line with the patient's hope to be assigned to the better treatment. Summarizing, it has to be stated that this process is unconscious or subconscious. The question is not whether selection bias occurs or not, but rather how much impact of bias one is willing to accept. This can be investigated with the proposed sensitivity analysis approach even in the planning phase. With this consideration, a unique approach is presented to link the randomization process of unrestricted or restricted procedures with the trial outcome.
Of course, other biases for time trend, e.g. log-time trend and step time trend 16 or attrition bias could be easily implemented in the modeling and then used in a numerical evaluation study. For instance, attrition bias could be modeled by a variable taking 0 or 1 on missingness, which offers opportunities, to study mechanism like missing at random.
Within this paper, we formulate a biasing policy for selection and chronological bias for a two-arm, parallel group, multi-center trial, according to the weighted stratified t test procedure proposed by Fleiss. 1 We further derive the distribution of the stratified weighted test statistic to calculate the impact on the type I error rate. Finally, the impact of the combined additive bias in multi-center trials using the unstratified t test compared to the weighted stratified t test is demonstrated in a simulation study.
8 Conclusion
Stratification in the randomization process makes the analysis sensitive to bias, i.e. results in type I error inflation. Procedures forcing terminal balance are worse in the cases where the study is prone to selection bias, irrespective if time trend is present additionally. Unbalanced sample size between centers does not affect the results. This leads to the conclusion that stratification in the randomization should be considered carefully if bias is supposed to be present. In summary, the presented approach contributes to optimizing the design of clinical trials stratified by center with respect to improve the derived level of evidence.
Supplemental Material
Supplemental Material1 - Supplemental material for Design and analysis of stratified clinical trials in the presence of bias
Supplemental material, Supplemental Material1 for Design and analysis of stratified clinical trials in the presence of bias by Ralf-Dieter Hilgers, Martin Manolov, Nicole Heussen and William F Rosenberger in Statistical Methods in Medical Research
Supplemental Material
Supplemental Material2 - Supplemental material for Design and analysis of stratified clinical trials in the presence of bias
Supplemental material, Supplemental Material2 for Design and analysis of stratified clinical trials in the presence of bias by Ralf-Dieter Hilgers, Martin Manolov, Nicole Heussen and William F Rosenberger in Statistical Methods in Medical Research
Supplemental Material
Supplemental Material3 - Supplemental material for Design and analysis of stratified clinical trials in the presence of bias
Supplemental material, Supplemental Material3 for Design and analysis of stratified clinical trials in the presence of bias by Ralf-Dieter Hilgers, Martin Manolov, Nicole Heussen and William F Rosenberger in Statistical Methods in Medical Research
Supplemental Material
Supplemental Material4 - Supplemental material for Design and analysis of stratified clinical trials in the presence of bias
Supplemental material, Supplemental Material4 for Design and analysis of stratified clinical trials in the presence of bias by Ralf-Dieter Hilgers, Martin Manolov, Nicole Heussen and William F Rosenberger in Statistical Methods in Medical Research
Supplemental Material
Supplemental Material5 - Supplemental material for Design and analysis of stratified clinical trials in the presence of bias
Supplemental material, Supplemental Material5 for Design and analysis of stratified clinical trials in the presence of bias by Ralf-Dieter Hilgers, Martin Manolov, Nicole Heussen and William F Rosenberger in Statistical Methods in Medical Research
Supplemental Material
Supplemental Material6 - Supplemental material for Design and analysis of stratified clinical trials in the presence of bias
Supplemental material, Supplemental Material6 for Design and analysis of stratified clinical trials in the presence of bias by Ralf-Dieter Hilgers, Martin Manolov, Nicole Heussen and William F Rosenberger in Statistical Methods in Medical Research
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the IDeAl project funded from the European Union Seventh Framework Programme (FP7 2007-2013) under grant agreement No. 602552. RDH received funding from the European Joint Programme on Rare Diseases within European Union’s Horizon 2020 research and innovation program under grant agreement No. 825575. Part of the work was done while RDH joined 2018 workshop on Design of Experiments: New Challenges at CIRM Luminy, France. RDH was granted by RWTH Aachen University under project rwth0334 with computing resources for simulations.
Supplemental material
Supplemental material is available for this article online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
