Cluster Randomized Trials Designed to Support Generalizable Inferences

Abstract

When planning a cluster randomized trial, evaluators often have access to an enumerated cohort representing the target population of clusters. Practicalities of conducting the trial, such as the need to oversample clusters with certain characteristics in order to improve trial economy or support inferences about subgroups of clusters, may preclude simple random sampling from the cohort into the trial, and thus interfere with the goal of producing generalizable inferences about the target population. We describe a nested trial design where the randomized clusters are embedded within a cohort of trial-eligible clusters from the target population and where clusters are selected for inclusion in the trial with known sampling probabilities that may depend on cluster characteristics (e.g., allowing clusters to be chosen to facilitate trial conduct or to examine hypotheses related to their characteristics). We develop and evaluate methods for analyzing data from this design to generalize causal inferences to the target population underlying the cohort. We present identification and estimation results for the expectation of the average potential outcome and for the average treatment effect, in the entire target population of clusters and in its non-randomized subset. In simulation studies, we show that all the estimators have low bias but markedly different precision. Cluster randomized trials where clusters are selected for inclusion with known sampling probabilities that depend on cluster characteristics, combined with efficient estimation methods, can precisely quantify treatment effects in the target population, while addressing objectives of trial conduct that require oversampling clusters on the basis of their characteristics.

Keywords

design generalizability transportability cluster randomized trials causal inference interference

Introduction

When conducting cluster randomized trials evaluators often have access to an enumerated cohort representing the target population of clusters. For example, in the applications motivating our work—cluster randomized trials of vaccine effectiveness in U.S. nursing homes (National Library of Medicine (U.S.), 2013, 2018, 2019)—evaluators can often identify a roster of trial-eligible nursing homes using routinely collected data. Similarly, when conducting educational experiments, administrative data can be used to compile a list of all trial-eligible schools in a state (Tipton, 2013b). When clusters can be selected for participation from an enumerated cohort, the inferential goals of the trial and practicalities related to research economy may conflict with the goal of producing generalizable inferences for the target population underlying the enumerated cohort. For example, evaluators may be interested in oversampling certain groups of clusters to increase the trial’s ability to test hypotheses about heterogeneity of treatment effects (i.e., effect modification or moderation) or to ensure that the trial can produce reasonably precise estimates in cluster subgroups defined by baseline characteristics. Furthermore, some clusters may be oversampled if they have attributes that facilitate the conduct of the trial (e.g., have infrastructure that facilitates data collection), particularly when resources are constrained. In such cases, it is possible to select clusters for participation in the trial using sampling probabilities that depend on baseline characteristics and are under the control of evaluators—and thus, known by design. Such sampling can help achieve the goals of the trial while supporting the generalizability of inferences to the target population.

Most recent work on generalizability methods has focused on individually randomized trials where participants are not representative of the target population, and where evaluators do not have control over an individual’s decision to participate in the trial or access to an enumerated cohort of individuals is not common (Cole & Stuart, 2010; Dahabreh et al., 2019b, 2020; Rudolph & van der Laan, 2017; Westreich et al., 2017). Some work in educational and welfare policy research has discussed generalizability analyses with cluster randomized trial data in settings where cluster participation in the trial was not under evaluator control (e.g., O’Muircheartaigh and Hedges (2014), Tipton (2013a), and Tipton et al. (2017)). Recent work in educational research has mainly considered trials in which clusters are selected for participation by sampling within strata defined by effect modifiers, such that the average treatment effect in the trial may directly generalize to the target population (Tipton, 2013b; Tipton et al., 2014; Tipton & Peck, 2017). This work has mentioned, without providing details, the possibility of using simple weighting (Stuart et al., 2011) or stratification (Tipton, 2013a) estimators to adjust for imbalances between the sampled clusters and the target population (e.g., for effect modifiers that were not stratified on) or when clusters are sampled with unequal probabilities across strata [e.g., to optimally estimate stratum-specific treatment effects (Tipton et al., 2019)].

Prior generalizability work using cluster randomized trial data, whether participation of the clusters in the trial was under the control of evaluators or not, aggregated individual-level information to the cluster level and used weighting or stratification methods to generalize inferences to the target population by estimating average treatment effects, in the target population of clusters (O’Muircheartaigh & Hedges, 2014; Tipton, 2013a, 2013b; Tipton et al., 2017; Tipton et al., 2014; Tipton & Olsen, 2018; Tipton & Peck, 2017; Tipton et al., 2019). These approaches may be inefficient because they ignore information on the relationship of covariates and treatment with the outcome, available from randomized clusters at the end of the trial (but often unavailable from non-randomized clusters). Instead, these approaches use cluster-level baseline covariate information from only non-randomized clusters via the sampling probability (or the probability of trial participation when not under the evaluators’ control). By modeling the relationship of covariates and treatment with the outcome using either individual-level or cluster-level data, efficiency can be improved using an augmented inverse probability weighting estimator, without introducing bias even if the outcome model is misspecified, as long as the sampling probability is correctly specified (when the sampling probability is under investigator control, a model for it can always be correctly specified). Of note, generalizability analyses that use individual-level data require accounting for various forms of within-cluster dependence (Balzer et al., 2019), including causal interference [e.g., due to herd immunity effects (Halloran & Struchiner, 1995)], even if the clusters are assumed to be independent—an issue that is well-appreciated for analyses of cluster randomized trials (Donner & Klar, 2000; Murray et al., 1998).

Here, we combine recent advances in the analysis of cluster randomized trials (Balzer et al., 2019; Benitez et al., 2022) and prior work on generalizability analyses for individually randomized trials (Dahabreh et al., 2019b, 2020) to propose augmented weighting estimators for analyzing cluster randomized trials where the evaluators have sampled participating clusters using known sampling probabilities that depend on baseline characteristics. The methods we describe can be used to estimate causal effects in the target population from which participating clusters are sampled, while exploiting individual-level information on the relationship of covariates and treatment with the outcome and allowing for arbitrary within-cluster dependence. We show that knowledge of the sampling probabilities leads to augmented weighting estimators that are robust to misspecification of the outcome model. In addition to robustness, we show that knowledge of the sampling probabilities can be used to develop more efficient estimators of causal estimands that pertain to the non-randomized subset of the target population, but not those that pertain to the entire target population. We evaluate the finite-sample performance of the methods in a simulation study motivated by a cluster randomized trial of vaccine effectiveness in U.S. nursing homes. Our simulations show that the precision of non-augmented weighting estimators can be improved substantially by estimating the sampling and treatment probabilities (even when they are known), rather than using the known probabilities. In contrast, the precision of the augmented weighting estimators is fairly similar whether using the true or estimated probabilities, and superior to non-augmented weighting estimators when the models for estimating the probabilities include covariates in addition to those used in the design (e.g., as may be necessary in trials with modest sample sizes that have baseline imbalances or are not representative of the target population with respect to variables not used in the design).

Study Design, Data, and Causal Quantities of Interest

Study Design and Data

Consider the cluster version of the nested trial design for analyses extending inferences from a individually randomized trial to a target population (Dahabreh et al., 2021): among a cohort of clusters sampled from the target population (e.g., a cohort of trial-eligible clusters), a subset is chosen to participate in a randomized trial with cluster-level treatment assignment. We assume that participation of clusters in the trial is fully under the control of the investigators (e.g., as might be the case for interventions with favorable risk-benefit profiles when all clusters in the enumerated cohort can be potentially included in the trial).

We index clusters in the cohort of trial-eligible clusters by j ∈ {1, …, m}; the jth cluster has sample size N_j, and we allow the sample size to vary across clusters. Individuals in cluster j are indexed by i ∈ {1, …, N_j}. We use S_j for the cluster-level indicator of selection into the trial; S_j = 1 for randomized clusters and S_j = 0 for non-randomized clusters. For all clusters, both randomized and non-randomized, we have data on cluster-level covariates, X_j (i.e., covariates that are constant for all individuals within a given cluster), and a matrix of individual-level covariates, W _j, for all individuals in the cluster. For example, if p baseline covariates are collected from each individual in cluster j, then W _j has dimension N_j × p. The (row) components of W _j are vectors W_j,i, with dimension 1 × p, that contain information on individual-level covariates. Of note, various covariate “aggregates” across individuals in the same cluster can be included as cluster-level covariates (i.e., as elements of X_j), including cluster-level averages of individual-level covariates. For example, the proportion of residents in a nursing home with a certain comorbidity can be calculated using individual-level information and the resulting derived variable can be treated as a cluster-level covariate. Throughout, we assume that the covariates (X_j, W _j) are measured at baseline, so that they cannot be affected by trial participation or treatment.

Selection into the trial depends on sampling probabilities that are chosen by the evaluators and are allowed to depend on covariates (X_j, W _j), using a Bernoulli-type sampling scheme at the cluster level (Breslow & Wellner, 2007; Dahabreh et al., 2019a; Saegusa & Wellner, 2013). We use A_j to denote the cluster-level treatment assignment; we only consider finite sets of possible treatments, which we denote as $A$ . We use Y _j to denote the vector of individual-level outcomes, such that Y _j = (Y_j,i : i ∈ {1, …, N_j}) for each cluster j. We define the cluster-level average observed outcome in cluster j, ${\bar{Y}}_{j}$ , as ${\bar{Y}}_{j} = (1 / N_{j}) \sum_{i = 1}^{N_{j}} Y_{j, i}$ .

We assume independence across clusters, but we allow for arbitrary dependence among individuals within each cluster [sometimes referred to as a partial interference assumption (Hudgens & Halloran, 2008)]. Such dependence can occur via multiple mechanisms, including (1) shared exposures: individuals share measured and unmeasured cluster-level factors that affect the outcome and response to treatment; (2) contagion: occurrence of the outcome in one individual might affect the outcome of another individual in the same cluster; (3) covariate interference: one individual’s covariates may affect the outcomes of other individuals in the same cluster; or (4) treatment-outcome interference: one individual’s treatment assignment may affect the outcome of other individuals in the same cluster.

We collect data on baseline covariates from the cohort of trial-eligible clusters and view the cohort as a random sample from that target population of clusters. Treatment and outcome data are only needed from clusters participating in the trial; the observed data are independent and identically distributed realizations of the random tuple O _j = (X_j, W _j, S_j, S_j × A_j, S_j × Y _j), j ∈ {1, …, m}.

Causal Estimands

Let $Y_{j, i}^{a}$ be the potential (counterfactual) outcome for individual i in cluster j under intervention to assign treatment $a \in A$ (Robins & Greenland, 2000; Rubin, 1974) and let $Y_{j}^{a}$ denote the vector of potential outcomes in cluster j under intervention to assign treatment $a \in A$ , such that $Y_{j}^{a} = (Y_{j, i}^{a} : i \in {1, \dots, N_{j}})$ . We define the average potential outcome in cluster j, ${\bar{Y}}_{j}^{a}$ , as ${\bar{Y}}_{j}^{a} = (1 / N_{j}) \sum_{i = 1}^{N_{j}} Y_{j, i}^{a}$ .

Following Balzer et al. (2019) and Benitez et al. (2022), our causal quantity of interest is the (cluster-level) expectation of the average potential outcome in the target population of clusters, $E [{\bar{Y}}^{a}] .$ This expectation, over the entire target population of clusters, will be different from the expectation of the average potential outcome in the randomized subset of the target population, that is, $E [{\bar{Y}}^{a}] \neq E [{\bar{Y}}^{a} | S = 1]$ , when factors which affect the outcome are differentially distributed between clusters that are selected into the trial and those that are not. The average treatment effect in the target population comparing treatments $a \in A$ and $a^{'} \in A$ is a contrast of the corresponding expectations of the average potential outcomes $E [{\bar{Y}}^{a} - {\bar{Y}}^{a^{'}}] = E [{\bar{Y}}^{a}] - E [{\bar{Y}}^{a^{'}}]$ . Also of interest are causal quantities in the non-randomized subset of the target population. Specifically, the expectation of the average potential outcome in the non-randomized subset of the target population, $E [{\bar{Y}}^{a} | S = 0]$ , and the average treatment effect comparing treatments $a \in A$ and $a^{'} \in A$ , $E [{\bar{Y}}^{a} - {\bar{Y}}^{a^{'}} | S = 0] = E [{\bar{Y}}^{a} | S = 0] - E [{\bar{Y}}^{a^{'}} | S = 0]$ .

Identification

For the Target Population

Identifiability Conditions

The following conditions are sufficient to identify the expectation of the average potential outcome in the target population: A1. Consistency of cluster-level average potential outcomes: if A_j = a, then ${\bar{Y}}_{j}^{a} = {\bar{Y}}_{j}$ for every j ∈ {1, …, m} and every $a \in A$ . A2. Conditional exchangeability over A in the cluster randomized trial: ${\bar{Y}}^{a} ⫫ A | X, W, S = 1$ for every a. A3. Positivity of treatment assignment probability in the trial: Pr [A = a|X = x, W = w , S = 1] > 0 for every a, and every x and w with positive density in the trial. A4. Conditional exchangeability over S: ${\bar{Y}}^{a} ⫫ S | X, W$ for every a. A5. Positivity of trial participation: Pr [S = 1|X = x, W = w ] > 0 for every x and w with positive density in the target population. Note that conditions A2 through A5 are supported by study design when clusters are selected using sampling probabilities known to the evaluators and treatment is randomly assigned among randomized clusters. Condition A1 should be judged on the basis of substantive knowledge, but can be rendered more plausible by study design (e.g., using appropriate definitions of clusters to ensure the partial interference assumption is plausible). Of note, if evaluators are only interested in learning about average treatment effects (but not about the expectation of the average potential outcome under each treatment), identification is possible under weaker assumptions (see Dahabreh et al., 2019b, 2020) for analogous results in the case of individually randomized trials). We focus on identification of the expectation of the average potential outcome under each treatment because outcomes under different treatments are of inherent scientific and policy interest, and necessary for contextualizing treatment effects.

Identification

As shown in Online Appendix A, and similar to work on individually randomized trials (Dahabreh et al., 2019b), under the above conditions, the expectation of the average potential outcome in the target population, $E [{\bar{Y}}^{a}]$ , is identified by

ψ (a) \equiv E [E [\bar{Y} | X, W, S = 1, A = a]] .

where the outer expectation is over the distribution of the target population of clusters. The average treatment effect in the target population can be identified by taking differences between expectations of the average potential outcomes under different treatments.

For the Non-randomized Subset of the Target Population

Identifiability Conditions

To identify the expectation of the average potential outcome in the non-randomized subset of the target population, we retain conditions A1 through A4 and replace condition A5 by the following, slightly weaker, condition:

A5*. Positivity of trial participation: Pr [S = 1|X = x, W = w ] > 0 for every x and w with positive density among the non-randomized subset of the target population. Condition A5* is supported by the study design when the sampling probabilities are under the control of the evaluators.

Identification

As shown in Online Appendix A, and similar to work on individually randomized trials (Dahabreh et al., 2020), under identifiability conditions A1 through A4 and condition A5*, the expectation of the average potential outcome in the non-randomized subset of the target population, $E [{\bar{Y}}^{a} | S = 0]$ , is identified by

ϕ (a) \equiv E [E [\bar{Y} | X, W, S = 1, A = a] | S = 0] .

The average treatment effect in the non-randomized subset of the target population can be identified by taking differences between the expectations of the average potential outcomes in the non-randomized subset of the target population under different treatments.

Estimation and Inference

For the Target Population

Estimation

We propose the following augmented inverse probability of selection weighting estimator for ψ(a):

\begin{array}{c} \hat{ψ} (a) = \frac{1}{m} \sum_{j = 1}^{m} {\frac{I (S_{j} = 1, A_{j} = a)}{\hat{p} (X_{j}, W_{j}) {\hat{e}}_{a} (X_{j}, W_{j})} {{\bar{Y}}_{j} - {\hat{g}}_{a} (X_{j}, W_{j})} + {\hat{g}}_{a} (X_{j}, W_{j})}, \end{array}

(1)

where

\hat{p} (X, W)

is an estimator for Pr [S = 1|X, W ];

{\hat{e}}_{a} (X, W)

is an estimator for Pr[A = a|X, W , S = 1] (the known-by-design sampling and treatment assignment probabilities can be used instead); and

{\hat{g}}_{a} (X, W)

is an estimator for

E [\bar{Y} | X, W, S = 1, A = a]

. In Online Appendix B, we show that this estimator is the efficient one whether the functions Pr [S = 1|X, W ] and Pr [A = a|X, W , S] are known to the evaluators or have to be estimated. In Online Appendix C, we show that

\hat{ψ} (a)

is robust in the sense that it converges to ψ(a) regardless of whether the estimator

{\hat{g}}_{a} (X, W)

is consistent for

E [\bar{Y} | X, W, S = 1, A = a]

Inference

We estimate the sampling variance of $\hat{ψ} (a)$ as

{\hat{σ}}_{\hat{ψ} (a)}^{2} = \frac{1}{m} \hat{V a r} [{\hat{Ψ}}_{j}^{1} (a)],

(2)

where

\hat{Var} [{\hat{Ψ}}_{j}^{1} (a)]

is the sample variance of the influence curve

{\hat{Ψ}}_{j}^{1} (a)

(the “sample analog” of the influence function (Tsiatis, 2007) we give in Online Appendix B):

{\hat{Ψ}}_{j}^{1} (a) = \frac{I (S_{j} = 1, A_{j} = a)}{\hat{p} (X_{j}, W_{j}) {\hat{e}}_{a} (X_{j}, W_{j})} \{{\bar{Y}}_{j} - {\hat{g}}_{a} (X_{j}, W_{j})\} + {\hat{g}}_{a} (X_{j}, W_{j}) - \hat{ψ} (a) .

The sampling variance can be used to obtain a (1 − α)% confidence interval as $(\hat{ψ} (a) \pm z_{1 - α / 2} \times {\hat{σ}}_{\hat{ψ} (a)})$ , where z_1−α/2 is the (1 − α/2) quantile of the standard normal distribution. Alternatively, inference may also be obtained via the non-parametric bootstrap (Efron & Tibshirani, 1994), accounting for clustering (Balzer et al., 2019).

For the Non-randomized Subset of the Target Population

Estimation

We propose the following augmented inverse odds of selection weighting estimator for ϕ(a):

\begin{array}{c} \hat{ϕ} (a) & = {\sum_{j = 1}^{m} I (S_{j} = 0)}^{- 1} \sum_{j = 1}^{m} \frac{I (S_{j} = 1, A_{j} = a) {1 - \hat{p} (X_{j}, W_{j})}}{\hat{p} (X_{j}, W_{j}) {\hat{e}}_{a} (X_{j}, W_{j})} {{\bar{Y}}_{j} - {\hat{g}}_{a} (X_{j}, W_{j})} \\ + {\sum_{j = 1}^{m} I (S_{j} = 0)}^{- 1} \sum_{j = 1}^{m} {1 - \hat{p} (X_{j}, W_{j})} {\hat{g}}_{a} (X_{j}, W_{j}) . \end{array}

(3)

In Online Appendix B, we show that

\hat{ϕ} (a)

is different from the efficient estimator when the function Pr [S = 1|X, W ] is not known. Specifically, when Pr [S = 1|X, W ] is not known the efficient estimator would have the term

{1 - \hat{p} (X_{j}, W_{j})}

replaced by the indicator for an observation belonging to the non-randomized clusters, I(S_j = 0) (see Dahabreh et al., 2020). Thus, at least in principle, knowledge of the sampling probability can lead to efficiency improvements. This phenomenon is analogous to well-known observations about the estimation of treatment effects in observational studies of point treatments with no unmeasured confounding. In that context, knowledge of the probability of treatment can be used to improve efficiency when estimating the average treatment effect on the treated, but not for the average treatment effect in the entire population underlying the observational study (Hahn, 1998). In Online Appendix C, we also show that

\hat{ϕ} (a)

is robust in the sense that it converges to ϕ(a) whether or not the estimator

{\hat{g}}_{a} (X, W)

is consistent for

E [\bar{Y} | X, W, S = 1, A = a]

Inference

We estimate the sampling variance of $\hat{ϕ} (a)$ as

{\hat{σ}}_{\hat{ϕ} (a)}^{2} = \frac{1}{m} \hat{V a r} [{\hat{Φ}}_{j}^{1} (a)],

(4)

where

\hat{V a r} [{\hat{Φ}}_{j}^{1} (a)]

is the sample variance of the influence curve

{\hat{Φ}}_{j}^{1} (a)

(the “sample analog” of the influence function we give in Online Appendix B):

\begin{array}{c} {\hat{Φ}}_{j}^{1} (a) = \frac{1}{\hat{π}} {\frac{I (S_{j} = 1, A_{j} = a) {1 - \hat{p} (X_{j}, W_{j})}}{\hat{p} (X_{j}, W_{j}) {\hat{e}}_{a} (X_{j}, W_{j})} {{\bar{Y}}_{j} - {\hat{g}}_{a} (X_{j}, W_{j})} + \\ {1 - \hat{p} (X_{j}, W_{j})} {{\hat{g}}_{a} (X_{j}, W_{j}) - \hat{ϕ} (a)}}, \end{array}

where

\hat{π}

is an estimator for Pr [S = 0], that is,

\hat{π} = 1 / m \sum_{j = 1}^{m} I (S_{j} = 0)

. The sampling variance can be used to obtain a (1 − α)% confidence interval as

(\hat{ϕ} (a) \pm z_{1 - α / 2} \times {\hat{σ}}_{\hat{ϕ} (a)})

, where z_1−α/2 is the (1 − α/2) quantile of the standard normal distribution. Inference may also be obtained via the non-parametric bootstrap (Efron & Tibshirani, 1994), accounting for clustering (Balzer et al., 2019).

Average Treatment Effects

The expectation of the average treatment effects can be estimated by taking differences between pairs of the estimators of the expectation of the average potential outcome described above. For example, the expectation of the average treatment effect in the entire target population, comparing treatments a and a′, using the augmented weighting estimator in equation (1), can be estimated as $\hat{ψ} (a) - \hat{ψ} (a^{'})$ . Analogous treatment effect estimators can be obtained for the non-randomized subset of the target population.

Modeling Participation, Treatment, and Outcomes

As noted above, the sampling probability and the probability of treatment in the trial are both known by design and can be used to estimate the expectation of the average potential outcomes and average treatment effects. Nevertheless, estimating these probabilities using simple parametric models (at the cluster level) can result in more precise estimates (Lunceford & Davidian, 2004; Williamson et al., 2014). We illustrate this behavior of the estimators in the simulation studies presented in the next section.

In contrast, the expectation of the average observed outcome conditional on baseline covariates and treatment in the trial, $E [\bar{Y} | X, W, S = 1, A = a]$ , is not known and has to be estimated. To use individual-level information when estimating this expectation, we modify the strategy of Balzer et al. (2019) for use in the context of analyses extending causal inferences to a target population. To begin, we specify and fit a working regression model for the conditional expectation of the individual-level outcome, Y_j,i, given cluster-level covariates, X_j, and the individual’s covariates, W_j,i (not the entire matrix W _j); the regression can be fit separately by treatment arm, to allow for heterogeneity of treatment effects. Next, we obtain estimated values from the fitted model on all individuals in the data, regardless of trial selection status. We denote these predictions as ${\hat{h}}_{a} (X_{j}, W_{j, i})$ . Last, we obtain estimates ${\hat{g}}_{a} (X_{j}, W_{j})$ by averaging the predictions over individuals in each cluster, ${\hat{g}}_{a} (X_{j}, W_{j}) = (1 / N_{j}) \sum_{i = 1}^{N_{j}} {\hat{h}}_{a} (X_{j}, W_{j, i})$ . We note that it is also possible (and may be necessary in some cases when data have already been aggregated to the cluster level) to estimate $E [\bar{Y} | X, W, S = 1, A = a]$ using only cluster-level data (e.g., by regressing cluster-level averages of the observed outcomes on cluster characteristics and cluster-level averages, or other summaries, of individual characteristics).

When individual-level data is available, the choice of whether to model the outcome at the individual level or cluster level is not obvious. In general, the model of the outcome conditional on covariates and treatment at the individual level will be different than the model of the cluster-level average of the outcome. Furthermore, our ability to specify and estimate either of these models will depend on background knowledge, properties of the data generating mechanism, and data attributes (e.g., sample size and data availability). That said, we note that by using equation (1) or (3), our estimate remains consistent regardless of the specification of the outcome because the sampling probability is known by design and investigators can always correctly specify a model for it.

Simulation Studies

We conducted a simulation study to verify the performance of the proposed augmented weighting estimators and to compare them against non-augmented weighting estimators. The choices in the simulation study, such as cluster size and the number of clusters in the sample from the target population, were informed by recently completed and ongoing trials of vaccine effectiveness in U.S. nursing homes (Gravenstein et al., 2016; Gravenstein et al., 2017; Gravenstein et al., 2021; National Library of Medicine (U.S.), 2013, 2018, 2019). We considered scenarios with different treatment effects (including a scenario under the sharp null hypothesis), different magnitudes of heterogeneity (weaker or stronger), presence or absence of interference, and with or without outcome model misspecification (for models fit at the individual level). Table 1 summarizes the scenarios we considered; in the remainder of this section, we focus on describing Scenario 1 that has strong heterogeneity, presence of interference, and a correctly specified outcome model (at the individual level).

Table 1.

Simulation Study Scenarios for the Outcome Data Generating Mechanism and Outcome Model Specification.

Scenario	Treatment Effect and Heterogeneity	Interference	Linear	Outcome Model
1	Stronger	Yes	Yes	Correctly specified
2	Stronger	No	Yes	Correctly specified
3	Weaker	Yes	Yes	Correctly specified
4	None (sharp null)	Yes	Yes	Correctly specified
5	Stronger	Yes	Yes	Misspecified
6	Stronger	Yes	No	Misspecified

Scenario 1 is the scenario presented in the main text; Scenario 2 does not have interference; Scenario 3 considers a weaker treatment effect and weaker heterogeneity; Scenario 4 has no treatment effect (and no heterogeneity); Scenarios 5 and 6 examine the impact of outcome model misspecification.

Effect heterogeneity = level of heterogeneity, where none (sharp null) means no treatment effect or heterogeneity in the outcome generating mechanism; Interference = indicates whether covariate interference is present in the outcome data generating mechanism; Linear = indicates whether nonlinear terms are present in the outcome data generating mechanism; Outcome model = indicates whether the individual-level outcome model is correctly specified.

Baseline Data Generation

We generated a sample of m = 5000 trial-eligible clusters from the target population. Each cluster had a sample size N_j, j = 1, …, m, randomly drawn from a Poisson distribution with mean parameter of 100. Thus, the number of individuals in each cluster varied, but on average, there were approximately 100 individuals per cluster, which is similar to the sample sizes in the trials we used to motivate the simulation.

We generated a binary cluster-level covariate, X_j, with a Bernoulli distribution with parameter Pr [X_j = 1] = 0.05. In each cluster, we generated individual-level covariates, by generating two column vectors of W _j, that is, W _1,j and W _2,j. For each cluster j, we denote the elements of W _1,j and W _2,j as W_1,j,i and W_2,j,i, for j = 1, …, N_j; we generated these elements using draws from two independent cluster-specific normal distributions, each with its own mean and variance of 1. We independently drew the cluster-specific mean for each of these individual-level covariates from a continuous uniform distribution from −1 to 1. We define the cluster-level average of W₁ as ${\bar{W}}_{1, j} = (1 / N_{j}) \sum_{i = 1}^{N_{j}} W_{1, j, i}$ and the cluster-level average of W₂ as ${\bar{W}}_{2, j} = (1 / N_{j}) \sum_{i = 1}^{N_{j}} W_{2, j, i}$ .

Selecting the Clusters in the Randomized Trial With Known Sampling Probabilities

We simulated trials with different cluster sample sizes: 50, 100, or 200 clusters. We sampled clusters from a cohort of 5000 trial-eligible clusters into the randomized trial so that Pr [X_j = 1|S_j = 1] = 0.5; that is to say, we wanted the clusters enrolled in the trial to be (approximately) equally split between the two possible levels of the cluster-level covariate X_j. To accomplish this, we had to oversample clusters with X_j = 1 and undersample clusters with X_j = 0.

For Bernoulli-type sampling of clusters from the target population sample, we used the sampling probability given by

\Pr [S_{j} = 1 | X_{j} = x] = \frac{\Pr [S_{j} = 1]}{\Pr [X_{j} = x]} \Pr [X_{j} = x | S_{j} = 1], for x = 0,1 .

For example, suppose that the targeted trial sample size was 50 clusters, the target population sample was 5000 clusters, and the desired proportion of clusters with X_j = 1 in the trial was 0.5. Then, among clusters with X_j = 1, we set the known-by-design sampling probability to $\Pr [S_{j} = 1 | X_{j} = 1] = \frac{50 / 5000}{0.05} 0.5 = 0.1$ ; similarly, among clusters with X_j = 0 we set the sampling probability to Pr [S_j = 1|X_j = 0] ≈ 0.005. Note that when designing cluster randomized trials, the quantities Pr [S_j = 1] and Pr [X_j = x|S_j = 1] would reflect the choice of the evaluators for the design of the trial and Pr [X_j = x] would be chosen on the basis of background knowledge about the target population or empirically estimated in the target population sample (in the simulation, we used the estimated Pr [X_j = x] value in each run of the simulation).

Treatment and Outcome Generation

Treatment A_j was randomized at the cluster level, following a Bernoulli distribution with parameter Pr[A_j = 1|S_j = 1] = 0.5. For each individual i in cluster j, we calculated the linear predictor $L_{j, i} = (2 A_{j} - 1) X_{j} + 0.5 (2 A_{j} - 1) W_{1, j, i} + 0.5 (2 A_{j} - 1) W_{2, j, i} + 0.5 (2 A_{j} - 1) {\bar{W}}_{1, j} + 0.5 (2 A_{j} - 1) {\bar{W}}_{2, j}$ . We then simulated binary individual-level outcomes from a Bernoulli distribution with parameter Pr [Y_j,i = 1|X_j, W_1,j,i, W_2,j,i, S_j = 1, A_j] = exp (L_j,i)/{1+ exp (L_j,i)}. Note that including ${\bar{W}}_{1, j}$ and ${\bar{W}}_{2, j}$ in the linear predictor of the probability of the outcome induces covariate interference. Also note that the product terms between the covariates and treatment in the generative model for the outcome induce heterogeneity of treatment effects.

Estimators

We considered estimation of the expectation of average potential outcomes and the average treatment effects in the entire target population and its non-randomized subset. When estimating quantities in the entire target population, we applied the augmented inverse probability weighting estimator in equation (1) with the outcome-model fit using cluster-level information (AIPW1) or individual-level information (AIPW2). We also considered the following non-augmented inverse probability weighting estimator (IPW):

\begin{array}{c} {\hat{ψ}}_{w} (a) = \frac{1}{m} \sum_{j = 1}^{m} \frac{I (S_{j} = 1, A_{j} = a) {\bar{Y}}_{j}}{\hat{p} (X_{j}, W_{j}) {\hat{e}}_{a} (X_{j}, W_{j})} . \end{array}

(5)

This estimator can be viewed as a special case of $\hat{ψ} (a)$ with the outcome model terms ${\hat{g}}_{a} (X, W)$ set identically to 0.

When estimating quantities in the non-randomized subset of the target population, we used the augmented inverse odds weighting estimator in equation (3) with the outcome-model fit using only cluster-level information (AIOW1) or both cluster- and individual-level information (AIOW2). We compared these estimators against the following non-augmented inverse odds weighting estimator (IOW):

\begin{array}{c} {\hat{ϕ}}_{w} (a) = {\sum_{j = 1}^{m} I (S_{j} = 0)}^{- 1} \sum_{j = 1}^{m} \frac{I (S_{j} = 1, A_{j} = a) {1 - \hat{p} (X_{j}, W_{j})} {\bar{Y}}_{j}}{\hat{p} (X_{j}, W_{j}) {\hat{e}}_{a} (X_{j}, W_{j})} . \end{array}

(6)

Similar to the inverse probability weighting estimator above, ${\hat{ϕ}}_{w} (a)$ can be viewed as a special case of $\hat{ϕ} (a)$ with the outcome model terms ${\hat{g}}_{a} (X, W)$ set identically to 0.

We note in passing that the augmented inverse probability weighting estimator (using the known-by-design or estimated probabilities) is asymptotically at least as efficient as the non-augmented inverse probability weighting estimator using the known-by-design probabilities, when the outcome model is correctly specified (see Online Appendix D). Finally, we compared all these estimators against a trial-only estimator, which is estimated by averaging the individual-level outcomes in each cluster and then taking the average of these averages over the clusters participating in the trial.

Model Specification

For all estimators, we considered three possible versions for the sampling probability and probability of treatment in the trial: one where the known-by-design probabilities were used, one where both probabilities were estimated with a simple logistic regression model at the cluster level (conditional only on the cluster-level variable, X_j, that determined the sampling probabilities), and one where both probabilities were estimated with a more complex logistic regression model at the cluster level (on the cluster-level variable, X_j, that determined the sampling probabilities and cluster-level averages of the individual-level covariates, W _j = ( W _1,j, W _2,j)).

For estimators that involve outcome modeling (i.e., AIPW and AIOW), we either modeled the outcome using cluster-level data (AIPW1 and AIOW1) or individual-level data (AIPW2 and AIOW2). When modeling using cluster-level data, we used a linear regression model for the cluster-specific average outcome, conditional on X_j and the cluster-level averages of W _1,j and W _2,j. When modeling using individual-level data, we used a logistic regression model for the indicator of the outcome, conditional on X_j and the elements of W _1,j and W _2,j corresponding to each observation, along with the cluster-level averages of W _1,j and W _2,j, separately in each treatment arm. The outcome model was correctly specified when using individual-level data; the outcome model was misspecified when using cluster-level data (as would be the case in most practical applications, when the true underlying individual-level model is complex), and thus, the regression of the cluster-level average of individual observed outcomes on cluster-level covariates is best viewed as an attempt to approximate the underlying true function.

Performance Assessment

We evaluated the performance of the estimators over 2000 simulation runs, in terms of bias and average standard deviation. We compared the estimated average standard deviation of the estimators (over the simulation runs) against (1) the average of the influence curve-based standard deviations, and (2) the average of a standard deviation estimated using a clustered bootstrap procedure that resamples with replacement from all the clusters (Field & Welsh, 2007), with 500 bootstrap samples in each of the simulation runs. We also compared the coverage of the augmented weighting estimators when using the influence curve-based standard deviation versus the standard deviation estimated using the clustered bootstrap procedure. To facilitate numerical comparisons, we multiplied the simulation estimates of the bias and average standard deviation by the square root of the target cluster sample size, $\sqrt{5000}$ . Because of the complexity of our data generating model, we obtained estimates of the “true” values for the expectations of the average potential outcomes and the average treatment effects using numerical methods (i.e., by generating both potential outcomes under the two levels of treatment for each observation over the simulation runs).

Additional Simulation Scenarios (Scenarios 2 Through 6)

Here, we briefly summarize the additional scenarios we examined in the simulation study (see also Table 1 for a summary of the different scenarios and Online Appendix F for additional details regarding the specification of models for data generation and for analyzing the data as needed for different estimators). Briefly, in Scenario 2, we modified Scenario 1 to generate data in the absence of interference; in Scenario 3, we considered weaker treatment effects and heterogeneity; and in Scenario 4, we generated data under the sharp null hypothesis of no treatment effect (and no effect heterogeneity). In Scenarios 5 and 6, we examined the impact of different kinds of misspecification of the outcome model. In Scenario 5, we generated data using the same approach as in Scenario 1, but when modeling the outcome at the individual level, we omitted W _2,j and its cluster-level average W _2,j, and when modeling the outcome at the cluster level, we omitted W _2,j. In Scenario 6, we generated data with a nonlinear relationship (on the logit scale) between the outcome and W _1,j, W _2,j, and their corresponding cluster-level averages but analyzed the data using models that only had linear terms.

Results for Scenario 1

Here, we present results from simulation Scenario 1 for estimands pertaining to the entire target population in the main text; results for estimands pertaining to the non-randomized subset of the target population were similar and are presented in Online Appendix E. Table 2 shows the scaled bias (i.e., bias

\times \sqrt{5000}

) of each estimator across the simulation runs. The trial-only estimator, as expected, is biased for estimating the expectation of the average potential outcomes or the average treatment effect in the entire target population because sampling into the cluster randomized trial depends on the effect modifier, X_j. All estimators that account for the sampling of clusters for participation in the trial (IPW, AIPW1, and AIPW2) show negligible bias, even in small cluster randomized trials.

Table 2.

Scaled Bias of Estimators for Quantities in the Target Population.

Estimand	n	Values for Probabilities	Trial-Only	IPW	AIPW1	AIPW2
ATE	50	True	12.286	−.045	.025	.025
	50	Estimated (simple)	12.286	.015	.025	.025
	50	Estimated (complex)	12.286	.001	−.019	.025
	100	True	12.149	−.232	.01	.014
	100	Estimated (simple)	12.149	−.048	.01	.013
	100	Estimated (complex)	12.149	.047	−.005	.015
	200	True	12.296	.127	.000	.000
	200	Estimated (simple)	12.296	.012	.000	.000
	200	Estimated (complex)	12.296	−.026	−.012	.000
$E [{\bar{Y}}^{a = 1}]$	50	True	6.099	−.144	.013	.014
	50	Estimated (simple)	6.099	−.024	.013	.014
	50	Estimated (complex)	6.099	.040	−.005	.011
	100	True	6.062	−.204	.001	.007
	100	Estimated (simple)	6.062	−.060	.001	.006
	100	Estimated (complex)	6.062	−.010	−.007	.008
	200	True	6.142	.012	−.005	−.006
	200	Estimated (simple)	6.142	−.014	−.005	−.006
	200	Estimated (complex)	6.142	−.035	−.01	−.004
$E [{\bar{Y}}^{a = 0}]$	50	True	−6.187	−.100	−.011	−.011
	50	Estimated (simple)	−6.187	−.039	−.011	−.012
	50	Estimated (complex)	−6.187	.039	.014	−.014
	100	True	−6.088	.027	−.009	−.007
	100	Estimated (simple)	−6.088	−.011	−.009	−.007
	100	Estimated (complex)	−6.088	−.057	−.001	−.008
	200	True	−6.153	−.114	−.005	−.006
	200	Estimated (simple)	−6.153	−.026	−.005	−.006
	200	Estimated (complex)	−6.153	−.009	.002	−.004

Results are scaled by $\sqrt{m}$ (i.e., multiplied by $\sqrt{5000} \approx 70.7$ ). ATE is defined as $E [{\bar{Y}}^{a = 1}] - E [{\bar{Y}}^{a = 0}]$ ; n = number of clusters in the trial; Values for probabilities = how the treatment and sampling probabilities are obtained for the estimators (True = use the known-by-design sampling probabilities and probabilities of treatment in the trial; Estimated (simple) = estimate the sampling and treatment probabilities conditional on X_j only (the variable used to determine the sampling probabilities); Estimated (complex) = estimate the sampling and treatment probabilities conditional on X_j and the cluster-level averages of W _1,j and W _2,j); Trial-only = average the individual-level outcomes in each cluster and then take the average of these averages over the clusters participating in the trial; IPW = non-augmented inverse probability weighting estimator; AIPW1 = augmented inverse probability weighting estimator, with the outcome model fit only at the cluster level; AIPW2 = augmented inverse probability weighting estimator, with the outcome model fit at the individual level.

In Table 3, we present the scaled standard deviation (i.e., the standard deviation over the simulation runs

\times \sqrt{5000}

) for estimators that account for the sampling of clusters for participation in the trial (IPW, AIPW1, AIPW2). When using the true sampling and treatment probabilities, IPW had substantially higher standard deviation compared with both AIPW1 and AIPW2. Estimating the sampling and treatment probabilities, using a simple model that only included X_j (i.e., the variable used to determine the cluster sampling probability) or a more complex model that included X_j and cluster-level averages of W _1,j and W _2,j reduced the standard deviation of IPW, but not enough to reach the standard deviation of AIPW1 or AIPW2. In contrast, AIPW1 and AIPW2 had similar performance when using the true sampling and treatment probabilities, when estimating these probabilities conditional on the variable X_j (i.e., the variable used to determine the cluster sampling probability), or when estimating these probabilities using a more complex model that included X_j and cluster-level averages that included X_j and cluster-level averages of the elements of W _1,j and W _2,j.

Table 3.

Scaled Standard Deviation of Estimators for Quantities in the Target Population.

Estimand	n	Values for Probabilities	IPW	AIPW1	AIPW2
ATE	50	True	14.133	1.453	1.384
	50	Estimated (simple)	4.847	1.453	1.385
	50	Estimated (complex)	5.123	1.484	1.427
	100	True	10.144	.989	.944
	100	Estimated (simple)	3.395	.989	.944
	100	Estimated (complex)	2.812	.992	.955
	200	True	7.197	.738	.710
	200	Estimated (simple)	2.308	.738	.710
	200	Estimated (complex)	1.649	.733	.714
$E [{\bar{Y}}^{a = 1}]$	50	True	9.885	1.050	.988
	50	Estimated (simple)	3.464	1.050	.989
	50	Estimated (complex)	3.783	1.072	1.019
	100	True	7.073	.671	.637
	100	Estimated (simple)	2.424	.671	.637
	100	Estimated (complex)	2.081	.675	.647
	200	True	5.094	.504	.479
	200	Estimated (simple)	1.688	.504	.479
	200	Estimated (complex)	1.236	.497	.481
$E [{\bar{Y}}^{a = 0}]$	50	True	10.088	.984	.937
	50	Estimated (simple)	3.444	.984	.938
	50	Estimated (complex)	3.803	.992	.960
	100	True	7.077	.693	.650
	100	Estimated (simple)	2.479	.693	.650
	100	Estimated (complex)	2.061	.689	.657
	200	True	5.019	.486	.467
	200	Estimated (simple)	1.650	.486	.467
	200	Estimated (complex)	1.238	.482	.468

Results are scaled by $\sqrt{m}$ (i.e., multiplied by $\sqrt{5000} \approx 70.7$ ). ATE is defined as $E [{\bar{Y}}^{a = 1}] - E [{\bar{Y}}^{a = 0}]$ ; n = number of clusters in the trial; Values for probabilities = how the treatment and sampling probabilities are obtained for the estimators; True = use the known-by-design sampling probabilities and probabilities of treatment in the trial; Estimated (simple) = estimate the sampling and treatment probabilities conditional on X_j only (the variable used to determine the sampling probabilities); Estimated (complex) = estimate the sampling and treatment probabilities conditional on X_j and the cluster-level averages of W _1,j and W _2,j; IPW = non-augmented inverse probability weighting estimator; AIPW1 = augmented inverse probability weighting estimator, with the outcome model fit only at the cluster level; AIPW2 = augmented inverse probability weighting estimator, with the outcome model fit at the individual level.

To examine methods for statistical inference, we focus on the augmented weighting estimators (AIPW1 and AIPW2) because they were nearly unbiased and had a substantially lower standard deviation compared with IPW. Table 4 presents the average of the standard errors using the influence curve-based approach (IC) and the cluster bootstrap (BS), along with the corresponding coverage of Wald-style 95% confidence intervals obtained using these standard errors for the augmented weighting estimators (AIPW1 and AIPW2). Ideally, the scaled average standard error should equal the scaled standard deviation in Table 3. In general, the influence curve-based approach for the average standard error was smaller than the standard deviation of the estimators, especially in smaller cluster trials. Using the influence curve-based standard errors resulted in undercoverage in the smaller cluster trials of 50 or 100 clusters, but nearly nominal coverage in larger cluster trials of 200 clusters. The average standard error based on the cluster bootstrap was similar to the standard deviation of the estimators. Using the bootstrap-based standard errors resulted in near-nominal coverage for all trial sizes we examined. Online Appendix E summarizes results for Scenario 1 for estimators of average treatment effects and expectations of the average potential outcome in the non-randomized subset of the target population; the results were similar to the results reported here for the entire target population.

Table 4.

Coverage and Scaled Average of Standard Errors of the Augmented Weighting Estimators for Quantities in the Target Population.

Estimand	n	Values for Probabilities	AIPW1				AIPW2
			ASE		Coverage		ASE		Coverage
			IC	BS	IC	BS	IC	BS	IC	BS
ATE	50	True	1.205	1.778	.878	.969	1.144	1.518	.884	.959
	50	Estimated (simple)	1.246	1.778	.907	.969	1.177	1.518	.902	.959
	50	Estimated (complex)	1.292	1.911	.904	.963	1.210	1.625	.899	.960
	100	True	.927	1.022	.926	.951	.885	.965	.931	.950
	100	Estimated (simple)	.944	1.022	.936	.951	.899	.964	.936	.952
	100	Estimated (complex)	.964	1.022	.938	.954	.915	.976	.936	.956
	200	True	.713	.742	.933	.945	.686	.709	.933	.947
	200	Estimated (simple)	.720	.742	.942	.945	.691	.709	.941	.946
	200	Estimated (complex)	.727	.734	.942	.941	.697	.712	.944	.948
$E [{\bar{Y}}^{a = 1}]$	50	True	.822	1.189	.851	.951	.777	1.034	.853	.944
	50	Estimated (simple)	.853	1.189	.879	.951	.803	1.034	.885	.944
	50	Estimated (complex)	.881	1.262	.882	.959	.824	1.095	.879	.946
	100	True	.627	.697	.913	.952	.595	.655	.915	.949
	100	Estimated (simple)	.642	.697	.937	.952	.608	.655	.929	.948
	100	Estimated (complex)	.656	.699	.934	.949	.620	.663	.927	.946
	200	True	.473	.493	.924	.939	.452	.468	.928	.944
	200	Estimated (simple)	.476	.493	.934	.939	.455	.468	.935	.945
	200	Estimated (complex)	.482	.489	.938	.938	.459	.471	.937	.942
$E [{\bar{Y}}^{a = 0}]$	50	True	.819	1.186	.866	.950	.774	1.033	.861	.951
	50	Estimated (simple)	.848	1.186	.891	.950	.799	1.033	.892	.951
	50	Estimated (complex)	.877	1.276	.893	.955	.820	1.105	.889	.952
	100	True	.629	.695	.902	.943	.598	.654	.913	.943
	100	Estimated (simple)	.639	.695	.919	.943	.607	.654	.923	.944
	100	Estimated (complex)	.652	.694	.927	.941	.617	.661	.929	.942
	200	True	.472	.494	.932	.944	.451	.469	.930	.944
	200	Estimated (simple)	.478	.494	.934	.944	.456	.469	.939	.944
	200	Estimated (complex)	.483	.489	.937	.938	.460	.471	.938	.942

ASE = average (over the simulations) of standard errors, scaled by $\sqrt{m}$ (i.e., multiplied by $\sqrt{5000} \approx 70.7$ ); IC = influence curve based; BS = bootstrap; Coverage = coverage, using 95% normal confidence intervals. ATE is defined as $E [{\bar{Y}}^{a = 1}] - E [{\bar{Y}}^{a = 0}]$ ; n = number of clusters in the trial; Values for probabilities = how the treatment and sampling probabilities are obtained for the estimators; True = use the known-by-design sampling probabilities and probabilities of treatment in the trial; Estimated (simple) = estimate the sampling and treatment probabilities conditional on X_j only (the variable used to determine the sampling probabilities); Estimated (complex) = estimate the sampling and treatment probabilities conditional on X_j and the cluster-level averages of W _1,j and W _2,j; AIPW1 = augmented inverse probability weighting estimator, with the outcome model fit only at the cluster level; AIPW2 = augmented inverse probability weighting estimator, with the outcome model fit at the individual level.

Results for Additional Simulation Scenarios (2 Through 6)

We report detailed results for these scenarios in Appendices G through K. Regardless of the magnitude of the treatment effect and the amount of heterogeneity, in Scenarios 2 and 3, the trial-only estimator was biased (except when there no heterogeneity of treatment effects), while the other estimators (IPW, AIPW1, and AIPW2) remained unbiased. Scenario 4, produced similar results for the IPW, AIPW1, and AIPW2 estimators (in this scenario, the trial-only estimator was also unbiased for the average treatment effect in the target population, but remained biased for expectations of the average potential outcome). In these scenarios, as in Scenario 1, the AIPW estimators had smaller standard deviation than IPW. In Scenarios 5 and 6, where we examined the impact of misspecifying the outcome model, the AIPW estimators had little bias because regardless of the specification of the outcome model, the probability of participation was either the true one (known by design) or estimated using a correctly specified model (i.e., the simulations reflect the robustness property of the AIPW estimator).

Discussion

We described a nested cluster randomized trial design where clusters are selected for inclusion in the trial with known sampling probabilities that may depend on baseline covariates, and proposed robust augmented weighting estimators for this design. The robustness of the proposed estimators stems from the fact that the sampling probability and the probability of treatment in the trial are known by design, and thus, models for them can always be correctly specified. Our estimators give evaluators the option of exploiting individual-level data on the relationship between covariates, treatment and outcomes, to further increase efficiency, while accounting for within-cluster dependence, including various forms of interference. We showed that, for causal estimands that pertain to the non-randomized subset of the target population, knowledge of the sampling probabilities can be used to develop augmented weighting estimators that are more efficient compared to augmented weighting estimators when the sampling probabilities are unknown. This improvement is not available for estimands that pertain to the entire target population because their efficient influence function is the same, whether the sampling probabilities are known or unknown.

Our proof-of-concept simulations, motivated by large cluster randomized trials of vaccine effectiveness (National Library of Medicine (U.S.), 2013, 2018, 2019), show that the augmented weighting estimators perform well in finite samples and better than previously described non-augmented weighting estimators. The augmented weighting estimators had about the same performance whether the true or estimated sampling and treatment probabilities were used, even in small trials. In contrast, our simulation results suggest that estimating the known-by-design sampling and treatment probabilities when using the non-augmented weighting estimators can substantially improve precision, but the improvement is often not enough to reach the precision of the augmented weighting estimators. In the simulation, the standard deviation estimated using a clustered bootstrap procedure worked well for inference with the augmented weighting estimators and the influence curve-based standard deviation (which is computationally faster) also performed well in larger cluster trial sizes. Even though the clustered bootstrap procedure we used worked well in our simulations, there are many options for bootstrapping clustered data (Davison & Hinkley, 1997; Field & Welsh, 2007) and comparisons among them might be useful, particularly for studies with a smaller number of clusters.

Prior work on designing a cluster randomized trial to support generalizable inferences has focused on sampling clusters such that crude (unadjusted) analyses of the trial data can estimate treatment effects in the target population (Tipton, 2013b; Tipton et al., 2014; Tipton & Peck, 2017). Such “representative” sampling using a constant sampling probability across strata defined by effect modifiers puts a premium on being able to use relatively simple statistical analyses but cannot accommodate other practical aspects of trial conduct, such as the need for rapid recruitment of clusters, the recruitment of clusters with established research infrastructure, or the desire for efficient estimation within subgroups of clusters defined by covariates. When representative sampling does not result in good balance between the trial and the target population, this prior work has mentioned the possibility of using simple weighting or stratification methods, without providing evidence of good performance. Our proposed design essentially works in the opposite direction, by acknowledging that evaluators often have to select clusters for participation conditional on baseline covariates, while still wanting to draw generalizable inferences—an instance of experimental design with multiple objectives (Sverdlov & Rosenberger, 2013; Sverdlov et al., 2020; Woodcock & LaVange, 2017). For example, our approach can support trials designed to test hypotheses in subgroups of clusters, by oversampling clusters with certain characteristics, and uses the known-by-design sampling probabilities to produce inferences that apply to the target population.

Our proposed design and analysis methods should be useful when practicalities of trial conduct (e.g., efficient recruitment) or the trial’s inferential goals require oversampling clusters with certain characteristics. They can form the basis for explicit approaches to the planning of future cluster randomized trials (Copas & Hooper, 2021; Raudenbush, 1997) via formal optimization procedures for trading-off competing research objectives. Such optimization efforts are motivated by the desire to use the most efficient experimental designs that are feasible; thus, they should routinely be paired with efficient estimation approaches, such as those that we have proposed.

In most cases, evaluators will choose sampling probabilities that depend on a low dimensional set of discrete covariates (e.g., those deemed as the most likely and strong effect modifiers for the treatment effects of interest). That said, our methods can also accommodate more complex sampling schemes. For example, when evaluators would like to sample clusters based on multiple covariates [as may be the case in large cluster randomized trials that motivate our work (National Library of Medicine (U.S.), 2013, 2018, 2019)], a risk or effect score, or other dimensionality reduction approach, could be used to create a lower dimensional variable that can then be used to determine the sampling probabilities.

Throughout, we assumed that the evaluators have complete control over cluster participation in the trial. Nevertheless, the methods can be easily extended to allow for the possibility that selected clusters may decline participation in the trial. When sampled clusters can decline participation, additional causal assumptions regarding the exchangeability of clusters that agree to participate with those that do not, among the sampled clusters, will be needed (this is analogous to recent results for individually randomized trials (Dahabreh et al., 2019c). Furthermore, the model for the probability of participation among sampled clusters will need to be correctly specified on the basis of background knowledge, because participation among the sampled clusters will not be under the control of the evaluators.

In summary, cluster randomized trials where clusters are selected for inclusion with known sampling probabilities that may depend on cluster characteristics, combined with efficient estimation methods, can lead to substantial improvements in the precision of the estimated effect in the target population, while also addressing competing objectives of trial conduct.

Supplemental Material

Supplemental Material - Cluster Randomized Trials Designed to Support Generalizable Inferences

Supplemental Material for Cluster Randomized Trials Designed to Support Generalizable Inferences Sarah E. Robertson, Jon A. Steingrimsson, and Issa J. Dahabreh in Evaluation Review.

Footnotes

Declaration of Conflict Interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Dr Dahabreh is the principal investigator of a research agreement between Harvard University and Sanofi on transportability methods for individually randomized trials, unrelated to this manuscript; Dr Dahabreh also reports consulting fees from Moderna for work unrelated to this manuscript. The other authors report no potential conflicts of interest.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported in part by National Library of Medicine (NLM) Award R01LM013616 and Patient-Centered Outcomes Research Institute (PCORI) awards ME-1502-27794 and ME-2019C3-17875. The content of this paper is solely the responsibility of the authors and does not necessarily represent the official views of the NLM, PCORI, the PCORI Board of Governors, or the PCORI Methodology Committee.

ORCID iDs

Sarah E. Robertson

Issa J. Dahabreh

Supplemental Material

Supplemental material for this article is available online.

References

Balzer

L. B.

Zheng

van der Laan

M. J.

Petersen

M. L.

(2019). A new approach to hierarchical data analysis: Targeted maximum likelihood estimation for the causal effect of a cluster-level exposure. Statistical Methods in Medical Research, 28(6), 1761–1780. https://doi.org/10.1177/0962280218774936

Benitez

Petersen

M. L.

van der Laan

M. J.

Santos

Butrick

Walker

Ghosh

Otieno

Waiswa

Balzer

L. B.

(2022). Defining and estimating effects in cluster randomized trials: A methods comparison [preprint]. arXiv preprint, (arXiv:2110.09633).

Breslow

N. E.

Wellner

J. A.

(2007). Weighted likelihood for semiparametric models and two-phase stratified samples, with application to Cox regression. Scandinavian Journal of Statistics, 34(1), 86–102. https://doi.org/10.1111/j.1467-9469.2006.00523.x

Cole

S. R.

Stuart

E. A.

(2010). Generalizing evidence from randomized clinical trials to target populations: The ACTG 320 trial. American Journal of Epidemiology, 172 (1), 107–115. https://doi.org/10.1093/aje/kwq084

Copas

A. J.

Hooper

(2021). Optimal design of cluster randomized trials allowing unequal allocation of clusters and unequal cluster size between arms. Statistics in Medicine, 40(25), 5474–5486. https://doi.org/10.1002/sim.9135

Dahabreh

I. J.

Haneuse

S. J. P. A.

Robins

J. M.

Robertson

S. E.

Buchanan

A. L.

Stuart

E. A.

Hernán

M. A.

(2021). Study designs for extending causal inferences from a randomized trial to a target population. American Journal of Epidemiology, 190(8), 1632–1642. https://doi.org/10.1093/aje/kwaa270

Dahabreh

I. J.

Hernán

M. A.

Robertson

S. E.

Buchanan

Steingrimsson

J. A.

(2019a). Generalizing trial findings in nested trial designs with sub-sampling of non-randomized individuals [preprint]. arXiv preprint, (arXiv:1902.06080).

Dahabreh

I. J.

Robertson

S. E.

Tchetgen

E. J.

Stuart

E. A.

Hernán

M. A.

(2019b). Generalizing causal inferences from individuals in randomized trials to all trial-eligible individuals. Biometrics, 75(2), 685–694. https://doi.org/10.1111/biom.13009

Dahabreh

I. J.

Robins

J. M.

Haneuse

S. J.-P.

Hernán

M. A.

(2019c). Generalizing causal inferences from randomized trials: Counterfactual and graphical identification [preprint]. arXiv preprint, (arXiv:1906.10792).

10.

Dahabreh

I. J.

Robertson

S. E.

Steingrimsson

J. A.

Stuart

E. A.

Hernán

M. A.

(2020). Extending inferences from a randomized trial to a new target population. Statistics in Medicine, 39(14), 1999–2014. https://doi.org/10.1002/sim.8426

11.

Davison

A. C.

Hinkley

D. V.

(1997). Bootstrap methods and their application. Cambridge university press.

12.

Donner

Klar

(2000). Design and analysis of cluster randomization trials in health research. Wiley.

13.

Efron

Tibshirani

R. J.

(1994). An introduction to the bootstrap (Vol. 57). Chapman and Hall/CRC.

14.

Field

C. A.

Welsh

A. H.

(2007). Bootstrapping clustered data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(3), 369–390. https://doi.org/10.1111/j.1467-9868.2007.00593.x

15.

Gravenstein

Dahal

Gozalo

P. L.

Davidson

H. E.

Han

L. F.

Taljaard

Mor

(2016). A cluster randomized controlled trial comparing relative effectiveness of two licensed influenza vaccines in US nursing homes: Design and rationale. Clinical Trials, 13(3), 264–274. https://doi.org/10.1177/1740774515625976

16.

Gravenstein

Davidson

H. E.

Taljaard

Ogarek

Gozalo

Han

Mor

(2017). Comparative effectiveness of high-dose versus standard-dose influenza vaccination on numbers of US nursing home residents admitted to hospital: A cluster-randomised trial. The Lancet. Respiratory Medicine, 5(9), 738–746. https://doi.org/10.1016/S2213-2600(17)30235-7

17.

Gravenstein

McConeghy

K. W.

Saade

Davidson

H. E.

Canaday

D. H.

Han

Rudolph

Joyce

Dahabreh

I. J.

Mor

(2021). Adjuvanted influenza vaccine and influenza outbreaks in US nursing homes: Results from a pragmatic cluster-randomized clinical trial. Clinical Infectious Diseases, 73(11), Article e4229–e4236. https://doi.org/10.1093/cid/ciaa1916

18.

Hahn

(1998). On the role of the propensity score in efficient semiparametric estimation of average treatment effects. Econometrica, 66(2), 315–331. https://doi.org/10.2307/2998560

19.

Halloran

M. E.

Struchiner

C. J.

(1995). Causal inference in infectious diseases. Epidemiology, 6(2), 142–151. https://doi.org/10.1097/00001648-199503000-00010

20.

Hudgens

M. G.

Halloran

M. E.

(2008). Toward causal inference with interference. Journal of the American Statistical Association, 103(482), 832–842. https://doi.org/10.1198/016214508000000292

21.

Lunceford

J. K.

Davidian

(2004). Stratification and weighting via the propensity score in estimation of causal treatment effects: A comparative study. Statistics in Medicine, 23(19), 2937–2960. https://doi.org/10.1002/sim.1903

22.

Murray

D. M.

et al. (1998). Design and analysis of group-randomized trials (Vol. 29). Oxford University Press.

23.

National Library of Medicine (U.S .) (2013). High dose in uenza vaccination and morbidity and mortality in U.S. nursing homes, identifier: NCT01815268 . https://clinicaltrials.gov/ct2/show/NCT01815268.

24.

National Library of Medicine (U.S .) (2018). Adjuvanted influenza vaccination and morbidity and mortality in U.S. nursing homes, identifier: NCT02882100 . https://clinicaltrials.gov/ct2/show/NCT02882100.

25.

National Library of Medicine (U.S .) (2019). Comparative effectiveness of recombinant versus standard dose quadrivalent in uenza vaccine in U.S. nursing homes, identifier: NCT03965195 . https://clinicaltrials.gov/ct2/show/study/NCT03965195.

26.

O’Muircheartaigh

Hedges

L. V.

(2014). Generalizing from unrepresentative experiments: A stratified propensity score approach. Journal of the Royal Statistical Society Series C: Applied Statistics, 63(2), 195–210. https://doi.org/10.1111/rssc.12037

27.

Raudenbush

S. W.

(1997). Statistical analysis and optimal design for cluster randomized trials. Psychological Methods, 2(2), 173–185. https://doi.org/10.1037/1082-989x.2.2.173

28.

Robins

J. M.

Greenland

(2000). Causal inference without counterfactuals: Comment. Journal of the American Statistical Association, 95(450), 431–435. https://doi.org/10.2307/2669381

29.

Rubin

D. B.

(1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5), 688, 701. https://doi.org/10.1037/h0037350

30.

Rudolph

K. E.

van der Laan

M. J.

(2017). Robust estimation of encouragement design intervention effects transported across sites. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 79(5), 1509, 1525. https://doi.org/10.1111/rssb.12213

31.

Saegusa

Wellner

J. A.

(2013). Weighted likelihood estimation under two-phase sampling. Annals of Statistics, 41(1), 269, 295. https://doi.org/10.1214/12-AOS1073

32.

Stuart

E. A.

Cole

S. R.

Bradshaw

C. P.

Leaf

P. J.

(2001). The use of propensity scores to assess the generalizability of results from randomized trials. Journal of the Royal Statistical Society. Series A (Statistics in Society), 174(2), 369–386. https://doi.org/10.1111/j.1467-985X.2010.00673.x

33.

Sverdlov

Rosenberger

W. F.

(2013). On recent advances in optimal allocation designs in clinical trials. Journal of Statistical Theory and Practice, 7(4), 753–773. https://doi.org/10.1080/15598608.2013.783726

34.

Sverdlov

Ryeznik

Wong

W. K.

(2020). On optimal designs for clinical trials: An updated review. Journal of Statistical Theory and Practice, 14(1), 1–29.

35.

Tipton

(2013a). Improving generalizations from experiments using propensity score subclassification: Assumptions, properties, and contexts. Journal of Educational and Behavioral Statistics, 38(3), 239–266. https://doi.org/10.3102/1076998612441947

36.

Tipton

(2013b). Stratified sampling using cluster analysis: A sample selection strategy for improved generalizations from experiments. Evaluation Review, 37(2), 109–139. https://doi.org/10.1177/0193841X13516324

37.

Tipton

Hallberg

Hedges

L. V.

Chan

(2017). Implications of small samples for generalization: Adjustments and rules of thumb. Evaluation Review, 41(5), 472–505. https://doi.org/10.1177/0193841X16655665

38.

Tipton

Hedges

Vaden-Kiernan

Borman

Sullivan

Caverly

(2014). Sample selection in randomized experiments: A new method using propensity score stratified sampling. Journal of Research on Educational Effectiveness, 7(1), 114–135. https://doi.org/10.1080/19345747.2013.831154

39.

Tipton

Olsen

R. B.

(2018). A review of statistical methods for generalizing from evaluations of educational interventions. Educational Researcher, 47(8), 516–524. https://doi.org/10.3102/0013189x18781522

40.

Tipton

Peck

L. R.

(2017). A design-based approach to improve external validity in welfare policy evaluations. Evaluation Review, 41(4), 326–356. https://doi.org/10.1177/0193841X16655656

41.

Tipton

Yeager

D. S.

Iachan

Schneider

(2019). Designing probability samples to study treatment effect heterogeneity. In Lavrakas

P. J.

Traugott

M. W.

Kennedy

Holbrook

A. L.

de Leeuw

E. D.

West

B. T.

(Eds.), Experimental methods in survey research: Techniques that combine random sampling with random assignment (pp. 435456). John Wiley and Sons.

42.

Tsiatis

(2007). Semiparametric theory and missing data. Springer Science and Business Media.

43.

Westreich

Edwards

J. K.

Lesko

C. R.

Stuart

Cole

S. R.

(2017). Transportability of trial results using inverse odds of sampling weights. American Journal of Epidemiology, 186(8), 1010–1014. https://doi.org/10.1093/aje/kwx164

44.

Williamson

E. J.

Forbes

White

I. R.

(2014). Variance reduction in randomised trials by inverse probability weighting using the propensity score. Statistics in Medicine, 33(5), 721–737. https://doi.org/10.1002/sim.5991

45.

Woodcock

LaVange

L. M.

(2017). Master protocols to study multiple therapies, multiple diseases, or both. New England Journal of Medicine, 377(1), 62–70. https://doi.org/10.1056/NEJMra1510062

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.47 MB