Practical issues in operationalizing the design and outcome evaluation of cluster randomized trials

Abstract

Trial designs using cluster-level randomization are necessary when interventions have intended effects that cannot be measured with individual randomization. When an intervention is intrinsically only able to be delivered to a cluster or when implementation of an individual level intervention is only feasibly implemented at a cluster level, cluster-level randomization is required. In designing the strategy for evaluation of the primary outcome of a cluster randomized trial, there are a multitude of important decisions to consider. While these decisions are guided primarily by the intervention—who benefits, what is the intended effect and when will it be achieved—there are important detailed choices that affect potential bias and statistical power, and implementation considerations that require compromise for considerations of feasibility and practicality. Through the lens of three large completed cluster randomized trials in HIV prevention, we present specific choices made for the overall evaluation plan, together with some of the detailed considerations, compromises and modifications that occurred during trial implementation.

Keywords

Cluster randomized trials HIV prevention outcome evaluation Hawthorne

Background

Trial designs using cluster-level randomization are necessary when interventions have intended impact that cannot be measured with individual randomization. When an intervention is intrinsically only able to be delivered to a cluster (e.g. HIV testing using mobile vans within a community), or when implementation of an individual level intervention is only feasibly implemented at a cluster level (e.g. implementation of financial incentives in a medical facility), cluster-level randomization is a requirement of the design. Cluster randomized trials (CRTs) have a plethora of important operational, logistical and measurement details to be decided in the design and implementation of the evaluation strategy. Both practical and statistical considerations must be weighed in these decisions.

Cluster randomized designs require evaluation of intervention impact for the cluster, but the strategy for that evaluation has myriad possibilities depending on whether the intended intervention impact is in the same or a different population from the intervention delivery population. Sometimes the intervention has intended direct impact on the delivery population; sometimes indirect impact on a population different from the delivery population; commonly in CRTs, intended impact is both direct and indirect. Identifying the group targeted by the intervention will guide the strategy for outcome evaluation and have important implications for the cost and resources needed. Often, practical and budget considerations require compromises in the evaluation strategy; early deliberations by the study team are important in assessing the risks and limitations of these choices, and discussion of potential mitigation strategies. Unwelcome surprises are almost a certainty as implementation proceeds, even with the best laid plans; ongoing review of any adaptations needs the same attention to risks and limitations for the evaluation.

This manuscript describes some of the practical considerations in the design and implementation of the evaluation of outcomes in a CRT, through illustration with three diverse case studies of CRTs in HIV prevention. Our goal is to assist those contemplating the logistics of future CRTs to design and implement evaluation strategies that will increase the chance of successful execution of a high-quality evaluation for these large, expensive and important studies.

Methods

The following three case studies are used to illustrate five key implementation decision areas required in implementing and designing CRT evaluations.

Three case studies

NIMH Project ACCEPT (HPTN 043):^1,2 Project ACCEPT sought to evaluate whether increased access to HIV testing and post-test support could decrease HIV incidence in a community. Increased testing access was implemented through HIV testing in mobile vans deployed throughout the community; peer-based support groups addressed stigma and disclosure to modify norms for HIV testing. The trial was conceived, designed and conducted before 2009, in an era with no proven effective biomedical prevention (ClinicalTrials.gov number NCT00203749).

Financial Incentives for viral suppression (HPTN 065):^3,4 HPTN 065 was a trial with five components: for the case study we use the component that was a CRT of financial incentives in health clinics in Washington, DC and the Bronx, NY (HPTN 065-FI). The trial evaluated whether clinics implementing a quarterly financial incentive given to persons living with HIV on antiretroviral therapy who maintained viral suppression would increase adherence to antiretroviral therapy and lead to higher rates of viral suppression (Clinicaltrials.gov number NCT01152918).

HPTN 071 (PopART):^5,6 The study of the PopART intervention evaluated whether universal HIV testing and treatment throughout a community would decrease the rate of new infections in the community. Treatment as prevention has been shown to be 95% effective in preventing HIV transmission; this CRT was designed to measure the impact of community-level implementation of household-based HIV testing and referral to treatment on reducing risk of HIV infection throughout the community. (ClinicalTrials.gov number, NCT01900977).

The detail of design and outcome evaluation strategies for these three CRTs in HIV prevention is detailed in Table 1.

Table 1.

Detailed comparison of design and evaluation strategies of three case studies.

	Project ACCEPT (HPTN 043)	Test and Treat-Plus (HPTN 065-FI)	PopART (HPTN 071)
Hypothesis	Study hypothesis: Increased access to HIV testing, and post-test support across a community will decrease HIV incidence in the community	Clinic implementation of financial incentives given to persons living with HIV on antiretroviral therapy upon achieving and sustaining undetectable viral load will increase viral suppression in the clinic	Household-based delivery of HIV testing, linkage to HIV prevention and care services, and universal antiretroviral therapy will decrease HIV incidence in the community
Intervention	HIV testing in mobile vans, post-test support groups addressing disclosure and stigma reduction	Payment of $75 every quarter for all clinic persons living with HIV on antiretroviral therapy if they were virally suppressed (VL < 400 cp/mL)	Annual household visits by community health workers to offer HIV testing, referrals to care and treatment. Arm A community clinics offered universal antiretroviral therapy; Arm B clinics offered standard of care antiretroviral therapy
Interventiondeliverypopulation	All adults in the community	All persons living with HIV on antiretroviral therapy established in care at the clinic	All adults in the community
Interventionimpact population(primary)	All HIV uninfected persons in the community	All persons living with HIV on antiretroviral therapy established in care at the clinic	All HIV-uninfected persons in the community
Standard of care	HIV testing as currently available in the community	Adherence support as currently available at the clinic	HIV testing and treatment as currently available in the community
Time frame	Three years of intervention: 2006–2009	Clinic implementation of FI for 2 years: 2011–2013	Three annual rounds of intervention 2013–2018
Setting	Tanzania, Zimbabwe, South Africa (Vulindlela and Soweto), Thailand	Bronx, NYC; Washington DC, United States	Zambia and South Africa
Design	Two arm pair-matched cluster randomized design; 8–10 paired communities (clusters) per setting; community size 3.4–23.9 K	Two arm cluster randomized trial of 39 HIV clinics(clusters); Patients per clinic 50–1000+	Three arm triplet-matched cluster randomized design: 21 communities (clusters); community size 21–167 K
Outcome measure	HIV incidence	Viral load suppression	HIV incidence
Who:	Adults aged 18–35 living in randomly selected households, N = 500–1400 per community, total enrolled 56,683	All persons living with HIV with two lab assessments at the clinic in the past year (via eHARs surveillance system)	Population cohort of adults aged 18–44 from randomly selected household (one per HH). 2000 per community. Total enrolled = 48,301
What:	HIV-negative or HIV-positive with recent HIV infection (assessed with a lab-based recency assay)	Viral load measured in past 6 months undetectable (<400 copies/mL); missing viral load assumed detectable	HIV incident cases occurring in HIV-uninfected cohort participants, assessed in annual follow-up visits
When:	Post intervention survey conducted after end of intervention in each matched pair of communities	Assessment period15–24 months after initiationof financial incentives	HIV incidence assessed in Year 2 and 3 of follow-up
Results: Interventionuptake	Community-based voluntary counseling and testing increased testing rates by 25% overall, by 45% in men and 15% in women	96% of eligible care visits received gift cards of $70 per quarter	After 3 years of PopART intervention, HIV status was known by 90% of men and women in Zambia and 78% of men and 85% of women in SA; for persons living with HIV, an estimated 95% of women and 85% of men knew their HIV status
Results:Primaryoutcome	Incidence of HIV in the intervention group was 1.52% versus 1.81% in the control group, a reduction in HIV incidence of 13.9% (p = 0.082)	Financial incentives at sites increased viral suppression by 3.8% (p = 0.01) overall compared with standard of care sites	PopART intervention in Arm A resulted in 7% lower HIV incidence compared to standard of care (p = 0.51); Arm B resulted in 30% lower HIV incidence (p = 0.006)

Results

Five operational areas of the CRT evaluation are considered: randomization, design of the outcome evaluation, risk of Hawthorne effect, representativeness, size and weighting of the evaluation clusters.

Randomization

The goal of randomization is to reduce risk of confounding by achieving balance in measured and unmeasured variables associated with the outcome. However, CRTs can have a relatively small number of randomization units, increasing the risk of chance imbalance. Operationally, most CRTs mitigate this risk by ensuring good balance through restrictions on the randomization, specifically to ensure balance on community characteristics known to be related to the (1) uptake of the intervention and (2) outcome, pre-intervention. Commonly CRTs use some combination of stratification, matching, restricted randomization and statistical adjustment. Matching and restricted randomization clearly require community-level information prior to randomization; planning to obtain this information is an important effort in the implementation timeline, since randomization typically needs to occur early (6–12 months prior to intervention start), for planning intervention implementation.

In both Project ACCEPT and PopART, stratified, matched designs were selected. Matching of communities, stratified by country, were decided on the basis of geographic, sociodemographic, cultural and infrastructure characteristics established by formative⁷ or prior⁸ research. A well-known bias-variance trade-off of a matched design is the reduction in the number of independent assessments of intervention effect: for example, matched pairs in Project ACCEPT reduced the assessments from the 34 communities to 17, in PopART, from 21 to 7.

Restricted randomization selects from all possible randomized allocations a restricted subset with an acceptable level of balance in baseline (pre-existing) data.⁹ Restricted randomization is intuitive and simple, since with a small number of clusters complete enumeration (and ordering) of randomizations is feasible. Restricted randomization provides a natural permutation test framework, allowing for a non-parametric assessment of the statistical significance of the trial result. Note that the size of the restricted pool of randomizations determines the number of realizations in the permutation testing, hence pools of at least 5000 are recommended. However, there are risks. Restricted randomization may rely on pre-existing data (not collected by the study) that is old and/or not purpose built. Efforts made to select a balanced randomization may come to naught if the data used for the randomization are inaccurate.

Both HPTN 065-FI and PopART used stratified, restricted randomization to ensure balance in specific factors. In HPTN 065-FI, critical factors to balance between arms within each city were distribution of clinic sizes and baseline viral suppression. Using baseline data from eHARS¹⁰ (the US National HIV Surveillance System that was used to assess outcome), we found the 10,000 randomizations with the smallest sum of squared t-statistics comparing between arms, and selected one at random for each city. PopART (in addition to matching) used restricted randomization to achieve baseline balance in community size (intervention coverage), HIV prevalence (correlate of incidence) and uptake of antiretroviral therapy (intervention uptake). Data for some communities were available from a prior trial;⁸ extension to new communities was imputed based on available data in that health district. Operationally, randomization choice was restricted to the 5000 randomizations where (1) average HIV prevalence was within 2%, (2) average antiretroviral therapy uptake among HIV-positive individuals was within 4% and (3) average community population size was within 5000, for all trial arm comparisons. Randomization occurred in a public ceremony where communities (1) selected from the restricted randomizations by generating a random number between 0 and 4999 and (2) assigned arm to the three symbols in the selected randomization. In both PopART and HPTN 065-FI, pre-existing data were used: for PopART, baseline balance on restricted factors was ultimately good; for HPTN 065-FI, baseline balance of the restricted factors was outside the specified limits, as a result of (later discovered) quality issues in the data.

Baseline imbalance can be mitigated by statistical methods in the primary analysis, typically by pre-specified primary outcome models that adjust for baseline measures known to be predictors of outcome. This necessitates collection of baseline data across all communities and can represent a substantial additional cost and/or delay. All three trials had baseline assessments: Project ACCEPT completed a baseline survey prior to randomization, with randomization occurring 1 year later, and intervention starts 6 months after that.

Both PopART and HPTN 065-FI included adjustment for baseline community measures predictive of the primary outcome; this was an important adjustment for HPTN 065-FI, given a baseline imbalance in viral suppression. This approach was not available for the primary outcome in Project ACCEPT, because the intervention (uptake of HIV testing) meant the team could not measure HIV at baseline or during the intervention without interfering with the study intervention. Therefore, no adjustment for baseline HIV prevalence was used.

Measuring the outcome in each cluster

There are three key considerations in designing the outcome evaluation strategy. These essentially amount to understanding the principles that guide the choice of who to measure, what to measure and when to measure. The decision of who to measure derives from the intended impact population of the intervention delivery, that is, who it is intended to benefit (see Figure 1 for the case studies). The first task is to design a sampling strategy for this ideal outcome (impact) population. In the simplest case, the impact and delivery populations are the same. The outcome evaluation could then occur as part of the intervention delivery (in the intervention communities), unless (as is often the case) coverage of the intervention is an integral part of the investigation. In that case, evaluation strategies that define a group representative of the entire population, irrespective of receipt of the intervention, are necessary. Outcome measurements must be implemented in the non-intervention communities, with minimal disturbance to the outcome. It is common for cluster randomized designs to have both direct and indirect mechanisms of action (i.e. benefits accrue not only to those receiving the intervention, but also to partners/children/neighbors), in which case the best practice for outcome assessment includes representation from these other groups. It is also possible that an intervention impact is wholly indirect—for example, in HPTN 071, increased antiretroviral therapy for persons living with HIV had an intended prevention effect for HIV-uninfected.

Figure 1.

Design of the evaluation strategy determines who benefits (intervention impact population) from the delivery of the intervention (intervention delivery population). In Project ACCEPT, any member of the community could benefit from those who accepted HIV testing. In HPTN 065, only those who received the financial incentive could benefit from the intervention. For the primary endpoint of HIV infections in HPTN 071, the primary intervention mechanism for impact in HIV-uninfected persons was viral suppression resulting from delivery of the test and treat intervention to HIV-infected persons.

In a study with indirect effects, the outcome assessment usually requires a dedicated research activity, for example, recruiting and following an evaluation cohort, entirely separate from the intervention work. An evaluation cohort often uses a sampling design to recruit a representative group from each cluster. Unbiased estimates will require definition and elucidation of the sampling frame and a well-defined approach to random sampling within each cluster. Alternatively, pre-existing surveillance data systems may allow comprehensive evaluation of the outcome in the community, potentially reducing the cost of the trial, but at the risk of repurposing data systems designed for other use.

In individual randomized trials, a common design choice is to assess the outcome in a subset of the community at higher risk of disease, for example, where HIV infection risk is higher in younger ages, HIV prevention trials typically restrict enrollment by age. Similar strategies can be applied to recruitment of the evaluation cohort, that is, evaluating the impact in a subgroup of the community with higher risk of disease. The trade-off between generalizability¹¹ (effects assessed in the subgroup may not generalize to the entire target population) and efficiency (higher disease rates typically result in more precision in estimated effect) is an important consideration in this choice. Transportability of results to other settings may also be challenging if the targeted subgroup is setting-specific or narrowly defined.^12,13 While generalizability and transportability are not primary drivers of the design decisions, the underlying importance of these epidemiologic principles often do influence the outcome definition.

Project ACCEPT and PopART interventions both had intended indirect impacts, and both implemented an evaluation cohort independent of the intervention. Project ACCEPT and PopART sampled households at random from a sampling frame of all households elucidated in the community to recruit a target number of individuals; Project ACCEPT recruited all eligible members of the household, PopART only a single member (see section on Hawthorne effect). Both PopART (recruited ages 18–45) and Project ACCEPT (recruited ages 18–35) placed an age restriction on the evaluation cohort.

Financial incentives in HPTN 065-FI invoked only direct effects: only those receiving the intervention could benefit. Yet because financial incentives were implemented as a clinic-wide program, rather than implementing individual data collection, HPTN 065-FI used eHARS Surveillance data which collated viral load data from all persons with confirmed diagnosis of HIV. Working with the Centers for Disease Control and Prevention and local departments of health, the study funded the development of clinic-specific evaluations of aggregate viral suppression for persons living with HIV at a clinic. A significant compromise in using eHARS was not being able to restrict the evaluation to persons living with HIV who were on antiretroviral agents (an item not collected in surveillance); the outcome assessed viral load in all persons living with HIV at a clinic, which would have included some not on antiretroviral agents and not eligible to receive the intervention.

The second decision is what outcome to measure. Considerations of power, effect size and interpretability are critical. As with any trial, there is a gradation of quality of evidence, with objective evidence of clinical outcomes being more impactful than self-reported measures or measurements of service uptake. The strength of prior evidence of efficacy, the components of the intervention and the complexity of implementation all play a part in determining the appropriate outcome. Fundamentally, the outcome should reflect as closely as possible the intended effect of the intervention, where that effect has clear clinical or public health benefit. However, even then, it can be difficult to determine the best outcome measure to select. For example, achieving viral suppression in HIV-infected persons is strongly predictive of reduced risk of HIV infection; however, the pathway to viral suppression includes (1) knowledge of HIV infection, (2) initiation of antiretroviral therapy and (3) sustained viral suppression. Is it sufficient to measure uptake of HIV testing to show benefit? Or sufficient to show increase in sustained viral suppression over the entire population? Or is it necessary to directly measure reduction in new HIV infections? An outcome defined for the whole evaluation cohort, rather than only a subset, typically has more reliable power and interpretability. For example, rather than using detectable viral load as an outcome restricted to the HIV-infected population (measuring transmission risk from a subset of the community), an outcome of detectable viral load in HIV infected as an outcome restricted (measuring transmission risk from a subset of the community), a more directly interpretable measure of transmission risk is detectable viral load in the entire population (HIV-uninfected persons assigned undetectable viral load).

In all three of the case studies, the outcome selected was a biologic outcome of known public health significance: HIV incidence and viral suppression. A consequence of this rigorous assessment was large assessment cohorts—HIV incidence outcomes required cohorts of ∼50,000.

The third decision is when to measure and depends on the expected time to achieve benefit: is it immediate or delayed impact, or are benefits not expected until sufficient uptake has occurred? In the case where benefits are theorized to increase over time, the trial will need to balance evaluation time frames with intervention activities: it will be rare that study logistics permit for study-measured outcomes to be assessed much after the intervention is complete. A common solution is to time the assessment toward the end or immediately after conclusion of the intervention. An advantage of external (non-study) evaluation is the ability to continue to assess outcomes after the intervention has completed.

Project ACCEPT evaluated HIV incidence after the completion of 3 years of intervention, using a cross-sectional incidence assessment. In parallel with the 3 years of intervention delivery, PopART followed a cohort with annual visits for 3 years, using longitudinal assessment of incident HIV infection. The initial plan included all 3 years in the outcome assessment; however, based on the pace of intervention rollout, this was changed to HIV incidence in the second and third year, anticipating that impact in the first year was minimal. A delay was built in for HPTN 065-FI—financial incentives could only have an impact after all patients qualified, thus the outcome assessment occurred in months 15–24 of the 2-year program. HPTN 065-FI also completed a long-term assessment of durability of impact.¹⁴

The risk of Hawthorne effect in outcome evaluation

Hawthorne effect in the context of a CRT refers to the risk of changing the estimate of intervention effect through the outcome measurement process. This is a common issue when the communities in the standard of care condition have no intervention component, and the research evaluation is the only study activity in the standard of care communities. This potential is clearly illustrated in Project ACCEPT, where the crux of the proposed intervention pathway was people learning their HIV status through increased ease of HIV testing. If the study measured HIV at baseline, or repeated HIV testing to assess HIV incidence, all persons in the outcome cohort would have received the intervention through measuring the outcome, invalidating measurement of intervention effect. This realization led to the decision to use a cross-sectional incidence assessment as the primary outcome, an assessment occurring only after the completion of intervention delivery. A similar issue existed for PopART, where the backbone of the intervention was household-based HIV testing. Participants in the outcome cohort were offered a rapid HIV test in the household (similar to intervention), raising the risk of a Hawthorne effect, as those in the outcome cohort had more ready access to an HIV test. The mitigations implemented in the outcome evaluation were (1) only one adult enrolled per household—to limit potential effect of research participation in the household; (2) rapid HIV tests were offered but not required; the research evaluation used blood collected from all participants. Considering the HIV status of the person enrolled, however, reveals more subtle consideration of Hawthorne effect in PopART. The hypothesized primary mechanism for decreased incidence was increased antiretroviral therapy uptake in HIV-infected persons. For an HIV-infected participant, testing itself was unlikely to affect the risk of HIV infection in another household’s cohort participant, so the risk of Hawthorne effect on HIV incidence was likely to be minimal. However, a Hawthorne effect on the secondary outcome of viral suppression in HIV-infected participants was quite likely: research participation could increase knowledge of HIV infection and increase viral suppression, compared to those not in the outcome cohort. Table 2 compares the uptake of rapid testing and knowledge of HIV status in intervention versus standard of care arms. Higher uptake in the standard of care arm suggests that participation in the evaluation in standard of care communities resulted in a similar knowledge of HIV infection in all arms of the evaluation cohort over the course of the study, potentially decreasing the observed intervention effect on viral suppression in the evaluation cohort, that is, not accurately reflecting differences in viral suppression between intervention and standard of care arms.

Table 2.

Assessing the potential Hawthorne effect of rapid HIV testing in HIV-infected participants of the Evaluation Cohort of PopART (HPTN 071).

	Intervention arm A	Intervention arm B	Standard of care arm
Uptake of rapid test: all HIV infected
Baseline	16%	13%	19%
1 year	9%	7%	14%
2 years	5%	4%	12%
3 years	3%	2%	5%
Uptake of rapid test: HIV infected who did not self-report HIV+
Baseline	25%	21%	32%
1 year	19%	13%	22%
2 years	13%	10%	34%
3 years	8%	7%	16%
Knowledge of HIV+ (self-report of HIV+ at current or prior visits): all HIV infected
Baseline	64%	65%	65%
1 year	75%	73%	77%
2 years	78%	82%	86%
3 years	82%	86%	89%

The assessment approach in HPTN 065-FI completely avoids the risk of Hawthorne effect, using an existing data surveillance system, rather than implementing a research evaluation. The advantages of existing data systems include (1) ability to evaluate pre-intervention measures to balance randomization and facilitate statistical adjustment, (2) no risk of Hawthorne effect, (3) consistent data across all clusters and (4) potential for post-intervention evaluation. Typical challenges encountered are compromises in the data available, lack of control of the quality, completeness and timing of access to data. Collaboration with surveillance personnel is essential, as additional quality control will need to be implemented to ensure accuracy of fields critical to the study (i.e. cluster identification).

Challenges with representativeness: recruitment and retention

Unbiased evaluation of cluster outcomes commonly requires representative sampling from persons in the cluster, meaning the study team is tasked with recruiting randomly selected persons, leading to trial participation that is more passive. Unlike a cohort recruited and consented in a clinic, household-based or community recruitment of participants do not experience high participation or retention rates. Almost inevitably, this leads to difficulty in attaining the ideal of a representative sample. The resulting potential for bias is difficult to remediate. This is a difficulty well known in survey sampling, and the same weighting methods used to correct for selection probability and non-response can be incorporated in the statistical analysis of a CRT, albeit with a cost to statistical efficiency. Ongoing monitoring of field work is important for detecting and correcting trends in quality and representation of the assessment cohort. For example, adjusting hours of operation in PopART resulted in increased participation in the evaluation cohort from working age members of the community. One approach is to plan and conduct sensitivity analyses to assess the potential impact any bias in evaluation might have on the assessment of effect: these might include evaluation of representation (e.g. age, gender) between arms, and assessment of potential effect moderators (e.g. different effects in men and women). Most CRTs acknowledge as a limitation the evaluation of effect in a cohort imperfectly representative of the communities.

Both Project ACCEPT and PopART experienced higher participation by females than males in the evaluation cohort (Table 3), likely a consequence of household-based recruitment. The cross-sectional sample of Project ACCEPT achieved more balanced representation; in PopART both enrollment and retention were higher in females, resulting in an effect evaluation with majority contribution by females (71% female, 29% male).

Table 3.

Representation in evaluation cohorts of CRT: distribution of sex in project ACCEPT and PopART.

	Male	Female
Project ACCEPT
Post-intervention assessment	10,587 (46%)	12,408 (54%)
PopART population cohort
Entire cohort	11,202 (29%)	27,139 (71%)
HIV-uninfected subgroup	9543 (32%)	19,494 (68%)
Total person years in HIV-uninfected (contributing to endpoint)	10,781 (29%)	26,438 (71%)

Cohort size and weighting in evaluation

The size of the intervention clusters in a CRT is typically not fully controlled by design, and it is not uncommon for intervention clusters to be very different in size. Provided cluster size itself is not informative about intervention impact, size of the cohort for each cluster is a design choice in the research evaluation strategy. For a community intervention using a separate research evaluation, there are two common choices. The simplest is to choose the same evaluation cohort size for all communities. This is relatively robust statistically and operationally and is a safe choice when outcome effect across clusters is expected to be similar. The second choice is to select evaluation cohort size to have similar outcome precision in each (pair of) clusters. This typically results in greater efficiency and power for the evaluation when knowledge about outcome precision was accurate.

Equal size was the strategy used in PopART: while the size of the 21 communities ranged from 21,000 to 167,000, the evaluation strategy was to enroll 2000 in each community. In the primary analysis, each community was given equal weight in the two-way analysis of variance (ANOVA) used for this three-arm trial, resulting in a simple interpretation of the average intervention effect across the seven triplets. Project ACCEPT based the recruitment strategy on expected incidence: communities expected to have lower HIV incidence had higher planned recruitment. In addition, the analysis weighted pairs of clusters based on outcome variance: a weighted paired t-test was used, where community pairs were assigned weights based on the harmonic mean of the number of events contributed in that pair. The weighted analysis means that communities with greater number of infection (more “information”) had a greater contribution in the estimate of intervention effect. The intervention effect was a weighted mean of the (log) relative risk of intervention versus standard of care pairs. In HPTN 065-FI, cluster size corresponded to the number of HIV-infected patients in a clinic (varying from 50 to 2000). The clinic-implemented individual level intervention was assessed using an aggregate clinic outcome of proportion of clinic patients virally suppressed. In analysis, clusters were weighted by size to reflect the desired estimand of change in viral suppression for an individual attending a clinic with or without financial incentives.

Conclusion

In designing the strategy for evaluation of the primary outcome of the CRT, there are a multitude of important decisions to consider. While these decisions are guided primarily by the intervention—who benefits, what is the intended effect and when will it be achieved—there are important detailed choices that affect potential bias and statistical power, and implementation considerations that require compromise, for considerations of feasibility and practicality. Through the lens of three large completed CRTs in HIV prevention, we present specific choices made for the overall evaluation plan, together with some of the detailed considerations, compromises and modifications that occurred during trial implementation.

Commonalities in the outcome evaluations of these three exemplary CRT include (1) careful attention to ensuring balance between arms in the evaluation cohort through the use of matched clusters and/or restricted randomization; (2) evaluation in a broadly representative, most affected cohort—not linked to receiving the intervention; and (3) objective endpoints of clear public health importance.

Differences in intervention mechanisms, target effects and choice of data source for research evaluation led to different decisions for defining the cohort for research evaluation, the size of the evaluation cohort and the primary analysis approach, particularly with respect to weighting in the analysis.

In the implementation of each of these CRTs, the overall strategy of the evaluation was decided during the design phase. However, in all cases, unanticipated challenges in implementation, as the intervention and evaluation unfolded over the years of the trial, led to modifications and refinements that were necessary to achieve a high-quality evaluation.

Any CRTs is an ambitious, high resource, intensive effort involving a large group with different research expertise. The outcome evaluation strategy is clearly a critical component of the CRT, and careful attention to the multiple decisions, both big and small, is necessary to ensure a high-quality trial.

Footnotes

Author’s note

Clinicaltrials.gov numbers: NCT01152918, NCT01152918, NCT01900977.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship and/or publication of this article: Research is supported by NIH grant UM1A1068617.

ORCID iD

Deborah Donnell

References

Coates

Kulich

Celentano

, et al. Effect of community-based voluntary counselling and testing on HIV incidence and social and behavioural outcomes (NIMH Project Accept; HPTN 043): a cluster-randomised trial. Lancet Glob Health 2014; 2(5): e267–e277.

Sweat

Morin

Celentano

, et al. Community-based intervention to increase HIV testing and case detection in people aged 16-32 years in Tanzania, Zimbabwe, and Thailand (NIMH project accept, HPTN 043): a randomised study. Lancet Infect Dis 2011; 11(7): 525–532.

El-Sadr

Donnell

Beauchamp

, et al. Financial incentives for linkage to care and viral suppression among HIV-positive patients: a randomized clinical trial (HPTN 065). JAMA Intern Med 2017; 177: 1083–1092.

Gamble

Branson

Donnell

, et al. Design of the HPTN 065 (TLC-Plus) study: a study to evaluate the feasibility of an enhanced test, link-to-care, plus treat approach for HIV prevention in the United States. Clin Trials 2017; 14(4): 322–332.

Hayes

Ayles

Beyers

, et al. HPTN 071 (PopART): rationale and design of a cluster—randomised trial of the population impact of an HIV combination prevention intervention including universal testing and treatment—a study protocol for a cluster randomised trial. Trials 2014; 15: 57.

Hayes

Donnell

Floyd

, et al. Effect of universal testing and treatment on HIV incidence—HPTN 071 (PopART). N Engl J Med 2019; 381: 207–218.

Chirowodza

van Rooyen

Joseph

, et al. Using participatory methods and geographic information systems (GIS) to prepare for an HIV community-based trial in Vulindlela, South Africa (Project Accept-HPTN 043). J Community Psychol 2009; 37: 41–57.

Ayles

Muyoyeta

Du Toit

, et al. Effect of household and community interventions on the burden of tuberculosis in Southern Africa: the ZAMSTAR community-randomised trial. Lancet 2013; 382: 1183–1194.

Moulton

. Covariate-based constrained randomization of group-randomized trials. Clin Trials 2004; 1(3): 297–305.

10.

Cohen

Gray

Ocfemia

, et al. The status of the National HIV surveillance system, United States, 2013. Public Health Rep 2014; 129(4): 335–341.

11.

Lesko

Buchanan

Westreich

, et al. Generalizing study results: a potential outcomes perspective. Epidemiology 2017; 28: 553–561.

12.

Pearl

Bareinboim

External validity: from do-calculus to transportability across populations. Statistical Science 2014; 29: 579–595.

13.

Westreich

Edwards

Lesko

, et al. Transportability of trial results using inverse odds of sampling weights. Am J Epidemiol 2017; 186: 1010–1014.

14.

El-Sadr

Beauchamp

Hall

, et al. Brief report: durability of the effect of financial incentives on HIV viral load suppression and continuity in care: HPTN 065 study. J Acquir Immune Defic Syndr 2019; 81: 300–303.