Matched Comparison Group Design Standards in Systematic Reviews of Early Childhood Interventions

Abstract

Background:

Systematic reviews assess the quality of research on program effectiveness to help decision makers faced with many intervention options. Study quality standards specify criteria that studies must meet, including accounting for baseline differences between intervention and comparison groups. We explore two issues related to systematic review standards: covariate choice and choice of estimation method.

Objective:

To help systematic reviews develop/refine quality standards and support researchers in using nonexperimental designs to estimate program effects, we address two questions: (1) How well do variables that systematic reviews typically require studies to account for explain variation in key child and family outcomes? (2) What methods should studies use to account for preexisting differences between intervention and comparison groups?

Methods:

We examined correlations between baseline characteristics and key outcomes using Early Childhood Longitudinal Study—Birth Cohort data to address Question 1. For Question 2, we used simulations to compare two methods—matching and regression adjustment—to account for preexisting differences between intervention and comparison groups.

Results:

A broad range of potential baseline variables explained relatively little of the variation in child and family outcomes. This suggests the potential for bias even after accounting for these variables, highlighting the need for systematic reviews to provide appropriate cautions about interpreting the results of moderately rated, nonexperimental studies. Our simulations showed that regression adjustment can yield unbiased estimates if all relevant covariates are used, even when the model is misspecified, and preexisting differences between the intervention and the comparison groups exist.

Keywords

systematic review nonexperimental research methods early childhood home visiting

Systematic reviews that assess the quality of research on program effectiveness can help decision makers faced with many intervention options. Many systematic reviews, such as the What Works Clearinghouse (WWC), the Home Visiting Evidence of Effectiveness (HomVEE) review, the Teen Pregnancy Prevention Evidence Review (TPP), and the Clearinghouse of Labor Evaluation and Research (CLEAR), assess and rate a study’s internal validity—that is, its ability to isolate the effects of a program or intervention from other factors that might influence participants’ outcomes.

A challenge when conducting systematic reviews is that there can be conflicting objectives. The aim is to develop standards that identify studies with credible results (i.e., strong internal validity), but it is also necessary that the reviews provide useful evidence for the fields they cover. Making standards so strict as to accept only the most rigorous study designs would maximize the credibility of results, but few studies would meet such standards and some studies, even though imperfect, produce useful evidence and they would be discounted. Allowing less rigorous designs to meet standards would allow reviewers to identify a larger pool of studies to provide evidence on a particular research question but that would increase the risk that some of the studies would be subject to important biases and, hence, produce misleading evidence.

Systematic reviews address this challenge by assigning different quality ratings to studies based on their internal validity. Different systematic reviews tend to agree on the broad elements to consider when assessing study quality, such as research design, attrition, and baseline equivalence (Westbrook, Avellar, & Seftor, 2016). The highest ratings are reserved for studies that provide the strongest evidence that differences in the outcomes between the intervention and the comparison groups at follow-up can be attributed to the intervention rather than to preexisting differences between the groups. The lowest rating identifies studies in which results are not credible due to flaws in the study’s design or its execution. In between these two extremes are “moderate” ratings that are intended to identify potentially useful studies that do not meet the highest standard of evidence.

Because randomization, if performed correctly, ensures that treatment and control groups are similar at baseline in both observable and unobservable characteristics, well-executed randomized controlled trials (RCTs) have high internal validity and typically receive the highest possible study quality rating. For example, RCTs are the only study designs eligible for the highest study quality rating in HomVEE.¹ However, they are not always feasible or practical to implement. RCTs that involve primary data collection, for example, can be expensive and out of reach of research budgets. In addition, stakeholders might not allow an RCT to be conducted in their settings, sometimes because they are resistant to intentionally withholding the treatment from people who could benefit.

Studies using nonexperimental designs, such as matched comparison group design (MCGD) studies, are typically not eligible for the highest study quality rating in systematic reviews. These studies feature intervention and comparison groups formed nonrandomly and can be easier to implement than RCTs. They are more likely to involve secondary data analysis, which can reduce costs and avoid the issue of stakeholder buy-in because consenting respondents have already provided the data. However, these studies have less internal validity and provide weaker evidence that differences in outcomes can be attributed to the intervention because they can account for only observable baseline characteristics and leave open the possibility of important unobserved baseline differences between the groups. Using the logic that nonexperimental studies using appropriately rigorous methods and strong covariates have the potential for producing useful information, systematic reviews have a different set of standards for nonexperimental studies. These standards correspond to a lower rating than the standards for experimental studies. For example, MCGDs can receive a “moderate” but not a “high” study quality rating in HomVEE.

Studies that receive moderate ratings can provide suggestive but not conclusive evidence that an intervention is effective. Because these studies cannot rule out bias caused by preexisting differences between the intervention and the comparison groups, systematic review users should keep in mind when examining the results of a moderate rated study that although effects may be due to the intervention under study, other factors might also have contributed.²

Systematic reviews’ study quality ratings center on how convincingly a study establishes that its treatment and control groups (in the case of studies with experimental designs) or intervention and comparison groups (in the case of nonexperimental studies) were equivalent in observable and unobservable ways at baseline (prior to the onset of the intervention) or that any differences could be accounted for in the analysis.³ If groups were equivalent at baseline on observable and unobservable characteristics that are correlated with the outcome (or if the study is able to credibly account for these baseline differences), any subsequent differences can be attributed to the effects of the intervention. If the groups differed at baseline in ways related to study outcomes (and the study did not account for these differences), the resulting evidence would potentially be misleading.

Although many reviews (e.g., the WWC, TPP, and HomVEE) recognize the importance of accounting for key baseline differences in order to credibly estimate intervention effects, there is less consensus on how the concept should be operationalized and assessed. First, reviews differ on the specific characteristics they require nonexperimental studies to take into account. Baseline measures of the outcomes of interest are often the strongest predictors of subsequent outcomes (Cook, Shadish, & Wong, 2008). In the WWC, for example, pretests are a required baseline measure even for studies of early childhood education interventions. Although the bulk of studies the WWC reviews examine K–12 interventions, the WWC also reviews studies of early childhood education interventions targeting children aged 3 and older in center-based preschool programs. WWC standards require baseline equivalence on pretests (or close proxies) for nonexperimental studies in this topic area (WWC, 2014b). In studies of interventions for infants and toddlers, however, baseline outcome measures might not be available. For example, a study of an intervention aiming to improve child cognitive development in which families enroll during pregnancy could not directly assess baseline child cognitive development. Moreover, there is less evidence (than there is for older children) about how strongly related baseline outcomes like pretests are with subsequent outcomes.

Second, systematic reviews vary in how they require studies to account for baseline differences in nonexperimental studies. Reviews such as the WWC and HomVEE require study authors to establish baseline equivalence by showing that intervention and comparison groups are similar on key characteristics at baseline—essentially demonstrating that participants and nonparticipants were successfully matched (at least in terms of the variables required by the systematic review, although unobservable differences may remain).⁴ Other reviews, such as CLEAR, allow studies to demonstrate equivalence by controlling for key characteristics in a regression.

With this article, we aim to provide information to stakeholders of systematic reviews who are developing or refining standards for nonexperimental studies and trying to balance rigor against feasibility as well as to researchers conducting nonexperimental studies to estimate intervention effects. We seek to identify areas in which systematic review standards for nonexperimental studies could be improved or modified to better differentiate studies with credible results. We focus on baseline characteristics and outcomes relevant to interventions that target young children and pregnant women. As a case study, we use HomVEE, which assesses the research on home visiting program models geared toward pregnant women and families with children from birth to age 5. However, our findings can apply more broadly to evidence reviews of early childhood interventions.

Our main research questions relate to two issues central to the internal validity of MCGDs:

Covariate choice: How well do variables that systematic reviews typically require studies to account for explain variation in outcomes, and what variables should systematic reviews of early childhood interventions require studies to account for to credibly estimate program effects?

Choice of estimation method: What methods should systematic reviews require studies to use to account for preexisting differences between intervention and comparison groups? Specifically, in what circumstances and to what extent does controlling for baseline differences in a regression model, compared to explicitly establishing baseline equivalence, produce valid impact estimates?

To provide insight into the first research question, we use nationally representative data in a descriptive, empirical analysis to explore the relationships between sets of baseline child and family characteristics and commonly examined subsequent outcomes for these children. This analysis provides information about the baseline characteristics nonexperimental studies should be required to account for to credibly estimate effects of interventions targeting pregnant women and young children. In addition, the analysis sheds light on the potential for unobserved characteristics—those that might not be measured in the data used for analysis—to affect common outcomes and thus lead to bias in impact estimates from nonexperimental studies. We found that the explanatory power of a large set of baseline sociodemographic and outcome measures was low in predicting a range of child and family outcomes relevant for early childhood interventions. In other words, much of the variation in these outcomes seems to be driven by factors or characteristics other than the ones we examined. This leaves a substantial possibility for bias in nonexperimental studies that account for only the characteristics we examined, if the unobserved factors driving variation in the outcomes are also related to participation in the intervention.

For our second research question, we use simulations to compare two methods of accounting for baseline characteristics: matching and regression adjustment. Results of our limited simulation study suggest that regression adjustment and matching can yield similar impact estimates even in the presence of a regression model that is misspecified in the sense that it does not appropriately account for nonlinear relationships between a model’s covariates and outcomes.

We begin with an explicit discussion of the theory underlying the problem of setting standards that are appropriately rigorous but feasible for studies to achieve. We catalog the classes of solutions to this problem and how our approach contributes to a solution. We then discuss our methods and results in detail. Finally, we conclude and suggest directions for future research.

A Discussion of the Problem

We can use Rubin’s causal model (Imbens & Wooldridge, 2007; Rubin, 1974, 1977, 1978) to illustrate the problem faced by program evaluation in general and systematic reviews in particular. According to the model, we observe N individuals drawn randomly from a large population, indexed by i = 1,…, N. Each individual has two potential outcomes, denoted by Y_i(0) and Y_i(1). Y_i(1) indicates the outcome for individual i in the treatment condition, and Y_i(0) indicates the outcome for individual i in the control condition. If we could observe the same individual i in both treatment and control conditions, we could measure the individual-level causal effect, Y_i(1) − Y_i(0). In reality, we observe only one of these potential outcomes for individual i because an individual cannot be in both conditions at once.⁵

One goal of program evaluation is to measure the average treatment effect (ATE) in a population (Equation 1). (In Equation 1, E[⋅] is the mathematical expectation operator and denotes the average treatment control difference for the population.⁶)

ATE = E [Y_{i} (1) - Y_{i} (0)] .

Inasmuch as researchers cannot observe both the treatment and the control potential outcomes for each individual, they must estimate the ATE by making assumptions about the counterfactual condition for individual i; that is, if i is treated, what would i’s outcome have been if i were not treated (or vice versa).

Classes of Solutions

RCTs

Randomly assigning individuals to the treatment or control condition provides the counterfactual researchers require to estimate unbiased treatment effects. Because randomization ensures that treatment and control groups are only randomly different from one another prior to treatment (i.e., on all baseline factors, observed, and unobserved), the control group provides a valid counterfactual for what would have happened to the treatment group in the absence of treatment. Formally, RCTs satisfy the strong ignorability assumption: Under random assignment, treatment assignment (T) is independent of the potential outcomes (Y):

(Y_{i} (0), Y_{i} (1)) ⊥ T_{i},

where T_i = 1 if individual i is randomly assigned to the treatment condition and 0 otherwise. We can estimate the ATE by calculating the difference in the mean outcomes of the treatment and control groups.⁷ Accounting for baseline characteristics—through regression adjustment, for example—is not necessary to reduce bias when estimating impacts in an RCT but can improve the precision of the estimates.

Nonexperimental studies

In a nonexperimental study, individuals or groups are not randomly assigned to the intervention but end up there through some nonrandom mechanism, such as personal choice or selection by a caseworker. Thus, simply comparing outcomes of the intervention group and a comparison group of nonparticipants might not yield a valid estimate of the ATE. This strategy assumes that the outcomes of nonparticipants tell us what would have happened to participants had they not participated. However, there are many reasons that participants might have chosen (or were chosen) to participate in the program, so it is unlikely that the experiences of nonparticipants provide a valid counterfactual for the experiences of participants. More specifically, unlike an experimental study in which a simple comparison of average outcomes between the treatment and the control groups yields an unbiased estimate of the treatment effect, such a comparison in a nonexperimental study would be unlikely to result in an unbiased estimate due to observable and unobservable differences between the intervention and the comparison groups.

In order for researchers to estimate unbiased treatment effects in observational studies, these studies must satisfy the conditional ignorability assumption.⁸ In particular, researchers must account for any characteristics or factors, X, that are related to both treatment assignment and the outcome (Stuart, 2010). After accounting for all such factors, treatment assignment is independent of the potential outcomes:

(Y_{i} (0), Y_{i} (1)) ⊥ T_{i} | X_{i} .

In other words, conditional on the confounding covariates used in the analysis, the assignment of individuals to treatment conditions is essentially random (Gelman & Hill, 2007).

Angrist (1998) provides a convincing example of a nonexperimental study that satisfies the conditional ignorability assumption. He compared employment outcomes of veterans to those of military applicants who did not serve, arguing that the pool of military applicants (including those who were selected and served) all indicated a strong interest in military service by undergoing a physical examination and completing the Armed Services Vocational Aptitude Battery (ASVAB). Thus, the main sources of bias in comparisons of veterans to nonveterans were the variables that the military used to screen and select applicants. In Angrist’s (1998) analysis, these variables—age, schooling, and ASVAB scores—were observable. By including these variables in regression and matching analyses, Angrist arguably satisfies the conditional ignorability assumption. He bolsters this conclusion by comparing results from the regression and matching analyses to results from an instrumental variables analysis and finding that the results from all analyses were broadly similar.

In general, however, the conditional ignorability assumption is difficult for nonexperimental studies to satisfy. The full set of factors related to participation in an intervention is typically not known with certainty. Moreover, it is common that the factors hypothesized to be related to participation include characteristics not directly observed or included in available data sets.

Contribution of Our Article

To provide useful information to users about intervention effectiveness, systematic reviewers must be able to distinguish between studies that produce credible estimates of an intervention’s effects and those that do not. Consequently, reviews set study quality standards so that study designs with the least potential for bias in impact estimates—RCTs, for example—receive the highest ratings. To account for the fact that nonexperimental studies that use appropriately rigorous methods and include sufficiently strong covariates do not meet this highest standard but may still produce useful evidence, the standards of systematic reviews are set at a threshold of rigor lower than that of well-implemented RCTs for these studies and give a lower quality rating to studies that meet this standard.

The challenge is to ensure that standards for nonexperimental studies are set high enough that results from studies meeting these standards are still credible as providing some evidence of an intervention’s effects. In particular, systematic reviews must determine how to define two sets of requirements for nonexperimental studies to meet even moderate quality ratings. First, what baseline characteristics must these studies account for in the analysis to give the greatest level of confidence that unobserved baseline differences between intervention and comparison groups are not leading to meaningful biases in the impact estimates? Second, what methodological approach should the studies use to account for these baseline characteristics?

In this article, we seek to provide information relevant for both questions, using the case of studies of early childhood interventions. We first address the question of which baseline characteristics or covariates are important for nonexperimental studies to account for in estimating impacts. As described earlier, the most important characteristics are those that are both strongly related to key outcomes for this population and that may differ between intervention and comparison groups, that is, are related to selection into the program. Because the factors related to participation differ from intervention to intervention, we focus on identifying characteristics most strongly related to outcomes. We examine these relationships using nationally representative data on pregnant women and young children.

We argue that researchers conducting nonexperimental studies of early childhood interventions should account for characteristics identified as strongly related to child outcomes, and we focus on the extent to which the covariates we examine collectively explain variation in key outcomes. It is important to note that accounting for characteristics identified as strongly related to child outcomes might not be enough to credibly estimate the causal effect of an intervention, and, for early childhood outcomes in particular, it can often be the case that strong predictors of outcomes do not exist. To the extent that the identified covariates explain a large portion of variation in outcomes, however, we argue that there is less scope for other, unobserved characteristics to lead to bias and that confidence in the impact estimates from studies that account for these observed characteristics should be higher than confidence in estimates from studies that do not.

To identify characteristics strongly related to child outcomes and to assess the extent to which these characteristics explain variation in outcomes, we use estimated R² values from regressions of key early childhood outcomes on varying sets of covariates. Inasmuch as R² values indicate the proportion of outcome variance explained by model covariates, we argue that higher R² values suggest that the covariates are more likely to account for factors that—if unaccounted for—would lead to biased impact estimates.

Although we believe that these R² values provide useful evidence for which characteristics should be accounted for in comparison group designs, we acknowledge a key limitation in this logic. This limitation is based on the premise that variation in outcomes can arise from multiple sources, including (1) idiosyncratic or random factors such as measurement error, (2) systematic factors unrelated to individuals’ program participation, and (3) systematic factors related to individuals’ program participation. Strong comparison group designs should account for the factors in this third group.⁹ The R² value for a set of covariates is useful to the extent that higher values are correlated with the covariates accounting for a larger share of this third source of variation. But its limitation is that it may also capture variation in the outcome that is random or unrelated to program participation. Thus, in some contexts, R² values for a set of covariates could be high even if they do not fully account for the key third source of outcome variation; by contrast, low R² values could be sufficient in some contexts in which there is substantial measurement error in the outcome or variation from factors unrelated to program participation (see, e.g., Wooldridge, 2015, p. 181).

To provide insight on what methods systematic reviews should require studies to use to account for baseline differences—the second research question—we conduct a simulation exercise. In particular, we examine the extent to which is there an advantage to matching—that is, explicitly establishing baseline equivalence—rather than including covariates in a regression model as controls when the regression model is misspecified. More specifically, we examine a situation in which a regression model accounts for some or all of the relevant covariates but assumes a linear functional form when the true relationship is nonlinear. To what extent does the regression approach produce biased impact estimates in these situations? This analysis sheds light on the extent to which a matching approach should be preferred to regression, at least in terms of the conditions covered by these simulations.

Simulations are useful for this purpose because they enable us to choose both the specification for the underlying data-generating process (DGP) and the specification for the regression model we use to estimate impacts. A researcher using nonsimulated, observational data specifies only the regression model and not the DGP. At the same time, the assumed relationships among covariates and between covariates and the outcome upon which the simulation is based come from real data—the same nationally representative data set used to address the first question. This provides some level of realism to the simulations, despite the inherently artificial and limited nature.

HomVEE as a case study

We use HomVEE as the frame to select baseline characteristics and outcomes of interest for our analyses. The Department of Health and Human Services launched HomVEE in 2009 to thoroughly and transparently review the home visiting research literature and assess the evidence of effectiveness for home visiting program models that target families with pregnant women and children from birth to age 5.¹⁰ HomVEE is an apt case study because it shares similarities with other systematic reviews and examines outcome domains relevant to studies of early childhood interventions in general, not only studies of home visiting programs.

As with reviews such as the WWC, TPP, and CLEAR, HomVEE rates studies on the strength of their internal validity and assigns ratings of high, moderate, or low.¹¹ RCTs are eligible for a high rating because of their strong internal validity. Because MCGDs cannot rule out unobservable baseline differences, the highest rating an MCGD can receive is moderate. HomVEE requires baseline equivalence on race and ethnicity; socioeconomic status (SES); and when feasible, baseline measures of outcomes of interest.¹² Table 1 summarizes the HomVEE MCGD standards.

Table 1.

Summary of MCGD Study Rating Criteria for HomVEE.

HomVEE Study Rating	Criteria for MCGD Studies
High	n.a.^a
Moderate	Baseline equivalence established on race/ethnicity, SES, and baseline outcome measures (if applicable)^b Baseline outcome measures used as controls in the analysis (if applicable)^b No confounding factors^c Must have at least two participants in each study arm No systematic differences in data collection methods across the intervention and comparison groups
Low	Studies that do not meet the requirement for a moderate rating

Source. “Reviewing Eligible Studies,” available at the Department of Health and Human Services HomVEE website: http://homvee.acf.hhs.gov/Review-Process/4/Producing-Study-Ratings/19/5

Note. HomVEE = Home Visiting Evidence of Effectiveness; MCGD = matched comparison group design; n.a. = not applicable; SES = socioeconomic status.

^aOnly randomized controlled trials with low attrition, regression discontinuity designs, and single case designs are eligible for a high rating.^b If it is not feasible to collect measures of outcomes of interest at baseline (e.g., when the outcome is a child cognitive development measure and the intervention began prenatally), baseline equivalence must be established on race/ethnicity and SES. If it is feasible to collect the baseline outcome measure, baseline equivalence should be established on this measure and used as a control in the analysis. ^cA confounding factor occurs when a component of the research design or methods lines up exactly with the intervention being tested, making it impossible to attribute an observed effect solely to the intervention.

HomVEE reviews studies with outcomes from domains that are relevant for a variety of early childhood interventions and programs, including out-of-home infant and toddler child care, maternal and child health interventions, and child welfare programs. HomVEE’s eight domains are (1) child development and school readiness (including cognitive and socioemotional development); (2) child health; (3) family economic self-sufficiency; (4) linkages and referrals (e.g., to social support services); (5) maternal health; (6) positive parenting practices; (7) reductions in child maltreatment; and (8) reductions in juvenile delinquency, family violence, and crime.

Choosing the Right Covariates

A key question for any researcher who conducts a nonexperimental study is whether available control variables truly capture baseline differences between participants and nonparticipants that might be related to program participation and later outcomes. The WWC, HomVEE, TPP, and other systematic reviews require baseline equivalence for at least variables such as demographic characteristics and SES, which are available in many data sets and are thought to be related to child, family, education, and other outcomes. However, these variables might not be sufficient. Cook, Shadish, and Wong (2008) argue that when used on their own, such “predictors of convenience” are often ineffective in reducing bias in observational studies, and Hallberg, Steiner, and Cook (2011) argue that researchers conducting nonexperimental studies should account for a large and heterogeneous set of characteristics rather than a smaller and more homogeneous set. Even with a large and heterogeneous set of characteristics, the question remains as to whether other important factors not accounted for lead to meaningful bias in estimates of intervention impacts.

Baseline measures of the outcomes of interest are often the strongest predictors of subsequent outcomes (Cook et al., 2008). The baseline outcome or pretest can serve as a proxy for many of the hard-to-measure factors that influence later outcomes, such as an individual’s motivation or support from an extended family. In reanalyzing several within-study comparisons that compared experimental and nonexperimental causal estimates, Cook and Steiner (2010) found that, of several factors such as covariate measurement reliability and the outcome data analysis method chosen, covariate choice mattered most for bias reduction, with the pretest being the single most important covariate. On the other hand, Steiner, Cook, Shadish, and Clark (2010) found that controlling for pretests is not essential in all contexts. In a study of the effects of college student training in vocabulary or math, they found that accounting for students’ a priori topical preferences (without pretests) eliminated nearly all bias in a comparison group design using propensity score matching (PSM). The key feature of this topical preference covariate was its relationship with selection into the program being studied.

More recent within-study comparisons have found that, in the case of examining the impacts of interventions on student achievement scores for middle and high school students, nonexperimental designs that account for students’ baseline or prior test scores can produce impact estimates that match those of experimental designs (Bifulco, 2012; Fortson, Gleason, Kopa, & Verbitsky-Savitz, 2015). However, these and other within-study comparisons examine achievement outcomes for older children. Much less is known about the ability of nonexperimental designs to produce impact estimates that match those of experimental designs for early childhood interventions. One exception is Hill, Gormley, and Adelstein (2015), which found good correspondence between the estimated impacts of Head Start programs in Oklahoma using a comparison group design versus a regression discontinuity design with arguably stronger causal validity.¹³ This was the case although the comparison group model did not account for children’s pretests and relied instead on demographic characteristics and a rich set of socioeconomic indicators (including age, gender, race/ethnicity, free or reduced-price lunch status, maternal education, and family structure).

Method

To explore the plausibility of various covariate sets as controls in nonexperimental studies, we conducted an empirical, descriptive analysis of a nationally representative data set in which we identified a set of common baseline characteristics and examined their correlations with a set of outcomes that early childhood interventions could target. The goal of this analysis was to show how much variation in young children’s outcomes could be explained by variables typically required by systematic reviews and how much more variation could be explained by including additional variables. We used the R² statistics from different regression models as measures of the percentage of variation in the outcomes explained by the covariates. Although a high R² does not imply unbiasedness, it does indicate a greater likelihood that the covariates accounted for in the design are explaining key variation in the outcome (i.e., variation caused by factors associated with selection into the program).

To determine which variables explained the most variation in outcomes commonly seen in studies of programs targeting pregnant women and young children, we analyzed data from the Early Childhood Longitudinal Study—Birth Cohort (ECLS-B). The ECLS-B provides detailed information on children’s early life experiences, focusing on their health, development, care, and education from birth to entry into kindergarten.¹⁴ It contains an abundance of demographic information about study children and families. In addition to race, ethnicity, and SES, the data set has information on maternal education, receipt of government assistance, household structure, and food security. The ECLS-B also contains baseline outcome measures in the domains of cognitive and socioemotional development, child and maternal health, and parenting practices.

We began with covariates mentioned in current HomVEE standards—race/ethnicity and SES—and successively added more sets of covariates. The most comprehensive set includes baseline outcome measures, which (as stated previously) are often unavailable in studies of interventions for pregnant women, infants, and toddlers. Table 2 lists the baseline covariates we considered, organized into five sets. Set 1 contains race/ethnicity and SES. (The ECLS-B’s SES index is a composite measure of parental education, occupation, and household income.) Set 2 includes additional measures of SES: household income, poverty status, and other variables. Set 3 adds variables measuring family economic condition: household structure, a food security index, and receipt of government housing assistance. Set 4 adds child characteristics: age at assessment and gender. Set 5 adds baseline outcome measures to the variables in all preceding models.¹⁵

Table 2.

Five Sets of Baseline Covariates.

Variable Description	Set 1	Set 2	Set 3	Set 4	Set 5
Minimum required for establishing baseline equivalence under current HomVEE standards^a
Mother’s race/ethnicity	X	X	X	X	X
SES index	X	X	X	X	X
Additional SES measures
Maternal education		X	X	X	X
Maternal employment		X	X	X	X
Food stamp receipt		X	X	X	X
Welfare receipt		X	X	X	X
Medicaid receipt		X	X	X	X
WIC receipt		X	X	X	X
Household income		X	X	X	X
Household income below 185% of federal poverty level		X	X	X	X
Other family economic condition variables
Household structure			X	X	X
Food security index			X	X	X
Government housing assistance			X	X	X
Child characteristics not required under current HomVEE standards
Child’s age at assessment				X	X
Child’s gender				X	X
Baseline outcome measures
Baseline cognitive development					X
Baseline behavior index					X
Baseline child health					X
Baseline maternal depressive symptoms					X
Baseline NCATS total parent score					X
Baseline NCATS total child score					X

Source. Early Childhood Longitudinal Study, Birth Cohort, 9-month data collection.

Note. HomVEE = Home Visiting Evidence of Effectiveness; NCATS = Nursing Child Assessment Teaching Scale; SES = socioeconomic status; WIC = Special Supplemental Nutrition Program for Women, Infants, and Children.

^aFrom the HomVEE website (http://homvee.acf.hhs.gov/Review-Process/4/Producing-Study-Ratings/19/5): For the HomVEE review, we prefer to see SES equivalence on specific economic well-being measures: income, earnings, or poverty levels according to federal thresholds. We also accept means-tested assistance (such as Aid to Families with Dependent Children/Temporary Assistance to Needy Families or food stamps/Supplemental Nutrition Assistance Program receipt), maternal education, and employment of at least one member in the household if at least two such alternative measures of SES are provided.

We analyzed at least one outcome in each of the eight HomVEE domains. Table 3 lists these outcome domains and the outcome measures within each domain used in this analysis, focusing on outcomes commonly used in studies reviewed by HomVEE. To assess the extent to which each set of covariates is associated with these outcomes, we examined the R² from regressions of each outcome on each covariate set.

Table 3.

Outcome Measures for MCGD Covariate Choice Analysis.

Outcome	HomVEE Domain
BSF-R Mental Scale score, age 2	Child development and school readiness (cognitive development)
Behavior Rating Scale Index, age 2	Child development and school readiness (socioemotional development)
Two Bags parental supportiveness, age 2	Positive parenting practices
Child health, age 2	Child health
Child birth weight status	Child health
Premature birth status	Child health
Maternal depressive symptoms, age 4	Maternal health
Number of community services accessed, age 2	Linkages and referrals
Child witnessed violence in the home, age 4	Reductions in juvenile delinquency, family violence, and crime
Child victim of violence in the home, age 4	Reductions in child maltreatment
SES, age 4	Family economic self-sufficiency
BSF-R Mental Scale score, age 2	Child development and school readiness (cognitive development)

Source. Early Childhood Longitudinal Study, Birth Cohort, 9-month, 2-year, and preschool data collections.

Note. BSF-R = Bayley Short Form–Research Edition; HomVEE = Home Visiting Evidence of Effectiveness; SES = socioeconomic status.

Results

Using data from ECLS-B, we assessed the extent to which each of five sets of covariates is associated with the outcomes commonly used in studies of early childhood interventions by examining the R² from regressions of each outcome on each covariate set. Table 4 contains the R² from each of the five regressions for each of the outcomes we considered. For most outcomes, the R² is relatively low (ranging from .01 to .24 for all but one outcome), even for the most comprehensive set of covariates.¹⁶ In other words, even the most exhaustive set of covariates examined here (Set 5) explains only a small proportion of the variation in these outcomes; unobserved baseline characteristics not included in the analysis may explain most of the variation. This does not necessarily imply that the included covariates are not effective in reducing bias, but it does mean that substantial potential for bias remains even after accounting for these covariates. A low R² is particularly worrisome when it arises from the inability of the included covariates to capture variation arising from systematic, unobserved factors related to individuals’ program participation (the third source of variation discussed previously).

Table 4.

Goodness-of-Fit Measures in Five Linear Regression Models.

		R ²
Outcome	Domain	Set 1	Set 2	Set 3	Set 4	Set 5
BSF-R Mental Scale score, age 2	Cognitive development	.09	.10	.10	.22	.24
Behavior Rating Scale Index, age 2	Socioemotional development	.02	.03	.03	.07	.10
Two Bags parental supportiveness, age 2	Parenting	.16	.17	.18	0.18	0.21
Child health, age 2	Child health	.03	.03	.03	.04	.17
Child birth weight status^a	Child health	.05	.06	.06	NA	NA
Premature birth status^a	Child health	.04	.05	.06	NA	NA
Maternal depressive symptoms, age 4	Maternal health	.04	.05	.05	.05	.12
Number of community services accessed, age 2	Linkages and referrals	.06	.11	.14	.14	.14
Child witnessed violence in the home, age 4	Family violence	.01	.02	.02	.02	.03
Child victim of violence in the home, age 4	Child maltreatment	.01	.01	.02	.02	.02
Socioeconomic status, age 4	Family economic self-sufficiency	.71	.74	.74	.74	.73
Variables included in models
Minimum required		X	X	X	X	X
Additional SES measures			X	X	X	X
Other family economic condition variables				X	X	X
Child assessment age and gender					X	X
Baseline outcome measures						X

Source. Early Childhood Longitudinal Study, Birth Cohort, 9-month, 2-year, and preschool data collections.

Note. BSF-R = Bayley Short Form-Research Edition; NA = not available; SES = socioeconomic status.

^aBecause this outcome is measured at birth, baseline outcome measures and child characteristics cannot be used as covariates in regressions with this as the outcome. Thus, we exclude Models 4 and 5 in the analyses of this outcome.

Although the overall percentage of variation explained is low in general, certain groups of covariates stood out as being important in explaining the variance of the outcomes (Table 4). For the cognitive and socioemotional development outcomes, adding assessment age and child gender (i.e., moving from covariate Set 3 to Set 4) more than doubled the R² (from .10 to .22 for the cognitive development outcome and .03 to .07 for the socioemotional development outcome). The magnitudes of these increases in R² were small, but this finding reflects the close relationship between age and development among very young children and reinforces the importance of accounting for age in models examining program impacts on measures of children’s development. By contrast, age and gender offered little additional explanatory power for the other outcomes examined in Table 4. We also observed that including additional measures of SES (i.e., moving from set 1 to set 2) almost doubled the R² for the linkages and referrals outcome.

In addition, we found that baseline outcome measures were important in explaining variance in child and maternal health outcomes but not in other domains. Adding baseline child health (along with the other baseline outcome measures) more than quadrupled the R² of the child health outcome regression (from .04 to .17). For the maternal depressive symptoms outcome, adding baseline outcome measures (including baseline maternal depressive symptoms) led to an increase in R² from .05 to .12.

Overall, however, baseline outcomes had relatively little explanatory power for the domains examined here. Aside from the family economic self-sufficiency domain, the overall R² value for the set of explanatory variables that includes baseline outcomes was .24 or less. This level of explanatory power was much lower than that of a similar set of covariates (including baseline outcomes) in explaining the academic achievement of older children, where the R² values often exceeded .50 (Bloom, Bos, & Lee, 1999; Schochet, 2008). An implication of this finding is that baseline outcomes are more important in models that explain achievement outcomes for older children than for younger children. More specifically, Cook et al. (2008) argue that pretests will often be sufficient for older children. However, the predictive power of pretests is much lower for younger children—if they exist at all—such that pretests are unlikely to sufficiently reduce bias in analyses involving young children. In other words, for some young children’s outcomes, including baseline outcomes in addition to the other covariates included in Sets 1–4 might do little to reduce bias. The low explanatory power of these variables leaves considerable potential for bias and calls into question the ability of nonexperimental, observational studies to provide unbiased estimates of effects of interventions targeting young children.

Choosing an Estimation Method

Once a researcher has decided upon the right covariates, he or she must choose an estimation method that accounts for differences between the intervention and the comparison groups on those characteristics. We focus in this article on two types of methods commonly used in studies reviewed by HomVEE: regression and matching. Regression and matching methods share the goal of eliminating bias due to preexisting differences between intervention and comparison groups that are related to outcomes. However, the methods differ in how they account for such differences. Regression methods use data from intervention and comparison groups that may have different characteristics, but they attempt to control for these differences by modeling relationships between observed covariates and the outcome. Matching methods attempt to form a comparison group with baseline characteristics that are similar to those of the intervention group. As noted earlier, a weakness common to both estimation methods is that they can account for only observed baseline characteristics. If there are unobserved characteristics related to the outcomes that differ between groups, neither approach will produce unbiased estimates.

Regression Adjustment

Regression methods involve estimating a model in which the dependent variable is the outcome of interest, and the independent variables include an indicator for participation in the intervention and any baseline covariates the researcher has selected to account for underlying differences between the intervention and comparison groups that would otherwise confound estimates of the intervention’s impact. The researcher must also select a functional form for the regression—that is, a model that describes the relationship between the covariates and the outcome of interest.

Researchers who rely solely on regression adjustment do not attempt to restrict the comparison group in any way prior to impact estimation so that it is more similar to the intervention group. Thus, the two groups might have different baseline characteristics, on average. For example, the intervention group might be more disadvantaged, on average, with a larger proportion in poverty. Instead, the researcher relies on regression adjustment to account for any such differences in the process of estimating impacts.

When a particular set of assumptions holds, this type of nonexperimental design can produce unbiased impact estimates. In particular, the regression model must account for all relevant (i.e., related to program participation and the outcome) baseline differences between the intervention and the comparison groups. In addition, the functional form of the regression model must be correctly specified—that is, the model must accurately capture the nature of the relationship between the covariates and the outcome.

Regression methods have advantages and disadvantages. An advantage is that researchers can use a broad comparison group with a larger sample than a smaller, more closely matched comparison group. Therefore, in some cases, regression adjustment can produce more precise impact estimates than a matching approach. A weakness of regression methods is that they must rely heavily on the appropriateness of the model’s functional form assumptions in cases in which the baseline characteristics of the intervention and comparison groups differ substantially (e.g., when there is little overlap between the groups in the distribution of the covariates as shown in Figure 1). If a researcher assumes an important covariate is linearly related to the outcome but the true relationship is nonlinear, this misspecification could lead to substantial bias. Moreover, sometimes when researchers conduct nonexperimental regression-based studies, they are unaware of the extent to which the treatment and control groups differ in their baseline characteristics, so they could be unaware of the extent to which they are relying on their functional form assumptions.

Figure 1.

Small degree of overlap between study groups.

In cases in which the regression-based approach is highly dependent on the functional form assumption, impact estimates rely heavily on extrapolation (Stuart, 2010), which can be quite sensitive to functional form (Foster, 2003). As an example of the dangers of extrapolation, consider a linear model relating weight to age specifically for prepubescent children. Equation 4 (Luscombe, Owens, & Burke, 2011) presents an easy-to-calculate relationship used in emergency medicine:

Weight (k g) = 3 \times Age + 7 .

According to this equation, a reasonable weight for a 6-year-old child would be 25 kg (55 pounds). Applied to a 60-year-old adult, this formula yields the unreasonable weight of 187 kg or 412 pounds. The linear model, which applies roughly to prepubescent children, does not apply to adults; in fact, the relationship between weight and age for adults is concave (weight tends to be lower for younger adults, increases until about age 50, then slowly decreases).

Matching

Another way researchers can account for observed baseline differences is by establishing a matched comparison group that has key baseline characteristics similar to the intervention group. After gathering information on program participants and potential comparison group members and selecting the matching variables, the researcher must choose from among many methods to form the matches. PSM has emerged as the recommended approach for nonexperimental program evaluation (Peikes, Moreno, & Orzol, 2008). It involves two steps: (1) estimating the propensity score (the estimation step) and (2) matching intervention group members to comparison individuals (the application step; Harder, Stuart, & Anthony, 2010). The estimation step involves estimating the probability of being in the intervention group (the propensity score). The application step involves selecting comparison group members with propensity scores that are very close to those of intervention group members.

PSM can produce valid (unbiased) estimates of the ATE if there are no unobserved characteristics related to the outcome that differ between the intervention and the comparison groups and if the researcher correctly models the relationship between the covariates and the treatment status.¹⁷ A strength of PSM is that a standard part of the approach involves assessing whether the intervention and comparison groups overlap enough on baseline characteristics to estimate the ATE (Rubin, 1997). In other words, the researcher will explicitly compare the baseline characteristics of the two groups and proceed with estimation only if the groups match closely. In addition, if such an approach is well implemented and the groups do match closely on relevant baseline characteristics, the estimation of impacts relies much less strongly (or not at all) on assumptions about the functional form of the relationship between covariates and the outcome. Thus, research suggests that using propensity score methods is better than using regression adjustment alone when the distributions of covariates between the intervention and the comparison groups are very different (Schafer & Kang, 2008). As noted, however, matching approaches may result in lower levels of statistical power, and so there are circumstances in which regression methods are preferred.

We have thus far presented regression adjustment and matching as two extremes of a continuum. Regression adjustment with no attempt to balance the distributions of the intervention and comparison groups lies at one end. In this simplified conceptualization, purely regression-based studies rely entirely on functional form to estimate the relationship between an intervention and its outcomes. At the other end of the continuum lie PSM and other methods that form matched intervention and comparison groups. At this extreme, matching methods do not rely at all on the functional form of the relationship between the intervention and its outcomes—a researcher could simply compare the means of the matched intervention and comparison groups to estimate a treatment effect.

It is more realistic to consider situations that lie between these extremes. In practice, checking for balance is, of course, possible in regression-based studies. In other words, a researcher could assess the extent to which the distributions of covariates are similar between intervention and comparison groups and, prior to estimating a regression model, take steps to make the distributions more similar (e.g., by trimming outliers). Furthermore, many matching estimators are infeasible when covariates are not balanced across intervention and comparison groups—that is, the models will not converge and they will fail to produce estimates. In this case, regression adjustment may be a researcher’s only option, although, as stated previously, estimates from such an analysis would rely heavily on the regression model’s functional form.

Hybrid methods

In fact, researchers need not choose between matching and regression approaches—they can combine the two. One method is to include in the outcome regression the right-hand-side variables from the propensity score model. Many researchers have discussed the benefits of this method (Harder et al., 2010). Another method is doubly robust estimation, which combines regression and weighting by a function of the propensity score (Bang & Robins, 2005). Doubly robust methods involve specifying a propensity model and a regression model, and double robustness means that, in large samples, treatment effects are unbiased even if one of these models is misspecified. Schafer and Kang (2008) provide an overview of these methods.

Method

Researchers who are attempting to estimate program impacts face several issues surrounding the impact estimation method. One issue concerns the relative effectiveness of establishing baseline equivalence through matching versus accounting for baseline differences using regression adjustment. As described previously, the regression adjustment approach is questionable when intervention and comparison groups differ on key baseline characteristics because it relies on assumptions about the functional form of the regression model. Our simulation exercise examines the extent to which there is an advantage to establishing baseline equivalence through matching rather than just including the covariates in a regression model as controls when the functional form is misspecified.

A second issue involves the scenario in which it is difficult to account for a key baseline characteristic that might differ between participants and nonparticipants. For example, of particular concern in HomVEE is a situation in which a characteristic such as a child’s cognitive ability influences the likelihood that a family would participate in a home visiting program but is not feasible to collect at baseline (this could occur when a family enrolls while the child is an infant, but the outcome of interest is measured when the child is a toddler or preschooler). In this case, baseline measures of the outcome are not available, but proxy measures—such as parental education or home environment (e.g., number of books in the home)—can be used. A secondary goal of our simulation analysis is to provide some evidence on whether regression adjustment produces valid impact estimates in such a case; specifically, on whether controlling for race/ethnicity and SES adequately account for baseline differences on a potentially unobservable characteristic, such as baseline child cognitive ability.

We simulated a situation in which program participants and nonparticipants differed on a small, limited set of baseline characteristics, and a researcher attempted to estimate program impacts using two approaches: (1) matching to establish intervention and comparison groups that were similar in terms of key characteristics at baseline and (2) regression adjustment to control statistically for differences in baseline covariates. Within each approach, we examined the levels of bias that occurred when the researcher lacked data on one or more of the key baseline covariates.

The simulation exercise involved two main steps: (1) simulating data similar to actual data from the ECLS-B (including a program participation indicator and an outcome) and (2) estimating the effect of program participation. The goal of this simulation exercise was to shed light on what happens when we know the true relationship between covariates and program participation and estimate an impact model using a regression model that may or may not have the right covariates and does not use the true functional form.

Step 1: Simulate data similar to actual data from the ECLS-B

We created simulated data sets containing three types of variables: baseline characteristics, a program participation variable, and an outcome measure.

Baseline characteristics

We created simulated data sets containing race/ethnicity, an SES index, and baseline cognitive development (as measured by the Bayley Short Form–Research edition [BSF-R] Mental Scale score), so that the simulated variables had the same means, variances, and correlations as the actual ECLS-B variables. We generated multiple binary and normally distributed variables simultaneously using the correlational structure of the actual variables from the ECLS-B with the “jointly.generate.binary.normal” command in the R software package BinNor (Demirtas, Amatya, & Doganay, 2014; Demirtas & Doganay, 2012). All variables were measured when children were 9 months old. To simplify the analysis, we recoded the race/ethnicity variable so that 0 = non-White and 1 = White, and we recoded the SES variable so that 0 = low SES and 1 = high SES. In addition, we standardized the cognitive development measure so that it had a mean of 0 and a standard deviation (SD) of 1. Tables 5 and 6 present the means, variances, and correlations of these variables from the ECLS-B.

Table 5.

Means and Variances of Baseline Variables Used in Simulation.

	Mean	Variance
Race/ethnicity	0.46	0.25
SES	0.41	0.24
Baseline cognitive development	0.00	1.00

Source. Early Childhood Longitudinal Study, Birth Cohort, 9-month data collection.

Note. SES = socioeconomic status.

Table 6.

Correlation Matrix of Baseline Variables Used in Simulation.

	Race/Ethnicity	SES	Baseline Cognitive Development
Race/ethnicity	1.00
SES	.19***	1.00
Baseline cognitive development	−.01	.00	1.00

Source. Early Childhood Longitudinal Study, Birth Cohort, 9-month data collection.

Note. SES = socioeconomic status.

***Significantly different from 0 at the .01 level, two-tailed test.

Any sample includes only some members of a population, so sample characteristics can differ from another sample drawn from the same population. Simulating one data set is akin to collecting data on one sample—a sample that may not be representative of the population as a whole. To incorporate variation that would naturally occur across samples within a population, multiple data sets can be created to capture some random fluctuations or “noise” across different samples. These fluctuations make the simulated data more realistic and give the outcomes the same kind of variation we would likely observe in a real data set. We created 19,992 data sets with 100 observations in each. We analyzed the data sets separately, and, within each simulation scenario described subsequently, we averaged the impact estimates across simulated data sets.

Program participation variable

Next, we used these variables in each simulated data set to generate the probability of participating in a hypothetical program. By manipulating the DGP, we varied the degree to which the baseline variables influenced the probability of participation. If participation depended heavily on the baseline variables, the intervention and comparison groups would be dissimilar. For example, if high-SES families were more likely to participate, the mean of the SES measure would be higher for the intervention group than for the comparison group. In this case, there would be high dissimilarity. Conversely, if participation were unrelated to the baseline variables, the intervention and comparison groups would be similar on these measures. In this case, there would be no dissimilarity. Varying the degree to which intervention and comparison groups were similar in terms of key baseline characteristics enabled us to investigate the amount of bias in impact estimates when we exclude one or more of these characteristics.

By manipulating the DGP, we constructed data sets in which each had a specified relationship between the baseline covariates and the program participation. Specifically:

In the high dissimilarity case, the average baseline cognitive development score for the intervention group was 0.9 SD higher than the average score for the comparison group, the percentage of White participants in the intervention group was 51% higher than in the comparison group, and the percentage of high-SES individuals in the intervention group was also 51% higher than in the comparison group.

In the moderate dissimilarity case, the average baseline cognitive development score for the intervention group was 0.2 SD higher than the average score for the comparison group, the percentage of White participants in the intervention group was 9% higher than in the comparison group, and the percentage of high-SES individuals in the intervention group was also 9% higher than in the comparison group.

In the no dissimilarity case, intervention and comparison groups had the same average value on these variables.

Outcome measure

Finally, we used the simulated baseline covariates to generate an outcome measure for cognitive development. We wanted the relationship between baseline measures and the outcome to be realistic, so we used actual data from the ECLS-B to obtain the regression coefficients we would use to generate the simulated outcome measure. Specifically, we used the BSF-R Mental Scale score measured at age 2 as the cognitive development outcome (the dependent variable) and regressed it on the three baseline (independent) variables, both the square and the cube of the baseline outcome measure, the interaction of the race/ethnicity and SES dichotomous variables, and a random error term. We chose to include the squared and cubed terms and the interaction term because doing so provided the best fit to the actual data, given the variables we had selected. Table 7 displays the results of this regression using actual data from the ECLS-B. All the variables were significant predictors of age 2 cognitive development with p values <.01, except for the race/ethnicity × SES interaction, which was not significant. The R² from the regression was .14, implying that the variance in the error term represented most (86%) of the variance in the outcome.

Table 7.

Regression Results and Parameter Values From ECLS-B Data.

Outcome: Cognitive Development at Age 2	Coefficient	SE
Baseline covariates
Baseline cognitive development	.25***	.01
Race/ethnicity	.32***	.03
SES	.41***	.03
Square of baseline cognitive development	−.05***	.01
Cube of baseline cognitive development	.01***	.00
Race/Ethnicity × SES interaction	−.06	.04
Constant	−.27***	.02
Regression statistics
R²	.14
Number of observations	8,900

Source. Early Childhood Longitudinal Study, Birth Cohort (ECLS-B), 9-month and 2-year data collections.

Note. SE = standard error; SES = socioeconomic status. Sample sizes rounded to the nearest 10 to comply with the Institute of Education Sciences’ restricted-use data policy.

***Significantly different from 0 at the .01 level, two-tailed test.

We then used the coefficients obtained from the regression using actual ECLS-B data and the simulated values of the baseline covariates to generate values of the outcome variable. Specifically, in each simulated data set, we created a cognitive development outcome using the parameter values in Equation 5.

\begin{aligned} Cognitive development outcome = & - 0.27 + 0.25 \\ \times Baseline cognitive development \\ + 0.32 \times Race/ethnicity + 0.41 \\ \times S E S - 0.05 \\ \times {Baseline cognitive development}^{2} \\ + 0.01 \times {Baseline cognitive development}^{3} \\ - 0.06 - Race/ethnicity \\ - SES interaction + error . \end{aligned}

Because we created the outcome via simulation, we understood exactly how the cognitive development outcome related to all the other variables. As mentioned previously, it was not related to program participation by design: Equation 5 contains no program participation indicator. Therefore, the true impact was zero, and any estimates of the program impact that differed from zero were biased.

Step 2: Estimate the effect of program participation

To assess how well matching and linear regression provided unbiased estimates of program participation, we ran three linear regressions within each simulated data set, each with different covariates, varying whether the model included all relevant baseline variables. If baseline variables are highly correlated with one another (and are similarly correlated with the outcome), misspecifying the model by leaving out one or more of these variables may result in low levels of bias. However, if baseline variables are not highly correlated with one another (but are correlated with the outcome), failing to include a relevant baseline variable could result in substantial bias. (As Table 6 shows, correlations between the baseline variables we considered were generally low.)

We included a program participation indicator in each of the three models and altered the other covariates in the model in the following ways:

The first model included only program participation and no covariates. An estimate of program impact based on a regression with only a program participation indicator was equivalent to the difference in the mean outcome of the intervention group versus the comparison group.

The second model added race/ethnicity and SES, analogous to a situation in which the researcher made a regression adjustment for some but not all baseline covariates that influenced program participation and outcomes. (With actual data, if the characteristic was not measured, a researcher might not know how similar or dissimilar intervention and comparison groups were on that characteristic.)

The third model added the baseline outcome measure. By construction, this regression included all covariates that influenced program participation and outcomes. However, although the third regression included all the relevant covariates, it was misspecified because it did not include any nonlinear terms or interactions. This condition enabled us to assess the extent to which even a misspecified regression model reduced bias in impact estimates when all the relevant baseline covariates were included.¹⁸

By design, these regression models did not include all the terms from Equation 5 which is the true DGP. We used only linear terms which did not account for nonlinearities and interactions present in our simulated data. We intentionally used the wrong functional form to reflect the real-world occurrence that with actual (nonsimulated) data, a researcher does not know all the relevant variables that affect the outcome or how the variables interact to affect the outcome and therefore may develop a misspecified model. (Recall from Equation 5 that the correct model had squared and cubed terms for the baseline measure of the outcome.) One of the dangers of using a misspecified model is that the greater the level of dissimilarity in the distributions of baseline covariates between intervention and comparison groups, the more regression estimates depend on functional form assumptions. If using the correct functional form is important for reducing bias, regression adjustment should perform poorly when distributions are dissimilar. If, however, using the correct functional form is less important to reducing bias, regression adjustment—even when using a misspecified model—should reduce bias when distributions are dissimilar.

Results

We conducted a simulation exercise to address two issues: (1) the relative effectiveness of establishing baseline equivalence through matching versus accounting for baseline differences using regression adjustment and (2) unobservable baseline variables. We created data similar to actual data from the ECLS-B and estimated the effect of program participation using three regression models in three scenarios: high, moderate, and no dissimilarity between intervention and comparison groups in terms of average values of the covariates in the DGP.

Crossing the three levels of dissimilarity with three misspecified regression models resulted in nine estimates, summarized in Table 8. The first column shows the level of dissimilarity between intervention and comparison groups on baseline race/ethnicity, SES, and cognitive development. The next three columns show the covariates that were included in the outcome regression model.

Table 8.

Bias in Program Impact Estimates Resulting From Different Regression Models and Levels of Covariate Imbalance.

	Amount of Bias (Measured in Effect Size Units) Based on Regression Models With the Following Covariates
Level of Average Covariate Dissimilarity Between Intervention and Comparison Groups	Participation Status Only	Participation, Race/Ethnicity, and SES	Participation, Race/Ethnicity, SES, and Baseline Outcome Measure
High dissimilarity	.58	.44	.01
Moderate dissimilarity	.12	.06	.00
No dissimilarity	.00	.00	.00

Source. Simulated data.

Note. SES = socioeconomic status. Parameters were obtained from Early Childhood Longitudinal Study, Birth Cohort data.

The number in the upper-left cell of results in Table 8 represents the amount of bias in estimates of program impacts (in effect size units) when there was a high degree of dissimilarity in baseline covariates, and estimates were produced by a regression model that included none of these covariates. As might be expected, this worst-case scenario produced a large bias of 0.58 SD. An evaluation of an intervention that produced impact estimates with bias this large would be misleading.

We explored what would happen if the researcher did not use matching to achieve covariate balance but relied only on regression adjustment. The effect of regression adjustment can be seen by moving across the first row of the table. If the intervention and comparison group covariates are highly dissimilar and the researcher has data for some relevant baseline covariates (race/ethnicity and SES) but not others (the baseline outcome), the partial regression adjustment reduces bias only modestly—to .44. This finding suggests that race/ethnicity and SES are not good proxies for baseline outcome measures, at least in the context of measuring impacts on infant/toddler development, given the correlations we observed in the ECLS-B. However, if the researcher includes all relevant baseline covariates in the regression model, the bias is nearly eliminated (bias in the top right cell of the table is .01). This finding holds although the regression model is misspecified—that is, it lacks the quadratic and cubic terms and interactions that were part of the DGP.

The bottom row of Table 8 indicates that when the baseline covariates have the same average values for intervention and comparison groups, there is no bias even in a model with the participation indicator as the only covariate, so regression adjustment does not affect bias. However, regression adjustment can increase the precision of impact estimates, which is why impact regression models that include baseline covariates are often used even in experimental research designs.

To get a sense of the benefits of creating matched intervention and comparison groups whose distributions of baseline variables are similar using a technique such as PSM, one can trace the decline in bias by moving down the left column of results in Table 8. The results show that bias in program impact estimates was reduced from 0.58 to 0.12 in the moderate dissimilarity case. Bias was zero in the no dissimilarity case—in this scenario, program participation was uncorrelated with the baseline covariates by design.

Looking at Table 8, we draw two main conclusions from this limited simulation exercise that considered only one outcome and a small set of covariates from the ECLS-B. First, using all relevant covariates in a regression model almost entirely eliminated bias, even when the functional form was incorrect. Second, including race/ethnicity and SES indicators reduced bias substantially when there was moderate dissimilarity between intervention and comparison groups in terms of all three baseline measures. However, substantial bias remained when the distributions of baseline covariates were highly dissimilar, unless the model also included the baseline outcome measure.

Conclusions

Our exploratory analyses examined two issues: (1) covariate choice—identifying the right set of variables to account for to credibly estimate early childhood intervention effects and (2) choice of estimation method—investigating the relative importance of establishing baseline equivalence on key characteristics (creating matched intervention and comparison groups) versus including the key characteristics as controls in a regression model.

Covariate Choice: Findings and Recommendations

Overall, the explanatory power of a large set of baseline sociodemographic and outcome measures from the ECLS-B was low in predicting a range of child and family outcomes relevant for early childhood interventions. This finding suggests potential for bias even after accounting for these variables. In addition, the specific set of baseline covariates that were most strongly correlated with outcomes differed from outcome to outcome. Given these findings, using a large and heterogeneous set of baseline covariates that are plausibly related to the outcome of interest is a useful strategy (as recommended by Hallberg, Steiner, & Cook, 2011). Based on the groups of covariates that stood out in our analyses as being important in predicting some outcomes, future research should consider the following:

including child age and gender when estimating program effects on cognitive and social–emotional outcomes;

including a variety of SES measures, particularly when estimating effects on linkages and referrals outcomes; and

using baseline outcome measures (or strong proxies for unmeasured baseline outcomes) whenever possible, especially for child and maternal health outcomes.

Including these as covariates in nonexperimental studies could potentially reduce bias, although a large potential for bias would remain because the overall explanatory power of these variables was low.

Even if systematic reviews expand the set of control variables they require studies to use to estimate program effects, the low explanatory power of these covariates means there is a large potential for bias from omitted variables. Thus, not only is it important to encourage the use of a large and heterogeneous set of baseline variables, it is important to conduct additional theoretical and empirical research to identify additional constructs that could influence participation in interventions that target young children and their families and to measure these constructs. Systematic reviews of early childhood interventions, in particular, should clearly explain what a moderate rating means so as not to mislead users that moderate-rated studies provide a level of evidence comparable to that of high-rated studies. The discrepancy between high- and moderate-rated studies of early childhood interventions in the credibility of estimated program effects may be larger than for studies of older children and other populations, given the lower prevalence and predictive power of baseline outcome measures in studies of young children.

Choice of Estimation Method: Findings and Recommendations

Our simulation exercise used a limited set of baseline covariates and involved just one set of parameters. If we had simulated relationships among other sets of variables from the ECLS-B with different covariance structures, our results might have differed. With the set of parameters we used, however, the results of our simulations suggest that if the intervention and comparison groups’ distributions of baseline covariates are similar (either by matching or by chance), functional form bias is not a concern. Furthermore, even when the distributions were dissimilar, regression adjustment for all relevant baseline covariates almost entirely eliminated bias, although the model was misspecified. Our findings also showed that if the groups’ distributions of baseline covariates are dissimilar and regression adjustment is not used, program impact estimates were substantially biased.

In real-world applications, a researcher might not be able to measure all relevant variables, and omitted variable bias remains a concern. In addition, researchers who use real-world data do not know the DGP and might estimate impacts using a misspecified regression model. Thus, the findings from our simulation exercise reinforce the importance of using a large, heterogeneous set of baseline control variables. Doing so can reduce the potential for omitted variable bias and help to reduce bias when the regression model is misspecified.

Based on these conclusions, researchers should also consider the following points:

Regression adjustment and matching can be complementary design tools when estimating impacts. If regression alone is used, but there is little overlap between the participants and the nonparticipants, the results are an extrapolation beyond the available data and can be highly dependent on the functional form of the regression model. Conversely, if matching alone is used, some degree of dissimilarity in the distributions of key baseline characteristics might remain, which could be controlled with regression adjustment.

Regardless of the approach, a good practice is to report information about the distributions of key baseline covariates for intervention and comparison groups. Researchers may want to consider reporting more information than just means and SDs by study group—for example, overlapping histograms or density plots like Figure 1—so readers can better assess the degree of similarity between the groups.

Directions for Future Research

To assess the quality of MCGDs and the potential for bias, given the low predictive power of commonly available early childhood covariates, it is necessary for the field of program evaluation and systematic evidence reviews to have a better understanding of why individuals choose, or are chosen, to participate in an intervention. Past work affirms the importance of understanding this selection mechanism (Steiner, Cook, Shadish, & Clark, 2010). Furthermore, even if we understood the selection mechanism, to account for it in estimations of program effects, we would need to measure it by capturing the crucial variables leading to selection into the program. Designing and administering a detailed survey instrument to capture individuals’ motivation to participate in a program or a program administrator’s rationale for selecting participants is a difficult task.

In certain designs, such as RCTs and regression discontinuity designs, the selection mechanism is known and can be accounted for; these designs are, therefore, considered the most rigorous for estimating program impacts. In most MCGDs, the selection mechanism cannot be fully known because numerous factors influence whether individuals or families participate in a program. For example, if the intervention group includes those who chose to participate and the comparison group includes those who chose not to, it can be difficult to understand the motivation behind these choices. As a first step, future research on early childhood interventions should focus on better understanding what attracts certain individuals or families to these programs, the constraints that might prevent them from participating, and programs’ considerations in targeting potential participants.

Even if a researcher thoroughly explores the selection mechanism, there is an additional complication with systematic reviews. Their developers must decide what constitutes a good understanding of the selection mechanism. Put differently, reviewers must assess how well the researcher understood the sorting of sample members into the intervention and comparison groups. How to create rigorous, objective standards instead of using a reviewer’s discretion or intuition about the quality of the study remains an open question.

In addition to better understanding selection mechanisms, early childhood researchers should endeavor to find additional constructs that predict program participation and outcomes for young children and ways to measure them. In our analyses, baseline measures of outcomes were not as strongly correlated with subsequent outcomes as they have been shown to be in studies of older children (Bloom et al., 1999; Schochet, 2008). Furthermore, other variables, including race/ethnicity and SES, did not explain much of the variation in outcomes. Thus, even if matching or regression adjustment is used in studies of early childhood interventions, the potential for bias remains when available control variables are weakly correlated with outcomes.

Although RCTs produce the most valid impact estimates because the selection mechanism is known and because randomization ensures that study groups are equivalent at baseline in observable and unobservable ways, they can be difficult to conduct for many reasons. When researchers cannot carry out RCTs and choose to conduct nonexperimental studies instead, they should pay careful attention to identifying a comprehensive set of variables related to program participation and outcomes, should provide information on the distributions of intervention and comparison groups on these variables, and should discuss the mechanism of selection into program participation, regardless of the method they choose to estimate impacts. While acknowledging that nonexperimental studies can provide suggestive, if not definitive, evidence of program effectiveness, systematic reviewers should be careful to avoid misleading users about the quality of nonexperimental studies, relative to that of RCTs. Specifically, they should clearly identify the weaknesses of moderate-rated studies and avoid implying that they provide a level of evidence comparable to that of high-rated studies.

Systematic reviews have the potential to improve the quality of research in the fields they cover (Seftor, 2016). By setting stricter standards of quality for nonexperimental studies, systematic reviews can contribute to improvements in nonexperimental research methods and better help decision makers choose effective programs.

Footnotes

Authors’ Note

The authors of this article are or have been involved in Home Visiting Evidence of Effectiveness (HomVEE).

Acknowledgments

The authors would like to acknowledge Lauren Supplee, T’Pring Westbrook, Seth Chamberlain and other OPRE staff, and our colleagues at Mathematica Policy Research who provided thorough reviews and thoughtful comments on this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the U.S. Department of Health and Human Services, Office of Planning, Research and Evaluation, Administration for Children and Families (contract numbers HSP23320095642WC/HHSP23337025T and GS-10F-0050L/HHSP233201500115G).

Notes

References

Angrist

J. D.

(1998). Estimating the labor market impact of voluntary military service using social security data on military applicants. Econometrica, 66, 249–288.

Bang

Robins

J. M.

(2005). Doubly robust estimation in missing data and causal inference models. Biometrics, 61, 962–973.

Bifulco

(2012). Can nonexperimental estimates replicate estimates based on random assignment in evaluations of school choice? A within-study comparison. Journal of Policy Analysis and Management, 31, 729–751.

Bloom

H. J.

Bos

Lee

(1999). Using cluster random assignment to measure program impacts: Statistical implications for evaluation of education programs. Evaluation Review, 23, 445–469.

Cook

T. D.

Shadish

W. R.

Wong

V. C.

(2008). Three conditions under which experiments and observational studies produce comparable causal estimates: New findings from within-study comparisons. Journal of Policy Analysis and Management, 27, 724–750.

Cook

T. D.

Steiner

P. M.

(2010). Case matching and the reduction of selection bias in quasi-experiments: The relative importance of pretest measures of outcome, of unreliable measurement, and of mode of data analysis. Psychological Methods, 15, 56.

Clearinghouse of Labor Evaluation and Research. (2017). Study profile: The impact of earnings disregards on the behavior of low-income families. Retrieved January 31, 2017, from http://clear.dol.gov/study/impact-earnings-disregards-behavior-low%E2%80%90income-families-matsudaira-blank-2014

Deke

Chiang

(2017). The WWC attrition standard: Sensitivity to assumptions and opportunities for refining and adapting to new contexts. Evaluation Review, 41, 130–154.

Demirtas

Amatya

Doganay

(2014). BinNor: An R package for concurrent generation of binary and normal data. Communications in Statistics-Simulation and Computation, 43, 569–579.

10.

Demirtas

Doganay

(2012). Simultaneous generation of binary and normal data with specified marginal and association structures. Journal of Biopharmaceutical Statistics, 22, 223–236.

11.

Fortson

Gleason

Kopa

Verbitsky-Savitz

(2015). Horseshoes, hand grenades, and treatment effects? Reassessing whether nonexperimental estimators are biased. Economics of Education Review, 44, 100–113.

12.

Foster

E. M.

(2003). Propensity score matching: An illustrative analysis of dose response. Medical Care, 41, 1183–1192.

13.

Gelman

Hill

(2007). Data analysis using regression and multilevel/hierarchical models (Vol. 1). New York, NY: Cambridge University Press.

14.

Hallberg

Steiner

P. M.

Cook

T. D.

(2011). The role of pretest and proxy pretest measures of the outcome in removing selection bias in observational studies. Paper presented at the Society for Research on Educational Effectiveness Annual Conference, Washington, DC. Retrieved May 16, 2014, from https://www.sree.org/conferences/2011/program/downloads/slides/167-2.pdf

15.

Harder

V. S.

Stuart

E. A.

Anthony

J. C.

(2010). Propensity score techniques and the assessment of measured covariate balance to test causal associations in psychological research. Psychological Methods, 15, 234.

16.

Hill

C. J.

Gormley

W. T.

Adelstein

(2015). Do the short-term effects of a high-quality preschool program persist? Early Childhood Research Quarterly, 32, 60–79.

17.

Imbens

Wooldridge

(2007). What’s new in econometrics? NBER Summer Institute. Retrieved June 20, 2016, from http://www.nber.org/minicourse3.html

18.

Kang

J. D.

Schafer

J. L.

(2007). Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science, 22, 523–539.

19.

Kazdin

A. E.

(2011). Single-case research designs: Methods for clinical and applied settings. New York, NY: Oxford University Press.

20.

Luscombe

M. D.

Owens

B. D.

Burke

(2011). Weight estimation in paediatrics: A comparison of the APLS formula and the formula “weight = 3 (age) + 7.” Emergency Medicine Journal, 28, 590–593.

21.

Peikes

D. N.

Moreno

Orzol

S. M.

(2008). Propensity score matching. The American Statistician, 62, 222–231.

22.

Rubin

D. B.

(1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66, 688.

23.

Rubin

D. B.

(1977). Assignment to treatment group on the basis of a covariate. Journal of Educational and Behavioral Statistics, 2, 1–26.

24.

Rubin

D. B.

(1978). Bayesian inference for causal effects: The role of randomization. The Annals of Statistics, 66, 34–58.

25.

Rubin

D. B.

(1980). Bias reduction using Mahalanobis-metric matching. Biometrics, 36, 293–298.

26.

Rubin

D. B.

(1997). Estimating causal effects from large data sets using propensity scores. Annals of Internal Medicine, 127, 757–763.

27.

Schafer

J. L.

Kang

(2008). Average causal effects from nonrandomized studies: A practical guide and simulated example. Psychological Methods, 13, 279–313.

28.

Schochet

P. Z.

(2008). Statistical power for random assignment evaluations of education programs. Journal of Educational and Behavioral Statistics, 33, 62–87.

29.

Seftor

(2016). Raising the bar. Evaluation Review. doi:10.1177/0193841X16665023

30.

Steiner

P. M.

Cook

T. D.

Shadish

W. R.

Clark

M. H.

(2010). The importance of covariate selection in controlling for selection bias in observational studies. Psychological Methods, 15, 250–267.

31.

Stuart

E. A.

(2010). Matching methods for causal inference: A review and a look forward. Statistical Science, 25, 1–21.

32.

Westbrook

T. P.

Avellar

S. A.

Seftor

(2016). Reviewing the reviews: Examining similarities and differences between federally funded evidence reviews. Evaluation Review. doi:10.1177/0193841X16666463

33.

What Works Clearinghouse. (2014a). Procedures and standards handbook (Version 3.0).

34.

What Works Clearinghouse. (2014b). Review protocol for early childhood education interventions (Version 3.0).

35.

Wooldridge

J. M.

(2015). Introductory econometrics: A modern approach. Boston, MA: Cengage Learning.

36.

Zhao

(2008). Sensitivity of propensity score methods to the specifications. Economics Letters, 98, 309–319.