Flaws in Evaluations of Social Programs

Abstract

Background:

This article describes eight flaws that occur in impact evaluations.

Method:

The eight flaws are grouped into four categories on how they affect impact estimates: statistical imprecision; biases; failure of impact estimates to measure effects of the planned treatment; and flaws that result from weakening an evaluation design. Each flaw is illustrated with examples from social experiments. Although these illustrations are from randomized controlled trials (RCTs), they can occur in any type of evaluation; we use RCTs to illustrate because people sometimes assume that RCTs might be immune to such problems. A summary table lists the flaws, indicates circumstances under which they occur, notes their potential seriousness, and suggests approaches for minimizing them.

Results:

Some of the flaws result in minor hurdles, while others cause evaluations to fail—that is, the evaluation is unable to provide a valid test of the hypothesis of interest. The flaws that appear to occur most frequently are response bias resulting from attrition, failure to adequately implement the treatment as designed, and too small a sample to detect impacts. The third of these can result from insufficient marketing, too small an initial target group, disinterest on the part of the target group in participating (if the treatment is voluntary), or attrition.

Conclusion

To a considerable degree, the flaws we discuss can be minimized. For instance, implementation failures and too small a sample can usually be avoided with sufficient planning, and response bias can often be mitigated—for example, through increased follow-up efforts in conducting surveys.

Keywords

social experiments experimental flaws response bias implementation crossover

Introduction

In this article, we consider eight, somewhat overlapping, types of flaws that have occurred in evaluating the effects or impacts of social programs, some more frequently than others, and some more serious than others. As shown subsequently, we loosely group these flaws into four categories on the basis of how each affects estimates of the impacts of the evaluated programs:

Statistical Imprecision in the Resulting Impact Estimates

1. Few show up

2. Inadequate sample size

Bias in the Impact Estimates

3. Failure to properly randomize

4. Control crossover

5. Sample attrition

Failure of the Impact Estimates to Measure the Effects of the Planned Treatment (even though they may accurately measure the effects of the treatment that is actually run)

6. Failure to implement the treatment properly

7. Failure to adequately communicate the treatment

A Weakening of the Originally Planned Experimental Design

8. Stakeholder resistance

Assignment of some types of flaws to a particular category is imprecise because they can potentially affect the impact estimates in more than one way.

Many of the flaws we consider result from how the programs being evaluated are designed or implemented, rather than how the evaluations themselves are conducted. (As we discuss subsequently, some analysts consider these program flaws rather than evaluation flaws.) Still, as suggested by the four categories listed previously, they affect findings from the evaluation, which, as a result, sometimes are less useful than anticipated, or even misleading. Thus, this article is intended to serve those conducting and underwriting evaluations as a warning of things to worry about in designing and implementing programs that are to be evaluated, as well as a caution of things to be concerned about in designing the evaluation itself. We illustrate each of the eight types of flaws with examples from evaluations of previous social experiments.¹ Although we illustrate the flaws with examples from randomized controlled trials (RCTs), the problems can occur in any type of evaluation; we use RCTs to illustrate because people sometimes assume that since RCTs provide the strongest evidence of impacts, they might be immune to such problems.²

It is widely agreed among social scientists that social experiments, where units are randomly assigned to treatment and control status, provide the best opportunity for learning about the effectiveness of a social intervention. For this reason, as noted by Greenberg and Shroder (2004), the number of social experiments has grown almost exponentially since the 1960s. Social experiments are popular because they usually produce estimates of the effects or impacts of a policy intervention that are internally valid—that is, unbiased for the sample included in the study. The reason is that the randomly assigned treatment and control groups can be validly compared because they differ only by chance and the fact that the former is subject to the policy intervention and the latter is not.

Although social experiments are rightly viewed as the “gold standard” in evaluations of the impacts of policy interventions, almost all of them confront problems of some sort in their implementation or operation. Some of these problems result in minor hurdles, while others cause experiments to fail—that is, the experiment is unable to provide a valid test of the hypothesis of interest.³ While the former merely requires a caveat and perhaps some statistical adjustments, the later renders experimental impact estimates, if they can be obtained at all, useless. There is obviously a continuum between a minor flaw and a fatal problem. In this article, we examine serious experimental flaws, but not all of these necessarily result in complete failure. As will be seen, in some circumstances, it is possible for analysts to overcome the flaw, at least in part; even in the case of a fatal flaw, an experiment may sometimes still provide useful information. In all the situations we describe, however, the flaw causes findings from comparisons between treatment and control groups to be subject to considerable uncertainty or unable to provide the information desired, and users of the experimental findings must exercise great caution. As Epstein and Klerman (2012) have recently pointed out, some social interventions are not yet ready for a rigorous impact evaluation, such as through random assignment, and may never be. As a result, they and Wholey (2004) suggest that programs be subject to an “evaluability assessment” to determine if and when a random assignment impact evaluation or some other rigorous form of impact evaluation is appropriate. We return to this point in the conclusions.

Imprecision in the Impact Estimates

Few show up

This problem, which is a major concern, occurs when some potential participants in the program being tested are available for random assignment, but there are many less than planned for in the research design, resulting in an insufficient sample for hypothesis testing.^4,5 This can happen when participating in the treatment being tested is voluntary, but is unlikely when participation is mandatory. It may be due to insufficient outreach and marketing, because the target group of those who qualify to participate in the experiment is too small, or because few persons in the target population think it is potentially beneficial for them to participate.⁶ It is obviously important to determine the actual source of the problem. A lack of outreach and marketing can be surmounted; and, as suggested by Glennerster and Takavarasha (2013, p. 303), under some circumstances, small financial incentives for participation can be used to address disinterest. However, too small a target group probably cannot be overcome. Too small a target group generally implies too little research and planning prior to undertaking the experiment; in the case of disinterest, however, the experiment has provided important information.⁷

An interesting example of insufficient outreach occurred in implementing Great Britain’s Employment Retention and Advancement (ERA) demonstration. One of the three target groups of the intervention, single parents who were working part time and receiving Working Tax Credits,⁸ could receive financial incentive bonuses under ERA by working full-time (at least 30 hr per week), as well as have access to caseworker services. During the initial recruitment effort, these mothers were not told about the incentive payments, clearly an important selling point of ERA, because of concern over their disappointment if they were randomly assigned to the control group and thus would be ineligible for the payments. Partially as a result, very few working single mothers volunteered to be randomly assigned. As a consequence, the policy of not mentioning the incentive payments was reversed, and considerable additional effort was put into recruiting these persons. Although this recruitment effort was in large part successful, the size of the ultimate research sample was still well under what was planned (Walker, Hoggart, & Hamilton, 2006).

Although the ERA demonstration recovered from its initial recruitment problems, other social experiments have not. For example, the Madison and Racine Quality Employment Experiment, which was targeted at women in the Work Incentive program (the forerunner of the today’s welfare-to-work programs), was ultimately aborted because of a combination of the small size of its registrant pool, which meant that the potential sample that could be recruited was inadequate, and its slowness in getting the program it was testing underway (Leiman, 1982). Epstein and Klerman (2012) mention the more recent Portland Career Builders ERA demonstration, which was cancelled when only one third of the target enrollment was recruited.

Inadequate sample size

This flaw is similar to the first flaw in that the manifestation is that the sample available for the evaluation is smaller than required to detect statistically significant program impacts; the difference between these two flaws is that in this case, the evaluation design calls for too few sample members. Although they usually receive less publicity than large experiments, many social experiments rely on small research samples. For example, drawing on a database containing information on 143 social experiments that were completed between 1962 and 1996, Greenberg, Shroder, and Onstott (1999)⁹ found that 18.6% had samples of fewer than 200,¹⁰ 16.4% had samples ranging from 200 to 499; and 15.0% had samples ranging from 500 to 999. Because the impacts of the treatments tested in social interventions are often modest in magnitude, small experiments, especially those with samples of only a few 100, tend to be underpowered. Consequently, even if the treatments produce true impacts, the estimated impacts are unlikely to be statistically significant.¹¹ For example, evaluators of the random assignment Tulsa Individual Development Account (IDA) demonstration conducted a 10-year follow-up survey in which they located 855 of the households that were originally randomly assigned (Grinstein-Weiss et al., 2012). Almost all the impacts estimated with the resulting data were not statistically significant. However, problems resulting from small sample size can be mitigated to some extent by using a regression framework to account for baseline covariates correlated with the outcome of interest. For example, in their analysis of a recently fielded experiment with 106 persons in the treatment group and 130 individuals in the control group, Cook, O’Brien, Braga, and Ludwig (2012) note that in the absence of accounting for covariates, their minimum detectable effect size would be .37 standard deviations; but if the covariates can account for one third of the variance in the outcome, this would fall to .30 standard deviations.

There are several reasons why sample sizes are small in some social experiments, and, as a result, detecting the impacts of the tested treatment may be unsuccessful. First, as previously discussed, when participation in an experiment is voluntary, relatively few persons may volunteer. Second, as discussed further subsequently, some of the original sample may be lost when follow-up data are collected.¹² Third, and perhaps most importantly, sample size may be constrained because of budgetary considerations.¹³ The cost of both administering treatments and collecting follow-up data often increase as sample size increases. To minimize these costs, experiments with very large samples—14.3% of the 143 experiments mentioned in the previous paragraph had samples of over 10,000—typically rely on existing automated administrative databases and often test modest incremental changes in existing programs. If budget constraints cannot be overcome, serious consideration should be given to not undertaking the experiment.

If social experiments were solely intended to provide information on program impacts, it would not be obvious why most small, underpowered experiments have been undertaken. After all, power tests can be, and often are, conducted before experiments are initiated. However, sometimes there may be political reasons for undertaking experiments with small samples—for example, delaying a decision on a policy change or responding to pressure to give the impression of scientifically testing a change that has already been decided upon. Or perhaps those designing some small experiments are overly optimistic about the size of the effect the treatment will produce.¹⁴

In principle, one way to address the limitations of several small experiments that test similar policies and estimate similar outcomes is through pooling information provided by the experiments and applying multilevel analysis or meta-analysis to determine whether the overall effect of the tested policies is statistically significant.¹⁵ This has often been done with small experiments in the health area, but to the best of our knowledge, it has not been done with small social experiments.¹⁶ Another approach, one that has occasionally been used when sample size is small, is to replicate an experiment.¹⁷ For example, a number of years ago, several small job club experiments were conducted. Each of these experiments targeted a different demographic group, but all of them tested a similar treatment, supervised job-search activities in a group. They all resulted in large impacts and, as a consequence, job clubs were widely adopted nationally. Because the various replications of the first job club experiment consistently produced positive and statistically significant impact estimates, there was reason to have some confidence in the findings.¹⁸

Biases in the Impact Estimates

Failure to properly randomize

By its very essence, social experimentation depends on the characteristics of the treatment group and the control group being similar, differing only as a result of the treatment being tested or by chance. One reason this may not occur is because of improper randomization, the one flaw we discuss in this article that is exclusive to social experimentation. A failure to properly randomize most often happens when those administering the treatment or those who are supposed to be randomly assigned have some control over the randomization process, rather than complete control residing in persons with no interest in who is assigned to the treatment or control group. Ideally, the randomization process should be controlled by the evaluators.¹⁹ Even seemingly foolproof methods, such as assigning every other person who walks through the door or every person whose social security number ends in an odd number to the treatment group, can be manipulated. However, improper random assignment may also occur through inadvertent administrative errors. When those administering the treatment do have some control over random assignment, it is obviously important for those evaluating an experiment to interview these persons to determine if the assignment was not entirely random. Randomization error can also sometimes be detected by comparisons of the observed characteristics of the treatment group with those of the control group at the time of random assignment. Such a comparison should always be made.

The New Orleans Homeless Substance Abusers Project provides an interesting example of staff subversion of the random assignment process. Only those substance abusers considered sufficiently motivated were placed on the selection list; those who did not appear sufficiently motivated were assigned to the control group. As a result, under one third of those entering the two treatment groups were actually randomized. Consequently, the analysis was conducted using nonexperimental selection bias correction techniques. Unexpectedly, these corrections actually increased the estimated impact of the treatment (Devine, Wright, & Brody, 1997).

A failure to properly randomize is not always purposeful. In the United Kingdom’s Supportive Caseloading experiment, unemployment benefit claimants who were initially eligible for the treatments were randomly assigned by using the last two digits of their National Insurance numbers. A second set of claimants who were initially ineligible for the treatment but later became eligible was also included in the study population. At one study site, these initially ineligible claimants were randomly assigned on the same basis as the initially eligible claimants; but in the other study site, those initially ineligible claimants who later became eligible were all assigned to the treatment group, an inadvertent situation that went unrecognized until after the fieldwork was finished. These persons probably accounted for over half of the treatment group in the site; and, once they were assigned to the treatment group, they apparently could not be separately identified and therefore were left out of the statistical analysis (Birtwhistle, Barnes, & Looby, 1994).

The Harbinger Mental Health Project provides a somewhat less grievous example of failure to randomize: The 100 members of the research sample who approached the hospital for treatment were randomly assigned, but the 21 members of the sample who were long-term residents of the hospital were assigned to the treatment and control groups on a nonrandom “matching” basis (Mowbray, Collins, Plum, Masterton, & Mulder, 1997, p. 110).²⁰ Both sets of individuals were included in the impact analysis, thereby undercutting the randomization of the research sample. A better choice would probably have been to drop the small subsample of long-term residents from the analysis.

A rather rare, but instructive, instance of an unintentional failure to properly randomize in an experiment that relied on administrative data occurred in the Wisconsin Self-Sufficiency First/Pay for Performance Program (SSF/PFP). In this test of a mandatory welfare-to-work program, Aid to Families with Dependent Children (AFDC) applicants who were assigned to the treatment group were required to participate in both the SSF and PFP components of the tested program, while persons who were already in the AFDC system at the beginning of the experiment were required to participate in only the PFP component. A data system that was being developed at the time of random assignment allowed the staff administering the AFDC (now Temporary Assistance for Needy Families) program to exempt some AFDC applicants who were assigned to the treatment group from SSF. Unfortunately, those who were exempted disappeared from the data available for analysis. Their equivalents, however, remained in the control group. Thus, the evaluators had confidence that the noncomparability the treatment and control groups was minimal for active AFDC recipients, but not for AFDC applicants.²¹

Control crossover

Sometimes called “control contamination,” control crossover occurs when some members of the control group receive services from the program being tested that they are supposed to be denied. This obviously diminishes the estimated impact of the treatment. However, unless the crossover is rampant, it is unlikely to result in the complete failure of the experiment. Moreover, Orr (1999) has suggested a simple correction exists that can be used when the proportion of the control group that crossed over is known to the evaluators. To use the approach suggested by Orr, and for general purposes, it is clearly important that records be kept for both the treatment and control groups of who receives what program-related services.

We briefly describe three examples of crossover. In the alternative schools demonstration, 13% of the controls in one site (Wichita, Kansas) and 39% of the controls in another site (Stockton, California) attended the alternative high schools that only members of the treatment group of high-risk youths were supposed to attend; however, the evaluators corrected for the resulting crossover in estimating the impacts of the alternative schools (Dynarski & Wood, 1997). In Bolivia’s School Facility Improvements experiment, an experiment in which schools, rather than individuals, were randomized, only treatment schools were supposed to receive funds intended to improve their physical facilities; nonetheless, some control schools learned by word of mouth about the funds, applied for them, and subsequently received them. The evaluation attempted to address this crossover issue by using an approach developed by Manski (1990) that puts bounds on the impact estimates (Newman, Pradham, & Rawlings, 2002). In Denmark’s Job Training Demonstration, almost one quarter of the control group received job training that they were supposed to be denied. An econometric model was used to attempt to take account of possible resulting biases (Rosholm & Skipper, 2009).

Sample Attrition

Sample attrition occurs when some of those who are randomly assigned are unavailable when follow-up data are collected. This is most likely to occur when the follow-up data are obtained through surveys, rather than through administrative records. To take a rather extreme, but important, example, in the Food Stamp Employment and Training Program demonstration, 12-month survey data were successfully collected for only 50% of the research sample of 13,086. This was attributable to difficulties in obtaining addresses from local Food Stamp offices, high mobility among the sample population, and the large number of homeless persons who were randomly assigned (Puma, Burstein, Merrell, & Silverstein, 1990).

The key problem resulting from sample attrition is that it is unlikely to be random. This can make the remaining sample unrepresentative of the group originally included in the research sample. Moreover, those who attrite from the treatment group may differ from those who attrite from the control group, often in ways that are not observable but yet are associated with the outcomes of interest. This causes response bias in the impact estimates.²² In addition, sample attrition reduces the size of the sample and thereby reduces statistical power, as was discussed for Cases 1 and 2 previously.

Several approaches have previously been used to attempt to minimize attrition in surveys. One is to provide small monetary incentives for participation in surveys. Another is to send Christmas cards to sample members each year and then track down the individuals whose cards come back “return to sender” before these individuals are permanently lost to follow up. A third is to provide sample members with self-addressed cards they can mail when they move. A fourth is to obtain contact information during baseline surveys for neighbors or relatives who would likely know the location of sample members who have moved. In addition, Glennerster and Takavarasha (2013, p. 310) suggest that if everyone in the research sample is made eligible for a program innovation, and the innovation is perceived as positive (e.g., financial incentives), but is phased in over time, the first group that is assigned to the program can serve as the treatment group and those phased in later can serve as the control group. Under these circumstances, the expectation of participation may reduce attrition among controls, although there is a danger that this expectation could change their behavior.

As mentioned previously, attrition reduces sample size, sometimes greatly reducing the power of the data to detect impacts resulting from the treatment. For example, the Project Hope demonstration, which was conducted in Columbus, Ohio, began with a small sample of 140 parents of Head Start children. In an attempted follow-up survey of 116 of these individuals, only 24 were ultimately interviewed, partially as a result of 55 telephone numbers having been disconnected (U.S. Department of Health and Human Services, 1992). Similarly, in the Partnership for Hope demonstration in Whatcom County, Washington, only 58 of the 109 individuals randomly assigned returned a questionnaire that was mailed out at the end of the experimental treatment (U.S. Department of Health and Human Services, 1994).

As suggested previously, sample attrition can cause response bias in the impact estimates. One recent example of response bias resulting from attrition is the United Kingdom’s ERA demonstration, where both survey data and administrative data were collected 5 years after random assignment for two of the program’s three target groups. The survey response rate was 62% for one of these groups and 69% for the other. The sample size was sufficiently large for both groups that lack of power to detect outcomes was not a serious problem even after attrition. However, the possibility of response bias was a concern. Fortunately, it was possible to examine the survey data for possible response bias by using the administrative data to compare the earnings impacts for those who responded to the survey with the earnings impacts for the full sample. It turned out that the earnings impacts were markedly larger for the survey respondents than for the full sample, strongly suggesting the presence of response bias (Hendra et al., 2011). Because some of the key outcomes, such as earnings, employment status, and some government benefit payments, were available from the administrative data, as well as from the survey data, the estimates of impacts from the former could be emphasized in reporting findings from the experiment. However, data on other outcomes, such as health status, wage rates, and hours, were only available from the survey data. Hence, although impacts on these outcomes were considered important, they could only be estimated with the survey data. Moreover, it was not possible to determine whether estimates of these impacts were subject to response bias.

The method used in the U.K. experiment to uncover the existence of response bias is, of course, limited to those fairly rare instances in which evaluations have access to administrative data as well as survey data. An alternative method, one that is more generally applicable, is to compare the baseline characteristics of the treatment group available in the follow-up period with those of the control group. This was done, for instance, by the evaluators of the Tulsa IDA demonstration, who found some indication of possible attrition bias (Grinstein-Weiss et al., 2012). This approach, as well as the one used in the U.K. experiment, when feasible, could be usefully considered in other evaluations.

Although sample attrition is a problem usually associated with obtaining follow-up data through surveys, administrative data are not completely immune to sample attrition. For example, many social experiments involving the welfare population have relied on the records of state welfare agencies and earnings data reported by employers to state agencies administering the unemployment insurance system. The unemployment insurance records do not include individuals who move out of the state in which the experiment was conducted and individuals who are self-employed.

Failure to Measure the Impacts of the Planned Treatment

Failure to implement the treatment Properly

This is a potentially serious problem that has occurred in a number of social experiments. In such instances, the experiment does not test what it was designed to test. Implementation (or process) analysis that involves observation of the program being tested and interviews with staff administering the treatment and members of treatment groups can be critical to detecting whether the treatment was implemented as planned.²³ Ideally, as mentioned in the Introduction section and discussed further in the Conclusions, this analysis will take place before evaluation begins to determine if a random assignment evaluation is appropriate.

The failure to implement the designed treatment is well illustrated by the five-site Quantum Opportunity Program Pilot (Milwaukee; Oklahoma City; Philadelphia; Saginaw, Michigan; and San Antonio) and the seven-site Quantum Opportunity Program Demonstration (Cleveland; Fort Worth; Houston; Memphis; Philadelphia; Washington, DC; and Yakima, Washington). These were sequentially run experiments, which were intended to test the effects of comprehensive services for high school students with a high probability of dropping out. Neither of the experiments implemented the full complement of planned services, although the extent of the deviations from the planned treatment varied among the sites. Indeed, one of the sites in the first experiment (Milwaukee) completely failed to implement the planned program and subsequently was dropped from the evaluation analysis (Hahn, 1994). In the later experiment, according to the detailed implementation study conducted by the evaluators (Maxfield, Castner, Maralani, & Vencill, 2003), two sites implemented a version of the treatment that differed “substantially from the program model,” while the other five “deviated moderately from the model.” Specifically, no site implemented the education or the community service components of the tested program as prescribed. The degree of deviation seemed to vary with whether the local organization operating the program was already running a program similar to the program model “in philosophy and structure” (Maxfield et al., 2003, pp. xi–xii). Maxfield, Castner, Maralani, and Vencill (2003, pp. 99–100) suggests several reasons for the deviation from the program model: the Department of Labor, which funded five of the sites, “did not require the sites to adopt all the elements of the QOP model;” some sites were more in agreement with the program’s philosophy than others, and, most importantly, the program model was complex and, as a result, difficult to implement. One possible lesson from the last point is that if program designers set the bar too high—for example, by developing an overly complex program—those attempting to implement a program may fail.

Whether the failure to properly implement the Quantum Opportunity Program model as originally designed in the second experiment is viewed as a program error or as an evaluation flaw depends on the intent of the replication.²⁴ If the purpose of the evaluation was to assess the original model, then there was an evaluation flaw—in effect, the evaluation provided the right answer to the wrong question, sometimes referred to as a type III error (Basch, Sliepcevich, Gold, Duncan, & Kolbe, 1985). If, however, the prior study was viewed as efficacy trials and changes were expected, then the failure to maintain fidelity was not a flaw at all.²⁵

Another example of an implementation problem is a recent random assignment demonstration program run in a district in India. In this experiment, families were to bring their grain to local millers who fortified the resulting flour with iron at no additional cost to the families. This was intended to offset iron deficiency anemia that causes low productivity and health problems in much of the developing world. Although the millers did this in the early days of the experiment, they soon stopped, perhaps, in part, because of a misunderstanding on their part. As a result, while the anemia rate fell during the first part of the experiment, there was no difference in the rate by the end of the experiment (Banerjee, Duflo, & Glennerster, 2011).

Sometimes poor implementation leads to small treatment-control contrast, resulting in a less than useful test of the planned treatment. For instance, one goal of the San Diego Homeless Research demonstration was to compare traditional case management with comprehensive case management, which was supposed to have smaller caseloads and provide additional services. In practice, this plan was not followed and, as a result, the differences between the two types of case management were minimal. Perhaps as a result, statistically significant differences in outcomes did not result (Greenberg & Shroder, 2004, pp. 335–336). Something similar occurred in Britain’s Intensive Gateway Trailblazers demonstration, which targeted young adults who had been unemployed for at least 6 months and who were receiving benefit payments. As designed, the mandatory tested program was supposed to require individuals assigned to the treatment group to participate in a course and to receive more intensive training and counseling than controls. In practice, the treatment was only weakly implemented, and, as a result, the services actually received by the treatment and control groups were similar and, because the attendance requirement was poorly enforced, turnout for the mandatory course was low (Davies & Irving, 2000).

Failure to adequately communicate the treatment

If members of a treatment group are to respond to the treatment appropriately, they presumably must understand what the treatment is. This is particularly a problem in a demonstration program if it is anticipated that knowledge of the program would be greater should the program adopted as a regular policy. In such a case, an evaluation of the demonstration cannot accurately measure the effects of the future ongoing program.

As is the case for implementation fidelity, failure to adequately communicate the treatment can be either an evaluation flaw or a program design flaw. In a long-standing program, failure of the participants to understand the treatment is logically viewed as a program flaw. For example, it is likely that few people in the United States fully understand the rules for the social security program, so the failure of individuals to respond rationally to the rules should be viewed as a program issue rather than an evaluation issue. On the other hand, when a demonstration is implemented, participants might not understand the treatment due to a failure to communicate the treatment or because the treatment is inherently difficult to understand. The Primary Prevention Initiative described subsequently provides a good example a fairly straightforward treatment—parents who failed to have their children immunized, receive scheduled physical examinations, or miss too many days of school had their welfare grants reduced by a fixed amount. On the other hand, many interventions are more complex, and the failure to understand the treatment could be either a flaw in the program or a failure to communicate the rules.

A communication failure was a serious problem with the German Targeted Negative Income Tax demonstration, which was run for public assistance recipients in seven sites in Germany from 1999 to 2002. In most, but not all sites, those eligible for the tested program, which was quite complex, were initially informed about the program by letter, with no further attempt at follow-up. Because of this, no conclusions about impacts were possible in six of the seven sites.²⁶ In hindsight, the need to communicate further with the target population should have been built into the program design.

The Primary Prevention Initiative (PPI), which was a test of requirements for school attendance, physical examinations, and appropriate immunizations for the children of AFDC recipients in Maryland, provides another illustration of a failure to understand the treatment. A telephone survey of over 200 members of the treatment group indicated that over 80% of them could not identify the mandatory requirements of the program (Wilson, Stoker, & McGrath, 1999, table 1). Knowledge among those who had been sanctioned through a grant reduction for failing to meet the requirements of the tested program was greater but only slightly so (Wilson et al., 1999, table 2). Not surprisingly, the impact of the program on school attendance and immunization was negligible. Thus, the PPI, as implemented, did not test the impact of the intervention on a knowledgeable population. If the state believes that the welfare population would be educated about the PPI rules should the program adopted as regular policy, then the efficacy of the strategy cannot be ascertained from the experiment undertaken.

Table 1.

Summary of Lessons From Experimental Flaws.

Problem	When It Occurs	Seriousness	Approaches for Addressing the Problem
Too small a sample due to—			Power tests should be conducted prior to implementing the experiment to determine if this is likely
Insufficient marketing	Test of voluntary treatment when pre-implementation planning is insufficient	Potential failure to detect impacts	Increase outreach and marketing effort
Target group too small	Pre-implementation research insufficient.	Potential failure to detect impacts	Attempt to estimate size of target group prior to initiating program
Disinterest in participating	Test of voluntary treatment	Potential failure to detect impacts, but useful information still provided	Possibly increase communication with target group. Consider small financial incentives for participating
Budgetary constraint	Pre-implementation planning insufficient	Potential failure to detect impacts	Consider replication and/or meta-analysis. Consider not undertaking experiment
Sample attrition	More likely when survey data are used	Potential failure to detect impacts	Increased effort at survey follow-up
Improper randomization	When those administering or subject to the treatment have some control or randomization, but sometimes done inadvertently	Serious but not necessarily fatal	Give the evaluators control of randomization process. Compare treatment and control characteristics at baseline to detect. Conduct nonexperimental analysis with statistical correction of selection bias
Control crossover	When treatment is attractive and administrators fail to prevent controls from receiving it	Not too serious, if not too large	Use implementation analysis to detect. Use the Orr crossover correction
Response bias due to attrition	Much more likely when survey data are used	Serious but not necessarily fatal	Use techniques to maximize survey response rates. When available, use baseline or administrative data to detect. Conduct nonexperimental analysis with statistical correction of selection bias
Failure in implementing treatment	Usually occurs with demonstration programs. Budget inadequate. Administers resistant to aspects of the treatment. Misunderstandings. Program plan highly complex	Depends on the degree to which actual treatment deviates from planned treatment	Use implementation analysis to detect. Possibly hold discussions with those implementing the treatment
Inadequate communication of treatment	When treatment is complex and/or effort to explain treatment is insufficient. Sometimes difficult to avoid	Not necessarily serious if lack of understanding in demonstration program to what it would be, should the program implemented as regular policy	Use implementation analysis to detect. Increase the effort at communication with the treatment group when communication has been inadequate
Stakeholder resistance	Especially likely when existing programs are evaluated. Also occurs when keeping controls from receiving an attractive treatment.	May cause experiment to be modified in unfavorable ways, shut down, or not initiated	Gain agreement to random assignment by key stakeholders as part of the design effort

Communication problems are sometimes unavoidable. For example, the Arkansas Welfare Waiver demonstration tested the effects of subjecting AFDC recipients to a family cap in which their benefits would no longer increase with the birth of an additional child. The experiment was limited to 10 small rural counties, while all AFDC recipients in the remainder of the state were simply made subject to the family cap. Neither the treatment nor the control group were certain of what the effects of an additional child would be on their welfare benefits because, as indicated by their responses to a survey, the information provided to them was often vague and misleading. In fact, many controls in the 10 experimental sites believed they were subject to the cap, although they were not. This was probably also due to the rest of the state being subject to the cap, wide-spread publicity about the family cap, feedback from those who were subject to the cap, and misinformation from some caseworkers (Turturro, Benda, & Turney, 1997).²⁷

A Weakening of the Originally Planned Experimental Design

Stakeholder resistance

Special efforts at the beginning of an experiment may be required to overcome stakeholder resistance. Resistance by various stakeholders to either a program innovation being tested through random assignment or to random assignment itself can force changes in the program or in the experimental design that weakens an experiment. In extreme instances, it may result in the experiment being cancelled. We first consider instances in which an experimental design is compromised and then describe a few cases in which experiments were cancelled. The latter obviously does not result in a flawed experiment since the experiment did not take place. But it is arguably more serious in many situations, as we fail to learn about the effects of the intervention.

Due to a combination of stakeholder resistance and adverse publicity, the random assignment Matriculation Awards Demonstration in Israeli high schools, in which entire schools were randomly assigned and cash awards for reaching achievement goals were offered to students at the treatment schools, was suspended after 1 year of a planned 3-year experiment, even though there was evidence that the policy had a considerable positive effect on matriculation rates (Angrist & Lavy, 2002). Staff at the Israeli Ministry of Education, especially those with academic training in education, was resistant to the idea of incentivizing students with financial rewards, and they expressed their views to journalists who cover education in the media. The view of these journalists, many of whom had their training in education, was that students should have enough intrinsic motivation to invest in education and that financially rewarding them for achievements was inappropriate. This negative feeling toward the program, together with a new minister of education who cancelled most of the new programs put in place by his predecessor, made it possible to stop the program between its first and the second year.²⁸ As a result, less information about matriculation awards was obtained than would have been ideal.

Caseworkers and their clients are often resistant to random assignment. Thus, prior to initiating some welfare-to-work experiments, program administrators have explained to caseworkers the purpose of random assignment and the mechanics of doing it. And caseworkers have been given scripts to help them discuss with clients their assignment to treatment or control status. This is especially important for those clients who are unhappy with their assignment. Sometimes caseworkers are concerned that certain clients should not be randomly assigned (e.g., they are too ill to participate in program activities if they are assigned to treatment status). To obtain caseworker buy-in, caseworkers in some experiments have been permitted to exclude some persons from the random assignment pool. This results in a less representative sample, but as long as the number excluded is kept small and the exclusion occurs prior to random assignment, the problem should not be too serious.²⁹

A more serious problem in obtaining a representative sample occurs in selecting sites in multisite experiments, as sites are almost always allowed to opt out of participating. For example, initial plans for the National JTPA Study, which was a random assignment evaluation of training programs for the disadvantaged funded under the Job Training Partnership Act, called for randomly selecting sites for participation in the study. The evaluation was limited instead to those self-selected sites that were willing to participate (Orr et al., 1996). This experience is probably fairly commonplace with large-scale experiments, as experiments with nationally representative samples of sites are rare.³⁰ In these instances, one approach is to compare the observable characteristics of the sample population with those of the target population of the tested program and possibly reweight the former to account for differences with the latter.

More direly and recently, an experimental evaluation was planned of the national Upward Bound program, which aims to help disadvantage youth enter and succeed in college. However, a previous experimental evaluation of Upward Bound had been initiated in the 1990s, with follow-up work completed after the turn of the century. This evaluation generally showed negligible program impacts (Seftor, Mamun, & Schirm, 2009). The youth advocacy community attempted to both cast doubt on the findings and prohibit further evaluation. Ultimately, Congress decided to withhold additional funding for the subsequent evaluation, which forced the U.S. Education Department to cancel the study.³¹ In addition, Congress reauthorized the Higher Education Act (now called the Higher Education Opportunity Act) with language to preclude a repeat of the previous evaluation by preventing the Education Department from requiring grantees to recruit a sufficient number of youth so that some could be assigned to a control group.

In 2005, the U.S. Department of Labor planned and designed a random assignment evaluation of the Youth Offender Demonstration Project. The experiment was to take place in six jurisdictions that were already operating the program. The random assignment design required that courts allow youths to be randomly assigned to one of three groups (two treatment groups and a control group). As it turned out, despite the efforts of the evaluation team, none of the courts or program operators in the six jurisdictions was willing to accept the experimental design because of changes in the programs that would have resulted and because of various ethical, legal, and political issues that random assignment raised. As a result, the experiment was not conducted (Dunham, Wiegand, & Michalopoulos, 2008).

Sometimes there is negative publicity about an experiment, and this causes stakeholder resistance. The New Deal for Disabled Persons (NDDP) was a voluntary welfare-to-work program for incapacity (disability) claimants in the United Kingdom. Original plans called for the NDDP to be evaluated with a random assignment experimental design at the time it was introduced nationally. Although the effectiveness of the program was unproven, due to adverse publicity about denying disabled persons assigned to control status program services, a decision was made by a U.K. government minister shortly before NDDP was introduced to drop the planned experimental evaluation. Although an evaluation was conducted, it was nonexperimental (Orr, Bell, & Lam, 2007).

Conclusions

In this article, we have examined eight flaws that can occur in conducting social experiments, sometimes causing the experiment to fail. Table 1 lists the major lessons from this effort. Flaws that appear to occur especially often are response bias resulting from attrition, a failure to adequately implement the treatment as designed, and too small a sample to detect impacts. The third of these flaws can result from insufficient marketing, too small an initial target group, disinterest on the part of the target group in participating (if the treatment is voluntary), or attrition.

The discussion of experimental flaws is in no way intended to discourage the use of social experiments for evaluating social programs. It is usually the best tool available. However, the discussion demonstrates that without due care and sufficient funding, experiments can face major obstacles.

To a considerable extent, these can be minimized. For instance, implementation failures and too small a sample can usually be avoided with sufficient effort and planning and response bias can often be mitigated—for example, through increased follow-up efforts in conducting surveys. More generally, flawed evaluations can be minimized by careful and formal consideration of whether the program is ready for an impact evaluation. Wholey (2004) developed the concept of conducting evaluability assessments to determine if an intervention is ready for a formal evaluation. Epstein and Klerman (2012) extend Wholey’s concepts to describe how detailed logic models can be used to conduct falsification tests to assess if a program is ready for an impact evaluation. Of course, some impact evaluations are likely to be flawed, but this article and the earlier work by Wholey (2004) and Epstein and Klerman (2012) provide guidance that should reduce the problem.

Footnotes

Authors’ Note

The authors are grateful for very helpful comments made on previous drafts of this paper by the editor and three anonymous referees.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

References

Angrist

J. D.

Lavy

(2002). The effect of high school matriculation awards: Evidence from randomized trials. Cambridge, MA: National Bureau of Economic Research working paper 9389.

Azrin

N. H.

Flores

Kaplan

S. J.

(1975). Job-finding club: A group-assisted program for finding employment. Behavioral Research and Therapy, 13, 17–27.

Azrin

N. H.

Philip

R. A.

(1979). The job club method for the handicapped: A comparative outcome study. Rehabilitation Counseling Bulletin, 23, 144–155.

Azrin

N. H.

Philip

R. A.

Thienes-Hontos

Besalel

V. A.

(1980). Comparative evaluation of the job club program with welfare recipients. Journal of Vocational Behavior, 16, 133–145.

Banerjee

Duflo

Glennerster

(2011). Is decentralized iron fortification a feasible option to fight anemia among the poorest? In Wise

D. A.

(Ed.), Explorations in the economics of aging (pp. 317–344). Chicago, IL: University of Chicago Press.

Barnow

Greenberg

(2013). Replication issues in social experiments, Journal for Labour Market Research , 46, 239–252.

Basch

C. E.

Sliepcevich

E. M.

Gold

R. S.

Duncan

D. F.

Kolbe

L. J.

(1985). Avoiding type III errors in health education program evaluations: A case study. Health Education Quarterly, 12, 315–331.

Birtwhistle

Barnes

Looby

(1994). Evaluation of supportive caseloading (1-2-1) in North Norfolk. Sheffield, England: Research and Evaluation, Employment Service, Tracking Study.

Bloom

H. S.

(1995). Minimum detectable effects: A simple way to report the statistical power of experimental designs. Evaluation Review, 19, 547–556.

10.

Bloom

H. S.

Hill

C. J.

Riccio

(2003). Linking program implementation and effectiveness: Lessons from a pooled sample of welfare-to-work experiments. Journal of Policy Analysis and Management, 22, 551–575.

11.

Blustein

(2005). Toward a more public discussion of the ethics of federal social program evaluation. Journal of Policy Analysis and Management, 24, 824–846.

12.

Bruhn

McKenzie

(2009). In pursuit of balance: Randomization in practice in development field experiments. American Economic Journal: Applied Economics, 1, 200–232.

13.

Cancian

Kaplan

Rothe

(2000). Wisconsin’s self-sufficiency/pay for performance: Results and lessons from a social experiment. Madison: University of Wisconsin-Madison, Institute for Research on Poverty.

14.

Cook

T. D.

Wong

(2007). The warrant for universal pre-K: Can several thin reeds make a strong policy boat? Social Policy Review, 21, 14–15.

15.

Cook

P. J.

O’Brien

Braga

Ludwig

(2012). Lessons from a partially controlled field trial. Journal of Experimental Criminology, 8, 271–287.

16.

Davies

Irving

(2000). New deal for young people: Intensive gateway trailblazers. A final report to the Employment Services. Birmingham, England: ECOTEC Research and Consulting Limited.

17.

Devine

J. A.

Wright

J. D.

Brody

C. J.

(1997). Evaluating an alcohol and drug treatment program for the homeless: An econometric approach, Evaluation and Program Planning, 20, 205–215.

18.

Dunham

Wiegand

Michalopoulos

(2008). The effort to implement the Youth Offender Demonstration Project (YODP) impact evaluation: Lessons and implications for future research, final report. Ottawa Ontario: Social Policy Research Associates.

19.

Dynarski

Wood

(1997). Helping high-risk youths; Results from the Alternative Schools Demonstration Program. Princeton, NJ: Mathematica Policy Research.

20.

Epstein

Klerman

J. A.

(2012). When is a program ready for rigorous impact evaluation? The role of a falsifiable logic model. Evaluation Review, 36, 375–401.

21.

Glennerster

Takavarasha

(2013). Running randomized evaluation: A practical guide. Princeton, NJ: Princeton University Press.

22.

Greenberg

Ashworth

Cebulla

Walker

(2004). Meta-analysis: Discovering what works best in welfare provision. Evaluation, 10, 193–216.

23.

Greenberg

Shroder

(2004), The digest of social experiments. 3rd ed. Washington, DC: Urban Institute Press.

24.

Greenberg

Michalopoulos

Robins

P. K.

(2003). A meta-analysis of government-sponsored training programs. Industrial and Labor Relations Review, 57, 31–53.

25.

Greenberg

Shroder

Onstott

(1999), The social experiment market. Journal of Economic Perspectives, 13, 157–172.

26.

Grinstein-Weiss

Sherraden

M. W.

Gale

W. G.

Rohe

Schreiner

Key

(2012). Long-term follow-up of individual development accounts: Evidence from the ADD experiment. Chapel Hill, NC: The University of North Carolina.

27.

Hahn

(1994). Evaluation of the Quantum Opportunities Program (QOP): Did the program work? Waltham, MA: Brandeis University Center for Human Resources.

28.

Harrison

G. W.

List

J. A.

(2004). Field experiments. Journal of Economic Literature, 42, 1009–1055.

29.

Hendra

Riccio

James A.

Dorsett

Richard

Greenberg

David H.

Knight

Genevieve

Phillips

Joan

Robins

Philip K.

Vegeris

Sandra

Walter

Johanna

Hill

Aaron

Ray

Kathryn

Smith

Jared

(2011). Breaking the low-pay, no-pay cycle: Final evidence from the UK employment retention and advancement (ERA) demonstration. Sheffield, London: Department for Work and Pensions. Research Report No 765.

30.

Leiman

J. M.

(1982). The WIN labs: A federal/local partnership in social research. New York, NY: MDRC.

31.

List

J. A.

(2011). Why economists should conduct field experiments and 14 tips for pulling one off. The Journal of Economic Perspectives, 25, 3–15.

32.

Manski

(1990). Nonparametric bounds on treatment effects. American Economic Review, 80, 319–323.

33.

Maxfield

Castner

Maralani

Vencill

(2003). The quantum opportunity program demonstration: Implementation findings. Princeton, NJ: Mathematica Policy Research.

34.

Mowbray

C. T.

Collins

M. E.

Plum

T. B.

Masterton

Mulder

(1997). Harbinger I: The development and evaluation of the first pact replication. Administration and Policy in Mental Health, 25, 105–123.

35.

Newman

Pradham

Rawlings

L. B.

(2002). An impact evaluation of education, health, and water supply investments by the Bolivian Social Investment Fund. The World Bank Economic Review, 16, 241–271.

36.

Olsen

R. B.

Orr

L. L.

Bell

S. H.

Stuart

E. A.

(2013). External validity in policy evaluations that choose sites purposively. Journal of Policy Analysis and Management, 32, 107–121.

37.

Orr

(1999). Social experiments: Evaluating public programs with experimental methods, Thousand Oaks, CA: Sage.

38.

Orr

L. L.

Bell

Lam

(2007). Long-term impacts of the new deal for disabled people: Final report. Leeds, England: DWP Research Report No. 342.

39.

Orr

L. L.

Bloom

H. S.

Bell

S. H.

Doolittle

Lin

Cave

(1996). Does training for the disadvantaged work? Evidence from the National JTPA Study. Washington, DC: Urban Institute Press.

40.

Puma

M. J.

Burstein

N. R.

Merrell

Silverstein

(1990). Evaluation of the food stamp employment and training program: Final report. Volume I: Washington, DC: U.S. Department of Agriculture, Food and Nutrition Service.

41.

Rosholm

Skipper

(2009). Is labour market training a curse for the unemployed? Evidence from a social experiment. Journal of Applied Econometrics, 24, 338–365.

42.

Seftor

N. S.

Mamun

Schirm

(2009). The impacts of regular upward bound on postsecondary outcomes 7-9 years after scheduled high school graduation: Final report. Princeton, NJ: Mathematica Policy Research.

43.

Shadish

W. R.

Cook

T. D.

Campbell

D. T.

(2002). Experimental and quasi-experimental designs for generalized causal inference. Boston, MA: Houghton Mifflin.

44.

Turturro

Benda

Turney

(1997). Arkansas welfare waiver demonstration final report. Fayetteville, NC: University of Arkansas.

45.

U.S. Department of Health and Human Services (1992). Demonstration Partnership Programs Project: Summary of final evaluation findings from FY1989. Washington, D.C.: Office of Community Services, Monograph Series 100–89: Case Management Family Intervention Models.

46.

U.S. Department of Health and Human Services (1994). Partnership for Hope, Whatcom County Opportunity Council, Bellingham, Washington. Summary of final evaluation findings from FY 1990, Demonstration Partnership Project. Washington, D.C.: Office of Community Services, Monograph Series 100–90: Case Management Family Intervention Models, 100-4-72—110-4-83.

47.

Walker

Hoggart

Hamilton

(2006). Making random assignment happen: Evidence from the UK employment retention and advancement (ERA) demonstration. London, England: Department for Work and Pensions Research Report 330.

48.

Wholey

(2004). Exploratory evaluation. In Hatry

H. P.

Wholey

J. S.

Newcomer

K. E.

(Eds.), Handbook of practical program evaluation (2nd ed, pp. 241–260). San Francisco, CA: Jossey-Bass.

49.

Wilson

L. A.

Stoker

R. P.

McGrath

(1999). Welfare bureaus as moral tutors: What do clients learn from paternalistic welfare reforms? Social Science Quarterly, 80, 473–486.