Abstract
Along with the late Howard Freeman, Richard Berk was a founding editor of Evaluation Review (then Evaluation Quarterly) in 1977. He resigned as editor of this journal at the end of 2010. In this article, he reflects on his experiences.
Introduction
The Evaluation Review was born as the Evaluation Quarterly in 1977. The journal was a natural outgrowth of interest in program evaluation that had begun in the 1930s and accelerated in the 1970s. In a brief history, Rossi, Freeman, and Lipsey (1979, 9–20) link the more recent visibility of program evaluation to improvements in social science research methods, the co-evolution of program evaluation and policy analysis, growth of evidence-based public policy, and globalization. The content of the Evaluation Review has reflected these trends.
Over the years, the Evaluation Review has also been a partner in a major victory for informed public policy. When the journal was founded, program evaluation was evolving into evaluation research, a broader species of applied social research that bridged the social sciences and statistics. As such, it was not to be found in any mainstream journals or at meetings of the major social science professional associations (Berk 1981, 1987). Now, there are several journals devoted to evaluation research, evaluations of various kinds are found in many of the major social science journals, and evaluations are often presented at social science professional meetings.
However, this academic success would be a policy failure were evaluation research not an important part of the policy process. In fact, evaluation research can now be found in most federal agencies and in a large number of agencies at state, county, and municipal levels. Indeed, the current mantra of “evidence-based” policy is a testimony to an even broader acceptance.
Coupled with the many advances, unfortunately, have come a variety of new problems. In particular, the policy-making process too often does not articulate what qualifies as evidence. Having read hundreds of papers over the years that purport to provide evidence on any number of policy matters, for every well done and useful study there is at least one other doing what under the famous Daubert ruling would be called junk science (Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 US 573—Supreme Court, 1993). This might not be a problem for policy except that too many users of evaluation research have trouble telling the difference.
In all fairness, it is sometimes difficult to determine which is which. Some purveyors of junk do not know what they are doing but proceed with all of the superficial trappings of high-quality work. Other purveyors of junk know well what they are doing and use those same trappings to cover their tracks.
It would be useful, were there some simple decoding device that would allow evaluation research users to tell the difference between real evidence and fake evidence. I know of no such device, and the risk is that any easily implemented set of rules will stifle innovation. Innovations are by definition novel changes in business as usual.
What I offer instead is a short sampling of issues around which the battle between evidence and junk is being waged. The focus will be on how to estimate the impact of interventions, which is so central to evaluation research. From this brief review, I offer at the end a few suggestions about ways to combat the junk.
Defining Causal Effects
One of the major conceptual advances over the past three decades has been a rethinking of “cause and effect.” Much of the credit goes to Donald Rubin (1986), who extended an earlier formulation from Jerzy Neyman (1923). Other key contributors include Paul Holland (1986) and more recently, Judea Pearl (2000). Although there are certainly dissenters (e.g., Dawid 2000), the “potential outcomes” formulation originating in Neyman’s paper has come to dominate the technical literature.
The key idea is that causal effects need to be conceptualized as hypothetical. Suppose, for example, an intervention comes in the form of a treatment condition and a comparison condition. The treatment condition might be to offer a prison inmate job training and the comparison condition might be to offer free time in the prison yard. The outcome might be labor market success after release. One imagines what the outcome would be for an inmate under the treatment condition, and what the outcome would be for that same inmate under the comparison condition. Both outcomes are hypothetical. They are carefully considered before any data are analyzed. Then, a causal effect is defined (not observed) by a comparison between these two potential outcomes. One such comparison might be the difference in earnings after release from prison—what might be the earnings if an inmate is offered job training compared to what might be the earnings if that same inmate were offered free time in the prison yard.
The potential outcomes formulation allows one to clearly distinguish between the definition of a causal effect and the estimation of a causal effect. This is an important advance. Too often the two have been conflated with confusion the usual result. If causal effects are not clearly defined, it is necessarily unclear what exactly is being estimated.
For example, there are different ways to define an average treatment effect that lead to different estimands (Imbens 2004, 6–7): the population average treatment effect, the population average treatment effect on the treated, the sample average treatment effect, the sample average treatment effect on the treated, and the average sample treatment effect conditional on sample covariates. Each of these definitions identifies a particular set of study subjects for whom average causal effects are desired. Such distinctions are commonly overlooked so that the nature of the estimand and the properties of estimates are obscure. Credible evidence is unlikely to result.
My sense is that gradually the potential outcomes formulation has been making important inroads in evaluation research practice, especially among economists. But the gains have come very slowly.
Estimating Causal Effects With Randomized Experiments
There is no doubt that the increasing use of randomized field experiments has been an important advance in program evaluation. A powerful and early case for randomized field experiments was made by Campbell (1969) in his classic paper “Reforms as Experiments,” with much of the technical reasoning found in a still earlier statement by Campbell and Stanley (1963). If one’s primary concern is with internal validity, 1 randomized experiments properly implemented are likely to provide the strongest possible evidence.
That said, the widespread use of randomized experiments has led to at least two unfortunate consequences. First, even if as some claim randomized experiments are the “gold standard,” randomized experiments in the field will often fall far short (Berk 2005). For example, randomized experiments are being routinely used when the units randomized are clusters of smaller units—cluster random assignment. Perhaps the most common instances involve studies in educational settings. Entire classrooms of students are assigned to one of several interventions, although interest centers on how individual students will respond. That is, the unit of policy interest is the student, but the unit subject to random assignment is the classroom.
A host of technical complications can follow. For example, the number of units randomized is often too few to meaningfully insure balance in all confounders. Treatment effect estimates are still unbiased, but the point estimate from a given experiment may be a substantial distance for the truth and inadvertently incorporate associations with one or more covariates.
More troubling still is that the students in a given classroom will interact so that the outcomes measured are not realized independently across students. Some researchers think this is “only” a problem for statistical tests and confidence intervals, and researchers are correct when they claim that there can be effective remedies.
Though too rarely recognized, the problem is much worse: “subject interference,” sometimes referred to as a violation of the “stable treatment value unit assumption” (Rosenbaum 2010, 28–29). Because the outcome for any one student depends on the students with whom he or she interacts, and because that depends on which students are assigned to which interventions, causal effects are defined not just by the interventions, but by which students happen to be assigned where. That is, causal effects depend in part on the outcome of the random assignment. So, there can be one causal effect for each possible shuffling of the study units, and a single causal effect no longer exists. As of now, there is no effective statistical remedy for subject interference. 2
A second unfortunate consequence is that research not based on random assignment can sometimes be categorically dismissed. This is a serious mistake, especially if internal validity is but one of several kinds of validity that matter. For example, one is not likely to be interested in the results of a randomized experiment unless they provide guidance for future interventions; the intent is to affect subsequent policy decisions. This means that one must consider how well the results generalize to new settings, new program participants, and related interventions (because no two social interventions will be identical). Thus, a mandatory job training experiment for individuals receiving unemployment compensation may not provide sufficiently useful estimates of what would happen if the program rolled out were voluntary. Then, one might make the case for a research design that has less internal validity than a randomized experiment as long as external validity were substantially improved.
These and other concerns about randomized experiments have been noted by a number of respected economists (e.g., Heckman and Smith 1995; Heckman 2000; Deaton 2008). Their overarching point is that estimates of causal effects are nearly useless unless one also learns the mechanisms by which the causal effects are produced. Randomized experiments risk just that outcome. There is a need for more substance, even if somewhat weaker casual effect estimates result.
Context also matters. The benchmark for any evaluation is current knowledge, not perfection. Hence, designs with less internal validity than randomized experiments can sometimes be probative. For example, large observational studies in which matching is used to adjust for confounders can sometimes provide instructive results (Rosenbaum 2010). The policy decisions are better informed than without the research even though the research does not employ random assignment.
But there are powerful counterarguments (e.g., Lalonde 1986; Duflo, Glennerster, and Kremer 2008; and especially Imbens 2009). At stake is whether the ideal becomes an enemy of the possible. No one can object to the need for understanding how an intervention works and the circumstance in which it will work best. But, one can certainly object if with current methods the search for understanding is likely to be fruitless and precludes learning anything at all about the real-world impact of interventions.
For me, perhaps the strongest case for random assignment is that it can be more effective than other designs in keeping researchers from making serious statistical errors. Evaluation researchers have over the years shown remarkable skill in getting themselves into all sorts of statistical train wrecks, but I believe that this is more difficult to do if a well-implemented randomized experiment is analyzed as the elementary textbooks recommend. The analysis of other kinds designs—quasi experiments and observational studies—generally require more statistical skill and a deeper understanding of the requisite statistical tools. Causal modeling is perhaps the best example.
Causal Modeling
Causal modeling, also called structural equation modeling, is the usual alternative to randomized experiments and has a long history in the social sciences (Goldberger 1973; Duncan 1975). Its aim is to impose on an observational study a model of how nature generated the data and then from the data, to estimate the values of the parameters nature employed. The impact of causal modeling on evaluation research, and more generally on empirical social science, is mixed at best. For a relatively early critique, Ed Leamer’s (1983) famous “Let’s Take the Con Out of Econometrics” is still a very entertaining read. Freedman’s (2005) book Statistical Models: Theory and Practice provides perhaps the most thorough treatment building on two decades of well-argued concerns about causal modeling. Over the years, I have had a few things to say about this too (Berk 2004).
Despite the many critical papers and books, causal modeling is far too often the statistical tool of choice. Recent examples include isolating the role of mediating or moderating variables and the ubiquitous hierarchical linear model. Among the reasons for causal modeling’s staying power is its promise first articulated in econometrics. Causal modeling burst onto the scene in the 1970s as a way to formally integrate social science theory and statistics. As such, it gave social science theory more explicit empirical relevance and helped to legitimate statistics as a useful social science tool. As a marriage of substantive and statistical theory, what could be better?
Causal modeling’s overreach only gradually became apparent over the next two decades. Researchers trained some time ago and not current on the causal model critiques may still be trying to capitalize on the promise of causal modeling or even if current and may believe that the promise outweighs the risks. And it is sometime difficult to exactly specify the risks. Causal modeling errors are matters of degree. There is no clear empirical demarcation between models that are close enough right and models that are not. One result is lots of wiggle room.
A less charitable explanation is that the widespread availability of point-and-click statistical packages makes causal modeling seem easy. No deep understanding is required. Indeed, I have often reviewed papers in which the statistical modeling procedure used is referenced by the name of the software package (e.g., A LISREL analysis was done). The only slightly less referential concern is to cite the software used as if that is all one needs to know (e.g., The analysis was done using proc mixed).’
3
For these and other reasons, response to critiques of causal modeling are often rhetorical. Freedman (2005,195) has compiled an instructive but incomplete list. We all know that. Nothing is perfect. Linearity has to be a good first approximation. Log linearity has to be a good first approximation. The assumptions are reasonable. The assumptions don’t matter. The assumptions are conservative. You can’t prove the assumptions are wrong. The biases will cancel. We can model the biases. We’re only doing what everyone else does. Now we will use more sophisticated techniques. If we don’t do it, someone else will. What would you do? The decision-maker has to be better off with us than without us. We all have mental models, not using a model is still a model. The models aren’t totally useless. You have to do the best you can with the data. You have to make assumptions in order to make progress. You have to give the models the benefit of the doubt. Where’s the harm?
Matching
There are promising alternatives to causal modeling. A good source is 2009 paper by Imbens and Wooldrige “Recent Developments in the Econometrics of Program Evaluation.” Among the more popular options is matching.
Matching provides an alternative to causal modeling less dependent on untestable assumptions, more subject to empirical diagnostics, and less vulnerable to statistical malpractice. For example, many causal models are the product of model selection. A range of models is employed with a given data set and a “best” model is selected. The result is that all statistical inference that follows with that data set is likely to be wrong, often badly wrong (Leeb and Pötscher 2005, 2006; Berk, Brown, and Zhao 2010).
Matching sidesteps model selection problems because the set of matching variables employed is determined without reference to the response variable. The goal is to arrive at balance in the sample and for this enterprise, the response variable can be locked in a safe. That is, subjects across treatment groups (or treatment and control groups) are matched so that on average the treatment groups (or treatment and control group) have covariates with effectively the same distributions. In this sense, the groups are made comparable. There is no need to consider the response.
Recent work has improved the way matching can be undertaken, the way sample balance is assessed, and the manner in which sensitivity to omitted covariates is determined (Rosenbaum 2010: Part II). But here too mistakes by researchers are too common. Causal and statistical inferences under matching require that nature conducts the equivalent of a randomized experiment conditional on a set of covariates. Therefore, one must make the case that important covariates have not been overlooked. Because balance in the sample does not mean that there is balance in unobserved covariates, justification for the nature-randomized perspective begins with past research and theory. Sensitivity tests can help, but except in extreme cases will rarely be definitive. Just as important, the key assumptions required for randomized experiments should apply. For example, one must meet the no interference requirement. Likewise, interaction effects with covariates are best addressed by post stratification, not a causal model.
Meta-Analysis
Meta-analysis starts off as a good idea. When there is a set of randomized experiments, with effectively the same outcomes, treatments, protocols, and mix of study units, it can be useful to pool the results. Ideally, the data from each experiment are obtained and a pooled analysis undertaken. Almost as good, key outcomes from each experiment are obtained and combined. In both cases, one would at least gain statistical power, and in principle an enriched analysis could be undertaken.
Meta-analysis begins to run off the tracks when the studies being combined are not comparable randomized experiments (cf. Hedges and Olkin 1985). For example, if the treatments vary in content across studies, the usefulness for policy of an estimated average casual effect over all of the treatments is obscure. What does one do with a finding that averaging over studies of job training and studies of job counseling, effects are beneficial? Should there be more job training programs, more job counseling programs, or more of both? In the average over treatments, there is no evidence that by themselves either kind of intervention works, nor any evidence that both work if introduced together. The routine practice of standardizing the causal effects just papers over this problem. Matters are complicated in a similar fashion if the response to the treatments is defined differently in different studies (e.g., getting a job vs. earnings) or if across studies the subjects differ in important ways.
In addition, when the studies are not randomized experiments, there is a strong likelihood that a collection of biased treatment effect estimates is being combined. How is one then better off? Biased estimates are not random errors and do not cancel out. The result can be just a more precise causal estimate that has the wrong sign and is systematically far too large or far too small.
Finally, in the absence of random assignment, the usual statistical inference becomes very difficult to justify (Berk 2007). One must assume that the studies being combined are a probability sample of all possible studies that could be (or were) done and that the studies were realized independently of one another. Even a casual exercise in the sociology of program evaluation will make clear that the studies summarized are not a probability sample of anything real. And the usual requirement of independence would mean that researchers do not read each other’s work, do not talk about that work at professional meetings, do not ever work collaboratively, and do not hire each other’s students. Indeed, the use of meta-analysis in a given field precludes the very independence on which the statistical inference depends.
In short, the importance of meta-analysis for estimating causal effects has been grossly overrated. A conventional literature review will often do better. At the very least, readers will not be swayed by statistical malpractice disguised as statistical razzle-dazzle.
Conclusions
One should applaud the view that public policy is to be based on evidence. However, what qualifies as evidence, let along strong evidence, is too often left unspecified. Into this vacuum has been drawn a mix of evaluations ranging from excellent to terrible. That would be acceptable if policy makers could routinely make similar distinctions.
One way to help policy makers has been in play for decades. When there is an important policy issue and contradictory evidence, it can be useful for a knowledgeable, neutral committee to undertake a review. This is often done by the National Research Council of the National Academy of Sciences. Foundations such as the Pew Charitable Trust sometimes support independent panels or reviews by staff that do much the same thing. It can also help if policy makers have ready access to the technical expertise of neutral third parties.
Moving up supply chain, evaluation research is perhaps best undertaken with teams that include individuals who have genuine expertise in the statistical procedures being employed. Large research firms such as Rand, MDRC, and Mathematica often try to proceed in this manner. University-based researchers sometime try to do much the same thing drawing on skills across several academic departments. There is certainly no guarantee that in either setting the requisite expertise will be properly represented, but there is at least a structure in which it can happen.
Finally, it is important for evaluation researchers to keep abreast of developments in the “data sciences:” statistics, econometrics, and computer science. I believe that computer science will in the next decade influence evaluation research much as statistics did in the 1970s, and because the data sciences are so driven by advances in computing power and the increasing availability of large data sets, change will come very quickly. Keeping abreast means more than knowing how to point and click in the newest software packages. Here as well, therefore, working in groups where the latest expertise can be found will likely help. It will also help in the longer run if social science graduate students interested in evaluation research receive at least master’s level training in the data sciences.
Footnotes
Acknowledgment
The author would like to thank Michael Foster for helpful comments on an earlier draft of this article.
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
The author(s) received no financial support for the research, authorship, and/or publication of this article.
