Abstract

In this book, Howard Wainer, a well-known statistician and previously principal research scientist at the Educational Testing Service, interrogates educational practices and ideas using hard data and statistical analysis. Most of the chapters deal with fairly narrow questions, particular those that involve the use of large-scale tests, such as the SAT and PSAT (preliminary SAT). The author, however, has a larger project: modeling how the careful use of statistics can and should guide educational decision making. It is this larger project that makes the book worthwhile, because, as he works toward this goal, Wainer effectively demonstrates both the strengths and limits of basing decisions of statistical evidence. In so doing, he demonstrates what evidence-based reasoning might look like when it is done well.
In the first chapter, Wainer examines whether the SAT should be made optional as an admission requirement. To investigate this question, he examines Bowdoin College, a small liberal arts school that made the SAT optional in 1969. If you look at the mean published SAT scores, Wainer explains, Bowdoin fares well among similar institutions. If you include the scores of those who chose not to submit the SAT (data that Wainer was somehow able to access), Bowdoin’s mean SAT scores were quite a bit lower. Wainer argues that Bowdoin should care about this discrepancy. He looked at the grade point average (GPA) of the new freshman, comparing those who submitted their SAT scores with those who did not. On average, the GPA of those students who did not submit their scores was 0.2 grade points lower than those who did. Using this data, Wainer argues that SAT scores are indicative of college success and therefore should be included in the college entrance requirements. Colleges that do not use the SAT, Wainer warns, are robbing themselves of a valid way of predicting student success.
Notice here that this warning only makes sense if one assumes that maximizing first-year GPA should be the goal of college admissions. Wainer himself realizes that this all depends on the adoption of particular admission goals: ‘If the goal of admissions policy is to admit students who are likely to be better in their college courses,’ he writes, ‘students with higher SAT scores should be chosen’ (p. 18, emphasis added). The careful conditional is important here, and we wished that Wainer had said more at this point. After all, one could certainly contest whether ‘maximizing first year GPA’ should be the goal of college admissions. For example, one could argue that first-year courses are less important than other outcomes (success in final-year courses, perhaps). Or, one could argue that admitting a group of less-prepared, disadvantaged students should be part of a college’s educational mission. Whatever the case, the final decision about the use of the SAT must attend to these moral and political issues. Wainer’s analysis succeeds in showing how data can be helpful, but it also clarifies for the careful reader the limits of this sort of evidence-based decision making.
Take another example. In Chapter 4, Wainer argues for the use of PSAT tests to determine who should enroll in Advanced Placement (AP) courses. According to Wainer there is a strong correlation (which he shows statistically) between a passing score on an AP exam and the relevant score on the PSAT. For example, if a student scores highly on the mathematics portion of the PSAT, then his or her chances of passing the AP calculus exam also increase. These two measures, in other words, are directly correlated. Wainer suggests that this statistical evidence is helpful as we think about allocating AP courses. Because AP courses are a scarce resource, they should be offered to those who will benefit the most from them.
Wainer is right that his analysis is helpful and he is right to highlight that this issue depends on notions of distributive justice – something beyond statistical evidence. Here again, his analysis reveals the underlying limitations of evidence-based decision making and, here again, we wish he had said more. The evidence linking PSAT scores and AP scores only matters, it seems, in light of a defensible position with respect to distributive justice. Wainer’s seemingly preferred position is a ‘utilitarian’ position, where distributive justice is really about maximizing social efficiency. As Christopher Jencks (1988) points out, one could adopt different positions here, such as a ‘humane’ position (where resources are devoted to those who are disadvantaged), a ‘moralistic’ position (where resources are devoted to those who are morally deserving), or a ‘democratic’ position, where resources are devoted equally, without respect to disadvantage, desert, or efficiency. Only when this hard work as been done does Wainer’s statistical evidence become relevant.
One particularly valuable contribution, in our estimation, arises in Chapter 9, where Wainer discusses value added (VA) measures or assessments. Wainer brings up three serious difficulties with using VA in assessing teacher performance. First, to make a value added causal inference about, say, a certain teacher X, we need to make a counterfactual judgment about what students would have achieved had they not had teacher X. To make these judgments, we often compare teacher X with other teachers. For this to be legitimate, however, we need to assume that students are assigned randomly among the different teachers. Random assignment, however, is very rare in actual schools. Thus, VA judgments about teachers are often unjustified statistically. Second, Wainer points to the problem of ‘missing data’. As we evaluate teachers over time, the group of students is constantly changing. These changes, alas, do not happen at random – certain types of students move, or fail to show up for exams, or fail to take exams seriously, and so forth. We cannot simply assume, then, Wainer points out, that ‘the data that are not observed are just like the data that are observed’ (p. 132). Thus, many of the inferences drawn from current VA models are unjustified. Third, Wainer argues that VA models need to assume that there is some static element of performance that can be measured across time. With many subjects, mathematics for instance, there is little relationship between one element of the subject and another (concept and manipulation of numbers is not the same thing as measurement, or algebra, or geometry). He argues, ‘Just because you call what is being measured “math” in two different years does not mean that it is the same thing’ (pp. 134–135). This makes claims about ‘value’ being ‘added’ in a static subject area highly questionable. In a precise and readable form, Wainer raises some central questions with respect to a hot policy trend. His recommendation is to ‘be careful’ in using VA assessments. The problems he points to, however, call the whole endeavor into serious question.
There is much more to the book. Wainer also addresses questions of examinee choice on exams, comparing tests to each other, aptitude verses achieve tests, college selection strategies, and more. Each time the story is familiar: Wainer says something valuable about the topic, while highlighting (sometimes explicitly, sometimes not) the moral and political considerations that also must be addressed for the evidence to have meaning and value. In the end, this processes produces a valuable book for students and policy makers, at least for those who take the time to read it carefully. The aim of the book is to show how statistics can be used to shine light on educational questions. It succeeds at this. It also succeeds at showing, we believe, the limitations of statistics in thinking about educational policy.
