Abstract
Many states now possess the data and statistical methods that can produce teacher value-added scores and link them to preparation programs. It is important to understand the limitations of these measures and the inferences that they do and do not support. These limitations fall into three categories. First, value-added measures (VAM) provide information about only one of several important dimensions of teacher preparation program quality, focusing on one outcome measure, but not addressing other program characteristics, including the quality of program resources, the appropriateness of program content, and the contributions programs make to teacher learning. Second, comparing programs on the average VAM scores begs the question of whether mean performance is the most appropriate way to look at program quality. Third, the measurement of program graduates’ VAM is strongly affected by the labor market for teachers, which weakens the inferences from VAM scores to the quality of preparation programs.
Teacher education has moved to the forefront in discussions about U.S. education. Teacher preparation programs, wherever they are based, play an important part in supplying teachers for K-12 schools. Recent reports, as well as those in decades past, have claimed that the quality of teacher preparation programs is uneven, with some programs of low quality and others high quality. Reports vary in features of programs they point to as the basis for these judgments. Official reports about teacher preparation, however, often paint a picture in which almost all programs are described as meeting standards, with little differentiation across institutions.
The mismatch between a general sense of high variability and an official picture of uniformity has frustrated many policy makers. The provisions for state reporting written into the Higher Education Act, for example, suggested that Congress expected ranked lists of programs, with some programs identified as top quality and others as poor quality. When the initial implementation of the law produced reports listing virtually all programs as generally meeting state standards, lawmakers expressed frustration. In negotiation about rules for the reauthorized legislation, federal officials have pressed for use of measures expected to make clearer distinctions among programs (U.S. Department of Education, 2011).
Among the measures federal officials are advocating is a set of indicators based on value-added scores associated with program graduates. The basic logic linked to these measures is simple: Teacher preparation programs should be judged by the quality of teachers they produce. In these judgments, “teacher quality” should be defined by the effect of teachers on the learning of their K-12 pupils. Estimates of that effect on learning should be measured in a way that takes account of the academic and demographic characteristics of their pupils, acknowledging that some students may be more difficult to teach than others. That is, estimates of effect on learning should be a measure of the teacher’s “value added,” the difference in learning compared with what would be typical for students with a given set of characteristics. The bottom line is that teacher preparation programs should be evaluated, at least to a large extent, by the average “value added” of their graduates, their value-added measure (VAM) scores.
Given the apparent simplicity of this argument, and the general appeal of the idea that student learning is the primary criteria for evaluating teachers, it is no wonder that policy makers have begun to press for value-added approaches to the evaluation of teacher preparation programs. These policy makers may see resistance to adoption of a value-added approach as self-serving or protective impulses from teacher preparation program leaders.
As the other articles in this issue of the Journal of Teacher Education demonstrate (Gansle, Noell, & Burns, 2012; Henry, 2012; Plecki, 2012), the administrative and statistical apparatus needed to link VAMs to preparation programs is now available. Investments in state data systems, prompted in part by the Race to The Top competitions, have increased the number of states where it is possible to connect student test results to individual teachers. That increase in use of state assessments and in the capacity to link student and teacher data has enlarged the number of states where it is technically feasible to calculate value-added scores for many teachers, a prerequisite to making the further connection between teacher value added and teacher preparation programs.
The combination of the common-sense appeal of judging programs by the effectiveness of its graduates and the increasing ease in computing value-added scores for at least some of a program’s graduates makes it likely that more states will begin to use teacher value-added scores as one component of the evaluation of teacher preparation programs. The use of VAMs for evaluation of teacher preparation is likely to increase and probably to endure. Given this increasing prevalence, it is important to understand the limitations of these measures and the inferences that they do and do not support.
These limitations fall into three categories. First, VAMs provide information about only one of several important dimensions of teacher preparation program quality, focusing on one characteristic of program graduates, but not addressing other important program characteristics, such as the quality of program resources, the appropriateness of program content, and the contributions programs make to teacher learning. Second, comparing programs on the average VAM scores for graduates begs the question of whether mean performance across graduates, as opposed to, for example, minimum level of performance achieved by all graduates, is the most appropriate way to look at program quality. Third, the fact that the measurement of program graduates’ VAM is strongly affected by the labor market for teachers weakens the strength of inferences from VAM scores to the quality of preparation programs. Complications due to the labor market stem from the nonrandom process by which program graduates are distributed across schools, with schools offering jobs to those they think most likely to succeed, graduates choosing offers from schools they find most appealing, and many program graduates not working in schools in the state where they completed teacher preparation. These complications result in missing data for all graduates who do not teach in the state where they were prepared, biased estimates of individual teacher VAM due to school working conditions, and biased estimates of program average VAM due to school and graduate choices about who teaches in which schools.
Varying Definitions of Program Quality
Considering how to interpret the average teacher VAM score for a teacher preparation program must start with considering the possible characteristics that could be taken to constitute preparation program quality. In the history of evaluating professional programs in a range of occupations, evaluations have attended to the “inputs” (library, qualifications of staff, physical facilities, etc.), the training experiences (content covered by coursework, assignments required, literature read, cognitive demand of discussions, particulars of field experiences), and the program outputs (graduates’ scores on written and performance assessments, job placement rates, employer satisfaction, value-added scores). Current attention increasingly emphasizes the program outputs, especially value-added scores. However, the other program features are also seen as aspects of program quality. Programs, for example, have been criticized because their staff lack K-12 teaching experience, or because the course syllabi fail to include readings based on the latest research on reading instruction. Value-added scores of graduates may be an important part of evaluating the quality of teacher preparation programs, but they do not capture all dimensions of program quality. They capture only information about characteristics of program graduates.
To elaborate, saying that a teacher preparation program is of high quality might point to any of four distinct characteristics of the program:
The program has facilities and staff of high quality, such as up-to-date instructional technology, highly qualified faculty, and connections to high-performing K-12 schools.
The content and skills covered in program courses and experiences are of high quality, for example, discussing research-based approaches to literacy instruction, having intensive coursework in content areas such as mathematics and science, and having lengthy field experiences with frequent supervision by experienced teachers.
Teacher preparation students who complete the program requirements make great gains in the knowledge and skills important for teaching. (That is, the program adds value to its teachers-in-training, the value added by the program, an aspect which parallels, but is distinct from, the value that program graduate adds to his or her pupils’ learning.)
The graduates of the program have high levels of the knowledge and skills important for teaching, including the knowledge and skills that will result in high VAM scores when they begin teaching.
These four ways of thinking about program quality may often be found together, with some programs excelling in all dimensions, with top facilities and staff, with courses and experiences that cover the most appropriate content, with students making substantial improvements during the program, and with graduates leaving the program with exemplary knowledge and skill. However, the four ways of thinking about program quality are distinct, so that a program may be of high quality in one or more of these ways, but not in all four. For example, a program at a well-endowed, selective college might have high-quality faculty and facilities, and might produce graduates who have considerable knowledge and skill important for teaching. However, if the students entering that program already possessed high levels of knowledge and skills, the program might decide not to require study of some important content (e.g., the subject matter content in the K-12 curriculum), and the teacher preparation students might leave the program having changed little from their program entry level in the knowledge and skills needed for teaching.
Given the wide variation in the institutional missions of higher education programs (and other non-college-based teacher preparation programs), this disjuncture among definitions of teacher preparation program quality is a real feature of the teacher preparation landscape. Teacher preparation based at a community college, for example, might be thought of as high quality in the sense that it makes strong contributions to the knowledge and skills of its students, although it would not be of high quality as measured by its graduates’ knowledge in the content areas. (Levine’s, 2006, report on teacher preparation, for example, finds that graduates of programs at colleges classified as Masters I institutions admit students with lower SAT scores than do colleges classified as research universities. It may be, however, that the Masters I programs require more intensive study of content linked to the K-12 curriculum and that their students improve more in that area than do teacher preparation students at research universities.)
In a perfect world, one would want all teacher preparation programs to be of high quality in all four senses. However, given the need for hundreds of thousands of new teachers each year, it is unrealistic to expect all programs to be of high quality in every sense.
Different audiences probably differ in the definition of program quality that they find of most interest. For the prospective teacher preparation student, the definition of most interest may be the extent of learning that will take place. That is, those preparing to teach would likely be most interested in how much they can expect to learn in the program, rather than in the average skills of their graduating class or the qualifications of the faculty. State policy makers hoping to have highly skilled teachers in every classroom in the state may be most interested in knowing that program graduates have high levels of knowledge and skill for teaching. They may be most interested in whether all programs in their state are producing skilled graduates, whether by recruiting program entrants who already possess most of the knowledge and skill they need or by admitting a range of students and raising their levels of knowledge and skill. Legislators financing higher education, however, might be interested in whether the dollars spent on a program were adding substantially to the prospective teachers’ knowledge and skill and whether the program graduates all have at least an adequate amount of teaching knowledge and skill. Principals and superintendents may pay most attention to the attributes of job applicants, with relatively little interest in the quality of the programs from which the applicants graduated. That is, if an applicant’s recommendations and performance on the measures used in the hiring process give them a clear sense of whether the applicant will succeed in their schools, a district may not have much interest in which program recommended that application for certification.
This walk through the varying dimensions of program quality serves as a reminder that the average VAM score of a program’s graduates may be one important component of teacher preparation program quality, but it is not the sole measure of quality. Evaluations of program quality legitimately include attention to resources, curriculum, and contribution to prospective teacher learning. Even within the category of program outcomes, VAM scores are not the only outcome of interest. Other important outcomes include graduates’ moral character, their commitment to teaching (effort and length of service), and their ability to take leading professional roles.
Which Distribution of VAM Program Outcomes Is Best?
Reports that use VAM to describe differences among teacher preparation programs generally use sophisticated statistical models to estimate the mean VAM score for a program’s graduates employed by schools in the state. Comparisons among programs are made by comparing these means, using information about the standard errors of the estimated means to judge whether differences among program means are statistically significant (e.g., Gansle, Burns, & Noell, 2010, 2012; Goldhaber & Liddle, 2012; Plecki, 2012).
Comparing estimates of these means across graduates suggests that programs with higher estimated means are higher quality programs. Mean VAM for the graduates in the sample is, however, only one way to think about program quality. Considering other possible statistics is a way to recognize that care should be taken in interpreting comparisons based on mean VAM.
Rather than simply looking at the estimated mean VAM score for graduates, for example, one could also attend to the variability of VAM scores across graduates. State policy makers might think it important that all of a program’s graduates have VAM scores above some threshold. A program with a high mean VAM, but high variability among graduates, might be recommending for certification some graduates who would have low effects on student learning, achieving a high mean by also have some graduates with extremely high VAM scores.
Another alternative to looking at the mean VAM score of graduates employed in the state would be to estimate the VAM scores (mean and variability) of all program graduates, using information available on the graduates who are not employed in the state to predict their likely VAM scores, if they had taken teaching positions in the state. This estimate would represent an estimate of the teaching effectiveness of the full set of program graduates, rather than including only those who took teaching positions in the state. For a state where most program graduates begin teaching in the state immediately on graduation, this estimate might be little different from current estimates. But for a state, such as Michigan, where few recent graduates have secured teaching positions within the state, the estimate might be dramatically different. The two estimates would support different inferences about the quality of program’s graduates. If a program’s most effective graduates were recruited to other states, or chose not to move immediately into teaching positions, the estimate of its quality based on mean VAM scores for those who did begin teaching in state would be lower than the estimate of its quality based on information about all graduates. The first is tied to the program’s contribution to the teaching force in the state; the second might be tied to the program’s contribution to the nation’s cadre of educators, some of whom might delay entry into teaching or take nonteaching positions. Both are defensible ways of thinking about program quality.
Labor Market Effects on a Program’s VAM Scores
The average estimated VAM scores associated with a teacher preparation program are generally computed in a way that attempts to take account of the variation in demographic and academic characteristics of the pupils in a teacher’s class. The statistical adjustments are made so that the teacher’s VAM score can be interpreted as the amount their pupils learned beyond what would be expected for such a group of students.
One from one program estimation approach is to compare teachers with other teachers in the same school. The logic of this approach is that it controls not only for characteristics of the pupils in the teacher’s classroom but also for the other characteristics of the school that affect student learning, such as the school climate, the physical condition of the school, the effectiveness of the school principal, and the support the school receives from parents and community members. Such school factors do make a difference in pupil learning; thus, a teacher whose pupils excel in a school with challenging conditions should be assigned a higher VAM score than a teacher whose pupils do equally well in a school with strong learning supports. The teacher in the challenging school has had a larger effect, although the students learned equal amounts in the two schools. An analysis of Florida data compared VAM estimates based on these in-school comparisons of teachers with those that did not use this approach to account for differences in the school context. The two estimation methods change the rankings of teacher preparation programs, showing that using these within-school comparisons to account for school context is consequential for judgments about preparation program quality (Mihaly, McCaffery, Sass, & Lockwood, 2012).
The teacher labor market, however, makes it difficult to do such a within-school analysis. The within-school comparisons require finding sufficient numbers of schools with recent graduates from the various teacher preparation programs to be compared. Teacher education graduates, however, often work in schools close to where they completed their preparation (Mihaly et al., 2012), so that schools are mostly staffed by nearby programs, rather than by the range of programs within a state. The Florida study, moreover, found that the schools with the largest degree of preparation program overlap were not typical of schools in the state. Therefore, basing program comparisons on schools with adequate data would produce results that could not be considered typical of results for the state. Thus, the labor market, the decisions by new graduates and schools about which teacher teaches in which schools, makes it difficult to appropriately take account of the effects of school context on teacher value-added scores.
The labor market undercuts inferences about teacher preparation program effects in a second way. When schools make hiring decisions, they have a pool of applicants from which to choose, perhaps including graduates from a variety of preparation programs. To the extent that schools are able to predict a teacher’s effectiveness from the candidate’s record, interviews with school staff, and perhaps a brief demonstration of teaching, they might use this information as a major criterion for deciding which applicant to accept. However, applicants often apply to several schools and are using what they know about the schools to decide which offers they would be most likely to accept. Teachers who appear most likely to be effective will likely have the greatest number of offers. The result is that a school will select teachers from a relatively narrow range of applicants because they find the weakest applicants unacceptable and they are unable to hire the strongest applicants, who will accept better offers. If VAM scores likely play a major part in these decisions, then the teachers at any particular school will be similar to one another in their VAM scores. Within-school comparisons of VAM scores for teachers from varying teacher preparation programs will then find little difference across programs because the hiring process resulted in low variability among teachers in each school. This scenario could occur even if preparation programs did differ substantially from one another in mean differences in the effectiveness of their graduates. The within-school similarity would result, for example, from a school hiring one of the most effective graduates from a program with lower average VAM, matched with one of the less effective graduates from a program with higher average VAM. Faculty conducting such a between-program comparison as part of Michigan State University’s Teachers for a New Era program speculated that school hiring decisions explained the absence of between-program effects they found (Floden, 2006). A more recent VAM study of teacher preparation (Goldhaber & Liddle, 2012) also suspects that the labor market leads to a compression in estimated differences between teacher preparation programs.
Conclusion
Are the best teacher preparation programs those whose teachers add the most to their pupils’ learning? At first glance, it seems that the answer must be “yes.” What else could it be? On reflection, however, it is apparent from points raised in the debates about teacher preparation that, although teacher value added is important, other characteristics are also important components of program quality. The relative importance of each component may vary by audience, with prospective teachers, for example, interested in how much they will learn, but policy makers more interested in how many effective teachers are being certified. Therefore, the mean graduate VAM score is one among several dimensions of program quality.
In looking at value added, are the best programs those with highest average VAM scores, computed using new state data systems? Again, this seems the obvious first cut, but interpretation of the estimates from these data systems must be done with caution. In addition to mean scores, variability should be considered. Moreover, the estimates that come from data systems are influenced by the processes that affect which teachers are included in the estimates and where they teach.
One reason that cautious interpretation is important is that, as VAM estimates begin to enter into accountability systems, they will affect preparation programs, just as accountability systems have affected K-12 schools (National Research Council, 2011). The way in which school context is taken into account, for example, may affect how much preparation programs attempt to encourage graduates to work in hard-to-staff schools. Some educators (e.g., Henry, 2012) advocate selecting an evaluation process that “neither benefits nor adversely affects the TPP [teacher preparation program] due to forces beyond the control of the programs, such as the choice of the type of schools in which their graduates choose to teach” (p. 346). However, others prefer to reward programs for steering graduates to high poverty schools (U.S. Department of Education, 2011).
Many states now possess the data and statistical methods that can produce teacher value-added scores and link them to preparation programs. These highlight the important responsibility that programs have to help prospective teachers learn to be effective instructors. Used thoughtfully, they can inform improvements in teacher preparation. Used simplistically, they may have unfortunate, unintended consequences.
Footnotes
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
Work on this paper was made possible in part by a Teachers for a New Era grant from the Carnegie Corporation of New York, the Ford Foundation, and the Annenberg Foundation. The statements made and views expressed are solely the responsibility of the author.
