The Rhetoric of Teacher Evaluation: New York City Teachers’ Responses to Performance Labels

Abstract

This paper uses the case of New York City teachers’ interpretations of the labels that are assigned to their performance to explore how teachers experience teacher evaluation systems. Based on our analyses of 141 interviews with New York City teachers, we argue that the ordinal performance labels assigned to teachers by New York City’s Advance teacher development and evaluation system—Highly Effective, Effective, Developing, and Ineffective—have meanings that extend beyond how they locate teachers in relation to one another, or in relation to an absolute standard of teaching performance. The labels can evoke powerful emotions in teachers, and may stymie policymakers’ hopes that they will be motivated by the labels to redouble their efforts to teach well. Teachers’ reactions to the labels can also reveal their strategies to resist the redefinition of teachers’ work.

Keywords

high-stakes accountability professionalism teacher evaluation

What’s in a name? Perhaps more than you think.

This paper uses a particular case—New York City teachers’ interpretations of the labels that are assigned to their performance—to explore how teachers experience evaluation systems. We argue that the ordinal performance labels assigned to teachers by New York City’s Advance teacher development and evaluation system—Highly Effective, Effective, Developing, and Ineffective—have meanings that extend beyond how they locate teachers in relation to one another, or in relation to an absolute standard of teaching performance. The labels can evoke powerful emotions in teachers, and may stymie policymakers’ hopes that they will be motivated by the labels to redouble their efforts to teach well. Teachers’ reactions to the labels can also reveal their strategies to resist the redefinition of teachers’ work.

Many scholars have viewed K-12 classroom teaching as a semi-profession, a term popularized by Etzioni (1969) to reflect partial enactment of the classic attributes of a profession, and extended by others to note that teaching, along with other gendered occupations such as nursing and social work, had claims to professional status suppressed by bureaucracies and governments dominated by men. Freidson (1970) and others, analyzing the high-status occupations ascribed the status of a profession, identified four such attributes: (a) an abstract knowledge base to guide practice, acquired via specialized education and training; (b) autonomy to exercise discretion in the workplace in the course of practice; (c) the legitimate authority to determine who can enter the profession and to act in the interest of clients; and (d) an ethic of altruism that recognizes a moral commitment to the interests of clients, and specifies codes of conduct. Each of these attributes reflects efforts by practitioners to control the boundaries of the profession and the scope of its work.

Even in the “golden age” of the study of professions in the 1950’s and 1960’s (Gorman & Sandefur, 2011), classroom teaching fell short, most notably on the presence of an abstract knowledge base to guide practice. This was a major agenda in the 1980’s and 1990’s for schools of education and other advocates in the US (Labaree, 1992), one not yet realized, as there remains considerable variability across teacher preparation programs in the content of what they teach. Other limits on teachers’ ability to enact their autonomy and authority were evident in works ranging from Waller (2014) to Lortie (1975).

Over time, a conception of professionalism lodged in an occupation was supplanted by what Evetts (2009) has called “organizational professionalism.” Increasingly, the work practices of the classic professions, as well as other occupations that make claims to professional status, have been regulated by public and private work organizations, with a particular bent toward evaluation and accountability. In the U.S., amidst the expansion of Great Society social programs in the 1960’s, evaluation became an important policy tool to determine if particular programs were meeting their goals, and to hold those programs accountable for their performance.

The two main forms of evaluation, formative and summative, each involved gathering data to inform public decision-making about whether a program was worthy of continued support, expansion, or, perhaps, suspension and elimination. Each federal social and educational program, and its state and local counterparts, was unique, and thus one could not call the evaluation of Great Society programs an accountability system. Rather, formative and summative evaluation were mechanisms for programs, and the workers who staffed them, to give an account of their performance, with the government the primary audience for these accounts.

Conversely, in the United Kingdom, concern for the efficiency and responsiveness of public sector services and organizations gave rise to the 1980’s-era phenomenon of New Public Management (NPM). NPM emphasized organizational efficiency, indicated by an organization’s outputs, its costs, and the relationship between the two (Hood, 1991; see also Flack, 2020). Central to NPM was setting measurable performance targets (including customer satisfaction), and monitoring organizational progress toward those targets. If public-sector organizations proved unable to meet their targets, government managers could contract with the private sector. The competition between the public and private sectors was believed to yield improvements in service quality and cost reduction (Evetts, 2009).

As the era of Great Society government programs in the U.S. wound down, both the U.S. and the U.K. could be characterized by a policy model known as neo-liberalism, which promotes free-market competition, and seeks to transfer power from the government and public sector to the private sector. In the realm of K-12 education in the U.S., this involved a proliferation of new institutional forms, such as charter schools and alternative routes to teacher certification, culminating in school and teacher accountability systems (Mittleman & Jennings, 2018).

New Public Management principles, reflected in these accountability systems, are illustrations of what has been termed performativity, a flexible term that links language, action, and identity. Ball (2003, p. 216) defines performativity as “a technology, a culture and a mode of regulation that employs judgements, comparisons and displays as means of incentive, control, attrition and change—based on rewards and sanctions (both material and symbolic). The performances (of individual subjects or organizations) serve as measures of productivity or output, or displays of ‘quality’, or ‘moments’ of promotion or inspection. As such they stand for, encapsulate or represent the worth, quality or value of an individual or organization within a field of judgment.”

Ball (2003) develops the notion of performativity in relation to teachers’ work, arguing that contemporary classroom teaching is regulated via rewards and sanctions that define competence. Inevitably, he argues, teachers must confront their understanding of themselves as teachers, as accountability systems prompt them to redefine the self in relation to the performance labels that the technology assigns to them. Identity work is intertwined with emotion (Reio, 2005), and the process can provoke acute anxiety, as performativity shifts teachers’ motivation from an internal sense of what is right and worthwhile to the measures of performance that the system values. Reconstructing the self in the face of discrepant understandings does require emotional labor, and labels that are inconsistent with an existing sense of self can be upsetting and angering (Stets, 2005).

The assignment of performance labels to schools and teachers is not new, and the format of these labels does influence public opinion about school quality (Jacobsen et al., 2014). But the context of the application of performance labels has changed substantially over the past two decades, as the broader trends described above have reshaped the relationship between schools and their external environments. Neoinstitutional theories of education portrayed schools, and the educators working within them, as buffered from the demands and expectations of their external environments; teachers could simply shut their doors and teach, without concern for the direct inspection of either their teaching or their students’ learning (Meyer & Rowan, 1977). There was a presumption that teachers were competent, and many school districts relied on a binary Satisfactory/Unsatisfactory rating system that was overwhelmingly weighted toward ratings of Satisfactory, frequently with 99% of teachers receiving a satisfactory rating (Weisberg et al., 2009). Some years later, Kraft and Gilmour (2017) found that even after state-level teacher evaluation system reforms, this still held.

The landscape began to change toward the end of the first decade of the 21st century, with the advent of new tools for assessing individual and organizational performance, some of which purported to isolate the impact of particular teachers and schools on their students’ academic performance, and others of which held individual and organizational attributes up to a yardstick defining what was good and what was bad. Two major dimensions began to emerge: ratings of student learning, and ratings of teachers’ professional practices (Kane et al., 2014).

There is great policy interest in whether the implementation of school and teacher accountability systems can lead to higher levels of student performance, and perhaps reduce the strong associations between students’ social backgrounds and their academic success. The evidence on these effects is mixed, with some studies finding that accountability systems boost student performance, and others concluding that they have limited, and sometimes unintended, effects (Pallas, 2020; Mittleman & Jennings, 2018).

In this paper, we do not tackle the effects of teacher accountability systems. Rather, our focus is on the meanings that teachers assign to performance labels, and whether they incorporate them into their professional identities, or resist doing so. Does the specific language used in the performance labels shape these meanings, and their potential consequences for teachers’ thoughts and actions? Why or why not?

We chose this emphasis to build additional evidence on the mechanisms by which teacher accountability systems might exert effects on teachers and their performance. The link between accountability and performance has become an institutionalized myth (Dubnick, 2005), with the prevailing belief that incentives will inevitably result in the desired performance changes. Policy analysts devote a great deal of effort to “getting the incentives right” in accountability systems, and misaligned incentives are a key explanation for why accountability systems often do not work the way they are intended (Figlio & Loeb, 2011; Mittleman & Jennings, 2018; Neal, 2018). Incentives are, though, typically thought of as material carrots and sticks; rarely are performance labels themselves treated as incentives.¹ In this paper, we expand our understanding of incentives, and possible sources of resistance to them, by studying teachers’ reactions to performance labels.

The Institutional Context of Teacher Evaluation Systems

In 2010, as the state of New York was in the hunt for funds from the $4.4 billion Race to the Top program introduced by the Obama administration, the state passed education law 3012-c requiring annual performance evaluations of classroom teachers and building principals. Following negotiations between the New York State Education Department and the New York State United Teachers, the law was revised in 2012 to clarify the use of locally-selected achievement measures (described below), and the roughly 700 school districts in the state were required to submit a plan for the annual professional performance review (APPR) of teachers and principals. Prior to the passage of this law, teachers in New York City and elsewhere in New York State were rated either Satisfactory (S) or Unsatisfactory (U), often based on perfunctory classroom observations (Toch & Rothman, 2008). In New York City, as in many other school districts around the country, nearly 99% were rated Satisfactory (Weisberg et al., 2009).

New York’s law 3012-c required that teachers be rated in three domains: growth on state assessments or a comparable measure (0–20 points), locally-selected measures of student achievement (0–20 points), and a district’s choice of observational rubrics for teachers’ practices (0–60 points). In each of these three domains, a teacher could be rated Highly Effective, Effective, Developing, or Ineffective, with different scoring bands for these ratings. The overall composite score summed across the three yielded an overall rating of Highly Effective (91–100), Effective (75–90), Developing (65–74), or Ineffective (0–64). Ostensibly to assist teachers whose performance was judged “below the bar,” the law provided that a teacher who received an overall rating of Developing or Ineffective would be subject to a Teacher Improvement Plan, in which an evaluator would meet with the teacher periodically to set goals and monitor progress toward meeting them. It also stipulated that a teacher who received an overall rating of Ineffective 2 years in a row could be subject to an expedited dismissal process, regardless of the teacher’s tenure status. Teachers would receive their overall rating for the preceding school year by September 1 of the following year.

Some elements of the APPR system were subject to local bargaining between school districts and local teachers unions, whereas others were set by the state. For teachers of English Language Arts and mathematics in grades 4 through 8, the state growth measures were derived from a statistical “value-added” model that sought to isolate a teacher’s contribution to her students’ spring scores on annual state assessments in these subjects, net of prior year achievement and student, school and classroom characteristics. This was controversial in New York City (Corcoran, 2010), as from 2007 to 2010, under the administration of Schools Chancellor Joel Klein, the district compiled what it referred to as Teacher Data Reports, an in-house version of value-added models for the state tests that compared approximately 18,000 teachers to one another across the district. In 2012, after a legal skirmish, the district released these reports, and local media published them, attaching teachers’ names to the rankings. Although district officials claimed that the reports were to be a management tool to improve teacher practice, and not to be used in tenure decisions, many teachers believed that the reports were intended for that purpose. The publication was humiliating for many (Pallas, 2012).

Halfway through the 2015 to 2016 school year, following a recommendation from the governor’s Common Core Task Force and evidence that throughout the state, one in six parents were opting their children out of these state assessments, the state Board of Regents placed a moratorium on the use of state-generated growth scores from the state English Language Arts and mathematics assessments in the calculation of the state growth measures. (Many teachers in our study had received state growth ratings based on the value-added model in the preceding school year.)

New York City’s evaluation system, termed Advance, described measures of teacher practice (MOTP) and measures of student learning (MOSL), with the former counting for 60% of the overall evaluation score, and the latter reflecting 20% for student achievement measures set by the state and 20% for locally-negotiated student achievement measures. New York City chose elements from Charlotte Danielson’s (2013) A Framework for Teaching as the rubric to guide the classroom observations that were the basis for MOTP, typically conducted by the school principal or assistant principal. The Danielson Framework has 22 components, spread across four domains: (1) Planning and Preparation, (2) The Classroom Environment, (3) Instruction, and (4) Professional Responsibilities. In 2015 to 2016, the year of this study, the United Federation of Teachers (UFT), New York City’s teachers union, and the New York City Department of Education negotiated that eight components would be rated, with those in domains 2 and 3 weighted more heavily (85%) than those in domains 1 and 4 (15%):

1a. Demonstrating knowledge of content and pedagogy

1e. Designing coherent instruction

2a. Creating an environment of respect and rapport

2d. Managing student behavior

3b. Using questioning and discussion techniques

3c. Engaging students in learning

3d. Using assessment in instruction

4e. Growing and developing professionally

Teachers could choose one of four observation options, conditional on their prior year overall rating. Option 1 consisted of at least one formal observation of a full class period, with pre- and post-observation conferences, and at least three informal observations of 15 minutes or longer, typically unannounced. Option 2 involved at least six informal observations. Teachers with an overall Highly Effective rating the preceding year could elect Option 3, involving at least three informal observations and at least three classroom visits by colleagues. Those with an overall rating of Effective in the preceding year could choose Option 4, which called for at least four informal observations.

For each observation, the rater would score each of the eight components observed during the observation on a scale from 1 to 4. These scores were summed to generate an overall MOTP rating at the end of the year, with an average score of 1.00 to 1.74 converting to a rating of Ineffective, 1.75 to 2.50 to Developing, 2.51 to 3.50 to Effective, and 3.51 to 4.00 to Highly Effective. Teachers thus equated the numerical component-level ratings in an observation to the equivalent label of Highly Effective to Ineffective.

Thus, New York City teachers received the performance labels in three different ways. Their performance on each of the eight Danielson components in a particular observation was rated Highly Effective to Ineffective, using a rubric that specified the criteria for ratings of 1, 2, 3, or 4 (corresponding roughly to Ineffective, Developing, Effective, and Highly Effective). They also received end-of-year summary labels in the two domains of MOTP and MOSL that ranged from Highly Effective to Ineffective. Finally, the summary performance on MOTP and MOSL was combined to yield an overall rating for the year of Highly Effective to Ineffective.

Method

During the 2015 to 2016 school year, the author and two other researchers, a college faculty member and a doctoral student with experience as a teacher educator, recruited a team of 11 research assistants, all current or former K-12 classroom teachers, interviewed 141 teachers and 24 principals working in 27 traditional elementary, middle and high schools in New York City.² Our sampling design intentionally sampled schools from prespecified strata, and then relied on quota sampling of teachers within schools. We sought to sample schools and teachers reflecting variability in grade level (K-5, 6–8, 9–12, and District 75, which is New York City’s designation for special education schools), average student performance level (high vs. low), prior aggregate teacher evaluations (high vs. low rates of prior school percentages of Developing and Ineffective ratings), student poverty rate (high vs. low), and size (above or below the median for schools with a particular grade configuration). The student performance strata for K-5 and 6–8 schools were defined by being in the top vs. bottom tercile on the percentage of students scoring at Levels 3 and 4 on the annual New York state English and math assessments administered in grades 3 through 8. The student performance strata for high schools were based on being in the top vs. bottom tercile on the school’s most recent 4-year high school graduation rate for entering ninth-graders. We used a threshold of 65% FRPL to define high vs. low poverty schools.

Following Stecher et al.’s (2010) coinage of “accountability with teeth,” we used the term “bite” as a descriptor of school settings in which there was heightened risk for low teacher evaluations that might have consequences for individual teachers (and hence the evaluations had more “bite’). We defined “bite” via the 2013 to 2014 aggregate distribution of teacher evaluation ratings for each school, as reported on the New York State Education Department website. A “high-bite” school is one in which 30% or more of the school’s teachers were rated Developing or Ineffective in the 2013 to 2014 school year. Conversely, a “low-bite” school had 3% or fewer of its teachers rated Developing or Ineffective in that year.

Since we were sampling teachers within schools, we limited our sample to schools with at least 20 teachers, according to New York City Department of Education public reports, and sought to sample 20% of the teachers in a sampled school, with a target minimum of 8 teachers and a maximum of 16 teachers to be interviewed in each school (and a $50 gift incentive for schools that reached their target N’s). Our eventual sample consisted of 47 teachers across 10 low-performance/high-bite schools; 45 teachers across 6 low-performance/low-bite schools; 41 teachers across 6 high-performance/low-bite schools; and 8 teachers across 2 District 75 schools, representing 141 teacher interviews in 27 New York City public schools. (There were no schools in the high-performance/high-bite quadrant).

Semi-structured interviews with participants lasted an average of 47 minutes each, and with few exceptions, were one-on-one, audio-recorded and subsequently transcribed.

The interviews addressed a wide range of topics regarding the teachers’ backgrounds, their perceptions of their school and its administrators, and of course the teacher evaluation system itself. For this paper, we focus on teachers’ comments about the MOTP, MOSL, and summary categories of Highly Effective, Effective, Developing, and Ineffective, particularly their understandings of what these labels meant, and the feelings the labels evoked in them. The interview questions that most often stimulated these comments were:

For the Measures of Teacher Practice, you could be rated Highly Effective, Effective, Developing, or Ineffective. What’s the lowest rating that would be acceptable to you, and why?

For the Measures of Student Learning, you could be rated Highly Effective, Effective, Developing, or Ineffective. What’s the lowest rating that would be acceptable to you, and why?

What’s your understanding of the consequences of getting a particular rating in the current Advance system? Are there rewards, or sanctions, associated with some of these?

The state came up with the categories of Highly Effective, Effective, Developing, and Ineffective to rate teacher performance. If it were up to you, how many categories would you use to rate teachers, and what labels would you use?

How did you feel about yourself as a teacher after receiving your overall composite rating in September?

Relying on the qualitative analysis software tool Dedoose (SocioCultural Research Consultants, 2018), we developed a thematic analysis system for coding and analyzing the interviews that began with a set of categories reflective of the research questions and prior research on performance evaluation in general, and teacher evaluation in particular. Our approach generally followed the protocols described in Deterding and Waters (2021). The two lead researchers and project manager initiated a coding scheme which, when revised in consultation with our coders, yielded 37 top-level “parent” codes (e.g., the extent to which a teacher feels a sense of control over some feature of her evaluation; when a teacher reports that the presence of an observer changes what happens in the classroom; the extent to which a component of the Advance evaluation system is perceived as fair or unfair to the respondent or other teachers; or an emotion, feeling or affective response associated with an evaluation rating or feature of the evaluation process). Each parent code had a description that set bounds on its application, and in most cases, the codebook also included a sample excerpt for which the code would be appropriate.

Nested under these top-level codes were 148 more detailed “child” codes representing specific cases or variations of the parent codes. Not all parent codes had child codes, but most did, with the greatest incidence of child codes for the parent codes that were either broad in scope or of particular interest for the project. For example, the Measures of Teacher Practice were based on classroom observations in which an observer, typically the school principal, would apply a rubric for elements of Charlotte Danielson’s (2013) A Framework for Teaching, a well-known tool for describing and assessing teachers’ practices. The parent code “Danielson Framework” picked up reference to the Danielson Framework or some element of it. The dozen child codes included:

(a) teacher indicates that s/he is unfamiliar with the Danielson Framework;

(b) teacher is confused about what the Danielson Framework is, or how it is used in the Advance system;

(d) teacher describes the elements of the Framework as desirable teaching practices, or consistent with the practices the school values;

(e) teacher refers to the number of Danielson elements on which s/he is rated;

(f) teacher describes the Danielson Framework as missing important things;

(g) teacher describes the Danielson Framework as being poorly implemented (rather than being opposed to its content);

(h) teacher indicates that observers are under pressure to “prove” that teachers have met the indicators in the Danielson Framework;

(i) teacher states that some Danielson elements are hard to observe, whereas others are easier to see;

(j) teacher states that the Danielson Framework is inappropriate under some conditions (as distinct from inappropriate or incorrect usage);

(k) teacher states that an observer can say whatever s/he wants, on a Danielson observation, due to the subjectivity of the framework; and

(l) teacher describes the Danielson Framework as objective, or a clear standard against which different teachers can be rated.

In addition, we developed 104 “baby” codes nested under the child codes, again representing specific cases or variations of the higher-level phenomenon being coded. For example, the parent code for when a teacher discusses the teaching practices and capacities that they feel are missed by the Advance evaluation system had four child codes: out-of-classroom roles and responsibilities; children’s overall development (i.e., social-emotional development); teacher thinking, planning, or preparation; and teacher effort. The out-of-classroom roles and responsibilities child code had three baby codes: sponsoring extracurricular activities, teacher leadership or mentorship, and school committee service.

In addition to this coding of excerpts from the transcribed interviews, we also coded each teacher interview for a set of invariant descriptive features of the teachers and their schools, derived either from the interview itself or from prior knowledge about the schools. The school-level descriptors included grade configuration, size, average student performance, poverty level, and “bite.” Teacher-level descriptors included gender, years of teaching experience, years in the school building, pathway into the profession, whether the teacher was a career-changer, tenure status, subject and grade levels taught, Advance observation option elected, whether the teacher understood growth scores, and the lowest Advance rating the teacher would accept. The continuous measures (e.g., years of teaching experience) were collapsed into a smaller set of ordinal categories.

All told, we coded 4,051 excerpts from the 141 teacher interviews, an average of about 29 excerpts per interview. We applied a total of 7,388 instances of a parent code to these excerpts, or approximately 52 applications of a parent code per interview. An additional 11,000 child or baby codes were also applied to the excerpts.³

Because our research questions for this paper bear specifically on the meaning of the four labels designed to classify teachers’ performance, we took an additional step in preparing the interview data for analysis. We extracted all of the coded excerpts that included the four labels (i.e., Highly Effective, Effective, Developing, and Ineffective), and read them repeatedly. Based on these readings, we coded these excerpts with an additional 20 codes bearing specifically on the meanings that teachers associated with the evaluation labels. These included codes such as “Developing provides room to grow,” “Highly Effective is unattainable on observations,” and “Ineffective implies no learning or impact.”

Results

We begin our analysis by focusing on teachers’ understandings of the performance labels in the Advance system. We emphasize the labels Developing and Ineffective, and how teachers understand them in relation to their understanding of teaching as a profession, followed by consideration of the label Highly Effective. Then we address teachers’ responses, often tinged with emotion, to the labels, with special attention to being assigned the labels Ineffective and Developing.

Teachers’ Understandings of Performance Labels

The first phase of our analysis examines the labels Developing, Ineffective, and Highly Effective. Because the label Effective is “unmarked,” in Zerubavel’s (2018) terms, because it is “normal” to think of people as competent, we did not conduct a parallel analysis for Effective.

Developing

We located the label Developing as the most controversial and problematic of the four teacher performance labels. Our interviews suggest several reasons for this. First, the label has an ambiguous location in relation to the label of Effective, the normative standard for competent performance, and thus an uncertain meaning. Second, because teachers believe that competent teachers should constantly be striving to improve, the label conflicted with professional norms. Third, Developing implies change over time, yet the label is a static snapshot. And finally, teachers believed that Developing was an appropriate label for novice teachers, but inappropriate and/or insulting for more experienced teachers. We discuss these reasons, derived from teachers’ words, below.

Ambiguity about the proximity of the label to a threshold of effectiveness

Prior to the implementation of Advance, teachers were rated Satisfactory (S) or Unsatisfactory (U). A rating of Ineffective is clearly below the threshold of satisfactory performance that was the hallmark of the prior teacher evaluation system, corresponding to U, or Unsatisfactory. A U rating was a kind of scarlet letter. Conversely, Effective and Highly Effective clearly are “above the bar” of competent performance associated with the prior rating of Satisfactory. But the rating of Developing is not as clearly above or below the bar. This ambiguity is illuminated in a 16-year veteran teacher’s preference for the old system: “I would go back to Satisfactory, Unsatisfactory and that’s it. You fall in two. You are either Satisfactory or you’re Unsatisfactory. Period. Why would you be Developing? Developing into what? What are you developing into? I mean, really. To me that’s like what are we doing here? Why are we in between? You are either Satisfactory or you’re Unsatisfactory. Period.”

Another teacher, in her 11th year, saw a contrast between the rating of Developing and a teacher’s maturity and experience, saying, “What is Developing? A girl’s breasts are developing when she’s in puberty, not a teacher.” The nexus of maturity, experience, and competence are all challenged by the label Developing.

Developing is, therefore, a label that is recognized as between Effective and Ineffective, but it provides a teacher little insight into where she or he is located along a continuum of professional competence, and why. The performance of a teacher labeled Developing is below that of a teacher labeled Effective, but that teacher may experience the boundary between the two as arbitrary.

Conflict with professional norms

As the third of four ordered labels, Developing is clearly not a desirable rating, and as we will see, many teachers had powerful emotional responses to receiving a rating of Developing, or to the prospects of such a rating. But the negative connotation of this label butts up against an occupational norm of continuous improvement. Teachers reported that they didn’t like the label, because they felt that all teachers, regardless of their current level of expertise, should be striving to develop their practice. Said a teacher in her sixth year, “Teaching is a never-ending craft, so you’re constantly learning. So there’s always room for development.” A seventh-year teacher said, “So if I were to redesign the categories . . . I guess, I think the best teachers are always at this idea of Developing. You know, we’re always trying to improve. We’re always trying to figure out what we’ve done well and what we need to do better.”

Another teacher with 4 years of experience noted, “I mean Developing. I mean we’re all developing. I mean it’s all about level of developing. We’re all growing and developing, and that’s the problem with this. Like what is it? When you get Highly Effective, you’ve arrived. You have nowhere to go from there?”

Put differently, teachers in our sample believed that effective teachers were constantly developing. The mutual exclusivity of the labels Developing and Effective felt confusing and inconsistent. A 16-year teacher said, “I’m here because I feel that my goal is to continue to improve,” and yet another with similar experience stated, “And you know what? We should be developing every day. Nobody is perfect. Nobody should stay stagnant.” This latter teacher told us, “I don’t necessarily think Developing is a bad thing . . . Maybe it’s I don’t know what the word should be because I think everybody should be developing on a daily basis, getting better, or trying to change things around to see what works, what didn’t work, and try to change it to see how we can make it work. I don’t necessarily think it’s such a bad—it shouldn’t be such a bad thing, but it is a bad thing, in a way.”

Finally, a fourth-year teacher explicitly cast the problem with the label Developing as a matter of semantics. “I think people have put such—had had such a negative response to the word Developing,” she said. “And if you just really think about it, Developing just means you have to maybe try other things, get better at it but it’s like, oh my gosh, God forbid you get a Developing. It’s the worst. It has just such a negative connotation to it. So maybe try a different word besides Developing will fix the whole problem.”

In sum, many teachers felt that effective teachers were and should be seeking to develop and grow in their practice. Thus, the system’s use of the label Developing to denote a substandard level of performance was inconsistent with their beliefs.

Developing as change over time

Since most teachers were observed repeatedly during the school year, they recognized that a rating of Developing on one or more Danielson element early in the year could lead to a higher rating later in the year, and thus provides room for growth to register on their evaluation. For some, this provision was a source of motivation. A teacher in her eighth year said, “If you’re going to observe me in November, maybe I can be Highly Effective in one component but I feel like, even as my students, you’re still learning the curriculum in that year. It’s freshly new. We’re just in 2 months. I can always improve; my students can always improve. So Developing for me kind of lights that little fire under me, like, okay, this is what they want to see, this is what I’m going to do, you know, and then I can be reminded like into my daily lesson plans, how am I going to show this so if they came in they were able to see it?”

The notion of growth over a year and motivation were often connected. A 14-year veteran said, “But if I got Developing, I feel like I’d kind of get a sense of that after my first informal [observation]. And then, if I was scoring low, then I’d pay attention. That’s the thing. Now I don’t pay attention to that Danielson, but I would pay attention if I started getting 2s, because 2 would be Developing.”

A second-year teacher said, “And I feel like if a school notices that a teacher might be Developing at the beginning, they should do everything to try and remedy that and either offer PD or support from another teacher. There’s no reason why a teacher is Developing at the beginning of the year and as Developing at the end of the year.”

In some cases, teachers saw the application of the rating of Developing in observations as a calculated effort on the part of administrators. “It’s the same thing we do with the students for report cards,” said a teacher with 10 years of experience. “They don’t wanna tell you in the beginning of the year that you’re doing great because they want you to improve, just take you to the next level. Wherever you are, they wanna take you to the next level. If you’re a good teacher they wanna make you a great teacher. If you’re a great teacher, they wanna make you a mentor teacher. That’s what they do. They always wanna push you and try and see if they can get more out of you.”

In some cases, the notion of room to grow extended to the course of a novice teacher’s career, and not simply a single school year. Another 10-year veteran noted, “I also have had administrators tell me, ‘I think you’re good, but I’m giving you an Ineffective because you need to show growth. So, you’ll thank me in 3 years when you’re up for tenure.’ Another, with more than 25 years in the classroom, found this strategy arbitrary and objectionable. “I have a problem when it feels you just need to find something because that’s how somebody that conducts an observation—you need to find a something so that you can tell them that they need to grow,” she said. “But so Developing would be like, okay.”

Developing as appropriate for novice teachers, but not more experienced ones

Many teachers saw beginners as learning the craft of teaching, with the expectation that they would not yet be as skilled as more experienced teachers, and would have more supports to move toward greater expertise. In this view, Developing is a developmentally-appropriate rating for a beginning teacher. Said a second-year teacher, “If you’re a new teacher and you have less than five years of experience, I think if you get a Developing, that’s definitely okay because that’s your mentor that needs to really help you.” Another, with 15 years in the classroom, commented, “I would say Developing is for a young teacher, and it should not be something that’s negative; it’s something to support them. Which areas are they developing in, and where they need support so they could grow.”

Some teachers argued for a totally different rating system for novices. “You have to give room for new teachers to grow,” said a newly-tenured teacher with 3 years of full-time experience. “They have to have their own system for at least two or three years. No one can do their best work in the first three years of the profession. After three years, everyone should be operating on a pretty level playing field. Just because it’s like . . . being a doctor. You have to have your rotations. You have to see your share of challenges. You have to be challenged. You have to have a chance to know what to do there. You should be operating on a different level for your first two or three years and then you should be evaluated after that growing period.” Later, we consider how teachers viewed both Developing and Ineffective as offensive to experienced teachers.

Although we are arguing that teachers are responding primarily to the subjective meaning of the term Developing, we did see some evidence that they linked it to its consequences, most notably the fact that teachers who receive a Developing overall rating are put on a teaching improvement plan, which many viewed as an encroachment on professional autonomy and as a documentation process that could eventually lead to dismissal. As a 30-year veteran put it:

I’ve come kind of to this acceptance that the people who make these choices about what’s good and what’s not good don’t know as much as me. So what am I gonna do? I still do it for them, and I do it for me, and I don’t need to be—I don’t have to have the Highly Effective plastered anywhere. But don’t tell me I’m Developing because now I’m gonna get pissed. Then we need an improvement plan. And then I’m out of here.

In this quote, the teacher conveys that if she were rated Developing she would be “pissed,” because the rating leads to a mandatory and potentially demeaning Teaching Improvement Plan. Thus, we cannot rule out the possibility that in some cases, it is the actual consequences of the label Developing that evoke particular meanings and emotional reactions from teachers, such as a threat to leave in a fit of anger. But we believe the weight of the evidence is clearly on the meaning of the label.

Ineffective

Teachers described the rating of Ineffective as problematic in several ways. First, they felt that the label implied that no student learning and no teaching were occurring in a teacher’s classroom, which they believed was almost never the case. Second, they drew parallels between labeling students and labeling teachers as failing, which is demoralizing to each. Third, they believed that no teacher who was putting in effort could reasonably be labeled Ineffective. Often, references to the term Ineffective evoked powerful emotions from teachers, either as they anticipated how they might feel if assigned this label, or as they recounted the experiences of other teachers in their buildings.

Ineffective as no learning or impact

The term Ineffective has very negative connotations for teachers, which undergirds their view that it is offensive. Said a teacher with 13 years of experience, “I think it’s the wording. When you say a teacher’s Ineffective and people take it different ways, but it could be that teacher has no sorta impact on kids or her students, none at all.” Another long-time teacher, referring to personal experience, said, “You know, this evaluation system is phony because when I got the Ineffective, the kids participated in class. They were able to answer questions. They were able to go to the board and answer algebra with geometry—altogether. How to find the missing angle and all those things, and they gave me an Ineffective.”

Some interpreted the rating of Ineffective as a teacher not teaching, instead being a passive presence—or non-presence—in the classroom. “Ineffective just means that you being in that classroom is the equivalent of kids being in front of a TV,” said a 10-year veteran. Another, with 9 years under her belt, equated it with “if they’re sitting there reading a paper and things like that,” or “someone who just sits at their desk or stands in front of her room, does absolutely nothing and lets sheer anarchy take place in the classroom.” Not surprisingly, few teachers believed that they, or their peers, were so unprofessional. “To be Ineffective means you’re really not doing no work at all,” said a teacher in her 13th year. “If you stand up in the classroom and you fumble or you do something, might be not working this way, it’s not that you’re Ineffective. It’s just that maybe that was not the greatest day for you to teach or maybe the kids were having a bad time.” Still others struggled to believe that a teacher who did absolutely nothing in the classroom would be allowed to teach. “I just still cannot wrap my head around someone getting an Ineffective but yet they’re teaching,” said a second-year teacher. “I mean to me, an Ineffective would be like blatantly not teaching. You just come to work and you just sit in your desk and you just let the students do whatever they want. You’re not—you don’t plan and someone who doesn’t interact with the students. I just don’t see it.”

Parallels to labels that demoralize students

Some teachers noted that teachers are socialized to avoid calling students failures, and asked why the system should call teachers failures. Said a third-year interviewee, “I would avoid the word Ineffective because, just like when we see students, we don’t want to see them as failures. We want to support them and help them and we know that, if we help them, they will improve. So why do we evaluate them and give them Ineffectives? If we do not want to give Ineffectives to students, why do they give this to teachers?”

In some cases, teachers made explicit references to motivation and morale. “But Ineffective is harsh,” said a 20-year veteran. “That’s a little—I mean—you know, the worst thing to do is, not saying hurt someone’s feelings, but to—you don’t want to take away someone’s confidence. So words are hurtful.” She went on to say, “Yeah, it’s just—needs work. How could you say, ‘Needs work,’ in a nice way. That’s what these—that’s what we have to, like a committee to come together and think of a good word, ‘Needs work.’”

Others came up with creative reframings of the term Ineffective. A teacher with a decade of experience said:

I don’t like the terminology. I don’t like Ineffective. Ineffective to teachers is like Ineffective would be to students. There was a lady, Rita Pierson, “Every Kid Needs a Champion,” she did like a little speech last year. [NYC Schools Chancellor] Carmen Farina had suggested all schools watch it. The thing that stood out to me, something that she said was, a student got three right out of, I think 20, on a test. So they got 17 wrong. And she put three, and she put a smiley face. And the kid said, “But Miss, isn’t this an F?” And she said, “Yes, it’s an F. But you didn’t get everything wrong. You got something right.”

So for me, that resonated in my mind and I never forgot that particular part. Because what it is, is saying that you weren’t the best, but you didn’t suck. When you tell someone they suck, when you tell someone they’re Ineffective, it sucks the life out of them.

The presumed theory of action behind the ratings is that they will motivate teachers to improve and excel; failing that, the ratings can lead to dismissal. But this quote suggests that negative labels can have the unintended effect of demoralization, “sucking the life out of” a teacher. This is further evidence that it is the emotional weight of the labels, more so than their consequences, that matters to teachers.

The lack of recognition of effort

Most teachers believe that they are working hard, and that their effort should be recognized and rewarded in the evaluation system. They believed that ratings such as Developing and Ineffective were inappropriate because they ignored effort and commitment, both of which are central to their beliefs about teaching as a profession. Said a tenured teacher with 6 years in the classroom, “I think that I’m doing a lot. I’m doing a lot. I work—like Saturdays and Sundays, I do a lot of work, just to plan a lesson, and to do things that I need to do as a teacher. Like if you will ask me, I don’t have, like, a life, if that is appropriate. And for me to really get—to be rated Developing in any of those aspects, would be, I think, no—what’s the appropriate term? Like, not appropriate.” A second-year teacher said, “I don’t believe that I’m Ineffective because I do my job. I take it seriously. I take my kids seriously. And I put in the time and the effort.”

Another, with a decade of experience, made a similar argument about effort and devotion to the job of teaching:

And it really, really, pardon my language, but it pisses me off that I am not found Effective all the time because I work really, really hard. I put a lot of energy into what I do. I put a lot of energy into my past, into my college education, into what—how I get the kids involved. The stuff I squirrel away, buy with my own money, go to the basement and find, get into dumpsters and pull out for the students to have hands-on experience—because I know they’re not learning from a book.

And so all that stuff I put in there to those lessons—try to meet the kids—I see them outside on the sidewalk—talk to them, talk to their parents, talk to their brothers and sisters—it’s not reflected if you’re telling them only Developing.

These quotes reflect teachers’ own reactions to the possibility of being assigned a rating of Developing or Ineffective, but we also observed teachers imagining how others might react. “I like where I am on the spectrum,” said a tenured teacher with 5 years of experience who was rated Effective the preceding year. “It makes me feel better and it’s rewarding for all of your hard work. But if you really work hard and you just can’t get it together, the Ineffective or Developing could be demoralizing, I guess. I don’t know the answer to that. I guess now that I look at both ends, it can be demoralizing for someone on that end.”

It is easy to imagine that teachers might critique “low” ratings labels such as Developing and Ineffective. But our analysis also revealed that the top rating of Highly Effective had uncertain and problematic meanings.

Highly Effective

Teachers believed that the category Highly Effective was problematic for a number of reasons. In some cases, they were uneasy about any category that seemingly capped performance, and did not allow for improvement. In others, they felt that the rating of Highly Effective was unattainable. Throughout, they expressed concern that the Highly Effective category, rooted in a highly-standardized observational rubric, did not acknowledge the discretionary work of teachers.

Highly Effective as capping performance

A fifth-year, untenured teacher said, “Well, I feel like Highly Effective, I feel like there’s always room for me to improve, so I would never take Highly Effective, because I feel like there’s always something I can do. . .” Another, with more than 10 years of experience, said, “Because there always has to be something that they need you to improve or it might seem a little weird if somebody was so perfect and then—you know what I mean?”

Highly Effective as unattainable

Teachers used a variety of expressions to convey that the rating of Highly Effective was unattainable on a routine basis. A two-decade veteran said, “Well, they say Highly Effective is a place we visit.” Another, with 16 years of experience, echoed this language, saying, “Highly Effective, I believe, is a place you visit once in a while. I don’t think it’s attainable on a regular basis.” A third, not yet tenured in her fourth year, said, “I understand Highly Effective is almost Nirvana.” Jokes about the category abounded. “Somebody had sent around an e-mail around Thanksgiving time,” said a teacher with 18 years in her building, “like the Danielson Thanksgiving dinner, and it’s like, the free-range turkey you’ve been raising in your backyard runs into the kitchen and leaps into the oven all by itself. And it’s just so grateful to give up its life to feed you on Thanksgiving. And it’s like, yeah, you really read it, that’s kind of how the Highly Effective is looked at.”

This unattainability, they believed, was due to the stringency of the Danielson rubric, and its inability to recognize the discretionary practices of teachers in the course of a given day in a specific classroom context. “I mean, when you look at Danielson,” said this same teacher, “when you look at Highly Effective, the language of it is just so, it’s so verbose. It’s so over the top. Are there teachers that are Highly Effective? Yes. Is a teacher Highly Effective every moment of every day, in every class? It’s impossible.”

Many teachers were frustrated by the fact that what they viewed as exemplary teaching could not be rated Highly Effective, due to the features of the Danielson rubric. “We talk a lot about the Danielson rubric,” said a recently-tenured teacher, “and they seem to–like we all interpret in the same way, and sometimes we’ll have personal feelings that are, you know, it was an amazing lesson, and we all followed the rubric, so that means we can’t give them Highly Effective even though we want to, because the rubric says it has to be Effective. And I feel like the evaluators at my school really stick to the rubric.” Another novice teacher commented, “I know a lot of teachers who I would personally classify as Highly Effective, but according to the Danielson rubric, if you walk into their classroom, they’ll never be graded as Highly Effective.” Said another teacher with 16 years of experience, “I don’t think the Danielson Framework was meant, originally, as an evaluation process. I think Effective and Highly Effective—or, Highly Effective is virtually impossible. I don’t know. I don’t see how anybody could be Highly Effective all the time.”

Teachers’ Emotional Responses to Performance Labels

In our interviews, teachers would occasionally refer to the meaning of the four labels—Highly Effective, Effective, Developing, and Ineffective—to teachers in general, and, in some cases, to themselves. In both instances, Developing and Ineffective were viewed as offensive or objectionable. Overwhelmingly, this was because teachers believed that their experience defined them as competent and effective practitioners.

This is not surprising, as many teachers are skeptical of the preparation of novice teachers emerging from “alternate route” programs such as Teach For America or the New York City Teaching Fellows. For better or worse, collective bargaining agreements that link salaries to experience also implicitly tie experience to competence, as does the very nature of teacher tenure. In addition, most teachers recognize that there is a steep learning curve; though they may have floundered in their early years, over time they believe that they have learned the craft of classroom teaching.

The language that teachers used to characterize their reactions, real or conjectured, to being labeled Developing or Ineffective was colorful. “But to have that Ineffective was like a slap in my face,” said a 30-year veteran. “And so it just disgusts me.” Another, in her 11th year, said, “it pisses me off that I am found not Effective all the time.” Others said, “it really upsets me,” “I’d be very disappointed,” “I’d be very upset,” “Developing would be really a shock,” “I’d probably be really angry,” and “If I got Developing, I’d lose my mind.” Still other teachers used terms such as “depressing,” “discouraging,” and “heartbreaking,” or “like a blow.” “I think I would cry for—I would think about quitting if I got less than Effective,” said a teacher approaching a decade in the classroom.

Many of these examples are hypotheticals, in which teachers imagine how they would react to a rating of Developing or Ineffective. But there are enough cases in which teachers did receive these labels to suggest that they do in fact evoke powerful negative emotions. It’s unquestioned that teachers believe that the labels have this power.

Teachers told us that the categories of Developing and Ineffective, both of which connoted less than effective or satisfactory performance, were at odds with their years of experience as teachers and the histories of performance they had demonstrated under the prior evaluation system, sometimes over many years. For example, a teacher with a decade of experience said, “I know teachers who have been teaching for many years and always got Satisfactories, and now, with the whole change of everything, are getting either Developing or Ineffective. They didn’t turn ineffective overnight.” It seemed inconsistent for teachers to be rated Satisfactory year after year and suddenly receive a rating that placed them below the bar for competence. Said another highly-experienced teacher, “It’s insulting to me. Because I’ve been teaching 19 years and now you’re going to tell me I’m Developing? All these years I’ve just been Developing in that area?”

A third said, “No one wants to be rated an Ineffective, and definitely no one wants to be rated Developing, especially at my level of education. I’ve been in the system for 16 years, and I’ve never been rated Ineffective. I’ve never had a poor rating. So, just to get a Developing is like a blow. You understand? For me, I’m always aiming for Highly Effective.” Still another, with 24 years in the classroom, said, “What am I supposed to be more concerned with? Should I be more concerned with my career that I’ve spent so much money and a pension that I’ve put into, the fact that hey, I’m approaching 55 in about 3 years, okay, so I better learn how to play the game really well because in this system, if you get one strike, two strikes, you understand? You’re out at the end of the two strikes after putting 20-something years. I am Ineffective? When did you come up with the fact that after all this time, Ineffective?”

We saw clear evidence of an association between a teacher’s experience and the lowest performance rating they would be willing to accept. Among the 121 teachers in our sample for whom we were able to code tenure status and the lowest rating they would accept, 49% of the untenured teachers would accept either a Developing or Ineffective rating, whereas only 27% of the tenured teachers would accept ratings this low (χ² = 5.2, p < .05).

We also saw clear evidence that teachers’ sense of themselves as professionals was shaped by the annual teacher evaluation labels assigned to them. With the exception of the labels assigned periodically during teacher observations, the summative end-of-year labels evoked powerful emotions, as they represented an institutional attribution of competence, and a durable quality of themselves as individuals. Although some teachers simply shrugged off the labels, others internalized them, and came to view themselves as having the attributes implied by the label. For those receiving labels of Developing and Ineffective, this was upsetting.

A long-time teacher said: “I got Developing . . . I knew that it was coming. Yeah. Because I felt – I feel really bad. I feel like I’m good for nothing because I spent all this time. I go home and sit with my books next to my bed. I go home, all I do is lessons, lesson. I mean, if a person that is Developing is a person at night, to me, is—I mean, I’ve been here 30 years and I know what I’m doing. I don’t even have to write a lesson plan for me to teach a lesson. You give me a book.”

Techniques of Neutralization

The emotional labor associated with the labels of Developing and Ineffective is unmistakable; teachers used phrases such as “a punch in the gut,” “really angry,” “defeated,” “you feel like shit,” and “a shock,” and said that they were “relieved” when they learned that they had been rated Effective rather than Developing. We draw on what Sykes and Matza (1957), writing about how adolescents protect their self-image when engaged in acts that external authorities deem delinquent or deviant, refer to as techniques of neutralization. A full exposition of the strategies that teachers use to neutralize their negative reactions to the labels of Developing and Ineffective is beyond the scope of this paper. But we do wish to draw attention to one specific strategy that we refer to as compartmentalization. We saw evidence that teachers sought to compartmentalize ratings of Developing and Ineffective as features of specific observations, rather than as durable attributes of their performance. In so doing, they could tolerate a rating of Developing, or even Ineffective, on a Danielson element by recognizing it as transient, and changeable over time.

Teachers were much more forgiving of ratings of Developing on particular Danielson elements than of a summary rating of Developing on the Measures of Teaching Practice or a rating of Developing overall. They frequently saw ratings of Developing on particular elements as transient, and did not experience them as a durable label. On a given day, for reasons that teachers might not be able to control, an observer might categorize their practice as Developing. “I think that I’m comfortable with my teaching practice,” said a teacher with nearly three decades of experience. “I really am. I think the kids are learning. I think that overall I’m a pretty good teacher. But there’s days that things don’t work out, and I’m fine with that. And I’ve gotten Developings and I’m perfectly okay with that. I’m all for being given advice to improve my practice.”

“So Developing or Ineffective has just never been a practice for me,” said a 10-year veteran. “And I’m gonna finish that statement—does that not mean that certain things within a lesson—well, Ineffective just wouldn’t exist. Things could be considered, let’s say Developing at one time or another, for a variety of factors.” Another experienced teacher commented, “I’m sure there are times where, I mean I know there are times, probably every day where I’m Developing or Ineffective on the Danielson rubric.” Still another, with 5 years of experience, said, “I would take [Developing] if it’s in a few categories, but not as an overall score.” A probationary teacher said, “I know that there are certain things that, when you come into the classroom and I cannot be perfect for all the elements. So, at that moment, I could be Developing but, for the long term, for if you see me as a teacher in the whole picture, I am Effective or Highly Effective.”

Some teachers neutralized their reactions to a rating of Developing by pointing to the rigidity of the rubric. “I’ve gotten Developing in a few things,” said a teacher in her 16th year, “but my principal takes the rating system very seriously—like the letter of the law. She’ll read that rubric, and rate you against what’s there in black and white. There’s no room. There’s no, like, ‘I know this teacher, and this is just happening now.’ It’s whatever is in that rubric, she holds us to it.” Another, in his 11th year, said, “We do these observations inside the school where a team of colleagues go into somebody else’s classroom and give ratings. You know, like we observe and we give feedback, but we also like rate them on the Danielson. And the thing that we always talk about is like there is times where it’s very reasonable of the teacher to actually choose Developing.”

Discussion

For many decades, it was taken for granted that teachers are competent. Parents and other teachers might recognize some teachers as more or less skilled, but competence itself is usually unremarkable. That has been the default, and we have not named competence as something that is worthy of remark or notice (Zerubavel, 2018). However, the imposition of the statewide teacher evaluation system in New York, as in other states, represents a claim on the part of the state that it has the legitimate authority and power to define standards of competence for K-12 classroom teachers (Flack, 2020). Labels such as Ineffective, Developing, Effective and Highly Effective have been imposed externally, normalizing the idea that there are gradations of competence among teachers.

In a context where 99% of teachers have been rated Satisfactory, the idea that one needs these labels is jarring to them. It also departs from how society has treated other occupations that claim professional status, such as physicians. Ratings chip away at an image of teachers as professionals, although that image itself is contested in the scholarly literature by scholars who view teaching more appropriately as a semi-profession (Flack, 2020; Horowitz, 1985; Ingersoll & Collins, 2018). We still do not formally rate doctors as “effective,” although individual patients and their families may voice their opinions, increasingly on websites that do allow ratings. Still, for most occupations, competence has not been named as something that is worthy of remark or notice.

One of the most striking findings in our study is the powerful emotions that are evoked by the receipt, or threat of receipt, of the ratings Developing and Ineffective. We cannot rule out that some of these reactions stem from the real or anticipated consequences of receiving these ratings (e.g., being placed on a Teacher Improvement Plan, in the case of an annual overall rating of Developing, or being fast-tracked for dismissal, regardless of tenure status, in the case of two consecutive overall ratings of Ineffective). Nevertheless, the weight of the evidence suggests that what is evoking these emotions is the fact that the labels Developing and Ineffective define a teacher as less than competent.

Ball’s (2003) argument that performance labels are a means of control, pressing teachers to internalize the criteria for competence, is relevant here. His reference to performance labels and “intensive work on the self” implies that teachers are obliged to redefine the self, or at least that portion that is rooted in an occupational identity, in ways that conform to the labels. In a sense, you are what you’re labeled. Reconstructing the self does require emotional labor, and as our data suggest, can be anxiety-provoking, as teachers seek to reconcile a negative label with their own sense of their competence and trajectory. A label that is inconsistent with an existing sense of self can be upsetting and angering.

The use of performance labels chosen by the state represents a struggle between what Evetts (2009) refers to as occupational professionalism and organizational professionalism. Neoliberal policy regimes and New Public Management approaches imply that expertise and professions are answerable to the state, rather than to clients such as students and their parents (Fournier, 1999). Moreover, the regulatory system measures and monitors a limited set of criteria, most notably standardized measures of student performance, and the teacher practices that are intended to boost that performance. This is far less expansive than what teachers have come to value through their preparation and experience (Ranson, 2003).

The heart of the struggle takes two forms. First, the state’s attempts to control work are accompanied by efforts to standardize it (Evetts, 2009). Measures of teaching practice such as rubrics based on the Danielson Framework for Teaching represent the application of standardized criteria to work that teachers view as highly discretionary, and unable to be standardized. Classic professionalism relies on the exercise of discretion in applying specialized knowledge acquired through formal education and experience (Flack, 2020). We saw obvious frictions between the standardized Danielson rubric, deployed by observers who might not have the capacity to pick up the particularities of context, and an ethic of individualism and discretion.

Second, it is well-recognized that teacher preparation programs do not adequately prepare novice practitioners to teach (Feuer et al., 2013; Lortie, 1975). This is even more evident in the presence of alternate routes to teaching that have extremely short induction periods. Teachers may believe that experience confers expertise (Whitney et al., 2013). This is why so many experienced teachers reported powerful negative emotions around the actual or conjectured assignment of the labels Developing or Ineffective. The labels challenged the very nature of their self-efficacy and professional identity.

Not surprisingly, teachers resisted the substitution of external labels for their hard-won identities, often launching sophisticated critiques of the tools and processes generating these labels. Many teachers did not view the Advance system as legitimate, and few saw it as a resource for action. This reaction is presumably at odds with the intentions of the policymakers who designed the system (Herlihy et al., 2014; Paufler & Sloat, 2020).

If teachers do not view an evaluation system as legitimate, it will devolve into a system rooted in sanctions much more so than trust (Mansbridge, 2014), weakening the moral basis for teaching, and the social compact binding classroom teachers to their communities. Teacher evaluation systems can evolve in significant ways over time (Dee et al., 2021), and it is possible that implementation features that address teachers’ concerns about control and professional discretion can help sustain trust, even as the systems promote the departure of teachers identified as low-performing. Nevertheless, our results show that at least in New York City, many teachers experience the teacher evaluation system as more punitive than supportive, and the performance labels assigned to teachers contributes to this view.

We note that our study is unable to link teachers’ perceptions of the performance labels, or the performance labels they were assigned, to changes in their instructional practice. There are risks in inferring behaviors from what respondents say in interviews (Jerolmack & Khan, 2014), perhaps especially regarding changes in instructional practice. Other limitations include teachers’ confusion regarding the details of the evaluation system and the context of “policy churn” that linked teacher evaluation to a barrage of other reforms, many of which did not last long.

Perhaps most important, this study was situated in New York City, the nation’s largest school system, and one that is highly idiosyncratic. In the years preceding this study, both city and state officials were dismissive of teachers and teaching in their public statements, promoting a culture of fear among rank-and-file teachers, and a polarization and stiffening of the state and local unions representing them. Whether our findings about teachers’ interpretations of performance labels can generalize to other, less-contested settings will require additional research.

A practical recommendation is to design teacher evaluation systems via collective bargaining between administrators and teachers. By doing so, the parties can build trust that can enable them to develop procedures—and performance labels—that will not have unintended consequences. Designs of this sort are not automatic; teacher evaluation is often described as a “wicked problem” (Lillejord et al., 2018), precisely because of its competing purposes of professional development and quality assurance (Herlihy et al., 2014; Kraft & Gilmour, 2016). Teachers are generally comfortable with those features of evaluation systems that provide feedback for professional development and improved instructional practice, but resist those features that are tied to high-stakes personnel decisions (Sartain et al., 2020).

Nevertheless, there are examples that demonstrate that collective bargaining can lead to consensus on the design and implementation of teacher evaluation systems. Donaldson and Papay (2015), for example, document the development of New Haven, CT’s TEVAL system, which assigns summative ratings on a five-point scale ranging from “Needs Improvement” to “Exemplary.”⁴ In New Haven, the emphasis was on evaluation for professional development and support, which built teacher trust and curbed resistance and dissatisfaction.

Organizing teacher evaluation around formative professional development opens the door to other innovations, such as peer observation and evaluation, which teachers typically resist if they believe that their observations will put their colleagues at risk. The performance labels embedded in teacher evaluation systems may be less offensive if they are decoupled from high-stakes decisions such as tenure and termination.

The evaluation of the performance of K-12 classroom teachers need not be as fraught as was evidenced in this study. In most white-collar occupations, workers do not fear losing their jobs based on performance evaluations. Even in New York City, it is far more likely that classroom teachers receiving low ratings will have their probationary periods extended than that they will be dismissed outright.

The emotional energy New York City teachers expend in responding to real or conjectured performance labels displaces energy that could feed their professional development. Revising these labels is a step toward devising teacher evaluation systems that can balance their multiple goals.

Footnotes

Acknowledgements

Author(s) would like to thank Yeonsoo Choi, Clare Buckley Flack, Anna Neumann, and Cami Touloukian for their thoughtful comments on earlier drafts. Also, they are grateful to Jennifer Jennings, Katie Ledwell, and the entire research team for the New York City Teacher Evaluation Study.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by Grant #201500152 from the Spencer Foundation.

ORCID iD

Aaron M. Pallas

Notes

Author Biography

Aaron M. Pallas is Arthur I. Gates Professor of Sociology and Education and Chair of the Department of Education Policy and Social Analysis at Teachers College, Columbia University. He educates stakeholders—including representatives of the media—about the complexities and unexpected consequences of accountability and resource distribution policies in public schools.

References

Ball

S. J.

(2003). The teacher’s soul and the terrors of performativity. Journal of Education Policy, 18(2), 215–228.

Corcoran

S. P.

(2010). Can teachers be evaluated by their students’ test scores? Should they be? The use of value-added measures of teacher effectiveness in policy and practice. Annenberg Institute at Brown University.

Danielson

(2013). The framework for teaching: Evaluation instrument. The Danielson Group.

Dee

T. S.

James

Wyckoff

(2021). Is effective teacher evaluation sustainable? Evidence from District of Columbia Public Schools. Education Finance and Policy, 16, 313–346.

Deterding

N. M.

Waters

M. C.

(2021). Flexible coding of in-depth interviews: A twenty-first-century approach. Sociological Methods & Research, 50(2), 708–739.

Donaldson

M. L.

Papay

J. P.

(2015). An idea whose time had come: Negotiating teacher evaluation reform in New Haven, Connecticut. American Journal of Education, 122(1), 39–70.

Dubnick

(2005). Accountability and the promise of performance: In search of the mechanisms. Public Performance & Management Review, 28(3), 376–417.

Etzioni

(1969). The semi-professions and their organization; teachers, nurses, social workers. Free Press.

Evetts

(2009). New professionalism and new public management: Changes, continuities and consequences. Comparative Sociology, 8(2), 247–266.

10.

Feuer

M. J.

Floden

R. E.

Chudowsky

Ahn

(2013). Evaluation of teacher preparation programs: Purposes, methods, and policy options. National Academy of Education.

11.

Figlio

Loeb

(2011). Chapter 8—School accountability. In Hanushek

E. A.

Machin

Woessmann

(Eds.), Handbook of the economics of education (Vol. 3, pp. 383–421). Elsevier.

12.

Flack

C. B.

(2020). Executing content: Instructional guidance infrastructures and conceptions of teacher professionalism [PhD dissertation]. Columbia University.

13.

Fournier

(1999). The appeal to ‘professionalism’ as a disciplinary mechanism. The Sociological Review, 47(2), 280–307.

14.

Freidson

(1970). Profession of medicine: A study of the sociology of applied knowledge. University of Chicago Press.

15.

Gorman

E. H.

Sandefur

R. L.

(2011). “Golden age,” quiescence, and revival: How the sociology of professions became the study of knowledge-based work. Work and Occupations, 38(3), 275–302.

16.

Herlihy

Karger

Pollard

Hill

Kraft

Williams

Howard

(2014). State and local efforts to investigate the validity and reliability of scores from teacher evaluation systems. Teachers College Record, 116, 1–28.

17.

Hood

(1991). A public management for all seasons? Public Administration, 69(1), 3–19.

18.

Horowitz

T. R.

(1985). Professionalism and semi-professionalism among immigrant teachers from the U.S.S.R. And North America. Comparative Education, 21(3), 297–307.

19.

Ingersoll

Collins

(2018). The status of teaching as a profession. In Ballantine

J. H.

Spade

J. Z.

Stuber

J. M.

(Eds.), Schools and society: A sociological approach to education (6th ed., pp. 199–213). Pine Forge Press/Sage Publications.

20.

Jacobsen

Snyder

J. W.

Saultz

(2014). Informing or shaping public opinion? The influence of school accountability data format on public perceptions of school quality. American Journal of Education, 121(1), 1–27.

21.

Jerolmack

Khan

(2014). Talk is cheap: Ethnography and the attitudinal fallacy. Sociological Methods & Research, 43(2), 178–209.

22.

Kane

T. J.

Kerr

K. A.

Pianta

R. C.

(Eds.). (2014). Designing teacher evaluation systems: New guidance from the measures of effective teaching project. John Wiley & Sons.

23.

Kraft

M. A.

Gilmour

(2016). Can principals promote teacher development as evaluators? A case study of principals’ views and experiences. Educational Administration Quarterly, 52(5), 711–753.

24.

Kraft

M. A.

Gilmour

A. F.

(2017). Revisiting the widget effect: Teacher evaluation reforms and the distribution of teacher effectiveness. Educational Researcher, 46(5), 234–249.

25.

Labaree

(1992). Power, knowledge, and the rationalization of teaching: A genealogy of the movement to professionalize teaching. Harvard Educational Review, 62(2), 123–155.

26.

Lillejord

Elstad

Kavli

(2018). Teacher evaluation as a wicked policy problem. Assessment in Education: Principles, Policy and Practice, 25(3), 291–309.

27.

Lortie

D. C.

(1975). Schoolteacher: A sociological study. University of Chicago Press.

28.

Mansbridge

(2014). A contingency theory of accountability. In Bovens

Goodin

R. E.

Schillemans

(Eds.), Oxford handbook of public accountability (pp. 55–68). Oxford University Press.

29.

Meyer

J. W.

Rowan

(1977). Institutionalized organizations: Formal structure as myth and ceremony. American Journal of Sociology, 83(2), 340–363.

30.

Mittleman

Jennings

J. L.

(2018). Accountability, achievement, and inequality in American public schools: A review of the literature. In Schneider

(Ed.), Handbook of the sociology of education in the 21st century (pp. 475–492). Springer International Publishing.

31.

Neal

(2018). Information, incentives, and education policy. Harvard University Press.

32.

Pallas

A. M.

(2012). The fuzzy scarlet letter. Educational Leadership, 70(3): 54–57.

33.

Pallas

A. M.

(2020). A sociological analysis of the effects of standards-based accountability policies on the distribution of educational outcomes. In Grek

Maroy

Verger

(Eds.), World Yearbook of Education 2021: Accountability and datafication in the governance of education (pp. 199–214). Routledge.

34.

Papay

J. P.

Murnane

R. J.

Willett

J. B.

(2016). The impact of test score labels on human-capital investment decisions. Journal of Human Resources, 51(2), 357–388.

35.

Paufler

N. A.

Sloat

E. F.

(2020). Using standards to evaluate accountability policy in context: School administrator and teacher perceptions of a teacher evaluation system. Studies in Educational Evaluation, 64, 100806.

36.

Ranson

(2003). Public accountability in the age of neo-liberal governance. Journal of Education Policy, 18(5), 459–480.

37.

Reio

T. G.

(2005). Emotions as a lens to explore teacher identity and change: A commentary. Teaching and Teacher Education, 21(8), 985–993.

38.

Sartain

Zou

Gutiérrez

Shyjka

Hinton

Brown

E. R.

Easton

J. Q.

(2020). Teacher evaluation in CPS: Perceptions of REACH implementation, five years in. University of Chicago Consortium on School Research.

39.

SocioCultural Research Consultants. (2018). Dedoose Version 8.0.35, web application for managing, analyzing, and presenting qualitative and mixed method research data. SocioCultural Research Consultants, LLC.

40.

Stecher

B. M.

Vernez

Steinberg

P. S.

(2010). Reauthorizing no child left behind: Facts and recommendations. RAND Corporation.

41.

Stets

J. E.

(2005). Examining emotions in identity theory. Social Psychology Quarterly, 68(1), 39–56.

42.

Sykes

G. M.

Matza

(1957). Techniques of neutralization: A theory of delinquency. American Sociological Review, 22(6), 664–670.

43.

Toch

Rothman

(2008). Rush to judgment: Teacher evaluation in public education. Education Sector.

44.

Waller

(2014). The sociology of teaching. Martino Fine Books.

45.

Weisberg

Sexton

Mulhern

J. L.

Keeling

(2009). The widget effect: Our national failure to acknowledge and act on teacher differences. The New Teacher Project.

46.

Whitney

A. E.

Olan

E. L.

Fredricksen

J. E.

(2013). Experience over all: Preservice teachers and the prizing of the “practical.” English in Education, 45(2), 184–200.

47.

Zerubavel

(2018). Taken for granted: The remarkable power of the unremarkable. Princeton University Press.