Observation Studies in Special Education: A Synthesis of Validity Evidence for Observation Systems

Abstract

Classroom observation research plays an important role in policy, practice, and scholarship for students with disabilities. When interpreting results of observation studies, it is important to consider the validity evidence provided by researchers and how that speaks to the intended use of those results. In this literature synthesis, we used Kane’s argument-based approach to validity to describe evidence of validity for uses of observation instruments in classroom observation research regarding teachers of students with disabilities. We identified 102 studies from 1975 to 2020 that met inclusion criteria. Results indicated many studies did not report validity evidence to support their use of the observation instruments. Over time, reporting levels for much of the evidence has remained relatively constant, but we noted a consistent decrease in number of observations conducted per teacher and a consistent and large increase in reporting of teacher participant characteristics. We provide implications of this for research and practice and suggestions for improving classroom observation research.

Keywords

classroom observation students with disabilities literature synthesis validity

By documenting teaching practice, classroom observational research fills an important function in educational systems (e.g., Cohen & Goldhaber, 2016): It provides insights into the practices teachers enact and how skilled they are in enacting these practices, thereby indicating in what areas they may need support to improve. This research also indicates conditions under which students are more likely to experience effective instruction, thereby informing, for example, policies guiding student placement decisions.

Observational research is especially important for students with disabilities. The Individuals with Disabilities Education Act (2004) guarantees students with disabilities access to a free, appropriate education, which must provide “more than de minimis” benefit; it must support them in making meaningful progress toward individualized goals, while also providing access to general education curricula (Yell & Bateman, 2019). Due to the learning and/or behavioral challenges students with disabilities experience, teachers are legally and ethically responsible to consistently use effective, efficient instructional practices (e.g., modeling, feedback) with a high degree of skill (Brownell et al., 2010). Classroom observation research plays an essential role in special education policy, leadership, and research, by providing insight into the ways the teacher workforce is meeting this responsibility, and the circumstances and settings in which they are more or less likely to fulfill this responsibility well.

High-quality classroom observational research depends on high-quality observation systems (Bell et al., 2012). Observation systems typically include an observation instrument, a rubric or rating form to help observers focus on particular aspects of instruction (e.g., classroom management, questioning) that are of empirical or theoretical importance for promoting positive student outcomes (Liu et al., 2019). Some observation systems are intended to yield descriptive data about the nature of teachers’ practice (e.g., the number of times they provided corrective feedback), and others yield a quality rating, indicating the strength of teachers’ practice overall.

Items in observation instruments guide the dimension of teaching practice researchers observe (Bell et al., 2012). Some instruments focus on instruction for specific content areas (e.g., Reading in Special Education), and others are content-generic, focusing on aspects of instruction that are meant to be universal across all content areas (e.g., Classroom Assessment Scoring System). Furthermore, some tools have a narrow focus, and others are more holistic. For example, some instruments count the number of opportunities for students to respond, focusing narrowly on one type of teacher–student interaction, whereas more evaluative tools include ratings of broader aspects of instruction (e.g., classroom management). Both types of instruments can be used in observation systems. However, observation systems include more than the items; procedures embedded in observation systems are used to enhance the validity of data collected using an observation instrument. These include training and reliability procedures and conducting a sufficient number and duration of observations (Liu et al., 2019).

Understanding the observation systems used to evaluate instruction for students with disabilities can provide insight into the quality of inferences drawn from observational research. Thus, the purpose of this systematic review is to examine the validity evidence provided for researchers’ use of observation systems in naturalistic observations of teachers of students with disabilities from 1975 to 2020. We acknowledge that this is a significant window of time, during which there have been important changes in both conceptions of effective special education practice (Brownell et al., 2010) and measurement theory (e.g., American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational, & Psychological Testing, 2013; Kane, 2010); thus, we attend to changes in the use of observation systems over time.

Conceptual Framework: An Argument-Based Approach to Validity

Drawing on Kane’s (2010) argument-based approach to validity, in this study, we examine ways in which observational studies of teachers’ instruction for students with disabilities have provided evidence that could be used to support their uses of observation systems. The Standards for Educational and Psychological Testing, put forth by the American Educational Research Association (2013) and other professional organizations, define validity as the extent to which “evidence and theory support the interpretations of test scores for proposed uses of tests” (p. 11). From an argument-based approach to validity, the intended interpretations of a measure imply particular inferences, or assumptions, users make when they draw conclusions and make decisions based on someone’s performance on a measure (Kane, 2010). By empirically demonstrating the accuracy of the inferences on which interpretations and uses are based, one can provide evidentiary support for those interpretations and uses of the measure (Kane, 2010).

Using an argument-based approach to validity, Bell et al. (2012) articulate key inferences on which classroom observation systems are based. First, scoring inferences refer to the assumption that a given behavior during instruction accurately and consistently translates, without bias, to a score on the observation system. Scoring inferences presume observers are not systematically biased in the scores they assign to particular teachers (e.g., men vs. women), in particular contexts (e.g., settings with which they are more/less familiar), or for particular aspects of the observation instrument (e.g., items on which they are less proficient). Observation systems can be designed to enhance evidence for scoring inferences by ensuring observers hold relevant knowledge, randomly assigning observers, masking observers to conditions that might activate biases, and conducting calibration to monitor and address observer drift. Researchers can evaluate scoring inferences after scores are assigned, by examining reliability across observers and testing whether scores systematically vary (e.g., by observer, teacher, context features). Note that reliability is a precondition for claims of validity; an observation system cannot validly measure a construct unless it can reliably produce consistent scores.

Second, generalization inferences support extending interpretations of the observed score to “a claim about expected performance over a larger universe of observations” (Kane, 2006, p. 24). There are two assumptions that fall under the generalization inference. The first is that “the sample of teaching observed is representative of all the instances of teaching to which one wants to generalize” (Bell et al., 2012, p. 67); this presumes the observation score is not capturing noise peculiar to the observed lesson but represents a typical range of instruction. The second inference is that the participant sample is representative of the population to which the user intends to generalize the results. Observation systems can be designed to support generalization inferences by observing a sample of a teacher’s instruction that is representative of the population of instruction about which researchers are generalizing (e.g., spreading observations over a relevant time period), ensuring an adequate number and length of observations, and observing a representative sample of teachers who can plausibly generalize to the population of teachers about whom researchers draw conclusions. After observation, one can also evaluate generalization inferences by examining the extent to which aspects of sampling (e.g., time of day or year) contribute to within-teacher variability in scores; the larger the contribution of these factors to scores, the more observations are required to support generalization inferences.

Third, extrapolation inferences refer to the assumption that scores on the measure are related to a broader construct, in the case of Bell et al.’s (2012) scholarship, teaching quality. Observation systems can be designed to strengthen the argument for extrapolation inferences by clearly defining the dimensions of instruction they evaluate and by providing empirical support for those dimensions (e.g., citing prior research or theory, providing definitions of dimensions observed). Extrapolation inferences can also be supported by evidence indicating observation scores are (a) related to other measures of similar dimensions of teaching quality and/or outcomes of teaching, but (b) unrelated to measures of unrelated constructs.

Finally, implication inferences refer to the assumption that uses of the measure as intended lead to expected outcomes (Bell et al., 2012). Implication inferences can be evaluated by examining whether using the scores as intended produces expected outcomes. For example, researchers may use the results of an observation study to decide what aspects of instruction they should prioritize for a future PD; if this decision leads to desirable outcomes (e.g., a PD that improves teacher practice and student outcomes), that would support the implication inference.

Kane (2006) laid out two stages of instrument validation. In the first stage, Development, developers identify their goal(s) for creating the instrument and collect evidence related to the interpretive argument, the framework for interpreting scores. In this stage, developers generally adopt a confirmationist bias, because they are attempting to support their interpretations and improve the instrument. The second stage is the Appraisal stage, in which researchers take a “more neutral or even critical stance” (p. 26). In the Appraisal stage, the instrument is assumed to be finished, and the interpretive argument is examined critically. Kane makes clear that validation is not a process that can be completed in one study; rather, it requires a series of studies providing different information related to validity. To appraise an instrument for a particular interpretation, the appraiser would gather the available research data related to the instrument and evaluate whether the data are sufficient to support their intended interpretation or use of the instrument.

Adopting Kane’s (2010) approach to validity and Bell et al.’s (2012) application of this approach to observation systems, we posit that, when using observation systems, researchers should provide evidence to support scoring, generalization, and/or extrapolation inferences, so other scholars have the opportunity to appraise the instrument’s appropriateness for their purposes. In this review, we evaluate the extent to which observation studies of teachers of students with disabilities provide evidence to support scoring, generalization, and extrapolation inferences. We do not evaluate the extent to which they provide evidence to support implications inferences; because we focus on uses of observation systems in research, evaluating the implications inference would require us to determine the impact of each study (e.g., policy changes made as a result of a study’s results), which is beyond the scope of the present review.

Previous Syntheses of Classroom Observational Research in Special Education

Researchers have conducted several prior syntheses to examine trends in classroom observational research. Most extant syntheses have examined specific populations (e.g., students with emotional/behavioral disorders; Vaughn et al., 2002) and content areas (e.g., reading; McKenna et al., 2015). Furthermore, most prior syntheses focused on results of included studies, not methods. For example, they examined whether studies’ results showed teachers using evidence-based practices (McKenna et al., 2015; E. A. Swanson, 2008; Vaughn et al., 2002; Walker & Stevens, 2017). These syntheses are useful for understanding the extent of the research-to-practice gap.

Taking a different perspective, other reviews focused on aspects of observation systems used in classroom observational studies. Park et al. (2019) described instruments used in observations of special education reading instruction, focusing on how instruments’ content aligned with principles of effective reading instruction for students with disabilities. They concluded a few instruments captured general instructional practices, but not specific effective reading practices for students with disabilities, indicating evidence for the extrapolation inference may not be present, at least for reading instruction. E. A. Swanson (2008) synthesized observation studies of reading instruction and found more than half reported conducting observer training, but the same number did not report how they calculated interobserver reliability, providing mixed evidence to support scoring inferences. Salvia and Meisel (1980) examined methods used in observation studies published in four journals in the prior 2 years to determine the extent to which observer bias may have influenced studies’ results; they examined three aspects of study design: (a) whether the study’s purposes were masked¹ for observers, (b) interrater reliability, and (c) observer training. They found about half the studies in their corpus had high potential for bias, indicating studies provided weak evidence to support scoring inferences.

These reviews provide important insight into the quality of observation systems, but none comprehensively examined the evidence of validity provided in researchers’ descriptions of observation systems; for example, Park et al. (2019) examined the content of observation instruments, not the systems used to promote valid ratings. Furthermore, researchers defined their corpus in ways that limited their relevance for improving the quality of special education classroom observation research writ large. Salvia and Meisel’s (1980) review is limited by its age and focus on only one aspect of observational quality. Although they are more recent, the other two reviews (Park et al., 2019; E. A. Swanson, 2008) are limited by a narrow focus on observations in reading. Though some teaching practices are content specific, (a) many are effective across content areas, (b) methods used to evaluate practice are seldom content specific, and (c) special education research often evaluates instruction across content areas because special educators often provide instruction across content areas (Jones & Brownell, 2014). Thus, a broader review likely has stronger relevance for informing future observation research.

The Current Study

To date, no reviews have comprehensively examined how observation studies of teachers of students with disabilities provide evidence that could be used to support validity inferences. Synthesizing validity evidence of observational systems is important for improving the quality of classroom observational research, thereby improving the capacity of this kind of research to inform research, practice, and policy aimed at improving the quality of instruction. Thus, we take a comprehensive approach, looking across the history of codified special education practice. Of note, the past 45 years have seen important changes in measurement theory (e.g., American Educational Research Association, 2013; Kane, 2010), as well as changes in how scholars conceptualize strong instruction for students with disabilities (Brownell et al., 2010), and where students with disabilities are served (e.g., Williamson et al., 2019). These changes likely require adjustments to the validity evidence needed to support inferences; as such, we attended to changes in validity evidence reported over this time frame.

Our goal in conducting this systematic review was to understand the extent to which observation studies provide compelling evidence to support common inferences related to observation research and how reporting of that evidence has shifted from 1975 to 2020, such that scholars reading these studies could appraise whether the validity evidence would support their intended uses. The following research questions guided our analysis:

Research Question 1: What evidence have researchers provided to support scoring, generalization, and extrapolation inferences in classroom observation studies?

Research Question 2: In what ways has the evidence researchers provided for these inferences changed from 1975 to 2020?

Method

Search and Inclusion Procedures

To locate relevant articles, we aligned search strategies with guidance for systematic reviews (Alexander, 2020). Figure 1 illustrates our search, following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA; Moher et al., 2009) guidelines for identification, screening, eligibility, and inclusion processes in a systematic review.

Figure 1.

Study selection process.

Inclusion Criteria

We sought published, peer-reviewed studies from 1975 to 2020, seeking observational research that targeted in-service teachers in K–12 settings serving students with disabilities. We included studies that met the following criteria: (a) published between 1975 and 2020; (b) peer-reviewed research; (c) focused on in-service teachers of students with disabilities; (d) methods employed systematic classroom observations, with observation foci determined a priori; (e) focused on teachers’ classroom activities; and (f) observations were conducted in natural school settings; and (g) observation results were reported separately from other data.

Time period: 1975 to 2020

Using policy to demarcate the bounds of our search, we began with articles published in 1975, the earliest year in which a major law codified special education. The U.S. Congress passed PL 94-143, the Education for All Handicapped Children Act, in 1975, which required schools to provide equal access to education for students with disabilities. In subsequent years, such policies were increasingly enacted internationally (e.g., in 1983 Finland’s Parliament passed the New Comprehensive Schools Act; in 1986 China adopted the Compulsory Education Law of the People’s Republic of China). We used 1975 as the start for our search because PL 94-142 is the earliest known national legislation for special education.

Peer-reviewed research

Aligned with our focus on evidence for validity arguments within published research, we included only peer-reviewed, published research in our final corpus, excluding dissertations, working papers, and other gray literature.

Teachers of students with disabilities

Because our purpose was to identify evidence supporting validity arguments in observations of teachers’ classroom practice when serving students with disabilities, we included articles focused on teachers’ behaviors in classrooms serving students with disabilities. We defined a teacher as an in-service teacher in any K–12 setting (e.g., general education, self-contained, juvenile justice); though studies of preservice teachers may include systematic observations, we decided the extrapolation inference related to teaching quality in preservice settings would likely be conceptually distinct from inferences related to in-service teachers. Some studies focused on observing students, and only examined teachers’ instruction incidentally (most notably, the Code for Instructional Structure and Student Academic Response; e.g., Rotholz et al., 1989); because of their central focus on students rather than teachers, we determined these studies did not directly address our research aim. We also excluded articles in which teaching students with disabilities was not an explicit focus of the observation, such as studies in which students with disabilities were present in classes, but research questions and data analysis did not explicitly address instruction for students with disabilities. For example, Gage et al. (2018) conducted observations of teachers’ implementation of evidence-based classroom management practices, but analysis and results were not disaggregated by student population. Though students with disabilities were likely included in these classes, there was not an explicit examination of the practices provided to this population of students, so this study was excluded from analysis.

Systematic classroom observations

To be included, studies had to use observation systems in which coding schemes were determined a priori. We excluded studies in which researchers analyzed field notes inductively. For example, we excluded a study by Moin et al. (2009) in which data collection consisted of free narrative paragraphs in an observation journal. Though inductive approaches provide important information about classroom practice, the concept of validity is less relevant than other methodological criteria (e.g., trustworthiness, credibility) for inductive data collection and analytic methods (Trainor & Graue, 2014).

Observations focused on classroom activities

We only included articles that focused on teachers’ classroom activities. We excluded articles that employed observations of teachers outside of instruction, such as those that looked at teacher time use more generally. For example, Sargent (1981) collected descriptive observational data regarding teachers’ time use; their data collection focused on time spent providing consultation, training, staffing for Individualized Education Program meetings, and record keeping. Because observed behaviors extended beyond the classroom, we excluded this study. Bounding our corpus in this way allowed us to focus on observation systems used to understand instruction, which are distinct from those employed to understand how teachers collaborate and consult, roles with unique measurement challenges (Jones et al., 2019).

Observations in naturally occurring settings

We only included studies that focused on natural classroom settings, excluding studies that used observation systems to assess the effects of interventions or professional development. For example, Kennedy et al. (2017) assessed effects of professional development on teachers’ vocabulary instruction in inclusive science classes. We excluded these studies because their purpose was to assess how practice changed through intervention (a different use) and would depend on different validity arguments.

Observation results presented separately

Finally, we only included studies that reported using systematic observations in the Method section and presented findings from these data in the results. Some studies used an observation system to support an overall understanding of participants but did not report observation results distinct from qualitative teacher profiles (e.g., Kelman & Branco, 2004); we excluded these studies. In other studies, observations were conducted to confirm results of other measures, but observation results were not reported. For example, Hang and Rabren (2009) used observations to confirm teachers’ assertions, but did not report observation results; we also excluded these types of studies.

Electronic Search

We screened titles and abstracts in the following electronic databases: PsychINFO and a combined database that included ERIC, Academic Search Complete, Education Full Text, Education Index Retrospective, Education Research Complete, Psychology and Behavioral Sciences Collection, and SPORTDiscus. We used the search terms observ*, “special education,” teacher or teachers or “special education teachers,” NOT pre-k or “pre-kindergarten” or preschool or “early childhood.” The database search resulted in 1,851 potential articles.

Two researchers reviewed each title and abstract, discussing decisions regarding inclusion for full-text review. Before working independently, they reached 100% agreement on 20% of articles. They then separately reviewed titles and abstracts of remaining identified articles in each database and determined whether they met initial inclusion criteria; following this, we met to discuss questionable articles. At this stage, we erred on the side of including false positives; for example, if an abstract mentioned using observation systems but did not specify whether participants were preservice or in-service teachers, we retained it for full text review. Through the electronic search, we identified 262 articles potentially relevant to our purpose.

Additional Search Efforts

Consistent with recommendations for systematic reviews (Alexander, 2020), we used three further strategies to identify relevant studies. Each strategy was enacted by pairs of researchers; after one team member conducted each search, a second member reviewed questionable articles. First, we hand searched the six journals that published the highest numbers of included articles from the database search: Exceptional Children, Teacher Education and Special Education, Learning Disability Quarterly, Remedial and Special Education, American Annals of the Deaf, and Journal of Special Education. This yielded 32 additional articles for full-text review. Next, using referential backtracking, we reviewed references from prior syntheses of observation studies (McKenna et al., 2015; Park et al., 2019; E. A. Swanson, 2008; Vaughn et al., 2002; Walker & Stevens, 2017), identifying two articles. Finally, we compiled a list of all authors involved in the research in our corpus. Based on this list, we reviewed three researchers’ publication records (i.e., Brownell, Johnson, Vaughn) to identify any additional research that warranted our attention. We identified three articles at this stage for full-text review.

Full-Text Review

After completing the search and screening processes, we reviewed the full texts of all identified articles (n = 299) to assess eligibility with regard to our inclusion criteria. As we screened the full text of each article, we refined our inclusion criteria to clarify the conceptual boundaries of our corpus and to ensure included studies meaningfully contributed to our research purpose (final inclusion criteria are described above). Consistent with other reviews, this process was iterative (Alexander, 2020); as such, after finalizing our inclusion criteria, we returned to articles that had previously been considered to determine whether they met our more stringent criteria. In the process of full-text screening, we excluded 198 studies, resulting in a final corpus of 100 articles. Two articles (Ivarie et al., 1984; Thompson et al., 1984) each described two studies; for this reason, across the 100 articles, 102 studies were analyzed.

Data Analysis

Analysis occurred across several phases: code development, code application, and data synthesis. Analysis was informed by previous observation syntheses’ coding structures, the literature on observation methods in educational settings (e.g., Jones & Brownell, 2014; Liu et al., 2019), and the framework proposed by Bell et al. (2012).

Development of Codes

Drawing on sources noted above, we developed codes and definitions to examine the extent to which researchers provided information that would allow readers to appraise evidence supporting scoring, generalization, and extrapolation inferences for their intended uses. Development of coding definitions was iterative (Alexander, 2020). After establishing a preliminary set of codes, we independently coded a subset of articles and met to compare codes and examine disagreements; this helped clarify definitions, provide examples, and establish decision rules. In some cases, information was not clearly defined in an article but was available from other sources (e.g., previous articles). We elected not to consult resources outside articles in our corpus because (a) we did not have access to all outside sources (e.g., protocols), and we wanted to standardize coding; and (b) unclear reporting has implications for practical use of study results. However, we did include codes noting if authors referred to other sources.

Scoring inference

To evaluate evidence to support the scoring inference, we designed codes to address how researchers attended to accuracy, consistency, and bias in the scoring process. We coded for variables related to observer training, the use of ongoing monitoring and recalibration of observers’ scoring, interobserver agreement (IOA), and measures taken to examine results for potential bias (e.g., whether observers were masked to condition, whether researchers evaluated sources of bias in scores; see Table 1 for the full list of codes).

Table 1

Scoring inference codes and definitions

Code	Options (numerical code)	Definition
Double-coding	N/A	Percentage of observations double-coded for reliability (99 = not specified)
Interobserver agreement (IOA)^a	Low (1)	Below 75% agreement, Cohen’s kappa = .61, ICC = .80
	Moderate (2)	75%–89% agreement, Cohen’s kappa = .61–.80, ICC = .80–.89
	High (3)	≥90% agreement, Cohen’s kappa = .81, ICC = .90
	Unspecified statistic (4)	IOA reported, but statistic calculated was not specified
	Other statistic (5)	IOA reported using a statistic other than the above
	N/A (0)	No IOA calculated (or no double coding was done)
	Not specified (99)	Information not provided in article
Number of observers	N/A	Number of observers (0 = not specified)
Observer training	No (0)	No observer training occurred
	Yes (1)	Observer training procedures noted/described in article
	Not specified (99)	Information not provided in article
Acceptable training reliability	No (0)	Trainees not required to reach minimum reliability
	Yes (1)	Trainees required to reach minimum acceptable reliability before collecting data
	N/A (2)	Observers not trained
	Not specified (99)	Information not provided in article
Observers masked	No (0)	Observers aware of study’s purposes
	Yes (1)	Observers unaware of the study’s purposes
	Not specified (99)	Information not provided in article
Trainees’ scores comparison	Other observers’ scores (1)	Observers’ scores during training/calibration were compared with other observer trainees
	Master/“true” scores (2)	Observers’ scores during training/calibration were compared with scores determined by master coders
	Combination (3)	Observers’ scores were compared with each other and to master scores
	N/A (0)	No training reliability was conducted or calculated
	Not specified (99)	Information not provided in article
Continued monitoring/recalibration	No (0)	No ongoing monitoring or recalibration was conducted
	Yes (1)	Researchers conducted ongoing monitoring or at least one recalibration round
	Not specified (99)	Information not provided in article
Validation tests conducted	No (0)	No validation tests conducted
	Yes (1)	Observers were given “test” videos during coding to conduct validation of observation scores
	N/A (2)	Observations were conducted live or based on transcripts
	Not specified (99)	Information not provided in article
Observer assignment	Not described (0)	Article does not include a description of how observers were assigned to teachers or classrooms
	Multiple observers assigned (1)	Teachers were observed by multiple observers (beyond double-coding for reliability)
	Observer expertise considered (2)	Observers purposefully assigned based on experience or expertise in the content area or grade level
	Other (3)	Another method was taken when assigning observers to mitigate impacts of potential observer bias
	Multiple (4)	Multiple methods were used
Examining results for bias	Not described (0)	Researchers do not describe analyzing results for potential bias, or there was only one observer
	Bias based on gender identity (1)	Researchers analyzed results for systematic differences in teachers scores by teacher gender
	Bias based on race/ethnicity (2)	Researchers analyzed results for systematic differences in scores of teachers based on teacher race/ethnicity
	Bias based on observer judgment (3)	Researchers analyzed observation results for systematic differences across observers
	Other (4)	Researchers analyzed observation results for systematic differences based on other teacher or individual factors
	Multiple (5)	Researchers analyzed for multiple types of bias

Note. N/A = not applicable; ICC = intraclass correlation.

Thresholds for IOA were taken from Graham et al. (2012).

Generalization inference

We examined evidence to support generalization inferences using codes about how researchers sampled observations from the population of instruction and accounted for potential “noise” in the data associated with sampling of observations. In this category, we coded for information regarding the number and length of observations; how observations were scheduled; whether results were examined for systematic differences related to contextual factors; and characteristics of the participant sample (see Table 2).

Table 2

Generalization inference codes and definitions

Code	Options (numerical code)	Definition
Setting	General education (1)	Observation(s) took place in a general education class
	Special education (2)	Observation(s) took place in self-contained classes
	General and special education (3)	Observation(s) took place across general and special education classes
	Other (4)	Setting did not fit either general or special education
	Not specified (99)	Information not provided in article
Level	Elementary (1)	Observation(s) took place in Grades K–5
	Secondary (2)	Observation(s) took place in Grades 6–12
	Both (3)	Elementary and secondary teachers were included
Content area	Cross-curricular (1)	Observation(s) occurred across multiple content areas
	Language arts (2)	Observation(s) conducted in reading and/or writing
	Mathematics (3)	As defined and reported in the study
	Science (4)
	Social studies (5)
	Other (6)	None of the above (e.g., physical education)
	Not specified (99)	Information not provided in article
Student population: Disability	Varied (1)	Students with various disabilities included
	SLD (2)	Student population as reported by the researchers
	EBD (3)
	ID (4)
	ASD (5)
	OHI (not ADHD) (6)
	ADHD (7)
	D/HI (8)
	B/VI (9)
	SLI (10)
	TBI (11)
	OI (12)
	Deaf-Blindness (13)
	Not specified (99)	Information not provided in article
Race/ethnicity reported	Not reported (0)	Race/ethnicity of teacher participants not reported
Race/ethnicity reported	Reported (1)	Race/ethnicity of teacher participants reported
Teacher race/ethnicity	White	Number of teachers identified in each category
	Black/African American
	Hispanic/Latinx
	Other
Gender reported	Not reported (0)	Gender of teacher participants not reported
Gender reported	Reported (1)	Gender of teacher participants reported
Teacher gender	Women	Number of teachers identified in each category
Teacher gender	Men	Number of teachers identified in each category
Disability status reported	Not reported (0)	Disability status of teacher participants not reported
Disability status reported	Reported (1)	Disability status of teacher participants reported
Number of teachers	N/A	Total number of teachers for whom observation data were collected and reported^a
Number of observations	N/A	Average or minimum number of observations conducted per teacher^a
Length of observations	N/A	Average or minimum length of observations in minutes^a
Scheduling observations	Not described (0)	No information provided
	Randomized observations (1)	Randomly selected class/time of day for observation
	Multiple times of day (2)	Scheduled observations at various times of school day
	Multiple times of year (3)	Scheduled observations at different points in the year
	Other (4)	Different method for scheduling observations described; purposefully supports generalization
	Multiple (5)	Researchers used more than one of the methods above
Examining results for impact of context	Not described (0)	Article does not describe analyzing for impact of context
	Time of day (1)	Observation results analyzed for systematic differences related to time of day observations took place
	Time of year (2)	Observation results analyzed for systematic differences related to time of year observations took place
	Other (3)	Observation results analyzed for systematic differences associated with another contextual factor
	Multiple (4)	Observation results analyzed for potential differences related to multiple contextual factors

Note. SLD = specific learning disability; EBD = emotional and behavioral disorders; ID = intellectual disability; ASD = autism spectrum disorder; OHI = other health impairment; ADHD = attention deficit hyperactivity disorder; D/HI = deaf/hearing impairment; B/VI = blindness/visual impairment; SLI = speech or language impairment; TBI = traumatic brain injury; OI = orthopedic impairment; N/A = not applicable.

When this number was not explicitly provided, we calculated it based on the information that was reported.

Extrapolation inference

The extrapolation inference refers to the assumption that scores on the measure are related to the broader construct of teaching quality (Bell et al., 2012). Although studies in our corpus focused on different dimensions of instruction, they all chose to focus on dimensions they posited were relevant to understanding teaching quality. In some studies, they were making judgments about teaching quality (e.g., quality of classroom management practice), while in others, they were describing behaviors that signaled something specific about teaching quality (e.g., use of different grouping structures, praise statements, or—in contrast—negative regard); however, even when researchers only described behaviors, there is an underlying assumption in observational research that these behaviors are an important part of understanding teaching quality in classrooms serving students with disabilities. Thus, we considered evidence to support the extrapolation inference broadly, examining whether articles provided information that would allow an appraiser to connect the instrument content to an understanding of teaching quality. Coding evidence to support extrapolation inferences involved both deductive and inductive coding. Deductively, we coded for (a) the method researchers used to establish content validity (e.g., expert review, literature review); (b) whether researchers provided definitions and examples of observed variables; (c) whether researchers tested the instrument during planning and development; and (d) whether and how researchers examined relationships between observation data and other measures theoretically related to teaching quality, such as teacher knowledge, student achievement, and student behavior (see Table 3).

Table 3

Extrapolation inference codes and definitions

Code	Options (numerical code)	Definition
Method for establishing content validity	Not described (0)	Method for reviewing/ensuring the value of the observation instrument content not described in article
	Expert review (1)	Sent content to experts for review and feedback
	Systematic literature review (2)	Content of the instrument based on systematic literature review
	Nonsystematic literature review (3)	Content of the instrument based on existing literature, but article does not describe a systematic review
	Existing instrument (4)	Article refers to another source of instrument development
	Other (5)	Article describes a different method for ensuring value
	Multiple (6)	Article describes multiple methods for ensuring value
Definition/examples of dimensions observed	None (0)	Does not provide definitions or examples of any observed dimensions
	Definition only (1)	Includes definitions but not examples
	Examples only (2)	Provides examples but not definitions
	Both (3)	Provides definitions and examples
Instrument tested	Not described (0)	Article does not describe pilot testing during instrument development
	Yes (1)	Article describes pilot testing during instrument development (either by them or the developers)
	Existing instrument (2)	Article does not include pilot testing but provides a citation to an existing source for instrument development details
Examined relationship between observation data and other measures	Not applicable (0)	No relationships examined between observation results and another measure
	Another teacher measure (1)	Report analyzing relationships between observation results and another measure related to teachers (e.g., evaluation score, knowledge assessment)
	Student achievement measure (2)	Report analyzing relationships between observation results and a measure of student academic achievement
	Student behavior measure (3)	Report analyzing relationships between observation results and a measure of student behavior or engagement
	Other measure (4)	Report analyzing relationships between observation results and another type of measure other than those above
	Multiple types of measures (5)	Report analyzing relationships with multiple types of measures from those listed above

Inductively, we coded the dimensions of teaching practice for which researchers observed. Our intention was to determine dimensions of teaching practice observed; we anticipated these may have changed over time, as conceptions of teaching practice for teaching students with disabilities have changed over time (Brownell et al., 2010). To develop codes for these dimensions, we (a) independently reviewed studies’ purposes, research questions, methods, and results; (b) constructed analytic memos documenting categories and subcategories examined; and (c) met to determine final codes. In this process, we noted considerable variability in how researchers named and defined dimensions of teaching. We collapsed codes into broader categories for the most prominent dimensions of teaching quality researchers examined. Of note, some studies reported observation results as one composite score including multiple dimensions without decomposing results (e.g., Bishop et al., 2010; Nougaret et al., 2005); because we found considerable variability in how these composites were defined, we could not accurately determine the dimension of teaching practice observed, so in our final coding, we coded these as Other. Also in the Other category, some teaching behaviors were so idiosyncratic (e.g., facial expression, teacher positioned to view students) that creating individual categories would have resulted in categories that were meaningless given the breadth of our review. Table 4 defines the final dimension codes.

Table 4

Dimensions of teaching practice codes

Dimension	Definition	Examples of observed items
Classroom management	The specific systems and practices that teachers use to manage the classroom and student behavior	• Rules and routines • Reinforcement or reward systems • Management of transitions
Co-teaching	The roles assumed by co-teachers and the co-teaching strategies used in inclusive classrooms	• Consulting • Parallel teaching • Station teaching
General teaching tasks	The teacher behaviors not related to delivering instruction or managing the classroom	• Monitoring • Assisting students • Making assignments
Instructional grouping	The grouping structures teachers use during instruction	• Whole group • Partner work • Independent work
Instructional practice	The purposeful instructional practices that teachers use during instruction	• Delivering explicit instruction • Fostering engagement • Modeling • Opportunities to respond

Application of Codes

In the next phase, pairs of researchers coded all articles using the final coding protocol, with teams of two coding each study. During coding, we met biweekly to discuss disagreements and work toward consensus. We calculated Cohen’s kappa (a measure of coding reliability for nominal data that accounts for agreement due to chance) for each code. Kappa values ranged from .386 to 1.00. The average kappa value across all codes was .706, indicating substantial agreement (Landis & Koch, 1977). All kappa values were statistically significant. We used consensus codes to examine patterns and synthesize information in the final stage of analysis.

Synthesis

After coding, we used descriptive analysis to provide a broad perspective on the research questions. We compiled, organized, and examined the data by inference (Research Question 1). Finally, we examined data descriptively by decade between 1980 and 2019 (we excluded the 1970s and the year 2020 from this part of the analysis because they were only partial decades) to assess how reporting of validity evidence for observational systems in special education has changed over time (Research Question 2). Given changes in measurement theory (e.g., Kane, 2010), we expected to find a trend toward more persuasive evidentiary support for all three inferences over time.

Results

We identified 100 articles containing 102 studies, spanning the years 1975 to 2020. Most studies were conducted in the United States of America (n = 84), but we identified studies from 13 countries. Across studies, researchers completed >306 observations of >2,369 teachers using >35 different observation instruments. The most commonly used instruments were as follows: CCS (n = 7; McIntosh et al., 1993), Recognizing Effective Special Education Teachers (RESET; n = 6; E. S. Johnson et al., 2018), Flanders Interaction Analysis system (n = 4; originally developed by Stuck & Wyne, 1971), Brophy–Good Dyadic Interactions System (n = 4; originally developed by Brophy & Good, 1970), Classroom Observation Keyed for Effectiveness Research (COKER; n = 3; K. M. Algozzine et al., 1986), and Reading in Special Education (n = 3; Brownell et al., 2009). Most (n = 80) reported descriptive rather than evaluative data, meaning they intended to capture features of what was happening in classroom(s) (e.g., number of times teachers provided feedback) without explicitly evaluating the quality of instruction.

In the following sections, we first present findings regarding the evidence researchers provided to support scoring, generalization, and extrapolation inferences for their uses of observation systems. We then discuss patterns in validity evidence over time. Supplemental Tables S1, S2, and S3 (available in the online version of this article) contain our full coding results related to these three inferences.

Evidence Related to Validity Inferences

Scoring Inference

Many studies in our corpus failed to provide strong evidence to support accuracy, consistency, and lack of bias in the instruments used or how observers interpreted instruments.

Accuracy

In 76 studies, researchers reported observers received training before collecting data; 24 did not report on observer training. The remaining two studies’ descriptions of their process indicated they did not provide training (Jordan et al., 1997; McKenzie, 2009). In 52 studies, researchers specified trainers had to reach at least moderate acceptable reliability (i.e., at least 75% agreement, .61 Cohen’s kappa, or .80 ICC). Forty-six studies reported what observers’ scores were compared with during training; of those, 16 studies (36%) compared trainees’ scores with other observer trainees, 29 (63%) compared trainees’ scores with “master” scores, and one used a combination.

Consistency

Twelve studies either used only one observer (n = 5) or did not calculate IOA (n = 7). Of the remaining 90 studies, in 34 (38%), researchers did not indicate whether any observations were double-coded. The percentage of observations double-coded in the other studies ranged from 0% to 100% (median = 25%). Of the 59 studies that calculated and reported IOA, 16 (27%) either did not specify the statistic they used or used a statistic other than the ones referenced in Graham et al. (2012), a reference document providing guidelines for interrater reliability for observations. Seven (12%) studies reported low agreement, 23 (40%) reported moderate agreement, and 13 (22%) reported high agreement (see Table 1 for thresholds and Graham et al., 2012, for explanations of the included statistics). Researchers reported conducting recalibration or ongoing monitoring of IOA during data collection in 17 studies. Of 18 studies that used video, none reported conducting validation tests in which researchers embedded “test” videos among actual videos to assess observer reliability.

Lack of bias

Thirty-six studies did not report the number of observers used, and five studies only used one observer, meaning there was no way to statistically test for potential bias. The mean number of observers per study was 6, and the mode was 2 (n = 16). Only seven studies reported the study’s purposes and/or research questions were masked to observers, a strategy that could reduce observer bias. In 77 studies, authors did not report using a purposeful process of assigning observers to reduce the potential for observer bias to impact results. Of the 25 studies that did report such steps, 16 (64%) ensured each teacher was observed by more than one observer for each observation. One study (Cipriano et al., 2018) described randomly assigning video segments to observers. H. L. Swanson (1984) ensured no observer was at the same school more than once and assigned two observers to each observation. Lawson and Cruz (2018) assigned two observers to each video and randomized which video each pair of observers coded. In most studies involving RESET (e.g., Crawford et al., 2019; E. S. Johnson et al., 2021) researchers ensured each video segment was rated by more than one rater and randomized the order of the segments raters watched. Of the 31 studies that spanned elementary and secondary schools and the 49 studies that included multiple content areas, none reported assigning observers to particular grades or content areas based on observer expertise, a potential threat to bias because observers could rate teachers systematically differently in contexts in which they held expertise.

Authors of eight studies described examining results for evidence of bias. Chow and Kasari (1999) described conducting preliminary analysis for differences in observation results related to teacher gender or the school at which the observation occurred. Examining bias related to experience in special education, the purpose of Lawson and Cruz’s (2018) study was to examine differences in observation scores between observers with and without special education experience. E. S. Johnson and Semmelroth (2015) examined the factors that contributed to variance in observation scores; one of those factors was the rater. Five studies using RESET (Crawford et al., 2020; E. S. Johnson et al., 2021; E. S. Johnson, Moylan, et al., 2019; E. S. Johnson, Zheng, et al., 2019; E. S. Johnson, Zheng, et al., 2020a) examined the relative severity of each individual rater and calculated fair scores compensating for differences in rater severity. No studies reported exploring their results for systematic differences related to teacher race or ethnicity.

Conclusions about evidence to support scoring inference

The number of studies reporting evidence to support scoring inferences was low. A relative strength was the 58% of articles that reported results of IOA calculations, though many of those articles did not specify the statistic used. No one article reported all types of evidence coded in this section. Brady (1989) and Swank et al. (1989) reported most of the information; the only codes from this section they did not report on were whether the study’s purpose was masked to observers, how observers were assigned to classes, and whether researchers analyzed results for potential bias. Studies using RESET also reported most of this information.

Generalization Inference

Most studies did not describe strong evidence to support the generalization inference. Thirty-eight studies took place in inclusive general education classes, 43 in self-contained special education classes, and 15 across both types of classes. One study (Rich, 1979) conducted observations in a summer workshop for students with mild emotional disturbance. Five studies did not report the setting.

Thirty-five studies were conducted in elementary settings, 24 in secondary schools, and 31 across elementary and secondary settings. Twelve did not report grade level(s) observed.

Forty-nine studies examined multiple content areas. Of the 34 studies in a single content area, 23 (68%) observed English language arts; 16 (70%) of these focused on reading instruction. Social studies was the next most commonly observed content area, representing four (12%) of the content-specific studies. Physical education classes were the setting for three studies, and two studies each took place in math and science classes. Eighteen studies did not report a focal content area.

Sixty-nine studies reported the disability category of students in observed classes. Of those, 39 (57%) were conducted with teachers of students with varied disabilities. In disability-specific studies, 10 focused on classes for students with learning disabilities and eight for students with emotional and behavioral disorders. Classes of students with intellectual disabilities were the focus of four studies; another four studies related to students who were Deaf or hard of hearing. One study each focused on classes containing students with autism spectrum disorder (Kodak et al., 2018), visual impairments (Wolffe et al., 2002), acquired Deafblindness (Haakma et al., 2016), and varied physical disabilities (e.g., muscular dystrophy; Brulle et al., 1983). In 33 studies, researchers did not report the disability category of students.

For the 100 studies that reported the number of teacher participants (two did not), the mean was 26, with a range of 1 to 117 and a median of 17. Many studies did not report demographic information about participants. Just over half (n = 56) reported participants’ gender. Across those studies, 85% of teachers were women (15% were men). Only 26 studies reported the race or ethnicity of the participants. In those studies, 84% were White, 6% were Black/African American, and 7% were Latinx. Only one study (Serwatka et al., 1986) reported disability status of teacher participants; they observed three pairs of teachers, each with one who was Deaf.

Sixteen studies did not report the number of observations. Of those that did, the mean number of observations per teacher was 3.9 (SD = 3.0), with a mode of 3 (n = 27). Twenty-two studies did not provide enough information for us to calculate average length of observations. In the 80 studies that did report it, the mean observation length was 49.2 minutes, though it varied widely (SD = 63.3; the mode was 20 minutes (n = 11).

Seventy-four studies did not describe purposefully scheduling observations to capture variability in teachers’ instruction (e.g., ensuring all observations did not occur at the same time of day). Of the 28 studies that did report this, 3 (11%) randomized the time of day/block they observed, 2 (7%) scheduled observations across multiple times of day, and 17 (61%) scheduled observations at different times of year. Note that we coded studies in the latter category only when authors indicated they spaced observations out purposefully to capture a wider range of instruction; if they only reported the length of time over which observations took place, we did not consider that purposeful scheduling to support generalization.

Many studies (e.g., Englert & Thomas, 1982) reported asking teachers to identify a lesson that represented typical instruction. Because of the likelihood teachers would select a lesson representing better than average instruction, knowing they were being observed, we did not consider this strong evidence for the generalization inference unless researchers took measures to verify the chosen lesson was indeed typical. In one study, for example, (Haakma et al., 2016) researchers recorded and viewed multiple lessons, then selected one that represented typical instruction to code and analyze. Five studies used multiple methods of purposeful scheduling. Derevensky and Leckerman (1997), Hillen et al. (1992), and Larrivee (1982) scheduled observations at multiple times of the day and week. Larrivee (1982) also sampled across “activity type” (p. 3), although they did not define it. Wang et al. (2015) and Cipriano et al. (2018) randomized observation times and conducted observations at different points in the year.

Only seven studies reported analyzing for the potential impact of contextual factors on their results. Four of those (Brady & Taylor, 1989; Ciullo et al., 2016; E. S. Johnson, Zheng, et al., 2019; E. S. Johnson, Zheng, et al., 2020a) compared results of observations conducted at different times of the school year. Two studies (E. S. Johnson & Semmelroth, 2015; Stanovich & Jordan, 1998) reported analyzing whether the lesson observed seemed to impact the results. Prior to conducting a full analysis of the observational data, Chow and Kasari (1999) reported checking for systematic differences in results across times of the school year and across schools. When they found no such differences, they collapsed the data and reported results for the full sample.

Conclusions about evidence to support generalization inferences

Most studies reported the number of observations conducted. However, most did not report evidence that they sampled observations in ways that would provide generalizable conclusions about the population of instruction for which they were drawing conclusions. McIntosh et al. (1993) and the series of studies on RESET were particularly strong examples, as they reported on most of this evidence.

Extrapolation Inference

We found most studies did not provide strong evidence for the extrapolation inference.

Dimensions observed

Dimensions examined varied greatly across studies, including instructional practices, classroom management, general teaching tasks, grouping structures, co-teaching, and other dimensions. Categorizing dimensions proved quite difficult, due to the array of terms researchers used to refer to them and lack of specificity about what they meant by those terms. Note, this lack of clarity is itself a limitation to many studies’ support for the extrapolation inference, as clarity regarding the constructs to which one is trying to extrapolate is a necessary precondition for providing evidence to support the extrapolation inference.

Instructional practices

Instructional practice was examined in 89 studies. Researchers conceptualized this dimension in varied ways, focusing on characteristics (e.g., explicitness) or broader practices (e.g., modeling). For example, Klingner et al. (2010) reported how often teachers used reading comprehension strategies such as predicting, paraphrasing, and meta-cognition. Within the instructional practice dimension, we identified a major subset of observation instruments (n = 74; 83%) focused at least partially on discrete teacher behaviors directed toward students, specifically teachers’ verbal behaviors. For example, researchers noted whether feedback was positive or negative and the frequency of different kinds of feedback (e.g., Haakma et al., 2016). Researchers also evaluated various aspects of questioning, such as whether questions were open-ended (e.g., Klingner et al., 2010), academic versus nonacademic (e.g., Brady, 1989), or directed toward the whole class or individual students (e.g., Scott et al., 2011).

Classroom management

The 25 studies coded in this dimension included observations that address the systems and practice teachers used to manage the classroom and student behavior. Some focused on specific practices, such as positive reinforcement (B. Algozzine et al., 1988), scanning the classroom (Englert & Thomas, 1982), and redirecting off-task behavior (Levy & Vaughn, 2002). Others measured time spent managing behavior (e.g., King-Sears et al., 2014) or the quality of teachers’ classroom management overall (e.g., Bishop et al., 2010).

General teaching tasks

The 37 studies coded as general teaching tasks observed teachers’ classroom behaviors other than delivering instruction or managing behaviors. For example, researchers tracked how often teachers modified instruction (e.g., K. M. Algozzine et al., 1986) or provided individual support (e.g., Baker & Zigmond, 1990). Many studies documented monitoring, such as circulation during independent or group work (e.g., Ciullo et al., 2016) or assisting students (e.g., Mitchell et al., 2012).

Grouping structure

In 26 studies, researchers tracked grouping structure, primarily by recording the amount of time teachers spent working with different sized groups (e.g., Larrivee, 1982). Typical options included whole group, small group, individual, and no one.

Co-teaching

Eight studies focused on co-teaching, examining interaction patterns between co-teachers, grouping structures (e.g., the six co-teaching models; Friend, 2015), and teacher roles (e.g., teaching, monitoring). Other researchers simply recorded the presence or frequency of co-teaching (e.g., Magiera & Zigmond, 2005).

Other

In 53 studies, researchers observed dimensions of practice that did not fit in the above categories. Some captured teacher characteristics, such as “the ability to work with individuals as well as groups,” “proper listening skills” (B. Algozzine et al., 1988, p. 262), or whether teachers were impartial (e.g., Fletcher et al., 1999). Several collected data on when teachers were absent from the classroom (e.g., Ivarie et al., 1984) and teachers’ physical behavior, such as whether they could see all students (e.g., Englert & Thomas, 1982).

Strength of instrument content

Forty-nine studies did not include any definitions or examples of the dimensions observed. Twenty-two provided definitions only, four provided examples only, and 27 reported both definitions and examples. There are two important caveats to this. One is that we were generous in our coding of this variable, counting even articles that only provided definitions/examples for a subset of the items in the instrument. For example, some researchers (e.g., Rieth et al., 1987) only provided such information about one of many variables. The other caveat is that a definition or example had to be provided in the article itself, given our decision not to examine documents outside of our corpus. Therefore, if researchers only referred to another document (e.g., E. S. Johnson et al., 2021) and did not include any definitions or examples in the article, we coded it as not providing them.

In 32 studies, authors did not describe evidence collected during instrument development to support dimensions of instruction evaluated. Nine studies described conducting systematic literature reviews or soliciting expert reviews. Twenty studies referenced existing literature to support their instrument but did not explicitly describe a systematic review. Six studies described other methods for establishing appropriateness. For example, Webster and Johnson (1987) described a theoretical model on which they built their codebook, and Magiera and Zigmond (2005) and Seo et al. (2008) described basing their instrument’s content on previous studies. In 28 studies, researchers did not describe evidence of appropriateness of content but did provide a citation to previously published work describing the instrument.

Evidence of dimensions’ relationship with teaching quality

One study compared observation scores with other measures for the primary purpose of validating their interpretation of scores as a measure of teaching quality. E. S. Johnson, Crawford, Zheng, et al. (2020) compared rank order of teachers based on scores on the RESET and on an instrument based on Danielson’s Framework for Teaching. In 28 studies researchers compared observation results with results of other measures, examining correlations between scores and other constructs related to teaching quality, such as student behavior or achievement. Eight of the 28 (29%) compared observation results with another measure of teacher knowledge or behavior. For example, Nougaret et al. (2005) compared observation results with a teacher self-assessment. In eight studies, researchers compared observations with measures of student achievement. For example, Korinek (1987) reported correlations between teacher behaviors and correct oral responses from students. Kent et al. (2018) analyzed data using a two-level model to predict student learning by the amount of time teachers spent using specific instructional practices.

In nine studies, researchers compared teacher behaviors with measures of student behavior, most commonly student engagement/on-task behavior. For example, McKenna et al. (2015) reported correlations between teacher behaviors (e.g., praise, opportunities to respond) and student engagement. Englert and Thomas (1982) defined teacher effectiveness in classroom management by calculating the “mean percentage of total students who were on-task during seatwork activities” (p. 5). They then compared behaviors (e.g., interactions with students, teacher positioning) of the teachers rated as most and least effective, as determined by the student engagement measure. Researchers in three studies examined relationships between multiple measures. Stanovich and Jordan (1998) reported correlations between results from observation scores, a teacher attitude scale, and a principal rating scale. Brownell et al. (2009) compared student reading growth with teachers’ use of specific instructional practices and compared those practices with a measure of teacher knowledge about reading instruction. Finally, David and Kuyini (2012) conducted regression analysis examining the relationship of teacher factors (e.g., gender, age, experience), teacher attitudes about inclusion, self-efficacy, and an observation score representing facets of teacher instruction (not specifically defined).

Conclusions about evidence to support extrapolation inferences

Instructional practice was the most common dimension observed, although it is notable that most of those focused on a very specific type of instructional practice–teachers’ verbal behaviors, such as questioning and feedback statements. Slightly more than half of the studies presented definitions or examples for the variables observed within the published article, although some of these only reported them for a portion of the observed variables. About one quarter of included studies compared results from another measure with the observation results, though the methods of comparison were varied and often descriptive in nature, making it difficult to draw conclusions across them.

Changes in Validity Evidence Over Time

The evidence researchers reported to support validity arguments has changed in important ways over time, in some cases becoming stronger and in others becoming weaker. The number of studies per decade varied, with most studies (60%) occurring in the 1980s or the 2010s. There was a dip in studies in the 1990s and 2000s, representing 31% of the total studies, split fairly evenly across the two decades. In the following sections, we report how studies’ descriptions of the evidence for scoring, generalization, and extrapolation inferences have changed over time. Because we analyzed by decade, we did not include studies published before 1980 or after 2019 in the following analysis.

Scoring

Accuracy

The percentage of studies reporting observer training and requiring a minimum level of reliability in training has remained consistently high over time. The percentage of studies using the “gold standard,” comparing trainee observers’ scores with a master coder’s scores has increased greatly over time, with the notable exception of the 14 studies published between 2000 and 2009, in which none did so. However, from 2010 to 2019, 15 out of 34 studies (44%) employed master scores during training.

Consistency

The percentage of observations that were double-coded in each study was higher in the past two decades (43% and 43%, respectively) when compared with those published in the 1980s and 1990s (20% and 23%, respectively). The percentage of studies in which researchers reported whether they conducted recalibration activities during data collection has stayed consistently low over time.

Lack of bias

The average number of observers per study slightly decreased over time until the 2010s; studies published between 1980 and 1989 reported the highest number, with a mean of 6.8 (SD = 5.0). The percentage of studies reporting that the purpose of the study was masked to the observers has stayed consistently low across time, as has the percentage reporting purposeful assigning of observers and the percentage describing analysis of potential bias.

Generalization

The setting in which observations occurred shifted over time toward more inclusive settings, which would be expected, given the rise in inclusion in recent decades (Williamson et al., 2019). The only decade in which observations in self-contained settings outnumbered those in general education settings was the 1980s, in which 14 studies out of 27 (52%) were set in self-contained settings only. In the 1990s and 2000s, about one third of the studies were conducted solely in self-contained classrooms; in the 2010s, it was about one fourth. The rest were split between general education only and a mix of general and special education classes. Observations in secondary classrooms represented the smallest percentage of studies in each decade, except in the 1980s, when secondary and elementary each represented eight studies (30%). Six of the studies from the 1980s (22%) did not report the grade level, as opposed to only one or two studies in each subsequent decade. Observations in each decade were almost exclusively done across content areas or in English language arts. The few studies (n = 11) that took place in other content areas (i.e., math, science, social studies, physical education) were almost all conducted since 2000, with the exception of social studies; three of those were conducted by the same research group in the 1980s (i.e., Brady, 1989; Brady & Taylor, 1989; Swank et al., 1989).

As expected given advances in technology, all video observations were published after 2000. In every decade, most studies were conducted in classrooms with students identified with varied disabilities. All studies that reported teachers’ race or ethnicity have been published since 1990, and the percentage that reports this information has approximately doubled each decade since, with 15 out of 34 studies (44%) reporting it between 2010 and 2019. Teachers’ gender has been reported in studies in every decade, with the percentage reporting increasing every decade; 25 out of 34 studies since 2010 (74%) reported teachers’ gender.

Interestingly, the average number of observations conducted per teacher has decreased over time, with a mean of 6.1 (SD = 4.8) in the 1980s and a mean of 2.8 (SD = 1.4) in studies in the 2010s. The average length of observations has been more variable, but the mean length has been lower in the two decades since 2000 than in the prior two decades. The percentage of studies reporting purposeful scheduling of observations has remained consistently low over time, as has the percentage reporting analyzing the results for potential impact of context.

Extrapolation

Most studies in our corpus did not compare results of the observations to any other measure. However, one change over time is that six out of the eight observations that compared observation results with a measure of student behavior were published in the 2010s.

Discussion

Overall, we found many studies did not report strong evidence that could be used to support inferences associated with validity of observation instruments. We also noted some small but significant changes over time regarding the information that has been reported in observation research of teachers for students with disabilities.

Critique and Directions for Future Research

We first critique the extent to which extant studies have provided evidence to support scoring, generalization, and extrapolation inferences, providing specific recommendations for how researchers could better support these inferences in future research. We then provide general recommendations for how studies could, overall, enhance the validity of their uses of observation systems in future classroom observation research.

Validity Inferences

Scoring inference

Most studies included in our corpus provided minimal evidence to support the scoring inference, but we found relative strengths in the widespread use of rater training and reliability, both of which are intended to provide evidence of the accuracy of raters’ assessments of teaching practice. Given the historical emphasis in special education on identifying discrete, low inference teacher behaviors (e.g., feedback), this emphasis on accuracy is not surprising, as observers must be skilled in precisely identifying and classifying behaviors. Training and reliability are a common way of ensuring observers have a clear understanding of the discrete differences between behaviors and then communicating that to readers.

However, our findings suggest room for growth. In more than half of the included studies, researchers failed to report using any strategies to enhance consistency (e.g., double scoring observations) and minimize bias (e.g., randomly assigning observers, masking observers to condition). This is concerning because claims that comprise the scoring inference are interdependent; accurate scores depend on consistent and unbiased ratings. It is difficult to claim scores appropriately reflect observed teaching practice without attending to the consistency of observers’ assessments and the inherent bias that observers—humans with their own beliefs, experiences, and knowledge—carry into observations (Bell et al., 2019). Omission of strategies to address consistency and bias may result in data that do not account for the extent to which observers’ cognitive work is the foundation of this research (Bell et al., 2019). Strategies such as recalibration (to help correct for observer drift) support observers after initial training, providing them practice and feedback to maintain accuracy and minimize error over the course of a study (Pianta & Hamre, 2016). This type of support is especially important for studies in which observations are conducted over longer periods of time or multiple observers are deployed.

There are several potential explanations for these findings. First, strategies to promote consistency may have been subsumed into researchers’ conception of training, or perhaps these strategies were embedded in studies’ designs but not reported in the articles. This issue could be ameliorated through online Supplemental materials, which provide more room for researchers to share technical details. In the future, journal editors could require that authors report information regarding strategies used to train and support raters, thus making this reporting a standard part of the peer review process. Second, it is also possible efforts to enhance the scoring inference were unsuccessful, undermining confidence in findings, and thus researchers chose not to include them. This could be addressed through open science practices (Cook et al., 2018), which place the onus on researchers to fully report and then make sense of their findings in light of these data. A third plausible explanation is that researchers have historically used claims of accuracy (based on training, reliability) as a proxy for more comprehensive evidence to support the scoring inference. In other words, this may be a result of insufficient researcher knowledge. We recommend scholars look to some of the stronger studies in our corpus as exemplars of how to design an observation system to support both accuracy and consistency (e.g., Lawson & Cruz, 2018; Swank et al., 1989; Wexler et al., 2018). For example, Wexler et al. (2018) depict how they trained and supported observers, providing evidence that all scores were attuned to the same standard of measurement. Yet, even in these exemplar studies, researchers did not report using strategies to reduce bias, a significant problem when we consider how observer choices—shaped by knowledge and beliefs—introduce error into data collection (Pianta & Hamre, 2016).

Generalization inference

The generalization inference assumes the instruction observed is representative of the population of instruction about which researchers are drawing inferences. We found many studies did not include information about the contexts and participants in their studies, precluding support for the generalization inference.

From the studies that do report this information, we identified many underrepresented participant groups. We only found one observation study focused on teachers of students with autism spectrum disorder and none for attention deficit hyperactivity disorder. Intervention researchers have identified specific teaching strategies for these populations (e.g., DuPaul et al., 2011; Wong et al., 2015), but extant research provides almost no indication of the prevalence of those strategies in typical school settings (outside of intervention research). Research on secondary teachers is also limited. This is consistent with previous analyses of special education research, which have identified limited research in secondary settings (e.g., McKenna et al., 2015). Though few observation studies focused on co-teaching, they show an opposite trend; there is little observation research on co-teaching in elementary schools. Research is also lacking in content areas other than English language arts. Some instructional practices are recommended across content areas, but there are also specific practices identified as evidence based for students with disabilities in specific contents (e.g., Gersten et al., 2009; Scruggs et al., 2010); extant research provides scant information about the instructional practice occurring in those content areas for students with disabilities.

One encouraging finding was that researchers are reporting demographic information about teacher participants more frequently as time goes on, which is important in observation research, given that there is some evidence men and teachers from minoritized racial/ethnic groups may systematically receive lower assessments of effectiveness than women and White teachers (e.g., Drake et al., 2019). However, despite increased reporting of demographics, we found researchers have not analyzed their results for these potential biases.

Finally, we found many researchers did not describe scheduling observations to strengthen the generalization of their results to teachers’ typical practice. As with the scoring inference, it may be that researchers schedule observations carefully and simply do not report this. However, without this information, it is difficult for readers to be certain the results truly represent a teacher’s typical practice. Studies have found, for example, teachers’ interactions with students can vary based on the time of day or the type of instructional activity (e.g., Curby et al., 2011). Therefore, if researchers schedule all observations in a short period of time or at the same time of day, they may capture noise associated with the context, limiting their capacity to draw valid inferences about typical practice. Even researchers who did carefully schedule observations to consider context did not consistently analyze their results for potential effects of context.

Extrapolation inference

When we examined evidence to support the extrapolation inference, the most striking finding was the frequent lack of complete definitions and examples of the observed dimensions of teachers’ practice. In many studies, it was difficult even to ascertain what was being observed. This may be a result of page limitations in publishing, but it is a substantial problem. Even seemingly straight-forward variables, such as teacher questioning, could be interpreted differently by researchers. Vague definitions and examples are particularly problematic, however, when researchers are examining broader dimensions of teaching, such as classroom management, which can be interpreted in widely different ways. Without a clear conception of how researchers defined variables of interest, it is difficult to evaluate the extent to which those variables may be related to teaching quality.

Relatedly, many researchers did not report taking steps to verify the appropriateness of the content of their observation instruments. This was largely true even in cases where the researchers developed the instrument for the study or when they adapted an existing instrument. In the latter case, researchers may have assumed because the instrument developers conducted content verifications, they did not need to conduct their own, but to provide evidence for extrapolation, verification of at least the adapted content would be important.

Most studies did not compare observation results with other measures, and only one did so to provide evidence of validity; others examined relationships with other measures to answer a substantive research question (e.g., relationship between knowledge and practice; Brownell et al., 2009). However, when combined with lack of detail about content, researchers often provided only weak evidence to support the inference that observed variables relate to teaching quality.

General Critique and Directions

These findings also highlight the need for research examining evidence for implications inferences, which was outside the scope of this review. Taking a cue from research examining replication (Therrien et al., 2016), conducting progeny reviews that focus on the most prominently used observation systems in our corpus could help to understand how results of observation research are taken up. This is consistent with advice from researchers in educational measurement to focus attention on established instruments (e.g., Pianta, 2012).

One concern is the sheer number of different observation systems used in special education research, which makes it difficult to draw conclusions across studies. One reason for this variety could be that there are few instruments which have been well-validated for use with special educators; because few extant studies provided strong evidence to support validity inferences, researchers may be hesitant to adopt the systems previously used.

One group of studies and related articles presents a stark contrast to the general pattern, and, for this reason, is worth exploring in more detail. In recent years, E. S. Johnson, Crawford, and colleagues have worked to develop the RESET observation system. Their development process is illustrative of a sequence of studies that provide valuable evidence other researchers can use to appraise the validity of RESET for their own purposes. A full description of their development process is beyond the scope of this article; however, a detailed description of the range of validation work the researchers have conducted, aligned with Kane’s (2013) interpretive use argument approach, is summarized in Johnson, Crawford, Moylan, et al. (2020). Here we provide an overview of the evidence provided in their published articles to demonstrate how providing evidence related to validity might support readers in making appropriate appraisals of classroom observational research.

RESET is an observation system consisting of 21 rubrics, each one aligned to a type of evidence-based practice (e.g., explicit instruction, E. S. Johnson, Zheng, et al., 2019; reading comprehension, E. S. Johnson, Moylan, et al., 2019). The researchers used the evidence-centered design framework to guide their development process (see E. S. Johnson et al., 2018). The content of each rubric was developed by conducting systematic literature reviews related to the central practice, consulting experts for feedback, and watching videos of special educators teaching; the research team is explicit about this process across articles and observation rubrics underneath the broader umbrella of RESET. To examine how the rubrics perform in naturalistic classroom settings, the researchers have conducted a series of studies employing many-facet Rasch measurement, which allows them to investigate the difficulty of each item on the rubric as well as the ways in which various components relate to raters’ scores (e.g., rater severity, rater experience). In one study (E. S. Johnson, Crawford, Zheng, et al., 2020), as mentioned above, they compared teachers’ scores on the RESET with their scores on the Framework for Teaching instrument, an observation instrument that is commonly-used in schools but researchers have found may not adequately support principles of effective special education instruction (see Jones & Brownell, 2014; Morris-Mathews et al., 2020).

The studies conducted thus far presenting the validity evidence for the RESET are rigorous; the evidence provided in the published articles is rich and would provide a wealth of information researchers can use to appraise whether the RESET is appropriate for their intended use(s). For example, researchers would be able to determine at what levels of reliability raters use the rubrics (e.g., E. S. Johnson, Zheng, et al., 2020a), information that can support the scoring inference. Furthermore, the published studies provide a wealth of information regarding factors that contribute to scores on the RESET (e.g., E. S. Johnson, Zheng, et al., 2020b), which can be used to support the generalization inference. Additionally, in support of the extrapolation inference, researchers could identify with which populations of participants the RESET has (and has not) been used and whether time of year of the observations creates systematic differences in scores (e.g., E. S. Johnson, Zheng, et al., 2019). The developers’ intended use of the RESET system is to provide instructional feedback to teachers (E. S. Johnson, Crawford, Moylan, et al., 2020); if a researcher were considering using the RESET for that or another purpose, published studies from this line of research provide a strong foundation for that researcher’s appraisal of the instrument’s validity.

Others in the field of special education are undertaking this work as well; a more recent observation system, initially developed for use with preservice teachers, is undergoing a similar development process. Pua et al. (2021) focused their attention on building a strong validity argument for use of the Preservice Observation Instrument for Special Education with preservice teachers. In subsequent studies, they have built further evidence that supports its use with K–12 special educators (Jones et al., 2021). Examining the ways the RESET and Preservice Observation Instrument for Special Education developers build a validity argument, disseminate their findings, and how other stakeholders such as policy makers, district personnel, and school leaders make sense of and take up findings could be an important contribution for future research.

Limitations

As with any research study, there are limitations to ours, which point to potential future research studies that might support or add to our findings. First, we only included observations in naturalistic settings (i.e., not intervention studies); our results may have been different if we had included other types of studies as well. Additionally, we did not examine the results of the included studies or the claims the authors made with regard to those findings. These claims have important implications for validity, depending on whether the researchers’ methods (e.g., instrument chosen, participant sample, observation schedule) support the claims they make. We also did not examine the authors’ stated purposes of the studies, which is another important consideration in the argument approach to validity (i.e., whether the chosen instrument and methods support the stated purpose). We found the purposes were so varied across the 102 studies that an analysis of them across out entire corpus would perhaps not have been meaningful. Additionally, researchers’ claims sometimes did not align with their stated purpose in the beginning of the article, adding complexity to this analysis. An important area for future research would be to analyze stated purposes and claims of a subset of these studies and the alignment of those purposes and claims with the observation systems used. Additionally, we judged as outside the bounds of our study the ways in which the included studies have been taken up by later researchers or policy makers, which are important aspects of the implication inference. This again would be an important area for future research.

Conclusions

Our findings demonstrate the importance of including specific details related to methods taken in observational research. Because findings from observational research are often used as catalysts for intervention studies and as justifications for policy decisions (e.g., increased use of inclusive settings), it is essential that those who wish to take up observational findings be able to appraise the relevance and validity of those findings for their intended purposes. Without information regarding, for example, included participants, dimensions observed, methods for training observers and scheduling observations, and analysis for potential impacts of bias and context, such evaluation is hindered. Improvements in technology and increased expectations for transparency in research methods combined with the availability of online Supplemental materials in publication should lead to improved reporting in future observational studies.

Supplemental Material

sj-docx-1-rer-10.3102_00346543211042419 – Supplemental material for Observation Studies in Special Education: A Synthesis of Validity Evidence for Observation Systems

Supplemental material, sj-docx-1-rer-10.3102_00346543211042419 for Observation Studies in Special Education: A Synthesis of Validity Evidence for Observation Systems by Wendy J. Rodgers, Hannah Morris-Mathews, John Elwood Romig and Elizabeth Bettini in Review of Educational Research

Footnotes

ORCID iD

Wendy J. Rodgers

Notes

Authors

WENDY J. RODGERS is an assistant professor of special education at Virginia Commonwealth University, Box 842020, 1015 West Main Street, Richmond, VA, USA; email: rodgerswj@vcu.edu . Her research focuses on effective practices for inclusive classrooms, co-teaching, and classroom observations of special education teachers.

HANNAH MORRIS-MATHEWS is an assistant professor at the University of Florida, 2 1713L Norman Hall, PO Box 117050, Gainesville, FL, USA; email: hmmathews@coe.ufl.edu . Her research focuses on the ways that special educators’ professional socialization experiences influence their instructional practice.

JOHN ELWOOD ROMIG is an assistant professor at the University of Texas at Arlington, 701 South Nedderman Drive, Arlington, TX, USA; email: john.romig@uta.edu . His research focuses on writing instruction and writing assessment.

ELIZABETH BETTINI is an assistant professor in the Special Education Program at Boston University’s Wheelock College of Education and Human Development, 2 Silber Way, Boston, MA, USA; email: lbettini@bu.edu . Dr. Bettini’s research focuses on the special education teacher workforce, particularly special educators’ working conditions and equitable access to skilled special educators.

References

Alexander

P. A.

(2020). Methodological guidance paper: The art and science of quality systematic reviews. Review of Educational Research, 90(1), 6–23. https://doi.org/10.3102/0034654319854352

*Algozzine

Morsink

C. V.

Algozzine

K. M.

(1988). What’s happening in self-contained special education classrooms? Exceptional Children, 55(3), 259–265. https://doi.org/10.1177/001440298805500309

*Algozzine

K. M.

Korinek

Morsink

C. V.

Algozzine

(1987). Who’s on first, what’s on second: Questioning in special education classrooms. Journal of Classroom Interaction, 23(1), 22–27. https://www.jstor.org/stable/23869319

*Algozzine

K. M.

Morsink

C. V.

Algozzine

(1986). Classroom ecology in categorical special education classrooms: And so, they counted the teeth in the horse! Journal of Special Education, 20(2), 209–217. https://doi.org/10.1177/002246698602000207

*Alves

A. J.

Gottlieb

(1986). Teacher interactions with mainstreamed handicapped students and their nonhandicapped peers. Learning Disability Quarterly, 9(1), 77–83. https://doi.org/10.2307/1510404

American Educational Research Association, American Psychological Association, National Council on Measurement in Education, Joint Committee on Standards for Educational, & Psychological Testing. (2013). Standards for educational and psychological testing. American Educational Research Association.

*Ammah

J. O. A.

Hodge

S. R.

(2005). Secondary physical education teachers’ beliefs and practices in teaching students with severe disabilities: A descriptive analysis. High School Journal, 89(2), 40–54. https://doi.org/10.1353/hsj.2005.0019

*Anderson

L. F.

Hendrickson

J. M.

(2007). Early-career EBD teacher knowledge, ratings of competency importance, and observed use of instruction and management competencies. Education and Treatment of Children, 30(4), 43–65. https://doi.org/10.1353/etc.2007.0019

*Ayers

J. B.

(1983). Verbal behavior in resource and regular elementary classrooms. Exceptional Children, 50(2), 176–177. https://doi.org/10.1177/001440298305000212

10.

*Baker

J. M.

Zigmond

(1990). Are regular education classes equipped to accommodate students with learning disabilities? Exceptional Children, 56(6), 515–526. https://doi.org/10.1177/001440299005600603

11.

Bell

C. A.

Dobbelaer

M. J.

Klette

Visscher

(2019). Qualities of classroom observation systems. School Effectiveness and School Improvement, 30(1), 3–29. https://doi.org/10.1080/09243453.2018.1539014

12.

Bell

C. A.

Gitomer

D. H.

McCaffrey

D. F.

Hamre

B. K.

Pianta

R. C.

(2012). An argument approach to observation protocol validity. Educational Assessment, 17(2–3), 62–87. https://doi.org/10.1080/10627197.2012.715014

13.

*Beyda

S. D.

Zentall

S. S.

Ferko

D. J. K.

(2002). The relationship between teacher practices and the task-appropriate and social behavior of students with behavioral disorders. Behavioral Disorders, 27(3), 236–255.

14.

*Bishop

A. G.

Brownell

M. T.

Klingner

J. K.

Leko

M. M.

Galman

S. A. C.

(2010). Differences in beginning special education teachers: The influence of personal attributes, preparation, and school environment on classroom reading practices. Learning Disability Quarterly, 33(2), 75–92. https://doi.org/10.1177/073194871003300202

15.

*Brady

M. P.

(1989). Differential measures of teachers’ questioning in mainstreamed classes: Individual and classwide patterns. Journal of Research and Development in Education, 23(1), 10–17. https://eric.ed.gov/?id=EJ401629

16.

*Brady

M. P.

Taylor

R. D.

(1989). Instructional consequences in mainstreamed middle school classes: Reinforcement and corrections. Remedial and Special Education, 10(2), 31–36. https://doi.org/10.1177/074193258901000207

17.

Brophy

J. E.

Good

T. L.

(1970). Teachers’ communication of differential expectations for children’s classroom performance: Some behavioral data. Journal of Educational Psychology, 61(5), 365–374. https://doi.org/10.1037/h0029908

18.

*Brownell

M. T.

Bishop

A. G.

Gersten

Klingner

Penfield

R. D.

Dimino

Haager

Menon

Sindelar

P. T.

(2009). The role of domain expertise in beginning special education teacher quality. Exceptional Children, 75(4), 391–411. https://doi.org/10.1177/001440290907500401

19.

Brownell

M. T.

Sindelar

P. T.

Kiely

M. T.

Danielson

L. C.

(2010). Special education teacher quality and preparation: Exposing foundations, constructing a new model. Exceptional Children, 76(3), 357–377. https://doi.org/10.1177/001440291007600307

20.

*Brulle

A. R.

Barton

L. E.

Barton

C. L.

Wharton

D. L.

(1983). A comparison of teacher time spent with physically handicapped and able-bodied students. Exceptional Children, 49(6), 543–545. https://doi.org/10.1177/001440298304900609

21.

*Cameron

D. L.

(2014). An examination of teacher–student interactions in inclusive classrooms: Teacher interviews and classroom observations. Journal of Research in Special Educational Needs, 14(1), 264–273. https://doi.org/10.1111/1471-3802.12021

22.

*Cameron

D. L.

Cook

B. G.

Tankersley

(2012). An analysis of the different patterns of 1:1 interactions between educational professionals and their students with varying abilities in inclusive classrooms. International Journal of Inclusive Education, 16, 1335–1354. https://doi.org/10.1080/13603116.2011.580459

23.

*Chow

V. T.

Kasari

(1999). Task-related interactions among teachers and exceptional, at-risk, and typical learners in inclusive classrooms. Remedial and Special Education, 20(4), 226–232. https://doi.org/10.1177/074193259902000406

24.

*Cipriano

Barnes

Bertoli

Rivers

(2018). Applying the classroom assessment scoring system in classrooms serving students with emotional and behavioral disorders. Emotional and Behavioural Difficulties, 23(4), 343–360. https://doi.org/10.1080/13632752.2018.1461454

25.

*Ciullo

Ely

McKenna

J. W.

Alves

K. D.

Kennedy

M. J.

(2019). Reading instruction for students with learning disabilities in grades 4 and 5: An observation study. Learning Disability Quarterly, 42(2), 67–79. https://doi.org/10.1177/0731948718806654

26.

*Ciullo

Lembke

E. S.

Carlisle

Thomas

C. N.

Goodwin

Judd

(2016). Implementation of evidence-based literacy practices in middle school Response to Intervention: An observation study. Learning Disabilities Quarterly, 39(1), 44–57. https://doi.org/10.1177/0731948714566120

27.

Cohen

Goldhaber

(2016). Observations on evaluating teacher performance: Assessing the strengths and weaknesses of classroom observations and value-added measures. In Grissom

J. A.

Youngs

(Eds.), Improving teacher evaluation systems: making the most of multiple measures (pp. 8–21). Teachers College, Columbia University.

28.

Cook

B. G.

Lloyd

J. W.

Mellor

Nosek

B. A.

Therrien

W. J.

(2018). Promoting open science to increase the trustworthiness of evidence in special education. Exceptional Children, 85(1), 104–118. https://doi.org/10.1177/0014402918793138

29.

*Cosson

Wilton

(1985). Teacher-pupil interaction and oral language development of mildly retarded children. Australia and New Zealand Journal of Developmental Disabilities, 11(2), 97–105. https://doi.org/10.3109/13668258509008753

30.

Crawford

A. R.

Johnson

E. S.

Moylan

L. A.

Zheng

(2019). Variance and reliability in special educator observation rubrics. Assessment for Effective Intervention, 45(1), 27–37. https://doi.org/10.1177/1534508418781010

31.

*Crawford

A. R.

Johnson

E. S.

Zheng

Y. Z.

Moylan

L. A.

(2020). Developing an understanding procedures observation rubric for mathematics intervention teachers. School Science and Mathematics, 120(3), 153–164. https://doi.org/10.1111/ssm.12393

32.

Curby

T. W.

Stuhlman

Grimm

Mashburn

Chomat-Mooney

Downer

Hamre

Pianta

R. C.

(2011). Within-day variability in the quality of classroom interactions during third and fifth grade. Elementary School Journal, 112(1), 16–37. https://doi.org/10.1086/660682

33.

*David

Kuyini

A. B.

(2012). Social inclusion: Teachers as facilitators in peer acceptance of students with disabilities in regular classrooms in Tamil Nadu, India. International Journal of Special Education, 27(2), 157–168. https://eric.ed.gov/?id=EJ982870

34.

*Dembo

Yoshida

Reilly

(1978). Teacher-student interaction in special education classrooms. Exceptional Children, 45(3), 212–213. https://doi.org/10.1177/001440297804500310

35.

*Derevensky

J. L.

Leckerman

(1997). Teachers’ differential use of praise and reinforcement practices. Canadian Journal of School Psychology, 13(1), 15–27. https://doi.org/10.1177/082957359701300102

36.

*DeVoe

McMillen

(1994). Teacher/student dyadic interaction with emotionally disturbed youth. Journal of Instructional Psychology, 21(4), 318–321. https://psycnet.apa.org/record/1995-27210-001

37.

*Dovey

T. M.

Francis

Corbett

Dibb

(2017). Perception and use of reinforcement by special education teachers. Journal of Research in Special Educational Needs, 17(4), 282–293. https://doi.org/10.1111/1471-3802.12386

38.

Drake

Auletto

Cowen

J. M.

(2019). Grading teachers: Race and gender differences in low evaluation ratings and teacher employment outcomes. American Educational Research Journal, 56(5), 1800–1833. https://doi.org/10.3102/0002831219835776

39.

DuPaul

G. J.

Weyandt

L. L.

Janusis

G. M.

(2011). ADHD in the classroom: Effective intervention strategies. Theory Into Practice, 50(1), 35–42. https://doi.org/10.1080/00405841.2011.534935

40.

*Eldar

Ayvazo

Hirschmann

(2018). Descriptive analysis of the instructional control of teachers in a classroom of students with behavioral disorders. Journal of International Special Needs Education, 21(1), 14–20. https://doi.org/10.9782/2159-4341-20.2.111

41.

*Englert

C. S.

Thomas

C. C.

(1982). Management of task involvement in special education classrooms: Implications for teacher preparation. Teacher Education and Special Education, 5(2), 3–10. https://doi.org/10.1177/088840648200500202

42.

*Fletcher

T. V.

Bos

C. S.

Johnson

L. M.

(1999). Accommodating English language learners with language and learning disabilities in bilingual education classrooms. Learning Disabilities Research & Practice, 14(2), 80–91. https://doi.org/10.1207/sldrp1402_2

43.

Friend

(2015). Welcome to co-teaching 2.0. Educational Leadership, 73(4), 16–22. https://eric.ed.gov/?id=EJ1084156

44.

*Gable

Hendrickson

Young

Shores

R. E.

Stowitschek

J. J.

(1982). A comparison of teacher approval and disapproval statements across categories of exceptionality. Journal of Special Education Technology, 6(1), 15–22. https://doi.org/10.1177/016264348300600103

45.

*Gable

R. A.

Hendrickson

J. M.

Young

C. C.

(1984). Teacher behavior in classes for the mentally retarded and multiply handicapped: A comparative analysis. Education and Training of the Mentally Retarded, 19(1), 54–63. https://www.jstor.org/stable/23877216

46.

Gage

N. A.

Scott

Hirn

MacSuga-Gage

A. S.

(2018). The relationship between teachers’ implementation of classroom management practices and student behavior in elementary school. Behavioral Disorders, 43(2), 302–315. https://doi.org/10.1177/0198742917714809

47.

*Gelzheiser

Meyers

(1991). Reading instruction by classroom, remedial, and resource room teachers. Journal of Special Education, 24(4), 512–526. https://doi.org/10.1177/002246699102400409

48.

Gersten

Chard

D. J.

Jayanthi

Baker

S. K.

Morphy

Flojo

(2009). Mathematics instruction for students with learning disabilities: A meta-analysis of instructional components. Review of Educational Research, 79(3), 1202–1242. https://doi.org/10.3102/0034654309334431

49.

*Ghedin

Aquario

(2020). Collaborative teaching in mainstream schools: Research with general education and support teachers. International Journal of Whole Schooling, 16(2), 1–34.

50.

Graham

Milanowski

Miller

(2012). Measuring and promoting inter-rater agreement of teacher and principal performance ratings. Center for Educator Compensation Reform. https://files.eric.ed.gov/fulltext/ED532068.pdf

51.

*Griffin

C. C.

League

M. B.

Griffin

V. L.

Bae

(2013). Discourse practices in inclusive elementary mathematics classrooms. Learning Disability Quarterly, 36(1), 9–20. https://doi.org/10.1177/0731948712465188

52.

*Haakma

Janssen

Minnaert

(2016). Understanding the relationship between teacher behavior and motivation in students with acquired deafblindness. American Annals of the Deaf, 161(3), 314–326. https://doi.org/10.1353/aad.2016.0024

53.

Hang

Rabren

(2009). An examination of co-teaching: Perspectives and efficacy indicators. Remedial and Special Education, 30(5), 259–268. https://doi.org/10.1177/0741932508321018

54.

*Harbort

Gunter

P. L.

Hull

Brown

Venn

M. L.

Wiley

L. P.

Wiley

E. W.

(2007). Behaviors of teachers in co-taught classes in a secondary school. Teacher Education and Special Education, 30(1), 13–23. https://doi.org/10.1177/088840640703000102

55.

*Hardman

Smith

Wall

(2005). Teacher–pupil dialogue with pupils with special educational needs in the National Literacy Strategy. Educational Review, 57(3), 299–316. https://doi.org/10.1080/00131910500149051

56.

*Hillen

Houghton

Bain

(1992). The effects of integrating children with special needs into mainstream classes: A case study. Educational and Child Psychology, 9(4), 52–56. https://psycnet.apa.org/record/1993-31392-001

57.

Individuals with Disabilities Education Act, 20 U.S.C § 33 (2004). https://uscode.house.gov/view.xhtml?path=/prelim@title20/chapter33&edition=prelim

58.

*Ivarie

Hogue

Brulle

A. R.

(1984). An investigation of mainstream teacher time spent with students labeled learning disabled. Exceptional Children, 51(2), 142–149. https://doi.org/10.1177/001440298405100206

59.

Johnson

E. S.

Crawford

Moylan

L. A.

Zheng

(2018). Using evidence-centered design to create a special educator observation system. Educational Measurement Issues and Practice, 37(2), 35–44. https://doi.org/10.1111/emip.12182

60.

Johnson

E. S.

Crawford

Moylan

L. A.

Zheng

(2020). Validity of a special education teacher observation system. Educational Assessment, 25(1), 31–46. https://doi.org/10.1080/10627197.2019.1702461

61.

*Johnson

E. S.

Crawford

A. R.

Zheng

Moylan

L. A.

(2020). Does special educator effectiveness vary depending on the observation instrument used? Educational Measurement Issues and Practice, 40(1), 36–43. http://doi.org/10.1111/emip.12394

62.

*Johnson

E. S.

Moylan

L. A.

Crawford

Zheng

(2019). Developing a comprehension instruction observation rubric for special education teachers. Reading & Writing Quarterly, 35(2), 118–136. https://doi.org/10.1080/10573569.2018.1521319

63.

*Johnson

E. S.

Semmelroth

C. L.

(2015, Fall). Validating an observation protocol to measure special education teacher effectiveness. Journal of the American Academy of Special Education Professionals, 99–119.

64.

*Johnson

E. S.

Zheng

Crawford

A. R.

Moylan

L. A.

(2019). Developing an explicit instruction special education teacher observation rubric. Journal of Special Education, 53(1), 28–40. https://doi.org/0.1177/0022466918796224

65.

*Johnson

E. S.

Zheng

Crawford

A. R.

Moylan

L. A.

(2020a). Evaluating an explicit instruction teacher observation protocol through a validity argument approach. Journal of Experimental Education. Advance online publication. https://doi.org/10.1080/00220973.2020.1811194

66.

Johnson

E. S.

Zheng

Crawford

A. R.

Moylan

L. A.

(2020b). Examining rater accuracy and consistency with a special education observation protocol. Studies in Educational Evaluation, 64(March), Article 100827. https://doi.org/10.1016/j.stueduc.2019.100827

67.

*Johnson

E. S.

Zheng

Crawford

A. R.

Moylan

L. A.

(2021). The relationship of special education teacher performance on observation instruments with student outcomes. Journal of Learning Disabilities, 54(1), 54–65. https://10.1177/002221940908906

68.

*Johnson

H. A.

Griffith

P. L.

(1986). The instructional patterns of two fourth-grade spelling classes: A mainstreaming issue. American Annals of the Deaf, 131(5), 331–338. https://doi.org/10.1353/aad.2012.1040

69.

Jones

N. D.

Bell

C. A.

Brownell

M. T.

Peyton

Pua

Fowler

Holtzman

(2021). Using classroom observations in the evaluation of special education teachers [Manuscript submitted for publication].

70.

Jones

N. D.

Bettini

Brownell

M. T.

(2019). Competing strands of educational reform policy: Can collaborative school reform and teacher evaluation reform be reconciled? Elementary School Journal, 119(3), 468–486. https://eric.ed.gov/?id=EJ1208116

71.

Jones

N. D.

Brownell

M. T.

(2014). Examining the use of classroom observations in the evaluation of special education teachers. Assessment for Effective Intervention, 39(2), 112–124. https://doi.org/10.1177/1534508413514103

72.

*Jordan

Lindsay

Stanovich

P. J.

(1997). Classroom teachers’ instructional interactions with students who are exceptional, at risk, and typically achieving. Remedial and Special Education, 18(2), 82–93. https://doi.org/10.1177/074193259701800202

73.

Kane

M. T.

(2006). Validation. In Brennan

(Ed.), Educational measurement (4th ed., pp. 17–64). American Council on Education and Praeger.

74.

Kane

M. T.

(2010). Validity and fairness. Language Testing, 27(2), 177–182. https://doi.org/10.1177/0265532209349467

75.

Kane

M. T.

(2013). The argument-based approach to validation. School Psychology Review, 42(4), 448–457. https://doi.org/10.1080/02796015.2013.12087465

76.

Kelman

C. A.

Branco

A. U.

(2004). Deaf children in regular classrooms: A sociocultural approach to a Brazilian experience. American Annals of the Deaf, 149(3), 274–280. https://www.jstor.org/stable/26234672

77.

Kennedy

M. J.

Rodgers

W. J.

Romig

J. E.

Lloyd

J. W.

Brownell

M. T.

(2017). Effects of a multimedia professional development package on inclusive science teachers’ vocabulary instruction. Journal of Teacher Education, 68(2), 213–230. https://doi.org/10.1177/0022487116687554

78.

*Kent

S. C.

Wanzek

Martinez

(2018). The application of empirically supported practices in middle school social studies classrooms. Remedial and Special Education, 39(6), 341–352. https://doi.org/10.1177/0741932517744662

79.

*King-Sears

M. E.

Brawand

A. E.

Jenkins

M. C.

Preston-Smith

(2014). Co-teaching perspectives from secondary science co-teachers and their students with disabilities. Journal of Science Teacher Education, 25(6), 651–680. https://doi.org/10.1007/s10972-014-9391-2

80.

*Klingner

J. K.

Urbach

Golos

Brownell

Menon

(2010). Teaching reading in the 21st century: A glimpse at how special education teachers promote reading comprehension. Learning Disability Quarterly, 33(2), 59–74. https://doi.org/10.1177/073194871003300201

81.

*Kodak

Cariveau

LeBlanc

B. A.

Mahon

J. J.

Carroll

R. A.

(2018). Selection and implementation of skill acquisition programs by special education teachers and staff for students with autism spectrum disorder. Behavior Modification, 42(1), 58–83. https://doi.org/10.1177/0145445517692081

82.

*Korinek

(1987). Questioning strategies in special education: Links to teacher efficacy research in general education. Journal of Research and Development in Education, 21(1), 16–22.

83.

*Kurth

J. A.

Born

Love

(2016). Ecobehavioral characteristics of self-contained high school classrooms for students with severe cognitive disability. Research and Practice for Persons with Severe Disabilities, 41(4), 227–243. https://doi.org/10.1177/1540796916661492

84.

*Lanasa

P. J.

III Mayo

N. B.

(1979). Verbal interaction in secondary special education classes: An observational study. High School Journal, 63(3), 120–123.

85.

Landis

J. R.

Koch

G. C.

(1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://www.jstor.org/stable/2529310

86.

*Larrivee

(1982). Identifying effective teaching behaviors for mainstreaming. Teacher Education and Special Education, 5(3), 2–6. https://doi.org/10.1177/088840648200500301

87.

*Lawson

J. E.

Cruz

R. A.

(2018). Evaluating special educators’ classroom performance: Does rater “type” matter? Assessment for Effective Intervention, 43(4), 227–240. https://doi.org/10.1177/1534508417736260

88.

*Levy

Vaughn

(2002). An observational study of teachers’ reading instruction of students with emotional or behavioral disorders. Behavioral Disorders, 27(3), 215–235. https://doi.org/10.1177/019874290202700303

89.

*Litchfield

Lartz

M. N.

(2002). Role analysis of teachers certified in teaching students who are deaf/hard of hearing who team-teach in co-enrollment classrooms. Teacher Education and Special Education, 25(2), 145–153. https://doi.org/10.1177/088840640202500206

90.

Liu

Bell

C. A.

Jones

N. D.

McCaffrey

D. F.

(2019). Classroom observation systems in context: A case for the validation of observation systems. Educational Assessment, Evaluation and Accountability, 31(1), 61–95. https://doi.org/10.1007/s11092-018-09291-3

91.

*Lyon

(1977). Teacher nonverbal behavior related to perceived pupil social-personal attributes. Journal of Learning Disabilities, 10(3), 173–177. https://doi.org/10.1177/002221947701000308

92.

*Maggin

D. M.

Wehby

J. H.

Moore Partin

T. C.

Robertson

Oliver

R. M.

(2011). A comparison of the instructional context for students with behavioral issues enrolled in self-contained and general education classrooms. Behavioral Disorders, 36(2), 84–99. https://doi.org/10.1177/019874291103600201

93.

*Magiera

Zigmond

(2005). Co-teaching in middle school classrooms under routine conditions: Does the instructional experience differ for students with disabilities in co-taught and solo-taught classes? Learning Disabilities Research & Practice, 20(2), 79–85. https://doi.org/10.1111/j.1540-5826.2005.00123.x

94.

*McGhie-Richmond

Underwood

Jordan

(2007). Developing effective instructional strategies for teaching in inclusive classrooms. Exceptionality Education Canada, 17(1), 27–51. https://eric.ed.gov/?id=EJ773905

95.

*McIntosh

Vaughn

Schumm

J. S.

Haager

Lee

(1993). Observations of students with learning disabilities in general education classrooms. Exceptional Children, 60(3), 249–261. https://doi.org/10.1177/001440299406000306

96.

*McKenna

J. W.

Muething

Flower

Bryant

D. P.

Bryant

(2015). Use and relationships among effective practices in co-taught inclusive high school classrooms. International Journal of Inclusive Education, 19(1), 53–70. https://doi.org/10.1080/13603116.2014.906665

97.

McKenna

J. W.

Shin

Ciullo

(2015). Evaluating reading and mathematics instruction for students with learning disabilities: A synthesis of observation research. Learning Disability Quarterly, 38(4), 195–207. https://doi.org/10.1177/0731948714564576

98.

*McKenzie

A. R.

(2009). Emergent literacy supports for students who are deaf-blind or have visual and multiple impairments: A multiple-case study. Journal of Visual Impairment & Blindness, 103(5), 291–302. https://doi.org/10.1177/0145482x0910300507

99.

*Mitchell

B. B.

Deshler

D. D.

Ben-Hanania Lenz

B. K.

(2012). Examining the role of the special educator in a response to intervention model. Learning Disabilities: A Contemporary Journal, 10(2), 53–74. https://eric.ed.gov/?id=EJ998225

100.

Moher

Liberati

Tetzlaff

Altman

D. G

., & the PRISMA Group. (2009). Preferred Reporting Items for Systematic Reviews and Meta-Analyses: The PRISMA statement. PLoS Med, 6(7), Article e1000097. https://doi.org/10.1371/journal.pmed.1000097

101.

Moin

L. J.

Magiera

Zigmond

(2009). Instructional activities and group work in the U.S. inclusive high school co-taught science class. International Journal of Science and Mathematics Education, 7(4), 677–697. https://doi.org/10.1007/s10763-008-9133-z

102.

*Moody

S. W.

Vaughn

Hughes

M. T.

Fischer

(2000). Reading instruction in the resource room: Set up for failure. Exceptional Children, 66(3), 305–316. https://doi.org/10.1177/001440290006600302

103.

Morris-Mathews

Stark

K. R.

Jones

N. D.

Brownell

M. T.

Bell

C. A.

(2020). Danielson’s framework for teaching: Convergence and divergence with conceptions of effectiveness in special education. Journal of Learning Disabilities, 54(1), 66–78. https://doi.org/10.1177/0022219420941804

104.

*Nougaret

A. A.

Scruggs

T. E.

Mastropieri

M. A.

(2005). Does teacher education produce better special education teachers? Exceptional Children, 71(3), 217–229. https://doi.org/10.1177/001440290507100301

105.

*Nowacek

E. J.

McKinney

J. D.

Hallahan

D. P.

(1990). Instructional behaviors of more and less effective beginning regular and special educators. Exceptional Children, 57(2), 140–149. https://psycnet.apa.org/record/1991-25800-001

106.

*Nyborg

(2011). The quality of mediational teaching practiced by teachers in special education: An observational study in Norwegian primary schools. International Journal of Special Education, 26(1), 101–111. https://eric.ed.gov/?id=EJ921194

107.

*O’Sullivan

P. J.

Marston

Magnusson

(1987). Categorical special education teacher certification: Does it affect instruction of mildly handicapped pupils? Remedial and Special Education, 8(5), 13–18. https://doi.org/10.1177/074193258700800503

108.

Park

Brownell

M. T.

Bettini

E. F.

Benedict

A. E.

(2019). Multiple dimensions of instructional effectiveness in reading: A review of classroom observation studies and implications for special education classrooms. Exceptionality, 27(1), 1–17. https://doi.org/10.1080/09362835.2017.1283628

109.

*Pennington

R. C.

Courtade

G. R

. (2015). An examination of teacher and student behaviors in classrooms for students with moderate and severe intellectual disability. Preventing School Failure, 59(1), 40–47. https://doi.org/10.1080/1045988X.2014.919141

110.

Pianta

Hamre

B. K.

(2016). Implementing rigorous observation of teachers: Synchronizing theory with systems for implementation and support. In Grissom

J. A.

Youngs

(Eds.), Improving teacher evaluation systems: Making the most of multiple measures (pp. 22–36). Teachers College, Columbia University.

111.

Pianta

R. C.

(2012). Implementing observation protocols: Lessons for K-12 education from the field of early childhood. Center for American Progress.

112.

Pua

D. J.

Peyton

D. J.

Brownell

M. T.

Contesse

V. A.

Jones

N. D.

(2021). Preservice observation in special education: A validation study. Journal of Learning Disabilities, 54(1), 6–19. https://doi.org/10.1177/0022219420920382

113.

*Restorff

D. E.

Abery

B. H.

(2013). Observations of academic instruction for students with significant intellectual disability: Three states, thirty-nine classrooms, one view. Remedial and Special Education, 34(5), 282–292. https://doi.org/10.1177/0741932512474995

114.

*Rich

H. L.

(1979). Classroom interaction patterns among teachers and emotionally disturbed children. Exceptional Child, 26(1), 34–40. https://doi.org/10.1080/0156655790260105

115.

*Rieth

Bahr

Okolo

Polsgrove

Eckert

(1988). An analysis of the impact of microcomputers on the secondary special education classroom ecology. Journal of Educational Computing Research, 4(4), 425–441. https://doi.org/10.2190/c1rd-kybj-59rt-n68n

116.

*Rieth

Polsgrove

Okolo

Bahr

Eckert

(1987). An analysis of the secondary special education classroom ecology with implications for teacher training. Teacher Education and Special Education, 10(3), 113–119. https://doi.org/10.1177/088840648701000303

117.

Rotholz

D. A.

Kamps

D. M.

Greenwood

C. R.

(1989). Ecobehavioral assessment and analysis in special education settings for students with autism. Journal of Special Education, 23(1), 59–81. https://doi.org/10.1177/002246698902300106

118.

Salvia

J. A.

Meisel

C. J.

(1980). Observer bias: A methodological consideration in special education research. Journal of Special Education, 14(2), 261–270. https://doi.org/10.1177/002246698001400213

119.

Sargent

L. R.

(1981). Resource teacher time utilization: An observational study. Exceptional Children, 47(6), 420–425. https://doi.org/10.1177/001440298104700603

120.

*Sazak-Pinar

Güner-Yildiz

(2013). Investigating teachers’ approval and disapproval behaviors towards academic and social behaviors of students with and without special needs. Educational Sciences: Theory and Practice, 13(1), 551–556. https://files.eric.ed.gov/fulltext/EJ1016664.pdf

121.

*Scheetz

N. A.

Martin

D. S.

(2006). Teacher quality: A comparison of national board-certified and non-board-certified teachers of deaf students. American Annals of the Deaf, 151(1), 71–87. https://doi.org/10.1353/aad.2006.0020

122.

*Schumm

J. S.

Moody

S. W.

Vaughn

(2000). Grouping for reading instruction: Does one size fit all? Journal of Learning Disabilities, 33(5), 477–488. https://doi.org/10.1177/002221940003300508

123.

*Scott

T. M.

Alter

P. J.

Hirn

R. G.

(2011). An examination of typical classroom context and instruction for students with and without behavioral disorders. Education and Treatment of Children, 34(4), 619–641. https://doi.org/10.1353/etc.2011.0039

124.

Scruggs

T. E.

Mastropieri

M. A.

Berkeley

Graetz

J. E.

(2010). Do special education interventions improve learning of secondary content? A meta-analysis. Remedial and Special Education, 31(6), 437–449. https://doi.org/10.1177/0741932508327465

125.

*Seo

Brownell

M. T.

Bishop

A. G.

Dingle

(2008). Beginning special education teachers’ classroom reading instruction: Practices that engage elementary students with learning disabilities. Exceptional Children, 75(1), 97–122. https://doi.org/10.1177/001440290807500105

126.

*Serwatka

T. S.

Anthony

R. A.

Simon

S. C.

(1986). A comparison of deaf and hearing teacher effectiveness. American Annals of the Deaf, 131(5), 339–343. https://doi.org/10.1353/aad.2012.0949

127.

*Simmons

R. J.

Magiera

(2007). Evaluation of co-teaching in three high schools within one school district: How do you know when you are truly co-teaching? Teaching Exceptional Children Plus, 3(3), 1–22. http://files.eric.ed.gov/fulltext/EJ967127.pdf

128.

*Sindelar

P. T.

Espin

C. A.

Smith

M. A.

Harriman

N. E.

(1990). A comparison of more and less effective special education teachers in elementary-level programs. Teacher Education and Special Education, 13(1), 9–16. https://doi.org/10.1177/088840649001300102

129.

*Sindelar

P. T.

Smith

M. A.

Harriman

N. E.

Hale

R. L.

Wilson

R. J.

(1986). Teacher effectiveness in special education programs. Journal of Special Education, 20(2), 195–207. https://doi.org/10.1177/002246698602000206

130.

*Stanovich

P. J.

Jordan

(1998). Canadian teachers’ and principals’ beliefs about inclusive education as predictors of effective teaching in heterogeneous classrooms. Elementary School Journal, 98(3), 221–238. https://doi.org/10.1086/461892

131.

Stuck

G. B.

Wyne

M. D.

(1971). Study of verbal behavior in special and regular elementary school classrooms. American Journal of Mental Deficiency, 75(4), 463–469. https://psycnet.apa.org/record/1971-23756-001

132.

*Sucuoglu

Akalin

Sazak-Pinar

(2010). The effects of classroom management on the behaviors of students with disabilities in inclusive classrooms in Turkey. Journal of the International Association of Special Education, 11(1), 64–74. https://eric.ed.gov/?id=EJ947834

133.

*Sutton

J. P.

McKinney

J. D.

Hallahan

D. P.

(1992). Effects of grade level and educational setting on behaviors of beginning learning disabilities teachers. Learning Disabilities Research & Practice, 7(1), 16–24.

134.

*Swank

P. R.

Taylor

R. D.

Brady

M. P.

Freiberg

H. J.

(1989). Sensitivity of classroom observation systems: Measuring teacher effectiveness. Journal of Experimental Education, 57(2), 171–186. https://doi.org/10.1080/00220973.10806504

135.

Swanson

E. A.

(2008). Observing reading instruction for students with learning disabilities: A synthesis. Learning Disability Quarterly, 31(3), 115–133. https://doi.org/10.2307/25474643

136.

*Swanson

E. A.

Solis

Ciullo

McKenna

J. W.

(2012). Special education teachers’ perceptions and instructional practices in response to intervention implementation. Learning Disability Quarterly, 35(2), 115–126. https://doi.org/10.1177/0731948711432510

137.

*Swanson

E. A.

Vaughn

(2010). An observation study of reading instruction provided to elementary students with learning disabilities in the resource room. Psychology in the Schools, 47(5), 481–492. https://doi.org/10.1002/pits.20484

138.

*Swanson

H. L.

(1984). Does theory guide teaching practice? Remedial and Special Education, 5(5), 7–16. https://doi.org/10.1177/074193258400500504

139.

Therrien

W. J.

Mathews

H. M.

Hirsch

S. E.

Solis

(2016). Progeny review: An alternative approach for examining the replication of intervention studies in special education. Remedial and Special Education, 37(4), 235–243. https://doi.org/10.1177/0741932516646081

140.

*Thompson

R. H.

Vitale

P. A.

Jewett

J. P.

(1984). Teacher-student interaction patterns in mainstreamed classrooms. Remedial and Special Education, 5(6), 51–61. https://doi.org/10.1177/074193258400500608

141.

Trainor

A. A.

Graue

(2014). Evaluating rigor in qualitative methodology and research dissemination. Remedial and Special Education, 35(5), 267–274. https://doi.org/10.1177/0741932514528100

142.

Vaughn

Levy

Coleman

Bos

C. S.

(2002). Reading instruction for students with LD and EBD: A synthesis of observation studies. Journal of Special Education, 36(1), 2–13. https://doi.org/10.1177/00224669020360010101

143.

*Vaughn

Moody

S. W.

Schumm

J. S.

(1998). Broken promises: Reading instruction in the resource room. Exceptional Children, 64(2), 211–225. https://doi.org/10.1177/001440299806400205

144.

Walker

M. A.

Stevens

E. A.

(2017). Reading instruction for students with learning disabilities: An observation study synthesis (1980–2014). Learning Disability Quarterly, 40(1), 17–28. https://doi.org/10.1177/0731948716633868

145.

*Wang

Wang

Wen

(2015). Teaching practice of physical education teachers for students with special needs: An application of the theory of planned behaviour. International Journal of Disability, Development and Education, 62(6), 590–607. https://doi.org/10.1080/1034912X.2015.1077931

146.

*Watson

(1995). Teacher talk and pupil thought. Educational Psychology, 15(1), 57–68. https://doi.org/10.1080/0144341950150106

147.

*Webster

R. E.

Johnson

M. M.

(1987). Teacher-student verbal communication patterns in regular and special classrooms. Psychology in the Schools, 24(2), 174–179. https://doi.org/10.1002/1520-6807(198704)24:2<174::AID-PITS2310240213>3.0.CO;2-7

148.

*Wexler

Kearns

D. M.

Lemons

C. J.

Mitchell

Clancy

Davidson

K. A.

Sinclair

A. C.

Wei

(2018). Reading comprehension and co-teaching practices in middle school English language arts classrooms. Exceptional Children, 84(4), 384–402. https://doi.org/10.1177/0014402918771543

149.

Williamson

Hoppey

McLeskey

Bergmann

Moore

(2019). Trends in LRE placement rates over the past 25 years. Journal of Special Education, 53(4), 236–244. https://doi.org/10.1177/0022466919855052

150.

*Wolffe

K. E.

Sacks

S. Z.

Corn

A. L.

Erin

J. N.

Huebner

K. M.

Lewis

(2002). Teachers of students with visual impairments: What are they teaching? Journal of Visual Impairment and Blindness, 96(5), 293–304. https://doi.org/10.1177/0145482x0209600502

151.

Wong

Odom

S. L.

Hume

K. A.

Cox

A. W.

Fettig

Kucharczyk

Brock

M. E.

Plavnick

J. B.

Fleury

V. P.

Schultz

T. R.

(2015). Evidence-based practices for children, youth, and young adults with autism spectrum disorder: A comprehensive review. Journal of Autism and Developmental Disorders, 45(7), 1951–1966. https://doi.org/10.1007/s10803-014-2351-z

152.

Yell

M. L.

Bateman

D. F.

(2019). Free appropriate public education and Endrew F. v. Douglas County County School System (2017): Implications for personnel preparation. Teacher Education and Special Education, 42(1), 6–17. https://doi.org/10.1177/0888406417754239

153.

*Yildiz

N. G.

(2015). Teacher and student behaviors in inclusive classrooms. Educational Sciences: Theory & Practice, 15(1), 177–184. https://doi.org/10.12738/estp.2015.1.2155

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.08 MB