Designing an Observation Protocol to Assess Tiered Instruction: An Exploratory Study

Abstract

Accurately measuring high-leverage practices (e.g., HLP 16: explicit instruction) within multitiered system of support (MTSS) instruction is imperative for assessing the efficacy of teacher efforts designed to improve it, yet intricacies associated with assessing such instruction make it difficult. In response to the need for assessments for this purpose, we modified a researcher-developed observation protocol to measure general and special educators’ implementation of HLPs in tiered instruction and explored how teachers implement HLPs across Tiers 1, 2, and 3 using generalizability theory and multifaceted Rasch model analyses. Our results indicate that there was some variability in teaching quality between teachers across tier settings. These findings illustrate how the observation tool and procedures can inform future professional development (PD) research by identifying (a) which HLPs require greater emphasis in PD sessions and (b) how the observation protocol can be improved to better achieve its intended purpose.

Keywords

classroom observation quantitative methods instructional strategies general and special education teachers

Introduction

To improve teaching for students with specific learning disabilities (SWSLDs) and other learners in a multitiered system of support (MTSS) instruction, researchers need to identify effective professional development (PD) interventions for improving teachers’ instructional practices. Central to this aim is the ability to create measures that can assess changes that teachers make in instructional practices, targeted by specific PD interventions, at each tier. Furthermore, to refine PD interventions, researchers need to identify those instructional practices that require the most support to change. Finally, to compare among PD interventions and identify those that are most effective, researchers need a common metric of instruction. Common measures allow them to comparatively evaluate the effectiveness of PD interventions focused on MTSS instruction in different content areas and at different grade levels (Jones, 2023). Also, common metrics of instruction are necessary tools for developing a science of teacher learning (Brownell et al., 2020).

Assessing Tiered Instruction: The Challenges

Developing observation protocols to assess tiered instruction involves common and unique measurement challenges. Evaluation of teachers’ instruction requires researchers to develop observation protocols that capture teaching practices that are common across lessons. In addition, if the instrument is to be used widely for PD studies, researchers must be able to capture behaviors that can be observed across content areas, such as reading and mathematics, and types of students being taught. Prior literature also examines how differences in lessons, students, content, and raters influence teachers’ scores on observation protocols. For example, teachers tended to receive lower ratings on teaching quality when they taught students with disabilities or low-performing students (Campbell & Ronfeldt, 2018; Cohen & Goldhaber, 2016), or when they were evaluated by severe raters (Johnson et al., 2020; Styck et al., 2021). Thus, any common metric of instruction must be able to distinguish teacher performance after accounting for these sources of variability (Hill et al., 2012; Liu et al., 2019).

The development of observation protocols for tiered instruction, however, is further complicated by the fact that an instructional tier may introduce an additional source of variance. Teachers may choose different evidence-based practices (EBPs) at each tier, and the timing and methods of teaching those EBPs may vary depending on the specific student needs being addressed. Tier 1 teachers (i.e., general education teachers, sometimes with the support of co-teachers or paraprofessionals) often provide the full complement of content area instruction, whereas teachers in Tiers 2 and 3 have limited time and are likely to focus on targeted areas. For example, in reading, Tier 1 teachers often address multiple reading skills (e.g., comprehension, decoding, phonological awareness, spelling, fluency, and vocabulary) depending on the grade level taught. Teachers providing Tiers 2 and 3 instruction, however, might target only one or two skills (e.g., multisyllabic decoding and fluency). In addition, teachers might provide instruction in EBPs that complement each other rather than teaching the same EBP. To illustrate, a Tier 3 teacher might provide explicit instruction in basic decoding patterns, whereas the Tier 1 teacher might explicitly teach a multisyllabic word-solving strategy. Finally, teachers provide more intensive instruction (e.g., more modeling, practice opportunities, and feedback) to small groups of students in Tiers 2 and 3. Teachers in Tier 1, however, often provide large-group instruction (Carlisle et al., 2013) to students with a wider range of abilities; thus, teacher–student interactions may be different for Tiers 2 and 3 than they are in Tier 1.

Our Study

In this study, we collected data on teachers’ tiered instruction using an observation protocol originally designed by Pua and colleagues (2021), adapted for our PD Development and Innovation study (Benedict et al., 2025) funded by the Institute for Education Sciences (IES). We wanted to determine whether data collected using the adapted protocol could be used to (a) distinguish teachers’ instruction at each tier and (b) improve our instrument and observation training procedures before using the protocol in our PD intervention pilot study. To address challenges associated with measuring tiered instruction, we focused on instructional practices common to the effective implementation of EBPs. We examined specific HLPs that have been associated with student achievement (Nelson et al., 2022). These high-leverage practices (HLPs) define a broad set of practices that are foundational to the implementation of effective instruction for students with disabilities and to the practice of special education (Aceves & Kennedy, 2024). In this study, we focused on those HLPs that are essential to the effective implementation of EBPs, including explicit instruction, effective classroom management strategies, and responsiveness to students—the latter defined by specific feedback to students, monitoring student learning, and scaffolding or support for student learning (Nelson et al., 2022). These behaviors were found to distinguish the performance of special education teacher candidates providing small-group reading intervention instruction to first-grade students with reading difficulties (Pua et al., 2021).

Previous Attempts to Measure Tiered Instruction

Prior observation studies of tiered and intervention instruction focused primarily on assessing teachers’ performance at one or two instructional tiers (Brownell et al., 2017; Ciullo et al., 2019; Doabler et al., 2021), whereas only four studies examined instruction across two tiers (Al Otaiba et al., 2025; Carlisle et al., 2013; Kent et al., 2017; Solis & McKenna, 2025). Studies of tiered instruction assessed the use of content-specific practices employed mostly in Tier 1 reading instruction and the use of HLPs provided in Tiers 1, 2, or 3 instruction.

Content-Specific Measures

Multiple studies assessed Tier 1 reading instruction using interval recording methods. In one set of studies, researchers used the individualizing student instruction (ISI) classroom observation system to assess teachers’ use of instructional practices in code-focused (i.e., decoding, phonological awareness), fluency, vocabulary, and comprehension content, and the degree to which Tier 1 teachers used teacher-directed or student-managed instruction in each of these areas (Al Otaiba et al., 2016; Connor et al., 2009, 2011, 2014). These researchers showed moderate to high inter-rater reliability across multiple samples (Cohen’s κ = .80; Intraclass Correlation Coefficent [ICC] range: .61–.81). In addition, they found that providing teacher-directed instruction in code-focused, vocabulary, and comprehension instruction predicted the reading achievement in decoding and comprehension of lower-achieving students compared with higher-achieving students. They did not, however, provide information about the degree to which teachers’ ratings were stable across lessons taught and raters.

Other researchers used a similar observation protocol—the Instructional Content Emphasis–Revised (ICE-R; Edmonds & Briggs, 2003)—to examine instruction at different tiers. Teachers are scored, using interval recording methods, on time spent: (a) on specific content (e.g., decoding, fluency, vocabulary) and (b) in instructional groupings used to teach content (e.g., whole class, small group). Likert scales have also been employed to score the overall level of student engagement and teaching quality. We were able to identify only three studies that compared instruction across multiple tiers (Al Otaiba et al., 2025; Kent et al., 2017; Solis & McKenna, 2025). For these studies, inter-rater agreement was established before conducting and scoring observations; additional training or reliability checks were provided in two studies to avoid rater drift (Al Otaiba et al., 2025; Kent et al., 2017). Only Solis and McKenna provided inter-rater agreement for scores based on observation field notes compared with scores for live observations during the course of the study. None of the researchers assessed the variability in teachers’ scores attributed to lesson, tier, or rater, although two of the studies’ results suggested that tier may contribute variance to teachers’ scores. In the Solis and McKenna study, teachers provided somewhat different proportions of content instruction in literacy according to tier (e.g., Tier 1 teachers spent more time on writing than Tiers 2 and 3 combined). Furthermore, Al Otaiba and colleagues found that teachers provided more code-focused than meaning-focused instruction in Tier 3 compared with Tier 1, and this difference was more pronounced for first-grade teachers than third-grade teachers. These findings suggest observation protocols focused on content-specific strategies may have limitations when used in PD studies where researchers compare treatment teachers providing multiple tiers of instruction to what teachers provide to those in control groups (Sohn, 2023; Charalambous & Kyriakides, 2017).

High-Leverage Practice Measures

Researchers have also examined general and special educators’ use of HLPs in tiered instruction or special education instruction using a combination of frequency, interval, and Likert rating scales. HLPs are strategies essential to the implementation of effective instruction for students with disabilities or the implementation of the Individuals with Disabilities Education Improvement Act (IDEA, 2004); they represent foundational teaching practices that can be explicitly taught in classroom settings across four key domains (i.e., collaboration, data-driven planning, instruction, and intensive intervention; Aceves & Kennedy, 2024; Nelson et al., 2022).

General Education

Doabler and colleagues (2019, 2021) created instrumentation (i.e., the Classroom Observations of Student-Teacher Interactions–Mathematics [COSTI-M]) to observe how frequently teachers used components of what they described as explicit instruction, such as modeling, practice opportunities, and academic feedback (HLP 16: explicit instruction, HLP 18: promote student engagement, and HLP 22: provide academic feedback), when implementing an evidence-based mathematics curriculum in Tier 1 or 2 instruction. They reported moderate to high levels of inter-rater reliability across domains, with ICCs ranging from .61 to .99. Findings showed that teachers who demonstrated more frequent use of explicit instruction practices achieved higher student outcomes in mathematics in either Tier 1 (Doabler et al., 2019) or Tier 2 (Doabler et al., 2021) settings, suggesting that explicit instruction can distinguish teaching quality in mathematics. The researchers, however, did not assess instruction across tiers, so they could not determine the variance attributed to tier, lesson, or rater to teachers’ scores.

Doabler and colleagues (2021) also employed a Likert rating measure to assess the quality of explicit instruction in Tier 2 instruction. Explicit instruction was defined as encompassing HLPs identified by Aceves and Kennedy (2024) (e.g., HLP 16: teacher modeling embedded in explicit instruction, HLP 18: promote student engagement, HLP 22: academic feedback, HLP 15: instructional scaffolding). They found weak-to-moderate relationships between quantity and quality indicators of explicit instruction (range: r = .06–.33), suggesting quantity of instruction does not represent quality. They also showed that frequent use of explicit instruction behaviors was negatively correlated with some student outcomes, whereas instructional quality consistently predicted positive gains across all student outcome measures. These authors demonstrated that quality ratings of explicit instruction might be more robust predictors of student achievement; however, they only assessed instruction at Tier 2.

Carlisle and colleagues examined general education teachers’ use of explicit instruction and found that use was affected by content area and instructional group. Kelcey and Carlisle (2013) found that general education teachers used explicit instruction less in fluency compared with comprehension instruction in Tier 1 instruction. Teachers used explicit instruction (i.e., explanation and modeling) 44% and 38% of the time in fluency, compared to 67% and 82% in comprehension. Furthermore, Carlisle and colleagues (2013) showed that teachers who taught reading lessons in small groups (similar to Tier 2 instruction) implemented explanation and modeling for 52.1% of their time compared to 45.2% in Tier 1 instruction.

Tier 3 or Special Education

Other researchers have shown that observation protocols assessing HLP implementation in Tier 3 or special education instruction (which in many states is Tier 3 instruction) are useful tools for differentiating teacher quality (Johnson et al., 2021; Pua et al., 2021). Johnson and colleagues developed an observation protocol to capture special education teachers’ implementation of explicit, systematic instruction on a 3-point Likert scale. Explicit, systematic instruction was broadly defined and included behaviors identified in the work by Aceves and Kennedy’s (2024) consensus document on HLPs (e.g., HLPs 12 and 13: identifying and communicating lesson goals, HLP 15: systematically faded support, HLP 16: modeling embedded in explicit instruction, HLP 22: constructive feedback, HLP 7: responsive learning environment, HLP 18: promoting student engagement, HLP 20: intensive instruction). Using multifaceted Rasch model (MFRM) analysis, they demonstrated that teachers’ ability to implement explicit, systematic instruction can be distinguished on a common metric across different content areas (i.e., reading, mathematics), and that certain aspects of explicit instruction were more difficult for teachers to implement than others. In their generalizability study (G-study) based on reading and mathematics lessons, the proportion of variation accounted for was as follows: 7.5% by teachers, 12.3% by items, 4.5% by raters, and 3.3% by the interaction between lessons and teachers (Johnson et al., 2020). They did find that explicit instruction partially predicted student growth if specific elements of this instruction (e.g., reviewing prior knowledge, modeling, and providing specific feedback) were well-implemented.

Pua and colleagues (2021) developed the Preservice Observation Instrument for Special Education (POISE) to assess the quality of special education preservice teachers’ Tier 3 reading instruction in a structured curriculum for first-grade students. Although the researchers studied teachers providing reading instruction, they designed POISE with the intention of using it across content areas and grade levels. Preservice Observation Instrument for Special Education consists of 12 items, rated on a 5-point Likert scale, that represent different HLPs and align with three instructional domains: (a) classroom management (CM; HLP 7: responsive learning environment, HLP 8: constructive behavioral feedback, HLP 18: promote student engagement); (b) explicit and systematic instruction (ESI; HLPs 12 and 13: identifying and communicating lesson goals, HLP 15: systematically faded support, HLP 16: explicit instruction); and (c) responsiveness to individual student learning (RISL; HLP 20: intensive instruction and HLP 22: constructive academic feedback). To provide evidence of content validity for these instructional domains, they reviewed the literature, interviewed scholars with expertise in effective special education instruction, and conducted a Q-sort with special education practitioners that involved sorting items into the three identified instructional domains. Finally, they conducted a G-study analysis to determine whether scores on POISE could distinguish the behaviors of teachers. They found that POISE had promising psychometric properties, such as moderate to high reliability coefficients (Weighted κ = .28–.53; α = .87–.93) and moderate-to-large variance (21%–38%) attributable to the teacher facet from a G-study. Raters only accounted for 6% of the variance, and lessons accounted for 5%. These findings indicated that POISE could successfully differentiate among levels of teacher proficiency after controlling for multiple sources of variance (e.g., rater, lesson).

Summary

Findings from these studies suggest that scores on measures of content-specific practices and HLPs can differentiate effective teaching in different content areas and at different grade levels. It should be noted that researchers’ definitions of explicit instruction and other HLPs varied across studies, and their approach to measurement (e.g., Likert vs. interval) varied, making it difficult to compare findings across studies. Researchers also do not know if the different approaches to assessing HLPs yield data that are comparable across tiers. In fact, findings from Al Otaiba and colleagues (2025), Solis and McKenna (2025), and Kelcey and Carlisle’s (2013) studies suggest that interval measures of HLPs and content-specific practices do not yield comparable data. Perhaps, the nature of Likert ratings makes them less sensitive to differences in how instruction is enacted at different tiers and with different content than interval and frequency ratings, as Likert ratings assess quality of implementation, which may be less sensitive to the frequency or time teachers spend using different practices in different content or at different tiers. This needs to be assessed in future studies.

Purpose of This Study and Research Questions

We explored general and special education teachers’ use of HLPs across Tiers 1, 2, and 3 using the Tiered Instruction Rating Scale (TIERS) protocol, a modified version of the POISE. Using a 7-point Likert scale, teachers’ quality implementation of HLPs was assessed on three constructs previously mentioned that are important for effective instruction of SWSLDs: (a) CM (HLP 7: responsive learning environment, HLP 8: constructive behavioral feedback, HLP 18: promote student engagement); (b) ESI (HLPs 12 and 13: identifying and communicating lesson goals, HLP 15: systematically faded support, HLP 16: explicit instruction); and (c) RISL (HLP 20: intensive instruction and HLP 22: constructive academic feedback).

We wanted to determine if (a) TIERS could distinguish the performance of teachers providing instruction at each instructional tier, (b) raters could successfully apply scores on the Likert scale, and (c) specific items were more difficult than others for teachers. To accomplish these goals, we used G-study and MFRM analyses to answer the following research questions:

Research Question 1: Which factors (teacher, item, domain, rater, lesson, tier) explain variability in their ratings?

Research Question 2: Which HLPs are most challenging for teachers to implement effectively?

Findings from our study would show if TIERS could isolate the variance unique to teaching quality for each tier of instruction and if raters could successfully apply scoring rules, or if additional rating support was needed. These findings would allow us to determine whether TIERS could be used in our IES professional development (PD) study. In addition, findings from the MFRM analysis would help us identify instructional practices in tiered instruction that require greater teacher support in our PD intervention and in future PD studies. Previous literature has demonstrated (Johnson et al., 2020; Styck et al., 2021) that instructional practices do not present an even level of difficulty.

Method

Participants

Forty-three fourth-grade general and special education teachers from 15 schools (8 treatment, 7 control) in the Southwestern United States participated in the larger PD project (Benedict et al., 2025). These teachers were included in this study (see Table 1). Teachers came from schools housed in four separate school districts; individual schools had free and reduced-price lunch rates that ranged from 10% to 98%, with an average of 56%. These schools varied considerably in terms of the ethnic and linguistic backgrounds of students; overall, approximately 55% of students were White, and 45% were non-White; the majority of the latter were Hispanic (33%). Sixty-three percent of the teachers were general education, and 37% were special education. Approximately 74% of teacher participants were White, and most non-White teachers were Hispanic (9%). Most teachers (60%) had elementary education certificates, 30% had both elementary and special education certificates, and 7% had special education certificates. The range of teaching experience varied from 1 to 15 years.

Table 1.

Demographics of 43 Teacher Participants in Tiered Instruction Study: 2019–2020.

Teacher demographics	Treatment school (n = 8)		Control school (n = 7)
Teacher demographics	n	%	n	%
Gender
Male	2	9	1	5
Female	20	91	20	95
Race/Ethnicity
American Indian or Alaskan Native	1	5	0	0
Asian	3	14	0	0
Black or African American	2	9	0	0
Hispanic	3	14	1	5
White	13	59	19	90
Mixed-race	0	0	1	5
Roles
General educator	14	64	13	62
Special educator	8	36	8	38
Highest level of education achieved
Bachelor’s	10	45	14	67
Master’s	12	55	7	33
Certification type
Professional	20	91	20	95
National board	1	5	0	0
Alternate pathway	2	9	1	5
Certification content area
Elementary	15	68	11	52
Special education	1	5	2	10
Elementary and special education	5	23	8	38
No certificate	1	5	0	0
Teaching experience (in years)
1–4	5	23	3	14
5–9	5	23	7	33
10–14	5	23	4	19
15 or more	7	32	7	33
Years at current school
1–4	11	50	10	48
5–9	7	32	6	29
10–14	3	14	4	19
15 or more	1	5	1	5

Observation Protocol: TIERS

Tiered Instruction Rating Scale represents a modified version of the POISE; Pua et al., 2021) and consists of two components. First, raters use the interval component of TIERS to record the occurrence of effective core instructional behaviors (e.g., modeling, feedback) and content-specific practices (e.g., cognitive strategy for decoding words, summarization strategy) in 30-second segments. The latter were used by researchers in a larger PD study funded by the IES (Benedict et al., 2025) to determine whether teachers were using the EBPs learned. Once raters completed the first component, they used interval data and field notes recorded about implementation of the core instructional behaviors to rate the quality of teachers’ instruction, using a 7-point Likert scale, on CM, ESI, and RISL (see Table 2). TIERS’ Likert items were adapted from POISE to better address general and special education instruction in the three domains of core instruction, and to adjust for problems we experienced scoring the protocol in this study. Teachers were scored using a 7-point scale on 10 items within the three domains of instruction (i.e., CM, ESI, and RISL, see Table 2) rather than 5-point scales like POISE. The scale was expanded to allow researchers in the IES PD study to obtain more useful and discriminating information about teaching performance (Hill & Grossman, 2013).

Table 2.

TIERS Likert-Type Rating Form.

Likert item		Rating
Classroom management
1	Uses instructional time efficiently	1	2	3	4	5	6	7
2	Establishes an organized classroom environment	1	2	3	4	5	6	7
3	Employs behavior management strategies	1	2	3	4	5	6	7
4	Creates a positive classroom climate	1	2	3	4	5	6	7
Explicit and systematic instruction
1	Maintains a clear and coherent pedagogical structure	1	2	3	4	5	6	7
2	Models/describes/explains concepts, strategies, and skills effectively	1	2	3	4	5	6	7
3	Provides appropriate practice opportunities	1	2	3	4	5	6	7
Responsiveness to individual student learning
1	Monitors students’ understanding	1	2	3	4	5	6	7
2	Gives timely, appropriate feedback	1	2	3	4	5	6	7
3	Provides support to facilitate student responses	1	2	3	4	5	6	7
Overall rating for instructional quality		1	2	3	4	5	6	7

Our study focuses on data generated from the Part II Likert scale. We wanted to assess validity evidence for the Likert scale as it provides a common metric of instruction across tiers and content areas. If data collected in our study provided evidence that TIERS can effectively distinguish teacher performance, it could be used in the IES PD study (Benedict et al., 2025) to validly assess teachers’ tiered literacy instruction and potentially be used in other studies of tiered instruction in different subject areas, such as a study of tiered mathematics instruction.

Procedures for Collecting and Rating Observation Data

Observation data were collected and rated according to the procedures described below.

Video-Recorded Lessons

The research team for the larger study collected two to five pre-intervention lessons, with an average of four videos per teacher, in fall 2019 using a recording system called Swivl. Researchers commonly recommend using multiple observations per teacher to achieve a reliable estimate of a teacher’s instructional quality across various instructional contexts (Johnson et al., 2018; Johnson et al., 2021). We used data collected during pretest observations, since those lessons best represent teachers’ naturally occurring reading instruction. Our PD intervention may have changed the variability of teachers’ scores on the observation protocol during the implementation phase.

Each video was rated using an online rating application created for the POISE (Pua et al., 2021) and modified for our study. There were 176 pre-intervention videos rated, 30% (n = 53) of which were double-rated using TIERS. A total of 61 videos were collected from Tier 1 settings, 54 from Tier 2 settings, and 61 from Tier 3 settings (i.e., special education). Length of videos averaged 20 min, and the range was 10 to 60 min. Percentage adjacent agreement on the pre-Likert scores was 46.7%; scores within a particular category were considered adjacent (e.g., Levels 1 and 2 are adjacent within the unsatisfactory category). A quadratic weighted kappa score of 0.42 indicated a moderate level of inter-rater reliability (Cicchetti, 1994). However, prior large, widely cited studies, such as the Measures of Effective Teaching (MET) project, demonstrate that 80% benchmarks are often statistically impractical due to the inherent, irreducible volatility of instruction across different lessons (Ho & Kane, 2013; Kane & Staiger, 2012). Consequently, to move beyond simple consensus in scoring, we utilized a G-study and MFRM analysis to dissect the underlying score structures and identify detectable teacher signals (Hill et al., 2012; Johnson et al., 2020).

Rater Training

Four anchor raters provided 4 days of training for the interval (Part I) and Likert (Part II) portions of the TIERS to 19 raters who varied in educational background and teaching experience. The four anchor raters included the principal investigator (PI) and Co-PI of our larger IES PD study, who had extensive teacher-training experience, as well as two doctoral students who had 5–13 years of Grades K–12 teaching experience and had received rater training from the PI. Each training session took approximately 2 hr and included the following components to help raters standardize application of scoring rules: (a) definition of each interval and Likert item, (b) discussions of examples and non-examples for each item, and (c) example videos representing target practices. After the training sessions, raters received weekly group and/or one-on-one training until they reached a percentage agreement of 80% with anchor raters on three calibration videos (Graham et al., 2012; Pua et al., 2021). We used percentage agreement to estimate inter-rater reliability, defined as the proportion of rating decisions on which the anchor rater and an individual rater made the same decision. We selected this method because it is easy to calculate and provides an intuitive and efficient way to determine how closely individual raters matched the anchor rater. (Lombard et al., 2002). After raters were trained to 80% mastery, they received randomly assigned videos, balanced by length of recording, schools (e.g., district), content (e.g., decoding, summarization), and teacher role (i.e., general or special education teacher).

Data Analysis

To explore which facets (e.g., teacher, rater, tier) explain variability in performance scores, we conducted a G-study and Multi-faceted Rasch Model (MFRM) analysis for the Likert scale (Hill et al., 2012; Johnson et al., 2020). These analyses identify the variance different facets contribute to the observed variance in teacher performance, whereas traditional reliability statistics (e.g., percentage agreement, Cronbach’s alpha) assume that all the variance is attributed to differences among raters, ignoring other sources of variance (e.g., teacher, item, domain, tier, lesson) that contribute to the final performance scores (Hill et al., 2012).

The G-study allowed us to detect variance components attributable to multiple facets (e.g., teacher, rater, lesson, tier, domain) and interactions among facets for the TIERS Likert scores. We employed a partially nested mixed design with lessons l within teachers t within tiers i crossed with domains d and raters r. In this (l:t:i)dr design, teachers t, lessons l, and raters r are treated as random, while tiers i and domains d are treated as fixed. We selected these conditions because elements are not interchangeable with others within tiers (i.e., Tiers 1, 2, 3) and domain facets (i.e., CM, ESI, RISL). In other words, it is theoretically unreasonable to generalize beyond fixed levels of tier and domain facets in the universe, as each element represents unique constructs of teaching quality (domains d) or instructional settings (tiers i) (Shavelson & Webb, 1991). Meanwhile, we were interested in estimating random effects for other facets to determine whether the rater population can consistently apply the scoring rules to identify different levels of teacher proficiency on the TIERS Likert scale. The G-study model in this study is notated as

X_{(l : t : i) d r} = μ + ν_{t} + ν_{l : t} + ν_{r} + ν_{t \times r} + ν_{l \times r} + ε_{(l : t : i) d r}

(1)

where X_(l:t:i)dr is the observed rating for one lesson (l) nested within one teacher (t) nested within a tier (i), crossed with domains (d) and raters (r), μ is the grand mean, ν _t is the teacher main effect, ν _l:t is the interaction effect for the lessons nested within the teachers, ν _r is the rater main effect, ν _t×r is the interaction effect for the teacher by rater, ν _l×r is the interaction effect for the lesson by rater, and ε _(l:t:i)dr is the residual effect (Brennan, 2003; Huebner & Lucht, 2019). We used Moore’s (2022) gtheory package in R (R Core Team, 2024) to conduct the analysis.

The MFRM study allowed us to analyze TIERS Likert scores, including multiple facets (e.g., teacher, item, domain, rater, tier, lesson) and calibrate interactions among elements within each facet onto a common scale (Wolfe & Dobria, 2008). MFRM is an extended version of the family of Rasch models; it models the item-level scores and offers different psychometric information as compared to the G-study, which models test-level scores. The MFRM provides, among other things, estimates of logit-scale location (e.g., indicators of identifying severe raters), separation statistics to investigate variation between each element, and model fit statistics. Our MFRM model is defined as

\ln (\frac{π_{n i d r l t k}}{π_{n i d r l t (k - 1)}}) = θ_{n} - δ_{i} - α_{d} - β_{r} - r_{l} - ζ_{t} - τ_{k},

(2)

where teacher n is rated by rater r on item i within domain d within lesson l within tier t. π _nidrltk is the log-odds (logit) of a teacher being awarded a score of item i in category k; θ_n is the performance (latent trait) of teacher n; δ_i is the difficulty of item i; α_d is the difficulty of domain d; β_r is bias (i.e., severity) of rater r; r_l is the difficulty of lesson l; ζ_t is the difficulty of tier t; τ_k is the Rasch–Andrich threshold (i.e., step difficulty) where the probability of being observed in category k is the same as that of being observed in category k-1 (Linacre et al., 2004; Wolfe & Dobria, 2008). We conducted the MFRM study using a computer software program, Facets 3.83.2 (Linacre, 2020).

Results

Generalizability Study Findings

We conducted a G-study to detect the variance components attributable to multiple facets (e.g., teacher, rater, tier, domain) and interactions between each facet (e.g., teacher × rater) for TIERS Likert scores (see Table 3). Analyses of ratings by teacher role and tier, controlling for the domain effect, demonstrate that a larger proportion of the variance was attributed to special education teachers (27.3%) than general education teachers (13.7%); general educators showed similar levels of variability in the quality of instruction in Tiers 1 and 2 settings (9.3%, 7.6%), respectively. This indicates that the TIERS scores better distinguished the performance of special education teachers. Variance associated with a lesson–teacher interaction was substantial across tiers (Tier 1: 33.0%, Tier 2: 18.1%, Tier 3: 29.1%), suggesting that general and special education teachers provided lessons that varied considerably in the degree to which they demonstrated behaviors measured by TIERS.

Table 3.

Variance Decomposition in the TIERS Likert Scores.

Variance component	Analyses of ratings by teacher role		Analyses of ratings by tier			Analyses of ratings by domain
Variance component	General Ed (Tiers 1 and 2)	Special Ed (Tier 3)	Tier 1	Tier 2	Tier 3	CM	ESI	RISL
Teacher (t)	13.7	27.3	9.3	7.6	27.3	13.8	14.2	19.5
Lesson:teacher (l:t)	32.0	29.1	33.0	18.1	29.1	42.0	47.9	29.4
Rater (r)	8.7	9.5	11.4	6.8	9.5	10.0	13.2	9.6
Teacher × rater (t _× r)	4.4	12.7	13.2	0.0	12.7	24.7	19.6	34.2
Lesson × rater (l _× r)	13.2	0.0	8.2	36.8	0.0	9.6	5.1	7.3
Residual	27.9	21.4	24.9	30.7	21.4	0.0	0.0	0.0

Note. CM = classroom management; ESI = explicit and systematic instruction; RISL = responsiveness to individual student learning.

Variance attributable to a teacher-by-rater interaction in Tier 2 (0.0%) was negligible, whereas in Tiers 1% and 3 (13.2%, 12.7%), it was moderate. There were a few differences in the number of videos provided from Tiers 1 and 2, suggesting that raters might apply different interpretations of the TIERS scoring rules to teachers who taught in Tiers 1 and 3. The variance attributed to a lesson-by-rater interaction in special education settings was marginal (0.0%), while it was moderate in general education settings (13.2%). Interestingly, in the disaggregated model by tier, the variance associated with general educators who taught in Tier 2 settings was significantly greater (36.8%) than that associated with those who taught in Tier 1 settings (8.2%). This indicates that raters likely assigned diverse scores to Tier 2 lessons. Relatively low variation is attributable to the rater main effect across tiers (Tier 1: 11.4%, Tier 2: 6.8%, Tier 3: 9.5%), implying that raters tended to be consistent with the scoring rules when controlling for other external factors (e.g., teacher, lesson).

We also controlled for the tier effect in separate analyses of ratings by domain. A considerable amount of variance associated with the teacher main effect was detected across domains (CM: 13.8%, ESI: 14.2%, RISL: 19.5%), suggesting that raters were able to differentiate between various levels of teacher proficiency within each construct. Lesson–teacher interaction accounted for the largest portion of variance in TIERS (CM: 42.0%, ESI: 47.9%, RISL: 29.4%), which means teachers demonstrate different levels of domain-related behaviors based on the lesson taught. ESI and CM were more lesson-dependent than RISL.

Multifaceted Rasch Model Study Findings

In addition to the G-study, we also conducted the MFRM analysis to obtain further information on psychometric properties of the TIERS Likert scores, such as the Wright map and logit-scale locations, separation statistics, and model fit statistics (see Table 4 and Figure 1). The Wright map in Figure 1 plots the location estimates of individual elements within five latent variables (i.e., teacher, item, domain, rater, tier, lesson). On the logit scale, more difficult elements (e.g., more proficient teachers, more severe raters, more difficult domains) are situated in higher locations.

Table 4.

Separation Statistics.

Statistic type	Latent variables
Statistic type	Teacher proficiency	Item difficulty	Domain difficulty	Rater severity	Tier difficulty	Lesson difficulty
M (measure)	0.17	0.00	0.00	0.00	0.00	0.00
SD (measure)	0.55	0.23	0.18	0.46	0.28	0.04
M (SE)	0.11	0.05	0.03	0.09	0.03	0.04
Separation ratio (G)	4.77	4.29	6.11	4.06	9.65	0.00
Separation reliability (Rel)	0.96	0.95	0.97	0.94	0.99	0.00
Fixed χ²	908.4	168.7	80.5	329.7	185.9	3.2
df	42	9	2	15	2	4
Significance	0	0	0	0	0	0.53

Figure 1.

The Wright map displays the estimated locations of individual elements within each of the six latent variables.

The good fit statistics for all facets (e.g., teacher, item, domain, rater, tier, lesson) indicate that the model parameters reported below represent the patterns in the underlying data. For the teacher facet, the separation ratio was 4.77, and the Rel statistic was high and statistically significant: Rel = 0.96, fixed χ²(42) = 908.4, p < .001, indicating that the various levels of teacher proficiency were reliably distinguished. For the item facet, ESI 2 (models concepts effectively) with 0.45 logits was the most difficult item (i.e., the hardest for a teacher to receive a high rating), while CM 4 (creates positive classroom climates) with −0.32 logits was the easiest item. Separation statistics indicate that locations of items were reliably and significantly differentiated, separation ratio: 4.29, Rel = 0.95, fixed χ²(9) = 168.7, p < .001. Separation statistics for the domain facet indicate a substantial amount of difference in domain variance, as the separation ratio was 6.11, and the Rel statistic was 0.97: fixed χ²(2) = 80.5, p < .001. Explicit and systematic instruction was located at the highest logit value of 0.15, suggesting that teachers might receive lower scores in ESI than in other domains. For the rater facet, Rater 16 with 0.89 logits was the most stringent rater, whereas Rater 1 with −0.88 logits was the most lenient rater. A total of 16 raters had distinct levels of severity, separation ratio: 4.06, Rel = 0.94, fixed χ²(15) = 329.7, p < .001. The tier facet also exhibits significantly distinct levels of difficulty; the separation ratio was 9.65, and the Rel statistic was 0.99: fixed χ²(2) = 185.9, p < .001. Tier 1 had the highest logit value of 0.21, implying that teachers who taught in Tier 1 settings received higher scores than teachers who taught in Tier 2 or 3 settings. For the lesson facet, separation statistics indicate lesson elements did not show distinct levels of difficulty (separation ratio <1.00), and the different locations of lesson elements were not distinguished: Rel = 0.0, fixed χ²(4) = 3.2, p = .53; this suggests that teachers’ scores were stable across lessons when controlling for other variables (e.g., rater severity, tier difficulty).

Discussion

To improve MTSS instruction, researchers need assessments that can validly measure the effect of PD on changes in teachers’ instruction at each tier. These assessments should generate data that allow researchers to compare treatment and control groups of teachers providing instruction at different tiers. Without this capacity, researchers cannot judge the efficacy of PD efforts. We developed the TIERS protocol to assess common features of effective MTSS instruction (e.g., HLP 16: explicit instruction) regardless of content taught (e.g., word decoding strategies) across instructional tiers. Our observation study focused on evaluating the portion of TIERS that uses Likert rating scales to assess three domains of instruction: CM, ESI, and RISL. G-study and MFRM analyses generated data that showed TIERS’s Likert scales can assess teachers’ use of these HLPs at Tiers 1, 2, and 3. The information provided will allow us to improve the TIERs’ observation rating process and PD for our IES study. In addition, the information provided suggests that with further study, the TIERS observation protocol could be a viable protocol for assessing MTSS instruction in other PD studies.

Scores on TIERS, according to our results, can identify differences in teaching proficiency for general and special education teachers and thus should be able to distinguish between teachers who make changes and those who make little progress in a PD effort focused on MTSS instruction. There are, however, some complexities in our findings worth considering. G-study findings indicated TIERS scores distinguished the instructional performance of special education teachers better than general education teachers. The amount of variance attributed to special education teachers was 27.3%, which is within the range of variance captured in other studies attempting to assess the instruction of special education teachers (Semmelroth & Johnson, 2014: 14.8%–21.3%; Peyton, 2019: 13%–33%; Pua et al., 2021: 21%–35%). Meanwhile, the variance attributed to general education teachers was 13.7%, which is comparable to the lower range of score variance identified for other observation tools designed to assess general education instruction (e.g., the Classroom Assessment Scoring System [Mantzicopoulos et al., 2018: 10.9%–44.5%; Mashburn et al., 2014: 16.9%–37.4%] and the Framework For Teaching [Kane & Staiger, 2012: 15.0%–33.0%; Mantzicopoulos et al., 2018: 16.2%–36.4%]). In our study, special education teachers varied more in instructional quality than did general education teachers. The MFRM analysis showed high separation statistics for the instructional tier. Furthermore, the Wright map generated for this analysis shows that general education teachers providing instruction in Tiers 1 and 2 settings scored similarly on the TIERS Likert scale, whereas special education teachers in Tier 3 settings scored lower. The two lowest-scoring teachers in our study were special education teachers, and they were notable outliers in the distribution of scores (see Figure 1). We do not know if future studies of special and general education teachers would produce similar results.

Our findings regarding the variability associated with each facet are also complex. Findings from the G-study showed considerable variation in a lesson: teacher interaction for each instructional tier (Tier 1: 33.0%, Tier 2: 18.1%, Tier 3: 29.1%), indicating that individual teachers respond differently to each lesson. These findings align with those from previous research (Mantzicopoulos et al., 2018: 18.2%–30.2%; Semmelroth & Johnson, 2014: 9.5%–14.9%). Interestingly, however, results of the MFRM analysis show there was no variability in the lesson facet, suggesting lessons are of uniform quality after accounting for the effects of other facets (e.g., teacher, rater, domain, tier). Findings from the G-study and MFRM analyses combined suggest that teachers might perform differently across lessons depending on other confounding factors, particularly the tier of instruction.

Results of our analyses also suggest that domains of instruction assessed by TIERS are distinct. The MFRM analysis generated high separation statistics for the domain, suggesting that CM, ESI, and RISL are distinct constructs. In addition, evidence from the G-study shows that raters assign scores in ways that distinguish teachers’ performance at each domain, and that teachers perform more similarly on certain domains than others (CM: 13.8%, ESI: 14.2%, RISL: 19.5%). Furthermore, teachers demonstrated the highest scores on CM and the lowest on ESI, suggesting that teachers may need more support in PD studies for learning how to implement explicit systematic instruction effectively.

In addition, findings from the MFRM analyses provide some insight into those behaviors for which teachers may need robust PD support. Separation statistics for the item facet were high, indicating that teachers demonstrated different levels of quality for individual practices. Teachers had the highest scores on the CM items (e.g., creates positive classroom climates) but displayed the most difficulty with modeling (HLP 16: explicit instruction), followed by coherent pedagogical structure (HLPs 12 and 13: identifying and communicating lesson goals), timely appropriate feedback (HLP 22: constructive academic feedback), and other instructional practices (e.g., HLP 7: responsive learning environment and HLP 8: constructive behavioral feedback). These findings are supported by previous studies using the MFRM analyses; Johnson and colleagues (2020, 2021) found that explaining (defined partly as modeling in HLP 16) and systematically withdrawing support (HLP 15) were the most difficult indicators of explicit instruction for special education teachers to demonstrate. Similarly, general education teachers showed the lowest scores for language modeling and quality of feedback on the Classroom Assessment Scoring System (Styck et al., 2021). We intend to use these results to improve our PD. Researchers and teacher educators can also use results from our study and similarly designed studies to identify HLPs in tiered instruction that may require more time and support to develop.

Raters were also capable of applying TIERS scoring rules, although results seem more nuanced and potentially complicated by the effect of the tier. The variance attributed to the rater main effect in the G-study was within the range (general education settings: 8.7%, special education settings: 9.5%) of rater variance identified in other G-studies of observation protocols (Hill et al., 2012: 6.2%–28.6%; Jones et al., 2022: 1%–17%; Semmelroth & Johnson, 2014: 1.8%–15.9%). In contrast, findings from the MFRM analysis showed a high level of separation statistics for raters, indicating there was variability in the raters’ ability to apply TIERS scoring rules consistently. Interestingly, a modestly higher amount of variance was due to the teacher × rater (general education settings: 4.4%, special education settings: 12.7%) and lesson × rater (general education settings: 13.2%, special education settings: 0.0%) interactions. Moreover, variance attributable to a teacher-by-rater interaction in Tier 2 (0.0%) was negligible, whereas in Tiers 1 and 3 (13.5%, 12.8%) it was moderate. These findings reinforce the fact that interactions between teacher × rater and lesson × rater might be confounded by the effect of Tier, as teachers did not provide lessons at each Tier. In this study, general education teachers taught Tiers 1 and 2 lessons, but special education teachers only taught Tier 3 lessons.

Limitations

Findings from our study provide some evidence supporting the effectiveness of TIERS for assessing tiered instruction. There are, however, limitations to consider. First, our partially nested design makes it difficult to disentangle confounding interactions (e.g., lesson:teacher) or facets (e.g., tier, teacher, lesson). In this study, only 30% of lessons were double-coded. By not double-coding all lessons, we could not account for a greater portion of rater variance that may have been associated with teachers or lessons in our study (Shavelson et al., 1989). However, we should note that this percentage is high compared with that of other studies (Kane & Staiger, 2012; Kent et al., 2017). Furthermore, double-coding all videos would be challenging to implement due to the costs associated with rating videos. In addition, the tier facet is fully confounded with the teacher facet, as special education teachers taught only Tier 3 instruction. Using a fully crossed design, in which teachers provide lessons at each tier, and every rater scores every lesson, would help researchers provide a more precise analysis of variance decomposition in the TIERS scores. Such a design, however, is likely unrealistic and even impossible. Special educators nearly always provide Tier 3 instruction, and never or rarely provide Tiers 1 and 2 instruction.

Second, the data included in our study were constrained to fourth-grade teachers providing naturally occurring reading instruction. Validity evidence from our study could change if TIERS is applied to other instructional contexts, such as different subjects (e.g., mathematics, science), grade levels (e.g., secondary students), or larger populations of teachers. Future research should include a larger and more heterogeneous sample of teachers to better support the generalizability of the TIERS Likert scores and confirm whether this common measure can function as intended with teachers in a variety of instructional settings (e.g., different content areas, grade levels).

Implications for Future Research

Our study provides initial evidence that HLPs represented in TIERS can be used as a common metric of general and special education teachers’ MTSS instruction and may be used to assess the efficacy of MTSS-focused PD interventions in our IES study and those of other researchers. Furthermore, findings from our study complement and expand on those examining observation protocols that assess HLPs (e.g., HLP 7: responsive learning environment, HLP 15: systematically faded support, HLP 16: explicit instruction) for general and special education teachers (Doabler et al., 2021; Johnson et al., 2020; Pua et al., 2021). Taken together, results suggest observation protocols can be constructed to assess certain HLPs, known to be effective for teaching students with disabilities regardless of instructional tier. Moreover, findings from our G-study and MFRM analyses demonstrate that using these two approaches to analyze variance in classroom observations can help researchers identify areas of instruction that may be targeted for additional support in PD efforts and improve observation protocols, including procedures for training raters. To fully examine the potential of TIERS and similar observation protocols for use in PD, however, more research is needed.

To fully validate TIERS (and other similar instruments), future research should employ designs where every lesson is coded by at least two different raters. Doing so will allow researchers to more precisely estimate rater effects for TIERS scores. Second, researchers need to analyze the relationship between TIERS scores and student outcomes, such as engagement and academic achievement, which represent important external indicators of teaching quality. Thus, it is imperative to determine whether TIERS scores can be extrapolated to outcomes for students, particularly those served in Tiers 2 and 3, who are most vulnerable to academic failure. Finally, since the TIERS Likert scale is intended to provide a common metric of HLPs within MTSS settings, more research is needed to confirm whether this tool can be used to assess tiered instruction for teachers at different grade levels and in other content (e.g., mathematics, primary reading, writing). This research will add to researchers’ understanding of the generalizability of TIERS.

In conclusion, findings from our study provide some initial understanding about how a common metric of HLPs, like TIERS, can be developed to assess general and special education teachers’ tiered instruction. The evidence generated in this study will be used to enhance rater training for our PD study, enhance the support we provide to teachers, and assess changes in teachers’ practice as a result of our PD intervention. Moreover, other researchers could use analyses similar to those used in this study to further develop TIERS or similar instruments to assess the efficacy of MTSS-focused PD interventions. If our field wants to improve PD for MTSS instruction, researchers need valid assessments of general and special education teachers’ HLPs in tiered instruction; these measures should provide a common metric of instructional practice that can be used to compare results across PD studies. Such assessments ultimately will also be crucial for use by practitioners in schools and districts, as they need to know whether their efforts to improve MTSS instruction are effective.

Footnotes

Authors’ Note

The content and opinions expressed do not represent those of IES.

ORCID iD

Hyojong Sohn

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was funded by the United States Department of Education, Institute for Education Sciences (IES) and National Center for Special Education Research (R324A170135).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Aceves

T. C.

Kennedy

M. J.

(Eds.). (2024). High-leverage practices for students with disabilities (2nd ed.). Council for Exceptional Children and CEEDAR Center.

Al Otaiba

Folsom

J. S.

Wanzek

Greulich

Waesche

Schatschneider

Connor

C. M.

(2016). Professional development to differentiate kindergarten Tier 1 instruction: Can already effective teachers improve student outcomes by differentiating Tier 1 instruction? Reading & Writing Quarterly, 32(5), 454–476. https://doi.org/10.1080/10573569.2015.1021060

Al Otaiba

Stewart

van Dijk

Conner

Freudenthal

D. R.

Rivas

Yovanoff

Allor

(2025). Comparing Tier 1 reading instruction with Tier 3 or special education intervention through an observational snapshot of school-implemented response to intervention across Grades 1–5. Reading and Writing, 38(4), 1129–1151. https://doi.org/10.1007/s11145-024-10534-7

Benedict

A. E.

Brownell

M. T.

Sohn

Williams

Kelcey

Koziarski

(2025). Project Coordinate: Impact of content-focused lesson study on teacher knowledge, collaboration, and MTSS instruction. Teacher Education and Special Education, 48(1), 26–45. https://doi.org/10.1177/08884064241298261

Brennan

R. L.

(2003). Coefficients and indices in generalizability theory. Center for advanced studies in measurement and assessment. CASMA Research Report, 1, 1–44.

Brownell

M. T.

Jones

N. D.

Sohn

Stark

(2020). Improving teaching quality for students with disabilities: Establishing a warrant for teacher education practice. Teacher Education and Special Education, 43(1), 28–44. https://doi.org/10.1177/0888406419880351

Brownell

Kiely

M. T.

Haager

Boardman

Corbett

Algina

Dingle

M. P.

Urbach

(2017). Literacy learning cohorts: Content-focused approach to improving special education teachers’ reading instruction. Exceptional Children, 83(2), 143–164. https://doi.org/10.1177/0014402916671517

Campbell

S. L.

Ronfeldt

(2018). Observational evaluation of teachers: Measuring more than we bargained for? American Educational Research Journal, 55(6), 1233–1267. https://doi.org/10.3102/0002831218776216

Carlisle

J. F.

Kelcey

Berebitsky

(2013). Teachers’ support of students’ vocabulary learning during literacy instruction in high poverty elementary schools. American Educational Research Journal, 50(6), 1360–1391. https://doi.org/10.3102/0002831213492844

10.

Charalambous

C. Y.

Kyriakides

(2017). Working at the nexus of generic and content-specific teaching practices: An exploratory study based on TIMSS secondary analyses. The Elementary School Journal, 117(3), 423–454. https://doi.org/10.1086/690221

11.

Cicchetti

D. V.

(1994). Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychological Assessment, 6(4), 284. https://doi.org/10.1037/1040-3590.6.4.284

12.

Ciullo

Ely

McKenna

J. W.

Alves

K. D.

Kennedy

M. J.

(2019). Reading instruction for students with learning disabilities in Grades 4 and 5: An observation study. Learning Disability Quarterly, 42(2), 67–79. https://doi.org/10.1177/0731948718806654

13.

Cohen

Goldhaber

(2016). Building a more complete understanding of teacher evaluation using classroom observations. Educational Researcher, 45(6), 378–387. https://doi.org/10.3102/0013189X16659442

14.

Connor

C. M.

Morrison

F. J.

Schatschneider

Toste

J. R.

Lundblom

Crowe

E. C.

Fishman

(2011). Effective classroom instruction: Implications of child characteristics by reading instruction interactions on first graders’ word reading achievement. Journal of Research on Educational Effectiveness, 4(3), 173–207. https://doi.org/10.1080/19345747.2010.510179

15.

Connor

C. M.

Piasta

S. B.

Fishman

Glasney

Schatschneider

Crowe

Underwood

Morrison

F. J.

(2009). Individualizing student instruction precisely: Effects of child × instruction interactions on first graders’ literacy development. Child Development, 80(1), 77–100. https://doi.org/10.1111/j.1467-8624.2008.01247.x

16.

Connor

C. M.

Spencer

Day

S. L.

Giuliani

Ingebrand

S. W.

McLean

Morrison

F. J.

(2014). Capturing the complexity: Content, type, and amount of instruction and quality of the classroom learning environment synergistically predict third graders’ vocabulary and reading comprehension outcomes. Journal of Educational Psychology, 106(3), 762–778. https://doi.org/762.10.1037/a0035921

17.

Doabler

C. T.

Clarke

Kosty

Fien

Smolkowski

Liu

Baker

S. K.

(2021). Measuring the quantity and quality of explicit instructional interactions in an empirically validated Tier 2 kindergarten mathematics intervention. Learning Disability Quarterly, 44(1), 50–62. https://doi.org/10.1177/0731948719884921

18.

Doabler

C. T.

Stoolmiller

Kennedy

P. C.

Nelson

N. J.

Clarke

Gearin

Fien

Smolkowski

Baker

S. K.

(2019). Do components of explicit instruction explain the differential effectiveness of a core mathematics program for kindergarten students with mathematics difficulties? A mediated moderation analysis. Assessment for Effective Intervention, 44(3), 197–211. https://doi.org/10.1177/1534508418758364

19.

Edmonds

Briggs

K. L.

(2003). The instructional content emphasis instrument: Observations of reading instruction. In Vaughn

Briggs

K. L.

(Eds.), Reading in the classroom: Systems for the observation of teaching and learning (pp. 31–52). Brookes.

20.

Graham

Milanowski

Miller

(2012). Measuring and promoting interrater agreement of teacher and principal performance ratings. Center for Educator Compensation Reform. https://files.eric.ed.gov/fulltext/ED532068.pdf

21.

Hill

H. C.

Charalambous

C. Y.

Kraft

M. A.

(2012). When rater reliability is not enough: Teacher observation systems and a case for the generalizability study. Educational Researcher, 41(2), 56–64. https://doi.org/10.3102/0013189X12437203

22.

Hill

H. C.

Grossman

(2013). Learning from teacher observations: Challenges and opportunities posed by new teacher evaluation systems. Harvard Educational Review, 83(2), 371–384. https://doi.org/10.17763/haer.83.2.d11511403715u376

23.

A. D.

Kane

T. J.

(2013). The reliability of classroom observations by school personnel [Research paper]. MET Project. Bill & Melinda Gates Foundation.

24.

Huebner

Lucht

(2019). Generalizability theory in R. Practical Assessment, Research, and Evaluation, 24(1), 1–12. https://doi.org/10.7275/5065-gc10

25.

Individuals with Disabilities Education Improvement Act, 20 U.S.C. § 1400 et seq. (2004).

26.

Johnson

E. S.

Crawford

Moylan

L. A.

Zheng

(2018). Using evidence-centered design to create a special educator observation system. Educational Measurement: Issues and Practice, 37(2), 35–44. https://doi.org/10.1111/emip.12182

27.

Johnson

E. S.

Crawford

Moylan

L. A.

Zheng

(2020). Validity of a special education teacher observation system. Educational Assessment, 25(1), 31–46. https://doi.org/10.1080/10627197.2019.1702461

28.

Johnson

E. S.

Zheng

Crawford

A. R.

Moylan

L. A.

(2021). The relationship of special education teacher performance on observation instruments with student outcomes. Journal of Learning Disabilities, 54(1), 54–65. https://doi.org/10.1177/0022219420908906

29.

Jones

N. D.

(2023). A research framework for the study of special education teacher preparation. In McCray

E. D.

Bettini

Brownell

M. T.

McLeskey

Sindelar

P. T.

(Eds.), Handbook of research on special education teacher preparation (2nd ed., pp. 85–105). Taylor & Francis Group.

30.

Jones

N. D.

Bell

C. A.

Brownell

Peyton

Pua

Fowler

Holtzman

(2022). Using classroom observations in the evaluation of special education teachers. Educational Evaluation and Policy Analysis, 44(3), 429–457. https://doi.org/10.3102/01623737211068523

31.

Kane

T. J.

Staiger

D. O.

(2012). Gathering feedback for teaching: Combining high-quality observations with student surveys and achievement gains [Measures of Effective Teaching Project]. Bill & Melinda Gates Foundation. http://eric.ed.gov/?id=ED540960

32.

Kelcey

Carlisle

J. F.

(2013). Learning about teachers’ literacy instruction from classroom observations. Reading Research Quarterly, 48(3), 301–317. https://doi.org/10.1002/rrq.51

33.

Kent

S. C.

Wanzek

Al Otaiba

(2017). Reading instruction for fourth-grade struggling readers and the relation to student outcomes. Reading & Writing Quarterly, 33(5), 395–411. https://doi.org/10.1080/10573569.2016.1216342

34.

Linacre

J. M.

(2020). Facets computer program for many-facet Rasch measurement (Version 3.83.2). https://www.winsteps.com

35.

Linacre

J. M.

Smith

E. V.

Smith

R. M.

(2004). Introduction to Rasch measurement: Theory, models, and applications. JAM Press.

36.

Liu

Bell

C. A.

Jones

N. D.

McCaffrey

D. F.

(2019). Classroom observation systems in context: A case for the validation of observation systems. Educational Assessment, Evaluation and Accountability, 31(1), 61–95. https://doi.org/10.1007/s11092-018-09291-3

37.

Lombard

Snyder-Duch

Bracken

C. C.

(2002). Content analysis in mass communication: Assessment and reporting of intercoder reliability. Human Communication Research, 28(4), 587–604. https://doi.org/10.1111/j.1468-2958.2002.tb00826.x

38.

Mantzicopoulos

French

B. F.

Patrick

Watson

J. S.

Ahn

(2018). The stability of kindergarten teachers’ effectiveness: A generalizability study comparing the framework for teaching and the classroom assessment scoring system. Educational Assessment, 23(1), 24–46. https://doi.org/10.1080/10627197.2017.1408407

39.

Mashburn

A. J.

Meyer

J. P.

Allen

J. P.

Pianta

R. C.

(2014). The effect of observation length and presentation order on the reliability and validity of an observational measure of teaching quality. Educational and Psychological Measurement, 74(3), 400–422. https://doi.org/10.1177/0013164413515882

40.

Moore

C. T.

(2022). Apply generalizability theory with R. https://cran.r-project.org/web/packages/gtheory/gtheory.pdf

41.

Nelson

Cook

S. C.

Zarate

Powell

S. R.

Maggin

D. M.

Drake

K. R.

Kiss

A. J.

Ford

J. W.

Sun

Espinas

D. R.

(2022). A systematic review of meta-analyses in special education: Exploring the evidence base for high-leverage practices. Remedial and Special Education, 43(5), 344–358. https://doi.org/10.1177/07419325211063491

42.

Peyton

(2019). Explicit and systematic instructional practices across special education contexts: A generalizability study [Doctoral dissertation, University of Florida]. ProQuest Dissertations & Theses Global.

43.

Pua

D. J.

Peyton

D. J.

Brownell

M. T.

Contesse

V. A.

Jones

N. D.

(2021). Preservice observation in special education: A validation study. Journal of Learning Disabilities, 54(1), 6–19. https://doi.org/10.1177/0022219420920382

44.

R Core Team. (2024). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.Rproject.org/

45.

Semmelroth

C. L.

Johnson

(2014). Measuring rater reliability on a special education observation tool. Assessment for Effective Intervention, 39, 131–145. https://doi.org/10.1177/1534508413511488

46.

Shavelson

R. J.

Webb

N. M.

(1991). Generalizability theory: A primer. Sage.

47.

Shavelson

R. J.

Webb

N. M.

Rowley

G. L.

(1989). Generalizability theory. American Psychologist, 44(6), 922–932. https://doi.org/10.1037/0003-066X.44.6.922

48.

Sohn

(2023). Developing an observation protocol for tiered reading instruction: A validation study (Publication No. 30529898) [Doctoral dissertation, The University of Florida]. ProQuest Dissertations and Theses Global.

49.

Solis

McKenna

J. W.

(2025). Reading instruction for students with autism spectrum disorder: Comparing observations of instruction to student reading profiles. Journal of Behavioral Education, 34(2), 399–419. https://doi.org/10.1007/s10864-023-09532-6

50.

Styck

K. M.

Anthony

C. J.

Sandilos

L. E.

DiPerna

J. C.

(2021). Examining rater effects on the classroom assessment scoring system. Child Development, 92(3), 976–993. https://doi.org/10.1111/cdev.13460

51.

Wolfe

E. W.

Dobria

(2008). Applications of the multifaceted Rasch model. In Osborne

J. W.

(Ed.), Best practices in quantitative methods (1st ed., pp. 71–85). Sage.