Continued Validation of the Direct Behavior Rating-Classroom Management: Examining the Impact of Training on Rating Accuracy

Abstract

Although associated with desirable student and teacher outcomes, many educators lack sufficient training in evidence-based, effective classroom management practices. The Direct Behavior Rating-Classroom Management (DBR-CM) is a low-inference, flexible, and feasible observational assessment tool developed to support professional development in classroom management. Validation efforts have yielded strong psychometric evidence in support of DBR-CM use thus far; however, early validation work was conducted with participants who received only brief, familiarization training. Given the positive influence of training on the reliability and accuracy of observational assessment, this study evaluated the impact of brief, comprehensive training on the accuracy of DBR-CM ratings. Results indicated that training significantly improved DBR-CM rating accuracy. The impact of training on the perceived social validity of the DBR-CM was also examined. Training predicted higher levels of understanding and acceptability, but not other dimensions of social validity (e.g., feasibility, system climate, system support). Findings support the efficacy of comprehensive training, which included modeling, guided practice, performance feedback, and calibration, in improving DBR-CM rating accuracy.

Keywords

observation professional development screening classroom management progress monitoring

The effectiveness of universal, Tier I supports in schools is incontrovertibly connected to classroom management (CM), or the collective actions that teachers engage in while executing their daily instructional and related activities. Effective CM is consistently associated with positive short- and long-term student outcomes (e.g., engagement, achievement, behavior; Chow et al., 2024; Herman et al., 2022). Ineffective CM also has significant implications for teachers. Poor, ineffective CM is associated with increased stress, burnout, and higher rates of professional attrition (Reinke et al., 2025). Unfortunately, despite the connection between CM and a host of desirable school outcomes, the use of effective, evidence-based CM practices remains a significant challenge facing schools and educators (Greenberg et al., 2014). Given that this challenge is most frequently attributed to insufficient preservice training (Greenberg et al., 2014), school districts must be prepared to address it through in-service professional development.

Recent studies have highlighted the benefits of ongoing professional development activities to support durable change in educator behavior, including the use of evidence-based CM strategies (Kleinert et al., 2017; Reinke et al., 2012; Smith & Gillespie, 2023). Furthermore, support that combines modeling, practice, and performance feedback appears to be the most effective approach to increase teacher adoption of evidence-based CM practices (Guskey & Yoon, 2009). Consultation is a low-cost indirect service delivery model that provides support for incorporating these strategies. Importantly, it has been shown to increase the use of evidence-based CM strategies and promote more favorable classroom outcomes (Kleinert et al., 2017; Reinke et al., 2012). Most school districts employ professionals (e.g., school psychologists, instructional coaches) trained to engage in consultation that supports the implementation of evidence-based CM practices. However, to engage in effective consultation, these professionals need access to data collection tools that are reliable, valid, and easy to administer. With this goal in mind, developers engage in efforts to ensure that assessment tools generate sound, usable data as efficiently as possible. Such efforts frequently incorporate the judicious use of training to enhance the accuracy and reliability of data generated by users. To this end, this study examined the impact of training on the accuracy of Direct Behavior Rating-Classroom Management (DBR-CM) ratings. The influence of training on user perceptions of social validity was also examined.

Data-Driven CM Support

Research indicates that the most effective professional development approaches utilize ongoing, data-driven support practices (Zepeda, 2012) to drive goal setting, feedback, and decision-making (Stoiber & Gettinger, 2016). Such approaches require the use of data collection tools that are easy to administer and psychometrically defensible. Unfortunately, historically, the number of easily administered, reliable, and valid CM assessment tools has been limited (Sims et al., 2020). A review by Chow et al. (2024) noted that direct observations were the most common method for assessing CM, followed by self-reports, interviews, checklists, video-based self-reflection, local rubrics, and reports by coaches and principals. Unfortunately, these methods for assessing CM appear unlikely to lead to the adoption or sustained use of evidence-based CM practices due to perceived or actual shortcomings in reliability, validity, ease of use, or practical relevance. However, Direct Behavior Rating (DBR; Chafouleas et al., 2010) has emerged as an alternative assessment format that demonstrates the potential to support ongoing, data-driven professional development activities in CM.

Direct Behavior Ratings (DBRs)

DBRs are brief direct observation assessments with strong conceptual foundations that combine the strengths of systematic direct observation (SDO), behavior rating scales, and general outcome measures. This combination produces a flexible, low-inference observation method completed in the environment where the behavior occurs, immediately after a period of direct observation (Christ & Boice, 2009). Although SDO is well-established as a method for assessing behavior in education, there may be drawbacks to its use in applied settings. SDO can be time-consuming and requires high levels of training to facilitate reliable and valid use (Riley-Tillman et al., 2005). Educators are sometimes reluctant to utilize SDO procedures for progress monitoring due to the intensive nature of data collection that can create issues with reliability when observers have competing attentional priorities (e.g., while teaching). In contrast, the flexibility of the DBR assessment methodology reduces the sustained attentional demands required by SDO (Chafouleas et al., 2010), making it well-suited for naturalistic, minimally invasive and thus more feasible and acceptable for use in screening and progress monitoring activities in classroom settings. Furthermore, multiple studies have demonstrated that reliable, valid, and accurate DBR data can be collected with minimal training (LeBel et al., 2010; Schlientz et al., 2009). Taken together, DBR is an efficient, reliable, defensible, and flexible assessment tool that can be used reliably by less skilled observers, making it well-suited for use in assessing CM.

Direct Behavior Rating-Classroom Management (DBR-CM)

The DBR-CM was developed to address the limited availability of CM-focused assessment tools that are time and resource-efficient (Sims et al., 2020). It offers a practical, flexible method to facilitate screening, formative assessment, and ongoing professional development in educational contexts. To date, the DBR-CM has shown promise as a reliable and valid assessment of evidence-based educator CM practice use. DBR-CM has shown moderate to strong correlations with concurrent assessments of educator CM behavior in both elementary and secondary classrooms and acceptable interrater reliability (Sims et al., 2020, 2023). In addition, when receiving brief, familiarization training (i.e., reading instructions, presentation of operational item definitions), users appear to provide consistent ratings (i.e., interrater reliability) at acceptable, but not optimal levels across most DBR-CM items (see Sims et al., 2023, Sims et al., 2023). Although promising, these early efforts to accumulate psychometric evidence supporting the use of the DBR-CM have identified opportunities to enhance reliability through efforts to improve rater accuracy. To this point, prior student-focused DBR (single-item scales) validation work illustrated that more comprehensive training, which included modeling, practice, and performance feedback, increased accuracy (i.e., reduced deviation scores; Schlientz et al., 2009). It is reasonable to anticipate that more comprehensive training would produce similar effects on DBR-CM ratings (e.g., more accurate).

Training Support

Assessment development and validation literature notes the importance of explicit efforts to enhance the accuracy and reliability of assessment results (Clark & Watson, 2019). Early endeavors to promote reliability and accuracy focus on aspects of instrument design, such as item clarity, construct alignment, and response format standardization, and continue with efforts to enhance user completion of assessment tools through parsimonious training activities (Clark & Watson, 2019). Training reduces inexperience-driven error, subjective interpretation, and cognitive bias, leading to more dependable measurement outcomes. Research demonstrates that modeling, guided practice, performance feedback, and calibration exercises within training are consistently associated with improved rater accuracy and interrater reliability, especially in observational or judgment-based assessments (Feldman et al., 2012). Such impacts were noted within student-focused DBR research. Schlientz et al. (2009) noted significant improvement in DBR rating accuracy (i.e., decreased score deviation) for users who received comprehensive training when compared to raters who received brief familiarization training. Generally, when raters receive structured instruction, calibration opportunities, and guided practice, subjective errors are reduced, making ratings more consistent (i.e., reliable) and better aligned with expert or criterion standards (i.e., accurate; Chafouleas et al., 2015; Feldman et al., 2012; Schlientz et al., 2009). This evidence underscores the importance of training not only for implementation fidelity but also for strengthening the psychometric defensibility of assessments used in applied settings.

Social Validity

In addition to generating psychometrically sound data, the willingness to use an assessment tool is a critical consideration when developing and refining assessment tools in educational and applied settings. Social validity refers to the perceived acceptability, feasibility, usability, and relevance of a tool or intervention by its intended users (Mandracchia & Sims, 2020). Social validity should be considered as a complementary element of psychometric defensibility, as it emphasizes human factors that underlie effective and sustainable assessment use. Adoption and sustained use of even the most psychometrically sound assessment tools is unlikely if they are unacceptable or are perceived as burdensome, difficult, unclear, or misaligned with user needs and values. In contrast, socially valid assessments are more likely to be adopted, used with fidelity, integrated into ongoing practice, and sustained over time (Leif et al., 2024). Within implementation science, concepts underlying social validity (e.g., acceptability, feasibility) are conceptualized as variable implementation outcomes that can be influenced through iterative user-centered training and other support strategies (Strohmeier et al., 2014). To this point, although empirical research is sparse, prior work suggests that training users in how to administer and interpret structured assessments can improve their perceptions of the social validity of assessment tools (Quinn, 2021). Although early evaluations suggest the DBR-CM is perceived as socially valid (Sims et al., 2020, 2023), these findings were associated with brief, familiarization training. Like accuracy, it can be reasonably concluded that perceptions of social validity may be enhanced via brief, comprehensive training. Confirming this hypothesis would further support the continued development and use of the DBR-CM.

Current Study

To contribute to the accumulation of validity evidence supporting the proposed interpretations and uses (see Kane, 2013) of the DBR-CM, this study examined the effects of a brief yet comprehensive training that included modeling, guided practice, performance feedback, and calibration exercises on rater accuracy and perceptions of social validity. Guiding research questions and related hypotheses included the following:

Research Question 1: Does brief, comprehensive training result in more accurate DBR-CM ratings?

Hypothesis 1: Participants in the training group will be more accurate in their DBR-CM ratings.

Hypothesis 2: The accuracy of DBR-CM ratings will improve for participants receiving delayed training (i.e., significant differences observed from pretest to posttest phase for the delayed training group).

Research Question 2: Does brief, comprehensive training result in more favorable perceptions of DBR-CM social validity?

Hypothesis 3: Trained participants would provide higher ratings of acceptability, understanding, feasibility, system climate, and system support for the DBR-CM.

Method

Participants

A total of 93 college students and education professionals in the Southwestern United States with connections primarily to psychology and education were recruited for participation in this study. Among those reporting demographic information, 90.4% (n = 84) identified as female and 9.6% (n = 9) as male. Racial and ethnic composition was 47.0% White, 33.0% Latinx, 12.2% Asian, 3.5% Black or African American, 2.6% mixed/other, 0.9% Native American, and 0.9% unknown (i.e., did not respond). Participants ranged in age from 18 to 68 (M = 31.0, SD = 11.4). Educational attainment included bachelor’s (32.2%, n = 30), master’s (17.4%, n = 16), master’s plus 15 units (15.7%, n = 15), associate’s (13.9%, n = 13), and doctoral degrees (12.2%, n = 11); 8.7% (n = 8) did not respond. Participants were primarily graduate students (34.8%, n = 33), school psychologists (27.0%, n = 26), and undergraduate students (24.3%, n = 24). A small number of participants were postgraduate educational professionals, including classroom educators and researchers (each 4.3%, n = 4), other professionals (2.6%, n = 3), specialized support personnel (1.7%, n = 2), and administrators (0.9%, n = 1). Among non-student participants, 21.7% (n = 25) reported 1 to 2 years of experience and 7.0% (n = 8) reported 21 or more years of professional experience.

Measures

Direct Behavior Rating-Classroom Management

The DBR-CM is a structured observational tool developed to assess educator use of evidence-based CM practices. Users rate observed CM practices across four items (Praise, Communication, Enthusiasm, and Rapport) on an 11-point Likert-type scale from 0 (low) to 10 (high; Sims et al., 2020). Item scores can be summed to generate a total score that indicates the overall use of evidence-based CM practices. Interrater reliability estimates for domain items have ranged from weak to strong (ICC = .67–.96), while moderate to strong concurrent validity estimates have been reported with the Classroom Atmosphere Scale and Classroom Assessment Scoring System (see Sims et al., 2020, 2023).

Modified Usage Rating Profile Assessment (URP-A)

The Usage Rating Profile-Assessment (URP-A) is a 28-item measure designed to assess the social validity of an evaluation process. The URP-A encompasses six subscales: Acceptability, Understanding, Feasibility, Home-School Collaboration, System Climate, and System Support. The URP-A was modified to exclude items not applicable to this study (i.e., three Home-School items). Items are rated on a six-point Likert scale ranging from 1 (strongly disagree) to 6 (strongly agree). The URP-A demonstrates a six-factor structure with acceptable model fit. Internal consistency, assessed via Cronbach’s alpha, ranged from .71 to .90 across factors, with the exception of the system support factor (Miller et al., 2013).

Procedures

Data Collection

Study activities were conducted virtually, with participants completing tasks on a private device (e.g., personal computer, tablet) from a location of their choice. After providing consent, participants completed a brief demographic survey. Participants were assigned to one of two groups, Delayed Training (DT; n = 58) or Training (T; n = 57), using stratified randomization. To enhance comparability, group assignment was randomized by level of education (i.e., undergraduate, graduate, postgraduate) and gender. Participants were emailed a personalized link to the Qualtrics survey for their respective group (i.e., DT, T). Each survey contained directions for completing the survey activities, which included an embedded DBR-CM training (video), six 5-minute clips of classroom instruction, six electronically formatted DBR-CM (i.e., one following each video clip), and selected URP-A items. The order of survey items varied by group assignment (see Supplemental Figure 1). Participants in the T group viewed the survey greeting and general directions, followed by a brief, comprehensive Qualtrics survey-embedded DBR-CM training, which included modeling, guided practice, performance feedback, and calibration exercises. T group participants then completed DBR-CM ratings for three 5-minute video clips. Ratings were provided immediately after each video clip. Participants in the DT group viewed the survey greeting and general directions, then completed DBR-CM ratings for three 5-minute video clips. For consistency in DBR-CM exposure and use across groups, both groups completed the selected URP-A items after rating the third video clip. Following completion of the modified URP-A, the DT group viewed the brief, comprehensive DBR-CM training. Each group then completed an additional three DBR-CM ratings for an additional three 5-minute video clips, with ratings provided immediately following viewing of each video clip. The survey concluded with a Thank You message and a request for a preferred email address to facilitate raffle activities as compensation. Most participants (75%) completed all study activities in less than 1 day, but completion ranged from 1 to 15 days.

DBR-CM Training

Participants completed a brief, comprehensive 41-minute DBR-CM video training. This training was developed by the lead DBR-CM developer, a researcher and trainer with background and expertise in assessment development and validation, data-based decision-making, and teaching or training adult learners, with support from an expert panel to derive “true scores” for practice items. The training presented the background and rationale for the development of the DBR-CM as well as the importance of the use of effective, evidence-based CM practices. The training introduced and defined DBR-CM items and provided examples and non-examples associated with each. This training also covered item completion, an introduction to the response format and Likert scale, as well as more detailed descriptions of the scale anchors that guide rating. This training included several DBR-CM practice rating opportunities (i.e., practice, calibration, feedback).

Data Cleaning and Coding

Data were exported from Qualtrics to Excel and deidentified. Twenty-two participants were removed via listwise deletion as they did not complete any DBR-CM ratings despite providing consent. Given the randomized presentation of items, this attrition was deemed missing completely at random (MCAR) and unrelated to study content. Additional missing ratings for individual clips (DT = 4; T = 2) followed a missing at random (MAR) pattern, likely resulting from participant fatigue or burden over the course of the six video clips. Imputation procedures were considered when addressing these missing data, but were deemed unnecessary, as planned analyses were determined to be sufficiently robust to withstand the impact of MCAR and MAR data. Furthermore, deviation scores were calculated at the individual item and participant level, rather than aggregated. Lastly, the final sample of 93 participants (DT, n = 49; T, n = 44; DT deviation scores, n = 1,136; T deviation scores, n = 1,024) provided sufficient power to detect observed effects (d = .60–.84) and exceeded the recommendation of 50 clusters for multilevel modeling (Hox et al., 2010).

Data Analysis

Outcome Variables

The outcome variable used to examine the impact of training on rater accuracy was based on DBR-CM item deviation scores and was consistent with those employed by Schlientz et al. (2009). Deviation scores were calculated by determining the difference between the score provided by an observer and the expert panel-derived “true score” for a given DBR-CM item and video clip. “True scores” were derived via consensus ratings provided by an expert panel. Panel members included four doctoral-level school psychologists and one doctoral candidate in school psychology. Beyond pre- and in-service training, each panel member had experience conducting observational assessments (i.e., systematic direct observation, rating scales, DBR) in applied settings within research and service delivery applications. All expert panel members completed the comprehensive, web-based DBR-CM training prior to completing ratings for each of the six video clips used in the study independently. Consensus agreement across ratings was used to arrive at “true scores” for DBR-CM items for each video clip. Instances of disagreement were reconciled via open discussion and review of ratings and video clips.

Statistical Analyses

Hypothesis 1

To address the first study hypothesis, multivariate linear mixed-effects modeling (MLMM) was conducted using R to determine whether training predicted more accurate DBR-CM ratings (i.e., lower absolute deviation scores). MLMM was selected to account for the hierarchical structure of the data, specifically the nesting of multiple video clip ratings within individual raters (Stawski, 2013). This approach preserves the covariance structure among the conceptually related outcomes while incorporating a random intercept for participants to control for intra-rater correlation (Olive, 2017). The multivariate effect of training was evaluated within the mixed-modeling framework using Kenward-Roger degrees of freedom adjustments (Kuznetsova et al., 2017). To examine item-specific impacts, the model included an interaction between training condition and DBR-CM item. Interpretation of unstandardized coefficients (b) focused on the direction and magnitude of training-related changes in absolute deviation scores, with larger absolute values indicating more substantive differences and narrower 95% confidence intervals indicating greater precision (Darlington & Hayes, 2017). The Intraclass Correlation Coefficient (ICC) was calculated to quantify the proportion of variance attributable to rater-level differences. Effect sizes were interpreted using conventional benchmarks (.20 = small, .50 = moderate, .80 = large; Cohen, 1988; Lee, 2016).

Hypothesis 2

To address the second study hypothesis, a linear mixed-effects analysis with random intercepts (LMM-RI) was conducted in RStudio (Version 2025.09.2+418) to examine training effects on accuracy across time points (videos 1–3 vs. 4–6). LMMs are well-suited for repeated-measures data because they: (a) account for dependence among observations, (b) accommodate unbalanced designs, and (c) incorporate both fixed and random sources of variability (Bates et al., 2015; Brown, 2021). Fixed effects included training condition (T vs. DT), time (pre- vs. posttraining), item (e.g., Praise, Communication), and their interaction (training condition x time), which indexed differential improvement after training as a function of training condition. Random intercepts for participants and clips accounted for baseline differences in accuracy and clip difficulty. Models were estimated via restricted maximum likelihood (REML) using the lme4 package (Bates et al., 2015) in R. Significance of fixed effects was determined using Satterthwaite-adjusted degrees of freedom via the lmerTest package (Kuznetsova et al., 2017). Coefficients were interpreted as unstandardized effects, with negative values indicating greater accuracy (i.e., reduced deviation). Estimated marginal means and 95% confidence intervals were obtained using emmeans package in R (Lenth, 2024). Effect sizes were calculated by dividing the interaction estimate by the residual standard deviation and interpreted using standard benchmarks (Cohen, 1988).

Hypothesis 3

To address the third study hypothesis, multivariate multiple regression (MMR) was conducted to determine whether training predicted more favorable perceptions of aspects of social validity of the DBR-CM. MMR was selected due to the conceptually related and moderately correlated nature of the outcomes and to preserve the covariance structure among them (Olive, 2017). This approach provides a single multivariate test of the predictor (training) across dependent variables while controlling Type I error (Olive, 2017). Although interrelated, univariate regression coefficients were examined for each distinct social validity construct rated (e.g., understanding, feasibility), which are conceptually related but potentially mutually exclusive. Interpretation of unstandardized coefficients (b) focused on the direction and magnitude of training-related changes in social validity ratings, with larger values indicating more favorable user perceptions of acceptability, understanding, feasibility, system climate, and system support (Darlington & Hayes, 2017).

Results

Assumptions for the statistical analyses were tested through the performance package in RStudio (Version 2025.09.2+418). Visual diagnostics confirmed that the absolute deviation scores satisfied requirements for linearity and homoscedasticity. Q–Q plots of random intercepts for raters confirmed that rater-level differences were normally distributed, validating the use of a hierarchical framework to account for the nesting of video clip ratings within raters. Although minor deviations in residual normality were observed, the sample size was sufficiently large to ensure the reliability of all multivariate and univariate estimates (Hox et al., 2010).

Training Impact on Accuracy of DBR-CM Ratings

To address the first hypothesis, a multivariate linear mixed-effects model (MLMM) was used to examine whether training predicted DBR-CM absolute deviation scores. This approach accounted for the nested structure of the data (i.e., multiple ratings within individual raters). A statistically significant multivariate effect of training on the combined deviation score outcomes was evident (b = −1.22, SE = .18, t(91.39) = −6.63, p < .001), indicating that training significantly improved rating accuracy across the combined outcomes (see Table 1). The model evaluated the impact of training for each specific DBR-CM item while incorporating a random intercept for raters. Results indicated that training significantly predicted lower absolute deviation scores (indicating greater accuracy) for all items. Compared to the DT group, trained raters demonstrated significantly lower absolute deviation scores for Praise (b = −1.43, SE = .17, p < .001), Communication (b = −1.15, SE = .17, p < .001), Enthusiasm (b = −1.30, SE = .17, p < .001), and Rapport (b = −1.01, SE = .17, p < .001). The Intraclass Correlation Coefficient (ICC) value (ICC = .194) indicated participant-level variance accounted for 19.4% of the total variance in DBR-CM deviation scores and demonstrated a moderate degree of consistency across raters. Results indicated that Effect size indices, derived from the model estimates, indicated medium-to-large effects for DBR-CM items, with Cohen’s d values ranging from .65 to .93 (see Table 1). The 95% confidence intervals for the training effect across all items indicated high precision in these estimates, as evidenced by the narrow standard errors.

Table 1.

Multivariate Effects of Training on Deviation Scores for DBR-CM Ratings.

	b	SE	95% CI	df	t	p	d
Combined Accuracy
	−1.22	.18	[−1.57, −0.87]	91.39	−6.63	<.001	.87
Individual Item Effects
Praise	−1.43	.17	[−1.76, −1.10]	91	−8.41	<.001	.92
Communication	−1.15	.17	[−1.48, −0.82]	91	−6.76	<.001	.82
Enthusiasm	−1.3	.17	[−1.63, −0.97]	91	−7.64	<.001	.93
Rapport	−1.01	.17	[−1.34, −0.68]	91	−5.94	<.001	.65

Note. Negative b values indicate a reduction in absolute deviation from master codes (i.e., an increase in rater accuracy). Combined Accuracy represents the multivariate fixed effect of training condition across all four DBR-CM dimensions.

To address the second study hypothesis, a linear mixed-effects analysis with random intercepts (LMM-RI) was conducted to examine training effects on absolute deviation scores across time points (Clips 1–3 vs. 4–6). The model included fixed effects for training condition (T vs. DT), time (pre- vs. posttraining), item, and the condition × time interaction, with random intercepts for participant and clip. Results revealed a significant condition × time interaction, b = 1.08, SE = .12, t(2,076.50) = 8.88, p < .001, d = .77 (see Table 2). Estimated marginal means (EMMs) showed that the DT group demonstrated a reduction in absolute deviation score from pretraining (M = 2.30, SE = .17) to posttraining (M = 1.22, SE = .18). The T group exhibited stable accuracy levels from pretraining (M = 1.19, SE = .13) to posttraining (M = 1.28, SE = .13). A significant main effect of condition was observed, b = −1.21, SE = .18, t(113.75) = −6.71, p < .001. Item-level effects indicated that absolute deviation was significantly lower (indicating higher accuracy) for Communication (b = −.47, p < .001), Enthusiasm (b = −.41, p < .001), and Rapport (b = −.50, p < .001) when compared to Praise. Analysis of random effects indicated significant variance at the participant level (SD = .77) and the clip level (SD = .56), with a residual variance of 1.96 (SD = 1.40) (see Table 2).

Table 2.

Linear Mixed-Effects Model Results for DBR-CM Rating Accuracy.

Predictor	b	SE	95% CI	df	t	p
Fixed Effects
(Intercept)	2.29	.35	[1.60, 2.98]	5.3	6.49	.001
Condition (T vs. DT)	−1.21	.18	[−1.56, −0.86]	91	−6.71	<.001
Time (Post vs. Pre)	−.94	.47	[−1.86, −0.02]	4.13	−2	.114
Condition × Time	1.08	.12	[0.84, 1.32]	1,679	8.88	<.001
Item Contrasts
Communication	−.47	.09	[−0.65, −0.29]	2,061.57	−5.49	<.001
Enthusiasm	−.41	.09	[−0.59, −0.23]	2,061.57	−4.78	<.001
Rapport	−.5	.09	[−0.68, −0.32]	2,061.57	−5.92	<.001
Random Effects
	Variance	SD
Participant	.59	.77
Clip	.32	.56
Residual	1.96	1.4

Note. Degrees of freedom were adjusted using the Satterthwaite approximation.

Training Impact on Social Validity

To address the third hypothesis, MMR was conducted to determine if training predicted participant perceptions of social validity (i.e., acceptability, understanding, feasibility, system climate, system support) for the DBR-CM. Descriptive statistics for these social validity variables are provided in Table 3. Consistent with our research focus on specific aspects of social validity, individual regression coefficients were examined to identify the impact of training on each rated outcome. Results indicated that training resulted in significantly more favorable perceptions of acceptability (b = .44, SE = .19, p = .02) and understanding (b = .46, SE = .22, p = .038). A moderate effect size was observed for both acceptability (d = −.50) and understanding (d = −.44). Training did not significantly predict feasibility, system climate, or system support, with corresponding effect sizes ranging from negligible to small (see Table 4).

Table 3.

Descriptive Statistics for Study Items by Training Condition.

Direct behavior rating-classroom management
Clip	Group	n	Praise deviationM (SD)	Communication deviationM (SD)	Enthusiasm deviationM (SD)	Rapport deviationM (SD)
1	Delayed	49	−.14 (2.22)	−.04 (1.19)	−1.41 (2.18)	−.06 (1.97)
1	Training	44	.36 (1.33)	−.02 (1.02)	−.32 (1.33)	.34 (1.27)
2	Delayed	49	−1.94 (2.66)	−2.69 (2.49)	−.84 (2.01)	−.73 (1.94)
2	Training	43	−.14 (1.64)	−.67 (1.66)	−.42 (1.38)	.37 (1.33)
3	Delayed	49	−3.78 (2.40)	−2.94 (2.60)	−2.80 (2.52)	−3.29 (2.44)
3	Training	43	−1.93 (2.35)	−1.67 (2.18)	−.67 (1.71)	−1.58 (2.07)
4	Delayed	46	−1.91 (2.47)	−.78 (1.85)	−1.24 (2.31)	−.48 (1.93)
4	Training	43	−1.47 (2.09)	−.72 (1.84)	−1.23 (1.88)	−.21 (1.50)
5	Delayed	45	−.78 (1.94)	.33 (1.43)	−.87 (1.85)	−.62 (1.43)
5	Training	43	−.98 (1.96)	.14 (1.57)	−1.02 (1.65)	−.81 (1.58)
6	Delayed	45	−.56 (2.29)	−.18 (1.92)	−1.24 (2.32)	−.44 (1.96)
6	Training	42	−.88 (2.29)	−.38 (1.74)	−1.55 (2.11)	−.62 (1.74)
Usage Rating Profile-Assessment^a
			AcceptabilityM (SD)	UnderstandingM (SD)	FeasibilityM (SD)	System ClimateM (SD)
	Delayed	49	4.27 (.99)	4.45 (1.26)	4.80 (.72)	4.30 (.89)
	Training	42	4.71 (.76)	4.90 (.66)	4.98 (.76)	4.36 (1.21)

Includes only selected items targeting considerations relevant to this study.

Table 4.

Multivariate and Univariate Effects of Training on Social Validity Outcomes.

	b	SE	95% CI	p	R²	Cohen’s d	95% CI
Multivariate	—	—	—	.080	—	—	—
	Pillai’s Trace = .107, F(5, 85) = 2.05, $η_{p}^{2}$ = .108
Acceptability	.44	.19	[0.073, 0.817]	.020	.0596	–.50	[–0.92, –0.08]
Understanding	.46	.22	[0.027, 0.885]	.038	.0477	–.44	[–0.87, –0.02]
Feasibility	.17	.16	[–0.134, 0.481]	.266	.0139	–.24	[–0.65, 0.18]
System Climate	.06	.22	[–0.377, 0.500]	.782	.0009	–.06	[–0.48, 0.36]
System Support	.06	.36	[–0.664, 0.787]	.867	.0003	–.04	[–0.45, 0.38]

Note. Negative Cohen’s d values indicate higher scores for the Training group relative to the Delayed Training group.

Discussion

Evidence-based CM practices are essential to producing desired outcomes for students and classroom educators. To this end, the DBR-CM was developed to help facilitate data collection efforts that support ongoing CM-focused professional development (Sims et al., 2020). This study sought to extend the growing body of validity evidence supporting the DBR-CM by examining the effects of a brief, yet comprehensive training on improving rater accuracy and user perceptions of social validity. It was hypothesized that comprehensive training would predict better accuracy of DBR-CM ratings as well as higher social validity ratings for the DBR-CM.

Impact of Training on Accuracy

The literature highlights the role of training in reducing subjective interpretation, inexperience, and cognitive bias, factors that can undermine the psychometric defensibility of assessment tools (Feldman et al., 2012; LeBel et al., 2010; Schlientz et al., 2009). Prior work establishing validity evidence with the DBR-CM yielded encouraging results. However, these studies used a brief familiarization training approach that lacked several components that may facilitate more accurate assessment use. Thus, a comprehensive DBR-CM training was developed that included the following evidence-based enhancement mechanisms outlined in Feldman et al. (2012): exposure to the background and rationale for development, modeling, guided practice, performance feedback, and anchoring instruction. As anticipated, across multiple analytic approaches, receipt of brief, comprehensive DBR-CM training predicted more accurate ratings. Predictive modeling indicated that comprehensive DBR-CM training was a significant predictor of more accurate DBR-CM ratings. Notably, the impact of training was significant for all DBR-CM items, with the most pronounced improvement observed for the Praise item. Furthermore, after participants in the delayed training group eventually received the comprehensive DBR-CM training, their deviation scores decreased substantially. This reflected a significant “catch-up” effect where their accuracy improved to match the levels of the group that had been trained at baseline.

Together, the magnitude and consistency of effects across all items underscore the effectiveness of comprehensive training in enhancing DBR-CM rating accuracy. These findings suggest that comprehensive training should be incorporated when using DBR-CM and other observational assessment tools to increase their psychometric defensibility. Given the intended applications of DBR-CM to drive professional development activities promoting evidence-based CM practices, ensuring that DBR-CM ratings are psychometrically sound is essential.

Social Validity

Research indicates training can strengthen key elements of social validity, such as acceptability, feasibility, and usability. Ultimately, study findings indicated that comprehensive training was not a significant predictor across all dimensions of social validity simultaneously. However, it was predictive of significantly greater levels of user acceptability and understanding, which are critical facets of social validity. This suggests that comprehensive training may help users gain a deeper appreciation of the conceptual purpose and rationale of the DBR-CM, as well as enhance their confidence and openness toward using the tool. The heightened appreciation for DBR-CM acceptability and understanding also underscores the importance of training for enhancing the foundational user perceptions that support DBR-CM adoption and sustained use, regardless of broader contextual or systemic support factors, such as system climate or system support, which remained unchanged by the brief training. These results may also reinforce the assertion that social validity is a complementary element of psychometric defensibility and highlight the value of training for cultivating the human factors necessary for effective and sustainable DBR-CM implementation.

Limitations and Future Directions

Despite these findings, several limitations warrant consideration. First, although videos allowed for standardized observations across all participants, they may not fully capture the dynamic, real-time complexity of in vivo classroom settings. In addition, observation length may influence rating accuracy. Video segments were limited to 5 minutes to reduce participant burden; however, observations in applied school settings are typically longer (e.g., 15 minutes). The sample, although diverse in experience and background, was composed primarily of students and very early-career professionals, which may limit generalizability of accuracy-focused findings to raters with more applied experience. Similarly, generalizability of social validity findings to more experienced users may also be limited. Perceptions of social validity noted for these participants may differ from those of veteran, in-service educators.

Future research should examine the impact of training on ratings from more experienced educators and explore both external observer and self-report DBR-CM formats. Including student developmental level as a variable may clarify how rater training and context interact. Comparisons across training conditions and developmental contexts could guide decisions related to the investment of resources in assessment training. It may also be useful to assess the long-term retention of rating accuracy posttraining and whether additional training is necessary for maintenance. Future work should also investigate whether improved observer accuracy leads to changes in educator behavior or student outcomes. Broadly, research should continue to build validity evidence for DBR-CM across varied settings, participants, and analytic approaches. For example, studies should examine the role of the DBR-CM in problem-solving consultation and its potential for identifying intervention targets and providing feedback. Future research should include self-report DBR-CM data to examine the relationship between external and self-ratings data to inform related professional development applications (e.g., self-monitoring).

Implications for Practice

These findings have meaningful implications for practice. CM remains a persistent challenge in schools, and many educators lack sufficient training in evidence-based practices. The DBR-CM offers a promising, accessible assessment method to support professional development in CM. This study suggests that incorporating a brief, formal training can enhance the accuracy of DBR-CM ratings, increasing confidence in and defensibility of the data produced by this low-inference observation tool. Combined with prior research, these results support the utility of the DBR-CM to promote improved CM. The growing psychometric support for the DBR-CM also has implications for Multi-Tiered Systems of Support implementation. A reliable and efficient CM assessment tool benefits Tier I teams and consultants aiming to improve universal practices. The DBR-CM represents a clear improvement over anecdotal observations (Chow et al., 2024) and offers a more feasible alternative to labor-intensive methods that are often implemented inconsistently (Riley-Tillman et al., 2005). Results further suggest that educators with varying experience and expertise can produce reliable ratings with brief training, though additional research is needed. In the context of staffing shortages, especially of school psychologists, DBR-CM may expand the pool of personnel who can support Tier I fidelity and adoption of evidence-based practices.

Conclusion

Brief training predicts improved DBR-CM rating accuracy and enhances perceptions of tool acceptability and understanding. These findings support the role of training in the implementation of defensible, data-driven tools and reinforce the promise of DBR-CM for CM assessment and professional development. Modest effects on user perceptions warrant continued research with larger samples to further clarify and extend these results.

Supplemental Material

sj-docx-1-aei-10.1177_15345084261452141 – Supplemental material for Continued Validation of the Direct Behavior Rating-Classroom Management: Examining the Impact of Training on Rating Accuracy

Supplemental material, sj-docx-1-aei-10.1177_15345084261452141 for Continued Validation of the Direct Behavior Rating-Classroom Management: Examining the Impact of Training on Rating Accuracy by Wesley A. Sims, Drew Hunter, Dennis T. Sisco-Taylor, Danielle Zahn and Jordan Gallegos in Assessment for Effective Intervention

Footnotes

ORCID iDs

Wesley A. Sims

Jordan Gallegos

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available at

References

Bates

Mächler

Bolker

Walker

(2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1), 1–48. https://doi.org/10.18637/jss.v067.i01

Brown

(2021). An introduction to linear mixed-effects modeling in R. Advances in Methods and Practices in Psychological Science, 4(1), 1–19. https://doi.org/10.1177/2515245920960351

Chafouleas

S. M.

Riley-Tillman

T. C.

Jaffery

Miller

F. G.

Harrison

S. E.

(2015). Preliminary investigation of the impact of a web-based module on direct behavior rating accuracy. School Mental Health, 7(2), 92–104. https://doi.org/10.1007/s12310-014-9130

Chafouleas

S. M.

Volpe

R. J.

Gresham

F. M.

Cook

C. R.

(2010). School-based behavioral assessment within problem-solving models: Current status and future directions. School Psychology Review, 39(3), 343–349. https://doi.org/10.1080/02796015.2010.12087756

Chow

J. C.

Sayers

Granger

K. L.

McCullough

Kingsbery

Morse

(2024). A systematic meta-review of measures of classroom management in school settings. Assessment for Effective Intervention, 49(2), 60–74. https://10.1177/15345084231208671

Christ

T. J.

Boice

(2009). Rating scale items: A brief review of nomenclature, components, and formatting to inform the development of direct behavior rating (DBR). Assessment for Effective Intervention, 34(4), 242–250. https://doi.org/10.1177/1534508409336182

Clark

L. A.

Watson

(2019). Constructing validity: New developments in creating objective measuring instruments. Psychological Assessment, 31(12), 1412–1427. https://doi.org/10.1037/pas0000626

Cohen

(1988). Set correlation and contingency tables. Applied Psychological Measurement, 12(4), 425–434. https://doi.org/10.1177/014662168801200

Darlington

R. B.

Hayes

A. F.

(2017). Regression analysis and linear models: Concepts, applications, and implementation. Guilford Publications.

10.

Feldman

Lazzara

E. H.

Vanderbilt

A. A.

DiazGranados

(2012). Rater training to support high-stakes simulation-based assessments. The Journal of Continuing Education in the Health Professions, 32(4), 279–286. https://doi.org/10.1002/chp.21156

11.

Greenberg

Putman

Walsh

(2014). Training our future teachers: Classroom management. National Council of Teacher Quality. https://www.nctq.org/dmsView/Future_Teachers_Classroom_Management_NCTQ_Report

12.

Guskey

T. R.

Yoon

K. S.

(2009). What works in professional development? Phi Delta Kappan, 90(7), 495–500. https://doi.org/10.1177/003172170909000709

13.

Herman

K. C.

Reinke

W. M.

Dong

Bradshaw

C. P.

(2022). Can effective classroom behavior management increase student achievement in middle school? Findings from a group randomized trial. Journal of Educational Psychology, 114(1), 144–160. https://doi.org/10.1037/edu0000641

14.

Hox

Moerbeek

Schoot

(2010). Multilevel analysis: Techniques and applications (2nd ed.). Routledge. https://doi.org/10.4324/9780203852279

15.

Kane

(2013). The argument-based approach to validation. School Psychology Review, 42(4), 448–457. https://doi.org/10.1080/02796015.2013.12087465

16.

Kleinert

W. L.

Silva

M. R.

Codding

R. S.

Feinberg

A. B.

St. James

P. S.

(2017). Enhancing classroom management using the classroom check-up consultation model with in-vivo coaching and goal setting components. School Psychology Forum, 11(1), 5–19. https://www.nasponline.org/publications/periodicals/spf/

17.

Kuznetsova

Brockhoff

P. B.

Christensen

R. H. B.

(2017). lmerTest package: Tests in linear mixed effects models. Journal of Statistical Software, 82(13), 1–26. https://doi.org/10.18637/jss.v082.i13

18.

LeBel

T. J.

Kilgus

S. P.

Briesch

A. M.

Chafouleas

(2010). The impact of training on the accuracy of teacher-completed direct behavior ratings (DBRs). Journal of Positive Behavior Interventions, 12(1), 55–63. https://doi.org/10.1177/1098300708325265

19.

Lee

D. K.

(2016). Alternatives to P value: Confidence interval and effect size. Korean Journal of Anesthesiology, 69, 555–562. https://doi.org/10.4097/kjae.2016.69.6.555

20.

Leif

E. S.

Kelenc-Gasior

Bloomfield

B. S.

Furlonger

Fox

R. A.

(2024). A systematic review of social-validity assessments in the Journal of Applied Behavior Analysis: 2010-2020. Journal of Applied Behavior Analysis, 57(3), 542–559. https://doi.org/10.1002/jaba.1092

21.

Lenth

R. V.

(2024). emmeans: Estimated marginal means, aka least-squares means [R package version 1.10.1]. https://CRAN.R-project.org/package=emmeans

22.

Mandracchia

N. R.

Sims

W. A.

(2020). Development of the Usage Rating Profile-Web Resource (URP-WR): Using assessment to inform web resource selection. Computers in the Schools, 37(4), 269–291. https://doi.org/10.1080/07380569.2020.1835388

23.

Miller

F. G.

Neugebauer

S. R.

Chafouleas

S. M.

Briesch

A. M.

Riley-Tillman

T. C.

(2013, August). Examining innovation usage: Construct validation of the usage rating profile-assessment [Conference session]. Poster presentation at the American Psychological Association Annual Convention, Honolulu, HI, United States.

24.

Olive

D. J.

(Ed.) (2017). Multiple linear regression. In Linear regression (pp. 39–81). Springer. https://doi.org/10.1007/978-3-319-68253-2_12

25.

Quinn

B. N. E.

(2021). An analysis of the acceptability, feasibility, and utility of the Global Mental Health Assessment Tool for Primary Care (GMHAT/PC) in a UK primary healthcare setting: A practice-based mixed methods study [Unpublished doctoral thesis]. University of Chester.

26.

Reinke

W. M.

Herman

K. C.

Stormont

Ghasemi

(2025). Teacher stress, coping, burnout, and plans to leave the field: A post-pandemic survey. School Mental Health, 17(1), 32–44. https://doi.org/10.1007/s12310-024-09738-7

27.

Reinke

W. M.

Stormont

Webster-Stratton

Newcomer

L. L.

Herman

K. C.

(2012). The incredible years teacher classroom management program: Using coaching to support generalization to real-world classroom settings. Psychology in the Schools, 49(5), 416–428. https://doi.org/10.1002/pits.21608

28.

Riley-Tillman

T. C.

Kalberer

S. M.

Chafouleas

S. M

. (2005). Selecting the right tool for the job: A review of behavior monitoring tools used to assess student response to intervention. The California School Psychologist, 10(1), 81–91. https://doi.org/10.1007/BF03340923

29.

Schlientz

M. D.

Riley-Tillman

T. C.

Briesch

A. M.

Chafouleas

S. M.

Walcott

C. M.

(2009). The impact of training on the accuracy of direct behavior ratings (DBR). School Psychology Quarterly, 24(2), 73–83. https://doi.org/10.1037/a0016255

30.

Sims

W. A.

King

K. R.

Reinke

W. M.

Herman

Riley-Tillman

T. C.

(2020). Development and preliminary validity evidence for the direct behavior rating-classroom management (DBR–CM). Journal of Educational and Psychological Consultation, 31(2), 215–245. https://doi.org/10.1080/10474412.2020.1732990

31.

Sims

W. A.

King

K. R.

Zahn

Mandracchia

Monteiro

Klaib

(2023). Measuring classroom management in secondary settings: Ongoing validation of the direct behavior rating-classroom management. Assessment for Effective Intervention, 48(3), 149–158. https://doi.org/10.1177/15345084221118316

32.

Smith

Gillespie

. (2023). Research on professional development and teacher change: Implications for adult basic education. In Comings

Garner

Smith

(Eds.), Review of adult learning and literacy: Connecting research, policy, and practice (Vol. 7, pp. 205–244). Routledge.

33.

Stawski

R. S.

(2013). Multilevel analysis: An introduction to basic and advanced multilevel modeling (2nd Edition). Structural Equation Modeling: A Multidisciplinary Journal, 20(3), 541–550. https://doi.org/10.1080/10705511.2013.797841

34.

Stoiber

K. C.

Gettinger

(2016). Multi-tiered systems of support and evidence-based practices. In Jimerson

S. R.

Burns

M. K.

VanDerHeyden

A. M.

(Eds.), Handbook of response to intervention: The science and practice of multi-tiered systems of support (pp. 121–141). Springer. https://doi.org/10.1007/978-1-4899-7568-3_9

35.

Strohmeier

Mulé

Luiselli

J. K.

(2014). Social validity assessment of training methods to improve treatment integrity of special education service providers. Behavior Analysis Practice, 7, 15–20. https://doi.org/10.1007/s40617-014-0004-5

36.

Zepeda

S. J.

(2012). Professional development, what works (2nd ed.). Eye on Education. https://doi.org/10.4324/9781315854878

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.06 MB

0.00 MB