Abstract
Objective:
Irritability and temper outbursts are common presenting concerns in pediatric mental health settings, but systematic monitoring can be difficult to sustain with broad, repeated assessment batteries. This study evaluated the temper outbursts/irritability-2 (TOI-2), an ultra-brief, two-item screener assessing temper outbursts and frequently irritable mood in youth.
Methods:
Participants were 1515 youth aged 5–17 years (M = 11.80, SD = 3.60) referred to an outpatient child psychiatry clinic. Female caregivers (90.4% biological mothers) completed the TOI-2 and criterion measures at intake. Convergent and discriminant validity were assessed via correlations with the Affective Reactivity Index (ARI), administered with a current-state timeframe, oppositional defiant disorder (ODD) symptom dimensions, attention-deficit/hyperactivity disorder, conduct disorder (CD), depression, and anxiety. Screening accuracy was evaluated using receiver operating characteristic analysis, with elevated irritability defined as ARI ≥ 4. Clinical utility was examined using screen-positive and screen-negative comparisons and known-group analyses.
Results:
The TOI-2 showed strong convergence with the ARI and ODD-Irritability symptoms (rs = 0.82). Its association with ODD-Irritability was significantly stronger than with ODD-Behavioral symptoms (r = 0.70; p < 0.001), supporting relative specificity to irritability within the broader disruptive behavior spectrum. ROC analysis indicated excellent classification accuracy, area under the curve = 0.91. Cut scores of ≥2.5 and ≥3.0 both optimized overall performance, Youden’s J = 0.66. A threshold of ≥3.0 is recommended for specialty-clinic triage, whereas ≥2.5 may be preferable when maximizing case detection is the priority. Youth who screened positive showed greater symptom severity and poorer adjustment across clinical domains. Known-groups analyses showed the largest TOI-2 elevations among youth meeting symptom-count thresholds for ODD and CD.
Conclusions:
The TOI-2 is a psychometrically sound, ultra-brief screener for clinically significant irritability in youth. Positive screens should prompt more comprehensive clinical assessment of irritability, impairment, disruptive behavior, mood symptoms, and treatment needs.
Keywords
Irritability in youth, characterized as a persistently irritable or angry mood and an increased proneness to temper outbursts (Carlson et al., 2016; Leibenluft et al., 2024), is among the most common presenting concerns in pediatric mental health settings (Evans et al., 2023). Beyond its role as a diagnostic criterion for disruptive mood dysregulation disorder (DMDD) and oppositional defiant disorder (ODD), irritability is a transdiagnostic component of emotional dysregulation (Nigg et al., 2020) and is associated with suicidality, adult depression, and functional impairment (Benarous et al., 2019; Dougherty et al., 2016; Stringaris and Goodman, 2009). Despite its clinical importance, routine irritability assessment remains difficult to sustain in clinical workflow, underscoring the need for practical, systematic approaches to measurement.
The American Academy of Pediatrics and the American Psychological Association recommend universal screening for mental, emotional, and behavioral health at every routine health maintenance visit (Weitzman et al., 2025). In busy clinical settings, however, implementation is constrained by cumulative screening burden. Irritability is rarely assessed in isolation; clinics often assess attention-deficit/hyperactivity disorder (ADHD), oppositional behavior, conduct problems, anxiety, depression, impairment, and other concerns within the same visit or across repeated follow-ups. In this context, even brief measures can add to respondent and clinician burden, increasing reliance on unstructured impressions rather than quantitative symptom tracking (American Psychiatric Association, 2023). The temper outbursts/irritability-2 (TOI-2) was developed to address this need; it is not intended to replace more comprehensive irritability measures when diagnostic classification or treatment planning is needed but to serve as a first-step screener for triage and repeated monitoring within typical clinical workflows.
Several existing measures provide important points of comparison. Wakschlag and colleagues developed clinically optimized 2- to 5-item screeners derived from the Multidimensional Assessment Profile Scales Temper Loss (MAPS-TL) scale, using item response theory and other methods (Alam et al., 2023; Hirsch et al., 2023; Kirk et al., 2023; Wiggins et al., 2018; Wiggins et al., 2023). These screeners are developmentally sensitive and well-suited to community and risk-enriched contexts. child behavior checklist (CBCL) derived irritability composites are also useful when the CBCL was administered (e.g., Evans et al., 2020a; Stringaris et al., 2012b), but they are embedded within a longer measure and often include mood-lability content that is only indirectly related to irritability. Similarly, the DSM-5-TR Level 1 Cross-Cutting Symptom Measure for children includes frequency-rated items assessing angry/irritable mood and temper outbursts (Narrow et al., 2013). The TOI-2 fills a narrower clinical workflow gap by scoring the same two problem-severity-anchored items as a standalone irritability screener for ages 5–17 years, allowing ratings to reflect the caregiver’s perceived clinical burden of outbursts and irritable mood.
Arguably, a screener for irritability should address both acute outbursts and sustained irritable mood. This distinction is often described as phasic irritability, characterized by sudden and intense outbursts, and tonic irritability, characterized by a more persistent irritable or annoyed mood. Although the clinical utility of separating these components remains an active area of research, prior work suggests that they may show partly different associations with psychopathology and impairment (Cardinale et al., 2021; Silver et al., 2025). The distinction is also consistent with diagnostic frameworks that include both elements: DMDD requires both severe temper outbursts and persistently irritable mood between outbursts, whereas ODD includes symptoms reflecting both loss of temper and sustained irritable or angry mood (American Psychiatric Association, 2013). Thus, an ultra-brief screener should ideally include items assessing both outbursts and irritable mood.
Adult mental health care has addressed similar measurement barriers through ultra-brief screeners, such as the patient health questionnaire-2 (PHQ-2) for depression (Kroenke et al., 2003) and the generalized anxiety disorder-2 (GAD-2) for anxiety (Kroenke et al., 2007). Meta-analytic evidence suggests these tools retain acceptable diagnostic accuracy relative to full diagnostic interviews (Manea et al., 2016; Plummer et al., 2016). A parallel approach may be useful for assessing pediatric irritability, but no two-item irritability screener has been validated in a treatment-seeking clinical sample that directly assesses both temper outbursts and frequently irritable mood.
The current study evaluated the TOI-2 from the Penn State Psychiatry Clinical Assessment and Rating Evaluation System (PCARES) project (Saunders et al., 2021; Waschbusch et al., 2020). The TOI-2 consists of two parent-rated items assessing temper outbursts and frequent irritable mood. These items were selected to index the irritability component of emotional dysregulation and map onto the irritability dimension of ODD: “temper or anger outbursts” corresponds to losing temper, whereas “frequently irritable” corresponds to angry/resentful mood and being touchy or easily annoyed. In contrast, the ODD-Behavioral dimension captures behaviors such as arguing, defying, deliberately annoying others, blaming others, and spitefulness. Using a large outpatient sample of youth aged 5.0 to 17.9 years, we examined convergent and discriminant validity, classification accuracy for clinically significant irritability, and clinical utility of the recommended screening threshold. We hypothesized that the TOI-2 would show strong convergence with the Affective Reactivity Index (ARI), correlate more strongly with ODD-Irritability than with ODD-Behavioral symptoms, exhibit expected transdiagnostic associations with externalizing and internalizing symptoms, and demonstrate acceptable classification accuracy.
Method
Participants and procedures
Data were drawn from intake assessments of 1515 youth ages 5.0–17.9 years referred to an outpatient child psychiatry clinic within an academic medical center. As part of routine intake, caregivers completed standardized rating scales in the waiting room immediately before the scheduled visit. Because multiple caregivers occasionally attended intake, ratings from a single caregiver per child were retained to satisfy rater independence; specifically, we used the primary female caregiver, most often the biological mother. See Table 1 for demographic and clinical characteristics.
Demographic and Clinical Characteristics of the Sample (N = 1515)
Sample sizes for demographic variables were as follows: child age (N = 1515; 100%), child sex (n = 1514; 99.9%), child race (n = 1452; 95.8%), child ethnicity (n = 1331, 87.9%), informant relationship (n = 1515; 100%), parent education (n = 1462; 96.5%), annual income (n = 1392, 91.9%). Percentages in the table were computed using valid (non-missing) cases. Parent income and education refer to the reporting female caregiver. Clinical scores for ODD, ADHD, and CD are item means (range 0–3). Depression and Anxiety are summed scores. ARI = Affective Reactivity Index (Stringaris et al., 2012a); MFQ-S = Mood and Feelings Questionnaire-Short Form (Angold et al., 1995); SCARED-5 = Screen for Child Anxiety Related Emotional Disorders-5 Item (Birmaher et al., 1999; Waschbusch et al., 2025). TOI-2 = Temper Outburst Irritability-2 Screener. ADHD, ODD, and CD were computed from the Disruptive Behavior Disorders Rating Scale (Fosco et al., 2023; Pelham et al., 1992).
Data were collected as part of the Penn State Psychiatry Clinical Assessment and Rating Evaluation System, a measurement-based care initiative that integrates standardized assessment into routine clinical workflow (Saunders et al., 2021; Waschbusch et al., 2020). From 2770 intake assessments collected between January 2017 and February 2022, analyses were restricted to youth ages 5.0–17.9 years and to one eligible caregiver per child, prioritizing greater caregiver-child interaction frequency, with ties resolved by retaining the earliest intake. Youth were excluded if both TOI-2 items were missing, yielding the final analytic sample. This study was approved by the Penn State College of Medicine IRB (STUDY00028393); consent was waived because data were collected as routine clinical care and analyzed retrospectively in de-identified form.
Measures
Brief problem rating form
The brief problem rating form (BPRF) is a 35-item measure developed for PCARES to assess a broad range of psychopathology symptoms. Items are rated from 0 (Not a Problem) to 6 (Serious Problem). Items 1–19 assess specific symptom domains, and item 20 (“Overall adjustment”) provides a global indicator of functioning. The TOI-2 was operationalized as the mean of two BPRF items: item 7 (“Temper or anger outbursts”) and item 8 (“Frequently irritable”). The standalone TOI-2 is included in the Appendix. The two items were averaged, and internal consistency was good (α = 0.84). For clinical utility analyses, BPRF item 20 was analyzed continuously and dichotomized to indicate no/low impairment (0–2) versus moderate-to-severe impairment (3–6).
Affective reactivity index
The ARI served as the primary criterion measure of irritability (Stringaris et al., 2012a). It includes six irritability items and one impairment item (“Overall, irritability causes him/her problems”), rated from 0 (Not True) to 2 (Certainly True). In the PCARES intake battery, caregivers were instructed: “At this time and as compared to others of the same age, how well does each of the following statements describe the behavior/feelings of your child/adolescent?” Thus, the item content and response scale were retained, but the temporal anchor reflected current functioning rather than the original 6-month timeframe. This administration was selected to align with the TOI-2 and other PCARES intake measures and is consistent with the current state time frame used in the DSM-5-TR Level 2 version of the ARI (Narrow et al., 2013). The ARI total score was calculated as the sum of items 1–6, excluding the impairment item. A cutoff of ≥4 was used to identify elevated irritability because it distinguishes youth with DMDD from controls (Stringaris et al., 2012a), corresponds to the median score in a comparable outpatient sample (Evans et al., 2021), and marks the point at which impairment became common in the present sample. ARI item 7 was dichotomized as no impairment (0) versus any impairment (1–2). Internal consistency was excellent (α = .91).
Disruptive behavior disorders rating scale
The disruptive behavior disorders rating scale (DBDRS) is a 45-item measure assessing the DSM-5 symptoms of ADHD, ODD, and conduct disorder (CD) (Fosco et al., 2023; Pelham et al., 1992). Items are rated from 0 (Not at All) to 3 (Very Much). Following prior research distinguishing irritable and behavioral dimensions within ODD (Burke et al., 2014; Burke et al., 2021), symptoms were divided into the ODD-Irritability (items 24, 26, 28) and ODD-Behavioral (items 03, 13, 15, 17, 39) subscales. Additional scales included ADHD-Inattention, ADHD-Hyperactive/Impulsive, and CD. Internal consistencies were high: ADHD-Inattention (α = 0.92), ADHD-Hyperactive/Impulsive (α = 0.91), CD (α = 0.79), ODD-Behavioral (α = 0.87), and ODD-Irritability (α = 0.83).
Screen for child anxiety-related emotional Disorders—Short form (SCARED-5)
Anxiety symptoms were assessed using the 5-item SCARED-5 (Birmaher et al., 1999; Waschbusch et al., 2025). Items are rated from 0 (Never True or Hardly Ever True) to 2 (Very True or Often True) and summed, with scores ≥3 indicating elevated anxiety. Internal consistency was acceptable (α = 0.71).
Mood and feelings questionnaire—short form (MFQ-S)
Depressive symptoms were assessed using the 13-item MFQ-S (Angold et al., 1995). Items are rated from 0 (Not True) to 2 (True) and summed, with scores ≥11 indicate elevated depression. Internal consistency was excellent (α = 0.90).
Statistical analysis
Analyses were conducted using R version 4.5.2 (R Core Team, 2025). The TOI-2 was calculated as the mean of the two BPRF items; when one item was missing, the other was used. For multi-item measures, scale scores were computed when at least half of the items were completed. Item-level missingness was low (<5%).
Primary validity analyses used Pearson correlations between the TOI-2 and criterion measures of irritability, disruptive behavior, ADHD, depression, and anxiety. The a priori hypothesis that TOI-2 scores would correlate more strongly with the ODD-Irritability than with ODD-behavioral symptoms was tested using Steiger’s Z test for dependent correlations (Steiger, 1980).
Classification accuracy was evaluated using receiver operating characteristic (ROC) analysis, with elevated irritability defined as an ARI total score ≥4. Sensitivity, specificity, positive predictive value (PPV), negative predictive value, area under the curve (AUC), and Youden’s J were examined across candidate cut scores to identify thresholds appropriate for specialty clinic triage versus broader case-finding.
Clinical utility was evaluated by classifying participants as screen-positive or screen-negative based on the recommended TOI-2 threshold of ≥3.0. Groups were compared on clinical outcomes and demographic characteristics using independent-samples t tests and chi-square tests, with Cohen’s d used to quantify effect sizes. The Benjamini-Hochberg false discovery rate procedure was applied to the family of clinical outcome comparisons. Categorical impairment outcomes were also examined using risk differences and odds ratios with 95% confidence intervals. Additional analyses of symptom-defined clinical groups, item-level performance, and age-related effects are summarized in the Results and reported fully in the Supplement.
Results
Convergent and discriminant validity
The TOI-2 showed strong convergence with the ARI total score and ODD-irritability symptoms (Table 2). Consistent with our hypothesis, the TOI-2 correlated more strongly with ODD-Irritability than ODD-behavioral symptoms, Steiger’s Z = 11.19, p < 0.001, supporting relative specificity to the affective/irritable component of ODD.
Convergent and Discriminant Validity: Correlations of the TOI-2 Screener and ARI Total Score with Criterion Measures
Sample sizes ranged from 1470 to 1502 due to missing data. All correlations were statistically significant (p < 0.001). ADHD, ODD, and conduct disorder measures were derived from the Disruptive Behavior Disorders Rating Scale (DBD). TOI-2 = Temper Outburst Irritability–2 Screener; ARI = Affective Reactivity Index; MFQ-S = Mood and Feelings Questionnaire–Short Form; SCARED-5 = Screen for Child Anxiety Related Emotional Disorders–5 Item.
The TOI-2 also showed moderate associations with CD, ADHD-Hyperactive/Impulsive, and ADHD-Inattention symptoms, consistent with the expected overlap between irritability and broader externalizing problems in clinical samples (Table 2). Associations with internalizing symptoms were smaller, with a modest association with depression and a weaker association with anxiety. The overall pattern of correlations closely paralleled the ARI (Table 2).
Classification accuracy of the TOI-2 screener
ROC analysis was conducted among participants with complete TOI-2 and ARI data (N = 1486). Elevated irritability was defined as an ARI total score ≥4; 840 youth (56.5%) met this criterion. As shown in Figure 1, the TOI-2 demonstrated excellent classification accuracy, AUC = 0.91, 95% CI [0.90, 0.93].

Receiver operating characteristic curve for the TOI-2 Screener predicting clinically elevated irritability on the Affective Reactivity Index. AUC = 0.91, 95% CI [0.90, 0.93]. Clinically elevated irritability was defined as an Affective Reactivity Index total score ≥4. The diagonal line represents chance-level classification (AUC = 0.50).
TOI-2 cut scores of ≥2.5 and ≥3.0 yielded comparable overall performance (Table 3), with both yielding Youden’s J = 0.66. We selected ≥3.0 as the recommended threshold for specialty clinic intake triage because it provided higher specificity and PPV, thereby reducing false-positive classifications, and because it corresponds to a clinically interpretable average item rating of at least “moderate problem.” A lower threshold of ≥2.5 may be preferable in settings where maximizing sensitivity is the priority.
Diagnostic Performance of the TOI-2 Screener at Various Cut-Off Scores
The criterion for clinically significant irritability was an ARI total score ≥4. The TOI-2 score represents the mean of Items 7 and 8 of the Brief Problem Rating Form (range = 0–6). Sensitivity reflects the proportion of true clinical cases correctly identified, and specificity reflects the proportion of non-clinical cases correctly excluded. Positive predictive value (PPV) is the proportion of individuals at or above the cut-off who meet criteria for clinical irritability, whereas negative predictive value (NPV) is the proportion of individuals below the cut-off who do not meet criteria. Bold font indicates the cut-off scores that maximized Youden’s Index (J = Sensitivity + Specificity − 1).
Clinical utility
Using the recommended threshold of ≥3.0, 784 youth (51.7%) screened positive, and 731 (48.3%) screened negative. As shown in Table 4, screen-positive youth had significantly greater symptom severity and poorer overall adjustment across all clinical domains. Effects were largest for ARI total score, ODD-irritability, ODD-behavioral, and CD symptoms, with smaller but significant differences for ADHD, depression, and anxiety. All clinical outcomes comparisons remained significant after Benjamini-Hochberg correction. Results were similar when using the alternative ≥2.5 threshold (Supplementary Table S1).
Clinical Utility of the TOI-2: Characteristics by Screen Status (Cut Score ≥3.0)
Ns vary across variables due to missing data. Screen-negative cases had a TOI-2 mean score <3.0, whereas screen-positive cases had a TOI-2 mean score ≥3.0. The TOI-2 score represents the mean of two Brief Problem Rating Form items assessing temper outbursts and irritability (range = 0–6). ARI impairment was indexed using item 7 (“Overall, irritability causes him/her problems”), dichotomized as 0 versus 1–2. BPRF item 20 (“Overall adjustment”) was analyzed as a continuous and categorical indicator; for categorical analyses, it was dichotomized into 0–2 (no/low impairment) versus 3–6 (moderate/high impairment). ADHD, ODD, and conduct disorder measures were derived from the Disruptive Behavior Disorders Rating Scale. MFQ-S = Mood and Feelings Questionnaire–Short Form; SCARED-5 = Screen for Child Anxiety Related Emotional Disorders–5 Item. The Benjamini-Hochberg false discovery rate procedure was applied to the family of nine clinical and functional outcome comparisons; all remained significant after correction. * = p < 0.05. ** = p < 0.01. *** = p < 0.001.
Categorical impairment analyses supported the clinical meaningfulness of the screening threshold. Screen-positive youth were more likely than screen-negative youth to have irritability-related impairment on ARI item 7 and moderate-to-severe overall adjustment problems on BPRF item 20 (Table 4). Demographic differences were statistically significant but small: males, youth from lower-income households, and youth whose parents had lower educational attainment were more likely to screen positive; there were no differences by race or ethnicity.
Known-Groups validity
Known-groups further supported clinical utility (Table 5; Supplementary Fig. S1). TOI-2 scores and screen positive rates were highest among youth meeting symptom count thresholds for ODD and CD. More moderate elevations were observed for ADHD and depression, whereas anxiety showed the smallest group difference. Among youth meeting ADHD criteria, TOI-2 elevation was concentrated among those with comorbid ODD or CD; ADHD-only youth scored below the screening threshold on average.
TOI-2 Scores and Screen Positive Rates Across Symptom-Defined Groups
In Group = participants who met grouping criteria; Out Grp = participants who did not meet grouping criteria. Groups were overlapping rather than mutually exclusive. Comparisons were between youth meeting the criteria for a given condition and clinic-referred youth not meeting those criteria. Screen-positive = TOI-2 ≥ 3.0. ODD = Oppositional Defiant Disorder. ADHD = Attention-Deficit/Hyperactivity Disorder. CD = Conduct Disorder. Dep = Depression. Anx = Anxiety. ODD, ADHD, CD groups defined using symptom count criteria on the Disruptive Behavior Disorders Rating Scale. Dep and Anx groups defined using (respectively) cutoff scores on the short forms of the Mood and Feelings Questionnaire and the SCARED-5 All t-tests and chi-square tests were significant at p < 0.001.
Additional sensitivity analyses
Item-level analyses examined the two TOI-2 items separately to evaluate the phasic and tonic components of irritability, reflected in “temper or anger outbursts” and “frequently irritable,” respectively. The items were strongly correlated, r = 0.73, p < 0.001, supporting their combination into a two-item mean score without suggesting redundancy. Both items performed well individually in ROC analyses. Differential correlation analyses showed that the temper outburst item was more strongly associated with externalizing criteria, whereas the frequently irritable item was more strongly associated with internalizing criteria. Age analyses showed that screen-positive rates declined across development, but demographic associations with screen-positive status were unchanged after age adjustment. Full item-level and age-related analyses are reported in Supplementary Tables S2, S3, S4, and S5.
Discussion
This study evaluated the psychometric properties of the TOI-2, an ultra-brief screener for assessing temper outbursts and frequent irritability in youth. Results supported our hypotheses. The TOI-2 showed strong convergence with the ARI and a significantly stronger association with ODD-irritability than ODD-behavioral symptoms, supporting relative specificity to the affective/irritability dimension of ODD. The screener also showed excellent classification accuracy for identifying ARI-defined irritability. Cut-scores of ≥2.5 and ≥3.0 both optimized performance, with ≥3.0 recommended for specialty clinic triage and ≥2.5 potentially preferable for broad case-finding. Youth who screened positive exhibited substantially greater symptom severity and impairment across clinical domains. Associations with internalizing symptoms were smaller but consistent with models of irritability as a transdiagnostic risk marker (Brotman et al., 2017; Leibenluft et al., 2024).
Irritability and emotion dysregulation
The TOI-2 should be interpreted as a screener for irritability, not as a comprehensive measure of emotion dysregulation. Its two items assess temper outbursts and frequent irritable mood, which are central manifestations of irritability and overlap with the emotional dysregulation construct emphasized in recent clinical frameworks. This focus is consistent with the American Academy of Child and Adolescent Psychiatry (AACAP) Presidential Task Force’s emphasis on impairing emotional outbursts as an important measurement target (Althoff et al., 2025). However, the TOI-2 does not assess broader components of emotion dysregulation, including mood lability, emotional intensity, regulatory capacity, contextual triggers, or functional consequences. Positive screens should therefore prompt further assessment rather than be interpreted as a complete characterization of emotion dysregulation.
The TOI-2 should also be interpreted alongside other brief irritability indicators. MAPS-TL-derived screeners offer developmentally optimized tools for assessing irritability and temper loss, particularly in community and risk-enriched samples, though their item content varies across developmental periods. CBCL-derived irritability composites are useful when the CBCL is already administered as part of a broadband assessment, but they are embedded within a longer measure and often include mood-lability content that is related to, but broader than, irritability. DSM cross-cutting screening items similarly capture angry/irritable mood and temper outbursts within a broader symptom screen. The TOI-2 fills a narrower clinical workflow niche by using the same two problem-severity-anchored items across ages 5–17, which may facilitate cross-age comparison, specialty clinic triage, and repeated symptom monitoring. Selection among brief irritability indicators depends on the intended purpose (i.e., developmental characterization, broadband assessment, triage, diagnostic clarification, or treatment monitoring).
Classification accuracy and clinical utility
The TOI-2 achieved an AUC of 0.91 for identifying ARI-defined irritability, with cut scores of ≥2.5 and ≥3.0 yielding comparable overall performance. This diagnostic performance compares favorably with established adult ultra-brief screeners, including the PHQ-2 for depression (Manea et al., 2016) and the GAD-2 for anxiety (Plummer et al., 2016). However, the AUC is likely an upper-bound estimate because the criterion measure (ARI) relied on the same informant and method as the TOI-2. Validation against a structured diagnostic interview and independently assessed impairment is needed to establish performance under more stringent conditions.
Clinical utility analyses indicated that youth who screened positive exhibited markedly greater symptom severity and poorer overall adjustment, with particularly large differences on irritability-related measures. At the same time, elevated TOI-2 scores should not be interpreted as indicating irritability outside the broader disruptive behavior spectrum. The TOI-2 was strongly associated with ODD-behavioral symptoms, and known-groups analyses showed the largest elevations among youth meeting symptom-count thresholds for ODD and CD. Thus, positive screens should prompt comprehensive assessments of ODD symptoms, impairment, and related treatment needs rather than be interpreted as evidence of a condition separate from disruptive behavior pathology.
Demographic differences in screen-positive status were significant but small. Males were slightly more likely than females to screen positive, and higher positive-screen rates were observed among families with lower household income and lower parental education. No differences emerged by race or ethnicity. These patterns are consistent with epidemiological research showing higher rates of externalizing problems and irritability among youth from lower socioeconomic backgrounds (Carlson et al., 2016; Copeland et al., 2013), whereas findings on sex differences are mixed (Copeland et al., 2015; Leibenluft et al., 2024; Mayes et al., 2019; Riglin et al., 2019). Future research should test measurement invariance to determine whether observed group differences reflect true differences in irritability rather than differential item functioning.
Clinical implications
The TOI-2 may be useful for identifying and monitoring youth with clinically significant irritability, including youth with DMDD symptoms or the irritable dimension of ODD. Elevated scores should not be interpreted as evidence of a condition separate from ODD or as an indication for medication-focused care. Rather, positive screens should prompt assessment of irritability, impairment, ODD symptoms, DMDD criteria, ADHD, mood symptoms, and contextual contributors to outbursts.
This distinction is important because behavioral parent training and cognitive-behavioral therapy are empirically supported treatments for youth with irritability and disruptive behavior problems, and available evidence indicates that these approaches can reduce irritability (Breaux et al., 2023; Waxmonsky et al., 2021). For example, Evans et al (2020b) found that standard behavioral parent training and cognitive-behavioral treatment protocols, as well as the modular transdiagnostic treatment, outperformed usual care in reducing irritability among youth with severe irritability and mood dysregulation. Systematic screening with the TOI-2 may therefore support timely triage toward these evidence-based psychosocial treatments.
This is particularly relevant given evidence that youth diagnosed with DMDD are increasingly receiving antipsychotics and mood stabilizers, often before psychotherapy services or optimization of ADHD medications (Baweja et al., 2025). Used as part of a measurement-based workflow, the TOI-2 could help identify youth who warrant further assessment and referral to behavioral services, while repeated assessments may help clinicians detect inadequate response to first-line interventions and adjust treatment plans accordingly.
Limitations
Several limitations should be considered. First, ratings were provided by a single female caregiver, predominantly biological mothers, which may limit generalizability to other informants. Multi-informant irritability studies support the validity and reliability of both parent- and self-report measures, but interrater agreement is modest (Bekiropoulou et al., 2026; Vuori et al., 2022). Second, classification accuracy was evaluated against a rating-scale criterion, ARI ≥ 4, rather than against a structured diagnostic interview or independently assessed impairment. Shared informant and method variance may have inflated classification performance. In addition, the ARI was administered with a current-state timeframe to align with the TOI-2 and other PCARES measures, whereas the ≥ 4 threshold was derived from prior ARI research using different temporal anchors. Third, the cross-sectional design precludes conclusions about sensitivity to change, which will be important for using the TOI-2 as a monitoring tool. Longitudinal research is needed to determine whether the TOI-2 detects clinically meaningful improvement during treatment and to establish benchmarks for reliable change. Fourth, although the TOI-2 captures core features of irritability, it does not assess contextual factors, duration, intensity, or situational pervasiveness of outbursts, all of which may be relevant for treatment planning and diagnosis. Fifth, the sample was drawn from a single outpatient psychiatry clinic and did not include a healthy comparison group, which may limit generalizability. Finally, because clinical group comparisons were based on rating-scale symptom thresholds rather than structured diagnostic interviews and should be interpreted as known-groups validity analyses rather than diagnostic validation.
Conclusions
The TOI-2 is a psychometrically sound, ultra-brief screener for clinically significant irritability in youth ages 5–17 years. By assessing temper outbursts and frequent irritable mood, the TOI-2 captures a clinically important component of youth emotion dysregulation. It demonstrated strong convergent validity with established irritability measures, excellent classification accuracy for ARI-defined elevated irritability, and robust known-groups validity. The TOI-2 reduces the assessment burden to two parent-rated items while maintaining operating characteristics comparable to those of established ultra-brief screeners. It should be used as a first-step screener, with positive results prompting a more comprehensive assessment.
Clinical Significance
The TOI-2 provides clinicians with a tool that can be completed in under 30 seconds and scored immediately, enabling systematic screening for irritability without adding substantial burden to the clinical workflow. A cut score of ≥3.0 is recommended for specialty clinic intake triage, whereas a cut score of ≥2.5 may be preferable when maximizing case detection is the priority. The TOI-2 may support front-end triage, routine monitoring of treatment response, and EHR-integrated outcome tracking for quality improvement and pragmatic research. Clinicians should use the TOI-2 as a first-step screen, with positive results prompting a comprehensive assessment using validated measures, such as the full ARI and evaluation of diagnostic criteria for ODD, CD, DMDD, ADHD, and mood disorders.
Authors’ Contributions
D.A.W.: Conceptualization, data curation, formal analysis, methodology, writing—original draft, writing—review and editing. D.E.B.: Conceptualization, writing—review and editing. F.H.S.: Conceptualization, writing—review and editing. J.G.W.: Conceptualization, writing—review and editing. R.B.: Conceptualization, writing—review and editing.
Ethical Considerations
This study was approved by the Penn State College of Medicine IRB (STUDY00028393).
Consent to Participate
Informed consent was waived because the data were collected as part of routine clinical care and analyzed retrospectively in de-identified form.
Data Availability
Data are available upon request from the first author.
Declaration of Conflict of Interest
Dr. Saunders reports funding from the National Institutes of Health and LivaNova, outside the submitted work. Dr. Babinski reports funding from the National Institutes of Health, outside the submitted work. Dr. Baweja reports funding from the Cardinal Health Foundation/Children’s Hospital Association Zero Suicide Initiative and Supernus, and has served on an advisory board for Ironshore, outside the submitted work. Dr. Waxmonsky reports consulting fees from Supernus Pharmaceuticals and Collegium Pharmaceutical, and funding from the National Institutes of Health, outside the submitted work. Dr. Waschbusch reports funding from Children’s Miracle Network, outside the submitted work. The authors declare no other potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Footnotes
Acknowledgments
The authors would like to thank the research assistants who supported this project.
Supplemental Material
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
