Abstract
Analyzing heterogeneous treatment effects (HTEs) plays a crucial role in understanding the impacts of educational interventions. A standard practice for HTE analysis is to examine interactions between treatment status and preintervention participant characteristics, such as pretest scores, to identify how different groups respond to treatment. This study demonstrates that the identical patterns of HTE on test score outcomes can emerge either from variation in treatment effects due to a preintervention participant characteristic or from correlations between treatment effects and item easiness parameters. We demonstrate analytically and through simulation that these two scenarios cannot be distinguished if analysis is based on summary scores alone. We then describe a novel approach that identifies the relevant data-generating process by leveraging item-level data. We apply our approach to a randomized trial of a reading intervention in second grade and show that any apparent HTE by pretest ability is driven by the correlation between treatment effect size and item easiness. Our results highlight the potential of employing measurement principles in causal analysis, beyond their common use in test construction.
Keywords
1. Introduction
Heterogeneous treatment effects (HTEs) are a topic of growing importance and interest as understanding HTE is crucial for determining the policy relevance of interventions. HTE analysis allows researchers to identify the target population for which interventions are most effective. Further, identifying subgroups for whom there might be much larger or smaller treatment effects can lead to efficiency gains if policies are targeted to groups that are most responsive to interventions (Abenavoli, 2019; Blundell et al., 2005; Brand et al., 2014; Bryan et al., 2021; Schochet et al., 2014; Torche et al., 2024).
One challenge with estimating HTE is the risk that researchers may engage in questionable research practices such as selectively reporting subgroups with large effects or p-hacking, thus providing misleading conclusions about the generalizability and replicability of HTE analyses (Schuetze & von Hippel, 2023). Many methods have been proposed to address these problems, such as preregistration (Olken, 2015) and machine learning approaches (Chernozhukov et al., 2022; Wager & Athey, 2018; Wallace et al., 2023; Yeager et al., 2019). These approaches address spurious HTE as a problem of inference. However, even when a researcher takes a principled approach to estimating HTE or employs novel machine learning methods, we show that they may still detect spurious HTE if the relevant outcome is a psychological construct or other form of latent outcome measured using psychometric techniques. This spurious HTE can occur when the items used to construct outcome measures, such as a test of reading comprehension based on a set of test items, themselves exhibit HTE. The present study thus complements the broader HTE literature by reconsidering spurious HTE as a problem of identification instead of a problem of inference. In particular, we show that two qualitatively different data-generating processes (DGPs) can yield identical patterns of HTE when psychometric outcomes, such as educational test scores, are used to estimate treatment effects.
To illustrate this problem, consider a common approach to estimating treatment effects. Education researchers often collect item-level assessment data, generate test scores for each individual from the item responses, and use these scores as outcomes in a standard regression framework to estimate treatment effects. The regression model can then be extended to include interactions between person characteristics and treatment status to probe potential sources of HTE. However, recent scholarship has proposed models where the treatment might differentially impact the individual assessment items (Ahmed et al., 2023; Gilbert et al., 2023b; Sales et al., 2021). Thus, by not modeling individual item-level responses directly, researchers are potentially ignoring the HTE among the items that constitute the outcome measure.
We offer two examples that could give rise to treatment effects that vary at the item level. First, in education, a treatment effect may reflect “teaching to the test” or “score inflation” rather than an improvement in the latent ability giving rise to the item responses (Koretz, 2005). For example, instruction in test-taking strategies such as “process of elimination” could have the effect of making multiple choice items uniformly easier without improving latent student ability. Second, in psychology, spousal loss induces a systematic change in the reporting of depressive symptoms, such that people with otherwise similar levels of depression are much more likely to report the specific symptoms of loneliness and sadness, as opposed to other symptoms such as a loss of motivation (see figure S7 in Domingue et al., 2021). The interpretation of this change for the purposes of diagnosis has been a long-standing challenge (e.g., Olivera-Aguilar & Rikoon, 2024; Zisook et al., 2007). These illustrations are not meant to be exhaustive but rather suggestive of the possibilities that we have in mind when we discuss item-dependent HTE.
In this study, we show that correlations between item-specific treatment effects and item easiness parameters can generate observed HTE patterns indistinguishable from those generated by HTE arising from treatment interacting with person characteristics. We demonstrate this resulting confounding both analytically and through Monte Carlo simulation. To resolve this issue, we propose a novel approach that leverages item-level data to simultaneously estimate treatment by person characteristic interactions as well as the correlation between item-specific treatment effects and item easiness parameters. We show that our approach identifies the relevant DGP using Monte Carlo simulation. Finally, we apply our approach to a randomized evaluation of a second-grade content literacy intervention and show that the observed HTE by pretest scores is driven by the correlation between item-specific treatment effects and the item easiness parameters.
Our study contributes to the burgeoning literature on the estimation of HTE (Athey & Imbens, 2016; Bryan et al., 2021; Chernozhukov et al., 2022; Künzel et al., 2019; Schuetze & von Hippel, 2023; Wager & Athey, 2018; Wendling et al., 2018). In particular, our study extends a literature on the identification and estimation of a latent heterogeneity, that is, the variation in treatment effects that is not driven by baseline variables (Jeon et al., 2021; Lyu et al., 2023; Pearl, 2022; VanderWeele & Batty, 2023; Winship & Morgan, 1999; Xie et al., 2012). We build on recent work describing how item-level assessment data, typically only used to construct outcome measures, can provide additional insights into the nature of treatment effects (Ahmed et al., 2023; Gilbert et al., 2023b; Sales et al., 2021), and more generally, how estimates of treatment impact may be sensitive to measurement properties such as the alignment between interventions and outcomes (Francis et al., 2022), the comparability of effect sizes across studies (Wolf & Harbatkin, 2023), effect moderation and mediation (Montoya & Jeon, 2020; VanderWeele & Vansteelandt, 2022), and the consequences of scoring decisions and outcome metric properties for inference (Domingue et al., 2020, 2022; Gilbert, 2023a; McNeish & Wolf, 2020; Skrondal & Laake, 2001; Soland et al., 2022; Widaman & Revelle, 2023). Our study leverages a measurement model (Briggs, 2008; De Boeck, 2004) within a potential outcomes framework (Holland, 1986; Rubin, 1974) to identify and estimate HTE, effectively integrating the three foundational elements of empirical research—measurement, identification, and inference—into a cohesive framework.
The study is structured as follows. Section 2 sets up a potential outcome framework for the causal inference model and reviews person-dependent HTE and item-dependent HTE DGPs. Section 3 examines how the two DGPs can yield identical observed patterns of heterogeneity and our proposed solution to this problem. Section 4 presents a Monte Carlo simulation study, demonstrating the identification challenge and its resolution. Section 5 presents an empirical application to a randomized evaluation of a content literacy intervention. Section 6 discusses the implications of applying measurement principles in impact evaluation in education and other fields.
2. A Model of Treatment Effects
We begin with a model of treatment effects for item response data. We consider the following DGP for dichotomous item responses on a test used to evaluate the efficacy of a given treatment. The probability of a correct response to item i by person j is
where f is a monotonically increasing function bounded within

Item response functions (IRFs) and treatment effects. Notes. The plots in the top row present the linear predictor and the plots in the bottom row present the probability of correct response. A1 and A2 depict a standard IRF that describes the probability of a correct response as a function of the sum
Both
Suppose that an individual is randomly assigned to treatment or control conditions; we write
which implies
As an alternative, if we suppose that the treatment effect is entirely related to change in the functioning of the items,
we similarly produce
We now introduce HTE into the model. At the person level, we can allow the treatment effect to depend on person characteristic Xj
(e.g., pretest scores, gender, socioeconomic status [SES]) by introducing an interaction between
At the item level, we can similarly allow the treatment effect size to depend on item characteristic Xi (e.g., item content, item type, position of item in the test):
We consider HTE in Panel C of Figure 1. The red and blue lines represent different increases in
2.1. Person-Dependent HTE
We first consider a DGP in which HTE arises due to variation in the treatment effect as a function of pretreatment ability (
Here,
2.2. Item-Dependent HTE
We now contrast the person-dependent HTE model with an alternative DGP:
Again consider the treatment effect
The
A directed acyclic graph representation of the two DGPs is presented in Figure 2. Concretely, the person-dependent HTE model may be theoretically justifiable when an intervention targets students by ability level. For example, an intervention that provides more resources and supports to lower ability students would presumably have a larger effect on low ability students and this effect would be constant across all items (i.e.,

Directed acyclic graphs for person-dependent heterogeneous treatment effect (HTE; top) and item-dependent HTE (bottom) data-generating processes. Notes. Squares indicate observed variables, hollow circles indicate latent variables, and solid circles represent cross product interaction terms. In
are item responses and Tj
is the treatment indicator.
3. Appropriate Identification of HTE
3.1. The Problem of Identification
We now describe a scenario under which conventional analysis of HTE would leave the DGP unidentified. By this, we mean that the person-dependent or item-dependent HTE processes produce data with equivalent patterns of treatment effects on an estimated posttreatment sum score Sj
conditional on
We use sum scores for clarity of exposition and because—despite active debate among psychometric researchers concerning their properties (McNeish & Wolf, 2020; Soland et al., 2022; Widaman & Revelle, 2023)—they are commonly used as outcomes in empirical research (Flake et al., 2017). However, our argument is not dependent on the use of sum scores. When items are equally discriminating, the sum score is a sufficient statistic for an IRT-based score. In fact, sum scores will be an injective (i.e., one-to-one) function of the true abilities (Birnbaum, 1968; Borsboom, 2005), but monotonic transformations can induce HTE or remove it (Ding et al., 2019; Domingue et al., 2022). Further, our simulations will show that our argument applies equally to latent variable models that estimate the measurement and regression models simultaneously (Lockwood & McCaffrey, 2020).
Let
We define the causal estimand at the test score level as
We emphasize that
We first illustrate the identification problem with a toy example, setting
In Table 1, we calculate
Toy Example Illustrating the Identification Problem on a Three-Item Test
Note. Toy example shows how both the person-dependent and item-dependent data-generating processes produce treatment effects on sum scores that depend on

Toy example illustrating the identification problem on a three-item test. Notes. Toy example based on the parameters used to generate Table 1 shows how both the person-dependent and item-dependent data-generating processes produce treatment effects on sum scores that depend on
This phenomenon holds in more general settings. For example, Figure 4 shows a more realistic case with 16 items, where the top row depicts a person-dependent HTE scenario, where

Different data-generating processes (DGPs) can produce the same pattern of sum score responses. Notes. This figure illustrates how different DGPs can produce the same pattern of sum score responses. The top row depicts person-dependent heterogeneous treatment effect (HTE) and the bottom row depicts item-dependent HTE. We first generated the item-level data from the item-dependent HTE model, where we set
We can formalize the intuition provided by Table 1 and Figure 4 analytically. We begin with an arbitrary set of
Applying the above equation to each potential outcome yields the following logistic functions, each of which can in turn be approximated by a linear equation because logistic curves are approximately linear between about 20%–80% of the upper asymptote (Long, 1997; Von Hippel, 2015), where
The treatment effect,
For the item-dependent HTE model, we assume
We again compute
Clearly,
We now have two expressions for
as the
3.2. The Solution
We can identify the relevant DGP with item-level data by including both the baseline ability by treatment interaction
We propose the following flexible EIRM to disentangle the causes of HTE:
Because this model allows for both person- and item-dependent HTE and models the item-level data directly, we can determine the extent to which person or item HTE is better fit to the underlying item-level data and identify the appropriate causes of HTE. We turn now to a simulation study demonstrating this result.
4. Monte Carlo Simulation
A key implication of the empirical indistinguishability of the person-dependent and item-dependent HTE processes is that we will estimate a nonzero value of
The simulation results are shown in Figure 5. The plots in the first row show that each model appropriately recovers the target parameter with minimal bias. That is, whether the DGP includes treatment by baseline ability interaction

Simulation results showing bias in parameter estimates. Notes. Figures depict bias in parameter estimates. The top row presents properly specified models and the bottom row presents misspecified models. The following parameters are fixed across all trials: 500 subjects, 40 items with constant discrimination of 1,
5. Empirical Application
For our empirical application, we analyze a public use data set from Kim et al. (2023), which examined the impact of the Model of Reading Engagement (MORE) literacy intervention on the reading comprehension test scores of 2,174 Grade 2 students through an RCT. The researcher-designed reading comprehension test contained 20 dichotomous items. Figure 6 shows the standardized outcome sum score on the standardized pretest score, where the right panel appears to show that the treatment effect is larger for students with higher preintervention scores.

Scatter plot of standardized sum-score outcome on pre-intervention scores. Notes. The figure presents a scatter plot of standardized reading comprehension sum scores on preintervention measure of academic progress reading scores by treatment status. The left panel depicts the main effect of the treatment and the right panel also includes the interaction effect. Hollow circles indicate the conditional means of the outcome at each decile of the baseline score.
However, it would be premature to assert that the treatment is more effective for students with higher preintervention achievement (i.e.,

Fitted logistic curves by treatment group for each item. Notes. The figure presents the fitted logistic curves of the probability of a correct response on preintervention measure of academic progress reading scores by treatment status. The items are arranged by difficulty, with percentage correct values listed in the headings. The logistic curves are derived from a fixed effects model of the correct response as a function of treatment, pretest, and item indicator, with two-way interactions between treatment and pretest and treatment and the item indicators.
To formally test whether the person- or item-dependent model is a better fit to the data and determine the source of HTE, we fit four EIRMs to the item-level data. Our baseline specification is a constant treatment effect model as described in Equation 3. We then fit a person-dependent HTE model that incorporates an interaction between treatment and pretest scores,
Empirical Application of Explanatory Item Response Models
Note. We apply the EIRM to data from Kim et al. (2023), which examined the impact of a content literacy intervention through a randomized-controlled trial. Column (1) presents a baseline constant effects model, column (2) presents the person-dependent heterogeneous treatment effect (HTE) model, column (3) presents the item-dependent HTE model, and column (4) presents the flexible model allowing for both person-dependent and item-dependent HTE. The unit of observation is item-person for all the models. Standard errors are in parentheses.
*p < .05. ***p < .001.
Column 1 shows a clear positive ATE in log-odds units. That is, the treatment is estimated to cause a 0.19 logit increase in the probability of a correct response across all students and items, holding constant pretest scores. Adding the treatment by pretest interaction term in Column 2 shows a small but significant positive interaction

Correlation between item easiness and item treatment effect size. Notes. The points represent empirical Bayes's estimates of item easiness in the control group (bi
) and item-specific treatment effect size (
6. Discussion
Our study shows that using summary measures of student performance as outcome variables can lead to scenarios where two distinct DGPs produce virtually identical observed results. A treatment may appear more beneficial for students with high or low baseline ability levels when the effectiveness of the treatment genuinely varies based on students’ baseline abilities (person-dependent HTE) or when the treatment impacts easier items more (or less) than hard ones (item-dependent HTE). These DGPs have distinct interpretations, and without item-level data, they cannot be distinguished. A treatment that disproportionately benefits low-scoring students across all items has different policy implications compared to one that uniformly enhances learning of easier content across all student groups. Consequently, standard analytic practices to estimate HTE may lead to misleading conclusions and poor policy decisions when incorrectly applied.
We show that this identification problem can be resolved by using an EIRM that accounts for both variation in treatment effect along a preintervention participant characteristic and a correlation between item easiness and treatment effect size that can correctly determine the relevant DGP and draw correct conclusions. The EIRM is flexible and can be extended beyond the relatively simple application explored here to a wide array of data analytic settings, such as the inclusion of additional covariates at the person or item level, treatment by item cluster interactions, randomization blocks, multiple treatments, additional levels of hierarchy, such as the nesting of students within schools, polytomous item responses, 2PL or 3PL models, or Bayesian approaches (Bulut et al., 2021; Bürkner, 2021; De Boeck et al., 2011; Gilbert, 2023b, 2024; Gilbert et al., 2023b, 2024; Stanke & Bulut, 2019), underscoring the applicability of our approach to diverse contexts. For researchers interested in applying our model to their own data sets, our references provide tutorials in the R programming language: the general EIRM for dichotomous items (De Boeck et al., 2011), extending the EIRM to polytomous items (Bulut et al., 2021), the item-dependent HTE model for dichotomous items (Gilbert, 2023b), and a Bayesian approach that allows for extensions including 2PL or 3PL models (Bürkner, 2021; Gilbert, 2023b). Furthermore, we provide both a detailed replication tool kit in our Online Supplemental Materials (OSM), and a brief Appendix showing the basic R syntax to fit our models.
The problem of identification here hinges on item easiness varying in a systematic way such that
We used the example of pretest ability to motivate this study, but our argument extends to any pretreatment characteristic correlated with outcomes, such as age or SES. In our OSM, we provide three additional empirical examples with alternative covariates (mathematics pretest, race indicator, and high SES indicator) and an additional data set with item-level outcome data (a 36-item vocabulary assessment) to demonstrate this point. In all cases,
While our exposition is primarily drawn from the perspective of education research, where the analyses of student assessment data are ubiquitous, the implications of our study are relevant across various disciplines, where both interest in HTE is high and detailed item-level data may potentially be available. Our results apply to economists, who collect item-level data on consumption, spending, or attitudes (Jackson et al., 2021), psychologists studying traits such as depression through survey scales (Beevers et al., 2007; Gilbert et al., 2024; Schmitt et al., 2009), political scientists measuring political knowledge through questionnaires (Baek & Wojcieszak, 2009), or medical practitioners using surveys to supplement biometric measurement or clinical evaluation (Domingue et al., 2021; Hieronymus et al., 2019; Jessen et al., 2018). By making item-level data available, as advocated by Domingue and Kanopka (2023) researchers can better analyze and interpret the impact of policies and interventions in education and other fields.
We acknowledge four limitations of our approach. First, item-level data are not always available to the researcher, which may preclude item-level analysis in many applications. Accordingly, it is unknown how prevalent or large item easiness by treatment effect size correlations may be in empirical applications, and thus, the scope of the problem outlined in this study is not well understood. Second, the logit coefficients produced by the EIRM are generally difficult to interpret compared to conventional test score metrics like standardized sum scores. While there are existing methods available to convert logit coefficients to standardized effect sizes (Breen et al., 2018; Gilbert et al., 2023b; Hox et al., 2017; Mood, 2010), these methods rest on strong assumptions and may add an extra layer of complexity for the researcher. For example, we can convert the main effect of the MORE treatment (Table 2, Model 1) to an effect size by dividing our coefficient
Despite these limitations, our findings emphasize the critical role of measurement principles in program evaluation and demonstrate the useful implications of IRT-based causal analyses, beyond the traditional use of measurement tools solely for test construction. The adaptability of the EIRM makes it a powerful tool for any field that relies on detailed, item-level data to uncover patterns of treatment heterogeneity not readily apparent through more traditional data analysis methods.
Footnotes
7. Appendix
Authors' Note
A replication tool kit containing our code and Online Supplemental Material is available via Research Box for researchers interested in replicating or extending the analyses in this study at the following URL: https://researchbox.org/2221. The code includes links to download the empirical data from
.
Acknowledgments
The authors wish to thank Zach Himmelsbach, Kenji Kitamura, Alex Bolves, Andrew Ho, the HGSE Measurement Lab, and the Stanford Psychometrics Lab for their helpful comments on drafts of this article.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article:This research was supported in part by the Jacobs Foundation.
