Is Effort Moderated Scoring Robust to Multidimensional Rapid Guessing?

Abstract

To mitigate the potential damaging consequences of rapid guessing (RG), a form of noneffortful responding, researchers have proposed a number of scoring approaches. The present simulation study examines the robustness of the most popular of these approaches, the unidimensional effort-moderated (EM) scoring procedure, to multidimensional RG (i.e., RG that is linearly related to examinee ability). Specifically, EM scoring is compared with the Holman–Glas (HG) method, a multidimensional scoring approach, in terms of model fit distortion, ability parameter recovery, and omega reliability distortion. Test difficulty, the proportion of RG present within a sample, and the strength of association between ability and RG propensity were manipulated to create 80 total conditions. Overall, the results showed that EM scoring provided improved model fit compared with HG scoring when RG comprised 12% or less of all item responses. Furthermore, no significant differences in ability parameter recovery and omega reliability distortion were noted when comparing these two scoring approaches under moderate degrees of RG multidimensionality. These limited differences were largely due to the limited impact of RG on aggregated ability (bias ranged from 0.00 to 0.05 logits) and reliability (distortion was ≤ .005 units) estimates when as much as 40% of item responses in the sample data reflected RG behavior.

Keywords

noneffortful responding rapid guessing item response theory model fit reliability

Test-taking effort is a necessary condition to support the validity of score-based inferences from all social science measures. When examinees fail to put forth their maximal effort, inferences made from such measures may likely not reflect the assessed constructs. Assuming that examinees have had the opportunity to learn what is being assessed, can fully access the test content, and possess sufficient time to answer all items, disengagement for a given item may occur when the resource demands of that item exceeds the effort that an examinee is willing to expend (Wise & Smith, 2011). For example, Asseburg and Frey (2013) found a positive linear relationship between the ability-difficulty balance and individuals’ test-taking effort. In such circumstances, examinees can disengage in various ways, such as omitting item responses, providing responses with intentional disregard for item content (i.e., noneffortful responding), or both (e.g., Ulitzsch et al., 2020).

Although disengagement of any form is undesirable, noneffortful responding is particularly concerning because it introduces psychometrically distortive information into scores. To mitigate the potential damaging consequences of noneffortful responding, researchers have proposed leveraging behavioral indicators of disengaged behavior, such as item multimedia interactions (e.g., Harmes & Wise, 2016), eye tracking (e.g., Lindner et al., 2017), electroencephalography (e.g., Halderman et al., 2021), retroactive video evaluations of emotional ratings (e.g., Lehman & Zapata-Rivera, 2018), cursor movements (Pokropek et al., 2022), and response times (see Wise, 2017), to identify potential construct-irrelevant responses.¹ Of these indicators, response times have been most utilized in applied research (Silm et al., 2020) and operational testing contexts (Wise & Kuhfeld, 2020). This behavioral indicator is advantageous because it allows for disengagement to be assessed for each examinee-by-item interaction and is collected unobtrusively, which limits potential observer effects.

Using this proxy measure, any response provided in a timeframe that is incommensurate with the amount of time needed to thoroughly read the item stem and response options and solve the presented problem is deemed to be a noneffortful response (for a description of methods used to identify noneffortful responding using response times, see Rios & Deng, 2021). Within the literature, this form of noneffortful responding is referred to as rapid guessing (RG) and has been studied predominantly within the context of cognitive assessments that possess multiple-choice keyed response options, given that correct rates can be used to help validate inferences of disengagement (Soland et al., 2019).² RG has been documented to occur at high-rates for assessment contexts in which there are little to no personal consequences for examinee performance (i.e., low-stakes tests), due to a diminished sense of task value, which limits some examinees’ effort capacity (Penk & Schipolowski, 2015). In a recent meta-analytic investigation, Rios et al. (2022) found that across 25 independent studies, an average of 28% of examinees were found to engage in RG on at least one item, while the mean percentage of RG responses across data matrices was equal to 7%.

RG has been found to generally underestimate test performance (Rios et al., 2022), which leads to bias in both measurement properties, such as item characteristics (e.g., van Barneveld, 2007) and measurement invariance (e.g., Deng & Rios, 2022), as well as score-based inferences, including aggregated ability estimates (e.g., Rios et al., 2022), subgroup comparisons (e.g., Rios, 2021) and growth estimates (e.g., Yildirim-Erbasli & Bulut, 2020), to name a few. Given the potential deleterious effects of RG, researchers have proposed various recoding and modeling approaches for handling this form of noneffortful responding. In the following sections, a review of these approaches is provided as a means to set the foundation for the objective of the present manuscript, which is to examine the robustness of the most popular of these approaches, the effort-moderated (EM) scoring procedure, to multidimensional RG (i.e., RG that is linearly related to examinee ability).

Response Time Scoring Approaches for Handling RG

There are currently two main categories of scoring approaches for contending with RG, which differ in classifying RG and estimating ability: (a) item response theory (IRT) mixture modeling; and (b) two-stage scoring.

IRT Mixture Modeling

The most recent set of scoring procedures for handling RG leverage item response theory (IRT) mixture modeling to distinguish between unique, latent response strategies (RG and effortful responding). Within this approach, a probabilistic model is employed to estimate the likelihood that an item response belongs to a latent class that reflects RG, based on the assumption that response times reflect different data-generating processes associated with RG and effortful responding. IRT mixture modeling advantageously avoids the need for dichotomous classifications of RG and instead utilizes the RG class probability to weight the observed item responses, which avoids the potential for fully misclassifying effortful responding as RG (see Rios, 2021). In addition, this general approach simplifies the modeling process by simultaneously estimating class membership probability and model parameter estimates.

With these advantages noted, many IRT mixture models assume that examinees possess a constant speed when effortfully responding, regardless of item characteristics (e.g., reading load) or cognitive fatigue, which is an assumption found to be untenable in applied testing contexts (Bolsinova et al., 2017; Domingue et al., 2021; Meng et al., 2015; van der Linden & Glas, 2010). Second, many models presume that response times reflective of RG follow a lognormal distribution with a common mean and variance across all items (this assumption is relaxed by Nagy & Ulitzsch, 2022). However, if this assumption is violated, prior research suggests that class membership probabilities may be biased, a potential outcome given idiosyncratic RG response time patterns between- and within-examinees (Molenaar et al., 2018; Wise & Kingsbury, 2016). Beyond these theoretical limitations, there are a number of practical constraints that have restricted the application of IRT mixture models. These limitations include model convergence issues when small sample sizes and/or proportions of RG are present. In addition, these models require specialized software that are capable of employing Bayesian estimation, which can lead to estimation times that are quite extensive (see Pokropek, 2016).³ These combined constraints have led some researchers to suggest that IRT mixture models are impractical for operational use and instead are best utilized as sophisticated tools for investigating the occurrence of RG in small-scale research efforts (Ulitzsch et al., 2022).

Two-Stage Scoring

Due to the limitations associated with IRT mixture models, an increasingly popular tactic to mitigate bias from RG is to employ a two-stage scoring approach. Within this general approach, a response to item i for examinee j is defined as a RG response (F_ij = 1) if the associated response time, RT_ij, is less than a predetermined response time threshold, T_i, that distinguishes the minimal time associated with solution behavior. A host of procedures have been proposed for establishing T_i based on heuristic rules (e.g., 3 seconds), the form of response time distributions, and/or response accuracy or item information accumulated over time (for more details, see Rios & Deng, 2021). Regardless of the procedure used to establish T_i, if F_ij = 1, the observed item score for item i by examinee j, Y_ij, is recoded, Y*_ij; otherwise, if F_ij = 0, Y_ij is unaltered (i.e., Y_ij = Y*_ij). Researchers have transformed Y_ij to Y*_ij by treating a RG item score as either incorrect (Deribo et al., 2021; Wright, 2019) or missing (Liu et al., 2019; Wise & DeMars, 2006). This process is completed for each i ×j interaction to create a transformed item response matrix for the total sample, Y*.

Recoding RG item scores as incorrect (e.g., Penalized scoring; Wright, 2016) innately presumes that an examinee engages in RG due to a low true response probability, and thus, strategically guesses to increase the likelihood of correctly answering an item. This assumption requires an examinee to accurately identify that they have an expected probability below chance, and provide an answer in less time than T_i, which may be untenable in applied settings. On one hand, a quick evaluation from test-takers seems hardly possible; on the other hand, prior research has shown that RG is likely most strongly associated with low task value in low-stakes testing contexts (i.e., examinees with high true response probabilities engage in RG due to a lack of interest in the presented task; Goldhammer et al., 2017; Penk & Schipolowski, 2015). Furthermore, recent simulation research suggests that imputing incorrect item scores for RG responses leads to greater error in both item and person parameters than naïvely ignoring the presence of RG (Rios & Deng, 2024). Due to strong impractical assumptions and poor psychometric performance, it is recommended that treating RG item scores as incorrect should be avoided where possible.

To account for the possibility that examinees may engage in RG for idiosyncratic reasons, an alternative recoding strategy is to treat all RG item scores as missing data (this approach was proposed in relation to aberrant responding by Waller [1974]). The rationale for this approach is that RG represents a response process that is unreflective of the assessed construct(s), and thus, provides psychometrically uninformative information. This strategy is advantageous because it does not make strong assumptions about the underlying rationale for RG behavior, and furthermore, avoids imputing responses that may increase bias in parameter estimates, such as assigning incorrect item scores (Rios & Deng, 2024). To date, recoding RG item scores as missing has been extensively researched and is the only approach to be employed in operational settings (see Wise & Kuhfeld, 2021). Two modeling approaches have been proposed once recoding RG responses as missing data: EM (Wise & DeMars, 2006) and Holman–Glas (HG; Liu et al., 2019) scoring.⁴

EM Scoring

This scoring approach simply applies a user-defined unidimensional IRT model to Y*, such as the three-parameter logistic (3PL) model, which can be expressed in the slope-intercept form as:

P (X_{i j} = 1 | θ_{j}, a_{i}, c_{i}, d_{i}) = c_{i} + (1 - c_{i}) \frac{e^{a_{i} θ_{j} + d_{i}}}{1 + e^{a_{i} θ_{j} + d_{i}}},

(1)

where $θ_{j}$ is the ability parameter for examinee j, while $a_{i}, c_{i}, d_{i}$ are respectively the slope, lower asymptote, and intercept parameters for item i.

EM scoring has been generally shown to mitigate bias in ability parameter estimates when compared with ignoring the presence of RG (particularly as the rate of RG increases), be robust to nonidiosyncratic patterns of RG, and yield proficiency estimates that are significantly associated with self-report measures of effort (Rios et al., 2017; Rios & Soland, 2021a, 2021b; Wise & DeMars, 2006; Wise & Kuhfeld, 2021). However, it is susceptible to nonconvergence and increased standard errors in ability estimation as RG rates rise (Rios & Soland, 2021b). Furthermore, by converting RG item scores to missing, EM scoring presumes that the propensity to rapid guess is not systematically related to item and examinee characteristics, given that examinee ability is modeled unidimensionally. The tenability of the latter assumption is particularly questionable because prior research has demonstrated a negative relationship between RG propensity and ability in certain testing contexts (i.e., low ability examinees are more likely to engage in RG than high ability examinees; Deribo et al., 2021; Rios et al., 2017). With that noted, EM scoring is the most researched scoring approach to date (e.g., Rios & Soland, 2021a, 2021b; Wise & DeMars, 2006; Wise & Kuhfeld, 2021) and is the only known method currently employed in operational settings (Wise & Kuhfeld, 2021).

HG Scoring

To account for the potential linear relationship between RG propensity and ability, Liu et al. (2019) extended the missing data model of Holman and Glas (2005). In this scoring approach, examinee ability ( $θ$ ) and RG propensity ( $ξ$ ) are estimated within a multidimensional IRT framework and assumed to follow a bivariate normal distribution. Two types of latent variable indicators are specified, item responses (Y*) and RG classifications (F), and are combined into an I × 2J matrix, where I and J are the total number of examinees and items respectively (Y* ranges from column 1 to J, while F is specified in columns J+ 1 to 2J).

Upon arranging the data matrix, the latent construct relationship can be expressed within a multidimensional IRT (MIRT) framework by specifying a user-defined unidimensional measurement model separately for Y* and F. For instance, Liu et al. (2019) conceptualized the prediction of F by $ξ$ using the Rasch model based on the assumption that the employment of RG is associated with item difficulty when $θ$ and $ξ$ are linearly related. In contrast, they chose the 3PL model to explain the predictive relationship of Y* by $θ$ given the potential presence of pseudo-guessing on single response multiple choice items. These relationships can be formally expressed in the slope/intercept form for person i on item j as

P (X_{ij} = 1 | δ_{j}, a_{i}, c_{i}, d_{i}) = c_{i} + (1 - c_{i}) \frac{e^{a_{i} δ'_{j} + d_{i}}}{1 + e^{a_{i} δ'_{j} + d_{i}}},

(2)

where $X_{ij}$ is equivalent to Y*_ij = 1 and F_ij = 1, and $δ_{j}$ is a vector of person coordinates ( $θ_{j}, ξ_{j}$ ) for examinee j. Turning to the item parameters, different parameterizations are made separately for Y*_ij and F_ij, given the different measurement model specifications. Focusing first on Y*_ij, for item i, $a_{i}$ , a vector of item discrimination parameters, is equal to ( $a_{i}$ ,0), while $d_{i}$ is the intercept parameter, and $c_{i}$ is the lower asymptote or pseudo-guessing parameter. In contrast, for F_ij, $a_{i}$ is equivalent to (0, 1), $d_{i}$ is the intercept parameter, and $c_{i}$ is equal to 0.

HG scoring has the potential to provide improved ability estimation when RG is multidimensional (e.g., latent ability and propensity for RG are correlated). In addition, it provides a flexible procedure for investigating correlates of examinees’ latent propensity for RG via structural equation modeling. However, applying HG scoring to datasets in which RG is ignorable (e.g., with minor RG rates) could lead to model overparameterization and biased parameter estimates. This arises from incorporating more parameters into the model than what the data justifies, leading to unnecessary complexity. In addition, if covariates (e.g., item content) account for differential RG among examinees with the same ability level, parameter estimates from this model may be biased and ultimately alter the interpretation of ability inferences, due to the assumption that $θ$ and $ξ$ are linearly related.

Study Rationale

Contending with RG is important for operational testing programs given the high rates of noneffortful responding previously observed (Rios & Deng, 2021). However, practitioners are presented with multiple options of RG scoring approaches, with each providing its own strengths and limitations. For instance, EM scoring is a simple method that can be easily integrated into operational workflows, as it relies on unidimensional IRT models that are readily employed by many testing programs. This approach has been shown to be robust to nonidiosyncratic patterns of RG (Rios & Soland, 2021b); however, it is less understood how robust parameter estimates are when missing data are nonignorable due to a linear relationship between $θ$ and $ξ$ . In a prior investigation of this question, Liu et al. (2019) employed certain methodological choices that might be reconsidered in light of subsequent research. First, their study operated under the assumption that RG rates were consistent across all examinees. This seems unrealistic with previous studies, such as Wise and Kingsbury (2016), which have highlighted significant variability in RG rates among examinees. Second, Liu et al. used expected a posteriori (EAP) ability estimation. While a common practice, this method has its critics, with some suggesting it can cause ability estimates to converge toward the mean, potentially introducing bias. In addition, although HG scoring was developed to account for this linear relationship, prior research has not examined the degree of model misfit present when RG proportions are small in a given dataset. This area of research is important as prior research has shown significant variations in RG rates by examinee populations (e.g., Rios & Guo, 2020; Rios & Soland, 2022).

To address the limitations in the literature, we conducted a simulation study to investigate model fit, ability parameter recovery, and factor analytic reliability between the EM and HG scoring approaches under various conditions. Specifically, these outcomes are examined when manipulating test difficulty, the proportion of RG present within a sample, and the strength of association between $θ$ and . The following research objectives are addressed in the simulation study:

Research Question 1 (RQ1): Does HG scoring provide poorer model fit compared with EM scoring when RG proportions are small?

Hypothesis 1 (H1): Model fit for the HG scoring approach will be worse compared with EM scoring under small RG rates due to overparameterization.

Research Question 2 (RQ2): Is there greater ability estimate error for EM scoring when compared with HG scoring when RG is multidimensional?

Hypothesis 2 (H2): compared with the EM approach, HG scoring will provide improved ability estimates and less distorted omega reliability estimates as the covariance between $θ$ and $ξ$ increases, with larger benefits observed under growing rates of RG.

The findings from this study have the potential to inform practitioners about whether a unidimensional approach to mitigating bias from RG is robust to underlying multidimensionality.

Method

Data Generation

Data were generated for a 50-item multiple-choice test using a two-factor correlated traits simple structure model. A multivariate normal distribution was employed to sample 2,000 $θ$ and $ξ$ parameters:

π_{j} ~ N (μ, Σ),

(3)

where $π_{j}$ is a 2 × 1 vector of simulee parameters with each row respectively representing $θ_{j}$ and $ξ_{j}$ , $μ$ is a 2 × 1 vector of mean ability values (constrained to zero across all conditions), and $Σ$ is a 2 × 2 covariance matrix with the diagonal components equal to 1 and the off-diagonal components equal to the covariance between $θ_{j}$ and $ξ_{j}$ . Two covariance levels were manipulated in which the two factors were inversely related for disengaged simulees (i.e., simulees engaging in RG in one or more items): −0.1 and −0.5. The former level reflects a scenario in which there is largely no linear relationship between $θ_{j}$ and $ξ_{j}$ , which supports an underlying assumption of EM scoring and allowed for the examination of model overfitting for HG scoring. The second condition reflects a moderate association between factors that has been observed in prior applied analyses and is aligned with the multidimensionality specified in HG scoring (see Liu et al., 2019). The two individual parameters (i.e., $θ$ and $ξ$ ) were generated separately for effortful and disengaged simulees, with the proportion of disengaged simulees were manipulated (the greater details were described below). In other words, the generation process of the two person parameters was conducted separately for the effortful and disengaged group.

Upon sampling $θ$ and $ξ$ parameters, effortful response probabilities for $θ$ were generated based on the unidimensional 3PL model (see equation 1; the slope-threshold parameterization was used for data generation) by obtaining item parameters from the following distributions:

a_{i} = unif (0.4, 1.5)

b_{i} = N ({\bar{b}}_{i}, 1)

(4)

c_{i} = unif (0.1, 0.25),

where ${\bar{b}}_{i}$ was manipulated to either −1 or 0 to reflect an easy or moderately difficult test. This independent variable was altered as RG has been found to disproportionately impact parameter estimates based on test difficulty (e.g., Rios & Soland, 2021b).⁵ Similar to Liu et al. (2019), probabilities for $ξ$ were generated using the unidimensional Rasch model:

P (F_{ij} = 1 | ξ_{j}, b_{i}) = \frac{e^{ξ_{j} - b_{i}}}{1 + e^{ξ_{j} - b_{i}}} .

(5)

where $b_{i}$ was item difficulty parameters generated using the distributions in the equation 4

The next step in the data generation process consisted of manipulating the percentage of RG responses in Y. This step required two stipulations, the proportion of simulees engaging in some degree of RG and the average percentage of within-disengaged simulee RG. Concerning the former, five levels were varied ranging from .10 to .50 in increments of .10, while the average rate of RG within-disengaged simulees was equal to one of four levels: 10%, 20%, 40%, and 80%. However, to address the fact that examinees can differ in the degree to which they employ RG (Wise & Kingsbury, 2016), the number of RG responses for each disengaged simulee was allowed to vary based on sampling from a multinomial distribution. Within each disengaged simulee, RG was generated for the items with the highest probabilities obtained from equation 5. Therefore, if examinee j was stipulated to engage in RG on three items, noneffortful responding would be imputed for the three items with the highest $F_{ij}$ probabilities. Combining the independent variables from this step resulted in the following percentages of RG responses in Y: 1%, 2%, 3%, 4%, 5%, 6%, 8%, 10%, 12%, 16%, 20%, 24%, 32%, and 40%. These levels fall within observed RG rates in operational tests (e.g., Deng & Rios, 2022; Rios & Guo, 2020; Rios & Soland, 2022).

In summary, the following four independent variables and their respective levels were included in the simulation design:

Covariance between $θ$ and $ξ$ : −0.1 and −0.5

Test difficulty: easy and moderately difficult

Proportion of disengaged simulees in sample: .10, .20, .30, .40, and .50

Average percentage of within-disengaged simulee RG: 10%, 20%, 40%, and 80%

These independent variables were fully crossed resulting in 80 conditions, with each condition replicated 100 times.

Model Estimation

Prior to jointly estimating item and ability parameters, missing data were imputed for known RG responses to create Y*, while F was created by reflecting the true generating RG classifications. After that, item and ability parameters were concurrently estimated for EM and HG scoring in the mirt R package (Chalmers, 2012). To serve as a baseline measure, a naïve 3PL model (hereon referred to as the naïve approach) was also estimated in which the presence of RG was ignored. Across scoring procedures, the Newton–Raphson estimation algorithm was utilized to determine the maximum of the likelihood function, with the starting value and convergence criterion established at 0 and 0.0001, respectively. Nonconverged estimation occurred if the convergence criterion established was not met after reaching 100 iterations. In addition, RG’s impact on factor analytic reliability was investigated using the Green and Yang’s (2009) formula with the semTools R package (Jorgensen et al., 2016) to account for categorical indicators and possible violations to tau-equivalence. Given that EM and HG scoring possess the same item response indicators, reliability was identical between these approaches.

Outcomes

To examine whether HG scoring tends to overfit data under small RG proportions, the Bayesian information criterion (BIC) was collected for each condition and compared with the BIC value obtained when no RG was present (this latter value was calculated separately in baseline conditions by test difficulty level). While the manipulated and baseline conditions differ only in terms of the presence of rapid guessing responses, both conditions originate from the same underlying simulated data structure. Therefore, this comparison of BIC provided a means of assessing distortion in model fit due to RG.

Addressing the second research question within this study, both ability estimate bias and mean absolute error (MAE) were calculated to gauge ability parameter recovery. Bias was defined for each replication as follows:

bias = \frac{Σ_{i = 1}^{n} {\hat{θ}}_{jr} - θ_{jr}}{n},

(6)

where ${\hat{θ}}_{jr}$ is the estimated ability parameter for examinee j, replication r, $θ_{jr}$ is the ability parameter for examinee j, replication r, and n is the number of observations per sample. Given the need to consider the balance between the reduction of systematic bias and random error, MAE was calculated as follows:

MAE = \frac{Σ_{i = 1}^{n} | {\hat{θ}}_{jr} - θ_{jr} |}{n} .

(7)

These outcomes were evaluated separately for engaged and disengaged simulees, which aims to investigate the impact of RG on the two subgroups. To assist in data interpretation, an analysis of variance model was fit in which bias was treated as the outcome, while scoring approach (EM and HG scoring), test difficulty, the covariance between $θ$ and $ξ$ , the proportion of disengaged simulees in the sample, as well as the average percentage of within-disengaged simulee RG were included as independent variables. In addition to examining the statistical significance of each moderating variable at an alpha level of .05, the eta-squared effect size was calculated. Furthermore, the distortion of omega reliability estimate was computed by comparing the differences between the manipulated conditions, where RG responses were introduced, and the baseline conditions with no RG present.

Results

Across conditions, both the naïve and HG scoring approaches were found to converge for every replication; however, EM scoring was susceptible to nonconvergence, with rates ranging from 0% to 6%. The highest rates of nonconvergence were generally noted for conditions under moderate test difficulty with minimal rates observed for easy test difficulty conditions (ranged from 0% to 1%). All nonconverged replications were removed from the analyses presented below.

Does HG Scoring Overfit Data When RG Proportions Are Small?

Figure 1 presents model fit distortion by scoring procedures disaggregated by test difficulty, the proportion of disengaged simulees in the sample, the percentage of within-disengaged simulee RG, and the covariance between $θ$ and $ξ$ . Results from this analysis indicate that naïvely ignoring the presence of RG was associated with minimal distortion in BIC values, particularly under easy test difficulty conditions. Slight negative model fit distortion was observed within the moderately difficult test condition, with distortion generally increasing as the proportion of disengaged simulees in the sample grew. In contrast, HG scoring possessed positive distortion across all conditions, with marginal differences in trends observed across levels of test difficulty and covariance between $θ$ and $ξ$ . Whereas the naïve approach’s model fit distortion was approximately zero when the test difficulty was easy and RG responses comprised 1% of all item responses, the BIC value for HG scoring was overestimated by a value of 500 and increased to nearly 1,000 as the rate of RG responses across the sample grew to 5%, supporting our prior hypothesis. Although susceptible to increasing negative distortion as the proportion of disengaged simulees increased, model fit distortion was negligible for EM scoring under small rates of RG responses in the sample data (1% - 5%). In contrasting HG and EM scoring, the latter was found to provide less absolute BIC distortion across all conditions in which RG made up 12% of the total sample’s item responses; however, beyond this proportion, HG scoring retained less absolute distortion.

Figure 1

Model Fit Distortion by Scoring Procedure

How Robust Is EM Scoring to Multidimensional RG?

Figures 2 and 3 respectively provide ability estimate bias and MAE across conditions for both engaged and disengaged simulees. Focusing first on engaged simulees, results demonstrated that ability estimate bias was slightly higher when test difficulty was easier compared with moderately difficult; however, within these levels, this outcome was not found to vary across the proportion of disengaged simulees in the sample, the percentage of RG responses within-disengaged simulees, nor the covariance between $θ$ and $ξ$ . Specifically, for EM scoring, the mean ability estimate bias was equal to approximately 0.05 logits when test difficulty was easy, while equaling roughly 0.00 logits under tests of moderate difficulty. After controlling for all independent variables, the ANOVA results suggest that there were no statistical differences between EM and HG scoring for engaged simulee ability estimate bias (p = .60), indicating no advantage for modeling multidimensional RG (Table 1). Furthermore, an inspection of Figures 2 and 3, showed that both RG scoring approaches provided nearly identical results in terms of both bias and MAE to the naïve approach, which reveals that RG had a negligible effect on engaged simulee ability estimates.

Figure 2

Ability Parameter Estimate Bias Disaggregated by Engagement Type and Scoring Procedure

Figure 3

Ability Parameter Estimate Mean Absolute Error Disaggregated by Engagement Type and Scoring Procedure

Table 1

ANOVA Results Investigating the Effects of Study Factors on Ability Parameter Estimate Bias

Outcome	Factor	SS	df	MS	F	p	$η^{2}$
Engaged θ	Scoring Approach	0.00	1	0.00	0.28	.60	.00
	Disengaged Proportion	0.00	4	0.00	1.37	.27	.00
	RG %	0.00	3	0.00	28.45	<.05	.01
	$θ$ and $ξ$ Covariance	0.00	1	0.00	5.85	<.05	.00
	Test Difficulty	0.06	1	0.06	8557.69	<.05	.97
Disengaged θ	Scoring Approach	0.00	1	0.00	0.02	.88	.00
	Disengaged Proportion	0.00	4	0.00	0.10	.98	.00
	RG %	0.36	3	0.12	130.23	<.05	.72
	$θ$ and $ξ$ Covariance	0.00	1	0.00	0.35	.56	.00
	Test Difficulty	0.00	1	0.00	0.57	.45	.00

Note. $SS$ = Sum of Square. $MS$ = Mean Square. $θ$ and $ξ$ Covariance = Covariance between latent ability and latent rapid guessing propensity. Engaged $θ$ = ability estimates ( $θ$ ) for engaged simulees. Disengaged $θ$ = ability estimates ( $θ$ ) for disengaged simulees. The $R^{2}$ for the engaged and disengaged linear regression models was respectively equal to .72 and .98.

Turning to disengaged simulee ability estimate bias, no significant differences were noted across levels of test difficulty (p = .45), the proportion of disengaged simulees (p = .98), nor the covariance between $θ$ and $ξ$ (p = .56); however, bias was found to significantly differ across levels of within-disengaged simulee RG percentages (p < .05; $η^{2}$ = .72). A closer inspection of Figure 2 shows that for moderate test difficulty conditions, EM scoring possessed an average bias of approximately 0.00 logits when the percentage of within-disengaged simulee RG rates were as high as 40%; yet, when this percentage grew to 80%, bias increased to roughly 0.20 logits. After controlling for all independent variables, the ANOVA results indicated a similar pattern for HG scoring, given no statistically significant differences in mean ability estimate bias between scoring approaches (p = .88), which runs counter to our hypothesis that HG scoring would provide added value when the covariance between $θ$ and $ξ$ increased. In addition, the naïve approach was found to provide overlapping confidence intervals with both EM and HG scoring for rates of RG in the total sample data matrix that were as high as 40% (Figure 2), and actually showed smaller MAE values for high rates of within disengaged simulee RG percentages (40% and 80%; Figure 3). Moreover, failing to account for RG had a negligible impact on omega reliability estimate distortion (distortion values ranged from approximately 0.00 to 0.0075 units), and as result, no advantages were noted when employing RG scoring approaches (EM and HG scoring provided identical degrees of distortion; Figure 4).

Figure 4

Omega Reliability Distortion Disaggregated by Scoring Procedure

Discussion

The objective of the present manuscript was to examine whether EM scoring, a unidimensional RG scoring approach, is robust to multidimensional RG in terms of model fit distortion, ability parameter recovery, and omega reliability distortion. To this end, a comprehensive simulation investigation was conducted. Overall, the findings of this study indicate that EM scoring provided less model fit distortion compared with HG scoring when the percentage of RG responses comprised as much as 12% of all responses across the total sample. In fact, HG scoring was susceptible to positive BIC value distortion under low rates of RG (1%), due largely to overparameterization. However, benefits were noted in mitigating model fit distortion as the percentage of RG responses in the total sample increased beyond 12% when employing the multidimensional HG approach. With this noted, both RG scoring approaches were found to provide greater absolute BIC value distortion compared to naïvely ignoring the presence of RG, particularly as the average rate of within-disengaged simulee RG grew.

In addition to model fit, this study investigated ability estimate parameter recovery to determine whether EM scoring was robust to multidimensional RG. Although the data generating model was equivalent to HG scoring, ability parameter estimate bias and MAE were equivalent between EM and HG scoring for both engaged and disengaged simulees across levels of test difficulty, proportions of disengaged simulees, within-disengaged simulee RG rates, and most importantly, the covariance between $θ$ and $ξ$ . Moreover, identical degrees of omega reliability distortion were observed between EM and HG scoring. These findings indicate that EM scoring was robust to data with underlying multidimensional RG, adding support for employing a more simplistic model to handle RG, given the limited added value of HG scoring.

With these findings noted, naïvely ignoring the presence of RG was not found to lead to significantly worse ability parameter recovery when compared with either RG scoring approach. In fact, when the percentage of RG responses in the total sample was quite large (40%), the naïve approach provided significantly lower mean bias in contrast to HG scoring. In addition, ignoring the presence of RG notably reduced combined systematic and random error (i.e., MAE) when within-disengaged simulee RG rates were high (40% and 80%). This is likely because while the EM and HG scoring attempt to address the issue of RG by imputing missing data, this act of imputation itself can limit available information and inadvertently lead to biases when the RG percentage is high (i.e., greater than 40%). Overall, the naïve approach was associated with minimal bias in average ability estimates for both engaged and disengaged simulees, with absolute bias ranging from 0.00 to 0.05 logits, and similarly led to negligible degrees of distortion in omega reliability estimates (ranging from ≤ .|0025|). This latter finding adds further support that RG generally has minimal impact on reliability estimates (Rios & Deng, 2022).

Limitations and Areas for Future Research

A number of limitations should be noted when interpreting the results from this study. First, only one form of noneffortful responding was examined, and thus, the findings are limited to contexts in which examinees solely disengage by employing RG. However, it is quite possible that within a singular testing event, examinees may display disengagement in a variety of ways, such as both RG and omitting item responses. To address this possibility, recent work by Ulitzsch et al. (2020) has attempted to incorporate both response time information and item responses to evaluate engagement based on RG and item omission by employing a hierarchical latent response model. Efforts such as these are promising as they consider the complexity of accounting for multiple disengaged behaviors simultaneously. Clearly, more research is needed in this area.

An additional limitation associated with this study was related to the data generation process. First, that data were generated based on a bivariate normal distribution, which assumed that RG propensity and ability were linearly associated. However, this assumption may be untenable when differential RG rates are observed within examinees with similar true abilities — a situation that may arise when RG correlates unassociated with ability are present (for a review, see Rios & Soland, 2022; Wise, 2017). If the assumption of a linear association is violated, biased parameter estimates will likely result (Liu et al., 2019). Thus, future research may benefit from employing more nuanced data generation approaches that employ structural equation models to combine both measurement and path models that better encapsulate RG behavior. Furthermore, this simulation considered only a weak and a moderate relationship between latent ability and RG propensity. Future studies should also encompass scenarios reflecting a stronger correlation between latent ability and RG propensity.

The final limitation associated with the present study is that only aggregated sample inferences were investigated. Although this level of aggregation is common in many low-stakes testing contexts (e.g., international comparative studies and educational accountability measures), the results highlighted in this paper do not generalize to situations where inferences are made at the individual level (e.g., formative assessments). Greater focus on this latter context is needed given that the preponderance of research on RG has been limited to aggregated sample contexts (see Wise, 2017), with limited exceptions (e.g., Rios, 2022).

Practical Implications

The findings from this study provide a number of insights for practitioners. First, fitting a multidimensional model to data with minimal RG responses can lead to poor model fit due to model overparameterization. Compared with the EM approach, HG scoring appears to only provide improved model fit when the percentage of RG responses in the total sample is equal to 12% or more. Second, when data are missing not at random (e.g., when latent ability and RG propensity is correlated in this simulation) and as the rate of missing data rises, practitioners should be cautious in interpreting model fit indices, such as BIC, as a means for model selection, as they may be increasingly biased (Davey, 2005). Third, when RG is multidimensional (with the covariance between $θ$ and $ξ$ being as high as −0.5), HG scoring provides negligible added value over the EM scoring approach. Thus, if accounting for RG in scoring, the unidimensional method may be preferable because it provides equivalent ability and omega reliability estimates to the more complex method, and it can be readily employed in operational testing programs that are currently using unidimensional IRT scoring.

Finally, under the conditions investigated, RG was found to have a negligible impact on average ability estimates for both engaged and disengaged simulees as well as on omega reliability estimate distortion. Therefore, when aggregating scores for large samples, inferences concerning reliability and examinee ability may be robust to considerable rates of RG in the data (up to 40%), supporting prior analyses (e.g., Wise et al., 2020). With this noted, it is important to consider that this finding does not generalize to contexts in which inferences are made about individual examinees and between subgroups that differ in RG (see Rios, 2021, 2022). Thus, as recommended by the Standards for Educational and Psychological Testing (American Educational Research Association et al., 2014; Standard 1.12), testing programs reporting scores aggregated at the sample-level may still benefit from continually gauging examinee engagement and documenting the degree and impact of construct-irrelevant responses on score-based inferences (American Educational Research Association et al., 2014).

Footnotes

Author Contributions

J.A.R. conceived of the present study, designed the simulation study, interpreted the findings from the simulation, and wrote the majority of the manuscript. J.D. wrote the R syntax for the simulation and created all tables and figures. All authors conducted critical revisions of the article throughout the review process and approved of the final version to be published.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Joseph A. Rios

Jiayi Deng

Notes

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing (6th ed.). American Educational Research Association.

Asseburg

Frey

(2013). Too hard, too easy, or just right? The relationship between effort or boredom and ability-difficulty fit. Psychological Test and Assessment Modeling, 55(1), 92–104.

Bolsinova

Tijmstra

Molenaar

De Boeck

(2017). Conditional dependence between response time and accuracy: An overview of its possible sources and directions for distinguishing between them. Frontiers in Psychology, 8, Article 202. https://doi.org/10.3389/fpsyg.2017.00202

Chalmers

R. P.

(2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1–29.

Davey

(2005). Issues in evaluating model fit with missing data. Structural Equation Modeling, 12(4), 578–597.

Deng

Rios

J. A.

(2022). Investigating the effect of differential rapid guessing on population invariance in equating. Applied Psychological Measurement, 46(7), 589–604.

Deribo

Kroehne

Goldhammer

(2021). Model-based treatment of rapid guessing. Journal of Educational Measurement, 58(2), 281–303. https://doi.org/10.1111/jedm.12290

Domingue

B. W.

Kanopka

Stenhaug

Soland

Kuhfeld

Wise

Piech

(2021). Variation in respondent speed and its implications: Evidence from an adaptive testing scenario. Journal of Educational Measurement, 58(3), 335–363.

Goldhammer

Martens

Lüdtke

(2017). Conditioning factors of test-taking engagement in PIAAC: An exploratory IRT modelling approach considering person and item characteristics. Large-Scale Assessments in Education, 5, 1–25.

10.

Green

S. B.

Yang

(2009). Reliability of summed item scores using structural equation modeling: An alternative to coefficient alpha. Psychometrika, 74(1), 155–167.

11.

Halderman

L. K.

Finn

Lockwood

J. R.

Long

N. M.

Kahana

M. J.

(2021). EEG correlates of engagement during assessment. Educational Testing Service. https://doi.org/10.1002/ets2.12312

12.

Harmes

J. C.

Wise

S. L.

(2016). Assessing engagement during the online assessment of real-world skills. In Rosen

Ferrara

Mosharraf

(Eds.), Handbook of research on technology tools for real-world skill development (pp. 805–824). IGI Global.

13.

Holman

Glas

C. A.

(2005). Modelling non-ignorable missing-data mechanisms with item response theory models. British Journal of Mathematical and Statistical Psychology, 58(1), 1–17.

14.

Jorgensen

T. D.

Pornprasertmanit

Miller

Schoemann

Rosseel

(2016). Package “SEM tools.”https://cran.r-project.org/web/packages/semTools/semTools.pdf

15.

Lehman

Zapata-Rivera

(2018). Student emotions in conversation-based assessments. IEEE Transactions on Learning Technologies, 11(1), 41–53. https://doi.org/10.1109/TLT.2018.2810878

16.

Lindner

M. A.

Lüdtke

Grund

Köller

(2017). The merits of representational pictures in educational assessment: Evidence for cognitive and motivational effects in a time-on-task analysis. Contemporary Educational Psychology, 51, 482–492. https://doi.org/10.1016/j.cedpsych.2017.09.009

17.

Liu

Luo

(2019). Modeling test-taking non-effort in MIRT models. Frontiers in Psychology, 10, Article 145. https://doi.org/10.3389/fpsyg.2019.00145

18.

Meng

X.-B.

Tao

Chang

H.-H.

(2015). A conditional joint modeling approach for locally dependent item responses and response times. Journal of Educational Measurement, 52(1), 1–27. https://doi.org/10.1111/jedm.12060

19.

Molenaar

Bolsinova

Vermunt

J. K.

(2018). A semi-parametric within-subject mixture approach to the analyses of responses and response times. British Journal of Mathematical and Statistical Psychology, 71(2), 205–228.

20.

Nagy

Ulitzsch

(2022). A multilevel mixture IRT framework for modeling response times as predictors or indicators of response engagement in IRT models. Educational and Psychological Measurement 82(5), 845–879. https://doi.org/10.1177/00131644211045351

21.

Penk

Schipolowski

(2015). Is it all about value? Bringing back the expectancy component to the assessment of test-taking motivation. Learning and Individual Differences, 42, 27–35.

22.

Pokropek

(2016). Grade of membership response time model for detecting guessing behaviors. Journal of Educational and Behavioral Statistics, 41(3), 300–325. https://doi.org/10.3102/1076998616636618

23.

Pokropek

Marks

G. N.

Borgonovi

(2022). How much do students’ scores in PISA reflect general intelligence and how much do they reflect specific abilities? Journal of Educational Psychology, 114(5), 1121–1135.

24.

Rios

J. A.

(2021). Is differential noneffortful responding associated with type I error in measurement invariance testing? Educational and Psychological Measurement, 81(5), 957–979.

25.

Rios

J. A.

(2022). A comparison of robust likelihood estimators to mitigate bias from rapid guessing. Applied Psychological Measurement, 46(3), 236–249.

26.

Rios

J. A.

Deng

(2021). Does the choice of response time threshold procedure substantially affect inferences concerning the identification and exclusion of rapid guessing responses? A meta-analysis. Large-scale Assessments in Education, 9(1), 1–25.

27.

Rios

J. A.

Deng

(2022). Quantifying the distorting effect of rapid guessing on estimates of coefficient alpha. Applied Psychological Measurement, 46(1), 40–52.

28.

Rios

J. A.

Deng

(2024). A comparison of response-threshold scoring procedures in mitigating bias from rapid guessing. Educational and Psychological Measurement, 84(2), 387–420. https://doi.org/10.1177/00131644231168398

29.

Rios

J. A.

Deng

Ihlenfeldt

S. D.

(2022). To what degree does rapid guessing distort aggregated test scores? A meta-analytic investigation. Educational Assessment 27(1), 1–18.

30.

Rios

J. A.

Guo

(2020). Can culture be a salient predictor of test-taking engagement? An analysis of differential noneffortful responding on an international college-level assessment of critical thinking. Applied Measurement in Education, 33(4), 263–279.

31.

Rios

J. A.

Guo

Mao

Liu

O. L.

(2017). Evaluating the impact of careless responding on aggregated-scores: To filter unmotivated examinees or not? International Journal of Testing, 17(1), 74–104. https://doi.org/10.1080/15305058.2016.1231193

32.

Rios

J. A.

Soland

(2021a). Investigating the impact of noneffortful responses on individual-level scores: Can the Effort-Moderated IRT model serve as a solution? Applied Psychological Measurement, 45(6), 391–406.

33.

Rios

J. A.

Soland

(2021b). Parameter estimation accuracy of the Effort-Moderated Item Response Theory Model under multiple assumption violations. Educational and Psychological Measurement, 81(3), 569–594.

34.

Rios

J. A.

Soland

(2022). An investigation of item, examinee, and country correlates of rapid guessing in PISA. International Journal of Testing, 22(2), 154–184.

35.

Silm

Pedaste

Täht

(2020). The relationship between performance and test-taking effort when measured with self-report or time-based instruments: A meta-analytic review. Educational Research Review, 31, 100335.

36.

Soland

Wise

S. L.

Gao

(2019). Identifying disengaged survey responses: New evidence using response time metadata. Applied Measurement in Education, 32(2), 151–165.

37.

Ulitzsch

Pohl

Khorramdel

Kroehne

von Davier

(2022). A response-time-based latent response mixture model for identifying and modeling careless and insufficient effort responding in survey data. Psychometrika, 87(2), 593–619.

38.

Ulitzsch

von Davier

Pohl

(2020). A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item-level non-response. British Journal of Mathematical and Statistical Psychology, 73, 83–112.

39.

van Barneveld

(2007). The effect of examinee motivation on test construction within an IRT framework. Applied Psychological Measurement, 31(1), 31–46.

40.

van der Linden

W. J.

Glas

C. A.

(2010). Statistical tests of conditional independence between responses and/or response times on test items. Psychometrika, 75(1), 120–139. https://doi.org/10.1007/S11336-009-9129-9

41.

Waller

M. I.

(1974). Removing the effects of random guessing from latent trait ability estimates (Research Report No. 74-32). Educational Testing Service. https://doi.org/10.1002/j.2333-8504.1974.tb00660.x

42.

Wise

S. L.

(2017). Rapid-guessing behavior: Its identification, interpretation, and implications. Educational Measurement: Issues and Practice, 36(4), 52–61. https://doi.org/10.1111/emip.12165

43.

Wise

S. L.

DeMars

C. E.

(2006). An application of item response time: The effort-moderated IRT model. Journal of Educational Measurement, 43(1), 19–38. https://doi.org/10.1111/j.1745-3984.2006.00002.x

44.

Wise

S. L.

Gao

(2017). A general approach to measuring test-taking effort on computer-based tests. Applied Measurement in Education, 30(4), 343–354.

45.

Wise

S. L.

Kingsbury

G. G.

(2016). Modeling student test-taking motivation in the context of an adaptive achievement test. Journal of Educational Measurement, 53(1), 86–105.

46.

Wise

S. L.

Kuhfeld

M. R.

(2020). A cessation of measurement: Identifying test taker disengagement using response time. In Margolis

M. J.

Feinberg

R. A.

(Eds.), Integrating timing considerations to improve testing practices (pp. 150–164). Routledge.

47.

Wise

S. L.

Kuhfeld

M. R.

(2021). Using retest data to evaluate and improve effort-moderated scoring. Journal of Educational Measurement, 58(1), 130–149.

48.

Wise

S. L.

Smith

L. F.

(2011). A model of examinee test-taking effort. In Bovaird

J. A.

Geisinger

K. F.

Buckendahl

C. W.

(Eds.), High-stakes testing in education: Science and practice in K-12 settings (pp. 139–153). American Psychological Association.

49.

Wise

S. L.

Soland

(2020). The (non) impact of differential test taker engagement on aggregated scores. International Journal of Testing, 20(1), 57–77.

50.

Wright

D. B.

(2016). Treating all rapid responses as errors (TARRE) improves estimates of ability (slightly). Psychological Test and Assessment Modeling, 58(1), 15–31.

51.

Wright

D. B.

(2019). Treating rapid responses as incorrect for non-timed formative tests. Open Education Studies, 1(1), 56–72. https://doi.org/10.1515/edu-2019-0004

52.

Yildirim-Erbasli

S. N.

Bulut

(2020). The impact of students’ test-taking effort on growth estimates in low-stakes educational assessments. Educational Research and Evaluation, 26(7–8), 368–386.