Polytomous Testlet Response Models for Technology-Enhanced Innovative Items: Implications on Model Fit and Trait Inference

Abstract

The development of technology-enhanced innovative items calls for practical models that can describe polytomous testlet items. In this study, we evaluate four measurement models that can characterize polytomous items administered in testlets: (a) generalized partial credit model (GPCM), (b) testlet-as-a-polytomous-item model (TPIM), (c) random-effect testlet model (RTM), and (d) fixed-effect testlet model (FTM). Using data from GPCM, FTM, and RTM, we examine performance of the scoring models in multiple aspects: relative model fit, absolute item fit, significance of testlet effects, parameter recovery, and classification accuracy. The empirical analysis suggests that relative performance of the models varies substantially depending on the testlet-effect type, effect size, and trait estimator. When testlets had no or fixed effects, GPCM and FTM led to most desirable measurement outcomes. When testlets had random interaction effects, RTM demonstrated best model fit and yet showed substantially different performance in the trait recovery depending on the estimator. In particular, the advantage of RTM as a scoring model was discernable only when there existed strong random effects and the trait levels were estimated with Bayes priors. In other settings, the simpler models (i.e., GPCM, FTM) performed better or comparably. The study also revealed that polytomous scoring of testlet items has limited prospect as a functional scoring method. Based on the outcomes of the empirical evaluation, we provide practical guidelines for choosing a measurement model for polytomous innovative items that are administered in testlets.

Keywords

technology-enhanced assessment innovative items polytomous items testlet item response theory

Introduction

With the increasing use of computers in recent testing, innovative items that make use of computer features have become increasingly popular. For example, items can deploy dynamic tools to present information more efficiently (e.g., audio, animation, video, simulation) or adopt interactive response mode to measure higher order skills or enhance task engagement (e.g., hot spot, graph, drag-and-drop). When innovative items are used in practice, they are often presented in testlets (e.g., lab scenario, case study, chart, table) and scored polytomously allowing partial credit (Betts et al., 2021; Davey et al., 1997; Jiao et al., 2012; Jodoin, 2003). These items are characterized as polytomous items embedded in testlets.

While testlets help enhance testing efficiency, they can induce local dependence among the items sharing the same testlets (Wainer & Kiely, 1987; Yen, 1984). The common stimulus attached to a testlet brings about extra correlation among the associated items, engendering local dependence beyond and above the relationship that could be explained by the primary latent factor. Studies have suggested that existence of local dependence among the testlet items can adversely affect psychometric inference. It can lead to overestimation of measurement precision (Marais & Andrich, 2008; Sireci et al., 1991; Wainer & Thissen, 1996), bias in item parameter estimation (Tuerlinckx & De Boeck, 2001; Wainer & Wang, 2000), item misfit (Marais & Andrich, 2008), linking and equating error (Lee et al., 2001; Li et al., 2005).

Several strategies have been discussed in the measurement literature to deal with the local dependence problem (e.g., Bradlow et al., 1999; Li et al., 2005; Wainer & Kiely, 1987; W.-C. Wang & Wilson, 2005). These studies were however mostly centered on binary-response items. Despite the increasing interest in polytomously scored innovative items, only scant attention has been given to polytomous testlet items (e.g., Li et al., 2010; Himelfarb et al., 2020). The purpose of this study is to fill in this void and investigate measurement models that can be used for polytomous testlet items. We in particular give special attention to innovative items that are emerging in recent operational assessments. These items are often tied to technology-rich stimuli (e.g., graphic, video, simulation) and require various interactive actions (e.g., matching, hot spot, drop-down). The items typically involve high degrees of multiplicity in response scores and face greater risk of convergence problems and scaling challenges. In this study, we conduct comprehensive evaluation of probable measurement models and document their relative performance in the model fit and trait inference. Four models are considered for investigation: the (a) generalized partial credit model (GPCM), (b) testlet-as-a-polytomous-item model (TPIM), (c) random-effect testlet model (RTM), and (d) fixed-effect testlet model (FTM). TPIM and RTM have been widely considered in the analysis of binary response data (Rosenbaum, 1988; Wainer & Kiely, 1987; X. Wang et al., 2002; Wainer et al., 2007). FTM is a new model suggested in this study.

The performance of the models is evaluated in multiple aspects. Earlier studies mostly drew on relative fit measures (e.g., log-likelihood, Akaike information criterion) to decide a comparatively better fitting model (e.g., DeMars, 2012; Glas et al., 2002; Hernandez-Camacho et al., 2017). These measures are, however, known to have high Type I error (De Champlain & Gessaroli, 1998; Hayashi et al., 2007). In addition, our empirical evaluation suggests that a better fitting model does not necessarily lead to better measurement outcomes. Notwithstanding the greater goodness-of-fit, a complex general model can result in poorer trait recovery than the simpler functional models. In applied settings, a general better-fitting model may not be always applicable to operational assessment—due, for example, to scaling, sample size, or score interpretability—and a simpler model that permits easy scaling and model-fitting can be more preferred if the latter does not show significant misfit. The goal of this study is to evaluate functioning of the models with such practical considerations, taking special note of practicality for large-scale operational assessment programs. Specifically, the study evaluates performance of the models in various aspects, including relative model fit, absolute item fit, recovery of item and person parameters, and classification accuracy. The study also suggests statistical inference methods relevant to those evaluations, such as significance tests of testlet effect, item calibration under FTM, and trait inference under the polytomous RTM. The outcomes of the study will provide important implications for operational testing programs that seek to include innovative items or polytomous testlet items.

The remainder of this article is organized as follows. The next section gives a brief outline of the measurement models considered in the study. This is followed by a section that presents statistical tests that evaluate significance of testlet effect. The subsequent two sections respectively provide simulation study and real data analysis that show empirical performance of the models. The final section discusses key findings and implications for real testing programs.

Modeling Framework

The study considers four measurement models for describing polytomous testlet items. All models are formulated for partial credit items given the prevalence in applied settings. Below outlines the assumptions and parameterization of each modeling framework.

Generalized Partial Credit Model

The first model, the generalized partial credit model (GPCM; Masters, 1982; Muraki, 1992), is one of the most widely used polytomous response models for partial credit items. The model assumes that all items on a test are conditionally independent given a latent trait parameter. It does not assume any local dependence within a testlet and models the response scores at the individual item levels. Specifically, for a response category $k$ of item $j$ , GPCM models the probability of scoring $k$ as

P_{ijk} = \Pr (X_{ij} = k | θ_{i}) \equiv \frac{\exp (\sum_{l = 0}^{k} (a_{j} θ_{i} - b_{jl}))}{\sum_{h = 0}^{K_{j}} \exp (\sum_{l = 0}^{h} (a_{j} θ_{i} - b_{jl}))},

(1)

where $X_{ij}$ is the item response variable for examinee $i$ and item $j$ with $k$ indexing integer item scores ( $k \in {0, 1, \dots, K_{j}}$ ); $θ_{i}$ is the latent trait level of examinee $i$ ; $a_{j}$ models the impact of $θ_{i}$ on the logit of the item category response function, $P_{ijk}$ ; and $b_{jk}$ indicates the location of response category $k$ of item $j$ . The coefficient $- b_{jk} / a_{j}$ shows the point at which $P_{ij, k - 1}$ and $P_{ijk}$ intersect in the $θ$ continuum. Note that the current study parameterizes $b_{jk}$ as a location parameter associated with difficulty of a response category. This is to make the estimation compatible with the existing calibration programs that use intercept expression (e.g., Chalmers, 2012; Robitzsch et al., 2020)—that is, $\exp (k a_{j} θ_{i} - \sum_{l = 0}^{k} b_{jl}) = \exp (k a_{j} θ_{i} + d_{jk})$ . When the step difficulty parameters are desired, the exponent can be alternatively parameterized as $a_{j} (θ_{i} - b_{jk}^{'}) = a_{j} (θ_{i} - b_{jk} / a_{j})$ .

Testlet as a Polytomous Item

GPCM assumes conditional independence of individual items and does not give account for testlet effect. As a practical way of addressing the local dependence, Wainer and Kiely (1987) suggested scoring testlet items as a single polytomous item. By treating a testlet as a measurement unit, this strategy shifts the local independence assumption of individual items to that of testlets. Since the items within a testlet are treated as a fungible unit, this scoring is referred to as testlet-as-a-polytomous-item-modeling (TPIM).

The scoring procedure can be described as follows. For a testlet $s$ , items within the testlet are scored by the sum of the observed item scores, $Y_{is} = \sum_{j \in S} X_{ij},$ where $S$ is the set of items anchored to the testlet $s$ . The rescored data are then treated as ordinary response scores and analyzed by a regular polytomous item response model (e.g., Bock, 1972; Muraki, 1992; Samejima, 1972). In the case of GPCM, the testlet score $Y_{is}$ is modeled as

P_{isk} = \Pr (Y_{is} = k | θ_{i}) \equiv \frac{\exp (\sum_{l = 0}^{k} (a_{j} θ_{i} - b_{jl}))}{\sum_{h = 0}^{K_{s}} \exp (\sum_{l = 0}^{h} (a_{j} θ_{i} - b_{jl}))},

(2)

where $P_{isk}$ denotes the probability that examinee $i$ scores $k$ on testlet $s$ ; $Y_{is} \in {0, 1, \dots, K_{s}}$ is the testlet score; and $K_{s} = \sum_{j \in S} max X_{j}$ is the maximum testlet score. We note that the response data obtained under the TPIM framework will have fewer measurement units and the measurement variables will have wider score ranges than the original data. Coalescing multiple items into one measurement unit will also lead to fewer free parameters.

Random-Effect Testlet Model

The other way of addressing the local dependence problem is to use a parametric model that explicitly accounts for the testlet effect. One of the most commonly considered models in this approach is the random-effect testlet model (RTM; Bradlow et al., 1999; X. Wang et al., 2002; Wainer et al., 2007). The model assumes that extra dependence among the testlet items is due to random interaction between the testlet and testlet-takers. For example, items within a testlet can show stronger correlation because of individual’s unique background knowledge about the testlet stimulus, misunderstanding of the testlet information, or general frustration with the testlet. To model such interaction, RTM uses a random-effect term within the item category response function. Let $γ_{is (j)}$ model the interaction between examinee $i$ and testlet $s (j)$ . The subscript $s (j)$ indicates the testlet $s$ ( $s = 1, \dots, S$ ) to which item $j$ ( $j = 1, \dots, J_{s}$ ) is belonged. Given the testlet factor, $γ_{is (j)}$ , RTM defines the item category response probability as

P_{ijk} = \Pr (X_{ij} = k | θ_{i}, γ_{is (j)}) \equiv \frac{\exp (\sum_{l = 0}^{k} (a_{j} θ_{i} + γ_{is (j)} - b_{jl}))}{\sum_{h = 0}^{K_{j}} \exp (\sum_{l = 0}^{h} (a_{j} θ_{i} + γ_{is (j)} - b_{jl}))},

(3)

where $P_{ijk}$ denotes the probability that examinee $i$ scores $k$ on item $j$ . Note that the current study assumes a distinct slope parameter for the primary trait dimension but no separate slopes for the testlet-effect dimensions. Our experimentation with simulated data suggests that assuming equal slopes for $θ$ and $γ$ (e.g., Bradlow et al., 1999) or distinct slopes for the testlet dimensions (e.g., Li et al., 2005) induces systematic bias in the slope estimation and consequently leads to poor recovery of the latent factors (a similar pattern was observed in DeMars (2006). This being the case, we constrain the slope of $γ$ at one and assume that the effect of a testlet is manifested only by the nominal value of γ. The size of the testlet effect (testlet effect-size hereafter) can be modeled by $σ_{γ_{s}}^{2} = Var (γ_{is} : i = 1, \dots, N)$ where $N$ is the number of examinees that received the testlet $s$ .

Fixed-Effect Testlet Model

RTM ascribes within-testlet dependence to individual’s trait level associated with the testlet stimulus. Alternatively, one may as well impute local dependence to testlet’s own characteristics, such as complexity of a scenario, added difficulty in processing the testlet information. In the context of GPCM, a fixed-effect testlet model (FTM) can be formulated as

P_{ijk} = \Pr (X_{ij} = k | θ_{i}) \equiv \frac{\exp (\sum_{l = 0}^{k} (a_{j} θ_{i} - (γ_{s} + b_{jl}))}{\sum_{h = 0}^{K_{j}} \exp (\sum_{l = 0}^{h} (a_{j} θ_{i} - (γ_{s} + b_{jl}))},

(4)

where $γ_{s}$ models the constant effect of testlet $s$ . $γ_{s}$ is a structural parameter that remains invariant across samples. Hence, once the model parameters are estimated from a sample, they can be used in other samples under the regular assumption of measurement invariance. The current conception is in line with the traditional item response models in the sense that it treats item effects as fixed and person effects as random.

We note that FTM in its current form is subject to nonidentifiability between $γ$ and $b$ . While the sum of the testlet and item parameters, $γ + b$ , can be identified by fixing the location of $θ$ , the origins of each $γ$ and $b$ cannot be determined precisely unless the location of either $γ$ or $b$ is fixed by design or by anchored items. Our simulation study suggests that introducing the new location parameter can induce modest bias when estimating the individual $γ$ and $b$ parameters. However, the bias had negligible impact on the recovery of the locations of the response categories (i.e., the sum of $γ$ and $b$ ) and did not alter the inference about $θ$ . In applied settings, this problem can be resolved by fixing the posterior probability of $θ$ using standalone items or by applying partial prior on $γ_{s}$ .

Evaluating Testlet Effects

The models in the previous section can be used to test significance of testlet effect. In this section, we present three statistical tests that evaluate significance of testlet effect under the GPCM, RTM, and FTM. All tests are performed based on the maximum likelihood (ML) estimates of the model parameters.

Likelihood Ratio Test

The first test, the likelihood ratio (LR) test, compares likelihoods of RTM and GPCM to determine significance of random interaction effect. If the likelihood of RTM, that assumes nonzero random testlet effects, is significantly larger than the likelihood of GPCM, that assumes zero testlet effect, it will suggest that the items within the testlets have strong random interaction with test takers. The dissimilarity between the likelihoods is measured by logarithm of likelihood ratio. Let ${\hat{η}}_{rtm}$ and ${\hat{η}}_{gpcm}$ each denote the vector of structural parameters estimated under RTM and GPCM. A test statistic evaluating significance of random interaction is calculated as

LR = 2 (\log L ({\hat{η}}_{rtm}) - \log L ({\hat{η}}_{gpcm})),

(5)

where $L (\cdot)$ denotes the likelihood of the final model. In the null case of no significant random testlet effects, the LR statistic asymptotically follows $χ^{2}$ distribution with degrees of freedom ( $df$ ) equal to the number of testlets.

While the LR test provides uniformly greatest statistical power, it requires fitting of two models. If the fitted outcomes differ in estimation precision, or if any of the models has convergence problems, the test results will become questionable. In addition, the test can evaluate testlet effects only at the test level. Rejection of the null hypothesis suggests that there exists at least one testlet with significant random interaction; it does not inform which particular testlets had significant effects.

Score Test

The second statistical test, the Lagrange multiplier (LM) test (Aitchison & Silvey, 1958; Rao, 1948), makes use of only one model and can evaluate testlet effect for each item cluster. Although the test is not precisely designed for testlet effect evaluation, it can indicate local dependence within a testlet. In the present setting, the test is performed on a set of ML estimates of GPCM and examines if the score function at ${\hat{η}}_{gpcm}$ is sufficiently close to zero. If the ML estimates of GPCM adequately approximate the maximum of the likelihood, the score function evaluated at ${\hat{η}}_{gpcm}$ will near zero. If the score function is substantially deviated from zero, it will indicate that the likelihood function at ${\hat{η}}_{gpcm}$ is far from the peak and there may be violations of the assumptions of the model, which includes local independence of items.

The departure from the maximum likelihood is assessed by squared score function:

LM = S ({\hat{η}}_{gpcm})^{T} I^{- 1} ({\hat{η}}_{gpcm}) S ({\hat{η}}_{gpcm}),

(6)

where $S ({\hat{η}}_{gpcm})$ is the score function evaluated at ${\hat{η}}_{gpcm}$ ,

S ({\hat{η}}_{gpcm}) = \frac{\partial \log L (η; X)}{\partial η} |_{η = \hat{η}} = \nabla_{\hat{η}} \log L

with $X$ being the response matrix, and $I (\hat{η})$ is the information matrix evaluated at $\hat{η}$ ,

I (\hat{η}) = Var [S (\hat{η})] = E [S (\hat{η}) S {(\hat{η})}^{T} | η] = E [(\nabla_{\hat{η}} \log L) {(\nabla_{\hat{η}} \log L)}^{T} | η] .

The present study obtains the information matrix as the observed Fisher information matrix:

I (\hat{η}) \approx - H (\hat{η}) = - (\nabla_{\hat{η}} \log L) (\nabla_{\hat{η}} \log L)^{T} = - \frac{\partial^{2} \log L (η; X)}{\partial η^{2}} |_{η = \hat{η}},

where $H (\hat{η})$ is Hessian of $\log L (η; X)$ evaluated at $\hat{η}$ . Under the null hypothesis, the LM statistic asymptotically follows a $χ^{2}$ distribution with $df$ equal to the number of testlets. It is also useful to note that when the null hypothesis holds, the variance-covariance matrix of the item parameter estimates becomes a diagonal matrix. One can exploit this property to evaluate testlet effect separately for each cluster. In this case, the LM statistic is calculated for a set of testlet items and significance of the statistic is evaluated under the $χ^{2}$ distribution with one degree of freedom.

Again it is to be noted that the LM test evaluates overall violation of the model assumptions. Since the test is performed without a specific alternative hypothesis, rejection of the null hypothesis does not inform whether the model misfit is a consequence of testlet effects or any other violations of the model assumptions (e.g., inappropriate parameterization, presence of other trait dimension(s), differential item functioning, within-person dependency, local dependence between item pairs).

Wald Test

The third test, the Wald test, can determine significance of testlet effects explicitly. The test is based on FTM and evaluates significance of the constant testlet effect. As with the LM test, the Wald test requires fitting of only one model—the alternative model that assumes nonzero testlet effect. Let $\hat{γ} = ({\hat{γ}}_{1}, \dots, {\hat{γ}}_{S})^{T}$ denote the vector of estimated testlet parameters that model constant effects of testlets. A statistic evaluating significance of the fixed testlet effects is calculated as:

W = {\hat{γ}}^{T} \cdot I (\hat{γ}) \cdot \hat{γ},

where $I (\hat{γ})$ is the Fisher information matrix evaluated at $\hat{γ}$ . The other parameters are fixed at the corresponding estimated values. Under the null hypothesis of no significant fixed testlet effects, $W$ has an asymptotic $χ^{2}$ distribution with $df$ equal to the number of testlets.

As with the LM test, the Wald test can be applied to individual testlets. Under the null hypothesis, items from different testlets become conditionally independent once the trait level and testlet effects are accounted for. One can capitalize on this result to evaluate significance of testlet effect separately for each item cluster. For example, the Wald statistic for a testlet $s$ is calculated as $W = \frac{{\hat{γ}}_{s}^{2}}{Var ({\hat{γ}}_{s})},$ and its significance can be evaluated under the $χ^{2}$ distribution with $df = 1$ . One can alternatively evaluate significance of $\sqrt{W}$ in the standard normal distribution if standard error of the estimate is readily accessible.

Simulation Study

Monte Carlo simulation was carried out to examine performance of the polytomous testlet models. The study in particular gave close attention to the relative performance of the models when the models were fit to different response data with varying testlet effects. With the known properties of the response data, the study examined advantages of fitting a correct model as well as consequence of fitting an ill-posed model. Below presents details of the simulation design, evaluation criteria, and simulation outcomes.

Design

Model

The study used three models to generate response data, GPCM, FTM, and RTM, and four models to analyze the observed data, RTM, FTM, GPCM, and TPIM.

Linear Test

Tests were created in linear forms with the fixed length of four testlets. Each testlet included six polytomous items (i.e., test length = 24) with varying maximum scores. The maximum item score was randomly determined between two and six with a constraint on the maximum testlet score (to be between 20 and 30). This setup makes the testlets comparable in the response scale as well as in the amount of testlet information. Given the fixed maximum item score, item parameters were randomly sampled from the uniform and normal distributions, respectively, as $a ~ U (. 5, 2.0)$ and $b ~ N (0, 1)$ .

Sample

Examinees’ trait parameters were drawn from the (multivariate) normal distribution with the fixed sizes, $N = 1, 000$ and 3,000. When testlets had no or fixed effects (i.e., GPCM, FTM), ability parameters were sampled from the standard normal distribution. When testlets had random interaction with test-takers (i.e., RTM), trait parameters were sampled from the independent multivariate normal distribution with zero means and covariance matrix, $N_{5} (0, diag (1, σ_{γ}^{2}, \dots, σ_{γ}^{2}))$ , where $σ_{γ}^{2}$ is fixed at a constant value.

Testlet Effect

Prior studies examining empirical binary-response data reported that real assessments tend to have small to moderate testlet effects (Boyd et al., 2013; Wainer et al., 2007). In this study, variance of testlet effects was systematically varied to model small, moderate, and large effects— $σ_{γ}^{2}$ = .25, .50, and 1.0. At each conditioned $σ_{γ}^{2}$ , the testlet parameters were simulated according to the data-generating model. When RTM was used as a generating model, latent testlet trait parameters were randomly sampled from the multivariate normal distribution along with the primary trait parameters. When FTM was used as a generating model, constant testlet parameters were randomly drawn from the normal distribution with a zero mean and the conditioned $σ_{γ}^{2}$ . GPCM does not assume any testlet effects and can be viewed as modeling zero testlet effect.

Model Estimation

Applying the parameters generated above, response data were simulated by rejection sampling, and four scoring models were fit to each observed data. All models were estimated by maximum marginal likelihood estimation with the expectation-maximization algorithm (Dempster et al., 1977). Specifically, for fitting RTM, GPCM, and TPIM, we used an R package, mirt (Chalmers, 2012). For fitting FTM, GPCM, and TPIM, an independently developed estimation routine was applied (see Supplemental Appendix A, available online, for estimation details).¹ GPCM and TPIM were fit twice to examine the comparability between the estimation programs. The simulation results suggested that two programs yield approximately equivalent outcomes. This article reports the outcomes from the hand-crafted program for simplicity.

Trait Estimation

Examinees’ trait levels were estimated by three estimators: (a) ML, (b) maximum a posteriori (MAP), and (c) expected a posteriori (EAP). The trait inference in the unidimensional models was performed similarly to earlier studies (e.g., Donoghue, 1994; Penfield & Bergeron, 2005; Muraki, 1993). The ML and MAP estimation under the polytomous RTM was performed under the bifactor modeling framework (see Supplemental Appendix B, available online, for details). EAP estimates under RTM were attained from the output of mirt that used stochastic nodes. Except for EAP in RTM, all estimation was performed in the independently developed estimation programs. The Bayesian estimation in RTM assumed $N (0, 1)$ and $N (0, . 5^{2})$ initial priors for $θ$ and $γ$ , respectively.

Replication

All simulation conditions were replicated 100 times and the results were summarized by the average of the observed statistics.

Evaluation

The outcomes of the simulation were evaluated in five aspects: (a) relative model fit, (b) absolute item fit, (c) testlet effects, (d) parameter recovery, and (e) classification accuracy.

Goodness of Fit

The model goodness-of-fit was assessed by relative model fit and absolute item fit statistics. The examined model fit measures include: Deviance (i.e., −2 log-likelihood), Akaike information criterion (AIC; Akaike, 1973, 1987), consistent AIC (CAIC; Bozdogan, 1987), Bayesian information criterion (BIC; Schwartz, 1978), and adjusted BIC (ABIC; Sclove, 1987). The absolute item fit was assessed by residual-based measures, including $X^{2}$ (Bock, 1972; Chen & Thissen, 1997; Yen, 1981), $Q_{3}$ (Yen, 1984), mean of absolute deviance of residual covariance (MADRC; McDonald & Mok, 1995), and standardized root mean square residual (SRMSR; Bentler, 1995; Jöreskog & Sörbom, 1988; Maydeu-Olivares, 2013). The measures were carefully chosen so that they are easily accessible in the existing software (e.g., Chalmers, 2012). Our analysis confirmed the compatibility in the fit statistics between mirt and the locally crafted program.

Significance of Testlet Effect

To evaluate type and size of testlet effect, we conducted statistical tests based on the fitted model outcomes. We mention that each statistical test outlined in Section 3 serves a different purpose and carries different implications. The current study uses significance test results only to demonstrate the patterns across the different models and different response data.

Estimation Eccuracy

The accuracy of the parameter estimates was evaluated by absolute biasedness (ABias), root mean square error (RMSE), product-moment correlation (Cor), and standard error (SE). When examining the item parameter estimates from the ill-fitted models, estimation accuracy was evaluated based on SE only. The precision of person parameter estimates was similarly evaluated applying the absolute bias, RMSE, correlation, and SE. In addition to the trait recovery, we also examined classification outcomes when the examinees were categorized into two groups based on their estimated trait parameters. The evaluated criteria include false positive rate (Type I error), false negative rate (Type II error), sensitivity, and specificity.

Results

Below presents simulation results for each aspect. In each evaluation, we discuss relative performance of the four scoring models and impact of the design variables (e.g., data-generating model, $σ_{γ}^{2}$ ). Where appropriate, significance test results from the analysis of variance are presented to inform the significance of the different performances. Space limitations preclude a detailed presentation of all comparison results. We therefore focus on the results from $σ_{γ}^{2}$ = .0, .50, 1.0 under $N$ = 1,000 in this document, and present complete results in Supplemnetal Appendix C (available online). We note that the results from $σ_{γ}^{2}$ = .25 and/or $N$ = 3000 differed only in the accuracy level (i.e., more accurate in general) and showed very consistent patterns with those presented in this document.

Relative Model Fit

Table 1 reports relative model fit statistics of the four models evaluated. Observe that TPIM yielded consistently larger log-likelihoods. This is because polytomous scoring of testlet items led to fewer item response category functions and multiplying fewer probability functions led to larger likelihoods. Due to the systematic difference in the likelihoods, we left out TPIM from the model comparison and used the results for reference only. The patterns concerning the other models varied depending on the response data. When data were drawn from GPCM or FTM, GPCM and RTM showed best fitness. GPCM yielded smallest CAIC and BIC; RTM produced smallest deviance statistics. The observed fit statistics were, however, overall comparable across GPCM, RTM, and FTM, and no particular model showed significant outperformance (p > .828). When data were derived from RTM, RTM performed notably better (p < . 001). It showed smallest deviance, CAIC, and BIC statistics across all nonzero $σ_{γ}^{2}$ conditions. Under the same settings, GPCM and FTM yielded larger fit statistics and performed similarly (p > . 868). The current results suggest that the relative model fit measures can adequately identify existence of random testlet effect while they are generally insensitive to fixed testlet effect.

Table 1.

Relative Model Fit.

	Generating model
	GPCM			FTM			RTM
Fit	Dev	CAIC	BIC	Dev	CAIC	BIC	Dev	CAIC	BIC
	$σ_{γ} = . 0$			$σ_{γ}^{2} = . 50$			$σ_{γ}^{2} = . 50$
RTM	57450.8	58449.7	58323.4	55027.9	56024.7	55898.6	58550.0	59547.9	59421.7
FTM	57452.8	58451.7	58325.4	55029.7	56026.5	55900.5	60876.6	61874.5	61748.3
GPCM	57452.6	58419.8	58297.5	55029.7	55994.8	55872.7	60876.4	61842.7	61720.5
TPIM	20473.4	21280.1	21178.1	19972.8	20766.6	20666.2	23456.6	24264.5	24162.3
				$σ_{γ}^{2} = 1.0$			$σ_{γ}^{2} = 1.0$
RTM				53311.7	54309.3	54183.2	57854.6	58854.2	58727.8
FTM				53313.7	54311.4	54185.2	62628.4	63628.0	63501.6
GPCM				53314.1	54280.1	54158.0	62628.2	63596.1	63473.7
TPIM				19578.0	20365.6	20266.0	24347.2	25156.9	25054.5

Note. The log-likelihood of TPIM is systematically smaller than others because of the smaller numbers of response category functions. Fit = fitted model; Dev = deviance; BIC = Bayesian information criterion; CAIC = consistent Akaike information criterion; σ_γ ² = variance of random or fixed testlet effects; RTM = random-effect testlet model; FTM = fixed-effect testlet model; GPCM = generalized partial credit model; TPIM = testlet-as-a-polytomous-item model.

In Table 1, a consistent pattern was observed with respect to the testlet effect. Increase in $σ_{γ}^{2}$ led to smaller fit statistics in the FTM-generated data while it entailed larger fit statistics in the RTM-derived data. We surmise that the improved fitness in the FTM data was due to greater stability in the parameter estimation. Increased variation in the fixed testlet effects can make the response categories more distinct and yield more evenly distributed response categories. The increased frequency in the extreme score categories, in consequence, can lead to more stable parameter estimation and correspondingly higher likelihoods. The pattern in the random testlet effect on the contrary can be attributed to increased uncertainty in the response data, which typically involves degenerated goodness-of-fit.

Absolute Item Fit

Presented in Table 2 are item-level fit statistics. The results show that the three models with standard scoring (i.e., GPCM, FTM, RTM) performed stably well. The models produced reasonably small deviance statistics and the proportions of ill-fitting items were maintained mostly below the nominal level. The three models performed comparably when data were obtained from GPCM or FTM ( $p$ > .504). When response data followed RTM, GPCM and FTM outperformed RTM despite the distinct random testlet effects. In fact, RTM was most susceptible to nonzero $σ_{γ}^{2}$ s and underperformed when approximating the observed response scores. The present results suggest a somewhat different conclusion from Table 1. Although RTM was most preferred when evaluated by the relative fit measures in the RTM data, it became less favorable when assessed by the absolute item fit measures. The current outcome demonstrates the need for caution when selecting a model. As noted in this example, the best fitting model does not necessarily demonstrate best measurement qualities. In applied settings, an operating scoring model shall be decided based on comprehensive evidence rather than solely on the relative model fit statistics.

Table 2.

Absolute Item Fit.

	Generating model
	GPCM					FTM					RTM
	$X^{2}$		$a Q_{3}$		SRMSR	$X^{2}$		$a Q_{3}$		SRMSR	$X^{2}$		$a Q_{3}$		SRMSR
Fit	Avg	% Sig	Avg	% Sig		Avg	% Sig	Avg	% Sig		Avg	% Sig	Avg	% Sig	SRMSR
	$σ_{γ}^{2} = . 0$					$σ_{γ}^{2} = . 50$					$σ_{γ}^{2} = . 50$
RTM	16.0	.001	.031	.003	.016	15.9	.003	.031	.004	.016	71.4	.695	.107	.362	.091
FTM	16.0	.001	.031	.003	.016	15.9	.003	.031	.004	.016	23.8	.069	.105	.327	.069
GPCM	16.0	.001	.031	.003	.016	15.9	.003	.031	.004	.016	23.8	.069	.105	.327	.069
TPIM	509.7	.075	.054	.246	.003	580.5	.070	.055	.270	.003	602.2	.073	.057	.257	.010
						$σ_{γ}^{2} = 1.0$					$σ_{γ}^{2} = 1.0$
RTM						15.8	.004	.031	.005	.016	198.6	1.000	.159	.541	.143
FTM						15.8	.004	.031	.005	.016	35.2	.207	.157	.522	.111
GPCM						15.8	.003	.031	.005	.016	35.2	.207	.157	.522	.111
TPIM						478.1	.065	.065	.343	.003	621.8	.072	.044	.132	.033

Note. Significance of $X^{2}$ was evaluated in the $χ^{2}$ distribution. Significance of $a Q_{3}$ was evaluated based on the $z$ test after Fisher’s $r$ to $z$ transformation. Avg = average; Fit = fitted model; % Sig = proportion of item pairs with $p < . 05$ .; SRMSR = standardized root mean square root of squared residuals; RTM = random-effect testlet model; FTM = fixed-effect testlet model; GPCM = generalized partial credit model; TPIM = testlet-as-a-polytomous-item model.

Also clear from Table 2 is that TPIM performed least favorably in retrieving the observed data. The items calibrated under the TPIM framework showed largest deviance statistics and highest proportions of misfitting items (i.e., testlets). Although it produced small standardized root mean square residuals, this was largely due to the response scale of the input data. The more the score categories, the more stable computation of product-moment correlation and thus less variation in the correlation coefficients. Together with RTM, while both the models were devised to address random testlet effects, they actually tended to underperform the simpler models in handling the residuals or extra correlations.

Item Parameter Estimation

Table 3 evaluates accuracy of item parameter estimates of the generating models. As can be noted, the true item parameters were well recovered with adequate accuracy. The estimates showed small biasedness and RMSEs and displayed strong positive correlation with the generating parameters. The $b$ estimates from FTM contained somewhat large estimation error. As alluded to above, this was expected because of the indeterminacy in the origin of the parameters. Table 3 suggests that the breakdown of item location parameters can introduce additional error when estimating the individual $γ$ and $b$ parameters. Nevertheless, the overall location of the response categories (i.e., the sum of $γ$ and $b$ ) was identified with reasonable accuracy.

Table 3.

Item Parameter Recovery.

	Generating model
	GPCM		FTM						RTM
	$σ_{γ}^{2}$ = .0		$σ_{γ}^{2}$ = .50			$σ_{γ}^{2}$ = 1.0			$σ_{γ}^{2}$ = .50			$σ_{γ}^{2}$ = 1.0
Par	$a$	$b$	$a$	$b$	$γ + b$	$a$	$b$	$γ + b$	$a$	$b$	$d$	$a$	$b$	$d$
AbsBias	.068	.112	.068	.342	.125	.070	.434	.133	.067	.110	.174	.070	.115	.192
RMSE	.089	.145	.087	.417	.169	.090	.526	.183	.085	.140	.225	.089	.145	.246
Cor	.982	.990	.983	.929	.990	.981	.902	.990	.984	.991	.993	.983	.991	.992
SE	.072	.130	.074	.141	.165	.075	.153	.180	.082	—	.214	.085	—	.221

Note. SE of $b$ in RTM was unavailable because mirt assumed intercept-based parameterization. Par = item parameter; $a$ = slope parameter; $b$ = location parameter; $d$ = intercept parameter. AbsBias = average of absolute bias; Cor = product-moment correlation; RTM = random-effect testlet model; FTM = fixed-effect testlet model; GPCM = generalized partial credit model; TPIM = testlet-as-a-polytomous-item model.

Table 4 reports average SEs of the item parameter estimates when the models were misfit. The SEs from the correct generating models are repeated for comparison. On the whole, the models maintained errors adequately small despite the mis-specification. Although the models entailed somewhat large SEs in the location parameter estimation, the overall size of the errors was reasonably small. We note that in Table 4, the trends across the design variables are not directly comparable. Since each condition involved unique item parameterization and different testlet effect, the outcomes can only be evaluated within each cell in absolute values.

Table 4.

Average Standard Errors of Item Parameter Estimates When the Models Were Misfit.

		Generating model
		GPCM	FTM		RTM
Fit	Par	$σ_{γ}^{2} = . 0$	$σ_{γ}^{2} = . 50$	$σ_{γ}^{2} = 1.0$	$σ_{γ}^{2} = . 50$	$σ_{γ}^{2} = 1.0$
RTM	$a$	.083	.085	.087	.082	.085
FTM		.072	.074	.075	.059	.053
GPCM		.072	.074	.076	.059	.053
TPIM		.056	.057	.058	.019	.013
RTM	$d$	.215	.240	.267	.214	.221
FTM	$b$	.129	.141	.153	.122	.120
FTM	$γ + b$	.147	.165	.180	.137	.134
GPCM	$b$	.130	.142	.153	.123	.121
TPIM	$b$	.258	.281	.296	.237	.234

Note. Underlined statistics are average SEs from the generating model. Fit = fitted model; Par = item parameter; $a$ = slope parameter; $d$ = intercept parameter; $b$ = location parameter; $γ$ = fixed testlet-effect parameter; RTM = random-effect testlet model; FTM = fixed-effect testlet model; GPCM = generalized partial credit model; TPIM = testlet-as-a-polytomous-item model.

Testlet Effect

The outcomes of the significance tests are reported without tabulation. The likelihood ratio test showed perfect detection rate whenever $σ_{γ}^{2}$ had a nonzero value. When there was no random testlet effect, it flagged testlet effect at 1.83% average rate. The score test was generally insensitive to the testlet effects. When GPCM was fit to RTM- or FTM-generated data, the estimated slope and location parameters were quite close to the generating parameters. This means that, in spite of the distinct testlet effects, the outcomes of GPCM tended to closely approximate the probable outcomes of RTM and FTM, resulting in insignificant score statistics. We note that, although the score test was indifferent to the testlet effects, it performed well in identifying other violations (e.g., one- vs. two-parameter models). The performance of the Wald test varied depending on the variance estimator. The current study examined three estimators—(a) Hessian, (b) outer-product-of-gradients (OPG), and (c) sandwich. The simulation results suggest that the Hessian estimator is most sensitive to testlet shift (100% and 87.50% correct flagging rates at each test and testlet levels), the sandwich estimator the second (99.67% [test], 76.67% [testlet]), and the OPG estimator the least (22.00% [test], 12.08% [testlet]).

Person Parameter Estimation

Table 5 summarizes trait inference results. All in all, the true trait levels were adequately estimated across the different modeling scenarios. While the performance of the scoring models varied depending on the response data and the trait estimator, the overall comparison suggests that GPCM achieved highest accuracy, followed by FTM, TPIM, and RTM. Putting concretely, when data were drawn from GPCM or FTM, GPCM led to most desirable outcomes. The error statistics (i.e., ABias, RMSE) were substantially smaller than those from RTM and TPIM (p < .001) and marginally smaller than FTM (p = .741). When response data followed RTM, GPCM or RTM delivered best performance. In particular, when the ML estimator was used, GPCM and FTM showed substantially smaller error statistics and higher correlation coefficients. When the Bayesian estimators were applied, RTM produced moderately better outcomes. We note that the performance of GPCM and FTM in the ML inference was significantly better than RTM (p < .001), whereas the outperformance of RTM in the Bayesian inference was marginal (p > .140 compared with GPCM, p > .051 compared with FTM). The present results on the whole seemed to suggest that, when there exists little to no random testlet effect, GPCM and FTM are more serviceable as an operating scoring model. When testlets have moderate to large random interaction effects, scoring under RTM with the Bayesian estimators appears more desirable. The detailed relative performance was however strongly dependent on the trait estimator.

Table 5.

Trait Recovery.

		Generating model
		GPCM				FTM				RTM
Est	Fit	ABias	RMSE	Cor	SE	ABias	RMSE	Cor	SE	ABias	RMSE	Cor	SE
ML		$σ_{γ}^{2} = . 0$				$σ_{γ}^{2} = . 5$				$σ_{γ}^{2} = . 5$
	RTM	.300	.388	.930	.542	.321	.411	.925	.578	.600	.786	.813	.742
	FTM	.172	.236	.976	.207	.173	.236	.976	.210	.291	.384	.936	.240
	GPCM	.169	.233	.976	.206	.172	.236	.976	.209	.290	.384	.936	.240
	TPIM	.172	.230	.976	.208	.176	.234	.975	.212	.350	.484	.921	.422
						$σ_{γ} = 1.0$				$σ_{γ} = 1.0$
	RTM					.343	.439	.915	.585	.624	.820	.803	.780
	FTM					.174	.237	.976	.211	.368	.482	.900	.261
	GPCM					.173	.235	.975	.210	.367	.482	.900	.261
	TPIM					.177	.234	.975	.213	.441	.605	.878	.528
MAP		$σ_{γ}^{2} = . 0$				$σ_{γ}^{2} = . 5$				$σ_{γ}^{2} = . 5$
	RTM	.162	.214	.979	.522	.166	.218	.978	.559	.254	.323	.948	.658
	FTM	.161	.211	.979	.196	.162	.212	.979	.200	.267	.338	.942	.225
	GPCM	.159	.209	.979	.196	.162	.211	.979	.200	.266	.337	.942	.226
	TPIM	.166	.219	.977	.201	.170	.223	.976	.205	.286	.363	.934	.373
						$σ_{γ}^{2} = 1.0$				$σ_{γ}^{2} = 1.0$
	RTM					.166	.216	.978	.558	.299	.379	.929	.684
	FTM					.164	.215	.978	.202	.334	.421	.909	.243
	GPCM					.163	.213	.978	.201	.333	.419	.909	.243
	TPIM					.172	.224	.976	.206	.358	.452	.898	.447
EAP		$σ_{γ}^{2} = . 0$				$σ_{γ}^{2} = . 5$				$σ_{γ}^{2} = . 5$
	RTM	.158	.207	.979	.195	.161	.210	.979	.198	.251	.318	.949	.306
	FTM	.160	.210	.979	.194	.161	.210	.979	.197	.268	.339	.942	.221
	GPCM	.158	.207	.979	.194	.161	.210	.979	.197	.267	.338	.942	.221
	TPIM	.166	.217	.977	.198	.170	.221	.976	.202	.283	.358	.934	.354
						$σ_{γ}^{2} = 1.0$				$σ_{γ}^{2} = 1.0$
	RTM					.163	.212	.978	.200	.297	.376	.928	.356
	FTM					.164	.213	.978	.199	.336	.423	.909	.238
	GPCM					.163	.212	.978	.198	.335	.422	.909	.238
	TPIM					.171	.223	.976	.203	.352	.444	.898	.416

Note. Est = trait estimator; Fit = fitted model; ABias = average absolute bias; Cor = correlation; RMSE = root mean square error; RTM = random-effect testlet model; FTM = fixed-effect testlet model; GPCM = generalized partial credit model; TPIM = testlet-as-a-polytomous-item model.

In Table 5, all design factors had significant impact on the trait recovery. Among the generating models, GPCM- and FTM-generated data entailed smallest error, while RTM data induced relatively larger error. The impact of $σ_{γ}^{2}$ was consistent with the expectations—the larger $σ_{γ}^{2}$ , the larger the estimation error. We note that the small standard errors in the alternative scoring models under the RTM data do not necessarily indicate overestimation of measurement precision. Despite the concern (Marais & Andrich, 2008; Sireci et al., 1991; Wainer & Thissen, 1996), the current simulation study did not indicate any distinct patterns in relation to the overestimation of reliability.² The large standard errors in the RTM trait estimates were largely due to the greater uncertainty that originates from the additional testlet dimensions.

Examinee Classification

Table 6 reports classification outcomes when the examinees were classified into two groups based on the estimated trait levels (cutoff = 0). The results show that the scoring models maintained false classification rates adequately low (.039 on average) and correct classification rates reasonably high (.922 on average). On the whole, GPCM and FTM performed significantly better than TPIM and RTM. The difference between GPCM and FTM was marginal (p > .757) whereas those between FTM and TPIM, and between TPIM and RTM were significant (p < .001). As with the trait recovery, Table 6 suggested strong interaction between the models and the trait estimator. When the response data were generated from GPCM or FTM, using each corresponding generating model led to best classification outcomes. Across the evaluation criteria, GPCM, FTM, and TPIM tended to give parallel outcomes (p > .088), while RTM showed substantially subpar performance. When data displayed random interaction effects, the scoring models showed substantially different patterns depending on the trait estimator. When the ML estimator was used, GPCM and FTM consistently outperformed TPIM (p < .001) and RTM (p < .001). Despite the distinct random testlet effects, RTM yielded subnormal classification outcomes throughout. When the Bayesian estimators were applied, RTM yielded distinctly accurate classification results. Especially when $σ_{γ}^{2}$ = 1.0, RTM significantly outperformed the counterparts. As $σ_{γ}^{2}$ dwindled, the outperformance became marginal and the classification results were comparable to those of GPCM and FTM (p > .010 when $σ_{γ}^{2}$ = 0.5, and p > .036 when $σ_{γ}^{2}$ = 0.25).

Table 6.

Classification Accuracy.

		Generating model
		GPCM				FTM				RTM
Est	Fit	FP	FN	Sen	Spc	FP	FN	Sen	Spc	FP	FN	Sen	Spc
ML		$σ_{γ}^{2} = . 0$				$σ_{γ}^{2} = . 50$				$σ_{γ}^{2} = . 50$
	RTM	.047	.045	.910	.906	.052	.051	.898	.895	.087	.089	.823	.827
	FTM	.025	.023	.954	.951	.025	.024	.952	.949	.047	.047	.907	.907
	GPCM	.024	.022	.955	.952	.025	.025	.951	.950	.047	.046	.908	.907
	TPIM	.025	.024	.953	.950	.026	.026	.948	.948	.050	.050	.900	.900
						$σ_{γ}^{2} = 1.0$				$σ_{γ}^{2} = 1.0$
	RTM					.058	.051	.897	.884	.090	.092	.817	.820
	FTM					.027	.025	.950	.947	.061	.062	.876	.878
	GPCM					.027	.024	.952	.946	.061	.061	.877	.878
	TPIM					.028	.026	.948	.944	.065	.066	.868	.870
MAP		$σ_{γ}^{2} = . 0$				$σ_{γ}^{2} = . 50$				$σ_{γ}^{2} = . 50$
	RTM	.025	.022	.956	.951	.026	.024	.953	.947	.044	.044	.912	.913
	FTM	.025	.023	.954	.951	.025	.024	.952	.949	.047	.047	.907	.907
	GPCM	.024	.022	.955	.952	.025	.025	.951	.950	.047	.046	.908	.907
	TPIM	.025	.024	.953	.950	.026	.026	.948	.948	.050	.050	.900	.900
							$σ_{γ}^{2} = 1.0$				$σ_{γ}^{2} = 1.0$
	RTM					.029	.023	.954	.942	.053	.053	.895	.894
	FTM					.027	.025	.950	.947	.061	.062	.876	.878
	GPCM					.028	.026	.948	.944	.065	.066	.868	.870
	TPIM					.027	.024	.952	.946	.061	.061	.877	.878
EAP		$σ_{γ}^{2} = . 0$				$σ_{γ}^{2} = . 50$				$σ_{γ}^{2} = . 50$
	RTM	.025	.022	.956	.951	.026	.024	.953	.949	.044	.043	.914	.912
	FTM	.024	.023	.954	.952	.025	.024	.952	.950	.047	.047	.907	.907
	GPCM	.024	.022	.955	.952	.025	.025	.951	.950	.047	.046	.908	.907
	TPIM	.025	.024	.953	.950	.026	.026	.948	.948	.051	.050	.901	.899
						$σ_{γ}^{2} = 1.0$				$σ_{γ}^{2} = 1.0$
	RTM					.028	.023	.953	.945	.054	.053	.894	.893
	FTM					.026	.025	.950	.948	.061	.062	.876	.878
	GPCM					.027	.024	.952	.946	.062	.061	.878	.877
	TPIM					.028	.026	.948	.944	.065	.066	.869	.870

Note. Est = trait estimator; Fit = fitted model; FP = false positive rate; FN = false negative rate. Sen = sensitivity; Spc = specificity; ML= maximum likelihood;MAP = maximum a posteriori; EAP = expected a posteriori; RTM = random-effect testlet model; FTM = fixed-effect testlet model; GPCM = generalized partial credit model; TPIM = testlet-as-a-polytomous-item model.

The trends relating to the estimator and the testlet effect-size were consistent with those reported earlier. The MAP and EAP estimators showed smaller classification errors (.036 on average) and higher success rates (.927) than the ML estimator (.044 and .911 each; largely attributable to RTM). Increase in the testlet effect-size negatively affected the classification accuracy. For each increasing ${σ_{γ}}^{2}$ value (.0, .5, 1.0), false classification rates increased as (.026, .039, .046) and correct classification rates decreased as (.949, .923, .908) on average.

Empirical Data Analysis

In this section, we present an example analysis of empirical data to demonstrate the application of the models in real test settings. The data were obtained from a large-scale standardized licensure assessment program in the United States. We analyzed five test forms that were administered as a part of field-testing of technology-enhanced innovative items. We conducted analysis similarly to the previous section—fitting four measurement models and carefully examining the fitted outcomes. The results were overall consistent across the test forms. This section presents the results of one test form as an illustrative example.

Data

The data set contained 1,316 examinees’ responses to 12 innovative items. The items were administered in two testlets with each containing six items. Each testlet had a total score of 16 and 24, and the items had maximum scores between two and six. We note that the data were collected as a part of voluntary field-testing, and the examinee sample was somewhat homogeneous in the ability distribution.

Analysis

The study applied the four polytomous testlet models to analyze the response data. As with the simulation study, we applied mirt to fit RTM, GPCM and TPIM; and the independently developed estimation program to fit FTM, GPCM, and TPIM. The outcomes of the two calibration programs were almost identical when the model was conditioned on GPCM and TPIM. For simplicity, we report the results from the handcrafted program only. The results were evaluated in three aspects: (a) the relative and absolute goodness-of-fit, (b) the type and size of testlet effect, and (c) parameter agreement between the scoring models. The specific evaluation criteria followed those reported in Section 4.

Results

Goodness of Fit

Table 7 reports model and item fit statistics. TPIM again showed distinctly smaller model fit statistics as a result of fewer response category functions. Among the three models that applied regular scoring, RTM was most preferred by Deviance and ABIC, and GPCM was most preferred by CAIC. The preference for GPCM in CAIC was chiefly due to the smaller number of free parameters. Table 7 also shows that item-level fit statistics similarly pointed RTM as a best fitting model. RTM yielded smallest deviance statistics and maintained the lowest proportions of ill-fitting item pairs. GPCM and FTM produced somewhat larger statistics but the values were generally comparable to those of RTM.

Table 7.

Model Fit Evaluation in Real Data.

			Model fit			Item fit
				$X^{2}$		$Q_{3}$		$a Q_{3}$		MADRC	SRMSR
	Dev	CAIC	ABIC	Avg	%Sig	Avg	%Sig	Avg	%Sig	MADRC	SRMSR
RTM	38978.5	39420.3	39194.8	11.3	0	.050	.167	.035	0	.041	.027
FTM	38990.0	39431.8	39206.3	11.5	0	.052	.182	.035	.015	.042	.029
GPCM	38990.0	39415.4	39198.3	11.5	0	.052	.182	.035	.015	.042	.029
TPIM	12831.8	13069.1	12947.9	177.9	0	.294	1	0	0	1.321	.002

Note. All average statistics were obtained as average of absolute statistics. The proportions of ill-fitting items under TPIM equaled zero because the test contained only two testlets. Dev = deviance; CAIC = consistent Akaike information criterion; ABIC = sample-adjusted Bayesian information criterion; MADRC = mean absolute deviance in residual covariance; SRMSR = standardized root mean square root of squared residuals; Avg = average; % Sig = proportion of item pairs with $p < . 05$ . RTM = random-effect testlet model; FTM = fixed-effect testlet model; GPCM = generalized partial credit model; TPIM = testlet-as-a-polytomous-item model.

Testlet Effect

Table 8 reports outcomes of significance testing that evaluates testlet effects. The likelihood ratio test suggests that RTM achieved significantly better fit over GPCM, indicating strong random interaction effects (p < .012). Note that the test can evaluate testlet effects only at the test level. It remains unclear which testlet had significant interaction. The score test indicated somewhat different conclusions. Both the testlets showed no significant violation of the assumptions of GPCM. It seemed that, despite the better fitness of RTM, the ML estimates of GPCM reasonably achieved zero score without significant disparity. The Wald test similarly indicated that testlets had no significant fixed effects. Although the second testlet showed a $p$ value close to the nominal level, this was largely due to the small standard error in $\hat{γ}$ .

Table 8.

Testlet Effect Evaluation in Real Data.

	Score test			Wald test
	Stat	$df$	$p$	$\hat{γ}$	SE( $\hat{γ}$ )	$χ^{2}$	$df$	$p$
Test	1.483	2	.476	—	—	4.367	2	.113
Testlet1	1.483	1	.223	.016	.017	.901	1	.343
Testlet2	.000	1	.991	.031	.016	3.466	1	.063

Note. Likelihood ratio test: log-likelihoods = −19493.7 (GPCM), −19489.2 (RTM). Test statistic = 8.849. $df$ = 2, $p$ = .012. GPCM = generalized partial credit model; RTM = random-effect testlet model.

Item Parameter Estimates

Figure 1 compares item parameter estimates across the different fitting models. As can be noted, the three models showed close correspondence in the parameter values. The estimates from GPCM and FTM closely resembled those from RTM despite the significant random testlet effects. The comparable item parameter values in particular suggest that the three models are likely to perform similarly in trait inference.

Figure 1.

Correspondence of item parameter estimates

Person Parameter Estimates

To examine the patterns in trait inference, we compared the ability estimates from the four scoring models. Figure 2 plots the ability levels estimated from the three trait estimators. As can be seen, the performance of the models varied depending on the estimator. When the ML estimator was used, RTM showed distinctly different patterns from the alternative scoring models, exhibiting correlations between .268 and .462 with the other models. The unidimensional models showed very comparable performance with average correlation of .981 in the same setting. We surmise that the distinct performance of RTM is due to increased uncertainty from the testlet factors as well as the sampling error entailed by the real data. When the simulated data were examined, RTM performed comparably to other models, showing average correlations of (.901, .888, .871) with the other models under each $σ_{γ}^{2}$ = .25, .50, and 1.0 conditions. The outcomes of the real data analysis suggest that, despite the comparable item parameter values, the ML estimation under RTM can lead to substantially different conclusions on the examinees’ trait levels due to added uncertainty. Figure 2 shows that Bayesian trait estimation can help alleviate the deviation between RTM and the alternative scoring models. When the ability levels were estimated via maximum or expected a priori, the trait estimates from RTM showed high correlation with those from the other models. Again, the three unidimensional models showed consistently high correlation (.974 on average), suggesting that they performed alike in trait inference.

Figure 2.

Correspondence in Person parameter estimates.

Conclusion

The purpose of this study was to investigate practical scoring models for polytomous testlet items and evaluate their performance in various test response data. The study considered four models for investigation: GPCM, TPIM, RTM, and FTM. The response data were simulated from GPCM, FTM, and RTM—each assuming no, fixed, and random testlet effects. The study also used the empirical assessment data to examine the performance in real test settings. The behavior of the models was evaluated broadly in five aspects: relative model fit, absolute item fit, significance of testlet effects, item and trait parameter recovery, and classification accuracy.

The empirical experimentation suggests that the three models with standard scoring overall perform comparably. While the specific performance varied depending on the data-generating model, trait estimator, and $σ_{γ}^{2}$ , they all in all showed parallel performance in the final trait inference. The most salient difference was observed when the data were obtained from RTM. Our empirical evaluation revealed that performance of RTM varies substantially depending on the trait estimator. The advantage of RTM as a scoring model was conspicuous only when there were strong random testlet effects and the trait levels were estimated in conjunction with Bayes prior. In other situations, it performed only marginally better or yielded substantially poorer outcomes than the simpler models. The current findings illuminate important implications for operational assessment programs. Although RTM demonstrates greater goodness-of-fit when evaluated on the relative fit criteria, it may not necessarily lead to best measurement outcomes. Oftentimes, the simpler model—GPCM and FTM—showed better performance than RTM even when there existed nonzero random testlet effects. Our experience with the simulated and real data suggests that these models can be used as functional alternatives to RTM while having limited ramifications on the trait inference.

The two particular models considered in this study, TPIM and FTM, deserve further comments. Throughout the study, we found that TPIM has limited promise as a practical alternative to RTM. Apart from the limitations documented in the literature (e.g., loss of information, unfair equating of items), TPIM tended to experience a frequent convergence problem due to data sparsity. The fitted outcomes also showed substantially different patterns from those of the item-level measurement models. In addition to the estimation problem, coalescing multiple polytomous items made it difficult to interpret the item parameters while also impeding item-level analysis (e.g., item difficulty, information, differential item functioning). Taken altogether, we conclude that TPIM is the least favored approach for scoring polytomous testlet items.

The second comment concerns FTM. We noted that simultaneous estimation of $γ$ and $b$ in FTM is subject to location indeterminacy. The results from the simulation study suggest that, although $γ$ estimates tend to have inward bias, the degree of biasedness is overall small and has minimal impact on the inference of other parameters. In fact, when FTM is fit without scale constraints, item location parameters are decomposed into a common testlet-effect parameter ( $γ$ ) and item-specific effect parameters ( $b$ ). Because of this, the outcomes of FTM were almost identical to those of GPCM. In applied settings, it may be preferable to use GPCM capitalizing on the available calibration software programs. When it is desired to evaluate significance of shift in the item locations or effects of item covariates in a confirmatory manner (e.g., item position), FTM may be used.

We conclude the article with limitations of the study and future research directions. First, the current study fixed some of the design factors to keep the scope of work manageable. A more systematic study may be conducted in the future investigating interaction between the scoring models and different design factors. For example, in applied settings, tests are commonly administered in a mixture of independent items and testlet items and testlets can vary in the composition of items (e.g., the type and number of items, testlet effect-size). A future study may examine the performance of the models under varying levels of these factors and consequence of using a simpler and more practicable scoring model. Second, our experience from the current study suggests that the use of TPIM generally leads to unsatisfactory outcomes when items have many response categories and/or testlets include large numbers of items. When polytomous scoring of testlet items involves high degree of multiplicity in response scores, one may consider continuous response models (Mellenbergh, 1994; Müller, 1987; Samejima, 1974) as an alternative to TPIM. For example, our analysis of real innovative items suggests that polytomous scoring of polytomous testlet items can have response scores on quite wide scales (e.g., 20-50). In these cases, it may be more desirable to apply the continuous response models to circumvent the convergence problems and to retain the information across the response scores. For appropriate applications of the continuous response models, substantive scientific evidence must be precedented that the continuous models can adequately address the local dependence within testlets.

Supplemental Material

sj-pdf-1-epm-10.1177_00131644211032261 – Supplemental material for Polytomous Testlet Response Models for Technology-Enhanced Innovative Items: Implications on Model Fit and Trait Inference

Supplemental material, sj-pdf-1-epm-10.1177_00131644211032261 for Polytomous Testlet Response Models for Technology-Enhanced Innovative Items: Implications on Model Fit and Trait Inference by Hyeon-Ah Kang, Suhwa Han, Doyoung Kim and Shu-Chuan Kao in Educational and Psychological Measurement

Supplemental Material

sj-zip-2-epm-10.1177_00131644211032261 – Supplemental material for Polytomous Testlet Response Models for Technology-Enhanced Innovative Items: Implications on Model Fit and Trait Inference

Supplemental material, sj-zip-2-epm-10.1177_00131644211032261 for Polytomous Testlet Response Models for Technology-Enhanced Innovative Items: Implications on Model Fit and Trait Inference by Hyeon-Ah Kang, Suhwa Han, Doyoung Kim and Shu-Chuan Kao in Educational and Psychological Measurement

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

This work was supported by the National Council of State Boards of Nursing, UTA19-000392.

ORCID iD

Hyeon-Ah Kang

Supplemental Material

Supplemental material for this article is available online.

Notes

References

Aitchison

Silvey

S. D.

(1958). Maximum likelihood estimation of parameters subject to restraints. Annals of Mathematical Statistics, 29(3), 813-828. https://doi.org/10.1214/aoms/1177706538

Akaike

(1973). Information theory and an extension of the maximum likelihood principle. In Petrov

B. N.

Csáki

(Eds.), Proceedings of the second international symposium on information theory (pp. 267-281). AkadémiaiKiadó.

Akaike

(1987). Factor analysis and AIC. Psychometrika, 52(3), 317-332. https://doi.org/10.1007/978-1-4612-1694-0_29

Bentler

P. M.

(1995). EQS structural equations program manual. Multivariate Software.

Betts

Muntean

Kim

Kao

(2021). Evaluating different scoring methods for multiple response items providing partial credit. Educational Psychological Measurement. Advance online publication. https://doi.org/10.1177/0013164421994636

Bock

R. D.

(1972). Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika, 37(1 Pt. 1), 29-51. https://doi.org/10.1007/BF02291411

Boyd

A. M.

Dodd

Fitzpatrick

(2013). A comparison of exposure control procedures in cat systems based on different measurement models for testlets. Applied Measurement in Education, 26(2), 113-135. https://doi.org/10.1080/08957347.2013.765434

Bozdogan

(1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52(3), 345-370. https://doi.org/10.1007/BF02294361

Bradlow

E. T.

Wainer

Wang

(1999). A Bayesian random effects model for testlets. Psychometrika, 64(2), 153-168. https://doi.org/10.1007/BF02294533

10.

Chalmers

R. P.

(2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6), 1-29. https://doi.org/10.18637/jss.v048.i06

11.

Chen

Thissen

(1997). Local dependence indexes for item pairs using item response theory. Journal of Educational and Behavioral Statistics, 22(3), 265-289. https://doi.org/10.3102/10769986022003265

12.

Davey

Godwin

Mittelholtz

(1997). Developing and scoring an innovative computerized writing assessment. Journal of Educational Measurement, 34(1), 21-41. https://doi.org/10.1111/j.1745-3984.1997.tb00505.x

13.

De Champlain

Gessaroli

M. E

. (1998). Assessing the dimensionality of item response matrices with small sample sizes and short test lengths. Applied Measurement in Education, 11(3), 231-253. https://doi.org/10.1207/s15324818ame1103_2

14.

DeMars

C. E.

(2006). Application of the bi-factor multidimensional item response theory model to testlet-based tests. Journal of Educational Measurement, 43(2), 145-168. https://doi.org/10.1111/j.1745-3984.2006.00010.x

15.

DeMars

C. E.

(2012). Confirming testlet effects. Applied Psychological Measurement, 36(2), 104-121. https://doi.org/10.1177/0146621612437403

16.

Dempster

A. P.

Laird

N. M.

Rubin

D. B.

(1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39(1), 1-38. https://doi.org/10.1111/j.2517-6161.1977.tb01600.x

17.

Donoghue

J. R.

(1994). An empirical examination of the IRT information of polytomously scored reading items under the generalized partial credit model. Journal of Educational Measurement, 31(4), 295-311. https://doi.org/10.1111/j.1745-3984.1994.tb00448.x

18.

Glas

C. A. W.

Wainer

Bradlow

E. T.

(2002). MML and EAP estimation in testlet-based adaptive testing. In van der Linden

W. J.

Glas

C. A. W.

(Eds.), Computerized adaptive testing: Theory and practice (pp. 271-287). Kluwer Academic.

19.

Hayashi

Bentler

P. M.

Yuan

K.-H.

(2007). On the likelihood ratio test for the number of factors in exploratory factor analysis. Structural Equation Modeling, 14(3), 505-526. https://doi.org/10.1080/10705510701301891

20.

Hernandez-Camacho

Olea

Abad

F. J.

(2017). Comparison of uni- and multidimensional models applied in testlet-based tests. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 13(4), 135-143. https://doi.org/10.1027/1614-2241/a000137

21.

Himelfarb

Marcoulides

K. M.

Fang

Shotts

B. L.

(2020). A two-level alternating direction model for polytomous items with local dependence. Educational and Psychological Measurement, 80(2), 293-311. https://doi.org/10.1177/0013164419871597

22.

Jiao

Liu

Hainie

Woo

Gorham

(2012). Comparison between dichotomous and polytomous scoring of innovative items in a large-scale computerized adaptive test. Educational and Psychological Measurement, 72(3), 493-509. https://doi.org/10.1177/0013164411422903

23.

Jodoin

M. G.

(2003). Measurement efficiency of innovative item formats in computer-based testing. Journal of Educational Measurement, 40(1), 1-15. https://doi.org/10.1111/j.1745-3984.2003.tb01093.x

24.

Jöreskog

K. G.

Sörbom

(1988). Lisrel 7. a guide to the program and applications (2nd ed.). SPSS.

25.

Lee

Kolen

M. J.

Frisbie

D. A.

Ankenmann

R. D.

(2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied Psychological Measurement, 25(4), 357-372. https://doi.org/10.1177/01466210122032226

26.

Bolt

D. M.

(2005). A test characteristic curve linking method for the testlet model. Applied Psychological Measurement, 29(5), 340-356. https://doi.org/10.1177/0146621605276678

27.

Wang

(2010). Application of a general polytomous testlet model to the reading section of a large-scale English language assessment (ETS RR No. 10-21). Educational Testing Service.

28.

Marais

I. D.

Andrich

(2008). Effects of varying magnitude and patterns of local dependence in the unidimensional Rasch model. Journal of Applied Measurement, 9(2), 105-124.

29.

Masters

G. N.

(1982). A Rasch model for partial credit scoring. Psychometrika, 47(2), 147-174. https://doi.org/10.1007/BF02296272

30.

Maydeu-Olivares

(2013). Goodness-of-fit assessment of item response theory models (with discussion). Measurement: Interdisciplinary Research and Perspectives, 11(3), 71-137. https://doi.org/10.1080/15366367.2013.831680

31.

McDonald

R. P.

Mok

M. M.-C.

(1995). Goodness of fit in item response models. Multivariate Behavioral Research, 30(1), 23-40. https://doi.org/10.1207/s15327906mbr3001_2

32.

Mellenbergh

G. J.

(1994). A unidimensional latent trait model for continuous item responses. Multivariate Behavioral Research, 29(3), 223-236. https://doi.org/10.1207/s15327906mbr2903_2

33.

Müller

(1987). A Rasch model for continuous rating. Psychometrika, 52(2), 165-181. https://doi.org/10.1007/BF02294232

34.

Muraki

(1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16(2), 159-176. https://doi.org/10.1177/014662169201600206

35.

Muraki

(1993). Information functions of the generalized partial credit model. Applied Psychological Measurement, 17(4), 351-363. https://doi.org/10.1177/014662169301700403

36.

Penfield

R. D.

Bergeron

J. M.

(2005). Applying a weighted maximum likelihood latent trait estimator to the generalized partial credit model. Applied Psychological Measurement, 29(3), 218-233. https://doi.org/10.1177/0146621604270412

37.

Rao

C. R.

(1948). Large sample tests of statistical hypothesis concerning several parameters with applications to problems of estimation. Proceedings of the Cambridge Philosophical Society, 44(1), 50-57. https://doi.org/10.1017/S0305004100023987

38.

Robitzsch

Kiefer

(2020). Tam: Test analysis modules (R package version 3.5-19). R Foundation for Statistical Computing. https://CRAN.R-project.org/package=TAM

39.

Rosenbaum

P. R.

(1988). Item bundles. Psychometrika, 53(3), 349-359. https://doi.org/10.1007/BF02294217

40.

Samejima

(1972). A general model for free-response data (Psychometric Monograph No. 18). Psychometric Society. http://www.psychometrika.org/journal/online/MN18.pdf

41.

Samejima

(1974). Normal ogive model on the continuous response level in the multidimensional latent space. Psychometrika, 39(1), 111-121. https://doi.org/10.1007/BF02291580

42.

Schwartz

(1978). Estimating the dimension of a model. Annals of Statistics, 6(2), 461-464. https://doi.org/10.1214/aos/1176344136

43.

Sclove

S. L.

(1987). Application of model-selection criteria to some problems in multivariate analysis. Psychometrika, 52(3), 333-343. https://doi.org/10.1007/BF02294360

44.

Sireci

S. G.

Thissen

Wainer

(1991). On the reliability of testlet-based tests. Journal of Educational Measurement, 28(3), 237-247. https://doi.org/10.1111/j.1745-3984.1991.tb00356.x

45.

Tuerlinckx

De Boeck

(2001). The effect of ignoring item interactions on the estimated discrimination parameters in item response theory. Psychological Methods, 6(2), 181-195. https://doi.org/10.1037/1082-989X.6.2.181

46.

Wainer

Bradlow

Wang

(2007). Testlet response theory and its application. Cambridge University Press.

47.

Wainer

Kiely

G. L.

(1987). Item clusters and computerized adaptive testing: A case for testlets. Journal of Educational Measurement, 24(3), 185-201. https://doi.org/10.1111/j.1745-3984.1987.tb00274.x

48.

Wainer

Thissen

(1996). How is reliability related to the quality of test scores? What is the effect of local dependence on reliability? Educational Measurement: Issues and Practice, 15(1), 22-29. https://doi.org/10.1111/j.1745-3992.1996.tb00803.x

49.

Wainer

Wang

(2000). Using a new statistical model for testlets to score TOEFL. Journal of Educational Measurement, 37(3), 203-220. https://doi.org/10.1111/j.1745-3984.2000.tb01083.x

50.

Wang

W.-C.

Wilson

(2005). The Rasch testlet model. Applied Psychological Measurement, 29(2), 126-149. https://doi.org/10.1177/0146621604271053

51.

Wang

Bradlow

E. T.

Wainer

(2002). A general Bayesian model for testlets: Theory and applications. Applied Psychological Measurement, 26(1), 109-128. https://doi.org/10.1177/0146621602026001007

52.

Yen

W. M.

(1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5(2), 245-262. https://doi.org/10.1177/014662168100500212

53.

Yen

W. M.

(1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125-145. https://doi.org/10.1177/014662168400800201

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

3.64 MB

0.00 MB

0.26 MB