Interactions Between Termination Criteria and Ability Estimators in Computerized Adaptive Testing

Abstract

Computerized adaptive testing (CAT) aims to optimize measurement by tailoring item administration to individual examinees. The efficiency and precision of a CAT heavily depend on the choice of ability ( $θ$ ) estimator and the termination criterion (stopping rule). Prior research suggests these components interact, but comprehensive evaluations across varying item bank characteristics remain limited. This simulation study investigated the interactive effects of four $θ$ estimators (maximum likelihood [MLE], weighted likelihood [WLE], maximum a posteriori [MAP], and expected a posteriori [EAP]) and four termination criteria (fixed-length, standard error of measurement [SEM], minimum information [MI], and change-in-estimate [Δ $θ$ ]) on measurement bias, precision (RMSE), and test length. These combinations were evaluated across low- (100-item) and high-information (500-item) item banks with both flat and peaked information distributions using the three-parameter logistic model. The results demonstrated that the optimal CAT configuration is contingent on item bank size and shape. Across all conditions, WLE emerged as the most robust estimator, effectively neutralizing the boundary estimation issues of MLE and the shrinkage bias characteristic of Bayesian estimators. In high-information banks, the SEM and fixed-length rules yielded the lowest conditional RMSE and bias regardless of bank shape. However, in low-information peaked banks, the strict SEM rule frequently failed to reach precision targets at the θ extremes, resulting in inefficient, maximum-length tests. Under these sparse conditions, the Δθ rule paired with WLE provided a superior balance of accuracy and efficiency by halting administration when precision gains stagnated. Conversely, the MI rule consistently exhibited the highest bias and RMSE. These findings underscore that optimal CAT design is not a one-size-fits-all solution. For high-quality banks, WLE paired with an SEM or fixed-length rule is recommended. For lower-quality banks, practitioners should adopt a $Δ θ$ rule or a hybrid SEM approach to prevent inefficient test elongation.

Keywords

computerized adaptive testing item response theory stopping rules ability estimation measurement precision item bank characteristics

Introduction

Computerized adaptive testing (CAT) is designed to deliver an optimal test for each examinee by administering fewer items while preserving or improving measurement precision relative to paper-and-pencil or other fixed-length tests (Meijer & Nering, 1999; Weiss, 1982, 2014). To implement a CAT, six components are required: (a) a specific response model; (b) a bank of pre-tested items; (c) an entry rule; (d) a method for selecting the next item; (e) an ability ( $θ$ ) estimator; and (f) a termination criterion (Weiss & Kingsbury, 1984; Weiss & Şahin, 2024). In most applications, an item response theory (IRT) model is assumed to describe response behavior. Among these components, the termination criterion has a particularly large influence because it governs both test length and the precision of ability estimation. Two common approaches are fixed-length and variable-length stopping rules.

Stopping Rules in CAT

The fixed-length rule ends the CAT when a predetermined number of items have been administered; all examinees receive the same number of items regardless of their interim ability estimates or the degree of precision achieved. Although simple, this approach entails two drawbacks: efficiency and quality of measurement. Some examinees are administered more items than necessary, increasing testing time, whereas others receive fewer items than needed, yielding lower measurement precision. Therefore, to ensure that examinees are measured to a desired degree of precision, variable-length CAT is typically preferred (Weiss & Kingsbury, 1984), and prior research has shown that variable-length CAT performs as well as, or better than, fixed-length CAT (Babcock & Weiss, 2012).

There are three general types of variable-length stopping rules. The most popular and most researched is the standard error of measurement (SEM) cutoff rule (Weiss & Kingsbury, 1984). Under the SEM rule, the test terminates when the SEM of the ability ( $θ$ ) estimate reaches a prespecified level, thereby aiming to ensure equal precision of measurement across examinees. When the item bank provides relatively flat and sufficiently high information across $θ$ , the SEM rule performs well in practice. However, with an item bank whose information is peaked, the test may fail to stop for some examinees (Tai et al., 2025), imposing an unnecessary burden on examinees with extreme $\hat{θ}$ and reducing efficiency. Tests may also end prematurely when model assumptions are violated (Ravand, 2015), or when informative items remain unused—an issue of particular concern in applications that require high accuracy (e.g., medical outcomes; Ware et al., 2005). For these reasons, the SEM rule is often paired with minimum- and maximum-items safeguards.

The second type of termination rule is the minimum information (MI) criterion. Used in early CAT research, MI specifies that the CAT should stop when no item in the bank can provide information exceeding a prespecified cutoff at the examinee’s current $θ$ estimate (Gialluca & Weiss, 1979; Maurelli & Weiss, 1981). This criterion tends to push precision as high as the bank allows but typically produces longer tests than the SEM rule (Babcock & Weiss, 2012; Stafford et al., 2019). To mitigate this drawback while maintaining precision, Choi et al. (2011) proposed the predicted standard error reduction (PSER) rule, a look-ahead variant that estimates the expected reduction in SEM for the next item under a minimum expected posterior variance selection framework. The test continues only if the next item is expected to reduce SEM by at least a threshold; otherwise, it ends. Choi et al. (2011) showed that PSER performs better than both SEM and MI rules.

The third type of termination rule is the change-in-estimate ( $Δ θ$ ) rule. Under this rule, the test ends when the absolute change in the $θ$ estimate falls below a prespecified tolerance for a specified number of consecutive items (Hart et al., 2005, 2006). Although under-researched, simulation studies have shown that $Δ θ$ performs as well as, or better than, SEM and MI with respect to precision and test length (Babcock & Weiss, 2012; Stafford et al., 2019). There are other types of termination criteria, such as confidence-interval–based and sequential classification rules; because the present study focuses on measurement rather than classification, these criteria are excluded from consideration here. Importantly, because stopping rules determine how much information is accumulated before termination—and item-bank information can be peaked or flat across $θ$ —bank information shape can substantially moderate the performance of any given rule.

Effects of $θ$ Estimator

Beyond termination criteria, the choice of $θ$ estimator also affects precision. Estimators commonly based in IRT include maximum likelihood estimation (MLE; Lord, 1983), weighted likelihood estimation (WLE; Warm, 1989), maximum a posteriori (MAP; Bock & Aitkin, 1981; Samejima, 1969), and expected a posteriori (EAP; Bock, 1983; Bock & Aitkin, 1981). MLE uses the response pattern and the item response function to estimate $θ$ , but in short adaptive tests it produces undefined estimates for non-mixed response patterns (e.g., all correct or all incorrect). Early CAT implementations therefore truncated the estimate or imposed ad hoc bounds in the absence of a mixed response pattern. WLE reduces small-sample bias relative to MLE and avoids the need for truncated or arbitrary $θ$ estimates. In operational CATs, extreme response patterns are typically handled via bounds or Bayesian estimators (S. Wang & Wang, 2001).

Bayesian methods (MAP and EAP) do not suffer from boundary problems because they incorporate a prior distribution; both form a posterior distribution by combining the prior with the likelihood, with MAP using the posterior mode and EAP using the posterior mean. Several simulation studies have found that EAP often exhibits lower bias than MAP, particularly in short tests (Hambleton & Swaminathan, 1985; S. Wang & Wang, 2001; T. Wang & Vispoel, 1998). However, when priors are misspecified, estimation can be highly biased; even with default priors centered at 0, examinees with extreme $θ$ tend to have their $θ$ estimates shrunk toward the center of the $θ$ scale.

When T. Wang et al. (1999) evaluated $θ$ estimators, they found that variable-length CAT produced different bias patterns than fixed-length CAT. Subsequently, Yi et al. (2001) compared termination criteria and estimators and reported an interaction between estimator and stopping rule. The topic was revisited only recently: Lee and Han (2024) evaluated the standard error of the total score and SEM performance under different estimation procedures, and Dwahdh and Alshraifin (2025) used real data to compare SEM and fixed-length rules across estimators. Furthermore, recent theoretical work by C. Wang et al. (2019) suggests that standard stopping rules may fail when item banks are exhausted of information, a factor that has not been fully examined in the context of unidimensional estimator interactions. Their findings were not consistent with prior research and did not account for differences in item-bank shape. Prior work comparing termination criteria indicates that bank shape is a major factor in bias across termination rules (Babcock & Weiss, 2012). Consequently, a comprehensive simulation is needed to examine the interaction of stopping rules and $θ$ estimators under different bank shapes and to provide clearer guidance for practice.

Purpose

The present study aimed to quantify how different termination criteria interacted with $θ$ estimators to affect bias, root mean squared error (RMSE), and test length. The shape of the item bank (peaked vs. flat) was also considered. A unified simulation was conducted to address how to choose an optimal pairing of termination criterion and $θ$ estimator for a given bank type, and practical guidance to researchers is provided. The research questions were: Given an item bank, which combination of stopping rule and estimator generated the least bias and RMSE in $θ$ estimation? What were the bias and RMSE patterns for each combination? Which combination yielded an optimal test length while maintaining relatively low bias?

Method

IRT Model

The three-parameter logistic dichotomous IRT model (3PLM) was used. Assume an IRT model with $J$ test items answered by $N$ individuals, where each item has two response categories (correct/incorrect). There are $2^{J}$ possible response patterns. The probability that respondent $i$ with ability $θ_{i}$ answers item j correctly is

P (Y_{ij} = 1 ∣ θ_{i}) = c_{j} + (1 - c_{j}) \frac{\exp {D a_{j} (θ_{i} - b_{j})}}{1 + \exp {D a_{j} (θ_{i} - b_{j})}},

(1)

where $a_{j}$ is the discrimination parameter, $b_{j}$ is the difficulty parameter, $c_{j}$ is the guessing parameter, and $D = 1.7$ is a scaling constant to approximate the normal–ogive model.

Item Banks

To approximate real-world variation in test information, four item-bank design conditions were manipulated: (a) 100-item banks with normally distributed $b_{j}$ parameters; (b) 500-item banks with normally distributed $b_{j}$ parameters; (c) 100-item banks with uniformly distributed $b_{j}$ parameters; and (d) 500-item banks with uniformly distributed $b_{j}$ parameters. The shape of the bank information function was controlled by the distribution of the item difficulty ( $b_{j}$ ) parameters: the normal distribution generated a peaked information bank, whereas the uniform distribution generated a flat information bank. Furthermore, the overall magnitude of bank information was manipulated via bank size rather than individual item quality. Because items in all conditions were generated from the same discrimination ( $a_{j}$ ) and guessing ( $c_{j}$ ) parameter distributions, the average individual item quality remained constant. However, a 500-item bank provides a much higher density of items at any given $θ$ level than a 100-item bank. In a CAT environment, this higher density allows the item selection algorithm to continuously select highly informative items for a longer duration before the optimal local item pool is exhausted. Therefore, the 100-item banks operated as sparse, low-information banks, while the 500-item banks operated as dense, high-information banks.

To ensure the generalizability of findings across different item banks, item banks were treated as a random factor. The simulation employed 1,000 independent replications for each of the four conditions (resulting in 4,000 unique generated item banks in total). Example bank information functions from a single replication are shown in Figure 1. Because of the guessing parameter in the 3PLM (Equation 1), tails naturally possess lower information than the mid-range even under flat $b_{j}$ distributions.

Figure 1.

Bank information functions for the first bank in each condition.

For each bank, 3PLM parameters were generated as follows: discriminations $a_{j} ~ \log N (0.05, {0.25}^{2})$ (ensuring $a_{j} > 0$ ; thus $E [a] \approx 1.085$ and $SD (a) \approx 0.275$ ); difficulties $b_{j} ~ N (0, 1)$ for peaked banks and $b_{j} ~ Unif (- 3, 3)$ for flat banks; guessing parameters were fixed at $c_{j} = 0.20$ .

Termination Criteria

Four termination criteria were investigated.

Fixed-Length Rule

The fixed-length rule served as a reference. These CATs terminated after 20 (low precision), 25, 30 and 35 (high precision) administered items.

SEM Rule

The SEM rule terminates a CAT when the standard error of the current $θ$ estimate falls below a prespecified threshold. For a response vector $Y = (y_{1}, \dots, y_{J})$ , the likelihood is

L (θ ∣ Y) = Π_{j = 1}^{J} P_{j} {(θ)}^{y_{j}} Q_{j} {(θ)}^{1 - y_{j}},

(2)

where $P_{j} (θ)$ is given by Equation 1 and $Q_{j} (θ) = 1 - P_{j} (θ)$ . The (expected) Fisher information for item $j$ is

I_{j} (θ) = \frac{{P_{j}^{'} (θ)}^{2}}{P_{j} (θ) Q_{j} (θ)},

(3)

and the test information is $I (θ) = \sum_{j \in A} I_{j} (θ)$ , where $A$ indexes administered items. Equivalently, the expected Fisher information can be written as

I (θ) = E_{Y ∣ θ} [- \frac{\partial^{2}}{\partial θ^{2}} ℓ (θ; Y)],

(4)

where $ℓ (θ; Y) = \log L (θ; Y)$ . The SEM is

SEM (θ) = \sqrt{\frac{1}{I (θ)}},

(5)

evaluated at the current $\hat{θ}$ . From a classical reliability perspective,

SEM = σ_{X} \sqrt{1 - ρ_{xx}} \approx \sqrt{1 - ρ_{xx}},

(6)

where $ρ_{xx}$ is the reliability index in classical test theory. Therefore $SEM \approx 0.316$ (low precision) corresponds to $ρ_{xx} \approx . 90$ and $SEM \approx 0.224$ (high precision) to $ρ_{xx} \approx . 95$ . Both thresholds were used. Minimum and maximum test lengths were set to 5 and 30 items, respectively.

Minimum Information Rule

The MI rule terminates when no unused item provides information at the current $\hat{θ}$ exceeding a small cutoff criterion. For a 3PLM item, Equation 3 simplifies to

I_{j} (θ) = D^{2} a_{j}^{2} \frac{Q_{j} (θ)}{P_{j} (θ)} {(\frac{P_{j} (θ) - c_{j}}{1 - c_{j}})}^{2} .

(7)

Early MI cutoffs were bank specific (e.g., Gialluca & Weiss, 1979; Maurelli & Weiss, 1981); simulations suggested that $I_{\min} \approx 0.56$ and $0.42$ were comparable to SEM cutoffs of $0.35$ and $0.30$ , respectively (Stafford et al., 2019). Because optimal MI thresholds are highly dependent on the specific item pool, a sensitivity analysis was conducted prior to the main study to evaluate a range of candidate cutoffs ( $0.05 to$ – $0.35$ ). Pilot testing revealed that higher thresholds (e.g., $0.15 to$ – $0.35$ ) frequently resulted in premature test termination, particularly in the sparse 100-item bank conditions, because the algorithm rapidly exhausted the few highly informative items available. Consequently, the final thresholds were set to $I_{\min} = 0.05$ (high precision) and $0.10$ (low precision), as these values successfully allowed the CAT to operate without artificially forcing immediate termination across all bank designs.

Change in $\hat{θ}$ Rule

The change-in- $θ$ rule terminates when

| {\hat{θ}}_{t} - {\hat{θ}}_{t - 1} | \leq ε for the last k administered items;

where $t$ indexes the current item, ε is a tolerance, and $k$ is a conservative parameter, 1 was used in Babcock and Weiss (2012). Following that study, settings of $ε = 0.05$ (low precision) and $ε = 0.02$ (high precision) were used.

$θ$ Estimation Procedures

Maximum Likelihood Estimation

The maximum likelihood estimation (MLE), ${\hat{θ}}_{MLE}$ , maximizes $\log L (θ ∣ Y)$ (Equation 2). In early CAT stages, non-mixed response patterns (all correct or all incorrect) can yield no closed solution. Operationally, this is handled by bounding $\hat{θ}$ or by using a Bayesian estimator (T. Wang & Vispoel, 1998). Following Babcock and Weiss (2012), when responses remained non-mixed $\hat{θ}$ was updated by a fixed step of $\pm 0.5$ until a mixed pattern occurred.

Maximum a Posteriori

Samejima (1969) proposed maximum a posteriori (MAP) as an alternative to MLE in cases where a prior distribution of $θ$ can be specified. The MAP estimates are found by maximizing the posterior distribution $p (θ | Y)$ where $p (θ ∣ Y) \propto L (θ ∣ Y) g (θ)$ , and $g (θ)$ is the prior distribution of $θ$ . $\hat{θ}$ is found by solving $\frac{\partial}{\partial θ} [p (θ | Y)] = 0$ . In this study, a standard normal prior $θ ~ N (0, 1)$ was used.

Expected a Posteriori

Expected a posteriori (EAP) estimation also used a standard normal prior. The EAP estimate is the posterior mean,

{\hat{θ}}_{EAP} = \frac{\sum_{k = 1}^{q} X_{k} L_{J} (X_{k}) W (X_{k})}{\sum_{k = 1}^{q} L_{J} (X_{k}) W (X_{k})},

(8)

where $q$ is the number of quadrature points, ${X_{k}, W (X_{k})}_{k = 1}^{q}$ are quadrature points and weights (with $\sum_{k} W (X_{k}) = 1$ ), respectively, and $L_{J}$ is the likelihood over the $J$ administered items (Bock & Aitkin, 1981). The default $q = 33$ quadrature points were used in estimation.

Weighted Likelihood Estimation

Weighted likelihood estimation (WLE, Warm, 1989) maximizes a weighted likelihood $L^{*} (θ) = L (θ) f (θ)$ , typically with $f (θ) = \sqrt{I (θ)}$ . Equivalently,

{\hat{θ}}_{WLE} = \underset{θ}{\arg \max} {L (θ) + \frac{1}{2} I (θ)},

(9)

which can be viewed as a Jeffreys—prior in MAP form. In practice, WLE reduces small-sample bias relative to MLE, while not introducing bias due to a prior distribution (S. Wang & Wang, 2001; Warm, 1989).

Simulated Examinees

This study used $1, 600$ simulated examinees, 100 at each of 16 evenly spaced points on $θ$ from $- 3$ to $3$ , and the results were summarized both conditioned on $θ$ and experimental conditions.

CAT Implementation

All simulees started at $θ_{0} = 0$ (the default in many CAT software packages). The next item was selected by maximum Fisher information: at each step, the unused item with the greatest $I_{j} (\hat{θ})$ was administered. For each item bank, all three estimation procedures and all four termination criteria were used within each estimation method and within each $θ$ level. For each item-bank condition, 1,000 replications were run.

While operational CATs frequently utilize item exposure control and content balancing constraints, these mechanisms were deliberately excluded from the current simulation. Implementing such constraints forces the selection algorithm to deviate from maximum information, which would introduce confounding variables and mask the pure mathematical interactions between the $θ$ estimators and termination criteria. By employing an unconstrained design—and instead accounting for generalizability by regenerating the entire item bank for each of the 1,000 replications—this study establishes a clean, theoretical baseline of these psychometric interactions without the confounding noise of operational restrictions.

Evaluation

Three outcomes were used to evaluate estimator–stopping-rule combinations.

Test Length

For each simulee the number of administered items was recorded; mean length was computed per condition to assess efficiency.

Bias

For each grid point of $θ$ ,

Bias (θ) = \frac{1}{n_{θ}} \sum_{i : θ_{i} = θ} ({\hat{θ}}_{i} - θ),

(10)

which reflects the signed total difference between estimated $θ$ and true $θ$ .

Root Mean Squared Error

For each grid point of $θ$ ,

RMSE (θ) = \sqrt{\frac{1}{n_{θ}} \sum_{i : θ_{i} = θ} {({\hat{θ}}_{i} - θ)}^{2}} .

(11)

Together, Bias and root mean squared error (RMSE) summarize overall accuracy.

Results

$θ$ Recovery in Low-Information (100-Item) Banks

Bias

Figure 2 illustrates the conditional bias across estimators and termination criteria conditional on $θ$ . Overall, fixed-length test and the SEM termination rule yielded the lowest conditional bias, closely followed by the $Δ θ$ (theta change) rule, whereas the minimum information (MI) rule exhibited the most substantial bias. Regarding bank structure, flat banks generally produced lower bias at the extremes of the $θ$ scale compared to peaked banks; however, performance was comparable between bank shapes in the central region of the $θ$ distribution. Across all variable-length rules, increasing stringency (i.e., requiring lower standard error, lower information gain, or smaller $θ$ change) consistently reduced bias. Among the estimation methods, WLE outperformed the alternatives in the high precision conditions, demonstrating less bias even at extreme $θ$ values—a robustness that held in the peaked bank condition. While MLE performed adequately with flat banks, it exhibited a characteristic outward bias, overestimating positive $θ$ and underestimating negative $θ$ . This expansion bias was exacerbated under the MI and $Δ θ$ termination rules. Conversely, the Bayesian methods (MAP and EAP) demonstrated shrinkage bias, pulling estimates toward the prior mean of $θ$ , 0. The performance differences between MAP and EAP were negligible under these conditions. For bias recovery in low-information settings, the combination of either the WLE estimator or MLE and the SEM termination rule proved most effective and robust, with WLE preferred because it does not require arbitrary values of $θ$ for non-mixed response vectors.

Figure 2.

Conditional bias by estimator and termination rule with the low-information bank. (A) Low precision condition. (B) High precision condition.

Test Length and Efficiency

As shown in Figure 3, the superior accuracy of the SEM rule was achieved at the expense of test efficiency. In the flat bank condition requiring high precision (SEM $< 0.224$ ), test length consistently reached the ceiling of 50 items. This ceiling effect suggests that, given the limited information in the bank, the strict precision criterion was unattainable for most simulees. In peaked banks, simulees with $θ$ near 0 achieved shorter tests (though rarely fewer than 30 items), whereas those at the extremes reached the 50-item limit. Notably, the EAP estimator produced an anomalous trend in the flat bank condition: Simulees with extreme $θ$ values received shorter tests than those near the center. This artifact likely arose because the Bayesian prior artificially reduces the standard error at the extremes, causing the SEM stopping rule to trigger prematurely. In contrast, the MI rule displayed an inverted U-shaped relationship with test length. Simulees with central $θ$ values were administered significantly more items than those at the extremes, a pattern driven by the availability of high-information items near the center of the scale.

Figure 3.

Conditional test length by estimator and termination rule with the low-information bank. (A) Low precision condition. (B) High precision condition.

The $Δ θ$ rule offered a compromise, producing significantly shorter tests—approximately 25 items for high precision ( $ε = 0.02$ ) and 15 for low precision ( $ε = 0.05$ ; Figure 3A). Similar to the MI rule, the $Δ θ$ rule resulted in longer tests for simulees near $θ = 0$ and shorter tests at the extremes. From a purely efficiency-oriented perspective, the low-precision $Δ θ$ rule was optimal. However, when balancing efficiency and accuracy, the $Δ θ$ rule (or MI rule) combined with WLE offers a favorable trade-off, achieving acceptable bias with significantly fewer items than the SEM rule. In the low precision condition, MI rule and $Δ θ$ rules exhibited similar but flatter results, while SEM rule used less items for central $θ$ s but many more items at the extremes.

Conditional RMSE

As illustrated in Figure 4, the conditional RMSE results indicate that the SEM and fixed-length rules generally outperformed the alternative stopping criteria in terms of measurement precision. Both rules exhibited similar error profiles; specifically, when using MLE and WLE estimators, the lowest RMSE values are observed in the center of the $θ$ distribution, with error increasing as $θ$ approaches the extremes. Notably, a non-linear decrease in RMSE is observed at the tails of the $θ$ scale—a known artifact attributable to boundary effects and the resulting reduction in score variance. Furthermore, the SEM rule tended to yield a more uniform (flatter) RMSE profile across the $θ$ continuum compared to the fixed-length condition. When examining the center of the distribution, EAP demonstrated the highest precision (lowest RMSE), though its error increased sharply at extreme $θ$ levels due to Bayesian shrinkage. MAP followed a similar trend but maintains a higher RMSE overall. Conversely, MLE and WLE exhibited superior performance at the scale’s extremes. The MI rule consistently underperformed across all conditions. Finally, while the $Δ θ$ rule significantly improved the performance of Bayesian estimators in the middle of the $θ$ range, it resulted in higher RMSE for MLE and WLE compared to the fixed-length and SEM rules.

Figure 4.

Conditional RMSE by estimator and termination rule with the low-information bank. (A) Low precision condition. (B) High precision condition.

Overall Accuracy

Table 1 summarizes the RMSE and test length distributions. For flat banks, the lowest RMSE was achieved by the WLE and MLE estimators paired with the strict SEM rule (SEM $< 0.224$ ); however, these combinations required the maximum test length (49.4 items). When factoring in efficiency, optimal performance—defined as minimizing RMSE while maintaining moderate test length—was observed with the EAP estimator paired with $Δ θ < 0.02$ , or WLE paired with SEM $< 0.316$ . A similar pattern emerged for peaked banks: MLE and WLE paired with SEM $< 0.224$ yielded the lowest RMSE but required nearly maximum test lengths (44.7 items). In this condition, the WLE estimator paired with $Δ θ < 0.02$ provided the best balance of accuracy and efficiency.

Table 1.

Low-Information Bank Results (J = 100).

Estimator	Variable stopping rules					Fixed-length rules
		Flat		Peaked			Flat	Peaked
	Stopping rule	RMSE	Length	RMSE	Length	Items	RMSE	RMSE
MLE	SEM $< 0.224$	0.250	49.4 ± 2.6	0.309	44.7 ± 8.1	$J = 20$	0.325	0.370
	SEM $< 0.316$	0.324	22.5 ± 10.2	0.375	30.0 ± 16.9	$J = 25$	0.295	0.344
	MI $< 0.05$	0.367	28.7 ± 14.3	0.415	25.6 ± 14.5	$J = 30$	0.276	0.328
	MI $< 0.10$	0.432	20.6 ± 11.3	0.473	18.8 ± 12.2	$J = 35$	0.265	0.318
	$Δ θ < 0.02$	0.349	23.4 ± 8.7	0.420	23.7 ± 10.7
	$Δ θ < 0.05$	0.419	14.6 ± 4.9	0.475	14.1 5.2
WLE	SEM $< 0.224$	0.249	49.4 ± 2.5	0.309	44.7 ± 8.1	$J = 20$	0.323	0.367
	SEM $< 0.316$	0.321	22.8 ± 10.2	0.374	29.8 ± 16.8	$J = 25$	0.293	0.342
	MI $< 0.05$	0.412	26.4 ± 15.2	0.474	22.6 ± 14.9	$J = 30$	0.275	0.328
	MI $< 0.10$	0.500	17.9 ± 11.7	0.554	16.2 ± 11.8	$J = 35$	0.264	0.318
	$Δ θ < 0.02$	0.340	22.6 ± 7.9	0.374	23.9 ± 8.4
	$Δ θ < 0.05$	0.463	13.3± 4.6	0.486	13.4 ± 4.2
MAP	SEM $< 0.224$	0.272	48.6 ± 4.0	0.368	43.2 ± 9.3	$J = 20$	0.364	0.439
	SEM $< 0.316$	0.398	17.5 ± 6.4	0.473	24.3 ± 16.1	$J = 25$	0.325	0.407
	MI $< 0.05$	0.470	26.7 ± 15.1	0.546	23.9 ± 14.9	$J = 30$	0.302	0.390
	MI $< 0.10$	0.578	18.1 ± 11.5	0.637	17.2 ± 12.1	$J = 35$	0.288	0.378
	$Δ θ < 0.02$	0.329	25.4 ± 5.8	0.410	25.7 ± 5.9
	$Δ θ < 0.05$	0.472	14.7 ± 4.2	0.523	14.2 ± 3.4
EAP	SEM $< 0.224$	0.279	46.1 ± 9.1	0.353	44.6 ± 7.9	$J = 20$	0.363	0.422
	SEM $< 0.316$	0.377	17.7 ± 4.4	0.414	24.0 ± 12.7	$J = 25$	0.328	0.394
	MI $< 0.05$	0.446	27.0 ± 15.1	0.509	23.5 ± 14.7	$J = 30$	0.307	0.377
	MI $< 0.10$	0.534	18.4 ± 11.5	0.583	16.7 ± 11.7	$J = 35$	0.293	0.366
	$Δ θ < 0.02$	0.319	27.2 ± 5.0	0.391	27.5 ± 6.5
	$Δ θ < 0.05$	0.414	16.1 ± 3.5	0.465	16.0 ± 3.3

Note. Boldface values indicate the optimal performance (lowest RMSE or shortest test length) within each item bank and termination criterion condition.

$θ$ Recovery in High-Information (500-Item) Banks

Bias

Figure 5 illustrates the conditional bias across $θ$ estimators and termination criteria for high-information banks. Overall, the fixed-length rule yielded the lowest bias, followed by the SEM rule and the $Δ θ$ rule. The MI rule exhibited the highest bias overall, particularly at the extremes of the $θ$ scale. In contrast to the low-information condition, bank shape exerted minimal influence on bias in the high-information setting; both flat and peaked banks demonstrated similar trends, characterized by near-zero bias around $θ = 0$ that increased as $θ$ approached the extremes. Regarding estimation methods, the patterns largely mirrored those observed in the low-information condition. WLE yielded the lowest overall bias, showing a slight shrinkage effect as $θ$ moved toward the extremes. MAP and EAP followed a similar pattern but with more pronounced shrinkage bias at extreme $θ$ values. However, a notable exception was observed with the MLE estimator. Unexpectedly, MLE did not converge to zero bias at $θ = 0$ ; rather, it exhibited a slight positive bias at the center of the scale. Furthermore, this bias became asymmetrical: it increased dramatically as $θ$ approached -3 but decreased only slightly as $θ$ approached 3. This asymmetry was effectively reversed when MLE was paired with the $Δ θ$ termination rule. In general, the most accurate $θ$ recovery in this condition was observed using the fixed-length rule combined with the WLE estimator.

Figure 5.

Conditional bias by estimator and termination rule with the high information bank. (A) Low precision condition. (B) High precision condition.

Test Length and Efficiency

As shown in Figure 6, the three variable-length rules produced distinct test length distributions. Under the SEM rule, higher item quality allowed simulees with central $θ$ values to reach precision targets quickly, resulting in shorter tests, whereas those at the extremes required longer tests. This “U-shaped” pattern was further accentuated by the peaked bank structure. Conversely, the MI rule produced an inverted trend: simulees with central $θ$ values received longer tests, while those at the extremes received shorter tests, driven by the concentration of information in the bank. The $Δ θ$ rule exhibited a more complex pattern. In the low-precision condition, test length remained relatively flat across the $θ$ scale. However, the high-precision condition revealed an “M-shaped” or bimodal distribution: test lengths were shortest at the center ( $θ = 0$ ) and the far extremes ( $θ = \pm 3$ ), while peaking at intermediate values (around $θ = \pm 2$ ); the pattern was more pronounced in the high-precision condition. If test efficiency is the sole criterion, the low-precision $Δ θ$ rule is superior. Compared with the SEM and MI rules, the change in $θ$ rule was the least affected by both type and precision of item banks, with mean conditional test length similar for peaked and flat item banks; the SEM rule was most affected by item bank structure and level of precision.

Figure 6.

Conditional test length by estimator and termination rule with the high information bank. (A) Low precision condition. (B) High precision condition.

Conditional RMSE

As illustrated in Figure 7, the conditional RMSE patterns in the high-information banks closely mirrored those observed in the low-information conditions, albeit with an overall reduction in absolute measurement error. The fixed-length rule consistently yielded the lowest RMSE across both flat and peaked bank shapes, with the SEM rule demonstrating a comparable error profile across the $θ$ continuum. Conversely, the MI rule continued to underperform, exhibiting the highest conditional RMSE across all $θ$ levels. Furthermore, the $Δ θ$ rule remained particularly effective when paired with Bayesian estimators (EAP and MAP), reducing their error in the center of the distribution compared to other termination rules. Ultimately, while the increased item information improved overall measurement precision, the relative effectiveness and conditional error trajectories of the stopping rules remained largely unchanged.

Figure 7.

Conditional RMSE by estimator and termination rule with the high information bank. (A) Low precision condition. (B) High precision condition.

Overall Accuracy (RMSE)

Table 2 summarizes the RMSE and test length distributions. For flat banks, the lowest RMSE was achieved by the WLE and MLE estimators paired with the 35-item fixed-length rule. When factoring in efficiency, optimal performance—defined as minimizing RMSE while maintaining moderate test length—was observed using WLE and MLE estimators paired with the SEM $< 0.224$ rule. A similar pattern emerged for peaked banks: MLE and WLE paired with the 35-item fixed-length rule yielded the lowest absolute RMSE but required the full test length. In this condition, the MLE and WLE estimators paired with SEM $< 0.224$ provided the best balance of accuracy and efficiency.

Table 2.

High Information Bank Results (J = 500).

Estimator	Variable stopping rules					Fixed-length rules
		Flat		Peaked			Flat	Peaked
	Stopping rule	RMSE	Length	RMSE	Length	Items	RMSE	RMSE
MLE	SEM $< 0.224$	0.226	26.2±5.6	0.241	32.7±13.4	$J = 20$	0.273	0.296
	SEM $< 0.316$	0.386	12.4±3.4	0.419	18.3±13.2	$J = 25$	0.235	0.261
	MI $< 0.05$	0.282	37.6±15.1	0.312	34.0±15.9	$J = 30$	0.213	0.239
	MI $< 0.10$	0.345	30.3±15.9	0.374	26.8±15.7	$J = 35$	0.197	0.224
	$Δ θ < 0.02$	0.287	26.9±10.7	0.328	27.0±12.3
	$Δ θ < 0.05$	0.373	14.6±4.9	0.405	14.6±5.6
WLE	SEM $< 0.224$	0.225	26.7±5.9	0.242	32.8±13.4	$J = 20$	0.270	0.297
	SEM $< 0.316$	0.376	13.0±3.8	0.418	18.5±13.3	$J = 25$	0.234	0.262
	MI $< 0.05$	0.366	33.8±17.3	0.417	30.2±17.3	$J = 30$	0.212	0.240
	MI $< 0.10$	0.473	24.9±16.9	0.523	22.2±16.2	$J = 35$	0.197	0.225
	$Δ θ < 0.02$	0.271	26.3±9.5	0.290	26.9±10.2
	$Δ θ < 0.05$	0.433	13.4±4.6	0.450	13.7 ± 4.7
MAP	SEM $< 0.224$	0.253	24.5 ± 5.3	0.278	30.2 ± 13.4	$J = 20$	0.306	0.347
	SEM $< 0.316$	0.498	10.9 ± 3.3	0.548	12.9 ± 8.4	$J = 25$	0.259	0.301
	MI $< 0.05$	0.386	34.2 ± 16.8	0.442	30.4 ± 16.9	$J = 30$	0.231	0.273
	MI $< 0.10$	0.493	25.5 ± 16.5	0.550	22.9 ± 15.9	$J = 35$	0.211	0.254
	$Δ θ < 0.02$	0.234	30.0 ± 7.7	0.273	30.0 ± 7.5
	$Δ θ < 0.05$	0.400	15.3 ± 4.2	0.430	15.2 ± 4.1
EAP	SEM $< 0.224$	0.252	24.3 ± 4.9	0.265	29.5 ± 10.7	$J = 20$	0.294	0.328
	SEM $< 0.316$	0.391	12.3 ± 2.6	0.405	12.9 ± 3.3	$J = 25$	0.255	0.291
	MI $< 0.05$	0.328	35.9 ± 16.3	0.380	31.3 ± 16.5	$J = 30$	0.231	0.268
	MI $< 0.10$	0.409	27.4 16.5	0.461	23.4 ± 15.6	$J = 35$	0.213	0.251
	$Δ θ < 0.02$	0.226	31.4 ± 6.2	0.264	31.6 ± 7.2
	$Δ θ < 0.05$	0.331	16.8 ± 3.4	0.362	16.7 ± 3.4

Note. Boldface values indicate the optimal performance (lowest RMSE or shortest test length) within each item bank and termination criterion condition.

Discussion

The current study evaluated the interactive effects of item bank information, bank shape, $θ$ estimators, and termination criteria on the bias, precision, and efficiency of CAT. The results indicate that the optimal configuration is not static; rather, it relies heavily on the interaction between the information density of the bank and the chosen $θ$ estimator.

Robustness of $θ$ Estimators

Generally, WLE proved to be the most robust estimator among the four methods across both low- and high-information conditions. As theoretically expected, likelihood-based bias remained negligible until the extremes of the $θ$ range. While MLE is known to be affected by extreme response patterns in the early stages of CAT, WLE incorporates a weighted penalty term that reduces finite-item bias. These findings strongly support T. Wang and Vispoel (1998), who first documented that variable-length CATs produce distinct bias patterns compared to fixed-length tests. The present results extended their work by demonstrating that WLE effectively neutralizes these variable-length bias fluctuations, whereas MLE remains vulnerable to them. Interestingly, in the high-information condition (500 items), MLE exhibited an unexpected, persistent positive bias even at the center of the $θ$ scale. This suggests that even with a high-quality bank, MLE remains sensitive to stopping rule interactions.

Conversely, the Bayesian methods (MAP and EAP) utilize a standard normal prior, which inevitably shrinks estimates toward the mean of $θ$ , 0, resulting in higher bias at the extremes. Beyond this expected Bayesian bias, a notable artifact was observed in the conditional RMSE results for the likelihood-based estimators (MLE and WLE): an apparent, yet misleading, reduction in overall error at the extreme tails of the $θ$ distribution. This artificial drop in RMSE is driven by boundary effects and score compression (Baker & Kim, 2004). As simulees’ true $θ$ exceeded the difficulty range of the item bank, the probability of uniform response patterns (e.g., all-correct or all incorrect) increased. Because the true maximum likelihood for such patterns is mathematically infinite, estimation algorithms must truncate the $θ$ estimate at predefined numerical search boundaries. This truncation artificially compresses the variance of the estimates, mathematically deflating the RMSE. Consequently, the lower RMSE observed at the extremes reflects an algorithmic boundary constraint rather than an increase in true measurement precision. Although WLE is slightly computationally more complex than MLE, the difference is negligible with modern computing power, making it the superior choice for minimizing bias while maintaining or increasing precision and test length.

$θ$ Estimator and Stopping Rule Interaction

A distinct interaction was observed between EAP and the termination criteria: EAP tended to terminate prematurely under the SEM rule but performed robustly with the $Δ θ$ rule. This confirms the early findings of Yi et al. (2001), who reported that $θ$ estimator choice significantly alters the behavior of stopping rules. Furthermore, the present results help clarify the recent inconsistencies noted by Lee and Han (2024). While Lee and Han found mixed performance when evaluating standard error-based termination, the current study isolates the source of this instability: the combination of Bayesian priors and SEM stopping rules. The Bayesian prior artificially lowers the standard error in the early stages of a CAT, effectively “tricking” the SEM rule into assuming high precision has been met. In contrast, the $Δ θ$ rule depends on the stability of the $θ$ estimate; because the prior prevents rapid shifts in $θ$ , EAP requires more items to demonstrate stability, leading to longer, more accurate tests.

The Moderating Role of Bank Characteristics

The impact of bank shape was heavily moderated by the total information available in the bank, a finding that supports and refines the results of Babcock and Weiss (2012). In the low-information condition (100 items), the SEM rule failed to reach precision targets at the extremes of peaked banks, forcing simulees to take maximum-length tests. This aligns with C. Wang et al. (2019), who demonstrated analytically that when an item bank lacks informative items (a “bank gap”), the reduction in standard error becomes negligible. Because the standard SEM rule monitors only absolute precision rather than the rate of improvement, it fails to detect this stagnation. This mechanism offers a plausible explanation for Dwahdh and Alshraifin (2025), whose finding that SEM yielded no advantage over fixed-length rules was likely confounded by the density of their item bank. Crucially, however, this inefficiency disappeared in high-information banks (500 items), where the abundance of items allowed the SEM rule to function as intended regardless of shape. These results underscore the theoretical importance of the minimum information (MI) rule. Conceptually similar to the predicted standard error reduction (PSER; Choi et al., 2011) and standard error change (SEC; C. Wang et al., 2019) criterion, the MI rule is designed to detect exactly the type of local bank exhaustion that the standard SEM rule misses. Consequently, present findings suggest that MI’s true utility lies not as a standalone rule, but as a conditional constraint or secondary termination rule—a “fail-safe” that prevents the runaway test lengths observed when strict SEM rules are applied to finite banks.

Practical Implications

For high-stakes testing where legal defensibility and equal precision are paramount, the optimal configuration depends on the bank size. For large, high-information banks (e.g., 500 items), WLE paired with a fixed-length rule (e.g., 35 items) yielded the lowest absolute bias, while WLE paired with a strict SEM rule provided the best balance of efficiency and accuracy. For smaller, low-information banks (e.g., 100 items), practitioners face a stricter trade-off. In these cases, WLE paired with the $Δ θ$ rule is the superior option for peaked banks to avoid the “running out of items” problem while maintaining acceptable accuracy. Although the MI rule in this study performed inconsistently, C. Wang et al. (2019) derived the theoretical relationship between item information and the change in standard error (SEC). They proposed that stopping when the gain in precision (SE drop) falls below a threshold is a more efficient strategy for finite banks than fixing an absolute standard error. Our findings regarding the $Δ θ$ rule support this general principle: by monitoring the stability of the estimate (which correlates with the inability of new items to shift $θ$ ), the algorithm avoids the exhaustive testing observed under the strict SEM rule. Practitioners must be aware, however, that the $Δ θ$ rule can occasionally lead to premature termination if the standard error remains high despite a stable estimate. Therefore, the present results concur with Babcock and Weiss (2012) that a minimum test length constraint (e.g., 10 items) should always be implemented when using the $Δ θ$ criterion.

Limitations and Future Research

A key limitation of this study is the use of simulated, unidimensional 3PLM data. Operational item banks often exhibit non-ideal characteristics, including multidimensionality and parameter drift. Furthermore, this study represents a “best-case scenario” as it did not incorporate content balancing or exposure control constraints. Future research should evaluate advanced termination criteria under more realistic constraints that include content balancing and exposure control mechanisms in testing environments where these constraints are appropriate.

Conclusion

The present study demonstrates that optimal CAT design relies heavily on the interaction between item bank information density, bank shape, and psychometric specifications. Overall, weighted likelihood estimation (WLE) emerged as the most robust estimator across all conditions. It offered stable parameter recovery without the shrinkage bias associated with Bayesian methods and successfully mitigated the extreme boundary estimation issues that persisted for MLE, even in high-information banks.

The simulations revealed that the influence of item bank shape is distinctly moderated by bank size. In low-information settings, a critical trade-off exists: while the SEM stopping rule ensures uniform precision, it results in excessive test lengths for peaked banks as the algorithm exhausts informative items. For these sparse, peaked banks, the $Δ θ$ rule provided a superior balance of accuracy and efficiency, though it requires careful pairing to avoid premature termination with Bayesian estimators. In contrast, high-information settings relax these constraints. An abundance of high-quality items renders bank shape less critical, allowing the SEM rule to function as a highly efficient termination criterion and enabling fixed-length rules to achieve maximum absolute accuracy.

Consequently, practical CAT configurations must be tailored to item bank quality. For high-stakes examinations utilizing large, high-information banks, practitioners should pair WLE with either a fixed-length rule (for maximum accuracy) or a strict SEM rule (for optimal efficiency). For lower-quality, peaked banks, a hybrid termination strategy is advisable. A strict SEM rule can serve as the primary criterion, paired with a minimum information (MI) constraint acting as a fail-safe to prevent inefficient test elongation when local items are exhausted. Alternatively, WLE paired with the $Δ θ$ rule offers a highly efficient solution for low-information contexts, provided a minimum test-length constraint is implemented.

Finally, while this study provides a comprehensive baseline using simulated unidimensional 3PLM data, operational testing introduces additional complexities. Future research should extend these findings by incorporating realistic operational constraints, such as content balancing and exposure control. In addition, systematically varying the arbitrary simulation parameters used in this study (e.g., threshold values for stopping rules, specific bank sizes, and length constraints) will further establish the generalizability of these optimal CAT configurations.

Footnotes

Appendix

ORCID iD

Xinyu Liu

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Babcock

Weiss

(2012). Termination criteria in computerized adaptive tests: Do variable-length CATs provide efficient and effective measurement? Journal of Computerized Adaptive Testing, 1, 1–18. https://www.jcatpub.net/index.php/jcat/article/view/16

Baker

F. B.

Kim

S.-H.

(Eds.). (2004, July 19). Item response theory: Parameter estimation techniques, Second edition (2nd ed.). CRC Press. https://doi.org/10.1201/9781482276725

Bock

R. D.

(1983). The discrete Bayesian. In Wainer

Messick

(Eds.), Principals of modern psychological measurement (pp. 103–115). Routledge. https://doi.org/10.4324/9780203056653

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46(4), 443–459. https://doi.org/10.1007/BF02293801

Choi

S. W.

Grady

M. W.

Dodd

B. G.

(2011). A new stopping rule for computerized adaptive testing. Educational and Psychological Measurement, 71(1), 37–53. https://doi.org/10.1177/0013164410387338

Dwahdh

Alshraifin

(2025). The impact of computerized adaptive test termination rules on accuracy across different ability estimation methods. Eurasia Journal of Mathematics, Science and Technology Education, 21(1), em2571. https://doi.org/10.29333/ejmste/15897

Gialluca

K. A.

Weiss

D. J.

(1979). Efficiency of an adaptive inter-subtest branching strategy in the measurement of classroom achievement (RR796). Defense Technical Information Center. https://apps.dtic.mil/sti/html/tr/ADA080956/

Hambleton

R. K.

Swaminathan

(1985). Item response theory: Principles and applications. Springer. https://doi.org/10.1177/014662168500900315

Hart

D. L.

Mioduski

J. E.

Stratford

P. W.

(2005). Simulated computerized adaptive tests for measuring functional status were efficient with good discriminant validity in patients with hip, knee, or foot/ankle impairments. Journal of Clinical Epidemiology, 58(6), 629–638. https://doi.org/10.1016/j.jclinepi.2004.12.004

10.

Hart

D. L.

Mioduski

J. E.

Werneke

M. W.

Stratford

P. W.

(2006). Simulated computerized adaptive test for patients with lumbar spine impairments was efficient and produced valid measures of function. Journal of Clinical Epidemiology, 59(9), 947–956. https://doi.org/10.1016/j.jclinepi.2005.10.017

11.

Lee

Han

K. T.

(2024). Evaluating effectiveness of standard error of score estimation as a CAT termination criterion. Journal of Computerized Adaptive Testing, 11(2), 13–29. https://doi.org/10.7333/2410-1102013

12.

Lord

F. M.

(1983). Unbiased estimators of ability parameters, of their variance, and of their parallel-forms reliability. Psychometrika, 48(2), 233–245. https://doi.org/10.1007/BF02294018

13.

Maurelli

V. A.

Weiss

D. J.

(1981, November). Factors influencing the psychometric characteristics of an adaptive testing strategy for test batteries. https://eric.ed.gov/?id=ED212676

14.

Meijer

R. R.

Nering

M. L.

(1999). Computerized adaptive testing: Overview and introduction. Applied Psychological Measurement, 23(3), 187–194. https://doi.org/10.1177/01466219922031310

15.

Ravand

(2015). Assessing testlet effect, impact, differential testlet, and item functioning using cross-classified multilevel measurement modeling. Sage Open, 5(2), 1–9. https://doi.org/10.1177/2158244015585607

16.

Samejima

(1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika, 34(S1), 1–97.

17.

Stafford

R. E.

Runyon

C. R.

Casabianca

J. M.

Dodd

B. G.

(2019). Comparing computer adaptive testing stopping rules under the generalized partial-credit model. Behavior Research Methods, 51(3), 1305–1320. https://doi.org/10.3758/s13428-018-1068-x

18.

Tai

M. H.

DeWeese

J. N.

Weiss

(2025). Stochastic curtailment: A new approach to improve efficiency of variable-length computerized adaptive tests. Journal of Computerized Adaptive Testing, 12(4), 164–235. https://doi.org/10.7333/2508-1204165

19.

Wang

Weiss

D. J.

Shang

(2019). Variable-length stopping rules for multidimensional computerized adaptive testing. Psychometrika, 84(3), 749–771. https://doi.org/10.1007/s11336-018-9644-7

20.

Wang

(2001). Precision of Warm’s weighted likelihood estimates for a polytomous model in computerized adaptive testing. Applied Psychological Measurement, 25(4), 317–331. https://doi.org/10.1177/01466210122032163

21.

Wang

Hanson

B. A.

Lau

C.-M. A.

(1999). Reducing bias in CAT trait estimation: A comparison of approaches. Applied Psychological Measurement, 23(3), 263–278. https://doi.org/10.1177/01466219922031383

22.

Wang

Vispoel

W. P.

(1998). Properties of ability estimation methods in computerized adaptive testing. Journal of Educational Measurement, 35(2), 109–135. https://doi.org/10.1111/j.1745-3984.1998.tb00530.x

23.

Ware

J. E.

Gandek

Sinclair

S. J.

Bjorner

J. B.

(2005). Item response theory and computerized adaptive testing: Implications for outcomes measurement in rehabilitation. Rehabilitation Psychology, 50(1), 71–78. https://doi.org/10.1037/0090-5550.50.1.71

24.

Warm

T. A.

(1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54(3), 427–450. https://doi.org/10.1007/BF02294627

25.

Weiss

D. J.

(1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6(4), 473–492. https://doi.org/10.1177/014662168200600408

26.

Weiss

D. J.

(2014). New horizons in testing: Latent trait test theory and computerized adaptive testing. Elsevier. https://doi.org/10.1016/C2009-0-03014-1

27.

Weiss

D. J.

Kingsbury

G. G.

(1984). Application of computerized adaptive testing to educational problems. Journal of Educational Measurement, 21(4), 361–375. https://doi.org/10.1111/j.1745-3984.1984.tb01040.x

28.

Weiss

D. J.

Şahin

(2024). Computerized adaptive testing: From concept to implementation. The Guilford Press. https://www.guilford.com/books/Computerized-Adaptive-Testing/Weiss-Sahin/9781462554515

29.

Wang

Ban

J.-C.

(2001). Effects of scale transformation and test-termination rule on the precision of ability estimation in computerized adaptive testing. Journal of Educational Measurement, 38(3), 267–292. https://doi.org/10.1111/j.1745-3984.2001.tb01127.x

Interactions Between Termination Criteria and Ability Estimators in Computerized Adaptive Testing

Abstract

Keywords

Introduction

Stopping Rules in CAT

Effects of θ Estimator

Purpose

Method

IRT Model

Item Banks

Termination Criteria

Fixed-Length Rule

SEM Rule

Minimum Information Rule

Change in θ ^ Rule

θ Estimation Procedures

Maximum Likelihood Estimation

Maximum a Posteriori

Expected a Posteriori

Weighted Likelihood Estimation

Simulated Examinees

CAT Implementation

Evaluation

Test Length

Bias

Root Mean Squared Error

Results

θ Recovery in Low-Information (100-Item) Banks

Bias

Test Length and Efficiency

Conditional RMSE

Overall Accuracy

θ Recovery in High-Information (500-Item) Banks

Bias

Test Length and Efficiency

Conditional RMSE

Overall Accuracy (RMSE)

Discussion

Robustness of θ Estimators

θ Estimator and Stopping Rule Interaction

The Moderating Role of Bank Characteristics

Practical Implications

Limitations and Future Research

Conclusion

Footnotes

Appendix

ORCID iD

Funding

Declaration of Conflicting Interests

References

Effects of $θ$ Estimator

Change in $\hat{θ}$ Rule

$θ$ Estimation Procedures

$θ$ Recovery in Low-Information (100-Item) Banks

$θ$ Recovery in High-Information (500-Item) Banks

Robustness of $θ$ Estimators

$θ$ Estimator and Stopping Rule Interaction