A Knowledge-Based Approach for Item Exposure Control in Computerized Adaptive Testing

Abstract

The purpose of this study is to investigate a functional relation between item exposure parameters (IEPs) and item parameters (IPs) over parallel pools. This functional relation is approximated by a well-known tool in machine learning. Let P and Q be parallel item pools and suppose IEPs for P have been obtained via a Sympson and Hetter–type simulation. Based on these simulated parameters, a functional relation k = f_P (a, b, c) relating IPs to IEPs of P is obtained by an artificial neural network and used to estimate IEPs of Q without tedious simulation. Extensive experiments using real and synthetic pools showed that this approach worked pretty well for many variants of the Sympson and Hetter procedure. It worked excellently for the conditional Stocking and Lewis multinomial selection procedure and the Chen and Lei item exposure and test overlap control procedure. This study provides the first step in an alternative means to estimate IEPs without iterative simulation.

Keywords

computerized adaptive testing item exposure control item exposure parameters parallel item pools machine learning artificial neural networks

Introduction

Computerized adaptive testing (CAT) has proved to be an effective electronic testing mechanism for many real-world test practices. Each year, the CAT version of the armed services vocational aptitude battery (CAT-ASVAB) is administered to hundreds of thousands of applicants entering or advancing their careers in the military services. Using item response theory, CAT adapts test items to a test-taker’s estimated ability and may shorten the test length without losing assessment precision. CAT is emerging as a useful alternative to the conventional paper-and-pencil group-administered testing.

One practical advantage of computerized adaptive tests is that they can be administered on a flexible schedule rather than at fixed times. The convenience and flexibility for examinees, however, may severely compromise test security if item exposure is not well controlled. Chang and Ansley (2003) gave a comparative study of various item exposure control methods used in CAT. To date, randomized item selection (e.g., McBride & Martin, 1983) and conditioned item selection (e.g., Sympson & Hetter, 1985) are main approaches used to avoid item overexposure in CAT (Way, 1998). Although randomized item selection is simple and easy to implement, this approach does not control item exposure well but shuffles items (Sympson & Hetter, 1985). On the other hand, conditioned item selection can control item exposure well so that most items are administered with exposure rates less than a prespecified maximum item exposure rate.

The procedure of conditioned item selection was originally proposed by Sympson and Hetter (SH) in 1985 and served as a foundation for other conditioned item selection methods. Because of the high-stakes nature of the test results, CAT-ASVAB has been administered with item exposure control based on the SH procedure (Hetter & Sympson, 1997; Segall & Moreno, 1999). The SH procedure was designed to directly control item exposure in a probabilistic fashion. Three kinds of probability are defined in this approach: P(S), the probability that an item is selected as the best item based on a CAT algorithm; P(A), the probability that an item is actually administered to examinees; and P(A|S), the conditional probability that an item is administered, given that it is selected as the best item, also named item exposure parameter (IEP). Because an item must be selected first before it can be administered, the relationship among these three probabilities becomes P(A) = P(A|S) × P(S). To meet the requirement that no item has been administered more often than a prespecified maximum item exposure rate r _max, the condition thus becomes P(A|S) × P(S) ≤ r _max.

Although it is simple to decide P(A|S) as long as P(S) is known, determining P(S) is not trivial. Furthermore, when P(A|S) for an item is decided, the P(S) values for the rest of the items in the pool will be changed, as will the P(A|S) values associated with them. Thus, a series of CAT simulation is needed to find stabilized P(S) values and P(A|S) values under which all P(A) values are less than or equal to r _max. The procedure for conducting iterative simulation is described later in the article.

Way (1998) observed that probabilistic item exposure control such as the SH procedure may not be adequate to guarantee a secure CAT implementation and suggested a system of rotating item pools to prevent a security breach. These pools should have similar distribution of contents and statistical attributes to support uniform measurement quality to examinees. Using techniques of constrained combinatorial optimization, Ariel, Veldkamp, and van der Linden (2004) have established a procedure to construct rotating item pools. These pools have identical distribution of item parameters (IPs) and thus are parallel pools constructed from a master pool. Because the SH-type IEPs cannot be solved analytically (van der Linden, 2003), finding a mechanism to estimate IEPs of parallel pools without tedious simulation becomes a challenging and interesting research problem.

Purpose of the Study

Today, most SH-type procedures find IEPs by iterative simulation of CATs. Although these parameters cannot be solved analytically (van der Linden, 2003), this does not mean that they cannot be estimated without tedious simulation. The purpose of this study is to investigate a functional relation between IEPs and IPs over parallel pools. This functional relation is approximated by a knowledge-based tool in machine learning. Because rotating parallel pools are useful in CAT implementations, it is suitable to restrain this study to parallel pools. Let P and Q be parallel item pools and suppose IEPs of P have been found via conventional SH-type simulations. Based on these simulated IEPs, a functional relation k = f_P (a, b, c) relating IPs to IEPs of the pool P is approximated by an artificial neural network (ANN) and used to estimate IEPs of Q without tedious simulation.

The original Sympson and Hetter (1985) procedure uses the maximal item information criterion to select items and adjusts IEPs in successive iterations with a well-known formula. Later variants of the SH procedure modify either the item selection criterion or the IEP updating formula to achieve more robust results. This study examined many popular extensions of the SH procedure. Various item exposure controls based on the SH procedure will be discussed next, which is followed by an introduction to the tools used in this study, particularly the ANN from machine learning. Then, item pools will be introduced with an operational definition of parallel pools. Simulation studies and results will follow the description of item pools. A discussion of the results and possible future studies will close this article.

Item Exposure Control

Item exposure control in tests is needed to maintain test security in practical CATs. The popular SH procedure and its many variants used in this study are described in the following. These are the primary procedures adopted in today’s CAT implementation or research studies regarding item exposure control.

The Sympson and Hetter (1985) Procedure

An item response model, which specifies how likely an examinee will answer an item correctly depending on a few characteristics of the item and the latent ability of the examinee, must be specified first. This study used the three-parameter logistic item response model (3PLM), in which the probability of a correct response given an ability level $θ$ is defined by $P_{i} (θ) = c_{i} + \frac{1 - c_{i}}{1 + \exp [- 1.7 a_{i} (θ - b_{i})]},$ 1where a_i is the item discrimination parameter, b_i is the item difficulty parameter, and c_i is the pseudo-guessing parameter of item i. The Sympson and Hetter (1985) procedure uses the maximal item information criterion to select items. A selected item is administered when a uniform random number is smaller than the IEP of that item. If the item is not administered, it is removed from the pool for the remaining test of a test-taker and the item with the next best item information is selected for administration consideration. An unbiased value of zero was assumed as the initial ability estimate for a test-taker, and the expected a posterior (EAP) estimation was used to estimate the latent ability after an item was administered.

After a target r _max is set and assuming an initial value of 1.0 for the IEP k of all items, a series of CAT simulation is conducted to a population of simulees. An iteration of simulation consists of a complete run of CATs to all simulees of interest according to the CAT algorithm specified above. At the end of each iteration, P(S) and P(A) are determined for each item by computing the proportion of times an item has been selected and administered, respectively. The maximum value of P(A) in the item pool is noted and k is redefined for each item as follows: $k = {\begin{matrix} r_{\max} ∕ P (S), & \begin{matrix} if & P (S) > r_{\max} \end{matrix} \\ 1, & \begin{matrix} if & P (S) \leq r_{\max} \end{matrix} \end{matrix}$ 2 To guarantee that a test-taker will take a complete test before exhausting the item pool, the l largest IEPs are set to 1.0 where l is the test length. The iterative simulation is repeated until all ks are stabilized and the maximum value of P(A) is slightly above r _max and oscillates in successive simulations.

The van der Linden (2003) Alternatives

After analyzing formal properties of the SH procedure, van der Linden pointed out that the SH procedure has a couple of unpleasant features: it is generally time-consuming and it can show unexpected behavior of exposure rates for some items at some steps and may have difficulty converging to a stable state at all (van der Linden, 2003). The researcher proposed a few alternative formulas in the IEP updating step to achieve more effective results. Some alternatives proposed by van der Linden have shown quite effective improvements over the SH procedure. Among them, Formula 12, recommended by the author, was considered in this study. This formula defines the IEP updating step as follows: $k^{(t + 1)} = {\begin{matrix} r_{\max} ∕ P^{(t)} (S) - γ, & if & P^{(t)} (A) > r_{\max} \\ k^{(t)}, & if & P^{(t)} (A) \leq r_{\max} \end{matrix},$ 3 where 0 ≤ γ ≤ r _max∕P ^(t)(A) is an adjustable parameter and the superscript indicates the iteration number. A few special features about this formula can be noted: (a) the adjusting criterion is based on the observed exposure rate P(A); and (b) when the observed exposure rate of an item is below the target, its IEP is not adjusted. Experiments have shown that with this new formula, the maximum observed exposure rate decreases steadily with the number of iterations, and IEPs converge to a stable state faster than the SH procedure.

The Stocking and Lewis (1998, 2000) Multinomial Procedure

The SH procedure is not flawless. Stocking (1994) pointed out that this procedure may not converge during its simulation stage. Instead of setting the highest IEPs back to 1.0 in the fix-up step of the SH procedure, Stocking and Lewis (1998, 2000) proposed a more robust multinomial model to select items. In the basic unconditional model of their approach, a list of items ordered from the highest item information to the lowest item information together with the IEPs k_i is formed. New probabilities l_i = (1 − k ₁)*…*(1 − k_i ₋₁)*k_i are computed and normalized to form a multinomial distribution from which an item is selected for administration consideration. A proper length for the list is approximated by the ratio of the pool size and the test length. The unconditional multinomial model conducts iterative simulation to a population of simulees with a special distribution of abilities. However, the more advanced conditional Stocking and Lewis procedure conducts iterative simulation conditional on ability. The ability scale is first discretized into M levels: $θ$ _m, m = 1, …, M. For each $θ$ _m, simulation is conducted using simulees of ability $θ$ _m to derive IEPs of this ability level. This results in a matrix of IEPs with columns being the parameters for discretized ability levels. During the simulation or operational phase, the column with the ability level closest to the estimated ability of a test-taker is used to control item administration in a multinomial selection fashion (Stocking & Lewis, 2000).

The Chen and Lei (2005) Item Exposure and Test Overlap Control

Test security can be considered from two levels: item exposure and test overlap. Item exposure controls the exposure rate of individual items, whereas test overlap is concerned with how similar two examinees receive the test on average. Davey and Parshall (1995) proposed an item selection mechanism conditional on items administered earlier in a test to solve the test overlap issue. The Davey and Parshall procedure is time-consuming and may be difficult to implement in a practical test environment (Stocking & Lewis, 1998). Using a relationship between the test overlap rate and the variance of observed exposure rates, Chen and Lei (2005) developed a modified SH procedure to control item exposure and test overlap at the same time. In their approach, given a desired test overlap rate T ₀, the corresponding variance $S_{0}^{2}$ for observed exposure rates is first solved from the following equation: $T_{0} = \frac{S_{0}^{2} + μ^{2}}{μ},$ 4 where μ is equal to the test length divided by the pool size. Let $S_{1}^{2}$ denote the variance of observed exposure rates after an iteration of the SH procedure is completed. If $S_{1}^{2}$ is smaller than $S_{0}^{2}$ , update IEPs as usual by Equation 2 and continue to the next run of simulation. Otherwise, for each item i, adjust the observed exposure rate from P_i (A) to ${P^{'}}_{i} (A)$ according to the following formula: ${P^{'}}_{i} (A) = S_{0} [\frac{P_{i} (A) - μ}{S_{1}}] + μ$ 5 Then, the updated IEP ${k^{'}}_{i}$ in Equation 2 is further modified as follows: $k_{i} = \min ({k^{'}}_{i}, {P^{'}}_{i} (A) ∕ P_{i} (S)) .$ 6 The iterative simulation will be monitored from two perspectives: the maximum observed exposure rate and the test overlap rate. When these two indices stabilize and are close to the target r _max and T ₀, the iteration stops and simulated IEPs can be used to control exposure at both the item and test levels (Chen & Lei, 2005).

A Structural Requirement for Tests—the Content-Balancing Requirement

A test may need to explore an examinee’s ability in different content areas of a subject at the same time. Content balancing can be added as a structural constraint to a CAT implementation. Items from different content areas must be administered in a preset manner such that the content coverage is balanced (Leung, Chang, & Hau, 2003). The original SH procedure can be modified to handle this content-balancing requirement. To do so, one must design a mechanism to select a content area before an item is selected from that area. For example, the content area may be chosen randomly or sequentially. The other approach is to choose the area farthest from its preset coverage requirement (Kingsbury & Zara, 1989).

The Chang and Ying (1999) Stratification Procedure

Item exposure control can be inspected from two alternative perspectives: item overexposure and item underexposure. Overexposed items result in test security problems, whereas underexposed items waste the resources devoted to develop the test items. It has been found that in a 3PLM environment, items with high discrimination parameter are frequently overexposed whereas items with low discrimination parameter are less selected for administration. To overcome this exposure imbalance problem, Chang and Ying (1999) proposed an a-stratified multistage CAT procedure, where items were stratified into a number of levels based on the discrimination parameter. The researchers found that with these stratified procedures, CAT operations were efficient and well balanced with respect to item usage. The SH procedure can be added to deal with item overexposure in the stratified multistage procedure.

Method

Approximating a functional relation between IEPs and IPs is considered a regression problem in statistics. Multiple linear regression (MLR) is commonly used to investigate a linear relationship between independent and dependent variables. However, it will be shown that MLR is not good enough to estimate IEPs from IPs. More delicate regression techniques from other fields must be considered.

ANN for Function Approximation

ANN has been successfully applied to solve many function approximation problems in engineering and social sciences. An ANN simulates the neural system of a brain to learn patterns from examples and uses the learned knowledge to make predictions for future data (Gupta, Jin, & Homma, 2003). A basic data processing unit in a neural net is called a neuron, which is connected to other neurons via synapses. The structure of an ANN refers to the number of neurons and the way they are distributed and connected. To simplify the computation, neurons are scattered into layers and information is transferred from layer to layer. Figure 1 shows the structure of a typical two-layer feed-forward neural net. The input layer represents the independent variables in a function approximation problem, for example, neurons for the IPs a, b, and c. The output layer corresponds to the dependent variables, for example, a neuron for the IEP k in Figure 1. Layers between the input and output layers are called hidden layers. An ANN with hidden layers is also called a multilayer perceptron (MLP). Without a hidden layer, a simple perceptron can hardly learn any interesting functions in practical applications (Minsky & Papert, 1969). Although the neural system of a real brain may not be arranged in such a layer style, it has been shown that an MLP can approximate arbitrarily well any continuous decision region provided that there are enough layers and neurons (Gallant & White, 1992).

FIGURE 1.

Schema for a two-layer feed-forward artificial neural network.

Each connection between two neurons (i.e., a synapse) has a weight to be learned from training examples. In a feed-forward network, data flow from the input layer to the hidden layers and to the output layer in one direction. A hidden layer works like an intermediate information store that aggregates data from the previous layer. Data computation at a neuron consist of a weighted averaging step followed by an activation step $o_{j} = f (\sum_{i} w_{i j} o_{i})$ , where neuron o_j is connected by neurons o_i in the previous layer with weights w_ij , and f is an activation function. A commonly used activation function is the sigmoid function $f (s) = 1 ∕ (1 + e^{- s}) .$

Using ANN to solve a function approximation problem is a two-phase procedure: the training phase and the operational phase. In the training phase (i.e., the knowledge acquiring phase), examples with known input-output pairs are used to adjust the synaptic weights so that computed outputs from the network are as close as possible to the target outputs. This is commonly done with a so-called back-propagation technique as follows. Let (a, b, c, t) be a training example in Figure 1 with (a, b, c) denoting the IPs of an item and t denoting the simulated IEP. Therefore, t is the target output for the input (a, b, c). For the explanation of the back-propagation technique, notations are modified as follows: the input layer is denoted by x_i , that is, x ₁ = a, x ₂ = b, x ₃ = c, the hidden layer is denoted by z_j , that is, $z_{j} = f (\sum_{i = 1}^{3} w_{i j} x_{i})$ , and the final network output is denoted by $y = f (\sum_{j = 1}^{4} v_{j} z_{j}) .$ The training purpose is to minimize the difference between y and t, and this is conveniently measured by the squared error function E = (t − y)²/2. The error is indeed a function of the network weights $E (w_{i j,} v_{j}) = {(t - f (\sum_{j = 1}^{4} v_{j} f (\sum_{i = 1}^{3} w_{i j} x_{i})))}^{2} ∕ 2.$

Although this is a complex composite function, using the steepest descent method a simple formula for the weight adjustment can be derived. The steepest descent method adjusts the current weight by a multiple of the gradient direction $\begin{matrix} v_{j}^{new} = v_{j}^{old} + Δ v_{j}, & Δ v_{j} = - η \frac{\partial E}{\partial v_{j}} \\ w_{i j}^{new} = w_{i j}^{old} + Δ w_{i j}, & Δ w_{i j} = - η \frac{\partial E}{\partial w_{i j}} \end{matrix}$ 7 where η is the learning rate. With the chain rule of calculus and a special feature of the sigmoid function, f′(s) = f(s)(1 − f(s)), the weight adjustment can be summarized as Δv_j = η(t − y) y (1 − y)z_j and Δw_ij = η(t − y) y (1 − y) z_j (1 − z_j )v_jx_i . For a general multilayer neural network with multiple dependent variables, the weight adjustment is conducted backward from the output layer to the input layer using the formula $Δ w_{i j} = (\partial E ∕ \partial w_{i j}) = η δ_{j} o_{i}$ , where o_i is the computed output of a neuron in the previous layer of neuron j and δ_j has two formulas depending on whether the neuron j is in the output layer $δ_{j} = {\begin{matrix} (t_{j} - y_{j}) y_{j} (1 - y_{j}), & output \begin{matrix} layer \end{matrix} \\ o_{j} (1 - o_{j}) \sum_{k} δ_{k} w_{j k}^{'}, & hidden \begin{matrix} layer \end{matrix} \end{matrix}$ In the above formula, y_j denotes the network output of a dependent variable and t_j is the corresponding target output. When neuron j is in a hidden layer, the summation $\sum_{k}$ sums over neurons connected by j in the immediate following layer. Notice that $δ_{k} s$ in the sum have been computed in a previous step (i.e., back-propagated). A detailed derivation of the above formulas and also a good tutorial on neural networks can be found in Munakata (1998, chap. 2).

During the training phase, network weights are initialized by random numbers first. Then, training examples are sequentially or randomly fed into the network to adjust weights using Equation 7. An epoch of network training is one complete presentation of the training examples to the network. After the network has been trained satisfactorily, it enters the operational phase. Using the trained weights, data with inputs only can be fed into the network to compute outputs. Although the training phase may take some time to obtain satisfactory synaptic weights, the operational phase is usually fast. For example, it took a trained MLP only seconds to compute all IEPs of a pool.

The Proposed Knowledge-Based Approach for Item Exposure Control

First, the input-output pairs of data (a_i, b_i, c_i, k_i ), i = 1, …, N = pool size, must be prepared with an SH-type procedure for a training pool, say P. Thus, target IEP k_i is obtained from a series of CAT simulation. These data constitute the training examples for an MLP to learn a functional relation k = f_P (a, b, c), which is then used to estimate IEPs of parallel pools.

Baringhaus and Franz (2004) Multivariate Two-Sample Test

Parallelism between two item pools will be defined in terms of sample distribution of IPs in the pools. To perform a multivariate two-sample test, a test statistics from Baringhaus and Franz (2004) was adopted. Using the difference of the sum of all the Euclidean interpoint distances between random variables from the two different samples and one-half of the two corresponding sums of distances of the variables within the same sample, Baringhaus and Franz found that this new test provided power similar to that of the Hotelling’s T ²-test under the assumption of normal distributions. In addition, the test also performed well in a distribution-free environment. The Baringhaus and Franz test can be implemented with the cramer package under the free R software environment for statistical computing.

Taylor and Thompson (1986) Random Data Generator

In addition to real parallel pools obtained from an operational CAT, parallel pools will also be synthesized. To achieve this goal, a random data generator based on a set of samples must be developed. Taylor and Thompson (1986) have developed such a generator as follows:

First, a multivariate observation x_j is randomly chosen from the given sample set;

The nearest m neighbors of x_j from the sample set are determined: x_j1 , …, x_jm . The mean ${\bar{x}}_{j}$ of these nearest neighbors is computed;

Random samples u ₁, …, u_m are generated using uniform distribution on the interval ( $1 ∕ m - \sqrt{3 (m - 1) ∕ m^{2}}$ , $1 ∕ m + \sqrt{3 (m - 1) ∕ m^{2}}$ ). With these data, a random sample is generated as

\sum_{i = 1}^{m} u_{i} (x_{j i} - {\bar{x}}_{j}) + {\bar{x}}_{j} .

The above procedure can be repeated as many times as needed to provide enough random data. The Taylor and Thompson (1986) procedure is implemented as a subroutine in the International Mathematics and Statistics Library (IMSL) numerical libraries.

Item Pools

Seven item pools were used in this study with 360 items in each pool. Descriptive statistics of the IPs for each pool are shown in Table 1. The first three pools (training, T1, T2) consisted of IPs actually calibrated from multiple forms of a large-scale standardized test based on 3PLM. They were assembled from a large master pool by requiring similar statistical attributes in each pool. Although previous studies on parallel pools did not give an explicit definition for the concept, they all emphasized on similar statistical attributes to support parallelism between pools (Ariel, Veldkamp, & van der Linden, 2004; Way, 1998). Thus, an operational definition for parallel pools in this study is given as follows.

TABLE 1.

Descriptive Statistics of Item Parameters for Pools Used in the Study

Item Pool	N	Mean	SD	Min	Max	Skewness	Kurtosis
a
Training	360	0.9688	0.3237	0.2821	2.3660	0.7700	1.4130
T1	360	0.9755	0.3145	0.2449	1.9432	0.5045	0.1055
T2	360	0.9956	0.3208	0.2449	2.1600	0.4497	0.1162
R1	360	0.9591	0.3008	0.3374	2.3865	1.0527	2.1771
R2	360	0.9802	0.3271	0.3126	2.3166	0.9608	1.4979
SD	360	0.9755	0.3148	0.2449	1.9432	0.5034	0.0993
BD	360	0.7629	0.4088	0.1000	1.5000	0.1301	−1.2568
b
Training	360	0.3983	1.1221	−3.4292	2.9434	−0.4480	0.3680
T1	360	0.4085	1.1499	−3.8185	4.0127	−0.2313	0.5343
T2	360	0.3947	1.1141	−3.8185	3.5764	−0.2838	0.4441
R1	360	0.3861	1.0718	−2.8812	2.8662	−0.3066	0.1238
R2	360	0.4567	1.0924	−3.5190	2.7397	−0.5241	0.7241
SD	360	0.4086	1.1499	−3.8185	4.0127	−0.2316	0.5352
BD	360	2.0221	1.4511	−2.5900	5.5500	−0.0790	−0.0676
c
Training	360	0.1852	0.0865	0.0252	0.5000	1.0630	1.5710
T1	360	0.1891	0.0811	0.0322	0.5000	0.8883	1.0163
T2	360	0.1901	0.0802	0.0462	0.5000	0.7976	0.5174
R1	360	0.1840	0.0768	0.0153	0.5183	0.9966	1.5821
R2	360	0.1807	0.0723	0.0233	0.4720	0.8893	1.1220
SD	360	0.1891	0.0811	0.0322	0.5000	0.8883	1.0163
BD	360	0.1891	0.0810	0.0300	0.5000	0.8925	1.0505

Notes: BD = big deviation; SD = small deviation.

Parallel Pools

Two item pools are parallel if their IPs come from the same multivariate distribution.

Using the Baringhaus and Franz test, it was observed that T1 and T2 were each parallel to training as shown in Table 2. The null hypothesis H ₀ of being distributed as the training pool was not rejected with a very high p value. Four synthetic pools were also formed.

TABLE 2.

Baringhaus and Franz Test for the Null Hypothesis H ₀ of Being Distributed as the Training Pool

Pool	0.95 Critical Value	Statistic	p Value	Test Result for H ₀
T1	1.4835	0.2103	.980	Not rejected
T2	1.5177	0.2686	.926	Not rejected
R1	1.6022	0.2654	.928	Not rejected
R2	1.5192	0.2458	.935	Not rejected
SD	1.5145	0.2108	.982	Not rejected
BD	2.1155	98.6046	.000	Rejected

Notes: BD = big deviation; SD = small deviation. Result is based on 1,000 permutation bootstrap-replicates.

Synthetic Item Pools R1 and R2

Using IPs from the training pool as the set of samples, the Taylor and Thompson (1986) procedure was applied with m = 5 and 3 to generate synthetic pools R1 and R2. Each of these pools was parallel to the training pool as shown in Table 2 with high p values.

Synthetic Item Pools SD and BD

The SD (small deviation) pool in Table 1 was synthesized from T1 by perturbing the a (discrimination) and b (difficulty) parameters of 10 random items. The perturbation was in the range of 0.01, and these 10 items had diverse IEPs. This mimics a situation where some of the items have been compromised and must be replaced by items with similar quality. The big deviation (BD) pool was prepared using a uniform sampling for a and a normal sampling for b, respectively. The BD pool was intended to be a nonparallel pool to the training pool.

Simulation Studies

IEPs of the training pool were first obtained using a simulation procedure. Various extensions of the Sympson and Hetter (1985) procedure were considered. Except the conditional Stocking and Lewis procedure, a population of 10,000 simulees drawn from the standard normal distribution N(0, 1) was used to find IEPs of the training pool. For the conditional multinomial procedure, 5,000 simulees with fixed ability were used for each ability level.

Evaluation of Predicted IEPs

For each test pool (i.e., T1, T2, R1, R2, SD, and BD), mean absolute error (MAE) was calculated to assess the accuracy of MLP predictions. The MAE is defined as follows: $MAE = \sum_{i = 1}^{N} | k_{i} - {\hat{k}}_{i} | ∕ N,$ 9 where k_i is the target IEP simulated from the same SH-type procedure used for the training pool, and ${\hat{k}}_{i}$ is the predicted IEP from the trained neural net. A correlation r between k_i and ${\hat{k}}_{i}$ was also computed. The smaller the MAE and the closer of r to 1, the better performance the trained MLP has achieved.

The performance of k_i and ${\hat{k}}_{i}$ was further investigated with respect to the precision of ability estimation and item exposure control. After an operational CAT was conducted to a second population of 10,000 simulees drawn from the normal distribution N(0, 1) using k_i or ${\hat{k}}_{i}$ to control item exposure, bias, and root mean squared error (RMSE) were computed as follows: $Bias = \sum_{j = 1}^{P o p} ({\bar{θ}}_{j} - θ_{j}),$ 10 $RMSE = \sqrt{\frac{\sum_{j = 1}^{P o p} {({\bar{θ}}_{j} - θ_{j})}^{2}}{P o p}},$ 11 where ${\bar{θ}}_{j}$ and θ_j were, respectively, the estimated and the true ability of the j ^th simulee, and Pop (= 10,000) was the population size. Regarding the item exposure control, the maximum observed exposure rate and the number of overexposed items were recorded. An item was considered overexposed if its observed exposure rate was 0.1 over the target r _max. The number of used items was taken to signify item usage. The MAE and the correlation of observed exposure rates using k_i and ${\hat{k}}_{i}$ in the operational CAT were also computed.

Experimental Environment

Hardware: A personal computer with Intel Pentium 4/3.0 GHz CPU and 512 MB memory was used for the simulation studies.

Software: The NeuroSolutions software from NeuroDimension Inc. (Gainesville, FL) was used for MLP implementation, the cramer package under R was adopted for the Baringhaus and Franz (2004) multivariate two-sample test, and the rndat subroutine from IMSL was used for the Taylor and Thompson (1986) random data generator.

MLP environment: The recommended network structure from NeuroSolutions was experimented and tuned to find a satisfactory structure. Each case was trained with 100,000 epochs.

Simulation Results

Characteristics of Simulated IEPs

Due to the nature of simulation, IEP fluctuates with the random seed used to start an SH-type procedure. Because the purpose of this study was to find a functional relation k = f_P (a, b, c), it was desirable to know how stable the k was, given different random seeds. It turned out that for all variants of the SH procedure except the van der Linden (2003) alternatives, simulated IEPs were quite stable regardless of the seeds used. Take the training pool as an example, 10 runs of the original SH procedure were conducted with different seeds. For each item, the standard deviation of the 10 simulated IEPs was close to 0. The maximum standard deviation from the 360 items of the pool was 0.019323, and the mean of these standard deviations was 0.001163. For the van der Linden (2003) alternatives with γ = 0.15, simulated IEPs depended critically on the seed used to start a simulation. For example, with two different random seeds, the SH procedure produced two sets of IEPs with a maximum absolute difference of 0.0415, whereas the van der Linden procedure produced two such sets with a maximum absolute difference of 0.1777.

The Sympson and Hetter (1985) Procedure

The original SH procedure using item information to select items and Equation 2 to update IEPs was implemented with r _max = 0.2 and a test length of 20. With simulated IEPs from the training pool, a MLR was first applied to investigate a linear relationship between IPs and IEPs as follows: $k = 1.123 - 0.418 a + 0.1 b + 0.715 c,$ 12 where each regression coefficient was significant with p < .001 and the unadjusted R ² was .400. This indicates all three IPs are strong predictors for k. The negative coefficient of a implies that if an item is more discriminatory, it should be controlled with a tighter IEP to guard against overexposure. The regression formula was used to predict IEPs from IPs of the test pools. The predicted IEPs were further trimmed to be in the range of (r _max, 1). Then, predicted IEPs were evaluated as outlined above with results summarized in Table 3. It is observed that the MLR formula, though very elucidative, did not perform well on the test pools. Many items were overexposed and one could hardly tell any differences between parallel and nonparallel tools except that the nonparallel BD pool used significantly fewer items than the parallel pools.

TABLE 3.

Evaluation of MLR With the SH Procedure

Pool	MAE	Corr	MAE2	Corr2	rMax	Bias	RMSE	Used	Over
T1	.1135	.754	.0242	.845	.574	−.00334	.287	167	17
T2	.1189	.742	.0238	.850	.557	.00165	.282	166	18
R1	.1214	.685	.0304	.793	.641	−.00501	.295	156	26
R2	.1239	.690	.0281	.814	.627	−.00024	.289	161	23
SD	.1139	.754	.0241	.843	.586	−.00391	.286	167	17
BD	.1232	.650	.0464	.675	.607	−.00548	.327	122	32

Notes: BD = big deviation; IEPs = item exposure parameter; MAE = mean absolute error; MLR = multiple linear regression; SD = small deviation. Column interpretation: MAE is defined in Equation 9; Corr is the correlation between simulated and predicted IEPs; MAE2 and Corr2 are MAE and Corr of observed exposure rates; r _max is the maximum observed exposure rate with predicted IEPs; Bias is defined in Equation 10; RMSE is defined in Equation 11; Used is the number of used items; Over is the number of overexposed items. Test length = 20, r _max = 0.2.

The naive linear regression in Equation 12 may not be able to track a nonlinear relationship between IPs and IEPs well. Because IEP is in a sense a proportion, a log odds transformation was performed on IEP with the result regressed by a MLR, that is, log(k/(1 − k)) = β₀ + β₁ a + β₂ b +β₃ c. Notice that this log odds transformation has resulted in a one layer neural network as $k = 1 ∕ (1 + e^{- (β_{0} + β_{1} a + β_{2} b + β_{3} c)})$ . Then, IEPs of T1 were predicted and assessed with this new logistic formula. It was found that the correlation between simulated and predicted IEPs dropped significantly from .754 (in MLR) to.443 (in logistic regression). The observed exposure rates had a maximum of .937 and 28 items were overexposed. This observation was not unexpected because historic research in ANN had shown that one layer neural networks had limited learning capability (Minsky & Papert, 1969).

Based on the idea that tightly controlled items should receive larger weights in training, another weighted MLR was considered. Each item in the training pool received a weight depending on its IEP as follows: $w = {\begin{matrix} 20, & 0.2 \leq k < 0.3 \\ 10, & 0.3 \leq k < 0.5 \\ 5, & 0.5 \leq k < 0.7 \\ 1, & 0.7 \leq k \leq 1.0 \end{matrix}$ This weight assignment was determined subjectively. Using the weighted MLR from the training pool, IEPs of T1 were again predicted and assessed as before. The MAE between simulated and predicted IEPs increased from .1135 (in MLR) to .1463 (in weighted MLR), and the correlation stayed about the same (.754 for MLR vs. .751 for weighted MLR). A particular difficulty in weighted MLR was to set proper weights for training examples. The same weight assignment for logistic regression improved the results only slightly from the unweighted logistic regression.

Because of the unsatisfactory results of MLR and logistic regression, MLP was used to find a functional relation between IPs and IEPs. Based on the trained MLP, IEPs of the test pools were predicted and evaluated as before with results summarized in Table 4. This MLP relation performed quite differently on parallel and nonparallel pools. The MAE between simulated and predicted IEPs of parallel pools was several orders smaller than that of the BD pool. Moreover, the correlation value was much higher and close to 1.0 on parallel pools. The same conclusion can be said for MAE and correlation of the observed exposure rates (the MAE2 and Corr2 columns). The five parallel pools had a maximum observed exposure rate less than .27 using around 170 items, whereas the BD pool had a maximum observed exposure rate of .723 using 114 items. Bias, RMSE, and item usage were about the same size on parallel pools when simulated or predicted IEPs were used in an operational CAT. Thus, the functional relation discovered by MLP preserved key characteristics of CAT operations on parallel pools.

TABLE 4.

Evaluation of MLP With the SH Procedure

Pool	MAE	Corr	MAE2	Corr2	rMax	Bias	RMSE	Used	Over
T1	.0228	.974	.0067	.987	.240	−.00064	.298	172	0
T2	.0330	.953	.0085	.977	.259	−.00437	.300	172	0
R1	.0137	.989	.0061	.987	.267	−.00025	.310	158	0
R2	.0067	.997	.0031	.996	.264	−.00300	.301	166	0
SD	.0224	.974	.0065	.987	.246	−.00030	.296	173	0
BD	.1140	.559	.0433	.680	.723	.00395	.349	114	29

Notes: BD = big deviation; MAE = mean absolute error; SD = small deviation. Test length = 20, r _max = 0.2

To further examine the knowledge-based approach, similar experiments with the SH procedure were conducted with a tighter exposure control (r _max = 0.1) or a longer test length (l = 30). Results for the tighter control are listed in Table 5, and results for the longer test length are listed in Table 6. These two experiments used more items to satisfy stricter conditions, but the comparison between simulated and predicted IEPs is similar to that for the basic case. Although MLP has provided good results to estimate IEPs without simulation, unlike MLR in Equation 12, it is difficult to see directly from the MLP relation the effect of a, b, or c on k.

TABLE 5.

Evaluation of MLP With the SH Procedure

Pool	MAE	Corr	MAE2	Corr2	rMax	Bias	RMSE	Used	Over
T1	.0318	.981	.0057	.980	.135	.00089	.340	261	0
T2	.0368	.975	.0069	.966	.133	−.00033	.335	260	0
R1	.0332	.972	.0062	.976	.136	−.00272	.350	258	0
R2	.0151	.994	.0041	.989	.136	−.00259	.346	248	0
SD	.0090	.999	.0033	.993	.126	−.00283	.344	249	0
BD	.3250	.416	.0616	.396	.785	−.00495	.420	132	27

Notes: BD = big deviation; MAE = mean absolute error; SD = small deviation. Test length = 20, r _max = 0.1.

TABLE 6.

Evaluation of MLP With the SH Procedure

Pool	MAE	Corr	MAE2	Corr2	rMax	Bias	RMSE	Used	Over
T1	.0276	.975	.0086	.984	.250	−.00174	.263	241	0
T2	.0333	.973	.0096	.980	.244	.00180	.261	243	0
R1	.0116	.995	.0052	.994	.262	.00114	.271	232	0
R2	.0089	.997	.0043	.996	.258	.00102	.265	234	0
SD	.0278	.976	.0086	.984	.245	−.00263	.262	239	0
BD	.1865	.491	.0695	.611	.796	−.00264	.331	158	38

Notes: BD = big deviation; MAE = mean absolute error; SD = small deviation. Test length = 30, r _max = 0.2.

The van der Linden (2003) Alternatives

The van der Linden procedure uses item information to select items as the original SH procedure, however, it changes the IEP updating step to Equation 3. Two runs of the van der Linden procedure for the training pool with r _max = 0.2, test length = 20 and γ = 0.15 in Equation 3 were conducted. These two runs differed in the random seed used to start the simulation. In the evaluation process, the same seed in a run was used to obtain simulated IEPs of test pools. Predicted IEPs were not trimmed to be greater than r _max because simulated IEPs from the van der Linden procedure can be smaller than r _max. Results of the MLP approach with the two random seeds are reported in Tables 7 and 8, respectively. In addition, functional relations approximated by MLP performed well on parallel pools and poorly on the BD pool. Maximum observed exposure rate was better controlled and more items were used than the knowledge-based approach for the SH procedure.

TABLE 7.

Evaluation of MLP With the van der Linden Alternatives and Random Seed #1

Pool	MAE	Corr	MAE2	Corr2	rMax	Bias	RMSE	Used	Over
T1	.0376	.963	.0113	.954	.203	−.00128	.309	200	0
T2	.0504	.947	.0142	.931	.202	−.00016	.309	204	0
R1	.0245	.982	.0105	.957	.228	−.00058	.321	194	0
R2	.0167	.992	.0078	.972	.240	−.00260	.321	190	0
SD	.0382	.963	.0117	.950	.203	−.00065	.315	199	0
BD	.1887	.505	.0588	.427	.740	−.00060	.377	119	21

Notes: BD = big deviation; MAE = mean absolute error; SD = small deviation. Test length = 20, r _max = 0.2.

TABLE 8.

Evaluation of MLP With the van der Linden Alternatives and Random Seed #2

Pool	MAE	Corr	MAE2	Corr2	rMax	Bias	RMSE	Used	Over
T1	.0297	.973	.0112	.956	.216	.00082	.309	196	0
T2	.0418	.955	.0137	.935	.214	.00203	.306	201	0
R1	.0214	.986	.0105	.955	.256	−.00040	.317	184	0
R2	.0181	.989	.0089	.968	.234	−.00230	.318	186	0
SD	.0290	.975	.0109	.959	.214	−.00296	.312	194	0
BD	.1892	.505	.0592	.420	.740	.00128	.376	117	21

Notes: BD = big deviation; MAE = mean absolute error; SD = small deviation. Test length = 20, r _max = 0.2.

The Unconditional Stocking and Lewis (1998, 2000) Multinomial Procedure

Unlike the van der Linden procedure, the Stocking and Lewis (unconditional or conditional) multinomial procedure modifies the item selection step in the SH procedure while keeping the same IEP updating step in Equation 2. This procedure does not use the maximal item information criterion but rather a normalized multinomial distribution based on a list of candidate items to select items. For the unconditional Stocking and Lewis multinomial procedure, the target exposure rate was set at r _max = 0.2 and the test length was 20. Therefore, each list of candidate items had 18 (=360/20) items. Evaluation of the knowledge-based approach is reported in Table 9. The result shows a very similar performance as the SH procedure in every respect.

TABLE 9.

Evaluation of MLP With the Unconditional Stocking and Lewis Multinomial Procedure

Pool	MAE	Corr	MAE2	Corr2	rMax	Bias	RMSE	Used	Over
T1	.0225	.974	.0064	.987	.247	.00012	.299	173	0
T2	.0340	.954	.0084	.976	.246	.00333	.297	172	0
R1	.0141	.988	.0060	.989	.264	.00032	.308	162	0
R2	.0069	.997	.0030	.996	.258	.00085	.304	165	0
SD	.0225	.975	.0064	.986	.247	.00041	.298	172	0
BD	.1111	.566	.0423	.691	.727	.00113	.357	112	27

Notes: BD = big deviation; MAE = mean absolute error; SD = small deviation. Test length = 20, r _max = 0.2.

The Conditional Stocking and Lewis (1998, 2000) Multinomial Procedure

For the conditional Stocking and Lewis multinomial procedure, the ability scale was discretized into 15 levels: θ_m = −3.5 +(m − 1) × 0.5, m = 1, …, 15.. A target exposure rate was set at r _max = 0.2. The test length was 20, and each list of candidate items had 18 items. Because of the tremendous amount of CAT simulations (5,000 × 15 = 75,000 for an iteration), this procedure was the most expensive one to find simulated IEPs.

To accommodate the ability factor in MLP, inputs were expanded to include a θ variable. Thus, a functional relation k = f_P (a, b, c, θ) was sought using a training set of 360 × 15 = 5,400 examples from the training pool. The θ variable was replaced by the formula −3.5 + (m − 1)× 0.5 during the training and operational phases of MLP. Results in Table 10 show that the performance of this knowledge-based approach was excellent. For each parallel pool, the maximum observed exposure rate was 0.21, and substantially more items were used. The nonparallel pool still performed poorly with the predicted IEPs. To examine the precision performance at each ability level, operational CATs were conducted to simulees of ability fixed at −3, −2, −1, 0, 1, 2, and 3. The Bias and RMSE are plotted against the ability level in Figures 2 and 3, respectively. It is observed that, at each ability level, the Bias and RMSE values for the parallel pools were almost the same as those for the training pool. This shows that the predicted IEPs did not degrade the precision of ability estimation at any ability level.

TABLE 10.

Evaluation of MLP With the Conditional Stocking and Lewis Multinomial Procedure

Pool	MAE	Corr	MAE2	Corr2	rMax	Bias	RMSE	Used	Over
T1	.0288	.973	.0041	.993	.210	−.00082	.328	291	0
T2	.0306	.970	.0041	.993	.210	.00069	.329	292	0
R1	.0221	.978	.0042	.991	.210	.00066	.342	279	0
R2	.0199	.984	.0042	.990	.210	.00108	.337	282	0
SD	.0293	.972	.0041	.993	.210	.00005	.330	290	0
BD	.1451	.471	.0364	.704	.801	.00226	.397	181	15

Notes: BD = big deviation; MAE = mean absolute error; SD = small deviation. Test length = 20, r _max = 0.2.

FIGURE 2.

Bias from operational computerized adaptive testings (CATs) conditional on ability level. Training pool uses simulated item exposure parameters (IEPs) from the conditional Stocking and Lewis multinomial procedure; other pools use predicted IEPs.

FIGURE 3.

Root mean squared error (RMSE) from operational computerized adaptive testings (CATs) conditional on ability level. Training pool uses simulated item exposure parameters (IEPs) from the conditional Stocking and Lewis multinomial procedure; other pools use predicted IEPs.

The Chen and Lei (2005) Item Exposure and Test Overlap Control

The Chen and Lei procedure modifies the SH procedure using a different IEP updating step in Equation 6 and preserves the item selection step using item information. Using the Chen and Lei (2005) procedure with a target exposure rate of r _max = 0.2, a target test overlap rate of T ₀ = 0.15 and a test length of 20, IEPs were simulated for the training pool. An MLP was trained with these simulated IEPs and used to predict IEPs of test pools. Evaluation of predicted IEPs is summarized in Table 11. Item exposure and test overlap were controlled pretty well for parallel pools, and this procedure used a few more items than the SH procedure.

TABLE 11.

Evaluation of MLP With the Chen and Lei Procedure

Pool	MAE	Corr	MAE2	Corr2	rMax	TestOver	Bias	RMSE	Used	Over
T1	.0353	.972	.0079	.980	.207	.144	.00126	.310	199	0
T2	.0511	.957	.0102	.964	.200	.140	−.00419	.301	204	0
R1	.0151	.992	.0045	.994	.199	.156	−.00381	.314	189	0
R2	.0117	.996	.0034	.996	.204	.154	−.00107	.313	188	0
SD	.0310	.978	.0070	.983	.207	.144	−.00025	.307	198	0
BD	.1656	.525	.0453	.657	.712	.369	−.00438	.367	125	18

Notes: BD = big deviation; MAE = mean absolute error; SD = small deviation. Test length = 20, r _max = 0.2, target test overlap = 0.15. TestOver is the observed test overlap rate.

A Structural Requirement for Tests—the Content-Balancing Requirement

The real pools (training, T1, and T2) contained a content indicator to impose a content-balancing requirement. Six content areas existed in each pool, and the number of items in these areas was 84, 60, 54, 54, 84, and 24 with 7, 5, 4, 5, 7, and 2 items, respectively, to be administered. It was found that subpools of these three pools in each content area were all parallel with high p values (>.8) in the Baringhaus and Franz test. The SH procedure was modified in the item selection step to handle a content-balancing requirement, whereas the original IEP updating step was preserved. To minimize the learning complexity for an ANN, a sequential content selection procedure was implemented, that is, content area s must serve prespecified items before a test moved to the next content area s + 1. In each content area, a computerized adaptive test used the criterion of maximal item information to select items. A target exposure rate was set at r _max = 0.2. Inputs to the knowledge-based approach were expanded to include a nominal content indicator s (=1, 2, 3, 4, 5, or 6). Evaluation of predicted IEPs is summarized in Table 12, which shows that the functional relation k = f_training (a, b, c, s) carried knowledge nicely (badly) from the training pool to parallel (nonparallel) pools. Although the predicted IEPs seemed to perform better in T2 in terms of observed exposure rates, the MAE and Corr actually indicated that MLP had made more precise predictions in T1. A further look into the parallelism of each subpool of content area revealed that T2 had a lower p value (around 0.8) in one area.

TABLE 12.

Evaluation of MLP With the SH Procedure and Content-Balancing Requirement

Pool	MAE	Corr	MAE2	Corr2	rMax	Bias	RMSE	Used	Over
T1	.0386	.954	.0127	.966	.305	−.00287	.269	261	1
T2	.0442	.935	.0135	.963	.264	−.00198	.275	263	0
BD	.1618	.528	.0668	.627	.794	.00112	.346	171	39

Notes: BD = big deviation; MAE = mean absolute error; SD = small deviation. Test length = 30, r _max = 0.2.

One may speculate that if each content area is learned separately, a better result may be obtained. That is, a functional relation k_s = k_s (a, b, c) is learned from the training pool for each content area s and used to predict IEPs of T1 or T2 in that area. Unfortunately, experiments revealed that this approach had worse results than the original approach. This may be explained as follows. Owing to the sequential content selection method, the functional relation k_s depends on its precedent relations, that is, k_s = k_s (a, b, c, k_s _{− 1}, …, k ₁), because item selections in the s ^th area depend on the ability estimation up to the (s − 1)^th area and IEPs in an area restrain how items are selected in that area and thus can influence ability estimation. Therefore, the s ^th functional relation k_s depends on IPs not only from the s ^th content area but also from previous areas.

The Chang and Ying (1999) Stratification Procedure

The Chang and Ying (1999) procedure is very similar to the sequential content selection procedure described above, only this time a content area is called a stratum. Leung, Chang, and Hau (2002) implemented an a-stratified design with a modified SH procedure to control item exposure. Their procedure deviated from the Sympson and Hetter (1985) procedure in two places: (a) the difficulty parameter b was used to select items; and (b) a multiplicative form k′ = ck was used to update IEPs. The factor c was set to 1.04 (0.95) when the observed exposure rate of an item was smaller (greater) than the target r _max. In the following, two experiments were conducted to implement the a-stratification procedure. Both experiments used the same stratification for item pools, but they differed in the item selection step in strata. The first experiment followed the original SH procedure to control item exposure in a stratum, and the second experiment adopted the mechanism in Leung, Chang, and Hau.

The training pool was partitioned into four strata according to the discrimination parameter a. Each stratum had 90 items and supplied 5 items to a test-taker. During the first stage of a test, items were selected and administered using the original SH procedure on the first stratum of 90 items. After five items had been administered from this stratum, the same procedure moved to the next stratum until 20 items were administered to a test-taker. The target exposure rate was set at r _max = 0.2. For the MLP approach, input variables consisted of four variables a, b, c, and s. The nominal variable s (=1, 2, 3, or 4) was used to indicate the stratum number. Evaluation of the predicted IEPs is summarized in Table 13. A few items were overexposed for parallel pools and Corr value decreased to 0.841. Because of the a-stratification step, lower degree of subpools parallelism as indicated by low p values in Table 14 had partially contributed to this effect. For example, T2 and SD had low p values in a few strata. Incidentally, they also had higher MAE and lower Corr than the other parallel pools.

TABLE 13.

Evaluation of MLP With the Chang and Ying a-Stratification Procedure

Pool	MAE	Corr	MAE2	Corr2	rMax	Bias	RMSE	Used	Over
T1	.0436	.862	.0143	.931	.365	.00073	.323	187	3
T2	.0526	.841	.0148	.928	.315	−.00166	.320	186	1
R1	.0404	.847	.0137	.931	.439	.00033	.325	175	4
R2	.0236	.948	.0101	.963	.346	.00274	.322	175	3
SD	.0464	.842	.0153	.919	.400	.00066	.324	187	4
BD	.1377	.328	.0535	.560	.999	−.00556	.372	104	25

Notes: BD = big deviation; MAE = mean absolute error; SD = small deviation. Item information is used to select items. Test length = 20, r _max = 0.2.

TABLE 14.

Baringhaus and Franz Test for the Degree of Parallelism Between Corresponding Stratum of the Training and Test Pools

p Value	Stratum 1	Stratum 2	Stratum 3	Stratum 4
T1	.917	.659	.835	.509
T2	.895	.291	.302	.276
R1	.974	.631	.656	.756
R2	.555	.983	.980	.999
SD	.874	.583	.821	.496
BD	.000	.000	.000	.000

Notes: BD = big deviation; SD = small deviation. Result is based on 1,000 permutation bootstrap-replicates.

The above stratification approach used a few more items than the original SH procedure without stratification. When the difficulty parameter b was used to select items, item utilization improved substantially. In this second experiment, item with the difficulty parameter closest to the estimated ability of a test-taker was selected for administration consideration. It was found that the multiplicative form k′ = ck used to update IEPs in Leung, Chang, and Hau (2002) was inefficient, thus this experiment used Equation 2 to update IEPs. The results are summarized in Table 15, which shows very similar phenomena as the previous experiment with two exceptions: (a) more items were used; and (b) RMSE was higher. Therefore, using the difficulty parameter to select items improved item usage, but it also made ability estimation less efficient.

TABLE 15.

Evaluation of MLP With the Chang and Ying a-Stratification Procedure

Pool	MAE	Corr	MAE2	Corr2	rMax	Bias	RMSE	Used	Over
T1	.0168	.803	.0082	.946	.353	−.00732	.346	321	1
T2	.0202	.773	.0088	.901	.501	−.00236	.345	320	3
R1	.0177	.823	.0085	.942	.351	−.00489	.355	320	2
R2	.0142	.893	.0090	.936	.381	.00260	.345	316	4
SD	.0158	.828	.0094	.924	.406	−.00344	.348	322	3
BD	.1344	.354	.0361	.674	.866	−.00160	.402	173	19

Notes: BD = big deviation; MAE = mean absolute error; SD = small deviation. The difficulty parameter b is used to select items. Test length = 20, r _max = 0.2.

Discussions

CAT cannot be implemented effectively unless item exposure is well controlled. The Sympson and Hetter (1985) procedure and its various modifications are commonly used in real applications or CAT studies to control item exposure. The SH-type procedures find stabilized IEPs through iterative simulation. A knowledge-based approach using ANNs has been proposed in this study to approximate a functional relation between IPs and IEPs of a training pool so that IEPs of parallel pools can be directly estimated without tedious simulation.

An operational definition of parallel pools was given using statistical attributes of the multivariate samples formed by IPs of item pools. Three real pools and four synthetic pools, together with the SH procedure and its many extensions, were investigated in this study to verify the proposed knowledge-based approach. It was found that the knowledge-based approach provided good results for item exposure control based on the original Sympson and Hetter (1985) procedure, the van der Linden (2003) alternatives and the unconditional Stocking and Lewis (1998, 2000) multinomial procedure. This knowledge-based approach provided excellent results for the conditional Stocking and Lewis multinomial procedure and the Chen and Lei (2005) procedure to control both item exposure and test overlap.

The knowledge-based approach worked satisfactorily for the SH procedure with a content-balancing requirement and less satisfactorily for the Chang and Ying (1999) a-stratification procedure. For these two types of exposure control, a nominal variable indicating the content area or stratum number was added to the list of independent variables. Several learning difficulties need to be investigated to improve the performance of the knowledge-based approach. For example, a previous reasoning argues that IEPs in the sth stage depend on IEPs of previous stages. Thus, a naive input form (a, b, c, s) may not be able to capture this intricacy efficiently. A further causality study is needed to design more efficient predictors for IEPs in multistage item selection mechanisms. Another difficulty involves the degree of parallelism in each subpool of content area or stratum, and this is especially serious with the a-stratification procedure.

In general, p value in the Baringhaus and Franz test can be used to indicate the degree of parallelism between pools. However, how robust is this indicator in CAT applications remains to be verified in future studies. For examples, using tens of neighbors in the Taylor and Thompson (1986) procedure, item pools with varying degrees of parallelism to the training pool were generated with the p value ranging from .010 to .776. When the knowledge-based approach was applied to the original SH procedure with r _max = 0.2 and test length = 20, it was found that, even with the p value as low as .010, the MLP functional relation k = f _training (a, b, c) generalized well to these pools (Table 16). However, the impact of lower degree of parallelism seemed to amplify when multistage item selection mechanisms were considered.

TABLE 16.

Evaluation of MLP With the SH Procedure

p Value	MAE	Corr	MAE2	Corr2	rMax	Bias	RMSE	Used
.776	.0148	.984	.0062	.986	.253	.00006	.310	165
.516	.0140	.990	.0055	.990	.258	−.00233	.308	161
.365	.0179	.974	.0050	.990	.253	−.00125	.300	170
.141	.0212	.964	.0068	.984	.269	−.00060	.308	159
.074	.0087	.994	.0033	.995	.252	.01166	.335	167
.010	.0145	.988	.0046	.992	.251	.00508	.312	163

Notes: MAE = mean absolute error. Pools are identified by the p value. Test length = 20, r _max = 0.2.

The knowledge-based approach was supposed to faithfully capture the relationship between IPs and simulated IEPs of the training pool. Predicted IEPs and simulated IEPs were expected to have similar item usage on parallel pools, that is, if the simulated IEPs underused many items, so would predicted IEPs. This phenomenon has been observed for all procedures tested in this study. With the SH procedure, Figure 4 shows that predicted IEPs of T1 underused many items with low discrimination parameter. When the a-stratification procedure was implemented with the SH procedure in each stratum, item usage was spread to all strata (Figure 5). However, this procedure still underused many items and tended to use highly discriminatory items in each stratum. Figure 6 shows results for the a-stratification procedure with the difficulty parameter to select items. More items were used across the entire spectrum of the discrimination parameter. However, this procedure also produced higher RMSE for ability estimation and a few overexposed items. The conditional Stocking and Lewis multinomial procedure used items across the entire spectrum of the discrimination parameter without losing precision of ability estimation or item overexposure control (Figure 7).

FIGURE 4.

Exposure rates for the T1 pool; items are sorted according to the discrimination parameter. Predicted item exposure parameters (IEPs) are used with the Sympson and Hetter procedure.

FIGURE 5.

Exposure rates for the T1 pool; items are sorted according to the discrimination parameter. Predicted item exposure parameters (IEPs) are used with the Chang and Ying a-stratification procedure. Item information is used for item selection.

FIGURE 6.

FIGURE 7.

Exposure rates for the T1 pool; items are sorted according to the discrimination parameter. Predicted item exposure parameters (IEPs) are used with the conditional Stocking and Lewis multinomial procedure.

Based on all evaluation criteria, the conditional multinomial procedure is deemed the best application of the knowledge-based approach. Incidentally, the conditional multinomial procedure was also the most expensive procedure to implement, and the knowledge-based approach has provided a practical solution to implement this procedure on parallel pools.

This knowledge-based approach to item exposure control considered mechanisms that needed the SH-type procedure to derive stabilized IEPs. Item exposure controls that did not need any IEPs were not covered in this study. For example, owing to the fact that the discrimination parameter and the difficulty parameter often are positively correlated, Chang, Qian, and Ying (2001) proposed an a-stratified multistage procedure with b blocking. The main idea was to make sure that each stratum had a balanced distribution of b values to guarantee a good match of estimated ability for different test-takers. The a and b parameters of the training pool had a correlation value of .476. Because this a-stratification with b blocking procedure did not need any IEPs to control item exposure, the knowledge-based approach had nothing to learn.

Readers are reminded that the knowledge-based approach in this study was applied to parallel pools defined primarily with statistical attributes. The content-balancing requirement, as a nonstatistical structural requirement, was also considered. As a professional test may include thousands of statistical and nonstatistical constraints on item pools (Ariel et al., 2004), future studies may focus on expanding the knowledge-based approach to cover more requirements on item pools.

Footnotes

Notes

References

Ariel

Veldkamp

van der Linden

(2004). Constructing rotating item pools for constrained adaptive testing. Journal of Educational Measurement, 41, 345–359.

Baringhaus

Franz

(2004). On a new multivariate two-sample test. Journal of Multivariate Analysis, 88, 190–206.

Chang

Qian

Ying

(2001). a-Stratified multistage computerized adaptive testing with b blocking. Applied Psychological Measurement, 25, 333–341.

Chang

Ying

(1999). a stratified multistage computerized adaptive testing. Applied Psychological Measurement, 23, 211–222.

Chang

Ansley

(2003). A comparative study of item exposure control methods in computerized adaptive testing. Journal of Educational Measurement, 40, 71–103.

Chen

S. Y.

Lei

P. W.

(2005). Controlling item exposure and test overlap in computerized adaptive testing. Applied Psychological Measurement, 29, 204–217.

Davey

Parshall

C. G.

(1995, April). New algorithms for item selection and exposure control with computerized adaptive testing. Paper presented at the annual meeting of the American Educational Research Association, San Francisco.

Gallant

White

(1992). On learning derivatives of an unknown mapping with mutilayer feedforward networks. Neural Networks, 5, 129–138.

Gupta

M. M.

Jin

Homma

(2003). Static and dynamic neural networks, from fundamentals to advanced theory. Hoboken, NJ: John Wiley & Sons.

10.

Hetter

R. D.

Sympson

J. B.

(1997). Item exposure control in CAT-ASVAB. In Sands

W. A.

Waters

B. K.

McBride

J. R.

(Eds.). Computerized adaptive testing: From inquiry to operation (pp. 141–144). Washington, DC: American Psychological Association.

11.

Kingsbury

G. G.

Zara

A. R.

(1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2, 359–375.

12.

Leung

Chang

Hau

(2002). Item selection in computerized adaptive testing: Improving the. a-stratified design with the Sympson-Hetter algorithm. Applied Psychological Measurement, 26, 376–392.

13.

Leung

Chang

Hau

(2003). Computerized adaptive testing: A comparison of three content balancing methods. The Journal of Technology, Learning and Assessment, 2, 1–15.

14.

McBride

Martin

(1983). Reliability and validity of adaptive ability tests in a military setting. In Weiss

D. J.

(Ed.) New horizons in testing (pp. 223–236). New York: Academic Press.

15.

Minsky

M. L.

Papert

S. A.

(1969). Perceptrons. Cambridge, MA: MIT Press.

16.

Munakata

(1998). Fundamentals of the new artificial intelligence beyond traditional paradigms. New York: Springer-Verlag.

17.

Segall

D. O.

Moreno

K. E.

(1999). Development of the CAT-ASVAB. In Drasgow

Olson-Buchanan

J. B.

(Eds.). Innovations in computerized assessment (pp. 35–65). Hillsdale, NJ: Lawrence Erlbaum Associates.

18.

Stocking

(1994). Three practical issues for modern adaptive testing item pools (Research Report 94–5). Princeton, NJ: Educational Testing Service.

19.

Stocking

Lewis

(1998). Controlling item exposure conditional on ability in computerized adaptive testing. Journal of Educational and Behavior Statistics, 23, 57–75.

20.

Stocking

Lewis

(2000). Methods of controlling the exposure of items in CAT. In van der Linden

W. J.

Glas

A. W.

(Eds.). Computerized adaptive testing: Theory and practice. Norwell, MA: Kluwer Academic Publishers.

21.

Sympson

Hetter

(1985, October). Controlling item-exposure rates in computerized adaptive testing. In Proceedings of the 27th Annual Meeting of the Military Testing Association (pp. 973–977). San Diego, CA: Navy Personnel Research and Development Center.

22.

Taylor

Thompson

(1986). Data based random number generation for a multivariate distribution via stochastic simulation. Computational Statistics & Data Analysis, 4, 93–101.

23.

van der Linden

(2003). Some alternatives to Sympson-Hetter item-exposure control in computerized adaptive testing. Journal of Educational and Behavioral Statistics, 28, 249–265.

24.

Way

(1998). Protecting the integrity of computerized testing item pools. Educational Measurement: Issues and Practice, 17, 17–27.