A Bivariate Lognormal Response-Time Model for the Detection of Collusion Between Test Takers

Abstract

A bivariate lognormal model for the distribution of the response times on a test by a pair of test takers is presented. As the model has parameters for the item effects on the response times, its correlation parameter automatically corrects for the spuriousness in the observed correlation between the response times of different test takers because of variation in the time intensities of the items. This feature suggests using the model in a routine check of response-time patterns for possible collusion between test takers using an estimate of the correlation parameter or a statistical test of a hypothesis about it. Closed-form expressions for the maximum-likelihood estimations of the model parameters and a Lagrange multiplier test for the correlation parameter are presented. As in any type of statistical decision making, results from such procedures should be corroborated by evidence from other sources, for example, results from a response-based analysis or observations during the test session. The effectiveness of the model in removing the spuriousness from correlated response times is illustrated using empirical response-time data.

Keywords

cheating Lagrange multiplier test lognormal model maximum-likelihood estimation response time

One of the increasing concerns of producers and consumers of test scores is fraudulent behavior during the test. Such behavior is more likely to be observed when the test takers' stakes are high, such as in admission, licensing, and certification testing. A simple form of cheating is one test taker attempting to copy the answers from another. This may happen without the source noticing. The classical form of this type of cheating is one test taker peeking at another’s answer sheet. A more modern form arises when computerized tests are administered in a regular classroom setting. Computer screens now have such a high resolution that, unless special arrangements are made, students can easily read the screens of others a few rows in front of them.

Test takers may also plan to collude during the test. A classical form of collusion is the signaling of question and answer numbers between test takers using a prearranged code (e.g., number of silent finger taps). More advanced communication is possible in the form of electronic transmission of data. One obvious instance is communication between test takers through a local network or over the internet when their PCs have not been locked down properly. However, more sophisticated forms have now become possible as a result of the rapid miniaturization of electronic devices. For instance, some digital pens now look like regular pens and automatically record handwritten responses or notes. The author would not be surprised if soon test takers would be caught transmitting such information wirelessly to others.

The traditional weapon of the testing industry against collusion is analysis of the agreement between the response patterns of the suspected test takers. The key quantity in all current statistics for detecting answer copying on multiple-choice items—traditionally one of the item formats most vulnerable to cheating—is the number of matching alternatives between test takers (Angoff, 1974; Frary, Tideman, & Watts, 1977; Holland, 1996; Lewis & Thayer, 1998; Sotaridona & Meijer, 2002, 2003; Sotaridona, van der Linden, & Meijer, 2006; van der Linden & Sotaridona, 2004, 2006; Wollack, 1997; Wollack & Cohen, 1998). However, these statistics differ in more or less subtle aspects. For example, some of them ignore matching correct alternatives, the idea being that a string of them may also be indicative of high-ability test takers who worked independently, whereas a string of matching incorrect alternatives is much less likely to occur under normal circumstances. Another difference resides in the null distributions postulated for these statistics. Some of them make an appeal to large-sample normality after some form of standardization, whereas others model the process of answer copying and derive the test statistic from the model.

It is not our intention to review these tests of answer copying or advocate a position with respect to the different assumptions underlying them. Readers interested in our view should refer to the introductory section of van der Linden and Sotaridona (2006). Instead, the goal of this research was to find an alternative method of detecting collusion between test takers based on their response times (RTs). RTs are a valuable extra source of information about the behavior of test takers during the test. For example, they have the advantage of being continuous rather than dichotomous and are therefore more informative about the size of possible aberrances than responses.

An even more attractive feature of RTs is that their use is not restricted to any specific response format. Particularly, it does not suppose the multiple-choice format for which the statistical tests above were developed. In fact, the more complicated the response format, the more informative the RTs. This point can be illustrated by the constructed-response format, for which we could not only analyze the total RT but also the time elapsed, for example, until a specific concept was introduced or a necessary tool was used. Of course, it will be more difficult to automate RT analyses, the closer we get to the level of the full set of time stamps in the test takers' logfiles. However, the possibilities are seemingly endless.

Finally, because the focus is no longer on responses, we avoid the statistical subtleties involved in the decision on what information the statistical test should condition: the number of incorrect alternatives by the source, his or her entire pattern of incorrect alternatives, or even the two full patterns of responses for the pair of test takers (Lewis, 2006). Such decisions become especially difficult when it is impossible to identify one test taker as the source and the other as the copier.

However, RT-based procedures for detecting collusion may seem to have disadvantages. Clearly, they can only be applied to tests administered in a computerized mode. However, we believe this to be a short-term disadvantage. Computerized testing offers numerous improvements over paper-and-pencil testing, such as the use of innovative item types, immediate scoring, and adaptive item selection. We therefore expect it soon to be the standard mode of testing. Because collusion is only possible when items between test takers are the same, this type of cheating is not a serious danger to adaptive testing, though. For RT-based procedures to detect aberrances more typical of adaptive testing, such as memorization and preknowledge of some of the items, see van der Linden and Guo (2008).

A more serious disadvantage may be test takers becoming aware of the fact that their RTs are checked and trying to fake realistic RTs to hide their collusion. However, we expect this to be too difficult for them, especially because they have to do this in real time. The expected RTs on the items in a typical operational test easily differ by a factor 5–8. For test takers trying to hide their collusion, it would be quite a challenge to find out what their patterns should be on a set of test items they have not seen before while keeping an eye on the clock, still solving the items, and avoiding running out of time. A more effective strategy for them might be to plan ahead. One possibility that has come to the author’s mind is an odd-even strategy, in which one test taker works on the odd and the other on the even items and they periodically exchange the answers. Such strategies may not be detected by the statistics presented in this manuscript but would always leave a trail of RTs that could be picked up by supplemental analyses. In fact, this particular trail would immediately be flagged by the procedure for the detection of aberrances in individual RT patterns in van der Linden and Guo (2008).

As any other area of statistical decision making, detection of collusion between test takers is prone to Type 1 and Type 2 errors—here: the errors of false accusations of cheating and incorrect assumptions of regular behavior. The former hurt innocent test takers; the latter are unfair to fellow test takers, especially if they are competing with the cheaters for scarce positions. In addition, cheating always undermines the credibility of the testing program.

The question of how to balance these errors is a policy issue that transcends the statistical aspects of the problem. But whatever the policy, it is always beneficial to take steps to reduce the likelihood of both types of errors to the maximum extent possible. One important step is to conduct response-based and RT-based procedures independently, and then check whether their conclusions agree. In addition, the decisions should be corroborated by additional observations (e.g., reports by proctors and analyses of logfiles). In the absence of such evidence, a practical approach already followed by some test organizations is to suspend the decision and invite the suspects to take the test again under controlled conditions.

1. RT Models

The analog of a string of matching responses between two test takers is a string of correlating RTs. The detection of such strings should be based on a statistical model for RT distributions under regular test behavior; otherwise, we might easily make wrong inferences. For example, as already indicated, RT distributions on test items depend heavily on their labor or time intensity. Because of this, the RTs of different test takers on the same set of test items always show a tendency to correlate, simply because of the variation in the time intensity from one item to the next. Likewise, test takers vary in the speed at which they operate. Ignoring this factor might lead to the confusion of speed with collusion as the cause of similar RTs. For empirical examples of spurious relations in patterns of observed RTs, see van der Linden (in press) and van der Linden and Guo (2008).

A model that does take item and person effects into account is the lognormal RT model (van der Linden, 2006). The core of the model is an equation that defines the speed of labor by a test taker on an item. Let $β_{i}^{*}$ denote the (unknown) amount of labor required to solve item i and t_ij the RT by test taker j on this item. Observe that this RT is the time elapsed during labor. Hence, the (average) speed of labor $τ_{j}^{*}$ by test taker j on item i is $τ_{j}^{*} = \frac{β_{i}^{*}}{t_{i j}} .$ 1This definition follows the format of any definition of speed as the rate of change of some substantive measure with respect to time. For example, the well-known definition of the speed of motion of a body from physics has the same form as Equation 1 but with the amount of labor in the numerator replaced by the distance traveled. For more on this fundamental equation of RT modeling, see van der Linden (in press).

Observe that Equation 1 is equivalent to $t_{i j} = \frac{β_{i}^{*}}{τ_{j}^{*}},$ 2

which shows that the equation actually decomposes one (known) RT into two (unknown) effects for the person and the item. The lognormal model follows from Equation 2 in two obvious steps: (a) a logarithmic transformation of the RTs to remove the typical skewness from their distributions and (b) the addition of an item-specific random term to allow for the randomness of RTs. The result is $\ln T_{i j} = β_{i} - τ_{j} + ε_{i}, ε_{i} \sim N (0, α_{i}^{- 2}),$ 3

where β_i and τ_j are now parameters for the item and person effects on a logarithmic scale and α_i can be interpreted as an item discrimination parameter. Equivalently, T_ij follows a lognormal distribution with density, $f (t_{i j}; τ_{j}, α_{i}, β_{i}) = \frac{α_{i}}{t_{i j} \sqrt{2 π}} \exp {- \frac{1}{2} {[α_{i} (\ln t_{i j} - (β_{i} - τ_{j}))]}^{2}},$ 4

where $τ_{j} \in ℝ$ is the speed at which test taker j operates on the test, $β_{i} \in ℝ$ is the time or labor intensity of item i, and $α_{i} \in ℝ^{+}$ is its discrimination parameter. Factor $1 / t_{i j}$ in Equation 4 is because of the change of variable in the density. The model is not yet identifiable but a convenient way to obtain identifiability is to set $μ_{τ} = 0,$ 5 for the population of test takers on which the items are calibrated.

For statistical issues related to parameter estimation and model validation, which can be performed efficiently by embedding the model in a hierarchical framework along with a regular response model, see Fox, Klein Entink, and van der Linden (2007); Klein Entink, Fox, and van der Linden (2009); and van der Linden (2006, 2007). A review of empirical applications of the model is given in van der Linden (2009).

1.1 Bivariate Lognormal Model

To analyze the RTs of pairs of test takers for possible collusion when their regular behavior can be assumed to follow the lognormal model in Equation 4, we need an extension of the model for the joint distribution of their RTs on a fixed item. Obviously, for a pair of test takers (j, k), the bivariate generalization of Equation 3 for the distribution of $(\ln T_{i j_{,}} \ln T_{i k})$ on item i has density $f (\ln t_{i j}, \ln t_{i k}; τ_{j}, τ_{k}, α_{i}, β_{i}, ρ_{j k}) = \frac{α_{i}^{2}}{2 π \sqrt{1 - ρ_{j k}^{2}}} \exp {\frac{- 1}{2 (1 - ρ_{j k}^{2})} (ψ_{j k}^{2} - 2 ρ_{j k} ψ_{i j} ψ_{i k} + ψ_{i k}^{2})},$ 6 where $ψ_{i j} = α_{i} [\ln t_{i j} - (β_{i} - τ_{j})] .$ 7 Equivalently (T_ij, T_ik ) has a bivariate lognormal density $f (t_{i j}, t_{i k}; τ_{j}, τ_{k}, α_{i}, β_{i}, ρ_{j k}) = \frac{α_{i}^{2}}{t_{i j} t_{i k} 2 π \sqrt{1 - ρ_{j k}^{2}}} \exp {\frac{- 1}{2 (1 - ρ_{j k}^{2})} (ψ_{i j}^{2} - 2 ρ_{j k} ψ_{i j} ψ_{i k} + ψ_{i k}^{2})} .$ 8 Factor $1 / t_{i j}$ in Equation 4 generalizes to $1 / t_{i j} t_{i k}$ in Equation 8 as a result of the two separate log transformations of T_ij and T_ik . For ρ_jk = 0, Equations 6 and 8 degenerate to the product of the two densities for the normal and lognormal representations of the univariate model in Equations 3 and 4, respectively. For computational convenience, we will capitalize on the representation in Equation 6.

Note that ρ_jk is a parameter for the degree to which the RTs by two fixed test takers agree. It is thus independent of the RTs of any of the other test takers. This fact is important because several of the statistical tests for matching alternatives on multiple-choice tests referred to earlier are population dependent and lead to different decisions when the same pair of test takers would be included in different populations. In addition, note that ρ_jk is the correlation between ψ_ij and ψ_ik in Equation 7. Because the impact of the identifiability restriction in Equation 5 on β_i and τ_j on ψ_ij cancels, the use of the population mean in Equation 5 does not make any of the inferences about ρ_jk population dependent.

Another important feature of the model is its automatic correction of ρ_jk for the time intensities β_i of the items. As a result, ρ_jk is insensitive to spuriousness in the observed correlation between the logtimes of two test takers because of the variation in the time intensity between the items. The model also allows us to compare the speed τ_j of different test takers. This is important because one would expect test takers who collude to work at approximately at the same speed. Although there may be several other reasons why test takers have similar speed, this expectation makes τ = (τ_j, τ_k) a second potential parameter of interest. We will return to this point later.

2. Estimation of

ρ_{j k}

It is assumed that the items have been calibrated with enough precision before their operational use and that we can therefore treat their parameters as known. Thus, for test takers j and k on items $i = 1, \dots, n$ , the only unknown parameters are the parameters of interest, ρ_jk and τ = (τ_j, τ_k). In an abuse of notation, let $\ln t_{j} = (\ln t_{1 j}, \dots, \ln t_{n j})$ and $\ln t_{k} = (\ln t_{1 k}, \dots, \ln t_{n k})$ . Hence, the log likelihood of $(ρ_{i k}, τ)$ can be written as follows: $ℓ (ρ_{j k}, τ; \ln t_{j}, \ln t_{k}) = const - \frac{n}{2} \ln (1 - ρ_{j k}^{2}) - \sum_{i = 1}^{n} \frac{1}{2 (1 - ρ_{j k}^{2})} (ψ_{i j}^{2} - 2 ρ_{j k} ψ_{i j} ψ_{i k} + ψ_{i k}^{2}) .$ 9 From Equation 9, using the first-order derivatives in Equations A1, A2, and A3 in the appendix, the likelihood equations for $(ρ_{i k}, τ)$ follow as $\frac{\partial ℓ (ρ_{j k}, τ {; \ln t}_{j}, \ln t_{k})}{\partial τ_{j}} = \frac{1}{(1 - ρ_{j k}^{2})} \sum_{i = 1}^{n} α_{i} (ρ_{j k} ψ_{i k} - ψ_{i j}) = 0,$ 10 $\frac{\partial ℓ (ρ_{j k}, τ {; \ln t}_{j}, \ln t_{k})}{\partial τ_{k}} = \frac{1}{(1 - ρ_{j k}^{2})} \sum_{i = 1}^{n} α_{i} (ρ_{j k} ψ_{i j} - ψ_{i k}) = 0,$ 11 $\frac{\partial ℓ (ρ_{j k}, τ {; \ln t}_{j}, \ln t_{k})}{\partial ρ_{j k}} = \frac{n ρ_{j k}}{(1 - ρ_{j k}^{2})} - \sum_{i = 1}^{n} \frac{ρ_{j k} ψ_{i j}^{2} - ρ_{j k}^{2} ψ_{i j} ψ_{i k} + ρ_{j k} ψ_{i k}^{2} - ψ_{i j} ψ_{i k}}{{(1 - ρ_{j k}^{2})}^{2}} = 0.$ 12 The first two equations yield $ρ_{j k} \sum_{i = 1}^{n} α_{i}^{2} [\ln t_{i k} - (β_{i} - τ_{k})] = \sum_{i = 1}^{n} α_{i}^{2} [\ln t_{i j} - (β_{i} - τ_{j})],$ 13 $ρ_{j k} \sum_{i = 1}^{n} α_{i}^{2} [\ln t_{i j} - (β_{i} - τ_{j})] = \sum_{i = 1}^{n} α_{i}^{2} [\ln t_{i k} - (β_{i} - τ_{k})] .$ 14 Note the symmetry between Equations 13 and 14. Hence, they only hold when the two sums are equal to 0 or $| ρ_{j k} | = 1$ . The first condition gives ${\hat{τ}}_{j} = \frac{\sum_{i = 1}^{n} α_{i}^{2} (β_{i} - \ln t_{i j})}{\sum_{i = 1}^{n} α_{i}^{2}},$ 15 ${\hat{τ}}_{k} = \frac{\sum_{i = 1}^{n} α_{i}^{2} (β_{i} - \ln t_{i k})}{\sum_{i = 1}^{n} α_{i}^{2}} .$ 16 The second condition can be ignored because the interest is in a nondegenerate bivariate distribution. In addition, $ρ_{j k} = 1$ involves identical RTs and in this degenerate case, Equations 15 and 16 already give identical results.

Following an argument in Lehmann (1999, ex. 7.5.4), the solution of Equation 12 is ${\hat{ρ}}_{j k} = \frac{\sum_{i = 1}^{n} {\hat{ψ}}_{i j} {\hat{ψ}}_{i k}}{{(\sum_{i = 1}^{n} {\hat{ψ}}_{i j}^{2} \sum_{i = 1}^{n} {\hat{ψ}}_{i k}^{2})}^{1 / 2}},$ 17 where ${\hat{ψ}}_{i j}$ and ${\hat{ψ}}_{i k}$ are the estimates in Equation 6 with Equations 15 and 16 substituted for the τs.

The maximum-likelihood estimator (MLE) of $τ_{j}$ weighs the observations $β_{i} - \ln t_{i j}$ by the square of the item discrimination parameters $α_{i}$ . Its expression thus has the well-known form of a precision-weighted average. The averaged quantity is $β_{i} - \ln t_{i j}$ , which is the negative of the logtimes corrected for the time intensities of the items (the negative is due to the fact that $τ_{j}$ is a speed and not a slowness parameter). The MLE of $ρ_{j k}$ appears to have the regular form of a sample covariance over the product of its two standard deviations. However, these quantities are not calculated over the logtimes but their estimated residuals in Equation 7.

3. Test of Hypothesis on

ρ_{j k}

In a routine screening of test data, we could not only calculate the MLE of $ρ_{j k}$ but also conduct a formal statistical test of a hypothesis about the parameter. The main difference between the two procedures is that the latter allows for the uncertainty in the parameter estimates.

We discuss the test of $H_{0} : ρ_{j k} = 0$ 18 against $H_{1} : ρ_{j k} > 0.$ 19 This test has the advantage of a statistic that reduces to a simple expression and allows us to directly follow the argument on which it is based. The generalization to a test of the more general hypothesis $H_{0} : ρ_{j k} = c$ against $H_{1} : ρ_{j k} > c$ is outlined in the appendix. The generalization is straightforward and allows us to check whether $ρ_{j k}$ differs substantially enough from zero to warrant concerns about possible collusion between the test takers, for instance, in situations when the test of Equation 18 against Equation 19 appears to have too much power and may detect negligible differences from $ρ_{j k} = 0$ . For such cases, see below.

The proposed test is a Lagrange multiplier (LM) test. For a general introduction to this type of test and a discussion of their favorable statistical properties, the reader is referred to Lehmann (1999, sect. 7.7) or Silvey (1975, sect. 7.4). The use of LM tests for the diagnosis of violations of various IRT models has been popularized by recent work by Glas and his associates (e.g., Glas, 1999; Glas & Dagohoy, 2007; Glas & Suárez Falcón, 2003; van der Linden & Glas, in press). A practical feature of LM tests is that their statistics have to be evaluated only at the MLEs of the parameters in the null model and more complicated computation of any of the estimates of the alternative parameters is avoided.

The statistic for the LM test of Equation 18 against Equation 19 with unknown parameters $τ = (τ_{j}, τ_{k})$ can be written as follows: $LM(ρ_{jk}) = \frac{h {(ρ_{j k})}^{2}}{h (ρ_{j k}, ρ_{j k}) - H (τ, ρ_{jk})' H {(τ, τ)}^{- 1} H (τ, ρ_{jk})} |_{τ = \hat{τ}, ρ_{jk} = 0},$ 20 where $h (ρ_{j k}) = \frac{\partial ℓ (ρ_{j k}, τ; t_{j}, t_{k})}{\partial ρ_{j k}},$ 21 $h (ρ_{j k}, ρ_{j k}) = - \frac{\partial^{2} ℓ (ρ_{j k}, τ; t_{j}, t_{k})}{\partial ρ_{j k}^{2}},$ 22 $H (τ, τ) = (\begin{matrix} - \frac{\partial^{2} ℓ (ρ_{j k}, τ; t_{j}, t_{k})}{\partial τ_{j}^{2}} & - \frac{\partial^{2} ℓ (ρ_{j k}, τ; t_{j}, t_{k})}{\partial τ_{j} \partial τ_{k}} \\ - \frac{\partial^{2} ℓ (ρ_{j k}, τ; t_{j}, t_{k})}{\partial τ_{k} \partial τ_{j}} & - \frac{\partial^{2} ℓ (ρ_{j k}, τ; t_{j}, t_{k})}{\partial τ_{k}^{2}} \end{matrix}),$ 23 and $H (τ, ρ_{j k}) = (\begin{matrix} - \frac{\partial^{2} ℓ (ρ_{j k}, τ; t_{j}, t_{k})}{\partial τ_{j} \partial ρ_{j k}} \\ - \frac{\partial^{2} ℓ (ρ_{j k}, τ; t_{j}, t_{k})}{\partial τ_{k} \partial ρ_{j k}} \end{matrix}) .$ 24 $h (ρ_{j k})$ is known as the score function associated with $ρ_{j k}$ , whereas $h (ρ_{j k}, ρ_{j k})$ , $H (τ, τ)$ , and $H (τ, ρ_{j k})$ are submatrices of the observed information matrix. The statistic in Equation 20 is asymptotically $χ^{2}$ distributed with one degree of freedom.

All necessary derivatives are given in the appendix. Observe that they are for the full log-likelihood function in Equation 9; that is, under the alternative bivariate lognormal model. However, when calculating Equation 20, the derivatives need to be evaluated not only at the MLEs of $τ_{1}$ and $τ_{2}$ in Equation 15 and 16 but also with $ρ_{j k}$ set equal to its null value. For the H ₀ in Equation 18, this parameter thus vanishes from the derivatives, which simplify considerably. Besides, $H (τ, τ)$ specializes to a diagonal matrix. As a result, the LM statistic in Equation 20 can easily be computed as follows: $LM(ρ_{jk}) = \frac{{(\sum_{i = 1}^{n} {\hat{ψ}}_{i j} {\hat{ψ}}_{i k})}^{2}}{\sum_{i = 1}^{n} ({\hat{ψ}}_{i j}^{2} + {\hat{ψ}}_{i k}^{2}) - n - \frac{{(\sum_{i = 1}^{n} α_{i} {\hat{ψ}}_{i j})}^{2} + {(\sum_{i = 1}^{n} α_{i} {\hat{ψ}}_{i k})}^{2}}{\sum_{i = 1}^{n} α_{i}^{2}}} .$ 25 It is interesting to note the presence of the same key quantity in the numerators of LM $(ρ_{j k})$ and the MLE of $ρ_{j k}$ in Equation 17. As explained in Glas (1999), the denominator of Equation 20 can be interpreted as the variance of the score function for the parameters under the null hypothesis adjusted for the estimation of the unknown parameters. It follows that LM $(ρ_{j k})$ can be interpreted as the ratio of the squared average cross-product of the residual logtimes and their adjusted variances, which explains its asymptotic behavior as a chi-square statistic.

4. Illustration With Empirical RTs

The effectiveness of the bivariate RT model in avoiding erroneous conclusions about collusion between test takers likely to arise when the inference is based on direct analyses of the RTs without any formal model is illustrated for a small sample of test takers from an administration of the Natural World Assessment (NAW-8) test in a study by Wise, Kong, and Pastor (2007). The NAW-8 is a test of quantitative and scientific reasoning proficiencies for college students. In an earlier study, we calibrated the same items under the lognormal RT model in Equation 4 and obtained excellent fit of the model to the items (Klein Entink et al., 2009).

In the Wise et al. (2007) administration of the NAW-8, the students served as participants in a experiment and had no stakes whatsoever in the test. Attempts to detect collusion between them would therefore be bound to fail. In fact, Wise et al. conducted their study to show the possibilities of using RTs for the detection of test takers with motivation problems. However, the goal of our current use of the data set is only to exemplify the proposed procedures for a few typical RT patterns. In particular, we wanted to show how focus on the estimation of correlation parameter $(ρ_{j k})$ in the model—or a statistical test of a hypothesis about it—can prevent from making the erroneous inferences suggested by a direct inspection of the RT patterns. Besides, the example is to illustrate the ease of calculating all relevant quantities in the procedures.

We analyzed the first 10 pairs of participants in the data set. The results for the three pairs of participants in Tables 1 and 2 were picked because they nicely illustrate the points we wanted to make with this example. The estimates of the time intensity and discrimination parameters of the items and the RTs of the participants in the example are shown in Table 1. The estimates of the item parameters were obtained from the full data set for 386 test takers using the identifiability restriction in Equation 5. Hence, the MLEs of the speed parameters of the six participants in the first row of Table 2, which were calculated directly from the logRTs and the $α_{i}$ parameters of the items using Equations 15 and 16, center about zero. The next relevant quantities are the residual logRTs $ψ_{i j}$ in Equation 7, whose estimates follow directly once the MLEs of the speed parameters have been calculated. These estimates are also shown in Table 1. Remember that all model parameters are on the logarithmic scale, and that the same thus holds for the estimates of $ψ_{i j}$ .

TABLE 1
Response Times (RTs) and Estimated Residual logRTs for Three Pair of Participants

Item Item Parameters Participant 1 Participant 2 Participant 3 Participant 4 Participant 5 Participant 6

$α_{i}$ $β_{i}$ T_ij ${\hat{ψ}}_{i j}$ T_ij ${\hat{ψ}}_{i j}$ T_ij ${\hat{ψ}}_{i j}$ T_ij ${\hat{ψ}}_{i j}$ T_ij ${\hat{ψ}}_{i j}$ T_ij ${\hat{ψ}}_{i j}$

1 1.8 2.7 22 0.53 26 −0.18 20 −0.59 21 0.83 21 −0.42 13 0.42

2 2.1 3.0 19 −0.14 38 0.14 43 0.46 15 −0.18 39 0.36 14 0.20

3 1.7 3.3 40 0.54 101 1.16 51 0.04 22 −0.12 86 1.02 4 −2.64

4 1.9 3.4 43 0.57 57 0.04 45 −0.35 22 −0.29 33 −0.84 26 0.50

5 2.0 3.1 27 0.27 37 −0.22 26 −0.87 23 0.39 60 0.92 18 0.40

6 1.9 3.0 27 0.50 21 −1.03 37 0.09 20 0.34 35 0.08 18 0.62

7 1.7 3.2 45 0.99 116 1.64 90 1.26 16 −0.41 33 −0.37 23 0.64

8 1.8 3.3 23 −0.33 44 −0.17 31 −0.76 24 0.14 45 0.01 12 −0.67

9 2.0 2.5 14 0.16 10 −1.60 12 −1.19 19 1.18 30 0.70 11 0.60

10 1.5 3.6 47 0.32 117 0.85 80 0.32 18 −0.82 37 −0.79 30 0.34

11 1.5 2.6 12 −0.21 18 −0.44 22 −0.10 7 −0.70 25 0.17 3 −1.60

12 1.7 2.1 5 −0.90 9 −0.85 23 0.80 10 0.66 17 0.37 2 −1.67

13 1.7 2.0 3 −1.63 25 1.06 29 1.31 9 0.64 12 −0.07 4 −0.33

14 1.8 1.9 4 −0.91 16 0.55 22 1.17 11 1.27 14 0.46 4 −0.08

15 1.8 2.9 16 −0.25 34 0.09 17 −1.07 19 0.43 23 −0.46 46 2.42

16 1.9 3.0 19 −0.10 27 −0.51 41 0.35 6 −1.92 51 0.87 16 0.46

17 1.9 1.9 9 0.47 6 −1.35 11 −0.15 5 −0.23 7 −0.91 6 0.59

18 2.1 1.9 6 −0.36 17 0.65 9 −0.63 3 −1.37 7 −1.05 6 0.62

19 1.6 3.4 47 0.74 97 0.97 105 1.14 25 0.06 42 −0.22 4 −2.43

20 1.8 2.6 12 −0.21 20 −0.30 16 −0.66 12 0.19 22 0.02 12 0.65

Item	Item Parameters	Participant 1	Participant 2	Participant 3	Participant 4	Participant 5	Participant 6
1	1.8	2.7	22	0.53	26	−0.18	20	−0.59	21	0.83	21	−0.42	13	0.42
2	2.1	3.0	19	−0.14	38	0.14	43	0.46	15	−0.18	39	0.36	14	0.20
3	1.7	3.3	40	0.54	101	1.16	51	0.04	22	−0.12	86	1.02	4	−2.64
4	1.9	3.4	43	0.57	57	0.04	45	−0.35	22	−0.29	33	−0.84	26	0.50
5	2.0	3.1	27	0.27	37	−0.22	26	−0.87	23	0.39	60	0.92	18	0.40
6	1.9	3.0	27	0.50	21	−1.03	37	0.09	20	0.34	35	0.08	18	0.62
7	1.7	3.2	45	0.99	116	1.64	90	1.26	16	−0.41	33	−0.37	23	0.64
8	1.8	3.3	23	−0.33	44	−0.17	31	−0.76	24	0.14	45	0.01	12	−0.67
9	2.0	2.5	14	0.16	10	−1.60	12	−1.19	19	1.18	30	0.70	11	0.60
10	1.5	3.6	47	0.32	117	0.85	80	0.32	18	−0.82	37	−0.79	30	0.34
11	1.5	2.6	12	−0.21	18	−0.44	22	−0.10	7	−0.70	25	0.17	3	−1.60
12	1.7	2.1	5	−0.90	9	−0.85	23	0.80	10	0.66	17	0.37	2	−1.67
13	1.7	2.0	3	−1.63	25	1.06	29	1.31	9	0.64	12	−0.07	4	−0.33
14	1.8	1.9	4	−0.91	16	0.55	22	1.17	11	1.27	14	0.46	4	−0.08
15	1.8	2.9	16	−0.25	34	0.09	17	−1.07	19	0.43	23	−0.46	46	2.42
16	1.9	3.0	19	−0.10	27	−0.51	41	0.35	6	−1.92	51	0.87	16	0.46
17	1.9	1.9	9	0.47	6	−1.35	11	−0.15	5	−0.23	7	−0.91	6	0.59
18	2.1	1.9	6	−0.36	17	0.65	9	−0.63	3	−1.37	7	−1.05	6	0.62
19	1.6	3.4	47	0.74	97	0.97	105	1.14	25	0.06	42	−0.22	4	−2.43
20	1.8	2.6	12	−0.21	20	−0.30	16	−0.66	12	0.19	22	0.02	12	0.65

TABLE 2

Estimated Speed Parameters, Estimated Correlations, and Lagrange Multiplier (LM) Statistics for Three Pair of Participants

	Participant 1	Participant 2	Participant 3	Participant 4	Participant 5	Participant 6
${\hat{τ}}_{j}$	−0.05	−0.61	−0.59	0.16	−0.53	0.41
r_jk	0.89		0.43		0.09
${\hat{ρ}}_{j k}$	0.02		−0.03		−0.36
LM $(ρ_{j k})$	0.02		0.04		1.71
p_jk	0.90		0.83		0.19

A naive analysis of the agreement between the RTs of two test takers would focus on the Pearson correlation $r_{j k}$ between their RTs. For the three pairs of participants, these correlations in the second row of Table 2 seem to suggest substantial agreement between the behavior of the first two pairs and no agreement at all between the third pair. In addition, the lack of correlation for the third pair may seem to be consistent with their RT patterns in Table 1, which displays generally much greater RTs for Participant 5 (the difference in the size between the RTs of these two participants seems to be confirmed by the MLEs of their speed parameters: Participant 5 [ ${\hat{τ}}_{5} = - 0.53$ ] worked considerably slower than Participant 6 [ ${\hat{τ}}_{6} = 0.41$ ]).

However, the estimates of correlation parameter $ρ_{j k}$ in the bivariate RT model reveal an entirely different picture. The estimates for the first two pairs of participants point at negligible correlation: ${\hat{ρ}}_{12} = 0.02$ and ${\hat{ρ}}_{34} = - 0.03$ . The lack of correlation is confirmed by the values of the statistic for the test of H ₀: $ρ_{j k} = 0$ in the last row of Table 2: LM $(ρ_{12}) = .02$ and LM $(ρ_{34}) = .04$ with $p_{12} = 90$ and $p_{34} = 83$ . However, the estimate for the third pair of participants was negative: ${\hat{ρ}}_{56} = - 0.36$ . Its size was substantial enough to be surprising but too small to have much practical meaning: LM $(ρ_{56})$ was equal to 1.71, with $p_{56} = 0.19$ . A check of the estimated residuals ${\hat{ψ}}_{i j}$ for the third pair of participants in Table 1 reveals that the negative value of ${\hat{ρ}}_{56}$ is almost entirely explained by extremely negative residuals for Participant 6 on only three of the items (items 3, 15, and 19).

5. Power of the Test

LM tests are optimal in several ways. For instance, they are known to be locally most powerful, consistent, and asymptotically equivalent to the better-known likelihood-ratio and Wald tests. The power of LM tests has been studied empirically for several applications in response modeling in the papers by Glas and his associates referred to earlier. To demonstrate the power of the current test, a simulation study was done for the items in the same NAW-8 test as in the previous section.

The power function of the test is the probability of rejecting the null hypothesis as a function of the true value of the parameters of interest, $ρ_{j k}$ . For a test of level α, the function is given by $\Pr {LM(ρ_{jk}) > χ_{1 - α}^{2} (1) | ρ_{j k}},$ 26 where $χ_{1 - α}^{2} (1)$ is the $(1 - α)$ th quantile under a chi-square distribution with one degree of freedom. In this study, we used the conventional level of $α = .05$ .

Clearly, the power increases with the number of common items for the two test takers. We therefore simulated RTs under the bivariate lognormal model in Equation 8 for pairs of test takers on the first $n = 20$ , 40, and 60items of the NAW-8 and estimated the probability in Equation 26 as a function of $ρ_{j k} = .00 (.05) .90$ . To check whether there would be any (unwanted) sensitivity of the test to the speed at which the test takers work, we simulated RTs at each of the true values of $ρ_{j k}$ for each of the speed combinations $(τ_{j}, τ_{k}) = (.0, .0), (- .1, .1), (- .2, .2),$ and (–.3, .3). For each combination of these parameter values, 1,000 replications were simulated.

The resulting estimates of the power functions are given in Figure 1 . As expected, each of the curves increased with the true value of $ρ_{j k}$ . In addition, the curves for a longer test length dominated those for a shorter length. For $ρ_{j k} = .0$ , the power of the test is always equal to its level—a feature that was reproduced by the results for each of the cases in Figure 1. In addition, across the four panels in Figure 1, the power functions appeared to be identical for all practical purposes, which indicates that the LM test was insensitive to the actual levels of speed of the test takers. Generally, the test appears to reach near perfect power at $ρ_{j k} = .4$ for test length n=60, $ρ_{j k} = .6$ for n=40, and $ρ_{j k} = .8$ for n=20.

FIGURE 1.

Estimated power functions of the Lagrange multiplier test of Equation 18 against Equation 19 for test lengths of n=20 (dotted curve), 40 (dashed curve), and 60 (solid curve) and combinations of speed parameters $(τ_{j}, τ_{k}) = (.0, .0), (- .1, .1), (- .2, .2)$ and (–.3, .3).

Just as for any other powerful statistical test, when using the proposed test in this article we should be aware of a possible embarrassment involved in having too much power. For instance, if for a test of n=40 items the interest is not in detecting any correlations smaller than .6, it might be prudent to use the generalization of the test in the appendix, say, for c=.4.

6. Concluding Remarks

The main point in this article is that RTs offer valuable information about the test takers' behavior during a test, which can be profitably used to check data sets for suspicious agreement between the RT patterns of different test takers. However, to prevent capitalization on spurious relations, the RTs should be analyzed under a model for their joint distribution for the test takers.

The bivariate lognormal model proposed in this article captures the two main sources of spuriousness: the effects of the differences in time intensity between the items and the speed at which the test takers operate during the test. The first type of effect leads to correlation between the observed RTs that should not be confused with actual collusion between test takers. The first two pairs of participants in Table 2 illustrated this point. The second type leads to additional similarity between RTs when two test takers happen to work at approximately the same speed. For example, the agreement between the general level of the RTs between Participants 2, 3, and 5 is due to their generally working slowly on the test and does not point at any coordination between their behavior.

A third type of quantity that should always be checked, especially when the value of $ρ_{j k}$ need further explanation, are the estimated residuals ${\hat{ψ}}_{i j}$ . The MLE of $ρ_{j k}$ appears to be just the observed correlation between these residuals; see Equation 17. We illustrated their use in our explanation of the negative value of $ρ_{j k}$ for the third pair of participants.

The examples also show how easy it is to conduct the proposed analysis. All relevant quantities have simple closed-form expressions that follow immediately from the item parameters and the RTs by the test takers. The only extra requirement is the calibration of the items under the RT model. However, this is also easily performed as part of the regular calibration of the test items (van der Linden, 2006).

Footnotes

Notes

Appendix

The necessary derivatives of the full log likelihood under the bivariate lognormal model in Equation 9 are given in the following two sections.

References

Angoff

W. H.

(1974). The development of statistical indices for detecting cheaters. Journal of the American Statistical Association, 69, 44–49.

Glas

C. A. W.

(1999). Modification indices for the 2PL and the nominal response model. Psychometrika, 64, 273–294.

Glas

C. A. W.

Dagohoy

A. V. T.

(2007). Person fit tests for IRT models for polytomous items with estimated person and item parameters. Psychometrika, 72, 159–180.

Glas

C. A. W.

Suárez Falcón

J. C.

(2003). A comparison of item-fit statistics for the three-parameter logistic model. Applied Psychological Measurement, 27, 87–106.

Fox

J.-P.

Klein Entink

R. H.

van der Linden

W. J.

(2007). Modeling of responses and response times with the package CIRT. Journal of Statistical Software, 20(7), 1–14.

Frary

R. B.

Tideman

T. N.

Watts

T. M.

(1977). Indices of cheating on multiple-choice tests. Journal of Educational Statistics, 2, 235–256.

Holland

P. W.

(1996). Assessing unusual agreement between incorrect answers of two examinees using the K-index: Statistical theory and empirical support (Technical Report No. 96-4). Princeton, NJ: Educational Testing Service.

Klein Entink

R. H.

Fox

J.-P.

van der Linden

W. J.

(2009). A multivariate multilevel approach to simultaneous modeling of accuracy and speed on test items. Psychometrika, 74, 21–48.

Lehmann

E. L.

(1999). Elements of large-sample theory. New York: Springer.

10.

Lewis

(2006). Note on conditional and unconditional hypothesis testing: A discussion of an issue raised by van der Linden and Sotaridona. Journal of Educational and Behavioral Statistics, 31, 305–309.

11.

Lewis

Thayer

D. T.

(1998). The power of the K-index (or PMIR) to detect copying (Research Report RR-98-49). Princeton, NJ: Educational Testing Service.

12.

Silvey

S. D.

(1975). Statistical inference. London: Chapman & Hall.

13.

Sotaridona

L. S.

Meijer

R. R.

(2002). Statistical properties of the K-index for detecting answer copying. Journal of Educational Measurement, 39, 115–132.

14.

Sotaridona

L. S.

Meijer

R. R.

(2003). Two new statistics to detect answer copying. Journal of Educational Measurement, 40, 53–69.

15.

Sotaridona

van der Linden

W. J.

Meijer

R. R.

(2006). Detecting answer copying using the kappa statistic. Applied Psychological Measurement, 30, 412–431.

16.

van der Linden

W. J.

(2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31, 181–204.

17.

van der Linden

W. J.

(2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72, 287–308.

18.

van der Linden

W. J.

(in press). Conceptual issues in response-time modeling. Journal of Educational Measurement, 46.

19.

van der Linden

W. J.

Glas

C. A. W.

(in press). Statistical tests of conditional independence between responses and/or response times on test items. Psychometrika, 75.

20.

van der Linden

W. J.

Guo

(2008). Bayesian procedures for identifying aberrant response-time patterns in adaptive testing. Psychometrika, 73, 365–384.

21.

van der Linden

W. J.

Sotaridona

(2004). A statistical test for detecting answer copying on multiple-choice tests. Journal of Educational Measurement, 41, 361–377.

22.

van der Linden

W. J.

Sotaridona

(2006). Detecting answer copying when the regular response process follows a known response model. Journal of Educational and Behavioral Statistics, 31, 283–304.

23.

Wise

S. L.

Kong

X. J.

Pastor

D. A.

(2007, April). Understanding correlates of rapid-guessing behavior in low-stakes testing: Implications for test development and measurement practice. Paper presented at the Annual Meeting of the National Council on Measurement in Education, Chicago, IL.

24.

Wollack

J. A.

(1997). A nominal response model approach to detect answer copying. Applied Psychological Measurement, 21, 307–320.

25.

Wollack

J. A.

Cohen

A. S.

(1998). Detection of answer copying with unknown item and trait parameters. Applied Psychological Measurement, 22, 144–152.

Item	Item Parameters		Participant 1		Participant 2		Participant 3		Participant 4		Participant 5		Participant 6
Item	$α_{i}$	$β_{i}$	T_ij	${\hat{ψ}}_{i j}$	T_ij	${\hat{ψ}}_{i j}$	T_ij	${\hat{ψ}}_{i j}$	T_ij	${\hat{ψ}}_{i j}$	T_ij	${\hat{ψ}}_{i j}$	T_ij	${\hat{ψ}}_{i j}$
1	1.8	2.7	22	0.53	26	−0.18	20	−0.59	21	0.83	21	−0.42	13	0.42
2	2.1	3.0	19	−0.14	38	0.14	43	0.46	15	−0.18	39	0.36	14	0.20
3	1.7	3.3	40	0.54	101	1.16	51	0.04	22	−0.12	86	1.02	4	−2.64
4	1.9	3.4	43	0.57	57	0.04	45	−0.35	22	−0.29	33	−0.84	26	0.50
5	2.0	3.1	27	0.27	37	−0.22	26	−0.87	23	0.39	60	0.92	18	0.40
6	1.9	3.0	27	0.50	21	−1.03	37	0.09	20	0.34	35	0.08	18	0.62
7	1.7	3.2	45	0.99	116	1.64	90	1.26	16	−0.41	33	−0.37	23	0.64
8	1.8	3.3	23	−0.33	44	−0.17	31	−0.76	24	0.14	45	0.01	12	−0.67
9	2.0	2.5	14	0.16	10	−1.60	12	−1.19	19	1.18	30	0.70	11	0.60
10	1.5	3.6	47	0.32	117	0.85	80	0.32	18	−0.82	37	−0.79	30	0.34
11	1.5	2.6	12	−0.21	18	−0.44	22	−0.10	7	−0.70	25	0.17	3	−1.60
12	1.7	2.1	5	−0.90	9	−0.85	23	0.80	10	0.66	17	0.37	2	−1.67
13	1.7	2.0	3	−1.63	25	1.06	29	1.31	9	0.64	12	−0.07	4	−0.33
14	1.8	1.9	4	−0.91	16	0.55	22	1.17	11	1.27	14	0.46	4	−0.08
15	1.8	2.9	16	−0.25	34	0.09	17	−1.07	19	0.43	23	−0.46	46	2.42
16	1.9	3.0	19	−0.10	27	−0.51	41	0.35	6	−1.92	51	0.87	16	0.46
17	1.9	1.9	9	0.47	6	−1.35	11	−0.15	5	−0.23	7	−0.91	6	0.59
18	2.1	1.9	6	−0.36	17	0.65	9	−0.63	3	−1.37	7	−1.05	6	0.62
19	1.6	3.4	47	0.74	97	0.97	105	1.14	25	0.06	42	−0.22	4	−2.43
20	1.8	2.6	12	−0.21	20	−0.30	16	−0.66	12	0.19	22	0.02	12	0.65