Abstract
Andrich, Marais, and Humphry showed formally that Waller’s procedure that removes responses to multiple choice (MC) items that are likely to be guessed eliminates the bias in the Rasch model (RM) estimates of difficult items and makes them more difficult. The former did not study any consequences on the person proficiency estimates. This article shows that when the procedure is applied, the more proficient persons who are least likely to guess benefit by a greater amount than the less proficient, who are most likely to guess. This surprising result is explained by appreciating that the more proficient persons answer difficult items correctly at a greater rate than do the less proficient, even when the latter guess some items correctly. As a consequence, increasing the difficulty of the difficult items benefits them more than the less proficient persons. Analyses of a simulated and real example are shown illustratively. To not disadvantage the more proficient persons, it is suggested that Waller’s procedure be used when the RM is used to analyze MC items.
Many educational and psychological assessments consist of multiple choice (MC) items, where there is a possibility that correct answers will be guessed. Depending on how difficult a person finds an item, if the person does not know the correct answer then he or she may eliminate some distractors and guess randomly among the rest, or if not able to eliminate any distractors, guess randomly among all of them. In this article, partial or complete random guessing will be referred to as random guessing. The previously implied proposition, and a theme of the article, is that a person will tend to guess a response as a function of how difficult the person finds an item, and not because of structural properties of the item. Thus, it is not assumed that guessing is present or not, but that it is a matter of degree, which is a function of a person’s proficiency relative to an item’s difficulty.
Because it has a guessing parameter for each item, responses to MC items are analyzed, often using the three-parameter logistic (3PL), which includes an item guessing parameter. However, because of the proposition that the degree of guessing is a function of the relative difficulty of an item for a person, this article studies its consequences by analyzing responses using the dichotomous Rasch model (RM), which has only an item difficulty parameter. There are other reasons, compatible with the earlier proposition, for analyzing responses to MC items using the dichotomous RM. The most important is Rasch’s (1961) theory of invariant comparisons—that if the data fit the dichotomous RM within a specified frame of reference, then the item parameter estimates are independent of any assumptions about the person distribution, and the person estimates are independent of which subset of items is used for assessment (Andrich, 2004). This feature is relevant in adaptive or tailored testing, where the focus is on the assessment of individuals and where, from a specified class of items, different persons are administered different items which are aligned relatively closely to each person’s proficiency. In addition, if tailored testing is applied, it is expected that little or no random guessing will take place, and it seems at best redundant to have an item characterized partly by a guessing parameter. Finally, the model has the convenience that the person’s total score on a set of items is sufficient for the person parameter estimate.
Random guessing violates the RM and will therefore bias the estimates of its difficulty parameters. Specifically, because the sufficient statistic for an item’s difficulty is the number of correct responses, which guessing increases, the more difficult items will appear relatively easier than they would be without guessing. This violation of the RM does not preclude using it as a hypothesis that there is no guessing in the responses, seeking evidence to the contrary, and then considering how to deal with any guessing found. This, rather than focusing on modeling the data using the 3PL or a more complicated model (e.g., San Martin, del Pino, & De Boeck, 2006), is the approach taken in this study (Andrich, 2004).
Only a few studies have investigated the effect of guessing on item and person estimates in the RM. Using simulated data, and as might be expected, Waller (1973, 1976, 1989) showed that items that had correctly guessed responses appeared relatively easier than their simulated values and that persons with low proficiency, who were hypothesized to be more likely to guess, were estimated to be more proficient than their simulated value. Waller proposed removing responses likely to be guessed from the data and showed that the resulting analysis recovered simulated person and item locations more accurately. Waller’s approach removes the response of a person to an item after it is established that the person has a low proficiency relative to the item’s difficulty and, therefore, that the response is likely to have a guessing component. In addition to Waller’s studies, it seems that only Wainer and Wright (1980) investigated the effect of guessing on the estimates of the dichotomous RM. They proposed an “adjustment” for guessing using the RM and a jackknife scheme, in which each item was omitted separately and proficiencies reestimated.
Recently, Andrich, Marais, and Humphry (2012) formalized Waller’s approach, which is simpler than that of Wainer and Wright’s, which permits estimating the magnitude and significance of the effect of guessing on the item difficulty estimates in the RM. Andrich et al. (2012) did not consider the consequent effects on the person estimates. The purpose of this study is to formalize the effects on person proficiency estimates in the RM when the effect of guessing is removed from the item parameter estimates using the procedure summarized above. The main new result, and perhaps surprising in the first instance because it is the least proficient persons who guess, is that although estimates of the least proficient persons are increased, the estimates of the most proficient persons are increased by an even greater margin. The reason for this effect is that the most proficient persons answer more difficult items correctly at a greater rate than the least proficient, even when the latter guess answers correctly, and therefore are affected more by changes in the locations of the difficult items than the less proficient. This effect is one of the main illustrations of the study.
The rest of the article is structured as follows: The section titled “The Data Sets” describes two data sets, one simulated and one real, which are analyzed for illustrative purposes. The “Removing Random Guessing From the Item Parameter Estimates” section summarizes the procedure used to correct the item parameter estimates. The section “Proficiency Estimates” describes the consequent effect on the person parameter estimates and the final section is a summary and discussion. Appendices A and B include elaborations of the article.
The Data Sets
Two data sets used illustratively in Andrich et al. (2012) are used in this study. They are a set of real data from the Raven’s Advanced Progressive Matrices (RAPM) and a set of data simulated to parallel the real data. The RAPM is a non-verbal test of reasoning consisting of 36 MC items (Raven, 1940). Each item has a two dimensional matrix with a pattern in which the element in the lower right hand corner is missing. From six or eight alternatives, the respondent is required to choose one to complete the pattern. This advanced version is more difficult than Raven’s Standard Progressive Matrices, the dimensionality of which was investigated by Van der Ven and Ellis (2000) using the dichotomous RM. The study by Andrich et al. (2012) of a sample of 469 persons with complete data on the RAPM showed that the items were relatively difficult for them and that there was substantial guessing in the most difficult items. Item 36 was so difficult that it was removed from all analyses.
The simulated data set had parameters similar to those from the RAPM data. The details of the item parameter estimates for the RAPM and the simulated data are shown for completeness in Table A1. In summary, the items were uniformly distributed in the range −3.0 to 3.0 logits with a mean of 0, and 470 persons were normally distributed with a mean of −0.75 and a standard deviation (SD) of 1.2 logits. By analogy to evidence from the RAPM, 23 of the most difficult items, in parallel to the RAPM data, were simulated to have guessing.
Because it appeared that the guessing was less prevalent by moderately proficient persons to moderately difficult items than implied by the 3PL, Andrich et al. (2012) used a simulation algorithm, a generalization of the 3PL, which permits modifying the degree of guessing as a function of the difference between a person’s proficiency and an item’s difficulty. This generalization is now summarized. The 3PL takes the form
where
is the probability of a correct response without guessing,
The generalization of the 3PL used in Andrich et al. (2012) takes the form
where
Removing Random Guessing From the Item Parameter Estimates
The procedure in Andrich et al. (2012) for removing the effects of random guessing is summarized below in part for completeness and in part to understand the reasons for the observed effects on the person proficiency estimates. All analyses used the RUMM2030 software (Andrich, Sheridan, & Luo, 2013), which provides consistent estimates (Zwinderman, 1995) using a pairwise conditional method of estimation of the item parameters (Andrich & Luo, 2003). Then taking the item parameters as known, the person parameters are estimated using a weighted likelihood method (Warm, 1989), which reduces the stretching bias of persons with scores toward the extremes found in maximum likelihood estimates.
The Tailored Analysis and Choice of Probability Cutoff
To operationalize Waller’s procedure for removing random guessing, Andrich et al. (2012) used the following steps. First, all responses (which may have included guessed responses) were analyzed with the RM. This analysis was termed the original analysis. To the degree that guessing is present in the responses, to that degree there the responses will misfit the RM and the difficulty estimates will be biased. However, they will be biased in a predictable way. It is this reasoning that leads to the subsequent steps that identify the likely presence of guessing in MC items using the dichotomous RM. We broach fit again with the analysis of the illustrative examples.
Second, given estimates from the original analysis, if a person’s correct response has a smaller probability than a specified value, whether the response was correct or not, it is converted to missing data. The effect is analogous to adaptive or tailored testing whereby students are not administered items that they are expected to find very difficult. In this case, the tailoring is carried out post hoc, rather than a priori. Accordingly, the analysis is referred to as a tailored analysis and the data as tailored data. The hypothesis of guessing is tested by comparing the difficulty estimates from the two analyses, which would be statistically insignificant if there was no guessing and the data fitted the dichotomous RM well. It is stressed that the processes does not identify persons who may or may not have guessed any response, rather the process identifies those responses likely to have a guessing component. It is also stressed that there is no way of telling whether or not any response was guessed. For example, a poorly proficient person may use specialized knowledge to answer an item correctly, but the response is still converted to be missing.
Specifically, Andrich et al. (2012) set a cutoff, which converted any response with a probability less than .3 of being correct to missing. This conservative value, relative to a theoretical value of
The Origin-Equated Analysis
In a RM analysis, an arbitrary identifying constraint, usually
This provided a third analysis referred to in this article as the origin-equated analysis. Evidence of guessing then is that the estimates of the more difficult items will be more difficult in the tailored analysis than in the origin-equated analysis (complete data), while the very easy items will have similar difficulty estimates. Andrich et al. (2012) showed that the magnitude and significance of the difference in difficulty estimates could be assessed using a theorem by Andersen (1995, 2002). They also showed that the tailored analysis of the simulated data set, being minimally affected by guessing, provides unbiased difficulty estimates.
Summary of the Difficulty Estimates
Appendix A shows details of the item parameter estimates from the different analyses. However, for the purposes of this study, which focuses on the person estimates, it is sufficient to demonstrate the relationships graphically. Figure 1 (top) shows, for the simulated example, the estimated item locations from the origin-equated and the tailored analyses against the simulated item locations. The difficult items are clearly more difficult in the tailored analysis than in the origin-equated analysis, whereas the very easy items are equally difficult. The difficulties in the tailored analysis are closer to the identity line with simulated values. Because the more difficult items have more responses eliminated, their standard errors of estimates are greater, and this is reflected by the greater variation from the identity line of the more difficult items. Figure 1 (bottom) shows the estimated item locations from the origin-equated and tailored analyses for the RAPM. The difficult items are more difficult in the tailored analysis than in the origin-equated analysis, strongly indicating the presence of random guessing in the more difficult items. Furthermore, because the differences in difficulty estimates increases as a function of difficulty, it confirms the hypothesis that guessing is a matter of degree as a function of relative difficulty, and not simply present or not.

Item estimates from the tailored and origin-equated analyses against their simulated values (top), and item estimates from against origin-equated aginst the tailored analysis (bottom) for the RAPM, both showing the hypothesized identity line.
Proficiency Estimates
From Figure 1, the difficulty estimates in the tailored analysis are taken to be less biased than those of the origin-equated analysis. Accordingly, it is expected that the proficiencies estimated from the tailored analysis will also be less biased. However, when proficiency estimates of individual performances are reported, often for policy reasons, no responses of persons can be removed. Therefore, a fourth analysis was conducted. In this analysis, all of the original responses of all persons are used to estimate the person proficiencies, but the difficulties of all items are anchored to their estimates from the tailored analysis. This analysis is referred to as the all-anchored analysis. Therefore, in the origin-equated and all-anchored analyses, the complete set of responses is analyzed while in the tailored analysis a subset of responses is analyzed. Because meaningful comparisons require the same origin, we compare the person distributions of the tailored, origin-equated, and all-anchored analyses. In the latter two analyses, we take advantage of the sufficiency of the total score for person estimates, which, therefore, when all persons have responded to the same items can be compared.
Estimates From Total Scores to Proficiency Estimates
Because the simulated example is used to confirm the accuracy of the proposed procedure for correcting difficulty estimates for guessing, the consequent effect on the proficiency estimates are summarized first. Figure 2 shows the estimated proficiencies for each total score from the origin-equated and the all-anchored analyses. It is evident that at the least proficient end of the continuum, there is little difference between the two, but that as proficiency increases, the all-anchored estimates for each total score show a systematically increasing difference. This is a consequence of the nonlinear way in which the difficulty estimates of items increase in the all-anchored, relative to the origin-anchored, analysis. The proficiency estimates of the RAPM show the same relationship.

Number-correct score to logit conversion for the simulated example (top) and the RAPM (bottom) for the origin-equated and all-anchored analyses.
Summary of the Proficiency Distribution
Table 1 shows the proficiency means and standard deviations from the three analyses described above for both the simulated and the RAPM data. In addition, for the simulated example, these values for the generating parameters and the actual simulated values, which are slightly different, are provided. Finally, for the simulated example, the regression of residuals (differences between simulated values and the estimated values) on the simulated values, are shown. Because no bias in the person estimates implies that this regression line should have a zero intercept and a zero slope, it was calculated as an indicator of bias in the estimates. The mean square difference between the simulated value and the estimate could also have been calculated as an index of bias, but because the tailored analysis reduces the number of responses for the less proficient persons, it increases the standard error of their estimates. As a result, the bias and reduced precision are confounded in the mean square index.
Proficiency Estimates From Different Analyses and Regression of Residuals on the Generating Parameters for the Simulated Example.
Note. RAPM = Raven’s Advanced Progressive Matrices.
Person distribution is normal.
Tailored Analysis
First, Table 1 shows that the regression of residuals in the simulated example from the tailored analysis are as expected (intercept −.017, slope −.011). Taking plus or minus two standard errors as a confidence interval around the value of 0.0, neither is statistically significant. Second, given the known standard deviation of the simulated values (1.290), the standard deviation of the mean is given by
The tailored analysis, therefore, also provides a frame of reference for the analyses of the RAPM. In the tailored analysis, seven persons had no response to any item, indicating that they were very poorly proficient relative to the difficulty of the test. (One person had no correct answer to any item even before the tailored analysis.) Of the remaining 462 persons, their mean (−0.742) is substantially less than the constrained mean item difficulty of 0.0, and it is this relative difficulty that is considered to have engendered guessing. The total number of responses deleted in the tailored analysis was 7,144 from a possible 16,380 responses, excluding the one person who had no correct answer in the original data. All items had at least one response removed, with six items having only 16 removed. These were the easiest items used to set the origin. The most difficult item had 466 responses removed. For completeness, Figure A2 shows that the persons are aligned to the easier end of the continuum. Clearly, with a lower cutoff for the probability of a successful response, say, .2, the number of responses deleted would have been less. However, as indicated already, that might have left more potentially guessed responses in the data.
Origin-Equated Analysis
In contrast to the tailored analysis, in the origin-equated analysis the intercept and slope of the residuals regressed on the simulated proficiencies are significantly different from zero. In addition, the mean proficiency is significantly less than the mean of both the simulating parameters and of the tailored analysis. This is initially counterintuitive. It would be expected that with complete data and therefore the presence of guessing, the mean would be greater in the origin-equated analysis. It is not, however, because the difficulties of a substantial number of items in the origin-equated analysis are regressed to lesser difficulties and the reward for a correct response to these items is also regressed. For the same reason, the standard deviation of the proficiencies is also smaller than that of the simulating parameters and of the tailored analysis. In the RAPM, the pattern is the same, with the mean and standard deviation of the origin-equated analysis both smaller than that in the tailored analysis.
All-Anchored Analysis
As indicated earlier, in any individual reporting of proficiency, all responses of a person are generally used. In this case, the relevant analysis is the all-anchored analysis, in which the items are anchored to the tailored item estimates, but where all responses of each person are included. Because of the presence of guessing in the data, the slope and intercept of the regressed residuals for the simulated example in Table 1 are significantly different from zero and, therefore, biased. As a result, the proficiency estimates are also biased. However, the degree of bias is substantially smaller than in the origin-equated analysis, 0.113 compared with 0.254 for the slope, and −0.118 compared with 0.367 for the intercept.
The person mean in the all-anchored analysis (−0.584) is greater than both the simulated mean (−0.790) and the mean in the origin-equated analysis (−0.957). The former is greater because it includes guessed responses, and the latter is greater because the item difficulties are greater than in the origin-equated analysis. The standard deviation in the all-anchored analysis (1.251) is slightly smaller than the simulated value (1.290), but is noticeably greater than the origin-equated analysis (1.055). These results point to the all-anchored analysis being preferred over the origin-equated analysis when all responses are analyzed. Incidentally, because the total score in the RM is the sufficient statistic, persons who have completed the same items and have the same total score have the same proficiency estimates, irrespective of the pattern of responses, although if the responses fit the model, they will be probabilistically Guttman-like (Andrich, 1985). This is not the case with the 3PL, where each pattern of responses generates a different estimate, and where, as shown by Chiu and Camilli (2012), a correct response for a person of low proficiency does not obtain the same credit as a correct response for a person of high proficiency.
As in the simulated example, the mean of the all-anchored analysis in the RAPM is greater than both the tailored and origin-equated analyses, where in the latter, again counter-intuitively the mean is less than in the tailored analysis. The standard deviation of the all-anchored analysis is also greater than that of the origin-equated analysis, and closer to that of the tailored analysis.
Thus, the first demonstration of the article is that (a) if guessing is removed from responses, unbiased estimates of the item difficulties are obtained and (b) if all responses are retained for the proficiency estimates of persons in the RM, then the mean and the standard deviations of the proficiencies are greater relative to those when guessing is not removed. However, as shown in the following, this effect does not arise from a uniform change across the continuum.
Cumulative Frequencies and Cut Points
In many large-scale educational assessments, cut points are set to indicate a minimum benchmark for achievement, and if this benchmark is not reached, remedial action is indicated. To not focus solely on reaching a minimum achievement standard, higher cut points that recognize meeting excellent benchmarks are also set. Because the RAPM and the simulated examples are relatively difficult, suppose for illustrative purposes that the lower and upper cut points are set at −2.0 and 0.0 logits, respectively. Figure 3 shows the cumulative percentage of persons in the simulated and RAPM examples. Consistent with Figure 2, the percentage of persons below any proficiency estimate in the origin-equated analysis is either equal to or greater than that in the all-anchored analysis, and the percentage difference increases as the proficiency level increases.

Cumulative percentage of person estimates for the simulated example (top) and the RAPM (bottom) for the origin-equated and all-anchored analyses.
Applying the cut points graphically, the percentage of persons below the lower benchmark for the origin-equated and all-anchored analyses are both approximately 13, while the percentages below the upper benchmark are 89 and 72, respectively. Therefore, the percentages above the upper benchmark are 11 and 28, respectively, which is a substantial underestimate in the origin-equated analysis. Applying the same relative cut points, the same interpretation can be given for the cumulative percentages in the RAPM shown in Figure 3 (bottom), which are close to those of the simulated example. This closeness is not surprising, given that the simulated example was based on the parameters of the tailored analysis of the real data, and the degree of guessing was also simulated to be similar.
Thus, the second main demonstration of the article is that the greater mean in the all-anchored analysis than the origin-equated analysis does not result from a uniform shift in the proficiency estimates; the estimates of the most proficient are increased by a greater amount than of the least proficient. This nonlinear effect is demonstrated in the cumulative frequencies for both the simulated and RAPM examples in Figure 3. The implication is that if the effects of random guessing are not removed in estimating the item difficulties, then the relative achievements of the more proficient persons are under recognized in the RM.
Fit of Responses to the Model
As already noted, guessing violates the dichotomous RM. Consequently, the fit in the tailored analysis in which guessing is removed is hypothesized to be better than in the origin-equated analysis. For an illustrative general test of fit for this study, persons were divided into eight class intervals, and for each class interval an approximate
General Approximate
Note. RAPM = Raven’s Advanced Progressive Matrices.
Number of Alternatives
The RAPM data were chosen to build on the analyses of the same data as in Andrich et al. (2012), in which the method of removing the effects of guessing from difficulty estimates in the RM was first formalized. These data have noticeable guessing, despite a larger than usual number of distractors, and so are ideal to illustrate a novel method for dealing with guessing. However, many traditional achievement tests have only four alternatives in their MC items and are designed not to be very difficult for persons to whom they are administered. For illustration of the process of removing guessing in these circumstances, Appendix B summarizes such an example and shows the graph of the distribution of the persons relative to the items and the key graph comparing item difficulties for identifying guessing. The same patterns consistent with the hypothesis that guessing is a function of the relative difficulty of an item for the proficiency of a person, though with an expected smaller effect because the items are relatively easy for the persons, are demonstrated clearly. Of course, whether or not there are effects due to guessing is an empirical matter in any data set, and if the items are very easy for the persons, it is expected that there will be relatively little guessing.
Summary and Discussion
The aim of this study was to determine the effects of removing guessing in the manner reported in Andrich et al. (2012) on person estimates in the dichotomous RM. The first conclusion is that in parallel with the effect on item difficulties, if the effect of guessing is not removed and the correct origin for comparison applied, then counterintuitively, the mean of the estimates is less than their actual mean. This result was not observed by Waller, who concluded that the mean was greater when guessing was not removed. However, he compared only the original analysis with the tailored analysis and did not equate the origin to be identical in the two analyses. In addition to the mean being less than the actual mean, the proficiencies are regressed toward their mean. As with the difficulty estimates, the article demonstrates that the regression effect on the proficiency estimates is not uniform across the continuum. In particular, the estimates of higher proficiency persons are regressed more when the effect of guessing is not removed from the difficulty estimates than those of lower proficiency persons, making their proficiency estimates less than their actual proficiencies.
This non-uniform effect can be explained by the observation that although the low proficiency persons tend to guess on the very difficult items, the more proficient persons answer these more difficult items at a greater rate than the less proficient persons, even though it is the latter who guess more and, therefore obtain a greater benefit from having difficult items with their real difficulties rather than regressed difficulties. Thus, not removing guessing when MC items are analyzed with the dichotomous RM will disadvantage the achievement of the more proficient persons. This may have substantial policy implications in large-scale national and international assessments.
Finally, the study confirms the rationale that in at least two real data sets (the RAPM and the example in Appendix B), guessing can be interpreted as a function of the difficulty of each item relative to the proficiency of each person, rather than it being a general property of an item or a complicated function of proficiency.
Footnotes
Appendix A
Appendix B
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research reported in this article was supported, in part, by an Australian Research Council Linkage grant with the School Curriculum and Standards Authority of Western Australia and the Australian Curriculum Assessment and Reporting Authority as Industry Partners, and by Pearson plc.
