Abstract
This study examines three controversial aspects in differential item functioning (DIF) detection by logistic regression (LR) models: first, the relative effectiveness of different analytical strategies for detecting DIF; second, the suitability of the Wald statistic for determining the statistical significance of the parameters of interest; and third, the degree of equivalence between the main DIF classification systems. Different strategies to tests–LR models, and different DIF classification systems, were compared using data obtained from the University of Tehran English Proficiency Test (UTEPT). The data obtained from 400 test takers who hold a master’s degree in science and engineering or humanities were investigated for DIF. The data were also analyzed with the Mantel–Haenszel procedure in order to have an appropriate comparison for detecting uniform DIF. The article provides some guidelines for DIF detection using LR models that can be useful for practitioners in the field of language testing and assessment.
Keywords
Many methods for detecting Differential Item Functioning (DIF) have been widely employed in the field of language assessment (see Ferne & Rupp, 2007, for a review). In general, the techniques derived under item response theory (IRT) are usually preferable to other types of analysis, as long as the sample size is large enough and the data fulfill the assumptions required for these models. On the contrary, observed-score methods, such as the Mantel–Haenszel (MH) procedure (Holland & Thayer, 1988) or logistic regression (LR), require fewer assumptions to be met, smaller sample sizes, and less complicated computing than IRT-based analyses. In research on language testing, use of MH (Harding, 2012; Koo, Becker, & Kim, 2013; Ockey, 2007; Pae, 2004, 2012; Roever, 2007; Uiterwijk & Vallen, 2005) is clearly predominant over RL (Alavi, Rezaee, & Amirian, 2011; Kim, 2001; Li & Suen, 2013; Rezaee & Shabani, 2010), even though MH was designed to detect uniform DIF and is not appropriate for detecting non-uniform DIF, especially when it arises in moderately difficult items (Narayanan & Swaminathan, 1996). In this respect, the advantages of LR over MH are that it is capable of simultaneously detecting both uniform and non-uniform DIF, and the ability is treated as a continuous variable without losing stratification power.
One of the goals of this study is to examine the application of LR to DIF detection when sample sizes are small, which is frequently the case in language assessment. Specifically, we used the responses of a random sample of 400 takers of the University of Tehran English Proficiency Test (UTEPT) to investigate three controversial aspects associated with the application of logistic regression (LR) models in DIF detection: (a) usefulness of different detection strategies with respect to DIF type; (b) feasibility of the Wald statistic for checking the statistical significance of DIF with small samples; and (c) extent to which different DIF classification systems are equivalent. These goals are described in detail below.
DIF detection strategies. Almost all simulation studies that compare LR to the MH chi-square (
With regard to this point, we hypothesize that, if there is uniform DIF, when the strategy followed by Herrera and Gómez (2008) is used, LR has almost no power compared to MH. On the contrary, when any of the three analytical strategies we consider appropriate is used (described in the following section), results similar to those of MH are expected.
Statistical significance. Some simulation studies have shown that there is no difference between applying the likelihood ratio statistic or the Wald test for testing the statistical significance of the parameters of interest in the LR models (Rogers, 1989). However, this equivalence may not hold true in some situations. The Wald and likelihood ratio tests are asymptotically equivalent; this means that in small samples there might be differences among them (see Paek, 2012). Moreover, the Wald statistic has two major disadvantages compared to the likelihood ratio test. First, it has a tendency to commit Type II errors when logistic regression coefficients are very large. In such situations, the standard error is usually inflated, increasing the probability of concluding that there is no statistical significance when in fact there is a very large effect (Menard, 2002). Second, it is less reliable in small samples (Agresti, 2002). It would, therefore, be important for there to be studies that compare how equivalent the two tests of significance in a realistic situation. We can consider the study on empirical data an instrument allowing the results of the simulation studies to be generalized to new situations. By using common terminology in experimental and quasi-experimental designs, empirical studies, to the extent that they corroborate the results of simulation studies, increase their external validity.
Practical significance. This study compares the application of two of the most popular DIF classification systems using the LR, the Educational Testing Service (ETS) classification system (Monahan, McHorney, Stump, & Perkins, 2007a) and the one proposed by Jodoin and Gierl (2001). Although very little used in the field of education and language testing and assessment, the Crane, Hart, Gibbons, & Cook (2006) criterion was also employed. Finding the degree of equivalence in the different classification systems is a very relevant question, because in case of discrepancy, choosing one or another could make a drastic difference in the results. It should be stressed that to date there are no studies, whether simulation or empirical, that compare the degree of equivalence of these classification systems when sample sizes are small.
The following section gives a brief review of the statistics employed in this study.
Logistic regression
Testing DIF
LR was first proposed for detecting DIF by Swaminathan and Rogers (1990), based on the inclusion of different terms in the regression model that would indicate different types of DIF. It assesses to what extent item scores (1 correct response, 0 incorrect response) can be predicted from total scores alone (model 1), from total scores and group membership (model 2), or from total scores, group membership, and interaction between total scores and group membership (model 3).
Where, ln is the natural logarithm, π is the probability of correct response to the studied item
The expression in front of the equal sign is known as a logit and is equal to the natural logarithm of the odds. The model parameters (β0, …, β3) are estimated via maximum likelihood method on the log odds or logit scale. The strategy for evaluating the DIF is based on the search for the most parsimonious model that best fits the data. The choice of model usually depends on the existence of statistically significant differences in the likelihood functions of the model that incorporate all of the parameters of interest (full model) and a model whose parameters are a subset of the parameters present in the full model (nested model). The likelihood ratio statistic that is used for this is equal to twice the natural logarithm of the ratio of the likelihood functions of the models compared at their maximum likelihood estimate of the parameters:
This statistic follows a χ2 distribution with degrees of freedom equal to the difference in the number of parameters of the models compared.
Several strategies have been proposed in the literature to determine whether the item shows DIF based on Models 1, 2, and 3. First, simultaneously test uniform and non-uniform DIF by comparing the model without DIF (Model 1) with the model that includes terms for both non-uniform and uniform DIF (Model 3). This strategy checks the hypothesis that there is no DIF (H0 : β3 = β2= 0). The null hypothesis of No DIF is tested against any of three alternatives: β3 ≠ 0 or β2 ≠ 0 or β3 ≠ 0 and β2 ≠ 0. The likelihood ratio statistic would be distributed with 2 degrees of freedom (df). It should be pointed out that the same hypothesis could be checked by applying the Wald statistic, which would allow only one model to fit to the data, Model 3. Under the hypothesis that there is no DIF, the Wald statistic follows a χ2 distribution with 2 degrees of freedom. (This is the statistic employed by Swaminathan & Rogers, 1990, in their seminal study.) Second, evaluate non-uniform DIF (H0 : β3 = 0) separately by comparing Models 2 and 3, and the uniform DIF (
Effect size measures
Many researchers have emphasized that an examination of both the statistical significance and a measure of effect size is needed to identify DIF using logistic regression (Jodoin & Gierl, 2001; Monahan et al., 2007a; Zumbo & Thomas, 1997). So using the LR coefficient
DIF is negligible (category A) if
DIF is moderate (category B) if
DIF is large (category C) if
The classification presented above is defined on the log-odds scale the LR model parameters are on instead of the Delta scale used in the original ETS proposal and on the formula by Monahan et al. (2007a).
As mentioned above, the Wald statistic is used to determine the statistical significance of
It follows a standard normal distribution under the hypothesis that β = 0. This statistic can also be calculated as Wald =
If the model which best fits the data is the Model 3, the regression coefficient of interest is
This disadvantage may be avoided using an estimator of the magnitude of DIF that provides a similar interpretation of uniform and non-uniform DIF. Among the proposals are measures similar to the coefficient of determination (R2) in linear regression, such as the Cox-Snell’s R2 or Nagelkerke’s R2 (Jodoin & Gierl, 2001; Zumbo & Thomas, 1997). Using this last statistic, an estimation of the size of DIF may be found by comparing Nagelkerke’s R2 of the full model with that of the nested model. Thus non-uniform DIF is equal to the difference in R2 between the non-uniform and uniform DIF models: Type A items – negligible DIF: Type B items – moderate DIF: 0.035 ≤ Type C items – large DIF:
Following the criteria of Jodoin and Gierl (2001), an item is considered to have DIF if the probability of either 1-df test was less than .05, and the corresponding
Mantel–Haenszel procedure
The MH procedure was applied to data in order to have a reference point with which to compare detection of uniform DIF using the LR. To apply the MH procedure, the information with the test taker answers to the item must be arranged in 2 × 2 Q contingency tables, where Q is the number of intervals in which the total test score is divided (h =1,…, Q). Thus there is a 2 × 2 contingency table for each h score level, with the group (reference/focal) as one of the entries and the answer to the item (right/wrong) as the other, as observed in Table 1. The values in Cells Ah, Bh, Ch, and Dh denote the number of test takers in each category, and Nh is the total number of test takers on h score level (Nh = Ah + Bh + Ch + Dh ).
The 2 × 2 contingency table for the hth score level.
The MH procedure has an estimator of the magnitude of the DIF (the Mantel–Haenszel common log odds ratio estimator,
The above equality may be expressed as a quotient such that the ratio of the odds, referred to as the odds ratio {α = (A/B)/(C/D)} or cross-product odds ratio {α = (A•D)/(C•B)} will be 1. Assuming homogeneity of the odds ratios of each stratum (α1 = · · · = α
h
= · · · = αQ), the MH measure of association calculated across all 2 × 2 contingency tables is the common odds ratio estimator (
The MH procedure, in addition to providing an estimator of the magnitude of DIF, provides a test of statistical significance. Given that
where
Objectives
Using data acquired from 400 University of Tehran English Proficiency Test takers, this study intended to answer the following questions.
Regardless of sample size, what DIF detection strategies are appropriate for detecting uniform and non-uniform DIF?
Does it make any difference whether LR or the Wald test is used to determine the statistical significance of DIF when the sample is small?
How much equivalence is there between the ETS, Crane et al. (2006), and Jodoin and Gierl (2001) DIF classification systems when the sample is small?
Method
Instrumentation
University of Tehran English Proficiency Test (UTEPT) is a high-stakes test of English developed and administered by English department of University of Tehran. It is a prerequisite for master’s degree holders aiming at participating in PhD exams of University of Tehran. It consists of 100 items with a time limit of 100 minutes. A more detailed description of this test may be found in Rezaee and Shabani (2010). It should be noted that the purpose of this study is not to analyze the test “per se”, but to analyze LR applied to DIF detection. Before proceeding to the analysis of DIF, analyses were done to find out the psychometric properties of the items, as well as the dimensionality and reliability of the scale as a whole. The original scale showed a Cronbach’s alpha reliability coefficient of 0.89, but many of its items had very low discrimination indices (estimated by the corrected item–total point biserial correlation). Items with a discrimination index below 0.20 were eliminated from the analyses. The resulting scale is comprised of 75 items and has a Cronbach’s alpha of 0.90. It is this reduced scale on which the analyses were performed. Two conditional covariance estimation-based procedures were used as follows: (a) to test the null hypothesis that the test data satisfies a unidimensional model (DIMTEST; Stout, 1987; Stout et al., 2001), and (b) to estimate an effect size for the multidimensionality (DETECT; Kim, 1994; Zhang & Stout, 1999). DIMTEST results (T = 1.7693, p = .0384) indicated that the test is essentially unidimensional. The maximum DETECT value (0.3055) indicated that the test showed weak multidimensionality (Monahan, Stump, Finch, & Hambleton, 2007b).
Participants
Most of the groups compared in studies on DIF are defined based on race/ethnicity, gender or other demographic variables, but very few employ academic background (Pae, 2004). Previous studies on DIF in the UTEPT were based on gender (Rezaee & Shabani, 2010) and academic background (Alavi, Rezaee, & Amirian, 2011). As pointed out above, one of the purposes of this study was to examine the behavior of LR with small samples. Since the study by Alavi et al. (2011) and this one share the same size of sample, it was decided to also select the same variable for forming the groups to facilitate the extrapolation of results. The Rezaee and Shabani (2010) study was done with a much larger sample (N = 6.555).
The participants of the present study were 400 test takers randomly selected from the 1600 examinees who took the University of Tehran English Proficiency Test (UTEPT) in November 2010. Passing UTEPT is a prerequisite for them to be allowed to sit for the PhD exam of the University of Tehran. Based on academic background, the sample was divided into a reference group of 200 participants with science and engineering background and a focal group of 200 participants with humanities backgrounds. The science and engineering group comprised students of chemistry, physics, mathematics, biology, mechanical engineering, electrical engineering, and civil engineering and the humanities group consisted of students of social sciences, law, political sciences, management, Persian literature, and foreign languages. There are both male and female test takers in the sample. A sample size of 200 test takers per group was chosen because some studies identified this size as the minimum for adequate power in MH and LR procedures (Güler & Penfield, 2009; Mazor, Clauser, & Hambleton, 1992; Paek & Wilson, 2011; Rogers & Swaminathan, 1993). The power found in simulation studies with smaller sample sizes (50 or 100 test takers per group) is only moderately acceptable when the magnitude of the DIF is high (Fidalgo, Ferreres, & Muñiz, 2004; Fidalgo, Hashimoto, Bartram, & Muñiz, 2007). Many simulation studies show that differences in ability among the groups measured by the test, which is technically called impact, increase Type I error rates in both LR and MH (DeMars, 2009; French & Maller, 2007; Jodoin & Gierl, 2001; Li et al., 2012; Pei & Li, 2010). To find out whether our analyses could be affected by this variable, an independent-samples t-test was conducted to compare UTEPT scores between groups. There was no significant difference in the total test scores for the science and engineering (M = 45.14, SD =11.03) and humanities (M = 42.51, SD = 12.62) groups at α = .01 (t = 2.218, df = 398, p = .027). The coefficient of determination
Analysis
Logistic regression
To use LR for DIF analysis, Models 1, 2, and 3 were fit to the data using SPSS (version 18). The logistic regression procedure uses the item response (1 for a correct response, and 0 for an incorrect response) as the dependent variable, with grouping variable (dummy coded as 1 = Science and engineering/reference group; 0 = Humanities/focal group), total scale score for each subject, and group by total score interaction, as independent variables.
Logistic regression was applied in two stages because of its advantages with regard to Type I and Type II errors (French & Maller, 2007; Navas-Ara & Gómez-Benito, 2002). In the first stage, uniform and non-uniform DIF were evaluated simultaneously using the Likelihood ratio test of Model 3 vs. Model 1 at a .05 alpha level. This detection strategy was chosen for the first stage because it maximizes the probability of identifying both uniform and non-uniform DIF and controls the overall Type I error rate. In the second stage, the test score for each examinee is refined by removing items that were found to show DIF in the first stage. Using the total refined score as the estimate of ability, the following tests were calculated for each item to determine whether there was DIF: T1 = Likelihood ratio test of Model 3 vs. Model 1 (Test of Uniform and/or Non-Uniform DIF). T2 = Likelihood ratio test of Model 2 vs. Model 1 (Test of Uniform DIF). T3 = Wald test of β2 in Model 2 (Test of Uniform DIF). T4 = Wald test of β2 in Model 3 (Test of Uniform DIF). T5 = Likelihood ratio test of Model 3 vs. Model 2 (Test of Non-Uniform DIF). T6 = Wald test of β3 in Model 3 (Test of Non-Uniform DIF).
Mantel–Haenszel procedure
Scale purification was conducted in the same manner as LR, that is, the MH statistics were computed in two stages. When an item is being investigated it is included in the matching criterion, though it displayed DIF in the initial analysis (Holland & Thayer, 1988; Tan et al., 2010). The level of significance used in all analyses was .05.
Results
All the data analyzed in this section are part of the second stage results. That is, they were found by applying the statistics employing the purified test score.
Table 2 presents the results of using LR analysis to determine the statistical significance (α = .05) of the parameters of interest. Out of a total of 75 items, 9 items were flagged as DIF items based on the 2-df likelihood ratio test (T1). The number of detections increased notably when the statistical significance of the parameters of interest was tested individually using the 1-df likelihood ratio tests (T2 and T5). Out of a total of 75 items, 18 items were flagged as DIF, of which 13 correspond to uniform DIF, four to non-uniform DIF, and one item (item 24) showed statistical significance in both parameters (
Item classified as exhibiting uniform and non-uniform DIF by type of test (α = .05).
T1 = Likelihood ratio test of Model 3 vs. Model 1 (Test of Uniform and/or Non-Uniform DIF).
T2 = Likelihood ratio test of Model 2 vs. Model 1 (Test of Uniform DIF).
T3 = Wald test of β2 in Model 2 (Test of Uniform DIF).
T4 = Wald test of β2 in Model 3 (Test of Uniform DIF).
T5 = Likelihood ratio test of Model 3 vs. Model 2 (Test of Non-Uniform DIF).
T6 = Wald test of β3 in Model 3 (Test of Non-Uniform DIF).
Table 3 shows the size effect measures proposed by Jodoin and Gierl (2001) and Crane et al. (2006). Comparing the values corresponding to columns
Results of the Jodoin–Gierl DIF classification criterion (Nagelkerke R2) and the Crane et al. (2006) criterion. According to the criteria of Jodoin and Gierl, the only item that would show moderate DIF is Item 24 (Category B).
Items identified according to the Crane et al. (2006) criterion:
Results of the ETS DIF classification system for the Mantel–Haenszel and the LR methods (uniform DIF).
Note: The items identified only by LR are in gray. The rest of the items have been identified by both procedures.
A correlational analysis of the estimators shows how all the DIF measures are highly related. Spearman rank correlations were found in a range of 0.89 for the correlation between Δ
Discussion
This study examined three controversial points concerning use of LR as a DIF detection method: (a) the DIF detection strategy chosen; (b) the type of statistic employed to determine statistical significance; and (c) the extent of equivalence of different DIF classification systems.
Concerning the first point, the Introduction described the most common analytical strategies in the literature and pointed out the risk of some of them being incorrect although they may seem plausible. Moreover, it was hypothesized that the inability of LR to detect uniform DIF in Herrera and Gómez’s (2008) simulation study is the product of a model specification error. The results support the plausibility of this hypothesis, showing that LR has detection rates similar to MH, as long as a proper analysis strategy is employed (see Tables 2 and 4). Since this is a study using real data and neither the magnitude nor the type of DIF in the items is known, the degree of certainty of our conclusion is limited. Therefore, an additional short Monte Carlo study was performed to clarify that point. The largest sample size used by Herrera and Gómez (2008) (1500 examinees per group) was chosen to achieve statistical power that would demonstrate the differences between strategies more clearly. The IRT parameters reported in the Herrera and Gómez study were employed for the 12 items with DIF. Likewise, the remaining 88 items without DIF fit to the following distributions: b≈ N(0, 1), a≈ N(0.5, 0.2), and c = 0.2 for all items. The ability of the reference and focal group was normally distributed, with mean 0 and standard deviation 1 N(0, 1). Data (20 replications) were generated employing the WinGen program (Han, 2007). Table 5 shows the results of the second stage of calculation of the statistics. The main difference between the results of the first and second stages is the substantial reduction in Type I error rate as a consequence of using the purified test score (second stage).
Power and Type I error rate (α = .05). in the MonteCarlo study (N= 3000; 20 replications).
T1 = Likelihood ratio test of Model 3 vs. Model 1 (Test of Uniform and/or Non-Uniform DIF).
T2 = Likelihood ratio test of Model 2 vs. Model 1 (Test of Uniform DIF).
T3 = Wald test of β2 in Model 2 (Test of Uniform DIF).
MH = Mantel–Haenszel test.
T4 = Wald test of β2 in Model 3 (Test of Uniform DIF).
T5 = Likelihood ratio test of Model 3 vs. Model 2 (Test of Non-Uniform DIF).
T6 = Wald test of β3 in Model 3 (Test of Non-Uniform DIF).
As may be observed in Table 5, when the analytical strategy followed by Herrera and Gómez (2008) is applied, the power of LR for detecting uniform DIF is practically null. However, when an appropriate analytical strategy was used, LR shows uniform DIF detection rates quite similar to those of MH (see statistics T1, T2 and T3) and very much higher than MH in the detection of non-uniform DIF (see statistics T1, T5 and T6). Consequently, check the no uniform DIF hypothesis by fitting Model 3 and applying the Wald (or Likelihood ratio) statistics with 1 df limits the power of this statistic for detecting uniform DIF. This strategy, inadequate from a theoretical viewpoint, has been included as an alternative analysis in other simulation studies (Hidalgo-Montesinos, Gómez-Benito, & Padilla-García, 2005), and in empirical studies (Doğan & Öğretmen, 2012), with results similar to those of Herrera and Gómez (2008).
From an applied viewpoint, the first conclusion that can be arrived at from this study is that performing a differentiated analysis of uniform DIF and non-uniform DIF always involves fitting more than a regression model. The regression models that would have to be fit based on the hypothesis checked and the type of statistic chosen, the likelihood ratio or Wald, are described below.
The second goal refers to the degree of equivalence between the likelihood ratio and the Wald statistic when samples are small. Even though likelihood ratio is theoretically preferable to the Wald statistic (Agresti, 2002; Menard, 2002), our results show practically identical behavior of the two statistics, corroborating the findings of the simulation study by Rogers (1989) which was done with 250 and 500 examinees per group.
The third point refers to the degree of equivalence of the Jodoin and Gierl (2001) DIF classification criterion, Crane et al. (2004) criterion, and ETS classification system (Monahan et al., 2007a). The results show that when samples are small, as they are in this study, there is wide discrepancy between the results of the Jodoin and Gierl (2001) classification system, which only classified one item with DIF, and the other criteria. Thus using the criteria established by Jodoin and Gierl (2001) for Nagelkerke’s R2 statistic, only one of the 18 statistically significant items would be classified as having DIF; the Crane criteria identified 9 items, and the ETS classification found 14 items in the category of moderate DIF.
These data have important implications in the field of language assessment. For example, and limiting it to research on DIF with the UTEP, the results of the previous Alavi et al. (2011) study, in which only one item with DIF was identified, should be taken with caution, because the sample size was small and the only criteria of DIF classification employed were based on Nagelkerke’s R2. In a more general manner, we can conclude that, in absence of simulation studies that determine the effectiveness of the various classification systems with small samples, more than one DIF detection method should always be applied, as recommended by Hambleton (2006).
Implications and future research
The results of this study make it possible to make a series of recommendations for using LR in DIF research. First, detection of the type of DIF requires adjusting differentiated models. The results show that an error in model specification has drastic results for the statistical power of the LR for DIF detection. Therefore, we warn against the naive idea of establishing equivalence between the type of DIF and the type of parameter, without considering the adjusted model. Second, with a small sample size of only 400 test takers, no difference was found between the Wald statistic and the LR test. Therefore, the Wald statistic may be used to determine statistical significance, with the consequent simplification in terms of adjusting models (see Table 6). Third, when sample sizes are small, more than one DIF detection technique must be applied, owing to the wide divergence in results found among DIF classification systems. Furthermore, the Jodoin criterion, compared to the ETS classification system, seems extremely conservative and simulation studies should be useful to determine their behaviour with small sample sizes and derive more accurate cutoff points.
Logistic regression models that would have to be fit depending on the hypothesis being checked and the type of statistic chosen. The degrees of freedom (df) of each statistic are also shown.
Footnotes
Acknowledgements
We appreciate the reviewers’ very detailed analysis of the manuscript and all the suggestions received. It enabled us to improve the results substantially.
Funding
This work was supported by the Spanish Ministry of Economy and Competitiveness (grant number PSI2009-08529).
