A Framework for Detecting Both Main Effect and Interactive DIF in Multidimensional Forced-Choice Assessments

Abstract

In recent decades, multidimensional forced-choice (MFC) tests have gained widespread popularity in organizational settings due to their effectiveness in reducing response biases. Detecting differential item functioning (DIF) is crucial in developing MFC tests, as it relates to test fairness and validity. However, existing methods appear insufficient for detecting DIF induced by the interaction between multiple covariates. Furthermore, for multi-category, ordered or continuous covariates, existing approaches often dichotomize them using a-priori cutoffs, commonly using the median of the covariates. This may lead to information loss and reduced power in detecting MFC DIF. To address these limitations, we propose a method to identify both main effect DIF and interactive DIF. This method can automatically search for the optimal cutoffs for ordered or continuous covariates without pre-defined cutoffs. We introduce the rationale behind the proposed method and evaluate its performance through three Monte Carlo simulation studies. Results demonstrate that the proposed method effectively identifies various DIF forms in MFC tests, thereby increasing detection power. Finally, we provide an empirical application to illustrate the practical applicability of the proposed method.

Keywords

multidimensional forced-choice questionnaires differential item functioning interactive DIF Thurstonian IRT model recursive partitioning

Over the past two decades, a growing body of evidence has demonstrated the strong validity of noncognitive attributes, such as personality, leadership, motivation, and career interests, to predict job performance (e.g., Denis et al., 2010; Shaffer & Postlethwaite, 2013; Sitser et al., 2013). Consequently, in high-stakes personnel selection settings, noncognitive attributes are often assessed alongside cognitive abilities to accurately determine a candidate's suitability for a position. While historically the Likert-type rating scale has been widely used for measuring noncognitive attributes, it has been criticized for its susceptibility to response bias in many studies (e.g., Danner et al., 2016; Frick, 2022; Ng et al., 2020; Wetzel et al., 2016). As Paulhus (1991) defined, a response bias is “a systematic tendency to respond to a range of questionnaire items on some basis other than the specific item content.” Three prominent types of response bias are: (a) socially desirable responding (including faking bad, lying, etc.), (b) acquiescence (tendency to agree), and (c) extremity bias (tendency to use extreme ratings). Researchers have demonstrated that response biases can jeopardize the validity and utility of measures (e.g., Griffith et al., 2007; Schmitt & Oswald, 2006) and potentially lead to incorrect hiring decisions in high-stakes settings (Winkelspecht et al., 2006), rendering traditional Likert-type scales a suboptimal choice for personnel selection.

Forced-Choice Formats and Item Response Theory Models for Multidimensional Forced-Choice Tests

To reduce response biases, psychometric researchers have long advocated for the use of multidimensional forced-choice (MFC) tests (e.g., Anguiano-Carrasco et al., 2015; Calderón Carvajal et al., 2021; Christiansen et al., 2005; Guenole et al., 2018; Jackson et al., 2000; Merk et al., 2017; Ng et al., 2020). In MFC tests, respondents are given two or more statements with similar appeals, and are asked to choose the one that fits them best or to rank them in order of fit (for an overview of different MFC formats, see Hontangas et al., 2015). Numerous studies have shown that the MFC form significantly reduces socially desirable responding and acquiescence (e.g., Heggestad et al., 2006; Jackson et al., 2000). Some research also demonstrated that MFC tests have similar or higher construct and criterion-related validity than Likert-type personality scales (Bartram, 2007; Salgado et al., 2015). Currently, examples of widely used MFC questionnaires include the International Personality Item Pool-Multidimensional Forced Choice (IPIP-MFC; Heggestad et al., 2006), the Occupational Personality Questionnaire (ipsative shorted format; OPQ32r; Brown & Bartram, 2009), and the Personality and Preference Inventory (Cubiks, 2010). According to Tett et al. (2011), almost 30% of U.S. companies use MFC tests for talent selection in organizational contexts.

Although MFC tests have great advantages, traditional scoring formats of these tests have been criticized for yielding ipsative or partially ipsative scores (Baron, 1996; Brown & Maydeu-Olivares, 2011; Cattell, 1944). Cattell (1944) first used the term ipsative (from the Latin ipse: he, himself) to name a type of scale in which a score on one attribute is relative to scores on the person's other attributes. The scores generated when individuals respond to such a scale are ipsative scores. Ipsative scores only reflect relative levels of the measured traits within an individual because the sum of scores is a constant for each respondent (Hontangas et al., 2015). Accordingly, ipsative scores are not directly comparable across individuals (Loo, 1999). Brown and Maydeu-Olivares (2013) also indicated that ipsative scores have problematic psychometric properties, including reliability, criterion-related validity, and construct validity. For a period of time, this limited the application of MFC tests.

Fortunately, this problem has been addressed using modern psychometric theories. Specifically, researchers have proposed a group of multidimensional forced-choice item response theory (MFC-IRT) models to analyze MFC item response data and generate reliable trait scores that can be compared across individuals. Research has revealed that IRT-estimated scores effectively overcome the limitation of ipsative data and offer higher measurement precision than ipsative scores (Brown & Maydeu-Olivares, 2013). In the past few decades, various MFC-IRT models have been proposed to establish the relationship between the overt responses and the latent traits assessed by MFC tests. To classify existing MFC-IRT models, Brown (2014) introduced a unified framework comprising three dimensions: (1) the MFC format used, (2) the measurement model for the relationships between items and measured traits, and (3) the decision model for choice behavior. In the framework, measurement models used for personality assessments are categorized into two main types: 1) Dominance Models and 2) Ideal Point Models, depending on assumptions regarding respondents’ response processes. Dominance models posit a monotonic increase in the likelihood of endorsing an item with an increasing underlying trait level. Conversely, ideal point models assume a nonmonotonic relationship where the probability of endorsement peaks when the respondent's trait level matches the item's location. As for decision models, four primary types have been employed to model MFC response data: Thurstone's Law of Comparative Judgment (Thurstone, 1927), Andrich's Forced Endorsement Model (Andrich, 1989), Luce's Choice Axiom (Luce, 1977) and Bradley–Terry Model (Bradley & Terry, 1952). Following the Brown's (2014) framework, existing MFC-IRT models are classified as depicted in Table 1.

Table 1.

Existing MFC-IRT Models and Their Categorization in the Brown's (2014) Framework.

MFC-IRT Models	Authors	MFC Formats	Measurement Model	Decision Model
Zinnes-Griggs (Z-G) model	Zinnes and Griggs (1974)	PICK	IPM	BT
Multi-unidimensional pairwise-preference model (MUPP-GGUM)	Stark et al. (2005)	PICK	IPM	AFE
Thurstonian IRT (TIRT) model	Brown and Maydeu-Olivares (2011)	PICK, MOLE, RANK	DM	TLCJ
Generalized graded unfolding-RANK (GGUM-RANK) model	Hontangas et al. (2015) and Lee et al. (2019)	PICK, MOLE, RANK	IPM	AFE
Multi-unidimensional pairwise preference two-parameter logistic (MUPP-2PL) model	Morillo et al. (2016)	PICK	DM	AFE
Rasch ipsative model (RIM)	Wang et al. (2017)	PICK	DM	AFE
Bayesian random block IRT (BRB IRT) model	Lee and Smith (2020a)	PICK, MOLE, RANK	DM	AFE
Zinnes-Griggs pairwise preference item response theory model	Joo et al. (2021)	PICK	IPM	TLCJ
Forced-choice ranking models	Hung and Huang (2022)	MOLE, RANK	IPM	AFE
Generalized Thurstonian unfolding model (GTUM)	Zhang et al. (2023)	PICK, MOLE, RANK	IPM	TLCJ

Note: PICK = choose or pick the statement that is most descriptive of them, MOLE = choose the most and least descriptive statements, RANK = rank all statements by their descriptiveness, DM = dominance model, IPM = ideal point model, AFE = Andrich's Forced Endorsement Model, TLCJ = Thurstone's Law of Comparative Judgment, LUCE = Luce's axiom of choice, BT = Bradley-Terry Model.

Recent advancements in IRT models for analyzing MFC response data have created two noteworthy and innovative research avenues. First, the development of novel MFC-IRT models that incorporate response time (e.g., Bunji & Okada, 2020, 2022; Guo et al., 2023) has provided a deeper understanding of the MFC item response process. This integration of response modeling and response time modeling holds promise for enhancing measurement accuracy. Second, researchers have delved into modeling aberrant response behavior specific to MFC tests (e.g., Frick, 2022; Peng et al., 2023). Expanding the scope of MFC-IRT models to accommodate diverse aberrant response patterns contributes significantly to enhancing the validity of derived trait estimates. Looking ahead, by continuing to advance MFC-IRT methodology and expand the scope of modeled response behaviors, researchers can further unlock the richness of MFC response data to obtain enhanced insights into the cognitive processes underlying test performance and more valid measurement of the noncognitive constructs of interest.

In summary, modern IRT approaches facilitate robust modeling and scoring of MFC tests, providing versatile options for analyzing MFC response data and generating reliable trait scores. Despite the abundance of MFC-IRT models, this study chose the TIRT model for three reasons: (a) it can deal with common MFC formats including choose or pick the statement that is most descriptive of them (PICK), choose the most and least descriptive statements (MOLE), and rank all statements by their descriptiveness (RANK); (b) its adaptability and widespread use in recent research (e.g., Brown & Maydeu-Olivares, 2011; Bürkner et al., 2019; Peng et al., 2023) have established it as a reliable analysis framework; (c) the development of MFC item blocks for dominance models seems easier and incur a lower cost than for ideal point models (Brown & Maydeu-Olivares, 2010), and some studies pointed out that the majority of existing personality assessments seem to fit a dominance model better than an ideal point model (e.g., Cho et al., 2015).

Why Detect Differential Item Functioning for MFC Tests?

As delineated in the Principles for the Validation and Use of Personnel Selection Procedures (2018), a fair selection procedure enables all test takers to show their status on a construct without undue advantage or disadvantage due to other personal attributes (e.g., age, race, ethnicity and gender). As an impactful assessment tool in high-stakes personnel selection procedure, MFC tests must also inevitably consider the issue of test fairness. In psychometric contexts, a frequently used concept intimately tied to test fairness is measurement invariance (MI). MI reflects the comparability of the trait estimates by respondents with divergent demographic characteristics (e.g., males and females). MI at the item level is also referred to as being free from differential item functioning (DIF; see Magis et al., 2010 for an overview). In MFC tests, MFC DIF refers to how individuals with matched latent trait levels but from different subgroups may respond to the same MFC item block¹ in a different way. As highlighted by the Standards of Educational and Psychological Testing (AERA et al., 2014) and Zieky (2015), DIF analysis has been a routine practice in test construction to ensure the fairness and accuracy of test outcomes. Hence, to ensure the fairness and validity of newly developed MFC tests in high-stakes personnel selection environments, it is essential to conduct MFC DIF analysis.

Compared to the traditional DIF detection with single-stimulus items, detecting DIF in MFC tests faces unique challenges. Specifically, the manifestation of DIF is more complex in MFC tests than in traditional tests. First, each item block in the MFC test contains at least two statements, which increases the chances that DIF appears in MFC tests. Additionally, following the views of Qiu and Wang (2021) and Lee et al. (2021), there are multiple possible sources of DIF in MFC tests, including statements, pairwise comparisons, or both. Finally, DIF in MFC tests may cancel or accumulate depending on how the statements are perceived relative to each other (Lee et al., 2021). This complexity requires unique detection methods for DIF in MFC tests, necessitating research in developing DIF detection methods tailored specifically for MFC tests.

Existing MFC DIF Detection Methods and Their Limitations

The current literature on MFC DIF detection methods is thin, with only four studies published to date to our knowledge. Existing methods can be classified into three categories, depending on whether they address differential functioning at the statement level, the item block level, or the whole test level.

Detecting MFC Differential Functioning at the Statement Level. Chen and Wang (2014) argued that a statement in a MFC item block may exhibit different utilities for different respondents. Therefore, they extended the logistic regression method (LR; Swaminathan & Rogers, 1990) to detect MFC statement-level differential functioning. Qiu and Wang (2021) introduced the term differential statement functioning (DSF) to define the differential functioning manifested at the statement level in MFC tests. Based on the Rasch ipsative model, they proposed three methods for detecting DSF: equal-mean-utility, all-other-statement, and constant-statement. Although the feasibility of these methods has been well-supported, they do not take into account an important issue, according to the views of Lee et al. (2021), that DSF may be canceled or accumulated at the item block level. More specifically, a MFC item block might exhibit DIF even when all the statements do not exhibit significant DSF; conversely, statements that exhibit DSF could be combined to produce a DIF-free item block. Therefore, performing MFC DIF detection only at the statement level is insufficient.

Detecting MFC Differential Functioning at the Item Block Level. Lee et al. (2021) argued that it may be more prudent to detect MFC DIF at the item block level directly, as DSF can be canceled or accumulated within an item block. To achieve this, they changed their unit of differential functioning analysis from a statement to an item block, and then they proposed a method called omnibus Wald tests (OWT) to test MFC DIF at the item block level for the TIRT model. Take a MFC triplet as an example: the basic idea of this method is to perform omnibus 6-df Wald tests on the three loading parameters and the three threshold parameters of each item block. A limitation of this method is that it cannot locate the sources that lead to DIF, whether it is the statement level, the pairwise level, or both.

Detecting MFC Differential Functioning at the Whole Test Level. Lee and Smith (2020b) introduced the multiple group confirmatory factor analysis (CFA) method based on the TIRT model to detect differential functioning on the whole test level. This work found that absolute changes (Δ) in alternative fit indices (e.g., CFI and NCI) worked well in testing measurement invariance for MFC tests. They also proposed cutoffs for changes in model fit indices to better detect metric and scalar non-invariance. However, this approach can only detect whether the whole MFC test maintains measurement invariance across subgroups, and cannot further provide information about which specific item blocks exhibit MFC DIF, which limits its practical application.

Overall, all the above existing methods for detecting MFC DIF employed a multigroup approach, a technique also utilized in traditional single-stimulus DIF detection. This approach requires that the covariates used for DIF detection be categorial, frequently with only two groups, like male and female respondents. While these established MFC DIF detection methods have demonstrated effectiveness, none adequately tackle two crucial issues.

First, in organizational settings, for those ordered or multi-category covariates (e.g., country, ethnicity, race, socioeconomic status, educational level, department/division, job grade, and cultural background) and continuous covariates (e.g., working years, language proficiency, and salary level), if without protected classes prescribed by laws or regulations, the subgroups are defined arbitrarily using a-priori cutoffs, commonly dichotomizing these covariates by their respective medians calculated from the collected data (Bourion-Bédès et al., 2015; Strobl et al., 2013). However, as cautioned by MacCallum et al. (2002), a-priori discretization carries the risk of discarding crucial variation within artificial categories, and may even fail to capture the range where DIF occurs precisely. Moreover, the outcomes of DIF detection are significantly affected by how subgroups are formed. Some researchers also demonstrated that determining the groups arbitrarily a-priori may lead to information loss and a potential decrease in the power to detect DIF (Strobl et al., 2013; Tay et al., 2015).

The second limitation of the existing MFC DIF detection methods is omitting the interaction among multiple covariates that potentially induce DIF, as existing methods only detect MFC DIF by each covariate separately. To distinguish DIF induced by the interaction between multiple covariates from DIF induced by a single covariate investigated in previous MFC DIF studies, we refer to the former interactive DIF, and the latter main effect DIF. Detecting interactive DIF in MFC tests bears many benefits. First, it allows for a more thorough understanding of the complex sources causing differential functioning (Belzak, 2023; Robert et al., 2006; Tay et al., 2015), enabling experts to make targeted modifications. As stated by Collins (1990), there is an increasing recognition that identity encompasses the intersection of multiple background characteristics. Second, it helps to more effectively ensure test fairness. Revealing more hidden measurement bias through interactive DIF detection helps to prevent hiring discrimination. Finally, it further enhances test validity and the overall quality of the tests. In contrast to main effect DIF, interactive DIF is particularly insidious, and it only becomes apparent when multiple covariates are considered simultaneously in a single round of DIF testing. But interactive DIF can be commonly present. For example, this measurement bias caused by the interaction between covariates is quite evident in the assessment of adolescent delinquent behavior (Bauer, 2017). Over the past decade, researchers have shown great research interest in the interactive DIF and developed methods to detect interactive DIF in traditional single-stimulus tests (e.g., Bollmann et al., 2018; Strobl et al., 2013; Tutz & Berger, 2016). However, methods to detect interactive DIF in MFC tests are lacking.

In conclusion, the current methods for detecting MFC DIF face limitations in dichotomizing the grouping covariates arbitrarily a-priori, and omitting the investigation of interactive DIF. This paper proposes a new method based on the recursive partitioning technique to address these limitations.

A Brief Overview of Recursive Partitioning Technique and Its Application in IRT-Based DIF Detection

Advances in machine learning (ML) have provided new opportunities to overcome the constraints of conventional DIF detection approaches. As a multivariate data mining technique in ML, recursive partitioning (RP) has proven remarkably effective in analyzing multivariate data across diverse scientific fields, including genetics (e.g., Díaz-Uriarte & Alvarez de Andrés, 2006), ecology (e.g., Cutler et al., 2007), medicine (e.g., Khalilia et al., 2011), and psychology (e.g., Hayes et al., 2015). RP has its origins in classification and regression tree (CART) models introduced by Breiman et al. (1984). The RP algorithm works by building a tree-like structure, with covariates being the nodes. The tree starts with a single node, and then branches into two mutually exclusive child nodes that maximize within-node homogeneity and between-node heterogeneity. This branching recurs for each new node.

In recent years, RP has emerged as a promising ML approach for IRT-based DIF detection in single-stimulus tests. One major advantage is the ability to incorporate both multi-category and continuous covariates into the partitioning, with the branching cutoffs determined organically based on the data. RP is expected to uncover complex and intricate interactive DIF, which traditional techniques cannot. Owing to these strengths, RP has gained increasing research attention over the past decade as an innovative tool for detecting DIF in traditional single-stimulus tests (e.g., Bollmann et al., 2018; Finch et al., 2015; Strobl et al., 2013; Tutz & Berger, 2016). Such RP algorithm involves the following five key steps:

Starting with the full sample, perform grouping parameter calibration using the current sample of all respondents based on each covariate of interests.

Calculate the selected splitting criterion (e.g., the log-likelihood test statistic) for each covariate of interests;

Locate the maximum value among all calculated split criteria and consider the combination of covariate and split point represented by this maximum value as the first best split covariate and its best split point.

Search for further best split covariate and its best split point based on Step 3.

Recursively repeat steps 2–4 on each resulting node until stopping criteria are met.

Existing studies employing RP for IRT-based DIF detection can be grouped into two main categories: global and item-level approaches. Global-level methods, such as Rasch Trees (Strobl et al., 2013) and polytomous Rasch Trees (Komboz et al., 2016), detect parameter instability across covariates to identify regions where DIF occurs but fail to automatically identify items responsible for DIF. In contrast, item-level RP approaches, such as item-focused trees for the Rasch Model (IFT; Tutz & Berger, 2016) and item-focused trees for the Partial Credit Model (PCM-IFT; Bollmann et al., 2018), enable the simultaneous identification of both items and variables responsible for DIF. Item-level methods generate tree diagrams that visualize which items exhibit DIF at which levels of which covariates and in which ways. This granular information facilitates focused review of potential DIF sources (Finch et al., 2015). Thus, item-level RP approach is more useful for leveraging DIF outcomes to improve test development and fairness.

The Present Study

To our knowledge, published studies on DIF detection using RP techniques have been limited to single-stimulus tests. While promising results have been demonstrated, its effectiveness in MFC tests has not yet been thoroughly investigated. In this study, we designed a RP-based method to detect both the main effect DIF and interactive DIF in MFC tests and evaluated its efficacy with simulation. The proposed method seeks to enhance the power of detecting MFC DIF related to ordered and continuous covariates without predefined protected classes through data-optimized grouping. Furthermore, it is designed to simultaneously identify item blocks and covariates responsible for DIF, holding significant practical implications for improving the validity and fairness of MFC tests in high-stakes personnel selection.

In what follows, we first introduce the Thurstonian IRT model, a statistical model used to fit forced-choice response data in this study. Second, we provide detailed information on the procedures of the proposed method. Third, we evaluate the proposed method's performance using three Monte Carlo simulation studies. In the fourth section, we present an empirical example that illustrates the applicability of the proposed method using a dataset derived from a well-validated MFC test collected in a real-world setting. Lastly, we discuss the limitations of the current study and offer recommendations for future research aimed at enhancing the capabilities of the proposed method.

The Thurstonian IRT Model

Thurstone (1927) introduced Thurstone's Law of Comparative Judgment to describe the respondents’ response process when responding to an MFC item block. He argued that for two statements, i and k, paired in an item block j, respondents reacted independently to each one, and each produced a utility, that is, t_i and t_k. He assumed that:

y_{j} * = t_{i} - t_{k} + ε

(1)

y_{j} = {\begin{matrix} 1, & if y_{j} * \geq 0, \\ 0, & if y_{j} * < 0. \end{matrix}

(2)

In the above equations,

y_{j} *

is a continuous variable, and the response

y_{j}

contains two categories, that is, 0 and 1. When

y_{j} * \geq 0

, the respondent's answer is 1, which means that they prefer statement i over statement k. Otherwise, an answer of 0 indicates that they prefer the statement k to the statement i. Thurstone believed that the judgment error

ε

should exist, and he also assumed that

t_{i}

t_{k}

, and

ε

are normally distributed and independent of each other.

The Thurstonian IRT (TIRT) model was proposed by Brown and Maydeu-Olivares (2011) and used to fit the response data generated from the MFC questionnaires. Assuming that the utility is a linear function of the latent trait being measured, the utility values $t_{i}$ and $t_{k}$ for the two statements in the block j can be expressed as the following linear functions of the latent traits $θ_{a}$ and $θ_{b}$ :

{\begin{matrix} t_{i} = μ_{i} + λ_{i} θ_{a} + ε_{i} \\ t_{k} = μ_{k} + λ_{k} θ_{b} + ε_{k} \end{matrix},

(3)

where

μ_{i}

and

μ_{k}

are the utility means,

λ_{i}

and

λ_{k}

are the loading parameters of the statements i and k, and

ε_{i}

and

ε_{k}

are unique variances. Inserting Equation 3 into Equation 1, it can be derived that:

y_{j} * = t_{i} - t_{k} = - γ_{j} + (λ_{i} θ_{a} - λ_{k} θ_{b}) + (ε_{i} - ε_{k}),

(4)

where

γ_{j}

is a mean difference of two statements involved in an item block j.

As the latent traits and errors were normally distributed, $y_{j} *$ was also normally distributed. The conditional probability of preferring the statement i to the statement k in the block j is expressed as follows:

P (y_{j} = 1 | θ_{a}, θ_{b}) = Φ (\frac{- γ_{j} + λ_{i} θ_{a} - λ_{k} θ_{b}}{\sqrt{ψ_{i}^{2} + ψ_{k}^{2}}}),

(5)

where

Φ (\cdot)

denotes the cumulative standard normal distribution function, and

ψ_{i}^{2}

and

ψ_{k}^{2}

are the unique variances of the two utilities, that is,

ψ_{i}^{2} = var (ε_{i})

and

ψ_{k}^{2} = var (ε_{k})

. The TIRT model can also be viewed as an extension of the normal ogive IRT model to situations in which statements are presented in blocks that measure multiple latent traits. For parameter calibration, the observed responses should first be converted into pairwise comparison results and then encoded into binary data.

The Proposed Method for Detecting Both the Main Effect DIF and Interactive MFC DIF

Differential Item Functioning for the Thurstonian IRT Model

In the TIRT model, we can model the DIF by specifying separate equations for the two groups of interest using the following expression:

P (y_{j} = 1 | θ_{a}, θ_{b}, γ_{j g}, λ_{i g}, λ_{k g}, ψ_{i g}, ψ_{k g}, g) = Φ (\frac{- γ_{j g} + λ_{i g} θ_{a} - λ_{k g} θ_{b}}{\sqrt{ψ_{i g}^{2} + ψ_{k g}^{2}}}), g = 1, 2,

(6)

and its linear predictor η can easily be obtained and expressed as

η_{j} = \frac{λ_{i g} θ_{a} - λ_{k g} θ_{b} - γ_{j g}}{\sqrt{Ψ_{i g}^{2} + Ψ_{k g}^{2}}}, g = 1, 2,

(7)

where g represents the group of respondents, with g equal to 1 for the focal group and g equal to 2 for the reference group.

According to Lee et al. (2021), DIF in the TIRT model can manifest in one of three scenarios: only loading parameters (λ), only threshold parameters (γ), or both loading and threshold parameters. These scenarios correspond to DIF at the statement level, pairwise comparison level, and both levels simultaneously, respectively. When examining DIF at the MFC item block level, it is important to note that DIF may occur even when all statements within the block are individually invariant. Therefore, a robust approach is to perform simultaneous DIF detection for both the loading and threshold parameters of each item block, as testing one parameter alone may lead to inaccurate conclusions.

The Procedures of the Proposed Method: the Block-Based Sequential Recursive Partitioning Trees

This paper proposes the Block-based Sequential Recursive Partitioning Trees method to detect MFC DIF in MFC tests, abbreviated as MFC-BSRPT. This method combines recursive partitioning technique with the TIRT model to grow a DIF tree for each studied item block. We first estimate the item parameters of the item block based on response data from respondents of different two subgroups (i.e., focal and reference groups) at each splitting point, then, based on binary splits, sequentially grow a recursive partitioning tree for each item block using a selected split criterion, and finally generate a tree diagram that displays where DIF is present by which cutoff of which covariate and in what ways. To aid understanding, we provide a flow chart of the procedure (see Figure 1). Below we elaborate on the detailed procedures and technical details of the proposed method taking a single studied item block as an example.

Figure 1.

A flow chart of the fitting procedures for the proposed method.

Step 1: Locate All Possible Cut Points in All Available Covariates

Two key issues must be considered when growing a tree: (1) how to select a covariate for a node and identify the best cut point for the selected covariate, and (2) when to stop growing the tree. These issues are explained in subsequent sections.

To grow a tree, one must identify all possible cutpoints within the covariate space. The covariate space refers to all levels of all covariates of interest, and these covariates could be categorial, ordered, or continuous. Notably, while all possible levels with multi-category and ordered covariates are countable, the levels for a continuous covariate have infinite choices. Therefore, in order to locate cutpoints suitable for dichotomizing the continuous covariate to facilitate DIF detection, we need to determine countable and meaningful levels for continuous covariates. In general, a common approach is to consider the number of decimal places reserved for each cutoff. For example, Strobl et al. (2013) took each possible cutoff value that was used to dichotomize the continuous covariate to two decimal places.

At each node, a covariate is selected and we divide all respondents of the full set A into two subgroups (i.e., focal and reference groups), that is, two subsets A₁ and A₂, which are denoted by the following equation:

A_{1} = A \cap {x_{m} \leq c} and A_{2} = A \cap {x_{m} > c},

(8)

where c is one of the levels in the covariate

x_{m}

Let $x_{p}^{T} = (x_{p 1}, \dots, x_{p m})$ denote a person-specific covariate vector of length m. To detect MFC DIF for an item block, one should examine all the covariates and all possible splits in each covariate using the following linear predictor of the TIRT model,

{\begin{matrix} η_{j} = \frac{θ_{a} \cdot {node}_{λ_{i}} - θ_{b} \cdot {node}_{λ_{k}} - {node}_{γ_{j}}}{\sqrt{{node}_{Ψ_{i}^{2}} + {node}_{Ψ_{k}^{2}}}}, \\ {node}_{λ_{i}} = [λ_{i l} \cdot I (x_{m} \leq c_{m}) + λ_{i r} \cdot I (x_{m} > c_{m})], \\ {node}_{λ_{k}} = [λ_{k l} \cdot I (x_{m} \leq c_{m}) + λ_{k r} \cdot I (x_{m} > c_{m})], \\ {node}_{γ_{j}} = [γ_{j l} \cdot I (x_{m} \leq c_{m}) + γ_{j r} \cdot I (x_{m} > c_{m})], \\ {node}_{Ψ_{i}^{2}} = [Ψ_{i l}^{2} \cdot I (x_{m} \leq c_{m}) + Ψ_{i r}^{2} \cdot I (x_{m} > c_{m})], \\ {node}_{Ψ_{k}^{2}} = [Ψ_{k l}^{2} \cdot I (x_{m} \leq c_{m}) + Ψ_{k r}^{2} \cdot I (x_{m} > c_{m})] . \end{matrix}

(9)

In Equation 9,

I (\cdot)

is the indicator function with

I (d) = 1

if d is true and

I (d) = 0

otherwise;

c_{m}

is a possible split point for the mth covariate; the subscripts l and r are abbreviations for the words left and right, respectively;

λ_{i l}

is the loading parameter of the statement i in the left node (

x_{m} \leq c_{m}

) and

λ_{i r}

is the loading parameter of the statement i in the right node

(x_{m} > c_{m})

. The focal and reference groups were constructed using the two split regions (i.e.,

{x_{m} \leq c_{m}}

and

{x_{m} > c_{m}}

) obtained by the mth covariate at a split point

c_{m}

. Therefore, Equation 9 can be regarded as an alternative representation of Equation 6.

Step 2: Search for the First Best Split Covariate and the Best Split Point Based on a Split Criterion

After locating all possible cut points in all available covariates, the algorithm searches for the optimal split of each covariate by applying a selected split criterion. According to Tutz and Berger (2016) and Strobl et al. (2009), there are two commonly accepted split criteria in tree-based modeling approaches: (a) impurity measures such as the Gini Index or Shannon Entropy, and (b) test-based splits such as the log-likelihood test statistic. Test-based splits assess whether the current child node requires further partitioning based on the corresponding test statistic. Because this study utilizes the TIRT model, we chose the test-based split criterion for the proposed method.

In general, any statistic which has been shown to perform well in identifying MFC DIF can serve as a test statistic. In this study, we chose the Wald statistic. Based on Lee et al.'s (2021) study, the Wald statistic $(χ^{2})$ is a potential tool in this regard, and it can be expressed as follows:

χ^{2} = (ξ_{R} - ξ_{F})^{T} (Σ_{R} + Σ_{F})^{- 1} (ξ_{R} - ξ_{F}),

(10)

where

ξ_{R}

and

ξ_{F}

represent the vectors of item parameter estimators of the reference group (R) and focal group (F) for a single item block, respectively;

Σ_{R}

and

Σ_{F}

denote the asymptotic variance and covariance matrices for

ξ_{R}

and

ξ_{F}

, respectively. The test statistic utilized in the proposed method follows a chi-square distribution.

One issue that must be addressed before using the Wald statistic for MFC DIF detection is how to obtain comparable item parameter estimates for the different subgroups of respondents. In this study, we used a well-validated method named the free baseline DIF testing approach employed by Lee et al. (2021). The procedures of this approach are as follows. First, an anchor subset consisting of one or more DIF-free item blocks should be identified, which ensures a common metric between groups when conducting TIRT parameter calibration. Then, conduct parameter calibration. For anchor items, get parameter calibrated with the TIRT model in which the parameters of all anchor item blocks are constrained to be identical between groups; for all the remaining item blocks, parameters for these item blocks were freely estimated across groups. Finally, examine the equivalence of parameters using the omnibus 6-df Wald tests in the current block. Lee et al. (2021) found that, as the sample size and DIF magnitude increased, power approached 1.0 and Type I error rates approached the nominal level (.05). Therefore, the Wald statistic based on free baseline testing is adopted as the selection criterion for the proposed method.

To identify the first best split covariate and its best cut point, the algorithm fits the TIRT model to different grouped response datasets, and these datasets are generated by splitting the original response data based on all possible cut points in the covariates being examined. After parameter calibration, the algorithm employs the null hypothesis H₀: $λ_{i l} = λ_{i r}$ , $λ_{k l} = λ_{k r}$ , $γ_{j l} = γ_{j r}$ to evaluate the significance of the Wald statistic calculated from each dataset. It is worth noting that when locating the best split covariate and its best cut point, the proposed method only takes the numerical magnitude of the calculated Wald statistic as the basis for the selection. Specially, the combination of the covariate and its cut point with the maximum Wald statistic is selected as the initial split for constructing the left and right child nodes. These two child nodes are denoted as $I (x_{m} \leq c_{m})$ and $I (x_{m} > c_{m})$ , respectively. As to how the significance of the Wald statistic is handled, it will be taken into account in the subsequent Step 4, which we will describe in detail therein.

In general RP applications, a covariate is allowed to be reused for different nodes of the same tree. However, in the proposed method, we set a restriction that each covariate is not allowed to be reused. More specifically, once the first best split covariate m is identified, it is excluded from further consideration during subsequent search iterations within the same block. This restriction was performed for two main reasons: (1) to control for Type I errors due to the multiple testing on the same covariate and (2) to reduce the complexity of the tree structure to improve interpretability of the proposed method.

Step 3: Search for Further Splits Based on Step 2

After identifying the first best split covariate and its best cut point, the method will continue to search for the subsequent best split covariates from the remaining covariates and their best split point. Specifically, the algorithm will re-calibrate the TIRT model item parameters using the subset of response data that belong to the current node only, and then re-compute the corresponding Wald statistics. Then, the algorithm locates the optimal combination of covariate and split point corresponding to when the Wald statistic is maximized.

Take the established right child node (i.e., $I (x_{m} > c_{m})$ ) as an example, the further possible nodes of the variable s are denoted as $I (x_{m} > c_{m}) \cdot I (x_{s} \leq c_{s})$ and $I (x_{m} > c_{m}) \cdot I (x_{s} > c_{s})$ . Then, the adjusted linear predictor of the TIRT model can be expressed as:

{\begin{matrix} η_{j} = \frac{θ_{a} \cdot {node}_{λ_{i}} - θ_{b} \cdot {node}_{λ_{k}} - {node}_{γ_{j}}}{\sqrt{{node}_{Ψ_{i}^{2}} + {node}_{Ψ_{k}^{2}}}}, \\ {node}_{λ_{i}} = [λ_{i l} \cdot I (x_{m} \leq c_{m}) + λ_{i r}^{[n]} \cdot I (x_{m} > c_{m}) \cdot I (x_{s} \leq c_{s}) + λ_{i r}^{[n]} \cdot I (x_{m} > c_{m}) \cdot I (x_{s} > c_{s})], \\ {node}_{λ_{k}} = [λ_{k l} \cdot I (x_{m} \leq c_{m}) + λ_{k r}^{[n]} \cdot I (x_{m} > c_{m}) \cdot I (x_{s} \leq c_{s}) + λ_{k r}^{[n]} \cdot I (x_{m} > c_{m}) \cdot I (x_{s} > c_{s})], \\ {node}_{γ_{j}} = [γ_{j l} \cdot I (x_{m} \leq c_{m}) + γ_{j r}^{[n]} \cdot I (x_{m} > c_{m}) \cdot I (x_{s} \leq c_{s}) + γ_{j r}^{[n]} \cdot I (x_{m} > c_{m}) \cdot I (x_{s} > c_{s})], \\ {node}_{Ψ_{i}^{2}} = [Ψ_{i l}^{2} \cdot I (x_{m} \leq c_{m}) + Ψ_{i l}^{2 [n]} \cdot I (x_{m} > c_{m}) \cdot I (x_{s} \leq c_{s}) + Ψ_{i r}^{2 [n]} \cdot I (x_{m} > c_{m}) \cdot I (x_{s} > c_{s})], \\ {node}_{Ψ_{k}^{2}} = [Ψ_{k l}^{2} \cdot I (x_{m} \leq c_{m}) + Ψ_{k l}^{2 [n]} \cdot I (x_{m} > c_{m}) \cdot I (x_{s} \leq c_{s}) + Ψ_{k r}^{2 [n]} \cdot I (x_{m} > c_{m}) \cdot I (x_{s} > c_{s})] . \end{matrix}

(11)

where

λ_{i r}^{[n]}

denotes the loading parameter corresponding to the new child node

[n]

in the grown tree and so on for the rest.

Step 4: Repeat Steps 1 to 3 Until the Stopping Rules are Met

Given that the data is recursively partitioned, the subsequent nodes are assigned smaller sample sizes. Therefore, a rule must be implemented to ensure an appropriate sample size of observed responses for accurate parameter estimation. The stopping rules for the proposed algorithm include: a) no covariates remain to continue the MFC DIF detection, or b) the sample size of a node drops below the threshold (e.g., 100 respondents). When a child node satisfies either of the above rules, the algorithm ceases to split further. After the algorithm is terminated, the proposed method will utilize the significant level of the final Wald statistics representing each established child node, which is the maximum value among all possible splits, to determine whether the item block exhibits DIF or not. Specifically, because a large number of significance tests are being performed at a time, the problem of multiple testing needs to be addressed, and the family-wise type I error needs to be controlled. We propose four different methods to control the type I error using the adjusted local significance level and/or DIF effect sizes. See the section titled “Controlling the Type I Error” after Step 5 for the details. If none of the Wald statistics exceeds the significance threshold, the proposed method would conclude that the current block is DIF-free.

Step 5: Terminate DIF Detection and Output a Tree Diagram Presenting the MFC DIF Form of the Current Item Block

It is crucial to note that tree diagrams are unnecessary for DIF-free item blocks on all the covariates that are incorporated into the MFC DIF detection. However, for DIF item blocks, a tree diagram is essential to illuminate which covariates generate DIF and their corresponding forms of MFC DIF. To aid potential practitioners’ understanding, Figure 2 presents four schematic tree diagrams, each representing a different MFC DIF form. Each child node of the tree represents a subgroup, and all TIRT model item parameters (i.e., λ and γ) for the current item block on the current subgroup are displayed in that child node. Solid or dashed arrows are used to indicate which covariates induce or do not induce MFC DIF, with solid lines signifying the presence of MFC DIF and dashed lines indicating its absence. While both item blocks 1 and 2 exhibit the main effect DIF, the difference lies in that item block 1 exhibits DIF solely on age, while item block 2 exhibits DIF on both the age and gender. In contrast, both item blocks 3 and 4 exhibit the interactive DIF, but there are differences in their MFC DIF forms. Specifically, the DIF in item block 3 is induced solely by the interaction between the two covariates, whereas the DIF in item block 4 exhibits main effect DIF on age in addition to being induced by the interaction between the two covariates.

Figure 2.

Schematic tree diagrams of different MFC DIF forms for different item blocks of a MFC test.

Controlling the Type I Error

Compared to existing MFC DIF detection methods, the potential advantage of the proposed method is in its ability to not only detect MFC DIF by multiple covariates simultaneously in a single algorithm, but also to handle ordered, multi-category and continuous covariates in addition to binary covariates, all with the same algorithm. However, the problem of multiple testing needs to be addressed. For the proposed method, whether performing inter-covariate DIF detection (i.e., simultaneous DIF detection of multiple covariates) and intra-covariate DIF detection (i.e., simultaneous DIF detection at multiple possible levels to dichotomize each multi-category, ordered or continuous covariate), the proposed method involves multiple DIF testing. And it is possible that small or trivial DIF effects may be detected in item blocks that do not actually exhibit DIF in a covariate, especially for the multi-category, ordered and continuous covariate, which may lead to inflated false alarms.

An important concept in the statistical literature on multiple testing is the familywise error rate. According to Benjamini and Hochberg (1995), the familywise error rate refers to the probability of at least one false rejection when a family of null hypotheses is tested simultaneously. In the proposed method, it corresponds to the probability that, for an item block, at least one of the Wald tests conducted among multiple covariates and multiple possible dichotomization levels within each covariate is falsely positive, making a Type I error, and flagging the item block as a DIF item block. Following the suggestion of Tutz and Berger (2016), if one wants to control the familywise error rate by a global significance level α in DIF detection, one has to use much smaller significance levels in the single tests, that is, one should adjust the local significance level based on the global significance level.

To effectively control for the family-wise Type I errors for the proposed method, we implemented the Bonferroni adjustment in this study. The Bonferroni adjustment divides the overall significance level α (here set as 0.05) by the number of hypotheses under examination, yielding the local significance level for the Wald test applied to each item block and covariate. This adjustment accounts for the increased probability of false positives due to multiple testing, thereby ensuring that the probability of incorrectly identifying a DIF-free item block as a DIF block is controlled at or below α. It is worth noting that, in this study, when adjusting the local significance level in the multiple DIF testing using the Bonferroni adjustment, the denominator for α can be set in two ways: (1) the number of covariates requiring DIF testing, and (2) the number of different possible levels at which each covariate requiring DIF testing is dichotomized. For a binary covariate, only the first option is applicable; For multi-category, ordered, or continuous covariates, both options are applicable.

In addition to the Bonferroni adjustment, Nye (2011) noted that incorporating DIF effect size estimation indices, a measure of practical significance, can potentially remedy the weaknesses of hypothesis testing (i.e., statistical significance) and potentially mitigate Type I errors. For the MFC test, Lee et al. (2021) proposed to calculate the MFC DIF effect size for each item block, an adaptation of Nye's (2011) method, as depicted in the equation below.

d_{D I F} = \frac{1}{S D_{j P}} \sqrt{\int \dots {\int [P_{j R} (θ) - P_{j F} (θ)]}^{2} f_{F} (θ) d {(θ)}_{1} \dots d θ_{D}}

(12)

where

S D_{j P}

represents a pooled estimate of the standard deviation of the binary pairwise comparison outcome j for P respondents;

P_{j R} (θ)

and

P_{j F} (θ)

denote the TIRT probabilities of a binary outcome for the reference group (R) and focal group (F);

f_{F} (θ)

Furthermore, Nye (2011) pointed out that, within the IRT framework, a DIF effect size of 0.1 is considered small. Therefore, one can consider this value as a threshold to dichotomize the DIF effect size continuum into a binary classification, and then combine both the significance of the p-value and this binary classification to determine whether the current item block exhibits DIF. The procedures for controlling Type I errors using the DIF effect size are as follows. First, the Wald statistic and its p-value are output at each child node created after recursively partitioning the current block using the proposed method. Next, the average DIF effect size estimate for each child node within the current block is calculated using Equation 12. Finally, the significance of each child node's p-value and its average DIF effect size are combined to further determine whether DIF is present in the current block (i.e., the first two rows in Table 2). Specifically, for an item block to be flagged as a DIF item block requires that the following criterion must be met: the p-value on any established child node is below the adjusted local significance level, and the average DIF effect size estimate of that node should also be greater than or equal to 0.1.

Table 2.

Potential Strategies for Controlling Type I Errors in the Proposed Method.

Strategies for Controlling Type I Errors	Criteria for Flagging as DIF
Number of covariates and DIF effect size (NCD)	$I (p_{j} < \frac{α}{M}) \cdot I (E S_{j m} \geq 0.1)$
Number of covariates and their levels and DIF effect size (NCLD)	$I (p_{j m} < \frac{α}{M (L_{m} - 1)}) \cdot I (E S_{j m} \geq 0.1)$
Number of covariates (NC)	$I (p_{j} < \frac{α}{M})$
Number of covariates and their levels (NCL)	$I (p_{j m} < \frac{α}{M (L_{m} - 1)})$

Note: I(•) = indicator function, with I(d) = 1 if d is true and I(d) = 0 otherwise; p_j= p-value of the Wald statistic for item block j; α = the overall significance level; M = the total number of covariates detected simultaneously; L_m= the total number of levels within the covariate m; ES_j= the average DIF effect size estimation of item block j.

In the present study, four possible strategies for controlling type I errors in the proposed method are outlined in Table 2, depending on the different Bonferroni adjustment perspectives and whether the average DIF effect size estimate is included. Through these strategies, we hope to provide effective guidance to potential users and ensure a more rigorous and meaningful assessment of DIF, while reducing the risk of identifying spurious effects due to the multiple testing.

Simulation Studies

Three simulation studies were conducted to investigate the three main research questions. Study 1 aimed to compare the DIF detection performance of the proposed method with that of an existing method in the field of MFC DIF detection, under different MFC DIF forms in the presence of two binary covariates. Study 2 investigated the ability of the proposed method to detect MFC DIF in the presence of both a binary covariate and an ordered covariate under different forms of DIF in a more realistic scenario. The scenario considered a true latent trait correlation matrix from a well-validated personality scale, more latent traits, and more item blocks in the MFC test, which is intended to more accurately reflect real-world data. Study 3 further investigated the performance of the proposed method under different DIF forms when there is both a binary covariate and a continuous covariate. All the simulation codes used in the three studies can be found at https://osf.io/b98dy/.

Study 1

Design

To enable a comprehensive comparison between the proposed method and an existing method (i.e., omnibus Wald tests on loadings and thresholds; Lee et al., 2021), we included two binary covariates, x₁ and x₂. In the present study, we adopted most of the simulation conditions used in the study of Lee et al. (2021). These conditions comprised 10 MFC item blocks, three measured latent traits, and three statements within each block (i.e., a triplet). This format has been used extensively in MFC research (e.g., Bürkner et al., 2019; Guenole et al., 2018; Lee et al., 2021; Ng et al., 2020) due to its reduced cognitive load and enhanced measurement accuracy compared to other formats like pairs and tetrads. Additionally, seven additional factors were manipulated as follows:

The percentage of blocks that exhibit DIF: (a) 0% and (b) 10%.

Type of noninvariant item parameters of the TIRT model: DIF on (a) loading parameter (λ), (b) threshold parameter (γ), and (c) both loading and threshold parameters.

DIF size: (a) 0.3 (i.e., 0.3 increase in focal group's loading and/or threshold parameter) and (b) 0.6 (i.e., 0.6 increase in focal group's loading and/or threshold parameter). In previous DIF studies (e.g., Kim et al., 2016; Lee et al., 2021; Stark et al., 2006), these effect sizes were commonly defined as small and large, respectively.

MFC DIF detection method: (a) omnibus Wald tests on loadings and thresholds (abbreviated as OWT; Lee et al., 2021) and (b) the proposed method, i.e., MFC-BSRPT. We selected the OWT method for comparison with our proposed method because it can detect MFC DIF at the item block level. We excluded certain other methods, such as the MFC-based logistic regression method (Chen & Wang, 2014), DSF-based methods (Qiu & Wang, 2021), and multiple group confirmatory factor analysis of the TIRT model (Lee & Smith, 2020b), as they are not designed to detect MFC DIF at the item block level.

Number of noninvariant item parameters under a specific noninvariant item parameter type within an item block: (a) 1, (b) 2, and (c) 3. When selecting a noninvariant item parameter type, it was important to consider not only the type itself, but also how many parameters in the block exhibited DIF for that type. For example, in a triplet with three loading and three threshold parameters, if we assumed that the DIF was related to the loading parameter in the MFC form, we needed to determine exactly how many loading parameters exhibited DIF.

MFC DIF form: (a) only the main effect DIF, (b) only the interactive DIF, and (c) both main effect DIF and interactive DIF.

The sample size per group: (a) 500, (b) 1,000, and (c) 2,000. These three levels correspond to a total sample size of 1,000, 2,000, and 4,000, respectively. Regarding the sample size of MFC response data, previous simulation studies using the TIRT model have mainly used large samples (e.g., 1,000, 2,000, or 4,000) (Brown & Maydeu-Olivares, 2011, 2012).

The sample sizes were equal across different levels of the same covariate. The total number of simulation conditions was 330 (0% of MFC DIF item blocks could not be meaningfully crossed with factors 1, 2, 3, 5, and 6; therefore, the 318 extraneous conditions were removed). For each condition, we performed 50 replications. All simulation code were written in R version 4.1.1 (R Development Core Team, 2021). To implement the proposed method, we utilized two R packages, thurstonianIRT (Bürkner et al., 2019) and MplusAutomation (Hallquist & Wiley, 2018), and the Mplus software (Version 8.0; Muthén & Muthén, 2017). In these two R packages, the former automatically generated the Mplus syntax for fitting MFC response data to the TIRT model and performing DIF detection, while the latter facilitated the use of Mplus for complex projects in R, such as Monte Carlo simulation studies or the comparison of many methods. It is important to note that the thurstonianIRT package cannot be directly applied for DIF detection. We have therefore made appropriate adjustments to its original R code.

Data Generation

To generate response data for DIF-free item blocks, we followed these steps: (1) Latent trait vectors were drawn from a multivariate normal distribution with a mean vector of 0 and a covariance matrix denoted as:

Σ_{θ} = [\begin{matrix} 1 & - 0.4 & 0 \\ - 0.4 & 1 & 0.3 \\ 0 & 0.3 & 1 \end{matrix}],

sourced from Lee et al. (2021); (2) Measurement errors were randomly sampled from a univariate standard normal distribution with a standard deviation of 1; (3) We ensured that the latent traits measured within the same block were all different, with only one trait measured by each statement, and an equal number of statements under each trait; (4) Item parameters from Lee et al.'s (2021) study were employed; and (5) MFC responses for individual respondents were derived through the following equation:

{\begin{matrix} y_{j} = 1 (E is preferred to F), if y_{E F} * = - γ_{j} + (λ_{E} η_{a} - λ_{F} η_{b}) + (ε_{E} - ε_{F}) \geq 0 \\ y_{j} = 0 (F is preferred to E), if y_{E F} * = - γ_{j} + (λ_{E} η_{a} - λ_{F} η_{b}) + (ε_{E} - ε_{F}) < 0 \end{matrix} .

(13)

The response data generated in this study comprised paired comparison data using the TIRT model based on the triplet format. To ensure that the responses adhered to the defined logic governed by the triplet format, a specific simulation approach was employed. For instance, there are three statements (i.e., E, F and G) in an item block, and the first two pairs of responses had to follow the format E > F and E < G, respectively. The third pair was mandated to be F < G due to the constraints of the triplet format. If the response of the third pair could not be inferred from the first two pairs, Equation 13 was once more utilized to generate the response for that pair.

The process of generating data for DIF item blocks followed the same procedure as outlined above, with one exception in Step 4. Under the DIF condition, Step 4 involved manipulating certain item parameters (i.e., λ and/or γ) of the focal group of DIF item blocks. For this study, we created MFC DIF by manipulating the item parameters of Item Block 3, which was designated as the DIF block with a 10% DIF proportion in the MFC test. Table 3 illustrates the simulation of the different DIF forms used to manipulate the item parameters of Block 3.

Table 3.

True Simulated Differences of Item Parameters for Different DIF Forms Under the Comparison of Focal and Reference Groups.

Block code	Differences of Item Parameters
Block code	DIF Form 1	DIF Form 2	DIF Form 3
3	$z \cdot I (x_{2} = 1)$	${\begin{matrix} z \cdot I ({x_{1} = 0} \cap {x_{2} = 0}) \\ z \cdot I ({x_{1} = 1} \cap {x_{2} = 1}) \end{matrix}$	$z \cdot I (x_{1} = 1) + z \cdot I ({x_{1} = 1} \cap {x_{2} = 1})$

Note: z = value of DIF size, DIF form 1 = only the main effect DIF, DIF form 2 = only the interactive DIF, DIF form 3 = both main effect DIF and interactive DIF.

We also provide a simple example illustrating how to generate DIF data to aid understanding. Based on the assumption that the first two loading parameters of Block 3 (i.e., statements 7 and 8) exhibit main effect DIF on the covariates x₁ and x₂, and that these parameters also exhibit interactive DIF due to the interaction between these two covariates, we use the following equation to simulate responses

{\begin{matrix} y_{7, 8} * = - γ_{7, 8} + [(λ_{7} + size) η_{1} - (λ_{8} + size) η_{2}] + (ε_{7} - ε_{8}) \\ y_{7, 9} * = - γ_{7, 9} + [(λ_{7} + size) η_{1} - λ_{9} η_{3}] + (ε_{7} - ε_{9}) \\ y_{8, 9} * = - γ_{8, 9} + [(λ_{8} + size) η_{2} - λ_{9} η_{3}] + (ε_{8} - ε_{9}) \\ size = 0.3 \cdot I (x_{1} = 1) + 0.3 \cdot I ({x_{1} = 1} \cap {x_{2} = 1}) \end{matrix} .

(14)

Parameter Estimation and DIF Detection

In this study, we utilized the mean-and-variance-adjusted unweighted least squares (ULSMV) estimator in Mplus, following the recommendation of Brown and Maydeu-Olivares (2011). This was done to facilitate the fitting of multigroup TIRT models with constrained item parameter estimates, specifically applied to data partitioned by varying split points. To ensure comparability of item parameter estimates across subsamples, a pre-defined anchor subset (Blocks 1 and 2) was utilized prior to conducting DIF detection. In the process of generating the Mplus syntax, we set all studied MFC item blocks to perform unconstrained cross-group item parameter estimation, with the exception of the anchor subset, where the item parameters were constrained to be equal. As indicated by Lee et al. (2021), an MFC triplet's Wald statistic possesses 6 degrees of freedom (df). To ascertain the significance of each node and determine the optimal split covariate, we used the MODEL TEST command in Mplus and conducted a simultaneous 6-df Wald test for each block.

Evaluation Criteria

True Positive Rates and False Positive Rates. To assess the effectiveness of the proposed method, we computed the true positive rates (TPRs) and false positive rates (FPRs) for each condition, which are also referred to as power and Type I error rate, respectively. To provide a thorough comparison of the performance of the two methods, we simultaneously assessed two different dimensions of the evaluation criteria: block level and the combination of block and covariate levels.

Let each block be characterized by a vector $ξ_{j}^{T} = (ξ_{j 1}, \dots, ξ_{j m})$ , where m is the number of covariates. $ξ_{j m} = 1$ if block j exhibited differential functioning in the covariate m and $ξ_{j m} = 0$ otherwise. If one of the components in vector $ξ_{j}^{T}$ is 1, block j is a DIF block. Block j is a DIF-free block only if all the elements within the vector were zero, that is, $ξ_{j}^{T} = (0, \dots, 0)$ . Using the indicator function $I (\cdot)$ , the evaluation criteria are calculated as follows:

1. TPR on the block level (TPR_B)

T P R_{B} = \frac{1}{# {j : ξ_{j} \neq 0}} \sum_{j : ξ_{j} \neq 0} I ({\hat{ξ}}_{j} \neq 0)

(15)

2. FPR on the block level (FPR_B)

F P R_{B} = \frac{1}{# {j : ξ_{j} = 0}} \sum_{j : ξ_{j} \neq 0} I ({\hat{ξ}}_{j} \neq 0)

(16)

3. TPR for the combination of block and variable (TPR_BV)

T P R_{B V} = \frac{1}{# {j, m : ξ_{j m} \neq 0}} \sum_{j, m : ξ_{j m} \neq 0} I ({\hat{ξ}}_{j} \neq 0)

(17)

4. FPR for the combination of block and variable (FPR_BV)

F P R_{B V} = \frac{1}{# {j, m : ξ_{j m} = 0}} \sum_{j, m : ξ_{p m} = 0} I ({\hat{ξ}}_{j m} \neq 0) .

(18)

DIF Effect Size Estimation. For the MFC test, Lee et al. (2021) proposed the calculation of the MFC DIF effect size for each item block, an adaptation of Nye's (2011) method, as depicted in the equation below.

d_{D I F} = \frac{1}{S D_{j P}} \sqrt{\int \dots {\int [P_{j R} (θ) - P_{j F} (θ)]}^{2} f_{F} (θ) d {(θ)}_{1} \dots d θ_{D}},

(19)

where

S D_{j P}

represents a pooled estimate of the standard deviation of the binary pairwise comparison outcome j for P respondents;

P_{j R} (θ)

and

P_{j F} (θ)

denote the TIRT probabilities of a binary outcome for the reference group (R) and focal group (F);

f_{F} (θ)

signifies a multivariate normal distribution of the latent trait in the focal group, with the mean and variance derived from the focal group's θ distribution; and D stands for the total number of measured latent traits. According to Nye (2011), DIF effect sizes, as calculated with Equation 19, of 0.1, 0.2, and 0.3 correspond to small, medium, and large DIF values, respectively. Effect sizes exceeding 0.2 are generally deemed more significant, whereas those below 0.2 are considered negligible. However, when calculating DIF effect size estimates, the proposed and existing methods have similarities and differences. Both utilize Equation 19 to calculate estimates, reflecting their similarity. For each item block, average effect size estimates were computed across 50 replications for all pairwise comparisons, using Equation 19. They differ in the dimensions accounted for when obtaining DIF effect size estimates. The existing OWT method averages DIF effect size estimates by each DIF-inducing covariate, while the proposed method averages by each established split.

Results of Study 1 for Power and Type I Error Rate

When all the item blocks in the MFC test were DIF-free, the false positive rates (a.k.a., Type I error rates) of the proposed method are presented in Table 4. Under a 10% DIF proportion in the MFC test, due to space constraints we only present the results under the condition of 2,000 per group in Table 5 of the main text. Please refer to Tables S1 to S3 and Tables S13 to S15 in the Supporting Information for the results from conditions with sample sizes of 500 or 1,000 per group. Notably, in Study 1, we found that when the two covariates were all binary, both the methods yielded acceptable Type I error rates even without considering DIF effect size estimation. Therefore, in Tables 4 and 5 of the main text, we only show the results when the strategy for controlling Type I errors was the number of covariates (NC). For simulation results of the proposed method when the strategy for controlling Type I errors was the Number of covariates & DIF effect size (NCD), see Tables S13 to S15 in the Supporting Information. Overall, as shown in Table 5, the performance of the two methods is comparable when the item block exhibits only the main effect DIF, while the proposed method shows better detection performance when the item block exhibits the interactive DIF compared to the conventional omnibus Wald tests. Subsequent sections analyze the impact of each manipulated factor on the experimental results.

Table 4.

Type I Error Rates Under Different Sample Sizes per Group When all Item Blocks in the MFC Test are DIF-Free in Study 1.

Sample Size Per Group	Methods	FPR_B	FPR_BV
500	OWT	.03	.01
500	MFC-BSRPT	.02	.01
1,000	OWT	.03	.01
1,000	MFC-BSRPT	.03	.01
2,000	OWT	.03	.01
2,000	MFC-BSRPT	.02	.01

Note: OWT = omnibus Wald tests, MFC-BSRPT = the proposed method.

Table 5.

Simulation Results of 10% DIF Proportion Under Different MFC DIF Detection Methods and Different MFC DIF Forms When the DIF Size Is Large (Small) and the Sample Size per Group is 2,000 in Study 1.

MFC DIF Form	Results	Indicator	Methods	λ			γ			both
MFC DIF Form	Results	Indicator	Methods	1	2	3	1	2	3	1	2	3
Only the main effect DIF	Power	TPR_B	OWT	1 (.82)	1 (.94)	1 (1)	1 (.76)	1 (.88)	1 (1)	1 (1)	1 (1)	1 (1)
		TPR_B	MFC-BSRPT	1 (.82)	1 (.94)	1 (1)	1 (.76)	1 (.88)	1 (1)	1 (1)	1 (1)	1 (1)
		TPR_BV	OWT	1 (.78)	1 (.92)	1 (1)	1 (.76)	1 (.88)	1 (1)	1 (1)	1 (1)	1 (1)
		TPR_BV	MFC-BSRPT	1 (.78)	1 (.92)	1 (1)	1 (.76)	1 (.88)	1 (1)	1 (1)	1 (1)	1 (1)
	Type I error	FPR_B	OWT	.03 (.02)	.03 (.03)	.03 (.03)	.02 (.02)	.02 (.02)	.03 (.02)	.02 (.02)	.03 (.02)	.02 (.02)
		FPR_B	MFC-BSRPT	.03 (.03)	.03 (.03)	.03 (.03)	.03 (.03)	.03 (.03)	.03 (.03)	.03 (.03)	.03 (.03)	.03 (.03)
		FPR_BV	OWT	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)
		FPR_BV	MFC-BSRPT	.01 (.01)	.02 (.02)	.02 (.01)	.02 (.02)	.01 (.01)	.02 (.01)	.02 (.02)	.02 (.01)	.01 (.01)
Only the interactive DIF	Power	TPR_B	OWT	.04 (.06)	.02 (.10)	.04 (.06)	.10 (.06)	.10 (.04)	.06 (.04)	.06 (.06)	.06 (.08)	.06 (.06)
		TPR_B	MFC-BSRPT	1 (.36)	1 (.70)	1 (.96)	1 (.50)	1 (.62)	1 (1)	1 (.88)	1 (1)	1 (1)
		TPR_BV	OWT	.02 (.04)	.02 (.06)	.04 (.02)	.04 (.04)	.04 (.02)	.04 (.02)	.02 (.04)	.04 (.04)	.04 (.04)
		TPR_BV	MFC-BSRPT	1 (.34)	1 (.66)	1 (.96)	1 (.50)	1 (.62)	1 (1)	1 (.88)	1 (1)	1 (1)
	Type I error	FPR_B	OWT	.03 (.02)	.02 (.02)	.02 (.02)	.02 (.03)	.02 (.02)	.03 (.02)	.03 (.03)	.02 (.02)	.03 (.02)
		FPR_B	MFC-BSRPT	.03 (.03)	.03 (.03)	.03 (.03)	.03 (.03)	.03 (.03)	.03 (.03)	.04 (.03)	.03 (.03)	.03 (.03)
		FPR_BV	OWT	.01 (.01)	.01 (01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)
		FPR_BV	MFC-BSRPT	.01 (.01)	.01 (.01)	.01 (.01)	.02 (.02)	.02 (.01)	.02 (.01)	.02 (.02)	.02 (.01)	.02 (.01)
Both the main effect DIF and interactive DIF	Power	TPR_B	OWT	1 (1)	1 (1)	1 (1)	1 (1)	1 (1)	1 (1)	1 (1)	1 (1)	1 (1)
		TPR_B	MFC-BSRPT	1 (1)	1 (1)	1 (1)	1 (1)	1 (1)	1 (1)	1 (1)	1 (1)	1 (1)
		TPR_BV	OWT	.36 (.12)	.38 (.16)	.92 (.42)	.72 (.18)	.92 (.22)	.98 (.50)	.64 (.24)	.86 (.44)	1 (.92)
		TPR_BV	MFC-BSRPT	.64 (.26)	.68 (.32)	1 (.80)	.98 (.38)	1 (.56)	1 (.90)	1 (.52)	.98 (.78)	1 (1)
	Type I error	FPR_B	OWT	.02 (.02)	.02 (.03)	.03 (.03)	.03 (.03)	.03 (.02)	.03 (.02)	.03 (.02)	.02 (.02)	.03 (.03)
		FPR_B	MFC-BSRPT	.03 (.02)	.03 (.03)	.03 (.03)	.03 (.03)	.03 (.03)	.03 (.03)	.04 (.03)	.03 (.03)	.03 (.03)
		FPR_BV	OWT	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)
		FPR_BV	MFC-BSRPT	.01 (.01)	.01 (.02)	.01 (.02)	.02 (.02)	.02 (.01)	.02 (.01)	.02 (.02)	.01 (.01)	.01 (.01)

Note: OWT = omnibus Wald tests, MFC-BSRPT = the proposed method. The second rows 1, 2 and 3 in the header of the table indicates the number of noninvariant item parameters under a certain noninvariant item parameter type within an item block is 1, 2, and 3, respectively.

MFC DIF Detection Method and MFC DIF Form. As depicted in Table 5, the results demonstrated the advantages of the proposed method over conventional omnibus Wald tests (OWT) for detecting interactive DIF. For example, with only the interactive DIF present and a sample size of 2,000 per group, a small DIF size of 0.3, and a DIF block proportion of 10%, the TPR ranges were .04∼.10 for OWT versus .50∼1 for the proposed method. Additionally, both methods had FPRs below the nominal level (0.05). With only the main effect DIF present, the two methods had comparable power and low Type I error rates (FPR < 0.05). For example, with a sample size of 2,000 per group, a small DIF of 0.3, and 10% DIF item blocks, the TPR ranges at the block level were .76∼1 for OWT versus .76∼1 for the proposed method, respectively. Taken together, these findings indicate the proposed method had comparable detection performance to OWT for the condition of only the main effect DIF. This was because both methods used the same test statistic for the main effect DIF. With interactive DIF present, OWT had limitations, as it examined covariates independently. In contrast, results in Table 5 suggest that the proposed method has a greater potential to identify interactive DIF. For example, in the condition with only the interactive DIF present, a sample size of 2,000 per group, a large DIF size (0.6), 10% DIF block proportion, and both the loading and threshold parameters were noninvariant, the TPR ranges at the block level were .88∼1 for OWT versus .04 for the proposed method, respectively.

Number of Noninvariant Item Parameters Under a Specific Noninvariant Item Parameter Type Within an Item Block, DIF Size and the Sample Size per Group. The results in Table 5 show that, in most experimental conditions, the detection performance of both the omnibus Wald tests and the proposed method improved as the number of noninvariant item parameters within an item block increased, given a specific noninvariant item parameter type. For example, when the sample size per group was 2,000, the DIF form was only the main effect DIF, the DIF size was small (0.3), and only the loading parameter was noninvariant, TPRs at the block level ranged from .82 to 1 (for one and three noninvariant loading parameters within an item block, respectively) for the omnibus Wald tests, and from .82 to 1 for the proposed method. This can be attributed to the accumulation of the DIF effect when there were more noninvariant item parameters within an item block, making it easier for both methods to identify DIF item blocks. Furthermore, as the DIF size increased, both methods exhibited improved detection performance. Lastly, the detection performance of both methods was further enhanced with an increase in the sample size per group. For example, as shown in Table S1 of Supporting Information, under the condition of only main effect DIF, a 10% DIF proportion, a small DIF size (0.3), and only one noninvariant loading parameter within an item block, TPRs at the block level ranged from .10 (for a sample size per group of 500) to .82 (for a sample size per group of 2,000) for both methods.

Type of Noninvariant Item Parameters of the TIRT Model. The experimental results indicated that when the MFC DIF exhibited in either the loading parameter (λ) or the threshold parameter (γ), neither of the detection methods demonstrated a significant difference in their ability to detect DIF. However, when the MFC DIF exhibited in both the loading and threshold parameters, both methods performed better than the other two factor levels. This was primarily due to the superposition of DIF effects on both parameters within an item block, which increased the likelihood of being correctly flagged as a DIF item block.

To improve the clarity and understanding of the results, we also generated Receiver Operating Characteristic (ROC) curves across different experimental conditions. In this study, the area under the curve (AUC) assesses the effectiveness of different MFC DIF detection methods in identifying DIF item blocks. A larger AUC indicates better detection performance. With a sample size per group of 2,000, a DIF proportion of 10%, a large DIF size (0.6), and two item parameters showing DIF within an item block under a specific noninvariant TIRT item parameter type, Figure 3 presents ROC curves under different conditions. These curves included different conditions such as MFC DIF form, detection method, and noninvariant item parameter type, both at the block level and the block-variable combination level. ROC curves for all other conditions are shown in Figures S1 to S17 in the Supporting Information. Overall, Figure 3 shows the better performance of the proposed method in detecting interactive DIF compared to the existing OWT method.

Figure 3.

ROC curves across various conditions at a sample size per group of 2,000, a 10% DIF proportion, with a large DIF size (0.6) and two item parameters exhibiting DIF within an item block under a specific noninvariant TIRT item parameter type in Study 1.

Results of Study 1 for DIF Effect Size Estimates

In addition to power and Type I error rate, we calculated the DIF effect size estimates for all the experimental conditions. Due to space constraints, we only showed them in Tables S4 to S12 of the online Supporting Information. As shown in Tables S4 to S12, the DIF effect size estimation was promising to be used to identify the item blocks that exhibited DIF based on the significance test. Overall, the DIF effect size estimates calculated based on the proposed method were more accurate in revealing the existence of DIF than the DIF effect size estimates calculated using the existing OWT method. It is worth noting that under a smaller sample size (i.e., 500 respondents per group), the DIF effect size estimates performed poorly under both methods. However, as the sample size per group increased, the proposed method performed better on this indicator than the existing OWT method.

Study 2

Design

While Study 1 yielded encouraging results, it should be noted that the covariates used in this study were both binary. This was done to allow comparison of the omnibus Wald tests with the performance of the proposed method in detecting MFC DIF. However, in practical applications, covariates may have more than two levels. Unlike the omnibus Wald tests, the proposed method can handle covariates with multiple levels that may induce DIF. To further test the effectiveness of the proposed method, this study considered both a binary covariate x₃ and an ordered covariate $x_{4} \in {1, 2, 3, 4}$ . Specifically, we examined the performance of the proposed method on 20 triplets in the MFC test, which measures five latent traits. In addition, the experiment manipulated six other factors to further evaluate the effectiveness of the proposed method.

The percentage of item blocks that exhibit DIF: (a) 10% (i.e., Item Blocks 3 and 4), and (b) 30% (i.e., Item Blocks 3, 4, 5, 6, 7, and 8).

Type of noninvariant item parameters of the TIRT model: DIF on (a) loading parameter (λ), (b) threshold parameter (γ), and (c) loading and threshold parameters.

DIF size: (a) small DIF (i.e., 0.3 increase) and (b) large DIF (i.e., 0.6 increase).

Number of noninvariant item parameters under a certain noninvariant item parameter type within an item block: (a) 1, (b) 2, and (c) 3.

MFC DIF form: (a) only the main effect DIF, (b) both main effect DIF and interactive DIF, and (c) only the interactive DIF.

The total sample size: (a) 1,000 and (b) 4,000. In Study 2, we considered both binary and order covariates, and the difference in the number of levels for the two covariates made it impossible to use a uniform number to describe the sample size per group. Therefore, we used the total sample size to replace the sample size per group described in Study 1. For example, given the same total sample size, the 500 respondents per group in Study 1 corresponds to 1,000 respondents here. In addition, to simplify the simulation, the condition with a total sample size of 2,000 used in Study 1 was removed from the present study.

A total of 216 conditions were employed in the simulation. To enhance the authenticity of the results, we kept the sample sizes for different levels of the same DIF-inducing covariate relatively uniform, although not exactly equal. Each of the conditions was replicated 50 times, and all simulation codes were written in R version 4.1.1 (R Development Core Team, 2021).

Data Generation

Although the data generation procedures for this study were largely the same as those used in Study 1, there were some notable differences. First, for our latent trait vector generation, we retrieved the true correlation matrix of traits from the NEO-PIR (McCrae & Costa, 1992) and represented it as:

Σ_{θ} = [\begin{matrix} 1 & - 0.21 & - 0.53 & - 0.25 & 0 \\ - 0.21 & 1 & 0.27 & 0 & 0.4 \\ - 0.53 & 0.27 & 2 & 0.24 & 0 \\ - 0.25 & 0 & 0.24 & 1 & 0 \\ 0 & 0.4 & 0 & 0 & 1 \end{matrix}] .

In addition, the loading and threshold parameters were randomly sampled from U(0.8, 1.3) and U(−1, 1), respectively, based on Lee et al. (2021). The generated parameters are shown in the Table S8 of Supporting Information. Finally, the details about how the item parameters of DIF blocks of the focal group were manipulated under different DIF forms were presented in Table 6.

Table 6.

True Simulated Differences of Item Parameters for Different DIF Forms Under Different Groups in Study 2.

DIF block code	Differences of Item Parameters
DIF block code	DIF Form 1	DIF Form 2	DIF Form 3
$\begin{aligned} {3, 4} \\ or {3, 4, 5, 6, 7, 8} \end{aligned}$	$z \cdot I (x_{4} \geq 3)$	${\begin{matrix} z \cdot I ({x_{3} = 0} \cap {x_{4} < 3}) \\ z \cdot I ({x_{3} = 1} \cap {x_{4} \geq 3}) \end{matrix}$	$z \cdot I (x_{3} = 1) + z \cdot I ({x_{3} = 1} \cap {x_{4} \geq 3})$

Note: z = value of DIF size, DIF form 1 = only the main effect DIF, DIF form 2 = only the interactive DIF, DIF form 3 = both main effect DIF and interactive DIF.

Evaluation Criteria

Study 2 used the same evaluation criteria as Study 1, please see the corresponding content in Study 1.

Results of Study 2

The results of the 30% DIF proportion condition are presented in Table 7 due to space limitations, as the 10% and 30% DIF proportions showed similarities. In addition, Table 7 shows the simulation results using only the Number of Covariates and their Levels & DIF Effect Size (NCLD) strategy for Type I error control, as it performs better than other strategies in terms of power and Type I error control. Full results can be found in Tables S17 to S24 in the online supporting information. Overall, the proposed method showed a favorable potential for detecting DIF with the ordered covariate, as shown in Table 7. It also showed promising results in dealing with different DIF forms and in the simultaneous identification of DIF item blocks and DIF-inducing covariates.

Table 7.

Simulation Results Under Different Conditions When the DIF Size is Large (Small) and the Percentage of Blocks That Exhibit DIF Is 30% in Study 2.

MFC DIF Form	Total Sample Size	Results	Indicators	λ			γ			both
MFC DIF Form	Total Sample Size	Results	Indicators	1	2	3	1	2	3	1	2	3
Only the main effect DIF	1,000	Power	TPR_B	.41 (.03)	.76 (.13)	.90 (.24)	.50 (.04)	.68 (.07)	.91 (.16)	.81 (.14)	.97 (.27)	.97 (.47)
		Power	TPR_BV	.41 (.03)	.76 (.13)	.90 (.23)	.50 (.04)	.68 (.07)	.91 (.16)	.81 (.14)	.97 (.27)	.97 (.47)
		Type I error	FPR_B	.02 (.02)	.02 (.02)	.02 (.02)	.02 (.02)	.01 (.02)	.02 (.02)	.02 (.02)	.02 (.02)	.02 (.02)
		Type I error	FPR_BV	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)
	4,000	Power	TPR_B	.99 (.16)	1 (.79)	1 (.96)	.56 (.03)	.99 (.30)	1 (.66)	1 (.32)	1 (.99)	1 (1)
		Power	TPR_BV	.99 (.16)	1 (.79)	1 (.96)	.54 (.03)	.99 (.29)	1 (.66)	1 (.32)	1 (.99)	1 (1)
		Type I error	FPR_B	.02 (.02)	.02 (.02)	.02 (.02)	.03 (.03)	.03 (.03)	.02 (.03)	.02 (.02)	.02 (.02)	.02 (.02)
		Type I error	FPR_BV	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)
Only the interactive DIF	1,000	Power	TPR_B	.04 (.02)	.10 (.04)	.22 (.03)	.10 (0)	.08 (.01)	.24 (.02)	.20 (.01)	.22 (.06)	.34 (.07)
		Power	TPR_BV	.04 (.02)	.10 (.03)	.22 (.03)	.10 (0)	.08 (.01)	.24 (.02)	.20 (.01)	.22 (.04)	.34 (.07)
		Type I error	FPR_B	.02 (.02)	.02 (.03)	.02 (.02)	.02 (.02)	.02 (.03)	.02 (.02)	.02 (.02)	.02 (.02)	.02 (.02)
		Type I error	FPR_BV	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)
	4,000	Power	TPR_B	.63 (.17)	.73 (.37)	.86 (.52)	.58 (.16)	.73 (.32)	.88 (.36)	.83 (.33)	.92 (.53)	.98 (.67)
		Power	TPR_BV	.63 (.14)	.73 (.36)	.86 (.52)	.57 (.14)	.72 (.30)	.88 (.36)	.83 (.31)	.92 (.52)	.98 (.67)
		Type I error	FPR_B	.03 (.02)	.03 (.03)	.03 (.03)	.03 (.02)	.03 (.03)	.03 (.03)	.03 (.02)	.03 (.03)	.03 (.03)
		Type I error	FPR_BV	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.02 (.01)	.02 (.01)
Both the main effect DIF and interactive DIF	1,000	Power	TPR_B	.89 (.24)	.97 (.54)	.99 (.77)	.98 (.33)	1 (.40)	1 (.74)	.99 (.66)	1 (.86)	1 (.98)
		Power	TPR_BV	.12 (0)	.17 (.02)	.03 (.01)	.04 (0)	.08 (0)	.22 (.01)	.27 (.02)	.24 (.06)	.11 (.03)
		Type I error	FPR_B	.02 (.02)	.02 (.01)	.01 (.02)	.02 (.02)	.02 (.03)	.02 (.02)	.02 (.02)	.02 (.02)	.02 (.02)
		Type I error	FPR_BV	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)
	4,000	Power	TPR_B	1 (.70)	1 (1)	1 (1)	1 (.30)	1 (.93)	1 (1)	1 (.96)	1 (1)	1 (1)
		Power	TPR_BV	.64 (.13)	.88 (.46)	.99 (.66)	.71 (.03)	.97 (.41)	1 (.63)	.87 (.43)	.98 (.81)	.98 (.96)
		Type I error	FPR_B	.02 (.03)	.02 (.02)	.03 (.02)	.03 (.03)	.02 (.03)	.02 (.03)	.02 (.02)	.03 (.03)	.03 (.03)
		Type I error	FPR_BV	.01 (.01)	.01 (.01)	.02 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.01)	.01 (.02)	.01 (.02)

Other important findings were as follows: First, the power trends in Study 2 across factor levels mirrored those in Study 1. Second, compared to Study 1, the performance of the proposed method for the ordered covariate in Study 2 deteriorated somewhat for the same total sample size. For example, with a total sample size of 1,000, a small DIF size (0.3) and only main effect DIF, block-level TPRs ranged from .10∼.88 in Study 1 versus .03∼.47 in Study 2. This occurred because the binary covariate would have more cases per group than the ordered covariate, holding the total sample size constant. As noted by Lee et al. (2021), sample size has a significant impact on the performance of MFC DIF detection. Therefore, for multi-category and ordered covariates, the proposed method requires the sample size assigned to each subgroup to be greater than 500 if more robust detection performance is desired. Third, the results showed that the NLCD strategy performed best among all strategies for controlling type I errors. The next best method considered the number of covariates and their levels, followed by the number of covariates and the DIF effect size. The worst control considered only the number of covariates. Two findings can be summarized from such results: (1) multiple comparisons between all levels of a single ordered covariate can potentially increase type I error rates; (2) although the p-value of the Wald statistic for an item block incorrectly flagged as having DIF is lower than the adjusted α, the average DIF effect size for that block would be lower than 0.1 in the vast majority of cases. Therefore, considering both the number of covariates and their levels, as well as DIF effect size estimates, potentially helped reduce type I errors while maintaining acceptable power as much as possible.

Study 3

In Study 3, we performed a follow-up simulation to further validate the DIF detection performance of the proposed method in the presence of the continuous covariate. Specifically, we included a binary covariate x₅ and a continuous covariate x₆. The MFC test is constructed with 10 triplets measuring three latent traits, and each trait was measured by 10 statements. Following the procedure for generating continuous covariates in existing studies (Berger & Tutz, 2016; Strobl et al., 2013), we sample the values of the continuous covariate x₆ from a standard normal distribution. The simulations for the different DIF forms were the same as in Study 2, except for the cutoff location that exhibited DIF in the continuous covariate. In this study, we set the cutoff location that exhibited DIF in the continuous covariate x₆ to 0.1, following the approach of Strobl et al. (2013). It is worth noting that, while all possible levels for multi-category covariates are countable, the levels for a continuous covariate have an infinite number of choices. Therefore, to find suitable cutoff values for dichotomizing the continuous covariate to facilitate DIF detection, we also refer to the study by Strobl et al. (2013), in which they took each possible cutoff value used to dichotomize the continuous covariate to two decimal places. In the present study, we followed this setting. We also used the item parameters and the covariance matrix used in Study 1, and the generation of latent trait vectors and response data remained consistent with Study 1. To ensure that the sample sizes assigned to the extreme levels of the continuous variable x₆ were not too small, we only considered the level of a total sample size of 4,000 used in both Study 1 and Study 2. Considering the efficiency of detecting continuous covariates with our proposed method, we only investigated six boundary conditions as listed in Table 8. Each condition was replicated 50 times.

Table 8.

Simulation Results of the Proposed Method Employing Different Strategies for Controlling Type I Errors in Study 3.

Strategies	Condition	DIF Form	DIF Size	Type of DIF Item Parameters	Number of DIF Item Parameters	The Average DIF Effect Size Estimates			TPR_B	TPR_BV	FPR_B	FPR_BV
Strategies	Condition	DIF Form	DIF Size	Type of DIF Item Parameters	Number of DIF Item Parameters	1st	2nd-L	2nd-R	TPR_B	TPR_BV	FPR_B	FPR_BV
NC	1	Form 1	0.6	λ	1	.14	.06	.06	1	1	.32	.32
	2	Form 1	0.6	both	3	.32	.06	.06	1	1	.34	.35
	3	Form 2	0.6	λ	1	.09	.13	.10	.40	.33	.34	.35
	4	Form 2	0.6	both	3	.09	.24	.22	.93	.87	.35	.36
	5	Form 3	0.6	λ	1	.18	.09	.14	1	.83	.32	.33
	6	Form 3	0.6	both	3	.42	.09	.26	1	1	.34	.35
NCL	1	Form 1	0.6	λ	1	.14	.06	.06	1	1	.16	.16
	2	Form 1	0.6	both	3	.32	.06	.06	1	1	.17	.17
	3	Form 2	0.6	λ	1	.09	.13	.10	.40	.33	.16	.16
	4	Form 2	0.6	both	3	.09	.24	.22	.93	.86	.17	.17
	5	Form 3	0.6	λ	1	.18	.09	.14	1	.83	.15	.15
	6	Form 3	0.6	both	3	.42	.09	.26	1	1	.17	.17
NCD	1	Form 1	0.6	λ	1	.14	.06	.06	.97	.97	.01	.01
	2	Form 1	0.6	both	3	.32	.06	.06	1	1	.01	.01
	3	Form 2	0.6	λ	1	.09	.13	.10	.33	.33	.01	.01
	4	Form 2	0.6	both	3	.09	.24	.22	.86	.86	.02	.02
	5	Form 3	0.6	λ	1	.18	.09	.14	1	.07	.01	.01
	6	Form 3	0.6	both	3	.42	.09	.26	1	1	.01	.01
NCLD	1	Form 1	0.6	λ	1	.14	.06	.06	.97	.97	.01	.01
	2	Form 1	0.6	both	3	.32	.06	.06	1	1	.01	.01
	3	Form 2	0.6	λ	1	.09	.13	.10	.33	.33	.0095	.0095
	4	Form 2	0.6	both	3	.09	.24	.22	.86	.86	.0095	.0095
	5	Form 3	0.6	λ	1	.18	.09	.14	1	.07	.0095	.0095
	6	Form 3	0.6	both	3	.42	.09	.26	1	1	.0095	.0095

Note: Strategies = the proposed different strategies for controlling Type I errors, Form 1 = only the main effect DIF, Form 2 = only the interactive DIF, Form 3 = both the main effect DIF and interactive DIF, NC = Number of covariates, NCL = Number of covariates and their levels, NCD = Number of covariates & DIF effect size, NCLD = Number of covariates and their levels & DIF effect size.1^st= The first split, 2^nd-L = The left node of the second split, 2^nd-R = The right node of the second split.

Table 8 shows the detection performance indices of the proposed method using different strategies for controlling Type I errors under each condition. The results for both TPR_B and TPR_BV were acceptable for most conditions. However, for both FPR_B and FPR_BV, there was a large difference between different strategies for controlling Type I errors. Specifically, the control strategy that considered the DIF effect size performed much better in controlling Type I errors than the performance when the DIF effect size was not considered. More importantly, there was no significant decrease in power when the control strategy included the DIF effect size estimate.

Empirical Application

Instrument and Data Collection

The present study used a Chinese version of the Multidimensional Forced-choice Five Facets of Mindfulness Questionnaire (MFC-FFMQ) to assess mindfulness. The Five Facets of Mindfulness Questionnaire (FFMQ; Baer et al., 2006) is originally a 39-item Likert-type personality assessment that measures five latent traits: Act, Describe, Nonjudging, Nonreactivity, and Observe. Deng et al. (2011) investigated the psychometric properties of the Chinese version of the FFMQ (Ch-FFMQ), and they found that the Ch-FFMQ has acceptable psychometric properties and is a valid instrument for assessing mindfulness. The MFC-FFMQ is a MFC version of the Ch-FFMQ that includes 13-triplet item blocks and uses the MOLE as the scoring format (i.e., participants choose the statements most and least descriptive to them), and each statement within a triplet measures a different dimension of mindfulness.

Data for the MFC-FFMQ were obtained by recruiting 1,408 participants through a combination of online media and field recruitment. All participants were university students who voluntarily agreed to participate in the study, and their anonymity was ensured. After screening the data for quality, 1,276 valid responses (61.91% female) were retained for subsequent data analysis. Respondents ranged in age from 16 to 35 years, with a mean age of 20.15 years (SD = 1.93). Of the participants, 65.44% were from rural areas and 34.56% were from urban areas.

Data Analysis

To begin, we calculated the empirical reliability index to evaluate the reliability of MFC-FFMQ. Then, we conducted a TIRT five-factor confirmatory factor analysis (CFA) on the MFC-FFMQ's five-factor structure using the R package thurstonianIRT (Bürkner et al., 2019) to evaluate the construct validity. Next, we used the proposed method and the omnibus Wald test to investigate MFC DIF in the MFC-FFMQ, with the goal of comparing the results yielded by each method. Specifically, we examined MFC DIF with respect to two binary covariates: gender (male: 1, female: 2) and household registration (rural population: 1, urban population: 2). To ensure good control of Type I errors, we applied the effective sequential free baseline approach recommended by Lee et al. (2021) to identify an anchor subset consisting of DIF-free item blocks. After establishing anchor subsets, we performed DIF detection using the two previously mentioned methods. All statistical analyses were conducted using R version 4.1.1 (R Development Core Team, 2021), with an overall significance level of 0.05 set for DIF test statistics.

Results

Measurement Reliability and Validity

Empirical Reliability. The reliability analysis showed that, for the five latent traits (i.e., Act, Describe, Nonjudging, Nonreactivity, and Observe), the empirical reliability estimates were .70, .65, .62, .70, and .56, respectively. As noted by Brown and Maydeu-Olivares (2013), the reliability of MFC tests may be relatively low compared to Likert-type scales. From their explanations, it can be inferred that the 5-point scoring format of the FFMQ contains four bits of information, whereas a triplet of the MFC-FFMQ contains only two bits of information per item block, which may be the cause of the reduced reliability. Overall, the empirical reliability of the MFC-FFMQ was acceptable.

Construct Validity. On completion of the TIRT CFA program, we found that the model had successfully converged and demonstrated a good fit. Specifically, the following model fit indices were obtained: χ²= 1226.24, df = 666; comparative fit index (CFI) = .91; Tucker-Lewis index (TLI) = .90; root mean square error approximation (RMSEA) = .026 (90% confidence interval [CI] = [.023, .028]). Taken together, these results provide a solid support for the five-factor structure of the MFC-FFMQ, as revealed by CFA.

MFC DIF Analysis

The sequential free baseline approach highlighted that MFC Items blocks 1, 8, and 11 were most consistent with Lee et al.'s (2021) anchor item screening criteria, consistently demonstrating DIF-free across all the covariates. Therefore, we selected these item blocks to form the anchor set for the subsequent MFC DIF detection on the remaining 10 item blocks. An overview of the final MFC-FFMQ DIF detection results obtained using the two methods is presented in Table 9, which shows a strong overlap between the two methods. We have also reported the average DIF effect size estimates within each item block for both the proposed method and the existing OWT method. As shown in Table 9, both the proposed method and the omnibus Wald test identified the same four MFC items (i.e., MFC item blocks 5, 7, 12, and 13) as exhibiting MFC DIF. To provide a visual representation of the DIF detection results, we have included the estimated trees for these DIF items in Figure 4. On analyzing the trees, we found that only the gender covariate induced MFC DIF for MFC item blocks 12 and 13, while both covariates gender and household registration induced MFC DIF for MFC item blocks 5 and 7, with an interaction between the two covariates in terms of inducing MFC DIF. MFC item blocks 5 and 7 showed both main effect DIF and interactive DIF across both covariates.

Figure 4.

The tree diagrams for the MFC item blocks 5, 7, 12 and 13 of MFC-FFMQ.

Table 9.

Comparison of MFC DIF Detection Results for the MFC-FFMQ Using the Proposed Method and the Omnibus Wald Test.

Item Block	Dimension	MFC-BSRPT				OWT
		The Detection Results	DIF Effect Size Estimates			The Detection Results	DIF Effect Size Estimates
		The Detection Results	1st	2nd-L	2nd-R	The Detection Results	x ₁	x ₂
1	D-A-O	Anchor	0	0	0	Anchor	0	0
2	D-A-O	×	.09	.18	.12	×	.09	.08
3	D-A-J	×	.12	.13	.11	×	.12	.08
4	D-J-R	×	.18	.18	.52	×	.18	.09
5	D-O-R	√	.24	.15	.17	√	.24	.18
6	D-J-R	×	.14	.13	.43	×	.14	.10
7	D-A-R	√	.26	.06	.18	√	.26	.16
8	A-O-J	Anchor	0	0	0	Anchor	0	0
9	D-J-R	√	.11	.13	.60	×	.11	.11
10	A-O-R	×	.13	.16	.13	×	.13	.13
11	O-J-R	Anchor	0	0	0	Anchor	0	0
12	A-O-J	√	.17	.20	.15	√	.17	.13
13	A-O-R	√	.17	.17	.18	√	.17	.11

Note: MFC item block 1, 8, and 11 formed an anchor subset; “Dimension” refers to the personality traits measured by three statements within each triplet of MFC-FFMQ; OWT = omnibus Wald tests, MFC-BSRPT = the proposed method;√ = flagged as a DIF item block; × = flagged as a DIF-free item block; Describe = D; Act = A; Observe = O; Nonjudging = J; Nonreactivity = R.

However, there were differences between the two detection methods in relation to MFC item block 9. The proposed method detected MFC DIF for this item block, but the OWT method did not. Figure 5 shows the recursive partitioning tree for item block 9, indicating MFC DIF caused by an interaction between the two covariates. Two solid arrows connected the two leaf nodes in the lower right corner to the internal node directly above, with dotted arrows connecting all other root nodes to internal nodes or leaf nodes. Thus, MFC item block 9 was classified as exhibiting the only interactive DIF. Notably, the omnibus Wald test did not detect MFC DIF for Item Block 9, as expected, given its limitation in identifying the interactive DIF present in the MFC tests. Specifically, the DIF detection results revealed no significant difference between urban and rural males on Item Block 9, with an average DIF effect size estimate of 0.13 within this block. However, there was a significant difference between urban and rural females on Item Block 9, with an average DIF effect size estimate of 0.6 within this block, exceeding the 0.2 threshold, indicating non-negligible DIF.

According to Costa et al. (2001), social desirability bias can lead to the endorsement of gender-relevant traits by both men and women, and they also suggested that “certain traits, such as fearfulness, may be viewed as less undesirable for women than for men” (p.131). These claims appear to be supported by the results of our study. For example, for MFC item block 12, the value of the loading parameter of the third statement (i.e., Usually when I have distressing thoughts or images, I just notice them and let them go.) was much higher for the female group than for the male group (.745 vs. .320). We found this statement to be highly socially desirable, with the potential to activate gender stereotypes for women and elicit socially desirable responses. Although Lee et al. (2021) have pointed out that the post hoc interpretation of DIF can be problematic, even though it is based on speculation, this does not prevent us from concluding that the proposed method has the ability to identify more MFC DIF forms.

(Figure 5).

Figure 5.

The tree diagram for the MFC item block 9 of MFC-FFMQ. Note: The solid line indicates the presence of MFC DIF in the current node and the dashed line indicates the absence of MFC DIF in the current node. Estimated TIRT-based item parameters (i.e., λ and γ) are given in each leaf of the tree.

Discussion

Summary of the Present Study

Detecting DIF for MFC tests is important for maintaining the fairness and validity of the tests. However, existing methods to detect DIF for MFC tests all have limitations (see the introduction section for details). The manifestation of DIF in MFC tests is more complex than in traditional single-stimulus tests, making DIF detection in MFC tests more challenging.

In this study, we introduce a MFC DIF detection method called MFC-BSPRT that uses the recursive partitioning technique to address the limitations of current MFC DIF detection approaches. To investigate the performance of the proposed method, we conducted three carefully designed simulation studies. These three studies aim to accomplish the following two tasks: (1) A comprehensive comparison of the detection performance of the existing omnibus Wald tests and the proposed method when the covariates used for MFC DIF detection are all binary. The experimental conditions include varying MFC DIF forms, sample sizes per group, percentages of MFC DIF item blocks, types of noninvariant item parameters of the TIRT model, DIF sizes, and numbers of noninvariant item parameters under a certain noninvariant item parameter type within an item block; (2) A comprehensive evaluation of the detection performance of the proposed DIF detection method for ordered covariates and continuous covariates. The experimental conditions include different percentages of item blocks that exhibit DIF, types of noninvariant item parameters of the TIRT model, DIF sizes, numbers of noninvariant item parameters under a certain noninvariant item parameter type within an item block, MFC DIF forms, and total sample sizes. In addition to the simulation studies, we conducted an empirical study to demonstrate the feasibility of the proposed method in a practical setting, and provided a comprehensive comparison of our framework with the existing OWT approach.

The main findings of this study are as follows. First, simulation studies demonstrated that our proposed method effectively controlled Type I errors and exhibited superior detection power compared to the existing OWT method under most experimental conditions. Second, regarding MFC DIF forms, our proposed method exhibited comparable detection performance to the existing OWT method in identifying main effect DIF, while also demonstrating superior detection power in detecting interactive DIF. Third, concerning the types of DIF-induced covariates that can be handled, the proposed method exhibited a robust detection power identifying MFC DIF for binary covariates, as well as demonstrating good detection power and adequate Type I error control for multi-category, ordered, and continuous covariates, surpassing the existing OWT method's inadequate performance in addressing these effectively. Fourth, in the empirical study on 12 out of 13 item blocks of the MFC-FFMQ, our proposed method displayed the same DIF detection as the existing OWT method.

Advantages of Using the MFC-BSRPT Approach for Developing and Validating MFC Tests

The increased MFC DIF detection power. The MFC-BSRPT approach demonstrates higher sensitivity in identifying both the item blocks and covariates responsible for DIF than the existing OWT method. Specifically, compared to the existing OWT method, MFC-BSRPT not only identifies items exhibiting DIF more accurately, but also flags covariates that cause DIF more effectively. This advancement is particularly valuable as it addresses a limitation seen in the OWT method, where the identification of covariates associated with DIF may not be robust enough.

Simultaneous detection of both simple and complex MFC DIF forms. The MFC-BSRPT method is versatile in its ability to identify different MFC DIF forms. This includes both the simple form characterized by main effect DIF and a more complex form involving interactive DIF. Our approach allows more potential sources of measurement bias to be explored in depth, providing a more nuanced understanding of how item blocks function across different subgroups.

Enable MFC DIF detection for multi-category, ordered, or continuous covariates. It is worth noting that this applies to scenarios when grouping is not explicitly predefined based on legally protected classes. In real-world environments, for some multi-category, ordered, or continuous covariates, if the protected classes are explicitly defined, then DIF detection is performed on groups defined by given cutoff values of these covariates. For those covariates without predefined cutoffs, a distinct advantage of the MFC-BSRPT approach is its enhanced adaptability in detecting DIF across all possible cutoffs of these covariates. It also allows for a more flexible and adaptive examination of covariates, accommodating various demographic and contextual factors that may affect item performance.

Visualization of MFC DIF detection results. The MFC-BSRPT approach provides a graphical representation of the DIF detection results, providing a clear and intuitive means of interpreting the DIF findings. This visualization serves as a powerful tool to facilitate communication and understanding of the detected biases among test developers and item review experts. It also aids in the transparent reporting of results, contributing to the overall rigor and transparency of the validation process.

Richer information to facilitate expert review and revision for MFC DIF item blocks. Compared to the traditional OWT method, the MFC-BSRPT approach provides more information about potential sources of DIF, including the item blocks and the covariates responsible for DIF, as well as the significance of those DIF findings. This detailed insight enables expert reviewers to conduct a thorough investigation of the flagged DIF item blocks, allowing for a more informed and targeted revision process.

Recommendations for Applying the MFC-BSRPT Approach in Organizational Applications and Research

Provision of R codes and detailed instructions. To facilitate the use of the proposed method, we provide an online resource (accessed via https://osf.io/b98dy/). The resource includes the following main components: (1) A detailed step-by-step tutorial on how to perform the MFC DIF analysis using our proposed method; (2) annotated syntax for performing MFC DIF analysis in R using our method; and (3) demo response data and demo code used to demonstrate the analysis process.

Recommendation on sample sizes. Consistent with the findings of Lee et al. (2021), a larger sample size is expected to lead to a higher DIF detection power in our method. Our simulation results show that the detection performance of the proposed method is directly affected by the sample size per group, and the performance is acceptable under the vast majority of conditions when the sample size is above 1,000 per group, while the method does not perform as well with small samples of less than 500 respondents per group. We recommend a sample size of 1,000 or more per group be used to achieve adequate detection accuracy. Note that here the recommendation is for per group, and does not automatically translate to a total sample size of 2,000. It is possible that the sample sizes of the two groups of a binary covariate do not come even; and for multi-category, ordered, or continuous covariates, the MFC-BSRPT algorithm can split the samples into two groups of different sizes. These lead to a smaller than 1,000 sample for one of the groups if the total sample size is 2,000. Hence, it is recommended that (1) when all covariates are binary, the total sample size is chosen based on the distributions of the categories to ensure at least 1000 per group; (2) when some covariates are multi-group, ordered, or continuous, because the splits cannot be predicted, use a total sample size larger than 2,000.

Recommendation on selecting MFC DIF detection methods. When selecting a MFC DIF detection method, one should keep in mind that the proposed approach does not universally outperform the omnibus Wald test. Therefore, we recommend choosing the detection method based on the specific situation. First, since the omnibus Wald test can effectively detect the main effect DIF, it is recommended as the preferred method when examining only the main effect DIF. Second, the sample size is a key factor affecting the effectiveness of all MFC DIF detection methods. Regardless of the method used, a sample size too small can lead to misinterpretation of the detection results. When sample sizes are inadequate, over-reliance on DIF results should be avoided. Increasing the sample size remains the optimal way to increase the detection power. Third, the detection method proposed in this study is more complicated than the traditional method (i.e., OWT method) because it considers the more complex interactive DIF. Therefore, researchers can choose the OWT method or the proposed method according to their needs when weighing the detection performance and computational efficiency. For the investigation of interactive DIF, the proposed approach is preferred. Finally, with its strengths in handling multi-category, ordered, or continuous covariates, the proposed method is recommended when analyzing such covariates. Given that the current methods for detecting MFC DIF face limitations such as needing to arbitrarily dichotomize the grouping covariates and omitting interactive DIF, our proposed method is expected to be a more robust alternative overall.

Dealing with the flagged MFC DIF item blocks. According to Lee et al. (2021) and Nye (2011), an item block that exhibits non-negligible MFC DIF would have both of the following characteristics: (1) the average DIF effect size of the block is greater than or equal to 0.2; and (2) the p-value for the Wald statistic of this item block is lower than the adjusted α level. Therefore, for item blocks with the above characteristics, we recommend that potential users substantially modify such item blocks. If the modification is too difficult, removing the block from the test is recommended. As for item blocks that have a significant p-value but an average DIF effect size of less than 0.2, they can be included in the test or used with slight modifications. It should be emphasized, however, that we do not advocate for relying solely on the results of the statistical test to draw conclusions about whether these item blocks exhibit MFC DIF. A more prudent approach is to first use the existing detection methods to initially screen the item blocks most likely to exhibit DIF, then to invite experts to conduct a more deliberate and thorough review of these flagged item blocks, and finally to determine whether the flagged item block does exhibit DIF.

Limitations and Directions for Future Study

Although this study found encouraging results demonstrating the effectiveness of the proposed method, there is room for improvement. First, although the proposed method showed potential in correctly identifying DIF item blocks and DIF-inducing covariates, it does not yet specify the exact level of MFC DIF manifestation, whether it is at the pairwise comparison level, the statement level, or both. Therefore, future research could explore more fine-grained ways to detect and distinguish at which level MFC DIF manifests based on the proposed method.

Second, our proposed method relies on the Wald statistic to split response data, which may not perform as well with smaller sample sizes, as suggested by our simulation results and other recent research (Lee et al., 2021). One possible direction for future research is to develop new test statistics for MFC DIF detection or to adapt existing single-stimulus-based DIF test statistics to improve the detection performance in small sample conditions.

Third, given the difficulty of obtaining representative samples in high-stakes talent selection settings, we used only a sample of university students in our empirical study. University students tend to be younger, more educated, less experienced, and come from a narrower range of backgrounds than real candidates for high-stakes personnel selection. Therefore, findings based only on student samples may not generalize well to real personnel selection contexts. Further research is needed to collect more diverse, representative samples from organizational settings.

Fourth, like the existing studies on IRT-based DIF detection using the recursive partitioning technique, our proposed method currently only performs binary splits. In future studies, researchers can consider the problem of multivariate splits.

Finally, our proposed method currently operates only under the dominance model, while its applicability to the ideal point model is unknown. Therefore, future research can explore the proposed method's performance under those ideal point models, such as the widely used multi-unidimensional pairwise-preference model (MUPP-GGUM; Stark et al., 2005), which has been applied to the MFC computerized adaptive testing algorithm and several personality tests for selection in the U.S. military (Stark et al., 2012).

Supplemental Material

sj-pdf-1-orm-10.1177_10944281241244760 - Supplemental material for A Framework for Detecting Both Main Effect and Interactive DIF in Multidimensional Forced-Choice Assessments

Supplemental material, sj-pdf-1-orm-10.1177_10944281241244760 for A Framework for Detecting Both Main Effect and Interactive DIF in Multidimensional Forced-Choice Assessments by Kai Liu, Yi Zheng, Daxun Wang, Yan Cai, Yuanyuan Shi, Chongqin Xi and Dongbo Tu in Organizational Research Methods

Supplemental Material

sj-docx-2-orm-10.1177_10944281241244760 - Supplemental material for A Framework for Detecting Both Main Effect and Interactive DIF in Multidimensional Forced-Choice Assessments

Supplemental material, sj-docx-2-orm-10.1177_10944281241244760 for A Framework for Detecting Both Main Effect and Interactive DIF in Multidimensional Forced-Choice Assessments by Kai Liu, Yi Zheng, Daxun Wang, Yan Cai, Yuanyuan Shi, Chongqin Xi and Dongbo Tu in Organizational Research Methods

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (grant numbers 32160203, 62167004, 32300942).

ORCID iDs

Kai Liu

Yan Cai

Dongbo Tu

Supplemental Material

Supplemental material for this article is available online.

Notes

Author Biographies

Kai Liu is a PhD candidate in Psychometrics at the Jiangxi Normal University in Nanchang, Jiangxi, China. His primary research interests include item response theory, cognitive diagnosis theory, differential item functioning, educational and psychological measurement, and computerized adaptive testing.

Yi Zheng is an associate professor at the Arizona State University in Tempe, AZ, USA. Her primary research interests include computerized adaptive testing, multistage testing, item response theory, educational and psychological test development and validation, automated test assembly, and machine-learning methods.

Daxun Wang is an instructor at the School of Psychology at the Jiangxi Normal University in Nanchang, Jiangxi, China. His primary research interests include item response theory, educational and psychological test development and validation, computerized adaptive testing, and Q-matrix validation in cognitive diagnosis.

Yan Cai is a full professor at the School of Psychology at the Jiangxi Normal University in Nanchang, Jiangxi, China. Her current research interests include item response theory, cognitive diagnosis theory, big data analysis, and test construction.

Yuanyuan Shi is a postgraduate student at the Jiangxi Normal University in Nanchang, Jiangxi, China. Her primary research interests include cognitive diagnosis theory and differential item functioning.

Chongqin Xi is an instructor at the Faculty of Education at the Guangxi Normal University in Guilin, Guangxi, China. His research mainly focuses on applied measurement in education, item response theory, computerized adaptive testing, and cognitive diagnosis theory.

Dongbo Tu is a full professor at the School of Psychology at the Jiangxi Normal University in Nanchang, Jiangxi, China. His research mainly focuses on applied measurement in education, differential item functioning, item response theory, computerized adaptive testing, cognitive diagnosis modeling, and machine learning.

References

American Education Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). The standards for educational and psychological testing. AERA Publications.

Andrich

(1989). A probabilistic IRT model for unfolding preference data. Applied Psychological Measurement, 13(2), 193–216. https://doi.org/10.1177/014662168901300211

Anguiano-Carrasco

MacCann

Geiger

Seybert

J. M.

Roberts

R. D.

(2015). Development of a forced-choice measure of typical-performance emotional intelligence. Journal of Psychoeducational Assessment, 33(1), 83–97. https://doi.org/10.1177/0734282914550387

Baer

R. A.

Smith

G. T.

Hopkins

Krietemeyer

Toney

(2006). Using self-report assessment methods to explore facets of mindfulness. Assessment, 13(1), 27–45. https://doi.org/10.1177/1073191105283504

Baron

(1996). Strengths and limitations of ipsative measurement. Journal of Occupational and Organizational Psychology, 69(1), 49–56. https://doi.org/10.1111/j.2044-8325.1996.tb00599.x

Bartram

(2007). Increasing validity with forced-choice criterion measurement formats. International Journal of Selection and Assessment, 15(3), 263–272. https://doi.org/10.1111/j.1468-2389.2007.00386.x

Bauer

D. J.

(2017). A more general model for testing measurement invariance and differential item functioning. Psychological Methods, 22(3), 507–526. https://doi.org/10.1037/met0000077

Belzak

W. C.

(2023). The multidimensionality of measurement bias in high-stakes testing: Using machine learning to evaluate Complex sources of differential item functioning. Educational Measurement: Issues and Practice, 42(1), 24–33. https://doi.org/10.1111/emip.12486

Benjamini

Hochberg

(1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B (Methodological), 57(1), 289–300. https://doi.org/10.1111/J.2517-6161.1995.TB02031.X

10.

Berger

Tutz

(2016). Detection of uniform and nonuniform differential item functioning by item-focused trees. Journal of Educational and Behavioral Statistics, 41(6), 559–592. https://doi.org/10.3102/1076998616659371

11.

Bollmann

Berger

Tutz

(2018). Item-focused trees for the detection of differential item functioning in partial credit models. Educational and Psychological Measurement, 78(5), 781–804. https://doi.org/10.1177/0013164417722179

12.

Bourion-Bédès

Schwan

Laprevote

Bédès

Bonnet

J.-L.

Baumann

(2015). Differential item functioning (DIF) of SF-12 and Q-LES-Q-SF items among French substance users. Health and Quality of Life Outcomes, 13, 172. https://doi.org/10.1186/s12955-015-0365-7.

13.

Bradley

R. A.

Terry

M. E.

(1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4), 324–345. https://doi.org/10.2307/2334029

14.

Breiman

Friedman

J. H.

Olshen

R. A.

Stone

J. C.

(1984). Classification and regression trees. Wadsworth.

15.

Brown

(2014). Item response models for forced-choice questionnaires: A common framework. Psychometrika, 81(1), 135–160. https://doi.org/10.1007/s11336-014-9434-9

16.

Brown

Bartram

(2009). Development and psychometric properties of OPQ32r (Supplement to the OPQ32 technical manual). SHL Group.

17.

Brown

Maydeu-Olivares

(2010). Issues that should not be overlooked in the dominance versus ideal point controversy. Industrial and Organizational Psychology, 3(4), 489–493. https://doi.org/10.1111/j.1754-9434.2010.01277.x

18.

Brown

Maydeu-Olivares

(2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460–502. https://doi.org/10.1177/0013164410375112

19.

Brown

Maydeu-Olivares

(2012). Fitting a Thurstonian IRT model to forced-choice data using mplus. Behavior Research Methods, 44(4), 1135–1147. https://doi.org/10.3758/s13428-012-0217-x

20.

Brown

Maydeu-Olivares

(2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18(1), 36–52. https://doi.org/10.1037/a0030641

21.

Bunji

Okada

(2020). Joint modeling of the two-alternative multidimensional forced-choice personality measurement and its response time by a Thurstonian D-diffusion item response model. Behavior Research Methods, 52(3), 1091–1107. https://doi.org/10.3758/s13428-019-01302-5

22.

Bunji

Okada

(2022). Linear ballistic accumulator item response theory model for multidimensional multiple-alternative forced-choice measurement of personality. Multivariate Behavioral Research, 57(4), 658–678. https://doi.org/10.1080/00273171.2021.1896351

23.

Bürkner

P.-C.

Schulte

Holling

(2019). On the statistical and practical limitations of Thurstonian IRT models. Educational and Psychological Measurement, 79(5), 827–854. https://doi.org/10.1177/0013164419832063

24.

Calderón Carvajal

Ximénez Gómez

Lay-Lisboa

Briceño

(2021). Reviewing the structure of Kolb’s learning style inventory from factor analysis and Thurstonian item response theory (IRT) model approaches. Journal of Psychoeducational Assessment, 39(5), 593–609. https://doi.org/10.1177/07342829211003739

25.

Cattell

R. B.

(1944). Psychological measurement: Normative, ipsative, interactive. Psychological Review, 51(5), 292–303. https://doi.org/10.1037/h0057299

26.

Chen

C.-W.

Wang

W.-C.

(2014, April). Detecting differential statement functioning in Ipsative tests using the logistic regression method. Paper presented at the annual meeting of National Council on Measurement in Education, Philadelphia, PA.

27.

Cho

Drasgow

Cao

(2015). An investigation of emotional intelligence measures using item response theory. Psychological Assessment, 27(4), 1241–1252. https://doi.org/10.1037/pas0000132

28.

Christiansen

N. D.

Burns

G. N.

Montgomery

G. E.

(2005). Reconsidering forced-choice item formats for applicant personality assessment. Human Performance, 18(3), 267–307. https://doi.org/10.1207/s15327043hup1803_4

29.

Collins

P. H.

(1990). Black feminist thought: Knowledge, consciousness, and the politics of empowerment. UnwinHyman.

30.

Costa

P. T.

Terracciano

Mccrae

R. R.

(2001). Gender differences in personality traits across cultures: robust and surprising findings. Journal of Personality and Social Psychology, 81(2), 322–331. https://doi.org/10.1037/0022-3514.81.2.322

31.

Cubiks. (2010). PAPI: Personality and Preference Inventory. http://www.cubiks.com/PRODUCTS/PERSONALITYASSESSMENTS/Pages/papi.aspx

32.

Cutler

D. R.

Edwards

T. C.

Beard

K. H.

Cutler

Hess

K. T.

Gibson

Lawler

J. J.

(2007). Random forests for classification in ecology. Ecology, 88(11), 2783–2792. https://doi.org/10.1890/07-0539.1

33.

Danner

Blasius

Breyer

Eifler

Menold

Paulhus

D. L.

Rammstedt

Roberts

R. D.

Schmitt

Ziegler

(2016). Current challenges, new developments, and future directions in scale construction. European Journal of Psychological Assessment, 32(3), 175–180. https://doi.org/10.1027/1015-5759/a000375

34.

Deng

Y.-Q.

Liu

X.-H.

Rodriguez

M. A.

Xia

C.-Y.

(2011). The five facet mindfulness questionnaire: Psychometric properties of the Chinese version. Mindfulness, 2(2), 123–128. https://doi.org/10.1007/s12671-011-0050-9

35.

Denis

P. L.

Morin

Guindon

(2010). Exploring the capacity of NEO PI-R facets to predict job performance in two French-Canadian samples. International Journal of Selection and Assessment, 18(2), 201–207. https://doi.org/10.1111/j.1468-2389.2010.00501.x

36.

Díaz-Uriarte

Alvarez de Andrés

(2006). Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 7(1), 3. https://doi.org/10.1186/1471-2105-7-3

37.

Finch

W. H.

Hernández Finch

M. E.

French

B. F.

(2015). Recursive partitioning to identify potential causes of differential item functioning in cross-national data. International Journal of Testing, 16(1), 21–53. https://doi.org/10.1080/15305058.2015.1039644

38.

Frick

(2022). Modeling faking in the multidimensional forced-choice format: The faking mixture model. Psychometrika, 87(2), 773–794. https://doi.org/10.1007/s11336-021-09818-6

39.

Griffith

R. L.

Chmielowski

Yoshita

(2007). Do applicants fake? An examination of the frequency of applicant faking behavior. Personnel Review, 36(3), 341–355. https://doi.org/10.1108/00483480710731310

40.

Guenole

Brown

A. A.

Cooper

A. J.

(2018). Forced-Choice assessment of work-related maladaptive personality traits: Preliminary evidence from an application of Thurstonian item response modeling. Assessment, 25(4), 513–526. https://doi.org/10.1177/1073191116641181

41.

Guo

Wang

Cai

(2023). An Item Response Theory Model for Incorporating Response Times in Forced-Choice Measures. Educational and Psychological Measurement, Advanced online publication. https://doi.org/10.1177/00131644231171193

42.

Hallquist

M. N.

Wiley

J. F.

(2018). Mplus automation: An R package for facilitating large-scale latent variable analyses in mplus. Structural Equation Modeling: A Multidisciplinary Journal, 25(4), 621–638. https://doi.org/10.1080/10705511.2017.1402334

43.

Hayes

Usami

Jacobucci

McArdle

J. J.

(2015). Using classification and regression trees (CART) and random forests to analyze attrition: Results from two simulations. Psychology and Aging, 30(4), 911–929. https://doi.org/10.1037/pag0000046

44.

Heggestad

E. D.

Morrison

Reeve

C. L.

McCloy

R. A.

(2006). Forced-choice assessments of personality for selection: Evaluating issues of normative assessment and faking resistance. Journal of Applied Psychology, 91(1), 9–24. https://doi.org/10.1037/0021-9010.91.1.9

45.

Hontangas

P. M.

de la Torre

Ponsoda

Leenen

Morillo

Abad

F. J.

(2015). Comparing traditional and IRT scoring of forced-choice tests. Applied Psychological Measurement, 39(8), 598–612. https://doi.org/10.1177/0146621615585851

46.

Hung

S.-P.

Huang

H.-Y.

(2022). Forced-Choice ranking models for Raters’ ranking data. Journal of Educational and Behavioral Statistics, 47(5), 603–634. https://doi.org/10.3102/10769986221104207

47.

Jackson

D. N.

Wroblewski

V. R.

Ashton

M. C.

(2000). The impact of faking on employment tests: Does forced choice offer a solution? Human Performance, 13(4), 371–388. https://doi.org/10.1207/s15327043hup1304_3

48.

Joo

S. H.

Lee

Stark

(2021). Modeling multidimensional forced choice measures with the Zinnes and Griggs pairwise preference item response theory model. Multivariate Behavioral Research, 58(2), 241–261. https://doi.org/10.1080/00273171.2021.1960142

49.

Khalilia

Chakraborty

Popescu

(2011). Predicting disease risks from highly imbalanced data using random forest. BMC Medical Informatics and Decision Making, 11, 51. https://doi.org/10.1186/1472-6947-11-51.

50.

Kim

E. S.

Joo

S.-H.

Lee

Wang

Stark

(2016). Measurement invariance testing across between-level latent classes using multilevel factor mixture modeling. Structural Equation Modeling: A Multidisciplinary Journal, 23(6), 870–887. https://doi.org/10.1080/10705511.2016.1196108

51.

Komboz

Strobl

Zeileis

(2016). Tree-Based global model tests for polytomous rasch models. Educational and Psychological Measurement, 78(1), 128–166. https://doi.org/10.1177/0013164416664394

52.

Lee

Smith

W. Z.

(2020a). A Bayesian random block item response theory model for forced-choice formats. Educational and Psychological Measurement, 80(3), 578–603. https://doi.org/10.1177/0013164419871659

53.

Lee

Smith

W. Z.

(2020b). Fit indices for measurement invariance tests in the Thurstonian IRT model. Applied Psychological Measurement, 44(4), 282–295. https://doi.org/10.1177/0146621619893785

54.

Lee

Joo

S.-H.

Stark

(2021). Detecting DIF in multidimensional forced choice measures using the Thurstonian item response theory model. Organizational Research Methods, 24(4), 739–771. https://doi.org/10.1177/1094428120959822

55.

Lee

Joo

S.-H.

Stark

Chernyshenko

O. S.

(2019). GGUM-RANK Statement and person parameter estimation with multidimensional forced choice triplets. Applied Psychological Measurement, 43(3), 226–240. https://doi.org/10.1177/0146621618768294

56.

Loo

(1999). Issues in factor-analyzing ipsative measures: The learning style inventory (LSI-1985) example. Journal of Business & Psychology, 14(1), 149–154. https://doi.org/10.1023/A:1022918803653

57.

Luce

R. D.

(1977). The choice axiom after twenty years. Journal of Mathematical Psychology, 15(3), 215–233. https://doi.org/10.1016/0022-2496(77)90032-3

58.

MacCallum

R. C.

Zhang

Preacher

K. J.

Rucker

D. D.

(2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7(1), 19–40. https://doi.org/10.1037/1082-989X.7.1.19

59.

Magis

Béland

Tuerlinckx

Boeck

P. d.

(2010). A general framework and an R package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. https://doi.org/10.3758/BRM.42.3.847

60.

McCrae

R. R.

Costa

P. T.

(1992). Discriminant validity of NEO-PIR facet scales. Educational and Psychological Measurement, 52(1), 229–237. https://doi.org/10.1177/001316449205200128

61.

Merk

Schlotz

Falter

(2017). The motivational value systems questionnaire (MVSQ): Psychometric analysis using a forced choice Thurstonian IRT model. Frontiers in Psychology, 8, Article 1626. https://doi.org/10.3389/fpsyg.2017.01626

62.

Morillo

Leenen

Abad

F. J.

Hontangas

de la Torre

Ponsoda

(2016). A dominance variant under the multi-unidimensional pairwise-preference framework. Applied Psychological Measurement, 40(7), 500–516. https://doi.org/10.1177/0146621616662226

63.

Muthén

L. K.

Muthén

(2017). Mplus (Version 8) [computer software]. Muthén & Muthén.

64.

Lee

M.-H. R.

Kuykendall

Stark

Tay

(2020). The development and validation of a multidimensional forced-choice format character measure: Testing the Thurstonian IRT approach. Journal of Personality Assessment, 103(2), 224–237. https://doi.org/10.1080/00223891.2020.1739056

65.

Nye

C. D.

(2011). The development and validation of effect size measures for IRT and CFA studies of measurement equivalence (Unpublished doctoral dissertation). University of Illinois at Urbana-Champaign.

66.

Paulhus

D. L.

(1991). Measurement and control of response bias . In Robinson

J. P.

Shaver

R. P.

Wrightsman

L. S.

(Eds.), Measures of social psychological attitudes (pp. 17–59). Academic.

67.

Peng

Man

Veldkamp

B. P.

Cai

(2023). A mixture model for random responding behavior in forced-choice noncognitive assessment: Implication and application in organizational research. Organizational Research Methods, Advanced online publication. https://doi.org/10.1177/10944281231181642

68.

Principles for the validation and use of personnel selection procedures. (2018). Industrial and Organizational Psychology: Perspectives on Science and Practice, 11(S1), 1–97. https://doi.org/10.1017/iop.2018.195

69.

Qiu

X.-L.

Wang

W.-C.

(2021). Assessment of differential statement functioning in ipsative tests with multidimensional forced-choice items. Applied Psychological Measurement, 45(2), 79–94. https://doi.org/10.1177/0146621620965739

70.

Robert

Lee

W. C.

Chan

K.-Y.

(2006). An empirical analysis of measurement equivalence with the INDCOL measure of individualism and collectivism: Implications for valid cross-cultural inference. Personnel Psychology, 59(1), 65–99. https://doi.org/10.1111/j.1744-6570.2006.00804.x

71.

Salgado

J. F.

Anderson

Tauriz

(2015). The validity of ipsative and quasi-ipsative forced-choice personality inventories for different occupational groups: A comprehensive meta-analysis. Journal of Occupational and Organizational Psychology, 88(4), 797–834. https://doi.org/10.1111/joop.12098

72.

Schmitt

Oswald

F. L.

(2006). The impact of corrections for faking on the validity of noncognitive measures in selection settings. Journal of Applied Psychology, 91(3), 613–621. https://doi.org/10.1037/0021-9010.91.3.613

73.

Shaffer

J. A.

Postlethwaite

B. E.

(2013). The validity of conscientiousness for predicting job performance: A meta-analytic test of two hypotheses. International Journal of Selection and Assessment, 21(2), 183–199. https://doi.org/10.1111/ijsa.12028

74.

Sitser

van der Linden

Born

M. P.

(2013). Predicting sales performance criteria with personality measures: The use of the general factor of personality, the big five and narrow traits. Human Performance, 26(2), 126–149. https://doi.org/10.1080/08959285.2013.765877

75.

Stark

Chernyshenko

O. S.

Drasgow

(2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The multi-unidimensional pairwise-preference model. Applied Psychological Measurement, 29(3), 184–203. https://doi.org/10.1177/0146621604273988

76.

Stark

Chernyshenko

O. S.

Drasgow

(2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91(6), 1292–1306. https://doi.org/10.1037/0021-9010.91.6.1292 .

77.

Stark

Chernyshenko

O. S.

Drasgow

White

L. A.

(2012). Adaptive testing with multidimensional pairwise preference items. Organizational Research Methods, 15(3), 463–487. https://doi.org/10.1177/1094428112444611

78.

Strobl

Kopf

Zeileis

(2013). Rasch trees: A new method for detecting differential item functioning in the Rasch model. Psychometrika, 80(2), 289–316. https://doi.org/10.1007/s11336-013-9388-3

79.

Strobl

Malley

Tutz

(2009). An introduction to recursive partitioning: Rationale, application, and characteristics of classification and regression trees, bagging, and random forests. Psychological Methods, 14(4), 323–348. https://doi.org/10.1037/a0016973

80.

Swaminathan

Rogers

H. J.

(1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x

81.

Tay

Huang

Vermunt

J. K.

(2015). Item response theory with covariates (IRT-C): Assessing item recovery and differential item functioning for the three-parameter logistic model. Educational and Psychological Measurement, 76(1), 22–42. https://doi.org/10.1177/0013164415579488

82.

Tett

R. P.

Christiansen

N. D.

Robie

Simonet

D. V.

(2011). International survey of personality test use: An American baseline. In 15th Conference of the European Association of Work and Organizational Psychology, Maastricht, The Netherlands.

83.

Thurstone

L. L.

(1927). A law of comparative judgment. Psychological Review, 34(4), 273–286. https://doi.org/10.1037/h0070288

84.

Tutz

Berger

(2016). Item-focussed trees for the identification of items in differential item functioning. Psychometrika, 81(3), 727–750. https://doi.org/10.1007/s11336-015-9488-3

85.

Wang

W.-C.

Qiu

X.-L.

Chen

C.-W.

Jin

K.-Y.

(2017). Item response theory models for ipsative tests with multidimensional pairwise comparison items. Applied Psychological Measurement, 41(8), 600–613. https://doi.org/10.1177/0146621617703183

86.

Wetzel

Böhnke

J. R.

Brown

(2016). Response biases . In Leong

F. T. L.

Bartram

Cheung

F. M.

Geisinger

K. F.

Iliescu

(Eds.), The ITC international handbook of testing and assessment (pp. 349–363). Oxford University Press.

87.

Winkelspecht

Lewis

Thomas

(2006). Potential effects of faking on the NEO-PI-R: Willingness and ability to fake changes who gets hired in simulated selection decisions. Journal of Business and Psychology, 21(2), 243–259. https://doi.org/10.1007/s10869-006-9027-4

88.

Zhang

Angrave

Zhang

Sun

Tay

(2023). The generalized Thurstonian unfolding model (GTUM): Advancing the modeling of forced-choice data. Organizational Research Methods, Advanced online publication. https://doi.org/10.1177/10944281231210481

89.

Zieky

M. J.

(2015). Developing fair tests . In Handbook of test development (pp. 97–115). Routledge.

90.

Zinnes

J. L.

Griggs

R. A.

(1974). Probabilistic, multidimensional unfolding analysis. Psychometrika, 39(3), 327–350. https://doi.org/10.1007/BF02291707

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

3.78 MB

0.40 MB