Mixed-Keying or Desirability-Matching in the Construction of Forced-Choice Measures? An Empirical Investigation and Practical Recommendations

Abstract

Forced-choice (FC) measures are becoming increasingly popular as an alternative to single-statement (SS) measures. However, to ensure the practical usefulness of an FC measure, it is crucial to address the tension between psychometric properties and faking resistance by balancing mixed keying and social desirability matching. It is currently unknown from an empirical perspective whether the two design criteria can be reconciled, and how they impact respondent reactions. By conducting a two-wave experimental design, we constructed four FC measures with varying degrees of mixed-keying and social desirability matching from the same statement pool and investigated their differences in terms of psychometric properties, faking resistance, and respondent reactions. Results showed that all FC measures demonstrated comparable reliability and induced similar respondent reactions. Forced-choice measures with stricter social desirability matching were more faking resistant, while FC measures with more mixed keyed blocks had higher convergent validity with the SS measure and displayed similar discriminant and criterion-related validity profiles as the SS benchmark. More importantly, we found that it is possible to strike a balance between social desirability matching and mixed keying, such that FC measures can have adequate psychometric properties and faking resistance. A seven-step recommendation and a tutorial based on the autoFC R package were provided to help readers construct their own FC measures.

Keywords

forced-choice faking mixed-keying social desirability matching

The forced-choice (FC) format has regained popularity in recent years as an alternative to traditional single-statement (SS) format for several reasons. First, properly designed FC measures are substantially more faking-resistant than their SS counterparts (Cao & Drasgow, 2019). Second, compared to SS measures, personality scores derived from FC measures possess higher predictive validity for important work outcomes (Salgado et al., 2015; Speer et al., 2023). Third, FC measures by design are immune to a number of response biases plaguing SS measures, such as acquiescence, extreme responding, and midpoint responding (Schulte et al., 2021). Finally, recent development of freely accessible R packages for automatic test assembly (Li et al., 2022) and scoring (Bürkner, 2019; Zhang, Tu et al., 2023) lifted many technical barriers. It is not surprising that FC measures are receiving growing attention, especially from those concerned with applicant faking.

However, previous Monte Carlo simulation studies revealed a seeming tension between psychometric properties and faking-resistance (i.e., Brown & Maydeu-Olivares, 2011; Bürkner, 2022; Bürkner et al., 2019; Lee et al., 2022). On one hand, for the estimated latent trait scores to be reliable, the state-of-the-art FC scoring model requires mixed keying (Brown & Maydeu-Olivares, 2011), which refers to the inclusion of FC blocks with a mixture of both positively and negatively keyed statements. On the other hand, faking-resistance requires statements within a block to have similar levels of social desirability, which in most cases means that statements within a block can not be mixed keyed. How to strike a balance between mixed-keying and social desirability matching such that we can develop psychometrically sound and faking-resistant FC measures is one of the most urgent practical issues to be addressed in FC measurement. While simulation studies are informative for understanding the reliability of trait score estimates under fully controlled ideal conditions, they offer limited insights regarding the fakability of FC measures in real world, as this is an empirical question by nature. Therefore, we argue that it is critical to move beyond simulations and toward empirical studies to investigate the key question of how to strike a balance between mixed-keying and social desirability matching to produce psychometrically sound and faking-resistant FC measures.

Another often neglected aspect of FC measurement is respondent reactions, which are critical for respondent engagement, data quality, and recruitment success (Hausknecht et al., 2004). Although some previous examinations of respondent reactions toward FC measures have been conducted (Converse et al., 2008; Dalal et al., 2021; Sass et al., 2020; Zhang, Sun et al., 2020; Zhang, Luo et al., 2023), no evidence is available yet for the potential impact of mixed keying and social desirability matching on respondents’ reactions in both honest and motivated faking situations. Again, the question of respondent reactions is purely empirical in nature and thus requires empirical data to answer it.

Therefore, echoing the call for more research on developing psychometrically sound, faking-resistant, and user-friendly personnel selection tools (Van Iddekinge et al., 2023), this study seeks to provide the first piece of empirical evidence on how mixed-keying and social desirability matching may impact FC measures’ psychometric properties (i.e., reliability, convergent validity, discriminant validity, and criterion-related validity), fakability (i.e., rank-order stability and mean score inflation), and respondent reactions (i.e., general and selection-specific). Based on our empirical findings, we also provide a step-by-step recommendation on how to construct good FC measures in the Discussion section. While we do not conduct simulation studies to answer the focal empirical questions, we do appreciate the value of simulations for FC measure development, such as estimating the reliability of trait scores derived from FC measures in ideal conditions. Therefore, we updated the R package autoFC (Li et al., 2022) with several additional functions that users can easily use to run customized simulations, and an associated tutorial in the Online Supplementary Materials. Ultimately, we aim to contribute empirical knowledge and practical tools to the construction of high-quality FC measures.

A Brief Overview of FC Measurement

Noncognitive constructs such as personality and vocational interests have been playing increasingly important roles in organizational research and personnel selection, due to their sizeable predictive validity for important organizational outcomes (He et al., 2019; Nye et al., 2012) and their potential to reduce adverse impact (Cottrell et al., 2015; Jones et al., 2022). Thus, accurate assessment of these constructs has been of key interest. The SS format, which requires respondents to indicate to what extent they agree with each statement on a polytomous scale (e.g., 1 = “Strongly disagree”, …, 5 = “Strongly agree”), is no doubt the most widely adopted format to assess noncognitive constructs, due to the relative ease of scale development, administration, and scoring. However, scores derived from SS measures are often contaminated by various response biases and deliberate faking (Kreitchmann et al., 2019; Wetzel et al., 2021; Zhang, Cao et al., 2020), all of which will, at least to some degree, render such scores less valid for making between-person comparisons and predicting key workplace outcomes (Schulte et al., 2021).

To address the issues of response biases and faking in the SS format, the FC format was introduced as an alternative (Sisson, 1948). In an FC measure, individuals are presented with blocks, each containing at least two statements (“statements” in FC measures are the same as “items” in SS measures). Respondents are then asked to either (1) choose the statement(s) that are most and/or least descriptive (the MOLE format) of themselves or (2) rank all the statements in each block from the most descriptive to the least descriptive (the RANK format) of themselves (Cao & Drasgow, 2019). The number of statements per block is called block size, which is often constant within an FC measure and typically ranges from 2 to 5. A block is called unidimensional if the statements within that block measure the same latent trait, and multidimensional if the statements within that block measure different latent traits. Multidimensional blocks are more common than unidimensional ones. In Figure 1, we illustrated examples of multidimensional FC blocks.

Figure 1.

Examples of multidimensional forced-choice blocks.

When the block size n is greater than 2, responses to each block (e.g., Lily chose statement A as “the most like me,” statement B as “the least like me,” and left statement C in between; A > C > B) will be decomposed into n(n-1)/2 pseudo items representing dichotomous outcomes of all unique pairwise comparisons (e.g., AB = 1; AC = 1; BC = 0), each indicating whether the first statement in a pair is preferred to the second (outcome 1) or not (outcome 0). These pseudo items will serve as indicators of latent factors and will be subjected to the Thurstonian Item Response Theory (TIRT; Brown & Maydeu-Olivares, 2011) model, which is a special type of categorical confirmatory factor analysis model for estimating statement parameters and person scores. This explains why larger blocks are more psychometrically informative when everything else is held constant, as larger blocks mean more indicators for latent factors (e.g., responses to block sizes of 2, 3, 4, and 5 corresponds to 1, 3, 6, and 10 pseudo items, respectively, if full ranking is elicited).

Unlike SS measures where participants make an absolute judgment regarding their agreement with each statement, FC measures require respondents to decide which statement in the current block describes them relatively better than others. Even when all statements within a block describe them with similar accuracy in the absolute sense (e.g., “all the statements describe me accurately/inaccurately”), they are still required to make a relative choice. Given the “forced” nature of responding, FC measures are immune to response biases such as acquiescence, extreme responding, halo bias, and leniency bias by design (Schulte et al., 2021). If statements within blocks are further matched on social desirability, FC measures are also substantially more faking-resistant than their SS counterparts (Cao & Drasgow, 2019; Speer et al., 2023).

In sum, the removal of multiple response biases and the faking resistance potential has rendered the FC format a promising alternative to the SS format. However, one challenge in FC measures is that achieving good psychometric properties and faking resistance often seem incompatible with each other. Next, we will elaborate on the rationale for social desirability matching and mixed keying in constructing FC measures, followed by a discussion on why the two seem incompatible. We will further discuss why respondent reactions should be considered when developing FC measures.

Maintaining Faking Resistance by Matching Social Desirability

Many noncognitive measures include both positively and negatively keyed statements for better coverage of the construct continuum (Tay & Ng, 2018). In most cases, positively keyed statements (e.g., “I am hardworking”) are substantially more socially desirable than their negatively keyed counterparts (e.g., “I often come to work late”). Participants presented with these statements in the same FC block can often identify and choose the more desirable statement in motivated faking situations (Bürkner et al., 2019) regardless of whether it is truly more descriptive of them than others (Schulte et al., 2021). In this sense, this FC measure is said to be more fakable. Following previous practices (Cao & Drasgow, 2019; Hu & Connelly, 2021), we operationalized the fakability of a measure in two complimentary ways: (1) as the standardized mean score difference (Cohen's d) between latent trait scores obtained in faking versus honest conditions, and (2) as the rank-order stability of latent trait scores across faking and honest situations. A larger standardized mean score difference and a lower rank-order stability both indicate a higher level of fakability. To minimize the opportunities for faking in FC measures, statements within the same block need to be matched on social desirability (Cao & Drasgow, 2019). To achieve this goal, researchers first need to obtain the social desirability values of all statements. Next, an index of similarity for desirability values (and the corresponding cutoff) is determined for the statements to be paired within a block. More details on how to obtain social desirability values and how to use them for creating blocks are discussed in the Recommended Steps to Develop FC Measures section in the Discussion section.

Improving Psychometric Properties by Using Mixed Keying

Aside from fakability, a fundamental requirement for any FC measure is that the estimated trait scores are reliable and valid. However, Brown (2016) and Bürkner (2022) mathematically showed that latent trait scores estimated from the FC format will be unreliable if all blocks contain statements keyed in the same direction (all statements in a block have either positive or negative factor loadings). This is because equally keyed blocks provide little information regarding the sum of latent trait scores involved in these blocks, which is essential for recovering their absolute locations (Brown & Maydeu-Olivares, 2011; Schulte et al., 2021). More precisely, when holding other factors constant, the amount of information provided by a pair is positively related to the absolute difference between factor loadings of the two statements (Brown & Maydeu-Olivares, 2011; Bürkner, 2022) and will drop to zero if the two statements have identical factor loadings. Given that most statements have been selected to possess moderate to high factor loadings, if statements keyed in the same direction are put into the same block, their factor loading difference would be small and thus not psychometrically informative.

Indeed, many simulation studies have confirmed that FC measures with only equally keyed blocks suffered from various psychometric issues, including low model convergence rates, severely biased estimates of statement parameters, unreliable estimates of latent trait scores (Brown & Maydeu-Olivares, 2011), and biased estimates of inter-trait correlations (Bürkner et al., 2019; Schulte et al., 2021). One theoretical solution is to include a larger number of latent factors (e.g., 30) and/or a large number of statements per trait (e.g., 15) with high factor loadings (e.g., .80 or above) in one FC measure, which has been shown to be effective in simulation studies (Bürkner et al., 2019; Schulte et al., 2021) and empirical settings (Brown & Bartram, 2009). However, this solution is likely too demanding and most often impractical. The easiest and the most effective solution is to include a substantial number of mixed-keyed blocks that contain both positively and negatively keyed statements, as it is easier to maintain substantial factor loading difference when one loading is positive and another one is negative, and both are of moderate magnitude. Simulations have shown that FC measures including mixed-keyed blocks consistently outperformed those with only equally keyed blocks in terms of model convergence and parameter recovery accuracy (Brown & Maydeu-Olivares, 2011; Bürkner et al., 2019; Schulte et al., 2021).

Social Desirability Matching vs. Mixed Keying: A Dilemma?

As discussed above, both social desirability matching and mixed keying are important for different properties of FC measures. However, researchers would be placed into a dilemmatic position if they wish to follow both design criteria: on one hand, the faking resistance of an FC measure is based on social desirability matching that often requires equally keyed blocks; on the other hand, accurate score estimation requires a substantial number of mixed keyed blocks, which can be equivalent to social desirability mismatching because positively keyed statements are often more desirable than their negatively keyed counterparts. It seems that no matter which side we prioritize, the other side will suffer. This leads some researchers to conclude that it is impossible to simultaneously maintain good psychometric properties and faking resistance for an FC measure (Bürkner et al., 2019; Ng et al., 2021; Schulte et al., 2021).

While we agree that social desirability matching and mixed keying may conflict with each other on some occasions, we argue that it is still possible to find a sweet spot between the two criteria, such that we can develop FC measures that are both sufficiently faking-resistant and psychometrically sound. According to a recent simulation study (Lee et al., 2022), 20% mixed keyed blocks suffice to ensure reliable latent trait scores estimates, as long as the statements are reliable indicators of the latent factors. In addition, the marginal utility of more mixed keyed blocks for reliability gradually decreases and reaches a plateau when the proportion of mixed keyed blocks exceeds 60%, implying that too many mixed keyed blocks are unnecessary. These findings are important as they showed that including a small proportion of mixed-keyed blocks can substantially benefit the psychometric properties of an FC measure while at the same time presumably not affecting the fakability as severely as previously thought (an FC measure with 80% of its blocks matched on social desirability may still be fairly faking-resistant). It is possible to compromise slightly on both criteria to reach a sweet spot where both good enough psychometric properties and fakability are achieved. There is hope!

However, in pursuit of such a sweet spot, we must rely on empirical evidence from real human responses instead of simulations because faking is a complex psychological phenomenon, for which we do not yet have a satisfactory psychometric model. As such, even though Lee et al.'s. (2022) simulations provide benchmarks for satisfactory reliability, these simulation results tell us little about the degree to which different proportions of mixed-keyed blocks will impact the fakability of an FC measure. Aside from fakability, other important psychometric properties of a measure, such as criterion-related validity, can only be examined using empirical data collected from real human respondents as well. In fact, even previous simulation findings on the reliability of FC scores should also be subjected to empirical tests because all simulations studies are based on untested assumptions and if these assumptions do not hold empirically, findings may be untrustworthy. In sum, it is critical to move beyond simulation studies and use empirical data to investigate whether it is possible to balance social desirability matching and mixed keying, and ultimately, develop FC measures with good enough psychometric properties and faking-resistance.

Respondent Reactions to FC Measures

Another important but often neglected issue in developing FC measures is respondent reactions. Respondent reactions refer to respondents’ attitudes, affect, or cognitions related to the measurement tool (Hausknecht et al., 2004). Positive respondent reactions can elicit favorable impressions on employers from respondents’ perspectives, increase applicants’ intention to recommend the employer to other job seekers, and improve data quality through enhanced test motivation (Hausknecht et al., 2004; McCarthy et al., 2017; Sass et al., 2020).

However, studies examining how FC design features can impact respondent reactions are still lacking, except Dalal et al. (2021) and Fuechtenhans and Brown (2022). Although Dalal et al. (2021) examined how different FC designs impacted respondent reactions, their use of a computerized adaptive testing design (where each respondent was presented with different statements that best matched their latent trait levels) may confound the effect of different statements with the effect of design features. Moreover, in many scenarios, researchers and practitioners would use a static FC measure constructed from a small statement pool. It is thus critical to know how different pairing strategies will impact respondents’ reactions when holding the statement pool constant. Fuechtenhans and Brown (2022) used a qualitative study design to examine how statement matching would impact respondents’ experience with the FC format. They found that blocks with both desirable and undesirable statements are generally considered as easier and less cognitively demanding than blocks matched on social desirability. Although these findings are valuable, it is important to complement these findings with quantitative estimates from a rigorously designed experiment and provide a more comprehensive coverage of other aspects of respondent reactions (e.g., perceived fakability), which are still lacking in the current literature.

The Present Study

The present study seeks to empirically examine how different levels of social desirability matching and mixed keying influence the (1) psychometric properties of, (2) fakability of, and (3) respondent reactions to FC measures. Answers to these questions would not only complement previous simulation findings but also provide an evidence-based guide to the construction of reliable, valid, faking-resistant, and user-friendly FC measures. To achieve these goals, we constructed four different versions of FC measures based on the same set of statements (see Methods section below for details). To further facilitate the use of the FC format, we provided step-by-step recommendations on how to construct high-quality FC measures in Discussion. A tutorial written in R implementing each of these steps using the autoFC (Li et al., 2022) package is also provided in the Online Supplementary Materials.

Methods

Participants and Procedures

We conducted a two-wave study to examine the psychometric properties of, fakability of, and respondent reactions to four different FC measures with different design features while holding the statement pool constant. Respondents were recruited from the Prolific crowdsourcing platform. The study flow and demographic information in each group can be found in Figure 2. At Time 1 (honest condition), we aimed for 550 respondents per group, and a total of 2,187 respondents were eventually recruited, each paid $3 for participation. After consenting to proceed, participants first responded to demographic questions and were then presented randomly with one of the four FC measures. After this, they were immediately asked about their reactions to the FC measure, then followed by the SS measure. Finally, participants completed several criterion measures, which were presented in random order. For both FC and SS measures, participants were instructed to respond as honestly as possible. After excluding responses that failed more than one out of six quality control items, a total of 2,147 usable responses were retained.

Figure 2.

Study procedures.

Three months later (Time 2; fake-good condition), all participants who participated in Time 1 survey were invited to join a follow-up survey for a $2 reward. Similar to Time 1, participants consented to their participation and provided demographic information. However, before presenting the focal measures, we simulated a fake-good test situation where participants were instructed to respond as if they were applying for their dream job. To increase the fidelity of the simulation, we first asked respondents to write down names of their dream organizations and positions. They were then asked to use one or two sentences to explain why they wanted these positions. After the explanation, respondents were asked to imagine that their dream organizations were hiring and the organizations would use a personality test to decide who will be invited to fly to the headquarters for the final interview. To seize this opportunity, respondents were asked to try their best to get on the invitation list. Our decision of implementing a “faking for your dream job” scenario was based on two reasons. First, describing their most wanted positions from their dream organization can make the simulated scenario more personally relevant and can better represent a real job application situation where most respondents would apply for jobs they like. Second, instructing participants to fake for their dream jobs overcomes the limitations brought by differential ability and motivation to fake had they been asked to fake for a predetermined position (Fuechtenhans & Brown, 2022). Following the simulated job application scenario, participants were presented with the same FC personality measure they completed in Time 1. After the completion of the FC measure, participants again indicated their reactions toward the measure, and then responded to the SS personality measure. No criterion measures were presented in Time 2. A total of 1,177 responses were collected at Time 2, resulting in a response rate of 54.82%.

Measures

Demographics

Respondent self-reported their demographic information, including age, gender (1 = Female, 2 = Male, 3 = Nonbinary), education level (1 = Primary school, 2 = High school or equivalent, 3 = Some college or equivalent, 4 = Bachelor or equivalent, 5 = Master, 6 = PhD), and annual income before tax (1 = under $10,000, 2 = $10,000–$19,999, 3 = $20,000–$29,999, 4 = $30,000–$39,999, 5 = $40,000–$49,999, 6 = $50,000–$74,999, 7 = $75,000–$99,999, 8 = $100,000–$150,000, 9 = Over $150,000).

HEXACO-60_FC

In this study, we constructed FC measures based on the HEXACO-60 (Ashton & Lee, 2009) and used a triplet format. We chose 20 triplets because (1) we want the FC measures to be sensitive to manipulations, (2) we want to keep the survey to a reasonable length, and (3) 20 triplets for 5–6 latent factors are also quite common (e.g., Brown & Maydeu-Olivares, 2011; Lee et al., 2019; Walton et al., 2020; Wetzel & Frick, 2020). To develop triplet MFC HEXACO measures, we first obtained social desirability ratings (ranging from 1 to 5) for the 60 statements from Anglim et al. (2017). Specifically, social desirability ratings from the applicant sample in their study were used. We then constructed four different FC versions, each with 20 triplets consisting of three statements measuring different dimensions from HEXACO. We constructed these FC measures using an automatic item pairing R package, autoFC (Li et al., 2022). The design criteria for these four measures (also see Figure 2) were as follows: (1) For the first two FC measures (FC1 and FC2), statements within a triplet were matched by similar levels of social desirability. More specifically, we operationalized social desirability discrepancy as the maximum difference of social desirability among the three statements in a block and maintained the mean discrepancy across all 20 blocks as 0.34 (min = 0.05, max = 0.70) for FC1 and 0.37 (min = 0.13, max = 0.73) for FC2. The numbers of mixed keyed blocks for FC1 and FC2 were set to be 3 and 6, respectively. (2) For the remaining two FC measures (FC3 and FC4), the numbers of mixed keyed blocks were set to be 13 for FC3 and 12 for FC4 to represent cases with more mixed keyed blocks¹. The mean discrepancy across 20 blocks was 0.86 (min = 0.13, max = 1.85) for FC3 and 1.09 (min = 0.48, max = 2.01) for FC4. (3) Across the four FC versions, we tried our best to ensure that each latent trait was paired with the other five traits for about an equal number of times (at least twice) while satisfying all previous constraints. Participants were required to select one statement describing them most, and another one describing them least from each block. Detailed block design for the four FC measures is presented in Table S1 in the Online Supplementary Materials.

In sum, FC1 and FC4 represent two realistic extremes of the compromise between social desirability matching and mixed keying. FC1 has the best match in terms of social desirability but has the fewest mixed-keyed blocks, while FC4 has the majority of blocks being mixed-keyed but is least matched on social desirability. The two FC measures in between (FC2 and FC3) represent attempts to strike a balance between mixed-keying and social desirability matching. Ideally, we would expect FC1 and FC2 to be more faking-resistant and FC3 and FC4 to be superior in measurement precision. By comparing FC1–FC2, we can examine the effect of increasing the number of mixed keyed blocks while maintaining the same degree of social desirability matching; By comparing FC3–FC4, we examine the effect of relaxing social desirability matching while maintaining a sufficient number of mixed keyed blocks. Finally, by comparing FC2–FC3, we can investigate the impact of different preferences for the balance between the two design criteria, in which FC2 favors better social desirability matching while FC3 favors more mixed keyed blocks. In sum, different comparisons between the four versions of FC can provide us with a holistic picture of the individual and joint impact of the two criteria.

HEXACO-60_SS

The same 60 statements from HEXACO-60 (Ashton & Lee, 2009) were also used as an SS Likert-type measure. Participants were instructed to indicate the extent to which each item described themselves on a 5-point rating scale. Items were randomly presented for each participant to reduce order effects. The SS measure served as an anchor to evaluate the psychometric properties of different FC measures.

Criterion measures

Details about criterion measures (e.g., reliability, length, rating scales) and their HEXACO correlates based on previous meta-analyses and large-sample primary studies are presented in Table 1. Full items can be found in the Online Supplementary Materials Section 4. Means, SDs, and reliabilities can be found in Table S2 in the Online Supplementary Materials.

Table 1.

Details about Criterion Measures and Respondent Reactions Measures.

Criterion	Example item	Rating scale	α	Theoretical HEXACO correlates
OCB (10) (Fox et al., 2007)	Helped new employees get oriented to the job.	1 = Never 5 = Every day	.83–.86	H(+), E(/), X(+), A(+), C(+), O(+) (Pletzer et al., 2021; Table 2)
CWB (10) (Spector et al., 2006)	Ignored someone at work.	1 = Never 5 = Every day	.80–.85	H(-), E(-), X(/), A(-), C(-), O(/) (Pletzer et al., 2020; Table 1)
JP (7) (Williams & Anderson, 1991)	I adequately complete assigned duties.	1 = Strongly disagree 5 = Strongly agree	.79–.82	H(+) (Lee, Berry and Gonzalez-Mulé, 2019; Table 4) E(/), X(/), A(/), C(+), O(+) (Zettler et al., 2020; Table 9)
JS (9) (Spector, 1985)	All in all, how satisfied are you with the pay of your job?	1 = Very dissatisfied 5 = Very satisfied	.85–.88	H(+), E(-), A(+), C(-) (Pletzer et al., 2023; Supplementary Materials Table 9), X(+), O(/) (Judge et al., 2002; Table 1)
BNT (6) (Kristensen et al., 2005)	How often do you feel tired?	1 = Never 5 = Always	.88–.90	H(-), E(+), X(-), A(-), C(-), O(/) (Pletzer et al., 2023; Supplementary Materials Table 9)
TI (6) (Roodt, 2004)	How often have you considered leaving your job?	1 = Always 5 = Never	.77–.82	E(+), X(-), A(-), C(-), O(/) (Zimmerman, 2008; Table 3)
ORG (5) (self-made)	In your current organization, do you have the right to hire people?	1 = Yes, 0 = No	.74–.80	E(/), X(+), A(/), C(+), O(+) (Harms et al., 2007; Table 1)
CHAR (3) (self-made)	In the past year, have you volunteered?	1 = Yes, 0 = No	.29–.42	E(/), X(+), A(+), C(+), O(/) (Carlo et al., 2005; Table 1)
SWB (5) (Diener et al., 1985)	In most ways my life is close to my ideal.	1 = Strongly disagree 5 = Strongly agree	.87–.90	H(+), E(-), X(+), A(+), C(+), O(+) (Anglim et al., 2020; Table 7)
FS (5) (Munyon et al., 2020)	I have adequate income.	1 = Strongly disagree 5 = Strongly agree	.84–.88	E(-) (Munyon et al., 2020; Table 5, 10)
PHQ (14) (Schat et al., 2005)	How often have you experienced headaches?	1 = Not at all 7 = All the time	.85–.87	H(/), E(-), X(/), A(/), C(+), O(/) (Pletzer et al., 2023; Table 3)
NARC (4) (Jonason & Webster, 2010)	I tend to want others to admire me.	1 = Strongly disagree 5 = Strongly agree	.75–.80	H(-), E(-), X(+), A(-), C(/), O(/) (Schreiber & Marcus, 2020; Table 1)
MACH (4) (Jonason & Webster, 2010)	I tend to manipulate others to get my way.	1 = Strongly disagree 5 = Strongly agree	.79–.82	H(-), E(-), X(/), A(-), C(-), O(/) (Schreiber & Marcus, 2020; Table 1)
PSYCH (4) (Jonason & Webster, 2010)	I tend to lack remorse.	1 = Strongly disagree 5 = Strongly agree	.67–.77	H(-), E(-), X(/), A(-), C(-), O(/) (Schreiber & Marcus, 2020; Table 1)

Respondent reactions (T1)	Example item	Rating scale	α
Positive Affect (3) (Adapted from Zhang et al., 2020 + self-made)	This questionnaire is interesting.	1 = Strongly disagree 5 = Strongly agree	.67–.75
Accuracy (3) (Adapted from Dalal et al., 2021 + Self-made)	This questionnaire can accurately measure my personality characteristics.	1 = Strongly disagree 5 = Strongly agree	.66–.74
Utility (3) (self-made)	This questionnaire is useful for personnel selection.	1 = Strongly disagree 5 = Strongly agree	.72–.77
Faking Resistance (3) (self-made)	It is hard to fake on this questionnaire.	1 = Strongly disagree 5 = Strongly agree	.65–.76
Difficulty (3) (Adapted from Zhang et al., 2020 + self-made)	This questionnaire is difficult to answer.	1 = Strongly disagree 5 = Strongly agree	.75–.78
Cognitive Burden (3) (self-made)	Completing this questionnaire makes me exhausted.	1 = Strongly disagree 5 = Strongly agree	.45–.56
Degree of Concentration (3) (Adapted from Zhang et al., 2020 + self-made)	I was concentrated when completing this questionnaire.	1 = Strongly disagree 5 = Strongly agree	.25–.39
Exerted Effort (3) (Adapted from Zhang et al., 2020 + self-made)	How much effort do you have to exert in order to complete this questionnaire as instructed?	0 = Zero effort 10 = All my efforts	NA
Exhaustion (3) (self-made)	How exhausted are you after completing this questionnaire?	0 = Not exhausted at all 10 = Completely exhausted	NA
Energy Level (3) (self-made)	Let's say your energy level was 10 before you start to work on this questionnaire. What's your current energy level after completing this questionnaire?	0 = Zero energy 10 = Full energy	NA

Respondent reactions (T2)	Example item	Rating scale	α
Fairness (2) (Chan et al., 1998; Lopez et al., 2019)	Overall, I believe the test was fair.	1 = Strongly disagree 5 = Strongly agree	.71–.84
Predictive Validity (2) (Kluger & Rothstein, 1993; Macan et al., 1994)	The test measured the skills necessary to perform well on the job.	1 = Strongly disagree 5 = Strongly agree	.82–.85
Satisfaction with Process (2) (Sylva & Mol, 2009; Tonidandel et al., 2002)	I liked taking this type of test.	1 = Strongly disagree 5 = Strongly agree	.69–.77
Organizational Attractiveness (2) (Highhouse et al., 2003)	For me, this company would be a good place to work.	1 = Strongly disagree 5 = Strongly agree	.83–.88
Intent to Accept Job (2) (Highhouse et al., 2003)	I would accept a job offer from this company.	1 = Strongly disagree 5 = Strongly agree	.75–.84
Face Validity (2) (Chan et al., 1998; Macan et al., 1994)	The actual content of the test is clearly related to the job.	1 = Strongly disagree 5 = Strongly agree	.84–.89
Intent to Recommend (2) (Highhouse et al., 2003; Smither et al., 1993)	Based on my experience with the test, I would recommend others to apply to this organization.	1 = Strongly disagree 5 = Strongly agree	.88–.93
Faking Resistance (2) (self-made)	It's hard to fake on this questionnaire.	1 = Strongly disagree 5 = Strongly agree	.87–.91
Accuracy (2) (Harris et al., 2021)	I believe the assessment accurately measured my personality.	1 = Strongly disagree 5 = Strongly agree	.56–.65

Note. Number after each construct name indicates the number of items measuring that construct. OCB = Organizational citizenship behavior; CWB = Counterproductive work behavior; JP = Job performance; JS = Job satisfaction; BNT = Burnout; TI = Turnover intentions; ORG = Organizational status; CHAR = Charity behaviors; SWB = Subjective well-being; FS = Financial Security; PHQ = Physical health; NARC = Narcissism; MACH = Machiavellianism; PSYCH = Psychopathy; H = Honesty-Humility; E = Emotionality; X = Extraversion; A = Agreeableness; C = Conscientiousness; O = Openness. For the “Theoretical HEXACO Correlates” column, A “(+)” notation indicates evidence for a positive association between the personality trait and the corresponding criterion variable, with an absolute magnitude of ≥.10. A “(-)” notation indicates evidence for a negative association between the personality trait and the corresponding criterion variable, with an absolute magnitude of ≥.10. A “(/)” notation indicates the association between the personality trait and the corresponding criterion variable is small with a magnitude < .10 (regardless of direction). Also for the “Theoretical HEXACO Correlates” column, notation not in italics represents evidence from a published meta-analysis, while italics represent evidence from a primary study. Reliabilities for Exerted Effort, Exhaustion, and Energy Level are not available as these variables were measured by only one item.

Respondent reactions

Respondent reaction measures at Time 1 focused on general perceptions of the FC measures, while those at Time 2 were tailored to the job application context. Items were adapted from previous studies (Chan et al., 1998; Dalal et al., 2021; Harris et al., 2021; Highhouse et al., 2003; Lopez et al., 2019; Macan et al., 1994; Smither et al., 1993; Tonidandel et al., 2002; Zhang, Luo et al., 2023) and self-developed. Assessed facets and example items can be found in Table 1. Complete items can be found in Section 4 of the Online Supplementary Materials.

Quality control items

Six quality control items were embedded, with five in the Likert measures and one in the FC measure. For items embedded in Likert measures, respondents were instructed to endorse a particular response option (e.g., strongly disagree). The quality control block in the FC measure required participants to select the first statement from the block as “most like me” and the second statement as “least like me.” In all subsequent analyses, we screened out respondents who missed more than one quality control item.

Scoring

All four FC measures were scored using the TIRT model (Brown & Maydeu-Olivares, 2011) with the R package thurstonianIRT (version 0.12.1; Bürkner, 2019). Specifically, we used the Markov chain Monte Carlo approach with default diffuse priors to estimate the TIRT model. The TIRT model converged well for the four versions with the largest potential scale reduction factor (R hat) being less than 1.10. As the SS version of the HEXACO scales was identical across the four groups, we pooled their responses together and scored them by the Multidimensional Graded Response Model (Samejima, 1997) using the R package mirt (version 1.33.2; Chalmers, 2012) with the estimator based on Cai's (2010) Metropolis-Hastings Robbins-Monro algorithm². To ensure comparability across time, Time 2 responses were scored by fixing statement parameters to those obtained at Time 1. Maximum a posteriori estimates were obtained for both FC and SS personality measures. For the sake of simplicity, criterion and all other measures were scored using sum scores after reverse coding. For transparency, all data and analysis scripts were made available on the Open Science Framework: https://osf.io/yvpz3/?view_only = 08601755f471440b80973194571b60bd.

Results Reporting

For psychometric properties, we reported (a) empirical reliability, computed as $\frac{v a r (\hat{θ})}{v a r (\hat{θ}) + m e a n (S E (\hat{θ})^{2})}$ (Brown & Maydeu-Olivares, 2018), (b) convergent validity between FC scores and their SS counterparts, (c) discriminant validity (intercorrelations) between traits, as well as similarity of intercorrelations as indexed by double-entry intraclass correlation (ICC; Furr, 2010) (d) criterion-related validity of FC and SS, as well as the profile similarity between FC and SS criterion-related validity profiles as indexed by double-entry ICC. For fakability of FC and SS measures, we reported (e) rank-order stability of personality scores between honest and fake-good conditions, and (f) faking effect as indexed by Cohen's d between trait estimates obtained in honest and fake-good conditions. Respondent reactions were presented for each of the four FC measures at both time points. We reported descriptive statistics, McDonald's ω, and Cohen's d for pairwise comparisons between groups for each of the respondent reaction dimensions. Note that we focused on effect sizes instead of statistical significance.

Results

Psychometric Properties

Reliability

As shown in Table 2, the empirical reliabilities of all six traits for the four FC measures were at least marginally acceptable (>.63). Also, the reliabilities of all four FC measures were consistently lower than those of the SS measure (FC: average reliabilities across traits ranging from .69 to .73 at Time 1 and .67 to .71 at Time 2; SS: average reliabilities across traits equal to .84 at Time 1 and Time 2), regardless of the FC design features or the measurement contexts. Furthermore, for all four FC measures, reliability estimates at Time 2 (fake-good condition) were slightly lower than those at Time 1 (honest condition). Comparing the reliability estimates of each individual factor across different FC measures, the differences were mostly small (less than .10). Notable exceptions were (1) Extraversion between FC1 and the other three FC measures, where the reliability for FC1 was .13 lower than FC2, .10 lower than FC3, and .14 lower than FC4 at Time 1, while .10 lower than FC2 at Time 2, and (2) Conscientiousness between FC3 and FC4, where the reliability for FC3 was .10 lower at Time 1. Overall, results from Table 2 showed that the impact of the extent of social desirability matching and mixed keying on FC reliability was limited as long as there were at least three mixed triplets. In addition, empirical reliability did not seem to be substantially compromised even if participants were motivated to fake their responses. Readers interested in the standard error of measurement for each person score against their estimated latent trait levels can refer to Figures S1 and S2 in the Online Supplementary Materials.

Table 2.

FC and SS Empirical Reliability.

Trait	Time 1 = Honest					Time 2 = Fake-good
Trait	FC1	FC2	FC3	FC4	SS	FC1	FC2	FC3	FC4	SS
H	.70	.71	.74	.68	.85	.67	.73	.73	.64	.83
E	.73	.75	.67	.75	.83	.71	.70	.62	.67	.81
X	.66	.79	.76	.80	.87	.68	.78	.73	.76	.87
A	.68	.69	.67	.68	.81	.65	.69	.64	.66	.82
C	.65	.71	.63	.73	.85	.62	.69	.61	.70	.85
O	.73	.70	.67	.68	.83	.71	.67	.67	.66	.84
Mean	.69	.73	.69	.72	.84	.67	.71	.67	.68	.84

Note. H = Honesty-Humility; E = Emotionality; X = Extraversion; A = Agreeableness; C = Conscientiousness; O = Openness. FC = Forced-choice measure, SS = Single-statement measure. Empirical reliability for single-statement measure is calculated based on participants from all four study groups.

Convergent validity

Table 3 shows the raw and corrected (for unreliability) convergent validity of all four FC measures with their SS counterpart. To control for the confounding effect of different reliabilities, we interpret the corrected convergent validities. At Time 1, the average correlation between trait scores measured with FC and SS measures was substantially lower for FC1 (but still large in magnitude; M = .81, min = .59, max = 1.00) while the other three versions had very similar and higher convergent validity (FC2: M = .91, min = .79, max = 1.00; FC3: M = .94, min = .88, max = 1.00; FC4: M = .94, min = .85, max = 1.00). This pattern suggests that although the construct validity of all four FC measures was properly retained, FC2–FC4 still fared better than FC1. At Time 2, the convergent validity dropped substantially (FC1: M = .71, min = .53, max = .84; FC2: M = .73, min = .55, max = .84; FC3: M = .79, min = .74, max = .93; FC4: M = .79, min = .73, max = .89). Noticeably, convergent validity for some traits in FC1 and FC2 was substantially worse than that in FC3 or FC4, particularly for Conscientiousness (.53 and .55 for FC1 and FC2 vs. .74 for FC3 and FC4). However, we note that the lower convergent validity with SS scores at Time 2 was likely because SS scores were substantially distorted due to faking.

Table 3.

FC and SS Convergent Validity.

	Uncorrected								Corrected for unreliability
	Time 1 = Honest				Time 2 = Fake-good				Time 1 = Honest				Time 2 = Fake-good
	FC1	FC2	FC3	FC4	FC1	FC2	FC3	FC4	FC1	FC2	FC3	FC4	FC1	FC2	FC3	FC4
H	.54	.64	.72	.65	.45	.55	.59	.54	.69	.83	.91	.85	.60	.70	.75	.73
E (Reverse)	.81	.79	.74	.76	.65	.63	.54	.56	1.00	1.00	1.00	.95	.84	.84	.77	.75
X	.57	.79	.76	.84	.52	.60	.64	.62	.75	.95	.94	1.00	.67	.73	.81	.77
A	.67	.72	.64	.71	.57	.57	.54	.55	.89	.95	.88	.96	.77	.75	.76	.75
C	.44	.61	.67	.70	.38	.42	.54	.57	.59	.79	.92	.88	.53	.55	.74	.74
O	.75	.73	.75	.75	.65	.58	.70	.67	.96	.95	1.00	1.00	.84	.78	.93	.89
Mean	.63	.71	.71	.74	.54	.56	.59	.59	.81	.91	.94	.94	.71	.73	.79	.77

Note. H = Honesty-Humility, E = Emotionality, X = Extraversion, A = Agreeableness, C = Conscientiousness, O = Openness. FC = Forced-choice measure, SS = Single-statement measure. Time 1 matched sample size: N_FC1 = 541, N_FC2 = 528, N_FC3 = 543, N_FC4 = 535. Time 2 matched sample size: N_FC1 = 289, N_FC1 = 283, N_FC1 = 302, N_FC1 = 303. Corrected convergent validity estimates larger than 1.0 were set to 1.0.

Discriminant validity

In Table 4, we reported the model-based latent correlations among the six traits for FC1–FC4 and SS. As we scored Time 2 responses by fixing model parameters obtained from Time 1 instead of separately estimating a model for each format at Time 2, we only reported discriminant validity information for Time 1. Overall, FC2, FC3, and FC4 demonstrated similarly moderate ICC with those estimated from SS (ICCs = .71, .67, and .64), while the trait intercorrelations for FC1 were vastly different from those found in SS, as shown by the low ICC (.17). For example, the correlations for Emotionality with Honesty-Humility and Extraversion were .06 and −.34, respectively, for SS but were .36 and −.03 for FC1. These results suggested that FC1 demonstrated substantially lower construct validity in terms of the intercorrelations between the traits. Besides those point estimates, the standard errors of discriminant validity estimates of FC1 (M = .078) were also about 10% to 20% higher than those for FC2–FC4 (M = .064, .070, and .068).

Table 4.

Discriminant Validity at Time 1.

Trait pair	Time 1 latent correlations
	FC1		FC2		FC3		FC4		SS
	r	SE	r	SE	r	SE	r	SE	r	SE
H-E	.364	.070	.111	.060	.191	.067	.216	.067	.056	.056
H-X	-.130	.076	.174	.061	-.092	.062	.117	.069	.023	.051
H-A	.250	.067	.086	.067	.129	.069	.226	.070	.230	.044
H-C	.304	.085	.343	.065	.188	.068	.273	.066	.216	.029
H-O	.147	.077	.047	.064	.119	.065	.251	.068	.015	.051
E-X	-.027	.094	-.185	.059	-.311	.065	-.153	.062	-.337	.044
E-A	.075	.072	-.051	.069	.020	.074	-.127	.069	-.188	.039
E-C	.162	.095	-.157	.065	.102	.081	-.115	.068	-.078	.050
E-O	.114	.079	.092	.066	-.012	.071	.125	.066	-.003	.054
X-A	-.003	.079	.442	.055	.220	.065	.089	.067	.235	.048
X-C	-.096	.079	.461	.052	.047	.075	.369	.059	.204	.049
X-O	.114	.073	.087	.063	.086	.066	.149	.066	.070	.055
A-C	.003	.077	.124	.071	-.078	.076	.151	.070	.111	.036
A-O	.024	.071	.069	.071	.118	.073	.227	.079	-.004	.028
C-O	.159	.080	.064	.065	.054	.077	.152	.067	.102	.052
Mean Absolute Correlation & SE	.131 (.078)		.166 (.064)		.118 (.070)		.183 (.068)		.125 (.046)
ICC with SS	.17		.71		.67		.64		NA

Note. H = Honesty-Humility; E = Emotionality; X = Extraversion; A = Agreeableness; C = Conscientiousness; O = Openness. The SS estimates were based on merged sample from all four conditions, which results in smaller standard errors.

Criterion variable	FC1	FC2	FC3	FC4	SS
MACH	.181	.137	.268	.328	.377
PSYCH	.187	.152	.316	.290	.421
NARC	.276	.135	.175	.275	.217
OCB	.048	.136	.110	.104	.096
CWB	.079	.076	.174	.173	.191
JP	.061	.043	.099	.157	.248
JS	.114	.146	.062	.117	.145
BNT	.232	.279	.133	.333	.311
TI	.104	.138	.088	.115	.162
SWB	.127	.322	.169	.364	.353
FS	.124	.120	.055	.080	.135
PHQ	.189	.135	.096	.168	.186
EDU	.038	.057	.036	.064	.041
WAG	.046	.032	.047	.098	.057
ORG	.163	.112	.063	.093	.089
CHAR	.048	.074	.070	.079	.045
GEN	.142	.225	.365	.172	.217
AGE	.036	.017	.057	.097	.083
TEN	.028	.009	.056	.065	.061
Mean R²	.117	.123	.129	.167	.181
FC-SS Validity Profile ICC	FC1	FC2	FC3	FC4	SS
Honesty-Humility	.23	.67	.88	.96	/
Emotionality	.97	.94	.79	.96	/
Extraversion	.60	.92	.75	.97	/
Agreeableness	.71	.81	.52	.96	/
Conscientiousness	−.04	.70	.72	.95	/
Openness	.08	.40	.66	.68	/
Overall	.54	.83	.77	.96	/

	Rank-order stability	Cohen's d (T2–T1)
H	.65	.53	.55	.51	.62	.95	.75	.75	.78	.73	.03	.31	.19	.48	.37	.03	.36	.22	.58	.40
E (Reverse)	.51	.48	.58	.47	.63	.72	.66	.91	.66	.77	.48	.33	.14	.38	.40	.56	.39	.17	.45	.44
X	.56	.49	.55	.49	.54	.83	.62	.74	.63	.62	.21	.32	.49	.68	.61	.25	.36	.56	.77	.65
A	.48	.56	.45	.49	.52	.72	.81	.68	.73	.64	.17	.33	.26	.56	.46	.21	.40	.32	.68	.51
C	.52	.47	.44	.38	.54	.81	.67	.70	.54	.64	.26	.55	.59	.98	.50	.33	.66	.75	1.16	.55
O	.59	.57	.65	.61	.75	.83	.83	.96	.90	.89	.11	.07	.23	.48	.25	.13	.08	.28	.58	.28
Mean	.55	.52	.54	.49	.60	.81	.72	.79	.71	.72	.21	.32	.31	.59	.43	.25	.37	.38	.70	.47

	FC1	FC2	FC3	FC4	Cohen's d
Time 1 Respondent Reactions
Affect	4.04	0.70	.75	4.01	0.69	.74	3.99	0.68	.76	4.00	0.65	.68	.05	.08	.06	.03	.01	−.02
Accuracy	3.39	0.75	.68	3.38	0.81	.75	3.34	0.70	.68	3.39	0.74	.69	.02	.07	−.003	.05	−.02	−.07
Utility	3.33	0.80	.76	3.33	0.81	.77	3.32	0.73	.72	3.39	0.73	.72	.00	.02	−.07	.01	−.07	−.09
Faking Resistance	2.93	0.89	.77	2.88	0.84	.68	2.89	0.78	.68	2.87	0.80	.66	.06	.05	.07	−.02	.01	.02
Difficulty	2.47	0.91	.76	2.46	0.93	.78	2.50	0.90	.77	2.49	0.90	.75	.01	−.04	−.02	−.04	−.03	.02
Burden	2.39	0.68	.59	2.38	0.73	.59	2.37	0.66	.52	2.40	0.70	.59	.01	.02	−.02	.01	−.03	−.04
Concentration	3.85	0.62	.53	3.80	0.60	.46	3.75	0.61	.54	3.79	0.60	.52	.07	.15	.10	.08	.02	−.06
Exerted Effort	7.83	2.24	-	7.81	2.29	-	7.43	2.06	-	7.58	2.08	-	.01	.19	.12	.17	.11	−.07
Exhaustion	3.25	2.36	-	3.40	2.48	-	3.51	2.30	-	3.62	2.39	-	−.06	−.11	−.16	−.05	−.09	−.05
Energy Level	9.33	1.66	-	9.19	1.64	-	9.28	1.59	-	9.30	1.64	-	.09	.03	.02	−.06	−.07	−.01
Time 2 Respondent Reactions
Fairness	3.22	1.00	.84	3.35	0.95	.81	3.31	0.86	.71	3.36	0.90	.75	−.13	−.10	−.15	.05	−.01	−.06
Validity	2.51	1.06	.85	2.65	1.06	.83	2.48	0.93	.82	2.58	1.08	.85	−.13	.04	−.06	.17	.07	−.10
Satisfaction	3.08	1.01	.77	3.23	0.96	.71	3.15	0.92	.72	3.27	0.92	.69	−.16	−.07	−.19	.09	−.03	−.13
Org Attractiveness	3.66	0.98	.88	3.75	0.87	.86	3.62	0.84	.84	3.72	0.84	.83	−.10	.04	−.07	.16	.04	−.13
Intent to Accept	4.10	0.79	.82	4.16	0.72	.84	4.07	0.69	.75	4.15	0.72	.82	−.08	.04	−.07	.13	.01	−.12
Face Validity	2.67	1.11	.88	2.77	1.07	.89	2.71	1.00	.84	2.78	1.05	.86	−.09	−.04	−.10	.06	−.01	−.07
Intent to Recommend	3.24	1.05	.93	3.33	0.97	.89	3.14	0.89	.88	3.30	0.94	.91	−.09	.11	−.06	.21	.03	−.18
Faking Resistance	2.60	1.13	.91	2.84	1.16	.88	2.65	1.09	.87	2.50	1.12	.88	−.21	−.05	.09	.17	.30*	.14
Accuracy	3.19	0.91	.65	3.34	0.86	.61	3.20	0.77	.56	3.32	0.88	.65	−.16	−.01	−.14	.17	.02	−.14

Criterion-related validity

Table 5 presents the double-entry ICC between the validity profiles of each FC and the SS measure after correcting for unreliability of personality scores (see Table S4 and Table S5 in the Online Supplementary Materials for full corrected and raw correlations). A correction was conducted on these correlations to control for differential reliabilities for FC and SS measures. ICC for each factor and across all HEXACO factors were reported. Generally speaking, FC1 was the least similar to the SS in terms of criterion-related validity (ICC = .54) and FC4 was the most similar (ICC = .96). FC2 and FC3 also had similar validity profiles as the SS (ICCs = .83 and .77 for FC2 and FC3). When examined trait by trait, the patterns are also consistent with those revealed by the overall ICCs. Specifically, FC1 displayed validity profiles that were the least similar to the SS (double entry ICCs = −.04, .08, and .23 for Conscientiousness, Openness, and Honesty-Humility). In contrast, FC4 consistently demonstrated the highest resemblance to the SS for all personality traits. We also observed some trait specificity beyond the general pattern. For example, the validity profiles of Emotionality for FC1, FC2, and FC4 were highly similar to those of the SS (ICCs ranging from .94 to .97), while the ICC for FC3 Emotionality was somehow lower (ICC = .79). Additionally, the ICCs for Openness were consistently among the lower end of the six traits across the four FC measures (ICCs = .08 .40, .66, and .68 for FC1–FC4), which was likely to be an artifact due to range restriction because openness correlated weakly with all criteria.

Table 5.
R² of HEXACO Trait Scores Predicting Criterion Variables and Double Entry ICCs of Validity Profile between FC and SS Measures, Corrected for Unreliability of Personality Scores.

Criterion variable FC1 FC2 FC3 FC4 SS

MACH .181 .137 .268 .328 .377

PSYCH .187 .152 .316 .290 .421

NARC .276 .135 .175 .275 .217

OCB .048 .136 .110 .104 .096

CWB .079 .076 .174 .173 .191

JP .061 .043 .099 .157 .248

JS .114 .146 .062 .117 .145

BNT .232 .279 .133 .333 .311

TI .104 .138 .088 .115 .162

SWB .127 .322 .169 .364 .353

FS .124 .120 .055 .080 .135

PHQ .189 .135 .096 .168 .186

EDU .038 .057 .036 .064 .041

WAG .046 .032 .047 .098 .057

ORG .163 .112 .063 .093 .089

CHAR .048 .074 .070 .079 .045

GEN .142 .225 .365 .172 .217

AGE .036 .017 .057 .097 .083

TEN .028 .009 .056 .065 .061

Mean R² .117 .123 .129 .167 .181

FC-SS Validity Profile ICC FC1 FC2 FC3 FC4 SS

Honesty-Humility .23 .67 .88 .96 /

Emotionality .97 .94 .79 .96 /

Extraversion .60 .92 .75 .97 /

Agreeableness .71 .81 .52 .96 /

Conscientiousness −.04 .70 .72 .95 /

Openness .08 .40 .66 .68 /

Overall .54 .83 .77 .96 /

Note. MACH = Machiavellianism; PSYCH = Psychopathy; NARC = Narcissism; CWB = Counterproductive work behavior; JS = Job satisfaction; BNT = Burnout; FS = Financial Security; OCB = Organizational citizenship behavior; SWB = Subjective well-being; TI = Turnover intentions; JP = Job performance; PHQ = Physical health; EDU = Education; WAG = Wage; ORG = Organizational status; CHAR = Charity behaviors.

We also presented the R² of all six personality traits predicting each criterion variable in Table 5. Averaging across all criterion variables, FC4 demonstrated the highest average R² (.167) among the four FC measures, which was also close to the one produced by SS (.181). This was followed by FC3 (.129) and FC2 (.123), while FC1 had the lowest average R² (.117). If we further differentiate criterion variables that were subjectively assessed by Likert-type measures (e.g., dark personality, OCB, and CWB) from more objectively reported criterion variables (e.g., education, wages), we can see that the SS measure displayed superiority over the FC measures in predicting the former (average R²s were .144, .152, .145, .209, and .237 for FC1, FC2, FC3, FC4, and SS) but showed no advantages at all in predicting the latter (average R²s were .072, .075, .099, .095, and .085 for FC1, FC2, FC3, FC4, and SS). Overall, these results showed that the criterion-related validity for FC4 was the best, while for FC1, it was the worst.

Fakability

Rank-order stability

Raw and corrected correlations between the same personality trait scores obtained in both honest and fake-good conditions are presented in Table 6. To control for the confounding effect of reliability differences, we focused on corrected rank-order stability. A more faking-resistant measure should be better at preserving respondents’ rank orders across honest and fake-good conditions. As expected, FC1 was the most faking-resistant (M = .81, min = .72, max = .95), and FC4 was the least faking-resistant (M = .71, min = .54, max = .90). All FC measures except FC4 were more faking-resistant than the SS measure (M = .72, min = .62, max = .89). Averaged across measures, Conscientiousness appeared to be most susceptible to faking, while Openness consistently showed the lowest susceptibility to faking. Correlations between Time 1 SS scores and Time 2 FC scores can be found in Table S7.

Table 6.
Rank-Order Stability and Mean Score Differences.

Rank-order stability Cohen's d (T2–T1)

Uncorrected Corrected for unreliability Uncorrected Corrected for unreliability

FC1 FC2 FC3 FC4 SS FC1 FC2 FC3 FC4 SS FC1 FC2 FC3 FC4 SS FC1 FC2 FC3 FC4 SS

H .65 .53 .55 .51 .62 .95 .75 .75 .78 .73 .03 .31 .19 .48 .37 .03 .36 .22 .58 .40

E (Reverse) .51 .48 .58 .47 .63 .72 .66 .91 .66 .77 .48 .33 .14 .38 .40 .56 .39 .17 .45 .44

X .56 .49 .55 .49 .54 .83 .62 .74 .63 .62 .21 .32 .49 .68 .61 .25 .36 .56 .77 .65

A .48 .56 .45 .49 .52 .72 .81 .68 .73 .64 .17 .33 .26 .56 .46 .21 .40 .32 .68 .51

C .52 .47 .44 .38 .54 .81 .67 .70 .54 .64 .26 .55 .59 .98 .50 .33 .66 .75 1.16 .55

O .59 .57 .65 .61 .75 .83 .83 .96 .90 .89 .11 .07 .23 .48 .25 .13 .08 .28 .58 .28

Mean .55 .52 .54 .49 .60 .81 .72 .79 .71 .72 .21 .32 .31 .59 .43 .25 .37 .38 .70 .47

Note. H = Honesty-Humility; E = Emotionality; X = Extraversion; A = Agreeableness; C = Conscientiousness; O = Openness; FC = Forced-choice measure; SS = Single-statement measure.

Mean score inflation

Raw and corrected (using formulas from Wiernik & Dahlke, 2020) standardized mean score differences (Cohen's d) between the honest and fake-good conditions can also be found in Table 6. Again, we focused on corrected effect sizes to account for the confounding effect of reliability differences. A more faking-resistant test should have smaller mean score inflation. As expected, FC1 was the most faking-resistant (M = 0.25, min = 0.03, max = 0.56), and FC4 was the least faking-resistant (M = 0.70, min = 0.45, max = 1.16). FC2 and FC3 were in between and performed very similarly (FC2: M = 0.37, min = 0.08, max = 0.66; FC3: M = 0.38, min = 0.17, max = 0.75) but showed differential faking effects across traits. For example, participants seemed to inflate their scores more easily on Honesty-Humility and Emotionality in FC2, while the same trend was observed for Extraversion and Openness in FC3. Inspections by trait yielded the same conclusion as when examining rank-order stability: averaged across FC measures, Conscientiousness consistently showed the highest susceptibility to faking, while Openness consistently exhibited the lowest susceptibility. Interestingly, the observed faking effect sizes in FC4 were higher than those in SS (M = 0.47, min = 0.28, max = 0.65).

Respondent Reactions

As displayed in Table 7, in honest condition, we found almost no meaningful differences in any aspects of respondent reactions. Specifically, after adjusting for multiple comparisons, none of the differences was statistically significant (ds < .19). Likewise, in the fake-good condition, the majority of the comparisons were statistically nonsignificant with tiny effect sizes, except for perceived faking resistance between FC2 and FC4, where FC2 was perceived as more faking-resistant than FC4 (adjusted p < .05, d = .30).

Table 7.
Respondent Reactions.

FC1 FC2 FC3 FC4 Cohen's d

M SD ω M SD ω M SD ω M SD ω 1–2 1–3 1–4 2–3 2–4 3–4

Time 1 Respondent Reactions

Affect 4.04 0.70 .75 4.01 0.69 .74 3.99 0.68 .76 4.00 0.65 .68 .05 .08 .06 .03 .01 −.02

Accuracy 3.39 0.75 .68 3.38 0.81 .75 3.34 0.70 .68 3.39 0.74 .69 .02 .07 −.003 .05 −.02 −.07

Utility 3.33 0.80 .76 3.33 0.81 .77 3.32 0.73 .72 3.39 0.73 .72 .00 .02 −.07 .01 −.07 −.09

Faking Resistance 2.93 0.89 .77 2.88 0.84 .68 2.89 0.78 .68 2.87 0.80 .66 .06 .05 .07 −.02 .01 .02

Difficulty 2.47 0.91 .76 2.46 0.93 .78 2.50 0.90 .77 2.49 0.90 .75 .01 −.04 −.02 −.04 −.03 .02

Burden 2.39 0.68 .59 2.38 0.73 .59 2.37 0.66 .52 2.40 0.70 .59 .01 .02 −.02 .01 −.03 −.04

Concentration 3.85 0.62 .53 3.80 0.60 .46 3.75 0.61 .54 3.79 0.60 .52 .07 .15 .10 .08 .02 −.06

Exerted Effort 7.83 2.24 - 7.81 2.29 - 7.43 2.06 - 7.58 2.08 - .01 .19 .12 .17 .11 −.07

Exhaustion 3.25 2.36 - 3.40 2.48 - 3.51 2.30 - 3.62 2.39 - −.06 −.11 −.16 −.05 −.09 −.05

Energy Level 9.33 1.66 - 9.19 1.64 - 9.28 1.59 - 9.30 1.64 - .09 .03 .02 −.06 −.07 −.01

Time 2 Respondent Reactions

Fairness 3.22 1.00 .84 3.35 0.95 .81 3.31 0.86 .71 3.36 0.90 .75 −.13 −.10 −.15 .05 −.01 −.06

Validity 2.51 1.06 .85 2.65 1.06 .83 2.48 0.93 .82 2.58 1.08 .85 −.13 .04 −.06 .17 .07 −.10

Satisfaction 3.08 1.01 .77 3.23 0.96 .71 3.15 0.92 .72 3.27 0.92 .69 −.16 −.07 −.19 .09 −.03 −.13

Org Attractiveness 3.66 0.98 .88 3.75 0.87 .86 3.62 0.84 .84 3.72 0.84 .83 −.10 .04 −.07 .16 .04 −.13

Intent to Accept 4.10 0.79 .82 4.16 0.72 .84 4.07 0.69 .75 4.15 0.72 .82 −.08 .04 −.07 .13 .01 −.12

Face Validity 2.67 1.11 .88 2.77 1.07 .89 2.71 1.00 .84 2.78 1.05 .86 −.09 −.04 −.10 .06 −.01 −.07

Intent to Recommend 3.24 1.05 .93 3.33 0.97 .89 3.14 0.89 .88 3.30 0.94 .91 −.09 .11 −.06 .21 .03 −.18

Faking Resistance 2.60 1.13 .91 2.84 1.16 .88 2.65 1.09 .87 2.50 1.12 .88 −.21 −.05 .09 .17 .30* .14

Accuracy 3.19 0.91 .65 3.34 0.86 .61 3.20 0.77 .56 3.32 0.88 .65 −.16 −.01 −.14 .17 .02 −.14

Note. *Statistically significant after adjusting for multiple comparisons. Significant differences were bolded. FC = Forced-choice measure.

Discussion

When intended for high-stakes situations, good psychometric properties and strong faking resistance are the two primary yet somewhat contradictory requirements for FC design. Achieving faking resistance often necessitates social desirability matching, but good psychometric properties require some mixed-keyed blocks which are often inevitably equivalent to social desirability mismatching. Besides, respondent reactions are also important in both low- and high-stakes situations because they may impact data quality and recruitment outcomes. However, no empirical evidence is yet available regarding the effects of different levels of social desirability matching and mixed keying on the psychometric properties of, faking resistance of, and respondents’ reactions to FC measures. To fill in this critical empirical gap, we conducted the first time-lagged experimental study to examine these three issues under different conditions by manipulating the levels of social desirability matching and mixed-keying. Results showed that (1) the impact of social desirability matching and mixed-keying on reliability was small (as long as there are at least 3 mixed keyed blocks), (2) FC measures with more mixed keyed blocks had substantially higher convergent validity with SS, more similar criterion-related and discriminant validity profile with SS, and can better predict criterion variables, (3) FC measures with better social desirability matching were generally more faking-resistant, and (4) different combinations of mixed keying and social desirability matching had negligible impact on respondents’ reactions in both honest and fake-good conditions. These findings demonstrate that it is possible to find a sweet spot between social desirability matching and mixed keying and thus construct a psychometrically sound and faking-resistant FC measure. Based on these findings and our first-hand experience with FC construction, we also provide tentative guidance on how to construct such a measure.

Mixed Keying or Social Desirability Matching?

Psychometric properties and faking resistance

Building upon initial attempts to reach a possible sweet spot between the two design criteria (Lee et al., 2022), we provided the first comprehensive empirical investigation on the effects of different mixed keying and social desirability matching combinations. First, as expected, neither solely focusing on social desirability matching (FC1) nor mixed keying (FC4) can produce FC measures that are both faking resistant and psychometrically sound. For FC1, although its high degree of social desirability matching brings notable advantages in faking resistance, the lack of mixed keyed blocks undermines its convergent validity, discriminant validity, and criterion-related validity. This essentially brings into question the construct validity of the scores. On the other hand, for FC4, its extensive focus on more mixed keyed blocks indeed allows its construct validity to be well maintained, consistent with the emphasis on mixed keying from simulation studies. However, relaxing social desirability matching too much also renders it more fakable compared to other FC counterparts. The utility of FC4 under the fake-good condition is hence limited.

These negative outcomes revealed by FC1 and FC4 suggest that for FC measures to be as valid and faking resistant as they are supposed to be, scale developers need to consider designing FC measures within a “middle ground.” As such, FC2 and FC3, representing the “middle ground” compromise between mixed keying and social desirability matching, demonstrate a better balance between psychometric properties and faking resistance. Comparing FC1 with FC2, consistent with Lee et al. (2022), psychometric properties of FC measures can be effectively improved and reach an acceptable level with the inclusion of just three more mixed keyed blocks (i.e., from 15% to 30%). We further extended their findings by showing that such improvement can even be achieved with a slight compromise in social desirability matching. Comparing FC3 with FC4, we found that faking resistance can be substantially strengthened with a better match in terms of social desirability (from 1.09 to 0.86 in terms of mean block desirability discrepancy), without reducing the number of mixed keyed blocks. Even more importantly, although FC2 and FC3 differed in terms of social desirability matching and mixed keying, their psychometric properties and faking resistance were largely similar. Admittedly, the trade-off still exists and a certain amount of loss in desirable psychometric properties is unavoidable, but such a trade-off seems acceptable for keeping FC as a both valid and faking-resistant measurement tool. We also acknowledge that it is not an easy task to find such a balance manually given a fixed statement pool, because the number of possible combinations can be astronomical. Hence, we recommend researchers use the R package autoFC (Li et al., 2022) to automate the search process and find the nearly optimal solutions.

Respondent reactions

No substantial impact was found for social desirability matching or mixed keying on respondent reactions. This is reassuring because it suggests that test developers do not need to worry a lot about respondent reactions when developing new FC measures.

Forced-Choice vs. Single Statement Measures

Although the primary focus of the present study is on the comparisons across the four FC measures, we believe the comparisons between the FC and the SS measures may also be of interest. By design, FC measures are less susceptible to or even immune from multiple response biases that plagues the SS format (Kreitchmann et al., 2019; Zhang, Luo et al., 2023). Our study further demonstrated that FC measures can be designed to maintain good construct validity. Nevertheless, readers may still be legitimately concerned about the utility of FC measures, given their relatively lower reliability estimates compared to their SS counterparts. The reliability discrepancy between the FC and the SS measures may originate from two sources. First, the FC responses are dichotomous in nature because respondents are only allowed to choose A or B. In comparison, the SS format allows respondents to indicate their degree of agreement. When holding other factors constant, dichotomous responses provide less information than graded ones, resulting in lower reliability of FC measures (Brown & Maydeu-Olivares, 2018). Fortunately, we can easily add a few more hard-to-fake desirability-matched blocks to FC2/FC3 to make trait scores derived from them as reliable as those from the SS format while maintaining their faking-resistance. However, it is much harder (if possible) to make the SS format as faking-resistant as the FC format. Second, it is well-known that the SS format is susceptible to various response biases, such as acquiescent, midpoint, and extreme response styles (Li et al., 2021; Plieninger & Heck, 2018; Sun et al., 2019; Sun et al., 2022). These systematic but construct-irrelevant biases can inflate reliability estimates. In the Online Supplementary Materials (Tables S8–S12 and Figures S5–S7), we presented additional analysis results where we corrected the SS scores for three common response biases (acquiescence, extreme responding, and midpoint responding) using the method developed by Plieninger and Heck (2018). It turned out that, after correction, the average reliability of the SS scores dropped from .84 to .70, which was very similar to that of FC. Taken together, these additional results suggested that the higher reliability estimates of SS were inflated, at the very least to some extent, by response biases, and that the FC format can mitigate these issues and provide more realistic estimates when carefully designed.

Another important finding is that FC4 demonstrated even greater susceptibility to faking compared to SS. Many papers have discussed the FC format as more faking-resistant than the SS format without properly noting that they have to be thoughtfully designed to be so. Our finding highlighted that the FC format is NOT a panacea for preventing faking. When desirable and undesirable statements are contrasted with each other within the same block, the social desirability difference among statements may become even more salient than when they are presented separately, thus making such blocks more susceptible to faking than their constituting statements (McCloy et al., 2005). If there are a substantial number of such blocks, FC measures can be even more fakable than their SS counterparts. Therefore, we urge users interested in using the FC format to counteract faking to be aware of this issue.

Recommended Steps to Develop FC Measures

Despite all the promises of the FC format, many people still find it difficult to develop a good FC measure due to the lack of guidelines. To promote a wider adoption of the FC format in organizational research and practices, below we provide a step-by-step guideline on how to develop high-quality FC measures based on our research findings and first-hand experience. These recommendations are intended as tentative guidelines that should be updated with more empirical evidence in the future rather than a gold standard.

Step 1. Generate a sufficient pool of high-quality statements for focal traits and obtain statement parameters. Several excellent guidelines have provided detailed discussions on how to write and select high-quality statements (Cao et al., 2015; Clark & Watson, 1995; 2019; Hinkin, 1998; Lambert & Newman, 2022; Worthington & Whittaker, 2006). Readers are encouraged to refer to them for more details. Here we want to emphasize the following considerations in the context of FC measure development. First, we should avoid the use of extremely worded statements (e.g., “I have never complained about anything”). Avoiding extreme wording can substantially lower the risk of the statement being too socially (un)desirable and hence too difficult to be matched with other statements. Second, it is important to keep a small proportion of negatively keyed statements (2–4 per trait) because we need them for mixed keyed blocks. Third, it is strongly recommended to keep more statements per trait than needed for a target FC measure as this can greatly ease the pairing in subsequent steps. In this step, researchers can also obtain statement parameters that will be used in the following steps. Specifically, if the test-developer adopts a dominance-response-process-based approach, it is recommended to fit a correlated-factor-analysis model to responses to the single statements and record the standardized factor loadings, statement intercepts, variance of statement uniqueness, and latent correlations among traits. If they adopt an unfolding-model-based approach, it is recommended to fit a Multidimensional Generalized Graded Unfolding Model (Tu et al., 2021; Tu et al., 2023; Wang & Wu, 2016) to the dichotomized responses and record statement discrimination, location, and threshold parameters, and latent correlations among traits.

Step 2. Obtain social desirability estimates of statements. There are three approaches to obtaining social desirability estimates for statements developed in Step 1. The first approach is direct rating where a small group of subject matter experts provide their direct ratings of the social desirability of each statement on a Likert scale (e.g., 1 = Very undesirable, 5 = Very desirable; see examples from Vasilopoulos et al., 2006; and Wetzel et al., 2021). Subject matter experts can be asked to rate the general and/or job-specific social desirability of each statement, depending on the intended use of the measure: If the measure is designed for use in specific jobs or organizations, then job-specific social desirability can be more appropriate; if the measure is intended for selection across jobs/organizations, then general social desirability is preferred. The second approach is to ask respondents to respond to these statements as if they were ideal job candidates (Naemi et al., 2014; Stark et al., 2005). These fake-good responses can also be used to operationalize the social desirability of statements. Recently, Hommel (2023) demonstrated that natural language processing techniques can also be used to predict statement social desirability with high accuracy. As of now, we recommend the direct rating approach because it is the most straightforward operationalization of social desirability. Fake-good responses may be contaminated by other irrelevant factors such as faking motivation. The natural language processing approach is promising but ignores individual differences in the perception of statement social desirability. Further, we recommend researchers to (1) use at least 30 participants for more reliable estimates of social desirability, (2) ensure each trait has statements spanning a similar range of social desirability levels, and (3) examine interrater agreement and prioritize statements whose social desirability was agreed upon by most raters.

Step 3. Determine block size. One of the most important decisions when developing FC measures is block size, which could range from 2 to the total number of statements (full ranking task, which is impractical with any substantial number of statements). When making this decision, researchers need to consider psychometric properties and respondents’ cognitive load. Larger block sizes should demonstrate superior psychometric properties because they produce more pairwise comparisons, but may also impose heavier cognitive load on respondents, potentially leading to compromised respondent reactions and data quality, thereby jeopardizing psychometric properties (Brown & Maydeu-Olivares, 2011). Surprisingly, very few studies have systematically examined the impact of block size on psychometric properties of (but see Frick et al., 2023 for an exception) and respondent reactions to FC measures. Drawing from our own results obtained from three samples with >4,500 respondents in another ongoing project (results available upon request as we are still writing this manuscript), we found minor differences (Cohen's ds = −.16 to .18) on perceived difficulty, exhaustion and cognitive load between FC measures with block sizes of three and five when holding statements constant. As such, block sizes ranging from 3 to 5 can all be considered as reasonable for static FC measures (all respondents received identical blocks) because they strike a good balance between psychometric information and respondent reactions. Five is also the up-to-date estimate of the upper limit of working memory capacity for meaningful chunks for adults (Cowan, 2010; Halford et al., 2007). A block size of 2 is recommended for computerized adaptive tests because it is much easier to implement (Stark et al., 2012), but not for static FC measures because it is not very psychometrically efficient. If researchers have specific reasons to maintain a block size of 2, we recommend using the graded FC format. This format allows respondents to indicate their degree of preference, thereby providing more psychometric information (Brown & Maydeu-Olivares, 2018; Zhang, Luo et al., 2023; Zhang, Tu et al., 2023) and potentially fostering more positive respondent reactions (Dalal et al., 2021). It is also recommended that for multidimensional FC measures, block size should not exceed the number of measured latent traits because we generally want to avoid having more than one statement of the same latent trait in the same block.

Step 4. Determine the number of mixed keyed blocks. After obtaining social desirability and deciding on block size, researchers need to decide on the number of mixed keyed blocks. Previous simulations demonstrated that 20–30% of mixed keyed blocks in a triplet format were sufficient for maintaining satisfactory reliability of trait scores (Lee et al., 2022). Our empirical findings further confirmed that this setting can also maintain sufficient faking-resistance. However, it should be noted that it is hard to recommend an absolute number that universally applies to all FC measures because it depends on block size and the number of latent traits being measured. We also note that what matters for psychometric properties is the number of mixed keyed pairs (recoded pairwise comparisons) and what matters for fakability is the proportion of mixed keyed blocks. Our findings suggest that 6 mixed keyed triplets (6÷20 = 30% mixed keyed blocks) and 14 matched triplets, corresponding to 12 mixed keyed pairs (each mixed keyed triplet has two mixed keyed pairs and one matched pair) and 48 matched pairs (14 × 3 = 42 matched pairs from matched triplets, and 6 × 1 = 6 from mixed keyed triplets) when recoded into pairwise comparisons items, suffice for measuring six traits. It means that 30% or fewer mixed keyed blocks and two mixed keyed pairs per trait without duplication would be a reasonable recommendation. Surely, more matched pairs will be even better as they provide more information without impacting fakability. Let's say three researchers want to measure 12 traits using FC measures, they need to have at least 24 mixed keyed pairs regardless of the block size. If researcher A plans to use block size of 3, there should be 12 mixed keyed triplets (24 mixed keyed pairs÷2 mixed keyed pairs per mixed keyed triplet) and 28 ([12 mixed keyed triplets÷30%] × 70%) or more matched triplets (28 × 3 + 12 = 96 matched pairs or more); if researcher B wants to use a block size of 4 and they design the mixed keyed blocks as containing two positively keyed statements + 2 negatively keyed statements (four mixed keyed pairs and two matched pairs per mixed keyed block), they need to have 6 (24÷4) mixed keyed quadruplets and 14 ([6÷30%] × 70%) or more matched quadruplets (6 × 2 + 14 × 6 = 96 matched pairs or more); if researcher C uses a block size of 5 and they design the mixed keyed blocks as containing 2(3) positively keyed statements + 3(2) negatively keyed statements (six mixed keyed pairs and four matched pairs per mixed keyed block), they need to have 4 (24÷6) mixed keyed quintets and 10 ([4÷30%] × 70%, rounded up) or more matched quadruplets (4 × 4 + 10 × 10 = 116 matched pairs or more). Furthermore, we recommend that mixed-keyed triplets be composed of 2(1) positively and 1(2) negatively keyed statements, mixed-keyed quadruplets be composed of two positively and two negatively keyed statements, and mixed-keyed quintets be composed of 3(2) positively and 2(3) negatively keyed statements. These designs allow the maximum number of keyed mixed pairs to appear.

Step 5. Create blocks. While mixed keyed blocks almost inevitably involve bundling desirable and undesirable statements, researchers can still try some degree of matching by putting moderately desirable and moderately undesirable statements together instead of putting very desirable and very undesirable statements together. Therefore, we recommend users to construct mixed keyed blocks first so that they have the largest statement pool to choose from. For any FC measures, researchers should try to ensure that (1) statements within the same block measure different latent traits, (2) each trait should be paired with all other traits for about an equal number of times, (3) each trait should also be involved in at least one mixed keyed pair, and (4) statements in the same block should be matched on social desirability as much as they can. Given all these constraints, it becomes challenging to create optimal blocks manually. Therefore, in the Online Supplementary Materials, we provided a tutorial on how to use the autoFC R package (Li et al., 2022) to automatically assemble blocks according to multiple criteria.

Before moving to the next step, we consider two additional issues deserving further attention. The first issue concerns how to use social desirability value for matching. The most popular way is to focus on the mean value for each statement across raters and try to minimize the absolute difference between statements’ mean desirability values (the D index; Edwards, 1957; Pavlov, 2022). Statements are said to be matched if the largest D between all possible statement pairs within a block is smaller than a predefined cutoff. We recommend setting the cutoff to be .50 for a 5-point scale based on previous studies (e.g., Vasilopoulos et al. [2006] used .357; Chernyshenko et al. [2009], Drasgow et al. [2012], and Hughes et al. [2021] used .714) and our first-hand experience. For mixed keyed blocks, the cutoff should be relaxed, though less evidence exists on what cutoff should be set. Based on our experience with FC questionnaire construction, 1–1.5 on a 5-point scale seem to be a reasonable cutoff for mixed keyed blocks. One potential issue of using mean desirability values is that the variance of social desirability values across raters is ignored. To overcome this issue, Pavlov et al. (2022) proposed the inter-item agreement (IIA) approach, which essentially utilized robust interrater agreement indices, such as Brennan–Prediger index (Brennan & Prediger, 1981; Gwet, 2014) and AC index (Gwet, 2008, 2014). Statement pairing, in turn, is based on the interrater agreement on social desirability values, rather than differences in mean social desirability values. We believe that the IIA approach is promising for statement matching. Readers interested in this approach can use the autoFC R package to execute it.

The second issue concerns the contextual nature of social desirability. While it is common to match statements based on social desirability ratings obtained from SS administration, this practice implicitly assumes that respondents’ perception of statement social desirability remains constant when these statements are administered individually versus in pair with other statements (Frick, 2022). However, a statement may become more or less desirable depending on the statements it is paired with (Lin & Brown, 2017). Even two statements with identical social desirability ratings when presented individually can still be perceived as differentially (un)desirable when paired together. As such, after constructing preliminary blocks, test-developers can invite human raters to rate the desirability of multiple statements presented simultaneously in a block. Blocks that may need further revision can be identified by checking the D index computed from the social desirability ratings obtained from block administration. If the D index exceeds the cutoff suggested above, researchers should repeat this process until they find blocks that satisfy the criterion. While the present study focused on dominance-model-based FC measure, the ideal-point-model-based FC measure is also widely used (Drasgow et al., 2012; Boyce et al., 2015). Under the unfolding framework, test developers should match statements on both social desirability and extremity to ensure faking-resistance (Cao & Drasgow, 2019).

Step 6. Examine the reliability of the FC measure using simulated data. It is extremely helpful to have an initial understanding of the reliability of trait scores derived from the FC measure constructed in the previous step under ideal conditions using Monte Carlo simulations. If the reliability does not fare well in these ideal conditions, it is unlikely to be satisfactory in more realistic conditions. In those cases, researchers should go back to the previous step to construct a new FC measure and examine its reliability in ideal conditions again before collecting empirical data. To examine reliability using simulated data, researchers need to simulate FC responses based on the statement parameters obtained in Step 1, assuming that statement parameters are largely invariant across FC and SS (Lin & Brown, 2017; Morillo et al., 2019). Specifically, if test developers adopt the dominance model, FC responses should be generated according to equation (4) in Brown and Maydeu-Olivares (2013); if the unfolding model is adopted, FC responses can be generated according to equation (2) in Lee, Joo and Lee (2019) or equation (13)/(14) in Zhang, Tu et al. (2023). Finally, researchers could fit either a dominance (e.g., TIRT) or an unfolding (e.g., GGUM-RANK or GTUM) model to the simulated data depending on the data generation model used, and obtain reliability estimates accordingly.

A natural follow-up question is how to calculate reliability when assembling a new FC measure from calibrated statement banks and empirical data is not yet available. If the FC measure is intended to screen in/out respondents within a certain range of the latent trait continuum, we recommend direct examination of standard errors of measurement within the range of interest because this is the most straightforward way to quantify measurement precision. If the measure is designed for general purpose and an overall estimate of reliability is needed, following Lin (2022), we recommended test-developers to use the squared correlation between estimated person scores and true person scores as the reliability estimate, because this index relies on the least assumptions and is a straightforward operationalization of reliability under Classical Test Theory. However, this squared correlation is also the most conservative estimate (Lin, 2022). Thus, we additionally recommend the empirical reliability estimate using the formula $\frac{v a r (\hat{θ})}{v a r (\hat{θ}) + m e a n (S E (\hat{θ})^{2})}$ suggested by Brown and Maydeu-Olivares (2018). In this formula, $var (\hat{θ})$ refers to the variance of estimated latent trait scores, and $S E (\hat{θ})$ refers to the squared value of standard error of measurement. But we note that since empirical reliability is likely to be slightly inflated (Lin, 2022), simultaneous consideration of the two reliability indices is recommended. Given the complexity of the TIRT model and simulations in general, we provided a step-by-step tutorial that automates the entire process, implemented using the autoFC R package, in the Online Supplementary Materials (Section 1). Users just need to input population parameters mentioned above. The R functions will run the simulation and summarize simulation results automatically. Currently, autoFC only covers TIRT-based-dominance models. Other models will be included in future updates.

Step 7. Empirical validation. If the FC measure performs satisfactorily in simulations, researchers can then proceed to empirically test its psychometric properties, fakability, and respondent reactions. We believe our study provided a good example of empirical validation design that interested readers can adopt for their own studies. Specifically, we recommend a within-subjects design to contrast honest versus motivated faking situations to comprehensively estimate the fakability of an FC measure. Researchers are also recommended to include the SS counterpart as a benchmark to gauge the degree of possible benefits (in terms of psychometric properties and faking resistance) and costs (in terms of respondent reactions) brought by the FC. At this stage, we recommend using empirical reliability or test–retest reliability to quantify measurement precision, as they reflect the measurement accuracy in actual rather than hypothetical samples. We additionally recommend test developers to regularly examine measurement invariance across demographic groups (e.g., gender or racial groups) to identify potentially noninvariant blocks and ensure score comparability across groups. Several techniques for assessing measurement invariance of FC measures have been proposed in recent years (Lee & Smith, 2020; Lee et al., 2021; Qiu & Wang, 2021) and we recommend readers to refer to these approaches.

Limitations and Future Directions

Despite its many strengths (e.g., large sample, experimental design, comprehensiveness, guidelines, and tutorial), the present study is still limited in the following ways. First, we only used FC measures with a block size of three, which provided less information compared to larger block sizes. Future researchers are strongly encouraged to examine whether block size will moderate the effects of mixed-keying and social desirability on psychometric properties, faking resistance, and respondent reactions. Second, although the current study demonstrated negligible impact of FC designs on respondent reactions, it remains possible that respondent reactions are dependent on the construct being measured as well. For example, if respondents are required to choose between statements measuring the dark personality traits (A = “I manipulate people to get what I want,” B = “I deserve more attention than others,” C = “I enjoy quick and nasty revenge”) that may threaten their self-images, they may have more salient negative reactions (Fuechtenhans & Brown, 2022). In such cases, the design of FC measures may become more relevant. Hence, we believe that a promising future research direction is to examine the effect of construct types on respondent reactions, and whether the impact of FC design on respondent reactions depends on these construct types. Third, we exclusively used self-reported data for measuring both personality and criterion variables. It would be interesting for future studies to explore whether the criterion-related validity of different FC measures would vary in the same manner when predicting other reported outcomes.

Conclusion

We presented the first piece of comprehensive empirical evidence on the impact of social desirability matching and mixed keying on the psychometric properties of, fakability of, and respondent reactions to FC measures. Most notably, a small compromise on desirability matching in exchange for more mixed keyed blocks is feasible, such that the improvement in psychometric properties does not substantially harm the faking resistance of an FC measure. Also, respondents did not report differential reactions toward different FC designs. All in all, we showed that it is possible to find a middle ground between social desirability matching and mixed keying such that the FC measures can have both good psychometric properties and high faking resistance. We further provided researchers tools for constructing such FC measures.

Supplemental Material

sj-docx-1-orm-10.1177_10944281241229784 - Supplemental material for Mixed-Keying or Desirability-Matching in the Construction of Forced-Choice Measures? An Empirical Investigation and Practical Recommendations

Supplemental material, sj-docx-1-orm-10.1177_10944281241229784 for Mixed-Keying or Desirability-Matching in the Construction of Forced-Choice Measures? An Empirical Investigation and Practical Recommendations by Mengtong Li, Bo Zhang, Lingyue Li, Tianjun Sun and Anna Brown in Organizational Research Methods

Footnotes

Acknowledgment

The authors thank Dr. Fritz Drasgow for his insightful comments on an earlier version of the manuscript.

Authors’ Note

An earlier version of this paper was presented at the 38th Annual Conference of the Society for Industrial and Organizational Psychology.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Mengtong Li

Bo Zhang

Lingyue Li

Tianjun Sun

Supplemental Material

Supplemental material for this article is available online.

Notes

Author Biographies

Mengtong Li is currently a doctoral candidate in the industrial-organizational psychology program at University of Illinois at Urbana-Champaign. His research mainly focuses on personality measurement, leadership, and research methods. His work advocates for advanced techniques on facilitating automatic test construction and validation.

Bo Zhang is an assistant professor at the School of Labor and Employment Relations and the Department of Psychology at the University of Illinois Urbana-Champaign. His research focuses on personnel selection, personality, and quantitative methods.

Lingyue Li is currently a doctoral candidate in the Department of Psychology at University of Illinois Urbana-Champaign. Her research primarily focuses on personnel selection, individual differences, and quantitative methods.

Tianjun Sun received her Ph.D. in psychology from University of Illinois Urbana-Champaign and is currently an assistant professor of industrial-organizational psychology at Kansas State University. Her research primarily focuses on personnel selection, individual differences, and quantitative methods. Her work advocates for the responsible use of psychometric tools and advanced technology to improve psychological sciences and solve organizational problems.

Anna Brown is currently professor of Psychometrics at the School of Psychology at University of Kent, United Kingdom. Her research focuses on scaling comparative judgements, detecting and preventing response biases and faking behaviors, and modelling response processes using IRT and SEM frameworks more broadly. Beyond academia, Professor Brown has extensive experience in designing, developing, and implementing psychometric testing solutions in the workplace, health settings, and education, and provides psychometric advice to several organizations in the private and public sectors internationally.

References

1.
Anglim J. Horwood S. Smillie L. D. Marrero R. J. Wood J. K. (2020). Predicting psychological and subjective well-being from personality: A meta-analysis. Psychological Bulletin, 146(4), 279–323. https://doi.org/10.1037/bul0000226

2.
Anglim J. Morse G. De Vries R. E. MacCann C. Marty A. (2017). Comparing job applicants to non–applicants using an item–level bifactor model on the HEXACO personality inventory. European Journal of Personality, 31(6), 669–684. https://doi.org/10.1002/per.2120

3.
Ashton M. C. Lee K. (2009). The HEXACO-60: A short measure of the major dimensions of personality. Journal of Personality Assessment, 91(4), 340–345. https://doi.org/10.1080/00223890902935878

4.
Boyce A. S. Conway J. S. Caputo P. M. (2015). ADEPT-15 technical documentation: Development and validation of Aon Hewitt’s Personality Model and Adaptive Employee Personality Test (ADEPT-15). Aon Hewitt.

5.
Brennan R. L. Prediger D. J. (1981). Coefficient kappa: Some uses, misuses, and alternatives. Educational and Psychological Measurement, 41(3), 687–699. https://doi.org/10.1177/001316448104100307

6.
Brown A. (2016). Item response models for forced-choice questionnaires: A common framework. Psychometrika, 81(1), 135–160. https://doi.org/10.1007/s11336-014-9434-9

7.
Brown A. Bartram D. (2009). Doing less but getting more: Improving forced-choice measures with IRT. In: Society for Industrial and Organizational Psychology Conference; April 2–4, 2009; New Orleans. Retrieved from http://kar.kent.ac.uk/44788/.

8.
Brown A. Maydeu-Olivares A. (2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460–502. https://doi.org/10.1177/0013164410375112

9.
Brown A. Maydeu-Olivares A. (2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18(1), 36–52. https://doi.org/10.1037/a0030641

10.
Brown A. Maydeu-Olivares A. (2018). Ordinal factor analysis of Graded-Preference Questionnaire Data. Structural Equation Modeling: A Multidisciplinary Journal, 25(4), 516–529. https://doi.org/10.1080/10705511.2017.1392247

11.
Bürkner P. C. (2019). thurstonianIRT: Thurstonian IRT models in R. Journal of Open Source Software, 4(42), 1662–1663. https://doi.org/10.21105/joss.01662

12.
Bürkner P. C. (2022). On the information obtainable from comparative judgments. Psychometrika, 87, 1439–1472. https://doi.org/10.1007/s11336-022-09843-z

13.
Bürkner P. C. Schulte N. Holling H. (2019). On the statistical and practical limitations of Thurstonian IRT models. Educational and Psychological Measurement, 79(5), 827–854. https://doi.org/10.1177/0013164419832063

14.
Cai L. (2010). High-dimensional exploratory item factor analysis by a Metropolis-Hastings Robbins–Monro algorithm. Psychometrika, 75(1), 33–57. https://doi.org/10.1007/s11336-009-9136-x

15.
Cao M. Drasgow F. (2019). Does forcing reduce faking? A meta-analytic review of forced-choice personality measures in high-stakes situations. Journal of Applied Psychology, 104(11), 1347–1368. https://doi.org/10.1037/apl0000414

16.
Cao M. Drasgow F. Cho S. (2015). Developing ideal intermediate personality items for the ideal point model. Organizational Research Methods, 18(2), 252–275. https://doi.org/10.1177/1094428114555993

17.
Carlo G. Okun M. A. Knight G. P. de Guzman M. R. T. (2005). The interplay of traits and motives on volunteering: Agreeableness, extraversion and prosocial value motivation. Personality and Individual Differences, 38(6), 1293–1305. https://doi.org/10.1016/j.paid.2004.08.012

18.
Chalmers R. P. (2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1–29. https://doi.org/10.186D37/jss.v048.i06

19.
Chan D. Schmitt N. Sacco J. M. DeShon R. P. (1998). Understanding pretest and posttest reactions to cognitive ability and personality tests. Journal of Applied Psychology, 83(3), 471–485. https://doi.org/10.1037/0021-9010.83.3.471

20.
Chernyshenko O. S. Stark S. Prewett M. S. Gray A. A. Stilson F. R. Tuttle M. D. (2009). Normative scoring of multidimensional pairwise preference personality scales using IRT: Empirical comparisons with other formats. Human Performance, 22(2), 105–127. https://doi.org/10.1080/08959280902743303

21.
Clark L. A. Watson D. (1995). Constructing validity: Basic issues in objective scale development. Psychological Assessment, 7(3), 309–319. https://doi.org/10.1037/1040-3590.7.3.309

22.
Clark L. A. Watson D. (2019). Constructing validity: New developments in creating objective measuring instruments. Psychological Assessment, 31(12), 1412–1427. https://doi.org/10.1037/pas0000626

23.
Converse P. D. Oswald F. L. Imus A. Hedricks C. Roy R. Butera H. (2008). Comparing personality test formats and warnings: Effects on criterion-related validity and test-taker reactions. International Journal of Selection and Assessment, 16(2), 155–169. https://doi.org/10.1111/j.1468-2389.2008.00420.x

24.
Cottrell J. M. Newman D. A. Roisman G. I. (2015). Explaining the black-white gap in cognitive test scores: Toward a theory of adverse impact. Journal of Applied Psychology, 100(6), 1713–1736. https://doi.org/10.1037/apl0000020

25.
Cowan N. (2010). The magical mystery four: How is working memory capacity limited, and why? Current Directions in Psychological Science, 19(1), 51–57. https://doi.org/10.1177/0963721409359277

26.
Dalal D. K. Zhu X. S. Rangel B. Boyce A. S. Lobene E. (2021). Improving applicant reactions to forced-choice personality measurement: Interventions to reduce threats to test takers’ self-concepts. Journal of Business and Psychology, 36(1), 55–70. https://doi.org/10.1007/s10869-019-09655-6

27.
Diener E. D. Emmons R. A. Larsen R. J. Griffin S. (1985). The satisfaction with life scale. Journal of Personality Assessment, 49(1), 71–75. https://doi.org/10.1207/s15327752jpa4901_13

28.
Drasgow F. Stark S. Chernyshenko O. S. Nye C. D. Hulin C. L. White L. A. (2012). Development of the tailored adaptive personality assessment system (TAPAS) to support army personnel selection and classification decisions. Drasgow Consulting Group.

29.
Edwards A. L. (1957). The social desirability variable in personality assessment and research. Dryden.

30.
Fox S. Spector P. E. Bruursema K. Kessler S. Goh A. (2007). Necessity is the mother of behavior: Organizational constraints, CWB and OCB. In Meeting of the Academy of Management, Philadelphia, PA.

31.
Frick S. (2022). Modeling faking in the multidimensional forced-choice format: the faking mixture model. Psychometrika, 87(2), 773–794. https://doi.org/10.1007/s11336-021-09818-6

32.
Frick S. Brown A. Wetzel E. (2023). Investigating the normativity of trait estimates from multidimensional forced-choice data. Multivariate Behavioral Research, 58(1), 1–29. https://doi.org/10.1080/00273171.2021.1938960

33.
Fuechtenhans M. Brown A. (2022). How do applicants fake? A response process model of faking on multidimensional forced-choice personality assessments. International Journal of Selection and Assessment. https://doi.org/10.1111/ijsa.12409

34.
Furr R. M. (2010). The double-entry intraclass correlation as an index of profile similarity: Meaning, limitations, and alternatives. Journal of Personality Assessment, 92(1), 1–15. https://doi.org/10.1080/00223890903379134

35.
Gwet K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29–48. https://doi.org/10.1348/000711006X126600

36.
Gwet K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC.

37.
Halford G. S. Cowan N. Andrews G. (2007). Separating cognitive capacity from knowledge: A new hypothesis. Trends in Cognitive Sciences, 11(6), 236–242. https://doi.org/10.1016/j.tics.2007.04.001

38.
Harms P. D. Roberts B. W. Wood D. (2007). Who shall lead? An integrative personality approach to the study of the antecedents of status in informal social organizations. Journal of Research in Personality, 41(3), 689–699. https://doi.org/10.1016/j.jrp.2006.08.001

39.
Harris A. M. McMillan J. T. Carter N. T. (2021). Test-taker reactions to ideal point measures of personality. Journal of Business and Psychology, 36(3), 513–532. https://doi.org/10.1007/s10869-020-09682-8

40.
Hausknecht J. P. Day D. V. Thomas S. C. (2004). Applicant reactions to selection procedures: An updated model and meta-analysis. Personnel Psychology, 57(3), 639–683. https://doi.org/10.1111/j.1744-6570.2004.00003.x

41.
He Y. Donnellan M. B. Mendoza A. M. (2019). Five-factor personality domains and job performance: A second order meta-analysis. Journal of Research in Personality, 82, 1–24. https://doi.org/10.1016/j.jrp.2019.103848

42.
Highhouse S. Lievens F. Sinar E. F. (2003). Measuring attraction to organizations. Educational and Psychological Measurement, 63(6), 986–1001. https://doi.org/10.1177/0013164403258403

43.
Hinkin T. R. (1998). A brief tutorial on the development of measures for use in survey questionnaires. Organizational Research Methods, 1(1), 104–121. https://doi.org/10.1177/109442819800100106

44.
Hommel B. E. (2023). Expanding the methodological toolbox: Machine-based item desirability ratings as an alternative to human-based ratings. Personality and Individual Differences, 213, 112307. https://doi.org/10.1016/j.paid.2023.112307

45.
Hu J. Connelly B. S. (2021). Faking by actual applicants on personality tests: A meta-analysis of within-subjects studies. International Journal of Selection and Assessment, 29(3-4), 412–426. https://doi.org/10.1111/ijsa.12338

46.
Hughes A. W. Dunlop P. D. Holtrop D. Wee S. (2021). Spotting the “ideal” personality response: Effects of item matching in forced choice measures for personnel selection. Journal of Personnel Psychology, 20(1), 17–26. https://doi.org/10.1027/1866-5888/a000267

47.
Jonason P. K. Webster G. D. (2010). The dirty dozen: a concise measure of the dark triad. Psychological Assessment, 22(2), 420. https://doi.org/10.1037/a0019265

48.
Jones K. S. Newman D. A. Su R. Rounds J. (2022). Vocational interests and adverse impact: How attraction and selection on vocational interests relate to adverse impact potential. Journal of Applied Psychology, 107(4), 604–627. https://doi.org/10.1037/apl0000893

49.
Judge T. A. Heller D. Mount M. K. (2002). Five-factor model of personality and job satisfaction: A meta-analysis. Journal of Applied Psychology, 87(3), 530–541. https://doi.org/10.1037/0021-9010.87.3.530

50.
Kieftenbeld V. Natesan P. (2012). Recovery of graded response model parameters: A comparison of marginal maximum likelihood and Markov chain Monte Carlo estimation. Applied Psychological Measurement, 36(5), 399–419. https://doi.org/10.1177/0146621612446170

51.
Kluger A. N. Rothstein H. R. (1993). The influence of selection test type on applicant reactions to employment testing. Journal of Business and Psychology, 8, 3–25. https://doi.org/10.1007/BF02230391

52.
Kreitchmann R. S. Abad F. J. Ponsoda V. Nieto M. D. Morillo D. (2019). Controlling for response biases in self-report scales: Forced-choice vs. psychometric modeling of Likert items. Frontiers in Psychology, 10, 2309. https://doi.org/10.3389/fpsyg.2019.02309

53.
Kristensen T. S. Borritz M. Villadsen E. Christensen K. B. (2005). The Copenhagen Burnout Inventory: A new tool for the assessment of burnout. Work & Stress, 19(3), 192–207. https://doi.org/10.1080/02678370500297720

54.
Lambert L. S. Newman D. A. (2022). Construct development and validation in three practical steps: Recommendations for reviewers, editors, and authors. Organizational Research Methods. https://doi.org/10.1177/10944281221115374

55.
Lee H. Smith W. Z. (2020). Fit indices for measurement invariance tests in the Thurstonian IRT model. Applied Psychological Measurement, 44(4), 282–295. https://doi.org/10.1177/0146621619893785

56.
Lee P. Joo S. H. Lee S. (2019). Examining stability of personality profile solutions between Likert-type and multidimensional forced choice measure. Personality and Individual Differences, 142, 13–20. https://doi.org/10.1016/j.paid.2019.01.022

57.
Lee P. Joo S. H. Stark S. (2021). Detecting DIF in multidimensional forced-choice measures using the Thurstonian item response theory model. Organizational Research Methods, 24(4), 739–771. https://doi.org/10.1177/1094428120959822

58.
Lee P. Joo S. H. Stark S. Chernyshenko O. S. (2019). GGUM-RANK statement and person parameter estimation with multidimensional forced choice triplets. Applied Psychological Measurement, 43(3), 226–240. https://doi.org/10.1177/0146621618768294

59.
Lee P. Joo S. H. Zhou S. Son M. (2022). Investigating the impact of negatively keyed statements on multidimensional forced-choice personality measures: A comparison of partially ipsative and IRT scoring methods. Personality and Individual Differences, 191, 111555. https://doi.org/10.1016/j.paid.2022.111555

60.
Lee Y. Berry C. M. Gonzalez-Mulé E. (2019). The importance of being humble: A meta-analysis and incremental validity analysis of the relationship between honesty-humility and job performance. Journal of Applied Psychology, 104(12), 1535. https://doi.org/10.1037/apl0000421

61.
Li M. Sun T. Zhang B. (2022). autoFC: An R package for automatic item pairing in forced-choice test construction. Applied Psychological Measurement, 46(1), 70–72. https://doi.org/10.1177/01466216211051726

62.
Li Z. Zhang B. Cao M. Tay L. (2021). Accounting for item response process and response styles using the Unfolding Item Response Tree (UIRTree) model. Preprint. https://doi.org/10.31219/osf. io/8w36e.

63.
Lin Y. (2022). Reliability estimates for IRT-based forced-choice assessment scores. Organizational Research Methods, 25(3), 575–590. https://doi.org/10.1177/1094428121999086

64.
Lin Y. Brown A. (2017). Influence of context on item parameters in forced-choice personality assessments. Educational and Psychological Measurement, 77(3), 389–414. https://doi.org/10.1177/0013164416646162

65.
Lopez F. J. Hou N. Fan J. (2019). Reducing faking on personality tests: Testing a new faking-mitigation procedure in a US job applicant sample. International Journal of Selection and Assessment, 27(4), 371–380. https://doi.org/10.1111/ijsa.12265

66.
Macan T. H. Avedon M. J. Paese M. Smith D. E. (1994). The effects of applicants’ reactions to cognitive ability tests and an assessment center. Personnel Psychology, 47(4), 715–738. https://doi.org/10.1111/j.1744-6570.1994.tb01573.x

67.
McCarthy J. M. Bauer T. N. Truxillo D. M. Anderson N. R. Costa A. C. Ahmed S. M. (2017). Applicant perspectives during selection: A review addressing “So what?,” “What’s new?,” and “Where to next?”. Journal of Management, 43(6), 1693–1725. https://doi.org/10.1177/0149206316681846

68.
McCloy R. A. Heggestad E. D. Reeve C. L. (2005). A silk purse from the sow's ear: Retrieving normative information from multidimensional forced-choice items. Organizational Research Methods, 8(2), 222–248. https://doi.org/10.1177/1094428105275374

69.
Morillo D. Abad F. J. Kreitchmann R. S. Leenen I. Hontangas P. Ponsoda V. (2019). The journey from Likert to forced-choice questionnaires: Evidence of the invariance of item parameters. Revista de Psicología del Trabajo y de las Organizaciones, 35(2), 75–83. https://doi.org/10.5093/jwop2019a11

70.
Morillo D. Ponsoda V. Leenen I. Abad F. J. Hontangas P. (2016). Comparing CFA and Bayesian estimations of forced-choice questionnaires with paired dominance items. International Test Commission (ITC) 2016 Conference, Vancouver, Canada.

71.
Munyon T. P. Carnes A. M. Lyons L. M. Zettler I. (2020). All about the money? Exploring antecedents and consequences for a brief measure of perceived financial security. Journal of Occupational Health Psychology, 25(3), 159–175. https://doi.org/10.1037/ocp0000162

72.
Naemi B. Seybert J. Robbins S. Kyllonen P. (2014). Examining the WorkFORCE™ assessment for job fit and core capabilities of FACETS™. ETS Research Report Series, 2014(2), 1–43. https://doi.org/10.1002/ets2.12040

73.
Ng V. Lee P. Ho M. H. R. Kuykendall L. Stark S. Tay L. (2021). The development and validation of a multidimensional forced-choice format character measure: Testing the Thurstonian IRT approach. Journal of Personality Assessment, 103(2), 224–237. https://doi.org/10.1080/00223891.2020.1739056

74.
Nye C. D. Su R. Rounds J. Drasgow F. (2012). Vocational interests and performance: A quantitative summary of over 60 years of research. Perspectives on Psychological Science, 7(4), 384–403. https://doi.org/10.1177/1745691612449021

75.
Pavlov G. (2022). Comparing different approaches for obtaining item desirability ratings [Poster]. Society for Industrial and Organizational Psychology Annual Conference, Seattle, WA, United States.

76.
Pavlov G. Shi D. Maydeu-Olivares A. Fairchild A. (2022). Item desirability matching in forced-choice test construction. Personality and Individual Differences, 183, 111114. https://doi.org/10.1016/j.paid.2021.111114

77.
Pletzer J. L. Oostrom J. K. Bentvelzen M. de Vries R. E. (2020). Comparing domain-and facet-level relations of the HEXACO personality model with workplace deviance: A meta-analysis. Personality and Individual Differences, 152, 109539. https://doi.org/10.1016/j.paid.2019.109539

78.
Pletzer J. L. Oostrom J. K. de Vries R. E. (2021). HEXACO personality and organizational citizenship behavior: A domain-and facet-level meta-analysis. Human Performance, 34(2), 126–147. https://doi.org/10.1080/08959285.2021.1891072

79.
Pletzer J. L. Thielmann I. Zettler I. (2023). Who is healthier? A meta-analysis of the relations between the HEXACO personality domains and health outcomes. European Journal of Personality. https://doi.org/10.1177/08902070231174574

80.
Plieninger H. Heck D. W. (2018). A new model for acquiescence at the interface of psychometrics and cognitive psychology. Multivariate Behavioral Research, 53(5), 633–654. https://doi.org/10.1080/00273171.2018.1469966

81.
Qiu X. L. Wang W. C. (2021). Assessment of differential statement functioning in ipsative tests with multidimensional forced-choice items. Applied Psychological Measurement, 45(2), 79–94. https://doi.org/10.1177/0146621620965739

82.
Roodt G. (2004). Turnover intentions. Unpublished document: University of Johannesburg Johannesburg, South Africa.

83.
Salgado J. F. Anderson N. Tauriz G. (2015). The validity of ipsative and quasi-ipsative forced-choice personality inventories for different occupational groups: A comprehensive meta-analysis. Journal of Occupational and Organizational Psychology, 88(4), 797–834. https://doi.org/10.1111/joop.12098

84.
Samejima F. (1997). Graded Response Model. In van der Linden W. J. Hambleton R. K. (Eds.), Handbook of Modern Item Response Theory. Springer. https://doi.org/10.1007/978-1-4757-2691-6_5

85.
Sass R. Frick S. Reips U. D. Wetzel E. (2020). Taking the test taker’s perspective: Response process and test motivation in multidimensional forced-choice versus rating scale instruments. Assessment, 27(3), 572–584. https://doi.org/10.1177/1073191118762049

86.
Schat A. C. Kelloway E. K. Desmarais S. (2005). The Physical Health Questionnaire (PHQ): construct validation of a self-report scale of somatic symptoms. Journal of Occupational Health Psychology, 10(4), 363–381. https://doi.org/10.1037/1076-8998.10.4.363

87.
Schreiber A. Marcus B. (2020). The place of the “Dark Triad” in general models of personality: Some meta-analytic clarification. Psychological Bulletin, 146(11), 1021–1041. https://doi.org/10.1037/bul0000299

88.
Schulte N. Holling H. Bürkner P. C. (2021). Can high-dimensional questionnaires resolve the ipsativity issue of forced-choice response formats? Educational and Psychological Measurement, 81(2), 262–289. https://doi.org/10.1177/0013164420934861

89.
Sisson E. D. (1948). Forced choice—The new army rating. Personnel Psychology, 1(3), 365–381. https://doi.org/10.1111/j.1744-6570.1948.tb01316.x

90.
Smither J. W. Reilly R. R. Millsap R. E. AT&T K. P. Stoffey R. W. (1993). Applicant reactions to selection procedures. Personnel Psychology, 46(1), 49–76. https://doi.org/10.1111/j.1744-6570.1993.tb00867.x

91.
Spector P. E. (1985). Measurement of human service staff satisfaction: Development of the Job Satisfaction Survey. American Journal of Community Psychology, 13(6), 693–713. https://doi.org/10.1007/bf00929796

92.
Spector P. E. Fox S. Penney L. M. Bruursema K. Goh A. Kessler S. (2006). The dimensionality of counterproductivity: Are all counterproductive behaviors created equal? Journal of Vocational Behavior, 68(3), 446–460. https://doi.org/10.1016/j.jvb.2005.10.005

93.
Speer A. B. Perrotta J. Jacobs R. R. (2023). Supervised construct scoring to reduce personality assessment length: A field study and introduction to the short 10. Organizational Research Methods. https://doi.org/10.1177/1094428122114569

94.
Stark S. Chernyshenko O. S. Drasgow F. (2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The multi-unidimensional pairwise-preference model. Applied Psychological Measurement, 29(3), 184–203. https://doi.org/10.1177/0146621604273988

95.
Stark S. Chernyshenko O. S. Drasgow F. White L. A. (2012). Adaptive testing with multidimensional pairwise preference items: Improving the efficiency of personality and other noncognitive assessments. Organizational Research Methods, 15(3), 463–487. https://doi.org/10.1177/1094428112444611

96.
Sun T. Zhang B. Cao M. Drasgow F. (2022). Faking detection improved: Adopting a Likert item response process tree model. Organizational Research Methods, 25(3), 490–512. https://doi.org/10.1177/10944281211002904

97.
Sun T. Zhang B. Phan W. M. J. Drasgow F. Roberts B. (2019, July). “Meh!”: Examining Midpoint Endorsement Habitude (MEH) in survey research. In Academy of management proceedings (Vol. 2019, pp. 16421). Academy of Management. https://doi.org/10.5465/AMBPP.2019.227

98.
Sylva H. Mol S. T. (2009). E-Recruitment: A study into applicant perceptions of an online application system. International Journal of Selection and Assessment, 17(3), 311–323. https://doi.org/10.1111/j.1468-2389.2009.00473.x

99.
Tay L. Ng V. (2018). Ideal point modeling of non-cognitive constructs: Review and recommendations for research. Frontiers in Psychology, 2423. https://doi.org/10.3389/fpsyg.2018.02423

100.
Tonidandel S. Quiñones M. A. Adams A. A. (2002). Computer-adaptive testing: The impact of test characteristics on perceived performance and test takers’ reactions. Journal of Applied Psychology, 87(2), 320–332. https://doi.org/10.1037/0021-9010.87.2.320

101.
Tu N. Zhang B. Angrave L. Sun T. (2021). bmggum: An R package for Bayesian estimation of the multidimensional generalized graded unfolding model with covariates. Applied Psychological Measurement, 45(7-8), 553–555. https://doi.org/10.1177/01466216211040488

102.
Tu N. Zhang B. Angrave L. Sun T. Neuman M. (2023). Estimating the multidimensional generalized graded unfolding model with covariates using a bayesian approach. Journal of Intelligence, 11(8), 163. https://doi.org/10.3390/jintelligence11080163

103.
Van Iddekinge C. H. Lievens F. Sackett P. R. (2023). Personnel selection: A review of ways to maximize validity, diversity, and the applicant experience. Personnel Psychology. https://doi.org/10.1111/j.1468-2389.2008.00420.x

104.
Vasilopoulos N. L. Cucina J. M. Dyomina N. V. Morewitz C. L. Reilly R. R. (2006). Forced-choice personality tests: A measure of personality and cognitive ability? Human Performance, 19(3), 175–199. https://doi.org/10.1207/s15327043hup1903_1

105.
Walton K. E. Cherkasova L. Roberts R. D. (2020). On the validity of forced choice scores derived from the Thurstonian item response theory model. Assessment, 27(4), 706–718. https://doi.org/10.1177/1073191119843585

106.
Wang W. C. Wu S. L. (2016). Confirmatory multidimensional IRT unfolding models for graded-response items. Applied Psychological Measurement, 40(1), 56–72. https://doi.org/10.1177/0146621615602855

107.
Wetzel E. Frick S. (2020). Comparing the validity of trait estimates from the multidimensional forced-choice format and the rating scale format. Psychological Assessment, 32(3), 239–253. https://doi.org/10.1037/pas0000781

108.
Wetzel E. Frick S. Brown A. (2021). Does multidimensional forced-choice prevent faking? Comparing the susceptibility of the multidimensional forced-choice format and the rating scale format to faking. Psychological Assessment, 33(2), 156–170. https://doi.org/10.1037/pas0000971

109.
Wiernik B. M. Dahlke J. A. (2020). Obtaining unbiased results in meta-analysis: The importance of correcting for statistical artifacts. Advances in Methods and Practices in Psychological Science, 3(1), 94–123. https://doi.org/10.1177/2515245919885611

110.
Williams L. J. Anderson S. E. (1991). Job satisfaction and organizational commitment as predictors of organizational citizenship and in-role behaviors. Journal of Management, 17(3), 601–617. https://doi.org/10.1177/014920639101700305

111.
Worthington R. L. Whittaker T. A. (2006). Scale development research: A content analysis and recommendations for best practices. The Counseling Psychologist, 34(6), 806–838. https://doi.org/10.1177/0011000006288127

112.
Zettler I. Thielmann I. Hilbig B. E. Moshagen M. (2020). The nomological net of the HEXACO model of personality: A large-scale meta-analytic investigation. Perspectives on Psychological Science, 15(3), 723–760. https://doi.org/10.1177/1745691619895036

113.
Zhang B. Cao M. Tay L. Luo J. Drasgow F. (2020). Examining the item response process to personality measures in high-stakes situations: Issues of measurement validity and predictive validity. Personnel Psychology, 73(2), 305–332. https://doi.org/10.1111/peps.12353

114.
Zhang B. Luo J. Li J. (2023). Moving beyond Likert and traditional forced-choice scales: A comprehensive investigation of the graded forced-choice format. Multivariate Behavioral Research. https://doi.org/10.1080/00273171.2023.2235682

115.
Zhang B. Sun T. Drasgow F. Chernyshenko O. S. Nye C. D. Stark S. White L. A. (2020). Though forced, still valid: Psychometric equivalence of forced-choice and single-statement measures. Organizational Research Methods, 23(3), 569–590. https://doi.org/10.1177/1094428119836486

116.
Zhang B. Tu N. Angrave L. Zhang S. Sun T. Tay L. Li J. (2023). The Generalized Thurstonian Unfolding Model (GTUM): Advancing the modeling of forced-choice data. Organizational Research Methods. https://doi.org/10.1177/109442812312104

117.
Zimmerman R. D. (2008). Understanding the impact of personality traits on individuals’ turnover decisions: A meta-analytic path model. Personnel Psychology, 61(2), 309–348. https://doi.org/10.1111/j.1744-6570.2008.00115.x

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.95 MB

	FC1			FC2			FC3			FC4			Cohen's d
	M	SD	ω	M	SD	ω	M	SD	ω	M	SD	ω	1–2	1–3	1–4	2–3	2–4	3–4
Time 1 Respondent Reactions
Affect	4.04	0.70	.75	4.01	0.69	.74	3.99	0.68	.76	4.00	0.65	.68	.05	.08	.06	.03	.01	−.02
Accuracy	3.39	0.75	.68	3.38	0.81	.75	3.34	0.70	.68	3.39	0.74	.69	.02	.07	−.003	.05	−.02	−.07
Utility	3.33	0.80	.76	3.33	0.81	.77	3.32	0.73	.72	3.39	0.73	.72	.00	.02	−.07	.01	−.07	−.09
Faking Resistance	2.93	0.89	.77	2.88	0.84	.68	2.89	0.78	.68	2.87	0.80	.66	.06	.05	.07	−.02	.01	.02
Difficulty	2.47	0.91	.76	2.46	0.93	.78	2.50	0.90	.77	2.49	0.90	.75	.01	−.04	−.02	−.04	−.03	.02
Burden	2.39	0.68	.59	2.38	0.73	.59	2.37	0.66	.52	2.40	0.70	.59	.01	.02	−.02	.01	−.03	−.04
Concentration	3.85	0.62	.53	3.80	0.60	.46	3.75	0.61	.54	3.79	0.60	.52	.07	.15	.10	.08	.02	−.06
Exerted Effort	7.83	2.24	-	7.81	2.29	-	7.43	2.06	-	7.58	2.08	-	.01	.19	.12	.17	.11	−.07
Exhaustion	3.25	2.36	-	3.40	2.48	-	3.51	2.30	-	3.62	2.39	-	−.06	−.11	−.16	−.05	−.09	−.05
Energy Level	9.33	1.66	-	9.19	1.64	-	9.28	1.59	-	9.30	1.64	-	.09	.03	.02	−.06	−.07	−.01
Time 2 Respondent Reactions
Fairness	3.22	1.00	.84	3.35	0.95	.81	3.31	0.86	.71	3.36	0.90	.75	−.13	−.10	−.15	.05	−.01	−.06
Validity	2.51	1.06	.85	2.65	1.06	.83	2.48	0.93	.82	2.58	1.08	.85	−.13	.04	−.06	.17	.07	−.10
Satisfaction	3.08	1.01	.77	3.23	0.96	.71	3.15	0.92	.72	3.27	0.92	.69	−.16	−.07	−.19	.09	−.03	−.13
Org Attractiveness	3.66	0.98	.88	3.75	0.87	.86	3.62	0.84	.84	3.72	0.84	.83	−.10	.04	−.07	.16	.04	−.13
Intent to Accept	4.10	0.79	.82	4.16	0.72	.84	4.07	0.69	.75	4.15	0.72	.82	−.08	.04	−.07	.13	.01	−.12
Face Validity	2.67	1.11	.88	2.77	1.07	.89	2.71	1.00	.84	2.78	1.05	.86	−.09	−.04	−.10	.06	−.01	−.07
Intent to Recommend	3.24	1.05	.93	3.33	0.97	.89	3.14	0.89	.88	3.30	0.94	.91	−.09	.11	−.06	.21	.03	−.18
Faking Resistance	2.60	1.13	.91	2.84	1.16	.88	2.65	1.09	.87	2.50	1.12	.88	−.21	−.05	.09	.17	.30*	.14
Accuracy	3.19	0.91	.65	3.34	0.86	.61	3.20	0.77	.56	3.32	0.88	.65	−.16	−.01	−.14	.17	.02	−.14