A Multimodal Item Response Modeling for Personality Assessment in Organizational Research

Abstract

Recent advances in process data collection have made it possible to efficiently collect multimodal behavioral indicators, such as response times and eye-tracking measures. These multimodal data have been widely applied in cognitive and achievement assessments, where they have improved the accuracy of latent construct estimation. However, the use of informative multimodal process data in noncognitive assessments, such as personality measures widely used in organizational research, has received considerably less attention. To address this gap, we integrate response time and eye-tracking data into a conventional item response model to capture respondents’ response processes, thereby improving differentiation across trait levels and enhancing noncognitive assessment. Simulation studies were conducted to evaluate the performance of the proposed model and compare it with a conventional IRT model. Results indicate that model parameters can be accurately recovered and that incorporating multimodal data significantly improves the accuracy of person latent trait estimates. Finally, an empirical analysis was conducted to demonstrate the applicability and advantages of the proposed model in personality assessment.

Keywords

graded response model multimodal data response uncertainty personality assessment Markov chain Monte Carlo (MCMC) algorithm.

In organizational research, self-report measures, such as questionnaires and surveys, are widely used to assess constructs such as personality, attitudes, and beliefs. Although efficient, these measures are subject to well-known limitations, including social desirability bias, extreme response styles, and unconscious influences (Donaldson & Grant-Vallone, 2002; Kam & Meyer, 2015; Wetzel et al., 2016). Consequently, scholars have increasingly emphasized the need to complement self-report data with behavioral measures and to integrate multiple sources of information to improve measurement accuracy (Chen & Wojcik, 2016; Meißner & Oll, 2019).

Recent technological advances have made the collection of multimodal process data (e.g., response times, eye-tracking measures, and mouse trajectories) more accessible. Unlike traditional self-report scores, which primarily capture a static outcome (i.e., the final response), multimodal data provide insight into the dynamic cognitive and decision-making processes underlying respondents’ answers. This additional information enables researchers to refine response scoring, thereby improving the accuracy of inferences about respondents’ behavioral patterns and psychological traits.

Theoretical work on item response processes provides additional support for incorporating multimodal data. Response difficulty increases as the distance between a person's trait level and the item location decreases (Ferrando, 2007; Ferrando & Lorenzo-Seva, 2007; Meng et al., 2013). For polytomous items, each examinee's latent trait can be conceptualized as a continuous continuum, with each item partitioned by multiple category thresholds. When a respondent's trait level lies far from these thresholds, responses tend to be faster and more confident. Conversely, when the trait level is close to a threshold, none of the available categories fully matches the respondent's standing, leading to greater hesitation and decision conflict. These differences in response certainty reflect person–item interactions and can be captured by multimodal behavioral indicators (Guo et al., 2024; Krajbich et al., 2010; Uggeldahl et al., 2016), such as response times and eye-tracking measures (e.g., fixation counts).

Joint modeling of item responses and multimodal process data within the framework of item response theory (IRT) has demonstrated considerable potential in cognitive and achievement assessments by improving parameter recovery and providing deeper insight into individual differences (Liang et al., 2023; Man et al., 2022; Zhan, 2022). However, relatively few studies have applied this approach to personality measurement, and even fewer have integrated more than two data modalities. To address this gap, the present study employs a cross-loading multimodal modeling framework to jointly analyze process data and outcome scores, thereby enhancing the assessment of personality traits.

Multimodal Process Data

As one of the most accessible and widely used sources of process data, response times (RTs) serve as a valuable indicator for detecting aberrant responding (Ulitzsch et al., 2020a, 2020b, 2022), identifying response styles (Henninger & Plieninger, 2021), and predicting preference strength (Konovalov & Krajbich, 2019). Specifically, RTs have been shown to be strongly associated with response uncertainty (Meng et al., 2013; Uggeldahl et al., 2016). The time required to respond is influenced by the individual's level of uncertainty: greater hesitation leads to longer latencies, whereas greater decisiveness leads to shorter latencies. Whereas response scores indicate what was selected, response time reflects how decisively the choice was made. Analysis of RT data can improve the differentiation of individuals across trait levels (Guo et al., 2024). A substantial body of research further indicates that jointly modeling response content and response times significantly improves the accuracy of latent trait estimation (Man et al., 2022; Tian et al., 2023).

Eye-tracking data is among the most widely used and important measures in psychological research. Although eye-tracking data collection remains predominantly lab-based, technological advances are increasing its accessibility and practicality. Apple's ARKit provides built-in eye-tracking capabilities on iOS devices, enabling researchers to use native eye-tracking functions on iPads without requiring additional specialized hardware (Anisimov et al., 2021; Taore et al., 2025). Scalable online research platforms, such as jsPsych internationally (Anwyl-Irvine et al., 2021) and Credamo in China (Zhang et al., 2023), now allow researchers to conduct remote eye-tracking studies using participants’ standard webcams. These platforms employ computer vision models to estimate gaze points, thereby making large-scale, cost-effective behavioral data collection more feasible. Meißner and Oll (2019) also emphasize the timeliness and potential of eye-tracking in organizational science, highlighting: (a) technological advances that have lowered costs and improved usability; (b) the growing importance of attention and information processing in today's information-rich organizational environments; (c) the ability of eye-tracking to provide real-time, objective behavioral data that complements traditional self-report methods.

Commonly used eye-tracking indicators include fixation count (FC), which measures the number of times an individual fixates on a location; saccades, which capture changes in fixation position; regression count, which indicates the number of times a participant returns their gaze to a specific target; and fixation duration, which quantifies how long the eyes remain still at a location. Among these indicators, FC is the most commonly used, as it may reflect the intensity of attention allocated to a targeted visual region (Zhan et al., 2022). In survey research, eye-tracking data have been used to examine questionnaire completion processes, for example: to investigate the relationship between eye movement measures and response consistency (Chauliac et al., 2020); to examine participants’ cognitive strategies, including how they interpret items, question stems, and response options (Margot et al., 2023); and to assess cognitive load(Kosch et al., 2018) and monitor attention (Michinov et al., 2015; Skinner et al., 2018).

Visual fixations are also employed to examine the decision-making process (Fisher, 2021; Krajbich et al., 2010; Uggeldahl et al., 2016; Van Loo et al., 2018). According to the attentional drift-diffusion model (aDDM; Krajbich & Rangel, 2011; Krajbich et al., 2012; Liu et al., 2023), when the difference between two options is small, more fixations are typically needed to accumulate sufficient evidence to reach the decision threshold and make a choice. Conversely, when the difference between the options is large, the decision-maker can reach the threshold with fewer fixations and make a decision more quickly. In other words, the number of fixations during item processing increases with rising uncertainty.

In summary, response time (RT) and fixation count (FC), as two types of multimodal process data, serve as key indicators of response uncertainty. When an individual is highly certain about how to respond to an item, the difficulty of selecting among alternatives is low; in this case, RTs are short, and the number of fixations (FC) tends to decrease. Conversely, if an individual perceives two or more alternatives as plausible, both RT and FC increase relative to the previous scenario.

Why Is Multimodal Assessment Needed in Personality Assessment?

In organizational research, noncognitive attributes, such as personality, leadership styles, self-efficacy, and job satisfaction, play a crucial role in guiding high-stakes decisions, including hiring, promotion, and personnel development (Al-Malki & Juan, 2018; Bakker et al., 2012; Bin & Shmailan, 2015; Judge et al., 2007). Measuring these constructs is critical for organizational decision-making. Self-report questionnaires remain the primary tool for their assessment and are widely considered efficient, cost-effective, and easy to administer.

Completing questionnaires is often regarded as a simple task, requiring participants to respond to each item sequentially. For each item, respondents must interpret the question and reflect on it using their own judgment. The Cognitive Aspects of Survey Methodology (CASM) movement emphasized that understanding individual response processes is essential for evaluating questionnaire validity and identifying potential sources of error (Fowler, 2013). The key assumption is that respondents’ cognitive processes underlie their survey responses (Schwarz, 2007; Tourangeau, 2003). The response process is not merely a mechanical selection of responses but a complex sequence of cognitive operations involving several stages, including item comprehension, retrieval of relevant information from memory, judgment formation, and mapping these judgments onto overt responses (Margot et al., 2023). The complexity of the response process cannot be fully captured by a single test score, which reflects only the final outcome of cognitive operations.

In cognitive and achievement assessment, the scope of measurable data has substantially expanded. These data include not only traditional response outcomes but also various process-related indicators. The technology-enhanced assessment (TEA; Sweeney et al., 2017; Webb & Gibson, 2015) system employs built-in sensors to record learners’ real-time psychological and biological responses (e.g., eye-trackers, motion detectors, heart rate monitors), enabling educators to monitor learning and adjust instruction accordingly (Man et al., 2022). With technological advancements, collateral process data collection has become routine in large-scale computer-based assessments, such as the Program for International Student Assessment (PISA). Since 2015, the PISA test has been administered on computers, with all student–computer interactions logged, enabling researchers to examine test-takers’ behaviors in greater detail (Ivanova et al., 2020). Compared with unimodal models that rely solely on response outcomes, multimodal data introduce additional informational constraints in parameter estimation, thereby improving the precision of latent trait estimation (e.g., Liang et al., 2023; Tian et al., 2023; Zhan, 2022; Zhan et al., 2022).

Although multimodal data have shown considerable utility in cognitive and achievement testing, their application in noncognitive assessment remains limited. Since measuring noncognitive traits also involves complex internal psychological processes, extending multimodal research approaches to this domain has become an inevitable trend. This approach shifts the focus from simple response outcomes to a more comprehensive capture of cognitive and behavioral processes, thereby enhancing the precision of noncognitive assessment.

Existing Approaches and Limitations

A review of prior literature on the joint modeling of examinees’ response data and multimodal process data indicates that hierarchical framework modeling, as proposed by Van der Linden (2007), is predominantly used for the combined analysis of responses and response times (RT). However, hierarchical framework modeling also presents several limitations. Ranger (2013) highlighted that the main theoretical limitation of hierarchical framework modeling is that incorporating RT data improves the estimation accuracy of latent ability only when the correlation between latent trait and latent processing speed is nonzero. Moreover, empirical studies have reported a low correlation between latent ability and latent processing speed (Man et al., 2022). According to Bolsinova and Tijmstra (2018), joint hierarchical framework models fall short in fully leveraging RT data because they assume that RTs are influenced solely by latent processing speed and not by latent traits. Crucially, hierarchical framework models do not account for how respondents’ cognitive processes generate the multimodal data observed during task performance.

Cross-loading modeling (Bolsinova & Tijmstra, 2018; Meng et al., 2013; Molenaar et al., 2015) can be viewed as an extension of the joint hierarchical modeling framework. In principle, including cross-loading terms allows auxiliary process indicators to directly inform the estimation of the focal latent trait, rather than being treated solely as correlated secondary outcomes. Regardless of the correlation between the latent trait and latent variables derived from multimodal data, these data can contribute to the estimation of latent trait parameters, thereby enhancing estimation accuracy.

Ferrando and Lorenzo-Seva (2007) initially proposed a response time model with cross-loading terms for dichotomous personality test items. Meng et al. (2013) later extended this model to polytomous IRT and argued that the difficulty of responding to an item reflects uncertainty in the decision-making process, with differences among response probabilities serving as key factors in determining item difficulty. Guo et al. (2024) introduced response uncertainty as a cross-loading term to link the latent trait with response time and found that this approach improved latent trait estimation accuracy in forced-choice assessments.

Therefore, we focus on personality assessment and develop a multimodal polytomous IRT model that integrates response outcomes, RT, and FC, with response uncertainty—measured via information entropy (see the section “Response Uncertainty in Personality Assessment”)—serving as a cross-loading indicator within a joint–cross-loading modeling framework.

Aims and Highlights of the Present Study

In both organizational and broader behavioral research, there have been strong calls for increased use of behavioral data and the integration of multiple methodological sources. With the growing prevalence of computer- and web-based assessments, the collection of process data has become routine. For example, the development of TEAs, together with advances in multimodal data acquisition, has emphasized the need for corresponding improvements in modeling techniques that can leverage these data. Despite the richness and diversity of these data sources, their application in organizational research remains limited. Most traditional IRT-based assessments provide limited information and fail to fully capture individual differences in latent traits, potentially constraining informed organizational decision-making.

To address this gap, this study employs a cross-loading modeling framework to jointly analyze multimodal process data. The model uses cross-loading terms to explain how response uncertainty arising from person–item interactions produces variability in multimodal data, and how these multimodal indicators, in turn, inform the cross-loading terms to enhance the precision of latent trait parameter estimation. As response uncertainty increases, item difficulty also rises, resulting in longer response times and higher visual fixation counts. This approach not only enables a more precise differentiation of individual latent traits but also establishes a methodological foundation for utilizing multimodal behavioral data in organizational decision-making contexts.

The remainder of this paper is organized as follows. First, cross-loading terms within the cross-loading modeling framework are introduced. Next, the three sub-models of the proposed framework—the IRT model, the response time model, and the visual fixation counts (FC) model—are presented, along with the rationale for their integration. Then, a simulation study was conducted to evaluate the feasibility of the proposed model, followed by a real data analysis to demonstrate the utility of multimodal data in practical applications. Finally, the study discusses its implications, limitations, and proposed directions for future research.

The Proposed Multimodal Item Response Modeling for Personality Assessment

Response Uncertainty in Personality Assessment

In the present study, information entropy was introduced to more accurately quantify response uncertainty during the decision-making process. Differences among response probabilities reflect the uncertainty in responding to an item (Meng et al., 2013). In information theory, information entropy measures the uncertainty associated with a random variable (Shannon, 1948): the more uniform the distribution of outcomes, the greater the uncertainty and the higher the entropy; conversely, if an outcome is highly certain, the entropy is low. Therefore, to more effectively capture differences among response probabilities and quantify response uncertainty during the answering process, information entropy was introduced as an index of uncertainty. For K response categories, with integer values ranging from 0, 1, 2, …, K–1, information entropy can be defined as:

\begin{matrix} H_{i j} = - \sum_{k = 0}^{K - 1} P (y_{j} = k θ_{i}) \log P (y_{j} = k θ_{i}), \end{matrix}

(1)

Here, $H_{i j}$ represents information entropy, and $P (y_{j} = k θ_{i})$ denotes the probability of selecting a particular response category k, as estimated by the IRT model. This uncertainty can manifest in multimodal data recorded during responding: low response uncertainty is typically associated with shorter RTs and fewer fixations, whereas high uncertainty corresponds to longer RTs and more fixations.

When information entropy approaches its maximum, log(K), meaning that the probabilities of all response categories are equal, i.e., $P (y_{j} = 0 θ_{i}) = P (y_{j} = 1 θ_{i}) =$ … $= P (y_{j} = K - 1 θ_{i}) = 1 / K$ , the examinee is highly indecisive and faces maximal difficulty in selecting an option, as all K alternatives are equally probable. Consequently, both RT and FC are likely to increase due to this heightened uncertainty.

When information entropy approaches its minimum, 0, meaning that the probability of selecting one response category is 100%, the examinee is certain which response to select, and choosing among alternatives involves minimal difficulty. Consequently, a rapid response with fewer fixations is expected.

The Framework of the Proposed Modeling (Multimodal-IRTM)

In this study, we employed a joint cross-loading modeling approach to integrate multiple types of data sources. Specifically, the joint model, referred to as multimodal-IRTM, consists of three components: Sub-Model I describes the response score data using an IRT model (e.g., the Graded Response Model; Samejima, 1969); Sub-Model II describes the response time data using the lognormal RT (LRT) model (Van der Linden, 2007); and Sub-Model III describes visual FC data using the Negative Binomial Fixation (NBF) model (Man & Harring, 2019). We then incorporate the introduced information entropy (see equation (1)) as a cross-loading term to link the three sub-models and to associate individuals’ latent traits with their response times and visual FCs. Further details of the proposed multimodal modeling framework are illustrated in Figure 1.

Figure 1.

A graphical representation of the proposed multimodal-IRTM.

For the Graded Response Model (Samejima, 1969), let there be $i =$ 1, 2…, N individuals, $j =$ 1, 2…, J items, and K response categories (integer values from 0,1, 2…, $K - 1)$ . The item parameter $a_{j}$ is the slope parameter, $b_{j}$ ( $b_{j} = b_{j 1}, b_{j 2}, b_{j K - 1}$ ) are a set of strictly ordered threshold parameters. Let the cumulative category response probabilities be:

\begin{matrix} \begin{matrix} P (y_{i j} \geq 1 ∣ θ_{i}) = \frac{1}{1 + e^{- 1.7 a_{j} (θ_{i} - b_{j 1})}}, \\ ⋮ \\ P (y_{i j} \geq K - 1 ∣ θ_{i}) = \frac{1}{1 + e^{- 1.7 a_{j} (θ_{i} - b_{j K - 1})}} . \end{matrix} \end{matrix}

(2)

Then the probability that examinee i attains exactly score k on item $j$ :

\begin{matrix} P (y_{i j} = k ∣ θ_{i}) = P (y_{i j} \geq k ∣ θ_{i}) - P (y_{i j} \geq k + 1 ∣ θ_{i}), \end{matrix}

(3)

with

P (y_{i j} \geq 0 ∣ θ_{i}) = 1

and

P (y_{i j} \geq K ∣ θ_{i}) = 0

The variation in respondents’ observed RT when answering an item is primarily influenced by three factors: (a) the information processing speed at which respondents work during the test (i.e., the person main effect), (b) the amount of labor required by the item (i.e., the item main effect), and (c) the person–item interaction effect (Thissen, 1983). Information entropy can be viewed as representing the person–item interaction term. The positive skewness and frequent large outliers inherent in RT data make the lognormal RT model a preferred choice for modeling such data (Van der Linden, 2007). Accordingly, the lognormal RT (LRT) model, incorporating information entropy, can be expressed as follows:

\begin{matrix} T_{i j} \sim f (t_{i j}; τ_{i}, θ_{i}, ω_{j}, ξ_{j}, β_{j}) = \frac{ω_{j}}{t_{i j} \sqrt{2 π}} \exp (- \frac{ω_{j}^{2}}{2} {(\log t_{i j} - (ξ_{j} + β_{j} H_{i j} - τ_{i}))}^{2}), \end{matrix}

(4)

with

ω_{j} =

/ σ_{j} .

Equation (4) can also be expressed as:

\begin{matrix} \log (T_{i j}) \sim N ((ξ_{j} + β_{j} H_{i j} - τ_{i}), σ_{j}^{2}), \end{matrix}

(5)

where

T_{i j}

be the observed RT of person i to item j, log(

T_{i j}

) is the logarithm of RT;

τ_{i}

is the latent processing speed parameter that reflects individual differences in general work pace;

ξ_{j}

is the time-intensity of item j that indicates being a time demand of item j.

ω_{j}

is the reciprocal of the standard deviation of the error term i.e., 1

/ σ_{j}

, which is here treated as a time-precision parameter.

β_{j}

is the regression coefficient, which determines the influence of Information entropy

H_{i j}

on response time.

For visual FC data, we adopt a parallel decomposition framework similar to that used for response times, in which FCs reflect person effects, item effects, and person–item interactions. Information entropy is incorporated to quantify response uncertainty. Following the approach of Man and Harring (2019), the Negative Binomial Fixation (NBF) model for FCs, incorporating information entropy, can be expressed as follows:

\begin{matrix} \begin{matrix} V_{i j} \sim f (V_{i j}; ϵ_{i}, θ_{i}, h_{j}, m_{j}) = \\ \frac{Γ (V_{i j} + h_{j})}{V_{i j}! Γ (h_{j})} {(\frac{h_{j}}{\exp (ϵ_{i} + λ_{j} H_{i j} + m_{j}) + h_{j}})}^{h_{j}} {(\frac{\exp (ϵ_{i} + λ_{j} H_{i j} + m_{j})}{\exp (ϵ_{i} + λ_{j} H_{i j} + m_{j}) + h_{j}})}^{V_{i j}} \end{matrix}, \end{matrix}

(6)

with

E (V_{i j}) μ_{i j} \exp (ϵ_{i} + λ_{j} H_{i j} + m_{j})

d_{j}

is defined as

d_{j} = 1 / \sqrt{u_{\cdot j} + u_{\cdot j}^{2} / h_{j}}

where

u_{\cdot j} = \sum_{j = 1}^{J} μ_{i j} / J

, equation (6) can also be expressed as:

\begin{matrix} V_{i j} \sim N B (\exp (m_{j} + λ_{j} H_{i j} + ϵ_{i}), d_{j}^{- 2}), \end{matrix}

(7)

Where $V_{i j}$ be the observed FC of person i to item j, $ϵ_{i}$ is the latent visual engagement of person $i$ that denotes the overall test engagement level for examinee i; $m_{j}$ is the visual-intensity of item j that represents the averaged amount of cognitive engagement for respondent to finish answering item j; $h_{j}$ is the dispersion or shape parameter of the expectation of the random variable $V_{i j}$ ; $d_{j}$ is the visual discrimination parameter, reflecting dispersion of the FCs on item $j$ ; $λ_{j}$ is the regression coefficient, which determines the influence of Information entropy $H_{i j}$ on FCs.

Finally, suppose that a correlation exists between participants’ latent processing speed and latent visual engagement. In this case, the joint distribution of latent processing speed and latent visual engagement is assumed to follow a multivariate normal distribution:

\begin{matrix} \begin{matrix} Θ_{i} = (\begin{matrix} τ_{i} \\ ϵ_{i} \end{matrix}) \sim M V N ((\begin{matrix} μ_{τ} \\ μ_{ϵ} \end{matrix}), Σ_{person}), Σ_{person} = (\begin{matrix} σ_{τ}^{2} \\ σ_{τ ϵ} & σ_{ϵ}^{2} \end{matrix}) \end{matrix} \end{matrix}

(8)

A multivariate normal distribution is also assumed for the item parameters, such that:

\begin{matrix} Ξ_{j} = (\begin{matrix} b_{j k} \\ ξ_{j} \\ m_{j} \end{matrix}) \sim M V N ((\begin{matrix} μ_{b_{k}} \\ μ_{ξ} \\ μ_{m} \end{matrix}), Σ_{i t e m}), Σ_{i t e m} = (\begin{matrix} σ_{b_{k}}^{2} \\ σ_{b_{k} ξ} & σ_{ξ}^{2} \\ σ_{b_{k} m} & σ_{ξ m} & σ_{m}^{2} \end{matrix}) \end{matrix}

(9)

and

b_{j 1} < b_{j 2} < b_{j 3} < b_{j 4}

The log-likelihood of the proposed model can be expressed as：

\begin{matrix} \begin{matrix} ℓ (θ, Θ, Ξ ∣ y, \log (t), V) = \sum_{i = 1}^{I} \sum_{j = 1}^{J} \log p (y_{i j}, \log (t_{i j}), V_{i j} ∣ θ_{i}, Θ_{i}, Ξ_{j}) \\ = \sum_{i = 1}^{I} \sum_{j = 1}^{J} [l o g p (y_{i j} ∣ θ_{i}, Ξ_{j}) + l o g p (l o g (t_{i j}) ∣ θ_{i}, Θ_{i}, Ξ_{j}) + l o g p (V_{i j} ∣ θ_{i}, Θ_{i}, Ξ_{j})] \end{matrix} \end{matrix}

(10)

where

p (y_{i j} ∣ θ_{i}, Ξ_{j})

is the response probability calculated from equations (2) and (3);

p (\log (t_{i j}) ∣ θ_{i}, Θ_{i}, Ξ_{j})

represents the normal probability density function with mean and variance according to calculated from equation (4); and

p (V_{i j} ∣ θ_{i}, Θ_{i}, Ξ_{j})

denotes the probability mass function of a Negative Binomial distribution with parameters derived from equation (6). Equation (10) is a formal model that links response scores, RT), and FC to the participants’ latent traits. It can be used to infer an individual’ s trait scores based on the observed response scores, RTs, and FCs.

Bayesian Parameter Estimation

Given the complexity of the proposed model, the parameters of the multimodal-IRTM model were estimated using a Bayesian approach with MCMC sampling, implemented in R via the nimble package. The prior are set following those in Man and Harring (2019), Man et al. (2022), Zhan et al. (2022), Guo et al. (2024), and Zhou and Guo (2026).

First, $Y_{i j}$ , $\log (T_{i j})$ and $V_{i j}$ are assumed to be distributed as:

Y_{i j} \sim Categorical (P_{i j}),

\log (T_{i j}) \sim N ((ξ_{j} + β_{j} H_{i j} - τ_{i}), σ_{j}^{2}),

V_{i j} \sim N B (\exp (m_{j} + λ_{j} H_{i j} + ϵ_{i}), d_{j}^{- 2}),

Due to the assumed correlation between latent processing speed and latent visual engagement, the priors for the person parameters of the LRT model and the NBF model are assumed from a multivariate normal distribution, that is:

(\begin{matrix} τ_{i} \\ ϵ_{i} \end{matrix}) \sim M V N ((\begin{matrix} 0 \\ 0 \end{matrix}), Σ_{person}) .

Furthermore, hyper priors for the covariance matrix of the multivariate normal distribution are specified as $Σ_{person} \sim Inverse - Wishart (R, 2)$ , where R is a two-dimensional identity matrix.

For the GRM model, the person parameter $θ$ is set as standard normal distribution, $θ \sim N (0, 1)$ .

In addition, the priors of item parameters are specified as

a \sim Lognormal (0, 0.25),

(\begin{matrix} b_{j k} \\ ξ_{j} \\ m_{j} \end{matrix}) \sim M V N ((\begin{matrix} 0 \\ 1 \\ 2 \end{matrix}), Σ_{i t e m}),

and b_{j 1} < b_{j 2} < b_{j 3} < b_{j 4},

Σ_{item} \sim Inverse - Wishart (R, 6),

σ_{j} \sim InvGamma (1, 1),

h_{j} \sim InvGamma (1, 1)

Following Guo et al. (2024), the priors for the remaining two regression coefficients of information entropy are set as

β_{j} \sim N (0, 1), λ_{j} \sim N (0, 1)

Simulation Studies

The simulation study was designed to evaluate the performance of the proposed multimodal-IRTM model. Specifically, two complementary analyses were conducted within the same simulation framework: first, parameter recovery was assessed to examine the accuracy and precision of latent trait estimates under varying test lengths, sample sizes, and regression coefficients, with a fixed number of response categories (K = 5); second, the comparative performance of the GRM and multimodal-IRTM models was evaluated to determine whether incorporating multimodal information improves the accuracy of personality assessment.

Design and Data Generation

A simulation design was adopted to investigate both parameter recovery and comparative model performance. As demonstrated in prior simulation studies (e.g., Man et al., 2022; Meng et al., 2013; Zhan et al., 2022), the manipulated factors included test length (I = 15, 30), sample size (N = 200, 500, 1000), regression coefficients of information entropy (RC: $β_{j}$ and $λ_{j}$ 0, 0.5, 1, 1.5), and analysis model (GRM vs. multimodal-IRTM). Each condition was replicated 50 times. The number of response categories was set to K = 5, reflecting the widespread use of five-point Likert scales in personality research. When the regression coefficients associated with information entropy were fixed at 0, the multimodal-IRTM model reduced to a combination of the separate GRM, LRT, and NBF models, thereby allowing a direct comparison of whether RT and FC contribute to latent trait estimation.

The latent processing speed $τ_{i}$ and the latent visual engagement $ϵ_{i}$ are generated from a multivariate normal distribution, as follow:

(\begin{matrix} τ_{i} \\ ϵ_{i} \end{matrix}) \sim M V N ((\begin{matrix} 0 \\ 0 \end{matrix}) (\begin{matrix} 0.25 \\ ρ_{τ ϵ} \times 0.25 & 0.25 \end{matrix})),

where

ρ_{τ ϵ} = - 0.5

was suggested in Man et al. (2022) as a moderate negative correlation, indicating that slower processing speed is associated with higher visual engagement, whereas faster processing speed corresponds to lower visual engagement.

The latent personality trait $θ$ was drawn from a standard normal distribution $N (0, 1)$ . The correlations between latent trait θ and latent processing speed τ, as well as between and latent visual engagement, were not explicitly specified because the present study adopted a joint cross-loading modeling framework, in which the relationships between θ and τ, and between θ and $ε$ , were implicitly captured by the cross-loading terms (Zhan et al., 2022).

The item ordered threshold $b_{j k}$ , the item time intensity $ξ_{j}$ , and the item visual intensity $m_{j}$ are generated from a multivariate normal distribution, as follow:

(\begin{matrix} b_{j k} \\ ξ_{j} \\ m_{j} \end{matrix}) \sim M V N ((\begin{matrix} 0 \\ 1 \\ 2 \end{matrix}), (\begin{matrix} 1 \\ 0.15 & 0.25 \\ 0.15 & 0.125 & 0.25 \end{matrix})),

where

ρ_{b_{k} ξ} = 0.3

ρ_{b_{k} m} = 0.3

ρ_{ξ m} = 0.5

. Such settings follow established procedures in prior studies (e.g., Tian et al., 2023; Zhan, 2022; Zhan et al., 2022). Specifically, the item threshold parameters were determined using a hierarchical framework model and guided by multiple prior studies (e.g., Curtis (2010), Tong et al. (2022), and Zhou and Guo (2026). A normal distribution was selected for the item threshold parameters to account for potential variations in response times and eye movements across items. Prior research indicates that item parameters derived from both response outcomes and process data are often correlated rather than independent (Man et al., 2022). Consequently, a positive correlation was hypothesized between item threshold parameters and their corresponding time- and visual-intensity parameters, implying that items with higher threshold values are associated with higher time- and visual-intensity measures.

The remaining parameters of the GRM model, the item slope $a_{j}$ , was drawn from log-normal distribution i.e., $l o g (a_{j}) \sim N (0, {0.25}^{2})$ (Gorney, 2024). The dispersion or shape parameter of the NBF model, $h_{j}$ , was generated from $InvGamma$ (2,6). For simplicity, the time-precision parameter $ω_{j}$ was set to 2, i.e., $ω_{j} = 1 / σ_{j} = 2$ ，then the standard deviation of the error term $σ_{j}$ is 0.5 (Guo et al., 2024).

Finally, item scores, response times, and visual FCs were simulated using the GRM model (equations (2) and (3)), the LRT model (equation (5)), and the negative binomial model (equation (7)), respectively.

Analysis

Considering the complexity of the proposed model, this study utilizes a cutting-edge R package “nimble” (de Valpine et al., 2017) for Bayesian inference. Nimble is an R package for programming with BUGS models using syntax similar to WinBUGS and JAGS, but with more flexibility in defining the models and algorithms. Users can operate from within R, and nimble will generate the C++ code for faster computation. For each condition, parameters were estimated using the MCMC algorithm with a single chain, a thinning interval of 1, 10,000 iterations, and a burn-in period of 5,000, after which the remaining 5,000 iterations were used for model parameter inference.

To investigate the accuracy of parameter estimation, we computed the BIAS and root mean square error (RMSE) as

\begin{matrix} BIAS (\hat{x}) & = \frac{\sum_{r = 1}^{R} \sum_{n = 1}^{N} (x - \hat{x})}{R \cdot N}, \end{matrix}

\begin{matrix} RMSE (\hat{x}) & = \sqrt{\frac{\sum_{r = 1}^{R} \sum_{n = 1}^{N} {(x - \hat{x})}^{2}}{R \cdot N}} \end{matrix},

where x is the true value,

\hat{x}

is the estimated value, R = 50 represents the number of repetitions, and N denotes the number of persons or items.

Additionally, correlations (COR) between estimated and true parameter values were evaluated to assess parameter recovery accuracy.

Results

Table 1 summarizes the recovery results for the latent trait ( $θ$ ) under the GRM and the multimodal-IRTM. For the multimodal-IRTM, BIAS values were close to zero, the maximum RMSE was approximately 0.30, and COR values exceeded 0.95 across all conditions. Overall, parameter recovery, as indicated by RMSE and COR, improved under more favorable design conditions. Both test length and sample size enhanced estimation precision; specifically, longer tests and larger samples resulted in lower RMSE and higher COR values. In addition, larger RC were associated with improved parameter recovery.

Table 1.

Latent Traits Parameter Recovery Between GRM and Multimodal-IRTM.

$R C$	$N$	$I$	GRM			Multimodal-IRTM
$R C$	$N$	$I$	BIAS	RMSE	COR	BIAS	RMSE	COR
0	200	15	0.003	0.304	0.955	−0.005	0.313	0.953
	200	30	−0.001	0.226	0.976	0.000	0.237	0.975
	500	15	−0.003	0.296	0.956	−0.002	0.298	0.956
	500	30	−0.005	0.217	0.978	−0.009	0.219	0.977
	1000	15	0.004	0.297	0.956	0.005	0.298	0.956
	1000	30	0.007	0.215	0.978	0.003	0.217	0.978
0.5	200	15	0.004	0.305	0.955	−0.005	0.291	0.960
	200	30	−0.002	0.226	0.976	−0.002	0.217	0.979
	500	15	−0.002	0.296	0.957	−0.001	0.280	0.962
	500	30	0.003	0.217	0.977	−0.005	0.206	0.981
	1000	15	−0.001	0.298	0.956	0.001	0.280	0.961
	1000	30	0.004	0.213	0.978	0.004	0.199	0.981
1	200	15	−0.006	0.304	0.955	−0.006	0.249	0.972
	200	30	−0.008	0.227	0.976	−0.003	0.190	0.986
	500	15	−0.002	0.297	0.956	0.003	0.238	0.973
	500	30	0.004	0.217	0.978	−0.002	0.170	0.987
	1000	15	−0.001	0.297	0.956	0.001	0.234	0.973
	1000	30	0.004	0.213	0.978	0.001	0.166	0.987
1.5	200	15	−0.003	0.305	0.954	−0.021	0.211	0.981
	200	30	−0.007	0.226	0.976	−0.015	0.162	0.991
	500	15	−0.008	0.297	0.956	−0.018	0.193	0.983
	500	30	0.002	0.216	0.978	−0.004	0.141	0.992
	1000	15	0.001	0.295	0.957	0.001	0.192	0.983
	1000	30	0.007	0.213	0.978	0.002	0.137	0.992

Note: $N$ = sample size; $I$ = test length; RC = the regression coefficients of information entropy.

In comparing the multimodal-IRTM with the GRM, the multimodal-IRTM demonstrated more accurate latent trait recovery in most conditions. When the RC were equal to zero, the RMSE and COR for $θ$ estimated by the multimodal-IRTM were slightly inferior to those obtained from the GRM. However, when RC exceeded zero, the multimodal-IRTM outperformed the GRM in estimating $θ$ , and estimation accuracy increased as the RC became larger. These findings suggest that incorporating multimodal data can substantially enhance the estimation of latent trait.

Figure 2 and Figure 3 report the recovery performance of item parameters (i.e., $a, b_{1}, b_{2}, b_{3}, b_{4}$ ) for the multimodal-IRTM and the GRM under the 15-item condition. For the boxplots, each box summarizes results across 50 independent replications under the corresponding experimental condition. The corresponding results for the 30-item condition, which were similar to those observed in the 15-item condition, are provided in Part 1 of the Supplemental Document. In addition, to further enhance transparency, the in Part 1 of the Supplemental Document also includes the mean Bias, RMSE, and COR for item parameter recovery across all simulation conditions.

Figure 2.

Boxplots of RMSE for item parameter recovery under the 15-item condition.

Figure 3.

Boxplots of COR for item parameter recovery under the 15-item condition.

For the multimodal-IRTM, both item parameters were recovered with good accuracy, with the vast majority of RMSE values below 0.4 and COR above 0.8. Moreover, increases in sample size were associated with improved parameter recovery, reflected in lower RMSE and higher COR values at larger sample sizes. The estimation results for the four threshold parameters indicate that $b_{2}$ and $b_{3}$ were recovered more accurately than $b_{1}$ and $b_{4}$ , as evidenced by smaller RMSE values, larger COR coefficients and reduced dispersion in the boxplots. This pattern may be explained by the assumption that $θ_{i} \sim N (0, 1)$ , under which the probabilities of endorsing extreme categories are low, resulting in limited information for estimating the corresponding threshold parameters. In contrast, the intermediate thresholds correspond to categories located in the central region of the ability distribution, where most observations occur, thereby supporting more accurate and stable estimation. This pattern was also observed in the empirical study, in which the posterior standard deviations of $b_{2}$ and $b_{3}$ were smaller than those of $b_{1}$ and $b_{4}$ .

Figures 2 and 3 clearly show that the item parameter recovery of the GRM is largely invariant to changes in RC. In contrast, for the multimodal-IRTM, RMSE decreases and COR increases as RC increases, indicating an overall trend of improving estimation accuracy. In other words, as RC increases, the advantage of the multimodal-IRTM over the GRM becomes increasingly pronounced. When RC was equal to zero, similar to the results for $θ$ , the multimodal-IRTM exhibited very slightly higher RMSE and lower COR than the GRM. When RC was 0.5, some item parameter estimates from the multimodal-IRTM demonstrated similar RMSE values but higher COR than those from the GRM, while others yielded lower RMSE and higher COR. When RC was 1 and 1.5, the multimodal-IRTM demonstrated superior item parameter recovery performance relative to the GRM, with uniformly lower RMSE and higher COR values. Overall, when RC was greater than 0, parameters estimated by the multimodal-IRTM generally outperformed those from the GRM.

When RC was equal to zero, and aside from the joint estimation of three sets of item parameters (i.e., the item threshold parameters $b_{1}, b_{2}, b_{3}, b_{4}$ , the item time intensity parameters $ξ$ , and the item visual intensity parameters $m$ ) under a multivariate normal distribution, the multimodal-IRTM was effectively equivalent to an extended version of the GRM. As a result, parameter estimates from the multimodal-IRTM were slightly less precise than those from the traditional GRM. When RC exceeded zero, the multimodal-IRTM incorporated a cross-loading term for information entropy, thereby linking the latent trait with the process data and enabling the model to use this additional information to refine $θ$ estimates and improve latent trait recovery. As RC increased, the contribution of information entropy within the LRT and NBF models became stronger, providing additional auxiliary information and further enhancing estimation accuracy.

Tables 2 and 3 present the parameter estimates of the LRT model and the NBF model within the multimodal-IRTM framework, respectively. The multimodal-IRTM demonstrated satisfactory parameter recovery across conditions; larger sample sizes were associated with improved recovery of item parameters, whereas longer tests resulted in more accurate estimation of person parameters. However, as RC increased, the RMSE for the speed parameter increased slightly, a pattern also reported by Guo et al. (2024), which may reflect additional uncertainty introduced into the response model parameters.

Table 2.

Recovery of the LRT Sub-Model Parameters of the Proposed Model.

$R C$	$N$	$I$	$τ$			$ξ$			$σ$		$β$
$R C$	$N$	$I$	BIAS	RMSE	COR	BIAS	RMSE	COR	BIAS	RMSE	BIAS	RMSE
0	200	15	−0.005	0.130	0.969	−0.018	0.194	0.921	−0.002	0.026	0.010	0.164
		30	−0.005	0.097	0.983	−0.026	0.185	0.933	−0.001	0.026	0.019	0.154
	500	15	0.006	0.126	0.969	0.002	0.121	0.974	−0.001	0.017	0.001	0.101
		30	0.002	0.092	0.984	0.000	0.119	0.974	−0.001	0.017	0.003	0.098
	1000	15	0.000	0.125	0.968	−0.001	0.088	0.986	0.000	0.012	0.001	0.075
		30	−0.001	0.091	0.984	−0.002	0.085	0.985	−0.001	0.012	0.001	0.071
0.5	200	15	−0.007	0.138	0.964	−0.019	0.203	0.914	−0.001	0.027	0.014	0.163
		30	0.004	0.104	0.981	−0.005	0.205	0.918	−0.003	0.027	0.011	0.161
	500	15	0.009	0.135	0.964	−0.002	0.130	0.966	−0.001	0.018	0.010	0.109
		30	0.002	0.099	0.981	0.004	0.127	0.969	−0.001	0.016	0.001	0.103
	1000	15	0.000	0.134	0.964	−0.003	0.097	0.982	0.000	0.012	0.003	0.079
		30	−0.002	0.098	0.981	−0.003	0.093	0.982	−0.001	0.012	0.002	0.077
1	200	15	−0.010	0.156	0.953	−0.088	0.243	0.890	−0.003	0.027	0.065	0.192
		30	0.003	0.115	0.975	−0.070	0.220	0.909	−0.003	0.027	0.061	0.175
	500	15	0.010	0.151	0.954	−0.027	0.159	0.956	−0.002	0.017	0.032	0.123
		30	0.001	0.108	0.977	−0.032	0.152	0.955	−0.001	0.016	0.028	0.120
	1000	15	−0.002	0.149	0.955	−0.012	0.111	0.976	−0.001	0.012	0.010	0.087
		30	−0.002	0.108	0.977	−0.018	0.111	0.975	−0.001	0.012	0.014	0.087
1.5	200	15	−0.012	0.167	0.947	−0.194	0.311	0.869	−0.003	0.028	0.148	0.245
		30	0.001	0.125	0.971	−0.163	0.278	0.891	−0.004	0.028	0.133	0.220
	500	15	0.009	0.161	0.947	−0.086	0.196	0.944	−0.002	0.018	0.076	0.154
		30	0.000	0.117	0.973	−0.076	0.176	0.949	−0.001	0.016	0.062	0.138
	1000	15	−0.001	0.161	0.947	−0.032	0.124	0.970	−0.001	0.012	0.024	0.093
		30	−0.001	0.115	0.974	−0.046	0.137	0.965	−0.001	0.012	0.037	0.105

Note: $N$ = sample size; J = test length; RC = the regression coefficients of information entropy; $τ =$ latent processing speed; $ξ =$ item time intensity; $σ =$ item standard deviation; $β =$ the regression coefficients of information entropy from the LRT model. COR = correlations between estimated and true item parameters. Since $σ$ and $β$ were fixed as constants during data generation, the COR were not provided.

Table 3.

Recovery of the NBF Sub-Model Parameters of the Proposed Model.

$R C$	$N$	$I$	$ε$			$m$			$d$		$λ$
$R C$	$N$	$I$	BIAS	RMSE	COR	BIAS	RMSE	COR	BIAS	RMSE	BIAS	RMSE
0	200	15	0.006	0.164	0.949	−0.061	0.247	0.878	−0.009	0.019	0.004	0.189
		30	0.004	0.124	0.972	−0.061	0.235	0.894	−0.008	0.018	0.004	0.177
	500	15	−0.002	0.164	0.948	−0.053	0.169	0.954	−0.009	0.014	0.003	0.123
		30	−0.003	0.122	0.973	−0.052	0.158	0.957	−0.009	0.015	−0.001	0.114
	1000	15	0.001	0.162	0.949	−0.052	0.144	0.963	−0.009	0.013	−0.003	0.099
		30	0.000	0.122	0.973	−0.050	0.127	0.973	−0.009	0.013	−0.001	0.088
0.5	200	15	0.007	0.152	0.954	−0.060	0.245	0.879	−0.003	0.011	0.032	0.192
		30	−0.002	0.115	0.975	−0.050	0.239	0.887	−0.003	0.011	0.031	0.187
	500	15	−0.004	0.153	0.953	−0.039	0.175	0.937	−0.003	0.008	0.021	0.136
		30	−0.002	0.110	0.976	−0.048	0.165	0.949	−0.004	0.008	0.027	0.127
	1000	15	0.001	0.152	0.954	−0.050	0.141	0.963	−0.003	0.007	0.023	0.101
		30	0.002	0.110	0.977	−0.043	0.136	0.964	−0.003	0.006	0.018	0.101
1	200	15	0.008	0.156	0.952	−0.126	0.276	0.874	−0.001	0.006	0.092	0.218
		30	0.000	0.118	0.973	−0.105	0.258	0.876	−0.001	0.007	0.082	0.203
	500	15	−0.005	0.156	0.951	−0.045	0.182	0.941	−0.001	0.004	0.037	0.144
		30	0.001	0.111	0.976	−0.061	0.179	0.942	−0.001	0.005	0.045	0.138
	1000	15	0.003	0.154	0.952	−0.049	0.146	0.962	−0.001	0.003	0.032	0.111
		30	0.002	0.111	0.976	−0.046	0.145	0.961	−0.001	0.003	0.031	0.111
1.5	200	15	0.006	0.165	0.946	−0.202	0.327	0.853	−0.001	0.004	0.158	0.259
		30	−0.001	0.124	0.970	−0.186	0.306	0.875	0.000	0.004	0.150	0.246
	500	15	−0.002	0.161	0.947	−0.103	0.213	0.929	0.000	0.003	0.083	0.162
		30	0.001	0.115	0.974	−0.093	0.200	0.935	0.000	0.002	0.071	0.157
	1000	15	0.001	0.160	0.948	−0.064	0.160	0.954	−0.001	0.002	0.049	0.119
		30	0.000	0.114	0.974	−0.064	0.158	0.954	0.000	0.002	0.050	0.122

Note: $ε =$ latent visual engagement; $m =$ item visual intensity; $d =$ item visual discrimination; $λ =$ the regression coefficients of information entropy from the NBF model. COR = correlations between estimated and true item parameters. For parameters without reported COR, the reasons are the same as described in Table 2.

As shown in Table 2, most BIAS values for the item time-intensity parameter $ξ$ were negative, indicating that the estimated values exceeded the true values and therefore reflected overestimation. For the regression coefficients $β$ of information entropy from the LRT model, most BIAS values were positive, indicating that the estimates were lower than the true values and thus reflected underestimation. The results in Table 3 indicate that the BIAS pattern in the NBF model was similar to that observed in the LRT model. The item visual-intensity parameter m tended to be overestimated, whereas the regression coefficients $λ$ tended to be underestimated. This pattern is consistent with the findings of Guo et al. (2024), who reported opposite signs for the BIAS of item and regression coefficients, with one set overestimated and the other underestimated. The regression coefficients $β$ and $λ$ associated with information entropy were predominantly positive in BIAS, indicating that their estimates generally fell below the true values and suggesting systematic underestimation. This finding is consistent with the well-documented phenomenon that linear relationships tend to be attenuated when predictors contain measurement error.

Empirical Data Application

Data Collection

To demonstrate the practical application of the proposed multimodal-IRTM model in personality assessment, we used the Chinese version of the Big Five Inventory (BFI; Li et al., 2025), whose items were selected from the International Personality Item Pool (IPIP; Goldberg, 1992). Response scores, response times, and eye-tracking data were collected using the Tobii Pro Fusion and Tobii TX300 eye trackers.

At the beginning of the questionnaire, participants were asked to provide demographic information, including age, gender (1 = male; 2 = female), and educational level (1 = Bachelor's degree; 2 = Master's degree; 3 = Doctoral degree or higher). Additionally, all participants provided informed consent prior to completing the formal tasks.

The experiment was conducted in a controlled eye-tracking laboratory setting. The study utilized a web-based program developed in Visual Studio Code, which incorporated the Big Five Personality Inventory. The program was deployed on a computer connected to a Tobii eye tracker, through which participants completed the questionnaire. During the task, the program automatically recorded participants’ response times and eye-movement data from the onset of each item to the corresponding keypress. Participants rated each item on a 5-point Likert scale ranging from 1 (strongly disagree) to 5 (strongly agree). Prior to the formal experiment, an example item was presented to illustrate the response procedure, and participants were encouraged to seek clarification from the experimenter if needed.

Data Description

Data were collected from undergraduate students enrolled at two universities across different provinces in China. Participants with color blindness, strabismus, severe myopia, or significant interocular differences in myopia were excluded to minimize potential sources of bias. The study initially recruited a total of 262 participants. After applying quality control filters (i.e., eye-tracking data with a sampling rate below 85% (He et al., 2025) and participants who failed either of the two attention-check items), 236 participants were retained for the final analysis. Among the final sample, 53 participants were male and 183 were female. The mean age was 21.58 years (SD = 2.82).

The scale comprised a total of 62 items, including 60 items assessing the Big Five personality traits (i.e., Agreeableness, Conscientiousness, Extraversion, Neuroticism, and Openness), with each trait measured by 12 items. Additionally, two attention-check items were inserted following the 20th and 40th items. In the current study, Cronbach's α coefficients for the five traits were 0.827, 0.887, 0.914, 0.917, and 0.865, respectively.

We simultaneously collected three types of data: response scores, response times, and visual FCs. Before analysis, observations with response times shorter than 0.3 s were excluded, as such rapid responses suggest that participants may not have adequately engaged with the items (Bunji & Okada, 2020; Guo et al., 2024). Response times were then converted to seconds and log-transformed to reduce skewness and limit the influence of outliers (Van der Linden, 2007). Visual FCs were computed using Tobii's default algorithm, which is based on the I-VT (Velocity-Threshold Identification) method proposed by Salvucci and Goldberg (2000).

Analysis

Model Convergence

Parameter convergence was evaluated using the potential scale reduction factor (PSRF; Brooks & Gelman, 1998), commonly denoted as $\hat{R}$ , with values below 1.2 considered indicative of satisfactory convergence. This measure is widely used in Bayesian statistics to assess MCMC convergence and ensure that parameter estimates have stabilized.

Model-fit

Bayesian leave-one-out cross-validation (LOO) and the Watanabe-Akaike Information Criterion (WAIC), as introduced in Watanabe and Opper (2010), were used as MCMC-based Bayesian model fit indices to evaluate and compare the relative adequacy of competing models, where lower values indicate better fit.

Since the GRM did not incorporate process data, such as response times or FCs, it was estimated using less information than the multimodal-IRTM, which integrates both outcome and process indicators. Hence, WAIC values from the GRM are not directly comparable with those from the multimodal-IRTM. To facilitate comparison between the GRM and multimodal-IRTM, we adopted the separate modeling approach (Meng et al., 2013; Zhan, 2022), consistent with conventional psychometric practice, in which each modality is modeled independently and the GRM, LRT, and NBF models are analyzed separately. For convenience, the three models generated using the separate modeling approach are hereafter referred to as the separate-Model. We then applied the separate-Model and the proposed multimodal-IRTM to analyze empirical data across the five dimensions of the Big Five Inventory (BFI). MCMC analyses were conducted using the nimble package in R. For each model, two MCMC chains were run for 60,000 iterations each, with the first 50,000 iterations discarded as burn-in.

Results

As shown in Table 4, across all analyses of the five personality dimensions, convergence diagnostics indicated satisfactory convergence for both the separate-Model and multimodal-IRTM, with at least 95% of parameter estimates in each dimension exhibiting a potential scale reduction factor (PSRF) below 1.2. Table 4 also presents model fit indices for the separate-Model and the proposed multimodal-IRTM applied to the empirical data. Across all five dimensions of the Big Five personality traits, the multimodal-IRTM produced lower LOO and WAIC values than the separate-Model, indicating superior model performance. For multimodal data, jointly modeled approaches achieve better fit than models that analyze each modality separately.

Table 4.

Model Convergence and Model Fit in Real-Data Application.

Personality factor	Separate-model			Multimodal-IRTM
Personality factor	%PSRF < 1.2	LOO	WAIC	%PSRF < 1.2	LOO	WAIC
Agreeableness (A)	100%	27,151.430	24,376.459	99.9%	26,990.230	23,932.146
Conscientiousness (C)	99.2%	27,436.508	24,792.654	100%	27,264.147	24,380.491
Extraversion (E)	100%	28,178.317	25,421.454	100%	28,079.380	25,135.520
Neuroticism (N)	100%	27,941.122	25,017.553	99.6%	27,829.949	24,769.885
Openness (O)	98.5%	27,584.433	25,029.312	98.8%	27,550.598	24,806.860

Note: Separate-model = fitting three models of data with three measurement models separately; PSRF = the potential scale reduction factor.

Table 5 reports the estimates of item parameters for the three measurement models in the multimodal-IRTM corresponding to the Agreeableness dimension (similar patterns were observed for other dimensions; details are provided in Part 2 of the Supplemental Document), along with the regression coefficients $β$ and $λ$ for information entropy. Most parameters have posterior standard errors below 0.2, indicating stable and reliable estimates. Some item parameters, however, exhibit moderately high posterior standard errors, particularly the intercepts $b_{1}$ and $b_{4}$ . The parameters $β$ and $λ$ represent the regression coefficients for information entropy in the LRT and NBF models, respectively. All estimates of $β$ and $λ$ are positive, consistent with the hypothesis that information entropy is positively associated with RT and FC. It should be noted that the magnitudes of $β$ and $λ$ depend on the scales of information entropy and the log-transformed RTs and FCs; therefore, these values do not directly reflect the strength of the relationships (Guo et al., 2024).

Table 5.

Estimates of Item Parameters for the Multimodal-IRTM Model in Empirical Data (Dimension A).

Item	Response Score					Response Time		Fixation Count		$β$
Item	$a$	$b_{1}$	$b_{2}$	$b_{3}$	$b_{4}$	$ξ$	$σ$	$m$	$d$
4	0.69 (0.06)	−3.97 (0.09)	−2.82 (0.21)	−2.05 (0.19)	1.08 (0.15)	0.35 (0.20)	0.32 (0.02)	1.38 (0.23)	0.23 (0.01)	0.92 (0.24)	1.06 (0.28)
9	0.68 (0.10)	−3.79 (0.41)	−1.02 (0.18)	0.17 (0.14)	3.21 (0.44)	0.98 (0.21)	0.32 (0.02)	2.19 (0.21)	0.20 (0.01)	0.31 (0.18)	0.31 (0.18)
13	1.46 (0.19)	−3.88 (0.22)	−1.54 (0.15)	−0.86 (0.11)	1.27 (0.12)	0.63 (0.13)	0.37 (0.02)	1.63 (0.15)	0.21 (0.01)	0.80 (0.17)	0.99 (0.20)
22	1.94 (0.21)	−3.42 (0.37)	−1.59 (0.16)	−1.20 (0.11)	0.95 (0.10)	0.46 (0.12)	0.18 (0.02)	1.48 (0.13)	0.30 (0.01)	1.48 (0.17)	1.65 (0.18)
28	0.77 (0.11)	−3.47 (0.46)	−0.58 (0.15)	0.64 (0.15)	3.09 (0.47)	0.94 (0.25)	0.33 (0.02)	2.05 (0.27)	0.20 (0.01)	0.36 (0.21)	0.47 (0.23)
33	0.89 (0.11)	−3.83 (0.27)	−2.33 (0.26)	−1.24 (0.17)	1.48 (0.18)	0.52 (0.15)	0.31 (0.02)	1.57 (0.18)	0.29 (0.01)	0.49 (0.17)	0.53 (0.21)
36	0.72 (0.12)	−2.84 (0.53)	−0.76 (0.20)	0.28 (0.13)	2.29 (0.36)	0.73 (0.23)	0.34 (0.02)	1.79 (0.24)	0.24 (0.01)	0.27 (0.18)	0.33 (0.19)
41	0.70 (0.08)	−3.96 (0.15)	−1.89 (0.21)	−0.65 (0.14)	2.58 (0.35)	0.68 (0.18)	0.38 (0.02)	1.87 (0.22)	0.20 (0.01)	0.46 (0.17)	0.45 (0.21)
50	1.26 (0.19)	−2.96 (0.39)	−1.43 (0.20)	−0.74 (0.13)	0.96 (0.13)	0.86 (0.12)	0.34 (0.02)	1.90 (0.16)	0.21 (0.01)	0.36 (0.12)	0.50 (0.16)
52	0.70 (0.08)	−3.97 (0.12)	−2.29 (0.26)	−0.42 (0.14)	2.54 (0.31)	0.82 (0.18)	0.32 (0.02)	1.92 (0.21)	0.24 (0.01)	0.32 (0.17)	0.37 (0.21)
54	1.10 (0.14)	−2.88 (0.37)	−1.22 (0.15)	−0.48 (0.10)	1.36 (0.17)	0.35 (0.18)	0.32 (0.02)	1.41 (0.22)	0.25 (0.01)	0.69 (0.17)	0.75 (0.21)
61	0.77 (0.12)	−2.50 (0.38)	−0.57 (0.14)	0.47 (0.13)	2.87 (0.43)	0.34 (0.27)	0.35 (0.02)	1.53 (0.31)	0.18 (0.01)	0.79 (0.21)	0.82 (0.25)

Note: $ξ =$ item time intensity; $σ =$ item standard deviation; $β =$ the regression coefficients of the Information entropy from the LRT model; $m =$ item visual intensity; $d =$ item visual discrimination; $λ =$ the regression coefficients of the Information entropy from the NBF model. Values in parentheses indicate standard errors (i.e., posterior standard deviations).

Figure 4 presents the distributions of posterior standard deviations for latent trait estimates across the five personality dimensions. Posterior standard deviations are generally low, indicating that latent traits are estimated with adequate precision for the current sample size (N = 236). Notably, measurement precision varies across dimensions, with Openness showing relatively greater dispersion and Extraversion displaying the most concentrated estimates. The presence of a few right-tailed observations in Agreeableness indicates that some respondents were located in regions of the latent continuum with lower information. Importantly, despite the increased complexity introduced by multimodal process data, posterior standard deviations remain well controlled, supporting robust recovery of individual differences.

Figure 4.

Boxplots of posterior standard deviations of latent traits ( $θ$ ) across five personality dimensions.

To assess the fit of the multimodal model to the multimodal data, we examined scatter plots comparing observed and predicted log-RTs and FCs for each item in the Agreeableness dimension across all respondents (see Figure 5). Results for the other dimensions displayed a similar pattern. For brevity, detailed results for the remaining dimensions are provided in Part 3 of the Supplemental Document. In all cases, the relationships appeared essentially linear. The absence of significant nonlinear trends indicates that the model adequately captured item-level relationships. The scatter plots showed that predicted values closely matched the observed values. For both log-RTs and FCs, the multimodal model accurately captured the underlying data structure.

Figure 5.

Scatter plots of the observed and predicted multimodal data for 12 agreeableness items: (a) log response times (log-RTs) data; (b) fixation counts (FCs) data.

Eye-Tracking Trajectory Analysis

Figure 6 presents the response patterns of participant #228 for two items within a single dimension, where panels (a) and (b) are both positively keyed, indicating that higher scores reflect greater endorsement of the construct, with response options ranging from “1 strongly disagree” to “5 strongly agree.” More examples can be found in Part 4 of the Supplemental Document.

Figure 6.

Response probabilities, information entropy and eye-movement trajectories of participant 228.

As shown in Figure 6(a), the participant's eye-movement pattern was relatively straightforward. More fixations and longer fixation durations were observed on the item stem, indicating that the participant devoted considerable effort to understanding the item. However, fixations on the response options were few, indicating that the participant made the selection with minimal hesitation and confidence. In contrast, panel (b) of Figure 6 shows that the participant exhibited uncertainty regarding the response options. After reading the item stem, the participant initially fixated on option “2 disagree” without making a decision. The participant then returned to the item stem for further cognitive processing, revisited option “2 disagree,” and ultimately changed the selection to “4 agree.” Had the participant been certain, the total number of fixations would have been approximately 12. However, due to hesitation and additional cognitive processing, the participant re-evaluated both the item stem and response options before making a final selection. The additional four fixations likely reflect the participant's response uncertainty.

Both the separate-model (i.e., the GRM) and the multimodal-IRTM were applied to analyze the empirical data. The table below the figure displays the response probabilities for all five categories under both models, alongside the corresponding information entropy. The alignment between predicted response probabilities and eye-movement trajectories shows that the multimodal-IRTM produces category probabilities that more accurately correspond to observed eye-movement patterns during the response process. Response uncertainty for panel (a) was low; however, predicted probabilities from the GRM indicate a comparatively high level of uncertainty. Furthermore, as shown in panel (a), the eye-tracking trajectory shows that the respondent did not fixate on option “4 agree,” whereas the GRM predicted a relatively high probability for this category. Taken together, by incorporating multimodal process data, the multimodal-IRTM better aligns with participants’ observed behavioral patterns than traditional response-only models.

The present study examines the association between participants’ latent traits and eye-FCs by modeling response uncertainty, under the assumption that cognitive uncertainty during the response process is reflected in multimodal measurements. This link is supported by analyses of eye-movement trajectories. Due to uncertainty in decision-making, participants engaged in more extensive cognitive processing, manifested as increased FCs. In other words, these additional fixations, exceeding both participants’ average fixation levels and the item-level average required fixations, are captured in the model by the latent visual engagement and item visual intensity parameters. In summary, by modeling uncertainty through information entropy, participants’ latent traits are linked to response times and visual FCs, enabling the model to represent cognitive processes during responding as reflected in multimodal measurements.

Discussion

Summary of the Present Study

Personality assessment plays a crucial role in organizational research, facilitating the selection of individuals whose traits align most closely with job requirements. If latent trait estimates can be made more precise, assessments will yield more informative and reliable guidance for personnel selection. Advances in technology have lowered the barrier to collecting rich process data, allowing researchers to capture information (e.g., response time, eye movements, mouse trajectories) that extends well beyond traditional self-reports. Multimodal assessment holds substantial potential. As noted in the introduction, research in cognitive and achievement testing increasingly demonstrates that incorporating multimodal data enhances measurement precision, provides deeper insights into test-taker behavior, and improves assessment quality and informativeness. However, despite their potential, the use of multimodal process data in noncognitive personality assessment remains limited, which is regrettable given its promise to improve measurement quality and support more effective personnel selection, job placement, and broader organizational decisions.

This research, situated in personality assessment using Likert-type items, leverages response uncertainty to connect participants’ latent traits with two types of process data: response times and eye-tracking measures. Using this link as a cross-loading term, we develop a joint model for response scores, response times, and visual FCs, referred to as the multimodal-IRTM. A simulation study evaluated the parameter recovery of the proposed model under multiple conditions. The results indicate that the MCMC algorithm yields reliable parameter estimates and that model parameters exhibit satisfactory recovery.

The simulation study also evaluated the performance of the proposed model relative to the GRM. In the simulation study, when the regression coefficient is positive, the person parameters ( $θ$ ) estimated by the multimodal-IRTM consistently show higher recovery accuracy than those from the GRM. Moreover, as sample size and the magnitude of the regression coefficient increase, the performance gap becomes more pronounced, with a maximum ΔRMSE of 0.104 and a maximum ΔCOR of 0.027. For item parameters, the multimodal-IRTM also shows slight improvements over the GRM, with a maximum ΔRMSE of 0.032 and a maximum ΔCOR of 0.064. The proposed multimodal-IRTM demonstrates improvements over the GRM across both simulation and empirical studies in terms of parameter recovery and model fit. In the empirical study, the multimodal-IRTM achieves better model fit, with improvements reflected in ΔLOO (up to 172.361) and ΔWAIC (up to 444.313).

In addition, the response probabilities estimated by the multimodal-IRTM are more consistent with participants’ eye-movement patterns. The findings in Figure 6 align with our expectations. Compared with the GRM, response probabilities from the multimodal-IRTM more closely matched the category preference patterns observed in eye-movement trajectories. One possible explanation is that this cognitive process arises from the mismatch between the five discrete Likert categories and the continuous latent traits. Confronted with this limitation, respondents engage in a decision-making process in which they select the option that better, rather than perfectly, represents their latent traits. The resulting score therefore reflects a comparative evaluation, and this cognitive decision-making process is evident in the multimodal data, as indicated by longer response times and increased fixations.

What Are the Advantages of the Proposed Multimodal-IRTM in Organizational Research?

As Meißner and Oll (2019) noted, “the time is right to expand the standard methodological toolkit of organizational scholars by bringing eye-tracking to their minds and hands.” In organizational research, incorporating multimodal data into noncognitive assessment represents a promising alternative to traditional self-report–based methods. This study proposes a method capable of simultaneously analyzing participants’ scores alongside multimodal data. By integrating multimodal behavioral data, richer layers of information can be accessed, enabling a more comprehensive and accurate estimation of participants’ latent traits. The key innovations of this approach are outlined as follows:

Enriching Self-Report Analysis with Multimodal Data. The present model provides a framework for integrating multimodal data with self-report scores, enabling self-report measures to be complemented by behavioral information. In future research, additional multimodal sources—such as eye-tracking, physiological signals, facial expressions, and mouse trajectories—could be incorporated to enrich organizational behavior studies. This approach can further enhance the assessment of latent constructs, capture underlying response dynamics, and improve both the precision and interpretability of self-report measures in applied contexts, including talent selection, employee development, and high-potential identification.

Increased Person Traits Assessment Precision. Simulation results indicate that when the regression coefficient exceeds zero, the RMSE of latent trait estimates decreases, and the correlation between estimated and true latent traits increases. In the empirical study, the proposed model also produced lower LOO and WAIC values. Moreover, analyses of eye-tracking trajectories indicate that the model incorporating multimodal data generates predicted response patterns that more closely align with eye-movement patterns than predictions based solely on response scores. By incorporating behavioral and process-based indicators, multimodal assessment extends beyond static self-reports to capture latent differences in response processes that are unobservable from final Likert ratings. These latent differences allow researchers to distinguish individuals with similar observed scores but differing response patterns or decision-making strategies. This approach further enhances the model's ability to differentiate respondents with similar observed scores yet varying response behaviors or cognitive strategies. In practice, the proposed model can produce richer talent profiles for applications including personnel selection and high-potential identification. Beyond providing more accurate trait scores, the model also generates process indicators that enable practitioners to differentiate candidates with similar scores but varying levels of response stability. This, in turn, supports more informed and comprehensive talent decisions and reduces the risk of misclassification.

Enhanced Interpretation of Process Data. Process-level behavioral data offer insights into several aspects of respondents’ noncognitive test performance, including: (a) person's main effect, (b) item's main effect, and (c) the person–item interaction effect. By explicitly modeling cross-loadings between person–item interactions (i.e., information entropy) and process indicators, this model directly connects latent traits to observed process behaviors. Unlike traditional hierarchical frameworks, which primarily capture correlations between latent traits and process variables without specifying mechanisms, the present model provides a more precise account of how respondents’ latent characteristics influence behavioral manifestations during the response process.

What are the Recommendations for Multimodal Assessment

Provision of R codes and detailed instructions. To facilitate replication and application of the proposed multimodal assessment models, R code implementing the MCMC analyses has been made available (https:/osf.io/wuq5d/overview). The resource includes two main components: (1) annotated R code for performing Multimodal MCMC analyses of simulation studies, and (2) a tutorial demonstrating how to apply the Multimodal MCMC method to empirical data. This enables other researchers to reproduce the analyses and adapt the code for their own datasets.

Consideration of sample size. Our simulation results indicate that when each group comprises more than 200 respondents, the proposed method performs reliably, supporting its direct application in routine assessment scenarios, including internal talent reviews and targeted recruitment within organizations. When higher estimation precision is required (e.g., for core position hiring or key talent evaluation), it is recommended to increase the sample size to over 500 to further reduce the RMSE of parameter estimates and improve the stability of results.

Flexibility to incorporate alternative multimodal data. The modeling framework permits seamless integration of additional process-level indicators, such as mouse-tracking, EEG, or electrodermal activity, without altering the core model structure. This flexibility facilitates adaptation to diverse organizational assessment contexts and allows exploration of how various multimodal signals enhance latent trait estimation. It is reasonable to anticipate that researchers will soon overcome current hardware limitations, enabling the acquisition and analysis of a broader range of multimodal data in large-scale assessments, thereby further increasing the applicability and impact of these models.

What Are the Limitations of the Current Study and Future Directions?

This study has several limitations that warrant further investigation in future research.

First, the present study employed the graded response model (GRM) as the psychometric foundation for the proposed multimodal framework. The GRM represents one of several viable models within the broader family of polytomous IRT. Future research could broaden this work by incorporating alternative polytomous IRT models, such as the partial credit model (PCM), generalized partial credit model (GPCM), or nominal response model (NRM). Moreover, personality and organizational assessments frequently involve multidimensional latent constructs. For example, the empirical dataset used in this study, the Big Five Personality Inventory, is multidimensional. The GRM was applied separately within each dimension to evaluate the feasibility and interpretability of the proposed multimodal framework. Future research should extend the current GRM-based framework to multidimensional graded response models (MGRM), enabling items to load on multiple latent traits simultaneously.

Second, the present study is conducted within the framework of dominance models, where the probability of endorsing an item increases monotonically as the person's latent trait exceeds the item location parameter. Within this framework, higher latent trait levels correspond to a higher probability of selecting keyed or higher-category responses, reflecting a cumulative response process. However, prior studies have assuming a dominance response process for personality data and have recommended ideal point models instead (e.g., Chernyshenko et al., 2001, 2007; Tay & Ng, 2018). Unlike dominance models, ideal point models assume a non-monotonic response function, where the probability of endorsing an item is highest when the person's latent trait is near the item location parameter. Therefore, incorporating multimodal process data into ideal point frameworks can provide deeper insights into the cognitive mechanisms underlying item endorsement, while improving parameter recovery and overall model fit.

Third, although the present study focused on unidimensional Likert-scale items, a key direction for future research is the adoption of forced-choice response formats. Forced-choice designs present respondents with multiple options simultaneously, requiring comparative judgments that can effectively reduce response biases, including social desirability, acquiescence, and extreme responding. Such designs compel respondents to prioritize among options. Incorporating process data into forced-choice formats has significant potential to deepen understanding of the cognitive mechanisms underlying decision-making. Process data can capture subtle aspects of respondents’ mental states, attentional allocation, and decision dynamics that are not observable from final choice outcomes alone. Extending the current multimodal framework to incorporate forced-choice response data could further improve the precision of latent trait estimation, enhance the ecological and psychological validity of personality and organizational assessments, and offer richer insights into the structure of multidimensional constructs.

Fourth, the empirical study should be interpreted as a preliminary demonstration of feasibility rather than a definitive validation in organizational settings. The sample size (N = 236) provides sufficient information for model estimation in this context; however, it is relatively small from an organizational research perspective and is drawn from a student sample with an age-constrained and female-dominated composition. Such characteristics limit the generalizability of the findings to typical organizational populations, which are generally more heterogeneous in age, gender, and work experience. Therefore, while the empirical results suggest that the multimodal-IRTM can be successfully estimated under moderate sample sizes, they should not be interpreted as evidence of its performance in representative organizational datasets. Future research should evaluate the robustness and generalizability of the proposed model using larger and more diverse samples, particularly in real-world organizational contexts.

Fifth, the current study focused exclusively on two types of multimodal response data: response times and visual FCs. Although these indicators offer valuable insights into the cognitive processes and attentional patterns underlying test-taking behavior, technological advances now allow for the collection of a broader range of process-related data. For example, researchers can record electrodermal responses, electroencephalography (EEG) signals, mouse or touchscreen interactions, and other behavioral or physiological measures. Future research could investigate whether the current multimodal framework can be extended to incorporate these diverse modalities, and whether such extensions might improve measurement precision, enhance model interpretability, and offer a richer understanding of the complex mechanisms underlying responses in personality, cognitive, and organizational assessments.

Sixth, the present study focused exclusively on data collected under normal and effortful response conditions and did not explicitly consider aberrant responding behaviors, such as careless or insufficient effort responding. Multimodal indicators, such as unusually short or long response times, atypical eye-movement patterns, or irregular physiological signals, may capture deviations in engagement, attention, or decision-making strategies that standard response models do not adequately explain. Disengaged examinees may interact with items using shorter response times and fewer fixations than would be expected if they were reading, comprehending, and providing an engaged response (Ulitzsch et al., 2020a). Under normal responding conditions, responses associated with higher uncertainty generally require more time than those associated with lower uncertainty. Responses with low uncertainty generally exhibit longer response times than random or careless responses, which are typically characterized by minimal cognitive engagement and abnormally short response times. Incorporating data from respondents exhibiting careless or insufficient effort responding in future research could allow for a systematic examination of whether multimodal measures improve the detection, modeling, and correction of aberrant responding, thereby enhancing the accuracy and robustness of latent trait estimation.

Supplemental Material

sj-docx-1-orm-10.1177_10944281261457337 - Supplemental material for A Multimodal Item Response Modeling for Personality Assessment in Organizational Research

Supplemental material, sj-docx-1-orm-10.1177_10944281261457337 for A Multimodal Item Response Modeling for Personality Assessment in Organizational Research by Dongbo Tu, Fumei Zhang, Siwei Peng, Daxun Wang and Yan Cai in Organizational Research Methods

Footnotes

ORCID iDs

Dongbo Tu

Fumei Zhang

Siwei Peng

Yan Cai

Funding

This work was supported by Grants 32300942, 32160203, 62467002 and 62167004 from the National Natural Science Foundation of China.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Materials

Supplemental material for this article is available online.

Author biographies

Dongbo Tu is a full professor at the School of Psychology at the Jiangxi Normal University in Nanchang, Jiangxi, China. His research mainly focuses on applied measurement in education, differential item functioning, item response theory, computerized adaptive testing, cognitive diagnosis modeling, and machine learning.

Fumei Zhang is a master's candidate in Psychometrics at the Jiangxi Normal University in Nanchang, Jiangxi, China. Her primary research interests include item response theory, cognitive diagnosis theory, educational and psychological measurement, and Bayesian analysis.

Siwei Peng is an instructor at the School of Psychology at Zhejiang Normal University in Jinhua, Zhejiang, China. Her primary research interests include educational/psychological measurement, response time modeling, item response theory, cognitive diagnosis, and Bayesian analysis.

Daxun Wang is an associate professor at the School of Psychology at Jiangxi Normal University in Nanchang, Jiangxi, China. His primary research interests include item response theory, educational and psychological test development and validation, computerized adaptive testing, and Q-matrix validation in cognitive diagnosis.

Yan Cai is a full professor at the School of Psychology at the Jiangxi Normal University in Nanchang, Jiangxi, China. Her current research interests include item response theory, cognitive diagnosis theory, big data analysis, and test construction.

References

Al-Malki

Juan

(2018). Leadership styles and job performance: A literature review. Journal of International Business Research and Marketing, 3(3), 40–49. https://doi.org/10.18775/jibrm.1849-8558.2015.33.3004

Anisimov

Сhernozatonsky

Pikunov

Raykhrud

Revazov

Shedenko

Zhigulskaya

Zuev

(2021). Okenreader: ML-based classification of the Reading patterns using an Apple iPad. Procedia Computer Science, 192, 1944–1953. https://doi.org/10.1016/j.procs.2021.08.200

Anwyl-Irvine

Dalmaijer

E. S.

Hodges

Evershed

J. K.

(2021). Realistic precision and accuracy of online experiment platforms, web browsers, and devices. Behavior Research Methods, 53(4), 1407–1425. https://doi.org/10.3758/s13428-020-01501-5

Bakker

A. B.

Tims

Derks

(2012). Proactive personality and job performance: The role of job crafting and work engagement. Human Relations, 65(10), 1359–1378. https://doi.org/10.1177/0018726712453471

Bin

A. S.

Shmailan

(2015). The relationship between job satisfaction, job performance and employee engagement: An explorative study. Issues in Business Management and Economics, 4(1), 1–8. https://doi.org/10.15739/IBME.16.001

Bolsinova

Tijmstra

(2018). Improving precision of ability estimation: Getting more from response times. British Journal of Mathematical and Statistical Psychology, 71(1), 13–38. https://doi.org/10.1111/bmsp.12104

Brooks

S. P.

Gelman

(1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4), 434–455. https://doi.org/10.1080/10618600.1998.10474787

Bunji

Okada

(2020). Joint modeling of the two-alternative multidimensional forced-choice personality measurement and its response time by a Thurstonian D-diffusion item response model. Behavior Research Methods, 52(3), 1091–1107. https://doi.org/10.3758/s13428-019-01302-5

Chauliac

Catrysse

Gijbels

Donche

(2020). It is all in the “Surv-eye”: Can eye tracking data shed light on the internal consistency in self-report questionnaires on cognitive processing strategies? Frontline Learning Research, 8(3), 26–39. https://doi.org/10.14786/flr.v8i3.489

10.

Chen

E. E.

Wojcik

S. P.

(2016). A practical guide to big data research in psychology. Psychological Methods, 21(4), 458. https://doi.org/10.1037/met0000111

11.

Chernyshenko

O. S.

Stark

Chan

K.-Y.

Drasgow

Williams

(2001). Fitting item response theory models to two personality inventories: Issues and insights. Multivariate Behavioral Research, 36(4), 523–562. https://doi.org/10.1207/S15327906MBR3604_03

12.

Chernyshenko

O. S.

Stark

Drasgow

Roberts

B. W.

(2007). Constructing personality scales under the assumptions of an ideal point response process: Toward increasing the flexibility of personality measures. Psychological Assessment, 19(1), 88. https://doi.org/10.1037/1040-3590.19.1.88

13.

Curtis

S. M.

(2010). BUGS Code for item response theory. Journal of Statistical Software, 36, 1–34. https://doi.org/10.18637/jss.v036.c01

14.

de Valpine

Turek

Paciorek

C. J.

Anderson-Bergman

Lang

D. T.

Bodik

(2017). Programming with models: Writing statistical algorithms for general model structures with NIMBLE. Journal of Computational and Graphical Statistics, 26(2), 403–413. https://doi.org/10.1080/10618600.2016.1172487

15.

Donaldson

S. I.

Grant-Vallone

E. J.

(2002). Understanding self-report bias in organizational behavior research. Journal of Business and Psychology, 17(2), 245–260. https://doi.org/10.1023/A:1019637632584

16.

Ferrando

P. J.

(2007). An item response theory model for incorporating response time data in binary personality items. Applied Psychological Measurement, 31(6), 525–543. https://doi.org/10.1177/0146621606295197

17.

Ferrando

P. J.

Lorenzo-Seva

(2007). A measurement model for Likert responses that incorporates response time. Multivariate Behavioral Research, 42(4), 675–706. https://doi.org/10.1080/00273170701710247

18.

Fisher

(2021). A multiattribute attentional drift diffusion model. Organizational Behavior and Human Decision Processes, 165, 167–182. https://doi.org/10.1016/j.obhdp.2021.04.004

19.

Fowler

F. J.

Jr (2013). Survey research methods. Sage Publications.

20.

Goldberg

Lewis R

. (1992). The development of markers for the Big-Five factor structure. Psychological Assessment, 4(1), 26–42. https://doi.org/10.1037/1040-3590.4.1.26

21.

Gorney

(2024). Three new corrections for standardized person-fit statistics for tests with polytomous items. British Journal of Mathematical and Statistical Psychology, 77(3), 634–650. https://doi.org/10.1111/bmsp.12342

22.

Guo

Wang

Cai

(2024). An item response theory model for incorporating response times in forced-choice measures. Educational and Psychological Measurement, 84(3), 450–480. https://doi.org/10.1177/00131644231171193

23.

Zhang

Qin

Shi

Wang

Dong

(2025). A simultaneous EEG and eye-tracking dataset for remote sensing object detection. Scientific Data, 12(1), 651. https://doi.org/10.1038/s41597-025-04995-w

24.

Henninger

Plieninger

(2021). Different styles, different times: How response times can inform our knowledge about the response process in rating scale measurement. Assessment, 28(5), 1301–1319. https://doi.org/10.1177/1073191119900003

25.

Ivanova

Michaelides

Eklöf

(2020). How does the number of actions on constructed-response items relate to test-taking effort and performance? Educational Research and Evaluation, 26(5–6), 252–274. https://doi.org/10.1080/13803611.2021.1963939

26.

Judge

T. A.

Jackson

C. L.

Shaw

J. C.

Scott

B. A.

Rich

B. L.

(2007). Self-efficacy and work-related performance: The integral role of individual differences. Journal of Applied Psychology, 92(1), 107. https://doi.org/10.1037/0021-9010.92.1.107

27.

Kam

C. C. S.

Meyer

J. P.

(2015). How careless responding and acquiescence response bias can influence construct dimensionality: The case of job satisfaction. Organizational Research Methods, 18(3), 512–541. https://doi.org/10.1177/1094428115571894

28.

Konovalov

Krajbich

(2019). Revealed strength of preference: Inference from response times. Judgment and Decision Making, 14(4), 381–394. https://doi.org/10.1017/S1930297500006082

29.

Kosch

Hassib

Woźniak

P. W.

Buschek

Alt

(2018, April). Your eyes tell: Leveraging smooth pursuit for assessing cognitive workload. In Proceedings of the 2018 CHI conference on human factors in computing systems, Montreal QC, Canada.

30.

Krajbich

Armel

Rangel

(2010). Visual fixations and the computation and comparison of value in simple choice. Nature Neuroscience, 13(10), 1292–1298. https://doi.org/10.1038/nn.2635

31.

Krajbich

Camerer

Rangel

(2012). The attentional drift-diffusion model extends to simple purchasing decisions. Frontiers in Psychology, 3, Article 193. https://doi.org/10.3389/fpsyg.2012.00193

32.

Krajbich

Rangel

(2011). Multialternative drift-diffusion model predicts the relationship between visual fixations and choice in value-based decisions. Proceedings of the National Academy of Sciences, 108(33), 13852–13857. https://doi.org/10.1073/pnas.1101328108

33.

Zhang

Mou

(2025, Jun). Though forced, still valid: Examining the psychometric performance of forced-choice measurement of personality in children and adolescents. Assessment, 32(4), 521–543. https://doi.org/10.1177/10731911241255841

34.

Liang

Cai

(2023). Using process data to improve classification accuracy of cognitive diagnosis model. Multivariate Behavioral Research, 58(5), 969–987. https://doi.org/10.1080/00273171.2022.2157788

35.

Liu

H.-Z.

Yang

X.-L.

Q.-Y.

Wei

Z.-H.

(2023). Preference of dimension-based difference in intertemporal choice: Eye-tracking evidence. Acta Psychologica Sinica, 55(4), article 612. https://doi.org/10.3724/SP.J.1041.2023.00612

36.

Man

Harring

J. R.

(2019). Negative binomial models for visual fixation counts on test items. Educational and Psychological Measurement, 79(4), 617–635. https://doi.org/10.1177/0013164418824148

37.

Man

Harring

J. R.

Zhan

(2022). Bridging models of biometric and psychometric assessment: A three-way joint modeling approach of item responses, response times, and gaze fixation counts. Applied Psychological Measurement, 46(5), 361–381. https://doi.org/10.1177/01466216221089344

38.

Margot

Leen

David

Sven

D. M.

Vincent

(2023). Self-report questionnaires scrutinised: Do eye movements reveal individual differences in cognitive processes while completing a questionnaire? International Journal of Social Research Methodology, 26(4), 391–407. https://doi.org/10.1080/13645579.2022.2052696

39.

Meißner

Oll

(2019). The promise of eye-tracking methodology in organizational research: A taxonomy, review, and future avenues. Organizational Research Methods, 22(2), 590–617. https://doi.org/10.1177/1094428117744882

40.

Meng

X. B.

Tao

Shi

N. Z.

(2013). An item response model for Likert-type data that incorporates response time in personality measurements. Journal of Statistical Computation and Simulation, 84(1), 1–21. https://doi.org/10.1080/00949655.2012.692368

41.

Michinov

Jamet

Métayer

Le Hénaff

(2015). The eyes of creativity: Impact of social comparison and individual creativity on performance and attention to others’ ideas during electronic brainstorming. Computers in Human Behavior, 42, 57–67. https://doi.org/10.1016/j.chb.2014.04.037

42.

Molenaar

Tuerlinckx

van der Maas

H. L.

(2015). A bivariate generalized linear item response theory modeling framework to the analysis of responses and response times. Multivariate Behavioral Research, 50(1), 56–74. https://doi.org/10.1080/00273171.2014.962684

43.

Ranger

(2013). A note on the hierarchical model for responses and response times in tests of van der Linden (2007). Psychometrika, 78(3), 538–544. https://doi.org/10.1007/s11336-013-9324-6

44.

Salvucci

D. D.

Goldberg

J. H.

(2000). Identifying fixations and saccades in eye-tracking protocols. Proceedings of the 2000 symposium on Eye tracking research & applications, Palm Beach Gardens, Florida, USA.

45.

Samejima

(1969). Estimation of latent ability using a response pattern of graded scores. Psychometric Society.

46.

Schwarz

(2007). Cognitive aspects of survey methodology. Applied Cognitive Psychology, 21(2), 277–287. https://doi.org/10.1002/acp.1340

47.

Shannon

C. E.

(1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

48.

Skinner

I. W.

Hübscher

Moseley

G. L.

Lee

Wand

B. M.

Traeger

A. C.

Gustin

S. M.

McAuley

J. H.

(2018). The reliability of eyetracking to assess attentional bias to threatening words in healthy individuals. Behavior Research Methods, 50, 1778–1792. https://doi.org/10.3758/s13428-017-0946-y

49.

Sweeney

West

Groessler

Haynie

Higgs

Macaulay

Mercer-Mapstone

Yeo

(2017). Where's the transformation? Unlocking the potential of technology-enhanced assessment. Teaching & Learning Inquiry: The ISSOTL Journal, 5(1), 1–13. https://doi.org/10.20343/5.1.5

50.

Taore

Lobo

Turnbull

P. R.

Dakin

S. C.

(2025). Diagnosing colour vision deficiencies using eye movements (without dedicated eye-tracking hardware). Journal of Eye Movement Research, 18(5), 51. https://doi.org/10.3390/jemr18050051

51.

Tay

(2018). Ideal point modeling of non-cognitive constructs: Review and recommendations for research. Frontiers in Psychology, 9, article 2423. https://doi.org/10.3389/fpsyg.2018.02423

52.

Thissen

(1983). Timed testing: An approach using item response theory. In New horizons in testing (pp. 179–203). Elsevier.

53.

Tian

Zhan

Wang

(2023). Joint cognitive diagnostic modeling for probabilistic attributes incorporating item responses and response times. Acta Psychologica Sinica, 55(9), article 1573. https://doi.org/10.3724/SP.J.1041.2023.01573

54.

Tong

Qin

Peng

Zhong

(2022). Detection of aberrant response patterns using a residual-based statistic in testing with polytomous items. Acta Psychologica Sinica, 54(9), article 1122. https://doi.org/10.3724/SP.J.1041.2022.01122

55.

Tourangeau

(2003). Cognitive aspects of survey measurement and mismeasurement. International Journal of Public Opinion Research, 15(1), 3–7. https://doi.org/10.1093/ijpor/15.1.3

56.

Uggeldahl

Jacobsen

Lundhede

T. H.

Olsen

S. B.

(2016). Choice certainty in discrete choice experiments: Will eye tracking provide useful measures? Journal of Choice Modelling, 20, 35–48. https://doi.org/10.1016/j.jocm.2016.09.002

57.

Ulitzsch

Pohl

Khorramdel

Kroehne

von Davier

(2022). A response-time-based latent response mixture model for identifying and modeling careless and insufficient effort responding in survey data. Psychometrika, 87(2), 593–619. https://doi.org/10.1007/s11336-021-09817-7

58.

Ulitzsch

von Davier

Pohl

(2020a). A hierarchical latent response model for inferences about examinee engagement in terms of guessing and item-level non-response. British Journal of Mathematical and Statistical Psychology, 73, 83–112. https://doi.org/10.1111/bmsp.12188

59.

Ulitzsch

von Davier

Pohl

(2020b). Using response times for joint modeling of response and omission behavior. Multivariate Behavioral Research, 55(3), 425–453. https://doi.org/10.1080/00273171.2019.1643699

60.

Van der Linden

W. J.

(2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308. https://doi.org/10.1007/s11336-006-1478-z

61.

Van Loo

E. J.

Grebitus

Nayga

R. M.

Jr. Verbeke

Roosen

(2018). On the measurement of consumer preferences and food choice behavior: The relation between visual attention and choices. Applied Economic Perspectives and Policy, 40(4), 538–562. https://doi.org/10.1093/aepp/ppy022

62.

Watanabe

Opper

(2010). Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. Journal of Machine Learning Research, 11, 3571–3594. https://doi.org/10.48550/arXiv.1004.2316

63.

Webb

Gibson

(2015). Technology enhanced assessment in complex collaborative settings. Education and Information Technologies, 20(4), 675–695. https://doi.org/10.1007/s10639-015-9413-5

64.

Wetzel

Lüdtke

Zettler

Böhnke

J. R.

(2016). The stability of extreme response style and acquiescence over 8 years. Assessment, 23(3), 279–291. https://doi.org/10.1177/1073191115583714

65.

Zhan

(2022). Joint-cross-loading multimodal cognitive diagnostic modeling incorporating visual fixation counts. Acta Psychologica Sinica, 54(11), article 1416. https://doi.org/10.3724/SP.J.1041.2022.01416

66.

Zhan

Man

Wind

S. A.

Jonathan

(2022). Cognitive diagnosis modeling incorporating response times and fixation counts: Providing comprehensive feedback and accurate diagnosis. Journal of Educational and Behavioral Statistics, 47(6), 736–776. https://doi.org/10.3102/10769986221111085

67.

Zhang

Koenitz

(2023). Effects of store fixture shape at retail checkout: Evidence from field and online studies. Production and Operations Management, 32(10), 3158–3173. https://doi.org/10.1111/poms.14028

68.

Zhou

Guo

(2026). Psychometric model framework for multiple response items. Psychometrika, 91(2), 587–619. https://doi.org/10.1017/psy.2025.10073

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

3.27 MB