A Bayesian Framework to Establish Validity Evidence for Multi-Unidimensional Instruments With Small Samples

Abstract

Accurate estimation of latent traits is critical in educational and psychological measurement, yet establishing validity evidence is challenging under small samples, ordinal data, and skewed distributions. We propose a Bayesian framework that incorporates expert-informed priors to enhance estimation and validity evidence for multi-unidimensional instruments. Simulation studies demonstrate that expert input improves estimation accuracy over ordinal confirmatory factor analysis, especially in samples as small as N = 25, with diminishing returns beyond six to nine experts. We compare variational inference (automatic differentiation variational inference) and Hamiltonian Monte Carlo in terms of estimation accuracy, computational efficiency, and posterior quality. A real-world application using TIMSS Grade 8 Mathematics data illustrates practical implications for expert selection and estimation strategies in small-sample instrument development.

Keywords

multidimensional instrument validity bayesian item development item response theory multidimensional IRT Hamiltonian Monte Carlo automatic differentiation variational inference

Introduction

Educational and psychological tests are essential for measuring individuals’ latent abilities or traits, which are unobservable but inferred through carefully designed instruments. These tests are widely used in state-level assessments (e.g., Massachusetts Comprehensive Assessment System [MCAS]), national assessments (e.g., National Assessment of Educational Progress [NAEP]), and international assessments (e.g., Trends in International Mathematics and Sciences Study [TIMSS]). The results provide policymakers and educators with critical information about student achievement and areas needing improvement. Similarly, tools like the Classroom Assessment Scoring System Toddler (Pianta et al., 2008) help educators evaluate classroom interactions to enhance teaching quality and student learning.

Multi-unidimensional instruments play a vital role in measuring complex constructs such as critical thinking or depression by assessing multiple distinct but related traits through subtests (e.g., TIMSS mathematics assessment; Lindquist et al., 2017). Test development is a rigorous, multi-stage process requiring validity and reliability evidence at the instrument and item levels (Crocker & Algina, 1986; DeVellis, 1991). Content validity is often established through expert reviews, while construct validity relies on factor analysis using participant data. However, traditional methods analyze these components separately, requiring large samples and prolonged timelines, often spanning years (Ellis et al., 2009).

Small sample sizes present significant challenges to instrument development. In niche populations, such as individuals with rare conditions, ethical or logistical constraints may limit participation (Burns & Grove, 2010; Patel et al., 2003). Smaller samples result in unreliable parameter estimates, inflated measurement error, and model convergence issues (Linacre, 1994; Stone & Yumoto, 2004). Additionally, the ordinal nature of response data, common in educational and psychological research (e.g., Likert scales), complicates validity testing. Maximum Likelihood estimation, the standard approach in confirmatory factor analysis (CFA), assumes continuous, normally distributed data and struggles with ordinal scales, often yielding biased parameter estimates and unreliable fit indices (Brown, 2006; Flora & Curran, 2004).

To address these challenges, Bayesian methods have emerged as a promising alternative. Gajewski et al. (2013) proposed a Bayesian framework integrating content and construct validity using expert ratings as priors to reduce sample size requirements. Building on this, Jiang et al. (2014) introduced the Bayesian Instrument Development (BID) method, incorporating Fisher’s transformation to allow full-range correlations, achieving better precision and efficiency compared to traditional CFA. Garrard et al. (2015) extended these methods to ordinal data with their Ordinal BID approach, enhancing predictive validity and parameter estimation for small samples.

Despite these advancements, prior studies are limited to unidimensional models and fail to address multidimensional instruments commonly used in educational and psychological assessments (Sheng & Wikle, 2007). To overcome these gaps, our study introduces a novel Bayesian framework for developing multi-unidimensional instruments, integrating expert and participant data, and leveraging state-of-the-art tools like Stan (Stan Development Team, 2023). Stan’s Hamiltonian Monte Carlo (HMC) algorithm offers improved sampling efficiency and enables advanced model evaluation techniques. To further enhance computational efficiency, we incorporate Variational Approximation, which transforms Bayesian inference into an optimization problem (Kucukelbir et al., 2017).

Purpose and Research Questions

The main purpose of this study is to develop a Bayesian framework that integrates expert opinions and participant data to establish validity evidence for multi-unidimensional instruments. Through extensive simulations, this study evaluates parameter estimation accuracy and computational efficiency under small-sample conditions. The specific research questions are as follows:

Research Question 1: How does the proposed Bayesian method, incorporating priors from expert ratings, improve estimation accuracy compared to traditional Ordinal CFA across varying dimensions, item counts, sample sizes, and participant data characteristics (e.g., skewness)? Additionally, how do expert-derived priors enhance estimation accuracy compared to non-informative priors? Specifically, what is the relationship between the number of experts ( $K$ ) and sample size ( $N$ ) required to achieve equivalent estimation accuracy?

Research Question 2: How does Variational Approximation compare to HMC in terms of estimation accuracy, computational efficiency, and the quality of approximated posterior distributions?

The organization of this paper is as follows. The next section presents Methodology, which introduces item response theory (IRT), multi-unidimensional IRT models, the hierarchical Bayesian framework, and parameter estimation techniques (HMC, automatic differentiation variational inference [ADVI]), followed by our proposed Bayesian modeling framework to establish validity evidence for multi-unidimensional instruments, including model formulation, prior specification, and parameter estimation. Section “Simulation Studies” illustrates the simulation designs addressing our research questions. Section “Results” presents findings from the two simulation studies. Section “Empirical Study” applies the framework to TIMSS 2019 Grade 8 Mathematics Student Questionnaire data. Sections “Discussion and Conclusion” and “Limitation and Future Research” summarize key findings, implications, limitations, and future research directions.

Methodology

IRT and Multi-Unidimensional IRT Models

IRT, also known as latent trait theory, is a modern framework for evaluating assessments by modeling the relationship between item responses and underlying latent abilities using person and item parameters. Developed by scholars such as Lord, Rasch, and Novick in the 1950s and 1960s, IRT yields robust, sample-independent estimates that overcome many limitations of Classical Test Theory, making it widely applicable in educational, psychological, and health-related assessments (van der Linden & Hambleton, 2013). Multidimensional IRT (MIRT) extends this framework to accommodate multiple latent traits or uncertain test structures (Reckase, 1997, 2006), while multi-unidimensional instruments—comprising subtests that each measure a specific ability yet capture inter-dimensional correlations—are particularly valuable for designing, maintaining, and equating assessments such as personality tests and achievement batteries (Sheng & Wikle, 2007).

In a multi-unidimensional test comprising $V$ subtests, each containing $J_{v}$ ( $v = 1, \dots, V$ ) items, let $y_{ivj}$ denote the ordinal response from the $i$ th examinee to the $j$ th item on the $v$ th subtest, where $i = 1, 2, \dots, N$ and $j = 1, 2, \dots, J_{v} .$ Suppose that item has $m_{vj}$ response categories ( $1, 2, \dots, m_{ν j}$ ), the probability of a response $c$ or below is modeled as

\begin{matrix} P (y_{ivj} \leq c | θ_{iv}, α_{vj}, δ_{vjc}) = Φ (α_{vj} (δ_{vjc} - θ_{iv})) = Φ (τ_{vjc} - α_{vj} θ_{iv}), c = 1, \dots, m_{vj} \end{matrix},

(1)

where $θ_{iv}$ is the $i$ th examinee’s ability in dimension $v$ assumed to follow a normal distribution $N (0, 1)$ ; we use $θ_{i} = {(θ_{i 1}, \dots, θ_{iV})}^{T}$ to denote a vector of the $i$ th examinee’s abilities across the $V$ dimensions, and $θ_{i}$ follows a multivariate normal distribution $N_{V} (0, Σ)$ ; $α_{vj}$ is the item discrimination parameter and $δ_{vjc}$ is the transition parameter separating adjacent response categories $c$ and $c + 1$ . For simplicity in model notation, $τ_{vjc}$ is used to denote $α_{vj} δ_{vjc}$ . The underlying continuous variable $y_{ivj}^{*} = α_{vj} θ_{iv} + ε_{ivj}$ is linked to $y_{ivj}$ via a set of ordered cut-points $- \infty = τ_{vj 0} < τ_{vj 1} < \dots < τ_{vj m_{vj}} = + \infty$ : $y_{ivj} = c$ if $y_{ivj}^{*} \in (τ_{vj (c - 1)}, τ_{vjc}]$ .

In Equation 1, Φ(⋅) is the standard normal cumulative distribution function, as the link makes this a normal-ogive or probit IRT model, which is a common formulation in multidimensional IRT. If instead the logistic function $logit (x) = 1 / (1 + \exp (- Dx))$ , with scaling constant $D \approx 1.7$ to align the logistic and normal-ogive models, is used, the formulation reduces to the familiar logistic IRT model. We verified that our method produces equivalent results under both links, so the conclusions also generalize to logistic IRT models.

Both ability $θ_{iv}$ and the measurement error $ε_{ivj}$ are assumed to follow the standard normal distribution, the underlying continuous variable $y_{ivj}^{*}$ follows a normal distribution $N (0, 1 + α_{vj}^{2})$ . Measurement error $ε_{ivj}$ are assumed to be independent across items and subtests.

Standardizing $y_{ivj}^{*}$ yields:

\frac{y_{ivj}^{*}}{\sqrt{1 + α_{vj}^{2}}} = \frac{α_{vj}}{\sqrt{1 + α_{vj}^{2}}} θ_{iv} + \frac{ε_{ivj}}{\sqrt{1 + α_{vj}^{2}}} .

(2)

This connects to a factor analysis model:

z_{ivj}^{*} = ρ_{vj} f_{iv} + e_{ivj}, z_{ivj}^{*} ~ N (0, 1), f_{i} ~ N_{V} (0, Σ), e_{ivj} ~ N (0, 1 - ρ_{vj}^{2}),

(3)

where $z_{ivj}^{*}$ follows a standard normal distribution; $ρ_{vj}$ represents the loading of item $j$ on dimension $v$ ; factors $f_{i} = (f_{i 1}, \dots, f_{iV})$ follow a multivariate normal distribution $N_{V} (0, Σ)$ where the main diagonals of $Σ$ are fixed at 1: $diag (Σ) = (1, \dots, 1)$ . The interchangeable relationship between the discrimination parameter for multi-unidimensional IRT models and factor loadings for factor analysis models is established through

ρ_{vj} = \frac{α_{vj}}{\sqrt{1 + α_{v}^{2}}}, α_{vj} = \frac{ρ_{vj}}{\sqrt{1 - ρ_{vj}^{2}}} .

(4)

A Hierarchical Framework for Bayesian Item Response Modeling

Bayesian IRT emerged in the 1980s to address challenges of complex response behaviors and hierarchical data structures (Fox, 2010). Researchers like Swaminathan and Gifford (1982, 1985), Rigdon and Tsutakawa (1983), and Mislevy (1986), among others, pioneered Bayesian methods to model guessing, extreme responding, missing data, and intentional misrepresentation. A key strength of the Bayesian approach lies in its ability to incorporate prior knowledge to improve the accuracy of parameter estimates when sample sizes are small. Furthermore, this approach provides a flexible framework for analyzing survey data with intricate hierarchical structure (Fox, 2010).

The hierarchical framework for item response modeling consists of two levels to account for the relationship between the item and person parameters.

At the first level, an item response model captures the probabilities of individual responses. Let $y$ represent observed responses, $θ$ denote the matrix of all examinees’ latent abilities across all dimensions, $ξ$ denote the set of all item parameters (including item threshold parameters $T$ and discrimination parameters $α$ ), and $y^{*}$ represent an augmented latent variable such that $y_{ivj}^{*} ~ N (α_{vj} θ_{iv}, 1)$ . For a test comprising $V$ subtests, each with $J_{v}$ items, the sampling distribution of $y$ is expressed as

\begin{matrix} f (y | θ, ξ) \propto f (y | y^{*}, T) f (y^{*} | θ, α) . \end{matrix}

(5)

The second level describes population-level characteristics for both items (in Equation 6) and examinees (in Equation 7), with each parameter being assigned a prior distribution and a joint distribution at the population level:

\begin{matrix} p (ξ) \propto p (α) p (T), \end{matrix}

(6)

\begin{matrix} p (θ) \propto p (θ | Σ) p (Σ) . \end{matrix}

(7)

With the two-level models described above, the full joint posterior distribution of $(θ, α, T, Σ, y^{*})$ can be written as

\begin{matrix} p (θ, α, T, Σ, y^{*} | y) \propto f (y | y^{*}, T) f (y^{*} | θ, α) p (θ | Σ) p (Σ) p (α) p (T) . \end{matrix}

(8)

Bayesian Parameter Estimation Techniques

Advances in computational statistics have revolutionized parameter estimation in IRT models, particularly through the development of sampling-based techniques like Markov Chain Monte Carlo (MCMC) and variational inference techniques. Among MCMC methods, HMC has gained prominence for its efficiency in sampling from complex, high-dimensional posterior by leveraging Hamiltonian dynamics (Duane et al., 1987; Neal, 2011). HMC introduces auxiliary momentum variables, uses the leapfrog integrator to simulate motion through an energy landscape, and employs a Metropolis acceptance criterion to ensure convergence (see Appendix A).

In contrast, ADVI approximates posteriors by transforming them into a standard coordinate space and fitting a simple variational distribution—typically a diagonal Gaussian—via optimization of the evidence lower bound (ELBO; Kucukelbir et al., 2017). While ADVI offers significant computational efficiency for large-scale or high-dimensional models, its mean-field assumption may limit its capacity to capture parameter dependencies (see Appendix B).

Bayesian Multi-Unidimensional IRT Models

Consider a $J$ -item multi-unidimensional test with V subtests, where each subtest $v$ contains $J_{v}$ items designed to measure a specific ability dimension. Let $y_{ivj}$ denote the ordinal response of examinee $i$ to the item $j$ in subtest $v$ , linked to a latent variable $z_{ivj}^{*} ~ N (0, 1)$ th rough $z_{ivj}^{*} = ρ_{vj} θ_{iv} + e_{ivj}$ and partitioned into response categories by ordered cut-points $T_{vj} = (τ_{vj 0}, τ_{vj 1}, \dots, τ_{vj m_{vj}})$ . Examinee abilities $θ_{i} = (θ_{i 1}, \dots, θ_{iV})$ follow a multivariate normal distribution $θ_{i} ~ N_{V} (0, Σ)$ , with $Σ = diag (σ_{1}, \dots, σ_{V}) R diag (σ_{1}, \dots, σ_{V})$ , where $σ_{1}, \dots, σ_{V}$ are fixed at 1.0, and $R$ is the between-trait correlation matrix. Measurement errors $e_{ivj}$ and abilities $θ_{iv}$ are assumed independent across items, subtests, and examinees, and $e_{ivj}$ are independent across all combinations of $i$ , $j$ , and $v$ .

Prior Specification From Expert Ratings

Following Gajewski et al. (2011), experts’ ratings on an item’s relevance are used to inform priors for the item’s correlation parameter $ρ_{vj}$ . Experts’ ratings $x_{vjk}$ are mapped to correlations via Cohen’s (1998) cut-points.

\begin{matrix} x_{vjk} = {\begin{matrix} 1 ‘ not relevant ’ if 0.00 \leq ρ_{vjk} < 0.10 \\ 2 ‘ somewhat relevant ’ if 0.10 \leq ρ_{vjk} < 0.30 \\ 3 ‘ relevant ’ if 0.30 \leq ρ_{vjk} < 0.50 \\ 4 ‘ highly relevant ’ if 0.50 \leq ρ_{vjk} < 1.00 \end{matrix} # \end{matrix}

(9)

Similarly, ratings for between-dimension correlations ( $r_{v_{1} v_{2}}$ ) follow the same structure. Fisher’s transformation $μ = g (ρ) = \frac{1}{2} \log \frac{1 + ρ}{1 - ρ}$ is applied to both item-to-dimension and between-dimension correlations. The prior for the transformed parameter $μ_{vj}$ is specified as

\begin{matrix} μ_{vj} ~ N (\frac{1}{K} \sum_{k = 1}^{K} g (ρ_{vjk}^{*}), \frac{\exp (5 \times s^{2})}{5 \times K}), \end{matrix}

(10)

where $ρ_{vjk}^{*}$ . is the anchor value based on the expert’s rating: $ρ_{vjk}^{*} = {0.05, 0.20, 0.40, 0.75}$ if $x_{vjk} = {1, 2, 3, 4}$ , respectively; $K$ is the number of experts, and $s^{2}$ is the sample variance reflecting the level of agreement among the experts. The correlation matrix $R$ receives a joint distribution consisting of a uniform LKJ distribution (with $η = 1$ ; Lewandowski et al., 2009) and normal distributions on its Fisher-transformed unique, lower-triangular elements:

p (R | η) \propto LKJ (η = 1) Π_{u = 1}^{\frac{V (V - 1)}{2}} N (\frac{1}{K} \sum_{k = 1}^{K} g (r_{uk}^{*}), \frac{\exp (5 \times s^{2})}{5 \times K}),

(11)

where $r_{vjk}^{*}$ is the anchor value based on the expert’s rating: $r_{vjk}^{*} = {0.05, 0.20, 0.40, 0.75}$ if $x_{vjk} =$ {1 = “no correlation,” 2 = “weak correlation,” 3 = “moderate correlation,” 4 = “strong correlation”}, respectively.

The LKJ prior is widely applied in Bayesian hierarchical and multivariate modeling because it places a proper distribution on valid correlation matrices (symmetric, positive-definite, ones on the diagonal) and is controlled by a shape parameter η. When η = 1, the prior is uniform over all correlation matrices, avoiding any preference for particular structures. We adopted this noninformative setting so that the data and expert-informed priors, rather than strong distributional assumptions, determine the estimated correlations.

Transition parameters $τ_{vjc}$ are modeled using truncated normal distributions, where $τ_{vj (c - 1)}$ serves as the lower truncation boundary to ensure ordered constraints (Fox, 2010).

The full posterior distribution is given by

\begin{matrix} π (μ, θ, Σ, T | y) \propto f (y | μ, θ, T) p (μ) p (θ | Σ) p (R | η) p (T) . \end{matrix}

(12)

Simulation Studies

Simulating Experts’ Rating

Following Jiang et al. (2014), we assume experts unanimously interpret correlations in relation to item relevance and that their ratings align with participants’ response patterns. To simulate item-to-dimension correlations, we adopt Garrard et al.’s (2015) approach by incorporating low (.3), moderate (.5), and high (.7) correlations, with all simulations conducted using R 4.1.2 (R Core Team, 2022).

Experts’ ratings are simulated using the Rating Scale Many-Facet Rasch (RS-MFR) model (Andrich, 1978; Linacre, 1989). Specifically, the log-odds of expert $k$ assigning a rating transitioning from category $c$ to $c + 1$ for item $j$ in dimension $v$ is given by

\ln [\frac{P (x_{vjk} = c + 1)}{P (x_{vjk} = c)}] = μ_{vj}^{*} - β_{v} - λ_{k} - δ_{c}

(13)

where $μ_{vj}^{*}$ is the standardized true item-to-dimension correlation, $β_{v}$ represents the rating difficulty¹ for dimension $v$ and is generated from a normal distribution $N (0, 0.5)$ . $λ_{k}$ is the expert’s severity/leniency simulated using a narrow distribution $λ ~ N (0, 0.25)$ to reflect that experts are well-trained and demonstrate relatively little variation compared to the variation in participants’ ability (Raczynski et al., 2015). $δ_{c}$ is the difficulty of transitioning from category $c$ to category $c + 1$ . The RS-MFR model is also used to simulate experts’ ratings for between-dimension correlations, with parameters $μ_{v_{1} v_{2}}^{*}$ , $β_{v_{1} v_{2}}$ , $λ_{k}$ , and $δ_{c}$ defined similarly.

Simulation Study One

The first simulation study is designed to address Research Question 1 by comparing Bayesian methods using expert-informed priors with ordinal CFA and determining the number of experts needed to match the accuracy of a doubled sample size. A six-way factorial design is used with factors: number of dimensions ( $V \in {2, 3, 4}$ ), number of items per dimension ( $J_{v} \in {5, 10}$ ), response categories per item ( $C \in {2, 3, 4, 5}$ ), number of participants ( $N \in {25, 50, 100}$ ), between-dimension correlations ( $r \in {0.3, 0.5, 0.7}$ ), and response data distribution (non-skewed vs. low, medium, and high skewness: $skew \in {0, 0.25, 0.5, 1}$ ), resulting in 3 × 2 × 4 × 3 × 3 × 4 = 864 combinations. For each combination, participants’ response datasets are simulated through the following four steps:

Participants’ abilities $θ_{i} = (θ_{i 1}, \dots, θ_{iV})$ are drawn from a multivariate normal distribution with a mean of 0 and covariance matrices where off-diagonal elements are set to 0.3 (low), 0.5 (medium), and 0.7 (high), and diagonal elements equal 1.

Standardized responses are generated using $z_{vij}^{*} = ρ_{vj} θ_{iv} + e_{vij}$ , with true item-to-dimension correlations specified as $ρ_{v}^{*} = {(0.3, 0.5, 0.7, 0.7, 0.7)}^{T}$ for a 5-items-per-dimension scenario, and $ρ_{v}^{*} = {(0.3, 0.3, 0.5, 0.5, 0.5, 0.7, 0.7, 0.7, 0.7, 0.7)}^{T}$ for a 10-items-per-dimension scenario, reflecting a mix of low (20%), moderate (30%), and high (50%) correlation items.

For each item, convert $z_{vij}^{*}$ to ordinal $y_{vij}$ using a set of ordered cut points: for C = 4, $T_{vj}^{*} = (- \infty, - 0.75, 0, 0.75, + \infty)$ for non-skewed data and $T_{vj}^{*} = (- \infty, - 0.75 + skew, skew, 0.75 + skew, + \infty)$ for skewed data at different skewness levels: 0.25 (low), 0.5 (medium) and 1 (high). Similarly, the non-skewed cutoff points for C = 2 is $T_{vj}^{*} = (- \infty, 0, + \infty)$ ; for C = 3, $(- \infty, - 0.5, 0.5, + \infty)$ ; for C = 5, $(- \infty, - 0.9, - 0.3, 0.3, 0.9, + \infty)$ . Skewed cutoff points are obtained by adding a skewness value (0.25, 0.5, or 1) to the non-skewed cutoff points.

Generate 100 replications for each of the 864 scenarios.

Parameter Estimation

CFA baseline: The ordinal CFA model is estimated using WLSMV estimator provided by the “lavaan” R package (Rosseel, 2012). WLSMV (Weighted Least Squares Mean and Variance adjusted; Muthén et al., 1997) is widely used in structural equation modeling for categorical indicators (binary or ordinal). It operates on polychoric or polyserial correlations, minimizes the weighted difference between observed and model-implied statistics, and adjusts chi-square tests and standard errors for non-normality and non-continuity.

Bayesian IRT model: Experts’ ratings are simulated using the rating scale model (Equation 13), with varying numbers of experts $K \in {2, 3, 6, 9, 12}$ . Based on the simulated experts’ ratings, priors are specified following Equations 10 and 11. For K = 0, which represents the baseline scenario where no expert information is incorporated, noninformative priors was assigned such that they have a widespread distribution for $ρ$ in (0, 1) and for $σ$ in (−1, 1). Posterior sampling is performed using HMC with the No-U-Turn Sampler (NUTS) in CmdStanR package (version 0.7.0; Stan Development Team, 2023), with 4,000 iterations (2,000 burn-in) across four chains. Estimated posterior means are then transformed back using ${\hat{ρ}}_{vj} = \frac{\exp (2 {\hat{μ}}_{vj}) - 1}{\exp (2 {\hat{μ}}_{vj}) + 1}$ for item-to-dimension correlations and ${\hat{σ}}_{u} = \frac{\exp (2 {\hat{ϕ}}_{u}) - 1}{\exp (2 {\hat{ϕ}}_{u}) + 1}$ for between-dimension correlations.

Model Performance Comparison

The average mean squared error (MSE) is used as the metric to compare estimation accuracy across models and different numbers of experts. Lower values indicate greater accuracy.

Let $ρ_{vj}^{*}$ be the true correlation between item $j$ and dimension $v$ and $σ_{v_{1} v_{2}}^{*}$ be the true correlation between dimension $v_{1}$ and dimension $v_{2}$ .

The MSE for each ${\hat{ρ}}_{vj}$ and ${\hat{σ}}_{v_{1} v_{2}}$ estimate across 100 replications is calculated as

MSE ({\hat{ρ}}_{vj}) = \frac{1}{100} \sum_{s = 1}^{100} {({\hat{ρ}}_{vj} (s) - ρ_{vj}^{*})}^{2}, MSE ({\hat{σ}}_{v_{1} v_{2}}) = \frac{1}{100} \sum_{s = 1}^{100} {({\hat{σ}}_{v_{1} v_{2}} (s) - σ_{v_{1} v_{2}}^{*})}^{2} .

(14)

The average MSEs for $\hat{ρ}$ and $\hat{σ}$ are then calculated as

\bar{MSE} (\hat{ρ}) = \frac{1}{J} \sum_{v = 1}^{V} \sum_{j = 1}^{J_{v}} MSE ({\hat{ρ}}_{vj}), \bar{MSE} (\hat{σ}) = \frac{2}{V (V - 1)} \sum_{1 \leq v_{1} < v_{2} \leq V} MSE ({\hat{σ}}_{v_{1} v_{2}}) .

(15)

Mixing Quality of HMC Sampling

The mixing quality of HMC sampling is assessed using Gelman’s $\hat{R}$ diagnostic, a widely used metric for evaluating the convergence and mixing of Markov chains in Bayesian analysis (Gelman et al., 2014). A value of $\hat{R}$ < 1.1 indicates a convergence across the four chains. Conversely, a value exceeding 1.1 suggests potential convergence issues and poor mixing quality, which can lead to biased estimates and unreliable inference.

Comparison of Sample Size $N$ with $K$ Experts Versus $2 N$ Using Non-Informative Priors

This analysis refers to a prior sensitivity analysis to assess the value of expert-derived priors in enhancing estimation accuracy. Specifically, we determine the number of experts $K$ needed for a sample size $N$ to achieve the same accuracy in estimating $ρ$ as a sample size $2 N$ using non-informative priors, which is essential for understanding the trade-offs between collecting additional data and utilizing expert input, especially in resource-limited scenarios.

Simulation Study Two

The second simulation study is designed to answer Research Question 2—how variational approximation (ADVI) compares to HMC in terms of estimation accuracy, computational efficiency, and approximation quality under various conditions? Simulated datasets are again generated using the same data simulation method described in Simulation Study One, following the same six-way factorial design, except for response category C = 4, and response distribution $skew \in {0, 0.5}$ , with 108 scenarios in total.

Parameter Estimation

Experts’ ratings are again simulated using the rating scale model (Equation 13), with the number of experts set based on findings from Simulation Study 1. Priors are then specified according to the simulated ratings following Equations 10 and 11.

ADVI and HMC are employed in parallel to estimate the model parameters:

HMC: Sampling from the full posterior using the HMC algorithm with the NUTS, implemented via the CmdStanR package (version 0.7.0) in R 4.1.2. The sampling procedure consists of 4,000 iterations, with a 2,000-burn-in period, across four chains.

ADVI: Approximating the full posterior using ADVI with a mean-field Gaussian variational family. The ADVI process was run for a maximum of 3,000 iterations, with 50 Monte Carlo Samples for estimating ELBO gradients and 100 Monte Carlo Samples for the ELBO estimate. Implementation was also done using the CmdStanR package (version 0.7.0) in R 4.1.2.

Model Performance Comparison

Estimation Accuracy: We compare HMC and ADVI using the average MSEs of $\hat{ρ}$ and $\hat{σ}$ over 100 simulations. In addition, we calculate the average squared bias—defined as $\bar{{Bias}^{2}} (\hat{ρ}) = \frac{1}{J} \sum_{v = 1}^{V} \sum_{j = 1}^{J_{v}} Bias {({\hat{ρ}}_{vj})}^{2}, \bar{{Bias}^{2}} (\hat{σ}) = \frac{2}{V (V - 1)} \sum_{1 \leq v_{1} < v_{2} \leq V} Bias {({\hat{σ}}_{v_{1} v_{2}})}^{2}$ $\bar{{Bias}^{2}} (\hat{σ}) = \frac{2}{V (V - 1)} \sum_{1 \leq v_{1} < v_{2} \leq V} Bias {({\hat{σ}}_{v_{1} v_{2}})}^{2}$ –to assess whether HMC and ADVI tend to consistently overestimate or underestimate the model parameters across different scenarios.

Computational Efficiency: For each scenario, we compare the average runtime of HMC with the NUTS and ADVI to determine which method is more efficient.

Approximation Quality of ADVI: To check the approximation quality of ADVI, we compare ADVI’s results to those from HMC, which is considered the gold standard in Bayesian inference. Specifically, we use the Kolmogorov–Smirnov (KS) test (Kolmogorov, 1933; Smirnov, 1948) to compare ADVI’s posterior estimates with HMC’s samples. We then measure ADVI’s overall performance by calculating the fraction of parameters for which the KS test p-value is greater than .05, indicating a good approximation.

Results

Results for RQ1: Expert-Informed Priors Improve Estimation Accuracy

This section presents two analyses: (a) Parameter estimation accuracy comparisons between the Bayesian method and ordinal CFA, with an assessment of mixing quality for HMC sampling. (b) Comparison of a sample size $N$ with $K$ experts versus $2 N$ using non-informative priors.

Parameter Estimation Accuracy: Bayesian Method Versus Ordinal CFA

Figures 1 and 2 compare the average MSEs for $\hat{ρ}$ and $\hat{σ}$ estimates, respectively, for $V = 2, J_{v} = 5$ , C = 4 instruments, with a reference line at y = 0.01 indicating the acceptable threshold. The average MSEs are summarized in Tables S2 and S3, and additional comparisons are provided in Figures S1 to S10 in the Supplemental Material (available in the online version of this article).

Figure 1.

Comparisons of average MSEs of $\hat{ρ}$ between ordinal CFA and HMC for a 2-dimension, 5-Item-per-dimension, 4-response-category instrument.

Figure 2.

Comparisons of average MSEs of $\hat{σ}$ between ordinal CFA and HMC for a 2-dimension, 5-item-per-dimension instrument.

The Bayesian method, using priors from at least six experts, consistently yields lower MSEs for $\hat{ρ}$ th an ordinal CFA, particularly in small samples ( $N = 25$ ) where ordinal CFA often fails to converge or produces extreme estimates (average MSEs >0.05). Although the gap narrows at larger sample sizes ( $N = 50, 100$ ), the Bayesian approach maintains its advantage—especially under skewed response distributions—and remains robust across all between-dimension correlation levels.

Estimation accuracy declines with increasing dimensions for both methods, but ordinal CFA is more adversely affected—especially in small samples—while the Bayesian method remains stable. Ordinal CFA benefits from more items per dimension, whereas the Bayesian method’s performance is unaffected by item count. Additionally, increasing the number of experts markedly improves the Bayesian method up to six experts, with further increases offering diminishing yet positive returns.

The Bayesian method also consistently outperforms ordinal CFA in estimating between-dimension correlations $\hat{σ}$ , particularly for small sample sizes ( $N = 25, 50$ ) when expert-informed priors are used. Ordinal CFA is notably sensitive to high correlations ( $r = 0.7$ ) and skewed response distributions, leading to unstable threshold estimates, whereas the Bayesian method leverages priors to stabilize variance estimates even under challenging conditions. Additionally, while both methods perform similarly across two to four dimensions, the Bayesian method maintains MSEs below 0.01 with more than six experts, and fewer items per dimension further emphasize its advantage. Overall, achieving acceptable $\hat{σ}$ (MSE <0.01) requires at least 12 experts for $N = 25$ , 9 to 12 for $N = 50$ , and 6 to 9 for $N = 100$ .

Figures 1 and 2 reveal two divergent patterns between CFA and Bayesian estimation. First, CFA’s estimation of item-to-dimension correlations (ρ) improves as the between-dimension correlation (r) increases from .3 to .7. Stronger correlations across subtests increase shared variance, which stabilizes the factor loading estimates that underlie ρ. Second, the opposite occurs for CFA’s estimation of between-dimension correlations (σ): higher r values reduce the effective variability among dimensions, making σ more difficult to estimate accurately, especially in small samples. By contrast, the Bayesian approach with HMC exhibits relative stability in estimating both ρ and σ, as the posterior distribution incorporates prior information and fully accounts for uncertainty.

Mixing Quality of HMC Sampling Versus Convergence of Ordinary CFA

The purpose of assessing HMC sampling quality was to verify that including experts improved not only estimation accuracy but also convergence. We evaluated mixing quality using the $\hat{R}$ statistic (Gelman et al., 2014), and across all scenarios and replications, no parameter exceeded $\hat{R}$ = 1.1, indicating proper convergence throughout (Table S1 in Supplemental Material).

In contrast to zero non-convergence in HMC, CFA exhibited some notable convergence issues (Table S2). Across 864 scenarios varying in sample size (N = 25, 50, 100), factor correlation (r = .3, .5, .7), and response distribution (non-skewed vs. skewed, sk = 0.25, 0.5, 1), non-convergence was concentrated in small samples (N = 25), particularly under severe skewness (up to 28% at r = .3, 26% at r = .5, and 22% at r = .7). By N = 100, convergence failures were nearly eliminated (<0.3%). Although higher r slightly reduced failure rates, skewness effects remained substantial at N = 25. These results indicate that CFA is prone to instability in small, skewed samples but stabilizes rapidly with moderate sample sizes (N ≥ 50).

Table 1 summarizes the recommended number of experts ensuring estimation accuracy for both $ρ$ and $σ$ to have MSEs stay below 0.01.

Table 1.

Number of Experts for Various Instrument Designs

Scenario	N = 25	N = 50	N = 100
2 dimensions, 5 items per dimension	K = 12	K = 9	K = 6
2 dimensions, 10 items per dimension	K = 12	K = 9	K = 6
3 dimensions, 5 items per dimension	K = 12	K = 9	K = 6
3 dimensions, 10 items per dimension	K = 12	K = 9	K = 9
4 dimensions, 5 items per dimension	K = 12	K = 9	K = 6
4 dimensions, 10 items per dimension	K = 12	K = 9	K = 9

Comparisons of Sample Size $N$ with $K$ Experts Versus $2 N$ Using Non-Informative Priors

The average MSEs of $\hat{ρ}$ serve as the primary measure of estimation accuracy. Figure 3 compares the average MSEs of $\hat{ρ}$ for a 2-dimension, 5-item-per-dimension instrument under expert-informed priors versus non-informative priors. The 3D plots show a curved surface (MSE as a function of sample size N and number of experts $K$ ) alongside flat surfaces for non-informative priors at $N = 50, 100$ , and $200$ . The intersection points shown in the corresponding 2D contour plot indicate the number of experts required for a given $N$ to match the accuracy of a sample size $2 N$ with non-informative priors.

Figure 3.

Comparisons of average MSEs of $\hat{ρ}$ : sample size $N$ with $K$ experts versus sample size $2 N$ using non-informative priors for a 2-dimension, 5-item-per-dimension, 4-response-category instrument. (A) 3D surface and (B) Contour.

Reciprocal MSE Modeling: Contributions of Factors

Estimation accuracy, measured by MSE, is known to be inversely proportional to sample size (MSE ∝ 1/N, or equivalently 1/MSE ∝ N). Rather than modeling MSE directly, which produces a nonlinear relationship as shown in Figure 3, we examined its reciprocal (1/MSE), which reveals a simple and additive linear relationship with number of experts (K) and sample size (N), as illustrated in Figure 4.

Figure 4.

Estimation accuracy of item–trait correlations $(\hat{ρ})$ and between trait correlations ( $\hat{σ}$ ) under HMC. Panels A and B show 1/(average MSE of $\hat{ρ}$ ) as a function of number of experts ( $K$ ) and sample size ( $N$ ) under different number of response categories (C), while Panels C and D show 1/(average MSE of $\hat{σ}$ ). Accuracy improves additively with both $N$ and $K$ .

To further investigate the relative importance of the simulation factors, we conducted a forward selection regression analysis using 1/MSE of $\hat{ρ}$ and $\hat{σ}$ from HMC as dependent variables. Candidate predictors included all simulation design factors (N, K, V, J_v, C, skewness, and r, each treated as continuous), an indicator for whether expert information was incorporated ( $I (K > 0)$ ), and an indicator for whether response was polytomous (I(C >2)). Both main effects and two-way interactions were considered. At each step, the predictor that most improved model fit (as measured by Akaike Information Criterion [AIC]) was added, yielding final models that quantify both the variance explained (R²) and the direction and size of effect for each factor (Table 2).

Table 2.

Forward Selection Results on Estimation Accuracy of $\hat{ρ}$ and $\hat{σ}$ Under HMC

HMC: 1/average MSE( $\hat{ρ}$ ) Forward selection				Final model: 1/MSE( $\hat{ρ}$ ) ~ K + N + I(C > 2) + Skew + N × I(C > 2) + I(K > 0)
Step	AIC	$R^{2}$	$Δ R^{2}$	Term added	Estimate	Standard error	t-Statistic	p-Value
0	54,267.5	0	—	(Intercept)	−8.13	0.603	−13.5	<.0001
1	51,295.3	.437	0.437	K	6.43	0.036	179.0	<.0001
2	45,582.2	.813	0.376	N	0.57	0.008	72.3	<.0001
3	42,152.9	.903	0.091	I(C >2)	6.13	0.598	10.2	<.0001
4	40,096.4	.935	0.032	Skew	21.82	0.33	66.1	<.0001
5	38,545.1	.952	0.017	N × I(C >2)	0.44	0.009	48.2	<.0001
6	37,256.2	.963	0.011	I(K >0)	15.32	0.4	38.3	<.0001
HMC: 1/average MSE( $\hat{σ}$ ) Forward selection				Final model: 1/MSE( $\hat{σ}$ ) ~ K + N + r + K × r + C + Jv
Step	AIC	$R^{2}$	$Δ R^{2}$	Term added	Estimate	Std. error	t-Statistic	p-Value
0	54,390.0	0	—	(Intercept)	−44.59	1.940	−23.0	<.0001
1	51,552.3	.422	0.422	K	1.32	0.213	6.2	<.0001
2	48,900.3	.653	0.232	N	0.71	0.009	80.4	<.0001
3	47,292.2	.746	0.093	R	22.97	2.741	8.4	<.0001
4	46,643.6	.776	0.03	K × r	11.73	0.406	28.9	<.0001
5	46,093.8	.799	0.023	C	6.18	0.246	25.1	<.0001
6	45,671.2	.814	0.016	Jv	2.31	0.110	21.0	<.0001

Note. Sample size (N) and number of experts (K) were the dominant predictors of estimation accuracy. Other design and data features (skewness, number of items, response categories, true correlation, and the presence of any experts) were statistically significant but explained only modest additional variance. HMC = Hamiltonian Monte Carlo; MSE = mean squared error.

The forward selection results show that sample size ( $N$ ) and number of experts ( $K$ ) are the dominant factors influencing estimation accuracy for both ρ and σ, while other factors (e.g., skewness, number of response categories, between-trait correlation, items per dimension) were statistically significant but explained only small additional variance. For ρ, the regression equation can be summarized as 1/MSE( $\hat{ρ}$ ) = (0.57+0.44I(C>2))N+15.3I(K > 0)+6.4 K + 6.1I (C > 2) + 21.8Skew + μ(other factors), with an overall R² = .963, of which N and K together accounted for 82% of the total variance and other factors 14%. For σ, the corresponding equation is 1/MSE ( $\hat{σ}$ ) = 0.71N+ (1.3+11.7r)K + 23r + 6.18C + 2.3Jv +μ (other factors), with R² = .814, where N and K explained 65%, r explained 9%, K × r 3%, C 2.3%, and Jv 1.6%.

Based on the relative ratios of the coefficients, we derive equivalences between the number of experts and sample size: for item–trait correlations $ρ$ , $N_{eff} (K) = 27 + 11.2 K$ when response is dichotomous (C = 2), and $N_{eff} (K) = 15 + 6.3 K$ when response is polytomous (C > 2); for between-trait correlations $σ$ , $N_{eff} (K) = (1.8 + 16.5 r) K$ , that is, $N_{eff} (K) = 6.8 K$ if r = .3, 10K if r = .5, and 13.4K if r = .7. For example, when responses are polytomous (C >2) and r = .5, compared to no experts (K = 0), adding the first expert is equivalent to increasing the sample size by about 15 subjects for ρ and 10 subjects for σ (holding other factors constant). For K > 1, each additional expert contributes information comparable to about 6 subjects for ρ and 10 subjects for σ.

In addition to the dominant roles of sample size (N) and number of experts (K), several other factors made statistically significant, though comparatively modest, contributions to estimation accuracy (Table 2). For $\hat{ρ}$ , skewness of category thresholds had a positive coefficient (+21.8), improving accuracy (ΔR² = .032). For $\hat{σ}$ , between-trait correlation (r, +23.0) made a moderate positive contribution (ΔR² = .093), suggesting that stronger correlations are easier to recover; in addition, number of response categories showed a small positive effect (+6.18; ΔR² = .023), while more items per dimension also showed a small positive effect (Jv, +2.31; ΔR² = .016). Overall, these contributions were minor compared with the dominant roles of N and K.

The full model equations in Table 2 can be used to guide study design. For example, if a precision criterion is set such that MSE( $\hat{ρ}$ ) <0.01 and MSE( $\hat{σ}$ ) <0.01 (equivalent to requiring an RMSE <0.1, or approximately a standard error <0.1 for the correlation estimates), the regression equations can be solved for N and K to identify suitable design choices. Because these models are based on extensive simulation results, the recommended sample sizes and numbers of experts are expected to align with the operating characteristics observed in practice.

Results for RQ2: HMC Versus ADVI Comparison

Parameter Estimation Accuracy: ADVI Versus HMC

Figures 5 and 6 present the comparisons of average MSEs and squared bias of $\hat{ρ}$ and $\hat{σ}$ between ADVI and HMC for $V = 2, J_{v} = 5$ instruments. Additional comparisons for other instrument designs (Figures S16–S25) are provided in the Supplemental Material (available in the online version of this article).

Figure 5.

Comparisons of average MSEs of $\hat{ρ}$ and $\hat{σ}$ between ADVI and HMC for a 2-Dimension, 5-Item-per-dimension instrument.

Figure 6.

Comparisons of average squared bias of $\hat{ρ}$ and $\hat{σ}$ between ADVI and HMC for a 2-dimension, 5-item-per-dimension instrument.

Overall, the results show the complementary strengths of ADVI and HMC. ADVI consistently demonstrated superior performance in estimating $\hat{ρ}$ , with smaller MSEs and squared bias, while HMC was more robust for estimating $\hat{σ}$ , particularly under challenging conditions such as high correlations ( $r = 0.7$ ) and larger sample sizes. Table 3 summarizes the preferred estimation methods across various instrument designs and sample sizes, considering both average MSEs and squared biases.

Table 3.

Preferred Estimation Methods by Instrument Design and Sample Size

	N = 25		N = 50		N = 100
Scenario (Dimension x Item-per-D)	$\hat{ρ}$	$\hat{σ}$	$\hat{ρ}$	$\hat{σ}$	$\hat{ρ}$	$\hat{σ}$
2 × 5	ADVI	HMC	ADVI	HMC	ADVI	HMC
2 × 10	ADVI	HMC	ADVI	HMC	ADVI	HMC
3 × 5	ADVI	HMC	ADVI	HMC	ADVI	HMC
3 × 10	ADVI	HMC	ADVI	HMC	ADVI	ADVI
4 × 5	ADVI	HMC	ADVI	HMC	ADVI	HMC
4 × 10	ADVI	ADVI	ADVI	HMC	ADVI	HMC

Note. ADVI = automatic differentiation variational inference; HMC = Hamiltonian Monte Carlo.

Computational Efficiency: ADVI Versus HMC

For each simulation using HMC, the average runtime for 4,000 iterations (2,000 burn-in and 2,000 sampling) across four chains was recorded. ADVI used an adaptive stopping criterion, terminating optimization when the relative change in ELBO fell below 0.01 (up to 3,000 iterations), using 50 samples for gradient estimation and 100 for ELBO calculation.

Figure 7 (with additional comparisons in Figures S26–S30 in the Supplemental Material in the online version of the journal) shows that while runtimes for both methods increase linearly with sample size, HMC consistently took longer runtimes than ADVI, with the gap widening significantly as sample size, number of dimensions, or the number of items per dimension increased. For example, at $N = 100$ with four dimensions and 10 items per dimension, HMC’s runtime per chain exceeded 3 min, whereas ADVI completed the process in under 1 min. Given that HMC typically requires four chains for convergence, its total runtime can be up to 10 times longer than that of ADVI in such cases.

Figure 7.

Comparisons of average runtime of HMC and ADVI for a 2-dimension, 5-item-per-dimension instrument.

Approximation Quality of ADVI

The KS test ( $α =$ .05, Table 4) reveals that ADVI’s approximation quality declines with larger sample size and higher between-dimension correlations—likely due to its mean-field assumption. Even under the best conditions, ADVI correctly approximates fewer than 20% of parameters. Figure 8 shows that for a representative simulation ( $N = 50, r = 0.5$ , skewed data, two dimensions, 10 items per dimension), ADVI’s density aligns with HMC in central regions but fails to capture dependencies and tail behavior, resulting in a low KS-test pass rate (9.5%), Since the primary goal of validity evidence is not to perfectly approximate the entire posterior distribution, the approximation quality of ADVI is acceptable when it is used for exploratory purposes as a computationally efficient alternative to HMC.

Table 4.

Proportion of Parameters Well-Approximated by ADVI Based on Kolmogorov–Smirnov Test Results (Threshold = 0.05)

Scenario	$V = 2;$ $J_{v} = 5$	$V = 2;$ $J_{v} = 10$	$V = 3;$ $J_{v} = 5$	$V = 3;$ $J_{v} = 10$	$V = 4;$ $J_{v} = 5$	$V = 4;$ $J_{v} = 10$
r = .3
N = 25
Non-skewed	.16	.17	.17	.14	.14	.13
Skewed	.20	.21	.22	.19	.18	.16
N = 50
Non-skewed	.08	.10	.08	.10	.08	.10
Skewed	.09	.11	.08	.10	.07	.11
N = 100
Non-skewed	.03	.07	.05	.11	.03	.08
Skewed	.03	.09	.05	.10	.03	.08
r = .5
N = 25
Non-skewed	.14	.16	.13	.12	.10	.11
Skewed	.16	.18	.17	.15	.13	.14
N = 50
Non-skewed	.07	.08	.06	.09	.06	.08
Skewed	.08	.10	.07	.08	.06	.09
N = 100
Non-skewed	.04	.07	.05	.10	.04	.06
Skewed	.03	.08	.04	.09	.03	.07
r = .7
N = 25
Non-skewed	.09	.13	.10	.10	.10	.10
Skewed	.12	.16	.13	.13	.13	.12
N = 50
Non-skewed	.06	.07	.06	.07	.06	.06
Skewed	.06	.08	.06	.06	.06	.07
N = 100
Non-skewed	.04	.06	.03	.08	.04	.04
Skewed	.03	.07	.04	.08	.03	.06

Note. For skewed scenarios, $sk = 0.50$ . ADVI = automatic differentiation variational inference.

Figure 8.

Comparison of posterior distributions: HMC versus ADVI (one simulation with $N$ = 50, $r$ = .5, skewed data, two dimensions, 10 items per dimension).

Empirical Study: Application to TIMSS Grade 8 Mathematics Student Questionnaire

Having established the performance of our Bayesian parameter estimation methods (HMC and ADVI) in simulated data, we applied these methods to real data from the 2019 TIMSS Grade 8 Mathematics Student Questionnaire (TIMSS & PIRLS International Study Center, 2019). This application serves to validate our proposed methods in a real-world context, where factors such as sample variability, missing data, and expert priors introduce additional complexities.

TIMSS 2019 is a large-scale international assessment of mathematics and science at Grades 4 and 8, administered to nationally representative samples across 64 countries and 8 benchmarking participants. Typical national samples included ~4,000 students per grade drawn from ~150 to 200 schools, resulting in approximately 330,000 Grade 4 and 250,000 Grade 8 students worldwide. Alongside the mathematics and science tests, TIMSS 2019 collects student questionnaire data covering attitudinal, behavioral, and background variables.

Data and Model Specification

The 2019 TIMSS Grade 8 Mathematics Student Questionnaire includes a wide range of background and attitudinal items designed to capture students’ engagement with mathematics learning. In particular, we focused on 27 Likert-type items (with four response categories) that measure three key dimensions of engagement: (a) confidence in learning mathematics—belief in one’s ability to succeed in mathematics; (b) enjoyment of learning mathematics—general interest and positive attitude toward mathematics; (c) value of learning mathematics—perceived importance of mathematics for future academic and career opportunities.

To reflect real-world classroom contexts, we randomly selected 10 small samples, each consisting of approximately 25 U.S. students, from the 2019 TIMSS dataset. Based on our previous findings, we gathered expert ratings from 12 eighth-grade mathematics teachers to serve as prior knowledge in our Bayesian approach. These teachers, recruited from various schools in Boston and Texas, evaluated each item’s relevance to its intended dimension using a four-point scale.

Parameter Estimation Results

For the CFA benchmark, we used the full U.S. Grade 8 sample of TIMSS 2019, which consists of 8,698 students who completed the mathematics student questionnaire. Given this large sample size, CFA estimates are stable and can reasonably be treated as “true” parameter values for benchmarking the performance of our Bayesian estimation methods. The item-to-dimension correlations were generally high, with most exceeding .80, confirming the strong construct validity of the TIMSS instrument. When applying our Bayesian estimation methods to the small samples, results (Table 5) corroborate our earlier findings for instruments with three dimensions and 10 items per dimension:

CFA produced out-of-bound estimates when sample sizes were as small as 25.

ADVI outperformed HMC in estimating item-to-dimension correlations $\hat{ρ}$ . However, for between-dimension correlation $\hat{σ}$ , neither method dominates, with ADVI and HMC each performing better in different cases.

HMC sampling is all well-mixed when using priors derived from 12 experts’ ratings.

Table 5.

Comparison of MSEs Across Three Methods for Estimating $\hat{ρ}$ and $\hat{σ}$ in 10 Randomly Selected Class Samples

			MSE of $\hat{ρ}$			MSE of $\hat{σ}$
Class ID	Samplesize	Validcases	CFA	HMC	ADVI	CFA	HMC	ADVI
510204	30	25	.022	.010	.009	.152	.040	.032
505211	30	23	.029	.011	.010	.017	.006	.007
501601	27	26	.019^a	.016	.014	.045	.022	.017
502807	27	23	.039^a	.014	.013	.020	.008	.008
518101	27	22	.071	.010	.009	.026	.008	.009
500414	26	22	.036^a	.014	.012	.136	.027	.024
501404	25	24	.019^a	.010	.009	.043	.012	.014
515910	25	24	.033^a	.012	.011	.002	.004	.005
504207	24	22	.036^a	.015	.013	.109	.013	.011
500104	23	22	.041	.012	.011	.038	.021	.018

Note. ADVI = automatic differentiation variational inference; CFA = confirmatory factor analysis; HMC = Hamiltonian Monte Carlo; MSE = mean squared error.

Indicates out-of-bound $\hat{ρ}$ estimates occur when using CFA.

A notable challenge arose in setting expert-informed priors. Cohen’s (1988) conventional cut-points were insufficient for capturing the high correlations observed in the real data. Even when all 12 experts rated an item as “Highly relevant” (i.e., a rating of 4), the prior mean for $\hat{ρ}$ remained only .75, which was substantially lower than the true correlations exceeding .90. To address this issue, we adjusted the prior means of $ρ$ so that a “Highly relevant” rating corresponded to a correlation of .90. Additionally, we made a slight shift to the prior mean of $\hat{σ}$ to better match the observed values.

This adjustment is reasonable for the following reasons. First, Cohen’s original cut-points are intentionally broad; the interval for “Highly relevant” spans ρ ∈ (0.50, 1.0), which does not align well with the narrow range of high correlations observed here. Second, the TIMSS 2019 questionnaire is a production-quality instrument that has undergone extensive development and screening, leaving items with consistently strong item–trait correlations (often near .90). In contrast, during an instrument development stage, where weaker items remain, Cohen’s thresholds may be more appropriate. Third, a natural long-term solution may be to refine Cohen’s scale by introducing a fifth category, distinguishing between “Highly relevant” (e.g., ρ ∈ (0.50, 0.80)) and “Extremely relevant” (e.g., ρ ∈ (0.80, 1.0)). Such an extension would better accommodate mature instruments while preserving applicability in early-stage development contexts.

Discussion and Conclusion

This study introduced a novel Bayesian framework for establishing validity evidence in multi-unidimensional instruments under small-sample conditions. By integrating expert-informed priors into a Bayesian parameter estimation procedure—and leveraging advanced sampling techniques such as HMC and ADVI—we addressed the limitations of traditional ordinal CFA when sample sizes are restricted. Extensive simulation studies and an empirical application using TIMSS Grade 8 Mathematics data provided critical insights into our approach.

We first demonstrated that incorporating expert-informed priors significantly improves parameter estimation accuracy in small-sample settings. Traditional CFA struggled under these conditions, often producing out-of-bound estimates and high MSEs. By contrast, our Bayesian approach, incorporating priors from $K$ experts, consistently outperformed CFA, particularly in estimating item-to-dimension correlations ( $\hat{ρ}$ ) when the sample size was $N \leq 50$ . Table 1 suggests the following guidelines for the number of experts needed to ensure both estimation accuracy under various conditions: For $N = 25$ , at least 12 experts are required. For $N = 50$ , 9 experts suffice. For $N = 100$ , 6 experts are generally adequate. Beyond these empirical guidelines, the regression models for 1/MSE presented in Table 2 provide additional insight into factors influencing estimation accuracy. These models not only quantify the relative importance of sample size and expert input but also offer a practical tool for study planning, allowing researchers to anticipate precision levels and determine appropriate combinations of subjects and experts.

Next, we compared HMC and ADVI with respect to estimation accuracy, computational efficiency, and approximation quality. Table 3 shows that ADVI consistently performs better than HMC for item-to-dimension correlations ( $\hat{ρ}$ ) across various sample sizes and instrument designs, while HMC provides more accurate estimates for between-dimension correlations ( $\hat{σ}$ ), particularly when $N \leq 50$ . This trend also emerged in the TIMSS empirical study (Table 5), where ADVI surpassed HMC in estimating $\hat{ρ}$ . Neither method, however, displayed a clear advantage for σ in the empirical data (Figure S20 in the online version of the journal), likely due to real-world factors such as prior biases and the limited granularity of the 1 to 4 expert rating scale—which contributed to higher MSEs relative to simulations.

When comparing runtimes, ADVI showed a pronounced advantage in computational efficiency, requiring far less processing time than HMC. This gap widened as sample size, dimensionality ( $V$ ), or the number of items per dimension ( $J_{v}$ ) increased. Although ADVI’s mean-field assumption led to lower KS test pass rates for equivalence with HMC (Table 4), it still captured central tendencies effectively (Figure 8). Given that the primary goal of validation is to provide actionable insights rather than perfectly replicate the entire posterior distribution, ADVI’s ability to approximate a substantial portion of HMC’s uncertainty profile makes it an efficient option for exploratory analyses.

When deciding between HMC and ADVI, researchers should consider both the goals of their analysis and practical constraints. ADVI is generally the better choice when estimating item-to-dimension correlations ( $\hat{ρ}$ ), particularly in large-scale studies where computational efficiency is a priority. Its scalability makes it ideal for handling large $N$ , high dimensionality, or limited computing resources. HMC, on the other hand, provides a more accurate estimation of between-dimension correlations ( $\hat{σ}$ ), especially for smaller samples ( $N \leq 50$ ) or when capturing parameter dependencies and tail behavior is critical. If computational resources and time are not constraints, researchers can benefit from using both methods: ADVI for an efficient estimation of $\hat{ρ}$ and HMC for a more accurate estimation of $\hat{σ}$ and full posterior distributions. This combined approach leverages the strengths of each method, providing both computational efficiency and a comprehensive understanding of the underlying parameters.

Limitations and Future Research

Despite the promising results and practical contributions of this research, several limitations and potential directions for future work warrant discussion.

First, this study focused on multi-unidimensional instruments with up to four dimensions and 10 items per dimension. Although these designs represent many practical assessment scenarios, they do not encompass more complex structures (e.g., multidimensional or hierarchical models). Extensions of the current Bayesian approach to accommodate such advanced measurement frameworks could provide deeper insights into its versatility and performance.

Second, a key feature of our Bayesian approach is the incorporation of expert-informed priors. While the inclusion of expert knowledge can greatly enhance parameter estimation for small samples, the quality and consistency of these expert ratings proved critical. Our simulations and empirical findings both revealed that coarse rating scales (e.g., 1–4) often fail to capture high correlations reliably, leading to biases. Future research could explore the use of more granular rating scales (e.g., 1–5 or continuous sliders) and provide additional training to improve rater accuracy and consistency.

Third, while HMC remains the “gold standard” for capturing intricate posterior dependencies, its computational overhead rises rapidly with increasing number of dimensions, items, or sample sizes. Investigating more efficient sampling algorithms could help strike a balance between accuracy and runtime.

Lastly, although ADVI demonstrated significant speed advantages and performed well for item-level parameters ( $ρ$ ), its reliance on the mean-field assumption limited its ability to capture complex multimodal posterior distribution and inter-parameter dependencies, particularly for highly correlated latent dimensions. Future research could explore more sophisticated variational families, such as full-rank approximations or normalizing-flow-based methods, to improve the quality of posterior estimates of variational inference. Also, methods like Stein Gradient Variational Descent (Liu & Wang, 2016) offer a promising direction, as they avoid the mean-field assumption by employing a particle-based approach.

Supplemental Material

sj-docx-1-jeb-10.3102_10769986251393420 – Supplemental material for A Bayesian Framework to Establish Validity Evidence for Multi-Unidimensional Instruments With Small Samples

Supplemental material, sj-docx-1-jeb-10.3102_10769986251393420 for A Bayesian Framework to Establish Validity Evidence for Multi-Unidimensional Instruments With Small Samples by Jihang Chen and Zhushan Li in Journal of Educational and Behavioral Statistics

Footnotes

Appendix A: Hamiltonian Monte Carlo

Hamiltonian Monte Carlo (HMC) simulates Hamiltonian dynamics to explore posterior distributions efficiently. The Hamiltonian function is defined as

H (z, v) = U (z) + K (v),

where $U (z) = - \log p (z)$ represents the potential energy derived from the posterior distribution, and $K (v) = \frac{1}{2} v^{T} M^{- 1} v$ is the kinetic energy, assuming a Gaussian momentum distribution.

The evolution of the system follows Hamilton’s equations:

\frac{dv}{dt} = - \frac{\partial U}{\partial z}, \frac{dz}{dt} = \frac{\partial K}{\partial v} .

To numerically approximate these continuous dynamics, HMC employs the leapfrog method:

After multiple steps, the proposed state ( $z^{*}, v^{*}$ ) is accepted or rejected using the Metropolis-Hastings criterion:

α = min (1, \frac{p (z^{*}, v^{*})}{p (z^{(n)}, v^{(n)})}) .

If a uniform random variable $u ~ U (0, 1)$ satisfies $u < α$ , the new state is accepted; otherwise, the current state is retained.

Appendix B: Automatic Differentiation Variational Inference

Automatic differentiation variational inference (ADVI) approximates the posterior $p (θ | y)$ using a variational distribution $q (θ | ϕ)$ , where $ϕ$ represents variational parameters. The method follows these steps:

A common choice for the variational family is a diagonal Gaussian:

q (ζ | ϕ) = Π_{k = 1}^{d} N (ζ_{k} | μ_{k}, σ_{k}^{2}) .

where $μ_{k}$ and $σ_{k}^{2}$ are the mean and variance parameters of the variational distribution.

The ELBO is computed as

L (ϕ) = E_{q (ζ)} [\log p (y, T^{- 1} (ζ)) + \log | det J_{T^{- 1}} (ζ) |] + entropy of q (ζ) .

where $J_{T^{- 1}} (ζ)$ is the Jacobian determinant of the inverse transformation.

To maximize the ELBO, ADVI uses stochastic gradient descent with the following techniques:

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Zhushan Li

Notes

Authors

JIHANG CHEN received his PhD from Boston College, Department of Measurement, Evaluation, Statistics, and Assessment, Campion Hall, 140 Commonwealth Avenue, Chestnut Hill, MA 02467; e-mail: jihang@bc.edu. His research interests include Bayesian methods and educational measurement.

ZHUSHAN LI is an associate professor at Boston College, Department of Measurement, Evaluation, Statistics, and Assessment, Campion Hall 336A, 140 Commonwealth Avenue, Chestnut Hill, MA 02467; e-mail: zhushan.li@bc.edu. Her research interests are in item response theory, latent variable models, estimation, and Bayesian methods, with applications to education and health.

References

Andrich

(1978). Scaling attitude items constructed and scored in the Likert tradition. Educational and Psychological Measurement, 38(3), 665–680.

Brown

T. A.

(2006). Confirmatory factor analysis for applied research. Guilford Press.

Burns

Grove

S. K.

(2010). Understanding nursing research-eBook: Building an evidence-based practice. Elsevier Health Sciences.

Cohen

(1988). Statistical power analysis for the behavioral sciences. Routledge.

Crocker

Algina

(1986). Introduction to classical and modern test theory. Holt, Rinehart and Winston.

DeVellis

R. F.

(1991). Scale development: Theory and applications. Sage.

Duane

Kennedy

A. D.

Pendleton

B. J.

Roweth

(1987). Hybrid Monte Carlo. Physics Letters B, 195(2), 216–222.

Ellis

Jablonski

Levy

Mansfield

(2009). High school science performance assessments: An examination of instruments for Massachusetts. Education Development Center.

Flora

D. B.

Curran

P. J.

(2004). An empirical evaluation of alternative methods of estimation for confirmatory factor analysis with ordinal data. Psychological Methods, 9(4), 466.

10.

Fox

J. P.

(2010). Bayesian item response modeling: Theory and applications. Springer.

11.

Gajewski

B. J.

Coffland

Boyle

D. K.

Bott

Price

L. R.

Leopold

Dunton

(2011). Assessing content validity through correlation and relevance tools. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 8, 81–96.

12.

Gajewski

B. J.

Price

L. R.

Coffland

Boyle

D. K.

Bott

M. J.

(2013). Integrated analysis of content and construct validity of psychometric instruments. Quality & Quantity, 47, 57–78.

13.

Garrard

Price

L. R.

Bott

M. J.

Gajewski

B. J.

(2015). A novel method for expediting the development of patient-reported outcome measures and an evaluation of its performance via simulation. BMC Medical Research Methodology, 15, 1–14.

14.

Gelman

Carlin

J. B.

Stern

H. S.

Dunson

D. B.

Vehtari

Rubin

D. B.

(2014). Bayesian data analysis. Chapman and Hall/CRC.

15.

Jiang

Boyle

D. K.

Bott

M. J.

Wick

J. A.

Gajewski

B. J.

(2014). Expediting clinical and translational research via Bayesian instrument development. Applied Psychological Measurement, 38(4), 296–310.

16.

Kolmogorov

A. N.

(1933). On the empirical determination of a distribution law. Giornale dell'Istituto Italiano degli Attuari, 4, 89–91.

17.

Kucukelbir

Tran

Ranganath

Gelman

Blei

D. M.

(2017). Automatic differentiation variational inference. Journal of Machine Learning Research, 18(14), 1–45.

18.

Lewandowski

Kurowicka

Joe

(2009). Generating random correlation matrices based on vines and extended onion method. Journal of Multivariate Analysis, 100(9), 1989–2001.

19.

Linacre

J. M.

(1989). Many-faceted Rasch measurement [Doctoral dissertation, The University of Chicago].

20.

Linacre

J. M.

(1994). Sample size and item calibration stability. Rasch Measurement Transactions, 7, 328.

21.

Lindquist

Philpot

Mullis

I. V. S.

Cotter

K. E.

(2017). TIMSS 2019 mathematics framework. In Mullis

I. V. S.

Martin

M. O.

(Eds.), TIMSS 2019 assessment frameworks (pp. 13–25). TIMSS & PIRLS International Study Center, Boston College.

22.

Liu

Wang

(2016). Stein variational gradient descent: A general purpose Bayesian inference algorithm. Advances in neural information processing systems 29.

23.

Mislevy

R. J.

(1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195.

24.

Muthén

du Toit

S.H.C.

Spisic

(1997). Robust inference using weighted least squares and quadratic estimating equations in latent variable modeling with categorical and continuous outcomes [Unpublished technical report]. https://www.statmodel.com/download/Article_075.pdf

25.

Neal

R. M.

(2011). MCMC using Hamiltonian dynamics. In Brooks

Gelman

Jones

G. L.

Meng

X.-L.

(Eds.), Handbook of Markov Chain Monte Carlo (pp. 113–162). Chapman & Hall/CRC.

26.

Patel

M. X.

Doku

Tennakoon

(2003). Challenges in recruitment of research participants. Advances in Psychiatric Treatment, 9(3), 229–238.

27.

Pianta

R. C.

Belsky

Vandergrift

Houts

Morrison

F. J.

(2008). Classroom effects on children’s achievement trajectories in elementary school. American Educational Research Journal, 45(2), 365–397.

28.

Raczynski

K. R.

Cohen

A. S.

Engelhard

G. Jr.

(2015). Comparing the effectiveness of self-paced and collaborative frame-of-reference training on rater accuracy in a large-scale writing assessment. Journal of Educational Measurement, 52(3), 301–318.

29.

R Core Team. (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org.

30.

Reckase

M. D.

(1997). The past and future of multidimensional item response theory. Applied Psychological Measurement, 21(1), 25–36.

31.

Reckase

M. D.

(2006). Multidimensional item response theory. In Rao

C. R.

Sinharay

(Eds.), Handbook of Statistics: Volume 26, Psychometrics (pp. 607–642). Elsevier.

32.

Rigdon

S. E.

Tsutakawa

R. K.

(1983). Parameter estimation in latent trait models. Psychometrika, 48(4), 567–574.

33.

Rosseel

(2012). Lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software, 48(2), 1–36.

34.

Sheng

Wikle

C. K.

(2007). Comparing multiunidimensional and unidimensional item response theory models. Educational and Psychological Measurement, 67(6), 899–919.

35.

Smirnov

(1948). Table for estimating the goodness of fit of empirical distributions. The Annals of Mathematical Statistics, 19(2), 279–281.

36.

Stan Development Team. (2023). Stan: A probabilistic programming language (version 2.32.0). https://mc-stan.org

37.

Stone

Yumoto

(2004). The effect of sample size for estimating Rasch/IRT parameters with dichotomous items. Journal of Applied Measurement, 5(1), 48–61.

38.

Swaminathan

Gifford

J. A.

(1982). Bayesian estimation in the Rasch model. Journal of Educational Statistics, 7(3), 175–191.

39.

Swaminathan

Gifford

J. A.

(1985). Bayesian estimation in the two-parameter logistic model. Psychometrika, 50(3), 349–364.

40.

TIMSS & PIRLS International Study Center. (2019). TIMSS 2019 Grade 8 Student Questionnaire. https://nces.ed.gov/timss/pdf/T19_GR8_StudentQ_USA_Questionnaire.pdf

41.

van der Linden

W. J.

Hambleton

R. K

. (Eds.). (2013). Handbook of modern item response theory. Springer.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

15.32 MB

A Bayesian Framework to Establish Validity Evidence for Multi-Unidimensional Instruments With Small Samples

Abstract

Keywords

Introduction

Purpose and Research Questions

Methodology

IRT and Multi-Unidimensional IRT Models

A Hierarchical Framework for Bayesian Item Response Modeling

Bayesian Parameter Estimation Techniques

Bayesian Multi-Unidimensional IRT Models

Prior Specification From Expert Ratings

Simulation Studies

Simulating Experts’ Rating

Simulation Study One

Parameter Estimation

Model Performance Comparison

Mixing Quality of HMC Sampling

Comparison of Sample Size N with K Experts Versus 2 N Using Non-Informative Priors

Simulation Study Two

Parameter Estimation

Model Performance Comparison

Results

Results for RQ1: Expert-Informed Priors Improve Estimation Accuracy

Parameter Estimation Accuracy: Bayesian Method Versus Ordinal CFA

Mixing Quality of HMC Sampling Versus Convergence of Ordinary CFA

Comparisons of Sample Size N with K Experts Versus 2 N Using Non-Informative Priors

Reciprocal MSE Modeling: Contributions of Factors

Results for RQ2: HMC Versus ADVI Comparison

Parameter Estimation Accuracy: ADVI Versus HMC

Computational Efficiency: ADVI Versus HMC

Approximation Quality of ADVI

Empirical Study: Application to TIMSS Grade 8 Mathematics Student Questionnaire

Data and Model Specification

Parameter Estimation Results

Discussion and Conclusion

Limitations and Future Research

Supplemental Material

sj-docx-1-jeb-10.3102_10769986251393420 – Supplemental material for A Bayesian Framework to Establish Validity Evidence for Multi-Unidimensional Instruments With Small Samples

Footnotes

Appendix A: Hamiltonian Monte Carlo

Appendix B: Automatic Differentiation Variational Inference

Declaration of Conflicting Interests

Funding

ORCID iD

Notes

Authors

References

Supplementary Material

Comparison of Sample Size $N$ with $K$ Experts Versus $2 N$ Using Non-Informative Priors

Comparisons of Sample Size $N$ with $K$ Experts Versus $2 N$ Using Non-Informative Priors