Abstract
This study explored the application of deep reinforcement learning (DRL) as an innovative approach to optimize test length. The primary focus was to evaluate whether the current length of the National Board of Chiropractic Examiners Part I Exam is justified. By modeling the problem as a combinatorial optimization task within a Markov Decision Process framework, an algorithm capable of constructing test forms from a finite set of items while adhering to critical structural constraints, such as content representation and item difficulty distribution, was used. The findings reveal that although the DRL algorithm was successful in identifying shorter test forms that maintained comparable ability estimation accuracy, the existing test length of 240 items remains advisable as we found shorter test forms did not maintain structural constraints. Furthermore, the study highlighted the inherent adaptability of DRL to continuously learn about a test-taker’s latent abilities and dynamically adjust to their response patterns, making it well-suited for personalized testing environments. This dynamic capability supports real-time decision-making in item selection, improving both efficiency and precision in ability estimation. Future research is encouraged to focus on expanding the item bank and leveraging advanced computational resources to enhance the algorithm’s search capacity for shorter, structurally compliant test forms.
Introduction
Balancing test length and content is critical for designing effective assessments that measure examinees’ knowledge and skills comprehensively yet efficiently (Angoff, 1953; Haberman, 2020; Kruyen et al., 2012; Şahin & Anıl, 2017; Yamamoto, 1995). Test length should be sufficient to ensure the content validity of the assessment, meaning it adequately covers the breadth and depth of the constructs being measured (Burisch, 1997; Horst, 1951; Kane & Bridgeman, 2017; Raykov & Marcoulides, 2011). A test that is too short may fail to capture the full range of competencies, leading to reduced reliability and potentially invalid conclusions about the examinee’s performance. Conversely, excessively long tests may introduce fatigue effects, compromising the validity of responses (Ackerman & Kanfer, 2009; Jensen et al., 2013). Hence, test developers must carefully consider the number of items to optimize measurement precision while maintaining alignment with the testing objectives and constraints.
The cost of administering longer exams represents a significant consideration for testing programs, as it directly impacts resource allocation and operational efficiency (Ellis, 2021). Longer assessments typically require increased time for proctoring, extended use of testing facilities, and higher costs for scoring, particularly if manual or rubric-based evaluation is involved (Harris et al., 2008; Jakee & Keller, 2017; Nelson, 2013). Moreover, they may pose logistical challenges such as scheduling conflicts and heightened examinee stress, potentially affecting test-taker engagement and performance (Hughes, 2005; Pascoe et al., 2020). These financial and operational burdens necessitate a strategic approach to test design, ensuring that the benefits of extended testing—such as enhanced construct representation justify the associated costs and logistical complexities. Balancing these factors is essential for creating assessments that are not only psychometrically sound but also economically viable (Davey et al., 2015). Furthermore, the size of an item bank directly influences the flexibility in test development and length of test forms (Weiss, 2013). A robust item bank enables the generation of multiple test forms and supports adaptive testing, where the difficulty of items dynamically adjusts to the test-taker’s ability level. However, creating and maintaining large item banks is resource-intensive, requiring significant investment in item development, calibration, and ongoing updates (Xing & Hambleton, 2004). Thus, optimizing test length is imperative to balance comprehensive content measurement with efficiency and practicality. Innovative approaches are urgently needed to create assessments that are psychometrically sound, economically viable, and adaptable to diverse testing needs (Svetina et al., 2019; Yasuda et al., 2021).
This study explores the application of deep reinforcement learning (DRL; Francois-Lavet et al., 2018; Mousavi et al., 2018) as a method for optimizing test length and introduces a framework for its utilization in computer adaptive testing (CAT). Using the Basic Science (Part I) testing program of the National Board of Chiropractic Examiners (NBCE), this study’s goal was to examine how DRL could be used as a test creation tool by applying it to test length optimization. By conceptualizing the problem as a combinatorial optimization task (Schrijver, 2003) and modeling it as a Markov Decision Process, the study developed an algorithm to construct tests from a finite pool of items while adhering to structural constraints, including appropriate content representation and psychometric specifications.
A Brief History
Reinforcement learning (RL) has emerged as a powerful approach to solve complex problems, particularly in the domain of combinatorial optimization. Its utility is well-illustrated through the traveling salesman problem (TSP; Hoffman et al., 2013), a classical problem that has long served as a benchmark for optimization algorithms (Agatz et al., 2018; Johnson, 1990). The TSP involves finding the shortest possible route that visits a set of cities once and returns to the starting point, making it representative of a wide range of real-world applications, such as logistics, routing, and network design (Junger et al., 1995).
The application of RL to the TSP has demonstrated significant advancements, leveraging neural networks and policy optimization techniques to achieve near-optimal solutions. Recent studies have shown that RL models, such as those employing attention mechanisms and sequence-to-sequence frameworks, can learn heuristics for TSP without relying on handcrafted features, offering generalizability to unseen instances (Kool et al., 2019). Moreover, RL approaches have been integrated with Monte Carlo Tree Search and other optimization strategies to further enhance performance and efficiency (Vinyals et al., 2015).
In recent years, RL has revolutionized the approach to solving the TSP, marking a significant shift toward data-driven, adaptive optimization. RL models leverage neural networks to learn solution heuristics directly from data, enabling them to generalize across different problem instances. Methods such as pointer networks and attention mechanisms have demonstrated the ability to produce high-quality solutions efficiently, even for large-scale problems, by dynamically adapting to the constraints and nuances of individual instances (Kool et al., 2019; Vinyals et al., 2015). Unlike traditional algorithms, RL approaches also offer the flexibility to incorporate additional constraints seamlessly, making them particularly versatile for real-world applications (Bello et al., 2016).
The evolution of algorithms for solving the TSP reflects a steady progression in computational efficiency and adaptability, driven by advancements in optimization methods and RL. In 1970, the quadratic assignment algorithm employed dynamic programming techniques to achieve one of the shortest distances calculated for the TSP at the time. This algorithm utilized the Bellman equations, a foundational approach in dynamic programming, to simplify function approximation by breaking the problem into smaller, recursive subproblems. This principle of problem decomposition is also a hallmark of modern RL algorithms (Graves & Whinston, 1970; Rahman et al., 2021; Y. Yang & Whinston, 2023).
The development of ant-Q in 1995 marked an early application of RL to the TSP, combining elements of Q-learning with ant colony optimization. This algorithm used simulated pheromone trails to learn solutions iteratively while Q-learning facilitated the recording and evaluation of policies based on the quality of actions taken (Gambardella & Dorigo, 1995). Ant-Q introduced a novel framework for comparing and evaluating solutions, making it a significant step forward in adaptive problem-solving. However, like its predecessor, the quadratic assignment algorithm, ant-Q faced limitations in scalability, particularly when applied to larger and more dynamic environments (Y. Yang & Whinston, 2023).
Neural networks enhance the ability of RL algorithms to approximate complex functions, enabling them to process larger datasets and adapt to more intricate problem spaces (Francois-Lavet et al., 2018). A notable milestone in this evolution was the development of the REINFORCE algorithm in 2019. By incorporating deep neural networks, REINFORCE significantly reduced the computational complexity associated with solving the TSP. It outperformed both the quadratic assignment algorithm and ant-Q in handling larger problem instances, generating solution paths for a greater number of cities with enhanced accuracy and efficiency (Y. Yang & Whinston, 2023; Mazyavkina et al., 2021).
Algorithms like REINFORCE have ushered in a new era of DRL, characterized by their capacity to leverage advancements in computing power and neural network architectures. DRL algorithms are now widely recognized for their adaptability and effectiveness in solving combinatorial optimization problems beyond the TSP, making them a versatile and evolving tool in computational optimization.
RL in Education and Tests
Li et al. (2023) proposed the use of DRL to develop individualized learning plans that adaptively select the most appropriate learning materials based on a learner’s latent traits (abilities). Their approach utilized a model-free DRL algorithm, specifically the deep Q-learning algorithm, which effectively identifies an optimal learning policy from data on learners’ progress without requiring prior knowledge of the transition model for learners’ continuous latent traits. To enhance data efficiency, they incorporated a transition model estimator using neural networks to emulate the learning process. Simulation studies demonstrated that the proposed algorithm efficiently identified optimal learning policies, particularly when aided by the transition model estimator, even with limited training data from a small sample of learners.
Pian et al. (2023) developed an RL framework for automated test item selection. Their method employs RL to learn item selection algorithms in a data-driven manner, capturing implicit cognitive relationships between test items while avoiding unnecessary item administration. Unlike traditional approaches, their method does not rely on examinees’ estimated knowledge states, mitigating potential inaccuracies from imprecise estimations. The proposed approach leverages implicit cognitive process information to enhance efficiency in item selection, providing a more effective and reliable testing experience.
Xue et al. (2021) introduced a supervised learning framework to correct biased item difficulty estimates in virtual learning environments. Using deep learning techniques, the authors converted observed response patterns into continuous latent traits and approximated complex continuous functions that are difficult to model mathematically. In addition, the study proposed two adjustment methods to enhance the accuracy of item parameter estimates within the semi-supervised learning framework. Simulations under the two-parameter logistic Item Response Theory model showed that the proposed framework successfully reduced biases in both student ability and item parameter estimates, thereby improving the overall accuracy of the system.
In a related study, Zhen and Zhu (2024) developed a learning framework for cheating detection in educational assessments using TabNet and other machine learning models. Their research involved a comprehensive evaluation of 12 base models, including Naive Bayes, linear discriminant analysis, Gaussian processes, support vector machines, decision trees, random forests, Extreme Gradient Boosting (XGBoost), AdaBoost, logistic regression, k-nearest neighbors, multilayer perceptrons, and TabNet. The findings revealed new insights into the potential of deep neural network models for identifying cheating in educational settings, highlighting the utility of TabNet as a robust tool for predictive accuracy and interpretability.
The RL Algorithm
RL algorithms are fundamentally inspired by the study of animal learning, particularly the groundbreaking work of Ivan Pavlov and B. F. Skinner. Pavlov’s experiments on classical conditioning demonstrated how animals could form associations between a neutral stimulus and a biologically significant event, providing early insights into the mechanisms of learning through feedback (Pavlov, 1927). By contrast, Skinner’s operant conditioning research emphasized the active role of behavior in shaping learning, introducing the concept of reinforcement through rewards and punishments (Skinner, 1938). Skinner’s work on reward schedules revealed how animals adapt their actions to maximize positive outcomes, forming the basis for many reward-driven learning models (Sutton & Barto, 2018).
The RL algorithms are characterized by five key components: the agent, environment, reward, policy, and value function (Szepesvári, 2022; Shakya et al., 2023). The agent represents the decision-making entity within the RL framework, navigating through the environment to achieve specified objectives. The environment is the structured system that provides the agent with a series of states, a set of possible actions, and corresponding rewards. At each state, the agent selects an action from its available options, adhering to predefined constraints, and receives a reward as feedback for its choice. This reward functions as a signal reflecting the immediate consequence or quality of the selected action (Qiang & Zhongli, 2011).
Based on this feedback, the agent updates its policy function, which governs the strategy for action selection in subsequent states. The policy aims to optimize the agent’s behavior to maximize cumulative rewards over time. The iterative nature of this feedback loop allows RL algorithms to learn and adapt dynamically, improving their performance as they interact with the environment. In addition, the value function serves as an evaluation metric, estimating the long-term expected rewards associated with each state or state-action pair, further guiding the agent’s decision-making process. Together, these elements form a cohesive framework enabling RL systems to effectively solve complex decision-making problems (Gosavi, 2017).
Policies map states to actions is defined by:
The best policy produces the highest possible cumulative reward throughout an episode, or single run from the initial to terminal state of the environment. The value function estimates the long-term expected reward of a given state-action pair under a policy:
These estimates are used to evaluate the quality of the decisions made by the algorithm through an episode. In the latest equation above,
Through iterative repetition and exploration, RL algorithms progressively refine their approach to identify an optimal policy. Repeatedly selecting actions deemed optimal reinforces effective behaviors while exploring alternative actions in various states enables the agent to gain a more comprehensive understanding of the environment. This iterative process incorporates temporal difference learning, a critical component of RL, which allows the agent to update its value function estimates by leveraging the difference between current estimates and those derived from subsequent states (Sutton & Barto, 2018). The improvement of the policy emerges incrementally, as these updates align the value function more closely with the observed outcomes and expected returns. Over time, the cumulative effect of these updates drives the algorithm to prioritize actions predicted to yield higher rewards. This decision-making framework is formalized within the structure of the MDP (Puterman, 1990), which underlines the mathematical foundation of RL (Wei et al., 2017).
NBCE Part I Exam
The Part I examination, administered by the NBCE, serves as a foundational assessment for chiropractic students, evaluating their knowledge in core scientific disciplines integral to the practice of chiropractic care. The exam is divided into six domains: General Anatomy, Spinal Anatomy, Physiology, Chemistry, Pathology, and Microbiology (NBCE, 2024).
The exam is designed to ensure equal representation across all six domains, with an equivalent proportion of test items allocated to each. The exam contains 50 items per domain and is scored within-domain providing six scores on a scale of 125 to 800 with a cut set at 375 (Himelfarb et al., 2020, 2022).
Literature Review
Over the past decade, the development of efficient and psychometrically sound shorter-form assessments has gained significant attention in psychological and educational research. Traditional methods of test reduction, such as selecting items with the highest factor loadings or maximizing test information, often fail to adhere to multiple psychometric criteria required by operational testing programs. In response, recent advancements in computational methods such as structural equation modeling (SEM)-based techniques, machine learning algorithms, and tree-based adaptive classification models have provided more sophisticated solutions for scale abbreviation. Often, these approaches optimize item selection based on predefined validity criteria while maintaining measurement accuracy and structural integrity.
Recent advancements in personality research highlighted the need for shorter inventories to improve efficiency without compromising accuracy. However, few such measures exist. A study conducted by Yarkoni (2010) introduced an automated method for abbreviating personality inventories with minimal effort, making assessment more scalable. Its validity was tested across three studies, demonstrating that the method effectively preserves psychometric properties while significantly reducing test length. In one application, it generated an abbreviated inventory that accurately reproduced scores from multiple existing measures. Findings support automated abbreviation techniques as a valuable tool for streamlining personality assessment while maintaining validity and structural integrity.
Browne et al. (2018) presented an SEM-based approach that utilized the standardized residual variance–covariance matrix to integrate multiple traditional psychometric criteria, including item homogeneity and reliability, as well as convergent and discriminant validity. Using SEM models with a fixed structure, researchers demonstrated a straightforward progressive elimination algorithm that systematically optimizes item selection across multiple psychometric criteria. This approach is then applied to the development of a short-form version of the multidimensional scale, which served as an indicator of psychological vulnerability to gambling-related problems.
In a relatively recent inquiry, researchers introduced an automated genetic algorithm (GA)-based method for abbreviating psychometric instruments. In their studies, this method was applied to develop a concise 40-item version of a psychological scale. The abbreviated measure demonstrated strong convergent correlations with the original scale and outperformed an alternative measure developed using a conventional methodology (Eisenbarth et al., 2015).
Previously, researchers explored the application of the ant colony optimization (ACO) algorithm in the development of short-form psychometric scales. As a demonstration, a 22-item abbreviated version of a quality-of-life assessment tool for individuals with diabetes was constructed using data from a sample of 265 diabetes patients. In addition, a simulation study is conducted to compare the performance of the ACO algorithm with traditional item selection methods, including those based on the largest factor loadings and maximum test information criteria. The findings indicate that the ACO algorithm outperforms these conventional approaches, highlighting its efficacy in optimizing item selection for scale reduction (Leite et al., 2008).
Further research showed that various psychological instruments suffer from psychometric deficiencies, as the derived person parameters often lack a solid theoretical foundation and fail to meet established psychometric criteria. The authors noted that one approach to enhancing the psychometric properties of such instruments is through abbreviation. Their study evaluated and compared multiple techniques for shortening self-report assessments using the Trait Self-Description Inventory within a large sample of 14,347 participants. The methods examined included: maximizing reliability and main loadings, minimizing modification indices and cross-loadings, the PURIFY Algorithm in Tetrad, ACO, and GA. Among these approaches, ACO demonstrated superior performance in enhancing the model fit of short-form scales (Olaru et al., 2015).
An additional study examined the effectiveness of several automated item selection algorithms, including ACO, Tabu search, GA, and a novel implementation of the simulated annealing algorithm using Monte Carlo simulation. The study assessed these algorithms in selecting short forms of scales with unidimensional, multidimensional, and bifactor structures, both under correctly specified and misspecified confirmatory factor analysis (CFA) models and in the presence or absence of external variables. Findings indicated that when the CFA model of the full-scale version is correctly specified or contains only minor misspecifications, all four algorithms generated short forms that retain strong psychometric properties and preserve the intended factor structure. However, under conditions of major model misspecification, the performance of all algorithms declined (Raborn et al., 2020).
Lim and Chapman (2013) noticed that existing instruments designed to assess attitudes toward mathematics have been criticized for being excessively long, outdated, or developed primarily using Western samples, limiting their generalizability. To address these limitations, a shortened version of the Attitudes Toward Mathematics Inventory (ATMI) was developed, measuring four key subscales: enjoyment of mathematics, motivation to engage in mathematics, self-confidence in mathematical abilities, and perceived value of mathematics. The psychometric properties of this abbreviated instrument were evaluated using a sample of 1,601 participants from Singapore.
McArdle (2014) used CFA to confirm the original four-factor structure of the ATMI. However, within this structure, several items exhibited high intercorrelations, suggesting redundancy. The author performed scale reduction. The removal of the problematic items either enhanced or did not adversely affect the psychometric properties of the instrument, leading to the creation of the short version of ATMI. The short ATMI demonstrated strong correlations with the original ATMI (mean r = .96), high internal consistency both for the overall scale (α = .93) and individual subscales (meanα = .87), and satisfactory test–retest reliability over a 1-month period (mean r = .75).
Later, McArdle (2014) explored the effectiveness of a Decision Tree Analysis (DTA) approach in the context of CAT. The underlying psychometric assumption was that if an individual’s total score is derived from a comprehensive set of test items (I), their performance on a smaller subset of items (i < I) can be used to approximate the overall test score with slightly reduced but still substantial accuracy. The author demonstrated that if this assumption holds, administering only a selected subset of items rather than the full set could significantly reduce test administration time while maintaining acceptable measurement precision.
The findings indicated that the DTA approach achieves considerably higher accuracy, with a scale reliability of
In another study, researchers applied GAs to develop a shortened version of a psychological assessment while maintaining its original multidimensional structure and psychometric integrity. The full-length instrument, though reliable, posed practical limitations due to its length. While an existing brief version was available, it condensed multiple dimensions into a single factor, limiting its applicability. To address this, a GA-based method was used to create a more efficient version that retained the original factor structure while significantly reducing administration time. Results demonstrated that the abbreviated version closely mirrored the full-length measure in terms of structural consistency, inter-correlations, and associations with key psychological constructs, making it a viable alternative for both research and clinical applications (Sahdra et al., 2016).
Finally, a recent study explored the use of machine learning techniques to develop a short, tree-based adaptive classification test from a lengthy assessment. A case study on risk assessment for juvenile delinquency highlighted key challenges, including the complexity of measuring multiple constructs and imbalanced training data due to a low prevalence of target outcomes. Traditional adaptive testing methods may be ineffective in this context, whereas decision tree models offer a promising alternative. A cross-validation study comparing eight tree-based adaptive tests to five benchmark methods found that the best-performing models achieved superior or comparable classification accuracy while drastically reducing test length (Zheng et al., 2020).
Recent advancements in statistical software have enabled the reduction of lengthy scales. The R packages GAabbreviate (Scrucca and Sahdra, 2016), ShortForm (Raborn and Leite, 2018), and GA (Scrucca, 2013) provide powerful tools for optimizing psychometric assessments and solving complex optimization problems. GAabbreviate is designed to automate the abbreviation of lengthy psychological scales using GAs ensuring that shortened versions retain key psychometric properties while minimizing administration time. ShortForm facilitates the development of short-form scales by selecting items based on multiple validity criteria, such as model fit and relationships with external variables, utilizing ACO to optimize item selection. Meanwhile, GA offers a flexible framework for applying Gas to a wide range of optimization problems, including mathematical functions and statistical modeling.
Current Study
In the context of test development, the optimal objective is to design a valid and reliable assessment that adheres to multiple structural constraints while being constructed from a finite set of discrete test items. These constraints may include content coverage, proportional representation of item difficulty levels, and alignment with psychometric specifications such as validity, reliability, and fairness (Haladyna & Rodriguez, 2013). Furthermore, test creation is inherently sequential in nature, as the ordering of items often plays a critical role in maintaining logical flow and ensuring that the test adheres to cognitive and instructional principles (Sireci, 1998). For example, certain test frameworks require items to be presented in increasing difficulty or to group questions by domain or skill, adding a layer of complexity to the test construction process.
Modern machine learning methods offer promising solutions for addressing the complexity and constraints of test creation efficiently. Algorithms such as RL and other optimization-based approaches can be employed to dynamically select and order test items while optimizing for multiple objectives. RL, for instance, can model test creation as a sequential decision-making process, where the system learns to select the next item based on the current state of the test under construction (Wang et al., 2024). These algorithms not only account for predefined structural constraints but can also adaptively refine their selection policies through iterative learning, improving performance over time.
Moreover, machine learning approaches are particularly advantageous for large-scale assessments, where the size of item banks and the complexity of test blueprints make manual test construction infeasible. By integrating neural networks or combinatorial optimization techniques, these systems can simultaneously consider content balance, psychometric properties, and even time constraints to produce test forms that meet rigorous standards (van der Linden, 2005). Recent advancements in attention-based models and automated item selection algorithms further enhance the ability to construct optimal tests with minimal computational overhead (Kool et al., 2019).
The NBCE has recently started a revision of its Basic Sciences (Part I) exam, prompting the need to reevaluate the appropriate number of items included in the assessment. This process ensures that the exam maintains its validity and reliability by providing adequate information to accurately estimate the examinee’s ability. However, achieving this optimal set of items is a complex task, as it requires constructing numerous test forms while adhering to content, psychometric specifications, and exposure constraints established by the development team. In the process, the NBCE increased the number of annual exam administrations; therefore, we considered a possible test reduction.
For the NBCE, a shorter exam provides opportunities for the development of a greater number of diverse test forms, which can substantially enhance item exposure control. In addition, the creation of shorter, equally reliable test forms can optimize resource utilization, as fewer items per exam may allow for more efficient item bank management and streamlined test assembly processes.
For examinees, a shorter exam can have profound positive effects on the testing experience. Reducing the number of test items can help mitigate the effects of test fatigue, a phenomenon where prolonged cognitive effort leads to diminished focus, increased stress, and reduced performance accuracy, particularly during lengthy assessments (Ackerman et al., 2010; Tagher & Robinson, 2016). By shortening the exam while maintaining psychometric rigor, students can engage more consistently across all test items, yielding results that are both more accurate and representative of their true abilities.
The challenge was amplified by the increasing size of item banks and the growing complexity of constraints, making it more difficult to identify subsets of items that optimize both test precision and structural integrity. In turn, this provided the researchers an opportunity to review RL as a promising solution due to its capacity for self-training and adaptive decision-making. In this context, the purpose of this study was to develop a deep DRL algorithm capable of determining, or confirming, the number of items required to estimate
Method
During the process of test restructure, the domain scores for multiple administrations of the Part I exam involving 1,425 examinees and 240 test items were generated using an Item Response Theory (IRT)-based calibration, linking, and scoring procedures. One of the primary advantages of IRT lies in its ability to integrate examinee performance and item difficulty estimates onto a common scale. Furthermore, IRT provides a robust framework for ensuring that scores reflect not only the number of correct responses but also the complexity of the items encountered, thereby enhancing the fairness and precision of the assessment (Bock et al., 1997; Bortolotti et al., 2013). The equation for the 3PL model is given by the following:
where
Calibrating these items makes measuring the utility of DRL as a base for CAT and as a general test construction tool possible as the items and their parameter values can be used to support the comparison of three different approaches. The three include an implicit learning approach, a heuristic approach, and a mixed approach. The implicit learning approach and the mixed approach both used DRL but the latter included influences from traditional CAT systems which explicitly use item information as an item selection criterion while the former implicitly learns what items to administer through experience. The heuristic approach is more strict in its search process for a shorter test form as it is rule-based rather than a trial-and-error process.
Implicit Approach
In the implicit DRL algorithm examinee j’s abilities (θ) were estimated using the expected a posteriori approach (Bock & Mislevy, 1982; de Ayala, 2009), which was given by:
The equation uses Hermite-Gause’s quadrature approximation to approximate the normal distribution for the examinee’s ability. In this equation,
Corresponding standard errors
The average standard error
The assembled environment of the implicit DRL algorithm was modeled to resemble a computer-adaptive test. The initial state of the program would be empty, representing the beginning of an exam. The first action of the agent within the environment would be administering one random item from the item bank holding all 240 items with their equating parameters, item index, domain indicator, and difficulty level. Thereafter, items administrated at future states would be contingent on the probability assigned to them by the policy function. For all episodes’ states succeeding the first, 0s and 1s were generated to represent whether the hypothetical student answered an administered item incorrectly or correctly. These responses (
The parameters of the Binomial distributed response vector
The algorithm was guided toward desired outcomes through a structured reward system. Positive rewards were assigned at each step when the algorithm successfully achieved the predefined domain and difficulty ratio constraints. To encourage efficiency in measurement, the algorithm was designed to minimize the number of administered items while still achieving a sufficiently low standard error for the ability estimate
The selected policy optimization method was proximal policy optimization (PPO), an algorithm in RL known for its efficiency and robustness. PPO offers several advantages, including its ability to utilize the value function to guide policy updates by computing trajectories and advantage estimates, as well as its use of trust regions to ensure stable learning (Schulman et al., 2017). Trajectories represent sequences of states, actions, and rewards that the agent experiences during an episode, capturing the interactions within the environment over time. Advantage estimates, derived from these trajectories, measured the relative improvement of a specific action compared to the average action for a given state. This estimation provided critical feedback, enabling the policy to focus on actions that contribute extensively to achieving optimal outcomes while maintaining stability in the learning process.
Let us consider the following:
The equation above finds estimates under policy π by taking the difference between the action-value function
The algorithm endured training over 100,000 episodes, during which trends in total reward and policy training loss were analyzed across episodes. Upon completion of training, the optimized policy was saved, and the environment was reset to its initial state (Episode 1). Using the saved policy, the algorithm was further evaluated over an additional 10,000 episodes. During these episodes, the goal was to identify subsets of test items that satisfied several key conditions: the standard error of ability estimation
Mixed Approach
In this mixed approach, the DRL maintained mostly the same infrastructure with the major change being the addition of item information as a reward-shaping tool. Using the 3PL item parameters and updated (theta) value at each step facilitated by Equations (4) and (5), item information values could be estimated for the following current (theta) value:
During the first step when no items have been administered, the algorithm assumes the simulated student’s
Here,
Heuristic Approach
The heuristic approach was facilitated through the use of the TestDesign package in R (Silva,van der Linden, & Ortiz, 2019). A shortened test was to be assembled using the Mixed Integer Programming framework given specific total item constraints as well as the existing domain and difficult representation constraints. The optimization function sought to maximize the cumulative item information of the selected items:
where
Results
Implicit DRL
The total reward for each episode of the implicit DRL is presented in Figure 1, with fluctuations indicating the algorithm’s exploratory process in searching for an optimal policy. The predominance of relatively high reward values suggests that a viable policy may have been identified early in the training process. Figure 2 illustrates the loss values across episodes of the implicit DRL, providing insight into the convergence of the policy optimization. The loss function for the policy is mathematically defined as:
Here

Implicit DRL total training reward by episode.

Implicit DRL total training loss by episode.
The results demonstrate the utility of the DRL, as they indicate a suitable policy for administering test items was discovered and optimal actions were reinforced while still allowing for exploration.
Across the 10,000 additional episodes conducted using the trained policy, none of the generated item sets fully satisfied all the desired specifications but there were 5,244 instances in which a subset of items yielded an
The smallest subset with a total of 97 items had final values for
Table 1 lists the characteristics of four subsets that were of most interest, being the subset providing the highest total reward, the subset providing the lower total reward, the subset that used the least number of items, and the subset whose
Implicit DRL Subsets of Interest.
Note. Label indicates the subset of interest. The corresponding columns are the total items administered for that episode, the total reward at the end of the episode, the final ability and standard error estimates, the means and ranges of the 3PL parameters for the subset of items administered, and the final domain and difficulty ratios.
Despite being able to highlight some of the learning process and identifying characteristics of a test which provides efficient measurement, these findings stress a critical limitation: the restricted size of the item bank used in this study. Expanding the item bank could allow for a more extensive exploration of potential subsets that might better align with all predefined constraints.
Mixed DRL
The total reward for each episode of the mixed DRL is presented in Figure 3. Unlike the implicit DRL, the trend for total reward by episode appeared to rise slower and large spike before leveling out to fluctuate around a local maximum. This could be explained by the addition of item information in reward shaping. The lower values near the beginning illustrate its lack of knowledge about the environment while the short spike could indicate instances where the algorithm found small subsets of items that meet the desired

Mixed DRL total training reward by episode.

Mixed DRL total training loss by episode.
The application of the mixed DRL did not result in any subsets being saved that used less than 240 items. This could be due to the information bonus overshadowing the final rewards. It is possible the agent kept administering more items rather than terminating earlier because it learned to focus on the more immediate reward of picking the item with the highest information. This could also explain why in the reward plot in Figure 3 for the mixed DRL we saw the local maximum it settled on was much lower than the maximum it found relative to where the reward plot in Figure 1 for the implicit DRL. This overfitting to local gains appeared despite trying to apply small shaping values to (equation) meaning other methods like penalization might be needed to engender more exploration. The total amount of items available for test construction could be the main bottleneck for exploration as well.
Test Design
The heuristic approach also did not lead to any subsets being saved for review. Being the strictest of the three approaches, it makes sense this one without any mechanism for exploration did not produce any subsets as it was required to meet the structural requirements of domain and difficulty representation. This result is supported by the implicit DRL as that approach did not yield any subsets that were a multiple of six which met all other structural or precision requirements either. Therefore, such a subset that can be created from one test form may not exist and 240 items appear to be the optimal total given the constraints and limitations.
While no approach could find a smaller subset of items that had all the desired qualities, these results illustrate the effectiveness of the DRL algorithm in leveraging temporal difference learning to refine its policy and reinforce optimal actions over time. Through the implicit DRL, we could also at least identify characteristics of items which led to precise ability estimates in a shorter fashion. The scope of learning was constrained by the limited resources provided, as only items from a single exam form were available. The addition of more items to the item bank could enhance the algorithm’s ability to explore and identify subsets meeting all desired specifications. This limitation highlights the need for larger datasets in future studies to fully harness the potential of DRL algorithms in optimizing test design.
Discussion
RL is transforming the field of testing by enabling adaptive assessment systems that tailor themselves to individual learners’ abilities and needs (Wang et al., 2024). Traditional testing systems often rely on static question sets that do not dynamically adjust to the examinee’s responses. RL introduces a significant paradigm shift by allowing tests to adapt in real time (Liu et al., 2024). For example, RL algorithms can analyze an examinee’s response patterns and dynamically select questions of appropriate difficulty to maintain an optimal challenge level (Li et al., 2023). Furthermore, RL-driven adaptive tests improve efficiency by reducing the number of questions required to reach reliable conclusions, thus shortening test durations while maintaining or enhancing precision.
Another major advantage of RL in testing is its ability to focus on the underlying processes behind responses rather than just the answers themselves. By modeling test-takers’ cognitive and behavioral patterns, RL can provide insights into problem-solving strategies, misconceptions, and areas requiring targeted intervention (Islam et al., 2021). For instance, in CAT, RL algorithms leverage a reward-based framework to optimize question selection, aiming for both mastery learning and diagnostic insights. Beyond individual assessments, RL-based testing systems contribute to large-scale education by continuously improving the question bank through feedback loops. Questions that fail to provide discriminatory power or are consistently answered incorrectly can be flagged for review or replaced, creating a self-improving testing ecosystem. Moreover, these systems enable the creation of longitudinal profiles of learners, helping educators track progress over time and tailor future instruction to maximize educational outcomes (Wang et al., 2024; Li et al., 2023).
A well-structured RL algorithm has the potential to address complex challenges, such as optimizing the test length for high-stakes examinations like the NBCE Part I. By utilizing dynamic programming and policy optimization techniques, RL can effectively identify subsets of test items that meet specific constraints related to content coverage and difficulty. However, the efficacy of such algorithms is inherently tied to the availability of computational resources and the robustness of the training environment. Localized training, while accessible, often falls short in handling the extensive computational demands required for training and optimizing RL models. In this context, while the algorithm demonstrated moderate success in identifying subsets of test items, current limitations necessitate retaining the test length at 240 items. This ensures the examination continues to fulfill its content and difficulty requirements until further advancements in algorithm refinement and computational scalability are achieved.
In terms of test length optimization, DRL provides powerful tools for achieving an ideal balance between brevity and measurement accuracy. Through its reward-based framework, DRL algorithms can prioritize item selection strategies that minimize the number of questions administered while still achieving precise estimates of test-taker ability. By leveraging partial information at each stage of the test, DRL models can dynamically determine when the addition of more items ceases to significantly improve the measurement outcomes, thus enabling early termination without compromising validity. In addition, DRL systems can simulate and analyze various test configurations, identifying optimal stopping rules and conditions that align with predefined accuracy thresholds. This capacity to adaptively optimize test length not only enhances efficiency but also reduces test fatigue for examinees, improving their overall testing experience. Ultimately, the flexibility and learning capabilities of DRL make it an indispensable tool for modernizing and refining the test construction and administration process in educational and professional contexts.
Future research should prioritize expanding the item bank to include a more diverse set of questions, allowing the algorithm greater flexibility in forming optimal test configurations. This expansion would enable the exploration of broader combinations, potentially enhancing the algorithm’s performance. In addition, the integration of cloud-based or high-performance computing infrastructure could provide the computational capacity needed to train and refine the algorithm efficiently. Alternative strategies, such as fine-tuning hyperparameters or adopting advanced model architectures, should also be explored to determine whether these adjustments yield improved outcomes. While multiple iterations of the RL algorithm were tested in this study, the scope for further experimentation remains significant, as alternative configurations may uncover superior solutions. These advancements will be critical for ensuring that future implementations can optimize test lengths while preserving the validity and reliability of the assessment.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
