Adaptive learning path generation in higher education using knowledge tracing and reinforcement learning: A multi-institutional study

Abstract

Individualized learning in higher education involves systems that modify instructional sequences to the knowledge and learning behavior of individual students. The conventional fixed curricula do not take into account dynamism of learning over time thus resulting in ineffective development and poor performance. This research presents a framework of the adaptive path generation of learning based on Deep Knowledge Tracing (DKT) and Deep Reinforcement Learning (DQN) to simulate student mastery and provide the best module order. DKT model is an LSTM network that predicts the probability of mastery, based on the learning interactions in sequential form. A DQN agent makes use of these representations and designs the optimization of the learning path as a Markov Decision Process (MDP) in order to select specific modules to maximize the long-term learning gain. Multi-institutional student datasets were used to evaluate the framework based on mastery prediction, convergence, learning gain, engagement, efficiency and generalization measures. Results show the DKT model achieved 87.6% accuracy, AUC-ROC 0.7343, RMSE 0.1569, and stable convergence across 42 epochs. The DQN agent increased cumulative reward from 53.94 to 57.68, reaching 70.01 and converging by episode 742. Adaptive learning paths improved learning gain by 23.59%, reduced learning steps by 36.44%, increased average mastery from 0.68 to 0.84, and enhanced weak concept recovery to 86.5%. Engagement improved, with time spent rising from 45% to 68%, revisit rate from 12% to 32%, and dropout reduced from 18% to 5%. Cross-institution evaluation confirmed strong generalization with consistent learning gain improvements. These findings demonstrate that the framework delivers personalized, scalable learning for tutoring systems and modern education platforms, supporting diverse student populations in higher education settings. The dataset includes approximately 300 students, multiple simulated institutions, and diverse learning modules which were created from more than 9000 interaction sessions.

Keywords

adaptive learning knowledge tracing reinforcement learning Deep Q-Network personalized learning

Introduction

Over the past few years, higher education has undergone tremendous change following the dynamism in the development of digital technologies, online learning systems, and intelligent tutoring systems.¹ The developments have made educational institutions provide flexible and scalable learning opportunities to a large and diversified student population.² The learners however vary greatly in terms of their background, cognitive capacity, pace of learning, and mode of engagement.³ Conventional curriculum designs have a predetermined strictly linear developmental pattern, and the developmental background of different students is assumed to be equal.⁴ This fixed method is likely to result in the ineffective learning experience of certain students who may lack the ability to grasp new material that requires knowledge of topics which have already been covered, whereas others are likely to feel bored by reading the same content again.⁵ These restrictions lead to a decrease in the level of engagement, frustration, and high rates of dropouts in institutions of higher learning.⁶ Adaptive learning systems have also become the solution to these challenges.⁷ The systems use student interaction data, performance, and behavioral trends to dynamically modify the instructional content to the needs of individual learners.⁸ Adaptive learning systems improve student interaction, learning efficiency, and mastery-oriented progression by offering individualized suggestions.⁹ With the growing inclusion of technology-based instruction models into the higher education frameworks, there is a move towards smart adaptive structures that can constantly track the knowledge development of the students and provide them with tailored learning platforms that can maximize academic performance and ultimate outcomes of learning.¹⁰ The static nature of traditional educational programs prevents them from meeting the unique learning requirements of students which leads to inefficient learning results and student disengagement and higher dropout rates. The systems function under the assumption that all learners will progress through the same learning path despite their different backgrounds and learning speed and level of participation. Adaptive learning systems use student behavior data to create personalized learning paths which help them overcome their learning challenges. The integration of Knowledge Tracing with Reinforcement Learning creates a complete system that uses KT to track student knowledge development while RL works to improve module recommendation decision processes. This combination enables educational institutions to create personalized learning paths which use actual performance data to improve student learning results and engagement in contemporary learning environments.¹¹

Recent developments in AI, especially the use of DL models, have made such modeling of complex student learning behavior and prediction of the progression of knowledge far easier.¹² Knowledge Tracing (KT) is a basic methodology, which estimates the level of student mastery over time through the analysis of sequential learning dynamics, including quiz questions, assignment submissions, and module completions.¹³ Classical KT methods, including Bayesian Knowledge Tracing, are based on probabilistic models and in most cases fail to model complex time dependence found in real world learning data.¹⁴ In order to address these shortcomings, DKT was proposed, and LSTM networks were used to better model sequential learning patterns.¹⁵ DKT allows the system to develop temporal association of previous and ongoing learning tasks that would give the correct prognosis of student mastery chances of various knowledge components.¹⁶ This is because it allows the educational systems to have a constantly updated record of the knowledge state of each student. Nevertheless, though DKT is very effective in forecasting mastery levels, it does not directly give instructions on the best next learning activity.¹⁷ Stated differently, DKT is based on a predictive model and not a decision-making system. Consequently, there is an essential gap between the estimation of knowledge and the optimization of instruction,¹⁸ showing that clever decision-making processes are urgently required to help with the use of predictions of the knowledge states to construct the optimal personal learning trajectories.¹⁹

RL, is an effective framework of computation that provides solutions to problems of sequential decision making, thus it is very applicable in adaptive learning.¹⁵ By interacting with the environment and providing feedback in the form of incentives, RL can assist an intelligent agent in learning the optimal choice policies.²⁰ Going by the adaptive learning, the environment state can be defined as the knowledge state of the student, whereas learning modules are the actions.²¹ The RL agent is trained to real-life higher education systems. Assistive technologies enable students with different disabilities to access educational materials because these technologies provide solutions to their unique learning requirements. The innovations create inclusive learning environments because they improve accessibility and student engagement and academic participation of students. Traditional fixed curricula assume that all students learn at the same pace which prevents students from developing personalized knowledge and limits their ability to learn at their own pace. Students tend to lose interest which leads to poor educational results. Recent deep learning-based knowledge tracing approaches have improved mastery prediction; however, they lack decision-making capabilities for optimal content sequencing. The combination of Deep Knowledge Tracing with Deep Reinforcement Learning creates a system that delivers accurate knowledge assessment and personalized learning path development for smart tutoring systems.

Research objectives

The development of an intelligent adaptive learning framework for the creation of personalized learning paths is the main objective of this project. The specific objectives are:

(1) Evaluate consecutive student interaction data to learn about the learning behavior and knowledge development.

(2) Use Deep Knowledge Tracing (DKT) to selectively forecast mastery by students over time.

(3) Reduce the formulation of adaptive module recommendation as a reinforcement learning problem of optimal path selection.

(4) Deploy Deep Q-Network (DQN) agent to suggest dynamically learning modules depending on the knowledge state of the students.

(5) Determine the effectiveness, influence of engagement, learning performance and overall generalization of the framework over various institutions.

Research contributions

The proposed research is a combination adaptive learning model that offers combination of DKT and DRL to generate personal learning paths. A DKT model that operates using LSTM is aimed at predicting the degree of mastery of students according to the learning sequence, optimization of the learning path is modeled as a MDP and solved with the assistance of a DQN agent. The framework introduces new state representations, which are more mastery-oriented, to supplement the policy reinforcement and policy stability learning performance. The experimental outcomes have shown a great improvement in learning outcome, engagement, and efficiency when compared to non-experimental curricula and control adaptive approaches. Moreover, the model is tested on various institutional datasets and its soundness, scalability and the ability to generalize are evidenced. The suggested solution is an effective and scalable method of intelligent tutoring systems, online learning platforms, and higher education setting.

The remainder of this work is structured as follows. The literature on adaptive learning systems, knowledge tracing, and reinforcement learning in learning environments is covered in Literature Review. The proposed adaptive learning structure and its system architecture are offered in Methodology. Experimental Setup records the dataset, preprocessing steps, and features engineering methods applied to process student interaction data. Methodology describes the proposed methodology that includes Deep Knowledge Tracing model, reinforcement learning environment, state representation, and adaptive learning path generation process. In Result and Discussion, the authors show the results and discussion of the experiment, which includes mastery prediction performance, improvement of learning gains, enhancement of engagement, analysis of efficiency, and the multi-institution generalization. The report concludes in Conclusion and Future work, which also outlines future research directions for improving adaptive learning systems and enabling widespread implementation in the real world.

Literature review

The purpose of adaptive learning systems is to tailor the course material to the individual characteristics, performance, and degree of knowledge of the students. Past adaptive learning methods were based on rule logic and fixed sequence of curriculums, which were not responsive to learner improvements and cognitive status. Adaptive learning has been greatly improved in recent developments in artificial intelligence, knowledge tracing, and reinforcement learning. Recent studies show that assistive technologies now include screen readers and speech recognition and adaptive interfaces for supporting students with disabilities in educational settings. However educational institutions face difficulties establishing scalable systems that provide personalized learning experiences while implementing their programs in various teaching environments.

Adaptive learning systems

Early adaptive learning systems put emphasis on structured curriculum and pre-determined recommendation strategies. Rincon-Flores et al. established that adaptive learning strategies enhance academic performance and student engagement because educational content is customized on the basis of learner progress. On the same note, Contrino et al. created an adaptive learning system, which increased the outcome of student performance and satisfaction both in online and traditional forms of learning, a fact that underscores the power of personalized learning structures. Artificial intelligence has been adopted in recent studies to improve adaptive learning systems. Song et al. presented a dynamic feedback-based learning optimization model, which used machine learning to modify learning trajectories through student interactions. Similarly, Sajja et al. proposed an intelligent assistant, which is an AI-based tool that can provide customized suggestions in tertiary learning institutions. Suryanarayana et al. have also established that the educational management systems based on AI enhance the efficiency of learning, scalability, and sustainability of the learning process based on the automated adaptive learning processes. Ma et al. introduced a recommendation approach based on the personalized learning path that incorporates several algorithms to enhance the learning results. Nevertheless, their model did not provide real-time adaptation and dynamic optimization that is needed in continuing learning settings.

Knowledge tracing models

Knowledge tracing is a very important aspect of adaptive learning systems, which allows predicting the state of knowledge of a learner throughout the learning process. Conventional knowledge tracing methods like BKT offered probabilistic models but could not model sophisticated patterns of learning. Knowledge tracing models based on deep learning have contributed to a high accuracy in estimating student knowledge. A DKT model was proposed by Tong et al. based on neural networks to predict the knowledge of students and their cognitive load, enhancing the recommendations of the learning path. On the same note, Fang et al. came up with an adaptive knowledge tracing model to simulate student learning activity with behavioral interaction data. Fu et al. combined reinforcement learning and dynamic knowledge tracing to maximize individualized learning directions. Their model showed better efficiency of learning, but did not have strong state representation and generalization across institutions. Despite its ability to enhance the estimation of student knowledge, these models will have issues with the state representation, scalability, and flexibility in a dynamic learning context.

Reinforcement learning in adaptive learning

Adaptive learning path optimization via RL has proven to be effective due to its capability to perform sequential decision making based on student performance. A dynamic programming algorithm was proposed by Lou et al., which employed cognitive graphs to produce dynamic Learning Paths; however, Lou et al.’s model did not use deep knowledge tracing in order to estimate an accurate student state. Another model was documented by Lin et al., employing a hierarchical reinforcement learning model using knowledge tracing to recommend Learning Paths that meet multiple objectives; meanwhile, Lin et al.’s model continued to exhibit issues of scalability when applied to very large Educational Systems. Pögelt et al. used reinforcement learning to suggest Mathematical Problems based on intended learning outcomes. The use of reinforcement learn RL for Learning Path Optimization by Fu et al. further validated the successful use of RL for dynamically adapting content sequences based upon current student performance; however, continued to experience unresolved challenges with training stability, generalization and real-time adaptation.

Multi-institution adaptive learning and evaluation

Recent studies have stressed the assessment of adaptive learning systems for different types of educational settings to ensure their robustness and ability to scale across many different submissions. The authors, Suryanarayana et al. make an argument about the significance of AI-powered education management systems in large-scale learning environments. In addition, both authors, Song et al. and Contrino et al. show significant improvements in student performance and engagement from adaptive learning systems across multiple learning environments; however, these were only single-institution datasets that don’t provide enough evidence for generalizability. Fu et al. and Lin et al. have been working on optimizing learning pathways using reinforcement learning and knowledge tracing but do not provide sufficient evaluation of their models from many different institutions and types of learning contexts.

Research gap

Although adaptive learning systems have made tremendous advancements, there are still a number of key limitations:

(1) Most of the existing adaptive learning models lack a detailed state of student knowledge, especially the temporal learning behavior, interaction of behaviors, and cognitive development which lessens the precision of learning direction suggestions.

(2) The majority of the existing systems employ either the knowledge tracing method or the reinforcement learning method separately, but not both in one tool of precise student modeling and the best decision-making.

(3) A number of adaptive learning models have been tested on small or single-source datasets which limits their capability to be effective in generalization across a variety of educational resources and large-scale learning platforms.

(4) Numerous currently used adaptive systems are based on fixed or semi-adaptive recommendation policies and do not have full autonomous schemes to constantly revise learning trajectories basing on instantaneous student performance.

(5) The vast majority of adaptive learning frameworks are experimented in a controlled or single-institution environment which restricts their strength, scalability, and generalizability in real-world, multi-institutional educational scenarios.

The limitations of the current adaptive learning systems are prevented with the proposed model, which adds a single framework that combines Deep Knowledge Tracing and Reinforcement Learning to obtain precise student modeling and dynamic learning path optimization. Deep Knowledge Tracing component provides learning pattern over time and generates accurate representations of the knowledge state, which improves insight into the mastery of students. The Reinforcement Learning agent utilizes these states of knowledge to continually propose the best learning modules, with regard to real-time performance, to ensure completely adaptive and personalized learning trajectories. The model is also tested using a variety of institutional datasets to determine strength, scale, and generalization in a variety of learning settings. All of these factors work together to make this integrated method more effective in learning and more accurate in providing recommendations. It also provides a scalable solution that can be applied to intelligent tutoring systems in higher education and the real world. The proposed framework directly solves these identified gaps through its implementation of DKT model, which accurately estimates temporal mastery, and DQN agent, which uses the estimates to improve its learning path selection. The system uses knowledge modeling to create a smooth transition process that leads to adaptive recommendation.

Methodology

The primary objective of the study is to provide a framework for constructing adaptive learning paths for college students by combining KT and RL to support individualized learning procedures that maximize efficiency, engagement, and mastery. First, the data of student learning interaction is obtained based on a large-scale educational dataset of module attempt, quiz results, time and session counts. The raw data are preprocessed, such as missing value management, normalization and sequence generation to generate learning trajectories in a temporal order. In order to convert raw performance metrics into mastery indicators, feature engineering maps modules to concepts, aggregates interaction statistics, smooths the learning trends, and converts raw performance metrics into mastery indicators. The predicted sequences are inputted into a DKT model to approximate the student mastery probabilities within modules, reflecting as the tight state vectors to be represented as state vectors by the RL agent. To simulate the adaptive learning setting, a DQN is used to model the Markov Decision Process, where the agent chooses the next module based on the level of knowledge of a student, and rewards are determined by the learning gained, the improvement of the mastery, and the variety of exercises. The study uses technology as its main method to create assistive solutions which help disabled students learn more effectively. The methodology uses adaptive systems together with user-centered design principles to create solutions which can be used by all people. The framework is additionally tested in a multi-institutional simulation to test generalization and its functionality is measured in terms of KT accuracy, learning gain, engagement, RL reward, as well as cross-domain adaptability. The general design of the suggested adaptive learning path generation system that will combine Deep Knowledge Tracing and Deep Reinforcement Learning to individualize the recommendations of modules to follow and to evaluate in a multi-institution setup is presented in Figure 1.

Figure 1.

The proposed adaptive learning framework’s overall block diagram.

Feature engineering

In order to convert unstructured student interaction data into meaningful representations appropriate for KT and RL models, feature engineering is essential. The objective of this stage is to extract informative features that accurately reflect student learning behavior, conceptual understanding, engagement level, and mastery progression over time. Since KT models require sequential learning inputs and RL models require structured state representations, feature engineering ensures that student interactions are represented as continuous, interpretable, and temporally consistent feature vectors. The Exponential Moving Average (EMA) method is used to smooth student interaction data because it helps to stabilize their temporal behavioral patterns by decreasing their short-term measurement fluctuations and their measurement noise. The EMA method gives more importance to current data while preserving the effect of earlier data which allows it to show how student learning patterns develop over time. This situation becomes critical because educational settings that use sequential learning methods experience sudden performance changes which do not show actual learning patterns. The EMA method produces stable feature values through its smoothing process which improves the quality of input data used in the knowledge tracing model and enhances the precision of subsequent reinforcement learning outcomes.

Module-to-concept mapping

The learning modules are linked to one or several underlying concepts which depict certain knowledge elements. This mapping also allows the system to follow conceptual level mastery and not just at module level, which increases generalization and interpretability. Let the set of concepts be defined as Equation (1),

C = {C_{1}, C_{2}, C_{3}, \dots, C_{K}}

(1)

where

K

denotes the over-all number of concepts. Each module

m_{j}

is plotted to a concept vector Equation (2),

ϕ (m_{j}) = {\begin{cases} 1, & if module m_{j} belongs to concept C_{k} \\ 0, & otherwise \end{cases}

(2)

The Knowledge Tracing model can determine the likelihood of mastery at the concept level by mapping, which can be used to give more accurate adaptive learning advice. Concept-based tracking also enables the Reinforcement Learning agent to specify modules that can fill special gaps in knowledge.

Feature aggregation across learning sessions

Students have the ability to engage in several sessions with the same module, thus feature aggregation is conducted to generalize learning behavior and minimize noise due to individual session variability. Aggregated feature values are calculated as the mean interaction feature values of all sessions that a student $i$ takes in a given module $j$ . Aggregated feature can be defined by the following Equation (3),

f_{i, j} = \frac{1}{N_{i, j}} \sum_{t = 1}^{N_{i, j}} x_{i, j, t}

(3)

where

f_{i, j}

represents aggregated feature value,

N_{i, j}

represents total number of interactions of student

i

with module

j

x_{i, j, t}

represents feature value of student

i

in time step

t

Mean quiz score, mean assignment score, mean time spent, mean attention score and mean attempts completed are aggregated to show the overall student learning performance. This averaging is highly useful in limiting the impact of short-term variations, and gives a consistent, dependable overview of student behavior. Such aggregated indicators, in turn, assist the Knowledge Tracing and Reinforcement Learning models in gaining a better insight into long-term learning dynamics and predicting mastery more accurately and make adaptive learning directions more likely to succeed.

Learning trend smoothing using exponential moving average (EMA)

The student learning process tends to have peaks and lows because of short-term changes in engagement or performance. To ensure that the underlying learning trend is captured and reduce noise, EMA smoothing is used in features involving performance. The EMA is defined as Equation (4),

E M A_{t} = α \cdot x_{t} + (1 - α) \cdot E M A_{t - 1}

(4)

where

x_{t}

is the feature value at time step

t

E M A_{t}

is the smoothed feature value, and

α \in [0, 1]

is the smoothing factor. Higher α values place more emphasis on the recent performance so recent performance can be better represented by the model. The EMA is used to smooth short-term variations to quiz score progression, assignment score progression, attention score progression and on the learning trend indicator to bring out long-term patterns of learning. This averaging operation minimizes noise and gives a more consistent picture of student performance over time and as a result enhances the accuracy and reliability of mastery estimation in the Knowledge Tracing model.

Mastery indicator conversion

In order to quantitatively encode student learning performance information, the raw performance scores are transformed into continuous mastery indicators with a range of 0 to 1. This makes it interpretable probabilistically and compatible with neural network models. The mastery indicator is defined as Equation (5),

m_{t} = \frac{{score}_{t}}{{score}_{\max}}

(5)

where

m_{t}

denotes mastery level at time step

t

, score

e_{t}

represents student score at time step

t

and score

_{\max}

represents maximum possible score. This transformation ensures that mastery values lie within the range,

0 \leq m_{t} \leq 1

. Higher values indicate greater mastery. In cases where binary success indicators are available, mastery probability can also be represented as Equation (6),

P (C_{k} ∣ S_{t}) = σ (z_{t})

(6)

where

P (C_{k} ∣ S_{t})

represents probability of mastering concept

C_{k}

σ

denotes sigmoid function and

z_{t}

denotes neural network output.

Engagement feature construction

Student engagement plays a significant role in learning outcomes. Therefore, engagement-related features are combined to create composite engagement indicators. The engagement score is defined as Equation (7),

E_{t} = \frac{w_{1} \cdot {time_spent}_{t} + w_{2} \cdot {attention_score}_{t} + w_{3} \cdot {video_watched}_{t}}{w_{1} + w_{2} + w_{3}}

(7)

where

E_{t}

represents engagement score and

w_{1}, w_{2}, w_{3}

denotes weighting coefficients. This feature helps the Reinforcement Learning agent select modules that maximize both learning and engagement. The three engagement indicators include time spent and attention score and interaction intensity because these indicators directly measure student engagement and cognitive focus during their learning activities. Time spent indicates effort investment, attention score captures focus level, and interaction frequency reflects active participation. The combined features of this system deliver a complete engagement measurement which enables the model to suggest educational modules that improve both student learning results and their active participation in the learning process.

Feature vector construction

After feature extraction and transformation, each student interaction is represented as a feature vector Equation (8),

X_{t} = [m_{t}, q_{t}, a_{t}, t s_{t}, a t t_{t}, E_{t}]

(8)

where

m_{t}

represents module ID embedding,

q_{t}

denotes normalized quiz score,

a_{t}

represents normalized assignment score,

t s_{t}

represents normalized time spent, att_t denotes attention score and

E_{t}

denotes engagement score. This vector serves as the input to the Knowledge Tracing model. The final output of this stage is a structured feature representation of student learning behavior and mastery progression Equation (9),

F = {X_{1}, X_{2}, X_{3}, \dots, X_{T}}

(9)

These designed feature vectors offer a well-structured and holistic representation of student state of knowledge, performance, progression and behavioral pattern. They are fed into the Knowledge Tracing model to get precise estimation of mastery and to the Reinforcement Learning agent to minimize adaptive optimization of the learning path. This feature engineering procedure enhances the general precision, stability and readability of the adaptive learning model, which allows making learning suggestions more meaningful and individual.

Knowledge tracing (KT)

To appropriately model student development of learning and dynamically forecast mastery levels, the present study will apply the DKT framework, which is offered on a LSTM neural network. Knowledge Tracing is sequential prediction task that predicts how the student knowledge state changes with time given access to the historical sequences of interaction. As opposed to the conventional models like BKT, which make the assumption of independent knowledge components and fixed transition probabilities, DKT uses deep neural networks to bring complex, nonlinear, and long-term causes and effects in learning behavior. The LSTM architecture is also an appropriate choice due to the presence of an internal memory state that can be used to memorize previous learning activities that will enable the model to make accurate estimates of mastery progression between modules. The ability is necessary in adaptive learning systems as it gives the system a consistent depiction of what the student has learned which can be used by the reinforcement learning agent to create individualized learning paths. The LSTM-based knowledge tracing model receives input through an embedding layer which transforms categorical module identifiers into dense vector representations. The system encodes module IDs through index encoding which subsequently links to a continuous embedding vector of low-dimensionality. The embedding establishes hidden connections between learning modules which enable the model to apply its knowledge to related concepts. The embedding vectors are learned jointly during model training, allowing the representation to adapt based on student interaction patterns. The method eliminates one-hot encoding sparsity problems while it enhances the capacity to model sequential knowledge through greater efficiency and expressiveness.

At each time step $t$ , a student interaction is denoted as a multidimensional feature vector capturing cognitive performance, engagement, and behavioral characteristics Equation (10),

x_{t} = [m_{t}, q_{t}, a_{t}, t s_{t}, a t t_{t}, e n g_{t}]

(10)

where

x_{t} \in R^{n}

is the input feature vector at time stage

t

, containing student learning interaction features such as quiz score, assignment score, time spent, and attention score.

m_{t}

represents module identifier (encoded using embedding representation),

q_{t}

is the normalized quiz score,

a_{t}

is the assignment score,

t s_{t}

denotes normalized time spent,

a t t_{t}

denotes number of attempts,

e n g_{t}

represents engagement features (attention score, clicks, or interaction intensity) Since module identifiers are categorical, they are transformed into dense embedding vectors Equation (11),

e_{t} = E (m_{t})

(11)

where

E \in R^{M \times d_{e}}

represents embedding matrix,

d_{e}

denotes embedding dimension. The final input vector becomes Equation (12),

x_{t} = [e_{t}, q_{t}, a_{t}, t s_{t}, {att}_{t}, {eng}_{t}]

(12)

For each student $i$ , the complete interaction sequence is represented as Equation (13),

X^{(i)} = {x_{1}^{(i)}, x_{2}^{(i)}, \dots, x_{T}^{(i)}}

(13)

where

T

denotes sequence length,

i = 1, 2, \dots, N

. The model is able to capture temporal dependencies in student learning behavior because of its sequential representation. The LSTM network maintains a hidden state representing the latent knowledge state of the student. The hidden state update is defined as Equation (14),

h_{t} = LSTM (x_{t}, h_{t - 1})

(14)

where the hidden state vector is

h_{t} \in R^{d}

, d is the hidden dimension. The hidden state vector of the preceding step,

h_{t - 1} \in R^{d}

represents the prior knowledge of the learner. A number of gating mechanisms that control the information flow and internal operations of the memory cell are the foundation of the LSTM network. These gates allow the model to selectively store the knowledge, add new information and create a new hidden representation of the student learning state.

The forget gate can decide on whether the information that comes out of the last cell state is to be retained or discarded. This mechanism allows the model to remove outdated or irrelevant knowledge and preserve useful learning patterns. The forget gate is mathematically represented in Equation (15):

f_{t} = σ (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f})

(15)

where

W_{f}, W_{i}, W_{c}, W_{o} \in R^{d \times n}

represent weight matrices associated with the input vector,

b_{f}, b_{i}, b_{c}, b_{o} \in R^{d}

represent bias vectors, and

σ (\cdot)

represents the sigmoid activation function

The input gate regulates the amount of fresh data that should be added to the cell state from the current input. By evaluating the significance of recently observed student learning characteristics, it controls the updating process. The input gate is defined in Equation (16):

i_{t} = σ (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i})

(16)

The weight matrices related to the hidden state are represented by $U_{f}, U_{i}, U_{c}, U_{o} \in R^{d \times d}$ . The new data that might be added to the cell state is represented by the candidate memory. It offers up-to-date knowledge content and is generated utilizing the current input and the prior hidden state. This is expressed in Equation (17):

{\tilde{c}}_{t} = \tanh (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c})

(17)

where

c_{t} \in R^{d}

represents the updated cell state containing refined long-term knowledge memory.

{\bar{c}}_{t} \in R^{d}

represents the candidate memory vector containing new knowledge information, and the hyperbolic tangent activation function is denoted by

\tanh (\cdot)

The new candidate memory and the previously held memory are combined to update the cell state. The input gate decides what should be added, and the forget gate decides what should be kept. This update mechanism is defined in Equation (18):

c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ {\tilde{c}}_{t}

(18)

where

c_{t - 1} \in R^{d}

represents the previous cell state storing long-term knowledge information, the input gate vector,

i_{t} \in R^{d}

controls how much new information is added, while the forget gate vector,

f_{t} \in R^{d}

controls which information is discarded.

The output gate controls the amount of information that should be revealed to the hidden state from the updated cell state. As a result, in the current time step, the model may generate a suitable knowledge representation. Equation (19) defines the output gate:

o_{t} = σ (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o})

(19)

where

o_{t} \in R^{d}

represents the output gate vector controlling the hidden state output, and O represents element-wise multiplication. The output state of the LSTM cell that was hidden is the final output and is the updated student knowledge state. It is calculated by the output gate and the new cell state, that is in the form of Equation (20):

h_{t} = o_{t} ⊙ \tanh (c_{t})

(20)

Such gating processes allow the model to store relevant knowledge, drop obsolete information, and store long-term learning dependencies, and makes student knowledge representation more accurate and adaptive learning path recommendation more effective. The hidden state $h_{t}$ encodes the student’s latent knowledge representation. This representation is mapped to mastery probabilities using a fully connected layer Equation (21),

{\hat{y}}_{t} = σ (W h_{t} + b)

(21)

where

{\hat{y}}_{t} \in {[0, 1]}^{M}

M

denotes number of modules. The output of the DKT model represents the predicted mastery probability of each learning module for a student at a given time step. This prediction is expressed as equation (22),

{\hat{y}}_{t, j} = P (M_{j} = 1 ∣ X_{1 : t})

(22)

where

\hat{y} (t, j)

denotes the predicted probability that the student has mastered module

j

at time step

t, P (\cdot)

represents the conditional probability function,

M_{j}

indicates the mastery status of module

j

(where

M_{j} =

1 denotes mastery and

M_{j} = 0

denotes non-mastery), and

X_{(1 : t)}

represents the sequence of student learning interactions from time step 1 to

t

, including quiz scores, assignment performance, time spent, and behavioral features.

Based on these predictions, the student knowledge state is represented as a knowledge state vector defined as equation (23),

S_{t} = [{\hat{y}}_{(t, 1)}, {\hat{y}}_{(t, 2)}, \dots, {\hat{y}}_{(t, M)}]

(23)

where

S_{t} \in R^{M}

represents the knowledge of the student state vector at time step t, and

M

denotes the total number of learning modules. Every item of this vector is the probability of the mastery of a particular module. Student Mastery Prediction DKT architecture is depicted in Figure 2.

Figure 2.

DKT architecture for Student Mastery Prediction.

The LSTM model uses a hidden state size that enables optimal performance through its advanced ability to learn and its capacity to generalize. The study needs a moderate hidden dimension to capture complex temporal dependencies between student learning sequences while maintaining research efficiency and computational effectiveness. The embedding dimensionality is designed to transform sparse interaction inputs into dense representations, enabling the model to effectively learn relationships between learning activities. The system captures short-term performance signals and long-term knowledge progression through its design, which creates a reliable student learning behavior representation that enables accurate mastery prediction.

There are a number of properties of knowledge state vectors. The first one is its dimension, which is $M$ , the sum of all the modules. Second, every value is represented in the range [0, 1] which is a mastery probability. Thirdly, it is dynamic and changes with time as the student is exposed to new learning modules. Lastly, it is personalized, that is, individual students possess a distinct knowledge state according to their individual learning process. This knowledge state vector gives a concise, continuous and informative account of student knowledge and is the state input s t of the reinforcement learning agent so as to allow the adaptive learning system to prescribe the best individualized learning trajectories.

The DKT model is used to forecast the likelihood of the student passing through the next learning unit on the basis of the present level of knowledge. This is a prediction that is referred to as equation (24),

{\hat{p}}_{t + 1} = {\hat{y}}_{t} \cdot onehot (m_{t + 1})

(24)

where

{\hat{p}}_{(t + 1)}

is the predicted probability of success of the next module at time step

t + 1

t + 1, {\hat{y}}_{t} \in R^{M}

is the predicted mastery probability vector at time step

t + 1

and onehot

(m_{(t + 1)}) \in R^{M}

is the one-hot encoded representation of the next module

m_{(t + 1)}

. The dot product procedure is used to acquire the mastery probability of the module selected. This forecast allows the adaptive learning system to peak performance of student in future and even prescribes suitable learning resources, remediation courses or new material depending on the estimated mastery. BCE loss function, which calculates the discrepancy between the actual student performance and the anticipated mastery probability, is used to train the model. Equation (25) defines the loss at time step t.

L_{K T} = - \frac{1}{M} \sum_{j = 1}^{M} [y_{t, j} \log ({\hat{y}}_{t, j}) + (1 - y_{t, j}) \log (1 - {\hat{y}}_{t, j})]

(25)

where

L_{K T}

represents the knowledge tracing loss at time step

t, M

denotes the total number of learning modules,

y_{(t, j)} \in {0, 1}

represents the actual mastery label of module

j

at time

t

, and

{\hat{y}}_{(t, j)} \in [0, 1]

represents the predicted mastery probability. The total loss across all students and time steps is defined as equation (26),

L = \frac{1}{N} \sum_{i = 1}^{N} \sum_{t = 1}^{T} L_{K T}^{(i, t)}

(26)

where

L

denotes the total training loss,

N

represents the number of students,

T

denotes the total number of time steps, and

L_{K T}^{(i, t)}

represents the loss for student

i

at time step

t

The Adam optimizer, which effectively updates parameters using gradient-based optimization and adaptive learning rates, is used to optimize the model’s parameters. The parameter update rule is defined as equation (27),

θ = θ - η \nabla_{θ} L

(27)

where η stands for the learning rate, θ for the trainable model parameters, and

\nabla_{θ} L

for the gradient of the loss function in relation to the parameters.

To achieve a stable convergence and effective learning of the DKT model, the model was trained with hyperparameters that were carefully chosen. Learning rate ( $η$ ) was adjusted as 0.001 to allow the parameters to be updated slowly and steadily. The batch size was 32 to trade off computational efficiency and training stability. The LSTM model used hidden 128 units to be able to learn the patterns of time and student learning progress. The model was trained on 50 epochs so that it learned adequately and not overfitted. These hyperparameter values were optimal to work with and to achieve mastery predictions that are accurate.

The DKT model training procedure entails sequential learning of student interaction information. First, the input sequence $X = {x_{1}, x_{2}, \dots, x_{T}}$ is set up, and each input vector is a student interaction feature. Then, forward propagation of the LSTM network is done, to calculate hidden states and probabilities of predicted mastery. The mastery values y 0 t predicted are then compared with the actual student performance to calculate the loss with the BCE. Backpropagation Through Time BPTT is used to compute time step gradients. Lastly, Adam optimizer simply changes the model parameters by iteration up to convergence.

The DKT model produces a sequence of knowledge state vectors of the learning of the student over time after training. This is defined as equation (28),

S = {S_{1}, S_{2}, \dots, S_{T}}

(28)

Here, $s_{t} \in R^{M}$ is the student knowledge state vector (at time step $t$ ), and $T$ the total number of learning interactions. In the state vectors, mastery probabilities of every module are present. These knowledge states give a dynamic and continuous representation of the student learning progress and are the input of the reinforcement learning agent.

The final output of the trained DKT model is a mastery matrix defined as equation (29),

\hat{Y} \in R^{N \times M}

(29)

where

\hat{Y}

represents the mastery prediction matrix,

N

denotes the total number of students, and

M

denotes the total number of learning modules.

Each element of the matrix is defined as equation (30),

{\hat{Y}}_{(i, j)} = Predicted mastery probability of student i for module j

(30)

where

{\hat{Y}}_{(i, j)} \in [0, 1]

is the probability that student

i

has mastered module

j

. This mastery matrix is a complete representation of all the modules knowledge of the student and a basis of adaptive learning path optimization through the application of reinforcement learning.

The LSTM-based DKT model uses temporal student interaction data to train its system. The network receives learning sequences from each student in a chronological manner which processes input x_t at time step t and previous hidden state h_t−1 to create an updated state h_t. The system tracks Knowledge development through different time periods. The model produces mastery probability outputs ${\hat{y}}_{t}$ which the system evaluates against real results by using Binary Cross-Entropy loss. The system uses Backpropagation Through Time (BPTT) to optimize its parameters. The system manages sequences of different lengths and maintains stable learning through its mini-batch training process which uses padding and masking techniques.

The mastery measurement that the DKT model yields is pivotal in the adaptive learning path development since it provides a profound explanation of student proficiency in all the modules. This is a vector that allows identifying weak and strong modules, after which the system can suggest individualized learning material depending on the needs of particular students. It also facilitates optimization of learning paths whereby it helps in selecting relevant modules and facilitating the adaptive adjustment of difficulty based on the capability of the student. Moreover, the mastery vector is used to represent the state of the RL agent, and this is intelligent in making decisions to reach a sequence of modules to recommend. DKT has a number of benefits compared to the traditional knowledge modeling methods, such as superior prediction accuracy, ability to model nonlinear and temporal learning behavior, better personalization, and lifelong mastery estimation. Also, the model is very scalable and can be deployed to massive educational data. The knowledge state vector S t, the mastery possibility matrix Y ˆ, and the chronological learning description of student activities are the ultimate outcomes of the knowledge tracing phase. These outputs are a holistic and dynamic account of the progress of student learning and are fundamental contributions to the Reinforcement Learning agent of adaptive learning path optimization. Student Mastery Prediction DKT procedure is represented in algorithm 1.

The Deep Knowledge Tracing (DKT) model operates by representing student knowledge as a continuous changing hidden state which develops over time. The LSTM stores a hidden state $h_{t}$ at every time point t which contains the complete history of previous interactions (x1,x2,…,xt). The system uses this representation to track immediate academic results and to determine student development patterns over an extended period. The model uses gated mechanisms to keep important knowledge while it eliminates outmoded information. The system models learning behaviors through which students maintain information and later fail to remember it. The knowledge state allows continuous mastery assessment which gives personalized results and helps improve prediction accuracy while serving as valuable data for reinforcement learning-based adaptive decision-making.

State representation

The knowledge state vector undergoes PCA dimensionality reduction because its high dimensionality needs to be solved through this method which also achieves better computational performance. The process eliminates duplicate content from the learning patterns yet vital patterns needed for learning remain intact, which leads to faster learning progress and steadier results from the reinforcement learning agent. The trained DKT model outputs a mastery prediction matrix $\hat{Y} \in R^{N \times M}$ , where each element represents the probability that a student has mastered a module. This matrix provides a global view of student knowledge. For each step, a row vector is extracted as the student’s knowledge state $s_{t}$ . This vector serves as input to the reinforcement learning agent, defining the state space in the MDP. Using this representation, the RL agent selects optimal modules, effectively linking mastery prediction with adaptive learning path optimization. In RL-based adaptive learning systems, state representation must accurately reflect a student’s current knowledge and learning progress. In this study, the state is derived from mastery probability vectors generated by the DKT model.

Initial knowledge state representation

At each time step $t$ , the DKT model produces a mastery probability vector that represents the likelihood of a student mastering each learning module. This vector is used as the initial state representation for the RL agent. The knowledge state vector is defined as equation (31),

s_{t} = [{\hat{y}}_{t, 1}, {\hat{y}}_{t, 2}, {\hat{y}}_{t, 3}, \dots, {\hat{y}}_{t, M}]

(31)

where

s_{t} \in R^{M}

is the knowledge state of the student in time

t

{\hat{y}}_{t, j} \in [0, 1]

is the predicted mastery probability within a given learning module, and M is the number of modules of learning. Higher values mean more mastery, whereas lower values mean things that have to be improved. For example, equation (32),

s_{t} = [0.92, 0.45, 0.30, 0.81, 0.62, \dots, 0.76]

(32)

The value of 0.92 is strong mastery in module 1, the value of 0.45 is moderate mastery in module 2, the value of 0.30 is weak mastery in module 3, and the value of 0.76 is good mastery in the final module This vector is a comprehensive and quantitative description of the learning status of the student in all the modules. When applied in real world adaptive learning systems, M may be very large, which leads to a high dimensional knowledge state vector. With such high dimensionality, computing complexity is escalated, and convergence rate of reinforcement learning is decreased, as well as there is a risk of overfitting. It also decreases the efficiency of the RL agent in learning and raises memory demands. In an attempt to overcome these difficulties, dimensionality reduction is utilized to reduce the knowledge state vector to a reduced dimensional representation without any loss of vital student knowledge information, which yields better training efficiency, scalability, and system performance in general.

Final state representation using PCA

In this study, PCA is used to project the state of high-dimensional knowledge using a lower-dimensional representation. The transformed state vector becomes as shown in equation (33),

s_{t}^{'} = P C A (s_{t})

(33)

In which,

s_{t}^{'} \in R^{d}

denotes the translated state,

d ≪ M

d

denotes the reduced state dimension, PCA maps the actual vector to a reduced dimension space while preserving highest amount of variance PCA finds the most significant latent knowledge patterns by obtaining orthogonal major components. The transformation is expressed as equation (34),

s_{t}^{'} = W^{T} s_{t}

(34)

where

W \in R^{M \times d}

is the projection matrix of PCA,

s_{t} \in R^{M}

, and

s_{t}^{'} \in R^{d}

. The lessened upgraded sheet holds the most important information regarding the mastery of the student, but eliminates the superfluous and non-informative elements. The compressed state vector

s_{t}^{'}

o is a form of latent knowledge properties of the student that is expressed in a concise form. A combination of related learning modules is a representation of higher-level learning patterns in each of the dimensions. As an example, the initial state vector:

s_{t} = [0.92, 0.45, 0.30, 0.81, 0.62, \dots, 0.76]

(35)

Reduced state vector:

s_{t}^{'} = [1.42, - 0.37, 0.85, 0.21, - 0.54]

(36)

The transformed features represent the latent knowledge elements, such as the learning proficiency, understanding of concepts, consistency, retention ability, and focus of learning weaknesses.

The agent of reinforcement learning uses the final reduced state s $s_{t}^{'}$ as input state that gives a succinct description of the current level of knowledge of the student. This condition allows the agent to recognize their strengths and weaknesses, predict the ability to learn, and choose the best next modules to be included in the personalized learning path. The Knowledge Tracing model receives the performance of the student as he or she engages with the proposed modules and replaces the knowledge state to create a new state vector. This dynamic state transition enables the reinforcement learning agent to keep on changing recommendations and hence increases learning efficiency, personalization and long-term knowledge retention. The state transition can be expressed as, $s_{t}^{'} \to s_{t + 1}^{'}$ . $s_{t}^{'}$ represents the current state and $s_{t + 1}^{'}$ represents the new state after interaction with the student.

The proposed state representation has shown a solitary and precise modeling of student knowledge because it models mastery probabilities of learning modules. PCA dimensionality reduces learning patterns without losing important patterns, whereas enhancing computing efficiency and convergence of reinforcement learning. This organized form will allow the RL agent to produce high-quality personalized recommendations based on the strengths and weaknesses of the students. The scalable structure can also be used to support large populations of students, modules, and interactions, which means that the framework can be used with real-world adaptive learning systems and large-scale educational systems.

The final state used by the RL agent is defined as equation (37),

S = {s_{1}^{'}, s_{2}^{'}, s_{3}^{'}, \dots, s_{T}^{'}}

(37)

where S involves the sequence of states of the students, T be the number of interactions of learning,

s_{t}^{'} \in R^{d}

. The RL framework’s adaptive learning route optimization starts with this state representation. The state transition mechanism links DKT-generated mastery probabilities with reinforcement learning updates. At each time step

t

, the DKT model produces a mastery vector

s_{t}

, representing the student’s current knowledge state, which is used as input to the RL agent. Based on this state, the agent selects an action

a_{t}

corresponding to the next learning module. After interaction, new performance data is added to the sequence, and the DKT model updates the mastery vector to

s_{t + 1}

. This transition

s_{t} \to s_{t + 1}

, along with reward feedback, enables the RL agent to optimize adaptive learning decisions. The most important dimensionality reduction method for this study was selected because it efficiently reduces data dimensions while preserving the most critical variance in the knowledge-state data. PCA provides a light computational load because it enables stable state transformations which benefit reinforcement learning systems that need to update their state information frequently when compared to other methods which use autoencoders and nonlinear embedding techniques. The PCA method creates transformed features which maintain their original interpretability while it eliminates unnecessary data elements and measurement errors. This method establishes an effective student knowledge representation system which creates compact knowledge representations to enhance reinforcement learning agent training speed and system stability.

Reinforcement learning (RL) environment

The recommendation process is also developed to be optimal in an adaptive and personalized way with an underlying MDP formulation. The MDP framework has a mathematical basis to model sequential decision-making, where a smart agent engages the learning system and the choice of the best learning modules by the student depending on the current state of knowledge development. By suggesting the best module at each stage, the RL agent aims to maximize cumulative learning improvement. The tuple ( $S, A, P, R, γ$ ), where S stands for the state space, A for the action space, P for state transition probabilities, R for the reward function, and γ for the discount factor, defines an MDP. The reinforcement learning environment is formulated as a Markov Decision Process (MDP) defined by the tuple ((S, A, R, P)). The state (S) represents the student’s current knowledge state derived from DKT mastery probabilities. The action space (A) corresponds to the selection of the next learning module. The reward function (R) measures learning advancement through its system of performance assessment and student engagement metrics. The transition probability (P) represents how student knowledge transitions after they complete a suggested educational module. This formulation enables the RL agent to learn an optimal policy for adaptive and sequential learning path recommendation. The reward function creates two goals which guide students toward useful learning progress while stopping unnecessary recommendation outputs from being shown to them. The system employs two reward types which include performance-based rewards and engagement-based rewards together with a repetition penalty that stops users from choosing previously suggested modules within a restricted timeframe. The system reduces rewards for duplicated module suggestions which leads to more diverse learning paths and prevents students from reaching learning dead ends. The system enables students to discover new learning materials while maintaining their existing knowledge base through controlled learning experiences. The RL agent uses this balance to create learning sequences which adapt to student needs without producing duplicate content.

State definition

The state represents the student’s current knowledge level across all modules. At time step $t$ , the state is represented as the mastery probability vector obtained from the Knowledge Tracing model equation (38),

s_{t} = [{\hat{y}}_{t, 1}, {\hat{y}}_{t, 2}, {\hat{y}}_{t, 3}, \dots, {\hat{y}}_{t, M}]

(38)

where

s_{t} \in R^{M}

is the student knowledge state,

{\hat{y}}_{t, j}

is the predict mastery probability for module

j

M

is the total number of modules. After dimensionality reduction using PCA, the reduced state is represented as equation (39),

s_{t}^{'} = P C A (s_{t}), s_{t}^{'} \in R^{d}, d ≪ M

(39)

The RL agent uses this state representation as input for decision-making, capturing the student’s learning progress.

Action space

The action represents the selection of the next learning module to recommend. At each time step $t$ , the RL agent selects an action, $a_{t} \in {1, 2, 3, \dots, M}$ , where $a_{t}$ represents the selected module index and $M$ is the total number of available modules Each action represents recommending a learning activity such as lectures, exercises, quizzes, or reinforcement sessions. The RL agent selects the optimal module to maximize student mastery improvement and overall learning efficiency.

State transition

After the agent selects an action $a_{t}$ , the student interacts with the recommended module. Based on the student’s performance, the Knowledge Tracing model updates the mastery probabilities, resulting in a new state equation (40),

s_{t + 1} = f (s_{t}, a_{t}, per formance e_{t})

(40)

This transition reflects how student knowledge evolves after completing the recommended module.

Reward function

Repetitive or ineffective recommendations are discouraged by the reward function, which is intended to promote meaningful learning progress. The reward at time step $t$ is defined as equation (41),

r_{t} = \frac{A P R_{t} - A P R_{t - 1}}{d_{t}} - λ n_{a_{t, t}}

(41)

where

A P R_{t}

denotes average mastery probability at time

t

A P R_{t - 1}

represents average mastery probability at previous timestep,

d_{t}

denotes distance to learning goal,

n_{a_{t}, t} =

number of times module

a_{t}

has been recommended, and

λ =

repetition penalty factor

A P R_{t} = \frac{1}{M} \sum_{j = 1}^{M} {\hat{y}}_{t, j}

(42)

d_{t} = β - A P R_{t}

(43)

where

β

denotes target mastery threshold (e.g., 0.9 or

90 %

). The reward formulation encourages effective learning by assigning positive rewards when student mastery improves and higher rewards for greater improvements. It prioritizes recommendations that move students closer to learning goals while penalizing repeated recommendations to promote diversity. This mechanism enables the reinforcement learning agent to learn an optimal and efficient adaptive learning strategy over time.

Episode definition

An episode represents one complete learning session for a student equation (44),

Episode = {s_{1}, a_{1}, r_{1}, s_{2}, a_{2}, r_{2}, \dots, s_{T}}

(44)

where T is the total number of learning interactions. Each episode terminates when the student reaches the predefined mastery threshold β or when the maximum number of learning steps is completed. This episodic framework enables the reinforcement learning agent to learn optimal policies by observing complete learning trajectories and their outcomes.

Policy function

The RL agent adheres to a policy, $π_{θ} (a_{t} ∣ s_{t})$ , which is parameterized by θ and yields the probability of choosing action a_t given state s_t. This policy maps states to action probabilities. The agent can choose behaviors that maximize the predicted cumulative reward, $\max_{θ} E [\sum_{t = 1}^{T} γ^{t} r_{t}]$ , the policy, where $r_{t}$ is the reward at timestep t and $γ$ is the discount factor.

Exploration and exploitation strategy

In RL, balancing exploration and exploitation is essential for optimal adaptive learning path generation. While exploitation concentrates on choosing modules that have historically yielded significant learning gains, exploration enables the agent to suggest novel or infrequently chosen modules to determine their potential impact on enhancing student understanding. The suggested framework employs an epsilon-greedy approach in which the agent investigates alternative modules with probability ϵ and chooses the most well-known module with probability (1-ϵ). In the course of training, the rate of exploration decreases gradually and the agent approaches an optimal policy that is the most effective at learning, and makes individualized suggestions on the modules.

The RL environment aims at learning an optimal recommendation policy that enhances mastery by students and reduces the learning time. The RL agent chooses suitable modules depending on the knowledge level of the student and prevents the duplication of the advice by the reward and penalties system. The system creates individual learning paths through the adjustment of personal learning behavior and history of progression. This guarantees effective learning of knowledge, quicker mastery of knowledge and maximum educational development according to the needs of the students. The RL environment generates the best module suggestions to each student based on their level of knowledge and learning advancement. It creates individualized adaptive learning pathways which enhance mastery levels and reduce redundancy and inefficiencies. As a result of constant communication and feedback of rewards, the system determines the most efficient order of modules. These outputs facilitate intelligent and data-driven learning paths to improve the learning efficiency and promote personalized learning.

RL algorithm: Deep Q-Network (DQN)

The study uses the DQN algorithm, a value-based RL technique that blends Q-learning with deep neural networks, to optimize adaptive learning path construction. The DQN agent’s goal is to discover an ideal policy that, given each student’s present knowledge state, chooses the best learning module for them. DQN can handle high-dimensional state spaces like mastery vectors obtained from the Knowledge Tracing model because it approximates the action-value function using a neural network, in contrast to standard Q-learning, which employs tabular representations. The Q-function, which calculates the expected cumulative reward of acting $a_{t}$ in state $s_{t}$ and then pursuing the best course of action, is the central component of the DQN algorithm (equation (45),

Q (s_{t}, a_{t}) = E [\sum_{k = 0}^{\infty} γ^{k} r_{t + k} ∣ s_{t}, a_{t}]

(45)

where

Q (s_{t}, a_{t})

is the expected cumulative reward,

r_{t}

is the immediate reward at time step t,

γ \in [0, 1]

is the discount factor that determines how important future rewards are, and E is the expected value operator. The DQN uses a neural network parameterized by θ,

Q (s_{t}, a_{t}; θ)

to estimate this Q-function. The network generates Q-values for each potential module recommendation after receiving the student knowledge state vector as input. The Bellman equation (46) is used to update the Q-values iteratively.

Q (s_{t}, a_{t}) \leftarrow Q (s_{t}, a_{t}) + α [r_{t} + γ \max_{a} Q (s_{t + 1}, a) - Q (s_{t}, a_{t})]

(46)

where

\max_{a} Q (s_{t + 1}, a)

is the maximum expected future reward, γ is the discount factor,

r_{t}

is the immediate reward, and α is the learning rate. By reducing the discrepancy between the target and projected Q-values, this update helps the agent gradually discover the best course of action.

The MSE Eqn. is the loss function that is utilized to train the neural network equation (47).

L (θ) = E [{(y_{t} - Q (s_{t}, a_{t}; θ))}^{2}]

(47)

where the target value is defined as equation (48),

y_{t} = r_{t} + γ \max_{a^{'}} Q (s_{t + 1}, a^{'}; θ^{-})

(48)

where the target network parameters are represented by

θ

and the primary network parameters by θ. The target network is updated on a regular basis as

θ^{-} \leftarrow θ

and stabilizes training.

The RL agent can predict the expected cumulative reward associated with recommending each learning module thanks to the DQN’s use of a fully connected feed-forward neural network to approximate the action-value function. The representation of the student’s knowledge state is mapped by this neural network to matching Q-values for every action that could be taken.

The input to the neural network is the student knowledge state vector obtained from the Knowledge Tracing model, represented as $s_{t} \in R^{d}$ , where $s_{t}$ denotes the student’s knowledge state at time step $t$ , and $d$ represents the dimensionality of the state vector after feature extraction and dimensionality reduction. Each element in $s_{t}$ reflects the mastery level or latent knowledge factor of a specific learning concept.

The hidden layers consist of fully connected dense layers that transform the input knowledge state vector into higher-level feature representations. The hidden layer operation is defined as equation (49),

h = ReLU (W_{1} s_{t} + b_{1})

(49)

where

h \in R^{n}

represents the hidden layer activation vector,

W_{1} \in R^{n \times d}

indicates the bias vector, and b_1∈R^n stands for the weight matrix. Equation (50) defines the Rectified Linear Unit activation function, or ReLU.

ReLU (x) = \max (0, x)

(50)

The ReLU activation introduces nonlinearity, allowing the network to learn complex patterns and interactions between student knowledge components and module effectiveness. Additional hidden layers may be used to improve representation learning and enhance model performance.

The output layer produces the Q-values corresponding to all possible learning module recommendations. It is defined as equation (51),

Q (s_{t}, :) = [Q (s_{t}, a_{1}), Q (s_{t}, a_{2}), \dots, Q (s_{t}, a_{M})]

(51)

where

M

is the number of modules in total. The anticipated benefit of suggesting a certain module is represented by each output. Equation (52) employs the ε-greedy strategy to strike a balance between exploration and exploitation.

a_{t} = {\begin{cases} random action & with probabilit y ϵ \\ \arg^{\max} Q (s_{t}, a) & with probability 1 - ϵ \end{cases}

(52)

where e represents exploration rate. Random action enables exploration; Greedy action selects the best-known module. Initially,

ϵ

is high to encourage exploration. Over time, it decays gradually equation (53),

ϵ_{t} = ϵ_{\min} + (ϵ_{\max} - ϵ_{\min}) e^{- k t}

(53)

where

k

is the decay rate. To improve training stability and efficiency, DQN uses an experience replay buffer, which stores past interactions equation (54),

D = {(s_{t}, a_{t}, r_{t}, s_{t + 1})}

(54)

Instead of learning from consecutive samples, the agent randomly samples mini-batches, $(s_{i}, a_{i}, r_{i}, s_{i + 1}) \sim D$ . The replay buffer improves training by breaking temporal correlations between samples, enhancing learning stability, increasing sample efficiency through reuse of past experiences, and preventing catastrophic forgetting, resulting in more robust and stable reinforcement learning performance. Figure 3 shows the DQN architecture diagram for adaptive path generation.

Figure 3.

DQN architecture for adaptive path generation.

The agent and the learning environment interact iteratively to train the DQN. First, an experience replay buffer D is made to store previous transitions, and the Q-network parameters θ are initialized at random. Using the ε-greedy policy, the agent chooses an action a_t at each time step t after observing the current student knowledge state $s_{t}$ The environment reflects the new student knowledge by returning the next state $s_{t + 1}$ , and the immediate reward r_t after the activity is completed. The replay buffer contains the transition tuple ( $s_{t}, a_{t}, r_{t}, s_{t + 1}$ ). Random mini-batches are taken from the replay buffer during training, and the goal value is calculated as $y_{t} = r_{t} + γ \max_{a^{'}} Q (s_{t + 1}, a^{'}; θ^{-})$ , where $θ^{-}$ stands for the desired network parameters and γ is the discount factor. Gradient descent is then used to minimize the loss function L(θ) in order to update the Q-network. To guarantee stable learning, the target network parameters are updated as $θ^{-} \leftarrow θ$ on a regular basis. Until the model converges to an ideal recommendation policy, this iterative process keeps going. The optimal recommendation policy is defined as equation (55),

π^{*} (s) = \arg \max_{a} Q^{*} (s, a)

(55)

The policy selects modules maximizing cumulative reward; convergence measured by stabilized Q-values, peak reward, and improved mastery progression over time.

The trained DQN generates an optimal policy, $π^{*} (s_{t})$ that selects the most appropriate learning module for each student based on their current learning state. This enables personalized recommendations, improves student mastery, supports efficient learning progression, and optimizes adaptive learning paths. The agent keeps on updating its policy as it uses the data of student interaction, and as a result, there is proper adaptive and effective learning advice over a period of time. The Deep Q-Network (DQN) uses experience replay for its training process to achieve better learning results and more stable system performance. The system saves interaction events as transition data which includes the state-action-reward-next state information $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ . The system uses random mini-batch selection from the buffer to eliminate temporal connections between data points which results in more consistent updates of gradient calculations. The system uses a dedicated target network to calculate target Q-values which undergoes periodic updates from the main network for better training stability. The adaptive learning framework achieves successful policy development and consistent results through its combination of experience replay and target network stabilization methods.

The training process uses an epsilon-greedy strategy which maintains equal weight between two different tasks. The agent makes his decision by choosing a random action with probability (\epsilon) while selecting the best action based on Q-values with probability (1 - \epsilon). The training process starts with a higher epsilon value which helps students discover various educational modules. The training process uses a predefined schedule to decrease epsilon values which leads to decreased random exploration and increased trust in developed policies. The decay system provides necessary exploration during initial training periods while it permits stable learning progress and high-quality decision-making during subsequent training intervals.

Adaptive learning path generation

The module Adaptive Learning Path Generation uses a Deep Reinforcement Learning architecture that is built on the DQN algorithm to actively suggest individual learning modules based on the knowledge status of the individual student. The aim is to decide on an optimal policy that would maximize student mastery progress and at the same time make learning efficient and adaptive.

The state space is the knowledge state of the student acquired through DKT model. At the time step $t_{,}$ the state $s_{t} \in R^{M}$ is the predicted mastery of probabilities ${\hat{y}}_{t, j} \in$ [ 0,1] of each learning module $j$ , and $M$ is the total number of modules. This vector is a succinct view of the overall proficiency of the student, strengths and weaknesses in all learning concepts that allow the reinforcement agent of learning to choose the most suitable next learning module.

The action space represents the selection of the next learning module to recommend. At each time step $t$ , the agent chooses an action $a_{t} \in A = {1, 2, 3, \dots, M}$ , where $a_{t}$ denotes the recommended module and $M$ is the total number of available modules. This allows the reinforcement learning agent to recommend the most appropriate module based on the student’s current knowledge state.

The optimal action is selected using the learned Q-function equation (56),

a_{t} = \arg \max_{a \in A} Q (s_{t}, a; θ)

(56)

where θ stands for the neural network parameters and

Q (s_{t}, a; θ)

is is the estimated action-value function. Based on learning efficiency, engagement, and mastery improvement, the reward function assesses how effective the suggested learning module is. Equation (57) defines the payoff at time step t.

r_{t} = α Δ m_{t} + β e_{t} - γ c_{t}

(57)

where

r_{t}

is the immediate reward,

Δ m_{t} = \frac{1}{M} \sum_{j = 1}^{M} ({\hat{y}}_{t + 1, j} - {\hat{y}}_{t, j})

represents mastery improvement,

e_{t}

represents student engagement score,

c_{t}

represents learning cost (time or effort) and

α, β, γ

are weighting coefficients. This reward formulation encourages mastery improvement while minimizing inefficient learning sequences.

After executing action $a_{t}$ , the student state is updated using the Knowledge Tracing model represented as equation (58),

s_{t + 1} = f (s_{t}, a_{t}, r_{t})

(58)

where

s_{t + 1}

is the updated student state,

f (\cdot)

represents the DKT state transition function. The learning objective is achieved when the average mastery exceeds the mastery threshold:

\frac{1}{M} \sum_{j = 1}^{M} {\hat{y}}_{t, j} \geq β

The remediation and acceleration mechanism ensures adaptive learning progression based on student mastery levels. Remediation is triggered when the predicted mastery probability ${\hat{y}}_{t, j} < τ$ , where $τ$ is the remediation threshold, prompting reinforcement of prerequisite concepts. Conversely, acceleration occurs when ${\hat{y}}_{t, j} \geq β$ , where $β$ is the mastery goal threshold, enabling advancement to higher-level modules and improving learning efficiency.

$Q (s_{t}, a_{t}; θ) \approx Q^{*} (s_{t}, a_{t})$ where $Q (s_{t}, a_{t}; θ)$ is the predicted Q-value parameterized by network weights θ and $Q (s_{t}, a_{t}; θ)$ is the optimal action-value function. The DQN is used to approximate the optimal action-value function. Based on the present level of student knowledge, the DQN learns to calculate the predicted cumulative reward for suggesting a learning module. The input layer, hidden layers, and output layer are the three primary parts of the network. The student state vector, $s_{t} \in R^{M}$ , is sent to the input layer.

Using nonlinear activation functions, the fully connected dense layers that make up the hidden layers convert the input state into higher-level feature representations.

The transformations are defined as Equations (59), (60),

h_{1} = σ (W_{1} s_{t} + b_{1})

(59)

h_{2} = σ (W_{2} h_{1} + b_{2})

(60)

where σ is the Rectified Linear Unit (ReLU) activation function,

b_{1}

and

b_{2}

are bias vectors, and

W_{1}

and

W_{2}

are weight matrices. By learning intricate nonlinear correlations between student knowledge states and anticipated learning outcomes, these hidden layers help the network make more accurate decisions. According to equation (60), the output layer generates the Q-values for every action that could be taken (learning modules).

Q (s_{t}, a_{t}) = W_{3} h_{2} + b_{3}

(61)

where

W_{3}

and

b_{3}

are the output layer weights and biases. Each output value corresponds to the expected cumulative reward of recommending a specific learning module. The RL agent selects the action with the highest Q-value, enabling optimal personalized learning path recommendations. To improve training stability, an experience replay buffer

D

stores transitions is equation (62),

D = {(s_{t}, a_{t}, r_{t}, s_{t + 1})}

(62)

Mini-batches are sampled randomly to break temporal correlations.

A target network with parameters $θ^{-}$ is used to compute stable target Q-values, which is represented as equation (63),

y_{t} = r_{t} + γ \max_{α^{'}} Q (s_{t + 1}, a^{'}; θ^{-})

(63)

where

γ

is the discount factor and

y_{t}

is the target Q-value. The DQN is trained by minimizing the loss function equation (64),

L (θ) = E [{(y_{t} - Q (s_{t}, a_{t}; θ))}^{2}]

(64)

The reinforcement learning agent was trained using the DQN algorithm with hyperparameters set to ensure efficient and stable learning. A total of 1000 training episodes were conducted to allow sufficient exploration and policy optimization. A batch size of 64 was used to stabilize gradient updates, while a learning rate of 0.001 ensured efficient convergence. The agent was enabled to utilize long-term mastery enhancement rather than instantaneous rewards by the discount factor γ = 0.99. To stabilize training, improve sample efficiency, and reduce the time dependence between samples, an experience replay buffer with a capacity of 10,000 was used. The agent converged at episode 780 which means that the agent learned the optimum policy. After training, the agent sets out an adaptive learning trajectory as a function of equation (65),

P = {a_{1}, a_{2}, a_{3}, \dots, a_{T}}

(65)

in which

a_{t}

corresponds to the action, and b corresponds to the recommended next learning module best expected to maximize mastery improvement and efficiency of learning overall. This framework facilitates the provision of dynamic and personalized learning through constant adaptation of module recommendations to the current knowledge state of the student that facilitates remediation of weak areas and faster advancement of mastered modules and eventually improves the learning outcomes and efficiency. The step-by-step process of DQN step of Adaptive Learning Path Generation is shown in algorithm 2.

The proposed framework uses Knowledge Tracing (KT) and Reinforcement Learning (RL) in a sequential and iterative approach. The DKT model processes student interaction data to produce mastery probability vectors which show current knowledge status. The RL agent receives these vectors as input states. The agent uses this state to choose the most suitable learning module. New performance data is created when the student uses the suggested module which then updates the mastery state in the DKT model. The system achieves dynamic adaptation through continuous information exchange which allows prediction and decision-making systems to work together for personalized learning path optimization.

Multi-institution simulation

A multi-institution simulation is done to test the strength, scalability, and generalization potential of the proposed adaptive learning system. This simulation determines that the KT and RL-based adaptive system is able to personalize learning paths successfully through a variety of educational settings, diverse learners and dissimilar module frameworks. This type of evaluation is essential in the context of real-world implementations as the system should work reliably in institutions that have dissimilar curricula, engagement patterns, as well as performance characteristics.

It consists of a set of institutional groups of the dataset divided by module clusters, course categories or simulated institutional identifiers. Express the entire dataset in the form of the equation, equation (66),

D = {D_{A}, D_{B}, D_{C}}

(66)

where

D_{A}

is the data of Institution A (training) and

D_{B}, D_{C}

the data of Institutions B and C (testing). All of the datasets include student interaction records, module activity records, quiz scores, and mastery progression. The training/testing split is given termed as equation (67),

D_{train} = D_{A}, D_{test} = {D_{B}, D_{C}}

(67)

This allows the model to be tested on previously unobserved institutional settings, and test its ability to generalize across institutions. The Knowledge Tracing model is first trained using student interaction sequences from the training institution $D_{A}$ . The model learns the mapping, $f_{K T} : X \to S$ , $X$ represents student interaction sequences and $S$ represents mastery probability vectors

After training, the KT model is applied directly to student data from Institutions B and C without retraining. The predicted mastery vectors are equation (68),

S_{B} = f_{K T} (X_{B}), S_{C} = f_{K T} (X_{C}) `

(68)

This evaluates whether the KT model can accurately estimate student knowledge states in previously unseen institutional contexts.

Similarly, the Reinforcement Learning agent is trained using interaction data and mastery states derived from Institution A. The RL agent learns an optimal policy is in equation (69),

π^{*} = \arg \max_{π} E [\sum_{t = 0}^{T} γ^{t} r_{t}]

(69)

where

π^{*}

is the optimal adaptive learning policy,

r_{t}

is the reward representing mastery improvement and

γ

is the discount factor. The trained policy

π^{*}

is then applied to Institutions B and C,

a_{t} = π^{*} (s_{t})

(70)

This measures the effectiveness of the policy learned in prescribing best learning paths in various institutional settings. The system performance is measured based on such metrics as mastery prediction accuracy (RMSE, MSE, AUC, accuracy), learning gain (LG, normalized LG), RL policy effectiveness (cumulative and average rewards, convergence), student engagement (time spent, revisit rate, dropout, interaction), adaptive learning path quality (efficiency, recommendation accuracy), and cross-institution generalization (learning gain, accuracy, generalization gap). Increased learning profits, rewards, involvement, and course path efficiency, reduced prediction mistakes and generalization breaks denote solid, expansive performance. The findings reveal that the framework is a reliable estimate of student mastery, a personalized learning path recommendation, and is able to sustain engagement in various educational settings, which confirms its applicability in large-scale adaptive learning implementation.

Experimental setup

Dataset overview

The Student Learning Interaction Logs Dataset, which is openly accessible on Kaggle, provided the data used in this investigation (https://www.kaggle.com/datasets/ziya07/student-learning-interaction-logs-dataset). This data is a simulation of but realistic longitudinal data of student interactions in a virtual learning environment, and is explicitly intended to facilitate adaptive learning, student modeling, and personalized education systems research, as presented in Table 1. The data provided is over 9000 learning sessions created by 300 students and covering various educational modules and topics. Every record is a learning unit and contains all the behavioral, engagement, and performance metrics of the learning session, including student ID, session time, module ID, quiz score, assignment score, attempts, time in session, engagement measures, and learning progress measures. The data is by nature a time-series one, in the sense that it maintains the time sequence of the student interactions, which makes it very easy to model sequentially with the Knowledge Tracing techniques. Moreover, it has such key indicators as success_label, which denotes whether a student has mastered a module successfully or not, and next_module_prediction, which allows optimizing the adaptive path through reinforcement learning. This is due to the fact that the suggested framework can also incorporate the model of behavioral learning patterns besides cognitive mastery as the proposed framework has both the performance and engagement features. This data is thus conducive in developing and testing the proposed adaptive learning path generation framework since it allows the estimation of the state of knowledge of interest, the training of the policy of reinforcement learning and the simulation of the multi-institutional behavior with the help of the structure of data partitioning. The dataset shows how students in multiple institutions interact with its educational environment, which uses student interaction data to create multiple institutional settings that enable researchers to study generalization patterns. The program offers courses in analytical fields and quantitative fields and conceptual subjects. The system organizes student interaction records according to their session times, which show how students engaged with different modules throughout their learning process. The records contain performance data and behavioral data and engagement data. The system achieves temporal consistency by organizing data according to chronological order. The system uses median imputation to handle missing numerical values, which helps decrease the effect of outliers, whereas forward-fill strategies maintain sequence continuity for categorical temporal features to create authentic learning progressions.

Table 1.

Dataset overview.

Attribute	Description
Dataset Name	Student Learning Interaction Logs Dataset
Dataset Type	Time-series educational interaction dataset
Number of Students	300
Number of Sessions	9000+
Number of Features	22
Label Variable	success_label
Key Features	quiz_score, assignment_score, time_spent_minutes, attempts_taken, attention_score
Categorical Features	student_id, session_id, module_id, feedback_type
Format	CSV
Memory Size	∼1.2 MB

The dataset contains more than 9000 interaction sessions which were collected from various modules that display different types of learning activities. The session records student interactions with the module which provides enough time-based information and varied learning patterns to develop adaptive learning behavior models.

Data preprocessing

The raw student interaction data is processed through a detailed preprocessing pipeline in order to make it compatible with sequential KT and RL models. Taking into consideration the fact that the dataset consists of temporal, numerical and categorical variables describing the learning behavior of students, preprocess is necessary to enhance the quality of data, its temporal consistency, and produce the structured learning sequences that can be inputted into deep learning models.

Handling missing values

Educational data is usually incomplete with missing sessions, not attempted quizzes or system logging anomalies. In order to resolve this problem, the proper imputation strategies are used depending on the type of features. The missing values of such numerical variables as quiz_score, assignment score, time spent in minutes, attention score, and attempts taken are filled with the median imputation method, which is resistant to outliers and does not distort the data distribution. The median imputation is given by the following equation (71),

x_{i} = {\begin{cases} x_{i}, & if x_{i} is available \\ median (X), & if x_{i} is missing \end{cases}

(71)

where

X

is the valid observations of the given feature. In categorical variables like module_id, feedback-type, and revisit-flag, the missing value will be dealt with by forward-fill imputation to maintain the continuity of the time series equation (72),

c_{t} = {\begin{cases} c_{t}, & if available \\ c_{t - 1}, & otherwise \end{cases}

(72)

In case the value of no previous value is available, the mode imputation is used,

c_{i} = mode (C)

. This is to guarantee that categorical features are logically aligned in sequences of learning.

Feature normalization

The datasets have different numerical scales in which normalization is necessary to avoid domination of the learning process by the features that have larger magnitude. Numerical features are put into the range of [0,1] using min-max normalization, which enhances convergence and stability of deep learning models.

In order to obtain the normalized feature value, the equation is calculated as equation (73);

x^{'} = \frac{x - x_{\min}}{x_{\max} - x_{\min}}

(73)

where

x

is original feature value,

x_{\min}

and

x_{\max}

represents minimum and maximum feature value.

x^{'}

is a normalized feature value. This operation avoids features with bigger numerical scopes having an overbearing effect on the learning procedure and enhances the consistency of the model. Normalization is used to standardize the values of feature values, thus the Knowledge Tracing and Reinforcement Learning models can learn effectively to ensure balanced contribution of features and increase aggregate prediction accuracy and adaptive recommendation learning.

Temporal sequencing and ordering

Student interaction records are sorted in chronological order depending on timestamps since Knowledge Tracing models are premised on sequential learning behavior. Interaction sequences of every student $u$ , are sorted as equation (74),

S_{u} = {(m_{1}, s_{1}, t_{1}), (m_{2}, s_{2}, t_{2}), \dots, (m_{T}, s_{T}, t_{T})}

(74)

in which

m_{t}

is the module accessed at time t,

s_{t}

is the performance score and t is the timestamp. This guarantees that the progression of learning is tracked properly so that the Mastery changes over time can be approximated by the Knowledge Tracing model.

An individual student sequence is a learning trajectory:

S_{u} = {x_{1}, x_{2}, x_{3}, \dots, x_{T}}

(75)

where each interaction vector

x_{t}

contains:

x_{t} =

[ module_id, quiz_score, assignment_score, time_spent, attempts_taken, attention_score].

Sliding window sequence generation

In order to facilitate time-series modeling and to increase the efficiency of training, sliding window segmentation is used to create fixed-length learning sequences. This performs the task of reducing long sequences into smaller ones that can be batched trained in LSTM-based Knowledge Tracing models.

Given a student sequence $S_{u}$ , sliding window subsequences are obtained as equation (76),

S_{i} = {(m_{t}, s_{t}), (m_{t + 1}, s_{t + 1}), \dots, (m_{t + w - 1}, s_{t + w - 1})}

(76)

where w is window size,

m_{t}

is module at time step t,

s_{t}

is student performance score. Every sliding window is a partial history of learning and a mastery development. The number of sub sequences generated with a length sequence of T is,

N = T - w + 1

. This enhances the generalization of the models by having several training samples across the student trajectories.

The fixed-length sliding window method divides student interaction data into time-series segments which help achieve effective time-series analysis. In order to create overlapping subsequences from a sequence that has length T the window of size w moves forward one step at a time. The system divides learning activities into separate time intervals which maintain their original sequence throughout the complete educational process. The window size selection determines how much information needs to be processed while smaller windows show immediate patterns and larger windows show permanent relationship between elements. The procedure establishes better model performance through improved generalization while enabling efficient LSTM-based knowledge tracing training through batch processing.

Feature encoding and vector construction

Categorical variables like module_id and feedback type are transformed into the numerical forms by using integer encoding or one-hot encoding. Every student interaction is then the feature vectors are portrayed as equation (77),

x_{t} = [m_{t}, q_{t}, a_{t}, t s_{t}, a t_{t}, a t t_{t}]

(77)

where

m_{t}

is the module id,

q_{t}

is quiz marks,

a_{t}

is assignment marks,

t s_{t}

is time spent,

a t_{t}

attempts taken and att_t represents attention mark.

This structured form of representation of vectors allows easy feeding into deep neural networks. After preprocessing, the dataset is transformed into cleaned, normalized, and temporally ordered student interaction sequences suitable for Knowledge Tracing and Reinforcement Learning.

The final output consists of structured sequences:

D = {S_{1}, S_{2}, \dots, S_{N}}

(78)

where each sequence represents a student learning trajectory. These sequences serve as input to the Knowledge Tracing model for mastery estimation and provide the state representation required for adaptive learning path optimization using Reinforcement Learning.

Software and hardware requirements

All the experiments were carried out with the help of Python 3.10 as the major program because it allows a wide range of support of ML and DL systems. Components of the DKT model and Deep Reinforcement Learning were created based on TensorFlow 2.x and Keras, which are effective in construction and training neural network structures. Implementation of the Reinforcement Learning, as the DQN was promoted with the help of TensorFlow-Agents and self-written Python modules. NumPy and Pandas were used to process data, feature engineer, and do numerical work, whereas Scikit-learn was involved in data division, normalization, and performance measurement metrics. Matplotlib and Seaborn were used as visualization and result analysis tools and allowed depicting the learning gain, engagement, and efficiency improvement in a graphical way effectively. The experimental workflow was implemented on the Jupyter Notebook and Google Colab settings so that reproducibility and effective model training could be performed. The experiments were performed on a system with an Intel Core i7 (or similar) processor, 16 GB of RAM and 512 GB SSD and either Windows 11 or Ubuntu Linux operating system. The option of GPU acceleration with NVIDIA CUDA enabled GPU acceleration was used to speed up the training of neural networks; nevertheless, the framework can also be effectively applied to CPU-based systems because of medium dataset size and optimized model configurations. The system operates through multiple machine learning and reinforcement learning libraries which form its core implementation. The LSTM-based Knowledge Tracing model uses TensorFlow and Keras because these frameworks provide flexible options for deep neural network development. The DQN reinforcement learning agent uses TensorFlow-Agents for its environment interaction and policy optimization capabilities. NumPy and Pandas handle data preprocessing and feature engineering and numerical computations while Scikit-learn provides support for normalization and evaluation metrics. The integrated toolchain enables efficient model development and training and evaluation processes within a scalable experimental pipeline.

Hyperparameter configuration

The effectiveness of the suggested adaptive learning model is determined by the optimal configuration of both the DKT model and the DQN reinforcement learning agent. The original baseline models were trained under default hyperparameter settings to come up with reference performance. After that, a series of hyperparameters were finely adjusted by empirical experimentation and validation analysis in order to obtain the best prediction accuracy and convergence of the policy. In the case of the DKT model, the hyperparameters such as the LSTM number of units, learning rate, and batch size, dropout, and training epochs were optimized to guarantee proper modeling of the temporal knowledge evolution. The optimization goal was to minimize the error in prediction and maximize the accuracy of mastery prediction, which is measured in terms of AUC-ROC and RMSE. The DQN agent hyperparameters that were used to achieve the reinforcement learning component were tailored towards stable learning and effective convergence. The parameters that were sensitive were learning rate, the discount factor (r), replay buffer memory, batch size, exploration rate (epsilon) and frequency of updating target network, which were optimized to maximize cumulative reward and learning gain. Experience replay and ε-greedy strategy were used to stabilize the learning process and achieve a balance between exploration and exploitation in the training process. The final set of optimized values of the hyperparameters to optimize the training of the DKT model and DQN agent is summarized in Table 2.

Table 2.

Hyperparameter configuration and optimal values for DKT and DQN models.

Model	Hyperparameter	Search range	Optimal value
Deep Knowledge Tracing (DKT)	LSTM units	32, 64, 128, 256	128
	Learning rate	0.01, 0.001, 0.0005	0.001
	Batch size	32, 64, 128	64
	Dropout rate	0.1, 0.2, 0.3, 0.5	0.2
	Epochs	20, 50, 100	50
	Optimizer	Adam, RMSProp	Adam
Deep Q-Network (DQN)	Learning rate	0.01, 0.001, 0.0005	0.001
	Discount factor (γ)	0.90, 0.95, 0.99	0.99
	Batch size	32, 64, 128	64
	Replay buffer size	5,000, 10,000, 50,000	10,000
	Exploration rate (ε)	1.0 → 0.01 decay	1.0 → 0.05
	Target network update frequency	50, 100, 200	100
	Training episodes	500, 800, 1000	1000

The evaluation process requires multiple institutions to assess model performance because the dataset needs to be divided into separate institutional sections. The researchers used institutional data to train their model while testing it on new data from different institutions to replicate cross-institutional testing. The system records all variations of student activities and course designs and student interactions that take place in different educational institutions. The system calculates performance metrics by averaging results over multiple partitions, which helps to achieve reliable outcomes and decrease measurement errors. The evaluation configuration tests the model’s capability to generalize across different educational environments while maintaining its performance standards.

Evaluation metrics

The suggested adaptive learning model is tested in terms of detailed metrics that include Knowledge Tracing accuracy, Reinforcement Learning policy effectiveness, student learning progress, engagement, and cross-institution generalization.

Knowledge tracing performance metrics

The KT model is tested in terms of its effectiveness in predicting student mastery levels of various modules in the long-term. Since KT is fundamentally a sequential prediction problem, both regression-based and classification-based evaluation metrics are used.

RMSE

RMSE measures the average deviation between predicted mastery probabilities and actual student performance, with lower values indicating higher prediction accuracy. It is defined as equation (79),

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}}

(79)

where

{\hat{y}}_{i}

represents predicted mastery probability,

y_{i}

denotes actual student performance outcome,

N

denotes total number of observations.

MSE

MSE quantifies squared differences between predicted and actual mastery, penalizing larger errors more, and providing insight into prediction stability is represented in equation (80),

M S E = \frac{1}{N} \sum_{i = 1}^{N} {({\hat{y}}_{i} - y_{i})}^{2}

(80)

This metric penalizes larger prediction errors more heavily and provides insight into prediction stability.

AUC-ROC

AUC measures the model’s ability to distinguish mastered from non-mastered modules; values closer to 1 indicate better classification performance is represented in equation (81),

A U C = \int_{0}^{1} T P R (F P R) d (F P R)

(81)

where TPR represents True Positive Rate, FPR represents False Positive Rate.

Accuracy

Accuracy indicates the proportion of correctly predicted mastery states among all predictions. Higher values reflect better mastery prediction is represented in equation (82),

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(82)

Learning gain metrics

Learning gain measures the improvement in student mastery levels resulting from adaptive learning path recommendations. This is a key indicator of educational effectiveness.

Learning gain (LG)

LG measures improvement in student mastery after adaptive learning; higher values indicate more effective knowledge enhancement. The learning gain is defined as equation (83),

LG = A P R_{post} - A P R_{pre}

(83)

where

A P R_{pre}

average mastery probability prior to the adaptive learning,

A P R_{post}

average mastery probability after adaptive learning.

Normalized learning gain

NLG equalizes gains in relation to the level of initial knowledge and can make a fair comparison between students whose mastery of initial knowledge is different:

L G_{norm} = \frac{A P R_{post} - A P R_{p r e}}{1 - A P R_{p r e}}

(84)

Reinforcement learning performance metrics

The Reinforcement Learning agent is assessed by the capacity to acquire optimal policies to give sensible learning directions.

Cumulative reward

Total reward is used to assess the cumulative learning pay obtained throughout an episode; it is greater when the policy is more effective and the learning trajectory is optimized is captured in equation (85),

R_{total} = \sum_{t = 1}^{T} r_{t}

(85)

where

r_{t}

denotes reward at time step

t

T

represents episode length

Average reward per episode

Average reward evaluates learning efficiency across episodes, reflecting consistent performance of the RL agent is represented in equation (86),

R_{a v g} = \frac{1}{E} \sum_{e = 1}^{E} R_{e}

(86)

where

E

denotes number of episodes,

R_{e}

denotes reward in episode

e

Policy convergence rate

Policy convergence measures how quickly the RL agent learns a stable optimal policy. It is evaluated by tracking the change in Q-values or cumulative reward over training iterations is represented in equation (87),

Δ Q = | Q_{t} - Q_{t - 1} |

(87)

When

Δ Q

becomes very small, the policy is considered converged.

Student engagement metrics

Student engagement metrics evaluate how effectively the adaptive system maintains student participation and interaction.

Time spent per module

Measures average duration students actively engage with modules; higher values indicate stronger engagement is represented in equation (88),

{Time}_{avg} = \frac{1}{N} \sum_{i = 1}^{N} {time_spent}_{i}

(88)

where

N

denotes the number of students

Revisit rate

Fraction of repeated visits; indicates reinforcement and active engagement in learning is represented in equation (89),

Revisit Rate = \frac{Number of revisits}{Total module interactions}

(89)

A moderate revisit rate indicates effective reinforcement learning.

Dropout rate

Measures student disengagement; lower values indicate better adaptive learning retention is represented in equation (90),

Dropout Rate = \frac{Number of students who stopped}{Total number of students}

(90)

Interaction frequency

Reflects how actively students interact with the system; higher values imply effective participation is represented in equation (91),

Interaction Rate = \frac{Total interactions}{Total learning sessions}

(91)

Adaptive learning path effectiveness metrics

To evaluate the quality of generated adaptive learning paths, the following metrics are used:

Path efficiency

Measures how effectively recommended modules improve mastery; higher values indicate optimal learning paths is represented in equation (92),

Efficiency = \frac{Mastery improvement}{Number of modules recommended}

(92)

Recommendation accuracy

Fraction of recommendations that led to mastery improvement; higher values reflect better policy decisions is represented in equation (93),

Recommendation Accuracy = \frac{Successful recommendations}{Total recommendations}

(93)

Multi-institution generalization metrics

Measures of generalization are computed in order to assess the cross-institution flexibility of the proposed framework.

Cross-institution learning gain

The improvement in average mastery between institutions; positivity is an indication that the system generalizes is reflected in equation (94),

L G_{cross} = \frac{1}{K} \sum_{k = 1}^{K} (A P R_{post, k} - A P R_{pre, k})

(94)

where

K

represents number of institutions.

Cross-institution mastery prediction accuracy

Reflects the average accuracy of prediction across institutions; positive values imply that presented mastery estimation is captured in the equation, equation (95) and is represented as,

{A ccuracy}_{cross} = \frac{1}{K} \sum_{k = 1}^{K} {Accuracy}_{k}

(95)

Generalization gap

Smaller gaps imply better generalization; the guarantee that adaptive learning will be reliable in different institutional environments is reflected in equation (96),

Gap = {Per formance}_{train} - {Per formance}_{test}

(96)

Less gap implies enhanced generalization.

Results and discussion

The following section outlines the finding of the experiment conducted using the DKT + DQN adaptive learning path generation framework. The assessment is based on predictive accuracy of mastery, convergence of reinforcement learning, effectiveness of adaptive paths, enhancement of engagement, efficiency, and inter-institutional generalization. The findings indicate the usefulness, soundness, and scalability of the suggested strategy. The adaptive learning paths proved successful according to the observed learning gain improvements and mastery level improvements and student engagement level enhancements.

Knowledge tracing performance evaluation

RMSE, MSE, AUC-ROC, and classification accuracy were used to evaluate the accuracy of the DKT model in estimating student mastery states. Figure 4 demonstrates the precision of the DKT model to predict student mastery states by comparing the predicted mastery probabilities with the actual student performance in core academic modules.

Figure 4.

Comparison of predicted and actual student mastery probabilities using the DKT model.

The findings show that there is a high correlation between the expected and real mastery levels. Indicatively, in the case of statistics, the actual performance of 94.5 was very close to the predicted mastery of 95.1 with a very small error of prediction of 0.6. This proves that the LSTM-based DKT model is the most appropriate model to learn the time-related learning patterns, as well as to model the development of student knowledge. The summary of the mastery prediction performance is presented in Table 3.

Table 3.

Knowledge tracing prediction performance.

Metric	Value
RMSE	0.1569
MSE	0.0246
AUC-ROC	0.7343
Accuracy	87.6%
Convergence Epoch	42

The AUC-ROC result 0.7343 shows a high level of discrimination, whereas the small RMSE proves correct estimation of the mastery. These findings confirm that DKT is appropriate to represent the student knowledge states in adaptive learning settings. This is further confirmed in Figure 5 which presents the confusion matrix of mastery prediction. The model has a high rate of true positive and true negative which validates a high rate of reliability in classification of mastered and non-mastered concepts.

Figure 5.

Confusion matrix for mastery state classification using the deep knowledge tracing model.

Also, training stability of the DKT model illustration is given in Figure 6 that shows the loss curve with training epochs. The trend of reduction in loss is a confirmation of effective convergence and model stabilization.

Figure 6.

Training and validation loss curve of the deep knowledge tracing (DKT) model across epochs.

Reinforcement learning convergence analysis

The policy optimization of the DQN agent was assessed through cumulative reward, average reward, policy convergence and stability of reward distribution. The convergence behavior and learning performance of the DQN agent were tested through observing the cumulative reward development along the 1000 training episodes. The cumulative reward curve in Figure 7 gives an understanding of how the agent can learn an optimal policy to follow in adaptive learning path development using student knowledge states acquired during the Knowledge Tracing module.

Figure 7.

Cumulative reward progression of the Deep Q-Network (DQN) agent across training episodes.

The agent displayed exploratory behavior at the beginning of the training, which led to the comparatively low cumulative reward values. The first cumulative reward was 53.94 which was an indication of the little knowledge that the agent had in regards to optimum sequence of learning modules. Over training, the agent slowly learned to match state of mastery of the students with suitable learning interventions and this resulted in better decision making and a greater accumulation of rewards.

There was a gradual positive change in the cumulative reward during the training process with the cumulative reward rising to culminating value of 57.68 and the highest maximum observed reward of 70.01. This increase is a mark that can be said to be a 6.94% improvement of the original reward and this indicates that the agent is able to optimize its policy as time goes on. The growth of reward is a sign that the agent has learned to optimize in choosing the learning modules that achieve the best results in maximizing long term student mastery and not concentrating on the immediate performance improvements. Moreover, the reward curve leveled off at a point of about 742 training episodes, which implies that the agent was now able to determine a close to optimal policy. After this stage, the change in rewards was minimal and it is possible to assume that the learning process was stabilized and additional training could produce only insignificant gains. Stability and reliability of the reinforcement learning model is proved by this convergence behavior.

Further confirmation of the policy stability is given in Figure 8 which shows the reward distribution histogram of all training episodes. The histogram denoting the distribution of the reward is very concentrated in the higher reward category with less frequency of the low-reward episode. This distribution pattern suggests that the agent was always obtained beneficial learning actions and never engaged in poor sequencing of modules. The lesser difference in the values of rewards also proves the strength and the stability of the learned policy. The general convergence behavior of the reinforcement learning agent is highlighted in Table 4.

Figure 8.

Reward distribution histogram showing policy stability of the reinforcement learning agent.

Table 4.

Reinforcement learning convergence performance metrics.

Metric	Value
Initial cumulative reward	53.94
Final cumulative reward	57.68
Maximum reward achieved	70.01
Minimum reward observed	48.72
Mean reward	58.34
Standard deviation	4.21
Total episodes	1000
Convergence episode	742

Table 4 shows that the average reward value of 58.34 means that the agent always had high reward results in the course of the training. The standard deviation of 4.21 is comparably low, which implies that the learning behavior is stable and the policy performance is not highly varied. The variation between the minimum and maximum rewards is an indication of the exploration stage of the agent and later on, the adoption of the steady exploitation of the best actions. The convergence characteristics as observed confirm the fact that the reinforcement learning agent was able to learn an optimal adaptive learning policy. The agent successfully applied mastery state data as Knowledge Tracing model to choose dynamically learning modules that would achieve the highest cumulative reward, which is equivalent to better student learning and efficient acquisition of knowledge.

Adaptive learning path effectiveness

Comparison of adaptive learning paths and static curriculum progression was done. The efficacy of the adaptive learning path creation is depicted in Figure 9 comparing the progression of curriculum under the case of static implementation versus progression under adaptive reinforcement learning.

Figure 9.

Comparison of static curriculum progression and reinforcement learning–based adaptive learning path.

The adaptive path is dynamic and it selects modules depending on the mastery state of students. In contrast to the static progression, the adaptive one recompletes poorly taught concepts and speeds up the progress through the acquired subject matter. This customized customization leads to accelerated learning and greater efficiency of learning.

Possible student mastery development with time is also depicted in Figure 10 where individual student mastery is increasing with time. The number proves that the suggested framework is effective in monitoring and improving student learning development.

Figure 10.

Student mastery progression over time under the proposed adaptive learning framework.

Table 5 provides a quantitative analysis of the comparison between the strategies of the adaptive learning path and the static learning path. The adaptive framework proposed had much higher average mastery score of 0.84 than 0.68 in the case of the static curriculum with the improvement being 23.53%. Also, adaptive approach made the modules needed in order to master reduced by 35.53, which proved to be more efficient in learning. The achievement rate on the mastery learning improved by 72.4 to 89.7 indicating that reinforcement learning is effective in leading students to superior learning sequences. Additionally, the adaptive system demonstrated significantly greater weak concept recovery rate of 86.5% and this implies that it was capable of detecting and addressing knowledge gaps effectively. These results assure that the suggested adaptive learning path model facilitates a more rapid, efficient, and personal learning approach to mastery than the conventional linear curriculum advancement.

Table 5.

Adaptive versus Static Learning Path Effectiveness Comparison.

Metric	Static learning path	Adaptive learning path (proposed)	Improvement
Average mastery score	0.68	0.84	+23.53%
Modules required to reach mastery	15.2	9.8	−35.53%
Mastery achievement rate	72.4%	89.7%	+23.90%
Average learning time (hours)	42.6	28.1	−34.04%
Weak concept recovery rate	61.3%	86.5%	+41.11%
Learning efficiency score	0.64	0.88	+37.50%

Comparative performance analysis

The proposed KT + RL system was tested on the comparison with base methods: static curriculum, random recommendation, rule-based, and KT-only. In order to test the efficiency of the suggested adaptive learning path generation framework, the comparative analysis of performance was made in comparison to four baseline strategies: the fixed curriculum progression, random module recommendation, rule-based recommendation, and knowledge tracing-only recommendation. These baselines are classic and intelligent generation of learning paths that are usually employed in adaptive learning systems.

Table 6 indicates that the proposed KT + RL framework recorded the best learning gain of 0.7349, which was much higher than all the baseline approaches. Comparatively, the learning gain was 0.5943, 0.5983 and 0.6032 in the static curriculum, random approach and rule-based approach, respectively. Even though the knowledge tracing-only model recorded a better performance with a learning gain of 0.6554, the result was still significantly lower than that of the combined KT + RL model.

Table 6.

Comparative learning gain performance.

Method	Learning gain	Standard deviation
Static curriculum	0.5943	0.0027
Random recommendation	0.5983	0.0332
Rule-based recommendation	0.6032	0.0272
Knowledge tracing only (KT)	0.6554	0.0181
Proposed KT + RL framework	0.7349	0.0109

Bold values indicate the average results across all institutions and are highlighted for emphasis.

The improvement of 23.59% was observed with the proposed method compared with the static baseline, which proved that reinforcement learning can be used to maximize the module sequencing on the basis of student knowledge states. Moreover, the standard deviation of the proposed framework was the least (0.0109), which demonstrates consistency in the performance of the framework and enhanced stability in comparison with the baseline methods. Conversely, the random recommendation strategy was the most variable (0.0332) because it was not structured to adapt to the knowledge state of students.

A paired t-test was done between the proposed KT + RL framework and the static baseline to verify the statistical significance of the observed improvements. The t-statistic of the statistical analysis was −10.3945 and p-value was 1.1856-10. The difference that the proposed framework has brought is statistically significant with the p-value being far less than the normal significance level of 0.05. This validates the fact that the performance increase that was observed cannot be attributed to random error but rather it was a direct effect of the suggested adaptive learning strategy.

Figure 11 also provides the relative performance of the learning gain based on various methods. The above conceptualized KT + RL framework illustrates high levels of performance clearly, which shows the reliability of reinforcement learning in the selection of optimum learning modules in real-time given the current knowledge states of students.

Figure 11.

Comparative learning gain performance across baseline methods and the proposed KT + RL framework.

Student engagement analysis

The involvement of students is a very crucial element that determines the effectiveness of learning, retention of information, and academic achievements in higher learning institutions. In a bid to assess the effect of the proposed adaptive learning framework to student engagement, some important engagement measures were examined; these include time on learning modules, revisit rate, dropout rate and the interaction frequency. Table 7 and Figure 12 show the comparative outcomes of the traditional curriculum development and the suggested adaptive learning system.

Table 7.

Engagement metrics comparison.

Metric	Static learning	Adaptive learning	Improvement
Time spent on modules	45%	68%	+23%
Revisit rate	12%	32%	+20%
Dropout rate	18%	5%	−13%
Interaction rate	2.5 per session	4.1 per session	+64%

Figure 12.

Comparison of student engagement metrics between static curriculum and adaptive learning framework.

The adaptive learning framework achieved a significant enhancement of student engagement as demonstrated in Table 7 in all the measures assessed. Time spent on learning modules on average was 45% in the fixed learning introduced to the students in the fixed learning environment and it was 68% in the adaptive learning environment, a point of 23% improvement. This improvement means that when the recommendations of modules were made personalized according to the personal knowledge position of the students, students were more engaged in the learning content.

The percentage of revisit rate (the number of times, students have been taking the same modules again) also improved significantly by 20%, as the percentage increased to 32 as opposed to 12. This outcome proves that the adaptive learning model was successful in determining knowledge gaps and prescribing relevant remedial modules to motivate students to strengthen weak concepts. High rate of revisit is a good sign of active learning behavior and enhanced knowledge consolidation.

But most importantly, the number of students who dropped out was reduced to 18% in the static learning environment and to 5% in the adaptive learning environment which is a significant difference of 13%. This significant reduction of the dropout rate indicates that customized learning paths enhance motivation of the students, alleviate frustration linked with the wrong levels of learning difficulty, and increase learning satisfaction.

The proposed adaptive learning framework has enhanced engagement, which is visually demonstrated in Figure 12. Adaptive method was always more effective than the fixed curriculum in all measures of engagement, which proves the efficiency of the concept of reinforcement learning-based module recommendation to keep students interested and engaged. The described increases in the level of engagement can be credited to the fact that the reinforcement learning agent is capable of dynamically changing the learning sequences depending on the mastery level of individual students. At the optimal level of challenge, recommended not too complex or too simple modules ensure that the system strikes the right balance by keeping the students motivated and engaged in their cognitive activities. In addition, Deep Knowledge Tracing has been incorporated, which allows correctly estimating the level of student knowledge and informs the learning agent to make informed decisions that are consistent with learning needs of students. This adaptive intelligence guarantees that students get pertinent learning materials when they are supposed to and this acts to minimize cognitive overload and enhance the overall learning experience.

Learning efficiency analysis

Another important performance indicator in an adaptive learning system is learning efficiency because it can be used to measure how fast the student can reach the mastery level with the least amount of unnecessary learning effort and time wasted. To investigate the effectiveness of the proposed adaptive learning framework, the comparison of number of learning steps taken to master the learning was made between the traditional fixed curriculum and an adaptive learning based on reinforcement learning path. Table 8 and Figure 13 are the results of the comparison.

Table 8.

Learning efficiency improvement analysis.

Student	Static learning steps	Adaptive learning steps	Steps reduced	Efficiency improvement (%)
S1	14	9	5	35.71
S2	16	11	5	31.25
S3	15	8	7	46.67
S4	14	10	4	28.57
S5	15	9	6	40.00
Average	14.8	9.4	5.4	36.44

Bold values indicate the average results across all institutions and are highlighted for emphasis.

Figure 13.

Comparison of learning steps required to achieve mastery between static and adaptive learning paths.

As indicated in Table 8, the suggested adaptive learning model decreased the number of learning steps towards mastery in all students. With the static curriculum, the average number of learning steps needed to reach the status of mastery was 14.8, and with adaptive learning framework, the average number of required learning steps dropped to 9.4. This is an average of 5.4 steps per student and a total of 36.44% has been saved in efficiency. Student S3 showed the greatest improvement in efficiency with a reduction of 7 learning steps, which is equivalent to 46.67% improvement. In the same manner, students S1 and S5 made 5 and 6 step improvements, which translates to efficiency increase of 35.71 and 40.00, respectively. The smallest improvement that was observed, in the case of student S4, was still a significant decrease of 4 steps, which is a 28.57% efficiency gain.

Figure 13 illustrates graphically the difference in mastery of the learning steps necessary to the learning path used in the static and adaptive learning in learning. Adaptive learning framework made significantly fewer steps in all students, which proved the efficacy of reinforcement learning to improve learning sequences.

The intelligent decision-making ability of the reinforcement learning agent can be credited with the efficiency improvement. With the help of the correct representations of knowledge states that the Deep Knowledge Tracing model provides, the agent chooses the modules that should be recommended and that directly relate to the knowledge gaps in students. This will help to avoid wastage of time by a student in concepts which he/she has already mastered and direct any learning activity towards areas that one needs improvement. In addition, the adaptive framework promotes the remediation as well as acceleration. Students with poor mastery in some of the concepts are directed to the relevant remedial modules, whereas high mastery students are given an opportunity to skip unnecessary material. This interactive adjustment reduces unnecessary learning processes and speeds up the process of mastering. The shorter number of learning steps also translates to the shorter learning time, increased student productivity, and the higher learning experience. Institutionally, this efficiency increase will facilitate the more efficient use of learning resources and allow deploying it on a large scale among a large number of students.

Multi-institution generalization performance

Generalization aptitude of the suggested adaptive learning system was tested on a variety of institutional datasets to guarantee its financial strength and applicability in a different range of educational contexts. Because the student populations, curriculum structure, and learning behaviors in different institutions might not be consistent, it is important to ensure that the framework suggested can ensure similar performance outside the training environment. There is strength and scalability in cross-institution performance. Table 9 and Figure 14 show the results of the cross-institution evaluation.

Table 9.

Multi-institution learning gain comparison.

Institution	Static learning gain	Proposed KT + RL learning gain	Absolute improvement	Relative improvement (%)
Institution A	0.5940	0.7342	0.1402	23.59%
Institution B	0.5940	0.7331	0.1391	23.43%
Institution C	0.5950	0.7374	0.1424	23.94%
Average	0.5943	0.7349	0.1406	23.65%

Bold values indicate the average results across all institutions and are highlighted for emphasis.

Figure 14.

Cross-institution generalization performance of the proposed adaptive learning framework.

The proposed Knowledge Tracing and Reinforcement Learning (KT + RL) framework was found to be performing better than the static learning system in all the three institutions (Table 9). The learning gain in institution A was between 0.5940 and 0.7342 which is a relative improvement of 23.59. Likewise, Institution B showed the improvement of 23.43% and Institution C had the best improvement of 23.94 thereby going up to 0.7374 than 0.5950. The proposed framework had an average learning gain of 0.7349 and 0.5943 of the static baselines (with an average relative improvement of 23.65%). This generalizability of performance in different institutions implies that the suggested adaptive learning framework is successful in generalizing performance in diverse academic settings.

Figure 14 demonstrates how effective the adaptive learning model has been at all the different institutions as demonstrated through much higher learning gains than the static curriculum. Also, the very little difference in the percentages of improvements across the various institutions further confirms the reliability and stability of the proposed adaptive learning model.

There are two primary reasons for the adaptive learning model’s apparent strong generalizability. First, using the deep knowledge tracing (DKT) model allows for an accurate inference of common learning patterns and the progression of knowledge of students as they move from one education system to another (or across institutions). Secondly, the reinforcement learning agent learns the best policies to obtain mastery of skills based on student mastery states (not on how students were taught at different institutions). This allows for the creation of an optimal learning policy for students regardless of the environment in which they are learning.

The cross-institutional consistency of the proposed framework is evidence that it does not exhibit overfitting to a specific dataset, and the proposed framework can be successfully implemented across a variety of higher education settings. In order to be implemented in the real world, an ability to generalize from one university to another is very important because of scalability and ability to implement across many types of institutions, courses, and student populations.

Overall, the results confirm that the proposed adaptive learning framework is robust, scalable, and capable of delivering consistent learning improvements across multiple institutional environments, making it suitable for large-scale deployment in modern higher education systems, as summarized in Table 10.

Table 10.

Overall performance summary of the proposed adaptive learning framework.

Metric	Proposed framework
Mastery prediction accuracy	87.6%
AUC-ROC	0.7343
RMSE	0.1569
Learning gain improvement	+23.59%
Engagement increase	+23%
Dropout reduction	−13%
Learning efficiency improvement	36.44%
Cross-institution generalization	High
RL policy convergence	Stable
Training stability	Confirmed

The proposed framework performs better than fixed learning paths in all measures considered. A combination of DKT and DQN allows creating a personalized, efficient and scalable adaptive learning trajectory, enhancing engagement, decreasing dropout rates and extrapolating across institutions. Besides, the consistency of performance, computational cost across institutional datasets was also consistent, and training convergence was reached within similar epochs and did not grow exponentially with the processing time. The complexity of the model is proportional to the number of students and interactions of the learning process, which ensures its feasibility at large scales, meaning the possibility of cross-institutional implementation. The state representation and fixed capacity neural architecture ensure that uncontrolled parameter growth when expanding to other institutions, which also favors scalability.

Discussion

The experiment findings are clear evidence that the adaptive learning framework that combines DKT and DQN has a great impact on the effectiveness of personalized learning systems. The smart adaptive decision-making and correct modeling of student knowledge allow generating optimal learning paths leading to quantifiable changes in learning outcomes, engagement, and efficiency. The accuracy of the Knowledge Tracing component is one of the major strengths of the proposed framework. The DKT model had a mastery prediction accuracy of 87.6, AUC-ROC of 0.7343 and RMSE of 0.1569, which means that it has good predictive properties and can be used to estimate student states of knowledge. These findings ascertain that the DKT model, built on LSTM, is an effective way of capturing the temporal dynamics of student learning behavior and the changes in knowledge with time. Mastery estimation is vital because it gives the state representation in which reinforcement learning is effective.

The RL component showed effective and constant optimization of policies. The accumulated reward has been on the rise, and the distribution of the reward is constant. DQN agent has managed to converge after about 742-780 training episodes. This verifies that the agent was exposed to an optimum policy of selecting a module, which results in the maximization of the long-term learning. The optimization mechanism that is reward-based is what allows the agent to respond dynamically to the progress of the students and thus have an efficient and individualized instruction. One of the most important findings of the present research is a significant enhancement in the learning gain of the proposed framework. The KT + RL model has a learning gain of 0.7349 which is a 23.59% improvement over the score of the static curriculum baseline. This difference was statistically significant (p < 0.05), which proves that adaptive reinforcement learning-based module sequencing has calculable advantages over the standard fixed learning sequences. The ablation experiments also confirmed the significance of combining the two elements, and reinforced the fact that, although Knowledge Tracing can give precise mastery prediction, Reinforcement Learning is the key to the optimization of the learning sequence.

The proposed framework also led to significant improvement on student engagement in addition to improving the learning outcomes. The time used in learning activities was raised to 68 out of 45% and the rate of revisit improved to 32 out of 12%, showing that there was better active engagement and knowledge reinforcement. Moreover, the dropout rate was lowered significantly (18% to 5%), which proved that individual adaptive learning increases student motivation and retention. A lot of efficiency was also realized in the proposed framework. The adaptive learning methodology minimized the number of instructions needed to master the learning by about 35% implying that the learning was quicker. This efficiency level improves the cognitive load and learning time of students and optimizes the level of knowledge acquisition.

Notably, the multi-institution assessment reflected the strength and applicability of the suggested framework. The model has shown a high ability to generalize, with all the institutions having learning gains of more than 23%. This validates the fact that the framework is not based on institution-related features and can be effectively implemented in a variety of educational settings. In general, Knowledge Tracing and Reinforcement Learning have synergistic value in integration. Knowledge Tracing allows to model student knowledge states correctly, and Reinforcement Learning allows choice of learning routes based on student knowledge states. This integration allows the creation of smart adaptive learning systems that would be able to provide personalized, efficient, and scalable learning.

Conclusion and future work

This research presented a learning path generation framework that is intelligent and based on DKT and DQN to optimize the optimization of personalized learning in higher education. The main goal was to create a data-driven framework that can precisely model the knowledge of students and dynamically suggest the best learning modules to achieve the maximum mastery and the best efficiency in learning. Deep Knowledge Tracing component effectively promoted the knowledge evolution of the students based on the data of temporal learning interaction. The DKT model was associated with highly positive predictive power, which had a mastery prediction accuracy of 87.6%, AUC-ROC of 0.7343, and a RMSE of 0.1569. The findings prove that the LSTM-based architecture is useful in capturing the sequential learning patterns and accurate student knowledge state representations. On these knowledge representations, the Deep Reinforcement Learning component maximized selection of modules by applying a reward-based learning approach. The Deep Q-Network agent was able to converge as well as learn optimal policies in module recommendation. Adaptive learning model recorded a 23.59 % learning gain over traditional static plot of curriculum progression, which indicated the feasibility of reinforcement learning in streamlining individualized learning paths. Besides, increasing the learning outcomes, the suggested framework considerably positively increased the engagement and the efficiency of learning among the students. The adaptive approach gave a boost in the level of engagement, dropout rates reduced, and the number of steps needed to master the learning process decreased by around 35%.

These results show how personalized adaptive learning is effective in enhancing the performance and learning experience of students. Moreover, the multi-institution test assured the strength and extrapolation ability of the offered framework. The model was constantly found to be very successful in various types of institutional data which confirms that it can be scaled and used in a wide variety of educational settings. In general, Knowledge Tracing and Reinforcement Learning integration offers a highly intelligent and efficient adaptive learning path generation solution. The suggested structure would allow proper modeling of students, optimal learning paths, and scaling to be used in the higher education systems. In the future, researchers could look into more intricate reinforcement learning approaches (e.g., PPO, Actor-Critic) that would enhance how policies are created. They could also look into adding additional types of behavioral variables for students, enabling online utilization of Learning Management Systems (LMS), performing multi-objective optimization, employing explainable artificial intelligence (AI), leveraging larger, varied datasets; and utilizing multimodal data (i.e., video interactions and eye-tracking) so that models are created based on a broader understanding of the student. The adaptive learning framework we developed can be applied to intelligent tutoring systems and personalized recommendation platforms which deliver educational solutions based on data-driven methods to meet the needs of individual learners. Assistive technologies create major benefits for inclusive education because they improve accessibility and enable students with disabilities to learn and participate in school activities. Future work should focus on scalable, intelligent, and personalized assistive systems for broader educational impact.

Footnotes

Consent to participate

Informed consent was obtained from all individual participants included in the study.

Consent for publication

All authors have reviewed and approved the manuscript and consent to its publication.

Author Contribution

Long Jishuang: Conceptualization, methodology, software development, data curation, formal analysis, writing – original draft preparation, visualization. Kamisah Osman: Supervision, conceptualization, validation, writing – review and editing, project administration. Faridah Mydin Kutty: Methodological guidance, validation, writing – review and editing, academic supervision. All authors have read and approved the final manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

References

Sun

Huang

Sun

, et al. Personalized learning path planning for higher education based on deep generative models and quantum machine learning: a multimodal learning analysis method integrating transformer, adversarial training and quantum state classification. Discover Artificial Intelligence 2025; 5(1): 29. https://doi.org/10.1007/s44163-025-00252-6

Zheng

Wang

Chen

, et al. Evolutionary machine learning builds smart education big data platform: data-driven higher education. Appl Soft Comput 2023; 136: 110114. https://doi.org/10.1016/j.asoc.2023.110114

Barbosa

PLS

Carmo

RAFD

Gomes

JPP

, et al. Adaptive learning in computer science education: a scoping review. Educ Inf Technol 2024; 29(8): 9139–9188. https://doi.org/10.1007/s10639-023-12066-z

George

Wooden

. Managing the strategic transformation of higher education through artificial intelligence. Adm Sci 2023; 13(9): 196. https://doi.org/10.3390/admsci13090196

Kovari

. A systematic review of AI-powered collaborative learning in higher education: trends and outcomes from the last decade. Soc Sci Humanit Open 2025; 11: 101335. https://doi.org/10.1016/j.ssaho.2025.101335

Airaj

. Ethical artificial intelligence for teaching-learning in higher education. Educ Inf Technol 2024; 29(13): 17145–17167. https://doi.org/10.1007/s10639-024-12545-x

Dumitru

Abdulsahib

Khalaf

, et al. Integrating artificial intelligence in supporting students with disabilities in higher education: an integrative review. Technol Disabil 2026; 38(1): 3–24. https://doi.org/10.1177/10554181251355428

Luo

. Integrating knowledge graph reasoning and reinforcement learning for explainable MOOC recommendations. IEEE Access 2025; 13: 183722–183733. https://doi.org/10.1109/ACCESS.2025.3625213

Zhang

Xiong

, et al. Three-way partitioning graphs with reinforcement learning for adaptive knowledge tracing. Inf Sci 2025; 721: 122644. https://doi.org/10.1016/j.ins.2025.122644

10.

Sharif

Uckelmann

. Multi-modal LA in personalized education using deep reinforcement learning based approach. IEEE Access 2024; 12: 54049–54065. https://doi.org/10.1109/ACCESS.2024.3388474

11.

Gligorea

Cioca

Oancea

, et al. Adaptive learning using artificial intelligence in e-learning: a literature review. Educ Sci 2023; 13(12): 1216. https://doi.org/10.3390/educsci13121216

12.

Kuo

Obiomon

, et al. Improving student learning outcome tracing at HBCUs using tabular generative AI and deep knowledge tracing. IEEE Access 2025; 13: 82407–82420. https://doi.org/10.1109/ACCESS.2025.3568171

13.

Lin

. Learning path recommendation enhanced by knowledge tracing and large language model. Electronics 2025; 14(22): 4385. https://doi.org/10.3390/electronics14224385

14.

Qiu

Wang

. Knowledge tracing through enhanced questions and directed learning interaction based on multigraph embeddings in intelligent tutoring systems. IEEE Trans Educ 2025; 68(1): 43–56. https://doi.org/10.1109/TE.2024.3448532

15.

Yang

Zhong

, et al. Graph-based effective knowledge tracing via subject knowledge mapping. Educ Inf Technol 2025; 30(7): 9813–9840. https://doi.org/10.1007/s10639-024-13069-0

16.

Khlaif

Ayyoub

Hamamra

, et al. University teachers’ views on the adoption and integration of generative AI tools for student assessment in higher education. Educ Sci 2024; 14(10): 1090. https://doi.org/10.3390/educsci14101090

17.

Chan

CKY

. A comprehensive AI policy education framework for university teaching and learning. Int J Educ Technol High Educ 2023; 20(1): 38. https://doi.org/10.1186/s41239-023-00408-3

18.

Fahad

Wasfi

Hayajneh

, et al. Reinforcement learning in education: a literature review. Informatics 2023; 10(3): 74. https://doi.org/10.3390/informatics10030074

19.

Stasolla

Zullo

Maniglio

, et al. Deep learning and reinforcement learning for assessing and enhancing academic performance in university students: a scoping review. AI 2025; 6(2): 40. https://doi.org/10.3390/ai6020040

20.

Essa

Celik

Human-Hendricks

. Personalized adaptive learning technologies based on machine learning techniques to identify learning styles: a systematic literature review. IEEE Access 2023; 11: 48392–48409. https://doi.org/10.1109/ACCESS.2023.3276439

21.

Demartini

Sciascia

Bosso

, et al. Artificial intelligence bringing improvements to adaptive learning in education: a case study. Sustainability 2024; 16(3): 1347. https://doi.org/10.3390/su16031347