Abstract
Personalized recommendation system has been widely adopted in E-learning field that is adaptive to each learner’s own learning pace. With full utilization of learning behavior data, psychometric assessment models keep track of the learner’s proficiency on knowledge points, and then, the well-designed recommendation strategy selects a sequence of actions to meet the objective of maximizing learner’s learning efficiency. This article proposes a novel adaptive recommendation strategy under the framework of reinforcement learning. The proposed strategy is realized by the deep Q-learning algorithms, which are the techniques that contributed to the success of AlphaGo Zero to achieve the super-human level in playing the game of go. The proposed algorithm incorporates an early stopping to account for the possibility that learners may choose to stop learning. It can properly deal with missing data and can handle more individual-specific features for better recommendations. The recommendation strategy guides individual learners with efficient learning paths that vary from person to person. The authors showcase concrete examples with numeric analysis of substantive learning scenarios to further demonstrate the power of the proposed method.
Introduction
Adaptive learning refers to an educational method that delivers personalized educational interventions, typically implemented through computerized algorithms that output personalized action recommendations based on analysis of the learning histories of the learners. The popularization of the Internet access has facilitated the application of adaptive learning in practice, which in turn prompted further research interests (Sleeman & Brown, 1982; Wenger, 1987). However, the past few years have seen great advances in big data technology. In particular, a sensational success was achieved by AlphaGo, which employed the cutting-edge techniques on deep reinforcement learning. The aim of this article is to bridge adaptive learning with the recent developments in deep reinforcement learning and consequently propose novel recommendation strategies for the adaptive learning system.
Emanated from behaviorist psychology (Skinner, 1938), reinforcement learning is concerned about how an agent interacts with the environment and learns to take actions to maximize total rewards. Reinforcement learning is one of the central topics in artificial intelligence (Kaelbling, Littman, & Moore, 1996) and has broad applications to areas such as robotics and industrial automation (Kober & Peters, 2012), health and medicine (Frank, Seeberger, & O’reilly, 2004), and finance (Choi, Laibson, Madrian, & Metrick, 2009), among many others. As a significant progress in recent years, deep reinforcement learning was proposed by Google Deepmind in 2013. In applications to play Atari 2600 games, it achieved the expert human proficiency (Mnih et al., 2013). In applications to playing the game go, it surpassed human level of play, starting from tabula rasa, within a training period as short as a few days (Silver et al., 2016).
To apply reinforcement learning techniques to adaptive learning, a proper formulation of adaptive learning process in terms of the setup is commonly used in reinforcement learning (Chen, Li, Liu, & Ying, 2018). Each time the learner takes an action following the recommendation strategy, the system enters a new knowledge state and then receives a reward. The authors’ goal is to maximize students’ learning efficiency, that is, helping learners master knowledge points in the most effective path. This is equivalent to balancing the trade-off between the expected total gain of knowledge points being mastered by the learner and total learning steps during the process. In a substantive learning scenario with the presence of assessment errors, unknown learning model, and complex reward forms, a good recommendation strategy is supposed to make full use of current information to maximize the learning gain and is feasible in various learning designs. Such an optimal strategy design problem is cast into a Markov decision problem and will be solved under a reinforcement learning framework.
In this article, an approach based on deep reinforcement learning, providing personalized recommendations and addressing the cost-effective learning need of learners, was presented. Specifically, a variant of deep Q-learning algorithm (Hasselt, Guez, & Silver, 2016) was adopted and tailored modifications for adaptive learning recommendations were made. The objective function is approximated by a neural network which is used to determine the optimal strategy. The authors made several novel contributions. First, the parameter space is enlarged and early stopping is incorporated into the learning process. As a result, the deep Q-learning approach maximizes the overall gain within the shortest time, serving the purpose of maximizing the learner’s learning efficiency. Second, the model is designed with the ability to handle missing data. In existing work (e.g., Chen et al., 2018), it is assumed that the learning process has fixed procedures, where the assessment model is considered to be indispensable at each step, and the recommendations cannot appropriately deal with missing knowledge status. In this article, a missing index is introduced and modeled, which helps to analyze incomplete data in a flexible fashion. Third, the effect of learning interest is considered. As the learning model differs among learners, a real-time recommendation system that is sensitive to the characteristics of each learner can be further obtained by introducing more personal features, such as learning interest. Thanks to the nature of the deep neural network, the proposed method can scale up comparatively easily in handling big data. It is expected that, by combining domain knowledge and more individual information, the proposed method may be further improved to a more efficient and competitive adaptive learning system.
The rest of this article is organized as follows. In section “Background,” a mathematical framework for adaptive learning is reviewed, where a general cost-efficient reward for the recommendation strategy was defined. Then, a variant of deep Q-learning tailored to adaptive learning is presented in section “Deep Q-learning recommendation strategy.” In section “Learning scenarios and experiments,” concrete simulated examples are given to support the methods in the more realistic learning scenarios, followed by discussion in section “Discussion.”
Background
The objective of recommendation is to help individual learners achieve their learning goals in the shortest time by utilizing all the currently available information. Consider
Assessment Model
The study on the diagnosis of one’s latent abilities has been developed in modern psychometrics. Online learning systems can track the entire online learning behaviors, including frequency of login, clips on the lecture video, and item responses which can be dichotomous or polytomous. Such information can be incorporated to model one’s learning styles and preferences so as to make learning designs more accurate and enjoyable for learners (Coffield, Moseley, Hall, & Ecclestone, 2004).
Consider the assessment on the proficiency level of knowledge points with the test item pool
which implies that
Through the learner’s item responses, how knowledge points have been mastered so far can be observed partially. The assessment result
Learning Model
In a learning system, the learning model responses to action at each time. Given action
In the context of adaptive learning, more factors beyond the learners’ knowledge states should be taken into considerations to train the learning model comprehensively while learning the recommendation strategy at the same time. These factors including cognitive abilities, learning styles, and other learning behaviors enable to distinguish various learning models for learners. Intuitively, interest in this subject could be a useful feature in the personalized recommendation system. Imagine two learners with different, strong and weak, interest in a course. They are likely to have very different learning processes and outcomes as a result of different interest. In mathematical modeling, different transition probabilities should be assumed for these two learners and different actions may be recommended even if they are at the same knowledge state. The detailed simulation study is presented in section “Experiment on the effect of learning interest” to illuminate how to combine interest as a feature into recommendation.
Recommendation Strategy
With learners’ information up to time
Given policy
According to
To find an efficient learning path to balance total gains on knowledge and learning steps, the action space
The reward setting imposes the penalty on total rewards by
Following the reward setting above, the Q-value function of policy
Deep Q-learning Recommendation Strategy
Finding the Q-value function with respect to uncertainties in the system is challenging. For traditional methods, dynamic programming (Bellman, 2003) fails due to the curse of dimensionality (Niño-Mora, 2009) inherent in relatively large state spaces. Then, approximate dynamic programming methods (Powell, 2007) have constraints on forms of parameterization and may have convergence problems when the form of Q-value function is complicated. In this section, a method based on deep Q-learning (Mnih et al., 2015) to optimize the policy was employed, which is called the DQN method in the rest of this article.
Q-learning
Q-learning (Watkins & Dayan, 1992) is the basic idea behind the proposed method. In a learning system, a sequence of state transitions
The above equation derived from the Bellman equation (Sutton & Barto, 1998) is based on the following intuition: if the optimal Q-value is known at the next time point, the optimal strategy is to select the action maximizing the expected value of
Deep Q-network
For large state space, it is often impractical to maintain the Q-values for all state-action pairs. The authors consider to approximate

The architecture of the neural network with two layers in the experiments.
Algorithm
In this part, deep Q-learning algorithm is elaborated. The learner’s learning experiences at each time,
where
Initialize Q-value function
Take action
Store transition
Sample a mini-batch of transitions from
Compute the predicted Q-value with
Update parameter
Every C steps, reset
Repeat steps 2 to 7.
Some intuitions about training tricks in the algorithm are presented as follows:
Each transition of experiences is potentially used in many parameter updates, which allows for greater data efficiency. Besides, drawing transitions randomly from
The dynamic
Periodically updating
Reinforcement learning refers to goal-oriented algorithms and Figure 2 visualizes how it works in a flow chart. The DQN method starts from an initialized Q-value function and improves the policy design according to interactions with the environment, which can handle a relatively large state space.

The interactions between the agent and environment with the DQN method.
Learning Scenarios and Experiments
In this section, the authors present a concrete learning system, conduct simulations in different learning scenarios, and explore the effect of learning interest by combining the classification problem in the DQN method. Two recommendation strategies are considered as baselines: one is the random policy that chooses actions uniformly at random from all available actions; another one is the oracle policy
where
Some intuitions about the reward setting are provided in the following:
Suppose there indeed exists a final exam at the terminal time and the final grade is a value measuring learners’ proficiency. If the
Let
Toy Experiment
The authors first go through a toy example, where the oracle policy is known. Assume that a course consists of three knowledge points

The hierarchy of knowledge points in the toy example.
Consider three lecture materials
which are the
Notice that the assessment model is employed to obtain estimator
Criteria
To show the power of the DQN method, the random policy serves as a lower benchmark, while the oracle policy is the upper benchmark. The performances of the recommendation strategies are evaluated by total rewards received in the learning processes. Given a scale parameter
Simulation results
Figure 4 shows the results of the toy example. Considering total rewards, the NOIRT curve with the DQN strategy and the oracle curve almost coincide. The other DQN methods with IRT can still work well with relatively high rewards in the end. Furthermore, the authors have an intuitive conclusion that the model including more items can lead to a better result from comparisons among IRT_2, IRT_8, and IRT_64. For clarity of expression, the final grade

The results of the experiments in the toy example.
Continuous Case
A more practical learning environment was provided, where the continuous state space is considered. In this case, the proficiency level of each knowledge point is rescaled to be a continuous value in the interval
The learning system consists of 10 knowledge points

The knowledge graph in the continuous case: the numbers in the circle indicate corresponding knowledge points, while the numbers on the arrow indicate learning prerequisites.
For learning actions, a total of 30 materials, that is,
The transition kernel was taken as a density function to reduce the storage space. Assume the learner takes action
where ⊙ stands for element-wise multiplication of vectors and
The transition function shows some properties of the learning model. On one hand, the exponential term is always smaller than one, which implies no retrograde in the learning. On the other hand, Equation 1 can be rewritten as
The above equation shows that at the initial stage, learners always acquire knowledge quickly and knowledge states have a fast increase. On the contrary, it is hard to have a big improvement when the state approaches 1. This phenomenon is a common occurrence in the learning process. It is easy to have a basic understanding for beginners, while doing some in-depth work on that basis is rather hard and effortful. Finally, let the initial knowledge state be
Criteria
Five simulations in total are conducted to compare with the random policy when the knowledge state space is continuous. Due to the complicated learning model, it is hard to determine the oracle policy at every scale
Simulation results
The results are shown in Figure 6. Clearly, the NOIRT curve without assessment error receives the highest final grade and the smallest number of learning steps. The left graph presents an intuitive performance, showing the more accurate measurement lays a foundation for a better recommendation. In terms of the Linear curve, although it can beat random policy, compared with DQN method, the linear approximation cannot fit Q-value function well when the form of Q-value function is complicated and thus does not have a solid performance. In summary, the DQN method outperforms the random policy and linear approximation method under the above settings, even in the case of IRT_2.

The results of the experiments in the continuous case.
Experiment on Missing Data
Generally, the diagnosis and recommendation for a learner in an adaptive learning system have fixed procedures as shown in Figure 7. However, students are likely to skip exercises, thus it cannot be always observed whether their knowledge states have an increment or not. In the case when

The flow chart of general procedures in the adaptive learning system: (1) the knowledge state
With full utilization of current information, the number of missing items was taken as a missing index. Note that the missing index can partially reflect efforts the learner has paid during the unobserved state period. Thus, it can be combined as a personal feature to track the learning model. When the learner skips the questions at time
Following above setting, the dimension of input of the network was expanded from
Criteria
The expected total rewards of DQN strategies based on varying percents of the missing data against the random policy were compared. Specifically, randomly take
Simulation results
The results are presented in Figure 8. The left figure shows that compared with the

The results of the experiments on the missing data.
It is worthy pointing out several exceptions, for example, when scale parameter
Experiment on the Effect of Learning Interest
The learning model differs among learners. To make a good recommendation, more individual-specific features added to distinguish types of learners can further improve the policy design. Intuitively, the interest in the subject can be an obvious personal feature of the learner. Therefore, the reinforcement learning and classification problem were combined by adding the interest into the approximation of Q-value function to thoroughly learn the learning model.
Two types of learners were simulated, type 1 and type 2, and they have different transition probabilities for different learning efficiency. For learners of type 1 with low learning efficiency,
Similar to the missing data experiment, the authors consider the learner’s interest as an additional dimension in the input which is
Criteria
Three recommendation strategies are conducted. For better comparison, the authors apply the DQN method in the model involving the interest feature, denoted as interest, and also in the non-interest model. In addition, the experiment where types of learners are already known is also presented as the upper benchmark, denoted as type_classified. To validate the interest effect, the authors assume that the knowledge states are measured without error in these three simulations.
Simulation results
As shown in Figure 9, the authors’ model involving high learner interest is better than the non-interest case. It demonstrates that the DQN method learns from the interest feature and successfully distinguishes these two types of students to some extent. Because of the flexibility of the network, the DQN method avoids adding high-level terms in the parameterization compared with function approximation methods.

The results of the experiments with the effect of learning interest.
The effect of learning interest as an example to show the efficiency and feasibility of the DQN method in learning features was elaborated. Of course, one feature is not enough to learn the learning model comprehensively and more personal features can be further considered to make a more precise recommendation.
Discussion
In this article, a DQN recommendation strategy that makes full use of available information under the framework of the adaptive learning system was proposed. The authors introduce three components of the learning system and formulate their problem into a Markov decision problem. Based on the set cost-efficient reward, a feed-forward neural network is used to approximate the Q-value function and a deep Q-learning algorithm is developed to train the optimal policy. The proposed method takes into account several needs in substantive learning scenarios, allowing for early stopping, handling missing data, and learning interest. Four concrete examples were provided in the simulation to show the power of the DQN method.
In the future work, more practical issues need to be considered. As illuminated in the first two simulations, the accuracy in the estimation of knowledge state can affect the quality of the recommendation strategy. Even though the assessment model is not paid much attention here, more individual-specific information can further improve the estimation accuracy. For example, cognitive skills can also be modeled by psychological tests (Brown & Burton, 1978). Combining good designs of learning materials, the recommendation can not only help master knowledge but also strengthen cognitive abilities. Moreover, the proposed method may be further improved by taking a reward setting that better reflects the real learning scenario. For example, the terminal reward may be defined in a different form to measure the feedback of a sequence of actions: for instance, it may directly relates to the learning duration.
The deep Q-learning method in practice needs to collect data and train the optimal strategy simultaneously in the initial stage, where the initial policy may be uncomfortable for learners. Some prior information to build the initial policy can improve the user experience. Moreover, a virtual but reasonable learning model serving as the environment can generate data for further training. As a data-driven approach, it is of interest to explore a more effective structure for the networks to better extract behavior information. For example, the number of layers in the neural network can be scaled up to meet more complicated situations. Besides, the method of policy gradient (Sutton, McAllester, Singh, & Mansour, 1999) may be borrowed, by which a more flexible stochastic policy can be obtained. Finally, theoretical interpretations remain to be studied to prove the feasibility of the deep Q-learning method in the practical large-scale learning environment.
Supplemental Material
OnlineSupplementary – Supplemental material for Adaptive Learning Recommendation Strategy Based on Deep Q-learning
Supplemental material, OnlineSupplementary for Adaptive Learning Recommendation Strategy Based on Deep Q-learning by Chunxi Tan, Ruijian Han, Rougang Ye and Kani Chen in Applied Psychological Measurement
Footnotes
Acknowledgements
The authors are very grateful to the constructive suggestions by the associate editors and the referees.
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research of Kani Chen is supported by Hong Kong RGC grants: 600813, 16300714, and 16309816.
Supplemental Material
Supplemental material is available for this article online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
