Abstract
The difficulty in class student state recognition is how to make feature judgments based on student facial expressions and movement state. At present, some intelligent models are not accurate in class student state recognition. In order to improve the model recognition effect, this study builds a two-level state detection framework based on deep learning and HMM feature recognition algorithm, and expands it as a multi-level detection model through a reasonable state classification method. In addition, this study selects continuous HMM or deep learning to reflect the dynamic generation characteristics of fatigue, and designs random human fatigue recognition experiments to complete the collection and preprocessing of EEG data, facial video data, and subjective evaluation data of classroom students. In addition to this, this study discretizes the feature indicators and builds a student state recognition model. Finally, the performance of the algorithm proposed in this paper is analyzed through experiments. The research results show that the algorithm proposed in this paper has certain advantages over the traditional algorithm in the recognition of classroom student state features.
Introduction
When people study, their mental attitude and body posture will change due to different learning contents and learning methods. This change reflects a state, that is, the learning state. The learning state is mainly manifested in the learning behavior of the learners in the teaching classroom. The current teaching classroom mainly includes two ways of teaching: one is the traditional face-to-face teaching classroom [1], and the other is a network-based online learning classroom [2]. In the traditional face-to-face teaching classroom, the learner’s good learning state requires the emotional power of the outside lecturer and the learner’s own power. The main performance learners get from the lecturer is: the lecturer’s positive and negative tone in the classroom teaching, vivid body language, and concern for the learner’s eyes. The main performance that the instructor gets from the learner is that the learner actively raises questions to the instructor to express doubts, judging the learner’s mental state from the learner’s facial expression, adjusting the dull learning atmosphere, and adjusting the teaching schedule in time. In short, under this traditional teaching method, the communication between the two parties in the classroom not only promotes the improvement of the teaching method of the lecturer, but also enables the learner to better invest in learning. In the web-based online learning classroom, from the perspective of learners and lecturers, the advantage is that the teaching materials are rich and the teaching time is not limited. The disadvantage is that the learner and the lecturer are not in the same space environment. On the one hand, because the learner does not observe the supervision and attention of the lecturer, the learner is easy to distract when learning, and the learner’s doubts about knowledge deepen during the long-term learning process and are prone to fatigue. On the other hand, because the lecturer can’t see the learner ’s emotions and state while learning, the lecturer can’t remind the learner and guide the learner in real time, can’t adjust the progress and plan of the course, and the lecturer can only judge the student’s learning state through the examination test, which is not objective and one-sided. In order to solve the problems encountered by learners in web-based online learning classrooms, the analysis of learners’ learning state is mainly considered from the following two aspects: (1) Fatigue: When the learner learns in a normal state, the head will maintain a fixed posture for a short time. However, when the learner is fatigued, the head will show a continuous nodding movement in a short time. Therefore, whether the learner is already fatigued is determined by the number of consecutive nodding movements on the head of the learner in a short time [3]. (2) Attention: When the learner is studying in a state of attention, the head will show a fixed posture. However, when the learner is studying in a state of inattentiveness, the learner’s head will look away and show a continuous movement of turning his head in a short time [4]. Therefore, whether the learner’s attention is focused is determined by the number of times the learner continuously turns the head in a short time.
In order to realize the recognition of the above two learning states, it is necessary to detect and recognize the learning posture and movement of the learner, that is, the detection and recognition of the learner’s head gesture and movement. Secondly, the state of the learner needs to be comprehensively judged according to the characteristics.
Related work
The learning state detection and recognition system is mainly to realize the head movement detection and recognition of the human body in the learning state. That is, the focus of detection and recognition is on the posture of the human head. The complexity of head poses determines the diversity of head pose detection methods to a certain extent. Since the 1990 s, domestic and foreign scholars have designed a variety of different systems to detect and recognize head posture, and formed a situation in which multiple systems coexist [5]. According to the different equipment used for posture detection, posture detection and recognition systems are divided into two types: (1) Posture detection and recognition system based on machine vision. This system realizes posture detection through an external image acquisition device. The image acquisition device is mainly composed of cameras and other equipment. It collects moving images of the human body, performs image processing, and then uses visual algorithms for detection and recognition. There is no direct contact between the detection device and the user. However, affected by space limitations, user privacy is easily compromised and the amount of data processing is large. Moreover, it has high requirements on hardware and software [6]. The literature [7] designed a real-time head pose detection system, which detects faces in images based on the method of active apparent model, uses POSIT algorithm to estimate head pose, and realizes real-time detection of head pose. The literature [8] used a machine vision algorithm to design a robot wheelchair intelligent interaction system based on head and posture recognition. The system uses Kinect sensors to collect human head image information, and uses random forest combined with ICP algorithm to calculate the head posture in real time, thereby controlling the rotation of the wheelchair in real time through the detected head posture changes. The literature [9] designed a detection system based on visual fatigue for the problem of fatigue driving. The system uses the Seetaface detection algorithm to detect human faces, and judges the degree of fatigue of classroom students through the changes in the detected eye state and head posture (left, right, head up, and nod). (2) Posture detection and recognition system based on motion sensor. This system realizes attitude detection by designing hardware detection device. The detection device is mainly composed of a microprocessor, an attitude sensor, and a power module. The detection device worn by the user’s head receives the posture data detected by the sensors in the detection device in real time, and then processes the data through a certain algorithm to obtain an accurate head posture, and then determine the head posture. Such methods have high detection accuracy and good real-time performance. The [10] designed a wearable head-mounted device. The device realizes the purpose of fatigue detection in fatigue driving by collecting human brain electrical signals and head movement data in real time. The literature [11] designed a virtual reality helmet system. The helmet uses the MPU6500 sensor to collect the wearer’s head posture motion information and design a calculation algorithm to improve the real-time and continuity of the head calculation, thereby solving the problem of dizziness when users wear it and the problem of delay when transmitting pictures.
The literature [12] used the sensor module to measure the movement posture of free-range animals, combined the quaternion-based nonlinear observer and the least square method to solve the measurement results of the MEMS sensor, and corrected the deviation of the angular velocity measurement value, and improved the accuracy of animal body motion posture estimation. Aiming at the problem of being susceptible to random noise in the process of human posture data collection, literature [13] proposed a posture solution method. This method adjusts the attitude deviation value through the proportional-integral algorithm and fuses the data information through complementary filtering, which effectively suppresses the influence of random noise interference in the data collection process and improves the data output accuracy. The literature [14] proposed a sensor data fusion method. The fusion method combines the conjugate gradient method and the complementary filter algorithm, that is, the accelerometer and magnetometer use the conjugate gradient method to achieve the optimal estimation of the attitude quaternion. Then, the obtained result and the attitude quaternion converted by the gyroscope angular velocity data are subjected to complementary filtering to obtain the attitude angle. The system performance improves when the attitude changes drastically. The literature [15] proposed an adaptive complementary filtering algorithm to track the flying attitude of the aircraft in real time to compensate the drift error of the gyroscope. The literature [16] designed an attitude detection system composed of 5 modules, and each module includes gyroscope, accelerometer and magnetometer. The system uses the Levenberg-Marquardt algorithm to correct the offset error, scale factor, etc., and averagely integrates the data of the five AHRS modules to reduce the impact of high-frequency noise. Then, the system applies the obtained trend to the direction estimation algorithm based on the complementary filtering algorithm, and uses the algorithm to determine whether the system has magnetic distortion. The literature [17] has designed a quaternion PID improved complementary filtering algorithm, which uses PID to suppress data fluctuations. Compared with the traditional complementary filtering algorithm, the average error of the pitch angle of this algorithm decreased by 52.4%, and the error of the roll angle decreased by 62.2%.
HMM-based classroom student state recognition model construction
HMM (Hidden Markov Model) is a probabilistic statistical method, including two processes of the random change process of the hidden state and the random change process of the observable state [18]. Through the correlation of two random processes, the statistical characteristics of time-varying signals are described together. The change of classroom student state is essentially a double random change process, in which the true state of the classroom student is the hidden state, which cannot be directly measured in practice [19]. The changes in physiological signals, facial expressions and other characteristics of classroom students during the state change process are observable states, which can be measured by physiological parameter instruments, cameras and so on in practice [20]. Therefore, the classroom student state recognition model can be constructed based on HMM [21–24].
HMM mainly includes the following parameters:
(1) The number of hidden states N: We assume that the N hidden states are θ1, θ2, ⋯ , θ N , and the states at time t are q t , q t ∈ (θ1, θ2, ⋯ , θ N ).
(2) The number of observable states M: We assume that the M observable states are V1, V2, ⋯ , V M , and the observable states at time t are Q t , Q t ∈ (V1, V2, ⋯ , V M ).
(3) The initial probability vector π: π = (π1, π2, ⋯ , π
N
)
T
, it represents the probability of each hidden state at the first moment, that is
(4) State transition probability matrix A: A = (a
ij
) N×N, among them
In the formula, a ij represents the probability of transition from the implicit state θ i corresponding to the current time t to the implicit state θ j corresponding to the next time t + 1.
(5) Output probability matrix B: B = (b
jk
) N×M, among them
In the formula, b jk represents the probability that the observable state is V k when the implicit state corresponding to the current time t is θ i .
Therefore, the HMM model can be abbreviated as λ = (N, M, π, A, B). Among them, the model parameters π and A represent the change process of the hidden state, B represents the change process of the observable state. The schematic diagram of a typical HMM is shown in Fig. 1.

Schematic diagram of HMM.
The parameters of the HMM model are generally solved by the BW (Baum-Welch) algorithm, which is an unsupervised training algorithm, and its principle is as follows:
After the observable state sequence O = (o1, o2, ⋯ , o
T
) of length T is given, in order to find the model parameter λ* that can maximize P (O|λ), the auxiliary function Q function is first constructed:
In the formula, I represents the implicit state sequence, I = (i1, i2, ⋯ , i
T
),
(1) The Lagrange function on π is constructed. Since
By taking the partial derivative and making the partial derivative 0, the following result can be obtained:
(2) The Lagrangian function on A is constructed. Because
The partial derivative is obtained, and the partial derivative is 0. The following result is obtained:
(3) The Lagrangian function for B is constructed. Since
It is asked for a partial derivative and makes the partial derivative 0. Among them, only when o
t
= v
k
, the partial derivative of b
j
(k) to b
j
(o
t
) is not 0, which is recorded as I (o
t
= v
k
).
The variables ɛ
t
(i, j) and γ
t
(i) are introduced. Among them, ɛ
t
(i, j) represents the probability of transition from the implicit state q
i
corresponding to time t to the implicit state q
j
corresponding to time t + 1, and γ
t
(i) represents the probability that the implicit state corresponding to time t is q
i
.
Therefore, π, A, B can be expressed as:
The optimal model parameter λ* of HMM can be obtained by using the iterative solution of the above formula.
After the HMM model parameters are obtained, the implicit state sequence can be derived by using HMM and known observable state sequences. The derivation algorithm is Viterbi algorithm, and its principle is as follows:
δ t (i) is the maximum probability of observable state sequence o1, o2, ⋯ , o t obtained by reasoning from a set of implicit state sequences (q1, q2, ⋯ , q t ) (q t = θ i ) at time t:
We define as a set of implicit state sequences (q1, q2, ⋯ , q
t
) at time t, (among them, q
t
= θ
i
) and infer the maximum probability of observable state sequence o1, o2, ⋯ , o
t
:
Therefore, the optimal hidden state sequence can be traced back by using ψ t (i), the specific process is as follows:
(1) Initialization:
(2) Recursion:
(3) Termination:
(4) Backtrack
Then, the implicit state sequence
In addition, when the HMM model parameters and the observable state sequence are known, the likelihood probability P (O|λ) of the observable sequence can also be calculated. The commonly used calculation method is the forward algorithm, and its principle is as follows:
λ = (N, M, π, A, B) and observable sequence O = (o1, o2, ⋯ , o
t
) are known, and the solution method of P (O|λ) is:
In the formula, the calculation method of α T (i) is shown in formula (16).
Before constructing the HMM-based classroom student recognition model, we first need to determine the initial values of the HMM model parameters.
(1) The number of hidden states N: This article divides the state of classroom students into two types: sober and fatigue. Therefore, the number of hidden states of HMM is 2;
(2) The number of states M is observed: the observable state is the classroom student characteristic index F = ({F2, F4, F8}) T , and its classification number can be obtained by clustering the classroom student characteristic index through the FCM algorithm and using the method of mixed F statistics to determine the optimal cluster number. In addition, in order to ensure that the model can be fully trained, we selected the data of a total of 8 individuals (a total of 8 × 240 = 1920 data points) of No. 3-8 and No. 11 and No. 12 as the training set data. Meanwhile, the data of the remaining classroom students is used as the test set data. After that, we use all the data in the training set to perform FCM clustering, and calculate the mixed F statistics under different clustering numbers. The results are shown in Fig. 2.

Mixed F statistics under different clustering numbers.
It can be seen from the above figure that when the number of clusters is 9, the mixed F statistic reaches the maximum value, which indicates that the distance between features of the feature index reaches the maximum and the distance within the category reaches the minimum. Therefore, we divide the feature index into 9 categories, that is, determine the number of HMM observable states as 9.
(3) Initial probability matrix π: Since N is 2, it can be determined that π is a matrix of 2 × 1. Moreover, π has little effect on the accuracy of fatigue state recognition, so the initial value of π is randomly generated and made to satisfy
(4) State transition probability matrix A: When N is known, A can be determined as 2 × 2 matrix. Moreover, A has little effect on the accuracy of fatigue state recognition like π. Therefore, the initial value of A is randomly generated and satisfied as
(5) The probability matrix B is output: N is 2 and M is 9, so it can be determined that B is the matrix of 2 × 9. Because B has a greater influence on the accuracy of fatigue state recognition and the model training algorithm, that is, Baum-Welch algorithm is an algorithm for solving local maxima. Therefore, in order to make the local maxima close to the global maxima, a suitable initial value of B must be selected for model training. The initial value of B can be obtained by statistics on the data in the training set:
In the formula, O j represents the number of feature index data when the classroom student state is j, and M jk represents the number of feature index data when the classroom student state is j and the corresponding feature index is k.
After setting the initial values of each parameter, the training set data and BW algorithm are used to train the HMM model, and the optimal model parameters can be obtained.
Based on the model constructed above, the fatigue state of the students in the test set classroom can be identified. The test set includes data from four classroom students No. 1, No. 2, No. 9, and No. 10. Considering the continuity of the changes of classroom students’ state and in order to reduce the computational complexity, this paper divides the data of each classroom student into 20 feature index data series of data points. Each sequence is input into the model as an observable state sequence separately, and then the Viterbi algorithm is used to detect the hidden state sequence, that is, the classroom student state sequence. Figure 4 shows the test results of one of the sequences. The corresponding statistical table is shown in Table 1. The ordinates 1 and 2 indicate the state of sobriety and fatigue, respectively. It can be seen from the figure that the real state change process of classroom students is from sober to fatigue. The recognition results of the model at the first 12 data points are the same as the course of the classroom students’ real state change, but after the 13th data point, the sober state is detected, which is different from the actual situation.

Statistical diagram of recognition results.

Accuracy of model recognition.
Statistical table of recognition results
In order to more easily and intuitively analyze the recognition results of the model, this paper compares the recognition results of the model with the actual state of the classroom students obtained by the subjective evaluation method, and solves the recognition accuracy of the model.
Figure 4 shows the model’s recognition accuracy rate for the data series of different characteristics of different classroom students. It can be seen from Fig. 4 that for each class of students, the recognition accuracy of the model varies greatly. Among them, the highest accuracy of Class 9 students is 90%, the lowest is 20%, and the difference between the two is 70%, indicating that the stability of model recognition is poor. For students in Class 1, the recognition accuracy of the first two test sequences is higher, but the latter four are lower. The main reason is that the intervention of external factors in the experimental process interfered with its original single state change process. The recognition accuracy rate of the sober state is 64%, but the recognition accuracy rate of the fatigue state is only 34.3%, and the average accuracy rate is 46.7%, indicating that the EEG signal acquisition process in the fatigue state is greatly interfered. Similarly, students in Class 10 were also disturbed during the experiment, and their fatigue state recognition accuracy is low, and the average accuracy is 51.7%. The students in Class 9 have a short fatigue state, so the accuracy rate of the fatigue recognition rate is only 26.7%, and the average accuracy rate is only 38.8%, indicating that the model has a poor recognition effect on its state. The average recognition accuracy of Class 2 students is higher than other class students, reaching 61.7%. The main reason is that there are classroom students with similar personality characteristics in the training set, and there is no interference in the experiment process. The model’s average recognition accuracy rate of all test set data is only 49.9%, indicating that it is greatly affected by the differences in personality characteristics of classroom students, and it is impossible to accurately detect classroom student state in practice.
In summary, the model has low accuracy in class student state recognition, high volatility, and poor stability. Moreover, the recognition accuracy rate is greatly affected by the differences in personality characteristics of classroom students, that is, the greater the difference in personality characteristics between the training set and the test set, the lower the recognition accuracy rate. Therefore, the model cannot complete the recognition of the fatigue state of random class students and is subject to certain restrictions in practical applications.
The timeliness of the detection algorithm is the key to promote the practicality of the classroom student fatigue detection system. It is a feasible way to improve the timeliness of the system by designing a reasonable data sampling strategy for the different stages of the fatigue state combined with the dynamic generation characteristics of the fatigue state. To this end, based on the analysis of the demand characteristics of the model at different stages of training and testing and human physiological characteristics, this chapter designs different sampling mechanisms to take into account the model prediction accuracy and system timeliness.
There is still a long-term sampling problem in the field of classroom student fatigue detection. Considering from the perspective of eye opening and closing, since the normal human eye blink rate is 15 times per minute, that is, about one blink every four seconds, the sober state must be defined as the eye continuously open for 3-5 seconds or more. Otherwise, the open state of the eyes within a short period of time, such as 1 second, is not enough to determine sober. According to the common video frame rate calculation, 3-5 seconds may contain video frames around 100 frames. For the Conv-LSTM used in the model, generally speaking, the longer the set time step, the more information the model comprehensively memorizes. However, this time step will be limited by the impact of computer computing power and LSTM performance, so the time step needs to be shortened. At the same time, the sampling interval needs to cover as long as possible 3-5 seconds. Therefore, the segment sampling method is introduced in the training phase of the model and combined with the convolutional LSTM time step used in this model, and K = 8 is set. This mechanism is called Long Range Sampling Mechanism (LRSM). The specific description is: a continuous training video is divided into eight segments according to the number of video frames. Then, one frame is randomly selected as the training image in each segment, and the feature extraction of the convolutional neural network and the extraction of the time series information of the LSTM are performed in order according to the order of the images.
In the testing stage, a dynamic adaptive sampling mechanism (DASM) based on state identification is introduced. The so-called dynamic self-adaptation is based on the results of model recognition, which refers to controlling the number of sampling intervals Fn during the test according to the current class student state identified by the model. The specific rules are as follows: When the model detects that the students in the classroom are sober, the sampling interval is set to 10 frames. When a fatigue state is detected, the mechanism decreases the sampling interval according to a linear rule. When the fatigue is judged to exceed three consecutive times, the model sampling interval is set to 1. The purpose of controlling the sampling frequency according to the state of the classroom students is to conduct sparse sampling when the students are sober to reduce the workload of the computer, and increase the sampling frequency when the fatigue state occurs to prevent the system from leaking.
Figure 5 shows the accuracy curve under different Fn. The abscissa is the number of different sampling intervals set in the range 1–12, and the ordinate is the accuracy of the model at the corresponding sampling interval. It can be seen from the figure that when Fn = 4, the accuracy rate is the lowest, which is 73.43%. and when Fn = 6 and Fn = 10, the accuracy rate basically reaches the maximum value, which is 80.60% and 80.85%, respectively. In order to take into account both the accuracy of the model and the operating efficiency of the system, the maximum sampling interval for the sober moment was finally selected as 10 frames. It should be noted that the value of Fn is related to the time step parameter of the long-term and short-term memory network, and the Fn corresponding to the neural network model with different memory steps will have certain differences. Based on the maximum sampling interval at the sober moment, a schematic diagram of the change in accuracy shown in Fig. 6 with the dynamic sampling interval is obtained.

Model accuracy at different sampling intervals.

The dynamic sampling interval.
Figure 7 shows the algorithm flow chart of the dynamic adaptive data sampling mechanism based on the above dynamic sampling interval strategy. For multiple video segments that need to be detected, it needs to be detected in sequence after one video segment is detected. In the figure, Fn represents the dynamically changing sampling interval, which is initialized to 1 before the detection of each video segment starts and then dynamically updated and passed according to the detection result. N represents the minimum number of video frames that meet the sampling conditions based on the current sampling interval. F is the total number of video frames that need to be sampled at present. If the total video length is greater than the minimum number of video frames that meet the sampling conditions, it means that sampling can be completed to complete the prediction. When the conditions are not met, the next video segment will be predicted.

Flow chart of data sampling mechanism algorithm.

Time comparison of different video segments in the test set.
Each sample contains video frames ranging from 60 to 500 frames. During the running of the designed model, the duration of a single prediction varies from 0.02 s to 0.07 s, and the duration is generally concentrated between 0.02 s and 0.03 s. It should also be noted that when the model is predicted for the first time, it involves issues such as model initialization and recovery of training parameters, so the first prediction time is longer, often reaching about 220 s. The curve excludes the first prediction, that is, the time data of one test sample, and the prediction time during the model running includes a total of ten samples. It can be clearly seen from the polyline: (1) The overall prediction time of each video segment is positively correlated with the length of the video segment, that is, the longer the video, the longer the overall prediction time.(2)When the length of the video segment is short, the dynamic adaptive data sampling mechanism and the prediction time of the equal interval sampling are not much different. However, when the video duration is longer, the time-consuming sampling mechanism of dynamic adaptive data is significantly lower than the equal interval sampling strategy.(3)The growth or decline trends of the two curves are roughly the same. For long-term samples, the dynamic adaptive data sampling mechanism can shorten the prediction time by 2–4 times.
Table 2 shows the test accuracy of the dual-stream model on the RGB-O model. It can be seen that the accuracy of the dual-flow model combined with the optical flow feature when sampling at equal intervals is 90.91%, which is higher than that of the single-channel model of 81.82%. However, after adopting the data sampling strategy and the fatigue discrimination strategy, the number of correctly identified samples decreased by 1, resulting in a decrease in accuracy. By looking at the sample of the recognition error, it is found that the duration of the test sample is shorter, a single recognition error occurs during the prediction process of the video segment, and fatigue is recognized as sober. Since the method combines a dynamic adaptive sampling mechanism at this time, which leads to a longer sampling span and the end of sampling of the video segment, it directly leads to the final recognition error. If the number of test sets is increased, or the length of the test video segment is increased, this phenomenon can be alleviated and the reliability of the final accuracy can be increased.
Accuracy of dual-stream model test
In addition, this study measures the difference in test time of the model after applying the two mechanisms. The longer the test video segment, the more obvious the difference. The prediction time of a single sample using the dynamic adaptive data sampling mechanism should be reduced by approximately 2–4 times.
In order to realize the discrimination of the learner’s learning state, this paper further studies the problem of video recognition features. This study proposes a method for eliminating differences based on model parameter transformation. Moreover, based on this method, by clustering classroom students and using the SFIM of all classroom students in the same cluster to perform linear weighted summation and adaptive operations, this study obtains a random classroom student state recognition method. In addition, this study designs a random human fatigue recognition experiment to complete the collection and preprocessing of EEG data, facial video data, and subjective evaluation data of classroom students, and takes the fusion result of facial video data and subjective evaluation data as the true state of classroom students. At the same time, the EEG data is divided into sober and fatigue data segments based on the true state of classroom students, and the construction of the classroom student state recognition research data set is completed. This paper conducts related research work on the relationship between the individual characteristics of classroom students and the fatigue detection system, and makes a breakthrough in the research on the classification of classroom characteristics and model adaptation methods. Finally, the model performance analysis shows that the model proposed in this paper has a certain effect.
