Abstract
Educational process mining aims (EPM) to help teachers understand the overall learning process of their students. Although deep learning models have shown promising results in many domains, the event log dataset in many online courses may not be large enough for deep learning models to approximate the probability distribution of students’ learning sequence due to a lack of participants. This study proposes a deep learning framework to help uncover the learning progression of learners. It aims to produce a graphical representation of the overall educational process from event logs. Our framework adopts the Smith–Waterman algorithm from the bioinformatics field to evaluate general learning sequences generated from deep learning models. Using our framework, we compare the performance of a deep learning model with the modified cross-attention layer and a model without modification and find that the modified model outperforms the other. The contribution of this framework is that it enables the use of neural architecture search techniques to uncover students’ general learning sequence irrespective of the dataset’s size. The framework also helps educators identify education materials that present as learning bottlenecks, enabling them to improve the materials and their respective layout order, thus facilitating student learning.
Keywords
Introduction
Many academic institutions utilise Learning Management System (LMS) to improve learning and teaching experience (Aldiab et al., 2019). Those institutions have been delivering their courses through LMS platforms in recent years where students are allowed to access different course materials freely throughout the duration of an enrolled course. In general, LMS platforms are considered to be more cost-effective than traditional classrooms (Bartley & Golek, 2004). In particular, the Modular Object-Oriented Dynamic Learning Environment (Moodle), considered to be one of the most prominent open-source LMS globally, has been considered as an efficient LMS that provides sets of learning-centric resource in the literature (Mustafa & Ali, 2023).
While students interact with the Moodle platform to access course content such as videos, assignments, lecture notes, and quizzes, Moodle helps teachers monitor students’ activity with its logging system (Shrestha & Pokharel, 2021). However, the log data generated by Moodle or other LMS platforms is difficult for teachers to comprehend, making it challenging to use the data to improve the student learning experience. Therefore, Educational Data Mining (EDM) and Educational Process Mining (EPM) have emerged as methods to obtain useful educational insights from raw data.
EDM-related algorithms originate from classical regression and classification data mining techniques and aim to extract simple patterns and data correlations. EDM often used to predict student performance, detect undesirable behaviours and automatic evaluations (Hernández-Blanco et al., 2019). However, information extracted from EDM alone is not sufficient to produce an overview visualization of the learning process and does not consider the process as a whole (Bogarín et al., 2018). In contrast, EPM is a specialised area that focuses on improving comprehension of the overall learning process.
Specifically, EPM is process-centric and assumes event sequences as the data type with each event associate with a single process instance or online activities. In summary, EPM focuses on the overall process as a whole, in contrast EDM focuses on significant patterns that emerge from data (Bogarín et al., 2018). For example, EDM would focus on predicting students’ final exam score using other tests score in the context of a Moodle course. On the other hand, EPM algorithms would analyse the sequences of events from the beginning of the course up until the final exam.
There are three main types of EPM (Bogarín et al., 2018): learning process discovery mining, conformance analysis and process model analysis. According to Bogarín et al. (2018), the main goal of educational process discovery mining is to give the instructor a clear visualisation of the student learning behaviour model. Conformance analysis, in contrast, helps the instructor to analyse whether a presumed behaviour model corresponds to the behaviour shown in event logs. Process model analysis aims to elaborate and extend the use cases of a given process model, which can help instructors to find learning bottlenecks and relationships between students in a course by comparing different process models.
Learning process discovery remains one of the most challenging tasks in EPM. Process discovery is challenging because a myriad of possible behavioural models can be drawn from any one set of event logs. This makes finding the behavioural model that best describes the general learning process out of all possible candidate models akin to finding a needle in a haystack. In the context of EPM, event logs can be produced from multiple learning progression models convoluted together, as different students may have different learning processes. As such, it is extremely difficult to generate a general learning process model from educational event logs.
Furthermore, students are allowed to access different course materials freely throughout the duration of an enrolled course in an LMS setup such as Moodle. Therefore, the log events created by students accessing different learning materials are not causally dependent on one another according to the definition originating from the alpha algorithm (Van der Aalst et al., 2004). Although a process model connecting all course materials with a parallel edge can be the correct model, it may not be a particularly useful model as the model is too generic (de Medeiros et al., 2007).
Consequently, matrices of values such as preciseness (de Medeiros et al., 2007) have been developed to confine the vagueness of the process model being mined as a representation of the corresponding event logs. However, as the number of highly diverse traces increases in the event log of an LMS, less precise models are allowed to be mined from the event log. Furthermore, as students often differ in their online learning behaviour (Hsiao et al., 2019), how to generalise the different types of learning behaviour and produce a model that is useful for educators to visualise remains a challenge in EPM.
Recent advancements in deep learning research have encouraged a widespread adoption of neural network architectures based on transformers (Vaswani et al., 2017) for analysing sequential data. However, in many online courses, the event log dataset may not be large enough for deep learning model (Du et al., 2020) to accurately approximate the probability distribution of general students’ learning sequences due to a limited number of students. Moreover, a general deep learning framework for uncovering the learning sequence of students is lacking in the literature due to the absence of a settled evaluation method. This study proposes a novel deep learning framework to help uncover students’ learning sequence. Specifically, this study aims to develop a framework to ensure that each segment in the general students’ learning sequence proposed by deep learning models is backed by subsequence that can be found within the events data set.
In our framework, we adopt the Smith-Waterman sequence alignment algorithm from bioinformatics to compare a sequence generated from a deep learning model to the original dataset and assess its similarity, enabling more precise evaluation.
We demonstrate the deep learning framework using a Moodle course that introduces statistical concepts to 94 students across three cohorts, with 33 course materials in total. Ultimately, this framework enables the use of neural architecture search techniques to uncover the general learning sequence of students, irrespective of the size of the dataset. Using our framework, we demonstrated how two deep learning models are evaluated and selected. On top of that, we design a convenient visualisation for users to see the learning process from the model output.
Background
There are four major areas of study related to EPM according to Bogarín et al. (2018): intention mining, sequence pattern mining, sub-graph mining and processing mining. Intention mining aims to model the processes according to the intention of the actors rather than the actual process (Khodabandelou et al., 2013). Sequence pattern mining and sub-graph mining focuses on searching interesting patterns or sub-process that may describe the underlying process whilst process mining focuses on extracting the overall underlying process from event logs. All in all, process mining remains one of the most challenging areas in EPM.
In EPM, there are two major approaches to construction of learning process models: Bottom-up approaches (sometimes refer as local search) and top-down approaches. Bottom-up approaches build learning process model through merging small local patterns whilst top-down approaches build learning process model through modifying a prior learning process model (Trcka & Pechenizkiy, 2009). The alpha algorithm (Van der Aalst et al., 2004) and heuristics miner algorithm (Weijters et al., 2006) are considered to be bottom-up approaches. Genetic process mining (de Medeiros et al., 2007) and fuzzy process mining (Günther & van der Aalst, 2007) adopted top-down strategies.
Over the years, different approaches developed their own conformance checking (Rozinat & Van Der Aalst, 2006) methodologies and metrics to find discrepancies and commonalities between the model behaviour and the observed behaviour. (Bogarín et al., 2018). Although deep learning had been widely adopted in EDM to predict learning outcomes and detect undesirable student behaviours and automatic evaluation (Hernández-Blanco et al., 2019), it is rarely used in EPM to construct a learning process model.
The alpha algorithm (Van der Aalst et al., 2004), one of the earliest process mining algorithms, extracts a workflow process model based on a myriad of event sequences, which are called traces. In essence, the algorithm classifies the relationships between events into three major types: causal relation, potential parallelism and neither. From these relationships, the algorithm constructs two sets A and B in which all of the events in A are causally related to B. The events within each set should be neither causally related nor parallel with one another. For every pair (A, B), a place (a type of node in a Petri net) is added to connect all of the events from A to B. A start event (an event without incoming links) and an end event (an event with only incoming links) are then added to the process model to order the intermediate processes of the process graph.
The heuristics miner algorithm (Weijters et al., 2006) uses a value ranging from −1 to 1 to express how strongly two events relate to one another in three ways. Like the alpha algorithm, the heuristic miner algorithm takes consecutive causal relations between two events into account. By normalising the frequency of directed consecutive causal relations, the heuristic values between two single events can be calculated. A threshold can then be applied to the heuristic matrix to determine whether there should be an edge connecting two events in the process model.
Another major improvement brought by the heuristic miner algorithm is the ability to detect length-one (a repeating event) and length-two loops (two consecutive repeating events). The heuristic miner algorithm specifically counts the number of length-one and length-two loops in the event log and normalises the value. A similar approach is applied to detect long-term dependence. The heuristic miner algorithm specifically counts the number of times an event A is followed by another event B at a later point in a trace and normalises the value with the total occurrence of A. Heuristic miner algorithms allow a user to pick a threshold to filter out low-frequency occurrences, which makes the algorithm more robust to noise (Bogarín et al., 2018).
Genetic process mining is designed to properly address non-free choices (de Medeiros et al., 2007), invisible tasks and duplicate tasks. Genetic process mining involves five major steps: reading an event log, building a population, computing fitness, selecting from the population through genetic operations and returning possible candidates.
Genetic process mining starts off with a population of models. Each model within the population has the same set of activities as those found in the event log. A new generation of models is selected from the old population according to their fitness. Genetic operators, such as crossover (merging two models) and mutation (the modification of a model), are applied to the newly selected models to increase the diversity of the new population. The genetic algorithm’s fitness function, which comprises formulas for both completeness (the model can reproduce the given set of event traces) and preciseness (the model cannot parse more than the given set of event traces), has added much value to the global search framework.
Fuzzy process mining (Günther & van der Aalst, 2007) algorithms are a branch of process mining algorithms that use a fuzzifier (van Dongen & Adriansyah, 2010) to cluster events into groups. Performing clustering before the mining process gives fuzzy process mining an advantage in handling highly unstructured behaviour and substantial numbers of events (Bogarín et al., 2018). The output of fuzzy process mining is a fuzzy net that can be interpreted as a more abstract representation of the underlying behaviour model (Bogarín et al., 2018). Fuzzy algorithms adopt both local and global strategy techniques to construct the process model (Gupta, 2014), and they remain one of the most popular algorithms in EPM research (Bogarín et al., 2018).
Methodology
In response to the challenges stated above, we propose a novel learning process discovery EPM framework using deep learning models and a dynamic visualisation methodology. The proposed framework differs from other frameworks in that it adopted the Smith–Waterman sequential alignment algorithm from bioinformatics to compare a sequence generated from a deep learning model to the original dataset, thereby enabling a more detailed evaluation of the alignment between the sequence and the dataset. This approach allows for a more nuanced analysis of the data over each time step of the sequence. We formalise our notion starting with Definitions 1–5 below.
Definition 1 (course). Let
Definition 2 (event traces and event log). Let
Definition 3 (original sequence of events). A sequence of events is an arbitrary sequence of actions denoted by
Definition 4 (start event and end event). Similar to the start and end places in a Petri net, a start event Using the above definitions, we can now enrich the definition of event traces from
Definition 5 (general learning sequence, general learning decision instance). Given Ultimately, we want to find a model that can help us to identify a general learning sequence, given as (4) From the above definitions, any deep learning models can be used for To demonstrate the framework, a minor modification is made to the transformer model’s position embedding layer (Vaswani et al., 2017), as shown in (5) The output of the masked single-head attention layer is We add a linear layer neural network between the left side (similar to the encoder part (Vaswani et al., 2017)) and right side (similar to the decoder part (Vaswani et al., 2017)) of the network acting as an abstraction operation. Furthermore, as shown in the model structure diagram in Figure 1, there is at least one masked operation between the input and output to reinforce the model to learn the conditional probability distribution of We train the model to approximate After parameterisation of the original construct, Based on this framework, we created a web application for users to train their models and visualise the output of the probabilistic model using a Sankey diagram and the animated state transition graph. Figure 2 presents an instance of an animated state transition graph. In contrast to other frameworks, the animated state transition graph lays out the information in the time domain rather than only in the space domain. This helps to avoid overwhelming users with the spaghettified diagrams (van Dongen & Adriansyah, 2010) introduced by heuristic mining and the complex notations (Bogarín et al., 2018) introduced by fuzzy abstraction. In our framework, a sequence alignment algorithm is used to evaluate the general learning sequence The Smith–Waterman algorithm comprises two main steps: matrix filling and backtracking (Xia et al., 2021). The matrix filling step computes the alignment matrix, of which its maximum value can be regarded as the local alignment score between The parametric alignment evaluation score is defined as From By aligning The support measurement not only reveals the section in which One major advantage of constructing a probabilistic model is that the model can represent a huge sample space. In our case, the size of the sample space is Furthermore, the support evaluation can be used as an indicator to determine learning bottlenecks incrementally. Formally, we can use equation (15) to formalise learning bottlenecks. If From equation (16), we can see that the cardinality of With the formula stated above, we can derive learning bottlenecks in terms of a set of educational materials. Let Educational materials can be grouped based on In summary, our proposed deep learning framework consists of the following phases: data pre-processing, model training, general learning sequence extraction, general learning sequence alignment, general learning sequence trimming and general learning sequence presentation. The data pre-processing phase is a process of filtering out unnecessary or redundant log events. Some events are logged multiple times for each dynamic interaction, which creates an imbalance in the event log, whereas others are logged fewer times per interaction. To extract meaningful interactions in a probabilistic base framework, General learning sequence extraction is a process of sampling a general learning sequence from any probabilistic model. As there may be an imbalance between the number of traces in After trimming,

Modified transformer block. Note. The original design (Vaswani et al., 2017) separates the left part of the model and regards it as an encoder and the right part of the model as a decoder. ‘Modified single-head cross attention layer’ refers to the cross-attention layer with the masked operation as well as the abstraction layer.

Animated state transition graph of a general learning sequence. Note. The red dot in the middle of an animated state transition graph represents the current instance of a general learning sequence. One important distinction between a state transition graph and an animated state transition graph is that there is a time step component represented by the animation in an animated state transition graph. At time step 1, the current state is ‘[Video]_Introduction_to_Sampling_Distribution’. In a state transition graph, the current state can either transition left or down. However, in an animated state transition graph, it can only go down, as revealed in the later time step 2. The degree of freedom upper bound of an animated state transition graph is

Phases of the deep learning framework for uncovering the learning progression of learners.
Results
We applied our deep learning framework to a Moodle course that introduces statistical concepts to 94 students across three cohorts and we have obtained the consent of 28 students for using their log event data in Moodle. During the course, each student is free to access the course materials apart from the post-test (final exam). The post-test (final exam) is only available at the end of the three-week course. Since Moodle logs are available at site and course level, we collected the log event data from the Moodle log database by selecting the course ID. Each event or action is related to a user’s accessing a page or course material in Moodle. For instance, if a student watched a video in Moodle, it would be recorded the event log as video log. There are four teaching sessions spanning over the three-week course. In the first teaching session, we introduce the Moodle course and ask the students to complete a pre-test. We also introduce the topic ‘Sampling Distribution’ in this session. A week after the first session, we host another teaching session to go over the topic ‘Central Limit Theorem’, summarizing the teaching materials that should be covered. In the third session, we ask students to cover the topic ‘Confidence Interval and Hypothesis Testing’ in preparation for the post-test in the final session. In the fourth session, we ask the students to complete the post-test.
Course Materials in the Statistical Concepts in Higher Education Course Topic ‘Sampling Distribution’.
Course Materials in the Statistical Concepts in Higher Education Course Topic ‘Central Limit Theorem’.
Course Materials in the Statistical Concepts in Higher Education Course Topic ‘Confidence Interval’.
Course Materials in the Statistical Concepts in Higher Education Course Topic ‘Hypothesis Testing’.
Although the animated state transition graph reveals the structure of the Moodle course and the section order of the Moodle course layout, only quizzes are included in the graph, with all other materials treated as noise. This phenomenon is caused by the imbalanced event counts between quizzes and other educational materials: each attempt at a question in a quiz is recorded repeatedly as a quiz event, whereas only one event is recorded when the other educational materials are accessed.
In light of this, the event log
Table 5 shows that the model with the masked cross-attention layer significantly outperformed the model without the masked layer. Finally, we performed an independent sample t-test, and the results show that our mask modification on the cross-attention layer achieved a significantly higher
The blue line in the graph in Figure 4 represents the support of the general learning sequences produced by the model with the modified cross-attention layer measured against the event log 
The black line shown in the graph represents the accumulated trace count up to time step
Although a long trace may have a better likelihood of aligning with a general learning sequence with a longer subsequence, the section of subsequence alignment may not necessarily be supported towards the end of the sequence. As a result, there is a dip in the support for around one third of the general learning sequences produced from both the masked and non-masked models when the accumulated trace count drops below 18.
These results revealed that the deep learning model with a masked cross-attention layer outperformed other models in the region in which the accumulated trace count is larger than 30. Using the deep learning model with better performance, we mined The structure of the online learning course Statistical Concepts in Higher Education. During the course, we guided the students to follow the designated structure. Moodle includes the event ‘\\core\\event\\course_viewed’, which is triggered when a student accesses the front page of the Moodle course. Figure 7 captures an instance of the 

Overall, 

Less significant learning bottlenecks were the set {‘Notes for Confidence Interval’} and the set {‘Pre Test’, ‘[Video] Introduction to Basic Statistics’, [Quiz] Definition of Sampling Distribution}. Using such an analysis, course instructors can identify which topics require most attention. In addition to the parametric analysis of the learning bottleneck, our web application allows course instructor to visualise the topology change.
Figure 9 shows the topology of the animated state transition graph ( Animated state transition graph topology 
In summary, the results show that the deep learning models are able produce a graphical representation of the overall educational process from event logs. The results also demonstrate how Smith–Waterman algorithm from the bioinformatics field evaluates general learning sequences generated from deep learning models. The results show that modifying the cross-attention layer in deep learning models can better approximate the distribution of general learning sequences, yielding significantly better results with p < .05 under our deep learning framework using 28 students’ event logs from a Moodle course that introduces statistical concepts with 33 course materials and found that the modified model outperforms the other. Last but not the least, the results also show that a significant learning bottleneck ‘[Quiz] Calculate Confidence Interval’ can be found by examine against the general learning process model under the framework.
Discussion and Conclusion
There are two types of search strategies for tackling the EPM process model discovery challenge: local search strategy and global search strategy. Frequency-based methods, such as the heuristics process mining algorithm and alpha algorithm, are regarded as local strategies, whereas sampling-based methods, such as the genetic algorithm, are regarded as global strategies. Although the framework proposed in this study is grounded on frequency, the parameterisation step allows global search methods to be incorporated into the search algorithm. There are seven phases in this framework: data pre-processing, model training, sequence extraction, sequence evaluation and sequence alignment, sequence trimming, state transition graph extraction and bottleneck evaluation.
Comparison of the Deep Learning Framework and Other Frameworks for Uncovering Learning Progression.
If the number of students in a course (and thus the data size) is small, course instructors may find it difficult to train a model with a large number of parameters. Therefore, deep learning models with fewer parameters should be adopted if there are only a small number of traces in the log. In this study, the model complexity (i.e., the embedding size) and training epochs are fixed for comparison;
Figure 10 illustrates how neural architecture search techniques can be merged into our deep learning process mining framework. The adoption of a neural architecture search framework would allow for the auto-tuning of model hyperparameters such as embedding size. In essence, a neural architecture search aims to find the best neural architecture given a dataset, a definition of the dedicated search space and an evaluation metric. In our deep learning framework, we properly define the event log Deep learning framework for uncovering the learning progression of learners with a neural architecture search.
The deep learning framework presented in this study can be applied in pedagogical design of Massive open online courses (MOOCs) host in LMS (Oh et al., 2019). Oh et al. (2019) points out that many e-learning design principles, such as practice domain-specific problem-solving skills, were applied to a limited degree regardless of levels of course difficulty. In contrast, Oh et al. (2019) found that those principles related to the organization and presentation of MOOCs content are important. One of the major contributions of the deep learning framework presented in this study is that a set of graphs is presented to the teachers for them to re-organise the course layout for students to overcome the learning bottlenecks whilst other frameworks only provide a metric base evaluation on the learning bottlenecks.
In conclusion, we propose a novel framework that allows deep learning models to perform learning sequence discovery. This framework enables the use of neural architecture search techniques to uncover the general learning sequence of students regardless of the size of the dataset. Sequence alignment algorithm from bioinformatics is adopted in the evaluation, allowing for a detailed evaluation of the support of each time-step of the general learning sequence. Using our framework, we suggest that the modification of the cross-attention layer in deep learning models helps to approximate the distribution of general learning sequences and produce a significantly better result with p < .05. Finally, we quantify learning bottlenecks as sets of material that obstruct the progression of students’ learning process and formally as the set of materials that significantly lower the support lower bound in an animated state transition graph.
Ethical Statement
Ethical Approval
The submitted work is original and has not been published elsewhere in any form or language, partially or in full. The authors have included the ethical approval and all of the relevant declaration statements in the submission of this manuscript.
Data Availability Statement
The data given this article are available from the corresponding author on reasonable request.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Informed Consent
The authors have obtained the approval of the ethics committee of the University for research involving humans and the informed consent of the human participants in this study.
