Abstract
Traffic flow prediction occupies a pivotal position in intelligent transportation systems, and accurate traffic flow prediction is of great significance for alleviating traffic congestion and reducing the incidence of traffic accidents. To improve the accuracy of traffic flow forecasts, it is necessary to consider the historical data over a longer period. However, most of the existing methods only consider part of the recent historical time information, ignoring the implied fluctuation of the traffic flow in some regions in the historical contemporaneous time interval. Therefore, we propose a multidimensional long-term spatio-temporal attention model for traffic flow forecasting by capturing time series correlations. In this model, we design a multi-temporal dimensional attention mechanism and a deep fusion extraction convolutional neural network to capture multidimensional temporal information and fuse spatio-temporal correlations to predict traffic flow. The experimental results on two real datasets show that the proposed model outperforms the compared models.
Introduction
In recent years, with the rapid expansion of the motorway network and the continued increase in vehicle ownership, social problems such as traffic congestion and traffic accidents are becoming more and more prominent. Traffic flow prediction plays a crucial role in intelligent transportation systems, which accurately predict the dynamics of traffic flow in the regions of the future time through the in-depth mining of historical observation data. The traffic system can be better understood and managed through an in-depth study of the characteristics and patterns of traffic flow, thus enhancing its operational efficiency and safety.
However, achieving accurate and real-time traffic forecasts poses a significant challenge. This is due to the highly complex and dynamic nature of traffic flow patterns, which are highly susceptible to interference from a variety of external factors such as traffic accidents and weather conditions. Although current methods have been able to capture the spatio-temporal characteristics of traffic flow data, for example, STRCNs 1 combine Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network architectures while capturing spatio-temporal dependencies. There are still two limitations: static spatial correlation and short-term temporal correlation. (1) Static spatial correlation means that most existing studies assume a constant traffic flow relationship between any two geographic regions. However, this assumption does not hold in reality. The traffic flow relationship between the two regions may change significantly over time. As shown in Figure 1, an emerging network hotspot region may become a new traffic hotspot, while a previously busy region may experience a decrease in traffic flow due to a reduction in neighbouring activities. (2) Short-term temporal correlation refers to many existing methods focusing mainly on data at recent time intervals. However, traffic conditions are usually influenced by events on a wider time scale. The current traffic condition is not only affected by immediate events such as recent traffic accidents and road construction. In addition, it can be affected by cyclical events as shown in Figure 2, such as the same peak traffic hour a week earlier. Therefore, the true dynamics of traffic flow cannot be captured by relying on recent data alone.

Over time, interregional traffic flow dynamics can significantly change due to various events, including holiday periods and traffic accidents.

Traffic flows in the region may be correlated with earlier time scales.
Therefore, we propose a multidimensional long-term spatio-temporal attention model for traffic flow forecasting by capturing time series correlations to address the above challenges. In the MLSTAM model, we first designed a temporal correlation extraction module, which captures both short-term fluctuations and long-term cyclical trends by modelling multi-scale temporal patterns. Then we developed a spatio-temporal dependency attention module that uses the encoder-decoder architecture to dynamically update spatial correlations through learnable attention weights. Finally, the multi-dimensional cross attention fusion module integrates multi-scale features. The main contributions are summarized as follows:
We propose a temporal correlation feature extraction module using a combination of attention mechanisms and CNNs to capture data temporal correlation features. We propose a Spatio-Temporal Dependency Attention Module and a Multi-dimensional Cross Attention Fusion Module for efficiently capturing spatial features and fusing temporally correlated features. We evaluated the proposed model using two real datasets, and the experimental findings demonstrated that the performance of the MLSTAM algorithm outperforms that of the comparison algorithms.
The structure of this study can be summarised as follows: Section 2 summarises the related work to traffic flow forecasting. Section 3 defines terms and notations related to traffic flow. Section 4 describes our proposed general framework. Section 5 conducts experiments on the model proposed in this paper on real datasets and compares it with other models. Finally, Section 7 concludes the paper.
With the rapid growth of motorway networks and increasing traffic demand, challenges in traffic management and safety have intensified, underscoring the need for accurate traffic flow predictions. Deep learning, with its robust spatio-temporal feature extraction capabilities and balance of dynamic factors, offers a promising approach to enhancing prediction accuracy. We will now discuss the relevant applications of deep learning in traffic flow prediction from two key perspectives.
In traffic flow prediction, CNN-based models are powerful tools widely recognized for extracting temporal and spatial relationships.2,3 For instance, Yang et al. 4 introduced an enhanced MF-CNN model considering network-scale traffic flow prediction and the impact of external factors, extracting features using CNN and combining them with external features via a logistic regression layer for prediction. Similarly, Zhang et al. 5 presented a short-term traffic flow prediction model utilizing spatio-temporal analysis and CNN to optimize input data through a feature selection algorithm and learn features. Additionally, Ren et al. 6 proposed a Combined Deep Learning Prediction (CLP) model consisting of parallel CNN-LSTM and CNN-GRU models, with weights calculated using a dynamic optimal weighting coefficient algorithm to boost prediction accuracy. Soni et al. 7 employed WKNN to capture historical data’s spatio-temporal correlations and DCNN to integrate multiple transportation metrics for prediction. Qiao et al. 8 suggested a short-term traffic flow prediction model combining a one-dimensional CNN and LSTM (1DCNN-LSTM), using a one-dimensional convolutional network to capture spatial information. Fu et al. 9 utilized CNN to capture spatial features at multiple intersections, a Transformer to reduce training time, and considered traffic flow’s cyclic trends. While CNNs are prevalent for feature extraction, capturing complex graph topology’s spatial correlations in traffic flow is challenging.
In contrast, graph neural networks (GNNs) excel at processing graph-structured data and capturing spatial correlations between graph nodes.10,11 Zhang and Jiao 12 addressed the limitations of CNN methods in capturing spatio-temporal correlations by using GNNs to extract spatial-temporal dependencies from historical traffic data. Bao et al. 13 proposed a PKET-GCN model that combines dynamic and static features, significantly improving traffic flow prediction accuracy. Yang et al. 14 introduced a multi-graph convolutional model with GRU. Considering the intricate spatio-temporal correlations in traffic flow data, Ni and Zhang 15 proposed an interpretable multi-gated graph convolution framework, constructing spatio-temporal blocks with multi-resolution 1DCNN to capture correlations effectively. Li et al. 16 suggested diffusion graph convolution to capture multiscale temporal correlations and global spatial dependencies, with a dynamic graph module integrating multi-source information. Luo et al. 17 developed an S-GNN model handling multiple traffic data inputs by combining spatio-temporal modules, using one-dimensional convolution for temporal features and GNNs for spatial relationships. Li et al. 18 used a spatio-temporal attention mechanism to assign weights to nodes, feeding them into a graph convolutional neural network to capture dynamic spatial correlations. Wang et al. 19 proposed a Dynamic Topology Man-GCN (DTM-GCN) model, constructing dynamic topology maps and integrating them with gated convolutional units to explore potential spatio-temporal features.
In recent years, attention mechanisms have become fundamental in the development of deep learning models, finding extensive use in time series prediction tasks.20–22 For instance, Ali et al. 23 proposed an ASTMGCNet based on the attention mechanism, which effectively captures complex spatio-temporal correlations by dynamically generating the graph structure. Xu and Liu 24 designed GPT4TFP to fuse spatio-temporal embedded features using a multi-head cross-attention mechanism and fine-tuned the GPT model using a freeze pre-training strategy. Kong et al. 25 introduced the ASTGAT model, which improves the initial graph structure by incorporating a Network Generator to capture the spatio-temporal correlations of hidden nodes and integrating them into a spatio-temporal graph attention network. Liu et al. 26 leveraged the strengths of spatio-temporal gated graph convolution and Transformer architecture to effectively capture both short-term and long-term spatio-temporal dependencies via the attention mechanism. Peng et al. 27 proposed the ST-DSTN model, which uses spatio-temporal fusion attention to capture dynamic correlations and a multi-scale circular stacking structure for feature interaction. Zhong et al. 28 developed a persistent perceptual temporal attention module that models the temporal, geospatial, and semantic spatial dependencies of traffic flow data for comprehensive data embedding. Bai et al. 29 created the STGNN-GCTA model by combining gated convolution with topological attention. Wu et al. 30 utilized a multilayer attention strategy for predicting subway passenger flow, using different attention layers to enhance prediction accuracy. Zhao et al. 31 employed a recursive bidirectional network with NAL-GAT-substituted GRU gating units to enhance local spatio-temporal capture. Lin et al. 32 proposed the STAtt-Net model for short-term traffic flow prediction, capable of dynamically modelling associations between locations in a city.
Despite the success of attention mechanisms in recent traffic flow prediction models, these models often struggle with capturing long-term temporal dependencies and fully understanding the deep connections between future and past data. To address this limitation, we combine attention mechanisms with CNNs to better capture temporal dependencies, with a particular focus on long-term temporal dependencies. Notably, in the dynamic environment modelling, Moghaddasi et al. proposed a DDQN-based data offloading strategy for 5G vehicular edge computing33,34 and a three-layer D2D-edge-cloud computing architecture for IoT devices, 35 while Min et al. 36 developed a joint resource allocation mechanism for vehicle multi-access edge computing networks, enhancing the double deep Q-network to achieve dynamic multi-task offloading in high-mobility scenarios. Their research leverages deep learning techniques to enable adaptive decision-making under dynamic environmental conditions, which motivates our adoption of the attention mechanism to capture the spatio-temporal dynamics correlation in traffic flow prediction.
Definition and notation
In this section, we describe the definitions and notations related to traffic flow forecasting.
Transportation network definition
In a transportation system, sensor nodes are typically dispersed along traffic roads, represented as a weighted directed graph
The definition of traffic flow prediction
In this work, we denote the observed traffic conditions in a given time interval
For traffic flow prediction at future time steps
Multidimensional Long-term Spatio-temporal attention model for traffic flow prediction
In this section, we provide a detailed account of the proposed framework and its primary components.
Model framework
Figure 3 represents the general framework of the MLSTAM, including three modules: the Temporal Correlation Extraction Module, the Spatio-Temporal Dependency Attention Module, and the Multi-dimensional Cross Attention Fusion Module. The Temporal Correlation Extraction Module is used to extract closeness, periodicity, and trend in the data and further acquire closeness, periodicity, and trend in the temporal correlation by CNN. Next, the Fusion Module is used to fuse the temporal correlation of these three characteristics. In addition, the spatio-temporal dependency awareness multi-dimensional fusion module adopts the encoder-decoder architecture and adds the temporal and spatial information of the data through the Temporal Embedding module. Finally, the cross-attention module is used to establish a temporal correlation with the data output from the decoder. The subsequent section provides a detailed explanation of the specific implementation process.

The framework of MLSTAM.
The temporal correlation extraction module consists of three attention modules and three convolutional neural networks for obtaining multidimensional long-term temporal correlations in historical data. The module decomposes complex traffic dynamics into three independently analyzable components: sudden changes in adjacent time intervals (Closeness), recurring daily patterns (Periodicity), and gradual variations over extended periods (Trend).
The module processes the input data
After acquiring the temporal closeness, periodicity, and trend features

Module structure of attention mechanism.
The spatio-temporal dependency attention module uses an encoder-decoder architecture. In this architecture, the encoder first receives the data as well as its spatio-temporal location information and processes it through multiple Spatio-Temporal Attention Blocks (ST-attention blocks). Each ST-attention block covers both spatial and temporal attention processing. In terms of spatial attention, the model achieves this by calculating the correlation between different locations. For each location i, the model calculates the attention score
The temporal attention model calculates attention weights for different time steps. These weights are used to weight features to capture temporal dependencies. The computation of temporal attention can be expressed as:
Between the Encoder and Decoder, the model introduces a Transform Attention module for transforming the historical features extracted by the Encoder into a feature representation more suitable for prediction. This transformation helps to further improve the accuracy of the prediction. After the processing of multiple ST-attention blocks, the Encoder outputs the feature representation of the encoded data and combines it with the spatio-temporal location information of the data as E, which is fed into the Transform Attention module:
Finally, the transformed feature representation goes to the decoder section, which also consists of a plurality of spatio-temporal attention blocks for generating traffic predictions for future time steps after combining them with temporal correlations. The decoder gradually generates predictions for future time steps by iteratively applying spatio-temporal attention blocks. Each time step prediction is based on the previous prediction and the transformed feature representation. The representation is:
In this section, we propose a weighted fusion strategy that combines the elements of temporal correlation, closeness
This section describes the experimental procedure in terms of the datasets, baselines, evaluation metrics, parameter settings, hyperparameter study, results analysis, and ablation studies.
Datasets
We evaluate the performance of MLSTAM using two real datasets, METR-LA and PeMS-BAY, acquired by road sensors. Traffic speed were aggregated to 5-minute intervals and normalized using Z-Score normalization, some details of the datasets are given in Table 1.
For the METR-LA dataset, traffic data was collected by the Los Angeles Loop Detectors, which consisted of 207 sensors and spanned the period of March 1, 2012 to June 30, 2012. For the PeMS-BAY dataset, traffic data were collected by the Bay Area Detectors, including 325 sensors, for the period from January 1, 2017 to May 31, 2017. Details of the datasets.
In our experiments, we use the following baseline algorithm to compare and analyze with the MLSTAM.
ARIMA
37
: The ARIMA model is a classic time series forecasting model based on smoothed data. SVR
38
: SVR is a regression model widely used to handle data with non-linear relationships. FNN: FNN is a classical deep learning model that learns more complex non-linear relationships in data through Factorization Machine and Multi-Layer Perceptron. FC-LSTM
39
: The FC-LSTM model is a deep learning model that combines a fully connected layer with a long- and short-term memory network, which is suitable for processing time-series data and can effectively capture spatio-temporal information and feature associations in traffic flow prediction. GMAN
40
: The GMAN is a deep-learning model for traffic flow prediction. It is based on graph neural networks and attentional mechanisms and aims to solve the complex spatio-temporal correlation problem in traffic flow prediction.
Evaluation metrics
The MLSTAM has the same assessment metrics as before,
41
including mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE). The specific formulas are as follows:
The experimental parameters in this section are set as shown in Table 2. The model uses a two-layer convolutional neural network to learn spatio-temporal features. The first layer has 16 filters and the second layer has 1 filter. This determines the number of features learned in each convolutional layer. The step size defines how far the convolutional kernel slides over the image, here it is set to 1 which means that the convolutional kernel moves one pixel at a time. Padding is used to add extra zeros around the boundaries of the image, here it is set to 0 which means that no padding is used. The model encoder and decoder use 5 ST-attention blocks each for capturing the spatio-temporal dependencies in the traffic flow data. In this paper, the predicted traffic flow size n is 12 and the batch size defines the number of samples processed by the model in a single iteration, which is set to 16 here. the number of times the model will completely traverse the training data, which is set to a maximum of 500 times here. The optimizer is set to Adam, which is an efficient optimization algorithm for training deep neural networks. The learning rate determines the step size of the model at each update of the weights, set here to 0.001, a common starting learning rate.
Hyper parameter settings for the MLSTAM.
Hyper parameter settings for the MLSTAM.
In this paper, two traffic flow datasets, METR-LA and PeMS-BAY, are analyzed as shown in Figures 5 and 6, the prediction models are evaluated under different settings. The evaluation metrics, including MAE, MAPE, and RMSE. These metrics help to provide a comprehensive understanding of the model’s performance over different prediction periods (15, 30, and 60 minutes).

Results on the METR-LA and the PEMS-BAY datasets with different

Results on the METR-LA and the PEMS-BAY datasets with different
In the METR-LA dataset, the model’s MAE, MAPE, and RMSE metrics show an overall decreasing and then increasing trend as the value increases. At that time, the model reaches the lowest values of MAE, MAPE, and RMSE on all three forecasting time periods, indicating the best forecasting performance of the model in this setting. Specifically, compared to
In the PeMS-BAY dataset, the model also achieves superior performance at
For the METR-LA dataset, when the value of
For the PeMS-BAY dataset, similar to the METR-LA dataset, the MAE, MAPE, and RMSE metrics decrease in most cases when the
Taking the above analyses together, it can be concluded that choosing
Tables 3 and 4 shows the performance of the different models on the PeMS-BAY and METR-LA datasets. The main evaluation metrics are MAE (Mean Absolute Error), MAPE (Mean Absolute Percentage Error), and RMSE (Root Mean Squared Error), and these metrics are computed for the prediction durations of 15 minutes, 30 minutes, and 60 minutes, respectively.
Results for different models on the METR-LA dataset.
Results for different models on the METR-LA dataset.
Results for different models on the PEMS-BAY dataset.
On the PeMS-BAY dataset, the MLSTAM shows good performance in all three evaluation metrics, MAE, MAPE, and RMSE. Specifically, MLSTAM has MAE values of 1.55, 1.68, and 1.83 for 15, 30, and 60-minute prediction durations, respectively, which are lower than the other compared models. Similarly, MLSTAM also achieves relatively low values for MAPE and RMSE, which indicates the accuracy and stability of the MLSTAM in predicting traffic flow. Compared to traditional time series forecasting methods such as ARIMA and SVR, the MLSTAM shows a significant performance improvement. For example, for a 60-minute prediction period, the MAE of ARIMA is 3.38, while that of MLSTAM is only 1.83, which is about 45.8% lower, and that of SVR is 3.28 under the same conditions, which is also better than that of MLSTAM. These results show that MLSTAM has higher prediction accuracy when dealing with complex traffic flow data. Compared with other deep learning models, such as FNN, FC-LSTM, and GMAN, MLSTAM also shows superior performance. In particular, in comparison with the current best-performing model, GMAN, MLSTAM’s MAE value is reduced by about 8.0% for the 60-minute prediction duration, which further demonstrates the effectiveness of the MLSTAM in capturing the long-term dependence of traffic flow.
On the METR-LA dataset, the MLSTAM also shows good performance. For the three evaluation metrics, MAE, MAPE, and RMSE, the values of MLSTAM are lower or close to those of the other comparison models. Specifically, the MAE values of MLSTAM are 3.41, 3.75, and 4.03 for 15, 30, and 60-minute prediction durations, respectively, which are lower than or close to the best performance of other models. The performance advantage of MLSTAM on the METR-LA dataset is more obvious compared to traditional methods such as ARIMA and SVR. For example, on the 60-minute prediction duration, the MAE value of ARIMA is 6.90 while MLSTAM is only 4.03, which is about 41.6% lower. This indicates that MLSTAM has better generalization ability when dealing with different datasets. Compared with other deep learning models such as FNN, FC-LSTM, and GMAN, MLSTAM’s performance on the METR-LA dataset is also competitive. In particular, in comparison with GMAN, MLSTAM has a slightly lower MAE value on the 60-minute prediction duration, further proving the MLSTAM’s superiority in dealing with complex traffic flow data.
Taken together, the MLSTAM demonstrates good prediction performance for long-term traffic flow data on both PeMS-BAY and METR-LA datasets. This is mainly attributed to the model’s ability to capture the spatio-temporal dependence of traffic flows and handle complex dynamic patterns. In addition, the stable performance of the MLSTAM over different prediction durations also demonstrates its robustness and generalization ability.
In this subsection, a variant ablation experiment is conducted to test whether some of the modules in the model improve the accuracy of the predictions. These modules include the closeness attention module, the periodic attention module, and the trend attention module. Some of the variants are listed below:
MLSTAM_C: The closeness attention Module and the spatio-temporal dependency awareness multi-dimensional fusion module are used to predict traffic flow. MLSTAM_P: The period attention module and the spatio-temporal dependency awareness multi-dimensional fusion module are used to predict traffic flow. MLSTAM_T: The trend attention module and the spatio-temporal dependency awareness multi-dimensional fusion module are used to predict the traffic flow. MLSTAM: As described in the previous model framework, closeness, period, and trend data are used to predict traffic flow.
From the Figure 7 provided, we can see the performance of MLSTAM and its variants (MLSTAM_C, MLSTAM_P, MLSTAM_T) on the datasets METR-LA and PeMS-BAY. These models take into account the closeness, periodicity, and trend of traffic data, respectively.

Comparison of experimental results of model variation on the METR-LA and PEMS-BAY datasets.
In the METR-LA dataset, MLSTAM_C performs relatively well in the 15-minute time interval, but its performance gradually decreases as the time interval increases. This suggests that closeness information has a high predictive value in a short period, but its influence diminishes over time. In the PeMS-BAY dataset, MLSTAM_C also shows the importance of closeness information, but its performance is not outstanding compared to the other models.
In both datasets, MLSTAM_P shows relatively stable performance. This suggests that periodicity information has some stability in traffic flow prediction and is not significantly affected by time intervals. MLSTAM_P performs better in longer time intervals (e.g., 60 minutes) compared to MLSTAM_C, which further confirms the importance of periodic information in long-term prediction.
In the METR-LA dataset, MLSTAM_T performs well in short time intervals (15 minutes), but its performance decreases sharply as the time interval increases. This suggests that trending information has some predictive value in short time intervals, but may be disturbed by other factors in long time intervals. In the PeMS-BAY dataset, the performance of MLSTAM_T is relatively poor, which may be related to the fact that the trending characteristics in this dataset are not obvious or are influenced by other factors.
In both datasets, MLSTAM achieves optimal results. This suggests that the combined consideration of closeness, periodicity, and trend information is essential for improving the accuracy of traffic flow prediction. MLSTAM achieves lower MAE, MAPE, and RMSE values for all time intervals compared to other models that only consider a single characteristic, which further confirms its superiority and effectiveness.
To accurately capture temporal correlations in the traffic flow prediction problem and effectively address the complex dynamic spatial and long-term time-dependent challenges, we propose a multidimensional long-term spatio-temporal attention model. We combine the attention mechanism and Convolutional Neural Networks (CNNs) to ensure that the model can sensitively capture long-term temporal correlations in traffic flow data. In addition, we design a Spatio-Temporal Dependency Aware Multi-scale Fusion Module to capture the dynamic spatio-temporal correlation between regions. Finally, we conducted experiments on two real-world datasets and the results show that our proposed model outperforms the baseline algorithm in long-duration traffic flow prediction.Although the proposed method achieves better performance in traffic flow prediction, it has potential limitations, such as neglecting critical real-world perturbations (e.g., dynamic meteorological variations and non-recurrent traffic disturbances), which may affect practical applicability. In the future, we will integrate multi-source data (e.g., weather forecasts, accident reports) for scenario-specific adaptation, develop lightweight models to enable real-time predictions on edge devices, and extend this framework to broader intelligent transportation applications, such as emergency evacuation planning.
Footnotes
Funding
This study research funding from the National Natural Science Foundation of China, grant number 62462031, and the Natural Science Foundation of Jiangxi Province, grant number 20242BAB26023.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
