MLSTAM: A multidimensional long-term spatio-temporal attention model for traffic flow forecasting by capturing time series correlations

Abstract

Traffic flow prediction occupies a pivotal position in intelligent transportation systems, and accurate traffic flow prediction is of great significance for alleviating traffic congestion and reducing the incidence of traffic accidents. To improve the accuracy of traffic flow forecasts, it is necessary to consider the historical data over a longer period. However, most of the existing methods only consider part of the recent historical time information, ignoring the implied fluctuation of the traffic flow in some regions in the historical contemporaneous time interval. Therefore, we propose a multidimensional long-term spatio-temporal attention model for traffic flow forecasting by capturing time series correlations. In this model, we design a multi-temporal dimensional attention mechanism and a deep fusion extraction convolutional neural network to capture multidimensional temporal information and fuse spatio-temporal correlations to predict traffic flow. The experimental results on two real datasets show that the proposed model outperforms the compared models.

Keywords

Intelligent traffic system traffic flow prediction spatio-temporal correlations

1. Introduction

In recent years, with the rapid expansion of the motorway network and the continued increase in vehicle ownership, social problems such as traffic congestion and traffic accidents are becoming more and more prominent. Traffic flow prediction plays a crucial role in intelligent transportation systems, which accurately predict the dynamics of traffic flow in the regions of the future time through the in-depth mining of historical observation data. The traffic system can be better understood and managed through an in-depth study of the characteristics and patterns of traffic flow, thus enhancing its operational efficiency and safety.

However, achieving accurate and real-time traffic forecasts poses a significant challenge. This is due to the highly complex and dynamic nature of traffic flow patterns, which are highly susceptible to interference from a variety of external factors such as traffic accidents and weather conditions. Although current methods have been able to capture the spatio-temporal characteristics of traffic flow data, for example, STRCNs¹ combine Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) network architectures while capturing spatio-temporal dependencies. There are still two limitations: static spatial correlation and short-term temporal correlation. (1) Static spatial correlation means that most existing studies assume a constant traffic flow relationship between any two geographic regions. However, this assumption does not hold in reality. The traffic flow relationship between the two regions may change significantly over time. As shown in Figure 1, an emerging network hotspot region may become a new traffic hotspot, while a previously busy region may experience a decrease in traffic flow due to a reduction in neighbouring activities. (2) Short-term temporal correlation refers to many existing methods focusing mainly on data at recent time intervals. However, traffic conditions are usually influenced by events on a wider time scale. The current traffic condition is not only affected by immediate events such as recent traffic accidents and road construction. In addition, it can be affected by cyclical events as shown in Figure 2, such as the same peak traffic hour a week earlier. Therefore, the true dynamics of traffic flow cannot be captured by relying on recent data alone.

Figure 1.

Over time, interregional traffic flow dynamics can significantly change due to various events, including holiday periods and traffic accidents.

Figure 2.

Traffic flows in the region may be correlated with earlier time scales.

Therefore, we propose a multidimensional long-term spatio-temporal attention model for traffic flow forecasting by capturing time series correlations to address the above challenges. In the MLSTAM model, we first designed a temporal correlation extraction module, which captures both short-term fluctuations and long-term cyclical trends by modelling multi-scale temporal patterns. Then we developed a spatio-temporal dependency attention module that uses the encoder-decoder architecture to dynamically update spatial correlations through learnable attention weights. Finally, the multi-dimensional cross attention fusion module integrates multi-scale features. The main contributions are summarized as follows:

We propose a temporal correlation feature extraction module using a combination of attention mechanisms and CNNs to capture data temporal correlation features.

We propose a Spatio-Temporal Dependency Attention Module and a Multi-dimensional Cross Attention Fusion Module for efficiently capturing spatial features and fusing temporally correlated features.

We evaluated the proposed model using two real datasets, and the experimental findings demonstrated that the performance of the MLSTAM algorithm outperforms that of the comparison algorithms.

The structure of this study can be summarised as follows: Section 2 summarises the related work to traffic flow forecasting. Section 3 defines terms and notations related to traffic flow. Section 4 describes our proposed general framework. Section 5 conducts experiments on the model proposed in this paper on real datasets and compares it with other models. Finally, Section 7 concludes the paper.

2. Related works

With the rapid growth of motorway networks and increasing traffic demand, challenges in traffic management and safety have intensified, underscoring the need for accurate traffic flow predictions. Deep learning, with its robust spatio-temporal feature extraction capabilities and balance of dynamic factors, offers a promising approach to enhancing prediction accuracy. We will now discuss the relevant applications of deep learning in traffic flow prediction from two key perspectives.

In traffic flow prediction, CNN-based models are powerful tools widely recognized for extracting temporal and spatial relationships.^2,3 For instance, Yang et al.⁴ introduced an enhanced MF-CNN model considering network-scale traffic flow prediction and the impact of external factors, extracting features using CNN and combining them with external features via a logistic regression layer for prediction. Similarly, Zhang et al.⁵ presented a short-term traffic flow prediction model utilizing spatio-temporal analysis and CNN to optimize input data through a feature selection algorithm and learn features. Additionally, Ren et al.⁶ proposed a Combined Deep Learning Prediction (CLP) model consisting of parallel CNN-LSTM and CNN-GRU models, with weights calculated using a dynamic optimal weighting coefficient algorithm to boost prediction accuracy. Soni et al.⁷ employed WKNN to capture historical data’s spatio-temporal correlations and DCNN to integrate multiple transportation metrics for prediction. Qiao et al.⁸ suggested a short-term traffic flow prediction model combining a one-dimensional CNN and LSTM (1DCNN-LSTM), using a one-dimensional convolutional network to capture spatial information. Fu et al.⁹ utilized CNN to capture spatial features at multiple intersections, a Transformer to reduce training time, and considered traffic flow’s cyclic trends. While CNNs are prevalent for feature extraction, capturing complex graph topology’s spatial correlations in traffic flow is challenging.

In contrast, graph neural networks (GNNs) excel at processing graph-structured data and capturing spatial correlations between graph nodes.^10,11 Zhang and Jiao¹² addressed the limitations of CNN methods in capturing spatio-temporal correlations by using GNNs to extract spatial-temporal dependencies from historical traffic data. Bao et al.¹³ proposed a PKET-GCN model that combines dynamic and static features, significantly improving traffic flow prediction accuracy. Yang et al.¹⁴ introduced a multi-graph convolutional model with GRU. Considering the intricate spatio-temporal correlations in traffic flow data, Ni and Zhang¹⁵ proposed an interpretable multi-gated graph convolution framework, constructing spatio-temporal blocks with multi-resolution 1DCNN to capture correlations effectively. Li et al.¹⁶ suggested diffusion graph convolution to capture multiscale temporal correlations and global spatial dependencies, with a dynamic graph module integrating multi-source information. Luo et al.¹⁷ developed an S-GNN model handling multiple traffic data inputs by combining spatio-temporal modules, using one-dimensional convolution for temporal features and GNNs for spatial relationships. Li et al.¹⁸ used a spatio-temporal attention mechanism to assign weights to nodes, feeding them into a graph convolutional neural network to capture dynamic spatial correlations. Wang et al.¹⁹ proposed a Dynamic Topology Man-GCN (DTM-GCN) model, constructing dynamic topology maps and integrating them with gated convolutional units to explore potential spatio-temporal features.

In recent years, attention mechanisms have become fundamental in the development of deep learning models, finding extensive use in time series prediction tasks.^20–22 For instance, Ali et al.²³ proposed an ASTMGCNet based on the attention mechanism, which effectively captures complex spatio-temporal correlations by dynamically generating the graph structure. Xu and Liu²⁴ designed GPT4TFP to fuse spatio-temporal embedded features using a multi-head cross-attention mechanism and fine-tuned the GPT model using a freeze pre-training strategy. Kong et al.²⁵ introduced the ASTGAT model, which improves the initial graph structure by incorporating a Network Generator to capture the spatio-temporal correlations of hidden nodes and integrating them into a spatio-temporal graph attention network. Liu et al.²⁶ leveraged the strengths of spatio-temporal gated graph convolution and Transformer architecture to effectively capture both short-term and long-term spatio-temporal dependencies via the attention mechanism. Peng et al.²⁷ proposed the ST-DSTN model, which uses spatio-temporal fusion attention to capture dynamic correlations and a multi-scale circular stacking structure for feature interaction. Zhong et al.²⁸ developed a persistent perceptual temporal attention module that models the temporal, geospatial, and semantic spatial dependencies of traffic flow data for comprehensive data embedding. Bai et al.²⁹ created the STGNN-GCTA model by combining gated convolution with topological attention. Wu et al.³⁰ utilized a multilayer attention strategy for predicting subway passenger flow, using different attention layers to enhance prediction accuracy. Zhao et al.³¹ employed a recursive bidirectional network with NAL-GAT-substituted GRU gating units to enhance local spatio-temporal capture. Lin et al.³² proposed the STAtt-Net model for short-term traffic flow prediction, capable of dynamically modelling associations between locations in a city.

Despite the success of attention mechanisms in recent traffic flow prediction models, these models often struggle with capturing long-term temporal dependencies and fully understanding the deep connections between future and past data. To address this limitation, we combine attention mechanisms with CNNs to better capture temporal dependencies, with a particular focus on long-term temporal dependencies. Notably, in the dynamic environment modelling, Moghaddasi et al. proposed a DDQN-based data offloading strategy for 5G vehicular edge computing^33,34 and a three-layer D2D-edge-cloud computing architecture for IoT devices,³⁵ while Min et al.³⁶ developed a joint resource allocation mechanism for vehicle multi-access edge computing networks, enhancing the double deep Q-network to achieve dynamic multi-task offloading in high-mobility scenarios. Their research leverages deep learning techniques to enable adaptive decision-making under dynamic environmental conditions, which motivates our adoption of the attention mechanism to capture the spatio-temporal dynamics correlation in traffic flow prediction.

3. Definition and notation

In this section, we describe the definitions and notations related to traffic flow forecasting.

3.1. Transportation network definition

In a transportation system, sensor nodes are typically dispersed along traffic roads, represented as a weighted directed graph $G = (V, ξ, A)$ derived from the spatial distribution of these nodes. Herein, $V$ signifies the set of nodes situated on a traffic road, with $N = | V |$ indicating the total number of nodes. The set $ξ$ comprises the edges linking these nodes, while $A$ signifies the weighted adjacency matrix, where $A_{i j}$ denotes the distance weight between vertices $v_{i}$ and $v_{j}$ .

3.2. The definition of traffic flow prediction

In this work, we denote the observed traffic conditions in a given time interval $t$ by $X_{t}$ , where $C$ is the number of traffic conditions under study, and so the observations of a given $N$ vertices over a history of $P$ time intervals are denoted as $X = (X_{t 1}, X_{t 2}, \dots, X_{t p})$ .

For traffic flow prediction at future time steps $t + 1$ to $t + n$ , we decompose historical patterns into three temporal components: the closeness component $X_{c l}$ , capturing immediate patterns from the most recent $α \cdot n$ time intervals preceding the prediction window; the periodicity component $X_{p d}$ , containing daily recurring patterns from $σ$ days prior to the target date $D$ ; and the trend component $X_{t d}$ , encoding weekly evolutionary patterns from historical data one week before $D$ . Formally, these components are defined as $X_{c l} = {X_{t - α n + 1 : t - (α - 1) n}^{D}, \dots, X_{t - n + 1 : t}^{D}}$ , $X_{p d} = {X_{t - α n + 1 : t - (α - 1) n}^{D - σ}, \dots, X_{t - n + 1 : t}^{D - σ}}$ , and $X_{td} = {X_{t + 1 : t + n}^{D - 7}, X_{t - n + 1 : t}^{D}}$ , where $α$ controls the temporal proximity scope, $σ$ specifies the periodic offset in days, and the superscripts denote dates.

4. Multidimensional Long-term Spatio-temporal attention model for traffic flow prediction

In this section, we provide a detailed account of the proposed framework and its primary components.

4.1. Model framework

Figure 3 represents the general framework of the MLSTAM, including three modules: the Temporal Correlation Extraction Module, the Spatio-Temporal Dependency Attention Module, and the Multi-dimensional Cross Attention Fusion Module. The Temporal Correlation Extraction Module is used to extract closeness, periodicity, and trend in the data and further acquire closeness, periodicity, and trend in the temporal correlation by CNN. Next, the Fusion Module is used to fuse the temporal correlation of these three characteristics. In addition, the spatio-temporal dependency awareness multi-dimensional fusion module adopts the encoder-decoder architecture and adds the temporal and spatial information of the data through the Temporal Embedding module. Finally, the cross-attention module is used to establish a temporal correlation with the data output from the decoder. The subsequent section provides a detailed explanation of the specific implementation process.

Figure 3.

The framework of MLSTAM.

4.2. Temporal correlation extraction module

The temporal correlation extraction module consists of three attention modules and three convolutional neural networks for obtaining multidimensional long-term temporal correlations in historical data. The module decomposes complex traffic dynamics into three independently analyzable components: sudden changes in adjacent time intervals (Closeness), recurring daily patterns (Periodicity), and gradual variations over extended periods (Trend).

The module processes the input data ${X_{c l}, X_{p d}, X_{t d}}$ through three specialized attention modules (closeness, period, trend) as shown in Figure 3 to generate the output features ${H_{c l}, H_{p d}, H_{t d}}$ . In the following, we take the closeness data as an example of an attentional neural network, $X_{c l}$ for each data block $X_{t - i n + 1 : t - (i - 1) n}^{D}$ (where $i \in (1, 2, \dots, α)$ ), a corresponding attentional weight $α_{i}$ is computed which indicates the importance of the data block for the current prediction task.

e_{i} = A t t e n t i o n (X_{t - i n + 1 : t - (i - 1) n}^{D}),

(1)

where

e_{i}

is the raw attention score of the

i

-th data block and Attention is a learnable neural network. Next, we normalize the original attention score to get the final attention weights:

a_{i} = \frac{\exp (e_{i})}{\sum_{j = 1}^{α} \exp (e_{j})} .

(2)

Finally, we use the normalized attention weights to weigh and sum the data blocks to obtain the weighted representation,

H_{c l} = \sum_{i = 1}^{α} a_{i} \cdot X_{t - i n + 1 : t - (i - 1) n}^{D} .

(3)

After extracting the data closeness, we perform a convolution operation on the data to obtain the temporal correlation closeness:

C_{c l}^{t + 1 : t + n} = C o n v s {H_{c l}},

(4)

where

H_{c l}

represents a weighted representation of the closeness data, the operation is then repeated for

X_{p d}

and

X_{t d}

respectively. Finally, the temporal correlation of traffic flow data is extracted respectively for the closeness

C_{c l}^{t + 1 : t + n}

, periodicity

C_{p d}^{t + 1 : t + n}

and trend

C_{t d}^{t + 1 : t + n}

4.3. Spatio-temporal dependency attention module

After acquiring the temporal closeness, periodicity, and trend features ${C_{c l}^{t + 1 : t + n}, C_{p d}^{t + 1 : t + n}, C_{t d}^{t + 1 : t + n}}$ , we propose a Spatio-Temporal Dependency Attention Module dedicated to extracting and integrating spatio-temporal features. The Spatio-Temporal Dependency Attention Module implicitly emphasizes local patterns while adaptively modelling global dependencies. The workflow of this module is illustrated in Figure 4.

Figure 4.

Module structure of attention mechanism.

The spatio-temporal dependency attention module uses an encoder-decoder architecture. In this architecture, the encoder first receives the data as well as its spatio-temporal location information and processes it through multiple Spatio-Temporal Attention Blocks (ST-attention blocks). Each ST-attention block covers both spatial and temporal attention processing. In terms of spatial attention, the model achieves this by calculating the correlation between different locations. For each location i, the model calculates the attention score $e_{i j}$ between it and all the neighbouring locations j and normalizes it by the softmax function to obtain the attention weights $α_{i j}$ . Eventually, these weights are used to weigh and sum the features of the neighbouring locations to obtain the spatial attention output of location i. The spatial attention output of location i is obtained by the model by using the softmax function:

\begin{aligned} e_{i j} & = s c o r e (x_{i}, x_{j}), \end{aligned}

(5)

\begin{aligned} α_{i j} & = \frac{\exp (e_{i j})}{\sum_{k \in N_{i}} \exp (e_{i k})}, \end{aligned}

(6)

\begin{aligned} x_{i}^{'} & = \sum_{j \in N_{i}} α_{i j} x_{j}, \end{aligned}

(7)

where the score() function is used to calculate the attention score, where

N_{i}

denotes the set of neighbouring nodes at location i,

x_{i}

and

x_{j}

denote any two sensor nodes and

x_{t}^{^{'}}

is the feature output of spatial attention module.

The temporal attention model calculates attention weights for different time steps. These weights are used to weight features to capture temporal dependencies. The computation of temporal attention can be expressed as:

\begin{aligned} β_{t} & = s o f t m a x (s c o r e (u_{t}, x_{t}^{D})), \end{aligned}

(8)

\begin{aligned} x_{t}^{″} & = β_{t} x_{t}^{D}, \end{aligned}

(9)

where

u_{t}

is the hidden state at time step t and score() is the function that computes the attention score,

x_{t}^{^{″}}

is the feature output of temporal attention. At layer L ST-attention block, after the spatial attention block, the output of the data

H_{s}

, and similarly, the output of the temporal attention module is

H_{t}

. Finally,

H_{s}

and

H_{t}

are fused through the Add module, and the fusion equation is:

\begin{aligned} z & = s i g m o i d (H_{s}^{(l)} W_{z, 1} + H_{t}^{(l)} W_{z, 2} + b_{z}), \end{aligned}

(10)

\begin{aligned} E_{t - n + 1 : t}^{D} & = z ⊙ H_{s}^{(l)} + (1 - z) ⊙ H_{t}^{(l)}, \end{aligned}

(11)

where

W_{z, 1}

, and

W_{z, 2}

represent the learnable parameter matrices,

b_{z}

represents the bias,

⊙

represents the product at the element level and

E_{t - n + 1 : t}^{D}

denotes the integrated spatio-temporal location information processed through ST-attention block.

Between the Encoder and Decoder, the model introduces a Transform Attention module for transforming the historical features extracted by the Encoder into a feature representation more suitable for prediction. This transformation helps to further improve the accuracy of the prediction. After the processing of multiple ST-attention blocks, the Encoder outputs the feature representation of the encoded data and combines it with the spatio-temporal location information of the data as E, which is fed into the Transform Attention module:

H_{t} = σ (score (W_{q} E, W_{k} E)) \cdot W_{v} E,

(12)

where

σ

denotes the sigmoid activation function,

W_{q}

W_{k}

, and

W_{v}

are learnable parameters matrices,

H_{t}

is the feature representation after transformed attention processing.

Finally, the transformed feature representation goes to the decoder section, which also consists of a plurality of spatio-temporal attention blocks for generating traffic predictions for future time steps after combining them with temporal correlations. The decoder gradually generates predictions for future time steps by iteratively applying spatio-temporal attention blocks. Each time step prediction is based on the previous prediction and the transformed feature representation. The representation is:

P_{t + 1 : t + n}^{D} = D e c o d e r (H_{t} + S T E),

(13)

where STE denotes spatio-temporal location information.

4.4. Multi-dimensional cross attention fusion module

In this section, we propose a weighted fusion strategy that combines the elements of temporal correlation, closeness $C_{c l}^{t + 1 : t + n}$ , periodicity $C_{p d}^{t + 1 : t + n}$ , and trend $C_{t d}^{t + 1 : t + n}$ . Specifically, these three features are weighted and fused through the learnable parameter matrices $W_{t, 1}$ , $W_{t, 2}$ , and $W_{t, 3}$ and the bias term $b_{t}$ :

T^{t + 1 : t + n} = C_{c l}^{t + 1 : t + n} W_{t, 1} + C_{p d}^{t + 1 : t + n} W_{t, 2} + C_{t d}^{t + 1 : t + n} W_{t, 3} + b_{t}

(14)

and fed into the Cross Attention module along with P:

\begin{aligned} q_{c} & = W_{c, q} T^{t + 1 : t + n}, \\ k_{c} & = W_{c, k} P_{t + 1 : t + n}^{D}, \\ v_{c} & = W_{c, v} P_{t + 1 : t + n}^{D}, \\ α & = s i g m o i d (s c o r e (q_{c}, k_{c})), \\ {\hat{X}}_{t + 1 : t + n}^{D} & = α v_{c}, \end{aligned}

(15)

where

q_{c}

k_{c}

, and

v_{c}

are vector representations of queries, keys, and values respectively,

W_{c, q}

W_{c, k}

, and

W_{c, v}

are learnable parameters matrices.

5. Experimental results and analysis

This section describes the experimental procedure in terms of the datasets, baselines, evaluation metrics, parameter settings, hyperparameter study, results analysis, and ablation studies.

5.1. Datasets

We evaluate the performance of MLSTAM using two real datasets, METR-LA and PeMS-BAY, acquired by road sensors. Traffic speed were aggregated to 5-minute intervals and normalized using Z-Score normalization, some details of the datasets are given in Table 1.

For the METR-LA dataset, traffic data was collected by the Los Angeles Loop Detectors, which consisted of 207 sensors and spanned the period of March 1, 2012 to June 30, 2012.

For the PeMS-BAY dataset, traffic data were collected by the Bay Area Detectors, including 325 sensors, for the period from January 1, 2017 to May 31, 2017.

Table 1.
Details of the datasets.

Datasets METR-LA PeMS-BAY

Start time 3/1/2012 1/1/2017

End time 6/30/2012 5/31/2017

Training set 3/1/2012-5/20/2012 1/1/2017-4/30/2017

Validating set 5/21/2012-6/10/2012 5/1/2017-5/31/2017

Testing set 6/11/2012-6/30/2012 6/1/2017-6/30/2017

Time interval 5 (minutes) 5 (minutes)

Datasets	METR-LA	PeMS-BAY
Start time	3/1/2012	1/1/2017
End time	6/30/2012	5/31/2017
Training set	3/1/2012-5/20/2012	1/1/2017-4/30/2017
Validating set	5/21/2012-6/10/2012	5/1/2017-5/31/2017
Testing set	6/11/2012-6/30/2012	6/1/2017-6/30/2017
Time interval	5 (minutes)	5 (minutes)

5.2. Baselines

In our experiments, we use the following baseline algorithm to compare and analyze with the MLSTAM.

ARIMA³⁷ : The ARIMA model is a classic time series forecasting model based on smoothed data.

SVR³⁸ : SVR is a regression model widely used to handle data with non-linear relationships.

FNN: FNN is a classical deep learning model that learns more complex non-linear relationships in data through Factorization Machine and Multi-Layer Perceptron.

FC-LSTM³⁹ : The FC-LSTM model is a deep learning model that combines a fully connected layer with a long- and short-term memory network, which is suitable for processing time-series data and can effectively capture spatio-temporal information and feature associations in traffic flow prediction.

GMAN⁴⁰ : The GMAN is a deep-learning model for traffic flow prediction. It is based on graph neural networks and attentional mechanisms and aims to solve the complex spatio-temporal correlation problem in traffic flow prediction.

5.3. Evaluation metrics

The MLSTAM has the same assessment metrics as before,⁴¹ including mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE). The specific formulas are as follows:

\begin{aligned} MAE & = \frac{1}{P} \sum_{i = 1}^{P} | y_{i} - {\hat{y}}_{i} |, \end{aligned}

(16)

\begin{aligned} RMSE & = \sqrt{\frac{1}{P} \sum_{i = 1}^{P} {(y_{i} - {\hat{y}}_{i})}^{2}}, \end{aligned}

(17)

\begin{aligned} MAPE & = \frac{1}{P} \sum_{i = 1}^{P} | \frac{{\hat{y}}_{i} - y_{i}}{y_{i}} |, \end{aligned}

(18)

where

y_{i}

denotes the actual value at some point between days

X_{t + 1 : t + n}^{D}

and

{\hat{y}}_{i}

denotes the corresponding predicted value. P denotes the magnitude of the predicted time n.

5.4. Parameter settings

The experimental parameters in this section are set as shown in Table 2. The model uses a two-layer convolutional neural network to learn spatio-temporal features. The first layer has 16 filters and the second layer has 1 filter. This determines the number of features learned in each convolutional layer. The step size defines how far the convolutional kernel slides over the image, here it is set to 1 which means that the convolutional kernel moves one pixel at a time. Padding is used to add extra zeros around the boundaries of the image, here it is set to 0 which means that no padding is used. The model encoder and decoder use 5 ST-attention blocks each for capturing the spatio-temporal dependencies in the traffic flow data. In this paper, the predicted traffic flow size n is 12 and the batch size defines the number of samples processed by the model in a single iteration, which is set to 16 here. the number of times the model will completely traverse the training data, which is set to a maximum of 500 times here. The optimizer is set to Adam, which is an efficient optimization algorithm for training deep neural networks. The learning rate determines the step size of the model at each update of the weights, set here to 0.001, a common starting learning rate.

Table 2.
Hyper parameter settings for the MLSTAM.

Parameter Value

CNN layers 2

Number of filters in CNN 16,1

Stride in CNN 1

Padding in CNN 0

ST-Att block(L) 5

Forecast time period 12

Batch size 16

Epoch 500

Optimizer Adam

Learning rate 0.001

Parameter	Value
CNN layers	2
Number of filters in CNN	16,1
Stride in CNN	1
Padding in CNN	0
ST-Att block(L)	5
Forecast time period	12
Batch size	16
Epoch	500
Optimizer	Adam
Learning rate	0.001

5.5. Hyperparameters study

In this paper, two traffic flow datasets, METR-LA and PeMS-BAY, are analyzed as shown in Figures 5 and 6, the prediction models are evaluated under different settings. The evaluation metrics, including MAE, MAPE, and RMSE. These metrics help to provide a comprehensive understanding of the model’s performance over different prediction periods (15, 30, and 60 minutes).

Figure 5.

Results on the METR-LA and the PEMS-BAY datasets with different $α$ .

Figure 6.

Results on the METR-LA and the PEMS-BAY datasets with different $σ$ .

In the METR-LA dataset, the model’s MAE, MAPE, and RMSE metrics show an overall decreasing and then increasing trend as the value increases. At that time, the model reaches the lowest values of MAE, MAPE, and RMSE on all three forecasting time periods, indicating the best forecasting performance of the model in this setting. Specifically, compared to $α = 1$ , the MAE at that time decreased by 2.3%, 1.6%, and 4.0% on the 15-minute, 30-minute, and 60-minute prediction time periods, respectively; the MAPE decreased by 0.8%, 0.6% and 0.9%, respectively; and the RMSE decreased by 0.8%, 0.7% and 0.8%, respectively. This suggests that in the METR-LA dataset, choosing an appropriate $α$ value is crucial for improving the model performance.

In the PeMS-BAY dataset, the model also achieves superior performance at $α = 3$ . Compared with $α = 1$ , the MAE at $α = 3$ is reduced by 7.1%, 3.4%, and 3.7% for the three prediction time periods, respectively; although the MAPE increases slightly (0.1%) for the 15-minute prediction time period, it decreases by 0.7% and 0.9% for the 30-minute and 60-minute prediction time periods, respectively; and the RMSE is reduced by 1.1%, 1.2%, and 1.1%, respectively. This suggests that $α = 3$ is also a more appropriate choice for the PeMS-BAY dataset.

For the METR-LA dataset, when the value of $σ$ is increased from 1 to 3, we can see that the MAE, MAPE, and RMSE decrease in all time periods, which suggests that the predictive performance of the model is improving. When the $σ$ value is increased from 3 to 4, these metrics increase slightly or remain relatively stable, suggesting that further increases in the $σ$ value may not result in a significant increase in predictive performance, or may even result in a decrease in performance.

For the PeMS-BAY dataset, similar to the METR-LA dataset, the MAE, MAPE, and RMSE metrics decrease in most cases when the $σ$ value is increased from 1 to 3, suggesting that the prediction performance is improving. When the $σ$ value is increased from 3 to 4, the changes in these metrics are smaller and there is no obvious performance enhancement or decrease.

Taking the above analyses together, it can be concluded that choosing $α = 3$ and $σ = 4$ as the parameter settings of the model in the METR-LA and PeMS-BAY datasets can achieve better prediction performance.

5.6. Results analysis

Tables 3 and 4 shows the performance of the different models on the PeMS-BAY and METR-LA datasets. The main evaluation metrics are MAE (Mean Absolute Error), MAPE (Mean Absolute Percentage Error), and RMSE (Root Mean Squared Error), and these metrics are computed for the prediction durations of 15 minutes, 30 minutes, and 60 minutes, respectively.

Table 3.
Results for different models on the METR-LA dataset.

Model MAE MAPE(%) RMSE

ARIMA 3.99/5.15/6.90 9.60/12.70/17.40 8.21/10.45/13.23

SVR 3.99/5.05/6.72 9.30/12.10/16.70 8.45/10.87/13.76

FNN 3.99/4.23/4.49 9.90/12.90/14.00 7.94/8.17/8.69

FC-LSTM 3.44/3.77/4.37 9.60/10.90/13.20 6.30/7.23/8.69

GMAN 3.53/3.85/4.24 9.74/10.75/11.43 6.57/7.16/8.39

MLSTAM 3.41/3.75/4.03 9.55/10.48/11.20 6.32/6.90/7.90

Model	MAE	MAPE(%)	RMSE
ARIMA	3.99/5.15/6.90	9.60/12.70/17.40	8.21/10.45/13.23
SVR	3.99/5.05/6.72	9.30/12.10/16.70	8.45/10.87/13.76
FNN	3.99/4.23/4.49	9.90/12.90/14.00	7.94/8.17/8.69
FC-LSTM	3.44/3.77/4.37	9.60/10.90/13.20	6.30/7.23/8.69
GMAN	3.53/3.85/4.24	9.74/10.75/11.43	6.57/7.16/8.39
MLSTAM	3.41/3.75/4.03	9.55/10.48/11.20	6.32/6.90/7.90

Table 4.

Results for different models on the PEMS-BAY dataset.

Model	MAE	MAPE(%)	RMSE
ARIMA	1.62/2.33/3.38	3.50/5.40/8.30	3.30/4.76/6.50
SVR	1.85/2.48/3.28	3.80/5.50/8.00	3.59/5.18/7.08
FNN	2.20/2.30/2.46	5.19/5.43/5.89	4.42/4.63/4.98
FC-LSTM	2.05/2.20/2.37	4.80/5.20/5.70	4.19/4.55/4.96
GMAN	1.77/1.88/1.99	3.90/4.18/4.42	3.95/4.31/5.65
MLSTAM	1.55/1.68/1.83	3.70/4.05/4.30	3.67/4.11/4.45

On the PeMS-BAY dataset, the MLSTAM shows good performance in all three evaluation metrics, MAE, MAPE, and RMSE. Specifically, MLSTAM has MAE values of 1.55, 1.68, and 1.83 for 15, 30, and 60-minute prediction durations, respectively, which are lower than the other compared models. Similarly, MLSTAM also achieves relatively low values for MAPE and RMSE, which indicates the accuracy and stability of the MLSTAM in predicting traffic flow. Compared to traditional time series forecasting methods such as ARIMA and SVR, the MLSTAM shows a significant performance improvement. For example, for a 60-minute prediction period, the MAE of ARIMA is 3.38, while that of MLSTAM is only 1.83, which is about 45.8% lower, and that of SVR is 3.28 under the same conditions, which is also better than that of MLSTAM. These results show that MLSTAM has higher prediction accuracy when dealing with complex traffic flow data. Compared with other deep learning models, such as FNN, FC-LSTM, and GMAN, MLSTAM also shows superior performance. In particular, in comparison with the current best-performing model, GMAN, MLSTAM’s MAE value is reduced by about 8.0% for the 60-minute prediction duration, which further demonstrates the effectiveness of the MLSTAM in capturing the long-term dependence of traffic flow.

On the METR-LA dataset, the MLSTAM also shows good performance. For the three evaluation metrics, MAE, MAPE, and RMSE, the values of MLSTAM are lower or close to those of the other comparison models. Specifically, the MAE values of MLSTAM are 3.41, 3.75, and 4.03 for 15, 30, and 60-minute prediction durations, respectively, which are lower than or close to the best performance of other models. The performance advantage of MLSTAM on the METR-LA dataset is more obvious compared to traditional methods such as ARIMA and SVR. For example, on the 60-minute prediction duration, the MAE value of ARIMA is 6.90 while MLSTAM is only 4.03, which is about 41.6% lower. This indicates that MLSTAM has better generalization ability when dealing with different datasets. Compared with other deep learning models such as FNN, FC-LSTM, and GMAN, MLSTAM’s performance on the METR-LA dataset is also competitive. In particular, in comparison with GMAN, MLSTAM has a slightly lower MAE value on the 60-minute prediction duration, further proving the MLSTAM’s superiority in dealing with complex traffic flow data.

Taken together, the MLSTAM demonstrates good prediction performance for long-term traffic flow data on both PeMS-BAY and METR-LA datasets. This is mainly attributed to the model’s ability to capture the spatio-temporal dependence of traffic flows and handle complex dynamic patterns. In addition, the stable performance of the MLSTAM over different prediction durations also demonstrates its robustness and generalization ability.

5.7. Ablation studies

In this subsection, a variant ablation experiment is conducted to test whether some of the modules in the model improve the accuracy of the predictions. These modules include the closeness attention module, the periodic attention module, and the trend attention module. Some of the variants are listed below:

MLSTAM_C: The closeness attention Module and the spatio-temporal dependency awareness multi-dimensional fusion module are used to predict traffic flow.

MLSTAM_P: The period attention module and the spatio-temporal dependency awareness multi-dimensional fusion module are used to predict traffic flow.

MLSTAM_T: The trend attention module and the spatio-temporal dependency awareness multi-dimensional fusion module are used to predict the traffic flow.

MLSTAM: As described in the previous model framework, closeness, period, and trend data are used to predict traffic flow.

From the Figure 7 provided, we can see the performance of MLSTAM and its variants (MLSTAM_C, MLSTAM_P, MLSTAM_T) on the datasets METR-LA and PeMS-BAY. These models take into account the closeness, periodicity, and trend of traffic data, respectively.

Figure 7.

Comparison of experimental results of model variation on the METR-LA and PEMS-BAY datasets.

In the METR-LA dataset, MLSTAM_C performs relatively well in the 15-minute time interval, but its performance gradually decreases as the time interval increases. This suggests that closeness information has a high predictive value in a short period, but its influence diminishes over time. In the PeMS-BAY dataset, MLSTAM_C also shows the importance of closeness information, but its performance is not outstanding compared to the other models.

In both datasets, MLSTAM_P shows relatively stable performance. This suggests that periodicity information has some stability in traffic flow prediction and is not significantly affected by time intervals. MLSTAM_P performs better in longer time intervals (e.g., 60 minutes) compared to MLSTAM_C, which further confirms the importance of periodic information in long-term prediction.

In the METR-LA dataset, MLSTAM_T performs well in short time intervals (15 minutes), but its performance decreases sharply as the time interval increases. This suggests that trending information has some predictive value in short time intervals, but may be disturbed by other factors in long time intervals. In the PeMS-BAY dataset, the performance of MLSTAM_T is relatively poor, which may be related to the fact that the trending characteristics in this dataset are not obvious or are influenced by other factors.

In both datasets, MLSTAM achieves optimal results. This suggests that the combined consideration of closeness, periodicity, and trend information is essential for improving the accuracy of traffic flow prediction. MLSTAM achieves lower MAE, MAPE, and RMSE values for all time intervals compared to other models that only consider a single characteristic, which further confirms its superiority and effectiveness.

6. Conclusion

To accurately capture temporal correlations in the traffic flow prediction problem and effectively address the complex dynamic spatial and long-term time-dependent challenges, we propose a multidimensional long-term spatio-temporal attention model. We combine the attention mechanism and Convolutional Neural Networks (CNNs) to ensure that the model can sensitively capture long-term temporal correlations in traffic flow data. In addition, we design a Spatio-Temporal Dependency Aware Multi-scale Fusion Module to capture the dynamic spatio-temporal correlation between regions. Finally, we conducted experiments on two real-world datasets and the results show that our proposed model outperforms the baseline algorithm in long-duration traffic flow prediction.Although the proposed method achieves better performance in traffic flow prediction, it has potential limitations, such as neglecting critical real-world perturbations (e.g., dynamic meteorological variations and non-recurrent traffic disturbances), which may affect practical applicability. In the future, we will integrate multi-source data (e.g., weather forecasts, accident reports) for scenario-specific adaptation, develop lightweight models to enable real-time predictions on edge devices, and extend this framework to broader intelligent transportation applications, such as emergency evacuation planning.

Footnotes

ORCID iD

Xiangze Liu

Funding

This study research funding from the National Natural Science Foundation of China, grant number 62462031, and the Natural Science Foundation of Jiangxi Province, grant number 20242BAB26023.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Jin

Lin

, et al. Spatio-temporal recurrent convolutional networks for citywide short-term crowd flows prediction. In: Proceedings of the 2nd International conference on compute and data analysis, 2018, pp.28–35. https://doi.org/10.1145/3193077.319308.

Huang

Zhao

, et al. A hybrid model of neural network with VMD–CNN–GRU for traffic flow prediction. Int J Mod Phys C 2023; 34: 2350159.

Liu

Wen

, et al. A grey convolutional neural network model for traffic flow prediction under traffic accidents. Neurocomputing 2022; 500: 761–775.

Yang

Peng

, et al. MF-CNN: traffic flow prediction using convolutional neural network and multi-features fusion. IEICE Trans Inf Syst 2019; 102: 1526–1536.

Zhang

, et al. Short-term traffic flow prediction based on spatio-temporal analysis and CNN deep learning. Transport A: Trans Sci 2019; 15: 1688–1711.

Ren

Chai

Yin

, et al. Short-term traffic flow prediction: a method of combined deep learnings. J Adv Transport 2021; 2021: 9928073.

Soni

Roy

Nagwanshi

. WKNN-FDCNN method for big data driven traffic flow prediction in its. Multimed Tools Appl 2024; 83: 25261–25286.

Qiao

Wang

, et al. Short-term traffic flow prediction based on 1DCNN-LSTM neural network structure. Mod Phys Lett B 2021; 35: 2150042.

Lao

, et al. Traffic safety oriented multi-intersection flow prediction based on transformer and CNN. Sec Commun Netw 2023; 2023: 1363639.

10.

Huang

Ding

, et al. Multi-mode dynamic residual graph convolution network for traffic flow prediction. Inf Sci (Ny) 2022; 609: 548–564.

11.

Huang

Yang

, et al. Multi-view dynamic graph convolution neural network for traffic flow prediction. Expert Syst Appl 2023; 222: 119779.

12.

Zhang

Jiao

. A spatio-temporal grammar graph attention network with adaptive edge information for traffic flow prediction. Appl Intell 2023; 53: 28787–28803.

13.

Bao

Liu

Shen

, et al. PKET-GCN: prior knowledge enhanced time-varying graph convolution network for traffic flow prediction. Inf Sci (Ny) 2023; 634: 359–381.

14.

Yang

. Predicting traffic propagation flow in urban road network with multi-graph convolutional network. Complex Intell Syst 2024; 10: 23–35.

15.

Zhang

. STGMN: a gated multi-graph convolutional network framework for traffic flow prediction. Appl Intell 2022; 52: 15026–15039.

16.

, et al. Multi-source information fusion graph convolution network for traffic flow prediction. Expert Syst Appl 2024; 252: 124288.

17.

Luo

Dou

Zheng

. Spatiotemporal prediction of urban traffics based on deep GNN. Comput Mater Continua 2024; 78: 265–282. https://doi.org/10.32604/cmc.2023.040067.

18.

Lan

, et al. Traffic flow prediction based on attention mechanism convolutional neural network. In: International conference on AI and mobile services, 2023, pp.50–59. Springer. https://doi.org/10.1007/978-3-031-45140-9_5.

19.

Wang

Shang

. DTM-GCN: a traffic flow prediction model based on dynamic graph convolutional network. Multimed Tools Appl 2024; 83: 89545–89561. https://doi.org/10.1007/s11042-024-18348-z.

20.

Yan

Zhang

Gao

, et al. GECRAN: Graph embedding based convolutional recurrent attention network for traffic flow prediction. Expert Syst Appl 2024; 125001.

21.

Chen

Zheng

, et al. Traffic flow matrix-based graph neural network with attention mechanism for traffic flow prediction. Inform Fusion 2024; 104: 102146.

22.

Chauhan

Kumar

Eskandarian

. A novel confined attention mechanism driven BI-GRU model for traffic flow prediction. IEEE trans Intell Transp Syst 2024.

23.

Ali

Ullah

Ahmad

, et al. An attention-driven spatio-temporal deep hybrid neural networks for traffic flow prediction in transportation systems. IEEE trans Intell Transp Syst 2025.

24.

Liu

. GPT4TFP: Spatio-temporal fusion large language model for traffic flow prediction. Neurocomputing 2025; 129562.

25.

Kong

Zhang

Wei

, et al. Adaptive spatial-temporal graph attention networks for traffic flow forecasting. Appl Intell 2022; 1–17. DOI: https://doi.org/10.1007/s10489-021-02648-0.

26.

Liu

Kang

, et al. STGHTN: Spatial-temporal gated hybrid transformer network for traffic flow forecasting. Appl Intell 2023; 53: 12472–12488.

27.

Peng

Yang

Zhao

. Multi-level spatial-temporal fusion neural network for traffic flow prediction. Cluster Comput 2024; 27: 6689–6702. https://doi.org/10.1007/s10586-024-04296-8.

28.

Zhong

Niu

, et al. Multi-scale persistent spatiotemporal transformer for long-term urban traffic flow prediction. J Elect Sci Technol 2024; 22: 100244.

29.

Bai

Xia

Huang

, et al. Spatial-temporal graph neural network based on gated convolution and topological attention for traffic flow prediction. Appl Intell 2023; 53: 30843–30864.

30.

Wang

. Forecasting metro rail transit passenger flow with multiple-attention deep neural networks and surrounding vehicle detection devices. Appl Intell 2023; 53: 18531–18546.

31.

Zhao

Zhang

Wang

, et al. Spatio-temporal causal graph attention network for traffic flow prediction in intelligent transportation systems. PeerJ Comput Sci 2023; 9: e1484.

32.

Lin

. Attention based convolutional networks for traffic flow prediction. Multimed Tools Appl 2024; 83: 7379–7394.

33.

Moghaddasi

Rajabi

Soleimanian Gharehchopogh

, et al. An energy-efficient data offloading strategy for 5G-enabled vehicular edge computing networks using double deep Q-network. Wirel Pers Commun 2023; 133: 2019–2064.

34.

Gharehchopogh

. Multi-objective secure task offloading strategy for blockchain-enabled IoV-MEC systems: a double deep Q-network approach, IEEE Access 2024; 12: 3437–3463.

35.

Moghaddasi

Rajabi

Gharehchopogh

, et al. An advanced deep reinforcement learning algorithm for three-layer D2D-edge-cloud computing architecture for efficient task offloading in the internet of things. Sustain Comput: Inform Syst 2024; 43: 100992.

36.

Min

Rahmani

Ghaderkourehpaz

, et al. A joint optimization of resource allocation management and multi-task offloading in high-mobility vehicular multi-access edge computing networks. Ad Hoc Netw 2025; 166: 103656.

37.

Makridakis

Hibon

. ARMA models and the Box–Jenkins methodology. J Forecast 1997; 16: 147–163.

38.

Lee

. Travel-time prediction with support vector regression. IEEE trans Intell Transp Syst 2004; 5: 276–281.

39.

Sutskever

Vinyals

. Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 2014; 27.

40.

Zheng

Fan

Wang

, et al. GMAN: a graph multi-attention network for traffic prediction. In: Proceedings of the AAAI conference on artificial intelligence, Vol. 34, pp.1234–1241. https://doi.org/10.1609/aaai.v34i01.5477.

41.

Yin

Wei

, et al. Multi-stage attention spatial-temporal graph networks for traffic prediction. Neurocomputing 2021; 428: 42–53.

MLSTAM: A multidimensional long-term spatio-temporal attention model for traffic flow forecasting by capturing time series correlations

Abstract

Keywords

1. Introduction

3. Definition and notation

3.1. Transportation network definition

3.2. The definition of traffic flow prediction

4. Multidimensional Long-term Spatio-temporal attention model for traffic flow prediction

4.1. Model framework

5.1. Datasets

5.3. Evaluation metrics

Table 2. Hyper parameter settings for the MLSTAM. Parameter Value CNN layers 2 Number of filters in CNN 16,1 Stride in CNN 1 Padding in CNN 0 ST-Att block(L) 5 Forecast time period 12 Batch size 16 Epoch 500 Optimizer Adam Learning rate 0.001

Footnotes

ORCID iD

Funding

Declaration of conflicting interests

References

Table 2.
Hyper parameter settings for the MLSTAM.

Parameter Value

CNN layers 2

Number of filters in CNN 16,1

Stride in CNN 1

Padding in CNN 0

ST-Att block(L) 5

Forecast time period 12

Batch size 16

Epoch 500

Optimizer Adam

Learning rate 0.001