Missing Traffic Data Imputation based on Tensor Completion and Graph Network Fusion

Abstract

During traffic data acquisition, missing data often arise owing to equipment failures and network disruptions. Despite extensive research on traffic data imputation, two primary limitations persist: First, existing methods struggle to fully integrate the spatiotemporal correlations and low-rank structures inherent in traffic data. Second, current research has mostly focused on missing completely at random (MCAR), with limited attention on other missing data patterns. We propose an innovative method, tensor completion and graph network fusion (TCGNF), to address these challenges for missing traffic data imputation. This method initially utilizes tensor completion for the preliminary imputation of missing data. Subsequently, it constructs the road network by leveraging the Pearson correlation coefficient from historical road data and the physical distances between detectors. The method then uses graph sampling and aggregation (GraphSAGE) to extract spatiotemporal data features from the two road networks and fuse them. Finally, these features are trained in generative adversarial networks (GANs) for accurate data imputation. Extensive experiments were conducted on two publicly accessible traffic datasets to validate the efficacy of the TCGNF model. The outcomes of these experiments indicate that the TCGNF model demonstrates superior generalization capabilities, significantly outperforming other state-of-the-art data imputation models concerning overall performance.

Keywords

traffic data imputation intelligent transportation system tensor completion deep learning data analysis

Acquiring basic parameters such as traffic flow, speed, and density is crucial when planning, designing, and operating transportation systems ( 1 ). These elements significantly influence the efficiency and effectiveness of urban traffic systems ( 2 ). However, despite advances in data collection techniques, traffic information management systems in many countries still grapple with issues such as data corruption and loss ( 3 ). This persists as a significant challenge for intelligent transportation systems (ITS) ( 4 ). In traffic data applications, traffic data analysis algorithms such as support vector machines (SVM) and neural networks usually require complete datasets to improve prediction accuracy and make better decisions (5 –7). Traffic data imputation algorithms have become a standard solution to cope with the problem of missing traffic data and to improve data integrity and availability ( 8 ). The application of this approach brings significant benefits to various stakeholders. For traffic management authorities, imputation algorithms can improve the accuracy of real-time monitoring so that traffic anomalies can be detected more quickly and effective emergency measures can be taken. This helps optimize resource allocation and improves the overall efficiency of traffic management. For city planners, relying on more accurate and complete data allows for more in-depth analyses of traffic patterns and infrastructure needs so that more efficient urban road networks and public transport systems can be designed to better cope with the growing traffic demand in the future. Traffic data imputation solves the challenge of missing data, empowers users to make more informed decisions, and supports the development of smarter, more efficient transportation systems.

Traffic data imputation has evolved from statistical models to deep learning methods ( 9 ). Statistical models primarily utilize basic statistical and mathematical models such as linear imputation and average estimation ( 10 ). Although these techniques are straightforward, they often overlook the complex relationships within the missing data, resulting in limited generalizability ( 11 ). As time series analysis methods developed, techniques like autoregressive integrated moving average (ARIMA) started to be used for traffic data imputation. While these methods enhance our understanding of data’s temporal dependencies, they exhibit limitations in managing nonlinear and non-stationary data ( 12 ). With the popularity of machine learning, techniques such as decision trees, neural networks, and SVM are effective in identifying complex relationships in data for traffic data imputation tasks, but this often relies on large parameters to the point of resource consumption ( 13 ).

In recent years, deep learning technologies have significantly propelled the field of traffic data imputation forward ( 14 ). For instance, while convolutional neural networks (CNN) are adept at learning local features, recurrent neural networks (RNN) and long short-term memory networks (LSTM) specialize in capturing temporal dependencies (15 –17). Autoencoders have been effective in capturing low-dimensional structures ( 18 ). As a result, the precision of data imputation has experienced substantial enhancement ( 19 ). Furthermore, generative adversarial networks (GANs), which generate high-quality samples through adversarial training for traffic data imputation, have emerged as a primary research focus in recent years ( 20 ). At the same time, researchers are exploring the integration of deep learning models with other innovative technologies ( 21 ). For instance, integrating deep learning with graph neural networks (GNNs) and transfer learning has been under examination to harness additional a priori information ( 22 ). Ensemble learning frameworks are also being explored to combine the strengths of different models ( 23 ).

Despite the numerous solutions available for traffic data imputation, two primary limitations remain to be addressed.

First, existing methods often find it difficult to simultaneously consider the spatio-temporal characteristics of traffic data and its low-rank structure. However, traffic data have complex and variable spatiotemporal characteristics. For example, vehicles and pedestrians tend to be concentrated in office or recreational areas owing to the differences in functional attributes of different regions, which makes traffic data exhibit significant spatial correlation between regions. In the time dimension, the traffic patterns of morning and evening peaks are highly similar between days, and traffic flows are usually significantly correlated between neighboring periods or locations ( 24 ). In addition, there may be covariance in the data collected by different traffic detectors, which implies that the information from some detectors can be approximated by the data from other detectors, thus further highlighting the low-rank nature of the traffic data.

Second, many current methods are mainly designed for scenarios with missing completely at random (MCAR), making them less effective for other types of missing data. However, data missing in real-world traffic data are often not entirely random. For example, factors like unfavorable geographical conditions might result in devices in certain areas consistently being unable to gather data. This phenomenon can be categorized under spatially missing completely at random (MCARS).

This paper proposes an innovative approach to filling in the missing traffic data effectively and demonstrating good performance across different scenarios to address the problem of missing traffic data. For example, a city traffic monitoring system has missing data from some sensors owing to equipment failure or network problems, affecting monitoring traffic conditions at key intersections. By combining spatiotemporal correlation and low-rank structure imputation algorithms, the missing information is recovered using neighboring road sections and historical data, which helps the traffic management department monitor traffic conditions more accurately, identify potential congestion and optimize the emergency response, thus enhancing the efficiency of traffic management and citizens’ travelling experience. This approach can achieve effective imputation for both MCAR and MCARS, providing more accurate and comprehensive support for analyzing and applying traffic data. The contributions of this paper are as follows:

(i) A new deep learning framework is proposed to uncover the low-rank characteristics and complex spatiotemporal correlations within traffic networks and to accurately and robustly impute missing traffic data.

(ii) The preliminary imputation of traffic data using tensor completion is proposed. The preliminary imputation of the data by this mathematical method takes full account of the low-rank structure of the data itself. It lays a solid foundation for the accuracy of the spatial–temporal feature extraction of the subsequent data ( 25 ).

(iii) A new method for constructing temporal and spatial correlation road networks is proposed. The temporal and spatial correlation networks are constructed using the Pearson correlation coefficient and Gaussian function. The temporal correlation network reveals the correlation of traffic data between different time points, while the spatial correlation network shows the traffic flow patterns between different locations ( 26 ).

(iv) A spatiotemporal feature fusion method based on GraphSAGE is proposed. GraphSAGE is an efficient GNN algorithm that learns node embeddings without relying on global graph information ( 27 ). This method can better utilize the information from temporal and spatial-related road networks to extract more representative spatiotemporal features.

Related Work

The imputation of missing traffic data has been extensively studied, with various methods demonstrating strengths and limitations in handling spatiotemporal complexities. Tensor completion methods leverage the inherent low-rank nature of traffic data to achieve notable accuracy in data recovery but face challenges related to rank determination, outlier sensitivity, and scalability in dynamic scenarios ( 28 ). GNNs excel at capturing spatial dependencies in graph-structured traffic networks, enhancing imputation precision. However, their reliance on graph quality and limited capacity for modelling temporal dynamics pose challenges for large-scale, dynamic networks ( 29 ). GANs, particularly variants like Wasserstein GAN (WGAN), show promise in handling high-dimensional data distributions and improving training stability and imputation performance. However, they remain prone to mode collapse, sensitivity to architecture, and difficulties with spatiotemporal correlations ( 30 ). These limitations underscore the potential of hybrid approaches that integrate tensor completion, GNNs, and GANs, combining their strengths to address spatiotemporal challenges better and enhance the robustness and accuracy of traffic data imputation.

Tensor Completion

Tensor completion algorithms are techniques designed to impute missing values in multidimensional data ( 31 ). The core idea involves estimating missing values using low-rank matrices or tensor approximations ( 32 ). Given the inherent low-rank nature of traffic data, researchers have proposed a variety of tensor-completion-based methods for recovering missing traffic flow, speed, and trajectory data ( 33 ). These methods take full advantage of the low-rank structure of traffic data by designing matrix or tensor rank functions that can accurately describe its structure (34, 35). For example, in 2020, Chen et al. proposed a novel non-convex low-rank tensor completion (LRTC) model specifically designed to enhance the imputation of missing spatiotemporal traffic data. This model employs the truncated nuclear norm (TNN) minimization approach, which exhibits superior performance over existing methods, particularly in scenarios with high missing data rates. The LRTC-TNN model effectively captures and leverages the intrinsic low-rank structure of spatiotemporal traffic data, resulting in more accurate data imputation ( 36 ). By contrast, in 2022, Nie et al. introduced a method utilizing the truncated tensor Schatten-norm, aimed explicitly at efficiently addressing the complex missing patterns in spatiotemporal traffic data ( 37 ). The model efficiently addresses the non-convex optimization problem by integrating the alternating direction method of multipliers (ADMM) with the generalized soft thresholding (GST) technique. Furthermore, the study proposed a truncation rate decay strategy for different missing data scenarios, demonstrating through experimental results that the method achieves outstanding performance across various conditions. These methods have achieved satisfactory imputation effects when dealing with real traffic datasets.

Although tensor completion provides an efficient imputation scheme by exploiting the low-rank structure of traffic data, it still faces some challenges in practical application ( 38 ). For instance, the difficulty in precisely determining the rank of a tensor can lead to excessive or insufficient data imputation, and sensitivity to outliers may affect the model’s accuracy. Consequently, to improve the expressive power and usefulness of the model, further optimization of the use of tensor completion in traffic data imputation tasks is required. This includes developing more rationalized rank functions, strengthening the model’s resistance to anomalies, and adding outside data for supervision ( 39 ).

Graph Neural Networks

GNNs are a specialized category of deep learning models crafted for handling data structured as graphs ( 40 ). These graph-structured data don’t just encompass nodes (e.g., intersections, sensors in traffic networks) and edges (e.g., road segments, relationships between nodes): they also encapsulate the intricate interplay between these nodes and edges, which can include aspects such as the strength of connections, directional flows, or dynamic changes over time ( 41 ). Compared with traditional deep learning models, the uniqueness of GNNs lies in their ability to operate directly on graph structures, enabling them to effectively capture complex relationships and dependencies between nodes. For instance, in social network analysis, GNNs can be used to identify community structures; in molecular structure identification, they can predict chemical properties ( 42 ).

In traffic data imputation, GNNs analyze node data and capture node interactions within road networks. This dual capability enables more precise prediction and imputation of missing data. Particularly in complex or incomplete traffic networks, GNNs leverage information from adjacent nodes, enhancing the accuracy of data imputation by effectively mapping spatial relationships ( 43 ).

Considering specific applications, in 2022, Cini et al. introduced an innovative method termed the graph recurrent imputation network (GRIN) ( 44 ). This method, grounded in GNNs, addresses multivariate time series imputation. The essence of GRIN lies in its ability to reconstruct missing data within multivariate time series. This is achieved through the learning of spatiotemporal representations from sensor network data. Such an approach directly addresses the pervasive challenge of data incompleteness, especially evident in traffic network analysis.

GNNs also face several limitations in traffic data imputation. Their performance heavily relies on the quality of the graph structure, meaning inaccuracies in the adjacency matrix or relational definitions can degrade results. Additionally, for large-scale traffic networks, the computational complexity of graph networks poses challenges with reference to training time and memory consumption. Furthermore, while effective at capturing spatial relationships, graph networks often struggle to model temporal features adequately, limiting their ability to represent dynamic traffic patterns ( 45 ).

Generative Adversarial Network

GANs comprise two opposing networks: the generator, which produces refined samples iteratively to deceive the discriminator, and the discriminator, which determines whether a sample is actual or generated ( 46 ). GANs have demonstrated significant potential for traffic data imputation ( 47 ). GANs are more adept at grasping intricate high-dimensional data distributions than other deep learning frameworks. They can produce high-quality simulated data, effectively imputing missing information. For instance, in 2018, Yoon et al. introduced the generative adversarial imputation network (GAIN), which utilizes adversarial training for data imputation ( 48 ). In 2020, Chen et al. proposed a new method to enhance traffic data imputation through parallel data and GANs ( 49 ).

Early GANs employed the Jensen–Shannon divergence to measure the difference between the distribution generated by the model and the actual data distribution ( 50 ). Using this metric led to challenges in training stability, notably gradient vanishing and mode collapse. To mitigate these problems, Arjovsky et al. championed the WGAN, which adopts the Wasserstein distance over the Jensen-Shannon divergence ( 51 ). Adopting the Wasserstein distance has significantly improved GAN training stability. In traffic data imputation, WGANs have demonstrated superior performance ( 52 ). For instance, in 2022, Xu et al. employed a combination of WGAN and graph aggregators to address traffic data gaps effectively, demonstrating significant performance improvements ( 53 ).

While GANs hold promise for traffic data imputation, several limitations affect their effectiveness. The primary challenge is mode collapse, which can produce incomplete or repetitive imputed patterns, undermining result reliability. Additionally, GANs struggle to capture the temporal and spatial correlations essential in traffic data, and their training process is unstable and highly sensitive to architecture, hyperparameters, and loss functions, making generalization difficult across different datasets ( 54 ). Overcoming these issues requires refining network architectures, developing task-specific loss functions, and exploring hybrid models that integrate GANs with GNNs to capture spatiotemporal dependencies in traffic data better.

Method

Figure 1 illustrates the overall framework of the method proposed in this paper. It encompasses four main components. (1) Construction of two road networks: one based on temporal correlations derived from historical data and another based on spatial correlations from sensor distances. (2) Using the LRTC module for the preliminary imputation of missing data. (3) Extracting spatiotemporal features from the preliminarily completed data of the two road networks through GraphSAGE and fusion the spatiotemporal information of these two networks. (4) Using generative adversarial networks to achieve the final repair of network traffic data.

Figure 1.

The structure of the tensor completion and graph network fusion (TCGNF).

Data Preprocessing

Firstly, the detectors in the road network are regarded as nodes of a graph $G = (V, E)$ , where $V = {v_{1}, v_{2}, v_{3}, \dots, v_{N}}$ is the collection of detectors, $v_{i}$ represents an individual node, and N signifies the total number of detectors. The edge set $E = {e_{ij}}_{i, j = 1}^{N}$ depicts relationships between nodes, with $e_{ij}$ being 1 if nodes $v_{i}$ and $v_{j}$ are adjacent, and 0 otherwise. Define the traffic data for the $i th$ detector as $x_{i} = (x_{i 1}, x_{i 2}, x_{i 3}, \dots, x_{iT})$ . $x_{it}$ represents the traffic data of the $i th$ detector during the time interval t, with $i = 1$ to N and $t = 1$ to T. T denotes the historical data length for each detector. Subsequently, the detectors’ historical data are divided into blocks, each of length F, resulting in a data matrix $X \in R^{N \times F}$ encompassing N nodes and F features.

Next, based on the pattern of missing values in the traffic data matrix X, a mask matrix M of the same dimensions as X is introduced. Missing data points are marked as 0, whereas present data points are represented by 1. Furthermore, a matrix $\tilde{X} = X * M$ represents the data with missing values, which will be the target matrix for subsequent imputation. The symbol “*” denotes element-wise multiplication, meaning multiplication of corresponding elements, as illustrated in Figure 2.

Figure 2.

Missing data generation.

Road Network Construction

Construction of the Temporal-Correlation-Based Road Network

Using the Pearson correlation coefficient to calculate the temporal correlation between nodes in a road network is effective because it quantifies the linear relationship between traffic flow data at different time points. This method helps identify how traffic conditions at one node are related to those at another over time, capturing the temporal dependencies between traffic patterns at different locations. Given that traffic flow often follows linear trends over time, Pearson’s correlation is particularly suited for this purpose. The coefficient, ranging from−1. to 1, provides an intuitive and easily interpretable measure of correlation strength, facilitating the identification of significant temporal associations. Additionally, Pearson’s correlation is computationally efficient, making it highly suitable for large-scale traffic datasets and essential for reconstructing temporal correlation-based road networks in real-world applications.

For each road node i monitored by a detector, based on the historical traffic state data $X_{i}$ , the formula for the Pearson correlation coefficient is as follows:

\begin{matrix} ρ_{ij} & = \frac{K \sum_{k = 1}^{K} X_{iK} X_{jK} - \sum_{k = 1}^{K} X_{iK} \sum_{k = 1}^{K} X_{jK}}{\sqrt{K \sum_{k = 1}^{K} X_{iK}^{2} - {(\sum_{k = 1}^{K} X_{iK})}^{2}}} \\ \times \frac{1}{\sqrt{K \sum_{k = 1}^{K} X_{jK}^{2} - {(\sum_{k = 1}^{K} X_{jK})}^{2}}} \end{matrix}

(1)

where K represents the length of the traffic network state nodes recorded by the selected detectors when calculating the Pearson correlation coefficient. Based on the calculated Pearson correlation coefficients, a Pearson correlation matrix R is established.

Some neighboring nodes have relatively low correlation with road nodes on the correlation coefficient matrix, which may interfere with the extraction of network features. To account for this, reconstruction of the traffic network involves selecting n detectors with higher Pearson correlation coefficients relative to node $v_{i}$ as the edge relationships. This process reconstructs a logically correlated network $H = (V, T)$ where $T = {a_{ij}}_{i, j = 1}^{N}$ is the adjacency matrix, with $a_{ij} = 1$ if there is an edge between nodes $v_{i}$ and $v_{j}$ , and $a_{ij} = 0$ otherwise. Figure 3 illustrates the schematic diagram of the network reconstruction under $n = 2$ . It is worth noting that, as shown in the figures, there is a significant difference between the original road network and the network reconstructed using Pearson’s correlation coefficient. This discrepancy arises because the network constructed by Pearson’s correlation coefficient primarily captures temporal correlations, which are not directly represented in the original road network. While this method effectively captures temporal dependencies, it is less sensitive to the spatial distance between nodes. This limitation further underscores the necessity of the road network fusion method proposed in this paper.

Figure 3.

Construction of the temporal-correlation-based road network.

This method enables the identification of temporal associations between detectors within the original road network, thereby constructing a more correlated and accurate road network diagram. Although other methods, such as dynamic time warping (DTW), mutual information (MI), or even deep learning-based approaches, could potentially measure temporal associations, they often involve higher computational costs or added complexity. Compared with these alternatives, the Pearson correlation coefficient achieves a desirable balance between accuracy, simplicity, and computational efficiency, making it highly suitable for this study.

Construction of the Spatial-Correlation-Based Road Network

When considering the spatial relationships in sensor networks, traditional methods often limit their analysis to the direct physical connections between sensors. However, the indirect connections between sensors are particularly important in many scenarios, such as traffic monitoring or environmental surveillance. To more comprehensively reveal the network’s spatial relationships, this method proposes using a Gaussian function to construct an adjacency matrix that reflects the spatial relationships between sensors ( 55 ).

Initially, an adjacency matrix $M = {m_{ij}}$ corresponding to the number of sensors is initialized. $m_{ij}$ is an element in the adjacency matrix, representing the distance from sensor i to sensor j. When i and j are not equal, their initial distance is set to infinity ∞, indicating the absence of a direct connection. When i and j are equal, their distance is set to 0. Subsequently, a dictionary mapping sensor IDs to matrix indices is established. Then, the matrix is populated based on the physical distance information between sensors.

The key step involves using a Gaussian function to convert each distance into a corresponding weight based on the standard deviation of the distance matrix, thereby reflecting the degree of interconnection between sensors in the network. The reason for using the Gaussian function is that it can smoothly decay the weights, making the closer sensors have stronger associations with each other, while the associations of the more distant sensors are gradually weakened, avoiding the phenomenon of abrupt changes in the traditional method, and also reflecting more accurately the natural spatial relationships between the sensors. The process is as follows:

Calculate the standard deviation σ of all non-infinite elements in the matrix M:

σ = \sqrt{\frac{\sum {(a_{ij} - μ)}^{2}}{N}}

(2)

where μ is the mean of all non-infinite elements, and N is the number of non-infinite elements.

Transform distances into weights using a Gaussian function to obtain the weight matrix W:

W_{ij} = \exp (- \frac{a_{ij}^{2}}{2 σ^{2}})

(3)

This transformation reduces the impact of greater distances and normalizes the weights between 0 and 1. To decrease computational complexity and highlight closer spatial associations, the method adopts a similar construction approach to the temporal correlation matrix by selecting an appropriate n value, thereby increasing the sparsity of the matrix. The final output sparse adjacency matrix maps the physical distances between sensors and reveals their potential spatial relationships. This method makes it possible to understand and analyze the structure and function of sensor networks more precisely, providing a solid foundation for data processing and decision support.

Low-Rank Tensor Completion

LRTC is a class of tensor completion methods based on the low-rank assumption. In this study, we apply LRTC for the initial imputation of traffic data. First, the missing traffic data $\tilde{X}$ is converted into a third-order tensor $Y \in R^{N \times D \times F}$ , where N is the number of nodes (i.e., traffic network segments or detectors), D is the number of days, and F is the time dimension. The objective of the LRTC model is to recover the missing data by minimizing the rank of the tensor. The objective function is formulated as:

\begin{array}{l} \underset{X}{argmin} R a n k (X) \\ subject to P_{Ω} (X) = P_{Ω} (Y) \end{array}

(4)

where $X \in R^{N \times D \times F}$ is the tensor we aim to recover, and Ω is the index set of the observed entries. The operator $P_{Ω} (X)$ represents the projection onto the observed entries, and the rank minimization objective exploits the low-rank nature of the traffic data in both the spatial and temporal dimensions. The operator $P_{Ω} (X)$ projects the tensor X onto the observed entries and is defined as follows for any tensor X:

[P_{Ω} (X)]_{n, d, f} = {\begin{matrix} X_{n, d, f}, & if (n, d, f) \in Ω, \\ 0, & otherwise . \end{matrix}

(5)

Here, $(n, d, f)$ is the index of the tensor, where $n \in [1, N]$ corresponds to nodes, $d \in [1, D]$ corresponds to days, and $f \in [1, F]$ corresponds to time steps.

LRTC has several key limitations in traffic data imputation. First, rank minimization is an NP-hard problem, resulting in high computational cost, especially when handling large-scale, high-dimensional spatiotemporal data. Second, LRTC uses the standard nuclear norm (NN), which fails to distinguish between important and less significant singular values, potentially overlooking critical traffic patterns. Third, NN minimization can lead to over-smoothing, diminishing the representation of essential features, particularly in the presence of missing or noisy data. To address these issues, we introduce the TNN. By truncating smaller singular values and retaining larger ones, TNN effectively reduces computational complexity and improves data recovery accuracy. Moreover, TNN effectively prevents over-smoothing, significantly enhancing imputation performance, particularly in cases of missing or noisy data. The definition of the TNN regularization term is as follows:

∥ X ∥_{TNN} = \sum_{i = 1}^{r} σ_{i} (X) \cdot I (σ_{i} (X) > τ)

(6)

where X is the tensor, r is its rank, τ is a threshold parameter, and $I (σ_{i} (X) > τ)$ is an indicator function that outputs 1 when $σ_{i} (X) > τ$ and 0 otherwise. Minimizing the TNN regularization term limits the tensor’s rank, resulting in better tensor completion.

Since the objective function minimized by TNN is non-convex, an ADMM-based non-convex optimization algorithm is employed. This algorithm decomposes the model optimization problem into three subproblems that are solved iteratively ( 56 ). This approach converts the original tensor completion problem into three subproblems that are solved iteratively.

Subproblem 1: Updating the tensor X

X^{k + 1} = \underset{X}{\arg \min} (∥ X^{k} ∥_{r, *} + ρ_{k} ∥ X^{k} - M^{k + 1} ∥_{F}^{2} + X^{k}, T^{k})

(7)

Subproblem 2: Updating the matrix M

M^{k + 1} = \underset{M}{\arg \min} (\sum_{i = 1}^{3} ∥ X^{k} - M^{k + 1} ∥_{F}^{2} - M^{k + 1}, T^{k})

(8)

Subproblem 3: Update auxiliary variable T

T^{k + 1} = T^{k} + ρ_{k} (X^{k} - M^{k + 1})

(9)

where X is the current iterated data tensor, $X^{k + 1}$ is the data tensor for the next iteration to be updated, $∥ X^{k} ∥_{r, *}$ is the TNN norm used to measure the rank of the data tensor, $ρ_{k}$ is a non-negative parameter used to balance the TNN norm term and the data tensor fitting term, $∥ X^{k} - M^{k + 1} ∥_{F}^{2}$ represents the Frobenius norm squared of the data tensor and the matrix $M^{k + 1}$ for fitting the data, and $X^{k}, T^{k}$ indicates the inner product of the data tensor $X^{k}$ and the auxiliary variable $T^{k}$ .

These three subproblems collectively form the core optimization framework of the LRTC model. Solving these subproblems iteratively enables the model to converge toward an approximately completed tensor X, where missing values are imputed. This crucial step enhances the likelihood of having complete data available for the subsequent extraction of road network node features.

Road Network Feature Fusion

This study employs GraphSAGE to extract spatiotemporal information from reconstructed road network data, with the aim of improving the imputation of missing traffic data. The core of this method lies in how to aggregate feature information from neighboring nodes. The computation process can be divided into the following three steps:

Firstly, the adjacency matrix is processed to eliminate self-connections and normalize. This normalization ensures that the influence of each neighboring node is equal when aggregating their features. The purpose of normalization is to ensure that the aggregated features of each node are not biased by the differences in the number of neighbors.

Next, feature aggregation for the target node and its neighbors is performed using a selected aggregation function. Various aggregation functions are available, such as mean aggregation, LSTM aggregation, and pooling aggregation. This study selects mean aggregation, which calculates the average of each dimension across the embeddings of the neighboring nodes. The mean aggregation operation can be described as:

h_{ν_{i}}^{z} = \frac{1}{| N (ν_{i}) |} \sum_{ν_{j} \in N (ν_{i})} h_{ν_{j}}^{z - 1}

(10)

where Z represents the depth of aggregation, $h_{ν_{i}}^{z}$ represents the aggregated features of node $ν_{i}$ at the z th layer, $N (ν_{i})$ represents the set of neighboring nodes of $ν_{i}$ , and $| N (ν_{i}) |$ represents the number of neighboring nodes. In the context of traffic flow, $h_{ν_{j}}^{z - 1}$ represents the traffic-related feature vectors (such as traffic volume or speed) at the previous aggregation level, capturing the spatiotemporal correlations.

Finally, the embedding features of the nodes are computed through nonlinear transformation operations. The formula for computing the node embedding features is:

h_{ν_{i}}^{z} = σ (W^{z} \cdot h_{ν_{i}}^{z})

(11)

where $W^{z}$ represents the weight matrix at the z th layer, and σ is the nonlinear activation function (e.g., ReLU). This step maps the aggregated feature information through a nonlinear transformation, allowing the model to capture more complex relationships and interactions within the road network data.

Through Z layers of aggregation operations, we obtain the aggregated features for each node, which represent the road network’s features at different levels of abstraction. The node aggregation process is illustrated in Figure 4, where two layers of GraphSAGE are used to extract features from the road network nodes.

Figure 4.

Node feature extraction.

After aggregating features from the spatially correlated road network data and temporally correlated traffic flow data to obtain $H_{v}^{z}$ and $H_{v}^{z^{'}}$ , respectively, the two sets of aggregated features are fused by directly concatenating them using the CONCAT function. This results in a merged feature set:

H_{v}^{z^{''}} = CONCAT (H_{v}^{z}, H_{v}^{z^{'}})

(12)

This concatenated feature vector $H_{v}^{z^{''}}$ combines both spatial and temporal information, providing enriched features that improve the imputation process of missing traffic data. The aggregated features, which include both spatiotemporal information from the road network, serve as crucial input for subsequent data imputation tasks. By using GraphSAGE aggregators, we effectively capture the key spatiotemporal features from the traffic data, enriching the feature set and enhancing the accuracy and reliability of the traffic flow imputation process.

Generative Adversarial Network Training

The GAN, consisting of a generator and a discriminator, is employed to generate accurate traffic state data. The generator produces complete traffic state data by using the aggregated features of missing traffic state data. Simultaneously, the discriminator provides supervision to ensure that the distribution of the generated complete data is similar to the distribution of real data.

The training process of the generator and the discriminator relies on the definition of loss functions. The loss function of the generator consists of two parts: the loss from the discriminator on the generated data and the reconstruction loss. The reconstruction loss is used to measure the difference between the generated data and the real data. The generator consists of four linear layers, including three hidden layers and one output layer. The number of neurons in the hidden layers is 64, 128, and 256, respectively. After passing through the three hidden layers and an activation function, it takes an input and outputs the generated complete traffic state data. The generator’s loss function is as follows:

G_{loss} = - D_{loss_fake} + α \cdot MS E_{loss}

(13)

where $D_{loss_fake}$ is the discriminator’s loss on the generated data, α is the weighting coefficient of the reconstruction loss, and $MS E_{loss}$ is the mean squared error loss. The reconstruction loss ensures that the generated data accurately reflects the traffic state, particularly relating to the continuity and consistency of traffic flow data.

The discriminator in this study is symmetric to the generator and consists of four linear layers, including three hidden layers and one output layer. The number of neurons in the hidden layers is 256, 128, and 64, respectively. After passing through the output layer, a sigmoid activation function maps the output to a range between −1 and 1, representing the probability that the input data are real. The discriminator’s loss function consists of the loss from both the generated and real data. The discriminator’s loss function is as follows:

D_{loss} = D_{loss_real} + D_{loss_fake}

(14)

To address stability issues during the training process, this study employs WGAN. The training alternates between updating the generator and the discriminator, with the parameters of the discriminator being updated five times before updating the generator’s parameters to improve model convergence. The generator and the discriminator parameters are optimized using the Adam optimizer ( 57 ). This approach successfully implements data imputation for missing traffic data.

Experiment

Datasets and Data Configuration

This paper verifies the performance of, tensor completion and graph network fusion (TCGNF) on two real traffic datasets. Figure 5 shows the distribution of detectors across different datasets.

Figure 5.

The distribution of detectors across different datasets: (a) Seattle Loop dataset and (b) PEMS-BAY dataset (PEMS-BAY—performance measurement system including data from the San Francisco Bay area.)

Seattle Loop Dataset: This dataset contains 2015 traffic speed data from 323 detectors along Seattle’s I-5, I-405, I-90, and SR-520 highways, recorded at 5-min intervals. Each detector provides time-stamped speed readings, structured as a time series with rows for time intervals and columns for individual detector readings (58, 59).

PEMS-BAY Dataset: Sourced from the performance measurement system (PEMS) by Caltrans, this dataset includes traffic speed data from 325 detectors across the San Francisco Bay Area, covering January 1 to June 30, 2017, at 5-min intervals. Similar to the Seattle Loop dataset, it provides time-series data with timestamps and detector-specific readings ( 60 ).

This paper proposes two missing data patterns: MCAR, where the probability of data being missing is completely independent of any other variable values, and MCARS, where the missing pattern is related to spatial, such as all data from a detector in a specific road segment being lost owing to a memory fault in the detector. The paper conducts experiments on data with missing rates ranging from 0.1 to 0.7 under these two patterns. Taking the PEMS-BAY dataset’s data from January 1, 2017, as an example, Figure 6 shows the heatmap of the complete data for January 1, 2017. The x-axis represents the time of the day (1/1/2017), and the y-axis represents different detectors.

Figure 6.

Heatmap of the PEMS-BAY observed data on January 1, 2017. (PEMS-BAY—performance measurement system including data from the San Francisco Bay area.)

Figure 7 displays the heatmaps for both MCAR and MCARS missing patterns at missing rates of 20%, 40%, and 60%. As the missing rate increases, the amount of zero data in the network grows, analogous to the increasing number of purple points in the heatmap.

Figure 7.

Heatmap of PEMS-BAY data on January 1, 2017: (a) missing completely at random (MCAR) missing patterns at missing rates of 20%, 40%, and 60% and (b) spatially missing completely at random (MCARS) missing patterns at missing rates of 20%, 40%, and 60%. (PEMS-BAY—performance measurement system including data from the San Francisco Bay area.)

Evaluation Metrics and Baseline Models

This paper employs three metrics for performance evaluation, including mean absolute error (MAE), root mean square error (RMSE), and mean absolute percentage error (MAPE). The calculation formulas are as follows:

MAE = \frac{1}{n} \sum_{i = 1}^{n} | {\hat{y}}_{i} - y_{i} |

(15)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {({\hat{y}}_{i} - y_{i})}^{2}}

(16)

MAPE = \frac{1}{n} \sum_{i = 1}^{n} | \frac{{\hat{y}}_{i} - y_{i}}{y_{i}} |

(17)

where ${\hat{y}}_{i}$ is the imputed data and $y_{i}$ is the original data.

TCGNF is compared with the following representative baseline models:

(1) GAIN: A model that uses the generative adversarial network framework to generate and impute missing data, suitable for multivariate datasets ( 48 ).

(2) BGCP: A spatiotemporal traffic data imputation method based on Bayesian tensor decomposition, which enhances data integrity through Bayesian statistics and tensor decomposition techniques ( 61 ).

(3) LRTC-TNN: A non-convex low-rank tensor completion model utilizing truncated nuclear norm minimization, focused on addressing complex missing problems in spatiotemporal traffic data ( 36 ).

(4) LRTC-TSρN: Based on truncated tensor Schatten-norm, aimed at filling in complex missing patterns in traffic data ( 37 ).

(5) GA-GAN: A traffic state data imputation method based on graph aggregation generative adversarial networks, combining generative adversarial networks and graph aggregation technology ( 53 ).

(6) LCR: Combines circulant matrix nuclear norm with Laplacian kernel-based temporal regularization to efficiently impute traffic time series ( 62 ).

In TCGNF, the feature dimension $F = 288$ in the LRTC module encompasses data for an entire day. Meanwhile, GraphSAGE’s feature dimension $F = 12$ represents data for 1 h. The method selects 7 days of historical data to calculate the Pearson correlation coefficient for constructing the temporal-correlated network. In the construction of both the temporal-correlated and spatial-correlated networks, the performance of the model at different values of n is compared concerning MAE, RMSE, and MAPE to choose the optimal n value. The generator and discriminator in GAN have three hidden layers, with 64, 128, and 256 units, respectively. The GAN loss’s reconstruction coefficient α is set to 100. Before each update of the generator parameters, the discriminator’s parameters are updated 5 times. To ensure the fairness of the comparison, all methods use the same data points when imputed.

Experimental Results

The Experimental Results of the Seattle Loop Dataset

The selection of n values for the temporal-correlated and spatial-correlated networks was evaluated on the Seattle Loop dataset. The best performance was achieved with $n = 3$ for the temporal-correlated network and $n = 7$ for the spatial-correlated network. Tables 1 and 2 present the errors for different n values in the temporal-correlated and spatial-correlated networks under the MCAR pattern with a 40% missing rate. As n increases, the aggregation of more nodes leads to a significant increase in resource consumption, such as memory usage and time complexity. Despite the decline in performance, the computational overhead increases substantially. Therefore, selecting $n = 3$ for the temporal-correlated network and $n = 7$ for the spatial-correlated network offers the optimal trade-off between performance and computational efficiency.

Table 1.

Results of Different n in Temporal-Correlated Networks with 40% Missing in Missing Completely at Random

n	2	3	4	5	6	7	8
MAE	2.19	2.15	2.24	2.18	2.23	2.22	2.29
RMSE	3.27	3.11	3.19	3.16	3.28	3.28	3.27
MAPE	5.21	5.13	5.22	5.25	5.21	5.23	5.21

Note: MAE = mean absolute error; MAPE = mean absolute percentage error; RMSE = root mean square error.

Bold indicates best performance.

Table 2.

Results of Different n in Spatial-Correlated Networks with 40% missing in Missing in Missing Completely at Random

n	4	5	6	7	8	9	10
MAE	2.20	2.25	2.20	2.16	2.17	2.27	2.22
RMSE	3.26	3.30	3.27	3.25	3.25	3.31	3.28
MAPE	5.33	5.37	5.56	5.35	5.45	5.39	5.40

Note: MAE = mean absolute error; MAPE = mean absolute percentage error; RMSE = root mean square error.

Bold indicates best performance.

Figure 8 displays the distribution of residuals between the real data and the imputed data under different missing rates in the Seattle Loop. The left and right images in each subplot represent the residual distributions for the MCAR and MCARS missing patterns, respectively. At a 20% missing rate, the distribution of residuals in the MCAR pattern tends to be notably concentrated around the 0 value, and this concentration remains significant even when the missing rate is increased to 60%. From the MCARS pattern, it can also be seen that most residuals are near 0, indicating that the model can effectively identify different missing patterns and maintain its accuracy and robustness even at high missing rates.

Figure 8.

The residual distribution in the Seattle Loop dataset: (a) the residual distribution in the Seattle Loop dataset in the missing rate of 20%, (b) the residual distribution in the Seattle Loop dataset in the missing rate of 40%, and (c) the residual distribution in the Seattle Loop dataset in the missing rate of 60%.

Comparative experiments were conducted on the PEMS-BAY and Seattle Loop datasets under various baselines, missing ratios, and missing patterns. Table 3 presents the error metrics for the Seattle Loop data under the MCAR pattern, while Table 4 outlines the errors under the MCARS pattern. Both tables demonstrate that our model consistently outperforms the others across all patterns. Specifically, our model performs better than the LRTC-TSρN model in both the MCAR and MCARS patterns, with the LRTC-TSρN model being the best-performing baseline for comparison. In particular, under the MCAR pattern, our model achieves a MAPE significantly lower than 6.5%, and under the MCARS pattern, it remains below 10. While the GA-GAN model performs well under the MCAR pattern, its performance deteriorates under the MCARS pattern, where missing data are not completely random. Our analysis indicates that the GA-GAN model primarily relies on temporal correlation analysis within the network, but fails to sufficiently consider the low-rank structure of the data. This likely explains its weaker performance in addressing missing data at the spatial level. Additionally, the LCR model stands out as the second-best performer, after our model. However, owing to its reliance on a single low-rank tensor completion algorithm, it struggles with high missing rates, a limitation shared by other LRTC-based models. The LRTC-TNN and LRTC-TSρN models also exhibit similar issues, highlighting the significant performance degradation of low-rank tensor completion algorithms when faced with more complex missing data scenarios.

Table 3.

Error of Methods on Seattle Loop Data in Missing in Missing Completely at Random (MCAR) Pattern

Model	10%			20%			30%			40%
Model	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
GAIN	3.92	6.07	10.94	3.88	6.21	11.46	3.95	6.29	11.46	4.08	6.43	12.06
BGCP	3.44	5.43	9.88	3.43	5.43	9.89	3.43	5.43	9.91	3.42	5.42	9.86
LRTC-TNN	2.26	3.41	5.73	2.33	3.52	5.96	2.41	3.66	6.23	2.51	3.81	6.54
LRTC-TSρN	2.26	3.46	5.90	2.34	3.59	6.15	2.42	3.73	6.43	2.50	3.88	6.73
GA-GAN	2.28	3.39	5.39	2.64	3.62	5.82	2.31	3.46	5.47	2.70	3.79	6.06
LCR	2.11	3.09	5.07	2.15	3.28	5.33	2.34	3.55	6.11	2.31	3.44	6.24
Ours	2.15	3.11	5.13	2.10	3.10	5.17	2.16	3.18	5.19	2.18	3.25	5.35
	50%			60%			70%
Model	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
GAIN	4.17	6.56	12.18	4.18	6.62	12.52	4.62	6.98	13.35
BGCP	3.42	5.43	9.88	3.43	5.43	9.87	3.43	5.44	9.93
LRTC-TNN	2.62	3.99	6.90	2.76	4.22	7.36	2.92	4.48	7.89
LRTC-TSρN	2.61	4.06	7.10	2.73	4.27	7.54	2.88	4.53	8.05
GA-GAN	2.72	3.91	6.23	3.03	4.26	6.89	2.97	4.46	7.14
LCR	2.33	3.79	6.60	2.54	4.08	7.17	2.88	4.28	7.77
Ours	2.25	3.38	5.58	2.34	3.54	5.80	2.50	3.80	6.48

Note: BGCP = A spatiotemporal traffic data imputation method based on Bayesian tensor decomposition; GAIN = generative adversarial imputation network; GA = Graph Aggregator; GAN = generative adversarial networks; LCR = Laplacian Convolutional Representation for Traffic Time Series Imputation; LRTC = low-rank tensor completion; MAE = mean absolute error; MAPE = mean absolute percentage error; RMSE = root mean square error; TNN = truncated nuclear norm.

Bold indicates best performance.

Table 4.

Error of Methods on Seattle Loop Data in Spatially Missing Completely at Random (MCARS) Pattern

Model	10%			20%			30%			40%
Model	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
GAIN	3.85	6.01	10.97	3.84	6.01	10.84	3.89	6.15	11.32	3.93	6.20	11.40
BGCP	3.56	5.66	10.35	3.57	5.70	10.35	3.60	5.73	10.44	3.64	5.87	10.59
LRTC-TNN	2.68	4.22	7.22	2.78	4.38	7.56	2.89	4.55	7.86	3.02	4.78	8.34
LRTC-TSρN	2.66	4.27	7.41	2.75	4.44	7.69	2.86	4.60	8.08	2.97	4.80	8.60
GA-GAN	4.28	7.38	11.24	4.38	7.34	11.69	4.44	7.37	12.47	4.72	7.67	13.16
LCR	2.55	4.03	7.21	2.71	4.18	7.36	2.77	4.48	7.66	2.89	4.77	8.14
Ours	2.58	4.11	7.15	2.68	4.28	7.26	2.75	4.43	7.50	2.87	4.65	8.03
	50%			60%			70%
Model	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
GAIN	4.07	6.38	11.73	4.35	6.66	12.26	7.54	10.64	17.68
BGCP	3.67	5.89	10.75	3.77	6.13	10.95	4.00	6.18	11.83
LRTC-TNN	3.17	5.01	8.83	3.35	5.29	9.39	3.57	5.65	10.17
LRTC-TSρN	3.15	5.11	9.62	3.50	5.92	11.96	4.00	7.02	14.91
GA-GAN	5.10	8.04	14.42	5.42	8.62	16.37	6.32	9.54	19.16
LCR	3.07	4.99	8.66	3.35	5.33	9.24	3.61	5.77	10.03
Ours	3.02	4.92	8.76	3.31	5.19	9.11	3.55	5.61	9.98

Bold indicates best performance.

The Experimental Results of the PEMS-BAY Dataset

Figure 9 presents the residual distribution in the PEMS-BAY dataset, where it is evident that the residual values cluster around 0, indicating the model’s exceptional performance on this dataset. Table 5 shows the error analysis for different missing ratios under the MCAR pattern using the PEMS-BAY dataset. In contrast, Table 6 displays the errors under the MCARS pattern. In both patterns, the TCGNF model demonstrated outstanding performance. Particularly in the MCAR pattern, the TCGNF model’s MAPE is below 2.4%, surpassing all comparison models’ best performance. Similarly, in the MCARS pattern, its MAPE remains below 5%, leading other models. It is worth noting that LRTC-TSρN performs better on the PEMS-BAY dataset because it has a larger dataset with clear spatial–temporal dependencies, which aligns well with the model’s low-rank assumption. In contrast, the Seattle Loop dataset, which consists of circular data from four interconnected highways, has a more complex spatial layout and traffic flow patterns, making it harder for the model to fully capture these intricate spatial–temporal relationships, leading to a lower performance. In all tested scenarios, whether compared with the PEMS-BAY or Seattle Loop datasets, TCGNF’s performance exceeds that of other comparison models.

Figure 9.

The residual distribution in the PEMS-BAY dataset: (a) the residual distribution in the PEMS-BAY dataset in the missing rate of 20%, (b) the residual distribution in the PEMS-BAY dataset in the missing rate of 40%, and (c) the residual distribution in the PEMS-BAY dataset in the missing rate of 60%. (PEMS-BAY—performance measurement system including data from the San Francisco Bay area.)

Table 5.

Error of Methods on PEMS-BAY Data in Missing Completely at Random (MCAR) Pattern (PEMS-BAY—performance measurement system including data from the San Francisco Bay area.)

Model	10%			20%			30%			40%
Model	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
GAIN	2.50	4.42	5.38	2.55	4.53	5.49	2.64	4.62	5.67	2.58	4.65	5.69
BGCP	1.95	3.56	4.23	1.93	3.54	4.19	1.93	3.55	4.20	1.93	3.55	4.19
LRTC-TNN	1.03	1.51	1.88	0.96	1.66	2.08	1.22	1.85	2.22	1.23	2.05	2.38
LRTC-TSρN	0.83	1.40	1.64	0.88	1.52	1.77	0.95	1.66	1.93	1.04	1.85	2.14
GA-GAN	1.16	1.81	2.18	1.44	2.14	2.69	1.36	2.14	2.60	1.59	2.47	3.03
LCR	0.93	1.73	1.85	0.95	1.79	1.9	0.98	1.84	1.96	1.01	1.92	2.04
Ours	0.80	1.34	1.56	0.85	1.40	1.66	0.88	1.47	1.72	0.93	1.55	1.80
	50%			60%			70%
Model	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
GAIN	2.66	4.73	5.83	2.78	4.80	6.04	3.00	4.99	6.40
BGCP	1.93	3.55	4.19	1.93	3.55	4.18	1.94	3.57	4.22
LRTC-TNN	1.26	2.27	2.63	1.46	2.51	2.91	1.65	2.77	3.22
LRTC-TSρN	1.14	2.06	2.38	1.26	2.31	2.66	1.40	2.61	3.01
GA-GAN	1.55	2.56	3.11	1.75	2.83	3.48	2.88	3.44	4.30
LCR	1.06	2.02	2.14	1.12	2.15	2.28	1.21	2.35	2.5
Ours	0.97	1.68	1.94	1.04	1.82	2.09	1.15	2.06	2.36

Note: BGCP = A spatiotemporal traffic data imputation method based on Bayesian tensor decomposition ; GAIN = generative adversarial imputation network; GA = Graph Aggregator; GAN = generative adversarial networks; LCR = Laplacian Convolutional Representation for Traffic Time Series Imputation; LRTC = low-rank tensor completion; MAE = mean absolute error; MAPE = mean absolute percentage error; RMSE = root mean square error; TNN = truncated nuclear norm.

Bold indicates best performance.

Table 6.

Error of Methods on PEMS-BAY Data in Missing Completely at Random (MCARS) Pattern (PEMS-BAY—performance measurement system including data from the San Francisco Bay area.)

Model	10%			20%			30%			40%
Model	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
GAIN	2.48	4.36	5.27	2.45	4.30	5.25	2.52	4.43	5.64	2.65	4.56	5.64
BGCP	2.24	4.27	4.95	2.29	4.35	4.93	2.35	4.83	5.10	2.48	6.59	5.35
LRTC-TNN	1.64	3.10	3.53	1.71	3.17	3.60	1.79	3.31	3.73	1.91	3.51	4.03
LRTC-TSρN	1.58	2.93	3.36	1.65	3.13	3.16	1.71	3.21	3.64	1.89	3.33	3.81
GA-GAN	4.12	6.41	8.13	3.88	6.15	8.07	4.17	6.40	8.64	4.50	6.66	9.45
LCR	1.59	2.99	3.31	1.65	3.08	3.42	1.74	3.20	3.56	1.88	3.44	3.79
Ours	1.55	2.83	3.26	1.64	3.13	3.08	1.66	3.21	3.57	1.79	3.22	3.71
	50%			60%			70%
Model	MAE	RMSE	MAPE	MAE	RMSE	MAPE	MAE	RMSE	MAPE
GAIN	2.89	4.85	6.01	4.07	6.37	7.84	32.62	37.94	51.19
BGCP	2.57	6.37	5.60	2.74	6.19	5.97	3.28	7.39	7.05
LRTC-TNN	2.03	3.73	4.37	2.21	3.99	4.66	2.45	4.36	5.21
LRTC-TSρN	1.87	3.51	4.04	2.00	3.72	4.35	2.17	4.01	4.75
GA-GAN	4.29	6.74	9.25	4.83	7.39	10.30	5.15	7.85	11.27
LCR	2.01	3.74	4.27	2.11	3.99	4.58	2.36	4.31	5.18
Ours	2.00	3.75	4.29	2.13	4.01	4.63	2.27	4.30	5.00

Note: BGCP = A spatiotemporal traffic data imputation method based on Bayesian tensor decomposition ; GAIN = generative adversarial imputation network; GA = Graph Aggregator; GAN = generative adversarial networks; LCR = Laplacian Convolutional Representation for Traffic Time Series Imputation; LRTC = low-rank tensor completion; MAE = mean absolute error; MAPE = mean absolute percentage error; RMSE = root mean square error; TNN = truncated nuclear norm.

Bold indicates best performance.

Ablation Experiment

This section aims to conduct ablation studies to show the contributions of different components within the proposed model. Given that MCAR is the most commonly studied missing pattern in the study, we conduct experiments under this pattern.

The Effectiveness of the LRTC Module

To verify the effectiveness of the LRTC module, two comparative experiments were set up: the first removes the LRTC module, and the second replaces the LRTC module with a basic moving smoothing completion module. The results are illustrated in Figure 10. Figure 10, a and b respectively, shows the comparisons for the Seattle Loop and PEMS-BAY datasets under the MCAR missing pattern. The x-axis of the graphs represents the missing rate, and the y-axis of the three subgraphs represent MAE, RMSE, and MAPE, respectively. “LRTC” indicates data preliminarily completed by the LRTC module, “SMOOTH” indicates data preliminarily completed by the moving smoothing completion module, and “NAN” indicates data without preliminary completion.

Figure 10.

The effectiveness of the low-rank tensor completion (LRTC) module: (a) the effectiveness of the LRTC module in the Seattle Loop and (b) the effectiveness of the LRTC module in PEMS-BAY. (PEMS-BAY—performance measurement system including data from the San Francisco Bay area.)

The moving window step size for the moving smoothing completion was set to 3. In Figure 10, it is evident that the models which underwent preliminary completion using the LRTC module exhibit the lowest error values and the best imputation effects. The models that were preliminarily completed using the moving smoothing completion module ranked second, and the models without any preliminary completion performed the worst as regards imputation effectiveness. This demonstrates that LRTC preliminary completion is crucial in enhancing the model’s performance.

The Effectiveness of Road Network Fusion

To validate the effectiveness of network fusion, two comparative experiments were conducted: one using solely the temporal-correlation matrix for feature extraction and the other using only the spatial-correlation matrix for feature extraction. The results are illustrated in Figure 11. Figure 11, a and b respectively, shows the results of experiments on the two datasets. It can be observed that in the problem of traffic data imputation, spatial correlation has a greater impact than temporal correlation. The best imputation results are achieved when features from both are fused.

Figure 11.

The effectiveness of road network fusion: (a) the effectiveness of road network fusion in the Seattle Loop and (b) the effectiveness of road network fusion in PEMS-BAY. (PEMS-BAY—performance measurement system including data from the San Francisco Bay area.)

Conclusion

This paper presents a novel approach for traffic data imputation, integrating tensor completion techniques with graph network fusion. We proposed a traffic state data imputation algorithm that leverages low-rank tensor completion, GraphSAGE feature fusion, and GAN. The algorithm begins by addressing missing data through low-rank tensor completion, followed by the construction of temporal and spatial correlation networks. Temporal correlations are calculated using the Pearson correlation coefficient, while spatial correlations are derived from the physical distances between detectors. GraphSAGE is then employed to extract features from these networks, optimizing the imputation process by effectively capturing the spatiotemporal dynamics of traffic data. Finally, GANs are used for model training, ensuring the generated imputed data reflect realistic traffic conditions. Experimental results on the Seattle Loop and PEMS-BAY datasets demonstrate the effectiveness of the proposed method, showing superior performance across various missing data patterns. The model successfully captures both the low-rank structure and spatiotemporal dependencies inherent in traffic data, significantly improving imputation accuracy. Despite these promising results, several avenues for future research remain. First, optimizing the method for computing temporal correlations between road networks, particularly by exploring alternatives to the Pearson correlation coefficient, could further enhance imputation accuracy. Second, while the approach has demonstrated effectiveness in traffic data imputation, its framework holds significant potential for generalization to other time series datasets, offering a versatile solution to missing data challenges across various domains.

Footnotes

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: Chengliang Xia, Xiang Yin; data collection: Junyang Yu, Xiaoli Liang, Lei Chen; analysis and interpretation of results: Chengliang Xia, Xiang Yin; draft manuscript preparation: Chengliang Xia, Xiang Yin, Junyang Yu. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is partly supported by the National Natural Science Foundation of China (No. 42201465) and the Key Research and Development and Promotion Special Project in Henan Province (No. 232102210120, No. 232102211025).

ORCID iDs

Chengliang Xia

Xiang Yin

Xiaoli Liang

Lei Chen

References

Determining Traffic-Flow Characteristics by Definition for Application in ITS. IEEE Transactions on Intelligent Transportation Systems, Vol. 8, No. 2, 2007, pp. 181–187.

Nellore

Hancke

G. P.

A Survey on Urban Traffic Management System Using Wireless Sensor Networks. Sensors, Vol. 16, No. 2, 2016, p. 157.

Jayakrishnan

Mahmassani

H. S.

T.-Y.

An Evaluation Tool for Advanced Traffic Information and Management Systems in Urban Networks. Transportation Research Part C: Emerging Technologies, Vol. 2, No. 3, 1994, pp. 129–147.

Mallick

Balaprakash

Rask

Macfarlane

Graph-Partitioning-Based Diffusion Convolutional Recurrent Neural Network for Large-Scale Traffic Forecasting. Transportation Research Record: Journal of the Transportation Research Board, 2020. 2674(9): 473–488.

Feng

Ling

Zheng

Chen

Adaptive Multi-Kernel SVM With Spatial–Temporal Correlation for Short-Term Traffic Flow Prediction. IEEE Transactions on Intelligent Transportation Systems, Vol. 20, No. 6, 2019, pp. 2001–2013.

Chen

Traffic Prediction Using Neural Networks. Proc., GLOBECOM ’93. IEEE Global Telecommunications Conference, Houston, TX, Vol. 2, 1993, pp. 991–995.

Duan

Liu

Y.-L.

Wang

F.-Y.

An Efficient Realization of Deep Learning for Traffic Data Imputation. Transportation Research Part C: Emerging Technologies, Vol. 72, 2016, pp. 168–181.

Wang

F.-Y.

Parallel Control and Management for Intelligent Transportation Systems: Concepts, Architectures, and Applications. IEEE Transactions on Intelligent Transportation Systems, Vol. 11, No. 3, 2010, pp. 630–638.

Zhang

Lin

Wang

A Customized Deep Learning Approach to Integrate Network-Scale Online Traffic Data Imputation and Prediction. Transportation Research Part C: Emerging Technologies, Vol. 132, 2021, p. 103372.

10.

Shamo

Asa

Membah

Linear Spatial Interpolation and Analysis of Annual Average Daily Traffic Data. Journal of Computing in Civil Engineering, Vol. 29, No. 1, 2015, p. 04014022.

11.

Xue

Feng

Ukkusuri

S. V.

Network Macroscopic Fundamental Diagram-Informed Graph Learning for Traffic State Imputation. Transportation Research Part B: Methodological, Vol. 189, 2024, p. 102996.

12.

Zhang

Traffic Prediction, Data Compression, Abnormal Data Detection and Missing Data Imputation: An Integrated Study Based on the Decomposition of Traffic Time Series. Proc., 17th International IEEE Conference on Intelligent Transportation Systems (ITSC), Qingdao, China, 2014, pp. 282–289.

13.

Peng

Tang

Guo

Hierarchical Spatio-Temporal Graph Convolutional Neural Networks for Traffic Data Imputation. Information Fusion, Vol. 106, 2024, p. 102292.

14.

Shu

Wang

Peng

Meng

Low-Rank Tensor Completion With 3-D Spatiotemporal Transform for Traffic Data Imputation. IEEE Transactions on Intelligent Transportation Systems, Vol. 25, 2024, pp. 18673–18687.

15.

Zong

Yan

An Intelligent Deep Learning Framework for Traffic Flow Imputation and Short-Term Prediction Based on Dynamic Features. Knowledge-Based Systems, Vol. 300, 2024, p. 112178.

16.

Yang

A Brief Review on Missing Traffic Data Imputation Methods for Intelligent Transportation Systems. Proc., 2024 7th International Symposium on Autonomous Systems (ISAS), Chongqing, China, 2024, pp. 1–6.

17.

Wang

Peeta

Leveraging Transformer Model to Predict Vehicle Trajectories in Congested Urban Traffic. Transportation Research Record: Journal of the Transportation Research Board, 2023. 2677(2): 898–909.

18.

Multi-Stage Deep Residual Collaboration Learning Framework for Complex Spatial–Temporal Traffic Data Imputation. Applied Soft Computing, Vol. 147, 2023, p. 110814.

19.

Shen

Zhou

Zhang

Liu

Kong

Bidirectional Spatial–Temporal Traffic Data Imputation Via Graph Attention Recurrent Neural Network. Neurocomputing, Vol. 531, 2023, pp. 151–162.

20.

Fang

Chen

MDTGAN: Multi Domain Generative Adversarial Transfer Learning Network for Traffic Data Imputation. Expert Systems with Applications, Vol. 255, 2024, p. 124478.

21.

Zhang

Miao

Chen

Spatial-Temporal Traffic Data Imputation Based on Dynamic Multi-Level Generative Adversarial Networks for Urban Governance. Applied Soft Computing, Vol. 151, 2024, p. 111128.

22.

Yang

Luo

Di-GraphGAN: An Enhanced Adversarial Learning Framework for Accurate Spatial-Temporal Traffic Forecasting Under Data Missing Scenarios. Information Sciences, Vol. 677, 2024, p. 120911.

23.

Qin

Xie

A Traffic Flow Data Quality Repair Model Based on Spatiotemporal Correlation. IEEE Access, Vol. 12, 2024, pp. 116816–116828.

24.

Wang

Fulda

Huang

Schultz

G. G.

Macfarlane

G. S.

Arnesen

Khayyat

Predicting Directional Traffic Volume at Intersections with Automated Traffic Signal Performance Measures Data Using Machine Learning Algorithms. Transportation Research Record: Journal of the Transportation Research Board, 2024. 2678(12): 1736–1750.

25.

Chen

Lin

Liu

Yang

Zhang

NT-DPTC: A Non-Negative Temporal Dimension Preserved Tensor Completion Model for Missing Traffic Data Imputation. Information Sciences, Vol. 653, 2024, p. 119797.

26.

Lin

Luo

HRST-LR: A Hessian Regularization Spatio-Temporal Low Rank Algorithm for Traffic Data Imputation. IEEE Transactions on Intelligent Transportation Systems, Vol. 24, No. 10, 2023, pp. 11001–11017.

27.

Liu

Ong

G. P.

Chen

GraphSAGE-Based Traffic Speed Forecasting for Segment Network With Sparse Data. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, No. 3, 2022, pp. 1755–1766.

28.

B.-Z.

Zhao

X.-L.

Chen

Ding

Wen Liu

Convolutional Low-Rank Tensor Representation for Structural Missing Traffic Data Imputation. IEEE Transactions on Intelligent Transportation Systems, Vol. 25, No. 11, 2024, pp. 18847–18860.

29.

Cheng

Osman

Ballan

FastSTI: A Fast Conditional Pseudo Numerical Diffusion Model for Spatio-Temporal Traffic Data Imputation. IEEE Transactions on Intelligent Transportation Systems, Vol. 25, No. 12, 2024, pp. 20547–20560.

30.

Chen

Wang

Zhou

Dynamic Origin-Destination Flow Imputation Using Feature-Based Transfer Learning. IEEE Transactions on Intelligent Transportation Systems, Vol. 25, No. 11, 2024, pp. 17147–17159. https://doi.org/10.1109/TITS.2024.3421233

31.

Long

Liu

Chen

Zhu

Low Rank Tensor Completion for Multiway Visual Data. Signal Processing, Vol. 155, 2019, pp. 301–316.

32.

Peng

Wei

Low-Rank Tensor Completion With a New Tensor Nuclear Norm Induced by Invertible Linear Transforms. Proc., 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, 2019, pp. 5989–5997.

33.

Ran

Tan

Jin

P. J.

Tensor Based Missing Traffic Data Completion with Spatial–Temporal Correlation. Physica A: Statistical Mechanics and its Applications, Vol. 446, 2016, pp. 54–63.

34.

Goulart

J. D. M.

Kibangou

Favier

Traffic Data Imputation Via Tensor Completion Based on Soft Thresholding of Tucker Core. Transportation Research Part C: Emerging Technologies, Vol. 85, 2017, pp. 348–362.

35.

Liu

Musialski

Wonka

Tensor Completion for Estimating Missing Values in Visual Data. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 35, No. 1, 2013, pp. 208–220.

36.

Chen

Yang

Sun

A Nonconvex Low-Rank Tensor Completion Model for Spatiotemporal Traffic Data Imputation. Transportation Research Part C: Emerging Technologies, Vol. 117, 2020, p. 102673.

37.

Nie

Qin

Sun

Truncated Tensor Schatten P-Norm Based Approach for Spatiotemporal Traffic Data Imputation with Complicated Missing Patterns. Transportation Research Part C: Emerging Technologies, Vol. 141, 2022, p. 103737.

38.

Zhang

Zhou

Xie

Semantic Understanding and Prompt Engineering for Large-Scale Traffic Data Imputation. Information Fusion, Vol. 102, 2024, p. 102038.

39.

Wang

Zhuang

Sun

Low-Rank Hankel Tensor Completion for Traffic Speed Estimation. IEEE Transactions on Intelligent Transportation Systems, Vol. 24, No. 5, 2023, pp. 4862–4871.

40.

Chen

X. M.

A Novel Reinforced Dynamic Graph Convolutional Network Model with Data Imputation for Network-Wide Traffic Flow Prediction. Transportation Research Part C: Emerging Technologies, Vol. 143, 2022, p. 103820.

41.

Wang

Zhang

Zheng

A Multi-View Bidirectional Spatiotemporal Graph Network for Urban Traffic Flow Imputation. International Journal of Geographical Information Science, Vol. 36, No. 6, 2022, pp. 1231–1257.

42.

Liang

Xie

Zhang

K.-C.

Souri

Spatial-Temporal Aware Inductive Graph Neural Network for C-ITS Data Recovery. IEEE Transactions on Intelligent Transportation Systems, Vol. 24, No. 8, 2023, pp. 8431–8442.

43.

Zhang

J. J.

Spatial-Temporal Traffic Data Imputation Via Graph Attention Convolutional Network. Proc., International Conference on Artificial Neural Networks, Bratislava, Slovakia, Springer, 2021, pp. 241–252.

44.

Cini

Marisca

Alippi

Filling the G_ap_s: Multivariate Time Series Imputation by Graph Neural Networks. arXiv Preprint arXiv:2108.00298, 2022.

45.

Yuan

Zhang

Wang

Peng

Yin

STGAN: Spatio-Temporal Generative Adversarial Network for Traffic Data Imputation. IEEE Transactions on Big Data, Vol. 9, No. 1, 2023, pp. 200–211.

46.

Creswell

White

Dumoulin

Arulkumaran

Sengupta

Bharath

A. A.

Generative Adversarial Networks: An Overview. IEEE Signal Processing Magazine, Vol. 35, No. 1, 2018, pp. 53–65.

47.

Zhang

Zheng

Zhao

A Generative Adversarial Network for Travel Times Imputation Using Trajectory Data. Computer-Aided Civil and Infrastructure Engineering, Vol. 36, No. 2, 2021, pp. 197–212.

48.

Yoon

Jordon

Schaar

Gain: Missing Data Imputation Using Generative Adversarial Nets. Proc., International Conference on Machine Learning, PMLR, Stockholm, Sweden, 2018, pp. 5689–5698.

49.

Chen

Wang

F.-Y.

Traffic Flow Imputation Using Parallel Data and Generative Adversarial Networks. IEEE Transactions on Intelligent Transportation Systems, Vol. 21, No. 4, 2020, pp. 1624–1630.

50.

Farnia

Tse

A Convex Duality Framework for GANs. In Advances in Neural Information Processing Systems, NeurIPS 2018 Program Committee, Curran Associates Inc., Red Hook, NY, Vol. 31, pp. 5254–5263.

51.

Arjovsky

Chintala

Bottou

Wasserstein Generative Adversarial Networks. Proc., International Conference on Machine Learning, PMLR, Sydney, Australia, 2017, pp. 214–223.

52.

Gulrajani

Ahmed

Arjovsky

Dumoulin

Courville

A. C.

Improved Training of Wasserstein Gans. In Advances in Neural Information Processing Systems, NeurIPS 2017 Program Committee, Curran Associates Inc., Red Hook, NY, Vol. 30, pp. 5769–5779.

53.

Peng

Wei

Shang

Traffic State Data Imputation: An Efficient Generating Method Based on the Graph Aggregator. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, No. 8, 2022, pp. 13084–13093.

54.

Zhou

Wang

Huang

Liu

Comparative Study on the Time Series Forecasting of Web Traffic Based on Statistical Model and Generative Adversarial Model. Knowledge-Based Systems, Vol. 213, 2021, p. 106467.

55.

Guo

A Simple Algorithm for Fitting a Gaussian Function [DSP Tips and Tricks]. IEEE Signal Processing Magazine, Vol. 28, No. 5, 2011, pp. 134–137.

56.

Wang

Yin

Zeng

Global Convergence of ADMM in Nonconvex Nonsmooth Optimization. Journal of Scientific Computing, Vol. 78, 2019, pp. 29–63.

57.

Kingma

D. P.

Adam: A Method for Stochastic Optimization. arXiv Preprint arXiv:1412.6980, 2017.

58.

Cui

Wang

Deep Bidirectional and Unidirectional LSTM Recurrent Neural Network for Network-wide Traffic Speed Prediction. arXiv Preprint arXiv:1801.02143, 2019.

59.

Cui

Henrickson

Wang

Traffic Graph Convolutional Recurrent Neural Network: A Deep Learning Framework for Network-Scale Traffic Learning and Forecasting. IEEE Transactions on Intelligent Transportation Systems, Vol. 21, 2020, pp. 4883–4894.

60.

Dong

Sun

Jin

Song

Zhang

Luo

Uncertainty Graph Convolution Recurrent Neural Network for Air Quality Forecasting. Advanced Engineering Informatics, Vol. 62, 2024, p. 102651.

61.

Chen

Sun

A Bayesian Tensor Decomposition Approach for Spatiotemporal Traffic Data Imputation. Transportation Research Part C: Emerging Technologies, Vol. 98, 2019, pp. 73–84.

62.

Chen

Cheng

Cai

Saunier

Sun

Laplacian Convolutional Representation for Traffic Time Series Imputation. IEEE Transactions on Knowledge and Data Engineering, Vol. 36, No. 11, 2024, pp. 6490–6502.