Multi-Agent Deep Reinforcement Learning with Graph Attention Network for Traffic Signal Control in Multiple-Intersection Urban Areas

Abstract

Deep reinforcement learning has seen significant progress in traffic signal control. However, existing research still lacks the ability to effectively capture the correlation of road network information and the perception capability of traffic signal states. To address this gap, we propose a multi-intersection traffic signal control method that integrates a graph attention network, named the graph attention network-deep deterministic polcy gradient (GAT-DDPG) algorithm. This algorithm incorporates the restart random walk into the attention mechanism, exploring graph information through global random walks, reducing reliance on local nodes, and enhancing the model’s comprehensive understanding of graph structure features, thereby improving the modeling capability of traffic network structures. Moreover, the algorithm can automatically identify and extract key features from the complex data of the traffic network without manual intervention, adapting to different traffic network topologies, and can update and adjust the traffic signal control system in real-time to accommodate actual traffic flow and congestion situations. Experimental results indicate that the GAT-DDPG algorithm reduces average vehicle travel time significantly across three real road networks (Hangzhou and Jinan in China, and New York, U.S.) and two synthetic road network datasets. Additionally, it demonstrates optimal convergence speed and performance in these real datasets, attributed to its capability to capture global information and deeply comprehend the intricate structures of traffic networks. The research proves that this model has significant advantages in the field of traffic signal control, improving the operational efficiency of urban area intersections. Future work will incorporate additional road environment factors to better adapt to complex urban traffic.

Keywords

graph attention network deep reinforcement learning deep deterministic policy gradient adaptive traffic signal control multi-intersection

With the rapid development of urbanization, the issue of traffic congestion is increasingly prominent, leading to a series of challenges such as fuel consumption, environmental pollution, and others, resulting in significant economic and social losses ( 1 – 3 ). Addressing traffic congestion could involve expanding infrastructure or optimizing traffic signal control methods. Among these measures, effective control of traffic signals appears particularly crucial.

Traditional fixed-time traffic signal control methods, which set fixed durations for green and red lights, are unable to adjust flexibly based on actual traffic conditions, leading to traffic congestion, unpredictable travel times, and energy wastage. To address these issues, the field of traffic management has developed an increasing number of efficient adaptive traffic signal control methods ( 4 ). Adaptive traffic signal control methods can adjust the signal light control strategy in real-time according to the current road traffic conditions to reduce traffic congestion and improve traffic efficiency ( 5 ). Currently, there are various adaptive traffic signal control methods, such as those based on fuzzy logic ( 6 – 9 ). However, these methods tend to have poor robustness and are susceptible to external environmental influences. Additionally, methods based on genetic algorithms require pre-defined traffic scenarios, making it difficult to respond to real-time changes in traffic conditions and achieve real-time traffic light control ( 10 ). Methods based on immune network algorithms can search for better solutions but have slow convergence speeds and require many iterations ( 11 ). Methods based on cellular automata typically rely on static traffic flow models, making it difficult to acquire and apply real-time traffic data ( 12 ). This may lead to suboptimal performance, especially in highly dynamic traffic environments. Therefore, further research is needed to develop traffic light control methods that are more robust, real-time, and efficient.

With the wave of artificial intelligence, the application of deep reinforcement learning (DRL) in the field of traffic signal control is becoming increasingly widespread ( 13 – 17 ).

A signal timing algorithm based on DRL was proposed, which learns Q-functions from traffic states and performance outputs, solving modeling and optimization problems through deep Q-networks (DQN) ( 18 ). This method significantly outperforms traditional approaches in finding better signal timing solutions. The DQN algorithm was applied to an adaptive traffic light timing system, directly controlling phase durations and defining the action space as all possible phase durations, thereby avoiding the issue of uncertain phase durations before the end of a phase ( 19 ). The multi-agent DQN algorithm was proposed and applied to traffic signal control, addressing the curse of dimensionality problem in traffic networks with high traffic volumes and interference ( 14 ). Additionally, a model and data-driven integrated reinforcement learning approach enhances system stability ( 20 ). A general framework for multi-intersection control using reinforcement learning employs a parameter-sharing training scheme, achieving strong generalization capabilities ( 21 ).

Focusing on policy-based methods, the proximal policy optimization (PPO) algorithm was applied to traffic signal control and compared with DQNs and double DQNs (DDQN). The results showed that the PPO policy outperformed DQN and DDQN in traffic signal control ( 22 ). The regional-grid advantage actor-critic algorithm, based on the advantage actor-critic algorithm, was proposed to enhance traffic signal control performance by sharing policies and locally discounted states with neighboring agents ( 23 , 24 ).

Different traffic signal control methods demonstrate different advantages under various simulation scenarios and levels of road network complexity. Existing research has indicated that DQN exhibits good performance in single-intersection road scenarios but may not achieve optimal solutions in large-scale road networks ( 25 ). In comparison, algorithms based on the actor-critic (AC) architecture show more pronounced advantages in complex road network studies ( 26 , 27 ).

Despite the progress made by these DRL-based algorithms in traffic signal control, there is still a need for further improvement in capturing the relevance of road network information and enhancing state perception ( 28 – 30 ). Graph neural networks (GNNs) can better capture the correlation of road network information, but research integrating GNNs with DRL for traffic signal control remains relatively limited, with existing studies still having some shortcomings.

The inductive graph reinforcement learning algorithm, based on graph convolutional networks (GCN), adapts to new road networks and traffic situations by learning effective control strategies while leveraging fine-grained vehicle data. However, it has limitations in optimizing global performance ( 31 ). A model based on graph-structured data state representation uses an end-to-end framework, employing GNNs for traffic network representation learning and Q-learning for policy learning, enhancing policy transferability but only considering isolated intersections ( 32 ). The hierarchical graph representation learning algorithm integrates the current step state of multiple agents and the previous multiple step states of each agent, facilitating the generation of optimal final embeddings. Nonetheless, the representation capability of the traffic network state remains inadequate ( 33 ). A graph-based DRL algorithm uses arbitrary signal phases as actions to achieve flexible traffic flow control and can be directly applied to intersections of any geometric shape. However, it only considers lane nodes without accounting for more detailed graph structures ( 34 ).

The main contributions and limitations of previous studies are summarized in Table 1.

Table 1.

Main Contributions and Limitations of Previous Studies

Year	Authors	Method	Main contributions	Limitations
2016	Li et al. ( 18 )	DQN	Addressing modeling and optimization problems using DQN.	The basic algorithm controls isolated intersections.
2022	Liu et al. ( 19 )	DQN	Defining the action space as traffic signal phase durations.	Only considering isolated intersections.
2020	Rasheed et al. ( 14 )	MADQN	Expanding DQN to MADQN to address multi-agent coordination issues.	Effective generalization becomes challenging when the state space dimension is high.
2022	Zhu et al. ( 22 )	PPO	Proving that the PPO algorithm outperforms the DQN and DDQN algorithms.	Only considering isolated intersections.
2020	Maske et al. ( 24 )	RGA2C	Sharing policies and local discount states with neighboring agents.	Unable to provide information about dynamic changes in traffic flow.
2021	Devailly et al. ( 31 )	GCN Q-learning	Using available data granularity by capturing lane and vehicle-level (dynamic) demands.	Unable to achieve global optimization.
2021	Yoon et al. ( 32 )	GNN Q-learning	Modeling the correlations among features generated by the spatial structure of intersections to enhance policy transferability.	Only considering isolated intersections.
2023	Yang ( 33 )	GNN AC	Integrating the current step states of multiple agents and multiple previous step states of each agent.	Inadequate representation capability of traffic network states.
2024	Wang et al. ( 34 )	GAT DQN	Enabling flexible traffic flow control, directly applicable to intersections of any geometric shape.	Only considering lane nodes, with a simplified graph structure.

Note: AC = Actor-Critic; DQN = Deep Q-Networks; GAT = Graph Attention Network; GCN = Graph Convolutional Networks; GNN = Graph Neural Networks; MADQN = Multi-Agent DQN; PPO = Proximal Policy Optimization; RGA2C = Regional-Grid Advantage Actor-Critic.

Therefore, in this paper, we delve into the control of traffic signals using the DRL approach with an AC architecture. To acquire the status information of relevant vehicles more accurately, we construct a model that represents states as graph-structured data, aiming to effectively model the interrelationships among features induced by the spatial structure of intersections. Simultaneously, we propose a fusion algorithm that incorporates GNNs into the deep deterministic policy gradient (DDPG) algorithm. Through this centralized control algorithm, we further enhance the capability to capture correlations in road network information and improve state awareness, resulting in a significant enhancement in decision-making for traffic signal control.

The experimental results demonstrate that our model exhibits superior performance in complex urban traffic networks with multiple intersections. This not only further validates the effectiveness of the proposed algorithm but also provides strong support for addressing urban traffic signal control problems.

The contributions of this paper are as follows:

1) We propose a graph attention network-deep deterministic polcy gradient (GAT-DDPG) algorithm. Modeling the traffic signal control problem as a graph-based state representation Markov decision process (MDP), enabling the learning of graph-structured data. By utilizing latent graph features, it learns complex decision-making for traffic signal control, thereby enhancing DRL’s perception of the urban road network state.

2) We devise a fusion algorithm that combines the attention mechanism with random walk with restart (RWR). This fusion modifies the weight calculation formula of GAT, incorporating the node importance ranking from the RWR algorithm. As a result, the algorithm directs increased attention toward highly significant nodes. Leveraging the available node importance information enhances GAT’s capability in graph data modeling.

3) We conduct experiments on synthetic road networks as well as large-scale real road networks containing hundreds of traffic signals. The experimental results demonstrate that, compared with the original DDPG and other methods, the proposed GAT-DDPG method shows significant improvements in both convergence and learning effectiveness.

Traffic System Modeling

In this section, the traffic signal control problem is redefined as a novel MDP, and the states, actions, and rewards are defined accordingly. The objective is to make autonomous decisions about which phase the intersections in the road network should be in, aiming to reduce lane pressure and improve traffic flow rate.

Markov Decision Process (MPD)

The MDP model consists of a state set $S$ , an action set $A$ , a state transition probability function $P$ (representing the probability of the agent transitioning from state $s_{t}$ to state $s_{t + 1}$ after taking action, i.e., $P (s_{t + 1} | s_{t}, a_{t})$ ), a reward function $R$ , and a discount factor $γ$ (indicating the agent’s emphasis on future rewards—a larger $γ$ value means the agent places higher importance on future rewards).

We model the traffic signal control problem as an MDP based on a graph-based state representation. GAT is employed to efficiently process the information of nodes and edges into embedded vectors while considering their topological relationships. In the entire MDP process of DRL, the agent receives the current decision action $A_{t}$ based on the state $S_{t}$ at time $t$ . The external environment then uses the state transition function $P$ to obtain the next state $S_{t + 1}$ and provides the agent with an immediate reward $R_{t}$ , after which it proceeds to the next time step. The agent continuously interacts with the urban traffic network environment to generate sample data for learning, aiming to obtain the optimal decision-making behavior, with the learning objective being to maximize reward returns.

State Representation by Graph

We represent the geometric features of urban traffic networks (note that this research applies to systems where traffic travels on the right-hand side of the road) using graphs, and embed traffic state data into the graph to represent the states of the target MDP. Each individual decision-making process state is composed of real-time information from local sensors. As it does not encompass all the information of the global MDP but only a portion of the urban traffic network, it is also referred to as partially observable.

Four types of node are defined in the graph: vehicles, lanes, intersections, and traffic signal controller (TSCer). The defined edge types are as follows:

Each node is connected to itself by an edge.

Each node is connected bidirectionally to other nodes of the same type.

Each lane node is connected bidirectionally to all vehicle nodes on that lane.

Each lane node is connected bidirectionally to other lane nodes at the same intersection.

Each TSCer node is connected bidirectionally to intersection nodes. (The phases of TSCer nodes influence the direction of vehicles at the intersection, and the intersection node information includes information from all TSCer nodes within the intersection.)

The global state is encoded as $g^{(0)} (t) = (h_{t}^{(0)}, e_{ij}, H^{(0)}, E)$ , where $h_{t}^{(0)}$ represents the initial node types in the graph, $e_{ij}$ represents the predefined edge relationship in the graph, where $i$ and $j$ denote nodes and $i \neq j$ , $H^{(0)}$ represents the node type set, and $E$ represents the graph edge type set. The initial node features are initialized as follows:

\begin{matrix} H^{(0)} (t) = {h_{1}^{(0)} (t), \dots, h_{4}^{(0)} (t)} \end{matrix}

(1)

where

superscript (0) = the node features of the input graph (initial node features, refer to the original data obtained from local observations).

The edge type set is defined as $E = {e_{i, j}}_{i \neq j}$ for all $i \neq j$ , encompassing all edges in the graph where each edge connects node $i$ to node $j$ , and $i$ is not equal to $j$ .

The node features are summarized in Table 2. “Current speed” represents the current velocity of vehicles (normalized), represented by the vehicle speed captured by lane sensors. “Position” denotes the relative position of vehicles on the lane. “Length” indicates the length of the lane in meters. “(EWA, $d_{t}^{1}$ )” represents the duration of green lights for the east-west direction and the east-west left-turn lights. The quantity represents the number of vehicles in the lane, with the number of vehicles represented by the total count of vehicles captured by lane sensors. “Is on EWA” indicates whether the east-west connection is available and is synchronized with the TSCer node feature (EWA, $d_{t}^{1}$ ).

Table 2.

Nodes and their Corresponding Features

Node type	Features
Vehicle (V)	Current speed, position
Lane (L)	Length, quantity
TSCer	(EWA, $d_{t}^{1}$ ), (EWLA, $d_{t}^{2}$ ), (NSA, $d_{t}^{3}$ ), (NSLA, $d_{t}^{4}$ )
Intersection (I)	Is on EWA, Is on EWLA, Is on NSA, Is on NSLA

Note: EWA = East-West Through Traffic; EWLA = East-West Left Turn; NSLA = North-South Left Turn; NSA = North-South Through Traffic; TSCer = Traffic Signal Controller.

Action Space

We adopt a four-phase control scheme as illustrated in Figure 1, which offers advantages such as comprehensiveness, efficiency, adaptability, and safety. Compared with other methods, four-phase control is typically easier to implement and manage, and it demonstrates greater reliability in reducing congestion and improving efficiency. The phases transition in the order of east-west through traffic (EWA), east-west left turn (EWLA), north-south through traffic (NSA), and north-south left turn (NSLA). The phase set is defined as follows:

\begin{matrix} phase = {EWA, EWLA, NSA, NSLA} \end{matrix}

(2)

where

EWA = the east-west through movement traffic signal turning green (allowing vehicles on the east-west lanes to proceed through or make right turns),

EWLA = the east-west left-turn movement traffic signal turning green (allowing vehicles on the east-west lanes to make left turns),

NSA = the north-south through movement traffic signal turning green (allowing vehicles on the north-south lanes to proceed through or make right turns), and

NSLA = the north-south left-turn movement traffic signal turning green (allowing vehicles on the north-south lanes to make left turns).

Figure 1.

Schematic diagram of four-phase control scheme.

At each time step $t$ , the agent determines whether the current signal phase needs to be switched, based on the input state information and outputs the action $a^{t} (a^{t} \in A)$ . If $a^{t} = 1$ , the traffic signal is switched to the next phase in the phase set; if $a^{t} = 0$ , the current traffic signal phase is kept unchanged. The duration of each phase is denoted as $d_{t}^{c} (c \in {1, 2, 3, 4})$ , and it belongs to the range $[D_{\min}, D_{\max}]$ . When two signal phases transition, there will be a 3 s yellow signal time followed by a 2 s all-red signal time. The action set for $N$ TSCer intelligent agents is represented as follows:

\begin{matrix} {\begin{matrix} a^{t} = {a_{i}^{t}}_{i = 1}^{N} \in A \\ A \in (0, 1) \end{matrix} # \end{matrix}

(3)

In the DDPG algorithm, actions are continuous, so in the application of traffic signal control, continuous action outputs need to be converted into discrete phase change decisions. This paper sets a threshold of 0.5: if the network output is greater than this threshold, the decision is to switch to the next phase (i.e., $a^{t} = 1$ ); if it is less than or equal to this threshold, the decision is to maintain the current phase (i.e., $a^{t} = 0$ ).

The phase transition process is illustrated in Figure 2, as exemplified. If the current traffic signal phase is the green light for the north-south direction, and the action selected at time step $t$ is the same as the action taken at the previous time step $t - 1$ , then the current green light phase for the north-south direction will continue (i.e., $a^{t} = 0$ ) for a duration of $d_{t}^{3}$ , without a yellow light phase in between. On the other hand, if the action selected at time step $t + 1$ is $a^{t + 1} = 1$ , the traffic signal phase will switch to the next phase, which is the left turn signal for the north-south direction, and a 3 s yellow signal followed by a 2 s all-red signal will be activated.

Figure 2.

Signal light changes for different actions.

Reward Scheme

We propose a calculation method based on lane-normalized pressure, using the pressure difference between all traffic phases at intersections as the algorithm’s reward. The normalized pressure formula is as follows:

\begin{matrix} ρ_{a} = k \times Q_{a} + ω \times {(Q_{a})}^{m} \end{matrix}

(4)

where

$ρ_{a}$ = the normalized pressure on lane $a$ , and

$Q_{a}$ = the queue length of vehicles on lane $a$ .

In Equation 4, when the queue length is small, the pressure growth is determined by $k \times Q_{a}$ , ensuring that the curve is linear when $Q_{a}$ is small. As the queue length increases, the pressure growth is determined by $ω \times (Q_{a})^{m}$ , ensuring that the curve exhibits exponential growth as $Q_{a}$ becomes larger.

The normalized pressure value is in the range of [0,1], where a pressure value of 0 indicates that there are no vehicles queuing. When the queue length reaches the maximum capacity of the road, the pressure value equals 1, indicating that the road cannot receive any more vehicles from upstream roads. Let $C_{a}$ be the maximum capacity that road $a$ can handle. When $Q_{a} = C_{a}$ , the road’s pressure reaches its maximum value $ρ_{a} = 1$ .

\begin{matrix} ρ_{a} = {\begin{matrix} 0, if Q_{a} = 0 \\ 1, if Q_{a} = C_{a} \end{matrix} \end{matrix}

(5)

By substituting Equation 5 into Equation 4, the parameter $ω$ is obtained as follows:

\begin{matrix} ω = \frac{1 - k \times C_{a}}{{(C_{a})}^{m}} # \end{matrix}

(6)

The normalization pressure formula obtained by substituting Equation 6 into Equation 4 is as follows:

\begin{matrix} ω = \frac{1 - k \times C_{a}}{{(C_{a})}^{m}} # \end{matrix}

(7)

Assuming a lane’s maximum capacity $C_{a} = 600$ , with $k = 0.001$ and $m = 8$ , the normalized pressure function of the lane is depicted in Figure 3. For queue lengths $Q_{a} < 500$ , the pressure value exhibits linear growth, whereas for $Q_{a} > 500$ ., the pressure value grows exponentially. This illustrates that, as the queue length on the road approaches its maximum capacity, the pressure increases at a faster rate, resulting in larger pressure values and longer access times, thereby effectively alleviating current traffic pressures.

Figure 3.

Plot of lane normalized pressure function.

The traffic phase pressure difference is defined as the difference between the entering lane pressure and the exiting lane pressure. The traffic phase $(a, b)$ pressure difference for a vehicle driving from lane $a$ nto lane $b$ is expressed as:

ρ (a, b) = ρ_{a} - ρ_{b} #

(8)

where

$ρ_{a}$ = the normalized pressures on the entering lane $a$ , and

$ρ_{b}$ = the normalized pressures on the exiting lane $b$ .

$(a, b) \in phase$ indicates that $(a, b)$ represents a movement from the set of phases $phase = {EWA, EWLA, NSA, NSLA}$ .

The pressure on intersection $i$ is the sum of the traffic phase pressure differences for all possible movements in the phase set, which can be expressed as follows:

ρ (a, b) = ρ_{a} - ρ_{b} #

(9)

The pressure at the intersections is chosen as the reward for the reinforcement learning algorithm. A larger reward value indicates a smaller sum of pressures at the intersections. The reward function is defined as follows:

\begin{matrix} r_{i} = - P_{i} \end{matrix}

(10)

The rewards of the agents and their neighboring regions are combined using a weighted sum, and a spatial discount factor is introduced to make the rewards of agents in the neighborhood positively correlated with the distance between the two agents. The spatial discount factor is represented as follows:

\begin{matrix} φ_{i, j} = φ_{0} \cdot d_{i}^{j} \end{matrix}

(11)

where

$d_{i}^{j}$ = the distance between node $i$ and node $j$ ,

$j \in N_{i}$ ( $N_{i}$ = the set of neighboring nodes of node $i$ ), and

$φ_{0}$ = the discount factor.

The reward function for the intelligent agent is defined as the sum of the local reward obtained from the current local environment and the rewards obtained from all neighboring agents within its vicinity. This definition is based on the concept proposed in Chu et al. ( 23 ). It is specifically expressed as:

\begin{matrix} r_{t}^{i} = r_{t}^{i^{'}} + \sum_{j \in N_{i}} φ_{i, j} \cdot r_{t}^{j} \end{matrix}

(12)

where

$r_{t}^{i^{'}}$ = the local reward obtained by the intelligent agent,

$φ_{i, j} \cdot r_{t}^{j}$ = the reward information obtained from neighboring intelligent agent $j$ , and

$φ_{i, j}$ = the spatial discount factor.

Methodology

In this section, we introduced the proposed GAT-DDPG algorithm, which uses GAT to represent the state and serves as the input for DRL. By employing neighborhood cooperation, the algorithm enhances the agent’s perception of the road network’s state, thereby improving the decision-making performance of DRL and enhancing the traffic signal control capabilities.

Deep Deterministic Policy Gradient (DDPG)

In traffic signal control, small variations in strategy can lead to significant fluctuations in traffic flow, making stability a crucial consideration. Despite the widespread application of the DQN algorithm in traffic signal control, it relies on approximating the Q-value function to indirectly formulate strategies ( 35 , 36 ). This indirect approach may pose challenges in maintaining stability. Therefore, we employ the DDPG algorithm to directly learn the policy function, thereby reducing the policy instability caused by approximation errors in the Q-value function ( 37 ).

The mean squared error between the estimated value and the target value is used as the loss function for the online critic network, as shown in Equation 13:

\begin{matrix} L (w) = \sum {[r_{t} + γ Q_{\bar{w}} (s_{t + 1}, a_{t + 1}) - Q_{w} (s_{t}, a_{t})]}^{2} \end{matrix}

(13)

where

$a_{t} = μ_{θ} (s_{t})$ = the action output by the online actor network at time step $t$ ,

$μ$ = the deterministic behavioral policy,

$Q_{w} (s_{t}, a_{t})$ = the estimated value output by the online critic network at time step $t$ ,

$r_{t}$ = the immediate reward received from the road network environment at time step $t$ ,

$γ$ = the discount factor, and

$Q_{\bar{w}} (s_{t + 1}, a_{t + 1})$ = the estimated value output by the target critic network at time step $t + 1$ (they together constitute the target value at moment $t$ ).

Online actor network updates its parameters by minimizing the following loss function:

\begin{matrix} L (θ) = - Q_{w} (s_{t}, a_{t}) \end{matrix}

(14)

After training each mini-batch of traffic data, the parameters of the two target networks are softly updated based on the parameters of the two online networks, as shown in Equation 15:

\begin{matrix} {\begin{matrix} \bar{w} \leftarrow τ w + (1 - τ) \bar{w} \\ \bar{θ} \leftarrow τ θ + (1 - τ) \bar{θ} \end{matrix} \end{matrix}

(15)

where

$w$ = the parameter for the online critic network,

$θ$ = the parameter for the online actor network,

$\bar{w}$ = the parameter for the target critic network,

$\bar{θ}$ = the parameter for the target actor network, and

$τ$ = the update coefficient.

By continuously optimizing the parameters of the neural networks to select the optimal control strategy, traffic signal control can effectively reduce lane pressure and improve vehicle traffic efficiency.

Graph Attention Network-Deep Deterministic Policy Gradient (GAT-DDPG)

The representation of traffic signal states is crucial for the design and performance of control algorithms, as it directly affects the optimization effectiveness and real-time responsiveness of traffic signals. Many papers and systems on traffic signal control primarily adopt traditional state representations, such as traffic flow and vehicle density based on static features ( 38 ). Although this representation may be effective in certain cases, it often fails to comprehensively capture the complexities and diversities of modern urban traffic. Therefore, we introduce a novel approach by applying GAT to address this issue. GAT provides a new perspective for representing traffic signal states.

GAT employs attention mechanisms to compute weighted sums of neighboring node features, making the weights of neighboring node features entirely dependent on the node features and independent of the graph structure. The core of GAT is the graph attention layer, which takes an input $h = {\vec{h_{1}}, \vec{h_{2}}, \dots, \vec{h_{N}}} \vec{h_{i}} \in R^{F}$ , where $N$ is the number of nodes, and $F$ is the number of features for each node. Its output is $h^{'} = {\vec{h_{1}^{'}}, \vec{h_{2}^{'}}, \dots, \vec{h_{N}^{'}}} \vec{h_{i}^{'}} \in R^{F'}$ , where $F'$ represents a higher-level number of features.

As shown in Figure 4, to achieve sufficient expressive power for extracting higher-level features, a mapping with the parameter $W \in R^{F^{'} \times F}$ is applied to each node, and the resulting vectors are concatenated. Then, a feedforward neural network is used to map the concatenated vector to a real value, activated by the leaky rectified linear unit (ReLU) function, as shown in Equation 16:

\begin{matrix} e_{ij} = LeakyReLU (\vec{a^{T}} [W \vec{h_{i}} | | W \vec{h_{j}}]) \end{matrix}

(16)

where

$e_{ij}$ = the initial attentional correlation between node $j$ and node $i$ ,

$\vec{a^{T}}$ = a feedforward neural network, and

|| = vector concatenation.

Figure 4.

Node attention weights.

While GAT demonstrates good performance, it is highly sensitive to the structure of the input graph, often requiring different hyperparameter settings or adjustments to the model architecture for optimal performance with varying graph structures. This sensitivity limits the model’s generalization ability. Therefore, we propose a fusion algorithm of GAT and RWR. RWR explores the information of the entire graph by conducting random walks starting from multiple nodes, thus providing a more global perspective. This global information helps the model better understand the structural characteristics of the entire graph, reducing reliance on local neighbor nodes and, consequently, lowering sensitivity to the input graph structure. Using the RWR algorithm for node random walks, nodes are selected with a certain probability, and, through continuous random walks, the importance of nodes gradually propagates and accumulates. The algorithm is defined as follows:

\begin{matrix} \vec{r_{l}} = (1 - c) {(E - c \tilde{X})}^{- 1} \vec{e_{l}} \end{matrix}

(17)

where

$\vec{r_{l}} = [r_{i, j}]$ = the importance vector,

$r_{i, j}$ = the importance ranking of node $j$ with respect to node $i$ ,

$c$ = the restart probability,

E = the unit matrix,

$X = [x_{i, j}]$ = the graph with directed weights,

$\tilde{X}$ = the matrix obtained after standardizing $X$ , and

$\vec{e_{l}}$ = the initial vector.

Introducing $\vec{r_{l}}$ as a learnable parameter vector multiplied by attention weights, the normalized attention weight $α_{ij}$ for each neighboring node $j$ of node $i$ is computed as follows:

\begin{matrix} α_{ij} = SoftMa x_{j} (e_{ij} \cdot \vec{r_{l}}) \\ = \frac{\exp (LeakyReLU (\vec{a^{T}} [W \vec{h_{i}} | | W \vec{h_{j}}]) \cdot ((1 - c) {(E - c \tilde{X})}^{- 1} \vec{e_{l}}))}{\sum_{k \in N_{i}} \exp (LeakyReLU (\vec{a^{T}} [W \vec{h_{i}} | | W \vec{h_{j}}]) \cdot ((1 - c) {(E - c \tilde{X})}^{- 1} \vec{e_{l}}))} \end{matrix}

(18)

where

$SoftMa x_{j}$ = the $SoftMax$ operation applied to the attention weights of all neighboring nodes $j$ of node $i$ (ensuring their sum equals 1), and

$N_{i}$ = the set of neighboring nodes of node $i$ .

The output representation is obtained by linearly accumulating the neighborhood representations of nodes according to the attention weights, as follows:

\begin{matrix} {\vec{h}}_{i 1}^{'} = σ (\sum_{j \in N_{i}} α_{ij} W \vec{h_{j}}) \end{matrix}

(19)

where

$σ (\cdot)$ = the sigmoid activation function, and

subscript 1 in ${\vec{h}}_{i 1}^{'}$ = the output obtained through a linear aggregation method.

To ensure stable node representation through self-attention, we introduce a multi-head self-attention mechanism to enhance the model’s representational power. Specifically, using $K$ to denote the number of heads, the aforementioned operations are performed for each head, and the results are concatenated together, as follows:

\begin{matrix} {\vec{h}}_{i 2}^{'} = | |_{k = 1}^{K} σ (\sum_{j \in N_{i}} α_{ij}^{k} W^{k} \vec{h_{j}}) \end{matrix}

(20)

where

subscript 2 in ${\vec{h}}_{i 2}^{'}$ = the output obtained after processing through multi-head attention and concatenating the results.

In the final layer of GAT, to avoid high-dimensional outputs arising from concatenation, averaging is used to fuse the results of multi-head attention, obtaining the final representation:

\begin{matrix} {\vec{h}}_{i}^{'} = σ (\frac{1}{K} \sum_{k = 1}^{K} \sum_{j \in N_{i}} α_{ij}^{k} W^{k} \vec{h_{j}}) \end{matrix}

(21)

The GAT network structure used in this paper is illustrated in Figure 5. Firstly, the network employs an attention mechanism $a$ implemented by a feedforward neural network to compute the initial attention coefficients $e_{ij}$ for node $i$ ’s neighbor nodes $j$ , with the activation function $LeakyReLU (\cdot)$ , followed by normalization through $SoftMax (\cdot)$ . To prevent overfitting and enhance model stability, GAT introduces a multi-head self-attention mechanism, replacing the fixed normalization operations in GCN. This significantly enhances its ability to capture graph spatial information correlations, making it highly suitable for complex traffic road network environments. During the computation of attention coefficients, GAT computes multiple independent attention weightings in parallel, aggregates these results through concatenation or averaging, and derives the new feature ${\vec{h}}_{i}^{'}$ for node $i$ .

Figure 5.

Network structure of the graph attention algorithm.

Based on this, we incorporate the RWR algorithm into GAT, enabling the model to more fully consider the long-distance dependency between nodes and the global topological characteristics of urban road networks, thereby enhancing the model’s generalization ability and robustness. The principle of the GAT-DDPG method is illustrated in Figure 6.

Figure 6.

The structure of the graph attention-deep deterministic policy gradient (GAT-DDPG) algorithm.

GAT-DDPG effectively captures the complex relationships between nodes in traffic networks, enabling the model to better understand the topology and traffic flow patterns of the network. By integrating the RWR algorithm, the model’s capability to represent and generalize complex graph structures is enhanced. Using GNNs in the AC architecture of DRL, as state inputs, allows for learning complex decision-making in traffic signal control based on latent graph features. This enables the model to make decisions and optimizations in real-time traffic environments, facilitating flexible adjustments and responses to dynamic traffic conditions.

Unlike decentralized control, GAT-DDPG achieves centralized processing and analysis of traffic data across the entire area through a unified decision center. This enables the formulation of optimal signal control strategies based on a global perspective. Such global optimization effectively avoids traffic signal conflicts and enhances traffic flow.

Experiment and Evaluation

In this section, we conduct comparative analysis of the proposed algorithm with other methods in both synthetic road networks and real-world road networks. The experiments are conducted in the CityFlow simulation environment, an open-source platform widely used for traffic flow simulation research. Through these comparative experiments, the superiority of the proposed GAT-DDPG algorithm is demonstrated.

Datasets and Related Settings

We use three real datasets from New York, U.S., and Hangzhou and Jinan in China. The New York dataset comprises 196 intersections arranged in a grid format of 7 rows and 28 columns; the Hangzhou dataset consists of 16 intersections arranged in a grid format of 4 rows and 4 columns; and the Jinan dataset contains 12 intersections arranged in a grid format of 4 rows and 3 columns. In these real datasets, processed traffic data are inputted, including intersection identifiers, vehicle counts, traffic light status configurations, road identifiers, and road adjacency information, among others. Vehicle travel times are obtained by parsing vehicle trajectory data.

Meanwhile, the model was evaluated on two synthetic road datasets: a 3 × 3 traffic network and a 6 × 6 traffic network. The synthetic datasets comprise both one-way and two-way traffic data. In the two-way lanes, there are four possible directions: west to east, east to west, north to south, and south to north. On the other hand, one-way lanes only have two possible directions: west to east and north to south.

In the experiments of this study, the mild traffic flow for the 3 × 3 synthetic road dataset is 626 vehicles per hour (vph) (i.e., 626 vehicles generated per hour), the moderate traffic flow is 1,071 vph, and the heavy traffic flow is 2,340 vph. For the 6 × 6 synthetic road dataset, the mild traffic flow is 1,021 vph, the moderate traffic flow is 2,426 vph, and the heavy traffic flow is 4,680 vph. Other relevant settings are shown in Table 3.

Table 3.

Other Relevant Settings

Traffic network size	Minimum intersection crossings (count)	Number of lanes (count)	Maximum negative acceleration (m/s²)	Maximum positive acceleration (m/s²)	Maximum speed (m/s)	Intersection spacing (m)
3 × 3	3	4	4.8	2.5	11.33	500
6 × 6	6	4	5.6	3.2	12.22	700

The study conducted 200 episodes of model training. Throughout the training process, a batch size of 32 was employed, with each training cycle utilizing 1,000 samples. Additionally, the study set the maximum capacity for storing past experiences to 10,000. The target Q-network was updated every five training cycles, with a discount factor $γ$ of 0.95 and a soft update coefficient of 0.99. The learning rates for the actor and critic networks were set at $1 \times 10^{- 4}$ and $5 \times 10^{- 5}$ , respectively. Within the GAT network, six hidden layers were used. Finally, the study retained the five best models and averaged their results.

Compared Algorithms and Metrics

We use average travel time (ATT) as the primary metric for evaluating algorithm performance; this is computed as the ATT (in seconds) of all vehicles between their entry and exit from the controlled area. Compared with other metrics that focus more on specific regions or nodes, such as average queue length or queue length per lane, ATT covers the entire journey from start to finish. This provides a comprehensive reflection of the overall performance and efficiency of the traffic system. Optimizing signal timings to reduce congestion and delays is a core objective of traffic signal control systems. ATT, as a performance metric, intuitively reflects the effectiveness of signal timing and provides robust support for optimization decisions.

The compared algorithms encompass traditional traffic signal control methods and reinforcement learning (RL)-based traffic signal control methods, outlined as follows:

Traditional Traffic Signal Control Methods

Fixedtime ( 39 ): Use fixed timings with random offsets, employing pre-determined schemes for traffic cycle lengths and phase times. This method is widely applied when traffic flow is steady.

Maxpressure ( 40 , 41 ): Chooses phases that maximize pressure. Maximum pressure only requires local information, as intersection control depends solely on queue lengths on adjacent road segments.

RL-Based Decentralized Traffic Signal Control Approach

CGDRL (Coordinated Deep Reinforcement Learners) ( 42 ): A DRL-based approach to multi-intersection signal control using joint action modeling involves collaboration through the design of a coordination graph to optimize joint actions between two intersections.

IntelliLight ( 43 ): A DRL method that demonstrates good performance in two-phase signal control. It does not consider neighboring information, with each intersection controlled by an individual agent. The agents do not share parameters and are updated independently.

GCNN (Graph Convolutional Neural Networks) ( 44 ): A multi-intersection traffic signal control method based on RL and GCNN, using GCNN to directly extract geometric road network features and employing model-free RL methods to adaptively learn policies.

CoLight ( 45 ): A multi-intersection signal control method based on GAT. It considers not only the spatiotemporal impact of adjacent intersections on the target intersection but also models adjacent intersections without indexing.

MA2C (Multi-Agent Advantage Actor-Critic) ( 23 ): A DRL-based multi-intersection traffic signal control method uses the same states and rewards as IntelliLight and employs a fully scalable decentralized multi-agent reinforcement learning algorithm. This method considers neighbor information, enhancing observability.

MDQN (Multi-Agent Deep Q-Networks) ( 46 ): A DRL-based approach for multi-intersection signal control. It considers neighbor information and incorporates this information into the loss function of the DQN. Each agent selects the optimal action to control the intersection by obtaining local lane state information.

FRAP (Flip and Rotation invariant All Phase) ( 47 ): A DRL method based on the intuitive principle of mutual competition in traffic signal control. When conflicts occur between two traffic signals, priority is given to the one with greater traffic movement, that is, higher demand.

DDPG ( 48 ): A DRL method that optimizes deterministic policies by simultaneously learning policy and value function networks. It is suitable for continuous traffic signal control tasks.

RL-Based Centralized Traffic Signal Control Approach

GAT-DDPG: Learns from graph-structured data, incorporating neighboring node information and their significance, while leveraging latent graph features to comprehend the intricate decisions associated with traffic signal control.

Performance Comparison

In this study, we have demonstrated the exceptional performance of the GAT-DDPG method through validation on synthetic datasets, including 3 × 3 and 6 × 6 traffic networks, as well as three real-world datasets. The specific results are presented in Table 4.

Table 4.

Performance Comparisons on Both Synthetic and Real-World Datasets

Data	Grid 3 × 3			Grid 6 × 6			Hangzhou	Jinan	New York
Vehicle (vehicles per hour)	Low 626	Medium 1,071	High 2,340	Low 1,021	Medium 2,426	High 4,680	High 2,983	High 6,295	High 2,796
Fixedtime	171.63	256.36	302.64	187.47	270.56	459.36	728.56	830.28	1,946.52
Maxpressure	165.32	156.25	201.56	140.59	232.41	369.49	426.72	343.45	1,628.45
CGDRL	362.89	564.35	968.24	708.26	687.23	1,956.36	589.45	1,012.56	2,045.88
IntelliLigh	121.03	159.56	146.32	207.56	245.26	509.91	494.56	513.32	2,007.85
GCNN	76.32	136.25	163.62	159.36	204.25	406.39	455.65	580.25	1,900.56
CoLight	59.36	100.35	159.24	143.65	175.46	265.49	331.36	335.27	1,463.75
MA2C	50.36	99.25	146.49	123.26	105.12	203.31	306.34	345.76	1,300.74
MDQN	45.25	91.26	120.12	98.52	98.45	126.38	297.65	316.54	1,186.47
FRAP	88.12	84.57	114.57	48.36	88.66	118.63	986.31	280.57	1,030.23
DDPG	42.89	82.79	114.36	62.39	72.97	100.65	286.32	296.58	1,058.42
GAT-DDPG	40.21	77.56	100.03	56.25	64.21	88.56	248.21	251.73	870.47
Improvement	6.25%	6.32%	12.53%	9.84%	12.01%	14.09%	13.31%	15.12%	17.76%

Note: CGDRL = Coordinated Deep Reinforcement Learners; DDPG = Deep Deterministic Policy Gradient; FRAP = Flip and Rotation invariant All Phase; GAT-DDPG = Graph Attention Network-Deep Deterministic Policy Gradient; GCNN = Graph Convolutional Neural Networks; MA2C = Multi-Agent Advantage Actor-Critic; MDQN = Multi-Agent Deep Q-Networks.

For the 3 × 3 traffic network, when compared with the DDPG algorithm, our approach shows a 6.25% improvement in ATT under light traffic conditions, a 6.32% improvement under moderate traffic conditions, and a significant 12.53% improvement under heavy traffic conditions.

In the case of the 6 × 6 traffic network, similarly compared with the DDPG algorithm, our method achieves a 9.84% enhancement in ATT under light traffic conditions, a 12.01% improvement under moderate traffic conditions, and an impressive 14.09% improvement under heavy traffic conditions.

Furthermore, in the real-world datasets of cities with heavy traffic flow such as New York, Hangzhou, and Jinan, this method also achieves significant improvements in ATT for vehicles. Compared with the DDPG method, our approach results in a 13.31% improvement in ATT in the Hangzhou dataset, a 15.12% improvement in the Jinan dataset, and a remarkable 17.76% improvement in the New York dataset.

Based on comprehensive data analysis, we can conclude that our method exhibits superior performance enhancements under heavy traffic flow conditions. Particularly in more complex environments with multiple intersections and nodes, as in the case of 36 intersections, under heavy traffic conditions, our method shows the highest performance improvement, reaching 14.09%. This clearly demonstrates the effectiveness of the GAT algorithm combined with the RWR algorithm in enhancing urban road network state perception and overall performance.

Our research findings highlight that adopting the GAT-DDPG approach can significantly enhance the ATT of vehicles under various traffic flow conditions, especially in complex urban road networks, where performance improvements are most pronounced.

In intricate and complex traffic environments, traffic signal control remains at the core of real-world traffic management and is one of the critical challenges currently needing to be overcome. The outstanding performance of the GAT-DDPG algorithm in simulating complex environments effectively verifies its potential efficiency and reliability in real-world complex traffic environments. This remarkable adaptability positions GAT-DDPG with tremendous potential and value in addressing urban traffic bottlenecks and optimizing traffic network layouts.

Convergence Comparison

Self-Convergence Comparison

As depicted in Figure 5, we provide a visual representation of the self-convergence of our algorithm in 3 x 3 and 6 x 6 traffic networks for both unidirectional and bidirectional heavy traffic scenarios. To make it easier to present the data, we took logarithms of the ATT.

In Figure 7a, it can be observed that, in the case of a 3 × 3 traffic network with unidirectional heavy traffic, the algorithm reaches a converged state at the 5th training iteration, and the logarithmic values of ATT stabilize around 4.08. In Figure 7c, for the 6 × 6 traffic network with unidirectional heavy traffic, the algorithm successfully achieves convergence by the 4th training iteration, and the logarithmic values of ATT stabilize around 3.53. This represents a performance improvement of 13.48% compared with the unidirectional heavy traffic scenario in the 3 × 3 traffic network.

Figure 7.

Self-convergence of the graph attention-deep deterministic policy gradient (GAT-DDPG) algorithm in synthetic datasets for both unidirectional and bidirectional heavy traffic scenarios: (a) 3 × 3 grid unidirectional, (b) 3 × 3 grid bidirectional, (c) 6 × 6 grid unidirectional, and (d) 6 × 6 grid bidirectional.

In Figure 7b, for the 3 × 3 traffic network with bidirectional heavy traffic, the algorithm reaches a converged state by the 6th training iteration, and the logarithmic values of ATT stabilize around 3.81. In Figure 7d, for the 6 × 6 traffic network with bidirectional heavy traffic, the algorithm also achieves convergence by the 6th training iteration, with the logarithmic values of ATT stabilizing around 3.65. This represents a performance improvement of 4.20% compared with the bidirectional heavy traffic scenario in the 3 × 3 traffic network.

Comparing the convergence of the 3 × 3 traffic network and the 6 × 6 traffic network under the same traffic conditions, we can observe that the algorithm achieves better convergence in the complex traffic environment. Therefore, the algorithm proposed in this paper demonstrates superior performance in adapting to complex traffic scenarios.

Comparing the convergence of the 6 × 6 traffic network with bidirectional heavy traffic flow to the convergence of both unidirectional and bidirectional heavy traffic flow in the 3 × 3 traffic network, we observe that, despite the more complex bidirectional heavy traffic flow environment in the 6 × 6 network, it achieves a convergence speed, similar to that of the 3 × 3 network, under unidirectional heavy traffic flow conditions. Moreover, the logarithmic values of ATT can converge to around 3.65, which represents the optimal performance across the three traffic environments. This discovery further validates the superior performance of our proposed algorithm in scenarios with more intersections and complex traffic conditions. It also signifies the effectiveness of our algorithm in enhancing road throughput and better adapting to real-world urban road scenarios.

The experimental results depicted in Figure 7 illustrate the rapid convergence of the data. This phenomenon could potentially stem from the relatively large gradient descent step size during the initial stages, causing parameters to swiftly traverse the parameter space and, thus, accelerating the convergence speed. As training progresses, the learning rate may gradually decrease, leading to smaller steps in parameter updates, thereby inducing the model’s performance improvement to gradually stabilize.

Comparative Convergence Analysis of Algorithms

Convergence speed is an important metric for evaluating the performance of an algorithm. In reinforcement learning, we usually want the algorithm to converge quickly because it can reduce the training time and the computational cost. An algorithm with good convergence can accurately approximate or even reach the optimal policy in a shorter period of time, and can operate stably and find the optimal solution under different environmental conditions. This is especially critical in the real-world traffic environment, which is complex, changing, and full of uncertainties.

As shown in Figure 8, we compared the learning curves of GAT-DDPG against four other DRL methods. The evaluation was conducted on three real datasets, revealing that GAT-DDPG significantly outperformed other baseline methods in the time taken to reach a threshold and the final learned performance. GAT-DDPG initiated with optimal performance, reached the targeted performance at the fastest pace, and concluded with optimal convergence.

Figure 8.

Convergence of the global graph attention-deep deterministic policy gradient (GAT-DDPG) and the other four algorithms during the training process: (a) New York, (b) Hangzhou, and (c) Jinan.Note: DDPG = Deep Deterministic Policy Gradient; FRAP = Flip and Rotation invariant All Phase; MDQN = Multi-Agent Deep Q-Networks; GAT-DDPG = Graph Attention Network-Deep Deterministic Policy Gradient.

Importantly, the attention GAT-DDPG pays to neighborhood information and importance during learning did not lead to a deceleration of model convergence. On the contrary, it accelerated the speed at which the model approximated the optimal policy. This underscores the efficacy and superiority of GAT-DDPG in tackling traffic signal control challenges.

Effectiveness Comparison

In Figure 9, the running times of GAT-DDPG and four other methods are compared on three real datasets and four synthetic datasets. “Running time” refers to the time required for the model to converge during training. To ensure fair evaluation, all methods were assessed using the same hardware. Because of the model’s complexity, GAT-DDPG initially requires more training time. However, after approximately 20 iterations, it achieves significant results, as depicted in Figure 8. The reduction in the number of iterations substantially shortens the overall running time, as illustrated in Figure 9. This demonstrates the effectiveness of the GAT-DDPG model, enabling flexible adjustment and rapid response to dynamic traffic conditions, thereby achieving faster decision convergence.

Figure 9.

Running times for different models: (a) synthetic datasets and (b) real datasets.Note: DDPG = Deep Deterministic Policy Gradient; DQN = Deep Q-Networks; FRAP = Flip and Rotation invariant All Phase; GAT-DDPG = Graph Attention Network-Deep Deterministic Policy Gradient.

Simultaneously, we considered model variants and conducted ablation experiments to validate the effectiveness of each module:

DDPG: Retains the DDPG component for learning traffic network states while removing the GAT algorithm component to assess the impact of GAT on model performance.

GCN-DDPG: Replaces GAT, used for learning traffic network states, with GCN to evaluate the impact of GAT on model performance.

GAT-DDPG-noRWR: Retains the GAT component for learning traffic network states and only removes the RWR algorithm component to assess the impact of RWR on model performance.

Figure 10 compares the results of ablation experiments on three real datasets. From the experimental results, it can be concluded that integrating GAT into the DDPG model effectively improves model performance, reducing the ATT of vehicles. Simultaneously, incorporating the RWR module effectively enhances the model’s convergence speed, which is crucial for enhancing training efficiency and performance stability, especially when dealing with large-scale and complex traffic network data.

Figure 10.

Ablation experiment: (a) Hangzhou and Jinan, and (b) New York.Note: DDPG = Deep Deterministic PolicyGradient; GAT-DDPG = Graph Attention Network-Deep Deterministic Policy Gradient; GAT-DDPG-noRWR = Graph Attention Network-Deep Deterministic Policy Gradient without Random Walk with Restart; GCN-DDPG = Graph Convolutional Network-Deep Deterministic Policy Gradient.

Conclusion

Accurately capturing and comprehensively representing traffic states remains a major challenge in the current traffic signal control domain. Since the traffic condition of an intersection is often profoundly affected by multiple intersections in its vicinity, this intricate interdependence greatly increases the complexity and difficulty of state acquisition. To address this challenge, we propose an innovative GAT-DDPG algorithm.

Specifically, GAT-DDPG uses GAT’s unique attention mechanism to accurately identify and measure the complex relationships and interdependencies among different intersections in the traffic network. This capability allows GAT not only to capture direct influences between neighboring intersections but also to reveal potential indirect connections and interactions among intersections at a distance. Through this deep understanding of complex intersection relationships, GAT significantly enhances the perception capability of traffic network states. Additionally, GAT-DDPG integrates RWR technology, effectively reducing GAT’s reliance on local neighboring nodes and mitigating sensitivity to the input graph structure. This improvement provides a more comprehensive representation of traffic network states, enhancing the model’s generalization ability. Finally, the AC architecture of the DDPG algorithm enables more flexible and efficient traffic signal control strategies.

GAT-DDPG can potentially update and adjust the state of traffic signal control systems in real-time to adapt to actual traffic flow and congestion conditions. Importantly, it possesses powerful automatic learning capabilities that allow it to directly identify and extract key features from the complex data of traffic networks, without the need for human intervention to define or select the importance of features. This characteristic enables GAT-DDPG to better adapt to and handle various traffic network topologies.

We conducted comparative experiments with other algorithms such as MDQN, FRAP, and DDPG in synthetic road networks (including 3 × 3 and 6 × 6 traffic networks) and three real road networks: Hangzhou (16 intersections), Jinan (12 intersections), and New York (196 intersections), under heavy traffic flow conditions. The experimental results demonstrate that the GAT-DDPG algorithm consistently outperforms others across different complexities of traffic networks, affirming its robustness. Additionally, we conducted ablation experiments that further validate the significant role of algorithmic components in enhancing model training efficiency and stability. Furthermore, across the three real datasets with different traffic network structures, GAT-DDPG exhibited superior convergence compared with benchmark algorithms such as Colight, MDQN, FRAP, and DDPG. It rapidly adapts and achieves excellent performance on diverse datasets, underscoring its strong generalization capability.

The method proposed in this paper still has some limitations, mainly because it currently focuses only on vehicular traffic, ignoring key factors such as pedestrians and weather. To overcome these limitations, future research is planned to incorporate more diverse external data such as pedestrian behavior models as well as weather conditions, which may help to improve the performance of the model and make the algorithm more comprehensively adaptable to the complex and changing urban traffic environment.

Footnotes

Author Contributions

The authors confirm contribution to the paper as follows: study conception and design: Guoqing Yang, Xin Wen; data collection: Fuqiang Chen, Xin Wen; analysis and interpretation of results: Fuqiang Chen, Xin Wen; draft manuscript preparation: Xin Wen. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the Tianjin Postgraduate Research Innovation Project (Approval Number: 2022SKYZ394, 2021YJSS336).

ORCID iD

Fuqiang Chen

References

Bharadwaj

Ballare

Chandel

M. K.

Impact of Congestion on Greenhouse Gas Emissions for Road Transport in Mumbai Metropolitan Region. Transportation Research Procedia, Vol. 25, 2017, pp. 3538–3551.

Grote

Williams

Preston

Kemp

Including Congestion Effects in Urban Road Traffic CO2 Emissions Modelling: Do Local Government Authorities Have the Right Options?

Transportation Research Part D: Transport and Environment, Vol. 43, 2016, pp. 95–106.

Jame

Rapelli

Casetti

Reducing Pollutant Emissions through Virtual Traffic Lights. Computer Communications, Vol. 188, 2022, pp. 167–177.

Genders

Razavi

Evaluating Reinforcement Learning State Representations for Adaptive Traffic Signal Control. Procedia Computer Science, Vol. 130, 2018, pp. 26–33.

Araghi

Khosravi

Creighton

A Review on Computational Intelligence Methods for Controlling Traffic Signal Timing. Expert Systems with Applications, Vol. 42, 2015, pp. 1538–1550.

Tunc

Yesilyurt

A. Y.

Soylemez

M. T.

Different Fuzzy Logic Control Strategies for Traffic Signal Timing Control with State Inputs. IFAC-PapersOnLine, Vol. 54, 2021, pp. 265–270.

Tunc

Soylemez

M. T.

Fuzzy Logic and Deep Q Learning Based Control for Traffic Lights. Alexandria Engineering Journal, Vol. 67, 2023, pp. 343–359.

Toan

T. D.

Wong

Y. D.

Lam

S. H.

Meng

Developing a Fuzzy-Based Decision-Making Procedure for Traffic Control in Expressway Congestion Management. Physica A: Statistical Mechanics and its Applications, Vol. 604, 2022, p. 127899.

Zhai

Sun

Zhou

Chen

Multiagent Control Approach with Multiple Traffic Signal Priority and Coordination. Journal of Transportation Engineering, Part A: Systems, Vol. 149, No. 1, 2023, p. 04022124.

10.

Wang

A Genetic Timing Scheduling Model for Urban Traffic Signal Control. Information Sciences, Vol. 576, 2021, pp. 475–483.

11.

Darmoul

Elkosantini

Louati

Said

L. B.

Multi-Agent Immune Networks to Control Interrupted Flow at Signalized Intersections. Transportation Research Part C: Emerging Technologies, Vol. 82, 2017, pp. 290–313.

12.

Wang

Jiang

Qin

A Multi-Agent Based Cellular Automata Model for Intersection Traffic Control Simulation. Physica A: Statistical Mechanics and its Applications, Vol. 584, 2021, p. 126356.

13.

Bálint

Tamás

Deep Reinforcement Learning Based Approach for Traffic Signal Control. Transportation Research Procedia, Vol. 62, 2022, pp. 278–285.

14.

Rasheed

Yau

K.-L. A.

Low

Y.-C.

Deep Reinforcement Learning for Traffic Signal Control Under Disturbances: A Case Study on Sunway City, Malaysia. Future Generation Computer Systems, Vol. 109, 2020, pp. 431–445.

15.

Liu

Sheng

Chen

Shi

Ran

Longitudinal Control of Connected and Automated Vehicles Among Signalized Intersections in Mixed Traffic Flow with Deep Reinforcement Learning Approach. Physica A: Statistical Mechanics and its Applications, Vol. 629, 2023, p. 129189.

16.

Chen

Yuan

Liu

Zhao

Yang

Liu

Vehicle Detection from Road Image Sequences for Intelligent Traffic Scheduling. Computers and Electrical Engineering, Vol. 95, 2021, p. 107406.

17.

Mao

A Comparison of Deep Reinforcement Learning Models for Isolated Traffic Signal Control. IEEE Intelligent Transportation Systems Magazine, Vol. 15, No. 1, 2023, pp. 160–180.

18.

Wang

F.-Y.

Traffic Signal Timing via Deep Reinforcement Learning. IEEE/CAA Journal of Automatica Sinica, Vol. 3, 2016, pp. 247–254.

19.

Liu

Qin

Luo

Wang

Yang

Intelligent Traffic Light Control by Exploring Strategies in an Optimised Space of Deep Q-Learning. IEEE Transactions on Vehicular Technology, Vol. 71, 2022, pp. 5960–5970.

20.

Chow

A. H.

Zhong

Adaptive Network Traffic Control with an Integrated Model-Based and Data-Driven Approach and a Decentralised Solution Method. Transportation Research Part C: Emerging Technologies, Vol. 128, 2021, p. 103154.

21.

Liang

Fang

Zhong

Oam: An Option-Action Reinforcement Learning Framework for Universal Multi-Intersection Control. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 2022, pp. 4550–4558.

22.

Zhu

Cai

Schwarz

C. W.

Xiao

Intelligent Traffic Light via Policy-Based Deep Reinforcement Learning. International Journal of Intelligent Transportation Systems Research, Vol. 20, 2022, pp. 734–744.

23.

Chu

Wang

Codecà

Multi-Agent Deep Reinforcement Learning for Large-Scale Traffic Signal Control. IEEE Transactions on Intelligent Transportation Systems, Vol. 21, 2020, pp. 1086–1095.

24.

Maske

Chu

Kalabić

Control of Traffic Light Timing Using Decentralized Deep Reinforcement Learning. IFAC-PapersOnLine, Vol. 53, 2020, pp. 14936–14941.

25.

Liu

X.-Y.

Zhu

Borst

Walid

Deep Reinforcement Learning for Traffic Light Control in Intelligent Transportation Systems. arXiv Preprint arXiv:2302.03669, 2023.

26.

Shen

Telikani

Fahmideh

Liang

Distributed Agent-Based Deep Reinforcement Learning for Large Scale Traffic Signal Control. Knowledge-Based Systems, Vol. 241, 2022, p. 108304.

27.

Qiao

Tian

Multiagent Soft Actor–Critic for Traffic Light Timing. Journal of Transportation Engineering, Part A: Systems, Vol. 149, 2023, p. 04022133.

28.

Borges

F. S. P.

Fonseca

A. P.

Garcia

R. C.

Deep Reinforcement Learning Model to Mitigate Congestion in Real-Time Traffic Light Networks. Infrastructures, Vol. 6, No. 10, 2021, p. 138.

29.

Mohamad Alizadeh Shabestary

Abdulhai

Adaptive Traffic Signal Control with Deep Reinforcement Learning and High Dimensional Sensory Inputs: Case Study and Comprehensive Sensitivity Analyses. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, 2022, pp. 20021–20035.

30.

Zhu

Ding

Auto-Learning Communication Reinforcement Learning for Multi-Intersection Traffic Light Control. Knowledge-Based Systems, Vol. 275, 2023, p. 110696.

31.

Devailly

F.-X.

Larocque

Charlin

IG-RL: Inductive Graph Reinforcement Learning for Massive-Scale Traffic Signal Control. IEEE Transactions on Intelligent Transportation Systems, Vol. 23, No. 7, 2021, pp. 7496–7507.

32.

Yoon

Ahn

Park

Yeo

Transferable Traffic Signal Control: Reinforcement Learning with Graph Centric State Representation. Transportation Research Part C: Emerging Technologies, Vol. 130, 2021, p. 103321.

33.

Yang

Hierarchical Graph Multi-Agent Reinforcement Learning for Traffic Signal Control. Information Sciences, Vol. 634, 2023, pp. 55–72.

34.

Wang

Zhu

Zhang

Tian

Zhang

A Large-Scale Traffic Signal Control Algorithm Based on Multi-Layer Graph Deep Reinforcement Learning. Transportation Research Part C: Emerging Technologies, Vol. 162, 2024, p. 104582.

35.

Wang

Urban Traffic Signal Control with Reinforcement Learning from Demonstration Data. Proc., 2022 International Joint Conference on Neural Networks (IJCNN), Padua, Italy, IEEE, New York, 2022, pp. 1–8.

36.

Wang

Taitler

Smirnov

Sanner

Abdulhai

eMARLIN: Distributed Coordinated Adaptive Traffic Signal Control with Topology-Embedding Propagation. Transportation Research Record: Journal of the Transportation Research Board, 2024. 2678: 189–202.

37.

Multi-Agent Deep Deterministic Policy Gradient for Traffic Signal Control on Urban Road Network. Proc., 2020 IEEE International Conference on Advances in Electrical Engineering and Computer Applications (AEECA), Dalian, China, IEEE, New York, 2020, pp. 896–900.

38.

Liu

Ding

A Distributed Deep Reinforcement Learning Method for Traffic Light Control. Neurocomputing, Vol. 490, 2022, pp. 390–399.

39.

Koonce

Traffic Signal Timing Manual. FHWA-HOP-08-024, US Department of Transportation, 2008. https://ops.fhwa.dot.gov/publications/fhwahop08024/index.htm. Accessed March 5, 2024.

40.

Varaiya

Max Pressure Control of a Network of Signalized Intersections. Transportation Research Part C: Emerging Technologies, Vol. 36, 2013, pp. 177–195.

41.

Kouvelas

ShangGuan

Makridis

M. A.

Dynamic Capacity Estimation of Mixed Traffic Flows with Application in Adaptive Traffic Signal Control. Physica A: Statistical Mechanics and its Applications, Vol. 606, 2022, p. 128065.

42.

Van der Pol

Oliehoek

F. A.

Coordinated Deep Reinforcement Learners for Traffic Light Control. Proceedings of Learning, Inference and Control of Multi-Agent Systems (at NIPS2016), Vol. 8, 2016, pp. 21–38.

43.

Wei

Zheng

Yao

IntelliLight: A Reinforcement Learning Approach for Intelligent Traffic Light Control. Proc., 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, Association for Computing Machinery, New York, 2018, pp. 2496–2505.

44.

Nishi

Otaki

Hayakawa

Yoshimura

Traffic Signal Control Based on Reinforcement Learning with Graph Convolutional Neural Nets. Proc., 2018 21st International Conference on Intelligent Transportation Systems (ITSC), Maui, HI, IEEE, New York, 2018, pp. 877–883.

45.

Wei

Zhang

Zheng

Zang

Chen

Zhang

Zhu

CoLight: Learning Network-Level Cooperation for Traffic Signal Control. Proc., 28th ACM International Conference on Information and Knowledge Management, Beijing, China, Association for Computing Machinery, New York, 2019, pp. 1913–1922.

46.

Haddad

T. A.

Hedjazi

Aouag

A Deep Reinforcement Learning-Based Cooperative Approach for Multi-Intersection Traffic Signal Control. Engineering Applications of Artificial Intelligence, Vol. 114, 2022, p. 105019.

47.

Zheng

Xiong

Zang

Feng

Wei

Zhang

Learning Phase Competition for Traffic Signal Control. Proc., 28th ACM International Conference on Information and Knowledge Management, Beijing, China, Association for Computing Machinery, New York, 2019, pp. 1963–1972.

48.

Chai

Xiong

An Effective Deep Reinforcement Learning Approach for Adaptive Traffic Signal Control. Proc., 2020 Chinese Automation Congress (CAC), Shanghai, China, IEEE, New York, 2020, pp. 6419–6425.