Optimizing Train Scheduling in Heavy-Haul Railways Using Diversified Cooperative Deep Reinforcement Learning

Abstract

With the increasing role of railway transportation in the globalized economy, the issue of loading delays has emerged as a critical factor affecting the efficiency and reliability of rail transport. This paper addresses the loading-delay problem in heavy-haul railway transportation by proposing an optimization method based on a diversified cooperative deep reinforcement-learning network (DCD-RLN). Within an actor–critic framework, this method integrates optimization algorithms with loading-delay models to develop a tailored strategy for adjusting train schedules during delays. The proposed approach considers constraints such as empty and loaded unit trains, coupling requirements, and competitive and cooperative interactions among multiple intelligent agents, aiming to minimize assembly times for unit trains and reduce waiting and assembly times during loading and coupling phases for loaded unit trains. Comparative experimental results with existing solutions demonstrate the significant advantages of the DCD-RLN method in enhancing the robustness and efficiency of train operations. This study not only provides a domain-specific reinforcement-learning framework for railway-transportation researchers but also offers valuable references for practical operations in railway-transportation enterprises.

Keywords

railway-transportation efficiency loading-delay optimization diversified cooperative deep reinforcement-learning network (DCD-RLN)actor–critic framework intelligent agent interactions

Introduction

In the context of a globalized economy, rail transportation has become a vital link connecting economic activities across different regions because of its large-scale and long-distance transport capabilities, relatively low transportation costs, and minimal environmental impact. The efficient operation of rail transportation systems is crucial for supporting economic development, ensuring supply-chain stability, and meeting societal demands ( 1 ). With the rapid growth of the economy and the acceleration of urbanization, rail transportation plays an increasingly significant role in the transportation systems of various regions ( 2 ). It is not only an efficient, safe, and environmentally friendly mode of transport but also plays a key role in addressing the growing demand for passenger and freight transport ( 3 ). Technological innovation and digital advancement, such as high-speed railways, intelligent train-scheduling systems, and advanced rail transport technologies, have provided strong support for improving the efficiency and quality of rail transportation ( 4 ). Moreover, the application of digital technology ensures accurate data collection and analysis, providing a powerful tool for optimizing rail transport networks and enhancing the safety and reliability of railway lines ( 5 ).

Rail freight plays a significant role globally, with different countries developing distinct transportation practices based on their geography, economy, and policy conditions. This paragraph aims to provide an overview and comparison of the practices in rail-freight organization in the United States, China, and European countries ( 6 ). In the United States, rail freight is primarily operated by private companies such as Union Pacific Railroad and BNSF Railway. These companies typically own their rolling stock and tracks, transporting bulk commodities like coal and grain. The main characteristic of American railways is their efficient transportation of long-haul, heavy-load cargo, which benefits from an extensive network of tracks and standardization of rolling stock ( 7 ). In contrast, rail freight in China is dominated by the state-owned China State Railway Group Co., Ltd. While meeting the substantial domestic freight demand, China’s railways have also promoted international freight development through the Belt and Road Initiative ( 8 ). A notable feature of China’s rail transport is the large-scale transportation of resource-based commodities such as coal and ores ( 9 ). In Europe, rail freight is more regulated and requires coordination between multiple countries and rail systems. Operators such as Deutsche Bahn and SNCF must adhere to the European Union’s transport regulations and standards ( 10 ). European rail transport emphasizes multimodality and environmental protection, improving transport efficiency through methods like combined road-rail transport ( 11 ). By comparison, it is evident that while the rail-freight practices of different countries have their own characteristics, they all face the common challenges of increasing transportation efficiency and reducing costs. Additionally, the impact of vehicle ownership on the flexibility of transport scheduling is also worthy of attention ( 12 , 13 ).

In recent years, numerous scholars have contributed significantly to the literature on train-scheduling optimization. The efficiency and punctuality of rail transportation systems are crucial for modern transportation networks. Lamorgese and Mannino ( 14 ) proposed an exact decomposition method in 2015 to address real-time train-scheduling problems by processing conflicts and delays in train scheduling, providing an effective scheduling strategy for railway operations. Subsequently, in 2018, Lamorgese et al. ( 15 ) further explored train-scheduling issues, summarizing various models and solutions for train-scheduling problems in the Handbook of Optimization in the Railway Industry, laying a theoretical foundation for decision support in the railway industry. With the increasing complexity of rail transportation systems, traditional scheduling methods face new challenges. Liao et al. ( 16 ) employed deep reinforcement learning (RL) in 2021 to solve the train-timetable rescheduling problem under disturbances, aiming to save energy. Their method not only responds in real time to disturbances in train operation but also significantly reduces energy consumption, which is of great significance for improving the sustainability and economic benefits of rail transportation. Narayanaswami and Rangaraj ( 17 ) provided a comprehensive review and analysis of the scheduling and rescheduling problems in railway operations in 2011, offering a valuable perspective for understanding the complexity of railway scheduling problems. Following this, in 2013, they proposed a mixed-integer linear programming (MILP) model for optimizing train-timetable rescheduling after railway traffic disruptions, considering not only the minimization of total delays but also other performance objectives, providing a new solution for railway scheduling problems ( 18 ).

The optimization of rail transportation systems is a complex issue, involving multiple aspects such as train scheduling, timetable rescheduling, and route planning. In their 2012 research, Yan and Yang ( 19 ) proposed a mixed-integer programming-based method to address the train motion planning problem and developed heuristic algorithms and decomposition techniques, providing an effective solution for train-scheduling problems. Boccia et al. ( 20 , 21 ) further explored train-scheduling problems in multi-track regions in their research in 2012 and 2013, using MILP and heuristic methods to optimize train scheduling and operation, offering new perspectives and methods for train scheduling in multi-track areas. Adenso-Dıaz et al. ( 22 ) proposed an online train-timetable rescheduling method in their 1999 research for regional train services, one which can respond in real time to disturbances in train services and improve the reliability of train timetables. Tornquist and Persson ( 23 ) used tabu search and simulated annealing algorithms to handle train traffic deviations in their 2005 research, techniques which can effectively deal with train traffic deviations and improve the flexibility and efficiency of train scheduling. Higgins et al. ( 24 ) proposed heuristic techniques for single-line train scheduling in their 1997 research, while Cai et al. ( 25 ) proposed a greedy heuristic method for rapid scheduling of single-track trains in their 1998 research, providing effective solutions for single-line train-scheduling problems. Samà et al. ( 26 ) proposed a variable neighborhood search method in their 2017 research for rapid train scheduling and routing under railway traffic disturbances, which can effectively handle railway traffic disturbances and improve the efficiency of train scheduling and routing. These research works demonstrate the application of various methods from mixed-integer programming to heuristic algorithms, and then to variable neighborhood search in train-scheduling problems, providing a variety of solutions to improve the efficiency and reliability of rail transportation systems.

Despite significant advancements in technology and management within the railway-transportation sector, numerous challenges persist, particularly the issue of loading delays. Loading delays can precipitate a decline in transportation efficiency and trigger a cascade of subsequent problems, such as train delays and the postponement of cargo delivery, resulting in substantial economic losses for railway-transportation enterprises ( 27 ). Moreover, loading delays can have an impact on the stability and reliability of the entire transportation network, increase transportation costs, and even affect the efficiency of the entire supply chain ( 28 ). Consequently, optimizing train scheduling to mitigate the impact of loading delays has become a pressing issue in the field of railway transportation.

In the field of freight-train scheduling, a multitude of scholars have conducted extensive research, focusing primarily on train consist planning, timetable optimization, and cargo transportation route selection. For instance, researchers have utilized multi-objective nonlinear 0-1 programming models to optimize train consist plans, thereby enhancing transportation efficiency ( 29 ). However, these studies often overlook the issue of loading delays, that is, the waiting time of trains during the loading and coupling stages, which directly affects the operational efficiency of trains and may lead to a decline in the efficiency of the entire transportation network.

This study addresses the loading-delay issue and proposes an optimization method based on a diversified cooperative deep reinforcement-learning network (DCD-RLN). Within the actor–critic framework, this method integrates optimization algorithms with loading-delay models to develop a customized strategy for adjusting train scheduling during delays. The method takes into account constraints such as empty and loaded cars, coupling requirements, and competitive and cooperative interactions among multiple intelligent agents. The primary goal is to minimize the assembly time of individual trains and reduce the waiting and assembly time of heavy unit trains during loading and coupling stages. Experimental results show that this method significantly improves the robustness and efficiency of train operations. The main contribution of this research is the development of a new optimization model that adjusts train scheduling during loading delays by considering constraints such as empty and loaded cars, coupling requirements, and competitive and cooperative interactions among multiple intelligent agents. Our proposed optimization method aims to minimize the assembly time of individual trains and reduce the waiting and assembly time of heavy unit trains during loading and coupling stages. Experimental results demonstrate the method’s significant effectiveness in enhancing the robustness and efficiency of train operations. The capacity of trains, the amount of available rolling stock, and the depot-entry and -exit operations are also considered. Three solution approaches are proposed to solve the resulting multi-objective mixed-integer nonlinear programming (MINLP) problem to deliver both an irregular train schedule (i.e., departure and arrival times of all train services) and a rolling-stock circulation plan (including depot-entry and -exit operations of rolling stock and connections between train services) simultaneously. The scholarly research paradigm is illustrated in Figure 1.

Figure 1.

The scholarly research paradigm.

Through the in-depth analysis and empirical research of this study, we have not only filled the gap in the existing research on loading delays but also provided a new optimization strategy for the field of railway transportation to cope with the increasing transportation demands and complex transportation environments. Our research findings will help railway transportation enterprises to improve operational efficiency, reduce costs, and ultimately enhance the stability and reliability of the entire transportation network. The subsequent sections of this manuscript are structured as follows. The next section reviews the existing literature on train-rescheduling problems. The third section delineates the methodology for modeling the coal-transportation overflow within railway networks, encompassing the development of the model framework, the formulation of objective functions, and the delineation of constraints. The fourth section elucidates the implementation of DCD-RLNs to address the model. The fifth section evaluates the efficacy of the proposed approach through a series of small-scale case studies. Finally, the sixth section synthesizes the research findings presented in this paper and posits potential avenues for future inquiry.

Related Work

In recent years, numerous learning-based research achievements have been proposed for application in intelligent transportation systems. Table 1 summarizes some representative existing methods in this field. To the best of our knowledge, Šemrov et al. ( 32 ) were the first to employ the Q-learning algorithm to propose a train-rescheduling method, whereby the state was defined by using the train positions, track availability, and current time. To circumvent the problem of a large state space, Khadilkar et al. (38) defined the state by employing only the track availability information surrounding the current train. To enhance the decision-making performance, Prasad et al. ( 31 ) adopted evolutionary strategies to search for better parameters of the policy network. Li and Ni ( 32 ) introduced a timetable scheme based on multi-agent deep RL, in which each train will be independently controlled based on its own operation information, including departure time, whether to stop at a station. Although the aforementioned methods have achieved some promising results, their learning strategies fail to maintain good performance when extended to various problem instances. The main reason is the absence of delay information in the state definitions, which makes it difficult to distinguish different delay scenarios. As a consequence, the same actions may be taken in different delay scenarios, thereby degrading the performance.

Table 1.

A Brief Summary of Existing Learning-Based Methods

Work	Problem setting	State definition
Šemrov et al. ( 32 )	Train rescheduling with multiple train delays	Train locations, current time, and track availability
Prasad et al. ( 33 )	Train scheduling/ rescheduling with multiple train delays	Track availability around the train to be controlled
Li and Ni ( 34 )	Train rescheduling with multiple train delays	Departure time, whether to stop at stations, and running direction of the current train
Zhu et al. ( 35 )	Train rescheduling with multiple train delays	Delay time, location, and local track availability of the current train
Ning et al. ( 36 )	Train rescheduling with multiple train delays	Actual arrival time and departure time of trains
Ghasempour et al. ( 37 , 38 )	Train rescheduling in single/multiple junctions	Arrival and departure time of trains entering junction
Wang et al. ( 39 )	Train rescheduling with single train delay	Planned arrival time, planned departure time, actual arrival time, and actual departure time
Our work	Train rescheduling with multiple train delays	Train delays

Zhu et al. ( 33 ) attempted to incorporate train delay information into the state definition, and experimental results showed that the resulting policy could achieve better performance. Ning et al. ( 34 ) defined the state by using the actual arrival and departure times of all trains, thus enabling the learned policy to distinguish each delay scenario within a fixed scheduled timetable. However, when confronted with various train-delay situations, their method still cannot maintain good performance, primarily because the learned policy is optimized based on a single problem instance. Moreover, Ghasempour et al. ( 35 , 36 ) proposed to learn the policy from multiple instances, but the policy was only used to address the delay scenarios at railway intersections. Although Yin et al. introduced a policy-based RL method for the timetable, where the policy was also learned from many delay scenarios, only one train was allowed to be delayed in their problem instance. However, in practical scenarios, it is not uncommon for multiple trains to be delayed simultaneously ( 37 ).

The aforementioned studies primarily focus on problem-solving in delay scenarios for passenger-transportation activities. However, delay scenarios in heavy-haul transportation caused by factors such as loading delays and adverse weather conditions have not been adequately explored. To effectively address these issues, this paper proposes a domain-specific RL framework. By utilizing states to represent the delay conditions of trains, the learned policy can directly determine the state changes at the next time stamp $(t + 1)$ . In addition, this paper constructs a delay-scenario model for the loading-delay problem of trains, ensuring that trains strictly adhere to railway rules during operation.

Unlike standard RL algorithms that assume generic state transitions and unconstrained actions, our method embeds railway-specific structural constraints—including dynamic track availability, yard-level marshaling restrictions, and scheduling-slot rigidity—into the state, action, and reward modules. Furthermore, domain-aware action masking ensures only feasible train movements are considered during training, significantly narrowing the policy search space and accelerating convergence. This integration of operational logic into the RL loop enables more robust and realistic decision-making, which is not achievable through vanilla deep Q-network (DQN) or Proximal Policy Optimization (PPO) implementations.

Modeling Transportation Scenarios in Heavy-Haul Railway Coal Corridors

Abstracting the loading-delay optimization problem within the context of heavy-haul railway coal corridors as a multifaceted assignment issue, we focus on the strategic allocation of empty trains entering the target route’s loading zone from the transportation-corridor interface ( 4 ). This process entails the efficient distribution of these trains to either loading or disassembly stations, ensuring operational synergy within the rail network. In particular, after the designated empty trains are disassembled into unit trains, they undergo a secondary optimization for assignment, which is pivotal for enhancing the overall efficiency of the rail logistics chain. Post loading operations at the loading stations, the vehicles metamorphose from unit empties to unit loads, marking a critical transition in the train’s journey. At this juncture, a critical decision is made about the direct deployment of the train for service or its assembly at technical stations in tandem with other train sets before departure ( 39 ). The intricate flow of loaded and empty trains during the resequencing process in these overloaded heavy-haul railway coal corridors is vividly depicted in Figure 2.

Figure 2.

Sequencing diagram of loaded- and empty-train flow continuity in loading operations.

In the realm of railway train scheduling, loading delays are a pervasive challenge, often attributed to a myriad of factors such as traffic congestion, inclement weather conditions, or malfunctions in loading equipment. To mitigate these delays and optimize the train schedule, a comprehensive modeling approach is indispensable. This approach necessitates a deep understanding and articulation of the objectives, constraints, and decision variables that are intrinsic to the problem at hand. By addressing these complexities within the overloaded heavy-haul railway coal corridors, this research endeavors to fortify the robustness and efficiency of train operations, providing a solid foundation for operational enhancement in rail-freight logistics.

Assumptions of the Model

The model assumes the following.

The foundational data of the model, including the coal-source demand plan from the shipper, the location settings of line stations, and standard times for coal transportation to the loading yard, is known.

The capabilities of heavy-haul railway lines and stations along the route, such as throughput capacities and loading efficiency at loading stations, are known.

Coal-source demands within a given timeframe do not change. Railway transportation typically operates on a 24 h decision cycle, and, for computational efficiency, optimization is conducted with minutes as the basic unit. The line is assumed to be a double-track system, and the combination of operations for large and small trains can only occur at technical stations. Grouping at technical stations aims to align with large trains or tonnage.

The causes of loading delays at the loading yard are not constrained by temporal or spatial limitations. This includes events such as adverse weather conditions (rain, snow, mudslides), breakdowns of land transport vehicles, or track fractures within the loading station.

Once the train schedule is set, the process of loading the empty train must achieve full axle or full weight conditions for departure; otherwise, it indefinitely waits.

To simplify the research dimension from mines to loading stations, transportation time and relaxation time are uniformly used for representation.

Since the focus of this study is the single-commodity transportation process in heavy-haul railways, the impact of passenger-transportation schedules on freight is not considered.

Two main causes of loading delays are considered. In respect of delays caused during the loading process, it is assumed that the quantity of empty trains meets the needs of each research station. For delays arising from the inability to guarantee loading for empty trains, it is assumed that once the conditions for empty-train arrival are met, loading stations can immediately execute the transition from empty to loaded.

The “train” is taken as the smallest research unit. The railway’s coal-transportation channel typically employs rolling stock for cyclic coal transportation. The use of “trains” as the basic unit of deployment aligns with the actual transportation mode of the railway.

Explanation of Symbols

Set Configuration and Explanations

$S$ is the set of stations within the research line; $i$ denotes the index of the loading or technical station where the train originates. Station attributes are indexed as $z$ and $j$ . $s_{i}^{z}$ indicates that station $i$ is a loading station; $s_{i}^{j}$ indicates that station $i$ is a technical station; $s_{i}^{z, j}$ indicates that station $i$ is both a loading and technical station, $s \in S, S = s^{z} \cup^{s^{j}} \cup^{s^{z, j}}$ .

$K$ is the set of empty-train types; $K = 1$ represents a 5000 t unit empty train, $K = 2$ represents a 10,000 t unit empty train, and $K = 3$ represents a 20,000 t unit empty train, $K = {k | k = 1, 2, 3}$ .

$M$ is the set of combined empty-train identifiers continuously joining the line from the division point; $m \in M$ , with a total quantity denoted as $| M |$ . The set of identifiers for unit empty train joining at the division point is $N_{1}$ , with a total quantity denoted as $| N_{1} |$ . After the decomposition of combined empty trains into unit trains with double locomotives, the set of identifiers for unit empty trains can be represented as $N_{2} = | N_{1} | + 1, \dots, | N_{1} | + | M |$ . The set of identifiers for the latter combined empty trains can be represented as $N_{3} = | N_{1} | + | M | + 1, \dots, | N_{1} | + 2 | M |$ .

Setting and Explanatory Notes on Model Parameters

$q_{k}$ is the total number of single-unit or combined empty trains received from the separation point.

$q_{k}^{s_{i}}$ is the number of $k$ -type empty trains received from the separation point and directly scheduled to station $s_{i}$ as planned.

$E_{k}^{s_{i}}$ is the total number of $k$ -type empty trains required at station $s_{i}$ based on the loading task.

$u_{s_{i}, s_{i + 1}}$ represents the line capacity between station $s_{i}$ and its next connecting station $s_{i + 1}$ , where $u \in U$ .

$t_{nk}$ represents the moment when the $k$ -type empty train with the access number $n$ enters the junction, where $n \in M \cup N_{1}$ .

$t_{g}^{s^{z}}$ is the departure time of the graph-planned loaded train at station $s^{z}$ , which occurs after the completion of the loading task and is the $g$ th scheduled departure in sequence.

$t_{ng}^{s^{z}}$ represents the time at which the $g$ th scheduled departure of the unit train with the identifier $n$ ( $n \in N_{1} \cup N_{2} \cup N_{3}$ ) occurs, connecting with the previously loaded train at station $s^{z}$ after the completion of loading.

$t_{k}^{s_{i_{travel}}}$ represents the travel time for an empty train of type $k$ to move from the division point to station $s_{i}$ in accordance with the plan.

$t_{s_{i}, s_{e}}^{k}$ represents the transportation time of a heavy-haul train of type $k$ from station $s_{i}$ to the unloading station $s_{e}$ .

$t_{i, de}^{k}$ , $t_{i, pa}^{k}$ , and $t_{i, lo}^{k}$ represent the additional departure time, parking time, and loading operation time, respectively, for a heavy-haul train of type k at station $s_{i}$ .

$t_{i, de}^{j}$ , $t_{i, pa}^{j}$ , and $t_{i, lo}^{j}$ represent the additional departure time, parking time, and disassembly or combination operation time, respectively, for a heavy-haul railway at technical station $s^{j}$ .

$ϖ$ is the number of groups in which a combined column is decomposed into unit columns (referred to as decomposition groups in this paper).

Decision Variable Definition and Explanations

This section covers the definition of binary decision variables based on technical operations for unit or combined empty trains and the sequencing of scheduled heavy columns at each station.

If the boundary interface accepts a type- $k$ empty train with a given identifier $n$ at time $t_{nk}$ , where $n \in N_{1}$ , it is assigned to the loading operation at station $s_{i}^{z}$ . In this case, $X_{nk}^{s_{i}^{z}}$ is set to 1; otherwise, $X_{nk}^{s_{i}^{z}}$ is set to 0.

If an empty train of type $k$ with identifier $n$ is assigned to the technical station $s^{j}$ for loading operation at time $t_{nk}$ , where $n \in N_{1}$ , then $X_{nk}^{s^{j}}$ is set to 1; otherwise, $X_{nk}^{s^{j}}$ is set to 0.

If a combined empty train of type $k$ with identifier $m$ is assigned to the technical station $s_{i}^{j}$ for disassembly operation at time $t_{mk}$ , where $m \in M$ , then $Y_{mk}^{s_{i}^{j}}$ is set to 1; otherwise, $Y_{mk}^{s_{i}^{j}}$ is set to 0.

If a type- $k$ empty train with identifier $α$ (where $α \in N_{3}$ ) disassembled at technical station $s_{i}^{j}$ is assigned to station $s_{i^{'}}^{z}$ for loading operation, then $Z_{k α}^{s_{i}^{j}, s_{i^{'}}^{z}}$ is set to 1; otherwise, $Z_{k α}^{s_{i}^{j}, s_{i^{'}}^{z}}$ is set to 0.

If a type- $k$ empty train with identifier $α$ (where $α \in N_{3}$ ) disassembled at technical station $s_{i}^{j}$ is assigned to another technical station $s_{i^{'}}^{j}$ for loading operation, then $Z_{k α}^{s_{i}^{j}, s_{i^{'}}^{j}}$ is set to 1; otherwise, $Z_{k α}^{s_{i}^{j}, s_{i^{'}}^{j}}$ is set to 0.

If a type- $k$ empty train with identifier $p$ (where $p \in N_{2}$ ) disassembled at technical station $s^{j}$ is loaded at the combined loading and technical station $s^{z, j}$ , then $W_{kp}^{s^{j}, s^{z, j}}$ is set to 1; otherwise, $W_{kp}^{s^{j}, s^{z, j}}$ is set to 0.

If a type- $k$ empty train with identifier $u$ (where $u \in N_{1} \cup N_{3}$ ) completed loading at loading station $s_{i}^{z}$ and is sent to technical station $s_{i^{'}}^{j}$ for the assembly of a combined train, then $H_{ku}^{s_{i}^{z}, s_{i^{'}}^{j}}$ is set to 1; otherwise, $H_{ku}^{s_{i}^{z}, s_{i^{'}}^{j}}$ is set to 0.

If a type- $k$ empty train with identifier $n$ (where $n \in N_{1} \cup N_{2} \cup N_{3}$ ) completes loading at station $s^{z}$ , and directly follows the scheduled departing combined train with identifier $g$ , then $L_{k, ng}^{s^{z}}$ is set to 1; otherwise, $L_{k, ng}^{s^{z}}$ is set to 0.

If type $k$ experiences loading delays at the loading station $s_{i}^{z}$ , then $O_{k}^{s_{i}^{z}} = 1$ , otherwise $O_{k}^{s_{i}^{z}} = 0$ .

Objective Function Formulation and Explanation

The optimization objective for empty-train deployment is to minimize the total transportation time in the event of loading delays. In formulating the objective function, two scenarios of loading delays need to be considered. First, in cases where coal delivery to the loading station is delayed or there are facility malfunctions causing unit trains to wait indefinitely, the objective is to schedule unit trains to other loading stations to avoid prolonged waiting at the delayed loading station. Second, if the untimely deployment of empty trains results in the loading station’s inability to complete the empty-to-loaded transition, the goal is to alleviate delays at the loading station by deploying a limited number of unit trains as efficiently as possible. This approach aims to address delays efficiently and effectively in both scenarios of loading delays.

Minimizing Unit-Train Connection Time

This means ensuring the shortest possible duration, as shown in Equation 1:

\begin{matrix} \begin{matrix} Min Z_{1} = \sum_{n \in N_{1}} \sum_{k} \sum_{s_{i}} \sum_{g} [X_{nk}^{s_{i}^{z}} (1 - \sum_{i^{'}} H_{kn}^{s_{i}^{z}, s_{i^{'}}^{j}}) L_{k, ng}^{s^{z}} t_{ng}^{s^{z}}] \\ + \sum_{k^{'}} \sum_{k} \sum_{g} \sum_{n \in N_{2} \cup N_{3}} \sum_{s_{i}} \sum_{s_{i}^{y}} \sum_{m} Y_{m k^{'}}^{s_{i}^{j}} Z^{s_{i}^{j}, s_{i^{'}}^{j}} \\ (1 - \sum_{s_{i}} H_{ku}^{s_{i}^{z}, s_{i^{'}}^{j}}) L_{k, ng}^{s^{z}} t_{ng}^{d} \\ + \sum_{k} \sum_{g} \sum_{s_{i} \in s^{j}} O_{k}^{s_{i}} (t_{i, de}^{k} + t_{i, lo}^{k}) \end{matrix} \end{matrix}

(1)

Minimizing the Waiting and Assembly Time of Unit Trains

In this objective, we aim to minimize the total waiting time and assembly time for unit trains awaiting grouping. This involves optimizing the time each unit train spends waiting to be assembled and the time required for the actual grouping process. The objective function considers various factors, including the arrival times of single-unit trains, the scheduling of assembly operations, and the associated time durations. The goal is to efficiently organize the unit trains, ensuring the shortest possible waiting and assembly times. This objective is crucial for enhancing the overall efficiency and timely deployment of unit trains in the transportation system. This objective function can be shown in Equation 2:

\begin{matrix} \begin{matrix} Min Z_{2} = \sum_{n \in N_{1}} \sum_{p \in N_{2}} \sum_{s_{i}} \sum_{k} \sum_{s^{j}} [X_{kn}^{s_{i}^{z}} H_{kn}^{s_{i}^{z}, s_{i^{'}}^{j}} W_{kp}^{s^{j}, s^{z, j}} (t_{np}^{s_{i}^{z}} + t_{k}^{s^{j}})] \\ + \sum_{k^{'} \in K, k^{'} \neq 1} \sum_{s^{j}} \sum_{m \in M} \sum_{s_{i}} \sum_{α \in N_{3}} \sum_{p \in N_{2}} [Y_{k^{'} m}^{s_{i}^{j}} Z_{k α}^{s_{i}^{j}, s_{i^{'}}^{z}} \\ H_{ku}^{s^{j}, s^{z, j}} W_{kp}^{s^{j}, s^{z, j}} \times (t_{α p}^{s^{j}} + t_{k}^{s^{j}})] \end{matrix} \end{matrix}

(2)

where $t_{np}^{s^{j}}$ represents the time required for the grouping of $n$ single-unit empty train ( $p \in N_{2}$ ) after completing the loading operations at station $s_{i}^{j}$ , and $t_{α p}^{s^{j}}$ denotes the assembly time between the unit train $α$ and $p$ ( $p \in N_{2}$ ).

Analysis

Although the objective functions defined in this section are based on cost functions measured in time, these two functions contribute differently to the efficiency of heavy-haul transportation production. We can analyze the contribution of each formula from different perspectives. The first equation emphasizes the flexibility of “load and go,” emphasizing the swift completion of loading and immediate deployment into transportation. The second equation focuses on the coordinated optimization of loading and grouping, as well as the enhancement of overall transportation capacity through the capacity tension in buffer intervals.

In practical scenarios, it is essential to consider factors such as station layout, optimal utilization of line and station capacities, composition of train flows, and train flow intensities. Therefore, the evaluation of the contribution of each formula needs to integrate these factors for a comprehensive analysis, enabling better adaptation to specific transportation environments and requirements.

To further clarify our approach, we have weighted the two optimization objectives into a single objective. This is a common strategy for transforming multi-objective problems into single-objective ones, which allows for a more focused optimization process. The importance weight for unit-train connection time is set as $ω_{1}$ , and the importance weight for the waiting and assembly time of unit heavy trains awaiting assembly is set as $ω_{2}$ . The single objective is calculated as follows:

F (*) = ω_{1} \times Z_{1} (*) + ω_{2} \times Z_{2} (*)

(3)

Constraint Conditions

Constraints on Unit Empty- and Loaded-Train Connection

For a unit empty train with the identifier $n$ introduced from the division point, it will be assigned to a loading station for loading operations. If this unit empty train, after completing the loading operation, is not planned to be sent to a technical station to form a grouped train with other unit loads, then, when departing from the loading station, it must be connected to a predetermined unit-loaded train to complete the departure operation. Based on the above analysis, we can derive the following constraint shown in Equation 4:

\sum_{s_{i}} \sum_{g} X_{nk}^{s_{i}} L_{k, ng}^{s_{i}^{z}} + \sum_{s_{i^{'}}} \sum_{s_{i}} H_{ku}^{s_{i}^{z}, s_{i^{'}}^{j}} = 1

(4)

If the aforementioned unit empty train needs to be sent to a technical station after completing the loading task to form a grouped train with other unit loads, then it should only be able to complete the grouped-train formation operation at one technical station. This requirement can be expressed by the constraint given as Equation 5:

\sum_{s_{i}} \sum_{s_{i^{'}}} H_{ku}^{s_{i}^{z}, s_{i^{'}}^{j}} = 1 \forall n \in N_{1}

(5)

The same research process can be applied to the unit trains generated after the technical station completes the disassembly operation of combination trains. For these unit trains, they should also satisfy the above constraints. This can be represented by Equation 6:

\sum_{s_{i}} \sum_{s_{i^{'}}} \sum_{g} \sum_{m} Y_{mk}^{s_{i}^{j}} Z_{k α}^{s_{i}^{j}, s_{i^{'}}^{z}} L_{k, ng}^{s^{z}} + \sum_{s_{i}} \sum_{s_{i^{'}}} H_{ku}^{s_{i}^{z}, s^{j}} = 1

(6)

Constraints on the Connection Between Scheduled and Pending Heavy Trains

The operational track of a specific scheduled heavy train at a loading station may not necessarily connect with a loaded unit train. This implies, as shown in Equations 7 and 8:

\sum_{s_{i}} \sum_{n \in N_{1}} X_{nk}^{s_{i}^{z}} L_{k, ng}^{s^{z}} \leq 1

(7)

\sum_{s_{i}} \sum_{α \in N_{3}} \sum_{s_{i}} \sum_{m} Y_{mk}^{s_{i}^{j}} Z_{k α}^{s_{i}^{j}, s_{i^{'}}^{z}} L_{k, ng}^{s^{z}} \leq 1

(8)

Constraints on the Demand for Empty Trains at Loading Stations

In periods outside the designated operational windows, the interface continuously receives unit or composite empty trains. Currently, loading station $s_{i}$ has loading requirements according to the transportation plan. The $k$ -type unit or composite empty train received at the interface are assigned to loading station $s_{i}$ to complete the scheduled loading operations. This assignment must meet the specified requirements for the number of unit wagons at loading station $s_{i}$ and can be formulated as in Equation 9:

\sum_{s_{i}} p_{k}^{s^{f}, s_{i}} + q_{k}^{s_{i}} = E_{k}^{s_{i}} \forall s_{i} \in S

(9)

where $\sum_{s_{i}} p_{k}^{s^{f}, s_{i}}$ represents the total quantity of $k$ -type unit wagons generated by decomposition at all technical stations and assigned to complete loading operations at $s_{i}$ ; and $q_{k}^{s_{i}}$ denotes the quantity of $k$ -type unit wagons directly assigned from the interface to loading station $s_{i}$ for loading operations.

Loading-Demand Constraints at Technical Stations

For technical station $s^{z, j}$ that is capable of both loading and marshaling operations, the assigned k-type single-unit or combination empty train originating from the boundary interface and directed to technical station $s^{z, j}$ should meet the demand for $k$ -type single-unit empty train required for loading operations at technical station $s^{z, j}$ . In other words, the demand for $k$ -type single-unit empty trains needed to complete loading operations at technical station $s^{z, j}$ should be satisfied by the trains arriving from the boundary interface and assigned to technical station $s^{z, j}$ for both loading and marshaling operations. It can be expressed by Equation 10:

\begin{matrix} \begin{matrix} \sum_{s_{i}^{j}} \sum_{α \in N_{3}} \sum_{m \in M} Y_{mk}^{s_{i}^{j}} Z_{k α}^{s_{i}^{j}, s_{i^{'}}^{z}} + \sum_{s_{i}^{j}} \sum_{m \in M} \sum_{p \in N_{2}} Y_{mk}^{s_{i}^{j}} W_{k α}^{s^{j}, s^{z, j}} \\ + \sum_{n \in N_{1}} X_{nk}^{s_{i}^{z}} = z_{k}^{s_{i}^{z, j}} \forall s^{z} \in Z \end{matrix} \end{matrix}

(10)

where

$z_{k}^{s_{i}^{j}}$ represents the demand for $k$ -type single-unit empty trains at technical station $s^{j}$ capable of completing loading operations; $\sum_{s_{i}^{j}} \sum_{α \in N_{3}} \sum_{m \in M} Y_{mk}^{s_{i}^{j}} Z_{k α}^{s_{i}^{j}, s_{i^{'}}^{z}}$ denotes the total number of $k$ -type single-unit or combination empty train decomposed by all technical stations, assigned to another loading station $s_{i^{'}}^{z}$ for loading operations, and not equipped with double locomotives; additionally, $\sum_{s_{i}^{j}} \sum_{m \in M} \sum_{p \in N_{2}} Y_{mk}^{s_{i}^{j}} W_{k α}^{s^{j}, s^{z, j}}$ represents the number of single-unit empty trains generated and assigned by all technical stations on the coal-transportation channel, directed to the loading station $s^{z}$ for loading operations and equipped with double locomotives for $k$ -type; and $\sum_{n \in N_{1}} X_{nk}^{s_{i}^{z, j}}$ is the quantity of single-unit empty trains directly transferred from the boundary interface to the technical station $s^{z, j}$ with loading capabilities.

Constraints on the Quantity Relationship Before and After Disassembling Combined Empty Trains

After the combined empty trains arrive at all technical stations and complete the disassembling operations, the resulting quantity of unit empty trains should be equal to the product of the quantity of each type of combined empty train and their disassembling group numbers, as shown in Equation 11:

\begin{matrix} \begin{matrix} \sum_{s_{i}} \sum_{α \in N_{3}} \sum_{m \in M} Y_{mk}^{s_{i}^{j}} Z_{k α}^{s_{i}^{j}, s_{i^{'}}^{z}} \\ + \sum_{s_{i}} \sum_{z} \sum_{m \in M} \sum_{p \in N_{2}} (Y_{mk}^{s_{i}^{j}} \times W_{k α}^{s^{j}, s^{z, j}}) = ϖ q_{k} \end{matrix} \end{matrix}

(11)

From another perspective, the quantity of unit empty trains generated by the disassembly of combined empty trains at technical station $s^{j}$ must be consistent with the sum of the quantities of unit empty trains generated by the disassembly of combined empty trains at technical station $s^{j}$ , including those without double locomotives and those with double locomotives. This relationship constraint can be expressed using Equation 12:

\sum_{s_{i}^{j}} \sum_{α \in N_{3}} Z_{k α}^{s_{i}^{j}, s_{i^{'}}^{z}} + \sum_{p \in N_{2}} \sum_{z} W_{k α}^{s^{j}, s^{z, j}} = \sum_{m \in M} ϖ Y_{mk}^{s_{i}^{j}} s_{i}^{j} \in S

(12)

Interval Passing Capacity Constraints

Assuming $q_{k}^{s_{i}, s_{i + 1}}$ represents the number of trains of type $k$ operating between two adjacent stations $s_{i}$ and $s_{i + 1}$ , the total number of unidirectional vehicles passing through this interval includes two aspects. On the one hand, it comprises the quantity of empty trains directly transferred from the separation point to station $s_{i + 1}$ and its more distant loading stations according to the plan. On the other hand, it involves the quantity of empty trains sent from $s_{i}$ and previous technical stations to station $s_{i + 1}$ and its more distant loading stations. This can be expressed by Equation 13:

\begin{matrix} \begin{matrix} \sum_{k} \sum_{v = s + 1}^{| S |} q_{k}^{v} σ_{0, v}^{s_{i}, s_{i + 1}} + \sum_{m} \sum_{v = 0, v \in s^{j}}^{s} Y_{km}^{v} \sum_{d \in D} \sum_{α \in N_{3}} Z_{k α}^{vd} σ_{v, d}^{s_{i}, s_{i + 1}} \\ = \sum_{k} q_{k}^{s_{i}, s_{i + 1}} \forall s_{i}, s_{i + 1} \in S \end{matrix} \end{matrix}

(13)

where $| S |$ represents the number of stations on the entire coal-transport route. Starting from the separation point, each station on the railway coal-transport route is sequentially numbered, with the station closest to the separation point assigned the number 0. Therefore, $σ_{0, v}^{s_{i}, s_{i + 1}}$ is a decision variable indicating whether a train is sent from the separation point to station $v (v \in S)$ and passes through the route between stations $s_{i}$ and $s_{i + 1}$ . If it is determined to pass through stations $s_{i}$ and $s_{i + 1}$ , then $σ_{0, v}^{s_{i}, s_{i + 1}} = TRUE$ ; otherwise, $σ_{0, v}^{s_{i}, s_{i + 1}} = FALSE$ . Previously, it was assumed that the capacity between stations $s_{i}$ and $s_{i + 1}$ is $u_{s_{i}, s_{i + 1}}$ . Therefore, the total number of trains of different types operating in the interval between $s_{i}$ and $s_{i + 1}$ should not exceed $u_{s_{i}, s_{i + 1}}$ . It can be expressed by Equation 14:

\sum_{k} q_{k}^{s_{i} . s_{i + 1}} \leq u_{s_{i}, s_{i + 1}} \forall s_{i}, s_{i + 1} \in S

(14)

The ultimate constraint on interval throughput capacity can be expressed as in Equation 15:

\begin{matrix} \begin{matrix} \sum_{k} \sum_{v = s + 1}^{| S |} q_{k}^{v} σ_{0, v}^{s_{i}, s_{i + 1}} + \sum_{m} \sum_{v = 0, v \in s^{j}}^{s} Y_{km}^{v} \sum_{d \in D} \sum_{α \in N_{3}} Z_{k α}^{vd} σ_{v, d}^{s_{i}, s_{i + 1}} \\ \leq \sum_{i}^{S} u_{s_{i}, s_{i + 1}} \forall s_{i}, s_{i + 1} \in S \end{matrix} \end{matrix}

(15)

Capacity Constraints for Disassembly Operations at Technical Stations

The quantity of disassembled composite trains at technical station $s^{j}$ does not exceed its disassembly capacity $C^{f}$ . This component of the constraints can be expressed as Equation 16:

\sum_{m \in M} Y_{mk}^{s_{i}^{j}} \leq C^{f} \forall s_{i}^{j} \in S

(16)

Constraints on the Sorting Capacity of Technical Stations

The constraint on the number of combined trains composed by technical station $s^{j}$ is that it does not exceed its maximum composition capacity $C^{z}$ , as shown in Equation 17:

\sum_{n \in N_{1}} \sum_{s_{i}} \sum_{k} X_{nk}^{s_{i}} H_{nk}^{s_{i}^{z}, s_{s^{'}}^{j}} + \sum_{s_{i}} \sum_{m \in M} \sum_{α \in N_{3}} (Y_{mk}^{s_{i}^{j}} \times H_{k α}^{s_{i}^{z}, s_{i^{'}}^{j}}) \leq C^{z}

(17)

Constraints on the Quantity Relationship Between Disassembled Combined Empty Trains and Assembled Combined Loaded Trains

Under the assumption of fixed locomotives, the total quantity of disassembled combined empty trains at various technical stations should be consistent with the total quantity of assembled combined loaded trains at those stations, as expressed in Equation 18.

\begin{matrix} \begin{matrix} \sum_{n \in N_{1}} \sum_{s_{i}} X_{nk}^{s_{i}^{j}} H_{nk}^{s_{i}^{z}, s_{i^{'}}^{j}} + \sum_{s_{i}} \sum_{m \in M} \sum_{α \in N_{3}} Y_{mk}^{s_{i}^{j}} Z_{k α}^{s_{i}^{j}, s_{i^{'}}^{j}} \\ = \sum_{m \in M} \sum_{s_{i}} \sum_{p \in N_{2}} Y_{mk}^{s_{i}^{j}} W_{kp}^{s^{j}, s^{z, j}} \forall s_{i}^{j} \in S \end{matrix} \end{matrix}

(18)

Continuity Time Constraints

For a unit train $i$ entering station $s_{i}$ from the boundary, the arrival time at station $s_{i}$ is given by $t_{ki}^{s_{i}} = t_{ki} + t_{k}^{s_{i_{travel}}} + t_{i, pa}^{k}$ , where $t_{g}^{s_{i}}$ denotes the departure time of the scheduled unit heavy train at station $s_{i}$ for the $g$ th scheduled heavy train. The continuity time for the unit train with index $i$ to connect with the departure of the $g$ th scheduled unit heavy train can be expressed using Equation 19:

t_{ig}^{s_{i}} = {\begin{matrix} t_{g}^{s_{i}} - t_{ki}^{s_{i}} t_{g}^{s_{i}} - t_{ki}^{s_{i}} \geq t_{k}^{s_{i}^{lo}} \\ t_{g}^{s_{i}} + 1440 - t_{ki}^{s_{i}} t_{g}^{s_{i}} - t_{ki}^{s_{i}} < t_{k}^{s_{i}^{lo}} \end{matrix}

(19)

For a combined train with index $m$ entering the boundary, the time it takes for the train to arrive at $s^{j}$ and complete the disassembly is given by $t_{ki}^{s_{j}} = t_{km} + t_{k}^{s_{i}^{j_{travel}}} + t_{k}^{s_{i}^{j_{pa}}} + t_{k}^{s_{i}^{j_{de}}}$ , where $i = | N_{1} | + | M | + m$ . Alternatively, when the train is sent to station $s_{i}$ , the time spent is $t_{ki}^{s_{i}} = t_{ki}^{s_{i^{'}}^{j}} + t_{k}^{s_{i^{'}}^{j_{sp}}} + t_{k}^{s_{i}, s_{i_{j}^{'}}} + t_{k}^{s_{i}^{pa}}$ . However, when a station serves as both a loading station and a technical station, $t_{k}^{s_{i}} = t_{ki}^{s_{i}^{j}} + t_{tr}^{s_{i}}$ , where $t_{tr}^{s_{i}}$ is the time for the unit train after disassembly to be delivered from the access line to the loading line. The time for the unit train to be sent to $s_{i^{'}}$ after loading at $s_{i}$ is $t_{ki}^{s_{i}, s_{i^{'}}} = t_{ki}^{s_{i}} + t_{k}^{s_{i}^{lo}} + t_{k}^{s_{i}^{de}} + t_{k}^{s_{i}, s_{i^{'}}} + t_{k}^{s_{i^{'}}^{pa}}$ . However, when the loading station and technical station are the same station, the assigned unit train with double locomotives to the technical station $s^{j}$ with index $p = | N_{1} | + m$ completes the loading operation in time $t_{p}^{s_{i}^{me}} = t_{km} + t_{k}^{s_{i}^{travel}} + t_{k}^{s_{i}^{pa}} + t_{k}^{s_{i}^{de}} + t_{k}^{s_{i}^{j}, s_{i^{'}}^{j}} + t_{k}^{s_{i^{'}}^{pa}} + t_{k}^{s_{i^{'}}^{lo}}$ . Then, the waiting time required to group two unit heavy trains (one of which is a unit heavy train with double locomotives) into a combined heavy train at this technical station is given by Equation 20:

t^{s_{i}^{j}} = | t_{p}^{s_{i}^{j}} - t_{ki}^{s_{i}^{j}, s_{i^{'}}^{j}} | P \in N_{2}, i \in N_{1} \cup N_{2}

(20)

Optimal Adjustment Method for Train-Scheduling Plans in the Scenario of Loading Delay

In contrast to generic RL applications, our approach embeds domain-specific operational rules and characteristics of heavy-haul freight railways into the model. For instance, the agent’s decision-making process is constrained by realistic scheduling windows, maximum axle loads, station throughput capacities, and coupling policies that reflect actual operations in Chinese coal corridors. Additionally, the multi-level coordination of empty and loaded-train units—subject to real-world marshaling and recombination constraints—is modeled explicitly, allowing our RL framework to capture the granular dynamics of heavy-haul scheduling in a way general RL setups cannot.

Algorithmic Symbol Representation and Flowchart Design

In traditional RL algorithms, each agent continuously learns and improves its strategy, resulting in dynamic instability from the perspective of each agent ( 40 ). This situation contradicts the convergence conditions of traditional RL algorithms because the dynamic nature of the environment introduces uncertainty into the agent’s learning. Additionally, because of the instability of the environment, traditional RL, such as experience replay (ER), are limited in their effectiveness when dealing with multi-agent systems ( 41 ). It is noteworthy that the increasing number of agents further exacerbates the variance problem when using policy-gradient (PG) algorithms ( 42 ).

In a multi-agent environment, the growing interaction among agents increases the uncertainty and variance in policy updates. This escalation makes it challenging for traditional PG algorithms to adapt to dynamically unstable environments, as the adjustment of a single agent’s policy struggles to effectively address the overall dynamic nature of the environment ( 43 , 44 ). In this context, traditional RL algorithms like DQN face even more significant challenges because they struggle to handle the complex interactions and cooperative learning problems among agents in multi-agent systems ( 45 ). Therefore, in researching novel RL algorithms, it is crucial to consider how to address dynamic instability in multi-agent environments and adopt corresponding strategies to enhance the robustness and performance of the algorithm.

In the domain of RL, one of the most prevalent approaches is the utilization of function approximation. Because of the formidable expressive capabilities of neural networks, they are frequently chosen to approximate the function $Q$ . When the action space in RL is continuous (infinite), the neural network’s input consists of the state $s$ and action $a$ , with its output being a scalar that represents the value obtained by taking action $a$ in state $s$ . Conversely, when the action space in RL is discrete (finite), in addition to the method used for continuous actions, it is also possible to input only the state $s$ into the neural network, which then concurrently outputs the $Q$ -value for each action. Typically, The DQNs are tailored to handle discrete actions, given that the update process of the $Q$ -function involves the $\max_{a}$ operation. Assuming the parameters used by the neural network to fit the function are $w$ , it follows that the $Q$ -value for all possible actions $a$ in each state $s$ can be denoted as $Q_{w} (s, a)$ . Here, the neural network employed to fit the $Q$ -function can be referred to as the $Q$ -network, as illustrated in Figure 3.

Figure 3.

Schematic diagram of DQN network operation.

The actor–critic algorithm framework is divided into two parts: the actor network (policy network) and the critic network (value network). The responsibility of the policy network is to interact with the environment and learn a better policy using PG under the guidance of the value network. The value network’s role is to learn a value function from the data collected through the interactions between the policy network and the environment. This value function is used to determine which actions are good and which are not in the current state, thereby aiding the policy network in updating its policy. The policy network is updated based on the principles of PG, while the value network is updated using $V_{w}$ , with $w$ being the parameters. The temporal-difference-error learning method can be selected, and, for a single data point, the loss function of the value function is defined as in Equation 21:

L (ω) = \frac{1}{2} {(r + γ V_{ω} (s_{t + 1}) - V_{ω} (s_{t}))}^{2}

(21)

Similar to DQN, a method akin to the target network is used here. The term $r + γ V_{ω} (s_{t + 1})$ in Equation 21 is employed as the temporal-difference target, which does not produce gradients for updating the value function. Therefore, the gradient of the value function can be expressed using Equation 22:

\nabla_{ω} L (ω) = - (r + γ V_{ω} (s_{t + 1}) - V_{ω} (s_{t})) \nabla_{ω} V_{ω} (s_{t})

(22)

The parameters of the value network can be updated using the gradient-descent method.

The problem addressed in this research is train scheduling under loading delays, and, based on the constructed mathematical model and decision variables mentioned above, it is identified that this problem falls under the category of a zero-sum stochastic game (SG) ( 46 ). In this research, a domain-specific RL framework DCD-RLN is selected. This algorithm assumes a discrete mode of information exchange among multiple agents, operates on the actor–critic algorithm framework, utilizes the deep deterministic policy-gradient (DDPG) algorithm, and addresses scalability issues in large-scale multi-agent environments by enabling parameter sharing among multiple agents ( 47 ).

For the SG problem, which involves a dynamic game process with multiple agents in multiple states, the SG problem can be represented by the tuple ${S, {A_{i}}_{i = 1}^{N}, {B_{i}}_{i = 1}^{M}, T$ , ${R_{i}}_{i = 1}^{N + M}}$ . Here, $S$ represents the current optimization problem’s state space, shared among all agents. The initial state $s_{1}$ follows $s_{1} ~ p_{1} (s)$ . In the DCD-RLN method defined in this paper, the action space of agent $i$ is denoted as $A_{i}$ , and the action space of other agents $j$ with competitive relationships is set as $B_{j}$ . The function $T : S \times A^{N} \times B^{M} \to S$ represents the deterministic transition function of the environment, and $R_{i} : S \times A^{N} \times B^{M} \to ℜ$ represents the reward function for each agent $i \in [1, N]$ .

To maintain a flexible framework in the DCD-RLN algorithm and accommodate different numbers of agents to meet the problem’s requirements, in the train scheduling under loading delays, where different trains have competitive and cooperative relationships, it is essential for both competitors and collaborators to share the same state space $S$ . Additionally, within each cooperative group, agents need to make the same actions in the action space $(A, B)$ . From another perspective, for each cooperating agent $i \in [1, N]$ and each competing agent $j \in [1, M]$ , $A_{i} = A$ and $B_{i} = B$ . In this study of train scheduling under loading delays, a time-varying global reward function based on the horizontal difference between two consecutive time steps is introduced:

r (s, a, b) = \frac{1}{M} \sum_{j = N + 1}^{N + M} Δ R_{j}^{t} (s, a, b) - \frac{1}{N} \sum_{i = 1}^{N} Δ R_{i}^{t} (s, a, b)

(23)

To simplify the representation of the global reward function, the subscript $t$ in the global reward $r (s, a, b)$ has been omitted in the following text. For any time step $t$ , the intelligent agents needing to choose actions will take the next action $a \in A^{N}$ , and the corresponding agents with competitive relationships will also take actions $b \in B^{M}$ . Using $Δ R_{i}^{t} (\cdot) \equiv R_{i}^{t - 1} (s, a, b) - R_{i}^{t} (s, a, b)$ , we can represent the cumulative rewards of all agents in the current overall environment, including those with cooperative and competitive relationships.

Observing the above equation, we can clearly draw a conclusion: the reward function in the formula is derived for each agent’s action, encompassing both cooperating and competing agents. The reward values for these two types of agents may be precisely opposite, resulting in a net reward of zero. According to the computation in the above equation, the rewards for multiple agents in the entire environment will continuously increase over time. If the rewards for agents with competitive relationships exceed those for agents with cooperative relationships, the overall reward value is then taken as the absolute value, ensuring a non-negative numerical outcome.

In the context of the defined global reward $r (s, a, b)$ , when agents with competitive relationships take joint actions $b$ , agents with cooperative relationships jointly take actions $a$ in the current state $s$ . The ultimate goal for the agents is to learn a strategy that maximizes the expected sum of discounted rewards.

The learning strategy for the DCD-RLN algorithm involves expanding a network of length $N$ (the number of intelligent agents) and applying backpropagation through time (BPTT) over the traversal time. Gradients are propagated to individual functions $Q_{i}$ and policy functions, which are aggregated from all intelligent agents and their chosen actions in the current state. The gradient of the cumulative reward from all agents is then propagated to influence the next actions of each intelligent agent, and the obtained gradients are further propagated back through the network to update the parameters.

Let $J_{i} (Θ)$ denote the objective for each intelligent agent $i$ , that is, maximizing its expected cumulative individual reward $r_{i}$ . This objective is represented as $J_{i} (Θ) = E_{s ρ_{Θ}^{T}} [r_{i} (s, a_{Θ} (s))]$ , where $ρ_{Θ}^{T} (s)$ is the downward distribution corresponding to the policy $a_{Θ}$ under transformation $T$ . It can be chosen for the steady distribution of the Markov decision process (MDP). Therefore, the representation function $J (Θ)$ for the convergent objective with $N$ agents can be expressed as Equation 24:

J (Θ) = E_{s ρ_{a_{Θ}}^{T}} [\sum_{i = 1}^{N} r_{i} (s, a_{Θ} (s))]

(24)

The process of the DCD-RLN method is illustrated in pseudo-code as Algorithm 1 and the overall flowchart of the DCD-RLN algorithm is shown in Figure 4.

Algorithm 1 DCD-RLN algorithm
1: Initialize the actor network $ξ$ and critic network $Ƴ$ 2: Initialize the target networks $ξ^{'} \leftarrow ξ$ and $Ƴ^{'} \leftarrow Ƴ$ 3: Initialize the replay buffer $R$ 4: for $episodes = 1, E$ do Initialize a random process $℧$ for agent action exploration. Determine the initial observation state $s^{1}$ based on the initial process $℧$ of the agent 5: fort=1,T do 6: For each agent $i$ , select the corresponding action $a_{i}^{t} = a_{i, θ} (s^{t}) + N_{t}$ 7: Obtain rewards ${[r_{i}^{t}]}_{i = 1}^{N}$ and the new observation state $s^{t + 1}$ based on the decision action 8: Store the obtained previous observation state, next observation state, rewards, and actions in the replay buffer: ${s^{t}, {[a_{i}^{t}, r_{i}^{t}]}_{i = 1}^{N}, s^{t + 1}} \in R$ 9: Randomly sample $M$ transition mini-batch data and place them in the replay buffer: ${s_{m}^{t}, {[a_{m, i}^{t}, r_{m, i}^{t}]}_{i = 1}^{N}, s_{m}^{t + 1}}_{m = 1}^{M} \in R$ 10: Utilize DCD-RLN algorithm to compute the target values for each agent in each transition step 11: for m = 1,M do 12: For each agent $m$ , compute ${\hat{Q}}_{m, i} = r_{m, i} + λ Q_{m, i}^{ξ^{'}} (s_{m}^{t + 1}, a_{θ^{'}} (s_{m}^{t + 1}))$ using the DCD-RLN algorithm 13: end for 14: Compute the critical gradient estimate using the formula: 15: $Δ ξ = \frac{1}{M} \sum_{m = 1}^{M} \sum_{i = 1}^{N} [({\hat{Q}}_{m, i} - Q_{m, i}^{ξ} (s_{m}, a_{θ} (s_{m}))) \cdot \nabla_{ξ} Q_{m, i}^{ξ} (s_{m}, a_{θ} (s_{m}))]$ 16: Compute the actor-learner gradient estimate according to the formula and replace the Q-value with the critical value estimate: 17: $Δ θ = \frac{1}{M} \sum_{m = 1}^{M} \sum_{i = 1}^{N} \sum_{j = 1}^{N} [\nabla_{θ} a_{j, θ} (s_{m}) \cdot \nabla_{a_{j}} Q_{m, i}^{ξ} (s_{m}, a_{θ} (s_{m}))]$ 18: Update the base network using the gradient estimate mentioned above 19: Update the target networks: $ξ^{'} \leftarrow γ ξ + (1 - γ) ξ^{'}, θ^{'} \leftarrow γ θ + (1 - γ) θ^{'}$ 20: end for 21: end for

Algorithm 1 DCD-RLN algorithm

1: Initialize the actor network

ξ

and critic network

Ƴ

2: Initialize the target networks

ξ^{'} \leftarrow ξ

and

Ƴ^{'} \leftarrow Ƴ

3: Initialize the replay buffer

R

4: for

episodes = 1, E

do Initialize a random process

℧

for agent action exploration. Determine the initial observation state

s^{1}

based on the initial process

℧

of the agent
5: fort=1,T do
6: For each agent

i

, select the corresponding action

a_{i}^{t} = a_{i, θ} (s^{t}) + N_{t}

7: Obtain rewards

{[r_{i}^{t}]}_{i = 1}^{N}

and the new observation state

s^{t + 1}

based on the decision action
8: Store the obtained previous observation state, next observation state, rewards, and actions in the replay buffer:

{s^{t}, {[a_{i}^{t}, r_{i}^{t}]}_{i = 1}^{N}, s^{t + 1}} \in R

9: Randomly sample

M

transition mini-batch data and place them in the replay buffer:

{s_{m}^{t}, {[a_{m, i}^{t}, r_{m, i}^{t}]}_{i = 1}^{N}, s_{m}^{t + 1}}_{m = 1}^{M} \in R

10: Utilize DCD-RLN algorithm to compute the target values for each agent in each transition step
11: for m = 1,M do
12: For each agent

m

, compute

{\hat{Q}}_{m, i} = r_{m, i} + λ Q_{m, i}^{ξ^{'}} (s_{m}^{t + 1}, a_{θ^{'}} (s_{m}^{t + 1}))

using the DCD-RLN algorithm
13: end for
14: Compute the critical gradient estimate using the formula:
15:

Δ ξ = \frac{1}{M} \sum_{m = 1}^{M} \sum_{i = 1}^{N} [({\hat{Q}}_{m, i} - Q_{m, i}^{ξ} (s_{m}, a_{θ} (s_{m}))) \cdot \nabla_{ξ} Q_{m, i}^{ξ} (s_{m}, a_{θ} (s_{m}))]

16: Compute the actor-learner gradient estimate according to the formula and replace the Q-value with the critical value estimate:
17:

Δ θ = \frac{1}{M} \sum_{m = 1}^{M} \sum_{i = 1}^{N} \sum_{j = 1}^{N} [\nabla_{θ} a_{j, θ} (s_{m}) \cdot \nabla_{a_{j}} Q_{m, i}^{ξ} (s_{m}, a_{θ} (s_{m}))]

18: Update the base network using the gradient estimate mentioned above
19: Update the target networks:

ξ^{'} \leftarrow γ ξ + (1 - γ) ξ^{'}, θ^{'} \leftarrow γ θ + (1 - γ) θ^{'}

20: end for
21: end for

Figure 4.

The DCD-RLN plays a pivotal role in establishing effective communication among agents by employing a cyclic network that facilitates the connection between individual agent strategies and their respective Q-networks. This network architecture is designed to enhance collaborative learning and decision-making processes within a multi-agent system.

Neural Network Architecture and Training Procedure

To implement the proposed DCD-RLN method, we adopt an actor–critic framework in which both the policy (actor) and value (critic) functions are approximated by neural networks. These networks are specifically tailored to capture the operational complexity of heavy-haul railway dispatching and to ensure scalability and adaptability under varying delay scenarios.

Policy Network (Actor)

The actor network is designed as a multilayer perceptron (MLP) composed of the following components:

Input dimension: Matches the state vector dimension, including variables such as train arrival times, station occupancy, load types, and delay indicators;

Hidden layers: Two fully connected layers with 128 and 64 units respectively;

Activation function: Rectified linear unit (ReLU) in hidden layers;

Output layer: A softmax layer producing a probability distribution over the discrete action space;

Output dimension: Equals the number of feasible train dispatching actions.

Value Network (Critic)

The critic network estimates the state value and adopts a structure similar to the actor:

Input dimension: Identical to the actor’s input;

Hidden layers: Two fully connected layers with 128 and 64 ReLU-activated units;

Output: A scalar value representing the expected return of the current state.

Training Setup

The networks are trained using a synchronous actor–critic method with the following settings:

Optimizer: Adam with learning rate $η = 5 \times 10^{- 4}$ ;

Discount factor: $γ = 0.95$ ;

Batch size: 32;

Replay buffer size: 10,000 transitions;

Exploration strategy: $ϵ$ -greedy with linear decay from 1.0 to 0.1 over the first 1500 episodes;

Training episodes: 3000;

Normalization: State features are standardized to zero mean and unit variance to accelerate convergence.

Data-Driven Environment Calibration

The training environment is calibrated using real-world operational data from a coal-dedicated heavy-haul corridor in northern China. Historical data sources include:

Timestamped records of train arrivals and departures;

Delay logs at loading and technical stations;

Train formation and coupling records;

Throughput statistics at each station and section.

These records are used to parameterize delay distributions and simulate dynamic railway conditions. By incorporating empirical knowledge into the simulation environment, the learning agent is exposed to realistic and representative scheduling scenarios, enhancing the policy’s applicability to practical heavy-haul operations.

Computational Complexity of DCD-RLN Algorithm

To clarify the analysis, we define the following notations:

$E$ : The total number of training episodes.

$T$ : The number of time steps per episode.

$N$ : The number of agents in the system.

$M$ : The number of transitions sampled from the replay buffer per time step.

$A$ : The computational cost of a forward pass through the actor network for one agent.

$C$ : The computational cost of a forward pass through the critic network for one agent.

$P$ : The total number of parameters in the actor and critic networks.

$| R |$ : The capacity of the replay buffer (used in initialization).

Single Time-Step Complexity

The computational complexity of the DCD-RLN algorithm can be analyzed based on its key components and operations. The analysis assumes the algorithm is executed over $E$ episodes, each with $T$ time steps, and involves $N$ agents interacting with the environment. Additionally, $M$ samples are drawn from the replay buffer at each time step.

Initialization Phase

The actor network $ξ$ and the critic network $Θ$ are initialized. Assuming the number of weights in both networks is $O (P)$ , the complexity of this step is $O (P)$ .

The target networks $ξ^{'} \leftarrow ξ$ and $Θ^{'} \leftarrow Θ$ are initialized with the same complexity, $O (P)$ .

Initializing the replay buffer involves allocating memory for $| R |$ transitions, resulting in a complexity of $O (| R |)$ .

Training Loop

This assumes a total of $E$ episodes in the training, with each episode consisting of $T$ steps, involving $N$ agents, and sampling $M$ pieces of data from the replay buffer each time.

Action selection and interaction with the environment: For each agent, the complexity of selecting the action $a_{i}^{t}$ depends on the inference complexity of the actor network. Assuming the inference complexity of the actor network is $O (A)$ , then the total complexity at each time step is $O (N \cdot A)$ . The complexity of interacting with the environment and obtaining rewards and new states can generally be neglected.

Data storage and replay buffer sampling: The complexity of storing the current state, action, and reward into the replay buffer is $O (1)$ . The complexity of randomly sampling $M$ pieces of data from the replay buffer is $O (M)$ .

Target value calculation: For each sampled transition data (a total of $M$ pieces), calculating the target value ${\hat{Q}}_{m, i}$ involves inference by the critic network. Assuming the inference complexity of the critic network is $O (C)$ , then the total complexity is $O (M \cdot N \cdot C)$ .

Gradient estimation and network updates: Critic network gradient estimation: The complexity is $O (M \cdot N \cdot C)$ . Actor network gradient estimation: The complexity is $O (M \cdot N^{2} \cdot A)$ , as it requires computing the gradient of actions for all $N$ agents. The complexity of updating the base network weights is $O (P)$ . The complexity of updating the target network is $O (P)$ .

The computational complexity for a single time step is given by: $O (N \cdot A + M + M \cdot N \cdot C + M \cdot N^{2} \cdot A + P)$ .

Overall Complexity

The total computational complexity over $E$ episodes and $T$ time steps is: $O (E \cdot T \cdot (N \cdot A + M + M \cdot N \cdot C + M \cdot N^{2} \cdot A + P))$ .

Dominant Term (High-Dimensional Multi-Agent Systems)

For high-dimensional systems with large $M$ , the dominant term is: $O (E \cdot T \cdot M \cdot N^{2} \cdot A)$ .

Intelligent Train-Dispatch Adjustment with DCD-RLN

In this paper, the loading-delay problem in heavy-haul railway transportation is modeled as an MDP and optimized using deep-RL techniques. MDP provides a modular way to express the interaction between an agent and the environment by accumulating rewards. The MDP process typically consists of several key sub-modules: states, actions, state-transition functions, and rewards. In this study, the agent is defined as the collection of all train-operation states. By continuously updating the local states of the trains represented by the agent, the global environment of the heavy-haul transportation route can be learned and updated. Given the characteristics of the loading-delay problem in heavy-haul railway transportation, this paper employs MDP within the deep-RL framework to describe the interaction process between the agent and the loading-delay environment, including the relationships among states, actions, and rewards. The following sections detail the specific definitions of the four sub-modules in the MDP process of this paper: states, actions, state-transition functions, and rewards.

State Space

The state space represents the information that the agent receives from the loading-delay environment in heavy-haul railway transportation and is specifically defined as the current state at time step $t$ in RL. For the loading-delay problem in heavy-haul railway transportation, the agent makes decisions on whether to depart or wait based on the current state of the train. In the DCD-RLN algorithm used for train operation, the global state $s^{t}$ at time stamp $t$ is defined as the collection of states of all trains at the current time stamp $t$ , that is, $s^{t} = {s_{k}^{t}, k \in K}$ . This paper defines three attributes for the state space. The local state of the train at time stamp $t$ is represented as $s_{k}^{t} = {c_{k}^{t}, b_{k}^{t}, o_{k}^{t}}$ , where: $c_{k}^{t}$ denotes the remaining loading time of the train at the loading station; $b_{k}^{t}$ denotes the remaining departure time of the train at the technical station; and $o_{k}^{t}$ indicates whether the train is reclassified into another train. When $o_{k}^{t} = 0$ , it means that the train has no reclassification plan; when $o_{k}^{t}$ is a positive integer, it means that the train is reclassified into multiple trains; and when $o_{k}^{t}$ is a negative integer, it means that the train is reclassified into a combined train with other trains.

Action Space

In the DCD-RLN algorithm, at time stamp $t$ , the agent executes an action $a^{t} \in A$ to determine the train operation plan for the loading-delay problem in heavy-haul railway transportation. The action space of $a^{t}$ represents a set of available actions, which can be specifically defined as $A = {1, 2, 3, \dots, K}$ . In the DCD-RLN algorithm, the agent executes action $a^{t}$ at the departure time of each train. Thus, the execution of an action by a train implies that it will either wait (for loading or to become fully loaded) or depart. Only a single action is permitted to be executed at each time stamp $t$ , thereby ensuring that the train satisfies the constraints (Equations 4–9).

State-Transition Function

At time stamp $t$ ,the agent executes the decision action $a^{t}$ , thereby determining the train’s departure situation. Based on the global state $s^{t}$ at the current time stamp $t$ , the global state $s^{t + 1}$ at the next time stamp $t + 1$ can be obtained through the state-transition function $s^{t + 1} ~ P (s^{t}, a^{t})$ . Generally, when the agent makes a correct decision, the global state $s^{t}$ will change accordingly. At this point, the current train $k^{t}$ will execute the departure action, and its local state $s_{k}^{t} = {c_{k}^{t}, b_{k}^{t}, o_{k}^{t}}$ will also be updated to the next time stamp $t + 1$ .

Reward Function

A specific research objective is to update the learning policy through interactions with the learning environment. The reward function, serving as a feedback module for the agent’s actions, directly reflects the success or failure of the agent’s actions, thereby effectively guiding the agent toward optimal learning behavior. For the loading-delay problem in heavy-haul railway transportation, in the DCD-RLN algorithm, the agent can determine the behavior of vehicles at time stamp $t$ . To meet the target set of the loading-delay problem, we construct Equation 23 as the target for the agent’s eventual convergence. Additionally, stepwise rewards are immediately provided after each action is executed, enabling the agent to learn the environment more rapidly and better understand the feedback of each action through the reward function. By setting stepwise rewards, the DCD-RLN algorithm can better adapt to the changing learning environment. The stepwise reward setting is shown in Equation 24. The global reward $r (s, a, b)$ is defined as: $R (s_{t}, a_{t}, b_{t}) = w_{1} \cdot z_{1} (s_{t}, a_{t}, b_{t}) + w_{2} \cdot z_{2} (s_{t}, a_{t}, b_{t})$ .

Integration of Deterministic Scheduling Model and RL Framework

To establish a coherent and operational link between the deterministic scheduling model and the proposed RL framework, we construct the environment, state space, action space, and reward function of the RL model based on the parameters and constraints defined in the mathematical formulation.

Specifically, the state space encapsulates dynamic system status, including train identifiers, arrival times at loading/technical stations ( $t_{nk}$ , $t_{si}^{travel}$ ), station loading capacities ( $E_{si}^{k}$ ), and line capacities ( $u_{si, si + 1}$ ). These variables are directly drawn from the deterministic model’s input and constraint set. The action space represents discrete dispatching decisions, such as whether to send an empty train to a loading station ( $X_{si}^{nk} = 1$ ) or assign it to a coupling sequence ( $H_{s_{i}, s_{j}}^{nk} = 1$ ). These choices directly reflect decision variables in the deterministic model. The reward function is designed to align with the objectives $Z_{1}$ and $Z_{2}$ in Equations 1 and 2, aiming to minimize connection time and waiting/assembly time. Penalties are assigned for constraint violations, excessive delays, or infeasible combinations, thereby embedding the deterministic model logic into the RL training process. In this way, the RL agent learns optimal dispatching policies within the feasible space defined by the original deterministic problem. This integration ensures that the learned strategies are both data-driven and operationally grounded.

To further clarify the correspondence between the deterministic scheduling model and the components of the RL framework, we present Table 2, which summarizes how the key variables and constraints in the mathematical model are systematically mapped to elements in the RL environment. This mapping demonstrates how environment states, actions, and reward functions are grounded in the operational and structural logic of the original formulation.

Table 2.

Correspondence Between Deterministic Model components and RL Elements

Deterministic model component	RL framework element	Description
Train arrival times $t_{nk}$ travel time $t_{si}^{travel}$	State variable	Encodes train progress and scheduling status
Loading demand $E_{si}^{k}$ line capacity $u_{si, si + 1}$	State variable / environment constraint	Reflects operational limits
Dispatch decision $X_{si}^{nk}$ coupling plan $H_{si, sj}^{nk}$	Action	Determines train assignment or assembly
Connection cost $Z_{1}$ waiting time cost $Z_{2}$	Reward function (negative)	Guides policy learning toward efficiency
Capacity constraint $u_{si, si + 1}$ disassembly constraint $C_{f}$	Action masking / penalty	Prevents infeasible actions during training
Scheduled departures $t_{sz}^{g}$ loading duration $t_{lo}$	State-transition dynamics	Used to update environment time state

Note: RL = reinforcement learning.

In addition, Figure 5 provides a schematic flowchart illustrating the integration process between the deterministic model and the RL framework. It visualizes the transformation of static model parameters into dynamic decision variables within the learning environment and highlights the constraint-driven masking mechanism that ensures feasibility and consistency with domain rules during policy optimization.

Figure 5.

Flowchart of the integration between the deterministic scheduling model and RL framework.

Moreover, to reflect the unique characteristics of heavy-haul freight operations—such as high tonnage, double-locomotive traction, and long loading cycles—we explicitly encode load capacity, composition type (unit or combined train), and yard-specific coupling constraints into the state representation. These elements are typically irrelevant in passenger-train operations but are essential to capture the operational complexity of heavy-haul corridors. This differentiation ensures that our RL policy is both context-aware and operationally grounded in real-world heavy-haul practices.

Case-Validation Analysis

The proposed approach is evaluated on a representative case of a coal-dedicated heavy-haul railway corridor, where trains are configured with up to 20,000 t of load and operate under stringent infrastructure and scheduling constraints. Unlike passenger trains, these freight trains face significant bottlenecks because of loading/unloading delays, limited passing capacity, and inflexible departure slots. These features make real-time dispatching highly sensitive to loading delays, and demand a specialized optimization approach, as developed in this study.

This section presents a case study using the ShenShuo line of the ShenHua Railway as an illustrative example. The ShenShuo Railway extends from DaLiuTa in the north to ShuoZhou West in the south, covering a total mainline distance of 266 km. Within this segment, there are a total of 16 stations involved in the origination and termination operations of trains, thereby constituting the set of stations for optimizing the intelligent adjustment of train schedules under loading-delay conditions.

For the purpose of this case study focusing on the ShenShuo line, the DaLiuTa station is designated as the boundary station, and ShiChi South is identified as the terminal station for the coal-transport corridor. Figure 6 provides an overview of the spatial distribution of each station, including their attributes and the directional flow of vehicles. Stations colored in blue indicate loading stations, those in yellow represent technical stations, and those in red signify stations that serve both loading and technical functions.

Figure 6.

Schematic diagram of the ShenShuo line.

This case study lays the foundation for investigating the intelligent adjustment and optimization of train schedules under loading-delay conditions on the ShenShuo line. The detailed analysis of station attributes, vehicle flow directions, and the distinctive roles of each station contributes to a comprehensive understanding of the operational context. The utilization of DaLiuTa as the boundary station and ShiChi South as the terminal station reflects the specific focus on the coal-transport corridor within the ShenShuo line.

Model Parameter Configuration

In this case study, the relevant information and parameter settings of each station on the ShenShuo Railway are listed in Table 3. Table 4 provides a detailed display of the operating times of different types of heavy-duty freight trains between stations, while Table 5 illustrates the passing capacities of different types of trains in the sections. Additionally, the details of technical stations in the study route are configured in Table 6.

Table 3.

Relevant Information and Parameter Settings for Each Station on the ShenShuo Railway.

Station number	Name	Loading capacity (minutes)	Loading operation time (minutes)
S1	DaLiuTa	33	2.2
S2	ZhuGaiTa	10	2.5
S3	YanJiaTa	18	3.4
S4	ShenMuBei	-	-
S5	HuangYangCheng	12	3.0
S6	XinChengChuan	8	2.8
S7	GuShanChuan	8	3.1
S8	FuGu	12	3.1
S9	BaoDe	15	2.7
S10	WangJiaZhai	16	3.2
S11	YinTa	16	2.9
S12	HanJiaLou	14	3.2
S13	SanCha	16	3.2
S14	HeZhi	10	3.0
S15	NanPoDi	NA	NA
S16	ShenChiNan	NA	NA

Note: NA = not available.

Table 4.

Operating Times of Different Types of Heavy-Duty Freight Trains Between Stations on the ShenShuo Railway

Operating section	10,000-ton train (minutes)	Regular train (minutes)
S2–S4	30	22
S3–S4	16	12
S4–S10	106	96
S5–S10	95	83
S6–S10	72	61
S7–S8	20	15
S7–S10	70	60
S8–S10	45	33
S9–S10	33	23
S10–S11	15	12
S10–S11	105	93
S11–S11	44	30
S12–S13	14	11
S13–S14	20	14
S14–S16	47	37

Table 5.

Passing Capacities of Different Types of Trains in Sections on the ShenShuo Railway

Operating section	10,000-ton train (minutes)	Regular train (minutes)
S2–S4	25	49
S4–S8	28	45
S8–S10	26	52
S10–S14	41	82
S14–S16	44	88

Table 6.

Technical Station Configuration

Station ID	Combination capacity (minutes)	Disassembly capacity (minutes)
S2	15	15
S3	4	4
S4	10	10
S10	10	10
S13	10	10
S14	4	4

In this case study, we only consider the 5000 t unit train as the basic unit for investigation. Two 5000 t unit trains can be combined to form a 10,000 t combination train, and a 10,000 t combination train can be disassembled into two 5000 t unit trains using two different operational methods. For computational convenience, the operational time for combining unit trains into a combination train is uniformly set to 60 min, while the disassembly time of a combination train into unit trains at the technical station is set to 35 min.

Acquisition Process of Heavy-Duty Empty-Train Flow

The source of the heavy-duty empty-train flow is an integrated loading plan for one day on the ShenShuo Railway, from 18:00 to 18:00 the next day. The plan schedules a total of 64 unit trains for loading operations. Following the time division method used during the planning phase of ShenHua Railway, this section divides one loading plan into six stages, with each stage covering a tracking time of 4 h. Table 7 provides information on the approved loading plan and transportation-demand quantities.

Table 7.

Integrated Daily Loading Plan for BaoShen Railway

Number	Origin	Destination	Vehicle type
1	S3	S16	C80
2	S3	S16	C80
3	S3	S16	C80
4	S9	S16	C80
5	S3	S16	C80
6	S3	S16	C80
7	S3	S16	C80
8	S3	S16	C80
9	S3	S16	C80
10	S5	S16	C80
11	S5	S16	KM98
12	S5	S16	C80
13	S5	S16	C80
14	S5	S16	C80
15	S5	S16	C80
16	S6	S11	C70
17	S6	S16	C70
18	S7	S16	C80
19	S7	S16	C80
20	S8	S16	C80
21	S8	S16	C70
22	S9	S11	C80
23	S9	S11	C80
24	S9	S11	C80
25	S9	S16	C80
26	S9	S16	C80
27	S10	S16	C80
28	S10	S16	C80
29	S10	S16	C80
30	S10	S16	C80
31	S11	S16	C80
32	S11	S16	C80
33	S11	S16	C80
34	S11	S16	C80
35	S11	S16	C80
36	S11	S16	C70
37	S11	S16	C70
38	S11	S16	C60
39	S11	S16	C60
40	S11	S16	C60
41	S12	S16	C80
42	S12	S16	C80
43	S12	S16	C60
44	S12	S16	C60
45	S12	S16	C60
46	S12	S16	C70
47	S12	S16	C70
48	S13	S16	KM98
49	S13	S16	C80
50	S13	S16	C80
51	S9	S16	KM98
52	S13	S16	C60
53	S13	S16	C60
54	S13	S16	C70
55	S13	S16	C70
56	S13	S16	C80
57	S13	S16	C80
58	S13	S16	C80
59	S13	S16	C80
60	S14	S16	C80
61	S14	S16	C80
62	S14	S16	C80
63	S14	S16	C80
64	S14	S16	C80

In the research presented in this paper, within a decision cycle, a total of eight train columns at loading stations experienced loading delays resulting from delayed coal delivery or station-equipment malfunctions. Among these, there are six columns of 5000 t single-unit empty trains and two columns of 10,000 t combined empty trains. Table 8 provides data on the storage of empty trains at various stations during different time periods.

Table 8.

Empty-Train Storage Data at Empty-Train Supply Stations on BaoShen Railway

Number	Station	Type	Category	Time (minutes)
1	S4	C80	Large	1
2	S2	C80	Large	1
3	S16	C80	Large	2
4	S2	C70	Large	2
5	S16	C80	Small	1
6	S16	C80	Large	6
7	S16	C60	Small	1
8	S13	C70	Small	1
9	S16	C60	Small	1
10	S4	C80	Small	1
11	S16	C80	Small	2
12	S16	C80	Small	2
13	S16	C80	Small	2
14	S16	C80	Small	1
15	S16	C80	Large	1
16	S2	KM98	Large	1
17	S16	C80	Small	2
18	S4	C80	Small	2
19	S2	C80	Large	6
20	S16	C70	Small	2
21	S16	C80	Small	2
22	S14	C70	Small	2
23	S4	C80	Small	2
24	S2	C80	Large	6
25	S2	KM98	Small	2
26	S16	C80	Large	3
27	S16	C80	Small	4
28	S16	C80	Small	3
29	S16	C70	Small	3
30	S16	C80	Small	3
31	S16	C70	Small	5
32	S4	C80	Small	3
33	S4	C60	Small	4
34	S16	C80	Small	3
35	S2	C80	Large	4
36	S16	C80	Small	4
37	S16	C60	Large	6
38	S16	C80	Large	6
39	S2	C80	Large	4
40	S16	C60	Small	4
41	S4	C80	Small	5
42	S16	C80	Small	5
43	S16	C80	Small	4
44	S10	C80	Small	4
45	S16	C60	Small	4
46	S16	C70	Small	6
47	S2	C80	Small	4
48	S4	C70	Small	5
49	S2	C80	Small	4
50	S2	C80	Small	4
51	S16	C70	Large	5
52	S16	C80	Small	1

Model Solution and Results Analysis

In this subsection, the mathematical model of loading delay in the coal-transport channel proposed above will be solved using a DCD-RLN method. In addition to model parameter settings and relevant station information, the learning rate of the DCD-RLN method is set to 0.002, the discount factor is chosen as 0.98, the target network update rate is 0.9, and the maximum number of iterations per training session is set to 800.

Original Freight-Train Operating Plan

The original freight-train operating plan consists of two parts: the plan for loaded trains and the plan for empty trains. The process for developing the initial plan for loaded trains is outlined in Table 9, while the process for developing the initial plan for empty trains is presented in Table 10. In Tables 9 and 10, it is noteworthy that the table header abbreviations are as follows: OS represents the originating station, DS represents the destination station, and OF represents the operating frequency.

Table 9.

Original Plan for Loaded-Train Operation

Number	OS	DS	OF (minutes)	Type
1	S1	S4	5	Small
2	S1	S10	1	Small
3	S3	S4	3	Small
4	S5	S10	4	Small
5	S5	S16	2	Small
6	S6	S10	1	Small
7	S6	S11	1	Small
8	S7	S10	2	Small
9	S8	S10	1	Small
10	S8	S13	1	Small
11	S9	S10	4	Small
12	S9	S11	1	Small
13	S11	S13	10	Small
14	S12	S13	5	Small
15	S12	S16	2	Small
16	S13	S16	12	Small
17	S14	S16	5	Small
18	S2	S16	10	Large
19	S4	S16	5	Large
20	S10	S16	9	Large
21	S13	S16	8	Large

Note: OS = originating station; DS = destination station; OF = operating frequency.

Table 10.

Original Plan for Empty-Train Operation

Number	OS	DS	OF (minutes)	Type
1	S16	S14	2	Small
2	S16	S13	9	Small
3	S16	S12	3	Small
4	S16	S11	6	Small
5	S16	S10	4	Small
6	S16	S8	2	Small
7	S16	S7	1	Small
8	S16	S6	2	Small
9	S16	S5	6	Small
10	S16	S3	7	Small
11	S14	S14	2	Small
12	S14	S13	2	Small
13	S14	S12	4	Small
14	S14	S11	2	Small
15	S10	S10	3	Small
16	S10	S9	4	Small
17	S8	S7	1	Small
18	S3	S3	2	Small
19	S16	S14	4	Large
20	S16	S13	1	Large
21	S16	S10	3	Large
22	S16	S9	2	Large

Note: OS = originating station; DS = destination station; OF = operating frequency.

By examining the initial heavy-train operation plan in Table 9, it can be observed that the plan schedules 41 unit direct trains from loading stations to technical stations or freight ports. The remaining trains, after completing the grouping operation at the technical station, are dispatched to freight ports as 10,000 t combination trains based on the demand for freight flow.

Combining the analysis of Table 10, it is observed that the initial empty-train dispatch plan schedules 45 units of 5000 t freight trains directly from their respective locations to the loading stations for loading operations. The remaining 12 trains, each with a capacity of 10,000 t, are planned to undergo disassembly at intermediate technical stations, resulting in 5000 t unit trains. Subsequently, these trains are dispatched to loading stations based on demand to complete loading operations.

Optimization Results of Freight-Train Dispatch Plan

Based on the initial dispatch plans for loaded and empty trains, we employed the DCD-RLN method to achieve coordinated optimization, optimizing both loaded and empty trains. The total transport time for loaded trains, from loading to completing their own transportation tasks, was reduced to 29,875 min after coordinated optimization. The demand for empty trains was effectively met to fulfill the loading requirements on the studied rail line. In the DCD-RLN, the process of optimizing the objective function is as shown in Equation 3.

The optimization results of the mathematical model for loading delays in the coal-transportation channel, using the multi-agent bidirectional coordinated deep-RL network method, are summarized in Figure 7, illustrating the relationship between the optimized objective function value and the number of iterations.

Figure 7.

The convergence plot for the DCD-RLN method.

Performance Comparison of Similar Algorithms

In this study, the principal innovation is the introduction of a Multi-Agent Reinforcement Learning (MARL) algorithm. The crux of modeling in MARL lies in addressing challenges related to coordination, communication, and equilibrium selection, particularly in competitive scenarios. The interdependence of agents’ actions introduces significant complexities, as the environment appears nonstationary to each agent. However, these aspects have not been extensively discussed in the models and algorithms presented. The design of the algorithms tends to be generic RL not specialized for the unique characteristics of railroads and multi-agent systems. To rectify this, we have provided more detailed insights into how agents are generated and how communications among agents and the environment are designed, and have included Figures 3 and 4 to illustrate the specific design of the model and algorithm ( 48 ).

The reason for selecting metaheuristic algorithms, such as the Grey Wolf Optimizer (GWO) ( 49 ) and sine cosine algorithm (SCA) ( 50 ), as comparative benchmarks is twofold. First, these algorithms are well-established in the field of optimization and have proven their efficacy across a variety of complex problems, including feature selection and classification tasks. Second, they serve as a robust baseline to assess the performance of our DCD-RLN algorithm, given their adaptability and widespread use in solving optimization problems similar to ours. The hyperparameter settings for the GWO and SCA algorithms are consistent with those of the original authors. Both algorithms employ a population size of 10 individuals and a maximum number of iterations set to 800.

Furthermore, Figure 8 illustrates the convergence behavior of the GWO and SCA, two metaheuristic algorithms, under the same operational conditions. A comparison between Figures 7 and 8 reveals that the DCD-RLN algorithm achieves a final convergence value of 12,004, whereas the convergence values for the comparative algorithms GWO and SCA are 13,568 and 14,721, respectively. This indicates that the DCD-RLN algorithm exhibits superior global search performance compared with GWO and SCA.

Figure 8.

Convergence plot for GWO and SCA methods.

In the final optimization results, Table 11 presents the drawn-up table for the empty-train dispatch plan of unit trains and combined trains. Table 12 provides the drawn-up table for the loaded-train dispatch plan of unit trains and combined trains. It can be observed from both tables that the optimized empty-train deployment plan remains consistent with the pre-optimized one.

Table 11.

Optimized Empty-Train Dispatch Plan

Number	OS	DS	OF (minutes)	Type
1	S16	S14	2	Small
2	S16	S13	9	Small
3	S16	S12	3	Small
4	S16	S11	6	Small
5	S16	S10	4	Small
6	S16	S8	2	Small
7	S16	S7	1	Small
8	S16	S6	2	Small
9	S16	S5	6	Small
10	S16	S3	7	Small
11	S14	S14	2	Small
12	S14	S13	2	Small
13	S14	S12	4	Small
14	S14	S11	2	Small
15	S10	S10	3	Small
16	S10	S9	4	Small
17	S8	S7	1	Small
18	S3	S3	2	Small
19	S16	S14	4	Large
20	S16	S13	1	Large
21	S16	S10	3	Large
22	S16	S9	2	Large

Note: OS = originating station; DS = destination station; OF = operating frequency.

Table 12.

Optimized Loaded-Train Dispatch Plan

Number	OS	DS	OF (minutes)	Type
1	S3	S4	3	Small
2	S5	S1	5	Small
3	S5	S16	2	Small
4	S6	S10	1	Small
5	S6	S11	1	Small
6	S7	S10	2	Small
7	S8	S10	1	Small
8	S8	S13	1	Small
9	S9	S10	4	Small
10	S9	S11	1	Small
11	S11	S13	10	Small
12	S12	S13	5	Small
13	S9	S16	2	Small
14	S13	S16	12	Small
15	S14	S16	5	Small
16	S2	S16	10	Large
17	S4	S16	6	Large
18	S10	S16	10	Large
19	S13	S16	8	Large

Note: OS = originating station; DS = destination station; OF = operating frequency.

Through Tables 9 to 12, it can be observed that there are changes only in the heavy-load train operation plans before and after optimization. In the optimized heavy-load train schedule, considering the occurrence of loading delays, 34 trains are planned to transport from loading stations to freight ports, which is seven fewer than the heavy-load train schedule before optimization. Additionally, it is noteworthy that the number of combination heavy-load trains, with a capacity of 10,000 t, is 15, which is one more than the initial operation plan.

In conjunction with the comparison and analysis of freight-train operation plans on the ShenShuo Railway coal-transportation channel before and after optimization, this section employs a more intuitive approach to validate the effectiveness of intelligent adjustment in the face of loading delays. The analysis further considers the operational utilization rate at technical stations and the alignment with train demand.

In this section, we initially focus on the investigation of the operational utilization rate of technical stations before and after the application of the multi-agent bidirectional coordinated deep-RL network method. Table 13 provides a comparison of the number of combinable trains at technical stations before and after optimization. From this perspective, technical stations such as S4, S10, and S13 exhibit an operational utilization rate exceeding approximately 10% of their pre-optimized capacity. This observation suggests that when heavy-duty freight trains cannot operate unit trains efficiently because of the distance from loading stations or freight ports, the completion of marshaling operations at technical stations to form 10,000 t trains effectively improves the utilization rate of technical station operational capacity. This, in turn, alleviates the issue of capacity constraints within line segments. The insights gained from this study contribute to a deeper analysis of the bottleneck in capacity constraints on the ShenShuo Railway ShenHua Line, facilitating future enhancements in transportation capacity and operational utilization at technical stations.

Table 13.

Optimized Loaded-Train Dispatch Plan

Number	Station	Operation capacity (minutes)	Original		After optimization		Change rate (%)
Number	Station	Operation capacity (minutes)	(Train count)	Utilization rate (%)	(Train count)	Utilization rate (%)	Change rate (%)
1	S2	14	9	64.28	10	71.42	+7.14
2	S4	11	5	45.45	6	54.55	+9.1
3	S10	11	8	72.73	9	81.82	+9.09
4	S13	11	8	72.73	10	90.91	+18.18

Figure 9 presents the optimized empty-train allocation plan and the loaded-train operation plan. As shown in the left part of Figure 9, the allocation plan schedules 65 5000 t unit trains to be directly dispatched from the empty-train stations to the loading stations for loading operations, while the remaining 20 10,000 t trains are decomposed into 5000 t unit trains at the technical stations and then sent to the loading stations according to the loading demands. The right part of Figure 9 reveals that the optimized loaded-train operation plan schedules 37 direct unit trains from the loading stations to the destination stations, which is four fewer than the initial operation plan. Meanwhile, 34 10,000 t combined trains are planned, meaning that a total of 68 5000 t unit trains are combined at different combination stations and dispatched in the form of combined 10,000 t heavy-haul trains, which is two more than the initial operation plan. In addition to these changes, the optimized loaded-train operation plan has also undergone corresponding adjustments in train operation sections, quantities, time periods, and types. Moreover, the operation time periods of some trains have been optimized to better match the arrival time periods of the allocated empty trains.

Figure 9.

Efficient optimization of empty-train allocation and loaded-train operation.

Next, this section will also demonstrate the effectiveness of this work by examining the alignment of plans for loaded and empty trains. Table 14 provides the alignment of loaded- and empty-train plans before and after optimization. From the table, it can be observed that the total time for the optimized heavy-duty freight trains to complete transportation operations is 29,875 min, a 4.89% reduction compared with the transportation time of 31,411 min in the pre-optimization plan, resulting in a total reduction of 1536 min. This section analyzes the results of intelligent adjustment and optimization of train schedules before and after using the multi-agent bidirectional coordinated deep-RL network method, focusing on two aspects: the utilization rate of technical station operations and the alignment rate of loaded and empty trains. The analysis demonstrates that intelligent adjustment of schedules under the scenario of loading delays, considering the impact of reallocating empty trains on completing loading operations at loading stations, ensures the continuity of loaded and empty trains. Within the constraints of limited transportation capacity, this approach improves the efficiency and execution of goods transportation in scenarios with loading delays.

Table 14.

Comparison of Matching Degree in Loaded/Empty-Train Plans Before and After Optimization

Comparison	Objective value	Matching degree (%)
Original plan	31,411	92.75 (64/69)
Optimized plan	29,875	100 (69/69)
Change rate (%)	−4.89	+8.6

Conclusion

In this study, we have introduced an innovative optimization method, the DCD-RLN, to tackle the challenge of loading delays in railway transportation. Through an extensive analysis of the impact of loading delays on railway efficiency, we have highlighted the necessity and urgency of our research. The DCD-RLN method adeptly adjusts train schedules during loading delays by accounting for constraints such as empty and loaded unit trains, coupling requirements, and the complex interactions among multiple intelligent agents in a competitive and cooperative environment. Our method has proven to significantly enhance the robustness and efficiency of train operations compared with existing solutions. Our research findings emphasize the critical role of advanced intelligent optimization methods in addressing real-world problems in railway transportation. The successful implementation of the DCD-RLN method not only illustrates the potential of deep RL in optimizing intricate transportation systems but also offers a novel tool for railway enterprises to overcome the challenges associated with loading delays.

However, it is important to acknowledge the limitations of our current approach. The DCD-RLN method, while effective in controlled scenarios, may face scalability issues when applied to larger and more diverse transportation networks. Additionally, the method’s reliance on accurate initial data and predefined parameters could limit its adaptability to unforeseen disruptions or changes in operational conditions. Future work will focus on extending the DCD-RLN method to more complex transportation scenarios and incorporating real-time data for dynamic scheduling, thereby enhancing the adaptability and responsiveness of railway-transportation systems.

We recommend further research to address these limitations and to explore the application of the DCD-RLN method in other types of transportation networks. This will help verify its broad applicability and contribute to the modernization and sustainable development of railway transportation. By refining our method and expanding its scope, we aim to make a more significant impact on the efficiency and reliability of transportation systems worldwide. While the DCD-RLN method presents a promising approach to mitigating loading delays in railway transportation, there is a need for ongoing research to address its current limitations and to adapt the method to the evolving demands of the transportation industry.

Footnotes

Author Contribution Statement

The authors confirm contribution to the paper as follows: study conception and design: Ai-Qing Tian, Jeng-Shyang Pan; data collection: Hong-Xia Lv; analysis and interpretation of results: Ai-Qing Tian, Jeng-Shyang Pan; draft manuscript preparation: Ai-Qing Tian. All authors reviewed the results and approved the final version of the manuscript.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Natural Science Foundation of China (Project No. 52072314; 52172321; 52102391), Sichuan Science and Technology Program (Project NO. 2020YJ0268; 2020YJ0256; 2021YFQ0001; 2021YFH0175), Science and Technology Plan of China Railway Corporation (Project No.: 2019F002), China Shenhua Energy Co., Ltd. Science and Technology Program(Project No.: CJNY-20-02), China Railway Beijing Bureau Group Co., Ltd. Science and Technology Program (2021BY02; 2020AY02), the National Key R&D Program of China (2017YFB1200702).

ORCID iDs

Ai-Qing Tian

Jeng-Shyang Pan

Data Accessibility Statement

Some or all of the data, models, or code generated or used during the study was provided by Southwest Jiaotong University under license and therefore cannot be made freely available. Direct requests for this data should be made to Southwest Jiaotong University.

References

Chen

Jiang

Chen

Shen

Ultra-Low Electrical Loss Superconducting Cables for Railway Transportation: Technical, Economic, and Environmental Analysis. Journal of Cleaner Production, Vol. 445, 2024, p. 141310. https://doi.org/10.1016/j.jclepro.2024.141310.

PalinskÃ£

Fabianová

Mikušová

Innovative Trends in the Field of Railway Transport and Infrastructure in the Conditions of Railways of the Slovak Republic. Transportation Research Procedia, Vol. 77, 2024, pp. 218–223. https://doi.org/10.1016/j.trpro.2024.01.029.

Labarthe

Ahmadi

Klibi

Deschamps

J.-C.

Montreuil

A Sustainable On-Demand Urban Delivery Service Enabled by Synchromodality and Synergy in Passenger and Freight Mobility. Transportation Research Part C: Emerging Technologies, Vol. 161, 2024, p. 104544. https://doi.org/10.1016/j.trc.2024.104544.

Tian

A.-Q.

Wang

X.-Y.

Pan

J.-S.

Snášel

H.-X.

Multi-Objective Optimization Model for Railway Heavy-Haul Traffic: Addressing Carbon Emissions Reduction and Transport Efficiency Improvement. Energy, Vol. 294, 2024, p. 130927. https://doi.org/10.1016/j.energy.2024.130927.

Wang

Gao

Liu

Neri

Zhang

Kou

Uncertainty-Aware Trustworthy Weather-Driven Failure Risk Predictor for Overhead Contact Lines. Reliability Engineering & System Safety, Vol. 242, 2024, p. 109734. https://doi.org/10.1016/j.ress.2023.109734.

Zhang

D’Ariano

Peng

Microscopic Optimization Model and Algorithm for Integrating Train Timetabling and Track Maintenance Task Scheduling. Transportation Research Part B: Methodological, Vol. 127, 2019, pp. 237–278. https://doi.org/10.1016/j.trb.2019.07.010.

Liu

Hwang

Lee

Ouyang

Lee

Smith

S. J.

Yan

Daenzer

Bond

T. C.

Emission Projections for Long-Haul Freight Trucks and Rail in the United States Through 2050. Environmental Science & Technology, Vol. 49, No. 19, 2015, pp. 11569–11576. https://doi.org/10.1021/acs.est.5b01187.

Chen

Jiang

Impacts of High-Speed Rail on Domestic Air Cargo Traffic in China. Transportation Research Part A: Policy and Practice, Vol. 142, 2020, pp. 1–13. https://doi.org/10.1016/j.tra.2020.10.002.

Mou

A Spatial Analysis of China’s Coal Flow. Energy Policy, Vol. 48, 2012, pp. 358–368. https://doi.org/10.1016/j.enpol.2012.05.034.

10.

Djordjević

Ståhlberg

Krmac

Mane

A. S.

Kordnejad

Efficient Use of European Rail Freight Corridors: Current Status and Potential Enablers. Transportation Planning and Technology, Vol. 47, No. 1, 2024, pp. 62–88. https://doi.org/10.1080/03081060.2023.2294344.

11.

Gleser

Elbert

. Combined Rail-Road Transport in Europe—A Practice-Oriented Research Agenda. Research in Transportation Business Management, Vol. 53, 2024, p. 101101. https://doi.org/10.1016/j.rtbm.2024.101101.

12.

Tian

A.-Q.

H.-X.

Wang

X.-Y.

Pan

J.-S.

Snášel

Bioinspired Discrete Two-Stage Surrogate-Assisted Algorithm for Large-Scale Traveling Salesman Problem. Journal of Bionic Engineering, Vol. 22, 2025, pp. 1926–1939. https://doi.org/10.1007/s42235-025-00724-6.

13.

Zhan

Tian

A.-Q.

S.-Q.

Optimizing PID Control for Multi-Model Adaptive High-Speed Rail Platform Door Systems with an Improved Metaheuristic Approach. International Journal of Electrical Power & Energy Systems, Vol. 169, 2025, p. 110738. https://doi.org/10.1016/j.ijepes.2025.110738.

14.

Lamorgese

Mannino

An Exact Decomposition Approach for the Real-Time Train Dispatching Problem. Operations Research, Vol. 63, No. 1, 2015, pp. 48–64. https://doi.org/10.1287/opre.2014.1327.

15.

Lamorgese

Mannino

Pacciarelli

Krasemann

J. T.

Train Dispatching. In Handbook of Optimization in the Railway Industry ( Borndörfer

Klug

Lamorgese

Mannino

Reuther

Schlechte

, eds.), Springer, Cham, 2018, pp. 265–283. https://doi.org/10.1007/978-3-319-72153-8-12.

16.

Liao

Yang

Zhang

Gong

A Deep Reinforcement Learning Approach for the Energy-Aimed Train Timetable Rescheduling Problem Under Disturbances. IEEE Transactions on Transportation Electrification, Vol. 7, No. 4, 2021, pp. 3096–3109. https://doi.org/10.1109/TTE.2021.3075462.

17.

Narayanaswami

Rangaraj

Scheduling and Rescheduling of Railway Operations: A Review and Expository Analysis. Technology Operation Management, Vol. 2, 2011, pp. 102–122. https://doi.org/10.1007/s13727-012-0006-x.

18.

Narayanaswami

Rangaraj

Modelling Disruptions and Resolving Conflicts Optimally in a Railway Schedule. Computers & Industrial Engineering, Vol. 64, No. 1, 2013, pp. 469–481. https://doi.org/10.1016/j.cie.2012.08.004.

19.

Yan

Yang

Mixed-Integer Programming Based Approaches for the Movement Planner Problem: Model, Heuristics and Decomposition. Proc. RAS Problem Solving Competition, The Institute for Operations Research and the Management Sciences, Phoenix Convention Center, 2012, pp. 1–14.

20.

Boccia

Mannino

Vasiliev

Solving the Dispatching Problem on Multi-Track Territories by Mixed Integer Linear Programming. Proc. RAS Competition/INFORMS Meet, The Institute for Operations Research and the Management Sciences, Phoenix Convention Center, 2012, pp. 1–16.

21.

Boccia

Mannino

Vasilyev

The Dispatching Problem on Multitrack Territories: Heuristic Approaches Based on Mixed Integer Linear Programming. Networks, Vol. 62, No. 4, 2013, pp. 315–326. https://doi.org/10.1002/net.21528.

22.

Adenso-Dıaz

González

M. O.

González-Torre

On-Line Timetable Re-Scheduling in Regional Train Services. Transportation Research Part B: Methodological, Vol. 33, No. 6, 1999, pp. 387–398. https://doi.org/10.1016/S0191-2615(98)00041-1.

23.

Tornquist

Persson

J. A.

Train Traffic Deviation Handling Using Tabu Search and Simulated Annealing. Proc., 38th Annual Hawaii International Conference on System Sciences, Big Island, HI, IEEE, New York, 2005, p. 73a. https://doi.org/10.1109/HICSS.2005.641.

24.

Higgins

Kozan

Ferreira

Heuristic Techniques for Single Line Train Scheduling. Journal of Heuristics, Vol. 3, 1997, pp. 43–62. https://doi.org/10.1023/A:1009672832658.

25.

Cai

Goh

Mees

A. I.

Greedy Heuristics for Rapid Scheduling of Trains on a Single Track. IIE Transactions, Vol. 30, No. 5, 1998, pp. 481–493. https://doi.org/10.1023/A:1007551424010.

26.

Samà

Corman

Pacciarelli

A Variable Neighbourhood Search for Fast Train Scheduling and Routing During Disturbed Railway Traffic Situations. Computers & Operations Research, Vol. 78, 2017, pp. 480–499. https://doi.org/10.1016/j.cor.2016.02.008.

27.

Ludvigsen

Klæboe

Extreme Weather Impacts on Freight Railways in Europe. Natural Hazards, Vol. 70, 2014, pp. 767–787. https://doi.org/10.1007/s11069-013-0851-3.

28.

Tardivo

Carrillo Zanuy

Sánchez Martín

COVID-19 Impact on Transport: A Paper from the Railways’ Systems Research Perspective. Transportation Research Record: Journal of the Transportation Research Board, 2021. 2675: 367–378.

29.

A Multi-Objective Train-Scheduling Optimization Model Considering Locomotive Assignment and Segment Emission Constraints for Energy Saving. Journal of Modern Transportation, Vol. 21, 2013, pp. 9–16. https://doi.org/10.1007/s40534-013-0003-1.

30.

Šemrov

Marsetič

Žura

Todorovski

Srdic

Reinforcement Learning Approach for Train Rescheduling on a Single-Track Railway. Transportation Research Part B: Methodological, Vol. 86, 2016, pp. 250–267. https://doi.org/10.1016/j.trb.2016.01.004.

31.

Prasad

Khadilkar

Kalyanakrishnan

Optimising a Real-Time Scheduler for Indian Railway Lines by Policy Search. Proc., 2021 Seventh Indian Control Conference (ICC), Mumbai, India, IEEE, New York, 2021, pp. 75–80.

32.

Train Timetabling with the General Learning Environment and Multi-Agent Deep Reinforcement Learning. Transportation Research Part B: Methodological, Vol. 157, 2022, pp. 230–251. https://doi.org/10.1016/j.trb.2022.02.006.

33.

Zhu

Wang

Goverde

R. M.

Reinforcement Learning in Railway Timetable Rescheduling. Proc., 2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), IEEE, New York, 2020, pp. 1–6.

34.

Ning

Zhou

Song

Dong

A Deep Reinforcement Learning Approach to High-Speed Train Timetable Rescheduling Under Disturbances. Proc., 2019 IEEE Intelligent Transportation Systems Conference (ITSC), Auckland, New Zealand, IEEE, New York, 2019, pp. 3469–3474.

35.

Ghasempour

Heydecker

Adaptive Railway Traffic Control Using Approximate Dynamic Programming. Transportation Research Part C: Emerging Technologies, Vol. 113, 2020, pp. 91–107. https://doi.org/10.1016/j.trc.2019.04.002.

36.

Ghasempour

Nicholson

G. L.

Kirkwood

Fujiyama

Heydecker

Distributed Approximate Dynamic Control for Traffic Management of Busy Railway Networks. IEEE Transactions on Intelligent Transportation Systems, Vol. 21, No. 9, 2019, pp. 3788–3798. https://doi.org/10.1109/TITS.2019.2934083.

37.

Wang

Zhou

Yuan

Zhang

Zhou

A Policy-Based Reinforcement Learning Approach for High-Speed Railway Timetable Rescheduling. Proc., 2021 IEEE International Intelligent Transportation Systems Conference (ITSC), Indianapolis, IN, IEEE, New York, 2021, pp. 2362–2367.

38.

Khadilkar

. A Scalable reinforcement learning algorithm for scheduling railway lines. IEEE Transactions on Intelligent Transportation Systems, Vol. 20, 2018, pp. 727–736. https://doi.org/10.1109/TITS.2018.2829165

39.

Zhu

Liu

Cai

Yang

Zhang

Deep Learning-Based Predicting and Compensating Method for the Pose Deviations of Parallel Robots. Computers & Industrial Engineering, Vol. 191, 2024, p. 110179. https://doi.org/10.1016/j.cie.2024.110179.

40.

S. E.

Deep Reinforcement Learning. In Reinforcement Learning for Sequential Decision and Optimal Control ( Li

S. E.

, ed.), Springer, Singapore, 2023, pp. 365–402. https://doi.org/10.1007/978-981-19-7784-8-10.

41.

Zhang

Lin

Zhu

Efficient Experience Replay Based Deep Deterministic Policy Gradient for AGC Dispatch in Integrated Energy System. Applied Energy, Vol. 285, 2021, p. 116386. https://doi.org/10.1016/j.apenergy.2020.116386.

42.

Castellini

Devlin

Oliehoek

F. A.

Savani

Difference Rewards Policy Gradients. Neural Computing and Applications, Vol. 37, 2022, pp. 13163–13186. https://doi.org/10.1007/s00521-022-07960-5.

43.

Yang

Chen

Yang

Wasserstein Distributionally Robust Optimization for Train Operation and Freight Assignment in a Metro-Based Underground Logistics System. Computers & Industrial Engineering, Vol. 192, 2024, p. 110228. https://doi.org/10.1016/j.cie.2024.110228.

44.

Zhong

Yang

Dessouky

Postolache

Multi-AGV Scheduling for Conflict-Free Path Planning in Automated Container Terminals. Computers & Industrial Engineering, Vol. 142, 2020, p. 106371. https://doi.org/10.1016/j.cie.2020.106371.

45.

Nguyen

T. G.

Phan

T. V.

Hoang

D. T.

Nguyen

T. N.

So-In

Efficient SDN-Based Traffic Monitoring in IoT Networks with Double Deep Q-Network. Proc., International Conference on Computational Data and Social Networks, Dallas, TX, Springer, Cham, 2020, pp. 26–38. https://doi.org/10.1007/978-3-030-66046-8_3.

46.

Azimzadeh

A Zero-Sum Stochastic Differential Game with Impulses, Precommitment, and Unrestricted Cost Functions. Applied Mathematics & Optimization, Vol. 79, 2019, pp. 483–514. https://doi.org/10.1007/s00245-017-9445-x.

47.

Wang

Fang

Ding

Xiong

Computation Offloading Optimization for UAV-Assisted Mobile Edge Computing: A Deep Deterministic Policy Gradient Approach. Wireless Networks, Vol. 27, No. 4, 2021, pp. 2991–3006. https://doi.org/10.1007/s11276-021-02632-z.

48.

Tian

A.-Q.

Liu

F.-F.

H.-X.

Snow Geese Algorithm: A Novel Migration-Inspired Meta-Heuristic Algorithm for Constrained Engineering Optimization Problems. Applied Mathematical Modelling, Vol. 126, 2024, pp. 327–347. https://doi.org/10.1016/j.apm.2023.10.045.

49.

Mirjalili

S. M.

Lewis

Grey Wolf Optimizer. Advances in Engineering Software, Vol. 69, 2014, pp. 46–61. https://doi.org/10.1016/j.advengsoft.2013.12.007.

50.

Mirjalili

SCA: A Sine Cosine Algorithm for Solving Optimization Problems. Knowledge-Based Systems, Vol. 96, 2016, pp. 120–133. https://doi.org/10.1016/j.knosys.2015.12.022.