Abstract
With the increasing role of railway transportation in the globalized economy, the issue of loading delays has emerged as a critical factor affecting the efficiency and reliability of rail transport. This paper addresses the loading-delay problem in heavy-haul railway transportation by proposing an optimization method based on a diversified cooperative deep reinforcement-learning network (DCD-RLN). Within an actor–critic framework, this method integrates optimization algorithms with loading-delay models to develop a tailored strategy for adjusting train schedules during delays. The proposed approach considers constraints such as empty and loaded unit trains, coupling requirements, and competitive and cooperative interactions among multiple intelligent agents, aiming to minimize assembly times for unit trains and reduce waiting and assembly times during loading and coupling phases for loaded unit trains. Comparative experimental results with existing solutions demonstrate the significant advantages of the DCD-RLN method in enhancing the robustness and efficiency of train operations. This study not only provides a domain-specific reinforcement-learning framework for railway-transportation researchers but also offers valuable references for practical operations in railway-transportation enterprises.
Keywords
Introduction
In the context of a globalized economy, rail transportation has become a vital link connecting economic activities across different regions because of its large-scale and long-distance transport capabilities, relatively low transportation costs, and minimal environmental impact. The efficient operation of rail transportation systems is crucial for supporting economic development, ensuring supply-chain stability, and meeting societal demands ( 1 ). With the rapid growth of the economy and the acceleration of urbanization, rail transportation plays an increasingly significant role in the transportation systems of various regions ( 2 ). It is not only an efficient, safe, and environmentally friendly mode of transport but also plays a key role in addressing the growing demand for passenger and freight transport ( 3 ). Technological innovation and digital advancement, such as high-speed railways, intelligent train-scheduling systems, and advanced rail transport technologies, have provided strong support for improving the efficiency and quality of rail transportation ( 4 ). Moreover, the application of digital technology ensures accurate data collection and analysis, providing a powerful tool for optimizing rail transport networks and enhancing the safety and reliability of railway lines ( 5 ).
Rail freight plays a significant role globally, with different countries developing distinct transportation practices based on their geography, economy, and policy conditions. This paragraph aims to provide an overview and comparison of the practices in rail-freight organization in the United States, China, and European countries ( 6 ). In the United States, rail freight is primarily operated by private companies such as Union Pacific Railroad and BNSF Railway. These companies typically own their rolling stock and tracks, transporting bulk commodities like coal and grain. The main characteristic of American railways is their efficient transportation of long-haul, heavy-load cargo, which benefits from an extensive network of tracks and standardization of rolling stock ( 7 ). In contrast, rail freight in China is dominated by the state-owned China State Railway Group Co., Ltd. While meeting the substantial domestic freight demand, China’s railways have also promoted international freight development through the Belt and Road Initiative ( 8 ). A notable feature of China’s rail transport is the large-scale transportation of resource-based commodities such as coal and ores ( 9 ). In Europe, rail freight is more regulated and requires coordination between multiple countries and rail systems. Operators such as Deutsche Bahn and SNCF must adhere to the European Union’s transport regulations and standards ( 10 ). European rail transport emphasizes multimodality and environmental protection, improving transport efficiency through methods like combined road-rail transport ( 11 ). By comparison, it is evident that while the rail-freight practices of different countries have their own characteristics, they all face the common challenges of increasing transportation efficiency and reducing costs. Additionally, the impact of vehicle ownership on the flexibility of transport scheduling is also worthy of attention ( 12 , 13 ).
In recent years, numerous scholars have contributed significantly to the literature on train-scheduling optimization. The efficiency and punctuality of rail transportation systems are crucial for modern transportation networks. Lamorgese and Mannino ( 14 ) proposed an exact decomposition method in 2015 to address real-time train-scheduling problems by processing conflicts and delays in train scheduling, providing an effective scheduling strategy for railway operations. Subsequently, in 2018, Lamorgese et al. ( 15 ) further explored train-scheduling issues, summarizing various models and solutions for train-scheduling problems in the Handbook of Optimization in the Railway Industry, laying a theoretical foundation for decision support in the railway industry. With the increasing complexity of rail transportation systems, traditional scheduling methods face new challenges. Liao et al. ( 16 ) employed deep reinforcement learning (RL) in 2021 to solve the train-timetable rescheduling problem under disturbances, aiming to save energy. Their method not only responds in real time to disturbances in train operation but also significantly reduces energy consumption, which is of great significance for improving the sustainability and economic benefits of rail transportation. Narayanaswami and Rangaraj ( 17 ) provided a comprehensive review and analysis of the scheduling and rescheduling problems in railway operations in 2011, offering a valuable perspective for understanding the complexity of railway scheduling problems. Following this, in 2013, they proposed a mixed-integer linear programming (MILP) model for optimizing train-timetable rescheduling after railway traffic disruptions, considering not only the minimization of total delays but also other performance objectives, providing a new solution for railway scheduling problems ( 18 ).
The optimization of rail transportation systems is a complex issue, involving multiple aspects such as train scheduling, timetable rescheduling, and route planning. In their 2012 research, Yan and Yang ( 19 ) proposed a mixed-integer programming-based method to address the train motion planning problem and developed heuristic algorithms and decomposition techniques, providing an effective solution for train-scheduling problems. Boccia et al. ( 20 , 21 ) further explored train-scheduling problems in multi-track regions in their research in 2012 and 2013, using MILP and heuristic methods to optimize train scheduling and operation, offering new perspectives and methods for train scheduling in multi-track areas. Adenso-Dıaz et al. ( 22 ) proposed an online train-timetable rescheduling method in their 1999 research for regional train services, one which can respond in real time to disturbances in train services and improve the reliability of train timetables. Tornquist and Persson ( 23 ) used tabu search and simulated annealing algorithms to handle train traffic deviations in their 2005 research, techniques which can effectively deal with train traffic deviations and improve the flexibility and efficiency of train scheduling. Higgins et al. ( 24 ) proposed heuristic techniques for single-line train scheduling in their 1997 research, while Cai et al. ( 25 ) proposed a greedy heuristic method for rapid scheduling of single-track trains in their 1998 research, providing effective solutions for single-line train-scheduling problems. Samà et al. ( 26 ) proposed a variable neighborhood search method in their 2017 research for rapid train scheduling and routing under railway traffic disturbances, which can effectively handle railway traffic disturbances and improve the efficiency of train scheduling and routing. These research works demonstrate the application of various methods from mixed-integer programming to heuristic algorithms, and then to variable neighborhood search in train-scheduling problems, providing a variety of solutions to improve the efficiency and reliability of rail transportation systems.
Despite significant advancements in technology and management within the railway-transportation sector, numerous challenges persist, particularly the issue of loading delays. Loading delays can precipitate a decline in transportation efficiency and trigger a cascade of subsequent problems, such as train delays and the postponement of cargo delivery, resulting in substantial economic losses for railway-transportation enterprises ( 27 ). Moreover, loading delays can have an impact on the stability and reliability of the entire transportation network, increase transportation costs, and even affect the efficiency of the entire supply chain ( 28 ). Consequently, optimizing train scheduling to mitigate the impact of loading delays has become a pressing issue in the field of railway transportation.
In the field of freight-train scheduling, a multitude of scholars have conducted extensive research, focusing primarily on train consist planning, timetable optimization, and cargo transportation route selection. For instance, researchers have utilized multi-objective nonlinear 0-1 programming models to optimize train consist plans, thereby enhancing transportation efficiency ( 29 ). However, these studies often overlook the issue of loading delays, that is, the waiting time of trains during the loading and coupling stages, which directly affects the operational efficiency of trains and may lead to a decline in the efficiency of the entire transportation network.
This study addresses the loading-delay issue and proposes an optimization method based on a diversified cooperative deep reinforcement-learning network (DCD-RLN). Within the actor–critic framework, this method integrates optimization algorithms with loading-delay models to develop a customized strategy for adjusting train scheduling during delays. The method takes into account constraints such as empty and loaded cars, coupling requirements, and competitive and cooperative interactions among multiple intelligent agents. The primary goal is to minimize the assembly time of individual trains and reduce the waiting and assembly time of heavy unit trains during loading and coupling stages. Experimental results show that this method significantly improves the robustness and efficiency of train operations. The main contribution of this research is the development of a new optimization model that adjusts train scheduling during loading delays by considering constraints such as empty and loaded cars, coupling requirements, and competitive and cooperative interactions among multiple intelligent agents. Our proposed optimization method aims to minimize the assembly time of individual trains and reduce the waiting and assembly time of heavy unit trains during loading and coupling stages. Experimental results demonstrate the method’s significant effectiveness in enhancing the robustness and efficiency of train operations. The capacity of trains, the amount of available rolling stock, and the depot-entry and -exit operations are also considered. Three solution approaches are proposed to solve the resulting multi-objective mixed-integer nonlinear programming (MINLP) problem to deliver both an irregular train schedule (i.e., departure and arrival times of all train services) and a rolling-stock circulation plan (including depot-entry and -exit operations of rolling stock and connections between train services) simultaneously. The scholarly research paradigm is illustrated in Figure 1.

The scholarly research paradigm.
Through the in-depth analysis and empirical research of this study, we have not only filled the gap in the existing research on loading delays but also provided a new optimization strategy for the field of railway transportation to cope with the increasing transportation demands and complex transportation environments. Our research findings will help railway transportation enterprises to improve operational efficiency, reduce costs, and ultimately enhance the stability and reliability of the entire transportation network. The subsequent sections of this manuscript are structured as follows. The next section reviews the existing literature on train-rescheduling problems. The third section delineates the methodology for modeling the coal-transportation overflow within railway networks, encompassing the development of the model framework, the formulation of objective functions, and the delineation of constraints. The fourth section elucidates the implementation of DCD-RLNs to address the model. The fifth section evaluates the efficacy of the proposed approach through a series of small-scale case studies. Finally, the sixth section synthesizes the research findings presented in this paper and posits potential avenues for future inquiry.
Related Work
In recent years, numerous learning-based research achievements have been proposed for application in intelligent transportation systems. Table 1 summarizes some representative existing methods in this field. To the best of our knowledge, Šemrov et al. ( 32 ) were the first to employ the Q-learning algorithm to propose a train-rescheduling method, whereby the state was defined by using the train positions, track availability, and current time. To circumvent the problem of a large state space, Khadilkar et al. (38) defined the state by employing only the track availability information surrounding the current train. To enhance the decision-making performance, Prasad et al. ( 31 ) adopted evolutionary strategies to search for better parameters of the policy network. Li and Ni ( 32 ) introduced a timetable scheme based on multi-agent deep RL, in which each train will be independently controlled based on its own operation information, including departure time, whether to stop at a station. Although the aforementioned methods have achieved some promising results, their learning strategies fail to maintain good performance when extended to various problem instances. The main reason is the absence of delay information in the state definitions, which makes it difficult to distinguish different delay scenarios. As a consequence, the same actions may be taken in different delay scenarios, thereby degrading the performance.
A Brief Summary of Existing Learning-Based Methods
Zhu et al. ( 33 ) attempted to incorporate train delay information into the state definition, and experimental results showed that the resulting policy could achieve better performance. Ning et al. ( 34 ) defined the state by using the actual arrival and departure times of all trains, thus enabling the learned policy to distinguish each delay scenario within a fixed scheduled timetable. However, when confronted with various train-delay situations, their method still cannot maintain good performance, primarily because the learned policy is optimized based on a single problem instance. Moreover, Ghasempour et al. ( 35 , 36 ) proposed to learn the policy from multiple instances, but the policy was only used to address the delay scenarios at railway intersections. Although Yin et al. introduced a policy-based RL method for the timetable, where the policy was also learned from many delay scenarios, only one train was allowed to be delayed in their problem instance. However, in practical scenarios, it is not uncommon for multiple trains to be delayed simultaneously ( 37 ).
The aforementioned studies primarily focus on problem-solving in delay scenarios for passenger-transportation activities. However, delay scenarios in heavy-haul transportation caused by factors such as loading delays and adverse weather conditions have not been adequately explored. To effectively address these issues, this paper proposes a domain-specific RL framework. By utilizing states to represent the delay conditions of trains, the learned policy can directly determine the state changes at the next time stamp
Unlike standard RL algorithms that assume generic state transitions and unconstrained actions, our method embeds railway-specific structural constraints—including dynamic track availability, yard-level marshaling restrictions, and scheduling-slot rigidity—into the state, action, and reward modules. Furthermore, domain-aware action masking ensures only feasible train movements are considered during training, significantly narrowing the policy search space and accelerating convergence. This integration of operational logic into the RL loop enables more robust and realistic decision-making, which is not achievable through vanilla deep Q-network (DQN) or Proximal Policy Optimization (PPO) implementations.
Modeling Transportation Scenarios in Heavy-Haul Railway Coal Corridors
Abstracting the loading-delay optimization problem within the context of heavy-haul railway coal corridors as a multifaceted assignment issue, we focus on the strategic allocation of empty trains entering the target route’s loading zone from the transportation-corridor interface ( 4 ). This process entails the efficient distribution of these trains to either loading or disassembly stations, ensuring operational synergy within the rail network. In particular, after the designated empty trains are disassembled into unit trains, they undergo a secondary optimization for assignment, which is pivotal for enhancing the overall efficiency of the rail logistics chain. Post loading operations at the loading stations, the vehicles metamorphose from unit empties to unit loads, marking a critical transition in the train’s journey. At this juncture, a critical decision is made about the direct deployment of the train for service or its assembly at technical stations in tandem with other train sets before departure ( 39 ). The intricate flow of loaded and empty trains during the resequencing process in these overloaded heavy-haul railway coal corridors is vividly depicted in Figure 2.

Sequencing diagram of loaded- and empty-train flow continuity in loading operations.
In the realm of railway train scheduling, loading delays are a pervasive challenge, often attributed to a myriad of factors such as traffic congestion, inclement weather conditions, or malfunctions in loading equipment. To mitigate these delays and optimize the train schedule, a comprehensive modeling approach is indispensable. This approach necessitates a deep understanding and articulation of the objectives, constraints, and decision variables that are intrinsic to the problem at hand. By addressing these complexities within the overloaded heavy-haul railway coal corridors, this research endeavors to fortify the robustness and efficiency of train operations, providing a solid foundation for operational enhancement in rail-freight logistics.
Assumptions of the Model
The model assumes the following.
The foundational data of the model, including the coal-source demand plan from the shipper, the location settings of line stations, and standard times for coal transportation to the loading yard, is known.
The capabilities of heavy-haul railway lines and stations along the route, such as throughput capacities and loading efficiency at loading stations, are known.
Coal-source demands within a given timeframe do not change. Railway transportation typically operates on a 24 h decision cycle, and, for computational efficiency, optimization is conducted with minutes as the basic unit. The line is assumed to be a double-track system, and the combination of operations for large and small trains can only occur at technical stations. Grouping at technical stations aims to align with large trains or tonnage.
The causes of loading delays at the loading yard are not constrained by temporal or spatial limitations. This includes events such as adverse weather conditions (rain, snow, mudslides), breakdowns of land transport vehicles, or track fractures within the loading station.
Once the train schedule is set, the process of loading the empty train must achieve full axle or full weight conditions for departure; otherwise, it indefinitely waits.
To simplify the research dimension from mines to loading stations, transportation time and relaxation time are uniformly used for representation.
Since the focus of this study is the single-commodity transportation process in heavy-haul railways, the impact of passenger-transportation schedules on freight is not considered.
Two main causes of loading delays are considered. In respect of delays caused during the loading process, it is assumed that the quantity of empty trains meets the needs of each research station. For delays arising from the inability to guarantee loading for empty trains, it is assumed that once the conditions for empty-train arrival are met, loading stations can immediately execute the transition from empty to loaded.
The “train” is taken as the smallest research unit. The railway’s coal-transportation channel typically employs rolling stock for cyclic coal transportation. The use of “trains” as the basic unit of deployment aligns with the actual transportation mode of the railway.
Explanation of Symbols
Set Configuration and Explanations
Setting and Explanatory Notes on Model Parameters
Decision Variable Definition and Explanations
This section covers the definition of binary decision variables based on technical operations for unit or combined empty trains and the sequencing of scheduled heavy columns at each station.
If the boundary interface accepts a type-
If an empty train of type
If a combined empty train of type
If a type-
If a type-
If a type-
If a type-
If a type-
If type
Objective Function Formulation and Explanation
The optimization objective for empty-train deployment is to minimize the total transportation time in the event of loading delays. In formulating the objective function, two scenarios of loading delays need to be considered. First, in cases where coal delivery to the loading station is delayed or there are facility malfunctions causing unit trains to wait indefinitely, the objective is to schedule unit trains to other loading stations to avoid prolonged waiting at the delayed loading station. Second, if the untimely deployment of empty trains results in the loading station’s inability to complete the empty-to-loaded transition, the goal is to alleviate delays at the loading station by deploying a limited number of unit trains as efficiently as possible. This approach aims to address delays efficiently and effectively in both scenarios of loading delays.
Minimizing Unit-Train Connection Time
This means ensuring the shortest possible duration, as shown in Equation 1:
Minimizing the Waiting and Assembly Time of Unit Trains
In this objective, we aim to minimize the total waiting time and assembly time for unit trains awaiting grouping. This involves optimizing the time each unit train spends waiting to be assembled and the time required for the actual grouping process. The objective function considers various factors, including the arrival times of single-unit trains, the scheduling of assembly operations, and the associated time durations. The goal is to efficiently organize the unit trains, ensuring the shortest possible waiting and assembly times. This objective is crucial for enhancing the overall efficiency and timely deployment of unit trains in the transportation system. This objective function can be shown in Equation 2:
where
Analysis
Although the objective functions defined in this section are based on cost functions measured in time, these two functions contribute differently to the efficiency of heavy-haul transportation production. We can analyze the contribution of each formula from different perspectives. The first equation emphasizes the flexibility of “load and go,” emphasizing the swift completion of loading and immediate deployment into transportation. The second equation focuses on the coordinated optimization of loading and grouping, as well as the enhancement of overall transportation capacity through the capacity tension in buffer intervals.
In practical scenarios, it is essential to consider factors such as station layout, optimal utilization of line and station capacities, composition of train flows, and train flow intensities. Therefore, the evaluation of the contribution of each formula needs to integrate these factors for a comprehensive analysis, enabling better adaptation to specific transportation environments and requirements.
To further clarify our approach, we have weighted the two optimization objectives into a single objective. This is a common strategy for transforming multi-objective problems into single-objective ones, which allows for a more focused optimization process. The importance weight for unit-train connection time is set as
Constraint Conditions
Constraints on Unit Empty- and Loaded-Train Connection
For a unit empty train with the identifier
If the aforementioned unit empty train needs to be sent to a technical station after completing the loading task to form a grouped train with other unit loads, then it should only be able to complete the grouped-train formation operation at one technical station. This requirement can be expressed by the constraint given as Equation 5:
The same research process can be applied to the unit trains generated after the technical station completes the disassembly operation of combination trains. For these unit trains, they should also satisfy the above constraints. This can be represented by Equation 6:
Constraints on the Connection Between Scheduled and Pending Heavy Trains
The operational track of a specific scheduled heavy train at a loading station may not necessarily connect with a loaded unit train. This implies, as shown in Equations 7 and 8:
Constraints on the Demand for Empty Trains at Loading Stations
In periods outside the designated operational windows, the interface continuously receives unit or composite empty trains. Currently, loading station
where
Loading-Demand Constraints at Technical Stations
For technical station
where
Constraints on the Quantity Relationship Before and After Disassembling Combined Empty Trains
After the combined empty trains arrive at all technical stations and complete the disassembling operations, the resulting quantity of unit empty trains should be equal to the product of the quantity of each type of combined empty train and their disassembling group numbers, as shown in Equation 11:
From another perspective, the quantity of unit empty trains generated by the disassembly of combined empty trains at technical station
Interval Passing Capacity Constraints
Assuming
where
The ultimate constraint on interval throughput capacity can be expressed as in Equation 15:
Capacity Constraints for Disassembly Operations at Technical Stations
The quantity of disassembled composite trains at technical station
Constraints on the Sorting Capacity of Technical Stations
The constraint on the number of combined trains composed by technical station
Constraints on the Quantity Relationship Between Disassembled Combined Empty Trains and Assembled Combined Loaded Trains
Under the assumption of fixed locomotives, the total quantity of disassembled combined empty trains at various technical stations should be consistent with the total quantity of assembled combined loaded trains at those stations, as expressed in Equation 18.
Continuity Time Constraints
For a unit train
For a combined train with index
Optimal Adjustment Method for Train-Scheduling Plans in the Scenario of Loading Delay
In contrast to generic RL applications, our approach embeds domain-specific operational rules and characteristics of heavy-haul freight railways into the model. For instance, the agent’s decision-making process is constrained by realistic scheduling windows, maximum axle loads, station throughput capacities, and coupling policies that reflect actual operations in Chinese coal corridors. Additionally, the multi-level coordination of empty and loaded-train units—subject to real-world marshaling and recombination constraints—is modeled explicitly, allowing our RL framework to capture the granular dynamics of heavy-haul scheduling in a way general RL setups cannot.
Algorithmic Symbol Representation and Flowchart Design
In traditional RL algorithms, each agent continuously learns and improves its strategy, resulting in dynamic instability from the perspective of each agent ( 40 ). This situation contradicts the convergence conditions of traditional RL algorithms because the dynamic nature of the environment introduces uncertainty into the agent’s learning. Additionally, because of the instability of the environment, traditional RL, such as experience replay (ER), are limited in their effectiveness when dealing with multi-agent systems ( 41 ). It is noteworthy that the increasing number of agents further exacerbates the variance problem when using policy-gradient (PG) algorithms ( 42 ).
In a multi-agent environment, the growing interaction among agents increases the uncertainty and variance in policy updates. This escalation makes it challenging for traditional PG algorithms to adapt to dynamically unstable environments, as the adjustment of a single agent’s policy struggles to effectively address the overall dynamic nature of the environment ( 43 , 44 ). In this context, traditional RL algorithms like DQN face even more significant challenges because they struggle to handle the complex interactions and cooperative learning problems among agents in multi-agent systems ( 45 ). Therefore, in researching novel RL algorithms, it is crucial to consider how to address dynamic instability in multi-agent environments and adopt corresponding strategies to enhance the robustness and performance of the algorithm.
In the domain of RL, one of the most prevalent approaches is the utilization of function approximation. Because of the formidable expressive capabilities of neural networks, they are frequently chosen to approximate the function

Schematic diagram of DQN network operation.
The actor–critic algorithm framework is divided into two parts: the actor network (policy network) and the critic network (value network). The responsibility of the policy network is to interact with the environment and learn a better policy using PG under the guidance of the value network. The value network’s role is to learn a value function from the data collected through the interactions between the policy network and the environment. This value function is used to determine which actions are good and which are not in the current state, thereby aiding the policy network in updating its policy. The policy network is updated based on the principles of PG, while the value network is updated using
Similar to DQN, a method akin to the target network is used here. The term
The parameters of the value network can be updated using the gradient-descent method.
The problem addressed in this research is train scheduling under loading delays, and, based on the constructed mathematical model and decision variables mentioned above, it is identified that this problem falls under the category of a zero-sum stochastic game (SG) ( 46 ). In this research, a domain-specific RL framework DCD-RLN is selected. This algorithm assumes a discrete mode of information exchange among multiple agents, operates on the actor–critic algorithm framework, utilizes the deep deterministic policy-gradient (DDPG) algorithm, and addresses scalability issues in large-scale multi-agent environments by enabling parameter sharing among multiple agents ( 47 ).
For the SG problem, which involves a dynamic game process with multiple agents in multiple states, the SG problem can be represented by the tuple
To maintain a flexible framework in the DCD-RLN algorithm and accommodate different numbers of agents to meet the problem’s requirements, in the train scheduling under loading delays, where different trains have competitive and cooperative relationships, it is essential for both competitors and collaborators to share the same state space
To simplify the representation of the global reward function, the subscript
Observing the above equation, we can clearly draw a conclusion: the reward function in the formula is derived for each agent’s action, encompassing both cooperating and competing agents. The reward values for these two types of agents may be precisely opposite, resulting in a net reward of zero. According to the computation in the above equation, the rewards for multiple agents in the entire environment will continuously increase over time. If the rewards for agents with competitive relationships exceed those for agents with cooperative relationships, the overall reward value is then taken as the absolute value, ensuring a non-negative numerical outcome.
In the context of the defined global reward
The learning strategy for the DCD-RLN algorithm involves expanding a network of length
Let
The process of the DCD-RLN method is illustrated in pseudo-code as Algorithm 1 and the overall flowchart of the DCD-RLN algorithm is shown in Figure 4.

The DCD-RLN plays a pivotal role in establishing effective communication among agents by employing a cyclic network that facilitates the connection between individual agent strategies and their respective Q-networks. This network architecture is designed to enhance collaborative learning and decision-making processes within a multi-agent system.
Neural Network Architecture and Training Procedure
To implement the proposed DCD-RLN method, we adopt an actor–critic framework in which both the policy (actor) and value (critic) functions are approximated by neural networks. These networks are specifically tailored to capture the operational complexity of heavy-haul railway dispatching and to ensure scalability and adaptability under varying delay scenarios.
Policy Network (Actor)
The actor network is designed as a multilayer perceptron (MLP) composed of the following components:
Value Network (Critic)
The critic network estimates the state value and adopts a structure similar to the actor:
Training Setup
The networks are trained using a synchronous actor–critic method with the following settings:
Data-Driven Environment Calibration
The training environment is calibrated using real-world operational data from a coal-dedicated heavy-haul corridor in northern China. Historical data sources include:
Timestamped records of train arrivals and departures;
Delay logs at loading and technical stations;
Train formation and coupling records;
Throughput statistics at each station and section.
These records are used to parameterize delay distributions and simulate dynamic railway conditions. By incorporating empirical knowledge into the simulation environment, the learning agent is exposed to realistic and representative scheduling scenarios, enhancing the policy’s applicability to practical heavy-haul operations.
Computational Complexity of DCD-RLN Algorithm
To clarify the analysis, we define the following notations:
Single Time-Step Complexity
The computational complexity of the DCD-RLN algorithm can be analyzed based on its key components and operations. The analysis assumes the algorithm is executed over
Initialization Phase
The actor network
The target networks
Initializing the replay buffer involves allocating memory for
Training Loop
This assumes a total of
Action selection and interaction with the environment: For each agent, the complexity of selecting the action
Data storage and replay buffer sampling: The complexity of storing the current state, action, and reward into the replay buffer is
Target value calculation: For each sampled transition data (a total of
Gradient estimation and network updates: Critic network gradient estimation: The complexity is
The computational complexity for a single time step is given by:
Overall Complexity
The total computational complexity over
Dominant Term (High-Dimensional Multi-Agent Systems)
For high-dimensional systems with large
Intelligent Train-Dispatch Adjustment with DCD-RLN
In this paper, the loading-delay problem in heavy-haul railway transportation is modeled as an MDP and optimized using deep-RL techniques. MDP provides a modular way to express the interaction between an agent and the environment by accumulating rewards. The MDP process typically consists of several key sub-modules: states, actions, state-transition functions, and rewards. In this study, the agent is defined as the collection of all train-operation states. By continuously updating the local states of the trains represented by the agent, the global environment of the heavy-haul transportation route can be learned and updated. Given the characteristics of the loading-delay problem in heavy-haul railway transportation, this paper employs MDP within the deep-RL framework to describe the interaction process between the agent and the loading-delay environment, including the relationships among states, actions, and rewards. The following sections detail the specific definitions of the four sub-modules in the MDP process of this paper: states, actions, state-transition functions, and rewards.
State Space
The state space represents the information that the agent receives from the loading-delay environment in heavy-haul railway transportation and is specifically defined as the current state at time step
Action Space
In the DCD-RLN algorithm, at time stamp
State-Transition Function
At time stamp
Reward Function
A specific research objective is to update the learning policy through interactions with the learning environment. The reward function, serving as a feedback module for the agent’s actions, directly reflects the success or failure of the agent’s actions, thereby effectively guiding the agent toward optimal learning behavior. For the loading-delay problem in heavy-haul railway transportation, in the DCD-RLN algorithm, the agent can determine the behavior of vehicles at time stamp
Integration of Deterministic Scheduling Model and RL Framework
To establish a coherent and operational link between the deterministic scheduling model and the proposed RL framework, we construct the environment, state space, action space, and reward function of the RL model based on the parameters and constraints defined in the mathematical formulation.
Specifically, the state space encapsulates dynamic system status, including train identifiers, arrival times at loading/technical stations (
To further clarify the correspondence between the deterministic scheduling model and the components of the RL framework, we present Table 2, which summarizes how the key variables and constraints in the mathematical model are systematically mapped to elements in the RL environment. This mapping demonstrates how environment states, actions, and reward functions are grounded in the operational and structural logic of the original formulation.
Correspondence Between Deterministic Model components and RL Elements
Note: RL = reinforcement learning.
In addition, Figure 5 provides a schematic flowchart illustrating the integration process between the deterministic model and the RL framework. It visualizes the transformation of static model parameters into dynamic decision variables within the learning environment and highlights the constraint-driven masking mechanism that ensures feasibility and consistency with domain rules during policy optimization.

Flowchart of the integration between the deterministic scheduling model and RL framework.
Moreover, to reflect the unique characteristics of heavy-haul freight operations—such as high tonnage, double-locomotive traction, and long loading cycles—we explicitly encode load capacity, composition type (unit or combined train), and yard-specific coupling constraints into the state representation. These elements are typically irrelevant in passenger-train operations but are essential to capture the operational complexity of heavy-haul corridors. This differentiation ensures that our RL policy is both context-aware and operationally grounded in real-world heavy-haul practices.
Case-Validation Analysis
The proposed approach is evaluated on a representative case of a coal-dedicated heavy-haul railway corridor, where trains are configured with up to 20,000 t of load and operate under stringent infrastructure and scheduling constraints. Unlike passenger trains, these freight trains face significant bottlenecks because of loading/unloading delays, limited passing capacity, and inflexible departure slots. These features make real-time dispatching highly sensitive to loading delays, and demand a specialized optimization approach, as developed in this study.
This section presents a case study using the ShenShuo line of the ShenHua Railway as an illustrative example. The ShenShuo Railway extends from DaLiuTa in the north to ShuoZhou West in the south, covering a total mainline distance of 266 km. Within this segment, there are a total of 16 stations involved in the origination and termination operations of trains, thereby constituting the set of stations for optimizing the intelligent adjustment of train schedules under loading-delay conditions.
For the purpose of this case study focusing on the ShenShuo line, the DaLiuTa station is designated as the boundary station, and ShiChi South is identified as the terminal station for the coal-transport corridor. Figure 6 provides an overview of the spatial distribution of each station, including their attributes and the directional flow of vehicles. Stations colored in blue indicate loading stations, those in yellow represent technical stations, and those in red signify stations that serve both loading and technical functions.

Schematic diagram of the ShenShuo line.
This case study lays the foundation for investigating the intelligent adjustment and optimization of train schedules under loading-delay conditions on the ShenShuo line. The detailed analysis of station attributes, vehicle flow directions, and the distinctive roles of each station contributes to a comprehensive understanding of the operational context. The utilization of DaLiuTa as the boundary station and ShiChi South as the terminal station reflects the specific focus on the coal-transport corridor within the ShenShuo line.
Model Parameter Configuration
In this case study, the relevant information and parameter settings of each station on the ShenShuo Railway are listed in Table 3. Table 4 provides a detailed display of the operating times of different types of heavy-duty freight trains between stations, while Table 5 illustrates the passing capacities of different types of trains in the sections. Additionally, the details of technical stations in the study route are configured in Table 6.
Relevant Information and Parameter Settings for Each Station on the ShenShuo Railway.
Note: NA = not available.
Operating Times of Different Types of Heavy-Duty Freight Trains Between Stations on the ShenShuo Railway
Passing Capacities of Different Types of Trains in Sections on the ShenShuo Railway
Technical Station Configuration
In this case study, we only consider the 5000 t unit train as the basic unit for investigation. Two 5000 t unit trains can be combined to form a 10,000 t combination train, and a 10,000 t combination train can be disassembled into two 5000 t unit trains using two different operational methods. For computational convenience, the operational time for combining unit trains into a combination train is uniformly set to 60 min, while the disassembly time of a combination train into unit trains at the technical station is set to 35 min.
Acquisition Process of Heavy-Duty Empty-Train Flow
The source of the heavy-duty empty-train flow is an integrated loading plan for one day on the ShenShuo Railway, from 18:00 to 18:00 the next day. The plan schedules a total of 64 unit trains for loading operations. Following the time division method used during the planning phase of ShenHua Railway, this section divides one loading plan into six stages, with each stage covering a tracking time of 4 h. Table 7 provides information on the approved loading plan and transportation-demand quantities.
Integrated Daily Loading Plan for BaoShen Railway
In the research presented in this paper, within a decision cycle, a total of eight train columns at loading stations experienced loading delays resulting from delayed coal delivery or station-equipment malfunctions. Among these, there are six columns of 5000 t single-unit empty trains and two columns of 10,000 t combined empty trains. Table 8 provides data on the storage of empty trains at various stations during different time periods.
Empty-Train Storage Data at Empty-Train Supply Stations on BaoShen Railway
Model Solution and Results Analysis
In this subsection, the mathematical model of loading delay in the coal-transport channel proposed above will be solved using a DCD-RLN method. In addition to model parameter settings and relevant station information, the learning rate of the DCD-RLN method is set to 0.002, the discount factor is chosen as 0.98, the target network update rate is 0.9, and the maximum number of iterations per training session is set to 800.
Original Freight-Train Operating Plan
The original freight-train operating plan consists of two parts: the plan for loaded trains and the plan for empty trains. The process for developing the initial plan for loaded trains is outlined in Table 9, while the process for developing the initial plan for empty trains is presented in Table 10. In Tables 9 and 10, it is noteworthy that the table header abbreviations are as follows: OS represents the originating station, DS represents the destination station, and OF represents the operating frequency.
Original Plan for Loaded-Train Operation
Note: OS = originating station; DS = destination station; OF = operating frequency.
Original Plan for Empty-Train Operation
Note: OS = originating station; DS = destination station; OF = operating frequency.
By examining the initial heavy-train operation plan in Table 9, it can be observed that the plan schedules 41 unit direct trains from loading stations to technical stations or freight ports. The remaining trains, after completing the grouping operation at the technical station, are dispatched to freight ports as 10,000 t combination trains based on the demand for freight flow.
Combining the analysis of Table 10, it is observed that the initial empty-train dispatch plan schedules 45 units of 5000 t freight trains directly from their respective locations to the loading stations for loading operations. The remaining 12 trains, each with a capacity of 10,000 t, are planned to undergo disassembly at intermediate technical stations, resulting in 5000 t unit trains. Subsequently, these trains are dispatched to loading stations based on demand to complete loading operations.
Optimization Results of Freight-Train Dispatch Plan
Based on the initial dispatch plans for loaded and empty trains, we employed the DCD-RLN method to achieve coordinated optimization, optimizing both loaded and empty trains. The total transport time for loaded trains, from loading to completing their own transportation tasks, was reduced to 29,875 min after coordinated optimization. The demand for empty trains was effectively met to fulfill the loading requirements on the studied rail line. In the DCD-RLN, the process of optimizing the objective function is as shown in Equation 3.
The optimization results of the mathematical model for loading delays in the coal-transportation channel, using the multi-agent bidirectional coordinated deep-RL network method, are summarized in Figure 7, illustrating the relationship between the optimized objective function value and the number of iterations.

The convergence plot for the DCD-RLN method.
Performance Comparison of Similar Algorithms
In this study, the principal innovation is the introduction of a Multi-Agent Reinforcement Learning (MARL) algorithm. The crux of modeling in MARL lies in addressing challenges related to coordination, communication, and equilibrium selection, particularly in competitive scenarios. The interdependence of agents’ actions introduces significant complexities, as the environment appears nonstationary to each agent. However, these aspects have not been extensively discussed in the models and algorithms presented. The design of the algorithms tends to be generic RL not specialized for the unique characteristics of railroads and multi-agent systems. To rectify this, we have provided more detailed insights into how agents are generated and how communications among agents and the environment are designed, and have included Figures 3 and 4 to illustrate the specific design of the model and algorithm ( 48 ).
The reason for selecting metaheuristic algorithms, such as the Grey Wolf Optimizer (GWO) ( 49 ) and sine cosine algorithm (SCA) ( 50 ), as comparative benchmarks is twofold. First, these algorithms are well-established in the field of optimization and have proven their efficacy across a variety of complex problems, including feature selection and classification tasks. Second, they serve as a robust baseline to assess the performance of our DCD-RLN algorithm, given their adaptability and widespread use in solving optimization problems similar to ours. The hyperparameter settings for the GWO and SCA algorithms are consistent with those of the original authors. Both algorithms employ a population size of 10 individuals and a maximum number of iterations set to 800.
Furthermore, Figure 8 illustrates the convergence behavior of the GWO and SCA, two metaheuristic algorithms, under the same operational conditions. A comparison between Figures 7 and 8 reveals that the DCD-RLN algorithm achieves a final convergence value of 12,004, whereas the convergence values for the comparative algorithms GWO and SCA are 13,568 and 14,721, respectively. This indicates that the DCD-RLN algorithm exhibits superior global search performance compared with GWO and SCA.

Convergence plot for GWO and SCA methods.
In the final optimization results, Table 11 presents the drawn-up table for the empty-train dispatch plan of unit trains and combined trains. Table 12 provides the drawn-up table for the loaded-train dispatch plan of unit trains and combined trains. It can be observed from both tables that the optimized empty-train deployment plan remains consistent with the pre-optimized one.
Optimized Empty-Train Dispatch Plan
Note: OS = originating station; DS = destination station; OF = operating frequency.
Optimized Loaded-Train Dispatch Plan
Note: OS = originating station; DS = destination station; OF = operating frequency.
Through Tables 9 to 12, it can be observed that there are changes only in the heavy-load train operation plans before and after optimization. In the optimized heavy-load train schedule, considering the occurrence of loading delays, 34 trains are planned to transport from loading stations to freight ports, which is seven fewer than the heavy-load train schedule before optimization. Additionally, it is noteworthy that the number of combination heavy-load trains, with a capacity of 10,000 t, is 15, which is one more than the initial operation plan.
In conjunction with the comparison and analysis of freight-train operation plans on the ShenShuo Railway coal-transportation channel before and after optimization, this section employs a more intuitive approach to validate the effectiveness of intelligent adjustment in the face of loading delays. The analysis further considers the operational utilization rate at technical stations and the alignment with train demand.
In this section, we initially focus on the investigation of the operational utilization rate of technical stations before and after the application of the multi-agent bidirectional coordinated deep-RL network method. Table 13 provides a comparison of the number of combinable trains at technical stations before and after optimization. From this perspective, technical stations such as S4, S10, and S13 exhibit an operational utilization rate exceeding approximately 10% of their pre-optimized capacity. This observation suggests that when heavy-duty freight trains cannot operate unit trains efficiently because of the distance from loading stations or freight ports, the completion of marshaling operations at technical stations to form 10,000 t trains effectively improves the utilization rate of technical station operational capacity. This, in turn, alleviates the issue of capacity constraints within line segments. The insights gained from this study contribute to a deeper analysis of the bottleneck in capacity constraints on the ShenShuo Railway ShenHua Line, facilitating future enhancements in transportation capacity and operational utilization at technical stations.
Optimized Loaded-Train Dispatch Plan
Figure 9 presents the optimized empty-train allocation plan and the loaded-train operation plan. As shown in the left part of Figure 9, the allocation plan schedules 65 5000 t unit trains to be directly dispatched from the empty-train stations to the loading stations for loading operations, while the remaining 20 10,000 t trains are decomposed into 5000 t unit trains at the technical stations and then sent to the loading stations according to the loading demands. The right part of Figure 9 reveals that the optimized loaded-train operation plan schedules 37 direct unit trains from the loading stations to the destination stations, which is four fewer than the initial operation plan. Meanwhile, 34 10,000 t combined trains are planned, meaning that a total of 68 5000 t unit trains are combined at different combination stations and dispatched in the form of combined 10,000 t heavy-haul trains, which is two more than the initial operation plan. In addition to these changes, the optimized loaded-train operation plan has also undergone corresponding adjustments in train operation sections, quantities, time periods, and types. Moreover, the operation time periods of some trains have been optimized to better match the arrival time periods of the allocated empty trains.

Efficient optimization of empty-train allocation and loaded-train operation.
Next, this section will also demonstrate the effectiveness of this work by examining the alignment of plans for loaded and empty trains. Table 14 provides the alignment of loaded- and empty-train plans before and after optimization. From the table, it can be observed that the total time for the optimized heavy-duty freight trains to complete transportation operations is 29,875 min, a 4.89% reduction compared with the transportation time of 31,411 min in the pre-optimization plan, resulting in a total reduction of 1536 min. This section analyzes the results of intelligent adjustment and optimization of train schedules before and after using the multi-agent bidirectional coordinated deep-RL network method, focusing on two aspects: the utilization rate of technical station operations and the alignment rate of loaded and empty trains. The analysis demonstrates that intelligent adjustment of schedules under the scenario of loading delays, considering the impact of reallocating empty trains on completing loading operations at loading stations, ensures the continuity of loaded and empty trains. Within the constraints of limited transportation capacity, this approach improves the efficiency and execution of goods transportation in scenarios with loading delays.
Comparison of Matching Degree in Loaded/Empty-Train Plans Before and After Optimization
Conclusion
In this study, we have introduced an innovative optimization method, the DCD-RLN, to tackle the challenge of loading delays in railway transportation. Through an extensive analysis of the impact of loading delays on railway efficiency, we have highlighted the necessity and urgency of our research. The DCD-RLN method adeptly adjusts train schedules during loading delays by accounting for constraints such as empty and loaded unit trains, coupling requirements, and the complex interactions among multiple intelligent agents in a competitive and cooperative environment. Our method has proven to significantly enhance the robustness and efficiency of train operations compared with existing solutions. Our research findings emphasize the critical role of advanced intelligent optimization methods in addressing real-world problems in railway transportation. The successful implementation of the DCD-RLN method not only illustrates the potential of deep RL in optimizing intricate transportation systems but also offers a novel tool for railway enterprises to overcome the challenges associated with loading delays.
However, it is important to acknowledge the limitations of our current approach. The DCD-RLN method, while effective in controlled scenarios, may face scalability issues when applied to larger and more diverse transportation networks. Additionally, the method’s reliance on accurate initial data and predefined parameters could limit its adaptability to unforeseen disruptions or changes in operational conditions. Future work will focus on extending the DCD-RLN method to more complex transportation scenarios and incorporating real-time data for dynamic scheduling, thereby enhancing the adaptability and responsiveness of railway-transportation systems.
We recommend further research to address these limitations and to explore the application of the DCD-RLN method in other types of transportation networks. This will help verify its broad applicability and contribute to the modernization and sustainable development of railway transportation. By refining our method and expanding its scope, we aim to make a more significant impact on the efficiency and reliability of transportation systems worldwide. While the DCD-RLN method presents a promising approach to mitigating loading delays in railway transportation, there is a need for ongoing research to address its current limitations and to adapt the method to the evolving demands of the transportation industry.
Footnotes
Author Contribution Statement
The authors confirm contribution to the paper as follows: study conception and design: Ai-Qing Tian, Jeng-Shyang Pan; data collection: Hong-Xia Lv; analysis and interpretation of results: Ai-Qing Tian, Jeng-Shyang Pan; draft manuscript preparation: Ai-Qing Tian. All authors reviewed the results and approved the final version of the manuscript.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the National Natural Science Foundation of China (Project No. 52072314; 52172321; 52102391), Sichuan Science and Technology Program (Project NO. 2020YJ0268; 2020YJ0256; 2021YFQ0001; 2021YFH0175), Science and Technology Plan of China Railway Corporation (Project No.: 2019F002), China Shenhua Energy Co., Ltd. Science and Technology Program(Project No.: CJNY-20-02), China Railway Beijing Bureau Group Co., Ltd. Science and Technology Program (2021BY02; 2020AY02), the National Key R&D Program of China (2017YFB1200702).
Data Accessibility Statement
Some or all of the data, models, or code generated or used during the study was provided by Southwest Jiaotong University under license and therefore cannot be made freely available. Direct requests for this data should be made to Southwest Jiaotong University.
