Bridging gaps in intelligent truck dispatching: An underexplored PPO-based model with expanded feature integration within open pit mines

Abstract

The mining industry is increasingly adopting intelligent systems to address challenges such as declining ore grades, rising operational costs and sustainability demands. This study presents a Proximal Policy Optimisation (PPO)-based truck dispatching model designed to enhance operational efficiency in open-pit mining. Addressing two key research gaps – limited integration of dispatching features and underutilisation of advanced reinforcement learning (RL) algorithms – the proposed model incorporates 19 critical features and is evaluated against the conventional Fixed Schedule (FS) method. A discrete event simulation environment was developed to emulate an open pit case study with heterogeneous trucks and shovels. The PPO model demonstrated convergence within 3.5 h and outperformed the FS baseline across multiple key performance indicators, including a 5.7% increase in total production, 4.2% improvement in plant delivery, and 13.2% higher truck utilisation. Compared to widely used RL algorithms in this domain, the PPO approach achieved faster convergence despite handling a more complex feature set. These findings highlight the potential of PPO as a robust and scalable solution for intelligent dispatching, offering practical benefits for Mining 4.0 initiatives.

Keywords

Truck dispatching simulation PPO fleet management open pit mining

Introduction

The mining industry is undergoing a transformative shift as it confronts various challenges – declining ore grades, rising energy costs and increasing pressure to reduce environmental impact (Igogo et al., 2021). In response, the sector is embracing Mining 4.0, a digital revolution that leverages automation, artificial intelligence and real-time data analytics to enhance efficiency and sustainability (Hazrathosseini and Moradi Afrapoli, 2023c). A critical component of this evolution is intelligent truck dispatching, which directly influences operational costs, productivity and carbon emissions (Hazrathosseini and Moradi Afrapoli, 2023b). Traditional dispatching methods, often reliant on static heuristics, struggle to adapt to the dynamic nature of mining operations, leading to inefficiencies in fuel consumption, fleet utilisation and route optimisation (Hazrathosseini and Moradi Afrapoli, 2023a; Khorasgani et al., 2020; Zhang et al., 2020). Therefore, adopting intelligent solutions for truck dispatching is essential to meet the evolving demands of modern mining, enabling adaptive, data-driven decisions that optimise resources, reduce emissions and drive sustainable productivity.

Reinforcement learning (RL) has emerged as a transformative solution, enabling dispatching systems to learn and improve continuously by interacting with their environment. Unlike conventional approaches, RL-based systems can dynamically adjust to real-time variables – such as fluctuating demand, equipment breakdowns and weather disruptions – while optimising for key performance indicators (KPIs) like fuel efficiency, production output and emissions reduction (De Carvalho and Dimitrakopoulos, 2021; Hazrathosseini and Moradi Afrapoli, 2025; Huo et al., 2023; Matsui et al., 2023; Zhang et al., 2020). Recent advancements in deep RL and multi-agent systems have demonstrated significant results compared to conventional methods, including: 5.56% increase in ore production (Zhang et al., 2020), 10–30% lower greenhouse gas emissions (Huo et al., 2024), controlling the grade fluctuating of the ore flow (Qiu et al., 2023), and 47% improvement in cash flow (De Carvalho and Dimitrakopoulos, 2023). These figures underscore RL's potential to revolutionise truck dispatching, bridging the gap between theoretical advancements and practical mining applications.

Over the past five years, numerous researchers have explored the application of RL to develop intelligent truck dispatching solutions. However, as will be discussed in further detail in section ‘Literature review’, a notable research gap exists concerning the comprehensiveness of truck dispatching features addressed within the models developed in the existing literature. Specifically, Hazrathosseini and Moradi Afrapoli (2024) conducted a critical review of previously published articles on RL-based fleet management systems in open-pit mines. Their study introduced a five-feature-class scale, incorporating 29 widely used dispatching features. Their findings revealed that 60% of these features were unaddressed in prior research, indicating substantial potential for improvement in the underlying algorithms.

An additional research gap is evident in the less frequent consideration of alternative RL algorithms, such as Proximal Policy Optimisation (PPO). Existing intelligent dispatching systems have predominantly utilised Deep Q-learning Network (DQN) and Double DQN (DDQN), often without exploring the efficacy of other viable approaches. Consequently, this study aims to address these two identified gaps: the oversight of essential dispatching features and the overuse of certain RL algorithms. To achieve this, the present research is guided by the following three objectives:

To develop an RL-based truck dispatching system incorporating a more comprehensive set of features than those currently addressed in the existing literature.

To leverage PPO, an algorithm that has been less explored in the relevant literature.

To quantitatively assess the operational efficiency of the developed system within an open pit mine case study against a conventional industry baseline, specifically the Fixed Schedule (FS) strategy, utilising common KPIs such as total production, material delivery to the plant, waste dump delivery, plant feed grade, truck fleet utilisation, and queuing times at shovels.

Literature review

The foundational concept behind intelligent dispatching is the shift from rule-based systems to data-driven decision-making frameworks. In response, researchers have developed intelligent models using RL techniques. Zhang et al. (2020) introduced a multi-agent framework to dispatch large-scale heterogeneous fleets using DQN. Their environment used centralised learning with decentralised execution, allowing individual trucks to act autonomously while sharing knowledge through a global experience buffer. Their approach improves productivity by 5.56%, reduces queuing and shovel starvation, and demonstrates robustness to fleet changes. De Carvalho and Dimitrakopoulos (2021) addressed the challenge of dynamic truck dispatching in mining complexes under operational and geological uncertainties. The authors proposed a DDQN framework integrated with a simulator to optimise dispatching decisions while managing stochastic factors like truck/shovel failures, variable cycle times, and ore grade uncertainty. Their approach, tested in a copper-gold mining complex, demonstrates 12–16% higher copper recovery and 20–23% higher gold recovery compared to baseline methods, while reducing queue times and improving fleet utilisation.

In another effort, De Carvalho and Dimitrakopoulos (2023) proposed an actor-critic RL framework to integrate short-term production planning with dynamic fleet management in mining complexes under geological and operational uncertainties. The method employs two RL agents: the first optimises shovel allocation to mining fronts, while the second assigns material destinations and truck numbers, leveraging a simulator to forecast material flow and update orebody models using real-time sensor data. Tested in a copper mining complex, the approach achieved a 47% improvement in cash flow compared to static baseline strategies by dynamically adapting to equipment failures, grade variability and production targets. Huo et al. (2023) proposed another approach to optimise truck fleet dispatching in open-pit mining, targeting reductions in greenhouse gas emissions and improvements in operational efficiency. Using tabular Q-learning in a hypothetical mining environment, their method dynamically adjusted truck routes and schedules based on real-time factors like payload, traffic and maintenance needs. The results showed a 30% reduction in emissions and a 55% increase in correct material deliveries compared to traditional fixed-schedule dispatching. Matsui et al. (2023) presented a real-time dispatching algorithm utilising Dueling DQN to optimise autonomous haulage truck operations in open pit mining. Their model, simulated in an environment replicating autonomous truck behaviours and accounting for vehicle interactions, demonstrated superior performance compared to industry-standard methods. The experimental results highlighted a 15–20% increase in transportation efficiency and a decrease in fuel consumption. Qiu et al. (2023) proposed a dynamic multi-objective evolutionary algorithm for optimising truck scheduling in open-pit mining, focusing on real-time ore blending and operational efficiency. The study incorporated tabular Q-learning to minimise ore grade fluctuations at crushing stations while maximising production and reducing fuel consumption. Results demonstrated that the proposed method outperforms traditional approaches, achieving better control over grade stability and improving scheduling efficiency.

Huo et al. (2024) developed a smart dispatching solution for mining haul truck fleets using a DDQN approach to improve operational efficiency and reduce greenhouse gas emissions. They simulated a hypothetical open pit mine environment to compare the performance of their model with conventional methods. The results demonstrated that their solution increased productivity at least by 27%, reduced emissions per unit production by 10–30%, and effectively handled operational disruptions, such as fleet size changes and shovel grade variations. Hazrathosseini and Moradi Afrapoli (2025) developed an intelligent rule-based decision-making system for preliminary truck dispatching in open pit mines using a modified Q-learning algorithm. Compared to a conventional rule-based system, the proposed solution achieved 10 times fewer incorrect dispatches, 4% fuel savings, a 10% reduction in queuing time, and a 14% increase in ore production. The study highlighted the system's ability to handle unforeseen scenarios and its potential as an upper-stage solution in multi-stage intelligent dispatching frameworks. Recently, Noriega et al. (2025) developed a real-time truck dispatching system for open-pit mining using DDQN. They created a discrete event simulation model to train the system, accounting for uncertainties in equipment cycles and production targets. The results showed that their DDQN-based solution outperformed traditional heuristics, achieving higher productivity, better adherence to ore quality targets, and improved fleet utilisation while integrating production indicators and traffic rules like no-overtaking constraints.

Table 1 offers a comprehensive overview, meticulously detailing the advantages and disadvantages associated with previously developed RL-based dispatching systems. Through this analysis, two distinct research gaps become particularly noticeable. Firstly, there is a consistent oversight of essential dispatching features in existing models, as highlighted by Hazrathosseini and Moradi Afrapoli (2024). Secondly, a prevalent reliance on specific value-based algorithms, such as DQN and DDQN, is observed, often to the exclusion of exploring alternative and potentially more suitable algorithms like PPO, an actor-critic method. This research initiative aims to address these two identified gaps directly by developing a more comprehensive truck dispatching algorithm that leverages PPO, an algorithm that remains relatively underexplored in this specific application domain.

Table 1.

Comparison of the RL-based models in the literature.

Article	Algorithm	Advantages	Disadvantages
Zhang et al. (2020)	DQN	(1) Shared centralised learning and decentralised execution scheme, (2) Truck failures and heterogeneity are addressed.	(1) Single objective formulation, (2) Poor reward shaping, (3) Essential dispatching features are overlooked.
De Carvalho and Dimitrakopoulos (2021)	DDQN	(1) Centralised learning and decentralised execution scheme, (2) A more complex scenario was experimented.	(1) Single objective formulation, (2) Essential dispatching features are overlooked.
Huo et al. (2023)	Tabular Q-learning	(1) Greenhouse gas emissions are incorporated, (2) Intricate reward shaping, (3) Some unprecedented features such as truck maintenance and fuel consumption are addressed.	(1) Use of tabular Q-learning, (2) Decentralised learning scheme, (3) Homogeneous fleet, (4) Essential dispatching features are overlooked.
Huo et al. (2023)	Dueling DQN	(1) The algorithm is benchmarked against a linear programming model, (2) Vehicle interactions are considered.	(1) Essential dispatching features are overlooked.
Qiu et al. (2023)	Tabular Q-learning	(1) Real-time ore blending is considered.	(1) Use of tabular Q-learning, (2) Essential dispatching features are overlooked.
De Carvalho and Dimitrakopoulos (2023)	An unspecidied actor_critic method	(1) Shovel allocation, (2) Production scheduling decisions are improved, 3) Sensory data are used.	(1) Learning process is complicated, requiring 24 h of training, (2) Essential dispatching features are overlooked.
Huo et al. (2024)	DDQN	(1) Greenhouse gas emissions are incorporated, (2) Truck maintenance is addressed.	(1) Basic state representation lacking decision-making parameters (e.g. shovel production targets), (2) Essential dispatching features are overlooked.
Hazrathosseini and Moradi Afrapoli (2025)	A variant of Q-learning	(1) Underexplored dispatching features are addressed, (2) The algorithm can be integrated into multi-stage dispatching models.	(1) The developed model dispatches trucks to the relevant shovel and destination without optimising the operational objectives.
Noriega et al. (2025)	DDQN	(1) Attention to bunching effects on roads, (2) Efficient reward shaping and state representation.	(1) Essential dispatching features are overlooked.

Methodology

Figure 1 illustrates the methodological flowchart for the intelligent truck dispatching model developed in this study. First, an array of technical dispatching features is selected to be included in the model with a view to fill research gaps. Then, a simulated open pit mining environment is created using Python^®. This simulated environment serves as the training platform for the truck agents, enabling them to gain experience and knowledge through interactions within this virtual setting. The environment incorporates the technical dispatching features identified as research gaps earlier. In the next step, an underexplored variant of RL algorithms, known as PPO, is implemented, providing the framework for training the truck agents. Once the training process is complete, the developed intelligent model is evaluated by benchmarking it against a baseline in the subsequent section to provide valuable insights into the effectiveness and potential benefits of the proposed model.

Figure 1.

Summary of the methodology.

Dispatching features

Section ‘Literature review’ highlighted the imperative for developing more comprehensive dispatching algorithms capable of integrating a broader array of essential features. Specifically, analysis of prior research indicated that only 12 out of 29 identified features had been addressed in existing studies, leaving a significant portion unexamined. To bridge this gap, the current study selects and incorporates 19 key features. In other words, an effective truck dispatching model for open pit mining must consider a wide range of operational features to reflect the complexity of real-world environments. More specifically, geological uncertainties can affect ore accessibility and quality, so accounting for these in the model helps minimise disruptions. Ore grades determine whether material should be sent to a waste dump or a processing plant, making grade awareness crucial. Block sequences influence the extraction order, and incorporating this ensures optimised shovel and truck allocation. Operational constraints such as shovel refuelling, shovel movement, shovel failure and shovel maintenance are also critical – refuelling is inevitable, movement causes delays, and failures or maintenance can significantly affect productivity.

Given the diversity of equipment, shovel heterogeneity and truck heterogeneity must be reflected, as capacities and speeds vary across units. The model should also support truck scalability, allowing the number of active trucks to change based on demand. Truck failure and maintenance need to be modelled to avoid dispatching to unavailable trucks and to reroute as needed. Likewise, truck refuelling, which contributes notably to idle time, must be planned intelligently. Weather conditions and blasting activities introduce external disruptions, requiring the model to reroute trucks or halt operations when necessary. Beyond operational logistics, the model must align with production objectives. These include ore production targets, which ensure trucks are dispatched to meet output goals; ore processing targets, which maintain a steady flow to processing plants; and processing plant head grade, which ensures consistent ore quality for efficient recovery. Lastly, waste production targets must be respected to adhere to scheduling constraints and stripping ratios. Collectively, these features ensure the dispatching system is robust, adaptive and aligned with mine-wide operational goals. By integrating these features into an intelligent dispatching model, this research aims to contribute a more holistic dispatching framework.

Mine environment

A ‘mine environment’ denotes a simulated environment emulating the conditions and complexities encountered in mining operations. The environment can be modelled as a Markov Decision Process (MDP), where the decision-making agent learns to optimise long-term operational efficiency. The MDP is formally defined by: a set of states ( $s$ ) representing the current status of trucks, shovels, destinations and operational targets; a set of actions ( $a$ ) corresponding to dispatch decisions (e.g. assign truck i to shovel $j$ ); a transition function ( $p$ ) describing the probabilistic outcome of each action given the current state; a reward function ( $r$ ) quantifying the effectiveness of a dispatching decision; and a discount factor ( $γ$ ) that balances immediate versus future rewards (Puterman, 2014). PPO is used to learn a policy ( $π$ ) that maps states to actions to maximise expected cumulative reward under this MDP formulation. The relationship between the environment and PPO is one of mutual dependency. At each timestep, the learning agent continuously observes its current state and executes actions within the training environment. In response, the environment provides feedback by transitioning the agent to a subsequent state and evaluating the agent's chosen action through the issuance of reward or penalty signals. These experiential data points are then stored by the PPO algorithm within a neural network, which functions as a repository for the learned policies. The environment specifically designed for the developed truck dispatching model is comprised of several distinct components and equations (equations (1) and (2)), as elaborated in Table 2.

Table 2.

Definition of MDP components.

Component	Description	Variables
Agent ( $i$ )	The number of agents is equal to the number of trucks in the case study.	$i \in {1, \dots, 15}$
State ( $s$ )	The state encapsulates decision variables represented as a vector available to the agent in every time step. $s = [P_{s h o v e l s}, P_{p l a n t}, P_{w a s t e}, G_{s h o v e l s}, G_{p l a n t}, T,$ $D_{s h o v e l s}, F_{r o u t e s}, C_{t r u c k}]$ (1)	$P$ : The hourly and shift productions for shovels, plant, or waste; $G$ : Ore grades for shovels or plant; $D$ : Distances from shovels; $F$ : The summed capacity of trucks en route or already at a shovel for each route; $C$ : Truck capacity; $T$ : Time.
Action ( $a$ )	The action space consists of five locations in the case study, namely Shovel 1, Shovel 2, Shovel 3, Shovel 4, and Parking Yard.	$a \in {0, 1, 2, 3, 4}$ .
Reward ( $r$ )	The reward function plays a significant role in training the agent to meet operational targets. $r = (r_{s h} + r_{p} + r_{g} + r_{w})$ $- (p_{s h} + p_{p} + p_{g} + p_{w}) - 1 + B$ (2)	$r$ : Reward for meeting targets for shovel production, plant feed, plant grade, or waste production; $p$ : Punishment for deviation from targets for shovel production, plant feed, plant grade, or waste production; $B$ : Bonus of 20 for meeting all the targets.

The environment needs to provide a simulator in order that the truck agents interact within. The simulation model used in this study is designed based on an open pit mine case study encompassing 4 heterogeneous shovels, 15 heterogeneous trucks of two types, and 1 parking yard embedded with a fuel station and a repair shop (Figure 2). Discrete event simulation was employed as the paradigm for modelling the mining environment using the logic depicted in Figure 3. The simulation initialises each truck with attributes, then iteratively evaluates actions based on a dynamic event table reflecting uncertainties and maintenance needs. A PPO agent determines optimal dispatch (to a shovel or yard) using a state vector, executes the action, and updates the event table if uncertainties or maintenance criteria are met. The truck's state and a computed reward are stored in PPO's replay memory for learning, with the simulation continuing until a 12-h shift duration is completed, implemented in Python^® using SimPy^®.

Figure 2.

The open pit mine case study.

Figure 3.

The simulation model.

PPO

PPO is an RL algorithm introduced by Schulman et al. (2017) as a more practical alternative to Trust Region Policy Optimisation (TRPO) (Schulman et al., 2015). It belongs to the class of on-policy, actor-critic methods and has become a cornerstone in modern deep RL due to its balance of efficiency and empirical performance across a wide range of tasks (Gu et al., 2022). PPO operates by optimising a clipped surrogate objective function that constrains the policy update to remain within a trust region, thereby preventing large, destabilising updates. Specifically, the algorithm maximises the following objective (equations (3)):

L^{C L I P} (θ) = E_{t} [min (r_{t} (θ) \hat{A_{t}}, c l i p (r_{t} (θ), 1 - ϵ, 1 + ϵ) \hat{A_{t}})]

(3)

Where, $L^{C L I P} (θ)$ is the clipped surrogate objective function, $r_{t} (θ)$ is probability ratio between new and old policies, $\hat{A_{t}}$ is advantage estimate, $θ$ represents the parameters of the policy neural network and ɛ is clipping ratio (e.g. 0.2), limiting how much $r_{t} (θ)$ can deviate from 1. If $\hat{A_{t}} > 0$ , the surrogate function encourages the action but clips to avoid excessive updates, and if $\hat{A_{t}} < 0$ , the surrogate function discourages the action but clips to avoid over-penalising. This formulation ensures that policy updates are conservative, thus improving training stability without the computational overhead of second-order methods like TRPO.

The PPO algorithm coded in Python^® was used to train truck agents for dispatching in the mining environment developed earlier named as the MineEnvironment class. The pseudo code for this algorithm is detailed in Table 3. The PPO algorithm consists of three main parts: (1) Hyperparameters, (2) The PPOAgent class and (3) The episodic training part. The hyperparameters were initially chosen based on the values recommended by Schulman et al. (2017) and were subsequently refined through iterative trial-and-error tuning to better suit the specific characteristics of the dispatching environment. Regarding the PPOAgent class, the agent's neural network consists of a shared feature extractor with multiple dense ReLU layers, followed by dual output heads: an actor generating a softmax probability distribution over dispatching actions (e.g. assigning trucks to shovels or dump sites) and a critic estimating state values to stabilise training. The select_action method ensures operational feasibility by masking invalid actions (e.g. unavailable routes) and applying a numerically stable softmax to sample valid actions, enhancing decision-making efficiency in constrained environments. Experience transitions (state, action, reward, next state) are stored in a replay buffer (memory) and sampled in mini-batches for training, where the compute_gae function computes Generalised Advantage Estimation (GAE) to balance bias and variance in reward signals – critical for handling sparse, delayed rewards in dispatching. During training (train_step), the agent optimises the policy using a clipped surrogate objective to prevent destructive updates, while a critic loss refines value predictions and entropy regularisation encourages exploration. The replay method orchestrates batch sampling, GAE calculation and multi-epoch updates, ensuring robust policy refinement. The episodic training phase of the PPO-based truck dispatching algorithm begins with the initialisation of the simulation environment and learning agent. Specifically, an instance of the custom-designed MineEnvironment is created to emulate the operational dynamics of an open pit mine, while a PPOAgent instance is initialised to learn optimal dispatching policies. At the start of each training episode, the environment is reset to a predefined initial state. Subsequently, multiple truck agents are instantiated as parallel simulation processes, each interacting with the shared MineEnvironment and PPOAgent. This parallelism enables realistic modelling of concurrent truck operations and facilitates efficient data collection for policy learning. The shared architecture ensures that all agents contribute to and benefit from a centralised learning process, promoting coordinated behaviour across the fleet.

Table 3.

The PPO algorithm.

Hyperparameters

Memory_capacity = 10,000, Batch_size = 32, Clip_ratio = 0.2, Lambda = 0.95, Epochs = 3, State_size = 27, Action_size = 5, Memory_capacity = 200,000, Batch_size = 32, Gamma = 0.99, Learning_rate = 0.00025, Number of layers in the neural network: 6, Number of neurons in each layer: 300.

The PPOAgent Class

Method: __init__(self, state_size, action_size)

Set the state dimension and action dimension

Initialise experience memory with fixed capacity

Set the PPO clipping ratio hyperparameter

Initialise shared actor-critic model via _build_model()

Initialise the Adam optimiser with the learning rate

Method: _build_model (self)

Input: state vector of shape (state_size)

Pass input through multiple fully connected layers with ReLU activations

Split the output of the neural network into two specific heads (A head is a specific output layer attached to a shared ‘body’ of a neural network to perform multiple distinct tasks efficiently).

- Actor: outputs action probabilities using softmax

- Critic: outputs a scalar state-value estimate

Return the combined actor-critic model

Method: remember (self, state, action, reward, next_state, done)

For each agent, append the current experience to the memory buffer.

Method: select_action (self, state, valid_actions)

Pass state through model to get action probabilities

Filter probabilities for only valid actions

Normalise probabilities using softmax for numerical stability

Sample an action from the valid actions using the filtered distribution

Return selected action

Method: compute_gae (rewards, values, dones)

For each step:

- Compute Delta, which is the one-step temporal difference error (TD error). It is the fundamental building block of many RL algorithms and represents the difference between the estimated value of the current state and the better estimate formed by combining the immediate reward with the value of the next state.

- Update GAE using Delta, Gamma (the discount factor, determining the present value of future rewards, and Lambda (the exponential decay factor for GAE, controlling the bias-variance trade-off in the advantage estimate)

- Compute return using GAE

Return returns as a NumPy array

Method: train_step (states, actions, old_probs, advantages)

Start gradient tape

- Pass states through model to get new_probs, the probabilities for the same actions as predicted by the policy network after it has been updated. These are the probabilities of the current, training policy. In contrast, old_probs are the probabilities for actions as predicted by the policy network before it was updated.

- Convert actions to one-hot encoding

- Compute new and old action probabilities

- Calculate probability ratios using old_probs and new_probs

- Compute clipped surrogate loss for actor

- Compute mean squared error loss for critic

- Compute entropy loss for exploration

- Combine losses into total loss

Compute gradients of loss with respect to model parameters

Apply gradients using the Adam optimiser

Return total loss

Method: replay (self)

If the memory buffer contains fewer experiences than the batch size, return (do not train).

Sample a minibatch of transitions from memory

Extract states, actions, rewards, next_states, and dones

Pass states and next_states through model to get values

Compute returns using the compute_gae method

Compute advantages using returns

Normalise advantages

Get old action probabilities from model

Convert all data to tensors

For a fixed number of epochs:

- Call the train_step method with current batch

Method: load (self)

Load pre-trained weights into the network.

Method: save (self)

Save the current weights of the network.

The PPO Episodic Training part

Initialisation: Create instances of MineEnvironment and the PPOAgent class. For each episode:

Reset the mine environment.

Creates all trucks as parallel simulation processes that run simultaneously while sharing the MineEnvironment and PPO classes.

End For

Training

The PPO-based truck dispatching algorithm was trained for 3.5 h over 100 episodes using an NVIDIA^® A100 GPU on the Google Colab^® platform (Figure 4). On average, each truck completed approximately 45 haulage cycles per episode. Given a fleet size of 15 trucks and a maximum achievable reward of 23 per truck (as defined in equation (2)), a cumulative reward of 15,525 was established as the minimum reward target. The learning process demonstrated convergence after approximately 90 episodes, with total episodic rewards peaking at 20,000 – well above the defined target threshold. In terms of operational metrics, the total material produced stabilised just above 15,000 tonnes, with each shovel contributing an average of 3800 tonnes, thereby exceeding the per-shovel production target. Material deliveries to both the processing plant and the waste dump plateaued around 7500 tonnes, aligning with their respective minimum thresholds. Notably, the plant feed grade exhibited marked stabilisation in the final episodes, suggesting the algorithm's capacity to maintain consistent ore quality. Additionally, truck fleet utilisation remained steady at approximately 91%, attributed to reduced queuing times at shovels following policy convergence.

Figure 4.

Performance of the developed model over 100 episodes during training.

Results and discussion

The previously developed and trained PPO-based truck dispatching model is evaluated against the FS method, a widely used baseline in open-pit mining operations. This comparative analysis, grounded in a set KPIs, seeks to assess the potential advantages offered by the PPO approach. The FS method represents a traditional dispatching strategy in which truck routes and shovel assignments are predetermined and remain fixed throughout each episode. Under this scheme, each truck consistently follows the same route and loads from the same designated shovel. To ensure a fair comparison, the same simulation environment was employed for both methods, with modifications made to the dispatching logic to implement the FS strategy. Specifically, trucks were grouped and assigned to shovels as follows: trucks 1–4 to Shovel 1, trucks 5–8 to Shovel 2, trucks 9–12 to Shovel 3 and trucks 13–15 to Shovel 4. Both the PPO and FS models were executed over a 30-day simulation period (30 episodes), and their performance was evaluated using KPIs such as total production, plant delivery, waste dump delivery and plant feed grade.

Figure 5 presents a comparative evaluation of the PPO and FS models across six KPIs upon running each model for three times and calculating average values. The results indicate that the PPO approach yielded a total production of 15,350 tonnes, marking a 5.7% improvement over the 14,523 tonnes achieved by the FS method. In terms of ore delivery to the processing plant, PPO averaged 7554 tonnes – an increase of 4.2% compared to the 7251 tonnes delivered under FS. Similarly, waste dump deliveries under PPO reached 7633 tonnes, reflecting a 7.5% gain relative to the 7099 tonnes recorded with FS. While the average ore grade in the plant feed showed only a slight difference between the two methods, PPO demonstrated a modest 0.9% improvement and, more importantly, exhibited greater stability in grade delivery. This consistency is particularly valuable for downstream processing. A key factor contributing to PPO's superior performance is truck fleet utilisation. By significantly reducing idle time at shovel queues – by approximately 30.4% – the PPO model enhanced overall fleet utilisation by 13.2%, reaching an average efficiency of 91.15%. These findings highlight the effectiveness of the PPO-based approach in optimising both operational throughput and resource utilisation.

Figure 5.

Comparison of PPO and FS in terms of average KPIs after running each model for three times.

The findings validate the potential of PPO as a viable and effective alternative to traditional RL algorithms in mining applications. The integration of a more comprehensive feature set allowed the dispatching agent to make more context-aware decisions, reflecting the complex realities of modern mining operations. The model's ability to converge within 90 episodes and maintain stable performance across multiple operational dimensions highlights its robustness and practical applicability. Notably, the PPO-based model developed in this study, which incorporates a broader set of dispatching features, achieved convergence within approximately 3.5 h. This is slightly faster than the 4-h training time reported for the DDQN-based model by Noriega et al. (2025), despite their model including fewer dispatching features and being trained on the same Google Colab^® platform. This comparison suggests that the PPO algorithm may offer improved training efficiency, even when handling more complex input representations.

Conclusion

This study was initiated to contribute to the field of intelligent truck dispatching in open pit mining by integrating a more comprehensive set of 19 operational features – surpassing the narrower scope of prior research – and by implementing the PPO algorithm, moving beyond conventional value-based RL methods. The proposed PPO-based model demonstrated significant improvements over the standard FS approach in KPIs. These included a marked increase in total production, a notable improvement in material delivery to essential destinations, and a superior reduction in truck idle times, which culminated in a high average fleet utilisation. Furthermore, the model provided a critical enhancement in operational consistency, particularly in maintaining ore grade stability, which is vital for processing efficiency. The results highlight the algorithm's sophisticated capacity to adapt to dynamic mining conditions, optimising resource allocation and mitigating inefficiencies. This research confirms that leveraging richer feature representations and modern policy-based RL algorithms can yield profound improvements in mining productivity and sustainability, providing both a valuable academic contribution and practical guidance for the industry's transition towards data-driven, adaptive dispatching systems aligned with Mining 4.0 principles. Future work may explore the scalability of the proposed model in larger mining environments, as well as its integration with digital twin platforms for real-time decision support.

Footnotes

Author contributions

Eyman Hazrathoseyni: conceptualisation, methodology, data curation, investigation, software, writing – original draft. Arman Hazrathosseini: writing – review and editing.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

Data is available upon request.

ORCID iDs

Eyman Hazrathoseyni

Arman Hazrathosseini

References

De Carvalho

Dimitrakopoulos

(2021) Integrating production planning with truck-dispatching decisions through reinforcement learning while managing uncertainty. Minerals 11(6): 87.

De Carvalho

Dimitrakopoulos

(2023) Integrating short-term stochastic production planning updating with mining fleet management in industrial mining complexes: An actor-critic reinforcement learning approach. Applied Intelligence 53(20): 23179–23202.

Cheng

Chen

CLP

, et al. (2022) Proximal policy optimization with policy feedback. IEEE Transactions on Systems, Man, and Cybernetics: Systems 52(7): 4600–4610.

Hazrathosseini

Moradi Afrapoli

(2023a) Intelligent fleet management systems in surface mining: Status, threats, and opportunities. Mining, Metallurgy and Exploration 40(6): 2087–2106.

Hazrathosseini

Moradi Afrapoli

(2023b) Maximizing mining operations: Unlocking the crucial role of intelligent fleet management systems in surface mining’s value chain. Mining 4(1): 7–20.

Hazrathosseini

Moradi Afrapoli

(2023c) The advent of digital twins in surface mining: Its time has finally arrived. Resources Policy 80: 103155.

Hazrathosseini

Moradi Afrapoli

(2024) Transition to intelligent fleet management systems in open pit mines: A critical review on application of reinforcement-learning-based systems. Mining Technology 133: 50–73.

Hazrathosseini

Moradi Afrapoli

(2025) An intelligent rule-based decision-making system for preliminary truck dispatching within open-pit mines. Mining Technology: Transactions of the Institutions of Mining and Metallurgy 134(1): 3–14.

Huo

Sari

Kealey

, et al. (2023) Reinforcement learning-based fleet dispatching for greenhouse gas emission reduction in open-pit mining operations. Resources, Conservation and Recycling 188: 106664.

10.

Huo

Sari

Zhang

(2024) Smart dispatching for low-carbon mining fleet: A deep reinforcement learning approach. Journal of Cleaner Production 435: 140459.

11.

Igogo

Awuah-Offei

Newman

, et al. (2021) Integrating renewable energy into mining operations: Opportunities, challenges, and enabling approaches. Applied Energy 300: 117375.

12.

Khorasgani

Wang

Gupta

(2020) Challenges of applying deep reinforcement learning in dynamic dispatching. ArXiv:2011.05570.

13.

Matsui

Escribano

Angeloudis

(2023) Real-time dispatching for autonomous vehicles in open-pit mining deployments using deep reinforcement learning. In: IEEE Conference on Intelligent Transportation Systems, Proceedings, ITSC, pp.5468–5475.

14.

Noriega

Pourrahimian

Askari-Nasab

(2025) Deep reinforcement learning based real-time open-pit mining truck dispatching system. Computers & Operations Research 173: 106815.

15.

Puterman

(2014) Markov Decision Processes: Discrete Stochastic Dynamic Programming. New Jersey, USA: John Wiley & Sons. Available at: https://books.google.com/books?hl=en&lr=&id=VvBjBAAAQBAJ&oi=fnd&pg=PP9&ots=rtkAzIVZLL&sig=88nyhp2oGHi5jMrGkcmfJdn_ivM

16.

Qiu

Yang

(2023) A reinforcement learning based dynamic multi-objective constrained evolutionary algorithm for open-pit mine truck scheduling. In: Proceedings - 2023 China Automation Congress, CAC 2023, pp.5370–5375.

17.

Schulman

Levine

Abbeel

, et al. (2015) Trust region policy optimization. In: International conference on machine learning, pp. 1889–1897. https://proceedings.mlr.press/v37/schulman15.html

18.

Schulman

Wolski

Dhariwal

, et al. (2017) Proximal policy optimization algorithms. ArXiv:1707.06347.

19.

Zhang

Odonkor

Zheng

(2020) Dynamic dispatching for large-scale heterogeneous fleet via multi-agent deep reinforcement learning. In: Proceedings - 2020 IEEE international conference on big data, big data 2020, pp. 1436–1441.