Policy-guided Q-learning algorithm for intelligent travel path planning

Abstract

In complex urban environments, traditional path planning methods have significant shortcomings in terms of safety assurance, multi-objective path optimization, and personalized travel recommendations. To address these issues, this paper proposes a reinforcement learning-based path planning algorithm with a strategy-guided mechanism and further constructs a path optimization model suitable for multi-destination travel scenarios. This method introduces a safety-aware potential field and a composite reward mechanism to guide the agent to achieve a dynamic balance between path length and safety. In the experimental section, a dataset incorporating map and urban public safety information was constructed, and 800 rounds of path learning simulation experiments were conducted. The results show that the convergence time of the proposed algorithm is 32% shorter than that of the greedy strategy-based method and 27% shorter than that of the policy enhancement method. Additionally, the average path length is reduced by over 100 m, and the safety score improves by over 14%. In multi-destination travel tests, when the number of target points was 20, the total path length was reduced by 3.00% compared to the distance matrix method and by 0.15% compared to the genetic algorithm, verifying its scalability and stability in complex scenarios. The research results indicate that this method can provide efficient, safe, and personalized path planning solutions for urban travelers, offering broad application prospects.

Keywords

policy-guided Q-learning safe path planning multi-destination travel optimization pointer network intelligent route recommendation

Introduction

With the development of society and the prosperity of tourism, travel path planning has become an increasingly important research field. When planning their journeys, modern travelers are increasingly demanding the efficiency and convenience of routes, while also focusing more and more on the safety of paths and personalized experiences. This trend reflects the significant changes in travel consumer behavior in today’s society, in which safety and personalization have become important considerations.^1,2 However, with the diversification of tourist destinations and the increasing personalization needs of travelers, traditional route planning methods have faced many challenges in dealing with complex routes, ensuring safety, and meeting personalization needs. On the one hand, complex urban road structures and frequent traffic incidents pose challenges to travel efficiency and safety. On the other hand, users’ demand for multi-destination, real-time response, and safety-assured route planning is increasing.^3,4 In the context of smart city and intelligent transportation system construction, how to achieve efficient, safe, and personalized travel route planning has become an important issue that needs to be addressed urgently. Furthermore, with the proliferation of smartphones and location-based services, travelers are more inclined to use these technologies to enhance their travel experience. The popularity of smart devices and applications has created unprecedented data resources for trip planning, but it also presents challenges in how to effectively use this data to meet personalized needs.^5,6 To this end, the study proposes a Q-learning path intelligent planning algorithm (QL-PIPA) based on policy guidance mechanism (PGM). The research aims to address the challenges of path planning in modern travel, including path diversity, safety, and personalization needs. The objective is to develop an intelligent path planning solution that provides travelers with safe and efficient travel routes and is able to satisfy users’ needs for a personalized travel experience. In light of the fact that travelers may visit multiple destinations, the research further develops multi-objective path intelligent planning (MO-PIP) algorithms with the objective of optimizing travelers’ paths between multiple stops. This study refers to the synchronous motion wobble control proposed for self-reconfiguring mobile robots in reference 7, particle swarm optimization for optimizing a novel proportional-integral differential neural network model in reference 8, and a neural network study on the control of nonlinear dynamic systems in reference 9. Furthermore, this study aims to address the challenges faced by traditional path planning algorithms in terms of dynamic environment adaptation, computational complexity, multi-objective optimization, and personalization and customization requirements.

The innovative aspects of this research are as follows: introducing a safety-aware potential field to guide agents in avoiding high-risk areas, thereby achieving a dynamic balance between path safety and efficiency. Second, a composite reward mechanism combining path length and safety distance was designed to meet personalized travel needs and enhance the robustness of path planning. By integrating pointer networks and reinforcement learning, the MO-PIP model was proposed to effectively address the complexity and scalability challenges of multi-destination path planning. Additionally, the proposed algorithm outperforms traditional Q-learning and reinforcement learning methods in terms of convergence speed, safety, and computational efficiency, demonstrating significant practical application potential.

The main contributions of this study are as follows: First, a QL-PIPA model incorporating PGM is proposed. By introducing a safety-aware latent field, the model achieves a dynamic trade-off between path length and safety, thereby enhancing the algorithm’s adaptability in complex urban environments. Second, a composite reward function system based on safety distance and risk level weighting is constructed, effectively guiding agents to avoid high-risk areas and improving the safety and personalization of path planning results. For multi-destination travel scenarios, the MO-PIP model, which combines pointer networks and reinforcement learning mechanisms, is designed to maintain good convergence performance and path optimization effects even when the number of nodes increases. The methods proposed in this study have strong practical application value.

The overall structure of the study consists of four sections. In the first section, the relevant research results and shortcomings of domestic and foreign trip planning are summarized. In the second section, the study first proposes QL-PIPA based on PGM, and further proposes MO-PIP considering that travelers usually visit multiple destinations. In the third section, the research experiments are compared and analyzed by the proposed model. In the fourth section, the experimental results are summarized, the shortcomings of the study are pointed out, and future research directions are proposed.

Related works

In recent years, the application of deep learning in the field of path planning has received increasing attention. With the advancement of technology, especially in computing power and data processing, deep learning has become a key technology in optimizing urban traffic, autonomous driving, and intelligent tourism planning.¹⁰ Some of the related studies by scientists and academicians are presented below. Zamoum et al. addressed the path planning problem for unmanned aerial vehicles (UAVs) navigating in complex environments by applying deep Q-learning (QL) and Dyna Q-learning methods to UAV path planning and introducing fuzzy logic for enhanced control, thereby optimizing the accuracy, adaptability, and obstacle avoidance capabilities of path planning.¹¹ Merikhipour et al. addressed the issue that traditional transportation mode recognition models rely on manual features and struggle to capture complex temporal patterns. They proposed a deep learning model that integrates a spatial attention mechanism and combined it with the proximal policy optimization (PPO) algorithm to optimize feature selection, thereby improving the accuracy and robustness of transportation mode recognition.¹² Nikookar et al. addressed the issue of insufficient model reusability in reinforcement learning by proposing a framework based on graph data models and designing a parameterized algorithm that balances efficiency and cumulative rewards, thereby achieving efficient strategy reuse and performance improvements in tasks such as query optimization and robot control.¹³ To solve the problem of high network deployment complexity due to resource constraints of UAVs, Xi et al. proposed a lightweight reinforcement learning-based real-time path planning method, the adaptive soft actor-critical algorithm. The generalization capability was enhanced by interacting structured environment models with dynamic information, thus improving the adaptability and planning efficiency of UAVs in complex environments.¹⁴ Shang Y et al. proposed an improved RRT algorithm combining QL and Gaussian distribution of obstacles for path planning of self-driving vehicles. The step size was dynamically adjusted according to the density of obstacle distribution to quickly generate the initial path and reduce the planning time. Ultimately, the paths were optimized by an enhanced bi-directional pruning technique to ensure that the resulting paths were complied with vehicle kinematics and dynamics constraints while ensuring smooth safety.¹⁵ Cai et al. proposed a novel robot path planning method that combines DRL and construction worker motion prediction to enhance the safety and efficiency of human–robot collaboration on construction sites. In addition, the predicted motion of workers was innovatively integrated into the state space and reward function computation. The results showed that this prediction-based path planning method could ensure that the robot successfully reaches the destination along the near shortest path.¹⁶

Zhou Y et al. tuned the exploration rate and learning rate of classical QL in order to improve its convergence speed. The study also designed an improved Dyna-2 algorithm to enhance the generalization ability of the QL algorithm. The effectiveness of this hybrid intelligent algorithm was demonstrated by path planning experiments in two static complex environments.¹⁷ Chen et al. proposed a soft actor-critic method based on deep reinforcement learning (DRL) for dynamic obstacle avoidance path planning of a robotic arm. The method achieved effective avoidance of moving obstacles in the environment and real-time path planning by designing an integrated reward function for dynamic obstacle avoidance and goal approximation. The results revealed that the method could effectively avoid moving obstacles and complete the planning task with a high success rate.¹⁸ Brandonisio et al. created an artificial intelligence agent through deep Q-network and dominant actor critic methods and followed reinforcement learning principles for training. The results showed that the trained agent performed well in expanding the target coverage.¹⁹

In summary, despite extensive research by experts and scholars in path planning and its application to various fields, most existing methods lack the adaptability to dynamic environmental changes and the ability to generalize complex route selection. The E-greedy strategy (EGS) is deficient in the balance between exploration and exploitation. In the context of intricate urban environments, the system is unable to accurately discern potential safety hazards, consequently leading to inadequate safety measures in path planning. Traditional methods are often unable to handle safety risks in complex urban environments, especially in path selection in evacuation areas and high-risk areas. The study proposes QL-PIPA based on PGM to address this issue. MO-PIP is also proposed to address the needs of multi-site travel. This study provides a more comprehensive optimization scheme for modern travelers’ route selection, with a focus on enhancing safety and providing satisfying personalized travel experiences. By introducing PGM, QL-PIPA can not only reduce the convergence time but also improve the efficiency of path planning while ensuring the safety of the path. Especially in complex environments, PGM guides the agent to find the best path faster and reduces unnecessary random exploration. A comparison of the existing literature with the proposed method is shown in Table 1.

Table 1.

Comparison of different methods.

Reference	Year	Research methodology	Finding	Shortcoming
Zamoum et al.¹¹	2025	Deep QL and Dyna QL methods	Optimized UAV path planning	Cannot be directly applied to path planning in complex urban environments
Merikhipur et al.¹²	2025	Deep learning model integrating spatial attention mechanism, combined with PPO optimization feature selection	Improved the accuracy and robustness of transportation mode recognition	The model has limited generalization ability in cross-scenario or low-resource environments
Nikookar et al.¹³	2025	A framework based on graph data models	Achieved efficient strategy reuse in query optimization and robot control tasks	The model has poor scalability in large-scale or high-dimensional state spaces
Xi et al.¹⁴	2024	Adaptive soft actor-critic algorithm	Proposing adaptive temperature coefficients to flexibly adjust the probability of exploration for UAV	Ineffective in solving the multi-objective path selection problem
Shang et al.¹⁵	2023	Improved RRT algorithm combining QL and obstacle Gaussian distributions	Quickly generates initial paths and optimizes them to satisfy kinematics and dynamics constraints of selfdriving vehicles	Difficult to deal with complex multi-objective path planning and dynamically changing environments
Cai et al.¹⁶	2023	DRL combined with construction worker movement prediction for path planning	Ensures that the robot successfully reaches the destination along the near shortest path	Lacks adaptability to complex environments
Zhou et al.¹⁷	2022	Adjusting the exploration rate and learning rate of QL and designing an improved Dyna-2 algorithm	Enhancing the convergence speed and generalization of QL algorithm for static complex environments	Applicable to static environments only, lack of sufficient processing capability for dynamically changing environments
Chen et al.¹⁸	2022	DRL-based soft actor-critic approach to dynamic obstacle avoidance path planning	Successful avoidance of dynamic obstacles and real-time path planning	High computational complexity
Brandonisio et al.¹⁹	2021	Intelligent body training based on deep Q-networks and master actor critic method	Successfully extends the target coverage area and shows good path planning performance	Focuses on spatial goal planning and does not deal with path planning in complex multi-objective or dynamic environments
This paper	-	PGM-based QL-PIPA	Improves the efficiency and safety of path planning, shortens the convergence time, and adapts to dynamic environments	-

Methodology

In this research, the QL safe path planning algorithm is first introduced and QL-PIPA incorporating PGM is further developed to improve the convergence speed of the algorithm. Subsequently, considering the needs of multi-destination travels, MO-PIP is proposed with the aim of providing travelers with a safe as well as an efficient multi-destination path planning scheme.

QL-PIPA construction with policy guidance

The main goal of the research is to address the challenges of travel path planning in modern dynamic environments. Specifically, the research aims to develop an intelligent path planning algorithm capable of efficiently navigating through complex urban environments while prioritizing safety and personalization. The study defines $S$ as the state space, which represents all possible states of the environment, including location, traffic conditions, and safety risks.²⁰ It is assumed that $A$ is the action space, representing all intelligible actions that the agent can take, such as moving to a neighboring node or changing methods. The goal of the agent is to find a policy that maps states to actions and maximizes the cumulative reward over a range of states and actions. Furthermore, in order to meet the traveler’s needs for path diversity, safety, and personalization during travel, the study proposes a PGM-based QL-PIPA. To achieve the travelers’ needs for path diversity, safety, and personalization during travel, the study proposes a PGM-based QL-PIPA. The algorithm considers both short distance and safety of the paths and solves the safety error of traditional methods in sparsely populated areas. By balancing distance and safety, both fast and safe routes are achieved using reward functions and QL. Based on the combined length and safety and security, the total path length $D$ can be shown in equation (1).

D = ‖ S_{b e g i n n i n g} - p_{c, 1} ‖_{2} + \sum_{k = 1}^{K - 1} ‖ p_{c, k} - p_{c, k + 1} ‖_{2} + ‖ p_{c, K} - S_{e n d p o i n t} ‖_{2}

(1)

In equation (1), $S_{b e g i n n i n g}$ represents the 2D coordinates of the starting point. $S_{e n d p o i n t}$ denotes the 2D coordinates of the end point. $K$ denotes the number of intersections (nodes). $p_{c, k}$ represents the 2D coordinates of $k$ intersections (nodes) from the starting point to the end point of the path. The path consists of $K$ intersections $p_{c, k}$ , with a total number of hops $K + 1$ , to ensure that the traveled distance is away from hazardous areas to enhance safety. The average safety distance $Ψ$ of a path from a hazardous location is the average distance from all path nodes to the nearest hazardous location on a planned path.²¹ The greater the distance, the farther the path is from the hazardous area and the greater the safety, as shown in equation (2).

Ψ = \frac{\sum_{k = 1}^{K + 1} \sum_{i = 1}^{m_{k}} \frac{d (S_{k_m i d,} c_{k, i})}{r a n k (c_{k, i})} + \sum_{k = 1}^{K} \sum_{j = 1}^{n_{k}} \frac{d (p_{c, k}, c_{k, j})}{r a n k (c_{k, j})}}{\sum_{k = 1}^{K + 1} m_{k} + \sum_{k = 1}^{K} n_{k}}

(2)

In equation (2), $c$ denotes the list of hazardous locations. $m_{k}$ denotes the number of hazards in the list of hazardous locations corresponding to $k$ streets. $n_{k}$ denotes the number of hazardous locations in the list of hazardous locations corresponding to intersections (nodes). $c_{k, i}$ denotes the $i$ th hazardous location in the list of hazards corresponding to the $k$ th street. $c_{k, j}$ denotes the $i$ th hazardous location in the list of hazards corresponding to the $k$ th street. $r a n k (c_{k, i})$ and $r a n k (c_{k, j})$ denote the hazard classes of hazardous locations $c_{k, i}$ and $c_{k, j}$ , respectively, with higher values indicating greater danger. $d$ denotes the Euclidean distance between two points. The coordinates of the midpoint of each street are labeled $s_{k_m i d}$ , and around these points, hazard zones are defined with a radius around the length of the street. The number of hazardous points within these zones is denoted by $m_{k}$ . Safe path planning is based on reinforcement learning and involves an intelligent body, an interactive environment and a decision information cache. The framework of the interaction between the intelligent body and the environment is shown in Figure 1 and aims to optimize the safety and efficiency of the traveler’s route.

Figure 1.

Framework diagram of interaction between agent and environment.

Figure 1 illustrates that the structure includes a simulated robotic intelligence that interacts with the map environment to find the best path with limited information. In this context, $S_{t}$ represents the coordinates of the current location, and $R_{t}$ represents the reward value obtained from the environment after the agent performs an action at time step $t$ . An information cache stores state, rewards, and algorithmic key factors to assist the intelligent body in decision-making during this process. The intelligent body initially obtains start and goal point coordinates from environment observations. Subsequently, it optimizes the algorithm by continuously updating the state and rewards through environment observations and feedback. Eventually, after continuous learning and parameter tuning, the algorithm converges to safe path intelligent planning.^22,23 Figure 2 further refines the framework of QL-PIPA for safe paths.

Figure 2.

QL-PIPA framework.

Figure 2 shows the overall framework of the QL-PIPA, which includes the interaction mechanism between the intelligent body and the environment. In this framework, the intelligent body continuously optimizes the path planning strategy by interacting with the environment to improve the efficiency and safety of path planning. By updating the state and reward information, the algorithm is able to converge to the optimal path to ensure the safety and efficiency of the path. After an intelligent body performs an action, if the next state of the environment reaches the end of the path, the update of the $Q$ -value follows equation (3).

{\begin{cases} Q (s, a) \leftarrow Q (s, a) + a [r - Q (s, a)] \\ Q (s, a) \leftarrow Q (s, a) + a [r + γ \max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)] \end{cases}

(3)

In equation (3), $r + γ \max_{a^{'}} Q (s^{'}, a^{'}) - Q (s, a)$ denotes the timing difference error. The discount rate $r$ and deviation index $a$ are utilized to adjust the values. The key to this process is to enable the intelligent to pursue optimal long-term rewards in its interaction with the environment, focusing on safe and short travel paths. Therefore, the study defines the path length reward as shown in equation (4).

PathLengthReward (s, a) = - β \times PathLength (s, a)

(4)

In equation (4), $β$ denotes the weight of the path length. $PathLength (s, a) s^{'}$ denotes the path length from state $s$ to state $s^{'}$ . The safe distance reward is defined as shown in equation (5).

SafetyReward (s, a) = α \times SafetyDistance (s, s^{'})

(5)

In equation (5), $α$ denotes the weight of the safe distance. $SafetyDistance (s, s^{'})$ denotes the safe distance of the path. The PGM controls the short-term efficiency of paths and the equilibrium between path safety and efficiency by introducing path length weights $β$ and safe distance weights $α$ , respectively. Safety distance is defined as the minimum distance between each node on a path and the nearest hazardous area, and safety risk is evaluated based on the hazard level of each node and the surrounding hazardous area. To balance safety and efficiency in path planning, safety distance and safety risk are incorporated into the reward function as important considerations for path selection. For this purpose, the reward function incorporates safety metrics as shown in equation (6).

r = {\begin{cases} S I (s, s^{'}), \begin{array}{l} s^{'} & N o n - e n d p o i n t a n d d^{'} \leq d \end{array} \\ ξ, \begin{array}{l} s^{'} & N o n - e n d p o i n t a n d d^{'} > d \end{array} \\ κ, \begin{array}{l} s^{'} & I t^{'} s t h e e n d p o i n t \end{array} \end{cases}

(6)

In equation (6), $ξ$ appears as a negative constant and $S I$ denotes the safety measure. The contribution of each action to the overall route length also needs to be considered. The transfer safety metric $S I (s, s^{'})$ for state $s$ to $s^{'}$ is shown in equation (7).

S I (s, s^{'}) = \frac{S P I (s, s^{'}) + S C I (s, s^{'})}{L (p)}

(7)

In equation (7), $p$ denotes a street and $L (p)$ is its length. $S C I (s, s^{'})$ indicates the level of safety at intersections. $S P I (s, s^{'})$ indicates the level of safety at streets. In the safety evaluation of roads and intersections, the hazardous effect of different locations varies from location to location. Therefore, the study uses the safety distance to determine the safety level of an intelligent body to choose a certain path, which is calculated as shown in equation (8).

{\begin{cases} S P I (s, s^{'}) = {\begin{cases} \frac{\sum_{i = 1}^{m} d (s_{m i d}, c_{i}) / r a n k (c_{i})}{m}, m \neq 0 \\ L (p), m = 0 \end{cases} \\ S C I (s, s^{'}) = {\begin{cases} \frac{\sum_{j = 1}^{n} d (p_{c}^{'}, c_{j}) / r a n k (c_{j})}{n}, n \neq 0 \\ \frac{d (p_{c}^{'}, c_{n e a r})}{r a n k (c_{n e a r})}, n = 0 \end{cases} \end{cases}

(8)

In equation (8), $s_{m i d}$ denotes the center point of the street connecting the two intersection nodes between states $s$ and $s^{'}$ . $m$ denotes the number of hazardous locations centered on $s_{m i d}$ and radiused from $p$ . $r a n k (c_{i})$ denotes the hazard class of the $i$ th hazardous location. $p_{c}^{'}$ denotes the intersection node in state $s^{'}$ . $n$ denotes the number of hazardous locations centered on $p_{c}^{'}$ and radiused from $p$ . $c_{n e a r}$ denotes the hazardous location that has the minimum distance from node $p_{c}^{'}$ . To speed up the convergence of the algorithm, this study proposes QL-PIPA with PGM, which uses the principle of artificial potential field to guide the intelligent body. The attraction function in the potential field is used to increase the chance of the intelligent body to obtain high rewards, improve the efficiency, and shorten the convergence time. The related function is detailed in equation (9).

U_{a t} (s) = \frac{1}{2} \cdot \bar{k} \cdot {(s_{n o w} - s_{e n d p o i n t})}^{2}

(9)

In equation (9), $U_{a t} (s)$ denotes the target gravitational field of state $s$ . $\bar{k}$ denotes a single gravitational field vector. $s_{n o w}$ denotes the coordinates of the current intersection node. $s_{t \arg e t}$ denotes the coordinates of the target node. Compared with traditional collision detection-based path planning methods, the safety-aware potential field mechanism is able to optimize the path through global guidance and avoid the problem of local optimal solutions. Traditional methods often rely on step-by-step detection and avoidance of obstacles, but cannot effectively avoid the potential safety risks of complex scenes. Conversely, the safety-aware potential field utilizes real-time adjustments to the gravitational field. It ensures that intelligent bodies circumvent obstacles and expeditiously identify safe and efficient pathways in complex environments, as shown in equation (10).

R_{s a f e} = β \cdot SafetyDistance - α \cdot RiskLevel

(10)

In equation (10), $R_{s a f e}$ denotes the safety-perceived potential field mechanism. $RiskLevel$ denotes the safety risk level. $SafetyDistance$ denotes the safety distance. Using this gravitational field, the algorithm guides the intelligence to adopt the $ε - g r e e d y$ strategy, randomly selecting actions with probability $1 - ε$ . This guidance reduces randomness and improves the chances of pursuing long-term gains. Next, the detailed flow of PGM combined with QL safe path intelligent planning is shown in Figure 3.

Figure 3.

Flowchart of intelligent planning of PGM and QL safe paths.

Figure 3 illustrates the intelligent planning process for safe paths based on PGM combined with QL. The figure clearly shows the role of PGM in path selection, where PGM not only accelerates the convergence process by balancing the exploration and exploitation of paths. However, it also influences the updating of Q-values by adjusting the reward function to optimize the path selection. As mentioned above, the study’s approach overcomes the problem of small crimes that may be missed when determining the shortest travel route using only the crime density map, and improves the safety of travel path planning. In addition, by introducing PGM, the study balances exploration and exploitation in the intelligent planning of safe paths in QL, which speeds up the convergence of the algorithm. The integration of the PGM with the QL algorithm is mainly in two aspects: action selection and Q-value update. In the action selection phase, the PGM provides a policy to guide the intelligence to select actions in each state, taking into account not only the immediate rewards but also the global path optimization to reach the goal. In the Q-value update phase, the PGM influences the Q-value update by adjusting the reward function, which allows the algorithm to converge to the optimal strategy more quickly. Specifically, the potential field in the PGM is used to adjust the immediate rewards in the QL algorithm to reflect the safety and efficiency of the paths to optimize path selection. The pseudo-code of the QL-PIPA algorithm is shown in Table 2.

Table 2.

Pseudo-code for QL-PIPA

Algorithm 1: QL-PIPA with policy-guided reward adjustment
Input:	State space S, Action space A, Learning rate α, Discount factor γ, Exploration rate ε, Safety weight factor W
Output:	Optimized Q-table Q(s,a)
1	Initialize Q(s,a) ← 0, ∀s∈S, a∈A
2	for episode = 1 to N do
3	Initialize state s
4	while s is not terminal do
5	With probability ε select a random action a
6	Else select a ← argmax Q(s,a)
7	Execute action a, observe next state s’, and reward r
8	Compute policy-guided reward: r ← r + W·SafetyScore(s’)
9	Update Q(s,a): Q(s,a) ← Q(s,a) + α·[r + γ·max Q(s’,a’) − Q(s,a)]
10	s ← s’
11	end while
12	end for
13	return Q

MOPIP design for multi-destination planning

Since PGM based QL-PIPA for safe paths is for single destination safe path planning problem. In order to satisfy the needs of multiple travel destinations, the research further proposes the MOPIP algorithm. The algorithm models the problem as an optimization task similar to the traveler’s problem and adapts multi-destination planning by combining pointer networks (PNs) and networks of long and short-term memory cells. The multi-destination locations are represented in the form of a two-dimensional sequence of coordinates as shown in equation (11).

l = {d_{i}}_{i = 1}^{o}

(11)

In equation (11), $d_{i}$ denotes the coordinates of each travel destination and $o$ is the number of destinations. Multi-destination path planning is treated as a sequential problem and is handled by factorization using the chain rule. The shorter the path of the method, the higher the output probability and vice versa.²⁴ Sequence element probabilities are computed based on the parameters $θ$ and policy $π$ . The order of destination visits is determined through the Softmax layer as shown in equation (12).

P_{θ} (π | l) = \prod_{i = 1}^{n} P_{θ} (π (i) | π (1), . . ., π (i - 1), l)

(12)

MO-PIP network deals with sequence pairs with strong correlation and uses recurrent neural networks for sequence prediction decision-making. Moreover, combined with bidirectional long and short-term memory and attention mechanism, it specifically solves the prediction problem of output and input correlation. For sequence pair ${X, Y^{X}}$ , the model calculates the transition probability from $X$ to $Y^{X}$ as shown in equation (13).

{\begin{cases} P (Y^{X} | X; θ) = \prod_{i = 1}^{n} P_{θ} (Y_{1}, . . ., Y_{i - 1}, X; θ) \\ θ^{*} = \underset{θ}{argmax} \sum_{X, Y^{X}} \log P (Y^{X} | X; θ) \end{cases}

(13)

The sequence-to-sequence model employs two long and short-term memory units as encoder and decoder, respectively, as shown in Figure 4. The model reads the elements $x 1, x 2, x 3$ of the input sequence at each time point $i$ , and transforms them into fixed-dimension embedded information until the end flag appears. After entering the decoding phase, the decoder selects the sequence element with the highest probability at each step for output. Each output of the decoder becomes the input for the next step until the end marker.

Figure 4.

Encoder and decoder processes for sequence-to-sequence models.

To accommodate the traveler problem, where the output sequence length is the same as the input sequence length, the study proposes an improved sequence model. This model incorporates the attention mechanism as shown in equation (14).

{\begin{cases} u_{j}^{i} = v^{T} \tanh (W_{1} e_{j} + W_{2} d_{i}), i, j \in (1, 2, . . ., o) \\ P (Y_{i} | Y_{1}, . . ., Y_{i - 1}, X) = s o f t \max (u^{i}) \end{cases}

(14)

In equation (14), $v$ , $W_{1}$ , and $W_{2}$ of the decoder are combined with the hidden state output of the encoder to form a pointer to the input element $j$ . PNs are optimized through supervised learning and the use of evaluation networks and reinforcement learning is investigated. This improves the decision-making of the policy network and solves the problems of data label acquisition and model generalization.^25,26 The evaluation network is based on long and short-term memory units, as shown in Figure 5, and is used to assess the strategy efficacy.

Figure 5.

Evaluation network structure diagram.

In Figure 5, the encoder of the PN uses the same structure as the long and short-term memory unit to encode a two-dimensional coordinate sequence $X$ of multiple target points into an $d$ -dimensional sequence of potential memory states. Each state $e n c = {[e_{i}]}_{i = 1}^{n}$ is in space and produces a hidden state h = $h$ . The long-short time memory unit processes the output of each time step and is computed as shown in equation (15).

{\begin{cases} u_{i} = v_{c}^{T} \tanh (W_{e} e_{i} + W_{h} h), i \in (1, 2, . . ., o) \\ P (h, X) = s o f t \max (u) \end{cases}

(15)

In the final stage of the process, a two-layer fully-connected network is used to perform a dimensionality reduction on the hidden layer state to produce a prediction of the state of the multi-target point sequence $V$ . Subsequently, this predicted value is used to compute the temporal difference error that fuels the network update. The actor-critic strategy, as a policy-driven reinforcement learning paradigm, realizes network training through the interaction of the intelligence with the environment. The studied MOPIP algorithm is based on this training mechanism, as shown in Figure 6.

Figure 6.

Schematic diagram of MOPIP algorithm network training update structure.

In the MOPIP algorithm, the sequence of multi-target point locations is defined as the state, the sequence of accesses output by the policy network is the action. Moreover, the feedback from the environment on these actions is represented by a reward signal containing total length information, constructed as a Markov decision framework. As it interacts with the environment, the intelligent body continuously acquires information about the current state, feeds it into policy and evaluation networks, predicts actions based on the state and environmental impact, and receives rewards from environmental feedback. The network parameters are updated by calculating the error and gradient. In the specific step of training, the intelligent body uses the initial parameters of the strategy network to interact with the sequence of multi-target points for A times, forming a series of interaction trajectories that record the whole process from the initial sequence to the reward. Among them, the specific equation of the instantaneous reward is shown in equation (16).

r (π_{i} | l_{i}) = C - L (π_{i} | l_{i})

(16)

In equation (16), the reward received by an intelligent body in state $l_{i}$ is determined when it performs action $a_{i}$ according to a specific policy $π_{i}$ . $L (π_{i} | l_{i})$ calculates the total length of the new path obtained by the intelligent body after performing action $a_{i}$ . The policy network is evaluated and parameters are updated after these interactions. The temporal difference error $δ_{i}$ is computed for all states $l_{i}$ within the sample, as shown in equation (17).

δ_{i} = r (π_{i} | l_{i}) + γ V (l_{i + 1}, θ_{v}) - V (l_{i}, θ_{v})

(17)

The evaluation network uses the mean square value of the temporal difference error as a loss measure and uses gradient adjustment techniques, i.e., rise and fall methods, to optimize its network parameters as shown in equation (18).

{\begin{cases} \nabla_{θ} J (θ) = \frac{1}{N} \sum_{i = 0}^{N} \nabla_{θ} \log π (a_{i} | l_{i}, θ) \cdot δ_{i} \\ L o s s_{θ_{v}} = \frac{1}{N} \sum_{i = 0}^{N} ‖ δ_{i} ‖_{2}^{2} \end{cases}

(18)

Experiments and results

A travel path planning dataset containing maps and city safety information is first constructed. Then, based on these data, a safe path intelligent planning algorithm for PGM is evaluated to verify its efficacy in ensuring traveler safety. Finally, the study focuses on the results of multi-objective safe path intelligent planning and its analysis.

QL-PIPA performance analysis

The study develops a route planning dataset employing OpenStreetMap maps and Chicago municipal public safety data. The safety risk data encompassed crime and traffic accident records from 2021 to 2023. The data are temporally segmented into two categories: daytime and nighttime. They are then normalized to score different risk types using a uniform metric. The main objective of the dataset is to evaluate the effectiveness of the proposed QL-PIPA for path planning in complex urban environments, especially its balance between safety and efficiency. The detailed information is shown in Figure 7.

Figure 7.

City map network latitude and longitude data. (a) A city with a high concentration of illegal activities (b) Targeting pedestrians and vehicles passing through areas with high levels of illegal activity.

Figure 7(a) shows the area of intensive offenses, while Figure 7(b) shows the area of intensive offenses targeting pedestrians and vehicles passing through the area, which can be seen from these two sub-maps approximately between latitude 41.877° and 41.887°, and longitude −87.635° and −87.625°. Safety risk data are collected and integrated into the path planning algorithms. These data included information on crime-intensive areas, particularly those targeting pedestrians and vehicles. The study area is known for its frequent criminal activities and has multiple security concerns, which is essential to assess the effectiveness of the proposed path planning algorithm in ensuring traveler safety. Meanwhile, the study defines the safe area based on the distance of each node on the path from the surrounding dangerous area. The level of the dangerous area is assessed by historical data and crime frequency. The safe distance of each path node is defined as the shortest distance between the node and the nearest dangerous area, and a larger safe distance means a safer path. During path planning, the algorithm prioritizes paths with larger safe distances to ensure the safety of travelers.

Before using the data for the path planning algorithm, certain pre-processing steps are required, including data cleaning, data integration, feature engineering, normalization, and data segmentation. The multiple feasible routes in this region provide a suitable test scenario for the study’s proposed PGM-based intelligent planning algorithm for safe routes to evaluate its effectiveness in ensuring travelers’ safety. To improve the stability and generalization ability of the model, the study conducted feature screening before modeling. The initial feature set is constructed based on domain knowledge, and redundant or invalid features are eliminated by correlation analysis and variance filtering method.²⁷ On this basis, the importance of each variable is ranked using the replacement feature importance method under the XGBoost framework, and the more influential features are retained as final inputs.

To evaluate the effectiveness of the policy-guided QL-based algorithm proposed in this research for intelligent planning of travel paths, 800 simulation experiments are executed. These experiments are based on city maps and safety hazard data, defining the start and goal points as [41.880°, −87.633°], [41.886°, −87.630°], respectively. The experimental environment is CPU Intel Core i7-9700K, GPU: NVIDIA RTX 3080, RAM: 32 GB DDR4. Each journey of the intelligent body from the starting point to the end point is regarded as a learning cycle. In the experiments, the algorithm is compared with EGS and strategy enhancement methods. The learning rate $a = 0.01$ , strategy $ε$ parameter of 0.1 is considered in the experiments.²⁸ The choice of hyperparameter settings and initial values of the model has a significant impact on the performance of the algorithm. In this study, the values of these hyperparameters are determined through a series of pre-experiments to ensure that the algorithm could converge in a reasonable amount of time and achieve better performance. The Q-value table for all states is initialized to 0, which is a common practice as it provides an unbiased starting point for the algorithm. The study introduces E-greedy strategy-based Q-Learning path planning (E-GQLPP), strategy-enhanced Q-learning path planning (SSAQPP), and the research proposed q-learning path planning algorithm based on policy guidance mechanism (QLPGM) are compared.^29,30 Figure 8 shows the performance of these three strategies in terms of convergence efficiency and effectiveness of path planning.

Figure 8.

Three strategies converge efficiency and effectiveness in path planning. (a) Cumulative reward (b) Path total length (c) Learning steps (d) Learning time.

As illustrated in Figure 8(a), the convergence trend of the cumulative reward comparison demonstrates that the method stabilizes after an initial 160 iterations. Comparing the total path length in Figure 8(b), E-GQLPP requires the longest path. The superiority of QLPGM can be further illustrated by combining Figure 8(c) learning steps and (d) time consumption. The proposed algorithm of this study effectively reduces the initial fluctuation, accelerates the discovery of the optimal policy, and reduces the time consumption of each iteration. Therefore, the research-proposed method demonstrates significant advantages in terms of efficiency and performance. Convergence is a key metric for evaluating the performance of optimization algorithms. It is shown that the QL-PIPA reduces the convergence time by 32% and 27% compared to the E-GQLPP and SSAQPP respectively. The convergence trend of the method can be observed by monitoring the cumulative reward of the method after each iteration. The experimental results display that the method significantly improves its performance by exploring unknown environments and trial-and-error learning in the first 80 iterations and stabilizes after 160 iterations. Figure 9 further compares the convergence time of three strategy method.

Figure 9.

Convergence time of three strategy methods.

In Figure 9, SSAQPP using policy selection assistance achieves algorithmic stability faster compared to the E-GQLPP. Although the augmented approach reduces unnecessary action selection, the ineffectiveness of early exploration is higher as it is still based on EGS. In contrast, the QLPGM is more effective in speeding up the convergence process. It reduces the convergence time by 32% relative to the experimental E-GQLPP method and by 27% compared to the SSAQPP. Based on the experimental results, it can be observed that the QLPGM remains stable over 800 learning iterations and reaches a steady state after the first 160 iterations. This indicates that the method is able to quickly adapt to the environment and converge to a stable strategy within a limited number of iterations. The introduction of PGM significantly increases the convergence speed of the method and reduces the initial fluctuations, which helps the algorithm maintain stability in the face of environmental changes. PGM guides the intelligent body to choose its actions by balancing exploration and exploitation, which speeds up the convergence process of the algorithm. This may be because the E-GQLPP mainly relies on the combination of random selection and greedy strategies. Although it is simple and effective, it performs poorly in complex dynamic environments. Especially in multi-objective path planning, and cannot quickly identify and avoid potential safety risks. While SSAQPP relies only on policy extension, although it improves the exploratory nature of traditional QL, its adaptability and convergence speed in complex environments are still limited, and it cannot effectively balance the safety and efficiency of the path. The results of ablation experiments are shown in Table 3.

Table 3.

Verification results of ablation experiments.

Experimental setup		Convergence time (number of iterations)	Path length (m)	Safety score (distance value)
PGM vs No PGM	PGM	220	4500.12	325.47
PGM vs No PGM	No PGM	350	4702.58	280.33
Different QL exploration strategies	E-greedy	280	4600.73	300.45
	SSAQPP	250	4552.63	310.12
	QLPGM	220	4500.12	325.47
Safety-weighted reward function	Safety distance weighted at 0.5, risk weighted at 0.5	230	4556.45	320.78
	Safe distance weighted at 0.8, risk weighted at 0.2	240	4480.91	335.22
	Safe distance weighted at 0.2, risk weighted at 0.8	260	4721.56	295.66

In Table 3, the method with PGM reduces the convergence time by 37% and improves the security score by 14.4% compared to the algorithm without PGM. This indicates that PGM significantly improves the efficiency and safety of path planning. Compared with the EGS and SSAQPP, the PGM-based QL method performs better in terms of both convergence time and path length, and has the highest security score. It indicates that the bootstrapping mechanism of PGM can effectively reduce unnecessary random exploration and optimize path selection. In addition, the study verified the impact of different safety distances and risk weights. It can be seen that when the weight of the safety distance is greater, the method tends to choose a safer path, but the path length increases. On the other hand, increasing the weight of risk results in a shorter path, but the safety score decreases. This indicates that the reward function has good adjustability and adaptability, and parameters can be adjusted according to different urban environments.

Safe path planning evaluation

To further illustrate the effectiveness of the proposed method in safe path planning, the study introduces dual fighter deep Q network (DDQN), deep deterministic policy gradient (DDPG), and PPO for performance comparison, and repeats the test for five times using the dataset as training data.^31,32 All baseline methods were retrained and fine-tuned using the same dataset and experimental environment. The hyperparameters of each algorithm were optimized through grid search, and multiple rounds of repeated experiments were conducted to obtain average results, which were then subjected to a two-sample t-test. The specific results are shown in Table 4.

Table 4.

Test results of different methods of path planning.

Method	Average runtime time (s)	Path length (m)	Safety score (distance value)
QLPGM	85.32 ± 3.67	4523.87 ± 9.62	321.41 ± 7.28
DDQN	92.68 ± 4.12	4593.44 ± 14.10	305.76 ± 6.54
DDPG	110.56 ± 5.34	4682.56 ± 13.22	290.89 ± 8.93
PPO	98.44 ± 4.98	4635.12 ± 12.49	298.14 ± 9.12

In Table 4, QLPGM has the lowest average running time compared to the other three methods, and the difference is statistically significant (P < 0.05). This indicates that it still has superior computational complexity and faster computational efficiency. Compared to the path length and security scores of DDQN, DDPG, and PPO, the proposed method has smaller confidence intervals and the highest overall security score. The study compares the path planning results of the policy-guided intelligent safe path planning algorithm based on policy guidance with the Dijkstra shortest path algorithm as well as the safe path approach. Three sets of paths with different start and end points are selected for testing to analyze the performance differences of each algorithm in path planning. The corresponding start and end point data are shown in Table 5.

Table 5.

Data related to starting and destination points.

/	Path one	Path two	Path three
Initial point	[41.880°, −87.633°]	[41.886°, −87.630°]	[41.884°, −87.634°]
End point	[41.886°, −87.628°]	[41.880°, −87.629°]	[41.880°, −87.626°]

Next, the intelligent planning of the paths is evaluated in terms of length and safety, based on the overall length of the paths and the average of the safety distances, as shown in Figure 10. Path one in Figure 10(a), Path two in Figure 10(b), and Path three in Figure 10(c) present the effect of the planning done by the three algorithms for the paths listed in Figure 10, respectively.

Figure 10.

The average length and safe distance of the planning algorithm for paths one to three. (a) Path one (2) Path two (3) Path three.

As can be observed from paths one through three of these three subgraphs, the proposed algorithm substantially improves the safe distance averages by 0.39%, 0.63%, and 0.55%, respectively, while keeping the path lengths similar to the shortest paths. Whereas, the safe distance averages are improved by 8.01%, 11.96%, and 16.50%, respectively. In contrast, the traditional shortest path algorithm mainly optimizes the path length and fails to effectively avoid the safety hazards. Compared to the safe path approach, the studied algorithm shows a slight increase in path length, but a more significant improvement in the average safe distance. Although the safe path approach takes safety into account, it may lead to ignoring some of the high-risk paths in short path planning. Therefore, the studied algorithm demonstrates better safety and efficiency in travel path planning.

Multi-destination planning evaluation

The research experiments first set the number of target points to 5 and 10, and the initial learning rate is 0.001. When the number of target points is 20, the initial learning rate is adjusted to 0.0001, and the number of interactions is 10,000. Next, the Euclidean distances are used to calculate the distance between the target points and to compare the path planning effects of distance matrix mapping (DMM), genetic algorithm (GA), PN, and MOPIP algorithms for path planning results, as shown in Figure 11.

Figure 11.

Different algorithm path length comparison.

Figure 11 illustrates the average total path length across three different urban regions-downtown (data1), residential (data2), and high-risk mixed-use (data3)-based on 90-105 repeated simulations per setting. Results are derived from real-world data from OpenStreetMap and the City of Chicago Data Portal. As shown in Figure 11(a) for the comparison of different algorithms for data 1, Figure 11(b) for data 2, and Figure 11(c) for data 3. From these three subfigures, the GA, the PN algorithm, and the studied MOPIP algorithm perform consistently in terms of the average total length of path planning when the number of target points is 5 or 10. This indicates that the MOPIP algorithm performs consistently with fewer path planning nodes. Subsequently, the study introduces the safe distance between two target points as an evaluation criterion, as a way to test the practical efficacy of the MOPIP algorithm in urban settings. The MOPIP algorithm is compared with other multi-objective point path planning methods. The results of intelligent multi-objective point path planning with different algorithms are shown in Table 6.

Table 6.

Comparison of the results of intelligent planning algorithm for multi-target point security path.

Target point number K	DMM	GA	MOPIP
5	2046.30	2046.30	2046.30
10	33.88.00	3076.80	2980.95
20	6639.40	6450.60	6441.30

In Table 6, the MOPIP algorithm, the DMM method, and the GA show good performance in the scenario with 5 target points. However, as the number of target points increases, the superiority of the MOPIP algorithm becomes more obvious. In the case of 10 target points, the total length of safe paths planned by the MOPIP algorithm is reduced by 11.98% and 3.09% compared to the DMM method and GA, respectively. When the number of target points increases to 20, this advantage is still significant and the total path length is reduced by 3.00% and 0.15% compared to the DMM method and GA, respectively. This indicates that the MOPIP algorithm has a significant efficiency advantage when dealing with more complex multi-objective point path planning tasks. It is also shown that the MO-PIP algorithm can effectively handle more complex multi-objective point path planning tasks with good stability.

Computational complexity analysis

Finally, the study further analyzes the computational complexity of the QL-PIPA and the MO-PIP algorithm in the proposed framework, as well as a comparison of the complexity of both with other algorithms. Figure 12 depicts the details.

Figure 12.

Comparison of computational complexity of different algorithms. (a) Comparison of time complexity (b) Comparison of space complexity.

Figure 12(a) and (b) show the comparison of time and space complexity of different algorithms. Among them, E denotes the number of learning iterations, S denotes the size of the state space, A denotes the size of the action space, and A′ denotes the average number of actions in each state after the introduction of PGM. k denotes the number of target points, l denotes the number of layers of the network, and p denotes the number of parameters of the network. The g denotes the total number of iterations required by the GA, and N denotes the population size. In Figure 12, the QLPGM algorithm is significantly reduced in time complexity after the introduction of PGM. The QLPGM algorithm needs to store additional policy guidance information, so it requires more storage space than the E-QGLPP algorithm. Combining the QLPGM and MOPIP algorithms, it can be concluded that the total time complexity of the proposed framework under study is O(E × S × A′ +k × l) and the total space complexity is O(S × A + S + p).

Discussion

To address the limitations of traditional path planning algorithms in complex urban environments, the study proposed the QL-PIPA based on PGM and extended it to the multi-objective planning scenario with MO-PIP. The experimental results showed that, compared with E-GQLPP and SSAQPP, the proposed method improved the convergence speed by 32% and 27%, respectively, and exhibited strong stability after 160 iterations. This improvement was due to the PGM. It effectively balanced exploration and exploitation by introducing safety-conscious potential fields and reward function design. The study showed that PGM could avoid unnecessary exploration of the agent and converge to optimal policies quickly.³³ This indicated that the introduction of the PGM strategy in the study had certain advantages.

In terms of safety, QL-PIPA outperformed Dijkstra’s shortest path and traditional safe path algorithms. Although the increase in path length is minimal, the average safety distance was significantly improved, confirming the algorithm’s advantage in navigating high-risk urban areas. In multi-goal scenarios, the MO-PIP algorithm showed significant performance gains as the number of goal points increases. When the number of destinations reached 20, MO-PIP reduces the total path length by 3.00% compared to DMM and by 0.15% compared to GA. This demonstrated its scalability and suitability for complex routing problems. Compared to baseline methods such as DDQN, DDPG, and PPO, the proposed method achieved better average runtime, shorter paths, and higher safety scores. In addition, ablation experiments confirmed the critical contribution of PGM and safety-weighted reward functions to overall performance.

In addition, the path planning algorithm proposed in this paper has good practical applicability. In the context of smart city construction, this algorithm can be integrated into navigation applications, urban emergency evacuation systems, and public transportation scheduling platforms to provide travelers with safe and efficient path recommendations. In the logistics and delivery sector, the MO-PIP model can address path optimization challenges in multi-destination delivery tasks, enhancing delivery efficiency and reducing operational risks. In autonomous driving scenarios, the strategy-guided mechanism helps vehicles avoid high-risk areas in complex road networks, ensuring driving safety. Therefore, the methodology proposed in this study demonstrates feasibility and scalability for deployment in real-world urban mobility and traffic control systems.

Nevertheless, there are limitations to this approach. Although the proposed method performs well in static or semi-dynamic environments, its adaptability to rapidly changing environments, such as real-time traffic updates and weather disturbances, has not been fully tested. Future research could further introduce time-sensitive risk weighting mechanisms to enhance the model’s adaptability in dynamic safety environments. A similar modelling approach has been applied in references 1 Zhu et al.'s research on time-aware path planning. In addition, uncertainty quantification (UQ) mechanisms can also be introduced, following the approach proposed by Barzegar et al. to achieve efficient UQ without changing the network structure.³⁴ The incorporation of such mechanisms is expected to improve the stability and reliability of the model in high-risk areas or data-sparse scenarios.

Conclusions

This study proposed a policy-guided QL-PIPA and its MO-PIP to address the safety, efficiency, and scalability requirements of modern travel path planning. Through a series of experiments in realistic urban scenarios: QL-PIPA reduced convergence time by up to 32% compared to baseline QL strategies. The algorithm significantly improved path safety, increasing the average safety distance by over 10% while maintaining competitive path lengths. The MO-PIP model demonstrated strong scalability, with improved performance in multi-destination route planning as the number of destinations increased. Compared to DDQN, DDPG, and PPO, the proposed algorithm achieved better runtime and higher safety scores, validating its effectiveness. These results indicated that the proposed framework could provide travelers with safer and more efficient routes, especially in complex urban environments. It also laid the foundation for future use in applications such as autonomous driving, logistics routing, and smart tourism.

Future research will integrate dynamic environmental data such as traffic conditions and weather changes to improve the real-time adaptability of the algorithms, explore lightweight model variants suitable for deployment on mobile or edge devices. Moreover, it combines PGM with DRL techniques to deal with higher dimensional state-action spaces and more complex decision scenarios. In addition, input perturbation analysis and feature attribution methods are introduced to clarify the degree of influence of different features on model output, optimize feature weights and model structure, and improve model interpretability.

Footnotes

ORCID iD

Jingya Shi

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research is supported by: Special ideological and political project of scientific research projects in Colleges and universities of the autonomous region in 2021, Research on the teaching method of integrating the red tour into the ideological and political course of Tourism Specialty in Higher Vocational Colleges, (NO. NJSZZX2174).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Zhu

Cheng

Zhang

, et al. Multi-robot environmental coverage with a two-stage coordination strategy via deep reinforcement learning. IEEE Trans Intell Transport Syst 2024; 25(6): 5022–5033.

Pan

Yan

, et al. Golden eagle optimizer with double learning strategies for 3D path planning of UAV in power inspection. Math Comput Simulat 2022; 193(1): 509–532.

Liu

Gao

, et al. Double BP Q-learning algorithm for local path planning of Mobile robot. Comput Commun 2021; 9(6): 138–157.

Parmar

Guzzetti

. Interactive imitation learning for spacecraft path-planning in binary asteroid systems. Adv Space Res 2021; 68(6): 1928–1951.

Yan

Xiang

Wang

. Towards real-time path planning through deep reinforcement learning for a UAV in dynamic environments. J Intell Rob Syst 2020; 98(2): 297–309.

Jiang

Zeng

Guzzetti

. Path planning for asteroid hopping rovers with pre-trained deep reinforcement learning architectures. Acta Astronaut 2020; 171(6): 265–279.

Rayguru

Kumar

Ramalingam

, et al. Simultaneous motion-slosh control for a self-reconfigurable Mobile robot. In: 2021 International Symposium of Asian Control Association on Intelligent Robotics and Industrial Automation (IRIA), 2021, pp. 240–246.

Chaturvedi

Kumar

. A PSO-Optimized novel PID neural network model for temperature control of jacketed CSTR: design, simulation, and a comparative study. Soft Comput 2024; 28(6): 4759–4773.

Kumar

Srivastava

Gupta

. Comparative study of neural networks for control of nonlinear dynamical systems with lyapunov stability-based adaptive learning rates. Arabian J Sci Eng 2018; 43: 2971–2993.

10.

Sengupta

Dangi

Guruprasad

. Towards sustainable and resilient smart cities: a comprehensive review. Int J Environ Sci 2025; 11(9): 609–622. https://theaspd.com/index.php/ijes/article/view/1647

11.

Zamoum

Baiche

Benkeddad

, et al. Modern artificial intelligence technics for unmanned aerial vehicles path planning and control. Bulletin EEI 2025; 14(1): 153–172.

12.

Merikhipour

Khanmohammadidoustani

Abbasi

. Transportation mode detection through spatial attention-based transductive long short-term memory and off-policy feature selection. Expert Syst Appl 2025; 267: 126196.

13.

Nikookar

Namazi Nia

Basu Roy

, et al. Model reusability in Reinforcement Learning. The VLDB Journal 2025; 34(4): 1–22.

14.

Dai

, et al. A lightweight reinforcement-learning-based real-time path-planning method for unmanned aerial vehicles. IEEE Internet Things J 2024; 11(12): 21061–21071.

15.

Shang

Liu

Qin

, et al. Research on path planning of autonomous vehicle based on RRT algorithm of Q-learning and obstacle distribution. Eng Comput 2023; 40(5): 1266–1286.

16.

Cai

Liang

, et al. Prediction-based path planning for safe and efficient human-robot collaboration in construction via deep reinforcement learning. J Comput Civ Eng 2023; 37(1): 04022046.

17.

Zhou

Wang

. Path planning of Mobile robot in complex environment based on improved Q-learning algorithm. Int J Mech Robot Syst 2022; 5(3): 223–245.

18.

Chen

Pei

, et al. A deep reinforcement learning based method for real-time path planning and dynamic obstacle avoidance. Neurocomputing 2022; 497(8): 64–75.

19.

Brandonisio

Michèle

Guzzetti

. Reinforcement learning for uncooperative space objects smart imaging path-planning. Springer Sci Business Media LLC 2021; 68(4): 1145–1169.

20.

Jindal

Agarwal

Chanak

. Emergency evacuation system for clogging-free and shortest-safe path navigation with IoT-enabled WSNs. IEEE Internet Things J 2021; 9(13): 10424–10433.

21.

Pasha

. Multi-UAV path planning using grey wolf optimization and RRT algorithm. J Soft Comp Para 2025; 7(2): 90–102.

22.

Yue

Liu

, et al. A modified dueling DQN algorithm for robot path planning incorporating priority experience replay and artificial potential fields. Appl Intell 2025; 55(6): 1–27.

23.

Wang

Hirota

, et al. Hybrid bidirectional rapidly exploring random tree path planning algorithm with reinforcement learning. J Adv Comput Intell Intell Inf 2021; 25(1): 121–129.

24.

. A novel deep learning driven robot path planning strategy: Q-learning approach. Int J Comput Appl Technol 2023; 71(3): 237–243.

25.

Wang

Zhu

Wang

, et al. An active object detection model with multi-step prediction based on deep q-learning network and innovative training algorithm. Appl Intell 2025; 55(2): 1–21.

26.

Cao

. Design research based on wearable device and ar technology — a case study of treatment for patients with dental anxiety. Appl Comput Lett 2022; 6(1).

27.

Mokayed

Quan

Alkhaled

, et al. Real-time human detection and counting system using deep learning computer vision techniques. Artif Intell Appl 2023; 1(4): 221–229.

28.

Barto

. Reinforcement learning: an introduction by Richard’s Sutton. SIAM Rev 2021; 6(2): 423.

29.

Chen

Kuang

, et al. Static photovoltaic models’ parameter extraction using reinforcement learning strategy adapted local gradient Nelder-Mead Runge Kutta method. Appl Intell 2023; 53(20): 24106–24141.

30.

Kong

Zhou

, et al. Hierarchical reinforcement learning from competitive self-play for dual-aircraft formation air combat. J Comput Des Eng 2023; 10(2): 830–859.

31.

Guo

Wang

, et al. Deep deterministic policy gradient algorithm based reinforcement learning controller for single-inductor multiple-output DC–DC converter. IEEE Trans Power Electron 2024; 39(4): 4078–4090.

32.

Funika

Koperek

Kitowski

. Automated cloud resources provisioning with the use of the proximal policy optimization. J Supercomput 2023; 79(6): 6674–6704.

33.

Wang

Qin

Kan

. Shielded planning guided data-efficient and safe reinforcement learning. IEEE Transact Neural Networks Learn Syst 2024; 36(2): 3808–3819.

34.

Abbaszadeh Shahri

Shan

Larsson

. A novel approach to uncertainty quantification in groundwater table modeling by automated predictive deep learning. Nat Resour Res 2022; 31(3): 1351–1373.