Abstract
The inherent training instability of the Deep Deterministic Policy Gradient (DDPG) algorithm has critically hindered its practical application to complex, safety-critical tasks such as quadrotor attitude control. To address this key challenge, this paper proposes an integrated approach named RS-DDPG (Robust and Stabilized DDPG), designed to enhance training stability and controller robustness. While individual components like delayed policy updates (adapted from Twin Delayed DDPG (TD3)) and exponential reward functions have been explored, our contribution lies in the synergistic integration of these elements with a structured curriculum and evaluation framework. This holistic approach is shown to be uniquely effective for this specific control problem. Extensive simulations and ablation studies, now benchmarked against both standard DDPG and TD3, provide definitive evidence of the efficacy of our approach. The resulting controller not only surpasses the baselines in convergence speed and performance but also exhibits exceptional robustness against a wide range of random initial states, persistent external disturbances, and significant model uncertainties. This work demonstrates how the careful integration of existing and novel components can yield a reliable, high-performance, data-driven controller, representing a vital step toward bridging the gap between simulation and real-world deployment in aerial robotics.
Introduction
Background and motivation
Quadrotors, a type of unmanned aerial vehicle (UAV), have garnered significant attention in recent years due to their simple mechanical structure, hovering capabilities, and vertical take-off and landing (VTOL) abilities. These characteristics make them ideal for a wide range of applications, including aerial photography, package delivery, infrastructure inspection, and search and rescue missions. However, the control of a quadrotor presents a formidable challenge. Its dynamics are inherently nonlinear, strongly coupled, and underactuated, making it difficult to design high-performance and robust controllers using traditional model-based methods.
In response to these challenges, reinforcement learning (RL) has emerged as a powerful, model-free paradigm for solving complex control problems. Among various RL algorithms, the Deep Deterministic Policy Gradient (DDPG) is particularly well-suited for continuous control tasks like quadrotor attitude control, as it can learn policies in high-dimensional state and action spaces. By directly interacting with the environment, a DDPG agent can learn an optimal control policy without requiring an explicit mathematical model of the system dynamics.
This potential has motivated the application of DDPG to quadrotor control. However, the standard DDPG algorithm is known for its training instability and sensitivity to hyperparameters. A direct application often fails to yield a controller that is both high-performing and robust enough for the demanding task of attitude stabilization. This gap between the potential of DDPG and the practical requirements of quadrotor control motivates our work to develop an enhanced algorithm and a more reliable training framework.
Challenges of DDPG in quadrotor control
Despite its theoretical appeal, applying the standard DDPG algorithm to quadrotor attitude control is fraught with several significant challenges:
Our contributions
To address the aforementioned challenges, this paper proposes Robust and Stabilized Deep Deterministic Policy Gradient (RS-DDPG), an enhanced DDPG-based approach integrated within a Robust Training Framework. Our contributions are as follows:
An Integrated DDPG Algorithm with Enhanced Stability: Our primary contribution is not a single novel component but the systematic integration of three key modifications to improve stability and learning efficiency, adapting best practices from prior research:
● ● ●
● ●
Related work
Quadrotor control methods
The control of quadrotors has been a subject of extensive research, leading to a variety of model-based control strategies. Proportional–Integral–Derivative (PID) controllers, due to their simplicity and effectiveness, are widely used and often serve as a baseline for attitude and position stabilization (Moin et al., 2024). However, PID controllers require meticulous manual tuning and may exhibit suboptimal performance when faced with significant nonlinearities and external disturbances. To address these limitations, more advanced model-based techniques such as backstepping (Wen et al., 2022) and robust control methods have been proposed.
Recognizing the limitations of purely model-based or purely learning-based approaches, a significant trend in recent research is the development of hybrid control architectures. These frameworks aim to combine the stability guarantees of traditional controllers with the adaptability and learning capabilities of RL. For instance, some studies propose supplementary RL controllers that work in tandem with a baseline controller, using the RL agent to compensate for model uncertainties and external disturbances online (Lin et al., 2019; Xu et al., 2022). Yoo et al. (2021) integrate a classical PD or LQR controller with an RL policy, demonstrating faster convergence and improved performance. Similarly, Hua and Fang (2023) propose a novel RL-based robust control strategy where a Robust Integral of the Signum of the Error (RISE) controller acts as a benchmark to guide the learning process, ensuring both performance and stability . These hybrid methods underscore the benefits of integrating learning with classical control; however, they often rely on a pre-existing stable controller and may not fully exploit the potential of end-to-end learning for optimal policy discovery. Our work, while purely RL-based, draws inspiration from this line of research by focusing on creating an inherently stable and robust learning process from the ground up.
RL for quadrotor control
Deep reinforcement learning (DRL) has emerged as a powerful model-free paradigm for quadrotor control, capable of learning complex policies directly from interaction with the environment. Among various DRL algorithms, actor-critic methods that can handle continuous action spaces are particularly popular. The DDPG algorithm has been a foundational method applied to quadrotor attitude control (Fei et al., 2024; Kim and Jung, 2024) and for tuning the parameters of other controllers (Al-Shayeb et al., 2024; Li and Wang, 2024; Sonmez et al., 2024). This trend is part of a broader application of RL in aerospace, which includes tasks such as fault-tolerant flight control for passenger aircraft (Mohammadi et al., 2025).
However, the standard DDPG algorithm is known to suffer from significant training instability, primarily due to the overestimation bias in the critic network, which can lead to suboptimal or divergent policies. To address this, subsequent research has focused on improving the stability of the learning process. The TD3 algorithm, an extension of DDPG, introduces clipped double Q-learning, delayed policy updates, and target policy smoothing to mitigate this bias (see Fujimoto et al. (2018)). TD3 has been successfully applied to various quadrotor tasks, including payload transportation (Lin et al., 2024) and trajectory tracking (Deng et al., 2023), demonstrating superior robustness compared to DDPG. Our proposed RS-DDPG algorithm explicitly incorporates the principle of delayed policy updates from TD3 as a core component to stabilize the learning of the critic. Furthermore, some works have noted the issue of overestimation bias and proposed novel network structures, such as the advantaged clipped critic networks, to alleviate it (Zhang et al., 2024), confirming that tackling this bias is a critical research direction.
More recently, algorithms such as soft actor-critic (SAC) have gained prominence due to their excellent sample efficiency and stability, achieved by incorporating an entropy maximization objective into the reward function (see Haarnoja et al. (2018)). While SAC represents the state-of-the-art in many continuous control tasks, our work focuses on demonstrating that a carefully structured and enhanced DDPG/TD3-style architecture can still achieve high levels of performance and robustness for specific applications like quadrotor attitude control. Furthermore, a critical challenge for all DRL-based controllers is the simulation-to-reality (sim-to-real) gap. Techniques such as domain randomization, which involves training the agent across a wide range of simulated physical parameters, are often employed to create policies that are robust enough for real-world deployment. While this paper focuses on establishing a robust controller in an idealized simulation, acknowledging the sim-to-real challenge is crucial for future work.
Another key challenge in applying DRL to quadrotor control is the design of an effective reward function to guide the learning agent. A sparse or poorly designed reward signal can lead to inefficient exploration and slow convergence. Consequently, significant effort has been dedicated to reward shaping. For example, Trad et al. (2023) designed a reward function to speed up the learning process for attitude control. Abo Mosali et al. (2022) introduced a novel reward formulation incorporating exponential functions and piecewise decomposition to enable more stable learning behavior . Our work builds upon this insight by proposing an exponential reward shaping strategy that provides a smooth and dense gradient, effectively guiding the agent towards stable flight.
Finally, to improve sample efficiency and handle complex tasks, researchers have explored structured training methodologies. Curriculum learning, where the agent is trained on progressively more difficult tasks, is effective for challenging maneuvers like flying through a narrow gap (Xiao et al., 2023) or for autonomous motion control in unknown dynamic environments (Hu et al., 2023). While these approaches improve learning efficiency, they often lack a systematic framework for evaluating the final policy’s robustness. A model that performs well on average during a stochastic training process (i.e. with exploration noise) may not be the most reliable or performant in a deterministic (deployment) setting, which is what our evaluation protocol aims to solve.
In summary, while existing research has independently addressed the issues of algorithmic stability, reward shaping, and training strategy, a holistic framework that integrates these solutions to produce a genuinely robust controller is still lacking. Our work, RS-DDPG, aims to fill this gap by combining key algorithmic enhancements (delayed policy updates, exponential reward shaping, gradient clipping) with a robust training framework that leverages curriculum learning and a rigorous, performance-based model selection criterion. This integrated approach is essential for moving beyond high average rewards in simulation to developing controllers that are reliable for real-world deployment.
The RS-DDPG framework
Problem formulation
The challenge of quadrotor attitude control is to devise a control policy that steers the vehicle’s angular orientation and velocities from any initial state to a desired stable setpoint, typically a hover state where all angular positions and rates are zero. We formalize this task within the mathematical structure of a Markov Decision Process (MDP), which is defined by the tuple
The ultimate objective is to learn a deterministic policy
Preliminaries on DDPG
The DDPG algorithm serves as the foundation for our approach. As an off-policy, model-free actor-critic method, DDPG is well-suited for control problems with continuous state and action spaces. It concurrently learns a Q-function (the critic) and a policy (the actor). The critic,
where
This process, however, is susceptible to instabilities, which we address in our proposed framework.
The proposed RS-DDPG algorithm
Our RS-DDPG algorithm enhances the standard DDPG framework through three critical modifications aimed at improving training stability, learning efficiency, and overall control performance.
Delayed policy updates
A primary source of instability in DDPG is the coupled nature of the actor and critic updates, where a rapidly changing critic can provide a noisy and oscillating gradient to the actor. To decouple these updates and mitigate the critic’s overestimation bias, we adopt the principle of delayed policy updates, a key feature of the TD3 algorithm. In our implementation, the critic network is updated at every training step, whereas the actor network and the target networks are updated less frequently, specifically once every two critic updates (
Exponential reward shaping
The design of the reward function is paramount for effectively guiding the agent’s learning process. A poorly shaped reward can lead to slow convergence or suboptimal behaviors. We introduce a novel exponential reward shaping strategy to provide a dense, smooth, and highly informative learning signal. The reward at each time step,
To formalize this, we define the attitude error vector as
These scalar values quantify the overall magnitude of the attitude deviation and angular rate error, respectively. The weights
RS-DDPG hyperparameters.
The selection of an exponential function, as opposed to a more common quadratic penalty (e.g.
In contrast, the gradient of our exponential reward term (
Gradient clipping
During the backpropagation process, large error signals can lead to excessively large gradients, causing drastic and unstable updates to the network weights, a phenomenon often referred to as “exploding gradients.” To ensure a stable and monotonic learning curve, we implement gradient clipping for both actor and critic updates. After the gradients
This procedure effectively puts a ceiling on the magnitude of the parameter updates in a single step, preventing the learning process from diverging and ensuring that the network weights converge smoothly towards an optimal solution.
The robust training framework
To complement the algorithmic enhancements, we developed a robust training framework that methodically improves the agent’s capabilities and ensures that the final selected model is genuinely reliable.
Curriculum learning strategy
Directly exposing a naive agent to highly challenging initial conditions can lead to repeated failures and hinder the learning process. We therefore employ a curriculum learning strategy that gradually increases the task difficulty. The difficulty is controlled by a factor
The choice of 200 episodes for the ramp-up was determined empirically. This duration was found to be a good trade-off: short enough to accelerate learning, but long enough for the agent to robustly master the simpler tasks before the full difficulty was introduced. A faster ramp-up led to instability, while a slower one unnecessarily prolonged the training process. The initial state for an episode is then set according to:
where c is a vector defining the maximum possible deviation. This easy-to-hard progression allows the agent to first master the fundamentals of stabilization from near-stable states before being required to handle larger perturbations, which significantly accelerates learning and improves the quality of the final converged policy.
Robust model evaluation and selection
The stochasticity introduced by exploration noise during training means that a high reward in a single episode does not guarantee a robust policy. To select a controller based on its true performance, we implement a rigorous evaluation and selection protocol. The training process is periodically paused (e.g. every 10 episodes) to conduct a series of deterministic evaluations. During this phase, the agent’s current policy is tested for a fixed number of episodes (e.g. 5) with exploration noise turned off. The average cumulative reward over these evaluation episodes,
Experiments and results
Simulation setup
All experiments are conducted in a numerical simulation environment developed in MATLAB. The simulation proceeds with a fixed time step of
The quadrotor model is based on a rigid-body dynamics assumption. The moments of inertia are set to
The actor and critic networks share a similar feedforward architecture. The actor network consists of an input layer accepting the six-dimensional state vector, followed by two hidden layers with 256 and 128 neurons, respectively, using the rectified linear unit (ReLU) activation function. The output layer has three neurons corresponding to the action dimensions, followed by a hyperbolic tangent (tanh) activation function to constrain the output to the range
The hyperparameters for the RS-DDPG algorithm are detailed in Table 1. These values were selected to ensure stable and efficient learning.
Training performance analysis
To evaluate the effectiveness of our proposed RS-DDPG framework, we conducted a comparative training experiment against both a standard DDPG and a more advanced TD3 baseline. We selected TD3 as the primary advanced baseline because our RS-DDPG algorithm builds directly upon its core principles (like delayed updates), making it the most relevant algorithm for the comparison of our additional enhancements (exponential reward and curriculum). The learning curves for the 300-episode training run are presented in Figure 1.

Learning curves of TD3, standard DDPG versus RS-DDPG. The plot shows the cumulative reward achieved in each episode during the training process.
As illustrated, the performance difference is stark. Both baselines failed to learn an effective policy. The standard DDPG algorithm (magenta curve) exhibited extreme instability with wildly fluctuating negative rewards, confirming its known fragility in complex control tasks. The more advanced TD3 baseline (green curve), while significantly more stable, was also unable to achieve positive rewards, and its learning stagnated at a suboptimal level. This limitation stems from its use of a traditional quadratic penalty reward function, which provides a gradient that is linear with respect to the error and approaches zero as the agent gets very close to the setpoint. This vanishing gradient leads to stalled convergence during the final, high-precision stabilization phase, as the agent lacks the granular feedback needed to eliminate small residual errors.
In sharp contrast, our proposed RS-DDPG algorithm (red curve) demonstrated a significantly superior learning trajectory. It learned rapidly, achieving a high cumulative reward of approximately 2000. While minor dips occur after episode 200 (a common artifact of continued exploration in RL), the performance remains exceptionally stable compared to the baselines, which fail to learn entirely. This success is fundamentally attributed to our exponential reward shaping strategy. Unlike the baselines, our reward function provides a persistent and informative learning signal even for very small errors, as its gradient approaches a non-zero constant when the agent is near the target setpoint. This non-vanishing gradient continuously encourages the agent to actively minimize any residual deviations, allowing it to overcome the performance plateau that constrained the TD3 baseline.
These results provide clear empirical evidence that our integrated approach is highly effective for the challenging quadrotor attitude control problem, vastly outperforming both standard DDPG and TD3. The comparison highlights that a sophisticated reward function design is as critical as algorithmic stability enhancements for achieving high-performance control.
To validate the effectiveness and necessity of each key component in our proposed RS-DDPG algorithm for improving quadrotor attitude control performance, we conducted a series of ablation studies. We separately removed three core modules from the complete RS-DDPG algorithm: delayed policy updates, exponential reward shaping, and curriculum learning. The performance of these ablated variants was then compared against the full RS-DDPG algorithm. The experimental results are shown in Figure 2, which displays the reward curves for the four algorithms over 300 training episodes.

Comparison of reward curves for RS-DDPG and its ablated variants.
As can be clearly observed from Figure 2, the algorithm with exponential reward shaping removed (the yellow curve) exhibits a catastrophic performance degradation. Its reward values remain consistently in the negative range throughout the entire training process, showing no signs of effective learning. This indicates that without a carefully designed reward function to guide the agent, a simple quadratic penalty is insufficient to provide a meaningful learning signal from the complex environment. Our exponential reward function, by offering smooth and dense rewards near the target state, provides a clear optimization direction for the policy gradients, fundamentally addressing the problem of sparse and inadequate learning signals. Therefore, exponential reward shaping is the most critical prerequisite for the agent to successfully learn the control task in this study.
By comparing the full RS-DDPG algorithm (the blue curve) with the variant that removes delayed policy updates (the orange curve), we can see that the latter’s performance stagnates quickly after reaching a reward value of approximately 300, failing to improve further. This fully demonstrates the necessity of the delayed policy update mechanism. In DDPG, the actor’s update depends on the critic’s Q-value estimates. If they are updated simultaneously, unstable Q-value estimates can introduce significant errors, leading to policy oscillations or even divergence. By delaying the actor’s update frequency, we provide the critic network with more time to converge to a more accurate value estimate, thus ensuring the stability and effectiveness of the policy updates. The experimental results show that this mechanism is key to overcoming performance plateaus and converging to a superior policy.
The algorithm without curriculum learning (the purple curve) eventually reaches a high reward level, but its training process exhibits extremely high volatility. As seen in the figure, its reward curve is accompanied by severe oscillations throughout the training period, especially in the early stages, which significantly increases the risk of training failure. Our introduced curriculum learning mechanism, by following a “simple-to-complex” approach, provides the agent with a relatively easier environment for exploration in the initial stages, allowing it to quickly grasp the fundamental control strategy. As training progresses, the task difficulty gradually increases, guiding the agent to steadily generalize to more complex initial states. This mechanism significantly smooths the learning curve, reduces initial exploration costs, and is a vital guarantee for ensuring a stable and efficient training process.
The results of our ablation study convincingly demonstrate the unique and indispensable role of each component within our proposed RS-DDPG algorithm. Exponential reward shaping provides the effectiveness for learning, delayed policy updates ensure the high performance ceiling, and curriculum learning secures training stability. The synergistic combination of these three components creates a robust and highly efficient DRL controller, which demonstrates superior performance in the quadrotor attitude control task compared to single-component or original DDPG algorithms. The stable, high-performance green curve (RS-DDPG) contrasts sharply with the blue curve (no curriculum) and the orange curve (no delay). This growing separation between the curves does not indicate divergence but rather highlights the compounding failures of the ablated models and the superior, stable performance achieved only by our complete, integrated framework.
Robustness and control performance
To address the need for a comprehensive evaluation and to justify our proposed robust training framework, the following sections benchmark the final controller’s performance. We analyze its response in a deterministic case, its generalization across a wide range of random initial states, and its performance under persistent external disturbances and parametric model uncertainties. This multi-faceted analysis, combined with the baseline comparisons, provides a thorough validation of the RS-DDPG framework.
Performance without disturbances and uncertainties
This section analyzes the performance of the designed RS-DDPG controller under a specific, deterministic scenario with an initial state of

RS-DDPG control performance of rotation angle.

RS-DDPG control performance of angular velocities.

RS-DDPG control performance of rotation control input.
The controller effectively drives all attitude angles and angular velocities to their desired zero values. The controller effectively drives all attitude angles and angular velocities to their desired zero values. The settling time (time to converge and stay within 2% of the setpoint) is approximately 1.0 second for the attitude angles and angular velocities. This demonstrates a highly responsive control action. This demonstrates a highly responsive control action. While the large initial deviations in pitch and yaw angles result in significant transient responses and a temporary overshoot, the strong damping ensures rapid decay, preventing sustained oscillations. After convergence, all state variables remain stable and close to the desired values, confirming excellent steady-state control performance.
The controller generates precise and efficient torque commands to execute the stabilization task. To correct the large initial errors, the controller commands significant initial torques, particularly for the yaw channel. These control inputs, however, quickly decay to near-zero values within 1.5 seconds, indicating that the controller requires minimal effort to maintain the stable state. This rapid decay highlights its energy efficiency. Furthermore, the smooth, non-oscillatory nature of the control input curves is beneficial for practical hardware implementation, as it reduces wear and tear on motors.
In conclusion, the presented results confirm the controller’s excellent performance. It successfully achieves rapid and stable convergence from a significant initial disturbance, which is a key indicator of its robustness. The controller’s ability to handle challenging initial conditions and produce efficient, smooth control signals validates the superiority demonstrated in the ablation study and confirms the practical applicability of the proposed RS-DDPG algorithm.
Performance with random initial states
To assess the robustness of the RS-DDPG controller, we conducted 100 Monte Carlo simulations with initial states sampled from wide uniform distributions: attitude angles from
As shown in Figure 6, the controller demonstrates exceptional and highly consistent performance. Crucially, all 100 trajectories for each attitude angle successfully converge to the zero setpoint, achieving a 100% success rate with no failures or divergent behaviors. Most trajectories settle within 2–3 seconds, and the transient responses remain well-bounded with smoothly decaying overshoots, even from severe initial conditions.

Attitude control performance under random initial conditions from Monte Carlo simulations. Each subplot shows 100 trajectories: (a) roll angle (ϕ) convergence, (b) pitch angle (θ) convergence, and (c) yaw angle (ψ) convergence.
This flawless performance across a wide operational envelope provides compelling evidence of the controller’s robustness and generalization. It validates that our proposed enhancements have effectively addressed the instability issues commonly associated with standard DDPG.
Performance with disturbances
The provided Figures 7 and 8 display the controller’s performance under continuous random disturbances applied to the control inputs. A comparison with the previous results (without disturbances) reveals the controller’s robust behavior in a more realistic environment. The controller’s initial response to the large attitude and velocity errors remains highly effective, despite the continuous random disturbance. The convergence time for both attitude angles and angular velocities is nearly identical to the non-disturbed case. This demonstrates that the controller’s powerful transient response is robust and not significantly degraded by external interference during initial stabilization.

RS-DDPG control performance of rotation angle with disturbances.

RS-DDPG control performance of angular velocities with disturbances.
This phase highlights the most crucial difference. Unlike the perfectly smooth curves in the non-disturbed case, the curves now exhibit small-amplitude, high-frequency fluctuations around the desired zero line. This jitter is a direct result of the controller continuously generating small, rapid corrections to counteract the random external torques. The successful confinement of these fluctuations to a minimal range around the setpoint is a clear and direct testament to the controller’s effective disturbance rejection capability.
The comparative simulation results provide strong evidence of the RS-DDPG algorithm’s superiority. While it demonstrates fast and efficient control in an ideal environment, its true value lies in its ability to maintain system stability and suppress disturbances in a simulated real-world scenario. The controller successfully keeps the system close to the desired state, validating its robustness and practical applicability.
To further evaluate the controller’s robustness against persistent structured interference, a specific scenario was conducted where a periodic disturbance signal with an amplitude of 1 is introduced into the control action starting from

RS-DDPG control performance of rotation angle with periodic disturbances introduced at
Performance with inertial uncertainties
The provided Figures 10 and 11 illustrate the performance of the pre-trained RS-DDPG controller when the quadrotor’s moment of inertia parameters are significantly perturbed, representing a test of the system’s robustness to unmodeled dynamics. Compared to the ideal case, the controller’s convergence is slightly slower, with attitude angles requiring approximately 2–2.5 seconds to settle. While the overall control trajectory is similar, a key difference lies in the presence of noticeable transient oscillations, particularly in the yaw angle (ψ) and its velocity (r) during the convergence phase. This is a classic indicator of a controller operating with a mismatched internal model.

RS-DDPG control performance of rotation angle with inertial uncertainties.

RS-DDPG control performance of angular velocities with inertial uncertainties.
These transient oscillations contrast with the steady-state fluctuations observed under continuous random disturbances. The oscillations here represent the controller’s policy adapting to the changed system dynamics to find a stable solution, rather than constantly counteracting external forces. After this brief period of adaptation, the system successfully converges to and maintains a perfectly smooth steady-state, demonstrating that the controller has effectively handled the internal model uncertainty.
In conclusion, these results provide strong evidence of the RS-DDPG algorithm’s remarkable robustness to system uncertainty. Despite the significant inertial perturbation, the pre-trained, data-driven policy successfully generalizes and adapts to the new dynamics. This is a critical advantage over traditional model-based control methods that would likely fail with such a large model mismatch. The ability to achieve stable control, even with compromised performance during a brief transient period, validates the practical applicability of the proposed RS-DDPG algorithm in real-world scenarios.
Conclusion and future work
Conclusion
Standard DDPG algorithms notoriously suffer from training instability, which has critically hindered their application to complex, safety-critical tasks such as quadrotor attitude control. This paper introduced RS-DDPG, a holistic framework designed specifically to overcome these limitations by fundamentally enhancing training stability and controller robustness. Our approach integrates crucial algorithmic modifications—including delayed policy updates, exponential reward shaping, and gradient clipping—with a structured training methodology that leverages curriculum learning and a rigorous evaluation protocol.
Extensive simulations and ablation studies provide definitive evidence of our framework’s efficacy. The resulting controller not only demonstrates superior convergence and performance over the standard DDPG baseline but also exhibits exceptional robustness against a wide range of initial states, persistent external disturbances, and significant model uncertainties. This work represents a vital step toward developing reliable, high-performance, data-driven controllers that can bridge the gap between simulation and real-world deployment in aerial robotics. While the specific hyperparameters are tailored to the quadrotor attitude control task, the methodology—integrating stability techniques, task-specific reward shaping, and a curriculum-based robust evaluation framework—provides a generalizable template for applying DRL to other complex, safety-critical control problems.
Future work
Building on this robust foundation, our immediate efforts will focus on bridging the sim-to-real gap by deploying the trained controller on a physical quadrotor to validate its performance against real-world challenges like sensor noise and actuator dynamics. Subsequently, we plan to expand the scope of our framework to tackle the more complex problem of full 6-degree-of-freedom (6-DOF) control for complete trajectory tracking. Further research will also explore integrating our robust training methodology with more sample-efficient algorithms, such as SAC, and investigating online adaptation methods. This will enable the controller to handle unmodeled dynamics like payload changes or component failures, thereby enhancing its autonomy and reliability for practical applications.
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded in part by the Guizhou Provincial Basic Research Program (Natural Science), under Grant No. ZK[2023]yiban139.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
The authors confirm that the data supporting the findings of this study are available within the article.
