Abstract
Due to the consideration of control performance and the uncertainty of the dynamic characteristics of nonlinear systems, designing the auxiliary signal for active fault diagnosis presents significant challenges. This paper presents a novel data-driven approach for auxiliary signal design in the active fault diagnosis of nonlinear systems while ensuring guaranteed control performance. Specifically, we introduce a double actor-critic network to generate tracking and diagnostic signals, respectively. Subsequently, a two-objective optimization method based on deep reinforcement learning is proposed to address the tradeoff between tracking performance and fault diagnosis. Finally, the effectiveness of this method is verified through a cart-pole system with stochastic noise.
Keywords
Introduction
In recent years, fault diagnosis has garnered significant attention and found widespread application in various domains such as cloud computing systems, power electronics, and chemical processes (Lee et al., 2022; Li et al., 2020; Liu et al., 2024; Taqvi et al., 2021; Zhao et al., 2024, 2025). Fault diagnosis can be categorized into passive and active diagnosis. Early research primarily focused on passive fault diagnosis (PFD), wherein the fault diagnosis is implemented by analyzing the system’s passive measured states/output information. Due to its passive nature, valuable diagnostic information may be overlooked as system uncertainties, and some faults may be masked by feedback controllers, thereby diminishing the accuracy of diagnosis.
To address this limitation, active fault diagnosis (AFD) designs an auxiliary signal that induces sufficient fault information within a given timeframe, while complying with system constraints. For diagnostic algorithms, both AFD and PFD are not very different and thoroughly surveyed (Blanke et al., 2006; Chiang et al., 2006; Venkatasubramanian et al., 2003a; 2003b; 2003c). Thus, the auxiliary signal design is a key research for AFD.
Generally speaking, there are two main research orientations of auxiliary signal design, which are the probabilistic method and the set-based method (Heirung and Mesbah, 2019). The main idea of the probabilistic method is to design an auxiliary signal that can maximize the distance of each probabilistic of each fault and minimum the Bayes risk of misdiagnosis (Hellman and Raviv, 1970; Tawarmalani and Sahinidis, 2002; Zhu et al., 2024). For discrete-time stochastic systems, a new upper bound of misdiagnosis probability for incipient actuator faults is proposed by Guo and He (2024), which promotes the diagnosis performance.
The set-based method describes different faults by sets. To achieve the purpose of fault diagnosis, this method designs an input so that there is no intersection between the sets of systems (Andjelkovic and Campbell, 2011; Nikoukhah, 1998; Scott and Campbell, 2014; Tabatabaeipour, 2015). The AFD method introduced by Xu (2021) addresses a two-level optimization problem by designing a single input at each time step. In a study by Xu (2024), a newly defined excluding degree of the origin from all healthy and faulty residual zonotopes is proposed, which owns more explicit physical meaning.
While the aforementioned studies have made notable progress, these methods are restricted for nonlinear systems due to computational challenges and the complexity of auxiliary signal design. In the past decade, data-driven methods have gained widespread popularity and proven effective in addressing various nonlinear problems (Song et al., 2023). One of these methods is deep reinforcement learning (DRL).
In essence, DRL can be viewed as an optimization algorithm. The control objective is transformed into an optimization function, and the optimal solution of this function is the control signal (Silver et al., 2014; Van Hasselt et al., 2016; Wang et al., 2016; Yan and Xu, 2019). DRL has demonstrated effectiveness across various domains (Wang et al., 2024). For example, some researchers have applied the DRL to nonlinear systems such as fuzzy systems and switched systems, which achieved prescribed time convergence (Zhang et al., 2024a; 2024b; Zhang and Xiang, 2025). In recent years, several studies (Li et al., 2023; Wang et al., 2023b) have integrated DRL into fault diagnosis with promising results. However, these studies mostly focus primarily on fault diagnosis assumptions while ignoring the impact of input signals on system performance. The main reason is that the research of fault diagnosis with DRL is not mature enough. Moreover, the conflict between fault diagnosis and control performance poses challenges in designing reward functions within the DRL framework. A study referenced as literature (Yan et al., 2023) introduces an Actor-Critic (AC) framework which is a well-received method in DRL. The research transforms the tracking issue into a constrained optimization problem and leverages the Trust Region Policy Optimization (TRPO) method to address the auxiliary signal design aspects of AFD. To evaluate the efficacy of this methodology, the study conducts two case studies, demonstrating the application of this approach in practical scenarios.
Based on the effort of the literature above, we propose a novel AC framework to generate the tracking and diagnostic signals respectively. Furthermore, to deal with the tradeoff between tracking performance and fault diagnosis, we present a two-objective optimization method based on deep reinforcement learning.
The implementation of this approach adopts a dual AC architecture. One AC architecture is responsible for fault diagnosis, its critic network assesses the accuracy of fault diagnosis, and its actor network generates fault diagnosis inputs. The Other AC architecture focuses on optimizing tracking performance, its critic network evaluates system controlling performance, and its actor network outputs control inputs.
The balance of control and diagnostic performance is reflected in the design of the reward function. A tradeoff parameter is introduced to quantify the balance between control and diagnosis. The parameter can be flexiblely adjusted to satisfy different requirements. In a study by Jia et al. (2023), an optimal trajectory is generated to diagnose faults, and the controller is designed to track this trajectory. Actually, the design of the controller depends on the dynamics, different from Jia et al. (2023), the method proposed in this paper adopts a model-free strategy, where the controller’s design is not reliant on system dynamics but instead on a large quantity of data to achieve a balance between control and diagnosis requirements. Similarly, in a study by Wang et al. (2023a), the utilization of the Q-learning method serves as an aid to control algorithms, rather than employing a completely model-free strategy. By adopting a model-free method, complex mathematical derivation can be avoided, rendering the design more intuitive and comprehensible. To facilitate computational efficiency, we employ Proximal Policy Optimization (PPO) as the policy learning method, which simplifies the optimization problem encountered in the TRPO algorithm. This choice not only streamlines the calculation process but also promotes more stable convergence due to the limitations imposed on gradient updates.
In general, the main contributions are as follows:
An Active Fault Diagnosis Tracking (AFDT) network is introduced in this paper, the network can output the auxiliary signal for AFD and ensure the control performance of the nonlinear system. The nonlinear system can be unknown, it depends on little prior knowledge.
In contrast to the auxiliary signal design by Li et al. (2023) and Wang et al. (2023b), the auxiliary signal design in this paper can guarantee the control performance while the fault diagnosis being conducted.
Compared with Yan et al. (2023), the compromise scheme between diagnostic accuracy and control performance is solved by a deep reinforcement learning method. Furthermore, to mitigate computational expenses, the PPO method is implemented as a replacement for TRPO, ensuring algorithmic stability.
The rest of the content of this paper is as follows: Section 2 introduces the system model and the background knowledge about RL. Section 3 contains the design of the AFD network and the main problem of AFD with control constraints. A simulation of the cart-pole system is presented in Section 4 to show the effect of the AFDT network. The conclusion is located in Section 5.
Problem statements
System modeling and problem formulation
Assume there exists a series of discrete-time nonlinear stochastic system, which is described as follow (Yan et al., 2023):
where i indicates the
With the assumption above, the fault matrix
where
For auxiliary control signal design, a reference system is defined to provide a reference:
where
We assume
It can be denoted as:
where β is the threshold of tracking error.
Reinforcement learning preliminaries
Assume there exists an agent follows a Markov Decision Process (MDP), which consist of tuple
γ is a discounted parameter,
The process that the agent changes from state

The RL process.
After several transitions, the agent could reach a terminated state at instant T (which means the target is accomplished or not), generating a trajectory described as
Generally speaking, the reward
Similarly, the action value function is defined, which represents the value of action
On the basis of
which denotes the advantage of action
By sampling the target policy
where α is the learning rate, which regulates the pace of gradient ascent. In actual training, the selection of α is a challenging task. If α is too large, excessive variation will occur during parameter updating, which easily makes the convergence curve unstable. Conversely, a very small α will lead to slow convergence and increase calculation costs.
TRPO method is proposed to solve this problem, by imposing a constraint on the Kullbac–Leibler (KL) divergence between the new policy and old policy. These two policies are close enough to use the importance sampling theorem with minor variance.
In equation (11),
Then the gradient ascent problem is transformed into a optimization problem by TRPO (Schulman et al., 2015):
where δ is a threshold to limit the KL divergence.
Although TRPO reduces the dependence of PG on the learning rate, it takes large computational consumption. Thus, the PPO method is proposed to simplify it, and the constraint on KL divergence is integrated into the optimization function. The target policy is as follows (Schulman et al., 2017):
where
Main results
In this section, a DRL-based AFD method with controlling performance guaranteed for nonlinear systems is presented. A double actor-critic network structure is introduced to generate a controlling signal and an AFD signal. A fitting network integrates these two signals to get the input of the detected system. The overview of whole network structure can be depicted in Figure 2.

AFD with tracking performance guaranteed.
Network structure
According to the problem statement, the input can be divided into two parts. One is to ensure the performance of controlling, which is realized by tracking the reference signal, thus we assume it to be
Tracker network
The tracker network has an AC structure, the actor network outputs the tracking signal for the detected system. The critic network outputs the value of state
In the critic network, a reward function is involved to measure the tracking effectiveness of the detected system. The reward function can be written as:
where
The term
By calculating the total reward
where
In the action network, we employ a continuous random policy of beta distribution instead of a discrete policy to ensure control accuracy. The outputs of the actor network are two parameters of beta distribution (
Based on the loss function in equation (16) and the PPO method in equation (13), let
In many cases, the function
AFD network
The AFD network is of the same structure as the tracker network, but the reward of the AFD network is associated with fault diagnosis. In consideration of the problem stated in equation (4), using similar linear map method in equation (15), it is easy to get the following formulation:
where
It should be pointed out that the reward in equation (18) has a fatal problem, which is that the reward has no diagnosis specificity for each fault. For example, assuming that a designed input can diagnose two faults exceptionally well, but diagnose others poorly. This designed input can also produce a large reward, but it is not expected. To avoid this situation and get a better fault diagnosis effect, the maximum mean discrepancy (MMD) method is involved.
MMD is a principle to measure the similarity of two distributions. Assuming there exists two sets of observed samples
where f is a map function from the function space
By using the trick of the kernel function, MMD can be calculated without knowing the function f (both the expected form and the empirical estimate form are given):
where
where
Under the consideration of MMD, the input of the AFD network is derived as:
where
where
Next, the algorithm of the AFD network with MMD is introduced.
First, set all l to TRUE and initialize the systems’ parameters. Then, calculate the MMD between the detected system and fault systems after some time. Next, find the fault system corresponding to the largest MMD and exclude this kind of fault. Finally, repeat this process for the remaining fault systems until only one flag remains TRUE. The fault type corresponding to this remaining flag is the result. The total algorithm is presented in algorithm 1.
Whole algorithm
The outputs of the AFD network and tracker network are loaded into the fitting network. In order to improve training efficiency, the AFD network and tracker network are trained offline.
During the training process of the whole network, the parameters in AFD and tracker networks are not updated. The only updated parameters are in the fitting network, they are updated by using the policy gradient ascent method, which is shown in equation (10). Since
In equation (25),
Considering both tracking and diagnostic performance, the reward function of the fitting network is designed as:
where
The training process of these three networks is conducted offline. The trained networks are loaded onto the upper computer when fault diagnosis is needed. Therefore, the online computational cost is focused on the networks’ complexity. Before calculating the computation costs, some preliminaries should be stated:
Generally speaking, the calculation of the neural network can be regarded as:
where y represents the output,
In equation 27, there are:
After sorting, the neural network in equation 27 performs
Expand the Taylor series to the second term,
Simulation
Training setup
The AFD and tracking networks are of the same AC network structure, the critic network is made up of three hidden layers, with hidden units being 64, 64, and 32, respectively. The squashing functions among these layers are
(
(
The actor network needs:
[
[
To sum up, the AFD and tracking networks each take (
The fitting network is of the same structure as the actor network, but the shared hidden layer is of 256 hidden units with input dimention being two times of the actor network output dimention. Thetwo separate sets of hidden layers are both composed of three hidden layers, with hidden units being 256, 128, and 64, respectively. The squashing functions among them are
During the offline training, the AFD, tracker, and fitting networks are trained separately. For linear map parameters,
The other main training parameters are listed in Table 1. α is learning rate, γ is the discount factor of value function, ϵ is the clipped factor in PPO, λ is the weight parameter in the fitting network reward function. The terminating step means the maximum transition number in every trajectory.
Parameter γ indicates the influence range of an action, where γ is closer to 1 means that the more influence on the future states.
Parameter ϵ is used to limit the update step, which avoids the influence of large gradient updates on the stability of the training process. Commonly, it is chosen as 0.2.
Parameter λ is to balance the performance of tracking and AFD. It is suggested that
Training parameters.
Training case
The trainging case of this method is a cart-pole system, the details of modeling can be found in a study by Barto et al. (1983). Assuming there is a cart moving on a horizontal smooth surface, its mass is M. A pole of length
Let
During the training process, there are two criteria that determine the termination of a trajectory. First, if the angle θ or the displacement x exceeds a predefined threshold, it signifies a termination state. In this case, it indicates that the controller has failed to achieve the desired control target. Second, if the cart-pole system reaches the terminating steps, it is considered that the control target has been successfully achieved. These criteria serve as the indicators for determining the end of a trajectory during training.
In this paper, we assume that
The parameters
The reference system is of the same model in equation (30), stabilized by PID controller with the controller parameters being
Smiulation and discussions
In this paper, the training of the tracker network takes 3000 trajectories, and the training of the AFD network takes 5000 trajectories. A learning curve of the tracker network and AFD network are given in Figure 3 to present the convergence of these two networks.

Rewards of AFD network and tracker network: (a) AFD reward. (b) Tracker reward.
According to the reward defined in equation (15), the maximum total reward is 1000, with each step contributing a reward of 1 over a trajectory of 1000 steps. The observed convergence value is approximately 940. While the maximum total reward represents an ideal outcome, minor deviations are unavoidable. Therefore, we can conclude that the network has successfully completed the tracking task. In addition, a test was conducted to evaluate the performance of the tracker network. Figure 4 illustrates the results, where the actual states of two systems and the corresponding errors between them are plotted on a graph with dual axes. It can be observed that the vertical angle error remains consistently below 0.1 rad, approaching zero. Furthermore, the displacement of the cart remains within the defined threshold over 1000 steps, indicating that the tracking network effectively achieves the desired controlling objective.

Tracking performance of tracker network: (a) Angle tracking effect, (b) Palstance tracking effect, (c) Displacement tracking effect and (d) Velocity tracking effect.
For the AFD networks, since no upper limit is given in the reward setting, it is difficult to judge the maximum value of the total reward. However, it can be observed from the figure that the learning curve eventually converges. Considering that the main purpose of the AFD network is to maximize the separation of the outputs between the various faulty systems, so the actual capability of the network can be detected by testing, Figure 5 shows the AFD network performance of fault separation, it can be seen in the figure that the AFD network separates each fault, which basically realizes the task of the AFD network. Admittedly, the AFD network is trained separately and its target is only to isolate different faults, thus the tracking performance of the system is not considered, and the network may exceed the output limit of the system before running 100 steps. This issue will be addressed later in the fitting network.

The test result of AFD network (with faule model 1): (a) AFD State. (b) AFD MMD.
When training the fitting network, a fault occurs at step 0, and then MMD calculation is performed at the 50th and the 100th steps. The fault diagnosis is completed at step 100, leaving 400 steps to test the controlling performance of the AFDT network. Figure 6 shows the learning curve during the training process (The training of the AFDT network takes 3000 trajectories), and it can be seen that the final total reward converges to a stable value, indicating that the policy has converged. Figure 7 shows the ability of the AFDT network for fault diagnosis and control performance, in Figure 7(a), the MMD is calculated at step 50 and fault system 2 is excluded. Similarly, fault system 1 is excluded at step 100, the AFDT network identified the fault type as model 3. Besides, the tracking error is presented in Figure 7(b), it is obvious that the tracking error is kept within 0.1 rad in 500 steps, maintaining the controlling ability.

The learning curve of AFDT network: (a) AFDT reward. (b) AFDT step.

The result of AFDT(With fault model 3): (a) AFDT MMD. (b) AFDT track error.
To illustrate the diagnostic capability of this network for the remaining types of faults, we tested the other two types of faults with random initial condition, which is presented in Figure 8. At the 100th step, the fault model was identified. It must be acknowledged that diagnosing faults while maintaining control performance under fault model 1 presents significant challenges. This difficulty arises because the controller’s performance is severely limited, impeding its ability to achieve both diagnostic and control objectives concurrently. Nevertheless, it is noteworthy that the tracking error remains stabilized within a range of 0.1 rad, indicating that some degree of control performance is still maintained. We believe that the AFDT network can effectively identify fault types and ensure certain controlling performance.

The other result of AFDT: (a) AFDT MMD (With fault model 1), (b) AFDT track error (With fault model 1), (c) AFDT MMD (With fault model 2) and (d) AFDT track error (With fault model 2).
Finally, to highlight the advantages of the AFDT method, a comparative experiment was conducted against the methodology outlined by Yan et al. (2023). The initial states for both methods were limited within the range of [–0.05, 0.05] (rad), with a simulation duration of 500 steps, each lasting 0.005 s. The evaluation refers to examining fault diagnosis and tracking performance under three fault models. The simulation results are depicted in Figures 9 and 10.

The fault diagnosis performance of AFDT and the compared method: (a) AFDT MMD (With fault model 1), (b) Compared method MMD (With fault model 1), (c) AFDT MMD (With fault model 2), (d) Compared method MMD (With fault model 2), (e) AFDT MMD (With fault model 3) and (f) Compared method MMD (With fault model 3).

The tracking performance of AFDT and the compared method: (a) Track error of AFDT and the compared method (With fault model 1), (b) Track error of AFDT and the compared method (With fault model 2) and (c) Track error of AFDT and the compared method (With fault model 3).
Figure 9 illustrates that both methods successfully complete fault diagnosis within 100 steps without error diagnosis. Generally speaking, the only difference between these two methods is the fluctuation of MMD value, which does not impact the diagnosis outcome. Figure 9(b) is terminated early due to the vertical angle surpassing the predefined threshold.
In Figure 10, it is evident that AFDT exhibits better tracking performance. When compared to the methodology by Yan et al. (2023), the AFDT approach demonstrates reduced error fluctuations under fault models 2 and 3. Despite severe actuator constraints in fault mode 1, the AFDT method maintains stability, whereas the comparative method loses control after 175 steps.
The primary strength of AFDT lies in its ability to flexibly adjust the weight between tracking and fault diagnosis performance. In this simulation case, the fault models cover a wide range of actuator faults, especially, fault model 1 represents a great control signal attenuation fault. When compared to the approach by Yan et al. (2023), AFDT demonstrates superior tracking performance while maintaining a similar fault diagnosis performance.
Conclusion and future work
In this article, a data-driven approach for auxiliary signal design in AFD of nonlinear systems, ensuring guaranteed control performance is proposed. In order to facilitate the design, two AC networks were established to achieve fault diagnosis and controlling targets respectively. Then, in order to balance the contradiction between the two optimization indicators, a deep reinforcement learning method is proposed to generate an integrated output. With this output, it can realize the fault diagnosis of the system while ensuring certain controlling performance. Finally, a simulation of the Cart-Pole system was conducted to verify the effectiveness of the AFDT network.
It is important to note that the proposed method requires a reference control signal to ensure effective control performance, which may be challenging to implement for certain systems. In addition, the current approach is constrained by a limited number of fault types. In future work, we aim to extend the method to accommodate a broader range of fault types and to explore strategies for maintaining satisfactory control performance without relying on a reference signal.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (grant no. 62273116)
Data availability statement
Data sharing not applicable to this article as no data sets were generated or analyzed during this study.
