Fault diagnosis of RV reducer with dual-branch fusion of enhanced GAN and multimodal data

Abstract

To address the challenges posed by the limited generalization of diagnostic models due to data sparsity in RV reducers, the lack of targeted multimodal fusion strategies, and complex operating conditions, this paper proposes a novel fault diagnosis method based on an enhanced generative adversarial network (GAN) and dual-branch multimodal data fusion. First, an Rotate Vector (RV) reducer fault test bench was established to acquire vibration and current signals. Subsequently, a Wasserstein GAN-residual and attention guidance network, incorporating double-layer residual connections and a multihead self-attention mechanism, was employed for multimodal data augmentation. This approach significantly improves the quality and diversity of generated samples, effectively mitigating data scarcity while ensuring stable training via Wasserstein distance and gradient penalty techniques. The augmented data were then transformed into time-frequency representations using the short-time Fourier transform. Finally, leveraging the global representativeness of vibration signals and the sensitivity of current signals to localized disturbances, an Self-Calibrated Convolution and Vision Transformer Fusion Network for Dual-Modality Classification (SCViT) dual-branch model was developed. This model achieves comprehensive fault diagnosis through multimodal feature fusion. Experimental results demonstrate that the proposed method exhibits superior diagnostic performance under three operating conditions, with diagnostic accuracies of 97.40, 97.83, and 96.54%, respectively. Compared with single-modality diagnosis, the method achieves an average improvement of 3.8 percentage points in diagnostic accuracy. The proposed method maintains high stability and accuracy under loaded conditions and reciprocating motions, providing novel insights for the intelligent maintenance of RV reducers.

Keywords

RV reducer WGAN-RAG SCViT multimodal data fusion fault diagnosis

Introduction

As a fundamental transmission component in high-precision machinery such as industrial robots and Computer Numerical Control (CNC) machine tools, the Rotate Vector (RV) reducer occupies a pivotal position in intelligent manufacturing owing to its superior transmission efficiency, prolonged service life, and excellent dynamic performance.^1–6 Nonetheless, under complex operating conditions, RV reducers subjected to sustained high loads are susceptible to critical failures such as cycloidal gear wear and planetary gear train malfunctions,^7–9 thereby compromising equipment safety and production efficiency. With the ongoing advancement of intelligent manufacturing and Industry 4.0, the necessity for efficient and intelligent predictive maintenance and health monitoring technologies for RV reducers has become increasingly imperative.^10,11

Data-driven fault diagnosis methods automatically extract fault features through the analysis of extensive operational data, thereby overcoming the inherent limitations of traditional model-based approaches and exhibiting substantial potential in the health monitoring of industrial equipment.^12,13 However, in practical industrial applications, fault diagnosis for RV reducers encounters three principal challenges: limited small-sample data resulting in constrained model generalization capabilities, inefficiencies in multisource data fusion, and difficulties in accurately characterizing complex operating conditions. These challenges markedly hinder the effective deployment of intelligent diagnostic technologies.

In practical industrial environments, the acquisition of equipment failure data is both challenging and resource-intensive, rendering the small-sample problem a critical bottleneck that impedes the broad implementation of intelligent diagnostic technologies. Generative adversarial networks (GANs) have attracted considerable interest owing to their powerful data generation capabilities. Nonetheless, existing GAN architectures exhibit limitations related to data quality and training stability. These limitations include challenges in capturing subtle feature variations within fault signals,¹⁴ inadequate compensation for information loss during signal transformation,¹⁵ absence of effective mechanisms to identify and preserve essential feature information,¹⁶ as well as common issues such as training instability and gradient vanishing. Collectively, these challenges substantially restrict the quality and diversity of the generated data. Hu et al.¹⁷ proposed PCASTNet, a physics-constrained adaptive style transfer network, which decouples fault content and machine style via wavelet transform and AdaSN module, and ensures physical consistency of generated samples through band energy constraint, effectively mitigating the drawbacks of traditional GANs. Similarly, Huang et al.¹⁸ proposed a simulation-to-real transformation framework for aeroengine dual-rotor systems, integrating asymmetric Gaussian chirplet model (AGCM)-based feature enhancement and adaptive multiscale style transfer network (AMSTN) to align simulation-real data distributions while preserving fault semantics, achieving excellent small-sample diagnostic performance.

Under variable load and reciprocating operating conditions, RV reducers exhibit fault characteristics characterized by pronounced nonlinearity and time-varying behavior, which pose significant challenges for traditional feature extraction methods in accurately characterizing such complex operating states. Multisource data fusion technology substantially improves fault diagnosis accuracy and reliability by integrating signals acquired from heterogeneous sensors. Nonetheless, current fusion techniques remain constrained when handling multimodal data, manifesting as insufficient refinement in feature engineering and fusion strategies,¹⁹ inadequate consideration of the intrinsic characteristics of multimodal data,²⁰ and a lack of specialized designs tailored to the distinct properties of different modalities. Xia et al.²¹ proposed a multisource data fusion approach based on hierarchical attention mechanisms; however, the limited diversity of fault samples in their study restricts the model’s generalization capability and diagnostic accuracy when confronted with unknown or heterogeneous faults. Peng et al.²² achieved multidomain feature integration via deep residual networks combined with a fusion embedding layer; yet, their hierarchical fusion strategy did not effectively consolidate features across multiple abstraction levels. Moreover, prevailing deep learning methodologies predominantly utilize single-network architectures, which hampers the simultaneous extraction of both global and local features,^23,24 often leading to the omission of critical nonlinear feature information under complex operating conditions.²⁵ In addition, there remains a paucity of systematic analysis concerning performance variations across different modalities under complex scenarios, resulting in suboptimal adaptability of feature extraction strategies.

Vibration signals, because of their direct interaction with mechanical structures, rich informational content, stability, capability to reflect multiple fault types, and independence from operating conditions, provide a comprehensive representation of the system’s overall macroscopic dynamic response and mechanical integrity.^26,27 Accordingly, they serve as an optimal global feature source for fault diagnosis. Conversely, current signals demonstrate heightened sensitivity to transient variations in mechanical load. Localized faults generate periodic torque oscillations transmitted through the drive shaft to the motor, inducing fluctuations in rotational speed and electromagnetic parameters. The modulation depth of amplitude and frequency within the current signal exhibits a linear correlation with the severity of the fault impact.²⁸ Therefore, current signals more precisely capture the dynamic characteristics associated with localized faults. Based on these complementary attributes, this study employs vibration and current signals as the primary input modalities for multimodal fault diagnosis.

In summary, current research on fault diagnosis of RV reducers continues to face several critical technical challenges: (1) the absence of high-quality data generation methods to address small-sample issues; existing GAN architectures demonstrate limitations in capturing detailed signal features and ensuring stable training; (2) insufficient efficacy of multimodal information fusion strategies, with existing methods inadequately accounting for the intrinsic differences in signal characteristics across modalities when integrating multisource data such as vibration and current signals. Moreover, these fusion strategies lack specificity and fail to effectively consolidate complementary information; (3) constrained feature representation capabilities under complex operating conditions, where existing approaches struggle to concurrently extract both global and local features and lack hierarchical fusion mechanisms.

To address the aforementioned challenges, this paper proposes a fault diagnosis method for RV reducers based on an improved GAN and a dual-branch fusion framework for multimodal data. The key contributions are summarized as follows:

Enhanced GAN architecture: The proposed architecture significantly improves the extraction of detailed features from vibration signals and effectively mitigates the vanishing gradient problem, thereby enhancing training stability and the quality of generated data.

Efficient multimodal data fusion strategy: A dual-branch fault diagnosis model is developed, integrating self-calibrating convolutional (SCConv) neural networks and Vision Transformers (ViTs) to achieve effective fusion of multilevel features.

Feature representation under complex operating conditions: The generated one-dimensional signals are transformed into two-dimensional time-frequency images, preserving both temporal and spectral information. Furthermore, differentiated feature extraction strategies for vibration and current signals are employed to enhance the robustness and accuracy of the diagnostic model under complex scenarios such as variable loads and reciprocating motions.

The remainder of this paper is organized as follows: the second section presents the relevant theoretical foundations; the third section describes the experimental data utilized; the fourth section details the experimental validation; and the final section provides a summary of the conclusions.

Theoretical foundation

Generative adversarial network

GAN, introduced by Goodfellow et al.,²⁹ constitute a robust deep learning framework for synthetic data generation. The fundamental mechanism of GAN is characterized by a dynamic adversarial learning process involving two neural networks: a generator (G) and a discriminator (D), which engage in a competitive yet collaborative training paradigm. The generator aims to produce samples that closely resemble real data to deceive the discriminator, whereas the discriminator endeavors to accurately differentiate between real and generated samples. Through this adversarial interplay, GAN progressively improves the quality of generated data, ultimately approximating the true data distribution with high fidelity. Specifically, the generator G accepts a random noise vector z as input, mapping it through nonlinear transformations to the target data space to generate a synthetic sample G(z). The discriminator D functions as a binary classifier, tasked with accurately distinguishing real samples x from generated samples G(z). It outputs probability values D(x) or $D (G (z))$ , indicating the likelihood that the input sample originates from the real data distribution. The GAN training process is fundamentally a minimax game, with its objective function expressed as:

{\begin{matrix} \min_{G} \max_{D} V (D, G) = E_{x} + E_{z} \\ E_{x} = E_{x ~ p_{d} (x)} [\log D (x)] \\ E_{z} = E_{z ~ p_{z} (z)} [\log (1 - D (G (z)))] \end{matrix}

(1)

where $p_{d} (x)$ denotes the distribution of the real data; $p_{z} (z)$ represents the distribution of the noise vector input to the generator, typically modeled as a Gaussian noise distribution; D(x) is the discriminator’s estimated probability that the sample x is real; and $D (G (z))$ denotes the discriminator’s estimated probability that the generated sample G(z) is real.

During the practical training process, the discriminator and generator are alternately optimized. Through this adversarial learning mechanism, the generator progressively learns to produce samples that are difficult to distinguish from real fault signals, thereby enhancing the model’s capability to understand various fault patterns. In the context of fault diagnosis for RV reducers, GANs can effectively mitigate data imbalance issues and improve the recognition accuracy of rare fault types.

Wasserstein distance and GP

The Wasserstein distance, also referred to as the earth Mover’s distance (EMD), quantifies the minimal “cost” required to transform one probability distribution into another. Employing the Wasserstein distance in lieu of the Jensen–Shannon divergence to assess the distributional discrepancy between real and generated samples has been demonstrated to substantially enhance the training stability of GANs,³⁰ Mathematically, the Wasserstein distance is defined as:

W (p_{d} (x), p_{g} (y)) = inf_{γ \in Π (p_{d} (x), p_{g} (y))} E_{(x, y) ~ γ} [{‖ x - y ‖}_{2}]

(2)

where $p_{d} (x)$ denotes the distribution of real data x, $p_{g} (y)$ represents the distribution of generated data y, $Π (p_{d} (x), p_{g} (y))$ is the collection of all joint distributions $γ (x, y)$ .

Then, the loss function of the Wasserstein GAN (WGAN) can be expressed as:

{\begin{matrix} min_{G} max_{D} W (D, G) = E_{s} + E_{z} \\ E_{s} = E_{s ~ p_{d}} [D (s)] \\ E_{z} = E_{z ~ p_{g}} [D (G (z))] \end{matrix}

(3)

Gradient penalty (GP) is a regularization method explicitly formulated to enforce the discriminator’s gradient norm to remain close to unity. It serves as an alternative to weight clipping, aiming to improve the stability of network training.³¹ The objective of GP is to introduce an auxiliary loss term that constrains the gradient norm within a desired range, thereby addressing the limitations associated with weight clipping. The GP loss is formally defined as:

{\begin{matrix} L_{g p} = λ E_{\tilde{x} ~ P_{\tilde{x}}} [{({‖ \nabla_{\bar{x}} D (\bar{x}) ‖}_{2} - 1)}^{2}] \\ \bar{x} = ϵ x + (1 - ϵ) \tilde{x}, \tilde{x} ~ P_{g} \end{matrix}

(4)

where ∇ represents the gradient, $‖ \cdot ‖_{2}$ denotes the two-norm, λ indicates the penalty factor, x comes from the real data distribution $p_{d} (x)$ , $\tilde{x}$ comes from the generated data distribution $p_{g} (y)$ , $ϵ$ follows a uniform distribution over [0,1].

Convolutional neural network

Convolutional neural network (CNN) has demonstrated outstanding performance in image feature extraction and pattern recognition, attributable to their inherent local receptive fields and translation invariance. In the context of RV reducer fault diagnosis, CNNs are capable of effectively processing time-frequency representations derived from transformed fault signals, thereby capturing the spatial structural characteristics intrinsic to fault features.^32,33

A CNN primarily consists of convolutional layers, pooling layers, and fully connected layers. The convolution operation extracts local features by sliding convolutional kernels over the input data, while the parameter-sharing mechanism endows the model with translation invariance, enabling it to recognize positional variations in fault features. The core computation of the convolution operation can be expressed as:

(x * w) (t) = \sum_{a = 1}^{m} x (a) \cdot w (t - a)

(5)

where x represents the input information, w denotes the weight kernel, t represents the bias term, and m represents the size of the weight kernel.

Pooling layers reduce the dimensionality of feature maps to improve computational efficiency while preserving essential features. Common pooling strategies include max pooling and average pooling. Fully connected layers are responsible for integrating high-dimensional features to achieve the final classification. Activation functions introduce nonlinear mappings, enabling the network to learn complex relationships; typical examples include the rectified linear unit (ReLU) and Sigmoid functions. The loss function quantifies the discrepancy between the model’s predictions and the ground truth. In this study, the cross-entropy loss is employed, coupled with the Adam optimizer for parameter optimization, with the objective of minimizing the loss and enhancing model performance.

The hierarchical architecture of CNNs facilitates the automatic extraction of multilevel representations from fault signals. Specifically, shallow layers capture fundamental texture and edge features, whereas deeper layers integrate these low-level features to form more abstract and discriminative representations of fault patterns. This capability of hierarchical feature learning enables CNNs to effectively identify various fault types in RV reducers, including wear and fracture. Nonetheless, CNNs are intrinsically constrained by their limited local receptive fields, which impedes their ability to capture long-range temporal dependencies and cross-cycle correlations within fault signals.

Transformer architecture

The Transformer architecture, owing to its powerful information modeling capabilities,³⁴ effectively overcomes the inherent limitations of traditional deep learning networks in capturing non-local correlations. This advantage is particularly crucial in the fault diagnosis of RV reducers, as fault features in planetary gear systems often manifest as complex modulated phenomena spanning multiple rotational cycles, which necessitate models with deep global feature correlation modeling capabilities. Specifically, in planetary gear systems, fault characteristics generated by sun-planet gear meshing impacts frequently exhibit modulation effects across several rotational cycles. These interrelated feature patterns, distributed in different regions of the time-frequency map, are difficult for conventional CNNs to fully extract.

To leverage the strengths of the Transformer for image-based time-frequency feature analysis, this paper introduces the ViT architecture, which is an improved version based on the standard Transformer. ViT reconstructs two-dimensional time-frequency images into a sequence of image patches and processes them as sequential data. Through its core self-attention mechanism, ViT dynamically assigns weights between different image patches, thereby overcoming the local perception limitations of CNNs. By transforming image processing into a sequence modeling task, ViT is able to break through the local perception constraints of traditional CNNs. Consequently, ViT effectively captures cross-regional feature coupling within fault signals.

The self-attention mechanism is the core component of the Transformer architecture. It enables the model to dynamically compute the relevance between any two positions in the input sequence, thereby capturing global dependencies.³⁴ The multihead attention mechanism extends this by running multiple attention heads in parallel, allowing the model to simultaneously focus on information from different representation subspaces and positions. Specifically, given an input sequence $X = [x_{1}, x_{2}, \dots, x_{n}]$ , the self-attention mechanism first linearly projects the input into three vectors: Query, Key, and Value, denoted as:

Q = X W^{Q}, K = X W^{K}, V = X W^{V}

(6)

where $W^{Q}, W^{K}$ , and W^V are learnable weight matrices.

The attention score between Query, Key, and Value is computed based on the scaled dot-product attention as follows:

head = Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(7)

where $1 / \sqrt{d_{k}}$ a scaling factor, and d_k denotes the dimension of the Key vectors.

The outputs of all attention heads are concatenated and then passed through a final linear projection to produce the multihead attention output. This process can be mathematically expressed as:

MultiHead (Q, K, V) = Concat (hea d_{1}, \dots, hea d_{h}) W^{O}

(8)

where W^O is a learnable weight matrix used for the final linear projection of the multihead attention output.

Fault diagnosis model based on multisource multimodal data fusion for small samples

This section presents a multisource data fusion framework for fault diagnosis specifically designed for RV reducers operating under small-sample conditions. The proposed framework mitigates data scarcity by leveraging an enhanced GAN and utilizes a dual-branch feature extraction network to comprehensively exploit the complementary information inherent in vibration and current signals. This approach enables precise and reliable fault identification.

WGAN-RAG network architecture

Although WGAN and its GP variant (WGAN-GP) have significantly improved the training stability and generative performance of traditional GANs by mitigating issues such as mode collapse and vanishing gradients, they remain challenged when modeling complex time-series signals. These architectures demonstrate limited capability in extracting local features and insufficient awareness of the global signal structure and periodic variations, thereby impeding the generation of high-fidelity synthetic data that accurately captures fault characteristics. To overcome these limitations, this study proposes an enhanced GAN architecture termed WGAN with residual and attention guidance (WGAN-RAG). This architecture systematically augments both the generator and discriminator modules, enhancing model stability and improving output fidelity. The network architecture of WGAN-RAG is depicted in Figure 1.

Figure 1.

Schematic diagram of the WGAN-RAG architecture. WGAN: Wasserstein generative adversarial network; RAG: residual and attention guidance.

Generator improvements

To address the limitations of WGAN-GP in insufficient local feature extraction and inadequate perception of the overall structural and periodic variations of signals, this paper first optimizes the generator. The generator architecture mainly comprises an initial feature mapping and upsampling layer, a deep residual feature extraction network, a multihead self-attention mechanism, and a final residual block followed by a multichannel output layer:

1) Initial feature mapping and upsampling layer: The input noise vector is initially mapped into a high-dimensional feature space through a fully connected layer. Subsequently, the network employs a combination of two upsampling operations and one-dimensional convolutions to progressively construct and refine temporally structured high-dimensional features. After these two processing steps, an independent one-dimensional convolutional layer is applied, with its output range constrained by a Tanh activation function. These three layers of one-dimensional convolutions and upsampling operations work in close coordination. Their combined configuration and parameter design aim to optimize the initial representation of temporal signals, thereby enhancing the feature learning efficiency of subsequent modules.

2) Deep residual feature extraction network: To further enhance the robust extraction of complex temporal features within deep generator networks and effectively alleviate the vanishing gradient problem commonly encountered in traditional deep network training, this study proposes a deep residual network composed of two sequential residual blocks (denoted as the double-layer residual network module in Figure 1). Each residual block internally integrates batch normalization (BN), ReLU activation functions, and skip connections, ensuring smooth propagation of deep feature information while effectively preserving multilevel temporal characteristics. This architectural design enables the generator to thoroughly mine and learn intricate patterns present in fault signals, while maintaining stable training dynamics.

First, through the first residual block process x_init, obtain the intermediate feature through the first residual block x₁:

x_{1} = ReLU (BN (Conv 1 d (x_{init}))) + x_{init}

(9)

Next, take x₁ as the input, through the second residual block, obtain the final output of the residual network $h_{res_out}$ :

h_{res_out} = ReLU (BN (Conv 1 d (x_{1}))) + x_{1}

(10)

3) Introduction of multihead self-attention mechanism: Due to the limited local receptive fields of convolutional layers, effectively capturing complex dependencies between nonadjacent time steps in long-term sequential signals remains challenging. To address this, the present study incorporates a multihead self-attention mechanism following the extraction of high-level features by the deep residual network. This mechanism allows the model to dynamically compute the relevance between each time step in the generated sequence and all other time steps in the input sequence, thereby directly modeling the contextual dependencies within the data. The final output features are obtained through a weighted combination of the multihead attention outputs and the original features. The output of the multihead attention M is fused with the original features $h_{res_out}$ through weighted summation to obtain the final output feature h_att:

{\begin{matrix} h_{att} = γ M + (1 - γ) h_{res_out}, \\ M = MultiHead (h_{res_out}) \end{matrix}

(11)

where γ denotes the fusion coefficient between the attention output and the original features.

4) Final residual block and multichannel output layer: To further refine feature representations and ensure output stability, this study incorporates a final residual block subsequent to the multihead self-attention mechanism. This residual block performs the ultimate optimization and reconstruction of attention-enhanced features, thereby augmenting their representational capacity and establishing a robust foundation for the subsequent multichannel output layer.

Discriminator improvements

To augment the generalization ability of the discriminator, this study implements a multilevel Dropout mechanism, thereby establishing a stochastic feature suppression strategy:

{\begin{matrix} D (x) = σ (\sum_{l = 1}^{3} M_{l} (‖ W_{fc, l} \cdot x ‖)) \\ M_{l} ~ Bernoulli (0.75) \end{matrix}

(12)

where M_l denotes the lth dropout layer, following a Bernoulli distribution, Dropout probability is set to 0.25 in this article, W_fc,l represents the weights of the l th layer of the discriminator. The symbol ⊙ denotes element-wise multiplication.

In summary, the module configuration of the improved GAN WGAN-RAG is presented in Table 1. The parameter settings of the convolutional modules (Conv_blocks), upsampling module (Up_sample), residual network (Res_blocks), and multihead self-attention mechanism (MHSA) are designed to maximize the model’s capability to capture complex dynamic features of time-series signals. By meticulously designing the number of layers, convolutional kernel sizes, attention head counts, and residual connection schemes for each component, the model capacity between the generator and discriminator is balanced, thereby ensuring the stability of adversarial training and the quality of the generated samples.

Table 1.

WGAN-RAG architecture configuration.

Type	Layer	Activation function	Parameter size	Output size
Generator	Input			(128)
	init_linear	relu	128 × (128 × 9)	(128,9)
	Up sample1			(128,18)
	Conv1	relu	128 × 128 × 3	(128,32)
	Up sample2			(128,36)
	Conv2	relu	128 × 128 × 3	(128,36)
	Conv3	tanh	128 × 128 × 3	(128,36)
	ResBlocks (2×)	relu	2 * (3 × 128)	(128,36)
	MHSA		Q,K:1 × 16,V:1 × 128	(128,36)
	ResBlock (post-attention)	relu	2 * (128 × 128 × 3)	(128,36)
	Flatten			(4608)
	Output	tanh	4608 × 1240	(1240)
Discriminator	Input			(1240)
	Linear_Block_1	leakyrelu(0.2)dropout(0.25)	1240 × 1024	(1024)
	Linear_Block_2	leakyrelu(0.2)dropout(0.25)	1024 × 512	(512)

WGAN: Wasserstein generative adversarial network; RAG: residual and attention guidance; MHSA: multi-head self-attention mechanism.

The core architecture of WGAN-RAG, which integrates dual-layer residual connections and multihead self-attention mechanisms, features strong modular compatibility. Regardless of variations in dataset modal types or levels of sample imbalance, the model retains its core residual-attention backbone. Only targeted adjustments to specific components are needed instead of a full architectural redesign. The residual-attention mechanism and Wasserstein distance-based training framework of WGAN-RAG are generalizable to other rotating machinery. Through transfer learning-based fine-tuning of the feature mapping layer, the model can adapt to the fault feature distributions of different target mechanical systems. The multihead self-attention module maintains its effectiveness in modeling long-range dependencies of periodic fault-related signals, while relevant layers can be adjusted appropriately to match the signal characteristics of the target system, ensuring stable generative performance.

SCViT dual-branch fusion diagnostic model

In the presence of gear faults such as cracks or wear, the vibration signals of RV reducers exhibit distinct periodic impact and modulation phenomena. These effects extend over multiple meshing cycles, resulting in continuous variations in the signal’s energy and frequency components over prolonged time scales. Such variations characterize the comprehensive health condition of the mechanical system, herein referred to as the “overall time-frequency correlation characteristics.” Complementarily, current signals demonstrate high sensitivity to transient mechanical load fluctuations. Localized defects (e.g., root cracks) induce periodic oscillations in the load torque, which, through electromechanical coupling, generate amplitude and frequency modulations in the stator currents. These modulations manifest as characteristic sidebands within the frequency spectrum. These “local dynamic perturbations” convey critical diagnostic information essential for the accurate identification of early-stage or minor faults.

In summary, to address the intrinsic differences between vibration and current signals in terms of frequency distribution and time-domain characteristics, this study proposes the Self-Calibrated Convolution and Vision Transformer Fusion Network for Dual-Modality Classification (SCViT) dual-branch fusion diagnostic model (a self-calibrated convolution and ViT fusion network for dual-modality classification). The model incorporates dedicated feature extraction pathways tailored to each signal modality: (1) the SCConv-based vibration branch is designed to effectively capture comprehensive dynamic patterns spanning multiple fault feature cycles within vibration signals; (2) the ViT-based current branch utilizes its self-attention mechanism to precisely identify localized spectral line variations and periodic modulation patterns induced by fault-related modulations in current signals. This targeted architectural design enables the dual-branch network to more thoroughly exploit the complementary information inherent in both modalities, thereby enhancing diagnostic accuracy.

The dual-branch diagnostic framework is depicted in Figure 2. It comprises an SCConv-based vibration feature extraction branch and a ViT-based current feature extraction branch, collectively facilitating the comprehensive utilization of complementary information from multi-source heterogeneous data.

Figure 2.

Structure of the dual-branch diagnostic model.

Time-frequency transformation of multisource signals

To fully preserve the temporal characteristics of the signals and facilitate subsequent deep learning model processing, this study employs the short-time Fourier transform (STFT) to convert one-dimensional signals into two-dimensional time-frequency representations.³⁵ STFT is a widely used method for time-frequency analysis. The fundamental computational formula of STFT is expressed as follows:

STFT (t, ω) = \int_{- \infty}^{\infty} x (τ) g (τ - t) e^{- j ω τ} d τ

(13)

where $x (τ)$ denotes the original signal, t represents time, τ is the integration variable, ω indicates frequency, and $g (τ - t)$ is a window function centered at t.

It is noteworthy that the length of the window function directly affects the resolution trade-off between the time and frequency domains. Therefore, to achieve precise analysis tailored to specific signal characteristics, careful selection of the window length is imperative. The parameter values adopted in this study are summarized in Table 7.

Self-calibrated convolution

Traditional CNNs, because of their inherent locality in feature extraction, struggle to effectively capture the global time-frequency characteristics of vibration spectrograms for RV reducer fault diagnosis. To address the limitations of local feature enhancement in vibration signals, this study introduces SCConv as a core feature extraction module. SCConv establishes local feature associations between spatial regions of the input spectrogram,³⁶ thereby enabling the extraction of comprehensive time-frequency features from vibration signals and enhancing the overall discriminative capability.

As illustrated in the architecture of the self-calibrated convolution module in Figure 3, the module employs a dual-path design for feature extraction. Specifically, the input feature map X is divided into $X_{1}$ and $X_{2}$ , corresponding to path 1 (self-calibration path) and path 2 (spatial information path), respectively. This bifurcated approach enables the model to effectively process dynamic features in vibration signals. Path 1 focuses on capturing multiscale global contextual information, while path 2 emphasizes the extraction of localized and discriminative features based on temporal changes in the input signal. The convolution kernel K is also divided into four corresponding parts $K_{1}, K_{2}, K_{3}, K_{4}$ to assist in the extraction and fusion of pathway features. The core operation steps are as follows.

1) Multiscale spatial transformation

Figure 3.

RV reducer fault diagnosis architecture.

Initially, the self-calibrated convolution module downsamples the input $X_{1}$ via the $r \times Down$ operation (as shown in Figure 3) to expand the receptive field. Subsequently, feature extraction is performed using $F_{2} [K_{2}]$ , and the upsampling operation $r \times Up$ is applied to restore the feature map to its original resolution.

This multiscale spatial transformation is essential for capturing the overall time-frequency correlation patterns within vibration signals. For example, in distributed faults such as multi-tooth surface wear on sun gears, where the effects extend across multiple meshing cycles, SCConv can effectively integrate these cross-temporal and cross-frequency patterns by expanding the receptive field.

2) Adaptive feature calibration

Adaptive feature calibration constitutes the core component of the self-calibrated convolution. It performs local weighting on multiscale temporal features extracted from X₁. The calibrated feature Y₁ is computed as follows:

Y_{1} = {F_{1}}_{(} X_{1}) ⊙ σ (F_{2} (X_{1}) + F_{3} (X_{1}))

(14)

where ⊙ denotes element-wise multiplication, and σ represents the Sigmoid activation function.

Adaptive feature calibration utilizes the global temporal features extracted from path 1 as a reference, while path 2 preserves the original multiscale features for adaptive weighting and correction. This mechanism enables the model to emphasize fault-related characteristics within vibration signals.

3) Dual-path fusion

The final output is obtained by concatenating the output Y₁ from path 1 and the output Y₂ from path 2, as expressed by:

Y = Concat (Y_{1}, Y_{2})

(15)

Given that gear faults in RV reducers (such as root cracks and tooth surface wear) induce periodic impacts, resonances, and modulations in vibration signals exhibiting holistic correlation characteristics across cycles and frequencies, the SCConv module effectively captures and integrates these dispersed yet interrelated time-frequency correlation patterns distributed across different regions of the time-frequency map via its self-calibration mechanism. This approach compensates for the inherent limitations of traditional CNNs’ localized receptive fields, thereby providing more comprehensive vibration feature representations for diagnosing distributed faults such as multitooth surface wear.

Vison transformer

This study adopts the ViT as the feature extraction module to address localized modulated disturbances present in current signals caused by faults. ViT partitions the time-frequency map into nonoverlapping local patches and encodes them into feature vectors, thereby capturing modulation components within specific temporal windows and frequency bands. By incorporating positional encoding, ViT preserves spatiotemporal information, enabling precise localization of modulation effects. Its self-attention mechanism dynamically establishes cross-region correlations, effectively constructing a fault feature representation network.³⁷

ViT initially segments the input time-frequency map into fixed-size, nonoverlapping image patches, which are subsequently flattened and projected linearly into embedding vectors to preliminarily capture local features within each patch. To retain spatiotemporal positional information, a class token is prepended to the embedding sequence, accompanied by learnable positional encodings, forming the input sequence to the Transformer encoder. This sequence is then processed by a Transformer encoder composed of L identical stacked Encoder layers. Each Encoder layer integrates an MHSA module and a multilayer perceptron module, employing layer normalization and residual connections to stabilize training. At this stage, theself-attention mechanism dynamically establishes cross-region dependencies, directly capturing relationships among image patches. Finally, the corresponding feature vector $z_{L}^{CLS} \in R^{D}$ output from the Transformer encoder is taken as input to a linear classification layer, yielding the predicted probabilities belonging to K categories:

y = Linear (z_{L}^{CLS})

(16)

where $y \in R^{K}$ denotes the predicted probability vector for the K categories.

Dual-branch diagnostic network for vibration and current feature fusion

In summary, this study proposes the SCViT diagnostic model architecture, which integrates two distinct processing branches to comprehensively exploit the complementary information inherent in vibration and current signals:

1) Vibration signal branch (CNN-SCConv): Vibration signals represent the energy dissipation within mechanical systems, with their broadband energy distribution reflecting the overall dynamic performance of the reducer. This branch employs a parameter-sharing CNN to extract preliminary features, which are subsequently enhanced by an SCConv module. This design effectively consolidates global information from the time-frequency representation and integrates cross-dimensional features, enabling precise characterization of complex cross-cycle patterns within the vibration signals.

2) Current signal branch (CNN-ViT): Current signals are more sensitive to transient mechanical load variations; for example, step changes in meshing stiffness caused by sun gear cracks manifest as abrupt spikes in the current time-domain signal. This branch similarly utilizes a parameter-sharing CNN for initial feature extraction and incorporates a ViT architecture. Leveraging multihead attention mechanisms, the ViT focuses on localized regions within the time-frequency map while effectively modeling long-range dependencies among these regions, thereby facilitating comprehensive analysis of dynamic disturbance features in current signals.

Features extracted from both branches undergo adaptive pooling and are concatenated along the same dimension before being input to a fully connected layer for classification across seven fault categories (including the normal state) under various operating conditions. This architectural design complements the full-time-domain modulation patterns of vibration signals with the localized abrupt features of current signals, thereby substantially enhancing diagnostic accuracy. The principal structural parameters of the model are summarized in Table 2.

Table 2.

Selected parameters of the model.

Layer	Parameter name	Activation function	Parameter size	Output size
Input 1				128 × 128 × 3
Input 2				128 × 128 × 3
Shared convolutional feature extractor	Conv block 1	relu	Conv2d(3,16,3)	126 × 126 × 16
	Conv block 2	relu	Conv2d(16,32,3)	124 × 124 × 32
	Conv block 3	relu	Conv2d(32,16,3)	122 × 122 × 16
	Conv block 4	tanh	Conv2d(16,3,3)	120 × 120 × 3
Self-calibrated convolutions	SCConv module		SCConv	120 × 120 × 64
ViT	ViT module		ViT	7
Feature fusion	Flatten SCConv		Flatten	921,600
	Concatenation		Concatenate	921,607
	Fully connected	relu + dropout(0.5)	Linear(921607,256)	256
Output		Softmax (implicit)	Linear(256, 7)	7-calss

ViT: Vision Transformer; SCConv: self-calibrating convolutional.

To ensure methodological rigor and clarify the impact of training settings on performance, key training hyperparameters and configuration details of the SCViTNet are summarized in Table 3. This study sets the total number of training epochs to 100, which is determined based on the actual convergence of model training: real-time monitoring of the training process shows that the model stabilizes in terms of training loss and test accuracy around the 40th epoch. Extending the training epochs to 100, on the one hand, ensures complete model convergence to fully learn the time-frequency features and fusion rules of bimodal signals, avoiding insufficient feature learning and gradient updates caused by too few epochs, this would fail to capture the global time-frequency correlation features of vibration signals and the local dynamic perturbation features of current signals, thereby leading to low fault classification accuracy. On the other hand, the 100-epoch training cycle involves no redundant computation and thus does not cause unnecessary consumption of computing resources. For multi-fault classification, CrossEntropyLoss is adopted, which accurately measures the discrepancy between predicted probability distributions and true fault labels. As the optimal choice for multifault classification tasks, it effectively drives the model to learn discriminative bimodal fusion features.

Table 3.

Experimental equipment parameters.

Hyperparameter category	Parameter name	Value
Training cycle	Total training epochs	100
Optimization setup	Initial learning rate	1e-4 (0.0001)
	Optimizer	Adam
	Loss function	CrossEntropyLoss
Batch configuration	Training batch size	10
	Testing batch size	20
Regularization	Weight decay (L2 regularization)	1e-4 (0.0001)

Proposed model algorithm flow

The overall fault diagnosis procedure of the proposed method is depicted in Figure 3, with the detailed steps as follows:

Step 1: Synchronously acquire periodic vibration signals and periodic current signals from the RV reducer under various fault types and operating conditions, thereby constructing a raw fault dataset comprising vibration and current signals across multiple operating scenarios.

Step 2: Construct an original small-sample imbalanced dataset (refer to Table 6), train the WGAN-RAG model on the imbalanced raw dataset, then use the trained model to generate augmented data.

Step 3: Generate augmented one-dimensional data to expand the imbalanced small-sample dataset, followed by conversion into time-frequency representations via STFT.

Step 4: Develop the SCViTNet dual-branch network for model training, integrating a vibration feature extraction branch composed of a parameter-sharing CNN and SCConv, as well as a current feature extraction branch combining a parameter-sharing CNN with a ViT. Partition the two-dimensional time-frequency image fault dataset of vibration and current signals from the RV reducer under different operating conditions into training and testing subsets, using the training subset for model training.

Step 5: Independently processed features from the SCConv and ViT branches are merged. The merged features undergo adaptive average pooling for dimensionality reduction, followed by concatenation along the same dimension to ensure data consistency. Finally, the features are linearly transformed into the output space to generate the final diagnostic results.

The aforementioned theoretical advancements collectively address the principal challenges in RV reducer fault diagnosis: (1) to mitigate the weak generalization capability of diagnostic models caused by sparse small-sample data in RV reducers, the WGAN-RAG network alleviates data scarcity and improves the quality of generated samples under small-sample conditions; (2) to overcome the lack of targeted multimodal information fusion, the SCViT dual-branch architecture (SCConv and ViT) differentially extracts holistic spatiotemporal correlation features from vibration signals and localized dynamic perturbations from current signals, thereby enhancing the specificity of multimodal fusion strategies; (3) to tackle the difficulty of feature representation under complex operating conditions, the SCViT dual-branch network, through its differentiated feature extraction and deep fusion mechanism, effectively captures and integrates fault features across varying load, reciprocating, and other complex conditions, thereby enhancing the model’s feature representation capability and diagnostic robustness under dynamic operating environments.

Experimental validation and analysis

Experimental setup and data acquisition

To demonstrate the effectiveness of the proposed method, a dedicated RV reducer test rig was established, as illustrated in Figure 4. The test rig comprises four principal components: a drive system, an RV reducer, a sensing unit, and a data acquisition system. During the experimental procedures, a servo drive controlled a permanent magnet synchronous motor, which served as the power source to actuate the RV reducer via reciprocating oscillation tests. The signal acquisition subsystem incorporated current sensors and tri-axial vibration accelerometers, synchronously capturing multichannel signals through an National Instruments (NI) data acquisition card interfaced with the LabVIEW platform. The key equipment specifications and detailed operational parameters are summarized in Table 4.

Figure 4.

Experimental platform: (a) current transformers, (b) vibration sensors, (c) NI 9234 data acquisition card, and (d) LabVIEW interface.

Table 4.

Experimental equipment parameters.

Equipment	Model	Test parameters
RV reducer	150BX-141-RVE-19	Reduction ratio: 141
Servo drive	SV-X3EB100A-A2	Oscillation angle: ±90°Angular velocity: 100° s⁻¹
Rotary encoder	SV-X2MH100C-B2LN	Speed: 3000 r min⁻¹
PCB sensor	256A16	Sensitivity: (±5%)100 mV g⁻¹
Hall-type current transformer	ZHTK25	Rated input current: 20 A
Data acquisition card	NI9234	Sensitivity: 33.56 mV g⁻¹

The experimental platform employs electrical discharge machining to fabricate artificial defects on the sun and planet gears, encompassing three damage sizes (0.1, 0.3, and 0.5 mm) and two failure categories: crack-type and wear-type failures. This setup results in six representative failure modes: cracks at the root of the sun gear teeth, cracks at the root of the planet gear teeth, single tooth surface wear of the sun gear, multiple tooth surface wear of the sun gear, single tooth surface wear of the planet gear, and multiple tooth surface wear of the planet gear. Together with the normal condition, these constitute seven operational states of the RV reducer, as summarized in Table 5. Figure 5 presents physical samples corresponding to each failure mode: Figure 5(a) illustrates cracks at the root of the sun gear teeth, Figure 5(b) depicts cracks at the root of the planet gear teeth, Figure 5(c) shows single tooth surface wear of the sun gear, Figure 5(d) demonstrates multiple tooth surface wear of the sun gear, Figure 5(e) displays single tooth surface wear of the planet gear, and Figure 5(f) exhibits multiple tooth surface wear of the planet gear. The reducer was tested under three operational conditions: condition 1 involved continuous unidirectional rotation of the output shaft through a complete cycle without additional load; condition 2 consisted of reciprocal oscillation of the output shaft at 90° intervals with an 8 kg external load; condition 3 entailed reciprocal oscillation of the output shaft at 90° intervals without load. The rotational speed was maintained at 100° s⁻¹ for all three conditions. Figure 6 depicts the time-domain periodic waveforms of vibration and current signals acquired from the sun gear exhibiting single tooth surface wear.

Table 5.

Fault types of RV reducer.

Number	Fault type	Speed (° s⁻¹)	Signal type	Motion type	Damage size (mm)
1	Cracks at the root of the sun gear teeth	100	Current, vibratory	Full rotation, reciprocating	0.1
2	Cracks at the root of the planet gear teeth	100	Current, vibratory	Full rotation, reciprocating	0.1
3	Single tooth surface wear of the sun gear	100	Current, vibratory	Full rotation, reciprocating	0.3
4	Multiple tooth surface wear of the sun gear	100	Current, vibratory	Full rotation, reciprocating	0.5
5	Single tooth surface wear of the planet gear	100	Current, vibratory	Full rotation, reciprocating	0.3
6	Multiple tooth surface wear of the planet gear	100	Current, vibratory	Full rotation, reciprocating	0.5
7	Normal condition	100	Current, vibratory	Full rotation, reciprocating	0.0

Figure 5.

Gear fault images: (a) cracks at the root of the sun gear teeth, (b) cracks at the root of the planet gear teeth, (c) single tooth surface wear of the sun gear, (d) multiple tooth surface wear of the sun gear, (e) single tooth surface wear of the planet gear, and (f) multiple tooth surface wear of the planet gear.

Figure 6.

Sun gear single tooth surface fault of RV reducer: (a) current signal and (b) vibration signal.

Diagnostic results and analysis

Experimental data were acquired and subsequently processed on a Windows 11 platform equipped with a GeForce RTX 3050 GPU and 12 GB of RAM. The raw signals were segmented into segments of 1240 samples each to form a small-sample, imbalanced dataset. The detailed configuration of the experimental dataset is provided in Table 6.

Table 6.

Experimental dataset configuration.

Label	Fault type	Sample size	Signal type	Experimental conditions
0	Cracks at the root of the sun gear teeth	30	Current, vibration	1/2/3
1	Cracks at the root of the planet gear teeth	20	Current, vibration	1/2/3
2	Single tooth surface wear of the sun gear	50	Current, vibration	1/2/3
3	Multiple tooth surface wear of the sun gear	70	Current, vibration	1/2/3
4	Single tooth surface wear of the planet gear	10	Current, vibration	1/2/3
5	Multiple tooth surface wear of the planet gear	20	Current, vibration	1/2/3
6	Normal condition	100	Current, vibration	1/2/3

The proposed small-sample multisource data fusion method initially inputs the raw samples into the pre-trained WGAN-RAG model for data augmentation, thereby expanding the total sample size to 700, with 100 samples allocated per fault category. The augmented one-dimensional signals are then transformed into two-dimensional time-frequency images (Figure 7) via STFT, whose parameter settings are specified in Table 7. Finally, these images are fed into the SCViT dual-branch feature fusion diagnostic network for fault identification.

Figure 7.

STFT maps of generated images: (a) vibration signal and (b) current signal. STFT: short-time Fourier transform.

Table 7.

STFT parameter settings.

Window function	Window	N{overlap}	N{fft}	fs
Hamming window	256	128	1240	25,600

STFT: short-time Fourier transform.

Evaluation of generated data quality

To evaluate the data generation quality of the improved GAN network, this study conducted a comparative analysis involving four classical GAN models: A Novel GAN-Based Data Augmentation Method Coupled Time–Frequency Domain (FTGAN),³⁸ Auxiliary Classifier Generative Adversarial Network (ACGAN),³⁹ Deep Convolutional Generative Adversarial Networks (DCGAN),⁴⁰ and WGAN.³⁰ Figure 8 presents the comparison of the generated time-domain signals from each model. It is observed that, for both current and vibration signals, the data generated by the proposed model exhibit a high degree of similarity in pattern to the real data, whereas the data generated by the other models contain varying levels of noise interference, particularly in the cases of DCGAN and WGAN. To further validate the effectiveness of the proposed model, signal similarity was analyzed using probability density function (PDF) curves. Figure 9 presents a comparison of PDF curves between the original data and those generated by various models for cracks at the root of the sun gear teeth (damage size 0.1 mm). The results indicate that the probability distribution of the data generated by the proposed WGAN-RAG model is most similar to the real data, with Pearson’s correlation coefficients reaching 0.8872 for vibration signals and 0.8901 for current signals, which are significantly higher than those of the comparative models (Table 8). The results demonstrate that WGAN-RAG maintains high generation fidelity across all fault categories in the RV reducer dataset, including those categories with relatively small original sample sizes. This indicates the model’s robustness to imbalanced and small-sample distributions, laying a foundation for its application to other mechanical systems with similar data characteristics.

Figure 8.

Time- and frequency-domain comparison of signals generated by different models. Top row: time-domain plots of (a) vibration signal and (b) electric current. Bottom row: frequency-domain plots of (a) vibration signal and (b) electric current.

Figure 9.

Probability density curves of generated data by various models for label 0: cracks at the root of the sun gear teeth: (a) vibration signal and (b) current signal.

Table 8.

Pearson’s correlation coefficients of generated data.

Model	Vibration	Voltage
WGAN-RAG (Proposed)	0.8872	0.8901
FTGAN	0.4751	0.6914
ACGAN	0.4463	0.7081
DCGAN	0.3327	0.5672
WGAN	0.3701	0.5585

WGAN: Wasserstein generative adversarial network; RAG: residual and attention guidance.

The proposed GAN-based network was utilized to perform data augmentation on the original small-sample dataset, expanding each fault sample to 100 instances. The dataset was divided into 75% for training and 25% for testing. The dual-branch model for vibration and current signals was trained using the training set, and its performance was evaluated on the testing set.

Validation of diagnostic performance for dual-branch architecture

To verify the rationality of the dual-branch architecture design, five comparative experiments were conducted under condition 1. Independent diagnostics were performed on the CNN-SCConv, CNN-ViT, CNN, and ViT branches, and their results were compared with those of the dual-branch fusion diagnostic model. Figure 10 illustrates the diagnostic accuracy curves for each architecture. The results demonstrate that the proposed dual-branch diagnostic model achieves the highest accuracy of 97.40% with the most stable diagnostic curve. The next best performances were observed in the CNN-ViT branch incorporating the ViT and the CNN-SCConv branch integrating the SCConv module, with accuracies of 93.94 and 90.91%, respectively. Models employing the CNN or ViT branches solely exhibited accuracies well below 90% and showed greater fluctuations.

Figure 10.

Comparative diagnostic accuracy curves of single-branch and dual-branch architectures under condition 1.

Analysis of the advantages of multisource data fusion

To compare the advantages of multisource data fusion versus single-modality diagnosis, ablation experiments were conducted in this study. Figure 11 presents the diagnostic accuracy curves of single-modality and multimodality fusion under three working conditions:

1) Condition 1 (full rotation, unidirectional forward rotation, no load): the vibration single-modality accuracy reached 94.57%, the current single-modality accuracy stabilized at 91.56%, while the dual-modality fusion accuracy rapidly stabilized at 97.40%;

2) Conditions 2 (partial rotation, reciprocating, with load) and 3 (partial rotation, reciprocating, no load): conditions 2 (non-full rotation reciprocating load) and 3 (non-full rotation reciprocating no-load): the vibration single-modality curves exhibited oscillations, whereas the dual-modality fusion diagnostic accuracies stabilized at 97.83 and 96.54%, respectively.

Figure 11.

Diagnostic results under different working conditions: (a) condition 1, (b) condition 2 and (c) condition 3.

Notably, the current single-modality showed convergence delay characteristics across all conditions, whereas the vibration-current fusion diagnostic model achieved the highest accuracy, faster convergence speed, and stronger stability under all conditions.

For a systematic evaluation of classification reliability, global performance analysis was conducted using confusion matrices (Figure 12). It should be noted that the confusion matrix is derived from data under condition 1, serving as a typical case to visualize classification results. The diagonal elements of the confusion matrices represent the classification accuracy of each fault type. The diagonal values of the multisource data fusion approach are close to 1.0, indicating minimal misclassification; in contrast, the number of misclassifications significantly increased with the single-modality approach.

Figure 12.

Confusion matrices for each diagnostic method: (a) multisource data fusion, (b) vibration signal and (c) electric current.

To gain deeper insight into the diagnostic performance of the model, t-distributed stochastic neighbor embedding (t-SNE) was employed for dimensionality reduction and visualization of features (Figure 13). Consistent with the confusion matrix, the t-SNE visualization is also based on condition 1 data. The results show that the fault feature clusters derived from multisource data fusion have clear boundaries and high separability, whereas the features from single-modality data exhibit significant overlap, leading to reduced diagnostic accuracy.

Figure 13.

t-SNE visualization of fault diagnosis for multisource data fusion and single-source data: (a) multi-source data fusion, (b) vibration signal and (c) electric current. t-SNE: t-distributed stochastic neighbor embedding.

To further quantify the discriminative ability of the model, receiver operating characteristic (ROC) curves and the corresponding area under the curve (AUC) values were utilized for performance evaluation (Figure 14). Similar to the confusion matrix and t-SNE visualization, the ROC curves are generated based on condition 1 data. ROC curves are superior for performance assessment as they comprehensively reflect the trade-off between true positive rate (TPR) and false positive rate (FPR) across all classification thresholds, rather than relying on a single threshold, thus providing a more robust and holistic evaluation of the model’s discriminative power. As shown in Figure 14, the macro-average ROC curve of multisource fusion (Figure 14(a)) is closest to the top-left corner (ideal classifier performance), while those of single-modality methods (vibration and electric current, Figure 14(b) and (c)) deviate noticeably. In addition, the per-class curves of multi-source fusion cluster tightly near the top, indicating stable and strong discriminative power across all faults. In contrast, single-modality curves are more dispersed, with some deviating further, confirming the superiority of multi-source fusion.

Figure 14.

ROC curves of multisource data fusion versus single-source data for fault diagnosis: (a) multisource data fusion, (b) vibration signal and (c) electric current. ROC: receiver operating characteristic.

For a comprehensive and quantitative comparison of diagnostic performance across all three working conditions, Table 9 (diagnostic performance comparison for multisource data fusion and single-source data) summarizes the Accuracy, F1 Score, and AUC values of multisource fusion, vibration single-modality, and electric current single-modality under each condition. The F1 Score, as a harmonic mean of precision and recall, effectively balances the two metrics and is particularly suitable for evaluating classification performance in fault diagnosis tasks with potential class imbalance. The table provides complete statistical results, including standard deviations, which compensate for the lack of visualization results for conditions 2 and 3. It is worth noting that visualizations for conditions 2 and 3 are omitted to avoid redundant content, as their variation trends are consistent with condition 1. Multisource fusion consistently outperforms single-modality methods in terms of accuracy, stability, and discriminative ability. The statistical results in Table 9 fully validate the generality and effectiveness of the proposed multisource fusion strategy across different working conditions.

Table 9.

Diagnostic performance comparison for multisource data fusion and single-source data.

Working condition	Data type	Accuracy (%)	F1 Score	AUC
Condition 1	Multisource data fusion	97.400 ± 0.136	0.972 ± 0.001	1.0000 ± 0.0000
	Vibration signal	94.569 ± 1.000	0.943 ± 0.009	0.9990 ± 0.0006
	Electric current	91.565 ± 1.500	0.912 ± 0.015	0.9960 ± 0.0010
Condition 2	Multisource data fusion	97.830 ± 0.095	0.976 ± 0.001	0.9995 ± 0.0001
	Vibration signal	92.083 ± 1.421	0.917 ± 0.015	0.9985 ± 0.0008
	Electric current	93.092 ± 0.179	0.928 ± 0.002	0.9975 ± 0.0004
Condition 3	Multisource data fusion	96.540 ± 0.125	0.963 ± 0.001	0.9992 ± 0.0001
	Vibration signal	91.019 ± 0.931	0.906 ± 0.010	0.9980 ± 0.0007
	Electric current	92.190 ± 1.300	0.919 ± 0.013	0.9970 ± 0.0009

AUC: area under the curve.

Comprehensive performance comparison across models

To validate the advantages of the proposed method, comparative tests were conducted between the proposed model and classical models, including entropy-weighted complementary ensemble empirical mode decomposition and support vector machine (EWCEEMD-SVM), Multi-Scale Convolutional Neural Network (MSCNN), and Transformer, under three working conditions. Detailed comparison results are presented in Table 10 and Figure 15:

Condition 1 (full rotation, unidirectional forward rotation, no load): The model proposed in this paper outperforms all comparison methods with an accuracy of 97.4%, which is 3.1 percentage points higher than the traditional spectrum analysis model EWCEEMD-SVM. At the same time, the predicted standard deviation is 78.1% lower than that of the Transformer with the highest accuracy, except for SCViTNet proposed in this paper;

Condition 2 (partial rotation, reciprocating, with load): facing the coupled interference of 8 kg dynamic load and local angular displacement, the proposed model attained an accuracy of 97.83%, with a standard deviation of 0.095, representing a 59.2% reduction compared to Multi-Scale Convolutional Neural Network-Long Short-Term Memory (MCNN-LSTM’s) 0.247;

Condition 3 (partial rotation, reciprocating, no load): the proposed method maintained the highest diagnostic accuracy of 96.54%, and its prediction stability (standard deviation 0.105) improved by 90.6% compared to Subdomain Adaptation Transfer Learning Network (SATLN).

Table 10.

Diagnostic accuracy of different models under various working conditions.

Model	Condition1		Condition 2		Condition 3
	Accuracy (%)	Standard deviation	Accuracy (%)	Standard deviation	Accuracy (%)	Standard deviation
SCViT (Proposed)	97.4	0.136	97.83	0.095	96.54	0.125
MSCNN⁴¹	95.29	0.311	95.0	0.203	94.5	0.201
MCNN-LSTM⁴²	95.3	0.201	96.0	0.233	95.3	0.356
EWCEEMD-SVM⁴³	94.3	0.269	95.2	0.160	95.1	0.106
SATLN⁴⁴	94.9	1.35	95.6	0.365	94.5	1.325
XGBoost⁴⁵	93.0	0.568	94.2	0.659	93.9	0.986
BA-PNN⁴⁶	94.0	0.667	95.6	0.365	94.6	0.558
Transformer⁴⁷	95.6	0.620	95.5	0.569	94.9	0.336

SCViT: a Self-Calibrated Convolution and Vision Transformer Fusion Network for Dual-Modality Classification; MSCNN: Multi-Scale Convolutional Neural Network; MCNN: Multi-scale Convolutional Neural Network; LSTM; Long Short-Term Memory; SATLN: Subdomain Adaptation Transfer Learning Network; XGBoost: eXtreme Gradient Boosting; BA-PNN: Bat Algorithm-Optimized Probabilistic Neural Network.

Figure 15.

Diagnostic results of different models under various working conditions.

Across the three working conditions, the standard deviation of the proposed model was controlled within the range of 0.095–0.136, representing a reduction of 49.4–90.6% compared to representative models. These results demonstrate the superior performance of the proposed method under multidimensional variable coupling scenarios, including full and non-full rotations, as well as loaded and unloaded states.

Stability evaluation of small-sample augmentation

To evaluate the contribution of the GAN-based augmentation strategy for small-sample enhancement, this study compares the comprehensive diagnostic performance of different models under multiple augmentation ratios of 5:1, 10:1, 20:1, 50:1, and 100:1 using radar charts (Figure 16). The results indicate that the proposed model maintains high accuracy across all augmentation ratios; in contrast, other models, such as MSCNN and MCNN-LSTM, perform acceptably at low augmentation ratios (5:1), but their performance significantly deteriorates as the augmentation ratio increases to 50:1 or 100:1, due to the insufficient quality of generated samples.

Figure 16.

Performance comparison of models under imbalanced sample ratios.

The experiments demonstrate that the improved GAN proposed in this paper, by dynamically balancing the generation ratio, effectively alleviates model bias caused by data scarcity and maintains stable performance under high augmentation ratio scenarios, showing strong adaptability to small-sample conditions.

Conclusion

To address key challenges in small-sample fault diagnosis of RV reducers, such as data scarcity and insufficient generation quality, inadequate multi-modal information fusion strategies, and limited feature representation capability under complex working conditions, this study proposes a collaborative diagnostic framework, SCViT, integrating a WGAN-RAG network with cross-modal feature fusion. The following innovations have led to significant improvements in diagnostic performance:

WGAN-RAG architecture for small-sample data augmentation: This approach improves the generation quality of small-sample fault data and effectively mitigates data distribution skewness. The probability distributions of generated data closely approximate those of real data, with Pearson correlation coefficients reaching 0.8872 (vibration) and 0.8901 (current), respectively.

SCViT dual-branch fusion diagnostic network fully leveraging modal characteristics: Based on vibration and current signals, a dual-path feature extraction mechanism is constructed, integrating parameter-shared CNN, self-calibrated convolution, and ViT. The differentiated feature extraction strategy effectively overcomes the modality limitations of single-source signals, significantly improving fusion diagnostic accuracy. Under load conditions, the diagnostic stability standard deviation is reduced to 1/3 of that of single-modality methods.

Validation under multidimensional working conditions: Experiments demonstrate that the proposed method maintains stable diagnostic accuracy within the range of 96.54–97.83% across complex scenarios, including complete cycle motion, reciprocating motion, and load. Compared to state-of-the-art deep learning frameworks, the standard deviation is reduced by 49.4–90.6%, validating the strong adaptability of the method to complex mechanical motions.

Stability under high-ratio sample augmentation: Systematic comparisons under augmentation ratios from 5:1 to 100:1 verify that the framework maintains stable performance in high-ratio data expansion scenarios, confirming its strong adaptability to small-sample conditions.

Experimental results indicate that the proposed approach effectively addresses critical challenges in small-sample fault diagnosis of RV reducers, exhibiting superior engineering applicability under various working conditions and providing a novel technical solution for intelligent operation and maintenance of rotating machinery.

Despite the aforementioned advancements and promising results, this study still has certain limitations that warrant further attention. Specifically, under conditions of imbalanced original data distributions, the proposed framework tends to yield comparatively higher misclassification rates for fault categories with limited initial samples than for those with sufficient data. To address these limitations and enhance the practical applicability of the framework, future work will focus on the following directions. First, to improve the diagnostic performance for rare faults in imbalanced data scenarios, we plan to integrate advanced few-shot learning techniques with the existing WGAN-RAG framework. In particular, the incorporation of meta-learning mechanisms is envisaged to empower the model to rapidly adapt to novel fault diagnosis tasks with limited labeled samples, thereby improving its generalization capability across imbalanced distributions. Second, considering the stringent requirements for real-time processing and computational efficiency in industrial applications, subsequent research will also concentrate on model lightweight design and inference acceleration to facilitate efficient deployment of the framework in industrial environments.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the following projects: (1) National Natural Science Foundation of China, “Research on Industrial Robot Joint Health Condition Assessment and Evolution Mechanism Based on Current Information and Modal Analysis” (project no. 52165065) and (2) Yunnan Provincial Department of Science and Technology General Program, “Research on Industrial Robot Health Assessment and Damage Mechanism Based on Multi-source Sensors and Multi-deep Learning Model Fusion” (project no. 02401AT070346).

ORCID iDs

Meng Meiwang

Wang Zhihai

Liu Xiaoqin

Liu Tao

Xu Qiyang

References

Yang

Zhang

Cheng

, et al. Reliability-based design optimization for RV reducer with experimental constraint. Struct Multidisc Optim 2021; 63(4): 2047–2064.

Yang

Zhou

Chang

, et al. A modelling approach for kinematic equivalent mechanism and rotational transmission error of RV reducer. Mech Mach Theory 2021; 163: 104384.

Wang

Mao

Design of high power density for RV reducer. J Brazil Soc Mech Sci Eng 2021; 43(6): 1–12.

Chen

Zhang

, et al. Application of nonlinear output frequency response functions and deep learning to RV reducer fault diagnosis. IEEE Trans Instrum Meas 2021; 70: 1–14.

Xie

Deng

YQ.

A dynamic approach for evaluating the moment rigidity and rotation precision of a bearing-planetary frame rotor system used in RV reducer. Mech Mach Theory 2022; 173: 104851.

Ahn

H-J

Choi

Lee

, et al. Impact analysis of tolerance and contact friction on a RV reducer using FE method. Int J Precis Eng Manuf 2021; 22(7): 1285–1292.

Yang

Liu

, et al. Acoustic emission signal fault diagnosis based on compressed sensing for RV reducer. Sensors 2022; 22(7): 2641.

Liu

Yang

, et al. An improved convolutional capsule network for compound fault diagnosis of RV reducers. Sensors 2022; 22(17): 6442.

Liu

Wang

, et al. Network lightweight method based on knowledge distillation is applied to RV reducer fault diagnosis. Meas Sci Technol 2023; 34(9): 095110.

10.

Raouf

Lee

Kim

HS.

Mechanical fault detection based on machine learning for robotic RV reducer using electrical current signature analysis: a data-driven approach. J Comput Des Eng 2022; 9(2): 417–433.

11.

Jiang

Zhang

Wei

, et al. Fault diagnosis of RV reducer based on denoising time-frequency attention neural network. Expert Syst Appl 2024; 238(Part B): 121762.

12.

Wang

Liu

, et al. Transient feature extraction of encoder signal based on envelope demodulation for fault diagnosis of RV reducer. IEEE Sensors J 2024; 24: 21082–21092.

13.

Qiu

Nie

Peng

, et al. A variable-speed-condition fault diagnosis method for crankshaft bearing in the RV reducer with WSO-VMD and ResNet-SWIN. Qual Reliab Eng Int 2024; 40: 2321–2347.

14.

Meng

Kong

, et al. Small sample fault diagnosis method for wind turbine gearbox based on optimized generative adversarial networks. Eng Fail Anal 2022; 140: 106573.

15.

Yang

Liu

Xie

, et al. Conditional GAN and 2-D CNN for bearing fault diagnosis with small samples. IEEE Trans Instrum Meas 2021; 70: 1–12.

16.

Luo

Yin

Yuan

, et al. An intelligent method for early motor bearing fault diagnosis based on wasserstein distance generative adversarial networks meta learning. IEEE Trans Instrum Meas 2023; 72: 1–11.

17.

Huang

, et al. PCASTNet: a physics-constrained adaptive style transfer network for sample generation in cross-machine small-sample fault diagnosis. IEEE Trans Instrum Meas 2025; 74: 1–17.

18.

Huang

Liu

, et al. A simulation-to-real transformation for small-sample fault diagnosis in aeroengine dual-rotor systems. IEEE Trans Instrum Meas 2026; 75: 1–23.

19.

Zhang

Gao

Shi

Bearing fault diagnosis method based on multi-source heterogeneous information fusion. Meas Sci Technol 2022; 33(7): 075901.

20.

Liu

Wang

, et al. Multi-feature fusion for fault diagnosis of rotating machinery based on convolutional neural network. Comput Commun 2021; 173: 160–169.

21.

Xia

Zhou

Shi

, et al. A fault diagnosis method with multi-source data fusion based on hierarchical attention for AUV. Ocean Eng 2022; 266: 112595.

22.

Peng

Xia

, et al. An intelligent fault diagnosis method for rotating machinery based on data fusion and deep residual neural network. Appl Intell 2021; 52(3): 3051–3065.

23.

Lingli

Shuhui

Xuejun

, et al. Fault diagnosis of a planetary gearbox based on a local bi-spectrum and a convolutional neural network. Meas Sci Technol 2022; 33(4): 045008.

24.

Mansouri

Dhibi

Hajji

, et al. Interval-valued reduced RNN for fault detection and diagnosis for wind energy conversion systems. IEEE Sensors J 2022; 22(13): 13581–13588.

25.

Cao

Yunusa-Kaltungo

An automated data fusion-based gear faults classification framework in rotating machines. Sensors 2021; 21(9): 2957.

26.

Immovilli

Bellini

Rubini

, et al. Diagnosis of bearing faults in induction machines by vibration or current signals: a critical comparison. IEEE Trans Ind Appl 2010; 46: 1350–1359.

27.

HyangFu

, et al. Vibration source inversion-based fault diagnosis: approach and application. J Sound Vib 2025; 597: 118818.

28.

Feng

Chen

Zuo

MJ.

Induction motor stator current AM-FM model and demodulation analysis for planetary gearbox faultdiagnosis. IEEE Trans Ind Informat 2019; 15(4): 2386–2394.

29.

Goodfellow

Pouget-Abadie

Mirza

, et al. Generative adversarial networks. Commun ACM 2014; 63: 139–144.

30.

Arjovsky

Chintala

Bottou

Wasserstein generative adversarial networks. In: International conference on machine learning, Sydney, Australia, 2017, pp. 214–223. PMLR.

31.

Gulrajani

Ahmed

Arjovsky

, et al. Improved training of Wasserstein GANs. Adv Neural Inf Process Syst 2017; 30: 5767–5777.

32.

Alzubaidi

Zhang

Humaidi

, et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data 2021; 8: 53.

33.

Zhang

Zuo

Zhang

FFDNet: toward a fast and flexible solution for CNN-based image denoising. IEEE Trans Image Process 2017; 27: 4608–4622.

34.

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. Neural Inform Process Syst 2017; 7: 1–15.

35.

Liu

Rolling bearing fault diagnosis based on STFT-deep learning and sound signals. Shock Vib 2016; 2016(1): 6127479.

36.

Liu

Hou

Cheng

, et al. Improving convolutional networks with self-calibrated convolutions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, Virtual Event, 2020, pp. 10096–10105. New York, NY: IEEE.

37.

Dosovitskiy

Beyer

Kolesnikov

, et al. An image is worth 16x16 words: transformers for image recognition at scale. IEEE Trans Pattern Anal Mach Intell 2021; 44: 5185–5195.

38.

Wang

Lang

, et al. FTGAN: a novel GAN-based data augmentation method coupled time–frequency domain for imbalanced bearing fault diagnosis. IEEE Trans Instrum Meas 2023; 72: 1–14.

39.

Odena

Olah

Shlens

Conditional image synthesis with auxiliary classifier GANs. Int Conf Mach Learn 2017; 70: 2642–2651.

40.

Radford

Metz

Chintala

Unsupervised representation learning with deep convolutional generative adversarial networks. In: Proceedings of the International Conference on Learning Representations (ICLR), 2016.

41.

Men

Chen

, et al. Railway wagon bearing fault diagnosis method based on improved sparrow search algorithm optimizing variational mode decomposition and multi-level convolutional neural network. Rev Sci Instrum 2024; 95(4): 045104.

42.

Chen

Zhang

Gao

Bearing fault diagnosis base on multi-scale CNN and LSTM model. J Intell Manuf 2020; 32: 971–987.

43.

Men

, et al. A hybrid intelligent gearbox fault diagnosis method based on EWCEEMD and whale optimization algorithm-optimized SVM. Int J Struct Integr 2023; 14(2): 322–336.

44.

Wang

Yang

, et al. Subdomain adaptation transfer learning network for fault diagnosis of roller bearings. IEEE Trans Ind Electron 2022; 69: 8430–8439.

45.

Sun

Lou

Fault diagnosis of controlled pitch propeller hydraulic system test bench based on XGBoost method [in Chinese]. Ship Eng 2024; 46(S2): 190–196.

46.

Yang

Chen

, et al. BA-PNN-based methods for power transformer fault diagnosis. Adv Eng Inform 2019; 39: 178–185.

47.

Liu

Han

, et al. Attention on the key modes: machinery fault diagnosis transformers through variational mode decomposition. Knowl Based Syst 2024; 289: 111479.