A parallel neural network method based on multi-level fusion of acoustic-vibration signals for slight and compound fault diagnosis of rolling bearings

Abstract

Conventional diagnosis methods relying solely on vibration signals often fail to identify the weak fault features of slight and compound faults, particularly under noise interference. To address these limitations, this study proposes a parallel neural network method based on multi-level fusion of acoustic-vibration signal for rolling bearing fault diagnosis. First, Filter Bank (Fbank) features are extracted to enhance the representation of slight and compound faults within multimodal signals by addressing the challenges posed by weak fault features and high noise sensitivity. This approach enables the extraction of spectral information from both acoustic and vibration signals, thereby improving the feature representation capability of the proposed framework. Second, a CNN-BiGRU parallel neural network is constructed to comprehensively capture weak fault features by integrating a convolutional neural network for spatial feature extraction with a bidirectional gated recurrent unit (BiGRU) for temporal feature extraction. Finally, a multi-level fusion strategy that combines feature-level and decision-level fusion is adopted to exploit complementary information from rolling bearings, significantly improving diagnostic accuracy and overcoming the reduced reliability of slight and compound fault diagnosis caused by incomplete fault information in unimodal diagnostic methods. Experimental results on a slight and compound fault dataset demonstrate that the proposed method achieves superior diagnostic performance under various noise conditions. The proposed method achieves a diagnostic accuracy of 99.55% at SNR = 10 dB and maintains over 90% accuracy even at SNR = −10 dB, outperforming conventional diagnostic approaches.

Keywords

slight and compound fault diagnosis acoustic-vibration signal filter bank feature parallel neural network multi-level fusion

1. Introduction

Rolling bearings are critical components of rotating mechanical systems and are extensively applied in various industries such as aerospace (Rejitha et al., 2023), automotive (Jin et al., 2023), and energy fields (Liu and Zhang, 2020). The operating condition of rolling bearings significantly impacts the performance and lifespan of machinery. Rolling bearings often operate continuously under complex and adverse conditions, which accelerates wear and fatigue and eventually leads to fault. Such faults can interrupt normal production and, in severe cases, cause safety accident (Deng et al., 2025; Lourari et al., 2025). Consequently, accurate monitoring and reliable identification of bearing operating conditions are of considerable practical importance, prompting sustained research interest in this area (Aburakhia et al., 2022; Zhang et al., 2026).

Existing rolling bearing fault diagnosis approaches can be broadly classified into three categories: signal-based methods, conventional machine learning-based methods, and deep learning-based methods. Signal-based approaches assess bearing condition by analyzing signals in the time domain, frequency domain, or time-frequency domain (Gao et al., 2015). Conventional machine learning-based methods typically rely on sensor-acquired data that reflect bearing operating states. The collected data are subsequently processed using models designed based on prior knowledge, in which features are manually selected and extracted (Chen et al., 2023). Although these two categories of methods have achieved notable performance in bearing fault detection, they each exhibit inherent limitations that create a significant research gap in complex industrial applications. Signal-based approaches often struggle with the non-stationary and nonlinear characteristics of signals captured under variable operating conditions, relying heavily on specialized prior knowledge and subjective expert interpretation. In contrast, conventional machine learning-based methods depend on manually designed feature extraction procedures (Zhu et al., 2020). This reliance not only introduces the risk of including irrelevant or redundant features but also restricts the model’s ability to adaptively learn discriminative representations from raw data, thereby limiting their practical applicability in automated diagnostic systems. In contrast, deep learning-based methods have attracted increasing attention with the rapid development of data acquisition and sensing technologies. These methods are capable of automatically learning discriminative fault representations from large-scale sensor data and constructing effective diagnosis models, which makes them well suited for complex mechanical systems operating under strong interference. In addition, deep learning-based approaches reduce reliance on handcrafted features and extensive diagnostic expertise (Li et al., 2025, 2026; Liu et al., 2025; Niu et al., 2021a). Recent advances have further enhanced the capability of deep learning models in cross-domain and data-scarce scenarios. For example, one study proposed a conditional distribution-guided adversarial transfer learning network with multi-source domains to effectively transfer knowledge from different machines for rolling bearing fault diagnosis (Wu et al., 2023). Another study developed an adaptive fused domain-cycling variational generative adversarial network that addresses data scarcity by generating high-quality synthetic data and performing adaptive fusion with real samples (Wang et al., 2026a). Additionally, a spatial-channel collaborative multi-scale graph interaction deep transfer learning method was introduced for unsupervised rotating machinery fault diagnosis, which enhances multi-source feature interaction and prototype extraction (Wang et al., 2026b).

The fundamental principle of deep learning-based fault diagnosis lies in establishing a direct mapping between sensor-acquired fault data (e.g., vibration signals) and fault categories using deep learning models. This end-to-end framework enables automatic feature learning and fault classification from raw data, without explicit reliance on prior knowledge (Dybala and Zimroz, 2014; Wen et al., 2018). This framework is especially advantageous for diagnosing individual faults. Beyond vibration signals, acoustic signals have proven effective for fault detection, particularly in environments where vibration sensors are less reliable, such as high-temperature, corrosive, or confined spaces. Acoustic measurements offer non-invasive, real-time monitoring, enabling effective fault detection under such challenging conditions (Mamun et al., 2025). However, in real industrial scenarios, the diagnostic challenge is further intensified by the fact that rotating mechanical systems frequently produce compound faults where multiple defects coexist. In this study, slight faults are defined as low-severity physical defects that generate extremely weak impulsive signatures. These subtle signals are easily submerged by strong background noise or masked by the dominant energy components of other concurrent faults. Existing diagnostic models and feature extraction methods, which are primarily optimized for prominent single fault scenarios, often fail to isolate these subtle features. This leads to a significant decrease in diagnostic accuracy and a higher risk of missed detections. To provide a comprehensive understanding of compound fault diagnosis, a systematic review was conducted summarizing research status, challenges, and future prospects of rolling bearing compound fault diagnosis methods, covering analytical models, signal processing, and artificial intelligence approaches (Li et al., 2024). Furthermore, an interpretable subdomain enhanced adaptive network (ISEANet) was proposed to improve unsupervised cross-domain fault diagnosis of rolling bearings by incorporating sparse subsegment-guided noise reduction, lightweight multi-feature extraction, and improved local maximum mean discrepancy (Liu et al., 2024). Consequently, unimodal measurement methods may be insufficient for reliably diagnosing faults in rotating machinery under actual operational conditions. Multimodal diagnostic approaches, such as the joint utilization of acoustic and vibration signals, offer an effective means to address these challenges. By exploiting complementary information from different sensing modalities, these methods reduce the likelihood of missed fault detection and improve diagnostic accuracy (Ji et al., 2021).

Significant progress has been made in multimodal deep learning fault diagnosis, with several advanced methods proposed in recent studies. For instance, a deeply coupled autoencoder network has been introduced to fuse vibration and acoustic data for fault diagnosis of gears and bearings (Ma et al., 2018). Niu et al. converted multi-sensor data into grayscale images and employed a deep residual network for fault diagnosis (Niu et al., 2021b). Zhou et al. proposed a hybrid approach where deep features were extracted from one-dimensional vibration data and two-dimensional image data using a stacked autoencoder (SAE) and a convolutional neural network (CNN), respectively (Zhou et al., 2019). They then performed feature fusion for fault diagnosis, validating their method on the Western Reserve University bearing dataset. Although the aforementioned studies effectively exploit multimodal data to enhance feature learning and improve diagnostic accuracy, they generally overlook the impact of noise commonly encountered in industrial environments, which limits their practical applicability. As a result, increasing research attention has been directed toward fault diagnosis under noisy conditions. For instance, a method utilizes a multi-sensor fusion algorithm based on convolutional neural network-long short-term memory (CNN-LSTM) networks to monitor bearing faults (Hao et al., 2020). Wang et al. extracted features directly from raw vibration signals and acoustic signals, integrating them using a one-dimensional CNN network (Wang et al., 2021). Their method, validated in bearing diagnosis with various signal-to-noise ratios, demonstrated superior recognition accuracy compared to algorithms relying on single-modal sensors. Wang et al. transformed signals from multiple vibration sensors into images, and fused them along the channel dimension to create feature-rich multichannel images (Wang et al., 2023). These methods substantially improve the diagnostic performance of rotating machinery through multimodal deep learning. Nevertheless, notable limitations remain, particularly the insufficient consideration of slight and compound faults.

Beyond the limitations discussed above, further challenges remain despite the progress achieved by multimodal deep learning-based fault diagnosis methods. For example, some studies have indicated that existing fault feature extraction and processing techniques still exhibit deficiencies. To address this problem, future efforts could focus on two key areas. First, developing feature extraction methods that provide rich frequency-domain and time-domain features would enable a more precise fault characterization. Second, designing model architectures tailored to the fault features of diverse signals of equipment could enhance the learning capability for spatiotemporal features, further improving diagnostic accuracy.

In terms of feature extraction methods, many researchers have used the mel frequency cepstrum coefficients (MFCC) or gamma filter cepstrum coefficients (GFCC) to capture distinct mechanical states for the fault diagnosis. Geng et al. proposed a method combining GFCC-based time-frequency representations of acoustic signals with a convolutional neural network (CNN) for fault diagnosis (Geng et al., 2019). Yue et al. proposed a new method called mel frequency mapping classification (MFMC) for the nonlinear mapping and classification of fault features, which is capable of distinguishing various health states of machinery under fluctuating operating conditions (Yue et al., 2024). Zhou et al. introduced a state identification method based on acoustic signals from rolling tires to detect bulging issues in tire endurance tests. This method utilizes the modified mel frequency cepstral coefficients (SMFCC) to represent the acoustic signal characteristic (Zhou et al., 2025). Yan et al. combined an improved time-frequency spectral kurtosis (MTSK) feature parameter that utilizes the sensitivity of kurtosis to impact signals and the mel frequency cepstral coefficients (MFCC) that reflect the acoustic characteristics of the human ear, achieving fault diagnosis accuracy of 99.6% at a speed of 20 km/h (Yan et al., 2022). However, the signal processing methods employed in the aforementioned studies exhibit critical limitations when integrated into deep learning-based diagnostic frameworks. While techniques such as MFCC offer high computational efficiency by using the discrete cosine transform (DCT) to highlight spectral envelope information, this compression is fundamentally designed for human speech recognition rather than mechanical fault detection. In the context of bearing diagnostics, the DCT step inevitably results in the loss of frequency-domain feature details and transient components that contain the signatures of slight physical defects. Consequently, these methods are not optimal for end-to-end deep learning models that require high fidelity features to accurately characterize machinery states.

To address the limitations of the aforementioned studies, this research employs multimodal (vibration and acoustic) signals processed using the Fbank method for fault diagnosis through multi-source data fusion targeting slight and compound bearing faults. First, the collected acoustic and vibration signals are converted into WAV audio files before undergoing Fbank-based processing. For diagnosing faults in rotating machinery, Fbank effectively capture the spectral details from acoustic signals, while vibration signals are similarly processed to extract critical information regarding machinery states. Unlike other feature extraction methods, Fbank features retain more fault-relevant information by omitting the discrete cosine transform (DCT) step. This allows Fbank to provide richer frequency-domain and time-domain features, which contribute to a more precise characterization of fault conditions. To further enhance the generalization and robustness of the fault diagnosis method, this study explores the use of multi-source data fusion strategy. This technique integrates information from multiple sources according to specific rules to produce unified and more meaningful insights. Compared to single-source information, multi-source data fusion leverages the complementary nature of diverse information streams, enhancing the effectiveness and accuracy of diagnostic systems.

To address the limitations identified above, this research develops a parallel neural network method based on the multi-level fusion of acoustic-vibration signals. The main contributions are as follows:

(1) Fbank-based signal processing: To resolve the problem of information loss in traditional DCT-based methods, we adopt an Fbank approach. This innovation ensures the retention of richer fault-related information, providing a more comprehensive data foundation for characterizing slight and compound bearing faults under noisy conditions.

(2) CNN-BiGRU parallel neural network framework: To handle the complexity of compound fault signals, we design a parallel deep learning framework. Unlike serial models, this framework uses the CNN to extract spatial features and the BiGRU to capture temporal dependencies simultaneously, enabling the model to extract more effective fault signatures from complex, overlapped signals.

(3) A multi-level fusion approach: This study develops a multi-level fusion approach that integrates acoustic and vibration information at both the feature and decision levels. By exploiting the complementarity of multimodal signals, this approach effectively addresses the problem of insufficient information in single-source data, significantly improving diagnostic robustness for slight and compound faults.

2. Methodology

2.1. Fbank-based signal processing

The Fbank feature, short for Filter Bank feature, is an acoustic feature derived from the post-processing of the short time Fourier transform (STFT) of a speech signal. This feature effectively characterizes the spectral properties of speech by decomposing the audio signal into multiple frequency bands and extracting the energy information of each band. The Fbank-based signal processing utilized in this study is illustrated in Figure 1. The Fbank-based signal processing typically involves the following steps:

(1) Pre-emphasis: The input speech signal undergoes pre-emphasis to enhance high-frequency components and mitigate the effects of high-frequency attenuation.

(2) Framing: The speech signal is divided into short-duration frames, often with a certain degree of overlap between frames.

(3) Windowing: A window function (such as the Hamming window or the Hanning window) is applied to each frame to minimize edge effects.

(4) Fourier Transform: The short time Fourier transform (STFT) is applied to the windowed frames to compute the spectrum.

(5) Filter Bank Processing: The spectrum is processed through a set of overlapping filters (typically triangular filters), and the energy output of each filter is calculated.

(6) Logarithmic Scaling: The energy outputs of the filter bank are logarithmically scaled to align the feature representation with human auditory perception.

Figure 1.

Fbank-based signal processing.

This study adopts the Fbank method for signal processing instead of the conventional MFCC approach because of its superior capability in preserving fault-related spectral information. For Fbank extraction, the Mel filter bank energies are directly retained as:

F_{i} = \log (\sum_{k} ∣ X (k) ∣^{2} H_{i} (k))

(1)

where X(k) denotes the spectrum of the vibration or acoustic signal and H_i(k) represents the i-th Mel filter. Since no decorrelation transformation is applied, the original spectral energy distribution and the correlations among adjacent frequency bands are preserved. These characteristics are particularly important for rolling bearing diagnosis because slight and compound faults usually generate weak impulsive components and coupled modulation frequencies distributed across multiple frequency bands.

In contrast, MFCC introduces a discrete cosine transform (DCT) to compress the filter bank energies into cepstral coefficients:

C_{n} = \sum_{i = 1}^{M} F_{i} \cos [\frac{π n (i - 0.5)}{M}]

(2)

where C_n denotes the cepstral coefficient and Mis the number of Mel filters. In practical implementations, only the first several low-order coefficients are typically retained, while the higher-order coefficients are discarded for dimensional compression and noise suppression. The retained energy ratio can be expressed as:

R = \frac{\sum_{n = 1}^{p} ∣ C_{n} ∣^{2}}{\sum_{n = 1}^{M} ∣ C_{n} ∣^{2}}

(3)

where p is the number of retained cepstral coefficients. This truncation process inevitably removes part of the high-frequency spectral fluctuation information contained in the discarded coefficients. Although such compression is beneficial for speech recognition tasks, it suppresses transient spectral structures and weak impulsive signatures that are crucial for identifying slight and compound bearing faults. By avoiding the DCT-based compression process, Fbank preserves more detailed spectral information and provides a higher-fidelity representation for deep learning-based fault diagnosis.

2.2. CNN-BiGRU

A convolutional neural network (CNN) is a type of feed-forward neural network that incorporates convolutional operations. Its architecture comprises several layers, including convolutional layer, activation function, pooling layer, fully connected layer, and classification layer (Li et al., 2022). The convolutional layer, activation function, and pooling layer work together to extract features from the inputs, while the fully connected layer flattens the feature maps. This layer, in combination with the Softmax classifier, produces classification results.

A recurrent neural network (RNN) is a network architecture with memory capabilities, allowing it to retain information from previous inputs in its internal states (Weerakody et al., 2021). RNNs are particularly well suited for solving time series problems, but their training process can encounter challenges such as gradient vanishing or gradient exploding. To address these issues, advanced variants of RNNs, such as gated recurrent units (GRUs), introduce a gating mechanism. GRUs simplify the internal structure by employing only an update gate and a reset gate, reducing model parameters and mitigating overfitting risks. The bidirectional GRU (BiGRU), used in this study, further enhances performance by capturing both forward and backward dependency information.

This study integrates the features of CNN and BiGRU within a CNN-BiGRU parallel network model, as depicted in Figure 2. While CNNs excel at extracting spatial features, they may lose critical temporal features during the training process. For audio signals, capturing temporal relationships is particularly important. The BiGRU component in the model plays a crucial role in effectively extracting these temporal features from faulty signals. The outputs of the parallel network are fused into a new feature vector using the method detailed in Section 2.3, enabling the model to leverage both spatial and temporal features for a deeper understanding of complex fault patterns.

Figure 2.

CNN-BiGRU parallel network model structure.

2.3. Multi-level fusion

Multi-source information fusion is a technology that integrates, correlates, and synthesizes data from multiple similar or dissimilar sensors to evaluate and identify the source of information. Since the data from unimodal sensors cannot fully capture the operating condition of equipment, there is inherent uncertainty in their diagnostic results. By leveraging multi-source information fusion technology, the physical, spatial, and temporal attributes of the acquired information are expanded, enabling a more comprehensive representation of the equipment’s state. This ultimately leads to higher diagnostic accuracy. Based on the level of fusion levels, multi-source information fusion methods can typically be categorized into data-level fusion, feature-level fusion, decision-level fusion, and model-level fusion (Zhang et al., 2024). This study adopts a multi-level fusion approach that combines feature-level fusion and decision-level fusion to enhance diagnostic robustness and accuracy.

Feature-level fusion involves extracting features from sensor data, fusing the resulting feature vectors, selecting the most relevant fused features, and using them for pattern recognition. As an intermediate level of fusion, it processes data after feature extraction, thereby reducing the volume of data to be handled and facilitating real-time processing. Decision-level fusion, on the other hand, involves making independent decisions for each sensor data mode and then combining these decisions using a specific rule to obtain a unified final decision. As the highest level of information fusion, this method effectively reduces the misdiagnosis rate and improves diagnostic robustness. Even if one modality or model underperforms, other modalities or models can compensate, ensuring a stable and accurate overall diagnosis.

This study proposes a multi-level and multimodal data fusion framework that achieves efficient integration of vibration and acoustic signals through the synergistic interaction of feature-level and decision-level fusion. The core concept of this framework involves hierarchically processing the heterogeneity of different modal data to fully leverage the characteristics of each level, thereby avoiding the limitations of a single fusion strategy. Simultaneously, it introduces a dynamic weighting mechanism, primarily implemented through adaptive weighting. The designed adaptive weights serve as trainable parameters optimized through back-propagation with constraints. This approach enhances the complementarity of multimodal information via adaptive weight optimization, thereby improving the accuracy and robustness of fault diagnosis models.

2.4. Fbank-CNN-BiGRU parallel network method

This study proposes an innovative Fbank-CNN-BiGRU parallel network method for fault diagnosis of rolling bearing based on acoustic-vibration fusion, as illustrated in Figure 3. First, vibration and acoustic signals undergo Fbank-based feature extraction to obtain spectral representations, thereby enhancing feature representation ability. In this section, the model performs feature extraction, which begins by obtaining the power spectrum:

P (t, f) = {| X (t, f) |}^{2}

(4)

where X (t, f) denotes the complex spectrogram of the signal at time t and frequency f.

Figure 3.

Fbank-CNN-BiGRU parallel network based on multi-level acoustic-vibration signal fusion method structure.

Subsequently, the feature matrix of energy is obtained using the Mel filter bank in Fbank:

E_{m} (t) = \sum_{f} P (t, f) \cdot H_{m} (f)

(5)

where Hm(f) is a triangular filter with linear intervals on the Mel scale.

Finally, an M × T characteristic matrix is obtained. Next, the extracted data is fed into a parallel network, where the CNN-BiGRU parallel network is employed to extract both spatial and temporal features, ensuring a more comprehensive feature representation. Convolutional networks process vibration signals to extract local spatiotemporal energy patterns and edge features. Following convolution and pooling layers, Flatten and Dense layers produce high-level feature vectors:

f_{v} = ϕ_{v} (x_{v})

(6)

where x_i is the original input feature.

And outputs the category prediction for the vibration branch via softmax:

{\hat{y}}_{v} = s o f t m a x (W_{v} f_{v} + b_{v})

(7)

where W_i and b_i are learnable parameters in neural networks.

Audio signals exhibit strong temporal correlations. BiGRU is employed to capture temporal dependencies between preceding and subsequent segments, followed by a fully connected layer to extract acoustic features:

f_{s} = ϕ_{s} (x_{s})

(8)

The final result is the acoustic branch prediction:

{\hat{y}}_{s} = s o f t m a x (W_{s} f_{s} + b_{s})

(9)

In the multi-level fusion stage, the signals first undergo feature extraction through different networks. A subset of the extracted features undergoes feature-level fusion, while the remaining features are retained independently. In feature-level fusion, the vibration feature f_v extracted by the CNN is concatenated with the acoustic feature f_s extracted by the BiGRU to obtain the fused feature:

f_{F} = [f_{v}; f_{s}]

(10)

And further obtain a fusion prediction of vibration and acoustic signals:

{\hat{y}}_{F} = s o f t m a x (W_{F} f_{F} + b_{F})

(11)

Finally, the results from the feature-level fusion and the outputs of the independent models are integrated using a dynamically weighted decision-level fusion approach, ultimately yielding the classification results. The final prediction is obtained through a weighted combination of three branch outcomes:

{\hat{y}}_{final} = α {\hat{y}}_{F} + β {\hat{y}}_{v} + (1 - α - β) {\hat{y}}_{s}

(12)

where

{\hat{y}}_{F}

{\hat{y}}_{V}

, and

{\hat{y}}_{S}

denote the prediction probabilities of the feature fusion branch, vibration branch, and acoustic branch, respectively. The adaptive parameters α and β are initialized before training and jointly optimized with all network parameters through back-propagation using the Adam optimizer.

To ensure the validity of the weighted fusion process, the adaptive parameters satisfy the following constraints:

0 \leq α \leq 1

0 \leq β \leq 1 - α

(13)

thereby guaranteeing that all branch weights sum to one.

To simultaneously optimize the synergy between feature learning and fusion decision-making, a compound loss function is defined:

\begin{array}{l} L = λ_{final} CE (y, {\hat{y}}_{final}) \\ + λ_{comp} (w_{F} CE (y, {\hat{y}}_{F}) + w_{V} CE (y, {\hat{y}}_{V}) + w_{S} CE (y, {\hat{y}}_{S})) \\ + μ (w_{F} KL ({\hat{y}}_{final} ‖ {\hat{y}}_{F}) + w_{V} KL ({\hat{y}}_{final} ‖ {\hat{y}}_{V}) + w_{S} KL ({\hat{y}}_{final} ‖ {\hat{y}}_{S})) \end{array}

(14)

where CE denotes cross-entropy loss, and KL represents Kullback-Leibler divergence used to maintain consistency among different prediction branches.

The confidence-weighted coefficients w_F, w_V, w_S are dynamically adjusted according to branch prediction confidence and fixed hyperparameters τ and κ. Specifically, the confidence score of each branch is defined as:

c o n f_{i} = \max ({\hat{y}}_{i}), i \in {F, V, S}

(15)

The confidence scores are normalized through the Softmax operation:

{\hat{w}}_{i} = \frac{\exp (κ \cdot c o n f_{i})}{\sum_{j} \exp (κ \cdot c o n f_{j})}

(16)

Accordingly, the final confidence-aware fusion weight is computed as:

w_{i} = τ \cdot ω_{i} + (1 - τ) \cdot {\hat{w}}_{i}

(17)

where ω_i is the prior weight, whose value corresponds to the adaptive learning parameters α and β and 1-α-β, respectively. The hyperparameter κ controls the sensitivity of confidence distribution, while τ balances the contributions of prior adaptive weights and confidence-aware dynamic weights.

During training, the adaptive parameters α and β are updated together with all network parameters through gradient optimization:

θ \leftarrow θ - η \frac{\partial L}{\partial θ}

(18)

where

θ = {α, β, W}

represents the trainable parameter set of the entire network.

Therefore, the loss function dynamically adjusts the modal weights according to their confidence, while the parameters α and β are jointly optimized to ensure that this adjustment converges toward the global optimum. By integrating feature-level fusion and decision-level fusion through the above design, comprehensive information about rolling bearings is synthesized to enhance diagnostic accuracy.

In addition to the main structure described above, regularization and dynamic learning rate methods are incorporated into the model to enhance the performance of the parallel network. To mitigate overfitting risks, L2 regularization and Dropout mechanisms are incorporated into the network architecture. Dropout randomly deactivates neurons during training with a certain probability, while maintaining the input and output neurons unchanged. L2 regularization adds a penalty term proportional to the square of the feature coefficients, which helps to smooth the model and reduce its tendency to overfit the training data (Benninger et al., 2023). These techniques not only minimize overfitting risks but also reduce the computational time required for training the network. Additionally, an early stopping method is implemented to further reduce overfitting. By monitoring the validation loss at the end of each training cycle, training is discontinued if no improvement is observed for a predefined number of cycles. This approach prevents the model from overlearning on the training set, thereby improving its generalization performance on unseen data while saving computational resources.

To enhance the final diagnostic performance, the learning rate decay method is employed to improve the convergence behavior of the model. This technique dynamically adjusts the learning rate in a stepwise manner: a larger learning rate is used in the early stages of training to accelerate convergence, while a smaller learning rate is applied in later stages to fine-tune parameters more precisely (Fan et al., 2024). This strategy ensures a balance between convergence speed and precision, ultimately improving the overall performance of the model.

3. Experiment

3.1. Data set and experimental environment

The experimental data of this paper comes from the BPS (Bearing Prognosis Simulation,) experimental bench in our laboratory, which mainly contains AC motor, coupling, support bearing, test bearing, vibration sensor and acoustic sensor, etc. The details are shown in Figure 4(a).

Figure 4.

BPS experimental system and bearing fault types.

The output end of the AC motor is connected to the transmission system via couplings, with the support bearing and test bearing positioned near the two ends of the rotating shaft, respectively. The rotational speed of the motor spindle is adjustable using a frequency converter to simulate various operating conditions for the bearings. In this experiment, three rotational speeds were selected: 960 r/min, 1200 r/min, and 1500 r/min. The test bearings utilized in this study are ER-16K deep groove ball bearings.

To simulate different types of bearing faults, damage was introduced to the test bearings using laser machining technology, resulting in four types of slight bearing faults: inner fault, outer fault, rolling element fault, and compound fault. Compound fault in this study refers specifically to the condition where the inner ring, outer ring, and rolling element of the bearing are all simultaneously damaged. It should be noted that this represents a full compound fault scenario. In practical applications, compound faults may also occur as partial combinations, such as inner ring and outer ring defects or inner ring and rolling element defects. Due to limitations in the available experimental conditions, only one type of compound fault specimen (i.e., full compound fault) was considered in this work. Visual representations of these four fault types are shown in Figure 4(b), with the fault types and their corresponding dimensions detailed in Table 1. Based on the experimental bench and test bearings, condition monitoring experiments of rolling bearings were carried out in the laboratory environment. During experimenting, the real-time acquisition of signals starts once the bearing reaches a stable state of operation, and the vibration and acoustic signals corresponding to various health states of rolling bearings under various working conditions are collected.

Table 1.

Fault types and dimensions.

Faulty bearing type	Inner race fault	Outer race fault	Rolling element fault	Compound fault
Depth of fault (mm)	0.1	0.1	0.1	0.1
Fault diameter (mm)	0.1	0.1	0.5	0.1, 0.5
Fault shape	Linear	Linear	Linear	Linear, spherical

In this study, vibration sensor signals in the vertical direction and acoustic sensor signals were selected to construct the dataset. The dataset was divided into training, validation, and test sets with a ratio of 6:2:2 (Yang et al., 2026). To ensure balanced representation, the split was performed in a stratified manner based on fault types, with each state containing a total of 1000 samples that were proportionally allocated to each subset. It should be noted that, in this work, each classification task is conducted under a single rotational speed condition. That is, data corresponding to different rotational speeds are not mixed within the same classification experiment but are treated independently. For each speed condition, the dataset is separately constructed and then divided into training, validation, and test sets following the same 6:2:2 ratio. This setting ensures that the model is trained and evaluated under consistent operating conditions.

To simulate noisy environments in a controlled and reproducible manner, additive Gaussian white noise was introduced directly to the raw time-domain signals prior to feature extraction. It should be noted that this type of noise is a simplified approximation and does not fully represent the complex noise characteristics encountered in real industrial scenarios, such as harmonic interference, non-stationary noise, and impulsive disturbances. Specifically, for each clean signal sample $x (t)$ , a noise sequence $n (t) \sim N (0, σ^{2})$ was generated and added to obtain the noisy signal $\tilde{x} (t) = x (t) + n (t)$ .

The signal-to-noise ratio (SNR) was defined based on the power ratio between the clean signal and the noise, consistent with the formulation used in Zhang et al. (2018):

S N R = 10 \log 10^{\frac{P s i g n a l}{P n o i s e}}

(19)

where

P_{s i g n a l}

and

P_{n o i c e}

are the signal and noise power, respectively. Based on this definition, the noise power can be derived as:

P_{n} = \frac{P_{s}}{10^{SNR / 10}}

(20)

In implementation, the signal power was computed as the mean squared value of the signal, that is,

P_{s} = mean (x^{2})

. The noise was then generated as Gaussian white noise with variance

P_{n o i c e}

, and added to the signal:

\tilde{x} = x + \sqrt{P_{n}} \cdot randn (\cdot)

(21)

This procedure ensures that the resulting noisy signal satisfies the target SNR level precisely, which is consistent with the implementation used in our code.

After noise injection, Fbank features were extracted from the noisy signals, ensuring that the extracted features reflect realistic noisy conditions. In this study, the proposed method was evaluated under five SNR levels: −10 dB, −6 dB, 0 dB, 6 dB, and 10 dB

To evaluate the performance of the model, appropriate evaluation metrics need to be established. These metrics assess the model’s capabilities from different perspectives, and the choice of metrics typically depends on the specific learning task. Since fault diagnosis involves classifying different health states into distinct categories, it can be framed as a classification problem. Taking a binary classification problem as an example, the outcomes can be categorized into four types based on the true class of a sample and the predicted class of the model. True Positive (TP): Positive samples correctly classified as positive; False Positive (FP): Negative samples incorrectly classified as positive; True Negative (TN): Negative samples correctly classified as negative; False Negative (FN): Positive samples incorrectly classified as negative. Let TP, FP, TN, and FN represent the number of samples in each of these categories. From these outcomes, commonly used evaluation metrics for fault diagnosis, such as accuracy, precision, recall, and F1-score, can be derived. In this study, multiple evaluation metrics are adopted to comprehensively assess the diagnostic performance of the proposed method. Accuracy measures the proportion of correctly classified samples among all samples and provides an overall assessment of model performance. However, for multi class fault diagnosis problems, accuracy alone may not fully reflect the recognition capability of the model for minority fault categories. Therefore, precision, recall, F1-score, and the identification rate for each fault type are additionally introduced. Accuracy refers to the proportion of correctly categorized samples out of the total number of samples. It is the most commonly used performance metric in fault diagnosis and is calculated using equation (22):

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(22)

Precision represents the proportion of correctly identified positive samples among all samples predicted as positive, reflecting the reliability of the model predictions. A higher precision indicates that fewer normal samples or other fault types are incorrectly classified as the target fault. Precision is calculated using equation (23):

P r e c i s i o n = \frac{T P}{T P + F P}

(23)

Recall measures the proportion of actual positive samples that are correctly identified by the model. A higher recall indicates that the model has a stronger capability to detect fault samples and reduces the probability of missed diagnosis. Recall is calculated using equation (24):

R e c a l l = \frac{T P}{T P + F N}

(24)

The F1-score is the harmonic mean of precision and recall, which comprehensively evaluates the balance between fault detection capability and prediction reliability. It is particularly suitable for imbalanced classification problems. The F1-score is defined using equation (25):

F 1 = \frac{2 \times P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

(25)

In addition, to further evaluate the diagnostic capability of the model for different fault categories, the recall value of each fault category is used as the individual identification rate. This metric reflects the proportion of correctly identified samples within a specific fault category and can effectively evaluate the recognition performance of the model for different fault types, including composite faults. The identification rate of fault categories is calculated using equation (26):

I d e n t i f i c a t i o n R a t e_{i} = R e c a l l_{i} = \frac{T P_{i}}{T P_{i} + F N_{i}}

(26)

3.2. Results and discussion

This section begins by presenting the collected data in a visual format. The time-domain waveforms of the bearing vibration signal for five different states are shown in Figure 5(a). As observed in the figure, the differences in waveforms among the various fault types are not visually distinct, making it challenging to identify the specific fault type through naked-eye observation. The feature images generated by applying the Fbank feature extraction method to the original time-domain signal are shown in Figure 5(b). These images offer the extracted features of the signal, providing a more detailed and distinguishable representation compared to the raw time-domain waveforms.

Figure 5.

(a) Raw signals in the time domain for different states of bearing and (b) Fbank feature image of different signals for different states of bearing.

From the extracted results, the Fbank feature images of different states of slight faults are more distinguishable than the original time-domain signals. However, visual observation alone is still not enough to accurately distinguish the different states of bearings, and deep learning training is needed for more detailed analysis.

To provide a quantitative evaluation of Fbank’s effectiveness, a comparative experiment was conducted using both Fbank and MFCC features under identical model configurations. The experimental results, as summarized in Table 2, provide clear evidence of Fbank’s advantage in preserving relevant information for slight and compound faults. For acoustic signals, the model using Fbank features achieved an accuracy of 93.66%, representing a 14.82% improvement over the 78.84% achieved by MFCC. Similarly, for vibration signals, the Fbank approach reached 96.50% accuracy, outperforming the MFCC approach by 9.73%. This disparity confirms that the information discarded by the DCT step in MFCC is indeed vital for accurate fault diagnosis. The ability of Fbank to maintain these detailed spectral features allows the model to achieve more robust performance in distinguishing states of rolling bearings.

Table 2.

Accuracy of different features.

Method	Fbank-acoustic signal accuracy (%)	Fbank-vibration signal accuracy (%)	MFCC-acoustic signal accuracy (%)	MFCC-vibration signal accuracy (%)
Accuracy	93.66	96.50	78.84	86.77

The training and validation loss functions for the unimodal data and the acoustic-vibration fusion data are illustrated in Figure 6. From the loss function results, it is evident that the training and validation losses of the proposed method converge rapidly as the number of iterations increase. Additionally, the acoustic-vibration fusion signals demonstrate lower and more stable loss values during both the training and validation phases compared to those of unimodal diagnosis. These findings preliminarily highlight the advantages of acoustic-vibration fusion. However, more comprehensive comparisons and statistical analyses are required to fully validate the superiority of this approach.

Figure 6.

(a) Loss function of Fbank-CNN based on vibration signal, (b) loss function of Fbank-CNN based on acoustic signal, and (c) loss function of the proposed method.

To further evaluate the diagnostic performance of the proposed method and highlight its advantages over unimodal diagnosis, two visualization techniques are utilized in this study: the t-SNE visualization algorithm and the confusion matrix. The t-SNE (t-Distributed Stochastic Neighbor Embedding) is a visualization technique that transforms high-dimensional Euclidean distances between data points into conditional probabilities that represent their similarities. Upon convergence of the training process, t-SNE can determine whether the dataset is well separated by projecting it into a 2D or 3D space, allowing for a clearer understanding of the data structure. The confusion matrix is a specialized tool used to visualize the performance of a classification model. It compares the predicted and true values to assess the model’s classification accuracy. By examining the confusion matrix, various performance metrics, such as accuracy, recall, and F1-score can be computed, offering a comprehensive evaluation of the model’s performance. Moreover, it helps identify specific categories that are prone to misclassification, providing valuable insights for model improvement. The results of the t-SNE visualization and the confusion matrix are presented in Figures 7 and 8, respectively.

Figure 7.

t-SNE results.

Figure 8.

(a) Confusion matrix of Fbank-CNN based on vibration signal, (b) confusion matrix of Fbank-CNN based on acoustic signal, and (c) confusion matrix of the proposed method.

To further analyze the diagnostic capability of the proposed method, the t-SNE visualization results and confusion matrices were investigated in detail. Compared with the unimodal methods based on vibration or acoustic signals, the proposed method exhibits clearer clustering boundaries and less overlap among different fault categories in the t-SNE feature space. The confusion matrices further support this observation. Compared with the unimodal methods, the proposed method produces fewer misclassified samples among different fault categories. The proposed method maintains relatively high recognition accuracy for different slight fault categories, demonstrating that the fused features remain sufficiently distinguishable even when the fault signatures are weak. The results show that unimodal data alone may not always provide sufficiently stable fault representations for slight and compound fault diagnosis. After combining vibration and acoustic information, more complementary fault characteristics can be retained, resulting in more stable feature representations and improved classification performance.

Given the inherent randomness in the initial weights of the neural network, to ensure the reliability of the training outcomes, the model will be trained and tested 10 times, and the average accuracy across these 10 runs will be used as an indicator of model performance. The accuracy and standard deviation of the proposed method under different speeds and SNR conditions are presented in Table 3.

Table 3.

Accuracy and standard deviation for different rotational speeds and different SNR conditions.

Number of revolutions per minute	SNR	10	6	0	−6	−10
960 r/min	Accuracy of the method (%)	99.32	97.80	96.43	94.62	90.32
960 r/min	Standard deviation of the method (%)	0.18	0.20	0.28	0.34	0.41
1200 r/min	Accuracy of the method (%)	99.55	98.12	96.26	95.55	92.25
1200 r/min	Standard deviation of the method (%)	0.21	0.26	0.33	0.39	0.42
1500 r/min	Accuracy of the method (%)	99.46	97.91	96.37	94.54	91.29
1500 r/min	Standard deviation of the method (%)	0.19	0.22	0.33	0.41	0.48

The accuracy of the proposed diagnostic method remained high across different speeds (960 rpm, 1200 rpm, and 1500 rpm). The highest accuracy, 99.55% (SNR = 10), was achieved at 1200 rpm across all SNR conditions. It is also evident that, even under relatively harsh conditions (e.g., −10 dB SNR), the method maintains a high level of accuracy. For instance, at 960 rpm, the accuracy was 90.32%, and at 1200 rpm, the accuracy reached up to 92.25%. These results demonstrate the method’s superiority in diagnosing slight and compound bearing faults under simulated noise interference, indicating its promising potential for practical applications.

Since the diagnostic performance at 1200 rpm consistently achieves the best results under different SNR conditions, the dataset at 1200 rpm was selected for further comparative analysis. To comprehensively evaluate the effectiveness of the proposed method, the 1200 rpm dataset was additionally applied to unimodal fault diagnosis models based on vibration signals and acoustic signals.

Considering that accuracy alone may not adequately reflect the classification performance in multi class fault diagnosis problems, additional evaluation metrics including Precision, Recall, and F1-score were introduced in this study. Moreover, the recall rate of each fault category was further provided to evaluate the recognition capability for different fault types. The detailed experimental results are summarized in Table 4.

Table 4.

Diagnostic results of bearing fault at 1200 r/min.

Number of revolutions per minute	Algorithm	Evaluation metric	SNR
Number of revolutions per minute	Algorithm	Evaluation metric	10	6	0	−6	−10
1200 r/min	Proposed method	Accuracy (%)	99.55	98.15	97.1	94.85	92.85
		Precision (%)	99.54	98.15	97.11	94.9	92.85
		Recall (%)	99.55	98.15	97.1	94.85	92.85
		F1-score (%)	99.55	98.15	97.1	94.86	92.84
		Normal recall (%)	99.47	98.4	98.67	95.73	92.73
		Inner race fault recall (%)	99.76	99.52	97.86	95.25	93.77
		Outer race fault recall (%)	99.49	98.47	96.68	95.65	93.88
		Rolling element fault recall (%)	99.28	99.02	97.32	96.83	95.34
		Compound fault recall (%)	99.75	95.29	95.04	90.82	86.59
	Fbank-vibration signal	Accuracy (%)	97.3	96.4	93.2	89.7	84.3
		Precision (%)	97.33	96.52	93.25	89.89	84.74
		Recall (%)	97.3	96.4	93.2	89.7	84.3
		F1-score (%)	97.3	96.4	93.18	89.74	84.33
		Normal recall (%)	94.5	97	94	92	91.5
		Inner race fault recall (%)	98	98	90	88	80
		Outer race fault recall (%)	97	90.5	88	89.5	87
		Rolling element fault recall (%)	99.5	99.5	95.5	89.5	77.5
		Compound fault recall (%)	97.5	97	98.5	89.5	85.5
	Fbank-acoustic signal	Accuracy (%)	94.5	92.9	90.2	87.7	83.5
		Precision (%)	94.69	93.01	90.69	88.37	84.06
		Recall (%)	94.5	92.9	90.2	87.7	83.5
		F1-score (%)	94.51	92.9	90.21	87.76	83.54
		Normal recall (%)	89	89	77.5	82	81
		Inner race fault recall (%)	94	96	93	85	83
		Outer race fault recall (%)	97	92	99	89	84.5
		Rolling element fault recall (%)	96.5	93.5	83.5	95.5	83
		Compound fault recall (%)	96	94	98	87	86

The results indicate that the proposed method achieves consistently superior diagnostic performance compared with the unimodal methods under different SNR conditions. As the SNR decreases from 10 dB to −10 dB, the diagnostic performance of all methods gradually deteriorates because the fault features become increasingly submerged by environmental noise. However, compared with the vibration-based and acoustic-based models, the proposed method exhibits a significantly slower degradation trend in Accuracy, Precision, Recall, and F1-score, demonstrating stronger robustness and stability under noisy conditions.

Specifically, the proposed method consistently achieves the highest Accuracy, Precision, Recall, and F1-score values under all tested SNR conditions. The higher Accuracy indicates that the proposed method can improve the overall correctness of fault classification by effectively integrating complementary information from vibration and acoustic signals. Meanwhile, the higher Precision and Recall demonstrate that the proposed method can simultaneously reduce false classifications and missed detections under noisy environments. This advantage is particularly important for slight bearing faults, since weak fault signatures are more susceptible to noise interference and are difficult to distinguish using single-modal information alone. Furthermore, the superior F1-score indicates that the proposed method achieves a better balance between prediction reliability and fault detection capability. Compared with the unimodal approaches, the proposed method can preserve more discriminative fault features under low-SNR conditions, thereby achieving more stable and reliable diagnostic performance.

The recall values of each fault category are used to represent the individual identification rates, which provide a more detailed evaluation of the recognition capability for different bearing fault types. The individual identification rates further verify the effectiveness of the proposed method for both slight and compound bearing fault diagnosis. For normal conditions, inner race faults, outer race faults, rolling element faults, and compound faults, the proposed method maintains consistently high recall values under different SNR conditions. These results demonstrate that the proposed method can effectively extract and preserve discriminative fault features from both vibration and acoustic signals, thereby improving the identification capability for both slight and compound bearing faults in noisy environments.

To further highlight the advantages of the proposed method under various SNRs, the confusion matrices for the proposed method and the two unimodal methods are generated as comparisons. These matrices were plotted at three noise disturbance levels (SNR = 10 dB, 0 dB, and −10 dB) and are presented in Figure 9.

Figure 9.

Comparison of confusion matrices under different SNR levels.

At 10 dB, all methods achieve relatively high classification accuracy because the fault features are still distinguishable from background noise. However, the unimodal methods already exhibit some confusion among different slight fault categories, while the proposed method maintains fewer misclassified samples. As the SNR decreases to 0 dB, the influence of noise becomes more significant. The unimodal methods show an obvious increase in misclassification. In contrast, the proposed method still maintains relatively stable classification performance. This indicates that the proposed method can preserve more useful fault features under noisy conditions. At −10 dB, severe noise interference causes substantial degradation in all methods. Several fault categories in the unimodal methods are incorrectly identified as other fault types or even normal conditions. Although the recognition accuracy of the proposed method also decreases under noise interference, the overall classification performance remains better than that of the unimodal approaches.

The results demonstrate that this method exhibits minimal sensitivity to noise across different SNR scenarios. Compared to unimodal approaches, it maintains considerable accuracy and correct classification rates under noisy conditions, further validating its robustness and effectiveness in diagnosing slight and compound bearing faults within simulated noisy environments.

To enhance the credibility of the results, several commonly used fault diagnosis models were selected for comparison. Some of these methods were referenced from Wang et al. (2021), and the comparative results are presented in Table 5.

Table 5.

Comparison results of different algorithms for bearing fault diagnosis.

Algorithm	SNR
Algorithm	10	6	0	−6	−10
Accuracy of the method (%)	99.55	98.12	96.26	95.55	92.25
WDCNN using vibration signals (%)	94.14	91.45	86.85	82.52	75.62
FFT-BP using vibration signals (%)	86.56	83.66	78.58	73.16	66.64
MSCNN (%)	96.62	94.44	88.92	85.24	79.86
1DCNN using vibration signals (%)	92.13	87.79	83.25	77.85	72.61
1DCNN using acoustic signals (%)	88.39	84.65	77.19	72.32	68.68
1DCNN using acoustic-vibration signals (%)	92.62	86.43	82.35	78.58	73.16

The results show that the method proposed in this study achieves the highest accuracy across all SNR levels, with its accuracy advantage being particularly evident at low SNR when compared to other methods. Specifically, the method in this study exhibits a 7.3% reduction in accuracy at both SNR = 10 and SNR = −10, which is lower than that observed for other methods. These results further validate the superiority of the proposed method in diagnosing slight and compound bearing faults in noisy environments.

To analyze the source of the additional computational overhead, ablation-based comparisons were conducted among the 1DCNN acoustic-vibration model, CNN-BiGRU, CNN-BiGRU-Fusion, and the proposed method. The comparison results are shown in Table 6. Unfortunately, the results indicate that this method requires longer training time. The results indicate that the BiGRU branch is the primary contributor to the increased training time. Specifically, after introducing the BiGRU structure, the training time increases significantly from 148.79 s to 395.5 s.

Table 6.

Comparison of training times and inference times for different bearing fault diagnosis algorithms.

Algorithm	Accuracy (%)	Training time(S)	Inference time(S)	Inference time (S/sample)
Proposed method	99.55	421.8	2.0411	0.00102
CNN-BiGRU-fusion	98.39	403.1	2.0666	0.00103
CNN-BiGRU	98.13	395.5	2.1940	0.00109
1DCNN using acoustic-vibration signals	92.62	148.79	0.434	0.000217

By comparison, the computational overhead introduced by the multi-level feature fusion strategy is relatively limited. After incorporating the fusion module, the training time only increases from 395.5 s to 403.1 s, indicating that the feature fusion operation introduces comparatively small additional complexity. Furthermore, after introducing the adaptive dynamic weighting mechanism, the training time increases from 403.1 s to 421.8 s. Compared with the computational cost introduced by the BiGRU branch, the increase caused by the adaptive weighting mechanism remains moderate.

In addition to training efficiency, inference efficiency was also evaluated. The proposed method requires approximately 2.0411 s for inference on the entire test set, corresponding to an average inference time of approximately 0.00102 s per sample. Although the proposed method exhibits higher computational complexity than the baseline method, the inference speed remains sufficiently fast for practical bearing fault diagnosis applications.

Overall, the proposed method requires longer training time than the baseline models, with the majority of the additional computational cost originating from the BiGRU branch. However, the increased computational burden is mainly concentrated in the offline training stage, while the online inference time remains sufficiently low for real-time applications. Considering the significant improvement in diagnosing slight and compound bearing faults under noisy conditions, the additional computational cost is considered acceptable for practical industrial fault diagnosis tasks.

4. Conclusion

In this study, a Fbank-CNN-BiGRU parallel network-based method is proposed for rolling bearing fault diagnosis using acoustic-vibration signal fusion. The method performs classification of slight and compound bearing faults using acoustic and vibration signals acquired from microphones and accelerometers.

Experimental results demonstrate that the proposed method achieves high diagnostic accuracy for slight and compound bearing faults under various SNR conditions, particularly in low-SNR environments. Its performance significantly outperforms fault diagnosis methods based on unimodal sensors or single-level data fusion strategies. These results confirm that the proposed approach exhibits superior classification accuracy and robustness for both slight and compound bearing faults. However, since the noise considered in this study is limited to Gaussian white noise, its applicability to real-world environments with more complex noise structures should be interpreted with caution. In addition, although the proposed method has been compared with several representative approaches, comparisons with more state-of-the-art methods have not yet been fully investigated, which may limit the comprehensiveness of the current performance evaluation.

Future research will focus on extending the applicability of the proposed method by evaluating its performance on a wider range of mechanical systems with diverse fault types, including additional compound fault combinations. Particular attention will be paid to bearing fault diagnosis under real industrial noise environments, moving beyond Gaussian white noise simulation to incorporate more realistic noise characteristics. More comprehensive comparisons with state-of-the-art methods will also be conducted to further validate the superiority and generalization capability of the proposed approach. To mitigate the increased training time, efforts will also be directed toward optimizing the model for lightweight implementation. In addition, the development of adaptive mechanisms will be investigated to improve generalization capability, enabling effective diagnosis of previously unseen fault types.

Footnotes

ORCID iD

Changhang Xu

Author contributions

Ziming Ji: Conceptualization, investigation, methodology, writing—original draft, formal analysis, validation, visualization, and software.

Changhang Xu: Project administration, writing—review and editing, and resources.

Jun Zhao: Investigation and software.

Na Li: Investigation and writing—review and editing.

Wenbo Yao: Investigation and formal analysis.

Zhiyuan Zhang: Data curation and supervision.

Wenao Wang: Supervision and validation.

Qingrui Hu: Supervision and formal analysis.

Rui Liu: Supervision and visualization. All authors read and approved the final manuscript.

Funding

This work was supported by the National Natural Science Foundation of China (No. 52574304), the National Key Research and Development Program of China (No. 2023YFC3009202).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Aburakhia

Myers

Shami

(2022) A hybrid method for condition monitoring and fault diagnosis of rolling bearings with low system delay. IEEE Transactions on Instrumentation and Measurement 71: 3519913. Piscataway: Ieee-Inst Electrical Electronics Engineers Inc. https://doi.org/10.1109/tim.2022.3198477

Benninger

Liebschner

Kreischer

(2023) Fault detection of induction motors with combined Modeling- and machine-learning-based framework. Energies 16(8): 3429. Basel: MDPI. https://doi.org/10.3390/en16083429

Chen

Yang

Xue

, et al. (2023) Deep transfer learning for bearing fault diagnosis: a systematic review since 2016. IEEE Transactions on Instrumentation and Measurement 72: 3508221. Piscataway: Ieee-Inst Electrical Electronics Engineers Inc. https://doi.org/10.1109/tim.2023.3244237

Deng

Zhou

, et al. (2025) Bearing fault diagnosis of variable working conditions based on conditional domain adversarial-joint maximum mean discrepancy. International Journal of Advanced Manufacturing Technology 136(11–12): 5043–5060. London: Springer London Ltd. https://doi.org/10.1007/s00170-025-15087-9

Dybala

Zimroz

(2014) Rolling bearing diagnosing method based on empirical mode Decomposition of machine vibration signal. Applied Acoustics 77: 195–203. London: Elsevier Sci Ltd. https://doi.org/10.1016/j.apacoust.2013.09.001

Fan

Wang

, et al. (2024) Performance degradation assessment of rolling bearing cage failure based on enhanced CycleGAN. Expert Systems with Applications 255: 124697. Pergamon-Elsevier Science Ltd. https://doi.org/10.1016/j.eswa.2024.124697

Gao

Cecati

Ding

(2015) A survey of fault diagnosis and fault-tolerant techniques-part I: fault diagnosis with model-based and signal-based approaches. IEEE Transactions on Industrial Electronics 62(6): 3757–3767. Piscataway: Ieee-Inst Electrical Electronics Engineers Inc. https://doi.org/10.1109/tie.2015.2417501

Geng

Wang

Zhou

(2019) Mechanical fault diagnosis of power transformer by GFCC time-frequency map of Acoustic signal and convolutional neural network. In: 2019 IEEE Sustainable Power and Energy Conference (iSPEC), Beijing, China, 21–23 November 2019, pp. 2106–2110. Available at: https://ieeexplore.ieee.org/document/8975318. (accessed 6 January 2026).

Hao

F-X

, et al. (2020) Multisensor bearing fault diagnosis based on one-dimensional convolutional long short-term memory networks. Measurement 159: 107802. Oxford: Elsevier Sci Ltd. https://doi.org/10.1016/j.measurement.2020.107802

10.

Han

Zhang

, et al. (2021) Parallel sparse filtering for intelligent fault diagnosis using acoustic signal processing. Neurocomputing 462: 466–477. Amsterdam: Elsevier. https://doi.org/10.1016/j.neucom.2021.08.049

11.

Jin

J-W

Kang

Lee

(2023) Fatigue analysis for automotive wheel bearing flanges. International Journal of Precision Engineering and Manufacturing 24(4): 621–628. Seoul: Korean Soc Precision Eng. https://doi.org/10.1007/s12541-023-00773-z

12.

Liu

Yang

, et al. (2022) A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Transactions on Neural Networks and Learning Systems 33(12): 6999–7019. Piscataway: Ieee-Inst Electrical Electronics Engineers Inc. https://doi.org/10.1109/TNNLS.2021.3084827

13.

Wang

Yan

, et al. (2024) A systematic review of diagnosis methods for rolling bearing compound faults: research status, challenges, and future prospects. Measurement Science and Technology 36(1): 012008. IOP Publishing. https://doi.org/10.1088/1361-6501/ad9766

14.

Zhang

, et al. (2025) Multi-scale attention-based xLSTM for rolling bearing fault diagnosis. Measurement Science and Technology 36(6): 066116. Bristol: IoP Publishing Ltd. https://doi.org/10.1088/1361-6501/add953

15.

Zhao

Cao

(2026) An UDA bearing fault diagnosis method based on synergistic optimization of unbalanced data. Mechanical Systems and Signal Processing 243: 113700. London: Academic Press Ltd- Elsevier Science Ltd. https://doi.org/10.1016/j.ymssp.2025.113700

16.

Liu

Zhang

(2020) A review of failure modes, condition monitoring and fault diagnosis methods for large-scale wind turbine bearings. Measurement 149: 107002. London: Elsevier Sci Ltd. https://doi.org/10.1016/j.measurement.2019.107002

17.

Liu

Yan

Liu

, et al. (2024) ISEANet: an interpretable subdomain enhanced adaptive network for unsupervised cross-domain fault diagnosis of rolling bearing. Advanced Engineering Informatics 62: 102610. https://doi.org/10.1016/j.aei.2024.102610

18.

Liu

Deng

Zhao

, et al. (2025) A noise-robust and cross-domain few-shot fault diagnosis method of rolling bearings based on TFC-FPN. Measurement Science and Technology 36(4): 046127. Bristol: IoP Publishing Ltd. https://doi.org/10.1088/1361-6501/adc027

19.

Lourari

El Yousfi

Essaidi

(2025) Enhanced diagnosis of bearing and gear faults using hilbert-huang transform, singular value decomposition, and supervised learning methods. International Journal of Advanced Manufacturing Technology 139(1-2): 983–999. London: Springer London Ltd. https://doi.org/10.1007/s00170-025-15970-5

20.

Sun

Chen

(2018) Deep coupling autoencoder for fault diagnosis with multimodal sensory data. IEEE Transactions on Industrial Informatics 14(3): 1137–1145. Piscataway: Ieee-Inst Electrical Electronics Engineers Inc. https://doi.org/10.1109/tii.2018.2793246

21.

Mamun

Guerra-Zubiaga

Peng

(2025) Smart systems for real-time bearing faults diagnosis by using vibro-acoustics sensor fusion with bayesian optimised 1-D CNNs. Nondestructive Testing and Evaluation 40(5): 2113–2137. Abingdon: Taylor & Francis Ltd. https://doi.org/10.1080/10589759.2024.2375567

22.

Niu

Liu

Bin

, et al. (2021a) A deep residual convolutional neural network based bearing fault diagnosis with multi-sensor data . In: 2021 4th Ieee International Conference on Industrial Cyber-Physical Systems, Icps, New York, 2021, pp. 655–660. IEEE.

23.

Niu

Wang

Golda

, et al. (2021b) An optimized adaptive PReLU-DBN for rolling element bearing fault diagnosis. Neurocomputing 445: 26–34. Amsterdam: Elsevier. https://doi.org/10.1016/j.neucom.2021.02.078

24.

Rejitha

Kesavan

Chakravarthy

, et al. (2023) Bearings for aerospace applications. Tribology International 181: 108312. London: Elsevier Sci Ltd. https://doi.org/10.1016/j.triboint.2023.108312

25.

Wang

Mao

(2021) Bearing fault diagnosis based on vibro-acoustic data fusion and 1D-CNN network. Measurement 173: 108518. London: Elsevier Sci Ltd. https://doi.org/10.1016/j.measurement.2020.108518

26.

Wang

Zhu

(2023) A rotating machinery fault diagnosis method based on multi-sensor fusion and ECA-CNN. IEEE Access 11: 106443–106455. Piscataway: Ieee-Inst Electrical Electronics Engineers Inc. https://doi.org/10.1109/access.2023.3320065

27.

Wang

Jiang

Dong

, et al. (2026a) Spatial-channel collaborative multi-scale graph interaction deep transfer learning for unsupervised rotating machinery fault diagnosis. Engineering Applications of Artificial Intelligence 176: 114691. https://doi.org/10.1016/j.engappai.2026.114691

28.

Wang

Jiang

Zeng

, et al. (2026b) An adaptive fused domain-cycling variational generative adversarial network for machine fault diagnosis under data scarcity. Information Fusion 126: 103616. https://doi.org/10.1016/j.inffus.2025.103616

29.

Weerakody

Wong

Wang

, et al. (2021) A review of irregular time series data handling with gated recurrent neural networks. Neurocomputing 441: 161–178. Amsterdam: Elsevier. https://doi.org/10.1016/j.neucom.2021.02.046

30.

Wen

Gao

, et al. (2018) A new convolutional neural network-based data-driven fault diagnosis method. IEEE Transactions on Industrial Electronics 65(7): 5990–5998. Piscataway: Ieee-Inst Electrical Electronics Engineers Inc. https://doi.org/10.1109/tie.2017.2774777

31.

Jiang

Liu

, et al. (2023) Conditional distribution-guided adversarial transfer learning network with multi-source domains for rolling bearing fault diagnosis. Advanced Engineering Informatics 56: 101993. https://doi.org/10.1016/j.aei.2023.101993

32.

Yan

Zhu

Zhang

, et al. (2022) Abnormal noise monitoring of subway vehicles based on combined acoustic features. Applied Acoustics 197: 108951. London: Elsevier Sci Ltd. https://doi.org/10.1016/j.apacoust.2022.108951

33.

Yang

Zhang

Zhu

, et al. (2026) Semi-supervised cross-domain fault diagnosis via contrastive pre-training and annotation-efficient alignment strategy. Journal of Industrial Information Integration 50: 101076. https://doi.org/10.1016/j.jii.2026.101076

34.

Yue

Wang

Zhang

(2024) Mel frequency mapping for intelligent diagnosis of rolling element bearings across different working conditions. Applied Acoustics 220: 109944. London: Elsevier Sci Ltd. https://doi.org/10.1016/j.apacoust.2024.109944

35.

Zhang

Peng

, et al. (2018) A deep convolutional neural network with new training methods for bearing fault diagnosis under noisy environment and different working load. Mechanical Systems and Signal Processing 100: 439–453. London: Academic Press Ltd- Elsevier Science Ltd. https://doi.org/10.1016/j.ymssp.2017.06.022

36.

Zhang

Yang

Chen

, et al. (2024) Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: a systematic review of recent advancements and future prospects. Expert Systems with Applications 237: 121692. Pergamon-Elsevier Science Ltd. https://doi.org/10.1016/j.eswa.2023.121692

37.

Zhang

Guo

Chen

, et al. (2026) Fast eserogram: a novel adaptive spectrum segmentation method for rolling bearing fault diagnosis. Mechanical Systems and Signal Processing 242: 113632. London: Academic Press Ltd- Elsevier Science Ltd. https://doi.org/10.1016/j.ymssp.2025.113632

38.

Zhou

Han

(2019) Fault diagnosis of multi-source heterogeneous information fusion based on deep learning. In: 2019 IEEE 8th Data Driven Control and Learning Systems Conference (DDCLS), Dali, May 2019, pp. 1295–1300. Available at: https://ieeexplore.ieee.org/document/8909017. (accessed 6 January 2026).

39.

Zhou

Gao

, et al. (2025) State identifying method for rolling tire in lab test using acoustic signal. Applied Acoustics 231: 110487. London: Elsevier Sci Ltd. https://doi.org/10.1016/j.apacoust.2024.110487

40.

Zhu

Luo

Zhao

, et al. (2020) Research on deep feature learning and condition recognition method for bearing vibration. Applied Acoustics 168: 107435. Oxford: Elsevier Sci Ltd. https://doi.org/10.1016/j.apacoust.2020.107435