Abstract
Conventional diagnosis methods relying solely on vibration signals often fail to identify the weak fault features of slight and compound faults, particularly under noise interference. To address these limitations, this study proposes a parallel neural network method based on multi-level fusion of acoustic-vibration signal for rolling bearing fault diagnosis. First, Filter Bank (Fbank) features are extracted to enhance the representation of slight and compound faults within multimodal signals by addressing the challenges posed by weak fault features and high noise sensitivity. This approach enables the extraction of spectral information from both acoustic and vibration signals, thereby improving the feature representation capability of the proposed framework. Second, a CNN-BiGRU parallel neural network is constructed to comprehensively capture weak fault features by integrating a convolutional neural network for spatial feature extraction with a bidirectional gated recurrent unit (BiGRU) for temporal feature extraction. Finally, a multi-level fusion strategy that combines feature-level and decision-level fusion is adopted to exploit complementary information from rolling bearings, significantly improving diagnostic accuracy and overcoming the reduced reliability of slight and compound fault diagnosis caused by incomplete fault information in unimodal diagnostic methods. Experimental results on a slight and compound fault dataset demonstrate that the proposed method achieves superior diagnostic performance under various noise conditions. The proposed method achieves a diagnostic accuracy of 99.55% at SNR = 10 dB and maintains over 90% accuracy even at SNR = −10 dB, outperforming conventional diagnostic approaches.
Keywords
1. Introduction
Rolling bearings are critical components of rotating mechanical systems and are extensively applied in various industries such as aerospace (Rejitha et al., 2023), automotive (Jin et al., 2023), and energy fields (Liu and Zhang, 2020). The operating condition of rolling bearings significantly impacts the performance and lifespan of machinery. Rolling bearings often operate continuously under complex and adverse conditions, which accelerates wear and fatigue and eventually leads to fault. Such faults can interrupt normal production and, in severe cases, cause safety accident (Deng et al., 2025; Lourari et al., 2025). Consequently, accurate monitoring and reliable identification of bearing operating conditions are of considerable practical importance, prompting sustained research interest in this area (Aburakhia et al., 2022; Zhang et al., 2026).
Existing rolling bearing fault diagnosis approaches can be broadly classified into three categories: signal-based methods, conventional machine learning-based methods, and deep learning-based methods. Signal-based approaches assess bearing condition by analyzing signals in the time domain, frequency domain, or time-frequency domain (Gao et al., 2015). Conventional machine learning-based methods typically rely on sensor-acquired data that reflect bearing operating states. The collected data are subsequently processed using models designed based on prior knowledge, in which features are manually selected and extracted (Chen et al., 2023). Although these two categories of methods have achieved notable performance in bearing fault detection, they each exhibit inherent limitations that create a significant research gap in complex industrial applications. Signal-based approaches often struggle with the non-stationary and nonlinear characteristics of signals captured under variable operating conditions, relying heavily on specialized prior knowledge and subjective expert interpretation. In contrast, conventional machine learning-based methods depend on manually designed feature extraction procedures (Zhu et al., 2020). This reliance not only introduces the risk of including irrelevant or redundant features but also restricts the model’s ability to adaptively learn discriminative representations from raw data, thereby limiting their practical applicability in automated diagnostic systems. In contrast, deep learning-based methods have attracted increasing attention with the rapid development of data acquisition and sensing technologies. These methods are capable of automatically learning discriminative fault representations from large-scale sensor data and constructing effective diagnosis models, which makes them well suited for complex mechanical systems operating under strong interference. In addition, deep learning-based approaches reduce reliance on handcrafted features and extensive diagnostic expertise (Li et al., 2025, 2026; Liu et al., 2025; Niu et al., 2021a). Recent advances have further enhanced the capability of deep learning models in cross-domain and data-scarce scenarios. For example, one study proposed a conditional distribution-guided adversarial transfer learning network with multi-source domains to effectively transfer knowledge from different machines for rolling bearing fault diagnosis (Wu et al., 2023). Another study developed an adaptive fused domain-cycling variational generative adversarial network that addresses data scarcity by generating high-quality synthetic data and performing adaptive fusion with real samples (Wang et al., 2026a). Additionally, a spatial-channel collaborative multi-scale graph interaction deep transfer learning method was introduced for unsupervised rotating machinery fault diagnosis, which enhances multi-source feature interaction and prototype extraction (Wang et al., 2026b).
The fundamental principle of deep learning-based fault diagnosis lies in establishing a direct mapping between sensor-acquired fault data (e.g., vibration signals) and fault categories using deep learning models. This end-to-end framework enables automatic feature learning and fault classification from raw data, without explicit reliance on prior knowledge (Dybala and Zimroz, 2014; Wen et al., 2018). This framework is especially advantageous for diagnosing individual faults. Beyond vibration signals, acoustic signals have proven effective for fault detection, particularly in environments where vibration sensors are less reliable, such as high-temperature, corrosive, or confined spaces. Acoustic measurements offer non-invasive, real-time monitoring, enabling effective fault detection under such challenging conditions (Mamun et al., 2025). However, in real industrial scenarios, the diagnostic challenge is further intensified by the fact that rotating mechanical systems frequently produce compound faults where multiple defects coexist. In this study, slight faults are defined as low-severity physical defects that generate extremely weak impulsive signatures. These subtle signals are easily submerged by strong background noise or masked by the dominant energy components of other concurrent faults. Existing diagnostic models and feature extraction methods, which are primarily optimized for prominent single fault scenarios, often fail to isolate these subtle features. This leads to a significant decrease in diagnostic accuracy and a higher risk of missed detections. To provide a comprehensive understanding of compound fault diagnosis, a systematic review was conducted summarizing research status, challenges, and future prospects of rolling bearing compound fault diagnosis methods, covering analytical models, signal processing, and artificial intelligence approaches (Li et al., 2024). Furthermore, an interpretable subdomain enhanced adaptive network (ISEANet) was proposed to improve unsupervised cross-domain fault diagnosis of rolling bearings by incorporating sparse subsegment-guided noise reduction, lightweight multi-feature extraction, and improved local maximum mean discrepancy (Liu et al., 2024). Consequently, unimodal measurement methods may be insufficient for reliably diagnosing faults in rotating machinery under actual operational conditions. Multimodal diagnostic approaches, such as the joint utilization of acoustic and vibration signals, offer an effective means to address these challenges. By exploiting complementary information from different sensing modalities, these methods reduce the likelihood of missed fault detection and improve diagnostic accuracy (Ji et al., 2021).
Significant progress has been made in multimodal deep learning fault diagnosis, with several advanced methods proposed in recent studies. For instance, a deeply coupled autoencoder network has been introduced to fuse vibration and acoustic data for fault diagnosis of gears and bearings (Ma et al., 2018). Niu et al. converted multi-sensor data into grayscale images and employed a deep residual network for fault diagnosis (Niu et al., 2021b). Zhou et al. proposed a hybrid approach where deep features were extracted from one-dimensional vibration data and two-dimensional image data using a stacked autoencoder (SAE) and a convolutional neural network (CNN), respectively (Zhou et al., 2019). They then performed feature fusion for fault diagnosis, validating their method on the Western Reserve University bearing dataset. Although the aforementioned studies effectively exploit multimodal data to enhance feature learning and improve diagnostic accuracy, they generally overlook the impact of noise commonly encountered in industrial environments, which limits their practical applicability. As a result, increasing research attention has been directed toward fault diagnosis under noisy conditions. For instance, a method utilizes a multi-sensor fusion algorithm based on convolutional neural network-long short-term memory (CNN-LSTM) networks to monitor bearing faults (Hao et al., 2020). Wang et al. extracted features directly from raw vibration signals and acoustic signals, integrating them using a one-dimensional CNN network (Wang et al., 2021). Their method, validated in bearing diagnosis with various signal-to-noise ratios, demonstrated superior recognition accuracy compared to algorithms relying on single-modal sensors. Wang et al. transformed signals from multiple vibration sensors into images, and fused them along the channel dimension to create feature-rich multichannel images (Wang et al., 2023). These methods substantially improve the diagnostic performance of rotating machinery through multimodal deep learning. Nevertheless, notable limitations remain, particularly the insufficient consideration of slight and compound faults.
Beyond the limitations discussed above, further challenges remain despite the progress achieved by multimodal deep learning-based fault diagnosis methods. For example, some studies have indicated that existing fault feature extraction and processing techniques still exhibit deficiencies. To address this problem, future efforts could focus on two key areas. First, developing feature extraction methods that provide rich frequency-domain and time-domain features would enable a more precise fault characterization. Second, designing model architectures tailored to the fault features of diverse signals of equipment could enhance the learning capability for spatiotemporal features, further improving diagnostic accuracy.
In terms of feature extraction methods, many researchers have used the mel frequency cepstrum coefficients (MFCC) or gamma filter cepstrum coefficients (GFCC) to capture distinct mechanical states for the fault diagnosis. Geng et al. proposed a method combining GFCC-based time-frequency representations of acoustic signals with a convolutional neural network (CNN) for fault diagnosis (Geng et al., 2019). Yue et al. proposed a new method called mel frequency mapping classification (MFMC) for the nonlinear mapping and classification of fault features, which is capable of distinguishing various health states of machinery under fluctuating operating conditions (Yue et al., 2024). Zhou et al. introduced a state identification method based on acoustic signals from rolling tires to detect bulging issues in tire endurance tests. This method utilizes the modified mel frequency cepstral coefficients (SMFCC) to represent the acoustic signal characteristic (Zhou et al., 2025). Yan et al. combined an improved time-frequency spectral kurtosis (MTSK) feature parameter that utilizes the sensitivity of kurtosis to impact signals and the mel frequency cepstral coefficients (MFCC) that reflect the acoustic characteristics of the human ear, achieving fault diagnosis accuracy of 99.6% at a speed of 20 km/h (Yan et al., 2022). However, the signal processing methods employed in the aforementioned studies exhibit critical limitations when integrated into deep learning-based diagnostic frameworks. While techniques such as MFCC offer high computational efficiency by using the discrete cosine transform (DCT) to highlight spectral envelope information, this compression is fundamentally designed for human speech recognition rather than mechanical fault detection. In the context of bearing diagnostics, the DCT step inevitably results in the loss of frequency-domain feature details and transient components that contain the signatures of slight physical defects. Consequently, these methods are not optimal for end-to-end deep learning models that require high fidelity features to accurately characterize machinery states.
To address the limitations of the aforementioned studies, this research employs multimodal (vibration and acoustic) signals processed using the Fbank method for fault diagnosis through multi-source data fusion targeting slight and compound bearing faults. First, the collected acoustic and vibration signals are converted into WAV audio files before undergoing Fbank-based processing. For diagnosing faults in rotating machinery, Fbank effectively capture the spectral details from acoustic signals, while vibration signals are similarly processed to extract critical information regarding machinery states. Unlike other feature extraction methods, Fbank features retain more fault-relevant information by omitting the discrete cosine transform (DCT) step. This allows Fbank to provide richer frequency-domain and time-domain features, which contribute to a more precise characterization of fault conditions. To further enhance the generalization and robustness of the fault diagnosis method, this study explores the use of multi-source data fusion strategy. This technique integrates information from multiple sources according to specific rules to produce unified and more meaningful insights. Compared to single-source information, multi-source data fusion leverages the complementary nature of diverse information streams, enhancing the effectiveness and accuracy of diagnostic systems.
To address the limitations identified above, this research develops a parallel neural network method based on the multi-level fusion of acoustic-vibration signals. The main contributions are as follows: (1) Fbank-based signal processing: To resolve the problem of information loss in traditional DCT-based methods, we adopt an Fbank approach. This innovation ensures the retention of richer fault-related information, providing a more comprehensive data foundation for characterizing slight and compound bearing faults under noisy conditions. (2) CNN-BiGRU parallel neural network framework: To handle the complexity of compound fault signals, we design a parallel deep learning framework. Unlike serial models, this framework uses the CNN to extract spatial features and the BiGRU to capture temporal dependencies simultaneously, enabling the model to extract more effective fault signatures from complex, overlapped signals. (3) A multi-level fusion approach: This study develops a multi-level fusion approach that integrates acoustic and vibration information at both the feature and decision levels. By exploiting the complementarity of multimodal signals, this approach effectively addresses the problem of insufficient information in single-source data, significantly improving diagnostic robustness for slight and compound faults.
2. Methodology
2.1. Fbank-based signal processing
The Fbank feature, short for Filter Bank feature, is an acoustic feature derived from the post-processing of the short time Fourier transform (STFT) of a speech signal. This feature effectively characterizes the spectral properties of speech by decomposing the audio signal into multiple frequency bands and extracting the energy information of each band. The Fbank-based signal processing utilized in this study is illustrated in Figure 1. The Fbank-based signal processing typically involves the following steps: (1) Pre-emphasis: The input speech signal undergoes pre-emphasis to enhance high-frequency components and mitigate the effects of high-frequency attenuation. (2) Framing: The speech signal is divided into short-duration frames, often with a certain degree of overlap between frames. (3) Windowing: A window function (such as the Hamming window or the Hanning window) is applied to each frame to minimize edge effects. (4) Fourier Transform: The short time Fourier transform (STFT) is applied to the windowed frames to compute the spectrum. (5) Filter Bank Processing: The spectrum is processed through a set of overlapping filters (typically triangular filters), and the energy output of each filter is calculated. (6) Logarithmic Scaling: The energy outputs of the filter bank are logarithmically scaled to align the feature representation with human auditory perception. Fbank-based signal processing.

This study adopts the Fbank method for signal processing instead of the conventional MFCC approach because of its superior capability in preserving fault-related spectral information. For Fbank extraction, the Mel filter bank energies are directly retained as:
In contrast, MFCC introduces a discrete cosine transform (DCT) to compress the filter bank energies into cepstral coefficients:
2.2. CNN-BiGRU
A convolutional neural network (CNN) is a type of feed-forward neural network that incorporates convolutional operations. Its architecture comprises several layers, including convolutional layer, activation function, pooling layer, fully connected layer, and classification layer (Li et al., 2022). The convolutional layer, activation function, and pooling layer work together to extract features from the inputs, while the fully connected layer flattens the feature maps. This layer, in combination with the Softmax classifier, produces classification results.
A recurrent neural network (RNN) is a network architecture with memory capabilities, allowing it to retain information from previous inputs in its internal states (Weerakody et al., 2021). RNNs are particularly well suited for solving time series problems, but their training process can encounter challenges such as gradient vanishing or gradient exploding. To address these issues, advanced variants of RNNs, such as gated recurrent units (GRUs), introduce a gating mechanism. GRUs simplify the internal structure by employing only an update gate and a reset gate, reducing model parameters and mitigating overfitting risks. The bidirectional GRU (BiGRU), used in this study, further enhances performance by capturing both forward and backward dependency information.
This study integrates the features of CNN and BiGRU within a CNN-BiGRU parallel network model, as depicted in Figure 2. While CNNs excel at extracting spatial features, they may lose critical temporal features during the training process. For audio signals, capturing temporal relationships is particularly important. The BiGRU component in the model plays a crucial role in effectively extracting these temporal features from faulty signals. The outputs of the parallel network are fused into a new feature vector using the method detailed in Section 2.3, enabling the model to leverage both spatial and temporal features for a deeper understanding of complex fault patterns. CNN-BiGRU parallel network model structure.
2.3. Multi-level fusion
Multi-source information fusion is a technology that integrates, correlates, and synthesizes data from multiple similar or dissimilar sensors to evaluate and identify the source of information. Since the data from unimodal sensors cannot fully capture the operating condition of equipment, there is inherent uncertainty in their diagnostic results. By leveraging multi-source information fusion technology, the physical, spatial, and temporal attributes of the acquired information are expanded, enabling a more comprehensive representation of the equipment’s state. This ultimately leads to higher diagnostic accuracy. Based on the level of fusion levels, multi-source information fusion methods can typically be categorized into data-level fusion, feature-level fusion, decision-level fusion, and model-level fusion (Zhang et al., 2024). This study adopts a multi-level fusion approach that combines feature-level fusion and decision-level fusion to enhance diagnostic robustness and accuracy.
Feature-level fusion involves extracting features from sensor data, fusing the resulting feature vectors, selecting the most relevant fused features, and using them for pattern recognition. As an intermediate level of fusion, it processes data after feature extraction, thereby reducing the volume of data to be handled and facilitating real-time processing. Decision-level fusion, on the other hand, involves making independent decisions for each sensor data mode and then combining these decisions using a specific rule to obtain a unified final decision. As the highest level of information fusion, this method effectively reduces the misdiagnosis rate and improves diagnostic robustness. Even if one modality or model underperforms, other modalities or models can compensate, ensuring a stable and accurate overall diagnosis.
This study proposes a multi-level and multimodal data fusion framework that achieves efficient integration of vibration and acoustic signals through the synergistic interaction of feature-level and decision-level fusion. The core concept of this framework involves hierarchically processing the heterogeneity of different modal data to fully leverage the characteristics of each level, thereby avoiding the limitations of a single fusion strategy. Simultaneously, it introduces a dynamic weighting mechanism, primarily implemented through adaptive weighting. The designed adaptive weights serve as trainable parameters optimized through back-propagation with constraints. This approach enhances the complementarity of multimodal information via adaptive weight optimization, thereby improving the accuracy and robustness of fault diagnosis models.
2.4. Fbank-CNN-BiGRU parallel network method
This study proposes an innovative Fbank-CNN-BiGRU parallel network method for fault diagnosis of rolling bearing based on acoustic-vibration fusion, as illustrated in Figure 3. First, vibration and acoustic signals undergo Fbank-based feature extraction to obtain spectral representations, thereby enhancing feature representation ability. In this section, the model performs feature extraction, which begins by obtaining the power spectrum: Fbank-CNN-BiGRU parallel network based on multi-level acoustic-vibration signal fusion method structure.
Subsequently, the feature matrix of energy is obtained using the Mel filter bank in Fbank:
Finally, an M × T characteristic matrix is obtained. Next, the extracted data is fed into a parallel network, where the CNN-BiGRU parallel network is employed to extract both spatial and temporal features, ensuring a more comprehensive feature representation. Convolutional networks process vibration signals to extract local spatiotemporal energy patterns and edge features. Following convolution and pooling layers, Flatten and Dense layers produce high-level feature vectors:
And outputs the category prediction for the vibration branch via softmax:
Audio signals exhibit strong temporal correlations. BiGRU is employed to capture temporal dependencies between preceding and subsequent segments, followed by a fully connected layer to extract acoustic features:
The final result is the acoustic branch prediction:
Finally, the results from the feature-level fusion and the outputs of the independent models are integrated using a dynamically weighted decision-level fusion approach, ultimately yielding the classification results. The final prediction is obtained through a weighted combination of three branch outcomes:
To ensure the validity of the weighted fusion process, the adaptive parameters satisfy the following constraints:
To simultaneously optimize the synergy between feature learning and fusion decision-making, a compound loss function is defined:
The confidence-weighted coefficients w
F
, w
V
, w
S
are dynamically adjusted according to branch prediction confidence and fixed hyperparameters τ and κ. Specifically, the confidence score of each branch is defined as:
The confidence scores are normalized through the Softmax operation:
Accordingly, the final confidence-aware fusion weight is computed as:
During training, the adaptive parameters α and β are updated together with all network parameters through gradient optimization:
Therefore, the loss function dynamically adjusts the modal weights according to their confidence, while the parameters α and β are jointly optimized to ensure that this adjustment converges toward the global optimum. By integrating feature-level fusion and decision-level fusion through the above design, comprehensive information about rolling bearings is synthesized to enhance diagnostic accuracy.
In addition to the main structure described above, regularization and dynamic learning rate methods are incorporated into the model to enhance the performance of the parallel network. To mitigate overfitting risks, L2 regularization and Dropout mechanisms are incorporated into the network architecture. Dropout randomly deactivates neurons during training with a certain probability, while maintaining the input and output neurons unchanged. L2 regularization adds a penalty term proportional to the square of the feature coefficients, which helps to smooth the model and reduce its tendency to overfit the training data (Benninger et al., 2023). These techniques not only minimize overfitting risks but also reduce the computational time required for training the network. Additionally, an early stopping method is implemented to further reduce overfitting. By monitoring the validation loss at the end of each training cycle, training is discontinued if no improvement is observed for a predefined number of cycles. This approach prevents the model from overlearning on the training set, thereby improving its generalization performance on unseen data while saving computational resources.
To enhance the final diagnostic performance, the learning rate decay method is employed to improve the convergence behavior of the model. This technique dynamically adjusts the learning rate in a stepwise manner: a larger learning rate is used in the early stages of training to accelerate convergence, while a smaller learning rate is applied in later stages to fine-tune parameters more precisely (Fan et al., 2024). This strategy ensures a balance between convergence speed and precision, ultimately improving the overall performance of the model.
3. Experiment
3.1. Data set and experimental environment
The experimental data of this paper comes from the BPS (Bearing Prognosis Simulation,) experimental bench in our laboratory, which mainly contains AC motor, coupling, support bearing, test bearing, vibration sensor and acoustic sensor, etc. The details are shown in Figure 4(a). BPS experimental system and bearing fault types.
The output end of the AC motor is connected to the transmission system via couplings, with the support bearing and test bearing positioned near the two ends of the rotating shaft, respectively. The rotational speed of the motor spindle is adjustable using a frequency converter to simulate various operating conditions for the bearings. In this experiment, three rotational speeds were selected: 960 r/min, 1200 r/min, and 1500 r/min. The test bearings utilized in this study are ER-16K deep groove ball bearings.
Fault types and dimensions.
In this study, vibration sensor signals in the vertical direction and acoustic sensor signals were selected to construct the dataset. The dataset was divided into training, validation, and test sets with a ratio of 6:2:2 (Yang et al., 2026). To ensure balanced representation, the split was performed in a stratified manner based on fault types, with each state containing a total of 1000 samples that were proportionally allocated to each subset. It should be noted that, in this work, each classification task is conducted under a single rotational speed condition. That is, data corresponding to different rotational speeds are not mixed within the same classification experiment but are treated independently. For each speed condition, the dataset is separately constructed and then divided into training, validation, and test sets following the same 6:2:2 ratio. This setting ensures that the model is trained and evaluated under consistent operating conditions.
To simulate noisy environments in a controlled and reproducible manner, additive Gaussian white noise was introduced directly to the raw time-domain signals prior to feature extraction. It should be noted that this type of noise is a simplified approximation and does not fully represent the complex noise characteristics encountered in real industrial scenarios, such as harmonic interference, non-stationary noise, and impulsive disturbances. Specifically, for each clean signal sample
The signal-to-noise ratio (SNR) was defined based on the power ratio between the clean signal and the noise, consistent with the formulation used in Zhang et al. (2018):
This procedure ensures that the resulting noisy signal satisfies the target SNR level precisely, which is consistent with the implementation used in our code.
After noise injection, Fbank features were extracted from the noisy signals, ensuring that the extracted features reflect realistic noisy conditions. In this study, the proposed method was evaluated under five SNR levels: −10 dB, −6 dB, 0 dB, 6 dB, and 10 dB
To evaluate the performance of the model, appropriate evaluation metrics need to be established. These metrics assess the model’s capabilities from different perspectives, and the choice of metrics typically depends on the specific learning task. Since fault diagnosis involves classifying different health states into distinct categories, it can be framed as a classification problem. Taking a binary classification problem as an example, the outcomes can be categorized into four types based on the true class of a sample and the predicted class of the model. True Positive (TP): Positive samples correctly classified as positive; False Positive (FP): Negative samples incorrectly classified as positive; True Negative (TN): Negative samples correctly classified as negative; False Negative (FN): Positive samples incorrectly classified as negative. Let TP, FP, TN, and FN represent the number of samples in each of these categories. From these outcomes, commonly used evaluation metrics for fault diagnosis, such as accuracy, precision, recall, and F1-score, can be derived. In this study, multiple evaluation metrics are adopted to comprehensively assess the diagnostic performance of the proposed method. Accuracy measures the proportion of correctly classified samples among all samples and provides an overall assessment of model performance. However, for multi class fault diagnosis problems, accuracy alone may not fully reflect the recognition capability of the model for minority fault categories. Therefore, precision, recall, F1-score, and the identification rate for each fault type are additionally introduced. Accuracy refers to the proportion of correctly categorized samples out of the total number of samples. It is the most commonly used performance metric in fault diagnosis and is calculated using equation (22):
Precision represents the proportion of correctly identified positive samples among all samples predicted as positive, reflecting the reliability of the model predictions. A higher precision indicates that fewer normal samples or other fault types are incorrectly classified as the target fault. Precision is calculated using equation (23):
Recall measures the proportion of actual positive samples that are correctly identified by the model. A higher recall indicates that the model has a stronger capability to detect fault samples and reduces the probability of missed diagnosis. Recall is calculated using equation (24):
The F1-score is the harmonic mean of precision and recall, which comprehensively evaluates the balance between fault detection capability and prediction reliability. It is particularly suitable for imbalanced classification problems. The F1-score is defined using equation (25):
3.2. Results and discussion
This section begins by presenting the collected data in a visual format. The time-domain waveforms of the bearing vibration signal for five different states are shown in Figure 5(a). As observed in the figure, the differences in waveforms among the various fault types are not visually distinct, making it challenging to identify the specific fault type through naked-eye observation. The feature images generated by applying the Fbank feature extraction method to the original time-domain signal are shown in Figure 5(b). These images offer the extracted features of the signal, providing a more detailed and distinguishable representation compared to the raw time-domain waveforms. (a) Raw signals in the time domain for different states of bearing and (b) Fbank feature image of different signals for different states of bearing.
From the extracted results, the Fbank feature images of different states of slight faults are more distinguishable than the original time-domain signals. However, visual observation alone is still not enough to accurately distinguish the different states of bearings, and deep learning training is needed for more detailed analysis.
Accuracy of different features.
The training and validation loss functions for the unimodal data and the acoustic-vibration fusion data are illustrated in Figure 6. From the loss function results, it is evident that the training and validation losses of the proposed method converge rapidly as the number of iterations increase. Additionally, the acoustic-vibration fusion signals demonstrate lower and more stable loss values during both the training and validation phases compared to those of unimodal diagnosis. These findings preliminarily highlight the advantages of acoustic-vibration fusion. However, more comprehensive comparisons and statistical analyses are required to fully validate the superiority of this approach. (a) Loss function of Fbank-CNN based on vibration signal, (b) loss function of Fbank-CNN based on acoustic signal, and (c) loss function of the proposed method.
To further evaluate the diagnostic performance of the proposed method and highlight its advantages over unimodal diagnosis, two visualization techniques are utilized in this study: the t-SNE visualization algorithm and the confusion matrix. The t-SNE (t-Distributed Stochastic Neighbor Embedding) is a visualization technique that transforms high-dimensional Euclidean distances between data points into conditional probabilities that represent their similarities. Upon convergence of the training process, t-SNE can determine whether the dataset is well separated by projecting it into a 2D or 3D space, allowing for a clearer understanding of the data structure. The confusion matrix is a specialized tool used to visualize the performance of a classification model. It compares the predicted and true values to assess the model’s classification accuracy. By examining the confusion matrix, various performance metrics, such as accuracy, recall, and F1-score can be computed, offering a comprehensive evaluation of the model’s performance. Moreover, it helps identify specific categories that are prone to misclassification, providing valuable insights for model improvement. The results of the t-SNE visualization and the confusion matrix are presented in Figures 7 and 8, respectively. t-SNE results. (a) Confusion matrix of Fbank-CNN based on vibration signal, (b) confusion matrix of Fbank-CNN based on acoustic signal, and (c) confusion matrix of the proposed method.

To further analyze the diagnostic capability of the proposed method, the t-SNE visualization results and confusion matrices were investigated in detail. Compared with the unimodal methods based on vibration or acoustic signals, the proposed method exhibits clearer clustering boundaries and less overlap among different fault categories in the t-SNE feature space. The confusion matrices further support this observation. Compared with the unimodal methods, the proposed method produces fewer misclassified samples among different fault categories. The proposed method maintains relatively high recognition accuracy for different slight fault categories, demonstrating that the fused features remain sufficiently distinguishable even when the fault signatures are weak. The results show that unimodal data alone may not always provide sufficiently stable fault representations for slight and compound fault diagnosis. After combining vibration and acoustic information, more complementary fault characteristics can be retained, resulting in more stable feature representations and improved classification performance.
Accuracy and standard deviation for different rotational speeds and different SNR conditions.
The accuracy of the proposed diagnostic method remained high across different speeds (960 rpm, 1200 rpm, and 1500 rpm). The highest accuracy, 99.55% (SNR = 10), was achieved at 1200 rpm across all SNR conditions. It is also evident that, even under relatively harsh conditions (e.g., −10 dB SNR), the method maintains a high level of accuracy. For instance, at 960 rpm, the accuracy was 90.32%, and at 1200 rpm, the accuracy reached up to 92.25%. These results demonstrate the method’s superiority in diagnosing slight and compound bearing faults under simulated noise interference, indicating its promising potential for practical applications.
Since the diagnostic performance at 1200 rpm consistently achieves the best results under different SNR conditions, the dataset at 1200 rpm was selected for further comparative analysis. To comprehensively evaluate the effectiveness of the proposed method, the 1200 rpm dataset was additionally applied to unimodal fault diagnosis models based on vibration signals and acoustic signals.
Diagnostic results of bearing fault at 1200 r/min.
The results indicate that the proposed method achieves consistently superior diagnostic performance compared with the unimodal methods under different SNR conditions. As the SNR decreases from 10 dB to −10 dB, the diagnostic performance of all methods gradually deteriorates because the fault features become increasingly submerged by environmental noise. However, compared with the vibration-based and acoustic-based models, the proposed method exhibits a significantly slower degradation trend in Accuracy, Precision, Recall, and F1-score, demonstrating stronger robustness and stability under noisy conditions.
Specifically, the proposed method consistently achieves the highest Accuracy, Precision, Recall, and F1-score values under all tested SNR conditions. The higher Accuracy indicates that the proposed method can improve the overall correctness of fault classification by effectively integrating complementary information from vibration and acoustic signals. Meanwhile, the higher Precision and Recall demonstrate that the proposed method can simultaneously reduce false classifications and missed detections under noisy environments. This advantage is particularly important for slight bearing faults, since weak fault signatures are more susceptible to noise interference and are difficult to distinguish using single-modal information alone. Furthermore, the superior F1-score indicates that the proposed method achieves a better balance between prediction reliability and fault detection capability. Compared with the unimodal approaches, the proposed method can preserve more discriminative fault features under low-SNR conditions, thereby achieving more stable and reliable diagnostic performance.
The recall values of each fault category are used to represent the individual identification rates, which provide a more detailed evaluation of the recognition capability for different bearing fault types. The individual identification rates further verify the effectiveness of the proposed method for both slight and compound bearing fault diagnosis. For normal conditions, inner race faults, outer race faults, rolling element faults, and compound faults, the proposed method maintains consistently high recall values under different SNR conditions. These results demonstrate that the proposed method can effectively extract and preserve discriminative fault features from both vibration and acoustic signals, thereby improving the identification capability for both slight and compound bearing faults in noisy environments.
To further highlight the advantages of the proposed method under various SNRs, the confusion matrices for the proposed method and the two unimodal methods are generated as comparisons. These matrices were plotted at three noise disturbance levels (SNR = 10 dB, 0 dB, and −10 dB) and are presented in Figure 9. Comparison of confusion matrices under different SNR levels.
At 10 dB, all methods achieve relatively high classification accuracy because the fault features are still distinguishable from background noise. However, the unimodal methods already exhibit some confusion among different slight fault categories, while the proposed method maintains fewer misclassified samples. As the SNR decreases to 0 dB, the influence of noise becomes more significant. The unimodal methods show an obvious increase in misclassification. In contrast, the proposed method still maintains relatively stable classification performance. This indicates that the proposed method can preserve more useful fault features under noisy conditions. At −10 dB, severe noise interference causes substantial degradation in all methods. Several fault categories in the unimodal methods are incorrectly identified as other fault types or even normal conditions. Although the recognition accuracy of the proposed method also decreases under noise interference, the overall classification performance remains better than that of the unimodal approaches.
The results demonstrate that this method exhibits minimal sensitivity to noise across different SNR scenarios. Compared to unimodal approaches, it maintains considerable accuracy and correct classification rates under noisy conditions, further validating its robustness and effectiveness in diagnosing slight and compound bearing faults within simulated noisy environments.
Comparison results of different algorithms for bearing fault diagnosis.
The results show that the method proposed in this study achieves the highest accuracy across all SNR levels, with its accuracy advantage being particularly evident at low SNR when compared to other methods. Specifically, the method in this study exhibits a 7.3% reduction in accuracy at both SNR = 10 and SNR = −10, which is lower than that observed for other methods. These results further validate the superiority of the proposed method in diagnosing slight and compound bearing faults in noisy environments.
Comparison of training times and inference times for different bearing fault diagnosis algorithms.
By comparison, the computational overhead introduced by the multi-level feature fusion strategy is relatively limited. After incorporating the fusion module, the training time only increases from 395.5 s to 403.1 s, indicating that the feature fusion operation introduces comparatively small additional complexity. Furthermore, after introducing the adaptive dynamic weighting mechanism, the training time increases from 403.1 s to 421.8 s. Compared with the computational cost introduced by the BiGRU branch, the increase caused by the adaptive weighting mechanism remains moderate.
In addition to training efficiency, inference efficiency was also evaluated. The proposed method requires approximately 2.0411 s for inference on the entire test set, corresponding to an average inference time of approximately 0.00102 s per sample. Although the proposed method exhibits higher computational complexity than the baseline method, the inference speed remains sufficiently fast for practical bearing fault diagnosis applications.
Overall, the proposed method requires longer training time than the baseline models, with the majority of the additional computational cost originating from the BiGRU branch. However, the increased computational burden is mainly concentrated in the offline training stage, while the online inference time remains sufficiently low for real-time applications. Considering the significant improvement in diagnosing slight and compound bearing faults under noisy conditions, the additional computational cost is considered acceptable for practical industrial fault diagnosis tasks.
4. Conclusion
In this study, a Fbank-CNN-BiGRU parallel network-based method is proposed for rolling bearing fault diagnosis using acoustic-vibration signal fusion. The method performs classification of slight and compound bearing faults using acoustic and vibration signals acquired from microphones and accelerometers.
Experimental results demonstrate that the proposed method achieves high diagnostic accuracy for slight and compound bearing faults under various SNR conditions, particularly in low-SNR environments. Its performance significantly outperforms fault diagnosis methods based on unimodal sensors or single-level data fusion strategies. These results confirm that the proposed approach exhibits superior classification accuracy and robustness for both slight and compound bearing faults. However, since the noise considered in this study is limited to Gaussian white noise, its applicability to real-world environments with more complex noise structures should be interpreted with caution. In addition, although the proposed method has been compared with several representative approaches, comparisons with more state-of-the-art methods have not yet been fully investigated, which may limit the comprehensiveness of the current performance evaluation.
Future research will focus on extending the applicability of the proposed method by evaluating its performance on a wider range of mechanical systems with diverse fault types, including additional compound fault combinations. Particular attention will be paid to bearing fault diagnosis under real industrial noise environments, moving beyond Gaussian white noise simulation to incorporate more realistic noise characteristics. More comprehensive comparisons with state-of-the-art methods will also be conducted to further validate the superiority and generalization capability of the proposed approach. To mitigate the increased training time, efforts will also be directed toward optimizing the model for lightweight implementation. In addition, the development of adaptive mechanisms will be investigated to improve generalization capability, enabling effective diagnosis of previously unseen fault types.
Footnotes
Author contributions
Funding
This work was supported by the National Natural Science Foundation of China (No. 52574304), the National Key Research and Development Program of China (No. 2023YFC3009202).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
