Abstract
Feature extraction and fusion are important for the fault diagnosis and prediction of rotating machinery. While traditional deep learning networks can learn single attribute of features, they still face difficulties in capturing heterogeneous features with distinct attributes during the fusion process. To solve this problem, a novel type of capsule networks (CapsNets) based on dual heterogeneous feature resonance fusion is presented for heterogeneous feature extraction and fault diagnosis. Firstly, a dual-scale deformable convolution network is proposed to extract dual heterogeneous features. Then, an adaptive heterogeneous feature adjustment mechanism is presented to adjust the weights of heterogeneous features and identify discriminative features. Next, a resonance fusion mechanism is constructed to coordinate and select correlated heterogeneous features in both structural and spatial dimensions, avoiding information conflicts in feature fusion. Lastly, the heterogeneous resonance gain features are introduced into the CapsNet for fault diagnosis and classification tasks. The superiority of the proposed network lies in its ability to integrate and coordinate global and local information, enhancing the correlation between heterogeneous features for improved performance. Comparative experiments on multiple datasets with the state-of-the-art methods demonstrate that the proposed method excels in extracting and fusing dual heterogeneous features under complex operating conditions and noise interference.
Keywords
Introduction
Rotating machinery is widely used in industrial manufacturing and transportation equipment.1,2 However, rotating components are susceptible to wear, fatigue, and other failures due to harsh operating environments and variable operating conditions, which may results in the internal damage and even severe safety incidents. 3 As a result, accurate and prompt detection of mechanical fault is important for improving the reliability and productivity of industrial equipment.4,5
Traditional signal-based fault diagnosis methods have been rapidly developed to extract features from the time and frequency domains and recognize faults, such as Fourier transform, 6 wavelet decomposition, 7 empirical mode decomposition, 8 variational mode decomposition, 9 and so on. Despite these advancements in technology, traditional approaches predominantly still depend on manual feature extraction and empirical judgment, which poses challenges for fault diagnosis. The introduction of machine learning (ML) provides an efficient and easily designed approach for industrial equipment maintenance by analyzing and identifying the intrinsic relationships between data, such as principle component analysis, 10 hybrid kernel ridge regression, 11 support vector machines, 12 decision trees, 13 and artificial neural networks. 14 However, these traditional ML methods rely on the shallow fault features extracted from the signals, which restrict the model’s ability of nonlinear pattern recognition and autonomous fault diagnosis in complex classification scenarios.
In recent years, deep learning (DL), as a class of ML algorithms, have shown great potential in the field of fault diagnosis due to their automatic feature extraction capabilities and end-to-end learning characteristics directly from raw vibration signals. Classical DL architectures, including convolutional neural networks (CNNs), 15 graph convolutional networks, 16 deep belief networks, 17 and generative adversarial networks (GANs), 18 have been widely adopted to analyze vibration signals from rotating components such as rolling bearings and gearboxes, enabling accurate fault diagnosis and remaining useful life prediction. In particular, CNNs have made extensive progress in the application of fault diagnosis due to their strong feature extraction ability.19,20 Although the pooling layer of the CNN provides the model with a priori probabilities of translational invariance but ignores specific attitude and spatial location information.
The introduction of capsule network (CapsNet) addresses the limitations of CNNs for spatial relationships and pose recognition during feature extraction. 21 Unlike neurons in traditional CNN, the CapsNet outputs a vector instead of a scalar, which not only represents the probability of an entity’s existence in the image but also contains multidimensional information such as the pose, angle, and position of that entity. 22 Liu et al. 23 proposed an improved residual-based GAN combined with CapsNet to solve the problem of unbalanced fault data. Xu et al. 24 proposed a convolutional block attention module (CBAM) integrated with CapsNet applied to fault diagnosis with different signal-to-noise ratio (SNR) data samples. However, most of the existing DL models utilize fixed single-scale feature convolution kernels to extract features, which render the networks ineffective in capturing the critical fault characteristics in nonstationary process conditions.
To effectively exploit diverse fault features, a combination of multiscale convolution and selective kernel adaptation is employed to enable the DL method to dynamically extract features at different scales.25,26 The integration of these techniques further enhances the model’s ability to fuse multiscale data. 27 Wang and Liu 28 proposed a multiscale meta-learning network that can be applied to cross-domain diagnosis with few samples. Wang et al. 29 proposed a multilevel attitude-aware denoising network for bearing fault diagnosis under noisy conditions. Xiong et al. 30 proposed a multiscale adaptive-routing capsule contrastive network for intelligent fault diagnosis in rotating machinery, improving accuracy and robustness under noisy environments and labels. Although these methods can effectively extract multiscale features, they do not fully consider the feature variability of multiple channels.
Due to noise and variable operating conditions, different samples with the same label are fed into a DL model, resulting in corresponding changes in the learned feature vectors. Heterogeneous features from different sources or scales have distinct attributes and spatial dimensions.31,32 During the fusion process, large differences between these features can reduce model accuracy, while highly similar features may lead to redundancy and decrease the model’s learning efficiency.33,34 These potential factors are collectively referred to as information conflict. Miao et al. 35 proposed a channel-wise CNN with feature augmentation for fault diagnosis of wheeled mobile robots, improving accuracy and robustness in multiheterogeneous sensor data. Han et al. 36 proposed a multisource heterogeneous information fusion network for intelligent fault diagnosis of rotating machinery, enhancing accuracy and robustness with limited datasets. Miao et al. 37 proposed a deep feature interactive network for machinery fault diagnosis that uses multisource heterogeneous data such as infrared thermal images and vibration signals for adaptive feature fusion. However, these methods fail to deeply mine and adjust the weights of heterogeneous features during the extraction process, which can lead to increased information redundancy. Furthermore, the fusion process neglects the correlation between heterogeneous features, potentially causing information conflicts and resulting in unreliable diagnostic results. This highlights the potential of a new DL method for capturing heterogeneous features under complex working conditions or noise interference and eliminating the conflict information differences between features.
In summary, how to accurately capture the heterogeneous features under complex working conditions or noise interference and eliminate the conflict information differences between features is the key of current research. For this purpose, we construct a model based on dual heterogeneous feature resonance fusion for CapsNets (DHFRF-CN), which effectively learn heterogeneous features, ensuring robust and reliable fault diagnosis performance across various scenarios.
The main contributions of this article are as follows:
A dual-scale deformable convolution network (DS-DCN) is developed to capture dual heterogeneous features from time-domain signals under complex working conditions and noisy environments.
An adaptive heterogeneous feature adjustment mechanism (AHFAM) is proposed to adjust the weights of heterogeneous features in the pooling layer, which can remove redundant signals and mine discriminative features for improved performance.
The resonance fusion mechanism (RFM) is proposed to coordinate the correlation of heterogeneous features across structural and spatial dimensions, which avoid feature conflicts during the fusion process and enhances the overall feature integration.
We conducted comparative experiments with other methods on different fault datasets. The effectiveness of the method in capturing heterogeneous features and eliminating the information conflict between the features under complex working conditions and noise environment is verified.
The rest of the article is organized as follows: The second section introduces the related theoretical knowledge. The third section describes the DHFRF-CN methodology and the fault diagnosis framework in detail. The fourth section verifies the effectiveness of the DHFRF-CN through experimental cases. At last, the fifth section briefly summarizes the work of this article.
Relevant theories
Deformable convolution Network
Since the convolution kernel is fixed in shape and size in conventional CNNs, it cannot effectively capture complex feature information from the objects with irregularities or deformations. DCNs were proposed to capture complex deformation features of objects while suppressing background interference. 38 As shown in Figure 1, the sampling positions in the DCN convolution kernel can be dynamically adjusted according to the content of the input feature map. As shown in Figure 2, DCN introduces a learnable offset on top of ordinary convolution, which enables the convolution operation to adjust the sampling position according to the changes of the features. The convolution operation of DCN is
where

Different sampling points of CNN and DCN. CNN: convolutional neural network; DCN: deformable convolution network.

Architecture of the DCN. DCN: deformable convolution network.
Capsule network
The core idea of CapsNet is to capture the features of the target and its spatial transformations through capsules. As shown in Figure 3, the architecture of the CapsNet is made up of the convolution layer, primary capsule layer, and digit capsule layer.39,40

Architecture of the CapsNet. CapsNet: capsule network.
The primary capsule layer receives the local feature maps. It is cut into 32 channels and 8 vector dimensions to obtain the feature map of size [6 × 6 × 8 × 32]. The length and direction of the capsule vectors indicate the probability of the existence of an entity and certain attributes of the entity. The dynamic routing is employed to enable the transmit of feature information between the PrimaryCaps and DigitCaps layers within CapsNet. This process establishes the connection between low-level and top-level features, facilitating an efficient transmission of features. The prediction vector
where
where
where
The CapsNet outputs probabilities in such a way as to push the probability of the correct category as close to 1 as possible and the probability of the wrong category infinitely close to 0. The margin loss is defined as the objective function of CapsNet:
where
Dual heterogeneous feature resonance fusion for capsule network
Inspired by the limitations of feature extraction in single-scale networks, we adopt the dual-scale network concept and propose DHFRF-CN. The DHFRF-CN is a DL model applied to fault diagnosis and its architecture is clearly illustrated in Figure 4. In contrast to the dual-scale network, which only uses convolutional layers, we not only use multistage convolutional layers but also incorporate the proposed AHFAM in the two subpaths. Furthermore, our DHFRF-CN is designed with feature resonance fusion and enhancement stages to extract more diverse heterogeneous features.

Architecture of the DHFRF-CN. DHFRF-CN: dual heterogeneous feature resonance fusion for capsule network.
The specific operation steps are as follows:
Step 1. The two-dimensional matrix obtained from the vibration signal sampling is used as the input of DS-DCN.
Step 2. The DS-DCN is constructed to capture dual heterogeneous features by adjusting the sampling points of the convolution kernels.
Step 3. AHFAM is used to adaptively calculate and adjust the weights of the heterogeneous features.
Step 4. The RFM is applied to coordinate the relevance of heterogeneous features and achieve feature fusion.
Step 5. The heterogeneous resonance gain features obtained through fusion are encoded into capsule unit for fault diagnosis and classification.
The DHFRF-CN addresses key problems in dual-heterogeneous feature extraction, weight optimization, and fusion conflict under complex operating conditions and noisy environments, achieving superior fault diagnosis performance.
Dual-scale deformable convolution network
The model first extracts heterogeneous features from different data sources or modalities from the time domain signal. DS-DCN is designed to extract dual-scale heterogeneous features, which adapt to the feature nature of different data and modalities. In DS-DCN, the convolution kernels emphasize different aspects of extracting dual-scale heterogeneous features, with the small kernel capturing fine-grained detail features, and the large kernel capturing macroscopic features.
DS-DCN is shown in Figure 5. The multistage DS-DCN is constructed by multiple ordinary convolution layers and deformable convolution layers with varying kernel sizes, which are stacked to progressively learn the hierarchical structure of heterogeneous features. The connection of the ordinary convolution layer module with the deformable convolution layer at different stages enables the proposed network to extract effective cross-modal heterogeneous features. The DS-DCN enhances the perception of dual-scale heterogeneous features by adjusting the sampling points of each layer based on the offset and the signal distribution.

Architecture of the DS-DCN. DS-DCN: dual-scale deformable convolution network.
Generally, the kernel size and dilation rate of convolution layers determines the receptive field for the extraction global and local features.41,42 Table 1 shows the sensitivity analysis of kernel sizes and dilation rates. In DS-DCN, CNNs using larger kernels can capture deeper information, while DCN adjusts sampling points through offsets, offering high flexibility. The dilation rate expands the receptive field of convolution without increasing or decreasing the number of parameters. However, DCN has enhanced the flexibility of the receptive field through the offset, and the dilation rate is usually kept at 1 or 2 to avoid excessive sparsity.
Sensitivity analysis of kernel sizes and dilation rates.
CNN: convolutional neural network; DCN: deformable convolution network; ERF: effective receptive field.
A kernel that is too large may lead to complex offset learning and increase the number of training parameters, while a kernel that is too small results in an insufficient effective receptive field. Therefore, a balanced choice of parameters is particularly important in fault diagnosis, and kernels of moderate size are usually sufficient to capture features. We can be concluded that in DS-DCN, a larger kernel is used in the early stages to adapt to complex vibration signals, while the kernel is appropriately reduced in the later stages to focus on features.
Adaptive heterogeneous feature adjustment mechanism
The AHFAM is proposed to adaptively calculate and adjust the weight of each feature according to the contribution of the heterogeneous features in the classification task, enhancing the feature discriminative ability of DS-DCN.
As shown in Figure 6, AHFAM employs two modules including the global average pooling (GAP) and global maximum pooling (GMP), which aggregates global and local features to capture the dependencies between channels. The former module adjusts the overall weights of the heterogeneous features to reflect the overall trend of the features by aggregating the mean values in the region, while the latter module assigns the local maximum weights of the heterogeneous features to effectively capture the local detail changes of the features. The AHFAM simultaneously aggregates the global and local weights of both modules to capture the dependencies between the two channels, forming an aggregated channel. The outputs from these three channels are fused into a discriminative heterogeneous feature, which enables the network to consider features across multiple dimensions and results in more accurate feature representations.

Architecture of the AHFAM. AHFAM: adaptive heterogeneous feature adjustment mechanism.
As can be seen from Figure 6, GAP and GMP are structurally parallel and symmetric. For the input feature
where
where
where
where
where
where
Resonance fusion mechanism
Traditional feature fusion methods often result in information loss or conflict due to excessive differences between features. To prevent conflicts arising from heterogeneous feature in terms of information differences and spatial size, the RFM is proposed for the capture of discriminative cross-modal information, including a structural alignment layer, an interaction connection layer, an association filtering layer, and feature enhancement layer. These layers work together to obtain heterogeneous resonance gain features, which effectively coordinate the diverse features across both spatial and structural dimensions.
As shown in Figure 7, the procedures of RFM are described as follows: First, a structural alignment layer is designed to facilitate dimensional and structural compatibility of dual-scale heterogeneous features, where a convolution operation is performed on the input feature mapping with same padding. For the heterogeneous features
where

Architecture of the RFM. RFM: resonance fusion mechanism.
Next, an interaction connection layer is constructed to enable feature interaction. The outer product operation is performed on the aligned features
where
where
Then, the association filtering layer is performed on the interacted features, which retains related features and removes conflicting features. The masking operation is introduced to set a threshold, preserving features with higher correlation and suppressing or removing those with lower correlation. The threshold
where
Finally, the full connectivity layer to obtain fused heterogeneous resonance gain features
where
DHFRF-CN for fault diagnosis
The fault diagnosis process of DHFRF-CN is shown in Figure 8 as follows:
Divide the data sample proportionally.
The training set is sent into the network to adjust the parameters and the validation set is used to adjust the hyperparameters, so that the network maintains the best generalization to unknown data.
Fault classification is performed using testing set data and diagnostic results are obtained. The results are also visualized using confusion matrix and t-distributed stochastic neighbor embedding (t-SNE).

Fault diagnosis process of DHFRF-CN. DHFRF-CN: dual heterogeneous feature resonance fusion for capsule network.
Experimental verification
This section employs multiple datasets and designs a range of experimental scenarios to evaluate the effectiveness of the DHFRF-CN model in diagnostic tasks. The configuration of the experiments, including the selection of the experimental environment and the model parameters, is also described in detail to ensure that the experiments are conducted under suitable conditions to obtain reliable results.
Experimental configuration and parameter settings
The proposed network in this article is based on the TensorFlow 2.9 (Google LLC, Mountain View, CA, USA) framework in Python 3.9 and implemented on an Intel Xeon E5-2680 v4 (Intel Corporation, Santa Clara, CA, USA) and RTX 3060 (NVIDIA Corporation, Santa Clara, CA, USA) under Windows 11. One thousand samples were collected for each class, and the dataset was divided in the ratio of 7:1:2. 43 Each experiment is conducted five times to obtain more reliable results. For the comparison experiments, we selected seven methods to compare with DHFRF-CN. These comparison methods include CNN, multiscale CNN (MSCNN), multiscale CapsNet (MSCN), deep CNN with wide first-layer kernels (WDCNN), 44 bidirectional long short-term memory and CapsNet with CNN (BLC-CNN), 45 dual convolutional CapsNet (DC-CN), 46 and convolutional block attention mechanism CapsNet (CBAM-CN). 24
Case1: Rolling bearing fault diagnosis
The CWRU bearing dataset contains test data for bearings under normal and fault conditions. 47 The test platform for this dataset is illustrated in Figure 9. Rolling bearing vibration data supporting the motor shaft were collected using vibration accelerometers with sampling frequencies of 12 and 48 kHz. By adjusting the loads, data were collected for one healthy state and nine fault states at four different loads. The fault types and data labels are shown in Table 2. Figure 10 shows the vibration waveform of the 10 states collected at 0HP.

CWRU bearing test bench; CWRU: Case Western Reserve University.
Description of sample labels on the CWRU bearing dataset.

Vibration waveforms of data samples.
Evaluation under identical loading conditions
In this section, we use the sample data collected under the same working conditions for model training and evaluation. For example, dataset A-A indicates that the sample set are all from 0HP. To thoroughly assess the effectiveness of DHFRF-CN, we conducted tests under different working conditions separately. As shown in Table 3, it can be seen that the accuracy reaches 100 at 0, 2, and 3HP, and is close to 100 at 1HP.
Diagnostic accuracy of DHFRF-CN under the same load.
DHFRF-CN: dual heterogeneous feature resonance fusion for capsule network.
In addition, Figure 11 demonstrates the confusion matrices of DHFRF-CN under four different working conditions to further analyze the classification effect of the model. The accuracy and misclassification rate metrics are labeled below each matrix. From the figure, it can be clearly observed that the misclassification rate under working condition 1 is 0.005, while the misclassification rates of other working conditions are all 0, which verifies reliability of DHFRF-CN performance under different working conditions.

Confusion matrix of DHFRF-CN under the same loading. DHFRF-CN: dual heterogeneous feature resonance fusion for capsule network.
Evaluation under different loading conditions
To verify the diagnostic ability of DHFRF-CN under variable load, the sample data are collected under different working conditions, such as 0, 1, 2, and 3HP are denoted as data sets A, B, C, and D. Dataset A→B indicates that the training and validation sample are from 0HP, and the testing sample is from 1HP. The results are shown in Table 4. It can be seen that DHFRF-CN not only maintains the top performance in all four conditions but also has the lowest standard deviation.
Diagnostic accuracy of eight methods under different loads (%).
CNN: convolutional neural network; MSCNN: multiscale convolutional neural network; MSCN: multiscale capsule network; WDCNN: deep convolutional neural network with wide first-layer kernel; BLC-CNN: bidirectional long short-term memory and capsule network with convolutional neural network; DC-CN: dual convolutional capsule network; CBAM-CN: convolutional block attention mechanism capsule network; DHFRF-CN: dual heterogeneous feature resonance fusion for capsule network.
Bold represents the optimal performance of DHFRF-CN under different conditions.
Figure 12 can be clearly seen as the diagnostic accuracy of the eight models under different loads. Overall, the evaluation accuracies of datasets A-B are all slightly lower than the other datasets, which indicate that the large difference in signal characteristics between datasets A and B leads to a lower accuracy. In this case of large feature differences, DHFRF-CN still maintains superior diagnostic result compared to other models, which moreover validates the feature difference extraction ability of this model. The signal feature differences between datasets B-C and C-D are small, and most of the eight models can extract useful and reliable features. DHFRF-CN achieves the highest accuracy of 99.78 and 99.00% on these two datasets, respectively, which can show that the present method is robust along with the excellent feature difference extraction ability. Dataset D-A has the lowest evaluation accuracy, which indicates that there is a huge signal feature difference between datasets D and A. The overall gain of DHFRF-CN on this dataset compared to CNN, MSCNN, MSCN, WDCNN, BLC-CNN, DC-CN, and CBAM-CN is 15.36, 15.36, 11.96, 8.54, respectively, 8.60, 22.12, and 4.72%. The comparison reveals that DC-CN has the lowest accuracy indicating that the single-scale CapsNet is unable to extract diverse features. CBAM-CN has the second highest accuracy after the present method and is the only one among the compared models that is above 70%. It can be seen that both channel and spatial attention in the convolution block attention mechanism can effectively extract key information, but cannot mitigate scale information conflicts. DHFRF-CN overcomes the problem of single-scale network extracting a single feature, but also extracts the most critical and effective feature information. It also avoids the problem of conflicting information differences on different scale channels and achieves the highest accuracy in different load experiments.

Diagnostic accuracy of eight methods under different loads.
Figure 13 demonstrates the accuracy of the eight models under different working conditions in single experiment. The results show that DHFRF-CN has the highest accuracy under all working conditions, and the results of the five experiments show good stability and exhibit small fluctuations. This indicates that DHFRF-CN performs consistently in different experiments and can effectively cope with the challenges of various working conditions with strong robustness.

Diagnostic accuracy of each trail for the eight methods at different loads.
In contrast, the diagnostic accuracies of the other seven methods have large fluctuations, indicating that they are susceptible to changes in working conditions or fluctuations in training data. For example, the low accuracy rate of DC-CN indicates that relying on only two layers of convolution network to process the signal fails to extract sufficiently rich features for CapsNet to perform effective classification. The heterogeneous resonance gain features obtained by DHFRF-CN through the first three stages of processing provide richer and more diverse input features for the CapsNet. This process effectively enhances the diversity and accuracy of feature expression, thus laying foundation for the accurate classification of the CapsNet. Therefore, the stable performance and superior performance of DHFRF-CN in this experiment indicate its potential advantages in fault diagnosis tasks, especially in the face of complex working conditions and data uncertainty.
Case2: Rolling bearing fault diagnosis
The experimental workbench for the dataset from the University of Paderborn (PU) in Germany is shown in Figure 14. Bearing data under four operating conditions were collected at different speeds and loads. The bearing data from the PU dataset are divided into three types: normal, inner ring fault, and outer ring fault. 48 Damage is classified as either level 1 or level 2 based on the severity of the fault. In this experiment, we selected five fault state datasets. As shown in Table 5, the fault state data include one healthy bearing (K001), two outer ring fault bearings with different degrees of damage (KA05, KA06), and two inner ring fault bearings with different degrees of damage (KI05, KI07).

PU bearing test bench. PU: University of Paderborn.
Description of sample labels on the PU bearing dataset.
PU: University of Paderborn.
Fault diagnostic results under benchmark conditions
The training accuracy and loss curve of DHFRF-CN under 0HP are shown in Figure 15. The curve converges within the first 10 training cycles and tends to be stable after that. This indicates that DHFRF-CN has good generalization performance.

Training loss curve.
Under 0 HP, the diagnostic results of the eight methods are shown in Table 6 and Figure 16. As can be seen from the figures, the precision, recall, F1 score, and average accuracy of DHFRF-CN are 97.72, 97.68, 97.65, and 97.85%, respectively, achieving the best metrics compared to other methods. Additionally, the standard deviation of DHFRF-CN is 0.19%, the lowest among all networks, indicating its greater stability. Figure 17 shows the confusion matrix of the eight methods diagnosed at 0HP. DHFRF-CN (Figure 17(h)) achieved the highest classification accuracy overall, with a misclassification rate of only 0.022.
Diagnostic results under benchmark conditions.
CNN: convolutional neural network; MSCNN: multiscale convolutional neural network; MSCN: multi-scale capsule network; WDCNN: deep convolutional neural network with wide first-layer kernel; BLC-CNN: bidirectional long short-term memory and capsule network with convolutional neural network; DC-CN: dual convolutional capsule network; CBAM-CN: convolutional block attention mechanism capsule network; DHFRF-CN: dual heterogeneous feature resonance fusion for capsule network.
Bold represents the optimal performance of DHFRF-CN under different conditions.

Diagnostic results of the eight methods.

Confusion matrix: (a) CNN, (b) MSCNN, (c) MSCN, (d) WDCNN, (e) BLC-CNN, (f) DC-CN, (g) CBAM-CN, and (h) DHFRF-CN. CNN: convolutional neural network; MSCNN: multiscale convolutional neural network; MSCN: multiscale capsule network; WDCNN: deep convolutional neural network with wide first-layer kernel; BLC-CNN: bidirectional long short-term memory and capsule network with convolutional neural network; DC-CN: dual convolutional capsule network; CBAM-CN: convolutional block attention mechanism capsule network; DHFRF-CN: dual heterogeneous feature resonance fusion for capsule network.
Fault diagnostic results in complex noise environments
To verify the reliability of DHFRF-CN in fault diagnosis under complex noise environments, Gaussian white noise with different SNRs was added to the original vibration signals under benchmark conditions, constructing vibration signals in noisy environments. 49 The lower the SNR, the stronger the noise intensity; conversely, the higher the SNR, the weaker the noise intensity. The experimental results are shown in Table 7. The data are intuitively presented in Figures 18 and 19. It can be seen that interference information in the signal significantly affects the performance of the eight methods. Under different noise conditions, DHFRF-CN outperforms the other seven methods. Especially under strong noise conditions with an SNR of −6 dB, the proposed DHFRF-CN achieves overall gains of 16.6, 11.87, 10.6, 23.67, 19.1, 10.87, and 18.67% compared to CNN, MSCNN, MSCN, WDCNN, BLC-CNN, DC-CN, and CBAM-CN. This further demonstrates the strong robustness of the proposed DHFRF-CN.
Detailed diagnostic results of the eight methods under different SNRS (%).
CNN: convolutional neural network; MSCNN: multiscale convolutional neural network; MSCN: multiscale capsule network; WDCNN: deep convolutional neural network with wide first-layer kernel; BLC-CNN: bidirectional long short-term memory and capsule network with convolutional neural network; DC-CN: dual convolutional capsule network; CBAM-CN: convolutional block attention mechanism capsule network; DHFRF-CN: dual heterogeneous feature resonance fusion for capsule network.
Bold represents the optimal performance of DHFRF-CN under different conditions.

Diagnostic accuracy of the eight methods under different SNRs. SNR: signal-to-noise ratio.

Diagnostic accuracy of the eight methods under different SNRs. SNR: signal-to-noise ratio.
Case3: Gearbox fault diagnosis
In this section, the Southeast University (SEU) gearbox dataset is used for fault diagnosis experiments. 50 As shown in Figure 20, the experiments were conducted using a driveline dynamics simulator. The dataset consists of sub-datasets from the bearing and gearbox. Each sub-dataset has one operational condition state and four fault states. The gearbox fault types are shown in Table 8.

SEU gearbox test bench. SEU: Southeast University.
Description of sample labels on the SEU gearbox dataset.
SEU: Southeast University.
Fault diagnostic results under benchmark conditions
The training loss curve of DHFRF-CN is shown in Figure 21. From this, it can be observed that the model has a fast convergence property. Within the first five training cycles, the curves converge rapidly and stabilize. Additionally, the average training accuracy curve indicates no overfitting throughout the entire training process, which further proves that DHFRF-CN has good generalization ability during the learning process.

Training loss curve.
As can be seen from the detailed data provided in Table 9, the minimum accuracy of DHFRF-CN consistently remains above the maximum accuracy of the other seven methods. As shown in Figure 22, DHFRF-CN achieved accuracy rates of 99.4, 99.2, 99.8, 99.2, and 99.6% in five independent experiments. The highest accuracy rate was obtained in each experiment, which fully verified the superior performance of the method. The performance of DHFRF-CN with the other seven methods in the diagnostic process is visualized from the bar chart. In terms of average F1 score and average accuracy, the DHFRF-CN proposed in this article also showed significant advantages over CNN, MSCNN, MSCN, WDCNN, BLC-CNN, DC-CN, and CBAM-CN, obtaining 14.36, 2.28, 8.72, 1.8, 1.76, 1.08, and 10.68% overall gain. In addition, the standard deviation of each of the DHFRF-CN is significantly lower than that of the other methods, which indicate that DHFRF-CN has higher stability and robustness.
Diagnostic results under benchmark conditions.
CNN: convolutional neural network; MSCNN: multiscale convolutional neural network; MSCN: multiscale capsule network; WDCNN: deep convolutional neural network with wide first-layer kernel; BLC-CNN: bidirectional long short-term memory and capsule network with convolutional neural network; DC-CN: dual convolutional capsule network; CBAM-CN: convolutional block attention mechanism capsule network; DHFRF-CN: dual heterogeneous feature resonance fusion for capsule network.
Bold represents the optimal performance of DHFRF-CN under different conditions.

Diagnostic results of the eight methods.
To further verify the feature clustering effect of each method, Figure 23 visualizes the feature clustering results of the eight methods using the t-SNE algorithm. The different colors in the figure represent the five operating states of the gearbox. The comparison shows that the feature clustering effect of DHFRF-CN (Figure 23(h)) is the most obvious, which is able to gather the features of different gear states more closely. Comparatively, the feature clustering effect of the other seven methods (Figure 23(a) to (g)) is poor, and the feature differentiation between gear states is not as obvious as that of DHFRF-CN. Therefore, DHFRF-CN not only achieves higher classification accuracy but also surpasses other methods in feature representation and clustering effectiveness.

Feature distribution visualization: (a) CNN, (b) MSCNN, (c) MSCN, (d) WDCNN, (e) BLC-CNN, (f) DC-CN, (g) CBAM-CN, and (h) DHFRF-CN. CNN: convolutional neural network; MSCNN: multiscale convolutional neural network; MSCN: multiscale capsule network; WDCNN: deep convolutional neural network with wide first-layer kernel; BLC-CNN: bidirectional long short-term memory and capsule network with convolutional neural network; DC-CN: dual convolutional capsule network; CBAM-CN: convolutional block attention mechanism capsule network; DHFRF-CN: dual heterogeneous feature resonance fusion for capsule network.
Fault diagnostic results in complex noise environments
To verify the reliability of DHFRF-CN in fault diagnosis under complex noise environments, Gaussian white noise with different SNRs was added to the original vibration signals under benchmark conditions, constructing vibration signals in a noisy environment. The lower the SNR, the stronger the noise intensity; conversely, the higher the SNR, the weaker the noise intensity. The experimental results are shown in Table 10. The data are intuitively presented in Figures 24 and 25. It can be seen that interference information in the signal significantly affects the performance of the nine methods. Under different noise conditions, DHFRF-CN outperforms the other seven methods. Especially under strong noise conditions with an SNR of −6 dB, the proposed DHFRF-CN achieves overall gains of 15.67, 9.6, 16.47, 12.13, 9.2, 9.2, and 9.93% compared to CNN, MSCNN, MSCN, WDCNN, BLC-CNN, DC-CN, and CBAM-CN. This further demonstrates the strong robustness of the proposed DHFRF-CN.
Detailed diagnostic results of the eight methods under different SNRS (%).
SNR: signal-to-noise ratio; CNN: convolutional neural network; MSCNN: multiscale convolutional neural network; MSCN: multiscale capsule network; WDCNN: deep convolutional neural network with wide first-layer kernel; BLC-CNN: bidirectional long short-term memory and capsule network with convolutional neural network; DC-CN: dual convolutional capsule network; CBAM-CN: convolutional block attention mechanism capsule network; DHFRF-CN: dual heterogeneous feature resonance fusion for capsule network.
Bold represents the optimal performance of DHFRF-CN under different conditions.

Diagnostic accuracy of the eight methods under different SNRs. SNR: signal-to-noise ratio.

Diagnostic accuracy of the eight methods under different SNRs. SNR: signal-to-noise ratio.
Ablation study
In this section, we carry out ablation experiments on the PU bearing and SEU gearbox datasets to verify the contribution and effect of each module in the network.
Effectiveness of the AHFAM
In this section, we established DHFRF-CN-NAHFAM to verify the effectiveness of AHFAM. Experiments were performed on both the PU bearing and the SEU gearbox datasets, with detailed results shown in Table 11 and visually presented in Figures 26 and 27. In both datasets, the diagnostic accuracy of DHFRF-CN was 10.84 and 4.48% higher than that of DHFRF-CN-NAHFAM, respectively, with a smaller standard deviation. This indicates that AHFAM can effectively extract discriminative information from fault features during the learning process, thereby significantly enhancing the model’s fault recognition capability.
Average diagnostic accuracy of the four methods.
PU: University of Paderborn; SEU: Southeast University; DHFRF-CN: dual heterogeneous feature resonance fusion for capsule network; NAHFAM: No adaptive heterogeneous feature adjustment mechanism; NRFM: No resonance fusion mechanism; SS: Single-channel structure. Bold represents the optimal performance of DHFRF-CN under different conditions.

Average accuracy for the four methods.

Diagnostic accuracy of each trail for the four methods.
Effectiveness of the RFM
In this section, we established DHFRF-CN-NRFM to verify the effectiveness of RFM. Experiments were performed on both the PU bearing and the SEU gearbox datasets, with detailed results shown in Table 11 and visually presented in Figures 26 and 27. In both datasets, the diagnostic accuracy of DHFRF-CN was 5 and 1.72% higher than that of DHFRF-CN-NRFM, respectively, with a smaller standard deviation. This indicates that RFM can adaptively integrate feature information from different channels, effectively avoiding information conflicts.
Effectiveness of the dual-scale network
In this section, we established DHFRF-CN-SS to verify the effectiveness of the dual-scale network. Experiments were performed on both the PU bearing and the SEU gearbox datasets, with detailed results shown in Table 11 and visually presented in Figures 26 and 27. In both datasets, the diagnostic accuracy of DHFRF-CN was 8.52 and 5.12% higher than that of DHFRF-CN-SS, respectively, with a smaller standard deviation. This demonstrates the advantage of the dual-scale network in capturing diverse features. In contrast, the single-channel network structure is limited when dealing with diverse features, making it difficult to fully uncover potential fault modes, leading to poor diagnostic performance.
Conclusion
In this article, we construct a novel DHFRF-CN for mechanical fault diagnosis. This method can accurately capture heterogeneous features and eliminate information conflicts between features under complex working conditions or noise interference. We conducted case studies on different fault datasets, comparing DHFRF-CN with state-of-the-art networks. The results show that the proposed network has strong fault diagnosis and noise resistance capabilities. Additionally, ablation experiments were designed to verify the effectiveness of the DHFRF-CN components in fault diagnosis.
The DHFRF-CN method proposed in this article focuses on fault diagnosis of rolling bearings and gearboxes. Future research will explore the application of this method to a broader range of machinery, such as motors and pumps. By adapting to the fault characteristics of different equipment, further validation of the general applicability and robustness of DHFRF-CN in mechanical fault diagnosis will be conducted. Additionally, with the rapid development of ML, emerging technologies like transfer learning and self-supervised learning have made significant progress. Therefore, future research will concentrate on how to effectively integrate these technologies into existing frameworks to address more complex operating conditions and diverse fault types.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China (nos. 51875457 and 62271390), the Key Research and Development Program of Shaanxi Province of China (2025CY-YBXM-602), the Scientific Research Program Funded by Education Department of Shaanxi Provincial Government (program no. 24JC083), and the Graduate Student Innovation Fund of Xi'an University of Posts and Telecommunications (CXJJYL2024069).
