Abstract
Rotating machinery fault diagnosis is fundamental to the safe and stable operation of smart manufacturing systems. However, vibration-based diagnostic models are often limited by the scarcity of labeled fault data, which weakens their robustness and generalization under varying operating conditions. Recent studies have applied denoising diffusion probabilistic models (DDPMs) to vibration signal augmentation in limited-sample scenarios. Yet, most existing methods are developed on raw time-series signals or generic time-frequency representations, without explicitly incorporating the multiscale impulsive characteristics of fault vibrations. As a result, the generated signals often exhibit weakened fault-related transients. To address this issue, this paper proposes a wavelet-preprocessed temporal-attention diffusion framework for vibration-signal augmentation. Multiscale wavelet priors are integrated into the diffusion backbone to preserve impact-sensitive structures, temporal-attention modules are embedded into the reverse denoising process to enhance temporal dependency modeling, and an asymmetric U-Net with a strengthened decoder is adopted to improve reconstruction quality. Experiments on two bearing test platforms demonstrate that the proposed method generates signals with higher physical fidelity and achieves superior diagnostic performance compared with representative generative baselines. These results indicate that the proposed framework is an effective augmentation strategy for robust vibration-based intelligent diagnostics under data-scarce conditions.
Keywords
1. Introduction
Industrial rotating machinery is expected to operate with high reliability and availability in smart manufacturing systems. As key supporting components in rotating transmission chains, rolling bearings play a critical role in maintaining motion accuracy, load transfer, and operational stability. Bearing failures can therefore lead to unplanned downtime, increased maintenance cost, and performance degradation, making accurate fault diagnosis essential for reliable condition monitoring (Jiang et al., 2024). In practice, however, labeled vibration data are often scarce because fault events are infrequent, fault reproduction is costly and potentially unsafe, and data annotation depends heavily on expert knowledge (Peng et al., 2025; Zhang et al., 2026). This data scarcity makes data-driven diagnostic models more vulnerable to overfitting and poor generalization under varying operating conditions (Qi et al., 2025; Shao et al., 2018). Therefore, developing robust fault diagnosis methods under limited labeled data remains a critical challenge in vibration-based intelligent maintenance.
To alleviate the scarcity of labeled vibration data in fault diagnosis, generative models have been increasingly explored for synthetic sample generation (Chen et al., 2026b). Early studies mainly relied on generative adversarial networks (GANs) (Fu et al., 2023; Liu and Wu, 2024) and variational autoencoders (VAEs) (Dixit and Verma, 2020; Ju et al., 2024; Zhao et al., 2020) to enrich limited datasets and improve diagnostic robustness. These methods have demonstrated the potential of data augmentation in small-sample scenarios (Wang et al., 2020; Zhao et al., 2022), but adversarial models often suffer from training instability, whereas latent-variable models may produce over-smoothed signals and insufficiently preserve transient fault details (Pan et al., 2022). More recently, denoising diffusion probabilistic models (DDPMs) (Ho et al., 2020) have attracted growing attention because of their stable training behavior and strong generative capability. Existing diffusion-based studies have reported promising results in vibration-signal augmentation, few-shot diagnosis, and cross-condition learning (Fan et al., 2025; Tong et al., 2026; Wang et al., 2024). However, most of these methods still treat vibration signals as generic time series or transformed representations, without explicitly emphasizing the impulsive and multiscale nature of bearing fault responses (Lin et al., 2025; Lourari et al., 2025; Peng et al., 2026).
Despite the progress of diffusion-based methods, several limitations remain when they are applied to vibration-oriented fault diagnosis. First, the scale-localized impulsive characteristics induced by repetitive fault impacts are often insufficiently modeled, which may weaken the preservation of fault-discriminative transient structures in generated samples. Second, temporal dependency during reverse diffusion is usually captured only implicitly by conventional U-Net backbones, making it difficult to accurately reconstruct the timing, recurrence, and modulation patterns of fault-related impulses (Chen et al., 2026a). Third, the quality of generated signals is still commonly evaluated using generic similarity measures or downstream classification accuracy alone (Qi et al., 2026), which cannot fully reflect whether the synthesized signals preserve physically meaningful waveform, spectral, and impulsive characteristics required for reliable diagnosis.
To address the above limitations, this study proposes a wavelet-preprocessed temporal-attention DDPM (WaveletTAM-DDPM), a diffusion-based vibration-signal generation framework designed to enhance data-scarce intelligent maintenance and fault diagnosis of industrial rotating systems, validated on rolling bearings as a representative rotating component. The main contributions of this work are summarized as follows: i. A DDPM-based framework is developed to enhance bearing diagnosis under limited samples by augmenting scarce labeled data with diffusion-generated vibration signals, providing a physics-informed data enrichment module that can be integrated into intelligent maintenance and condition monitoring systems. ii. The proposed framework introduces a physics-informed diffusion structure, where wavelet kernel convolution injects multiscale impact-sensitive priors, temporal-attention modules model long-range temporal dependency during reverse diffusion, and an asymmetric U-Net with a deeper decoder enhances reconstruction capability. iii. A composite evaluation metric is constructed to assess generated signals from multiple vibration-physics perspectives, integrating time-, frequency-, and entropy-based similarity measures to provide a more systematic assessment of physical fidelity and its relevance to downstream intelligent diagnosis and maintenance decision support.
The rest of the paper is organized as follows. Section 2 provides a detailed explanation of proposed framework, its network structure, and the proposed composite metric. Section 3 reports experiment evaluation and validation. Section 4 concludes the paper.
2. Proposed method
2.1. Overall workflow of the proposed method
To address the scarcity of labeled vibration data in industrial fault diagnosis, this study proposes a data augmentation framework based on WaveletTAM-DDPM. The goal is to generate high-fidelity synthetic vibration signals that expand the training distribution without requiring additional physical experiments. As illustrated in Figure 1, the workflow consists of four stages: Overall framework of the proposed method.
2.1.1. Data preprocessing
Raw vibration signals under normal and fault conditions are collected and segmented into fixed-length samples using a sliding window. This increases sample diversity while preserving local transient impulses.
2.1.2. Diffusion-based signal generation
During training, each vibration sample is first processed by a wavelet kernel convolution layer integrated directly into the diffusion network to extract localized time-frequency impulse features (Fujieda et al., 2018). The extracted representations are then denoised by an asymmetric U-Net equipped with temporal attention to perform the reverse diffusion process, ultimately generating synthetic vibration signals.
2.1.3. Signal-level quality evaluation
The generated vibration signals are assessed using the proposed composite metric, which evaluates similarity to real signals from time domain, frequency domain, and impact-feature perspectives.
2.1.4. Task-level diagnostic evaluation
High-fidelity synthetic data are combined with real signals to train fault diagnostic networks, whereas only real samples are used for testing, which verifies the generated signals improve diagnostic robustness rather than biasing evaluation.
Signal-level and task-level evaluation jointly measure the usefulness of the generated data. The proposed framework ensures that synthetic signals are not only visually similar to real vibration data, but also practically beneficial for downstream fault diagnosis.
2.2. WaveletTAM-DDPM
The structure of the proposed WaveletTAM-DDPM is shown in Figure 2. The model integrates three key components into the diffusion process: wavelet kernel convolution, an asymmetric U-Net backbone, and temporal attention, enabling effective generation of vibration signals under limited-sample conditions. WaveletTAM-DDPM U-Net structure.
2.2.1. Wavelet kernel convolution for multi-channel vibration representations
At the first stage of the diffusion U-Net, the raw vibration segment passes through a wavelet kernel convolution (WKC) layer, replacing the conventional first convolution layer (Yi et al., 2024). Unlike standard convolution kernels that learn arbitrary filter shapes, WKC uses predefined wavelet functions as kernels, capturing localized transient impulses and oscillatory patterns that dominate vibration signals. This makes WKC more suitable for rotating machinery fault signals, which are inherently impulsive, nonstationary, and multi-frequency.
In addition, WKC expands the original 1-channel signal into a multi-channel representation, where each channel corresponds to a wavelet kernel at a different scale. At larger scales, the wavelet channels emphasize low frequency components associated with defect-induced resonance bands, while smaller scales respond strongly to high frequency transients and their modulation sidebands. This scale-based decomposition allows each channel to act as a band selective filter that captures fault-related spectral signatures such as the bearing characteristic frequencies: ball pass frequency of inner race (BPFI), ball pass frequency of outer race (BPFO), and ball spin frequency (BSF), along with their surrounding amplitude-modulated sidebands, directly at the network input. As a result, the U-Net receives an explicitly encoded time-frequency representation instead of having to infer scale-dependent patterns implicitly through deeper layers.
In this work, the Mexican hat wavelet (Mexh) is used as the kernel basis. Mexh is a second-derivative Gaussian wavelet that produces symmetric, sharp impulse responses, which is ideal for rotating machinery fault impulses. Compared with Morlet and Laplace wavelets, Mexh offers stronger localization of transient spikes, clear separation between oscillatory components and impulse peaks, and zero mean, which improves stability during reverse diffusion. Thus, WKC provides impulse-aware representations before any diffusion step occurs, making the reverse denoising path easier to learn.
2.2.2. Asymmetric U-Net for improved denoising reconstruction
The backbone of WaveletTAM-DDPM follows a U-Net structure but adopts an intentionally asymmetric depth. Each down-sampling layer contains one residual block, whereas each up-sampling layer contains two. This asymmetry is motivated by the different functional roles of the encoder and decoder in diffusion modeling. The encoder primarily performs feature contraction, mapping the multiscale wavelet representation into a compact latent space. Because this stage emphasizes abstraction rather than detail preservation, a single residual block at each level is sufficient to encode the dominant structures required for denoising.
In contrast, the decoder operates during the reverse diffusion process, where the model must progressively remove noise while reconstructing fine-grained temporal structures that are easily corrupted by the forward noising steps. Reverse denoising is inherently a detail restoration task, and its difficulty increases as the time step decreases. To support this, each decoder stage includes two residual blocks, enabling stronger nonlinear refinement, greater sensitivity to subtle impulse cues, and more expressive feature transformation. The additional residual block also improves the fusion between encoder and decoder features passed through skip connections, which is essential for accurately restoring impulse shapes and modulation envelopes.
Together, this asymmetric design allocates model capacity where it is most needed, namely, on the reconstruction side of the U-Net, stabilizing the denoising trajectory and improving the fidelity of the synthesized vibration signals.
2.2.3. Temporal-attention module (TAM)
The TAM (Wang et al., 2022) is inserted in the bottleneck and all decoder blocks to strengthen reconstruction of long-range temporal structures. Its structure is shown in Figure 3. Given an input feature tensor TAM structure.
The attention map
Finally, TAM fuses the attention map with the original features via a residual connection:
TAM is applied only in the middle and up-sampling stages. During down-sampling, the network extracts local representations, and temporal correlations are still redundant, making attention unnecessary. During up-sampling, the model must reconstruct temporal structure from noise, and TAM strengthens long-range dependency modeling, helping recover fault impulses and periodic vibration patterns more accurately.
Main hyperparameter settings.
2.3. Composite metric
To evaluate the fidelity of generated vibration signals in a comprehensive and physically meaningful way, a composite multi domain quality metric is constructed. Vibration signals from rotating machinery exhibit strong impulsive transients, harmonic spectral structures, and amplitude-modulated sidebands produced by bearing impacts, fault repetition frequencies, and resonance amplification. Assessing such signals using a single similarity measure is inadequate. Metrics such as mean squared error (MSE) and PSNR measure waveform proximity but are insensitive to spectral distortions and may still assign high scores to signals that lack fault impulses. Purely spectral metrics such as frequency spectrum cosine similarity (FSCS) capture harmonic alignment but ignore local amplitude variations and transient strength. Impact based indicators alone, such as spectral entropy difference (SED), quantify impulsiveness but overlook global waveform accuracy and harmonic structure. Relying on only one of these perspectives risks overlooking essential characteristics needed for reliable fault diagnosis.
Therefore, three complementary indicators are selected to jointly characterize fidelity from the time, frequency, and impact perspectives: PSNR evaluates waveform similarity in the time domain, FSCS measures the alignment of harmonic structures and spectral energy distribution, and SED captures the impulsiveness and disorder level associated with fault-induced impacts. Together, PSNR reflects temporal accuracy, FSCS captures frequency component correctness, and SED quantifies impact behavior. Combining them aligns fidelity evaluation with the physics of vibration signals, ensuring that generated samples preserve waveform shape, spectral content, and impulse characteristics. The three indicators are fused using the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) (De Lima Silva and De Almeida Filho, 2020) to produce a single composite score. TOPSIS is used here because PSNR, FSCS, and SED reflect complementary aspects of signal fidelity and may exhibit different trends across competing methods. By jointly considering their different scales and optimization directions, the fusion result provides a more comprehensive assessment of generated-signal quality than any single metric alone.
PSNR is computed as
FSCS is defined as
SED is computed as
An evaluation metrics matrix is constructed, where
Composite metric is computed as
Thus, a larger
3. Experiment verification
3.1. Data description
Two bearing datasets were used to evaluate the proposed WaveletTAM-DDPM under different operating conditions, sampling rates, and fault types. The datasets were selected to cover both cross roller and roller bearings, enabling assessment of generalization across bearing structures and excitation characteristics.
3.1.1. Cross-roller–bearing dataset
Data were collected on the test platform using SW-HG-20 cross-roller–bearings. The inner race was rigidly fixed to a 450 Nm load arm, while the outer race was coupled to the input shaft. Vibration signals were measured using a PCB6061 accelerometer and sampled at 25.6 kHz. Four health conditions were tested: normal, inner race fault, outer-race fault, and rolling-element fault. All defects were laser-machined with dimensions of length × width × depth. Inner- and outer-race defects were sized
3.1.2. Roller-bearing dataset
The second dataset was collected on the roller-bearing test platform developed in our laboratory. NU205 M and N205 M bearings were used for the experiments. Inner race faults were machined on NU205 M bearings, whereas outer-race faults were machined on N205 M bearings. Apart from the number of rolling elements, the two bearings share identical geometric parameters. All defects were produced with a depth of 0.5 mm and widths of either 0.5 mm or 1.0 mm. Vibration signals were acquired at a sampling rate of 100 kHz under a rotational speed of 1400 rpm. The dataset includes three operating conditions: normal, inner race fault, and outer-race fault.
All signals were standardized using max-min normalization to eliminate amplitude-scaling effects caused by sensor gain or installation variability. Samples were generated using a sliding window with 100-point overlap. Window lengths were selected to exceed at least one full bearing rotation under each sampling setting and also follow common practice in rotating machinery diagnostics. Accordingly, segments of 1024 points were used for the 25.6 kHz data, and 5120 points were used for the 100 kHz data. To avoid information leakage caused by overlapping windows, the raw continuous signals were first divided into mutually exclusive training, validation, and test subsets, and sliding-window segmentation with overlap was then performed separately within each subset. Therefore, no overlapped segments were shared across the subset boundaries.
Experimental settings for Case 1, 2, and 3.
3.2. Compared methods
To demonstrate the advantages of the proposed WaveletTAM-DDPM, several representative generative models were selected as baselines. These models cover the main categories of vibration-signal generation approaches used in rotating machinery research, including adversarial learning, latent-variable modeling, and diffusion-based sampling. All baselines were trained on the same preprocessed samples as the proposed method to ensure fair comparison.
Collectively, these baselines provide a comprehensive comparison spanning adversarial, latent-variable, and diffusion paradigms, enabling a rigorous evaluation of the proposed model. For fairness, all compared generative methods were trained and evaluated under the same preprocessing procedure, augmentation setting, and downstream diagnostic protocol.
3.3. Quality evaluation of the generated signals
To evaluate sample fidelity under limited samples, 10 real samples from each class were used to train the generative models, after which each model produced 200 synthetic signals for each class under the same augmentation setting. This setting reflects practical scarcity scenarios in rotating machinery and provides sufficient samples for statistically stable evaluation of quality metrics.
Composite-metric evaluation of generated signals (Case 1).
Composite-metric evaluation of generated signals (Case 3).
Across both cases, WGAN and VAE-GAN achieve high PSNR due to their strong amplitude reconstruction capability, but exhibit weaker FSCS because their adversarial objectives provide limited constraints on spectral structures. DCGAN achieves moderately balanced performance, whereas DDPM performs worst when trained on only 10 real samples, showing unstable PSNR and reduced FSCS caused by diffusion-step over-smoothing under data scarcity.
In contrast, WaveletTAM-DDPM achieves the highest FSCS and competitive SED in both cases, indicating better preservation of frequency-domain structure and impact-related characteristics. Although its PSNR is slightly lower than that of VAE-GAN and WGAN, it obtains the highest composite score. This result suggests that the proposed method preserves diagnostically meaningful structures rather than only matching signal amplitudes, thereby demonstrating robust generative performance under limited-sample conditions.
Taking a representative outer-race fault sample from Case 3 as an illustrative example, Figure 4 compares the time- and frequency-domain waveforms of real and generated signals. WaveletTAM-DDPM closely reproduces impulse intervals, amplitude ranges, and overall waveform morphology. In the frequency domain, the generated spectra accurately capture fault-related characteristic frequencies and their harmonics, despite mild attenuation of impulse peaks. These results confirm that the proposed method produces realistic and condition-specific vibration signals well suited for data augmentation in downstream fault diagnosis. Time-domain (a) and frequency-domain (b) comparison between real and generated signals for a representative outer-race fault sample in case 3.
3.4. Fault diagnosis under limited samples
To assess the usefulness of WaveletTAM-DDPM for downstream applications, fault classification experiments were conducted under limited-sample conditions. Two end-to-end diagnostic networks, wide deep CNN (WDCNN) (Zhang et al., 2017) and ResNet-18 (He et al., 2016), were adopted as baselines. Both models integrate feature extraction and classification within a single convolutional structure, enabling direct learning from raw vibration segments.
During the augmentation stage, 10 or 30 real samples per class were used to train the generative models, after which each compared generative model produced 200 synthetic signals for each class. The same augmentation setting was applied across all methods, cases, and sample conditions to ensure a fair comparison. For training the diagnostic networks, these synthetic samples were combined with the corresponding 10 or 30 real samples, whereas the test set consisted exclusively of real signals to ensure unbiased evaluation. All diagnostic networks were trained for up to 300 epochs with early stopping based on validation accuracy, and each experiment was repeated ten times to ensure statistical reliability across all three cases.
Diagnosis accuracy (%) of different models.

Diagnosis accuracy of different models under case 1 using 10 real samples per category for training.

Diagnosis accuracy of different models under case 2 using 10 real samples per category for training.

Diagnosis accuracy of different models under case 3 using 10 real samples per category for training.
In contrast, WaveletTAM-DDPM achieves the best or near-best diagnostic performance across nearly all cases and network backbones. With only 10 real samples per class, it reaches 95.28% accuracy with WDCNN and 98.04% with ResNet-18, yielding improvements of more than 40 percentage points and 20 percentage points over training without augmentation. These gains highlight the model’s ability to generate discriminative, class-consistent vibration signals even in extreme low-sample settings. In medium-sample regimes (30-sample training), WaveletTAM-DDPM remains highly competitive and exhibits lower variance than both adversarial and diffusion baselines, demonstrating strong robustness and generalization.
Figures 8–10 show confusion matrices for Case 3 with ResNet-18 under the 10-sample condition. Without augmentation, the classifier collapses toward a single dominant class, reflecting the lack of discriminative features in scarce data. DCGAN and DDPM reduce this collapse but still suffer from substantial misclassification between adjacent fault types. VAE-GAN and WGAN further improve separability but still exhibit overlaps in classes with similar spectral signatures. In contrast, WaveletTAM-DDPM produces a confusion matrix with strong diagonal dominance and minimal off-diagonal errors. This indicates that its synthetic signals preserve class-dependent temporal and spectral structures such as impulse intervals, modulation patterns, and characteristic-frequency harmonics, allowing the classifier to learn reliable decision boundaries under extreme scarcity. Confusion matrices of ResNet-18 under Case 3 using 10 samples per category for training. (a) Without augmentation; (b) DCGAN. Confusion matrices of ResNet-18 under Case 3 using 10 samples per category for training. (a) DDPM; (b) VAE-GAN. Confusion matrices of ResNet-18 under Case 3 using 10 samples per category for training. (a) WGAN; (b) Proposed method.


4. Conclusion
This study addressed vibration-based fault diagnosis under limited labeled data in industrial rotating machinery by proposing WaveletTAM-DDPM, a physics-informed diffusion-based augmentation framework that integrates wavelet kernel convolution for multiscale impulse-aware representation with temporal attention for long-range signal modeling. A composite-metric combining time-, frequency-, and entropy-domain similarity was further introduced to evaluate the physical fidelity of generated samples. Experiments on two bearing datasets demonstrated the effectiveness of the proposed method under severe data scarcity. With only ten real samples per class, WaveletTAM-DDPM improved WDCNN accuracy from 52.75% to 95.28% and ResNet-18 accuracy from 74.17% to 98.04% in a representative case. These results demonstrate that the proposed framework can effectively enhance data augmentation quality and downstream diagnostic performance under severe data scarcity. Nevertheless, the current framework remains limited to single-sensor inputs, lacks stronger physics-based constraints during diffusion sampling, and has not been validated in broader cross-domain scenarios. Future work will therefore focus on multi-sensor fusion, stronger physics-consistent diffusion mechanisms, broader cross-condition validation, and tighter integration with industrial PHM workflows.
Footnotes
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China [grant number U23A20620]; and the National Natural Science Foundation of China [grant number 52275111].
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The first dataset used in this study was provided by an industry partner and is not publicly available due to confidentiality restrictions. The second dataset can be made available upon reasonable request.
