Physics-informed diffusion-based augmentation for vibration-based fault diagnosis of rotating machinery

Abstract

Rotating machinery fault diagnosis is fundamental to the safe and stable operation of smart manufacturing systems. However, vibration-based diagnostic models are often limited by the scarcity of labeled fault data, which weakens their robustness and generalization under varying operating conditions. Recent studies have applied denoising diffusion probabilistic models (DDPMs) to vibration signal augmentation in limited-sample scenarios. Yet, most existing methods are developed on raw time-series signals or generic time-frequency representations, without explicitly incorporating the multiscale impulsive characteristics of fault vibrations. As a result, the generated signals often exhibit weakened fault-related transients. To address this issue, this paper proposes a wavelet-preprocessed temporal-attention diffusion framework for vibration-signal augmentation. Multiscale wavelet priors are integrated into the diffusion backbone to preserve impact-sensitive structures, temporal-attention modules are embedded into the reverse denoising process to enhance temporal dependency modeling, and an asymmetric U-Net with a strengthened decoder is adopted to improve reconstruction quality. Experiments on two bearing test platforms demonstrate that the proposed method generates signals with higher physical fidelity and achieves superior diagnostic performance compared with representative generative baselines. These results indicate that the proposed framework is an effective augmentation strategy for robust vibration-based intelligent diagnostics under data-scarce conditions.

Keywords

rotating machinery fault diagnosis vibration-signal augmentation denoising diffusion probabilistic model data-scarce intelligent diagnostics prognostics and health management

1. Introduction

Industrial rotating machinery is expected to operate with high reliability and availability in smart manufacturing systems. As key supporting components in rotating transmission chains, rolling bearings play a critical role in maintaining motion accuracy, load transfer, and operational stability. Bearing failures can therefore lead to unplanned downtime, increased maintenance cost, and performance degradation, making accurate fault diagnosis essential for reliable condition monitoring (Jiang et al., 2024). In practice, however, labeled vibration data are often scarce because fault events are infrequent, fault reproduction is costly and potentially unsafe, and data annotation depends heavily on expert knowledge (Peng et al., 2025; Zhang et al., 2026). This data scarcity makes data-driven diagnostic models more vulnerable to overfitting and poor generalization under varying operating conditions (Qi et al., 2025; Shao et al., 2018). Therefore, developing robust fault diagnosis methods under limited labeled data remains a critical challenge in vibration-based intelligent maintenance.

To alleviate the scarcity of labeled vibration data in fault diagnosis, generative models have been increasingly explored for synthetic sample generation (Chen et al., 2026b). Early studies mainly relied on generative adversarial networks (GANs) (Fu et al., 2023; Liu and Wu, 2024) and variational autoencoders (VAEs) (Dixit and Verma, 2020; Ju et al., 2024; Zhao et al., 2020) to enrich limited datasets and improve diagnostic robustness. These methods have demonstrated the potential of data augmentation in small-sample scenarios (Wang et al., 2020; Zhao et al., 2022), but adversarial models often suffer from training instability, whereas latent-variable models may produce over-smoothed signals and insufficiently preserve transient fault details (Pan et al., 2022). More recently, denoising diffusion probabilistic models (DDPMs) (Ho et al., 2020) have attracted growing attention because of their stable training behavior and strong generative capability. Existing diffusion-based studies have reported promising results in vibration-signal augmentation, few-shot diagnosis, and cross-condition learning (Fan et al., 2025; Tong et al., 2026; Wang et al., 2024). However, most of these methods still treat vibration signals as generic time series or transformed representations, without explicitly emphasizing the impulsive and multiscale nature of bearing fault responses (Lin et al., 2025; Lourari et al., 2025; Peng et al., 2026).

Despite the progress of diffusion-based methods, several limitations remain when they are applied to vibration-oriented fault diagnosis. First, the scale-localized impulsive characteristics induced by repetitive fault impacts are often insufficiently modeled, which may weaken the preservation of fault-discriminative transient structures in generated samples. Second, temporal dependency during reverse diffusion is usually captured only implicitly by conventional U-Net backbones, making it difficult to accurately reconstruct the timing, recurrence, and modulation patterns of fault-related impulses (Chen et al., 2026a). Third, the quality of generated signals is still commonly evaluated using generic similarity measures or downstream classification accuracy alone (Qi et al., 2026), which cannot fully reflect whether the synthesized signals preserve physically meaningful waveform, spectral, and impulsive characteristics required for reliable diagnosis.

To address the above limitations, this study proposes a wavelet-preprocessed temporal-attention DDPM (WaveletTAM-DDPM), a diffusion-based vibration-signal generation framework designed to enhance data-scarce intelligent maintenance and fault diagnosis of industrial rotating systems, validated on rolling bearings as a representative rotating component. The main contributions of this work are summarized as follows:

i. A DDPM-based framework is developed to enhance bearing diagnosis under limited samples by augmenting scarce labeled data with diffusion-generated vibration signals, providing a physics-informed data enrichment module that can be integrated into intelligent maintenance and condition monitoring systems.

ii. The proposed framework introduces a physics-informed diffusion structure, where wavelet kernel convolution injects multiscale impact-sensitive priors, temporal-attention modules model long-range temporal dependency during reverse diffusion, and an asymmetric U-Net with a deeper decoder enhances reconstruction capability.

iii. A composite evaluation metric is constructed to assess generated signals from multiple vibration-physics perspectives, integrating time-, frequency-, and entropy-based similarity measures to provide a more systematic assessment of physical fidelity and its relevance to downstream intelligent diagnosis and maintenance decision support.

The rest of the paper is organized as follows. Section 2 provides a detailed explanation of proposed framework, its network structure, and the proposed composite metric. Section 3 reports experiment evaluation and validation. Section 4 concludes the paper.

2. Proposed method

2.1. Overall workflow of the proposed method

To address the scarcity of labeled vibration data in industrial fault diagnosis, this study proposes a data augmentation framework based on WaveletTAM-DDPM. The goal is to generate high-fidelity synthetic vibration signals that expand the training distribution without requiring additional physical experiments. As illustrated in Figure 1, the workflow consists of four stages:

Figure 1.

Overall framework of the proposed method.

2.1.1. Data preprocessing

Raw vibration signals under normal and fault conditions are collected and segmented into fixed-length samples using a sliding window. This increases sample diversity while preserving local transient impulses.

2.1.2. Diffusion-based signal generation

During training, each vibration sample is first processed by a wavelet kernel convolution layer integrated directly into the diffusion network to extract localized time-frequency impulse features (Fujieda et al., 2018). The extracted representations are then denoised by an asymmetric U-Net equipped with temporal attention to perform the reverse diffusion process, ultimately generating synthetic vibration signals.

2.1.3. Signal-level quality evaluation

The generated vibration signals are assessed using the proposed composite metric, which evaluates similarity to real signals from time domain, frequency domain, and impact-feature perspectives.

2.1.4. Task-level diagnostic evaluation

High-fidelity synthetic data are combined with real signals to train fault diagnostic networks, whereas only real samples are used for testing, which verifies the generated signals improve diagnostic robustness rather than biasing evaluation.

Signal-level and task-level evaluation jointly measure the usefulness of the generated data. The proposed framework ensures that synthetic signals are not only visually similar to real vibration data, but also practically beneficial for downstream fault diagnosis.

2.2. WaveletTAM-DDPM

The structure of the proposed WaveletTAM-DDPM is shown in Figure 2. The model integrates three key components into the diffusion process: wavelet kernel convolution, an asymmetric U-Net backbone, and temporal attention, enabling effective generation of vibration signals under limited-sample conditions.

Figure 2.

WaveletTAM-DDPM U-Net structure.

2.2.1. Wavelet kernel convolution for multi-channel vibration representations

At the first stage of the diffusion U-Net, the raw vibration segment passes through a wavelet kernel convolution (WKC) layer, replacing the conventional first convolution layer (Yi et al., 2024). Unlike standard convolution kernels that learn arbitrary filter shapes, WKC uses predefined wavelet functions as kernels, capturing localized transient impulses and oscillatory patterns that dominate vibration signals. This makes WKC more suitable for rotating machinery fault signals, which are inherently impulsive, nonstationary, and multi-frequency.

In addition, WKC expands the original 1-channel signal into a multi-channel representation, where each channel corresponds to a wavelet kernel at a different scale. At larger scales, the wavelet channels emphasize low frequency components associated with defect-induced resonance bands, while smaller scales respond strongly to high frequency transients and their modulation sidebands. This scale-based decomposition allows each channel to act as a band selective filter that captures fault-related spectral signatures such as the bearing characteristic frequencies: ball pass frequency of inner race (BPFI), ball pass frequency of outer race (BPFO), and ball spin frequency (BSF), along with their surrounding amplitude-modulated sidebands, directly at the network input. As a result, the U-Net receives an explicitly encoded time-frequency representation instead of having to infer scale-dependent patterns implicitly through deeper layers.

In this work, the Mexican hat wavelet (Mexh) is used as the kernel basis. Mexh is a second-derivative Gaussian wavelet that produces symmetric, sharp impulse responses, which is ideal for rotating machinery fault impulses. Compared with Morlet and Laplace wavelets, Mexh offers stronger localization of transient spikes, clear separation between oscillatory components and impulse peaks, and zero mean, which improves stability during reverse diffusion. Thus, WKC provides impulse-aware representations before any diffusion step occurs, making the reverse denoising path easier to learn.

2.2.2. Asymmetric U-Net for improved denoising reconstruction

The backbone of WaveletTAM-DDPM follows a U-Net structure but adopts an intentionally asymmetric depth. Each down-sampling layer contains one residual block, whereas each up-sampling layer contains two. This asymmetry is motivated by the different functional roles of the encoder and decoder in diffusion modeling. The encoder primarily performs feature contraction, mapping the multiscale wavelet representation into a compact latent space. Because this stage emphasizes abstraction rather than detail preservation, a single residual block at each level is sufficient to encode the dominant structures required for denoising.

In contrast, the decoder operates during the reverse diffusion process, where the model must progressively remove noise while reconstructing fine-grained temporal structures that are easily corrupted by the forward noising steps. Reverse denoising is inherently a detail restoration task, and its difficulty increases as the time step decreases. To support this, each decoder stage includes two residual blocks, enabling stronger nonlinear refinement, greater sensitivity to subtle impulse cues, and more expressive feature transformation. The additional residual block also improves the fusion between encoder and decoder features passed through skip connections, which is essential for accurately restoring impulse shapes and modulation envelopes.

Together, this asymmetric design allocates model capacity where it is most needed, namely, on the reconstruction side of the U-Net, stabilizing the denoising trajectory and improving the fidelity of the synthesized vibration signals.

2.2.3. Temporal-attention module (TAM)

The TAM (Wang et al., 2022) is inserted in the bottleneck and all decoder blocks to strengthen reconstruction of long-range temporal structures. Its structure is shown in Figure 3. Given an input feature tensor $X \in R^{C \times N}$ where $C$ represents the number of channels and $N$ represents the temporal length, for each channel, TAM first computes two linear projections through $1 \times 1$ convolutions:

P^{T A M} = W_{p} * X

(1)

Q^{T A M} = W_{q} * X

(2)

where

W_{p}

and

W_{q}

are learnable

1 \times 1

convolution kernels and

*

denotes convolution. A normalized attention precursor

W^{TAM}

is obtained by applying a SoftMax operation along the temporal axis:

W^{TAM} = SoftMax (Q^{TAM})

(3)

Figure 3.

TAM structure.

The attention map $T^{TAM}$ is then computed by element-wise multiplication:

T^{T A M} = P^{T A M} ⊙ W^{T A M}

(4)

Finally, TAM fuses the attention map with the original features via a residual connection:

Y^{T A M} = X \oplus T^{T A M}

(5)

where

⊙

denotes element-wise multiplication and

\oplus

denotes element-wise addition. Normalizing

Q^{TAM}

enforces relative temporal weighting, allowing

W^{T A M}

to selectively gate

P^{T A M}

. The resulting attention map

T^{T A M}

highlights informative temporal regions while suppressing redundancy, improving temporal feature reconstruction in the reverse diffusion process.

TAM is applied only in the middle and up-sampling stages. During down-sampling, the network extracts local representations, and temporal correlations are still redundant, making attention unnecessary. During up-sampling, the model must reconstruct temporal structure from noise, and TAM strengthens long-range dependency modeling, helping recover fault impulses and periodic vibration patterns more accurately.

The main hyperparameter settings are summarized in Table 1.

Table 1.

Main hyperparameter settings.

Parameter	Value
Number of steps	1000
Maximum epochs	1000
Batch size	8
Learning rate	0.00005
Optimizer	Adam
Wavelet function	Mexh
Number of wavelet filters	96
$β_{1}$	0.9
$β_{2}$	0.99

2.3. Composite metric

To evaluate the fidelity of generated vibration signals in a comprehensive and physically meaningful way, a composite multi domain quality metric is constructed. Vibration signals from rotating machinery exhibit strong impulsive transients, harmonic spectral structures, and amplitude-modulated sidebands produced by bearing impacts, fault repetition frequencies, and resonance amplification. Assessing such signals using a single similarity measure is inadequate. Metrics such as mean squared error (MSE) and PSNR measure waveform proximity but are insensitive to spectral distortions and may still assign high scores to signals that lack fault impulses. Purely spectral metrics such as frequency spectrum cosine similarity (FSCS) capture harmonic alignment but ignore local amplitude variations and transient strength. Impact based indicators alone, such as spectral entropy difference (SED), quantify impulsiveness but overlook global waveform accuracy and harmonic structure. Relying on only one of these perspectives risks overlooking essential characteristics needed for reliable fault diagnosis.

Therefore, three complementary indicators are selected to jointly characterize fidelity from the time, frequency, and impact perspectives: PSNR evaluates waveform similarity in the time domain, FSCS measures the alignment of harmonic structures and spectral energy distribution, and SED captures the impulsiveness and disorder level associated with fault-induced impacts. Together, PSNR reflects temporal accuracy, FSCS captures frequency component correctness, and SED quantifies impact behavior. Combining them aligns fidelity evaluation with the physics of vibration signals, ensuring that generated samples preserve waveform shape, spectral content, and impulse characteristics. The three indicators are fused using the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) (De Lima Silva and De Almeida Filho, 2020) to produce a single composite score. TOPSIS is used here because PSNR, FSCS, and SED reflect complementary aspects of signal fidelity and may exhibit different trends across competing methods. By jointly considering their different scales and optimization directions, the fusion result provides a more comprehensive assessment of generated-signal quality than any single metric alone.

PSNR is computed as

PSNR = \frac{1}{n} \sum_{i = 1}^{n} 10 \log_{10} (\frac{{\max (y_{i})}^{2}}{MSE (y_{i}, \hat{y_{i}})})

(6)

FSCS is defined as

FSCS = \frac{\sum_{i = 1}^{n} FFT (y_{i}) \cdot FFT (\hat{y_{i}})}{\sqrt{\sum_{i = 1}^{n} FFT {(y_{i})}^{2} \sqrt{\sum_{i = 1}^{n} FFT {(\hat{y_{i}})}^{2}}}}

(7)

SED is computed as

SED = | S E_{real} - S E_{gen} |

(8)

SE = - \sum_{i = 1}^{N} p_{i} \log (p_{i})

(9)

where

y_{i}

and

\hat{y_{i}}

denote the

i

-th sample points of the real and generated vibration signals,

MSE

is the mean squared error,

n

denotes the number of samples, Fast Fourier Transform (FFT) is used to obtain the signal spectrum, and spectral entropy (SE) is computed from the normalized spectral power distribution

p_{i}

. Higher PSNR and FSCS, together with lower SED, indicate that generated signals retain temporal fidelity, spectral structure, and impulsiveness consistent with real vibration responses. To unify directionality where higher value indicates better quality, for the

i

-th sample points,

S E D_{i}

is normalized as

{SED}_{i}^{'} = \max (SED) - SE D_{i}

(10)

An evaluation metrics matrix is constructed, where $i$ represents signals sample and $j$ represents the evaluation metrics. All metrics are standardized with:

v_{i j} = \frac{x_{i j}}{\sqrt{\sum_{i = 1}^{n} x_{i j}^{2}}}

(11)

where

v_{i j}

represents standardized evaluation metrics. The TOPSIS procedure then establishes the ideal and anti-ideal solutions

A^{+}

and

A^{-}

per metric’s attribute characteristics:

A^{+} = {v_{1}^{+}, v_{2}^{+}, v_{3}^{+}}

(12)

A^{-} = {v_{1}^{-}, v_{2}^{-}, v_{3}^{-}}

(13)

where

v_{j}^{+} = \max (v_{i j})

v_{j}^{-} = \min (v_{i j})

. The Euclidean distances between each candidate alternative and both the ideal solution

A^{+}

and anti-ideal solution

A^{-}

are then computed:

d_{i}^{+} = \sqrt{\sum_{j = 1}^{3} {(v_{i j} - v_{j}^{+})}^{2}}

(14)

d_{i}^{-} = \sqrt{\sum_{j = 1}^{3} {(v_{i j} - v_{j}^{-})}^{2}}

(15)

Composite metric is computed as

CompositeMetri c_{i} = \frac{d_{i}^{-}}{d_{i}^{+} + d_{i}^{-}}

(16)

Thus, a larger $CompositeMetri c_{i}$ value denotes higher signal quality.

3. Experiment verification

3.1. Data description

Two bearing datasets were used to evaluate the proposed WaveletTAM-DDPM under different operating conditions, sampling rates, and fault types. The datasets were selected to cover both cross roller and roller bearings, enabling assessment of generalization across bearing structures and excitation characteristics.

3.1.1. Cross-roller–bearing dataset

Data were collected on the test platform using SW-HG-20 cross-roller–bearings. The inner race was rigidly fixed to a 450 Nm load arm, while the outer race was coupled to the input shaft. Vibration signals were measured using a PCB6061 accelerometer and sampled at 25.6 kHz. Four health conditions were tested: normal, inner race fault, outer-race fault, and rolling-element fault. All defects were laser-machined with dimensions of length × width × depth. Inner- and outer-race defects were sized $2 mm \times 0.5 mm \times 0.05 mm$ , while the rolling-element defect measured $2 mm \times 1.0 mm \times 0.2 mm$ . Defects were introduced at fixed circumferential positions but their angular locations were not used as diagnostic information.

3.1.2. Roller-bearing dataset

The second dataset was collected on the roller-bearing test platform developed in our laboratory. NU205 M and N205 M bearings were used for the experiments. Inner race faults were machined on NU205 M bearings, whereas outer-race faults were machined on N205 M bearings. Apart from the number of rolling elements, the two bearings share identical geometric parameters. All defects were produced with a depth of 0.5 mm and widths of either 0.5 mm or 1.0 mm. Vibration signals were acquired at a sampling rate of 100 kHz under a rotational speed of 1400 rpm. The dataset includes three operating conditions: normal, inner race fault, and outer-race fault.

All signals were standardized using max-min normalization to eliminate amplitude-scaling effects caused by sensor gain or installation variability. Samples were generated using a sliding window with 100-point overlap. Window lengths were selected to exceed at least one full bearing rotation under each sampling setting and also follow common practice in rotating machinery diagnostics. Accordingly, segments of 1024 points were used for the 25.6 kHz data, and 5120 points were used for the 100 kHz data. To avoid information leakage caused by overlapping windows, the raw continuous signals were first divided into mutually exclusive training, validation, and test subsets, and sliding-window segmentation with overlap was then performed separately within each subset. Therefore, no overlapped segments were shared across the subset boundaries.

Three cases were constructed to represent a range of bearing types and defect severities. Case 1 used the cross roller-bearing dataset with four health conditions. Cases 2 and 3 used the roller-bearing dataset, corresponding to small-size (0.5 mm) and large-size (1.0 mm) defects, respectively. To evaluate diagnostic performance under limited samples, only 10 and 30 samples per class were used for training in each case. The resulting six experimental tasks are summarized in Table 2.

Table 2.

Experimental settings for Case 1, 2, and 3.

Case	Bearing	Class	Health status	Defect size (mm)	Speed (rpm)	Samples/Class
1	Cross roller	0	Normal	-	1500	10/30
		1	Inner race fault	2 × 0.5 × 0.05	1500	10/30
		2	Outer-race fault	2 × 0.5 × 0.05	1500	10/30
		3	Rolling-element fault	2 × 1.0 × 0.2	1500	10/30
2	Roller	0	Normal	-	1400	10/30
		1	Inner-race fault	0.5 width, 0.5 depth	1400	10/30
		2	Outer-race fault	0.5 width, 0.5 depth	1400	10/30
3	Roller	0	Normal	-	1400	10/30
		1	Inner-race fault	0.5 width, 0.5 depth	1400	10/30
		2	Outer-race fault	0.5 width, 0.5 depth	1400	10/30

3.2. Compared methods

To demonstrate the advantages of the proposed WaveletTAM-DDPM, several representative generative models were selected as baselines. These models cover the main categories of vibration-signal generation approaches used in rotating machinery research, including adversarial learning, latent-variable modeling, and diffusion-based sampling. All baselines were trained on the same preprocessed samples as the proposed method to ensure fair comparison.

Deep Convolution GAN (DCGAN) (Radford et al., 2016) employs fully convolutional generator and discriminator networks, improving training stability compared with early GAN variants. DCGAN has been widely adopted for vibration-signal generation due to its efficiency and ability to capture local temporal structures.

Wasserstein GAN (WGAN) (Li et al., 2022) replaces the Jensen-Shannon divergence with the Wasserstein distance, alleviating mode collapse and offering smoother gradient behavior. This makes it a strong baseline for generating impact-like responses in bearing vibration signals.

VAE-GAN (Liu et al., 2021) integrates variational autoencoding with adversarial training, combining structured latent encoding with high-fidelity reconstruction. Its hybrid formulation allows better modeling of both global trends and localized fault-induced impulses.

DDPM (Ho et al., 2020) represents the diffusion-based class of generative models. It generates diverse and high-quality samples through iterative denoising steps that reverse a forward noise-injection process. DDPM has recently shown strong performance on non-Gaussian and highly nonstationary vibration data.

Collectively, these baselines provide a comprehensive comparison spanning adversarial, latent-variable, and diffusion paradigms, enabling a rigorous evaluation of the proposed model. For fairness, all compared generative methods were trained and evaluated under the same preprocessing procedure, augmentation setting, and downstream diagnostic protocol.

3.3. Quality evaluation of the generated signals

To evaluate sample fidelity under limited samples, 10 real samples from each class were used to train the generative models, after which each model produced 200 synthetic signals for each class under the same augmentation setting. This setting reflects practical scarcity scenarios in rotating machinery and provides sufficient samples for statistically stable evaluation of quality metrics.

Tables 3 and 4 report the proposed composite metric and its component measures for Case 1 and Case 3. Results for Case 2 show a similar overall trend. In particular, WaveletTAM-DDPM achieves the highest composite metric (0.5458 ± 0.1312) and the highest FSCS (0.7542 ± 0.0496), further supporting the consistency of the proposed method across different cases. Detailed values are omitted here for brevity. PSNR reflects amplitude-level fidelity, FSCS measures spectral-shape consistency, and SED quantifies deviations in spectral entropy, indicating whether noise-like or overly smooth components appear in generated signals.

Table 3.

Composite-metric evaluation of generated signals (Case 1).

Method	PSNR↑	FSCS↑	SED↓	Composite Metric↑
WGAN	10.3827 $\pm$ 2.0191	0.7571 $\pm$ 0.0704	1.3747 $\pm$ 0.3441	0.6435 $\pm$ 0.1323
DDPM	−10.7431 $\pm$ 4.7664	0.4538 $\pm$ 0.1004	2.4954 $\pm$ 0.9748	0.4382 $\pm$ 0.1419
DCGAN	8.9914 $\pm$ 1.9483	0.6592 $\pm$ 0.0642	1.7570 $\pm$ 0.5420	0.6083 $\pm$ 0.1408
VAE-GAN	10.6218 $\pm$ 1.8694	0.2320 $\pm$ 0.1010	1.7296 $\pm$ 1.2245	0.3551 $\pm$ 0.1967
Proposed	8.6813 $\pm$ 2.1073	0.7759 $\pm$ 0.0603	1.7066 $\pm$ 0.2844	0.6848 $\pm$ 0.1128

Table 4.

Composite-metric evaluation of generated signals (Case 3).

Method	PSNR↑	FSCS↑	SED↓	Composite Metric↑
WGAN	14.3181 $\pm$ 3.1200	0.6260 $\pm$ 0.0986	0.7729 $\pm$ 0.1898	0.5754 $\pm$ 0.1086
DDPM	−2.9313 $\pm$ 6.6705	0.3669 $\pm$ 0.1328	3.6981 $\pm$ 1.8016	0.5065 $\pm$ 0.2225
DCGAN	12.6748 $\pm$ 3.1056	0.5529 $\pm$ 0.0577	1.0469 $\pm$ 0.2769	0.5854 $\pm$ 0.0969
VAE-GAN	14.4938 $\pm$ 3.0240	0.2991 $\pm$ 0.1182	3.8583 $\pm$ 1.9260	0.5197 $\pm$ 0.1939
Proposed	11.9570 $\pm$ 3.0818	0.7638 $\pm$ 0.0394	3.9225 $\pm$ 0.9480	0.5900 $\pm$ 0.1001

Across both cases, WGAN and VAE-GAN achieve high PSNR due to their strong amplitude reconstruction capability, but exhibit weaker FSCS because their adversarial objectives provide limited constraints on spectral structures. DCGAN achieves moderately balanced performance, whereas DDPM performs worst when trained on only 10 real samples, showing unstable PSNR and reduced FSCS caused by diffusion-step over-smoothing under data scarcity.

In contrast, WaveletTAM-DDPM achieves the highest FSCS and competitive SED in both cases, indicating better preservation of frequency-domain structure and impact-related characteristics. Although its PSNR is slightly lower than that of VAE-GAN and WGAN, it obtains the highest composite score. This result suggests that the proposed method preserves diagnostically meaningful structures rather than only matching signal amplitudes, thereby demonstrating robust generative performance under limited-sample conditions.

Taking a representative outer-race fault sample from Case 3 as an illustrative example, Figure 4 compares the time- and frequency-domain waveforms of real and generated signals. WaveletTAM-DDPM closely reproduces impulse intervals, amplitude ranges, and overall waveform morphology. In the frequency domain, the generated spectra accurately capture fault-related characteristic frequencies and their harmonics, despite mild attenuation of impulse peaks. These results confirm that the proposed method produces realistic and condition-specific vibration signals well suited for data augmentation in downstream fault diagnosis.

Figure 4.

Time-domain (a) and frequency-domain (b) comparison between real and generated signals for a representative outer-race fault sample in case 3.

3.4. Fault diagnosis under limited samples

To assess the usefulness of WaveletTAM-DDPM for downstream applications, fault classification experiments were conducted under limited-sample conditions. Two end-to-end diagnostic networks, wide deep CNN (WDCNN) (Zhang et al., 2017) and ResNet-18 (He et al., 2016), were adopted as baselines. Both models integrate feature extraction and classification within a single convolutional structure, enabling direct learning from raw vibration segments.

During the augmentation stage, 10 or 30 real samples per class were used to train the generative models, after which each compared generative model produced 200 synthetic signals for each class. The same augmentation setting was applied across all methods, cases, and sample conditions to ensure a fair comparison. For training the diagnostic networks, these synthetic samples were combined with the corresponding 10 or 30 real samples, whereas the test set consisted exclusively of real signals to ensure unbiased evaluation. All diagnostic networks were trained for up to 300 epochs with early stopping based on validation accuracy, and each experiment was repeated ten times to ensure statistical reliability across all three cases.

Table 5 reports the average diagnostic accuracy using 30 real samples per class, while Figures 5 –7 illustrate performance in the more challenging 10-sample scenario. Accuracy increases with more real data, and ResNet-18 consistently outperforms WDCNN due to its deeper representation capability. WGAN and VAE-GAN perform competitively when 30 samples are available but degrade sharply in the 10-sample regime, reflecting higher data requirements. DCGAN and DDPM bring moderate improvements but remain unstable under severe scarcity.

Table 5.

Diagnosis accuracy (%) of different models.

Models	WDCNN (Case 1)	ResNet-18 (Case 1)	WDCNN (Case 2)	ResNet-18 (Case 2)	WDCNN (Case 3)	ResNet-18 (Case 3)
None	65.89 $\pm$ 5.59	81.44 $\pm$ 23.30	80.91 $\pm$ 10.97	83.00 $\pm$ 19.00	78.18 $\pm$ 9.94	71.20 $\pm$ 20.32
DCGAN	94.00 $\pm$ 1.15	94.22 $\pm$ 2.97	89.40 $\pm$ 3.87	96.24 $\pm$ 3.60	82.25 $\pm$ 2.88	89.23 $\pm$ 13.69
DDPM	95.01 $\pm$ 8.31	97.92 $\pm$ 1.56	86.85 $\pm$ 1.81	91.28 $\pm$ 3.44	88.63 $\pm$ 3.04	97.92 $\pm$ 1.56
VAE-GAN	97.16 $\pm$ 1.24	97.59 $\pm$ 0.78	93.84 $\pm$ 3.87	97.87 $\pm$ 3.40	77.58 $\pm$ 4.64	96.37 $\pm$ 5.52
WGAN	97.00 $\pm$ 0.94	97.32 $\pm$ 0.83	91.05 $\pm$ 4.14	96.07 $\pm$ 3.80	75.81 $\pm$ 3.38	94.68 $\pm$ 4.77
Proposed	96.93 $\pm$ 1.93	97.73 $\pm$ 0.82	97.88 $\pm$ 2.63	98.38 $\pm$ 1.67	96.58 $\pm$ 2.61	97.83 $\pm$ 1.05

Figure 5.

Diagnosis accuracy of different models under case 1 using 10 real samples per category for training.

Figure 6.

Diagnosis accuracy of different models under case 2 using 10 real samples per category for training.

Figure 7.

Diagnosis accuracy of different models under case 3 using 10 real samples per category for training.

In contrast, WaveletTAM-DDPM achieves the best or near-best diagnostic performance across nearly all cases and network backbones. With only 10 real samples per class, it reaches 95.28% accuracy with WDCNN and 98.04% with ResNet-18, yielding improvements of more than 40 percentage points and 20 percentage points over training without augmentation. These gains highlight the model’s ability to generate discriminative, class-consistent vibration signals even in extreme low-sample settings. In medium-sample regimes (30-sample training), WaveletTAM-DDPM remains highly competitive and exhibits lower variance than both adversarial and diffusion baselines, demonstrating strong robustness and generalization.

Figures 8 –10 show confusion matrices for Case 3 with ResNet-18 under the 10-sample condition. Without augmentation, the classifier collapses toward a single dominant class, reflecting the lack of discriminative features in scarce data. DCGAN and DDPM reduce this collapse but still suffer from substantial misclassification between adjacent fault types. VAE-GAN and WGAN further improve separability but still exhibit overlaps in classes with similar spectral signatures. In contrast, WaveletTAM-DDPM produces a confusion matrix with strong diagonal dominance and minimal off-diagonal errors. This indicates that its synthetic signals preserve class-dependent temporal and spectral structures such as impulse intervals, modulation patterns, and characteristic-frequency harmonics, allowing the classifier to learn reliable decision boundaries under extreme scarcity.

Figure 8.

Confusion matrices of ResNet-18 under Case 3 using 10 samples per category for training. (a) Without augmentation; (b) DCGAN.

Figure 9.

Confusion matrices of ResNet-18 under Case 3 using 10 samples per category for training. (a) DDPM; (b) VAE-GAN.

Figure 10.

Confusion matrices of ResNet-18 under Case 3 using 10 samples per category for training. (a) WGAN; (b) Proposed method.

4. Conclusion

This study addressed vibration-based fault diagnosis under limited labeled data in industrial rotating machinery by proposing WaveletTAM-DDPM, a physics-informed diffusion-based augmentation framework that integrates wavelet kernel convolution for multiscale impulse-aware representation with temporal attention for long-range signal modeling. A composite-metric combining time-, frequency-, and entropy-domain similarity was further introduced to evaluate the physical fidelity of generated samples. Experiments on two bearing datasets demonstrated the effectiveness of the proposed method under severe data scarcity. With only ten real samples per class, WaveletTAM-DDPM improved WDCNN accuracy from 52.75% to 95.28% and ResNet-18 accuracy from 74.17% to 98.04% in a representative case. These results demonstrate that the proposed framework can effectively enhance data augmentation quality and downstream diagnostic performance under severe data scarcity. Nevertheless, the current framework remains limited to single-sensor inputs, lacks stronger physics-based constraints during diffusion sampling, and has not been validated in broader cross-domain scenarios. Future work will therefore focus on multi-sensor fusion, stronger physics-consistent diffusion mechanisms, broader cross-condition validation, and tighter integration with industrial PHM workflows.

Footnotes

ORCID iD

Weihua Li

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China [grant number U23A20620]; and the National Natural Science Foundation of China [grant number 52275111].

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The first dataset used in this study was provided by an industry partner and is not publicly available due to confidentiality restrictions. The second dataset can be made available upon reasonable request.*

References

Chen

Shen

Xia

, et al. (2026a) MS-DAWCAE: multi-scale deep adversarial wavelet convolutional autoencoder toward unseen fault diagnosis with limited data. Mechanical Systems and Signal Processing 249: 114042. https://doi.org/10.1016/j.ymssp.2026.114042

Chen

Lin

Gao

, et al. (2026b) Multiscale scattering forests: a domain-generalizing approach for fault diagnosis under data constraints. Knowledge-Based Systems 337: 115389. https://doi.org/10.1016/j.knosys.2026.115389

De Lima Silva

De Almeida Filho

(2020) Sorting with TOPSIS through boundary and characteristic profiles. Computers & Industrial Engineering 141: 106328. https://doi.org/10.1016/j.cie.2020.106328

Dixit

Verma

(2020) Intelligent condition-based monitoring of rotary machines with few samples. IEEE Sensors Journal 20(23): 14337–14346. https://doi.org/10.1109/jsen.2020.3008177

Fan

Zhang

, et al. (2025) A novel lightweight DDPM-based data augmentation method for rotating machinery fault diagnosis with small sample. Mechanical Systems and Signal Processing 232: 112741. https://doi.org/10.1016/j.ymssp.2025.112741

Jiang

, et al. (2023) Rolling bearing fault diagnosis based on 2D time-frequency images and data augmentation technique. Measurement Science and Technology 34(4): 045005. https://doi.org/10.1088/1361-6501/acabdb

Fujieda

Takayama

Hachisuka

(2018) Wavelet convolutional neural networks. arXiv. arXiv:1805.08620, [cs].

Zhang

Ren

, et al. (2016) Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR), Las Vegas, NV, USA, 27-30 June 2016. IEEE, pp. 770–778.

Jain

Abbeel

(2020) Denoising diffusion probabilistic models. arXiv. arXiv:2006.11239, [cs].

10.

Jiang

Lin

, et al. (2024) Fault diagnosis of gearbox driven by vibration response mechanism and enhanced unsupervised domain adaptation. Advanced Engineering Informatics 61: 102460. https://doi.org/10.1016/j.aei.2024.102460

11.

Chen

Qiang

, et al. (2024) A systematic review of data augmentation methods for intelligent fault diagnosis of rotating machinery under limited data conditions. Measurement Science and Technology 35(12): 122004. https://doi.org/10.1088/1361-6501/ad7a97

12.

Zou

Jiang

(2022) Fault diagnosis of rotating machinery based on combination of Wasserstein generative adversarial networks and long short term memory fully convolutional network. Measurement 191: 110826. https://doi.org/10.1016/j.measurement.2022.110826

13.

Lin

Huang

Chen

, et al. (2025) Matching pursuit network: an interpretable sparse time-frequency representation method toward mechanical fault diagnosis. IEEE Transactions on Neural Networks and Learning Systems 36(7): 12377–12388. https://doi.org/10.1109/TNNLS.2024.3483954

14.

Liu

(2024) Incremental bearing fault diagnosis method under imbalanced sample conditions. Computers & Industrial Engineering 192: 110203. https://doi.org/10.1016/j.cie.2024.110203

15.

Liu

Jiang

, et al. (2021) Rolling bearing fault diagnosis using variational autoencoding generative adversarial networks with deep regret analysis. Measurement 168: 108371. https://doi.org/10.1016/j.measurement.2020.108371

16.

Lourari

El Yousfi

Benkedjouh

, et al. (2025) Enhancing bearing and gear fault diagnosis: a VMD-PSO approach with multisensory signal integration. Journal of Vibration and Control 31(19–20): 4098–4112. https://doi.org/10.1177/10775463241273842

17.

Pan

Chen

Zhang

, et al. (2022) Generative adversarial network in mechanical fault diagnosis under small sample: a systematic review on applications and future perspectives. ISA Transactions 128: 1–10. https://doi.org/10.1016/j.isatra.2021.11.040

18.

Peng

Shao

Xiao

, et al. (2025) A systematic review on interpretability research of intelligent fault diagnosis models. Measurement Science and Technology 36(1): 012009. https://doi.org/10.1088/1361-6501/ad99f4

19.

Peng

Shao

Xiao

, et al. (2026) Dual-stage interpretable domain generalization fault diagnosis: integrating prior knowledge and gradient-weighted class activation mapping. Engineering Applications of Artificial Intelligence 166: 113655. https://doi.org/10.1016/j.engappai.2025.113655

20.

Chen

Kong

, et al. (2025) Attention-guided graph isomorphism learning: a multi-task framework for fault diagnosis and remaining useful life prediction. Reliability Engineering & System Safety 263: 111209. https://doi.org/10.1016/j.ress.2025.111209

21.

Karimi

Uhlmann

, et al. (2026) Uncertainty-aware sensorless anomaly detection using a reliable indicator from position-guided multi-step deep decomposition network. Reliability Engineering & System Safety 271: 112258. https://doi.org/10.1016/j.ress.2026.112258

22.

Radford

Metz

Chintala

(2016) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv. arXiv:1511.06434, [cs].

23.

Shao

Jiang

Lin

, et al. (2018) A novel method for intelligent fault diagnosis of rolling bearings using ensemble deep auto-encoders. Mechanical Systems and Signal Processing 102: 278–297. https://doi.org/10.1016/j.ymssp.2017.09.026

24.

Tong

Zhu

, et al. (2026) ACS-DM: adaptive conditional sampling diffusion model for few-shot machinery fault diagnosis. Journal of Vibration and Control 10775463251414180. Available at: https://doi.org/10.1177/10775463251414180

25.

Wang

Sun

Jin

(2020) Imbalanced sample fault diagnosis of rotating machinery using conditional variational auto-encoder generative adversarial network. Applied Soft Computing 92: 106333. https://doi.org/10.1016/j.asoc.2020.106333

26.

Wang

Chen

Zhang

, et al. (2022) Dual-attention generative adversarial networks for fault diagnosis under the class-imbalanced conditions. IEEE Sensors Journal 22(2): 1474–1485. https://doi.org/10.1109/jsen.2021.3131166

27.

Wang

Huang

Zhang

, et al. (2024) Denoising diffusion implicit model combined with TransNet for rolling bearing fault diagnosis under imbalanced data. Sensors 24(24): 8009. https://doi.org/10.3390/s24248009

28.

Hou

Jin

, et al. (2024) Time series diffusion method: a denoising diffusion probabilistic model for vibration signal generation. Mechanical Systems and Signal Processing 216: 111481. https://doi.org/10.1016/j.ymssp.2024.111481

29.

Zhang

Peng

, et al. (2017) A new deep learning model for fault diagnosis with good anti-noise and domain adaptation ability on raw vibration signals. Sensors 17(2): 425. https://doi.org/10.3390/s17020425

30.

Zhang

Chen

Lai

, et al. (2026) Global-local contrastive learning: a multi-operating-condition guided approach for few-shot cross-domain bearing fault diagnosis. Engineering Applications of Artificial Intelligence 165: 113375. https://doi.org/10.1016/j.engappai.2025.113375

31.

Zhao

Liu

, et al. (2020) Enhanced data-driven fault diagnosis for machines with small and unbalanced data based on variational auto-encoder. Measurement Science and Technology 31(3): 035004. https://doi.org/10.1088/1361-6501/ab55f8

32.

Zhao

Jiang

Liu

, et al. (2022) A new data generation approach with modified Wasserstein auto-encoder for rotating machinery fault diagnosis with limited fault data. Knowledge-Based Systems 238: 107892. https://doi.org/10.1016/j.knosys.2021.107892