Deep learning-based detection and severity classification of distributed pitting faults in helical gearboxes

Abstract

Gears, which play a vital role in the transmission of power and torque between shafts in industrial machinery, can be susceptible to faults such as pitting, cracking, wear, and corrosion, caused by harsh operating conditions. Among these faults, distributed pitting, characterized by the formation of small pits on all surfaces of the teeth, is more difficult to detect. The early detection and severity assessment of such faults are crucial in preventing further damage to machinery. Deep learning methods, particularly autoencoders (AEs) and convolutional neural networks (CNNs), have demonstrated significant potential for detecting gear faults in vibration signals. This study explores the use of variational AEs (VAEs) as feature extractors to classify the severity of distributed pitting faults in helical gears. For this purpose, experiments were conducted using a test rig equipped with a two-stage industrial helical gearbox under four distinct load conditions, employing healthy gears and gears with varying pitting severities. Vibration signals, captured by accelerometers, were combined into one-dimensional time-series data and subsequently transformed into two-dimensional representations using two approaches to leverage image classification capabilities of deep learning methods. Various configurations of CNN, AE, and VAE models were evaluated for fault severity classification using different input modalities. Of all the methods tested, VAE combined with a classifier performed best in almost all load conditions. In addition, because not all fault types may be represented during model development, the study also examined the behavior of the best-performing models when one fault class was excluded from training. This additional evaluation was used to assess the generalization to unseen fault classes under incomplete fault coverage, and the VAE–classifier combination showed more reliable behavior than the competing alternatives within this evaluation setting.

Keywords

distributed gear pitting fault diagnosis deep learning variational autoencoder condition monitoring

Introduction

Gearboxes are one of the most critical components of industrial mechanical systems, used to transfer power and torque from one shaft to another at desired rates. They are widely used in various sectors, including automotive, aerospace, energy, and manufacturing. Typically used in harsh environments, gearboxes are prone to faults such as tooth wear, bearing failure, and misalignment.¹ These faults can lead to reduced performance, increased downtime, and even catastrophic failure. One of the most challenging types of gearbox failure is pitting, which can cause significant damage to the gearbox and the equipment it drives. Pitting is a type of gear failure that occurs when there are small pits or craters on the surface of the gear teeth, which can eventually lead to the formation of a larger pit or crack. The type of fault in which the location of the pits is not restricted to a single tooth or gear section and can be found on every tooth of the gear is called distributed pitting fault. It is difficult to determine the presence or development of a distributed pitting fault because the gear vibration signals for different fault severities are similar to each other and to those of a healthy gear.² These pits are typically caused by repeated load cycling of the gear or as a result of misalignment and can cause noise, vibration, and reduction in overall gear performance.

Early pitting detection is critical to prevent more serious damage and avoid costly repairs or replacement. Pitting can be detected by analyzing the gear’s vibration, acoustic emission, or magnetic flux. Vibration signals are the most well-known and widely used technique for the diagnosis of faults.^3–8 Vibration signal analysis provides a cost-effective, reliable, and real-time monitoring solution through the easy integration of sensors into the system. The time domain,^9–11 frequency domain,^12–14 or time–frequency domain^2,15–17 methods can be used to analyze vibration signals. Advanced signal feature extraction methods are required to detect faults from vibration signals. These methods rely on prior knowledge of signal processing and professional experience. The extracted features determine the success of fault detection. Deep learning methods have shown remarkable performance in various signal processing tasks and do not require human intervention in the feature extraction process, thus eliminating the need for domain-specific expertise.¹⁸

In recent years, deep learning methods such as convolutional neural networks (CNNs), deep belief networks, recurrent neural networks (RNNs), and autoencoders (AEs) have been successfully used for gear fault diagnosis with promising results.^5,19,20 Jing et al.²¹ implemented CNNs to detect faults in planetary gearboxes. Using vibration data, Tian and Zuo²² developed an extended RNN model to predict gearbox health. Zhang et al.²³ introduced a novel unsupervised fault diagnosis method based on generalized normalized spare filtering to achieve high accuracy and robustness for bearings and planetary gears under complex operating conditions with limited training samples. Li et al.²⁴ proposed a deep neural network method based on an AE to diagnose faults in planetary gearboxes using features extracted by variational mode decomposition and power spectral entropy. Lupea et al.²⁵ developed CNN-based one-dimensional (1D) and two-dimensional (2D) models to detect gearbox faults using raw vibration data from a triaxial accelerometer. Li et al.²⁶ presented a CNN-based method combining multisensor data and multiscale feature fusion to achieve high accuracy and fast convergence in detecting helical gear faults under high-speed and high-load conditions. Lupea et al.²⁷ proposed a cubic kernel support vector machine (SVM) model for detecting helical gearbox faults, utilizing GMF harmonics and sideband features obtained from triaxial vibration signals. Qu et al.²⁸ introduced a novel methodology for fault detection in gearboxes, integrating dictionary learning with a deep sparse AE framework. Abdul and Al-Talabani²⁹ developed a model based on mel frequency cepstral coefficients and gammatone cepstral coefficients computed for the input signal frames and tested it using SVM, long short-term memory, and echo state network classification models on two different gearbox datasets. Furthermore, Ma et al.³⁰ proposed a data-driven fault diagnosis method that combines time–frequency analysis and a deep residual network to effectively detect incipient faults in planetary gearboxes at varying speeds. Another study, developed a second-order cyclostationary indicator to monitor fatigue pitting progression in gears.³¹ This study showed that the indicator can effectively evaluate gear degradation by utilizing the cyclostationary characteristics of gear vibration signals. Although each method offers unique advantages and limitations, the optimal choice depends on the specific application and available data.

Among these methods, AEs have been shown to be effective for unsupervised feature extraction by learning low-dimensional representations of input data.³² However, variational autoencoders (VAEs) go a step further with their superior nonlinear feature extraction capabilities and the ability to learn more representative and discriminative features even from limited sample data.^33,34 Unlike traditional AEs, VAEs incorporate a constraint into their coding network to ensure that latent variables follow a standard Gaussian distribution. This regularization not only addresses the problem of unregularized latent space inherent in AEs but also improves recognition performance by enabling the model to capture meaningful variation in the data.³⁵ As a result of these features, VAEs are well-suited to challenging tasks where detecting hidden and complex features is critical.

Although VAEs have demonstrated considerable success in general fault detection tasks,³⁶ their specific application as feature extractors for the classification of gear pitting faults particularly challenging cases such as distributed pitting in helical gears remains underexplored. In our previous work,³⁷ a VAE-based method was proposed for anomaly detection in distributed gear pitting; however, the focus was limited to identifying the presence of a fault without assessing its severity. Similarly, Yurtsever et al.³⁸ applied a VAE model for the classification of gear pitting faults using raw vibration signals, but their study was confined to local pitting faults rather than distributed pitting in helical gears, evaluated only under a single operating load.

Despite limited research on the use of VAEs to classify pitting faults in gear systems, VAE-based techniques have been investigated in other fault-related contexts. For example, conditional variational neural networks have been used to extract features to diagnose faults such as cracked, chipped, or missing gear teeth.³⁹ Joint VAEs have been utilized for anomaly detection in wind turbine gearboxes using SCADA and supervisory control data,⁴⁰ while CVAE-GANs have been developed to address data imbalance in planetary gearbox datasets.⁴¹ Moreover, a multifidelity VAE model has been proposed for general gearbox fault diagnosis in big data environments using vibration signals.⁴²

This study investigates the effectiveness of VAEs as feature extractors for the detection and severity classification of distributed pitting faults in helical gears under different load conditions. In addition to the classification of the severity of distributed pitting faults in helical gears, the task of classifying fault types not seen in the training dataset is also critical in the prognosis of an industrial system. In practical gearbox monitoring, it is often unrealistic to assume that all fault types will be available during model development. A diagnostic model may therefore encounter fault patterns that were not represented in the training data. To examine this practical challenge, the present study evaluated the generalization capability of the learned representations to unseen fault classes by excluding one fault class during training and then analyzing how the unseen samples were mapped by the trained model. This evaluation provides additional insight into the robustness of the proposed representations when fault coverage in the training set is incomplete.

The remainder of this article is structured as follows: The “Experimental setup” section outlines the procedures and experimental setup employed in this study. The “Dataset” section provides a detailed description of the data used for training and evaluating the model. The “Methodology” section introduces the architectures and theoretical underpinnings of the CNN, AE, and VAE models developed for pitting fault detection. The “Results and discussion” section presents the experimental findings and their interpretation. The “Computational analysis and deployment considerations” section evaluates the computational complexity of the models presented in this study and discusses the feasibility of their real-world implementation. Finally, the “Conclusion” section summarizes the main outcomes of the study and offers concluding remarks.

Experimental setup

A test rig for fault monitoring has been set up, as illustrated in Figure 1. It comprises a two-stage industrial helical gearbox, an AC drive motor with a power rating of 2.2 kW, and a DC load motor with a power rating of 2.2 kW. These components are connected via belt-pulley mechanisms in order to eliminate the adverse effects that may arise from the use of AC–DC motors and misalignments. A 5-V DC ME4-S12L-PA type inductive sensor was used, which produces a single pulse per rotation, to determine the position of the input shaft. The specifications of the first and second stages of the gearbox are presented in Table 1.

Figure 1.

Experimental setup.^37,43

Table 1.

Specifications of the two-stage gearbox.^37,43

Specification	First stage	Second stage
Number of teeth	29/40	13/33
Normal module (mm)	1.25	2.5
Pressure angle (°)	20	20
Helix angle (°)	30	15

Using a resistor bank, the DC motor load was adjusted to four different levels. The first of these load conditions represents the no-load condition and is referred to as 0% load. The remaining load conditions in the gearbox correspond to 33%, 66%, and 100% of the maximum power of the DC load motor, respectively. Furthermore, a speed controller for the AC drive motor was employed to enable the gearbox to operate within the speed range of 0 to 3000 rpm.⁴³

Two PCB 352A76 type accelerometers were employed within a frequency range of 5–16,000 Hz to obtain the vibration signals generated by the gears. The accelerometers were placed at right angles to each other in the input shaft bearing housings as shown in Figure 1. The raw vibration data acquired from the accelerometers were sampled at 15 kHz and recorded on a computer using a data acquisition system and LabVIEW 7.0 software of National Instruments.

When there is angular misalignment in the gears, the surface contact stress along the face width of the mating gear teeth cannot be uniform. In this case, the distributed pitting fault, one of the gear failures that may occur in the future, is most likely to begin on the tooth surfaces where contact stresses greater than allowed are experienced.

All distributed pitting faults were simulated in all teeth of the pinion gear using an electro-erosion machine, as shown in Figure 2; first, a circular pit, whose diameter and depth are approximately 0.7 and 0.1 mm, respectively, occurred on the surfaces of the tooth as shown in Figure 3(b). Subsequently, to represent the advancement of distributed pitting faults assumed to be caused by the presence of angular misalignment, the number of pits was increased as shown in Figure 3(c), (d), and (e).

Figure 2.

Formation of distributed pitting failure on gear tooth surfaces using an electro-erosion machine.⁴³

Figure 3.

Pinion gear with pitting faults of different severities^37,43: (a) Healthy, (b) Fault 1, (c) Fault 2, (d) Fault 3, and (e) Fault 4.

The speed of the pinion gear is 2678 rpm, which gives a fundamental tooth meshing frequency of 1294 Hz for the first stage and 420.7 Hz for the second stage. Both vibration and positioning signals (inductive sensor) were sampled at 15 kHz. The raw vibration data were recorded continuously over 1337 pinion rotations.

Figure 4 presents the time and frequency domain representations of the vibration signals obtained from gears with four different types of distributed pitting and a healthy gear. The gearbox was disassembled and reassembled each time to simulate the distributed pitting fault on the tooth surfaces of the pinion gear. Due to this situation and manufacturing errors, a modulation that exhibits itself as repetitive fluctuations for each pinion rotation may have occurred.

Figure 4.

Raw vibration acceleration signals (A1 accelerometer) detected from the gearbox with distributed pitting and their corresponding spectra.³⁷

It is clear from Figure 4 that the vibration acceleration signals are similar, making it difficult to determine whether a distributed pitting fault exists or is developing. In particular, it has been shown in previous studies that advanced signal processing methods and computational load are needed for the early detection of distributed pitting damage and the evaluation of their severity.^2,44–46

Dataset

The dataset includes five distinct classes: four representing different levels of pitting fault severity and one representing the healthy gear condition. The faults in the dataset are labeled as Fault 1, Fault 2, Fault 3, and Fault 4, and these labels directly correspond to the severity levels of the respective faults. Each class comprises data from three data channels: horizontal acceleration sensor data, vertical acceleration sensor data, and encoder output data. The encoder output data is used to determine the full-tour rotation of the gear. The full-tour rotation information was used to divide the horizontal and vertical sensor data into discrete windows for each rotation cycle. Consequently, each class was represented by 1337 data series, comprising 334 data points. As in Hizarcı et al.,⁴⁶ the time series were filtered to remove noise between 1000 and 5400 Hz. This filtering step was applied to reduce high-frequency noise and improve signal consistency prior to model development.

Analyzing collected data using time-series methods offers significant advantages, particularly when examining frequency components and investigating signal dynamics.⁴⁷ Nevertheless, the remarkable effectiveness of image processing techniques within deep learning frameworks has highlighted the need to enhance feature richness by transforming temporal data into 2D representations. Therefore, the capabilities of deep learning algorithms can be exploited by creating images from existing data.^48,49 Two different approaches were used to transform the vibration data into images.

In the first approach, the Morlet wavelet was used as the mother wavelet transform to generate scalogram images. The Morlet wavelet is a highly effective function that provides high accuracy in time–frequency analysis and enables the localization of signals.⁵⁰ The scale value is set in the range 1–20 for more precise separation of high and low frequency components. The scalogram images reflect the time–frequency distribution of the signal in detail and provide a rich and informative data source for deep learning models. This is particularly useful for capturing the dynamic characteristics of signals and helping to improve classification performance.

An alternative approach involves sequentially concatenating x-axis and y-axis accelerometer readings into a 2D matrix.⁵¹ For each rotational movement instance, 334 data points from the x-axis are paired with an equal number of y-axis measurements, forming a combined vector of 668 temporal samples. To align with standard image-processing requirements, the dataset undergoes zero-padding by appending eight null values, yielding a 676-element vector. This adjusted sequence is subsequently reshaped into a 26 × 26 pixel matrix, ensuring square dimensionality for compatibility with CNN architectures commonly used in image-based deep learning frameworks. Figures 5 and 6 were created to provide a visual overview of the datasets.

Figure 5.

Scalogram-based representations of different fault conditions: (a) Healthy, (b) Fault 1, (c) Fault 2, (d) Fault 3, and (e) Fault 4.

Figure 6.

Data matrix representations of different fault conditions: (a) Healthy, (b) Fault 1, (c) Fault 2, (d) Fault 3, and (e) Fault 4.

Methodology

This study presents a methodology for the detection and severity classification of distributed pitting faults in a two-stage industrial helical gearbox. Because distributed pitting gives rise to complex and overlapping vibration responses, the diagnosis task is generally more challenging than that of more localized fault signatures. To examine this problem from different perspectives, the acquired vibration signals were represented in three forms: 1D time-domain signals, 26 × 26 data-matrix images, and 224 × 224 time–frequency scalograms.

To obtain a reliable assessment of performance, all closed-set classification experiments were conducted using a chronological split of 60% for training, 10% for validation, and 30% for testing without random shuffling to reduce the risk of temporal data leakage. Unless otherwise stated, the reported neural-model results were obtained using five predefined random seeds and are reported as mean ± standard deviation. Within this protocol, the validation subset was used for model selection and hyperparameter tuning, whereas the final reported results were obtained only on the held-out test subset.

The evaluated methods were grouped into four categories. First, handcrafted feature extraction combined with conventional classifiers was used to establish a baseline. Second, CNN models were applied according to the input format, including 1D-CNN for time-domain signals, LeNet-5 for the 26× 26 data matrices, and transfer learning models such as ResNet50, VGG-16, MobileNetV3-Large, and EfficientNet-B0 for the scalogram images. Third, AE-based models were used to learn compact latent representations from both 1D and 2D inputs. Finally, VAEs were evaluated as probabilistic alternatives to standard AEs, with the expectation that their latent-space structure could better capture the gradual and overlapping characteristics of distributed pitting signals.

Feature extraction

The feature extraction process involves selecting informative features that effectively represent the dataset. The features were evaluated and extracted using MATLAB’s Diagnostic Feature Designer tool,⁵² after which a support vector machine (SVM) was used for classification. The extracted features include root mean square (RMS), mean, standard deviation, skewness, peak, signal-to-noise and distortion ratio (SINAD), signal-to-noise ratio (SNR), shape factor, kurtosis, clearance factor, impulse factor, crest factor, and total harmonic distortion (THD).⁵³ The extracted features characterize various statistical and signal properties: RMS represents the signal’s energy by measuring the square root of the average squared values, while the mean indicates the average signal level, and standard deviation reflects the degree of variation. Skewness describes the asymmetry of the signal distribution, peak value identifies the maximum amplitude, whereas SINAD and SNR provide measures related to signal clarity. The shape factor reflects the waveform shape by relating the RMS to the average value. Kurtosis measures the heaviness of the distribution tails, indicating outliers, while clearance factor, impulse factor, and crest factor, respectively, highlight fault-related characteristics by comparing peak values to different average or RMS measures. Finally, THD quantifies the distortion caused by harmonic frequencies relative to the fundamental signal component. Together, these features provide a conventional statistical description of the signal and serve as a baseline for comparison with the deep learning-based approaches.

Convolutional neural network

A CNN, which is a type of feedforward neural network, is capable of automatically extracting features from datasets through its convolutional structures.⁵⁴ The standard CNN architecture consists of several key layers: input, convolution, pooling, fully connected, and output layers.^55,56 The convolutional layer applies multiple filters to these images, allowing the extraction of numerous features that are then transformed into feature maps. An activation function is then applied to introduce nonlinearity into the convolutional layer, which is essential to increase the learning capacity of the network. The pooling layer then reduces the spatial dimensions of the feature maps and the number of parameters, thereby improving computational efficiency. Although there are different types of pooling layers, this study uses max-pooling, a widely used technique that preserves the structural information of images.⁵⁷ The fully connected layer then consolidates the extracted features into a feature vector. Finally, the output layer maps this feature vector to class probabilities and determines the classification result.⁵⁸ The CNN models employed in this study can be categorized into 1D and 2D architectures, which are detailed below.

One-dimensional convolutional neural network

Specifically designed for 1D data, 1D-CNN employ kernels that slide along a single dimension, unlike traditional CNNs, which use two spatial dimensions. Recognizing the efficacy of 1D-CNN in processing 1D signals,^59,60 to evaluate the performance of 1D-CNN in our dataset, we developed 12 distinct model architectures by hyperparameter tuning. The optimized hyperparameters include the number of convolutional layers, kernel size, pooling layer size, and the number of filters. These ranges were selected to cover shallow-to-moderate architectures that were compatible with the dataset size and signal length, while avoiding unnecessarily large models that could increase overfitting risk and computational cost. A total of twelve different 1D-CNN models were applied on vibration data obtained under different load conditions (no load, 33%, 66%, and full load).

Two-dimensional convolutional neural network

2D-CNNs are widely used for processing image-based data because they can detect spatial and hierarchical patterns in 2D inputs. 2D-CNNs offer a significant advantage in that they can effectively learn complex visual features from images. This study used 2D-CNNs to analyze visual representations derived from vibration signals. Specifically, transfer learning was used to train ResNet50 and VGG-16 models, as well as MobileNetV3 and EfficientNet-B0, with scalogram images, which offer time–frequency representations of the signals. In parallel, the LeNet-5 model was also trained on the data matrix image set to enable a comparative evaluation with a simpler architecture.

Transfer learning with ResNet50 and VGG-16 models, MobileNetV3, and EfficientNet-B0

ResNet50 and VGG-16 models, which are among the most popular choices due to their high performance rates, are used for the classification of distributed pitting faults. ResNet50 is a deep CNN comprising 50 layers, initially proposed by Microsoft Research Asia in 2015.^61,62 The most significant distinction between ResNet50 and other networks is its implementation of residual connections, which effectively addresses the issue of gradient vanishing and enables the training of dense models.⁶³

The VGG-16 architecture represents a deep CNN, as outlined in the study by Simonyan and Zisserman.⁶⁴ The use of compact 3 × 3 filters in the convolution layers allows for a narrow detection area while allowing the addition of more layers and the use of a deeper network. It has demonstrated state-of-the-art performance in numerous image recognition benchmarks, including the ImageNet large-scale visual recognition challenge held in 2014.

In addition to these widely used architectures, MobileNetV3⁶⁵ and EfficientNet-B0⁶⁶ were also included in the comparative evaluation. These models were selected because they provide more computationally efficient alternatives while maintaining strong image classification capability, making them relevant for practical fault diagnosis settings.

The transfer learning method is used to reduce the dependency of the evaluated 2D-CNN models on large amounts of task-specific labeled data. This method uses knowledge from models that have been pre-trained on large datasets (e.g., ImageNet) to solve related problems,^67,68 which is particularly useful in industrial settings where labeled fault data are often limited.⁶⁹

LeNet-5 model

LeNet-5 represents a groundbreaking advance in the domain of CNNs. It was developed by Yann LeCun and his team in 1998.⁷⁰ The LeNet-5 network comprises seven layers, including two convolutional layers, two subsampling layers, and three fully connected layers.⁷¹ The standard LeNet-5 architecture includes an input layer with dimensions of 32× 32 or 28× 28. As the image set has dimensions of 26 × 26, modifications have been implemented in certain layers of the network architecture to accommodate this input size. The architecture of the modified LeNet-5 is shown in Figure 7.

Figure 7.

Modified LeNet-5 architecture.

Autoencoder

AEs, which were first introduced in the mid-1980s by Hinton et al.^72,73 within the Parallel Distributed Processing (PDP) framework, aim to compress high-dimensional input data into a low-dimensional representation, known as the latent space, while minimizing information loss.⁷⁴ The reduction in information loss results in minimization of distortion between the input and output data. Consequently, high-dimensional images can be represented using the latent space. The latent space representation learned by the AE can serve as a feature vector for subsequent classifier algorithms. This approach allows classifiers to be trained with low-dimensional representative values of high-dimensional images. In this article, both 1D and 2D-AE architectures were employed depending on the structure of the input data. While 1D-AEs are preferred for raw time-series signals, 2D-AEs are more suitable for visual representations such as images.

One-dimensional autoencoder

One-dimensional AEs are specifically designed to process sequential data, making them particularly suitable for vibration signals recorded over time. Unlike 2D-AEs, which operate on spatially structured image data, 1D-AEs preserve the temporal ordering of the input, enabling the model to learn patterns inherent in time-dependent sequences. In this study, horizontal and vertical vibration signals were concatenated and used as inputs to the 1D-AE model. The goal was to obtain compact latent representations that could be used to train classifiers more efficiently. The properties of 1D-AE are given in Table 2. 1D-AEs are particularly useful in applications where preserving the sequential structure of the data is critical, such as time-series analysis.^75,76 In this approach, the latent space is reduced to 50-dimensional data, and the aim is to train the classifier algorithms with these latent space values.

Table 2.

1D-AE architecture.

Layer	# Outputs	Kern.	Str.	Act.	Pad.
Encoder
Input	668, 1	—	—	—	—
Conv1D	64	3	1	ReLU	SAME
Conv1D	64	3	1	ReLU	SAME
MaxPool1D	—	—	—	—	—
Conv1D	64	3	1	ReLU	SAME
Conv1D	64	3	1	ReLU	SAME
MaxPool1D	—	—	—	—	—
Flatten	—	—	—	—	—
Dense	50	—	—	—	—
Decoder
Input	50	—	—	—	—
Dense	167	—	—	ReLU	SAME
Reshape	167, 1	—	—	—	—
Conv1DTrans	64	3	1	ReLU	SAME
Conv1DTrans	64	3	2	ReLU	SAME
Conv1DTrans	64	3	1	ReLU	SAME
Conv1DTrans	64	3	2	ReLU	SAME
Conv1DTrans	1	3	1	Sigmoid	SAME
Conv1DTrans	668, 1	—	—	—	—

Note. 1D-AE: one-dimensional autoencoder. Kern. : Kernel size, Str. : Stride, Act. : Activation function, Pad. : Padding.

Two-dimensional autoencoder

2D-AEs are widely used for image-based data, where spatial structure and local dependencies carry essential information. In the context of this study, 2D-AEs were trained on scalogram and data matrix images derived from vibration signals. These models aim to compress high-dimensional visual representations into compact latent vectors while preserving important structural features relevant for fault detection. Due to the varying image sizes employed in the study, two different architectures are presented in Tables 3 and 4, each optimized for the specific type of input image.

Table 3.

2D-AE architecture for scalogram images.

Layer	No. of outputs	Kern.	Str.	Act.	Pad.
Encoder
Input	224, 224, 3	—	—	—	—
Conv	32	3	2	ReLU	SAME
Conv	64	3	2	ReLU	SAME
Dense	1024	—	—	—	—
Dense	512	—	—	—	—
Dense	128	—	—	—	—
Dense	50	—	—	—	—
Decoder
Input	50	—	—	—	—
Dense	28 × 28 × 64	—	—	—	—
Reshape	28, 28, 64	—	—	—	—
Trans conv	64	3	2	ReLU	SAME
Trans conv	32	3	2	ReLU	SAME
Trans conv	3	3	2	Sigmoid	SAME
Output	224, 224, 3	—	—	—	—

Note. 2D-AE: two-dimensional autoencoder. Kern. : Kernel size, Str. : Stride, Act. : Activation function, Pad. : Padding.

Table 4.

2D-AE architecture for data matrix images.

Layer	No. of outputs	Kern.	Str.	Act.	Pad.
Encoder
Input	26, 26, 1	—	—	—	—
Conv	128	3	1	ReLU	SAME
Conv	128	3	2	ReLU	SAME
Conv	128	3	2	ReLU	SAME
Flatten	—	—	—	—	—
Dense	1024	—	—	ReLU	—
Dense	512	—	—	ReLU	—
Dense	128	—	—	ReLU	—
Dense	50	—	—	—	—
Decoder
Input	50	—	—	—	—
Dense	13 × 13 × 128	—	—	ReLU	—
Reshape	13, 13, 128	—	—	—	—
Trans conv	128	3	1	ReLU	SAME
Trans conv	128	3	1	ReLU	SAME
Trans conv	128	3	2	ReLU	SAME
Trans conv	1	3	1	Sigmoid	SAME
Output	26, 26, 1	—	—	—	—

Note. 2D-AE: two-dimensional autoencoder. Kern. : Kernel size, Str. : Stride, Act. : Activation function, Pad. : Padding.

Variational autoencoders

VAEs represent a specific subcategory of AEs that adopt a probabilistic methodology in the encoding process. In contrast to a classical AE, a VAE compresses the input data as a probability distribution that includes mean and variance.⁷⁷ This enables the VAE to learn a distribution over the latent space and to capture the statistical structure of the data.⁷⁸ As with the classical AE approach, the latent space and probability distribution obtained in the VAE approach can be employed as a feature for the aforementioned classifier algorithms. The block diagram representation of the general VAE structure is presented in Figure 8.

Figure 8.

VAE architecture.⁷⁹ VAE: variational autoencoder.

In this study, VAE architectures were developed for both 1D and 2D input data, extending the corresponding AE structures by incorporating the reparameterization trick.

One-dimensional variational autoencoder

The 1D-VAE structure is derived from the classical 1D-AE model and is tailored for time-series data such as vibration signals. By modeling the latent space as a distribution rather than a point estimate, 1D-VAEs enable more robust and expressive feature extraction from sequential inputs. The 1D VAE adopts the same architecture as the 1D AE presented in Table 2; the main difference lies in the use of the reparameterization trick to model the latent space probabilistically. In the context of 1D VAE, the latent space values encompass probabilistic distribution information, including the mean and variance.

Two-dimensional variational autoencoder

For image-based data, the 2D-VAE architecture builds upon the previously introduced 2D-AE structures designed for scalogram and data matrix images. The same convolutional and dense layers are preserved, while the encoder is modified to output both the mean and variance vectors for the latent variables. A common approach was used with AE architectures for different image sets, and the VAE architecture was created by adding the reparametrization trick to the architectures shown in Tables 3 and 4.

Following the formulation of the AE and VAE architectures, establishing a robust training framework is essential to ensure the reliability of the extracted features. To this end, hyperparameters were determined through targeted pilot studies rather than exhaustive grid searches. This approach efficiently addresses the structural complexity of both 1D and 2D data while preventing overfitting. The VAE architectures were intentionally designed as direct extensions of the AEs. This ensures a fair, controlled baseline for evaluating probabilistic regularization.

The latent dimensionality was set to $z = 50$ based on pilot experiments. In preliminary trials, smaller latent sizes were associated with higher reconstruction error and less stable downstream classification, whereas much larger latent sizes did not provide consistent gains and occasionally appeared to capture noise-like variation. Therefore, $z = 50$ was retained as a practical compromise between representational capacity and generalization. The VAE regularization weight was set to $β = 0.1$ , which in pilot experiments provided a reasonable balance between reconstruction fidelity and latent-space regularization while maintaining stable downstream classification performance.

Training hyperparameters were also empirically established. The Adam optimizer, with a learning rate of $0.0005$ , ensured stable convergence and smooth weight updates. A batch size of 32 provided a practical trade-off between gradient stability and memory efficiency. Lastly, training was capped at $30$ epochs, as validation loss stabilized early and further training risked memorization.

Results and discussion

The performance of the methods used in this study is evaluated in this section. First, results obtained from approaches utilizing vibration signals as 1D time-series data are presented, including handcrafted feature extraction combined with classifiers, various CNN architectures, AEs, and VAEs. The performance of the methods applied to vibration signals transformed into images using two different techniques is then detailed. For the scalogram image dataset, transfer learning is performed using pre-trained ResNet50 and VGG-16 models, alongside AEs and VAEs. Finally, the performance of LeNet-5, AEs, and VAEs is evaluated on the 26 × 26 pixel image dataset.

1D-dataset: Vibration signals

This section addresses the performance of machine learning methods applied to vibration signal data obtained from accelerometers. First, the approach of extracting handcrafted statistical features and performing classification is presented. Subsequently, CNN, AE, and VAE-based representation learning methods were evaluated and compared.

Handcrafted feature extraction and classification

In the field of signal analysis, novel methodologies are continuously being developed for the analysis of 1D signals. However, statistical signal analysis methods remain prevalent as a baseline. The present study involves an analysis of the 13 features referenced in the “Methodology” section. These features were ranked according to their discriminative importance using a one-way analysis of variance test. To this end, a series of studies were conducted using the most significant 2 features, 4 features, 7 features, 12 features, and all 13 features. The results demonstrated that performance, represented by accuracy, was 49.75% for the two most significant features, 61.64% for four features, 67.11% for seven features, 66.87% for twelve features, and 66.62% when all features were included. It was observed that the highest performance was obtained by using the most important seven features, and the confusion matrix is presented in Table 5. The seven-feature subset consisted of RMS, Mean, Standard Deviation, Skewness, Peak, SNR, and SINAD. Including additional features beyond this point led to a slight decrease in performance. The standard deviation across the five runs was observed to be ± 0.00, which is consistent with the deterministic behavior of the SVM classifier under a fixed chronological data split.

Table 5.

Confusion matrix with seven optimal handcrafted features for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	60.45 ± 0.00	11.19 ± 0.00	19.15 ± 0.00	8.96 ± 0.00	0.25 ± 0.00
F1	21.89 ± 0.00	32.84 ± 0.00	41.29 ± 0.00	3.98 ± 0.00	0.00 ± 0.00
F2	14.43 ± 0.00	20.15 ± 0.00	65.42 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
F3	7.46 ± 0.00	4.73 ± 0.00	0.25 ± 0.00	81.34 ± 0.00	6.22 ± 0.00
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	4.48 ± 0.00	95.52 ± 0.00

Note. T\P stands for True\Predicted. H, F1, F2, F3, and F4 represent Healthy, Fault 1, Fault 2, Fault 3, and Fault 4, respectively. It should be noted that these abbreviations apply to all such subsequent tables.

For reference, the handcrafted feature extraction and SVM pipeline was also evaluated computationally. The total training time was 0.2226 s, and the end-to-end inference time was 0.5561 ms per sample, including feature extraction and SVM prediction. The peak memory usage during inference was 1.0917 MB of RAM. As shown in Table 5, the handcrafted feature-based approach still struggles to distinguish between some degradation states, with Fault 1 being most frequently confused with Fault 2 and Healthy. This suggests that manually extracted statistical features alone are not sufficient to capture the more complex degradation patterns in the vibration signals.

One-dimensional convolutional neural network

1D-CNN models with varying architectures were constructed by modifying the number of convolution layers, the number of filters, the pooling size, and the kernel size. The impact of these modified parameters on the fault detection performance of the models was examined. A total of twelve different 1D-CNN models were applied on vibration data obtained under different load conditions (no load, 33%, 66%, and full load), and their performances were evaluated over five predefined random seeds. Table 6 provides a comprehensive overview of the parameter values and the mean accuracy values of the models. Models with 1, 2, and 3 convolutional layers showed comparable accuracy rates, with a single convolutional layer providing the best compromise between accuracy and computational efficiency. The kernel size variations revealed that smaller kernels were generally sufficient, while larger pooling layers tended to reduce performance because they caused greater feature loss. Finally, the number of filters showed that 16 filters provided the optimal compromise between accuracy and computational cost, achieving robust performance with lower computational demand than wider models. Therefore, it is concluded that the optimization of these parameters is of crucial importance to maintain high accuracy while minimizing complexity in fault detection tasks. As a result of these observations, model 1 was selected as the final configuration because it provided the most favorable overall trade-off between accuracy and computational efficiency. The confusion matrix of the best model for no load condition is given in Table 7. The model achieved an overall accuracy of 99.40%.

Table 6.

Accuracies of 1D-CNN models under varying load conditions and architectural parameters.

Model	Load ratio (%)				Network structure
Model	0	33	66	100	CL	KS	PS	F
Model 1	99.40 ± 0.48	97.85 ± 1.51	98.85 ± 1.05	99.37 ± 0.59	1	3	2	16
Model 2	97.76 ± 0.40	96.27 ± 0.99	96.97 ± 1.24	82.33 ± 31.17	2	3	2	16
Model 3	96.83 ± 0.77	95.89 ± 0.75	96.88 ± 0.47	93.33 ± 4.30	3	3	2	16
Model 4	99.32 ± 0.46	97.99 ± 0.38	98.69 ± 0.37	99.19 ± 0.09	1	5	2	16
Model 5	99.30 ± 0.24	98.09 ± 0.32	98.84 ± 0.21	99.29 ± 0.18	1	7	2	16
Model 6	98.47 ± 0.37	96.99 ± 0.76	97.01 ± 1.57	98.28 ± 0.57	1	3	3	16
Model 7	97.05 ± 1.23	93.43 ± 2.03	96.03 ± 1.52	96.73 ± 0.49	1	3	4	16
Model 8	94.65 ± 3.49	92.31 ± 1.35	89.88 ± 9.89	93.00 ± 5.07	1	3	5	16
Model 9	76.20 ± 4.97	71.33 ± 2.53	73.56 ± 4.37	73.03 ± 7.46	1	3	10	16
Model 10	99.27 ± 0.28	97.94 ± 0.48	98.93 ± 0.17	99.12 ± 0.44	1	3	2	32
Model 11	98.20 ± 2.42	82.65 ± 31.32	98.67 ± 0.60	99.13 ± 0.67	1	3	2	64
Model 12	99.19 ± 0.50	82.29 ± 31.15	98.34 ± 0.66	90.78 ± 16.83	1	3	2	128

Note. 1D-CNN: one-dimensional convolutional neural network; CL: number of convolutional layers; KS: kernel size; PS: pooling size; F: number of filters. Bold values indicate the highest mean accuracy for each load ratio.

Table 7.

Confusion matrix of 1D-CNN for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	99.50 ± 0.35	0.00 ± 0.00	0.15 ± 0.22	0.35 ± 0.42	0.00 ± 0.00
F1	0.20 ± 0.21	98.46 ± ± 0.44	1.19 ± 0.62	0.15 ± 0.14	0.00 ± 0.00
F2	0.25 ± 0.18	0.50 ± 0.63	99.20 ± 0.89	0.05 ± 0.11	0.00 ± 0.00
F3	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	100.0 ± 0.0	0.00 ± 0.00
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.15 ± 0.22	99.85 ± 0.22

Note. 1D-CNN: one-dimensional convolutional neural network.

One-dimensional autoencoder

The performance of the 1D-AE approach applied to the analysis of 1D vibration signals is presented in Table 8. The overall accuracy was 96.25%. The main performance loss appears to arise from confusion between Fault 1 and Fault 2, together with the relatively lower and more variable recall of Fault 3.

Table 8.

Confusion matrix of 1D-AE for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	99.05 ± 1.71	0.05 ± 0.11	0.75 ± 1.67	0.15 ± 0.22	0.00 ± 0.00
F1	0.20 ± 0.45	95.47 ± 5.22	3.78 ± 4.00	0.55 ± 0.83	0.00 ± 0.00
F2	0.75 ± 1.25	2.39 ± 1.37	95.77 ± 4.75	1.10 ± 2.17	0.00 ± 0.00
F3	0.15 ± 0.14	0.00 ± 0.00	2.74 ± 5.98	94.13 ± 10.74	2.99 ± 4.74
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	3.18 ± 5.87	96.82 ± 5.87

Note. 1D-AE: one-dimensional autoencoder.

One-dimensional variational autoencoder

Table 9 shows the performance of the 1D-VAE method, which adds a statistical approach to the 1D-AE method. The VAE model performs better than the AE model. Although the detection accuracy for the Fault 1 and Fault 2 classes was lower than for the other classes, this disparity was less significant compared to the 1D-AE method. The overall accuracy achieved by the 1D-VAE method was 98.96%.

Table 9.

Confusion matrix of 1D-VAE for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	99.65 ± 0.14	0.00 ± 0.00	0.20 ± 0.31	0.15 ± 0.14	0.00 ± 0.00
F1	0.05 ± 0.11	98.06 ± 1.05	1.39 ± 0.77	0.50 ± 0.53	0.00 ± 0.00
F2	0.20 ± 0.21	2.14 ± 1.06	97.66 ± 1.10	0.00 ± 0.00	0.00 ± 0.00
F3	0.20 ± 0.21	0.25 ± 0.31	0.10 ± 0.14	99.45 ± 0.58	0.00 ± 0.00
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	100.00 ± 0.0

Note. 1D-VAE: one-dimensional variational autoencoder.

2D-dataset: 224 × 224 scalogram image

This section discusses the performance of machine learning methods using scalogram images obtained with the continuous wavelet transform applied to the vibration signal data. The methods described in the methodology section are presented sequentially.

ResNet50

The ResNet50 model was employed to classify images of dimensions 224 × 224 using the transfer learning method. Evaluated over five predefined random seeds, the trained model exhibited the following mean accuracy rates and standard deviations for the five classes: “Healthy” (98.16 ± 1.00%), “Fault 1” (88.06 ± 11.03%), “Fault 2” (97.21 ± 2.51%), “Fault 3” (93.08 ± 3.25%), and “Fault 4” (97.01 ± 1.29%). In addition to the individual class accuracy rates mentioned above, the overall system performance was calculated, resulting in an accuracy rate of approximately 94.71%. The confusion matrix generated by the performance of the ResNet50 model under the 0% load condition is presented in Table 10.

Table 10.

Confusion matrix of ResNet50 model for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	98.16 ± 1.00	0.95 ± 0.24	0.20 ± 0.40	0.65 ± 0.54	0.05 ± 0.10
F1	1.39 ± 0.98	88.06 ± 11.03	7.71 ± 9.43	2.84 ± 1.05	0.00 ± 0.00
F2	0.45 ± 0.33	2.34 ± 2.22	97.21 ± 2.51	0.00 ± 0.00	0.00 ± 0.00
F3	1.34 ± 1.09	4.53 ± 1.79	0.30 ± 0.60	93.08 ± 3.25	0.75 ± 0.52
F4	0.25 ± 0.27	0.05 ± 0.10	0.00 ± 0.00	2.69 ± 1.09	97.01 ± 1.29

VGG-16

The VGG-16 model was evaluated under the 0% load condition. Across five random seeds, the model achieved mean class accuracies of 98.16 ± 2.57% for “Healthy,” 94.88 ± 1.90% for “Fault 1,” 98.36 ± 1.46% for “Fault 2,” 93.63 ± 8.29% for “Fault 3,” and 97.51 ± 4.00% for “Fault 4.” The overall average accuracy was 96.51%. The corresponding confusion matrix is presented in Table 11.

Table 11.

Confusion matrix of VGG-16 model for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	98.16 ± 2.57	0.55 ± 0.74	0.10 ± 0.12	1.09 ± 1.59	0.10 ± 0.20
F1	1.94 ± 1.45	94.88 ± 1.90	1.89 ± 1.50	1.00 ± 0.90	0.30 ± 0.29
F2	0.05 ± 0.10	1.29 ± 1.39	98.36 ± 1.46	0.25 ± 0.31	0.05 ± 0.10
F3	4.48 ± 8.46	0.85 ± 0.37	0.15 ± 0.12	93.63 ± 8.29	0.90 ± 0.84
F4	1.69 ± 3.38	0.10 ± 0.12	0.05 ± 0.10	0.65 ± 0.54	97.51 ± 4.00

MobileNetV3

The MobileNetV3 model was evaluated under the 0% load condition. Across five random seeds, the model achieved mean class accuracies of 97.46 ± 0.69% for “Healthy,” 79.90 ± 9.73% for “Fault 1,” 97.01 ± 1.79% for “Fault 2,” 95.62 ± 4.49% for “Fault 3,” and 95.52 ± 2.38% for “Fault 4.” The overall average accuracy was 93.10%. The corresponding confusion matrix is presented in Table 12.

Table 12.

Confusion matrix of MobileNetV3 model for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	97.46 ± 0.69	0.60 ± 0.68	0.30 ± 0.29	1.64 ± 0.66	0.00 ± 0.00
F1	4.98 ± 2.19	79.90 ± 9.73	7.11 ± 3.89	8.01 ± 4.00	0.00 ± 0.00
F2	0.75 ± 0.42	1.79 ± 1.47	97.01 ± 1.79	0.45 ± 0.19	0.00 ± 0.00
F3	1.64 ± 1.07	1.94 ± 3.39	0.25 ± 0.31	95.62 ± 4.49	0.55 ± 0.37
F4	0.20 ± 0.19	0.10 ± 0.12	0.00 ± 0.00	4.18 ± 2.47	95.52 ± 2.38

EfficientNet-B0

The EfficientNet-B0 model was evaluated under the 0% load condition. Across five random seeds, the model achieved mean class accuracies of 98.36 ± 0.34% for “Healthy,” 91.94 ± 1.80% for “Fault 1,” 99.00 ± 0.59% for “Fault 2,” 96.22 ± 1.37% for “Fault 3,” and 99.20 ± 0.74% for “Fault 4.” The overall average accuracy was 96.95%. The corresponding confusion matrix is presented in Table 13.

Table 13.

Confusion matrix of EfficientNet-B0 model for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	98.36 ± 0.34	0.40 ± 0.30	0.35 ± 0.43	0.85 ± 0.25	0.05 ± 0.10
F1	0.50 ± 0.27	91.94 ± 1.80	4.03 ± 1.78	3.48 ± 1.45	0.05 ± 0.10
F2	0.10 ± 0.12	0.85 ± 0.40	99.00 ± 0.59	0.05 ± 0.10	0.00 ± 0.00
F3	0.80 ± 0.10	1.04 ± 0.62	0.30 ± 0.19	96.22 ± 1.37	1.64 ± 0.98
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.80 ± 0.74	99.20 ± 0.74

Two-dimensional autoencoder

For the 2D-AE approach based on scalogram images, the AE was trained on the training subset and then used to extract latent features from the training, validation, and test data. These latent representations were subsequently used for classification, and among the evaluated classifiers, the SVM yielded the best results, with the validation set used for hyperparameter tuning. Across five random seeds, the AE–SVM pipeline achieved mean class accuracies of 99.75 ± 0.27% for “Healthy,” 96.72 ± 0.19% for “Fault 1,” 98.31 ± 0.24% for “Fault 2,” 98.06 ± 0.43% for “Fault 3,” and 98.51 ± 0.31% for “Fault 4.” The overall average accuracy was 98.27%, and the corresponding confusion matrix under the 0% load condition is presented in Table 14.

Table 14.

Confusion matrix of 2D-AE model for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	99.75 ± 0.27	0.15 ± 0.20	0.00 ± 0.00	0.10 ± 0.12	0.00 ± 0.00
F1	0.50 ± 0.22	96.72 ± 0.19	1.74 ± 0.44	1.04 ± 0.29	0.00 ± 0.00
F2	0.35 ± 0.20	1.34 ± 0.20	98.31 ± 0.24	0.00 ± 0.00	0.00 ± 0.00
F3	0.05 ± 0.10	1.44 ± 0.37	0.00 ± 0.00	98.06 ± 0.43	0.45 ± 0.29
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	1.49 ± 0.31	98.51 ± 0.31

Note. 2D-AE: two-dimensional autoencoder.

Two-dimensional variational autoencoder

For the 2D VAE approach applied to 224 × 224 scalogram images, the VAE was trained on the training subset and then used to extract latent representations ( $z_{train}$ , $z_{val}$ , and $z_{test}$ ) from all data splits. The downstream SVM classifier was trained on $z_{train}$ , while $z_{val}$ was used for hyperparameter tuning.

When evaluated on the unseen $z_{test}$ set, the VAE–SVM pipeline achieved mean class accuracies of 99.80 ± 0.24% for “Healthy,” 96.27 ± 0.47% for “Fault 1,” 98.46 ± 0.24% for “Fault 2,” 98.61 ± 0.64% for “Fault 3,” and 99.35 ± 0.46% for “Fault 4.” The overall average accuracy was 98.50%. The corresponding confusion matrix under the 0% load condition is provided in Table 15.

Table 15.

Confusion matrix of VAE model for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	99.80 ± 0.24	0.00 ± 0.00	0.05 ± 0.10	0.15 ± 0.20	0.00 ± 0.00
F1	0.10 ± 0.12	96.27 ± 0.47	2.09 ± 0.46	1.54 ± 0.40	0.00 ± 0.00
F2	0.35 ± 0.12	1.19 ± 0.29	98.46 ± 0.24	0.00 ± 0.00	0.00 ± 0.00
F3	0.05 ± 0.10	1.00 ± 0.35	0.00 ± 0.00	98.61 ± 0.64	0.35 ± 0.34
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.65 ± 0.46	99.35 ± 0.46

Note. VAE: variational autoencoder.

2D-dataset: 26 × 26 Data matrix

This section discusses the performance of machine learning methods using a 26× 26 image set generated with two-axis vibration signal data. Initially, the architecture of LeNet-5, a well-established deep learning model, is presented. This is followed by the performance of AE and VAE approaches, respectively.

LeNet-5

The custom LeNet-5 model, operating on 26 × 26 input images, was evaluated under the 0% load condition. Across five random seeds, the model achieved mean class accuracies of 97.96 ± 1.21% for “Healthy,” 94.78 ± 4.73% for “Fault 1,” 96.87 ± 1.83% for “Fault 2,” 98.51 ± 0.97% for “Fault 3,” and 98.11 ± 1.28% for “Fault 4.” The overall average accuracy was 97.24%. The corresponding confusion matrix for the Custom LeNet-5 model is presented in Table 16. Despite the reduced input size, the model maintained competitive classification performance.

Table 16.

Confusion matrix of Custom LeNet-5 model for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	97.96 ± 1.21	0.40 ± 0.56	0.70 ± 0.62	0.95 ± 0.62	0.00 ± 0.00
F1	1.84 ± 3.56	94.78 ± 4.73	3.13 ± 1.36	0.25 ± 0.39	0.00 ± 0.00
F2	0.65 ± 0.83	1.44 ± 1.43	96.87 ± 1.83	1.04 ± 1.37	0.00 ± 0.00
F3	0.35 ± 0.30	0.05 ± 0.10	0.05 ± 0.10	98.51 ± 0.97	1.04 ± 1.15
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	1.89 ± 1.28	98.11 ± 1.28

Two-dimensional autoencoder

The 2D-AE approach was also applied to the 26× 26 data matrix images. The AE was trained on the training subset and used to construct latent representations ( $z_{train}$ , $z_{val}$ , and $z_{test}$ ) from the input matrices. When these latent representations were used as features, the SVM classifier achieved the best performance, with the validation subset used for hyperparameter tuning. Across five random seeds, the AE–SVM model achieved mean class accuracies of 99.10 ± 0.58% for “Healthy,” 98.41 ± 0.40% for “Fault 1,” 98.86 ± 0.68% for “Fault 2,” 99.10 ± 0.64% for “Fault 3,” and 99.65 ± 0.20% for “Fault 4.” The overall average accuracy was 99.02%, as shown in Table 17.

Table 17.

Confusion matrix of 2D-AE model for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	99.10 ± 0.58	0.10 ± 0.12	0.75 ± 0.44	0.05 ± 0.10	0.00 ± 0.00
F1	0.10 ± 0.20	98.41 ± 0.40	1.49 ± 0.27	0.00 ± 0.00	0.00 ± 0.00
F2	0.15 ± 0.12	0.95 ± 0.58	98.86 ± 0.68	0.05 ± 0.10	0.00 ± 0.00
F3	0.00 ± 0.00	0.35 ± 0.43	0.15 ± 0.12	99.10 ± 0.64	0.40 ± 0.25
F4	0.00 ± 0.00	0.05 ± 0.10	0.00 ± 0.00	0.30 ± 0.19	99.65 ± 0.20

Note. 2D-AE: two-dimensional autoencoder.

Two-dimensional variational autoencoder

In the 2D-VAE approach, the same architecture as the corresponding AE was used for the 26 × 26 images. The VAE was trained on the training subset, and the resulting latent representations ( $z_{train}$ , $z_{val}$ , and $z_{test}$ ) were used for downstream classification. Unlike the standard AE, the VAE models the latent space probabilistically through the reparameterization trick. Among the evaluated classifiers, the SVM model achieved the best performance, with the validation subset used for hyperparameter tuning. Across five random seeds, the VAE–SVM pipeline achieved mean class accuracies of 99.95 ± 0.10% for “Healthy,” 99.25 ± 0.35% for “Fault 1,” 99.65 ± 0.12% for “Fault 2,” 99.70 ± 0.19% for “Fault 3,” and 99.75 ± 0.16% for “Fault 4.” The overall average accuracy was 99.66%. These results are presented in Table 18.

Table 18.

Confusion matrix of VAE model for load condition 0%.

T∖P	H	F1	F2	F3	F4
H	99.95 ± 0.10	0.05 ± 0.10	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
F1	0.00 ± 0.00	99.25 ± 0.35	0.75 ± 0.35	0.00 ± 0.00	0.00 ± 0.00
F2	0.00 ± 0.00	0.35 ± 0.12	99.65 ± 0.12	0.00 ± 0.00	0.00 ± 0.00
F3	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	99.70 ± 0.19	0.30 ± 0.19
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.25 ± 0.16	99.75 ± 0.16

Note. VAE: variational autoencoder.

A comparative analysis of different input representations

The performance of the methods for three distinct input representations—1D time-series data, 224× 224 scalogram images, and 26× 26 data matrices—is summarized in Table 19 for the no-load (0%) condition. The results indicate that deep learning-based automatic feature extraction substantially outperformed traditional feature extraction and shallow learning (FESL) methods, which achieved an accuracy of 67.11 ± 0.00%. This finding highlights the difficulty of manually extracting discriminative features from complex vibration signals. Among the 1D input models, the 1D-CNN achieved 99.40 ± 0.48% accuracy. Although scalogram images are frequently used in the literature to capture time–frequency characteristics, the large pre-trained networks evaluated in this study did not exceed 97% accuracy. In contrast, the 2D-VAE model applied to the 26× 26 data matrices achieved the highest baseline accuracy of 99.66 ± 0.20%, indicating that effective latent representations can still be learned from substantially reduced input dimensions.

Table 19.

Overall performance comparison of the evaluated models under 0% load condition.

Input data format	Model architecture	Accuracy (%)
1D time-series data	FESL	67.11 ± 0.00
	1D-CNN	99.40 ± 0.48
	1D-AE	96.25 ± 5.04
	1D-VAE	98.96 ± 0.73
224 × 224 Scalogram images	ResNet50	94.71 ± 2.79
	VGG-16	96.51 ± 2.59
	MobileNetV3	93.10 ± 1.28
	EfficientNet-B0	96.95 ± 0.44
	2D-AE	98.27 ± 0.22
	2D-VAE	98.50 ± 0.37
26 × 26 Data matrix images	Custom LeNet-5	97.24 ± 1.14
	2D-AE	99.02 ± 0.53
	2D-VAE	99.66 ± 0.20

Note. 1D: one-dimensional; 2D: two-dimensional; FESL: feature extraction and shallow learning; 1D-CNN: one-dimensional convolutional neural network; 1D-AE: one-dimensional autoencoder; 1D-VAE: one-dimensional variational autoencoder; 2D-AE: two-dimensional autoencoder; 2D-VAE: two-dimensional variational autoencoder.

Methods that demonstrated a high degree of success were also applied to data collected under varying load conditions (0%, 33%, 66%, and 100%), as detailed in Tables 20, 21, 22, and 23, respectively. Finally, as illustrated in Table 24, the aggregated average accuracies across all load conditions are presented. It has been observed that the VAE architecture achieved the highest overall accuracy ( $99.46 \pm 0.18 %$ ), followed by AE ( $99.13 \pm 0.35 %$ ) and 1D-CNN ( $98.87 \pm 0.52 %$ ). Although the leading methods all produced very high accuracies, their results are closely clustered, making it difficult to distinguish their practical diagnostic limits under standard evaluation settings. For this reason, the following section presents a separate analysis of the three best-performing methods (VAE, AE, and 1D-CNN) to further examine their behavior under more challenging and realistic conditions.

Table 20.

Performance under 0% load condition.

Class	1D-CNN	2D-AE	2D-VAE
Healthy	99.50 ± 0.35	99.10 ± 0.58	99.95 ± 0.10
Fault 1	98.46 ± 0.44	98.41 ± 0.40	99.25 ± 0.35
Fault 2	99.20 ± 0.89	98.86 ± 0.68	99.65 ± 0.12
Fault 3	100.00 ± 0.00	99.10 ± 0.64	99.70 ± 0.19
Fault 4	99.85 ± 0.22	99.65 ± 0.20	99.75 ± 0.16
Total accuracies	99.40 ± 0.48	99.02 ± 0.53	99.66 ± 0.20

Note. 1D-CNN: one-dimensional convolutional neural network; 2D-AE: two-dimensional autoencoder; 2D-VAE: two-dimensional variational autoencoder.

Table 21.

Performance under 33% load condition.

Class	1D-CNN	2D-AE	2D-VAE
Healthy	98.26 ± 0.84	99.85 ± 0.14	99.80 ± 0.21
Fault 1	97.66 ± 1.20	99.15 ± 0.33	99.70 ± 0.32
Fault 2	99.25 ± 0.43	99.25 ± 0.00	99.75 ± 0.00
Fault 3	95.62 ± 2.64	98.31 ± 0.62	98.11 ± 0.97
Fault 4	98.46 ± 1.47	98.76 ± 0.56	99.20 ± 0.41
Total acc.	97.85 ± 1.51	99.06 ± 0.41	99.31 ± 0.50

Note. 1D-CNN: one-dimensional convolutional neural network; 2D-AE: two-dimensional autoencoder; 2D-VAE: two-dimensional variational autoencoder.

Table 22.

Performance under 66% load condition.

Class	1D-CNN	2D-AE	2D-VAE
Healthy	99.00 ± 0.93	100.00 ± 0.00	100.00 ± 0.00
Fault 1	97.36 ± 1.90	99.55 ± 0.11	98.71 ± 0.21
Fault 2	99.30 ± 0.62	99.15 ± 0.28	98.76 ± 0.50
Fault 3	98.91 ± 0.67	99.75 ± 0.30	99.55 ± 0.32
Fault 4	99.70 ± 0.41	99.25 ± 0.39	98.91 ± 0.45
Total acc.	98.85 ± 1.05	99.54 ± 0.26	99.19 ± 0.35

Note. 1D-CNN: one-dimensional convolutional neural network; 2D-AE: two-dimensional autoencoder; 2D-VAE: two-dimensional variational autoencoder.

Table 23.

Performance under 100% load condition.

Class	1D-CNN	2D-AE	2D-VAE
Healthy	99.60 ± 0.52	98.96 ± 0.48	99.75 ± 0.25
Fault 1	98.76 ± 0.58	98.26 ± 0.47	98.91 ± 0.38
Fault 2	98.51 ± 1.07	98.76 ± 0.25	99.70 ± 0.21
Fault 3	100.00 ± 0.00	98.96 ± 0.64	100.00 ± 0.00
Fault 4	100.00 ± 0.00	99.60 ± 0.38	100.00 ± 0.00
Total acc.	99.37 ± 0.59	98.91 ± 0.44	99.67 ± 0.22

Note. 1D-CNN: one-dimensional convolutional neural network; 2D-AE: two-dimensional autoencoder; 2D-VAE: two-dimensional variational autoencoder.

Table 24.

Average accuracy from no-load to full-load condition.

Class	1D-CNN	2D-AE	2D-VAE
Healthy	99.09 ± 0.38	99.48 ± 0.30	99.88 ± 0.10
Fault 1	98.06 ± 0.61	98.84 ± 0.30	99.14 ± 0.17
Fault 2	99.07 ± 0.40	99.01 ± 0.30	99.47 ± 0.12
Fault 3	98.63 ± 0.71	99.03 ± 0.47	99.34 ± 0.30
Fault 4	99.50 ± 0.39	99.32 ± 0.38	99.47 ± 0.15
Total acc.	98.87 ± 0.52	99.13 ± 0.35	99.46 ± 0.18

Note. 1D-CNN: one-dimensional convolutional neural network; 2D-AE: two-dimensional autoencoder; 2D-VAE: two-dimensional variational autoencoder.

Load-wise confusion matrix analysis of the three best-performing methods

Based on the comparative results in Table 19, the three best-performing methods (with a success rate above 99%) were further examined through their confusion matrices under different load conditions. This analysis provides a clearer view of how class-wise error patterns change with load.

For the 1D-CNN model, the confusion matrices under 33%, 66%, and 100% load conditions are presented in Tables 25 to 27. These results show that the model maintained strong performance across all evaluated loads, although the separation between adjacent fault classes changed slightly under transitional operating conditions.

Table 25.

Confusion matrix of 1D-CNN for load condition 33%.

T∖P	H	F1	F2	F3	F4
H	98.26 ± 0.84	0.40 ± 0.38	1.00 ± 0.50	0.10 ± 0.14	0.25 ± 0.43
F1	0.95 ± 0.57	97.66 ± 1.20	0.15 ± 0.22	1.09 ± 0.57	0.15 ± 0.14
F2	0.20 ± 0.32	0.05 ± 0.11	99.25 ± 0.43	0.25 ± 0.00	0.25 ± 0.18
F3	0.40 ± 0.14	0.30 ± 0.11	0.30 ± 0.11	95.62 ± 2.64	3.38 ± 2.64
F4	0.00 ± 0.00	0.05 ± 0.11	0.00 ± 0.00	1.49 ± 1.40	98.46 ± 1.47

Note. 1D-CNN: one-dimensional convolutional neural network.

Table 26.

Confusion matrix of 1D-CNN for load condition 66%.

T∖P	H	F1	F2	F3	F4
H	99.00 ± 0.93	0.50 ± 0.50	0.25 ± 0.00	0.25 ± 0.56	0.00 ± 0.00
F1	0.60 ± 0.80	97.36 ± 1.90	1.19 ± 0.69	0.85 ± 0.67	0.00 ± 0.00
F2	0.20 ± 0.21	0.40 ± 0.57	99.30 ± 0.62	0.00 ± 0.00	0.10 ± 0.22
F3	0.10 ± 0.14	0.05 ± 0.11	0.05 ± 0.11	98.91 ± 0.67	0.90 ± 0.76
F4	0.00 ± 0.00	0.00 ± 0.00	0.10 ± 0.14	0.20 ± 0.32	99.70 ± 0.41

Note. 1D-CNN: one-dimensional convolutional neural network.

Table 27.

Confusion matrix of 1D-CNN for load condition 100%.

T∖P	H	F1	F2	F3	F4
H	99.60 ± 0.52	0.00 ± 0.00	0.05 ± 0.11	0.35 ± 0.42	0.00 ± 0.00
F1	0.20 ± 0.21	98.76 ± 0.58	0.95 ± 0.51	0.10 ± 0.14	0.00 ± 0.00
F2	0.30 ± 0.27	1.14 ± 1.02	98.51 ± 1.07	0.05 ± 0.11	0.00 ± 0.00
F3	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	100.00 ± 0.0	0.00 ± 0.00
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	100.00 ± 0.0

Note. 1D-CNN: one-dimensional convolutional neural network.

For the 2D-AE model based on 26× 26 data matrices, the corresponding confusion matrices under 33%, 66%, and 100% load conditions are reported in Tables 28 to 30. The results indicate that the model preserved high classification performance across load levels. At 33% load, the lowest class accuracy was 98.31 ± 0.62% for Fault 3. At 66% load, the Healthy class reached 100.00 ± 0.00%, and all fault classes exceeded 99% accuracy. At 100% load, the model remained close to its no-load performance, with the largest misclassification observed between Fault 1 and Fault 2 (1.59 ± 0.57%). Overall, these results suggest that the 2D-AE approach provides stable feature representations for the compressed data-matrix inputs across the evaluated load conditions.

Table 28.

Confusion matrix of 2D-AE for load condition 33%.

T∖P	H	F1	F2	F3	F4
H	99.85 ± 0.14	0.00 ± 0.00	0.15 ± 0.14	0.00 ± 0.00	0.00 ± 0.00
F1	0.15 ± 0.22	99.15 ± 0.33	0.25 ± 0.30	0.45 ± 0.32	0.00 ± 0.00
F2	0.40 ± 0.14	0.30 ± 0.21	99.25 ± 0.00	0.05 ± 0.11	0.00 ± 0.00
F3	0.00 ± 0.00	1.19 ± 0.41	0.00 ± 0.00	98.31 ± 0.62	0.50 ± 0.47
F4	0.00 ± 0.00	0.00 ± 0.00	0.25 ± 0.00	1.00 ± 0.56	98.76 ± 0.56

Note. 2D-AE: two-dimensional autoencoder.

Table 29.

Confusion matrix of 2D-AE for load condition 66%.

T∖P	H	F1	F2	F3	F4
H	100.00 ± 0.0	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
F1	0.20 ± 0.11	99.55 ± 0.11	0.25 ± 0.18	0.00 ± 0.00	0.00 ± 0.00
F2	0.20 ± 0.11	0.65 ± 0.28	99.15 ± 0.28	0.00 ± 0.00	0.00 ± 0.00
F3	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	99.75 ± 0.30	0.25 ± 0.30
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.75 ± 0.39	99.25 ± 0.39

Note. 2D-AE: two-dimensional autoencoder.

Table 30.

Confusion matrix of 2D-AE for load condition 100%.

T∖P	H	F1	F2	F3	F4
H	98.96 ± 0.48	0.15 ± 0.14	0.85 ± 0.52	0.05 ± 0.11	0.00 ± 0.00
F1	0.05 ± 0.11	98.26 ± 0.47	1.59 ± 0.57	0.10 ± 0.22	0.00 ± 0.00
F2	0.20 ± 0.11	1.04 ± 0.21	98.76 ± 0.25	0.00 ± 0.00	0.00 ± 0.00
F3	0.00 ± 0.00	0.30 ± 0.32	0.15 ± 0.14	98.96 ± 0.64	0.60 ± 0.28
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.40 ± 0.38	99.60 ± 0.38

Note. 2D-AE: two-dimensional autoencoder.

For the 2D-VAE model, the confusion matrices under 33%, 66%, and 100% load conditions are shown in Tables 31 to 33. The results indicate that the 2D-VAE maintained highly stable classification performance across all evaluated load conditions. At 33% load, the lowest class accuracy was 98.11 ± 0.97% for Fault 3, whereas at 66% load the Healthy class reached 100.00 ± 0.00%, and at 100% load both Fault 3 and Fault 4 were classified with 100.00 ± 0.00% accuracy. These results are consistent with the overall trends in Table 24 and suggest that the probabilistic latent representation remained robust under varying load levels.

Table 31.

Confusion matrix of 2D-VAE model for load condition 33%.

T∖P	H	F1	F2	F3	F4
H	99.80 ± 0.21	0.00 ± 0.00	0.00 ± 0.00	0.20 ± 0.21	0.00 ± 0.00
F1	0.00 ± 0.00	99.70 ± 0.32	0.00 ± 0.00	0.30 ± 0.32	0.00 ± 0.00
F2	0.05 ± 0.11	0.20 ± 0.11	99.75 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
F3	0.00 ± 0.00	0.85 ± 0.57	0.00 ± 0.00	98.11 ± 0.97	1.04 ± 0.57
F4	0.00 ± 0.00	0.00 ± 0.00	0.05 ± 0.11	0.75 ± 0.39	99.20 ± 0.41

Note. 2D-VAE: two-dimensional variational autoencoder.

Table 32.

Confusion matrix of 2D-VAE for load condition 66%.

T∖P	H	F1	F2	F3	F4
H	100.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00
F1	0.20 ± 0.11	98.71 ± 0.21	1.09 ± 0.22	0.00 ± 0.00	0.00 ± 0.00
F2	0.10 ± 0.14	1.14 ± 0.45	98.76 ± 0.50	0.00 ± 0.00	0.00 ± 0.00
F3	0.00 ± 0.00	0.25 ± 0.18	0.00 ± 0.00	99.55 ± 0.32	0.20 ± 0.21
F4	0.00 ± 0.00	0.20 ± 0.32	0.00 ± 0.00	0.90 ± 0.22	98.91 ± 0.45

Note. 2D-VAE: two-dimensional variational autoencoder.

Table 33.

Confusion matrix of 2D-VAE for load condition 100%.

T∖P	H	F1	F2	F3	F4
H	99.75 ± 0.25	0.00 ± 0.00	0.25 ± 0.25	0.00 ± 0.00	0.00 ± 0.00
F1	0.00 ± 0.00	98.91 ± 0.38	1.09 ± 0.38	0.00 ± 0.00	0.00 ± 0.00
F2	0.00 ± 0.00	0.30 ± 0.21	99.70 ± 0.21	0.00 ± 0.00	0.00 ± 0.00
F3	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	100.00 ± 0.0	0.00 ± 0.00
F4	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	0.00 ± 0.00	100.00 ± 0.0

Note. 2D-VAE: two-dimensional variational autoencoder.

Evaluation of model generalization to unseen fault classes

Creating a comprehensive dataset that includes every possible failure mode for components in various industrial applications, such as gearboxes, is usually unfeasible. However, identifying these failures is crucial. As a result, a diagnostic model may encounter fault patterns that are not represented in the training data. To investigate this challenge, one fault class was excluded from the training set, and the behavior of the trained models was then examined on the unseen samples. This analysis provided an additional evaluation of generalization to unseen fault classes under incomplete fault coverage.

To evaluate this generalization capacity, the models that demonstrated high performance in the closed-set classification experiments were selected: the 1D-CNN, which processes 1D time-series data, and the 2D-AE and 2D-VAE models, which operate on 26× 26 data matrices. In each experiment, one fault type was completely excluded from the training process and used only as the unseen test class. The remaining known classes were chronologically divided into 80% training and 20% validation subsets to avoid temporal data leakage and preserve a realistic evaluation setting.

The validation subset served different roles depending on the model. For the 1D-CNN, it was used to monitor training convergence and reduce overfitting. For the 2D-AE and 2D-VAE frameworks, it was used for hyperparameter tuning of the downstream SVM classifiers. After training and model selection on the known classes, the entire excluded fault class was presented to the trained models as the test set. The resulting predictive distributions were then examined, with particular attention to the rate of false “Healthy” predictions.

Generalization analysis of the 1D-CNN model

Tables 34 to 37 present the results obtained using the 1D-CNN method for the evaluation of generalization to unseen fault classes, organized by increasing load conditions. For each sub-experiment, one fault type was excluded from the training set, and then the classifier was trained with the remaining fault types. Subsequently, the classifier was tested with the excluded fault type to observe how the unseen samples were assigned to the known classes, including “Healthy.”

Table 34.

Predictions of the 1D-CNN model under no-load condition in the unseen fault class evaluation.

TC	F1	F2	F3	F4	H
F1	—	48.69 ± 11.81	4.16 ± 1.36	0.01 ± 0.03	47.14 ± 11.33
F2	61.17 ± 3.75	—	1.47 ± 1.86	0.22 ± 0.22	37.14 ± 3.06
F3	3.13 ± 0.83	1.75 ± 0.99	—	46.70 ± 3.40	48.42 ± 3.64
F4	0.00 ± 0.00	0.46 ± 0.32	98.98 ± 0.50	—	0.55 ± 0.22

Note. 1D-CNN: one-dimensional convolutional neural network; TC: test classes.

Table 35.

Predictions of the 1D-CNN model under 33% load condition in the unseen fault class evaluation.

TC	F1	F2	F3	F4	H
F1	—	15.47 ± 5.10	66.91 ± 5.79	3.34 ± 3.29	14.29 ± 1.73
F2	11.40 ± 4.42	—	26.73 ± 4.38	14.23 ± 7.51	47.64 ± 6.37
F3	7.67 ± 2.53	2.90 ± 0.60	—	86.46 ± 4.14	2.96 ± 1.18
F4	0.34 ± 0.14	1.23 ± 0.56	91.67 ± 2.38	—	6.76 ± 2.10

Note. TC: 1D-CNN: one-dimensional convolutional neural network; test classes.

Table 36.

Predictions of the 1D-CNN model under 66% load condition in the unseen fault class evaluation.

TC	F1	F2	F3	F4	H
F1	—	54.11 ± 9.32	10.73 ± 5.51	0.49 ± 0.24	34.67 ± 7.26
F2	34.09 ± 5.49	—	6.10 ± 2.63	12.77 ± 4.70	47.03 ± 4.83
F3	6.15 ± 3.37	1.23 ± 0.65	—	82.21 ± 6.14	10.41 ± 2.77
F4	0.81 ± 0.54	8.59 ± 3.51	89.87 ± 3.89	—	0.73 ± 0.59

Note. 1D-CNN: one-dimensional convolutional neural network; TC: test classes.

Table 37.

Predictions of the 1D-CNN model under full-load condition in the unseen fault class evaluation.

TC	F1	F2	F3	F4	H
F1	—	49.21 ± 5.27	4.98 ± 3.49	0.00 ± 0.00	45.80 ± 3.94
F2	63.62 ± 6.87	—	0.84 ± 0.47	0.04 ± 0.06	35.50 ± 6.84
F3	3.29 ± 2.20	0.85 ± 0.68	—	43.77 ± 8.62	52.09 ± 7.26
F4	0.00 ± 0.00	0.21 ± 0.23	99.36 ± 0.44	—	0.43 ± 0.23

Note. 1D-CNN: one-dimensional convolutional neural network; TC: test classes.

Under the no-load (0%) condition (Table 34), the classifier showed substantial uncertainty when an unseen fault class was introduced. In particular, when Fault 1 was excluded from training, it was misclassified as “Healthy” at a rate of 47.14 ± 11.33%. A similar pattern was observed for unseen Fault 3, which was mapped to “Healthy” at a rate of 48.42 ± 3.64%. In contrast, unseen Fault 4 was assigned predominantly to Fault 3 (98.98 ± 0.50%), while the false “Healthy” rate remained low (0.55 ± 0.22%).

At the 33% and 66% load conditions (Tables 35 and 36), the assignment pattern of the unseen classes changed with the load level. For example, at 33% load, unseen Fault 2 was classified as “Healthy” at a rate of 47.64 ± 6.37%, whereas unseen Fault 3 was mapped mainly to Fault 4 (86.46 ± 4.14%).

In particular, under full-load conditions, the model again showed a tendency to place unseen intermediate faults near the “Healthy” boundary. At the 100% load condition (Table 37), unseen Fault 3 was classified as “Healthy” 52.09 ± 7.26% of the time. By contrast, unseen Fault 4 continued to be mapped almost entirely to Fault 3 (99.36 ± 0.44%), rather than to “Healthy.”

Overall, the 1D-CNN was able to group severe unseen faults into nearby faulty categories, but its tendency to assign early-stage or intermediate unseen faults to “Healthy” remains a clear limitation for fail-safe industrial deployment.

Generalization analysis of the 2D-AE model

The generalization behavior of the 2D-AE model, which uses 26× 26 data matrices, is summarized across different load conditions in Tables 38 to 41. Similar to the 1D-CNN, the 2D-AE maps unseen samples into the decision regions formed by the known classes. Under the 0% and 100% load conditions, the model showed broadly similar assignment patterns. Severe unseen faults were generally mapped to nearby faulty classes rather than to “Healthy.” For example, at 100% load, an unseen Fault 4 was assigned predominantly to Fault 3 (99.60 ± 0.74%). In addition, neighboring early-stage faults tended to map to one another, as reflected by unseen Fault 2 being classified as Fault 1 at a rate of 83.78 ± 6.23%. However, false “Healthy” predictions remained an important limitation. Most notably, unseen Fault 3 was classified as “Healthy” at rates of 45.62 ± 1.77% at 0% load and 45.30 ± 2.11% at 100% load.

Table 38.

Predictions of the 2D-AE model under no-load condition in the unseen fault class evaluation.

TC	F1	F2	F3	F4	H
F1	—	65.58 ± 5.26	4.92 ± 1.85	0.00 ± 0.00	29.50 ± 5.88
F2	83.44 ± 6.06	—	0.84 ± 0.76	0.00 ± 0.00	15.72 ± 5.41
F3	0.40 ± 0.50	0.91 ± 0.30	—	53.06 ± 1.71	45.62 ± 1.77
F4	0.00 ± 0.00	0.00 ± 0.00	99.60 ± 0.74	—	0.40 ± 0.74

Note. 2D-AE: two-dimensional autoencoder; TC: test classes.

Table 39.

Predictions of the 2D-AE model under 33% load condition in the unseen fault class evaluation.

TC	F1	F2	F3	F4	H
F1	—	31.79 ± 8.03	15.57 ± 31.07	3.23 ± 1.69	49.41 ± 21.46
F2	2.20 ± 0.56	—	0.18 ± 0.04	54.81 ± 2.93	42.81 ± 3.12
F3	1.66 ± 1.39	32.45 ± 1.41	—	57.43 ± 1.35	8.47 ± 2.12
F4	0.03 ± 0.04	55.00 ± 2.14	40.63 ± 1.59	—	4.34 ± 1.61

Note. 2D-AE: two-dimensional autoencoder; TC: test classes.

Table 40.

Predictions of the 2D-AE model under 66% load condition in the unseen fault class evaluation.

TC	F1	F2	F3	F4	H
F1	—	64.65 ± 32.09	11.91 ± 23.37	0.39 ± 0.53	23.05 ± 29.74
F2	32.03 ± 6.00	—	6.91 ± 7.03	4.29 ± 1.80	56.77 ± 1.67
F3	0.61 ± 0.20	15.20 ± 11.85	—	72.13 ± 2.85	12.06 ± 9.22
F4	0.03 ± 0.04	1.90 ± 1.17	93.66 ± 2.67	—	4.41 ± 3.58

Note. 2D-AE: two-dimensional autoencoder; TC: test classes.

Table 41.

Predictions of the 2D-AE model under full-load condition in the unseen fault class evaluation.

TC	F1	F2	F3	F4	H
F1	—	66.27 ± 5.24	4.67 ± 1.39	0.00 ± 0.00	29.07 ± 5.61
F2	83.78 ± 6.23	—	0.88 ± 0.74	0.00 ± 0.00	15.33 ± 5.60
F3	0.58 ± 0.75	0.93 ± 0.27	—	53.19 ± 1.84	45.30 ± 2.11
F4	0.00 ± 0.00	0.00 ± 0.00	99.60 ± 0.74	—	0.40 ± 0.74

Note. 2D-AE: two-dimensional autoencoder; TC: test classes.

The model showed less stable behavior under transitional load conditions. At 33% load (Table 39), unseen Fault 1 was classified as “Healthy”49.41 ± 21.46% of the time, together with relatively large variation across runs. At 66% load (Table 40), unseen Fault 2 was mapped predominantly to “Healthy” (56.77 ± 1.67%). These results indicate that, although the 2D-AE performed well in the closed-set experiments, its deterministic latent representation was less effective at keeping unseen fault samples separated from the “Healthy” region, particularly under intermediate load conditions.

Generalization analysis of the 2D-VAE model

Tables 42 to 45 display the results from the 2D-VAE model under the evaluation of generalization to unseen fault classes, arranged by increasing load conditions. Similarly to the other models, one type of fault was excluded from the training set at each stage, and the classifier was trained on the remaining types of faults. The excluded fault type then served as the test set to evaluate how unseen samples were mapped to the known classes. The 2D-VAE differs from the deterministic models; in that, it learns a probabilistic latent representation, which leads to a more distributed assignment pattern for unseen samples.

Table 42.

Predictions of the 2D-VAE model under no-load condition in the unseen fault class evaluation.

TC	F1	F2	F3	F4	H
F1	—	39.40 ± 15.78	13.72 ± 8.75	5.58 ± 7.85	41.30 ± 16.99
F2	40.37 ± 17.79	—	28.57 ± 14.03	4.52 ± 5.84	26.54 ± 4.07
F3	18.80 ± 8.02	32.03 ± 10.95	—	36.26 ± 9.88	12.91 ± 3.96
F4	3.08 ± 2.56	2.42 ± 2.82	92.98 ± 6.35	—	1.51 ± 1.41

Note. 2D-VAE: two-dimensional variational autoencoder; TC: test classes.

Table 43.

Predictions of the 2D-VAE model under 33% load condition in the unseen fault class evaluation.

TC	F1	F2	F3	F4	H
F1	—	17.34 ± 8.62	53.87 ± 22.42	12.57 ± 8.34	16.23 ± 9.96
F2	48.69 ± 7.26	—	18.26 ± 11.63	14.36 ± 6.91	18.68 ± 10.76
F3	36.59 ± 3.72	10.82 ± 3.09	—	42.41 ± 6.63	10.19 ± 3.45
F4	6.49 ± 4.85	10.04 ± 1.83	68.71 ± 10.91	—	14.76 ± 7.20

Note. 2D-VAE: two-dimensional variational autoencoder; TC: test classes.

Table 44.

Predictions of the 2D-VAE model under 66% load condition in the unseen fault class evaluation.

TC	F1	F2	F3	F4	H
F1	—	25.09 ± 7.39	41.63 ± 18.40	7.52 ± 8.14	25.76 ± 13.43
F2	25.52 ± 17.21	—	32.39 ± 12.68	15.83 ± 3.18	26.27 ± 12.41
F3	28.81 ± 20.70	10.14 ± 3.30	—	47.91 ± 16.41	13.13 ± 7.88
F4	10.64 ± 5.73	14.50 ± 7.39	68.69 ± 9.97	—	6.18 ± 4.96

Note. 2D-VAE: two-dimensional variational autoencoder; TC: test classes.

Table 45.

Predictions of the 2D-VAE model under 100% load condition in the unseen fault class evaluation.

TC	F1	F2	F3	F4	H
F1	—	38.12 ± 15.33	15.99 ± 9.04	5.89 ± 8.02	40.00 ± 8.01
F2	32.89 ± 12.40	—	38.67 ± 2.77	4.29 ± 3.88	24.14 ± 10.05
F3	20.63 ± 7.56	22.24 ± 10.80	—	43.77 ± 11.70	13.36 ± 2.42
F4	6.58 ± 7.14	5.88 ± 6.69	86.21 ± 13.97	—	1.33 ± 1.47

Note. 2D-VAE: two-dimensional variational autoencoder; TC: test classes.

Under the no-load (0%) condition (Table 42), Fault 1 did not collapse into a single known class; instead, it was mainly assigned to Fault 2 (39.40 ± 15.78%) and “Healthy” (41.30 ± 16.99%). Although the false “Healthy” rate remained non-negligible, the predictions were more distributed than those of the deterministic models. In contrast, when Fault 4 was excluded from training, the model avoided the “Healthy” boundary and mapped it predominantly to Fault 3 (92.98 ± 6.35%), with a false Healthy rate of only 1.51 ± 1.41%.

At the 33% and 66% load conditions (Tables 43 and 44), the 2D-VAE continued to suppress false “Healthy” assignments more effectively than the deterministic models. For example, at 33% load, unseen Fault 1 was classified as “Healthy” only 16.23 ± 9.96% of the time and was instead mapped mainly to Fault 3 (53.87 ± 22.42%). This suggests that the probabilistic latent representation tends to keep unseen fault patterns within the broader damaged region rather than allowing them to cross into the “Healthy” class.

Under full-load conditions, the overall behavior remained similar to that observed at no load. At the 100% load condition (Table 45), unseen Fault 1 was classified as “Healthy” at a rate of 40.00 ± 8.01%, whereas unseen Fault 4 remained well separated from the “Healthy” class, with a false “Healthy” rate of only 1.33 ± 1.47%.

Overall, although false “Healthy” predictions were not completely eliminated for unseen early-stage faults, the 2D-VAE showed the most stable behavior among the evaluated models. In particular, severe unseen faults remained within the damaged decision region rather than being misclassified as normal, which is a more desirable outcome for fail-safe industrial diagnosis.

Table 46 presents a comparison of the rates at which the 1D-CNN, 2D-AE, and 2D-VAE models assign unseen fault samples to known fault classes, instead of the healthy class, across varying load conditions. The 2D-VAE method consistently outperforms both the 1D-CNN and 2D-AE models in all load conditions. While the deterministic models show larger performance variations across load levels—for example, the 1D-CNN reaches 66.69% under no-load conditions and the 2D-AE drops to 73.74% at 33% load—the 2D-VAE maintains comparatively stable performance, remaining around or above 80% in all test conditions. On average, the 2D-VAE achieves an overall success rate of 81.73 ± 8.65%, compared with 76.09 ± 6.28% for the 2D-AE and 73.03 ± 5.00% for the 1D-CNN. This pattern suggests that the probabilistic latent representation learned by the 2D-VAE may help the model keep unseen fault samples within damaged decision regions more consistently than the deterministic alternatives. Consequently, although none of the evaluated models completely eliminated false “Healthy” predictions for unseen faults, the 2D-VAE showed the most reliable overall behavior in this forced-classification setting. These results indicate that the 2D-VAE is the most reliable among the evaluated methods for applications in which previously unseen fault conditions may be encountered.

Table 46.

Assignment of unseen fault samples to known fault classes using 1D-CNN, 2D-AE, and 2D-VAE models under varying load conditions.

Load	1D-CNN	2D-AE	2D-VAE
0%	66.69 ± 6.14	77.19 ± 3.45	79.44 ± 8.98
33%	82.09 ± 3.51	73.74 ± 7.08	85.04 ± 8.35
66%	76.79 ± 4.58	75.93 ± 11.05	82.17 ± 10.26
100%	66.55 ± 5.36	77.48 ± 3.52	80.29 ± 6.58
Average	73.03 ± 5.00	76.09 ± 6.28	81.73 ± 8.65

Note. 1D-CNN: one-dimensional convolutional neural network; 2D-AE: two-dimensional autoencoder; 2D-VAE: two-dimensional variational autoencoder.

Computational analysis and deployment considerations

To examine the practical deployment characteristics of the evaluated methods for continuous condition monitoring, a computational analysis was performed. Table 47 summarizes the model complexity, memory requirements, and timing metrics of all evaluated methods. All benchmarks were conducted on a workstation equipped with an NVIDIA GeForce RTX 5070 12GB GPU, an Intel Core i7-14700F CPU, and 32 GB of RAM, using the PyTorch framework⁸⁰ with CUDA 12.8 acceleration.

Table 47.

Computational complexity and hardware performance metrics of the evaluated models.

Dataset	Model	Parameters	GFLOPs	Memory footprint (peak MB)	Training time (s/epoch)	Inference time (ms/sample)	Throughput (samples/s)
One dimensional	1D-CNN	535,069	0.0011	28.99 ± 0.44	0.17 ± 0.00	0.02 ± 0.00	$~$ 50,000.0
	1D-AE	617,784	0.07	48.58 ± 0.21	0.48 ± 0.10	0.10 ± 0.07	9951.1
	1D-VAE	1,152,234	0.07	56.37 ± 0.37	0.42 ± 0.05	0.06 ± 0.01	17,357.7
Data matrix 26× 26	LeNet-5	24,585	0.0004	19.93 ± 0.00	0.49 ± 0.09	0.11 ± 0.02	8880.0
	2D-AE	8,864,051	0.38	153.32 ± 0.53	0.76 ± 0.02	0.13 ± 0.01	7474.9
	2D-VAE	8,870,501	0.38	153.04 ± 0.36	1.17 ± 0.17	0.19 ± 0.07	5345.0
Scalogram 224× 224	MobileNetV3	4,208,437	0.47	561.30 ± 0.00	9.95 ± 0.06	3.52 ± 0.23	284.3
	EfficientNet-B0	4,013,953	0.83	754.85 ± 0.57	12.05 ± 0.21	3.42 ± 0.07	292.3
	ResNet50	23,518,277	8.26	1061.52 ± 1.71	16.25 ± 0.23	3.92 ± 0.16	255.2
	VGG-16	134,289,477	31.04	5244.79 ± 0.89	101.27 ± 31.68	4.68 ± 0.04	213.6
	2D-VAE	208,759,911	1.33	3232.92 ± 0.19	12.01 ± 0.26	3.30 ± 0.15	303.0
	2D-AE	208,753,461	1.33	3247.66 ± 0.74	11.58 ± 0.10	3.47 ± 0.20	288.4

Note. 1D-CNN: one-dimensional convolutional neural network; 1D-AE: one-dimensional autoencoder; 1D-VAE: one-dimensional variational autoencoder; 2D-AE: two-dimensional autoencoder; 2D-VAE: two-dimensional variational autoencoder.

Real-time implementation and edge deployment feasibility

Condition monitoring for industrial gearboxes benefits from timely processing of vibration data to support early fault detection and reduce the risk of severe mechanical damage.⁸¹ The computational analysis highlights clear differences in deployment feasibility across the evaluated models. Deep 2D architectures such as VGG-16 and the scalogram-based 2D-VAE model exhibit relatively large memory requirements and higher parameter counts, which may limit their suitability for resource-constrained edge platforms.

By contrast, lightweight models such as the 1D-CNN offer very fast inference, although their generalization behavior under unseen operating conditions was less consistent than that of the best-performing latent-space models.

Among the evaluated methods, the 2D-VAE model utilizing 26× 26 data matrices provides a favorable balance between diagnostic performance and computational cost. With an inference time of approximately 0.19 ms per instance, the model can process about $5.3 \times 10^{3}$ instances per second. Because each inference instance corresponds to one rotational segment containing 334 samples per accelerometer channel, this throughput corresponds to an approximate processing capacity of $1.79 \times 10^{6}$ raw samples/s per channel under the present segmentation scheme, which is substantially above the 15-kHz acquisition rate used in this study. These results suggest that real-time inference is achievable under the present acquisition and segmentation setting, while still leaving computational margin for signal acquisition, windowing, and preprocessing. In addition, its memory requirement of approximately 153 MB and computational complexity of 0.38 GFLOPs (Giga Floating-point Operations Per Second) suggest that the model is more deployment-friendly than the heavier 2D alternatives, although verification on specific edge hardware remains necessary.

Model compression and batch processing strategies

Although the proposed 26 × 26 data matrix-based 2D-VAE architecture is already more compact than the larger 2D models, additional optimization strategies may further improve its suitability for edge deployment. In particular, post-training quantization can reduce memory usage and inference cost, making the model easier to deploy on devices with tighter power, memory, or thermal constraints. Similar gains may also be achieved through pruning or deployment-oriented runtime optimization, depending on the target hardware.

For streaming vibration data, batch processing is not always necessary during standard online operation, where low-latency single-window inference is often preferred. However, small-batch execution may still be beneficial when multiple channels or parallel sensing nodes are processed simultaneously. In addition, overlapping sliding-window analysis can be used to support continuous monitoring while preserving temporal sensitivity to fault-related transients. In this setting, a buffer-based implementation can help maintain stable throughput and reduce the risk of data loss during periods of increased acquisition rate.

Overall, these considerations suggest that the 2D-VAE model is a practical candidate for real-time condition monitoring on edge-oriented industrial platforms, while also allowing further optimization through hardware-aware compression and deployment strategies.

Generalization considerations and limitations

Although the proposed VAE–SVM pipeline achieved strong diagnostic performance in the present experiments, its broader industrial generalization should be interpreted within the scope of the present validation setting. The current validation was conducted on a specific two-stage helical gearbox operating at a constant speed of 2678 rpm. Because vibration responses are influenced by machine configuration, gear and bearing characteristics, operating speed, load, and structural dynamics, the reported performance may vary when the method is transferred to different systems or to variable-speed operating conditions. In this respect, the representation-learning component may transfer more readily than the decision boundaries learned for the present gearbox, which are expected to remain more system-dependent.

In such cross-system scenarios, relying solely on a fully supervised classifier may be difficult, since newly deployed machinery rarely has sufficiently labeled fault data covering all relevant operating states. In our preliminary study,³⁷ we established an unsupervised anomaly-detection baseline using VAE reconstruction-error thresholds learned from healthy data only. That earlier setting was designed to separate healthy and faulty behavior at a binary level. In contrast, the objective of the present study is more demanding, as it targets multi-class classification of fault severity levels.

For practical deployment in previously unseen systems, a staged strategy may therefore be more suitable. For a new system with no available fault labels, an unsupervised or weakly supervised approach can be adopted initially using healthy baseline data. In this setting, VAE-based features may be combined with methods such as one-class SVM^82,83 to support early detection of abnormal operating behavior. As labeled fault data become available over time, the diagnostic pipeline can then be extended toward the proposed multi-class framework.

To improve transferability across different speeds, gearbox configurations, or operating environments, future implementations may benefit from domain adaptation or transfer-learning strategies. For example, the VAE trained on the source system may be reused as an initialization and then adapted on a limited amount of target-domain data, while the downstream classifier is re-trained or fine-tuned for the new operating conditions. Before full deployment, validation on the target system should ideally include healthy baseline acquisition, evaluation under representative load and speed ranges, and limited fault-state verification whenever such data can be obtained.

Conclusion

This study investigated the detection and severity classification of distributed pitting faults in a two-stage industrial helical gearbox using vibration-based deep learning models under multiple load conditions. Distributed pitting produces more complex and overlapping vibration patterns, making it more difficult to diagnose than more localized fault signatures. To address this challenge, acquired vibration signals were evaluated in three input forms: 1D time-domain signals, 224 × 224 scalogram images, and 26 × 26 data-matrix representations.

The experimental results showed that model performance depended not only on the learning architecture but also on the suitability of the input representation. Among the evaluated approaches, the VAE-based model using the 26 × 26 data-matrix representation provided the most consistent overall performance across load conditions, while also offering a favorable balance between diagnostic accuracy and computational efficiency. Additional analyses under incomplete fault coverage further indicated that the learned representations could retain useful discriminative structure even when one fault class was excluded during training.

In general, the main contribution of this work lies in the systematic comparison of multiple diagnostic strategies for distributed pitting severity classification, with VAE-based representations emerging as the most consistent performers. At the same time, the transferability of these findings should be interpreted within the scope of the present validation setting, which involved a single gearbox configuration, artificially induced faults, and a constant operating speed. Future work may extend the present framework to broader operating conditions, different gearbox systems, and additional fault types to further assess its robustness and transferability.

Footnotes

ORCID iDs

Ozan Can Alper

Hatice Dogan

Hasan Ozturk

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Kumar

Gandhi

Zhou

, et al. Latest developments in gear defect diagnosis and prognosis: a review. Measurement 2020; 158: 107735. https://doi.org/10.1016/j.measurement.2020.107735

Ozturk

Yesilyurt

Sabuncu

. Detection and advancement monitoring of distributed pitting failure in gears. J Nondestr Eval 2010; 29: 63–73. https://doi.org/10.1007/s10921-010-0066-4

Salameh

Cauet

Etien

, et al. Gearbox condition monitoring in wind turbines: a review. Mech Syst Signal Process 2018; 111: 251–264.

Abdul

Al-Talabani

Ramadan

. A hybrid temporal feature for gear fault diagnosis using the long short term memory. IEEE Sensors Journal 2020; 20(23): 14444–14452. https://doi.org/10.1109/JSEN.2020.3007262

Zhang

Zhou

Wang

, et al. State of the art on vibration signal processing towards data-driven gear fault diagnosis. IET Collab Intell Manuf 2022; 4(4): 249–266. https://doi.org/10.1049/cim2.12064

Miltenović

Rakonjac

Alexandru

, et al. Detection and monitoring of pitting progression on gear tooth flank using deep learning. Appl Sci 2022; 12(11): 5327. https://doi.org/10.3390/app12115327

Wei

, et al. A review of early fault diagnosis approaches and their applications in rotating machinery. Entropy 2019; 21(4): 409. https://doi.org/10.3390/e21040409

Tama

Vania

Lee

, et al. Recent advances in the application of deep learning for fault diagnosis of rotating machinery using vibration signals. Artif Intell Rev 2023; 56(5): 4667–4709. https://doi.org/10.1007/s10462-022-10293-3

Praveenkumar

Sabhrish

Saimurugan

, et al. Pattern recognition based on-line vibration monitoring system for fault diagnosis of automobile gearbox. Measurement 2018; 114: 233–242. https://doi.org/10.1016/j.measurement.2017.09.041

10.

Liu

Han

Wang

, et al. A method of acoustic emission source location for engine fault based on time difference matrix. Struct Health Monit 2023; 22(1): 621–638. https://doi.org/10.1177/14759217221088995

11.

Cheng

Yang

, et al. All time-scale decomposition method and its application in gear fault diagnosis. Struct Health Monit 2024; 25(1): 320–349. https://doi.org/10.1177/14759217241289873

12.

Kar

Mohanty

. Gearbox health monitoring through multiresolution fourier transform of vibration and current signals. Struct Health Monit 2006; 5(2): 195–200. https://doi.org/10.1177/1475921706058002

13.

Liu

Riemenschneider

. Gearbox fault diagnosis using empirical mode decomposition and Hilbert spectrum. Mech Syst Signal Process 2006; 20(3): 718–734. https://doi.org/10.1016/j.ymssp.2005.02.003

14.

Kar

Mohanty

. Vibration and current transient monitoring for gearbox fault detection using multiresolution Fourier transform. J Sound Vib 2008; 311(1–2): 109–132. https://doi.org/10.1016/j.jsv.2007.08.023

15.

Wang

McFadden

. Application of wavelets to gearbox vibration signals for fault detection. J Sound Vib 1996; 192(5): 927–939. https://doi.org/10.1006/jsvi.1996.0226

16.

Fan

Zuo

. Gearbox fault detection using Hilbert and wavelet packet transform. Mech Syst Signal Process 2006; 20(4): 966–982. https://doi.org/10.1016/j.ymssp.2005.08.032

17.

Yan

Kang

, et al. Multiple faults separation and identification of rolling bearings based on time-frequency spectrogram. Struct Health Monit 2024; 23(4): 2040–2067. https://doi.org/10.1177/14759217231197110

18.

Gao

Liu

Xiang

. Fault detection in gears using fault samples enlarged by a combination of numerical simulation and a generative adversarial network. IEEE/ASME Trans Mechatron 2021; 27(5): 3798–3805. https://doi.org/10.1109/TMECH.2021.3132459

19.

Tang

Yuan

Zhu

. Deep learning-based intelligent fault diagnosis methods toward rotating machinery. IEEE Access 2019; 8: 9335–9346. https://doi.org/10.1109/ACCESS.2019.2963092

20.

Singh

Gangsar

Porwal

, et al. Artificial intelligence application in fault diagnostics of rotating industrial machines: a state-of-the-art review. J Intell Manuf 2021; 34: 931–960. https://doi.org/10.1007/s10845-021-01861-5

21.

Jing

Zhao

, et al. A convolutional neural network based feature learning and fault diagnosis method for the condition monitoring of gearbox. Measurement 2017; 111: 1–10. https://doi.org/10.1016/j.measurement.2017.07.017

22.

Tian

Zuo

. Health condition prediction of gears using a recurrent neural network approach. IEEE Trans Reliab 2010; 59(4): 700–705. https://doi.org/10.1109/TR.2010.2083231

23.

Zhang

Wang

, et al. General normalized sparse filtering: a novel unsupervised learning method for rotating machinery fault diagnosis. Mech Syst Signal Process 2019; 124: 596–612. https://doi.org/10.1016/j.ymssp.2019.02.006

24.

Cheng

Liu

, et al. Study on planetary gear fault diagnosis based on variational mode decomposition and deep neural networks. Measurement 2018; 130: 94–104. https://doi.org/10.1016/j.measurement.2018.08.002

25.

Lupea

. Detecting helical gearbox defects from raw vibration signal using convolutional neural networks. Sensors 2023; 23(21): 8769. https://doi.org/10.3390/s23218769

26.

Zhao

Sun

, et al. Multi-scale CNN for multi-sensor feature fusion in helical gear fault detection. Procedia Manuf 2020; 49: 89–93. https://doi.org/10.1016/j.promfg.2020.07.001

27.

Lupea

Coroian

. Helical gearbox defect detection with machine learning using regular mesh components and sidebands. Sensors 2024; 24(11): 3337. https://doi.org/10.3390/s24113337

28.

Deutsch

, et al. Detection of pitting in gears using a deep sparse autoencoder. Appl Sci 2017; 7(5): 515. https://doi.org/10.3390/app7050515

29.

Abdul

Al-Talabani

. Highly accurate gear fault diagnosis based on support vector machine. J Vib Eng Technol 2023; 11(7): 3565–3577. https://doi.org/10.1007/s42417-022-00768-6

30.

Chu

Han

. Deep residual learning with demodulated time-frequency features for fault diagnosis of planetary gearbox under nonstationary running conditions. Mech Syst Signal Process 2019; 127: 190–201. https://doi.org/10.1016/j.ymssp.2019.02.055

31.

Feng

, et al. A novel vibration indicator to monitor gear natural fatigue pitting propagation. Struct Health Monit 2023; 22(5): 3126–3140. https://doi.org/10.1177/14759217221142622

32.

Wang

Huang

Wang

, et al. Generalized autoencoder: a neural network framework for dimensionality reduction. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, Columbus, OH, USA, 23–28 June, 2014, pp. 490–497. IEEE.

33.

Huang

Chai

Zhu

, et al. A novel distributed fault detection approach based on the variational autoencoder model. ACS Omega 2022; 7(3): 2996–3006.

34.

Yan

She

, et al. Reliable fault diagnosis of bearings using an optimized stacked variational denoising auto-encoder. Entropy 2022; 24(1): 36. https://doi.org/10.3390/e24010036

35.

Yan

She

. Deep order-wavelet convolutional variational autoencoder for fault identification of rolling bearing under fluctuating speed conditions. Expert Syst Appl 2023; 216: 119479. https://doi.org/10.1016/j.eswa.2022.119479

36.

Zemouri

Lévesque

Boucher

, et al. Recent research and applications in variational autoencoders for industrial prognosis and health management: a survey. In 2022 Prognostics and health management conference (PHM-2022 London), London, UK, 27–29 May, 2022, pp. 193–203. IEEE.

37.

Alper

Doğan

Öztürk

. Gear pitting fault detection: leveraging anomaly detection methods. In 2023 14th International conference on electrical and electronics engineering (ELECO), Bursa, Turkey, 30 November-2 December, 2023, pp. 1–5. IEEE.

38.

Yurtsever

Ümütlü

Öztürk

. A comparative study of diverse autoencoder models in local gear pitting fault diagnosis. Konya J Eng Sci 2025; 13(1): 59–73. https://doi.org/10.36306/konjes.1571234

39.

Wang

Jin

Sun

, et al. Planetary gearbox fault feature learning using conditional variational neural networks under noise environment. Knowl Based Syst 2019; 163: 438–449. https://doi.org/10.1016/j.knosys.2018.09.005

40.

Yang

Zhang

. Wind turbine gearbox failure detection based on scada data: A deep learning-based approach. IEEE Trans Instrum Meas 2020; 70: 1–11. https://doi.org/10.1109/TIM.2020.3045800

41.

Wang

Sun

Jin

. Imbalanced sample fault diagnosis of rotating machinery using conditional variational auto-encoder generative adversarial network. Appl Soft Comput 2020; 92: 106333. https://doi.org/10.1016/j.asoc.2020.106333

42.

Cai

Meng

, et al. MFVAE: a multiscale fuzzy variational autoencoder for big data-based fault diagnosis in gearbox. IEEE Trans Fuzzy Syst 2024; 33(1): 180–191.

43.

Öztürk

. Gearbox health monitoring and fault detection using vibration analysis. PhD Thesis, Dokuz Eylul Universitesi, Turkey, 2006.

44.

Öztürk

Sabuncu

Yesilyurt

. Early detection of pitting damage in gears using mean frequency of scalogram. J Vib Control 2008; 14(4): 469–484. https://doi.org/10.1177/1077546307080026

45.

Ümütlü

Hızarcı

Ozturk

, et al. Classification of helical gear fault levels using frequency component based statistical analysis with ANN. Usak Univ J Eng Sci 2018; 1(2): 76–86. https://izlik.org/JA63SN24FD

46.

Hizarci

Ümütlü

Ozturk

, et al. Vibration region analysis for condition monitoring of gearboxes using image processing and neural networks. Exp Tech 2019; 43(6): 739–755. https://doi.org/10.1007/s40799-019-00329-9

47.

Liu

Hsaio

. Time series classification with multivariate convolutional neural network. IEEE Trans Ind Electron 2018; 66(6): 4788–4797. https://doi.org/10.1109/TIE.2018.2864702

48.

Tong

. Robust single accelerometer-based activity recognition using modified recurrence plot. IEEE Sensors Journal 2019; 19(15): 6317–6324.

49.

Zaman

Sah

Direkoglu

, et al. A survey of audio classification using deep learning. IEEE Access 2023; 11: 106620–106649. https://doi.org/10.1109/ACCESS.2023.3318015

50.

Büssow

. An algorithm for the continuous Morlet wavelet transform. Mech Syst Signal Process 2007; 21(8): 2970–2979. https://doi.org/10.1016/j.ymssp.2007.06.001

51.

Liu

Gao

. A deep learning-based fault diagnosis of leader-following systems. IEEE Access 2022; 10: 18695–18706. https://doi.org/10.1109/ACCESS.2022.3151155

52.

MathWorks. Diagnostic feature designer app, https://www.mathworks.com/help/predmaint/ref/diagnosticfeaturedesigner-app.html (2024, accessed 15 December 2024).

53.

Jamil

Khanam

. Fault classification of rolling element bearing in machine learning domain. Int J Acoust Vib 2022; 27(2): 77–90.

54.

Liu

Yang

, et al. A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans Neural Netw Learn Syst 2021; 33(12): 6999–7019. https://doi.org/10.1109/TNNLS.2021.3084827

55.

LeCun

Bengio

Hinton

. Deep learning. Nature 2015; 521(7553): 436–444.

56.

Wang

Fan

Wang

. Comparative analysis of image classification algorithms based on traditional machine learning and deep learning. Pattern Recognition Letters 2021; 141: 61–67. https://doi.org/10.1016/j.patrec.2020.07.042

57.

Sharma

Jain

Mishra

. An analysis of convolutional neural networks for image classification. Procedia Comput Sci 2018; 132: 377–384. https://doi.org/10.1016/j.procs.2018.05.198

58.

Agarap

. Deep learning using rectified linear units (ReLU), 2018. arXiv:1803.08375.

59.

Kiranyaz

Avci

Abdeljaber

, et al. 1D convolutional neural networks and applications: a survey. Mech Syst Signal Process 2021; 151: 107398. https://doi.org/10.1016/j.ymssp.2020.107398

60.

Mukhopadhyay

Panigrahy

Misra

, et al. Quasi 1D CNN-based fault diagnosis of induction motor drives. In 2018 5th International conference on electric power and energy conversion systems (EPECS), Kitakyushu, Japan, 23–25 April, 2018, pp. 1–5. IEEE.

61.

Zhang

Ren

, et al. Deep residual learning for image recognition, 2015. arXiv:1512.03385.

62.

Zhang

Ren

, et al. Identity mappings in deep residual networks. In: Computer vision–ECCV 2016: 14th European conference, Amsterdam, The Netherlands, October 11–14, 2016, Part IV 14, pp. 630–645. Springer.

63.

Zhang

Ren

, et al. Deep residual learning for image recognition. Las Vegas, NV, USA, 27–30 June, In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778. IEEE.

64.

Simonyan

Zisserman

. Very deep convolutional networks for large-scale image recognition, 2014. arXiv:1409.1556.

65.

Howard

Sandler

Chu

, et al. Searching for mobilenetv3. In: Proceedings of the IEEE/CVF international conference on computer vision, Seoul, South Korea, 27 October2 November, 2019, pp. 1314–1324. IEEE.

66.

Tan

. Efficientnet: rethinking model scaling for convolutional neural networks. In: International conference on machine learning, Long Beach, CA, USA, 915 June, 2019, pp. 6105–6114. PMLR.

67.

West

Ventura

Warnick

. Spring research presentation: a theoretical foundation for inductive transfer. Brigham Young University, College of Physical and Mathematical Sciences. J Softw Eng Appl 2007; 12: 11.

68.

Pan

Yang

. A survey on transfer learning. IEEE Trans Knowl Data Eng 2010; 22(10): 1345–1359. https://doi.org/10.1109/TKDE.2009.191

69.

Zhuang

Duan

, et al. A comprehensive survey on transfer learning. Proc IEEE 2020; 109(1): 43–76. https://doi.org/10.1109/JPROC.2020.3004555

70.

LeCun

Bottou

Bengio

, et al. Gradient-based learning applied to document recognition. Proc IEEE 1998; 86(11): 2278–2324. https://doi.org/10.1109/5.726791

71.

LeCun

Bottou

Bengio

, et al. Lenet-5, convolutional neural networks. http://yann lecun com/exdb/lenet 2015; 20(5): 14.

72.

Rumelhart

Hinton

Williams

. Learning representations by back-propagating errors. Nature 1986; 323(6088): 533–536. https://doi.org/10.1038/323533a0

73.

Hinton

. Learning translation invariant recognition in a massively parallel networks. In: PARLE parallel architectures and languages Europe: volume I: parallel architectures, Eindhoven, The Netherlands, June 15–19, 1987, pp. 1–13. Springer.

74.

Baldi

. Autoencoders, unsupervised learning, and deep architectures. In: Proceedings of ICML workshop on unsupervised and transfer learning. Bellevue, WA, USA, 2 July 2011. JMLR: Workshop and Conference Proceedings, 2012, pp. 37–49.

75.

Chen

Wang

. One-dimensional convolutional auto-encoder-based feature learning for fault diagnosis of multivariate processes. J Process Control 2020; 87: 54–67. https://doi.org/10.1016/j.jprocont.2020.01.004

76.

Tornyeviadzi

Seidu

. Leakage detection in water distribution networks via 1D CNN deep autoencoder for multivariate scada data. Eng Appl Artif Intell 2023; 122: 106062. https://doi.org/10.1016/j.engappai.2023.106062

77.

Kingma

Welling

. Auto-encoding variational bayes, 2013. arXiv:1312.6114.

78.

Doersch

. Tutorial on variational autoencoders, 2016. arXiv:1606.05908.

79.

Addo

Zhou

Jackson

, et al. EVAE-Net: an ensemble variational autoencoder deep learning network for covid-19 classification based on chest x-ray images. Diagnostics 2022; 12(11): 2569. https://doi.org/10.3390/diagnostics12112569

80.

Paszke

Gross

Massa

, et al. Pytorch: an imperative style, high-performance deep learning library. Adv Neural Inform Process Syst 2019; 32: 8024–8035.

81.

Yang

, et al. Recent advances in vibration condition-based fault diagnosis of rotating machinery. J Control Decis 2026; 13(1): 1–32. https://doi.org/10.1080/23307706.2025.2526054

82.

Wang

Cha

Y-J

. Unsupervised deep learning approach using a deep auto-encoder with a one-class support vector machine to detect damage. Struct Health Monit 2021; 20(1): 406–425. https://doi.org/10.1177/1475921720934051

83.

Wang

Cha

Y-J

. Unsupervised machine and deep learning methods for structural damage detection: a comparative study. Eng Rep 2025; 7(1): e12551. https://doi.org/10.1002/eng2.12551