Automated labeling and abnormal detection based on kernel cluster local outlier factor for machinery health monitoring

Abstract

Machinery label data is necessary for training intelligent fault diagnosis models. However, unlabeled and abnormal data are commonly seen in these data, resulting in the reduction of data quality. As a result, these low-quality data may lead to inaccurate diagnosis models. To address this issue, a kernel cluster local outlier factor (CLOF) method is proposed for automated labeling and abnormal data detection. The suggested approach can establish the relationship among different samples of label data based on a parameter-free method, that is, the natural neighbor spectrum. Through this relationship, different clusters are searched. Then, CLOF is calculated to evaluate the abnormal degree of different clusters, and clusters whose CLOF is larger than the predetermined threshold value are detected as abnormal data. The natural neighbor spectrum is reconstructed after cleaning abnormalities. Finally, fault types of the data can be labeled automatically based on the relationship between unlabeled and labeled data through the reconstructed spectrum. The proposed method is validated through different experimental data collected from a gear test bench, a real wind turbine, and a centrifugal pump, respectively. The results indicate that the proposed approach is effective in detecting abnormal data with different condition types and labeling data accurately and automatically.

Keywords

Fault diagnosis condition health monitoring local outlier factor natural neighbor abnormal data detection automated labeling

Introduction

Machinery structure health monitoring has entered the big data era, presenting both significant opportunities and challenges. As for the opportunities, big data can be processed and mined to recognize the machinery condition or diagnose faults accurately.^1–3 For different health monitoring fields such as wind turbine drivetrain,⁴ bearing,^5,6 gearbox,⁷ and in different working conditions such as unstable speed,^8,9 and so on, have been deeply studied. However, big data possesses five distinct characteristics, namely substantial volume, diverse variety, rapid velocity, diminished quality, and low density of value,¹⁰ which make it a challenge to process this big data for fault diagnosis. Many researchers resort to employing deep learning techniques in intelligent fault diagnosis for processing massive data more rapidly and efficiently.^11–14 Although these intelligent fault diagnosis methods provide an effective way for processing of monitoring data with properties of high volume, high velocity, and low-value density, it should be noted that several challenges and concerns accompany these methods. Because machinery usually operates under tough environment, abnormal data are widely present. Inaccuracy models are usually constructed based on low-quality data containing abnormal data.

Data quality describes the extent of the associations between data and the operational status of machinery,¹⁵ Abnormal label data indicates the presence of abnormal data as well as unlabeled or incorrectly labeled data. Poor-quality data are usually generated due to disturbance from the environment and faults on data acquisition equipment, these poor-quality data are irrelevant to the machinery health condition. Training intelligent fault diagnosis models directly based on these abnormal data without data processing, unreliable models are probably obtained due to the special property, that is, “garbage in, garbage out”¹⁶ shared by machine learning-based methods. Extensive research has been conducted on the impact of abnormal data on machine learning performance, and all have found that the classification accuracy decreases as the quality of training data declines.^17–19 As a result, wrong diagnosis results are probably obtained based on these unreliable models. The abnormal data have attracted much attention, and it has become an issue that cannot be ignored during intelligent model construction. For example, to decrease the negative influence of polluted data, Shang et al.²⁰ proposed a core loss that can be used in an autoencoder framework. Wang et al.²¹ attempted to construct a graph neural network with data cleaning for the diagnosis of monitoring data with contamination. By using compressive sensing and an enhanced context encoder for abnormal data reconstruction, one-dimensional data quality is greatly improved.²² Although many intelligent models have considered the negative influence brought by abnormal data and many measures have been taken, they cannot cancel the influence of abnormal data, and data labeling is still an open issue.

To address this issue, some data cleaning methods should be used to remove abnormal data and label data for quality assurance of monitoring data before fault diagnosis. The data quality will be improved, and then it can be used to train intelligent models. Normal data, intimately linked to health status, have similar characteristics and can be considered as contextual attributes. The data that are abnormal and against some typical pattern, with their characteristics becoming indeterminate rather than deterministic, governed by random or probabilistic distributions, are markedly distinct from the normal data. Thus, these contextual outliers are considered as abnormal data,²³ and numerous methods for detecting abnormal data have been proposed. These methods can be categorized into four groups: statistics-based methods, classification-based methods, regression-based methods, and cluster-based methods. As the commonly used statistics index or methods, median absolute deviation,²⁴ Gumbel distribution,²⁵ 3 $σ$ rule,²⁶ change-point grouping and the quartile algorithm,²⁷ and minimum covariance determinant²⁸ are used for abnormal data detection. Although the statistics-based methods are easy to implement and time-efficient, their effectiveness relies on the assumption that normal data follow specific distributions. The classification-based methods include one-class support vector machine,²⁹ support vector data description,³⁰ deep neural network,³¹ and Gaussian process have been used for outlier detection.³² The classification model should be trained based on enough label samples, including both normal and abnormal data, and the model has poor robustness. Least trimmed square estimator,³³ a two-layer feed forward neural network,³⁴ stacked denoising autoencoder,³⁵ and a Kalman Filter³⁶ all belong to regression-based methods and have been used to identify outliers according to estimation error. Because regression-based methods are susceptible to noise, these methods probably fail to detect abnormal data.

Many researchers also developed some detection methods of abnormal data based on clustering algorithms. For example, utilized in the assessment of wind turbine health, the density-based spatial clustering of applications with noise (DBSCAN) algorithm³⁷ was implemented to mitigate the impact of outliers arising from instrumentation errors. A local kernel density estimation approach³⁸ was proposed for abnormal data detection. For enhancing the robustness of their energy prediction model via deep learning algorithms, He et al.³⁹ utilized the local outlier factor (LOF) to refine the gathered data. A kernel-based LOF⁴⁰ was suggested for the cleansing of big data in machinery condition monitoring. Xie et al.⁴¹ proposed an adaptive sliding window and weighted multiscale LOF for abnormal data detection. The isolation forest was introduced for anomaly detection.⁴² Hu et al.⁴³ argued that the preparation of data precedes the modeling of wind turbine power curves driven by data and proposed a preliminary cleaning method based on k-means clustering algorithm. Clustering algorithms usually need some parameters to be set and their effectiveness depend greatly on these parameters, which brings great difficulty in applications of these methods.

The above detection methods of abnormal data cannot be used for label data quality assurance, and the reasons are as follows. First, there exist difference among the data of various labels which make the abnormal detection more difficult. Second, these methods usually fail to detect abnormal data that are numerous. Third, most existing work on data quality assurance focuses on abnormal data detection but neglects labeling data automatically. In actual engineering, unlabeled data are commonly seen and cannot be directly used for intelligent fault diagnosis modeling. Lacking an effective technique of labeling data, labeling data has to be performed by hand, which requires expert experience and takes up much time.^44,45 Therefore, an effective quality assurance method of label data should have both the ability to detect abnormal data and to label data automatically.

To detect abnormal data and label data automatically for label quality assurance, a natural neighbor-based kernel cluster local outlier factor (CLOF) is proposed. In the proposed method, the natural neighbor spectrum is constructed first. The inspiration for the concept of the natural neighbor stems from the dynamics of human camaraderie, wherein true friendship is mutual, requiring each individual to view the other as a genuine companion in order for the bond to be authentic. The idea of natural neighbor is that two points can be considered as natural neighbors only if any point belongs to the neighborhood of the other. Based on this idea, natural neighbor can be searched without setting parameters and just depends on the condition of the label data. Specifically, if data of some label have high density, the nearest neighbor number of these data will have large values, while abnormal data will have a small nearest neighbor number. A spectrum of natural neighbors can be constructed based on the relationship inherent in natural neighbor data. Second, the natural neighbor clusters are searched according to the natural neighbor. Third, the local outlier degree of different clusters, including the normal cluster and suspicious cluster, will be calculated. Finally, abnormal data can be detected from data of various labels according to these CLOFs and then the label can be identified according to the natural neighbor relationship between the data of known labels and the data of unknown labels.

Drawing from the aforementioned statements, the contributions of this paper can be outlined as follows:

1) The paper originally illustrates and discusses the notion of label data quality within the domain of machinery health monitoring. Moreover, based on these discussions, a method employing the natural neighbor-based kernel CLOF is proposed for improving the quality of label data by automatic labeling and abnormal data detection.

2) The application scope of the natural neighbor is broadened in machinery health monitoring through the proposed method. Based on natural neighbor, a new quantitative evaluation of data quality is proposed and named CLOF. The CLOF is able to detect abnormal data from data of various label, which traditional methods probably fails to do.

3) Based on the natural neighbor, the label data can be automatically identified and labeled, which further improves the quality of the label data.

The remainder of this paper is organized as follows. The natural neighbor theory is introduced in the second section. The method for ensuring label data quality is elaborated in the third section. The fourth section is dedicated to the validation of the proposed technique using gear and bearing data, individually. Lastly, the fifth section provides the conclusion for this study.

Natural neighbor theory

k-Nearest neighbors

The foundation of many powerful tools in various fields has been laid by k-nearest neighbors (KNN),^46–48 which refers to a subset that can be searched according to the following rules. Given a set of points A, and for any point p in A, the KNN of p is defined as

N_{k} (p) = {q | d (p, q) \leq k d (p)}

(1)

where $d (p, q)$ signifies the Euler distance between $p$ and $q$ , while $kd (p)$ represents the $k$ distance of $p \cdot kd (p)$ should satisfy two rules: initially, there must exist a minimum of k objects $q^{'}$ that fulfill criteria $d (p, q^{'}) \leq d (p, q)$ , $q \in D \ {p}$ ; secondarily, there can be a maximum of k−1 objects $q^{'}$ fulfill $d (p, q^{'}) \leq d (p, q)$ , $q^{'} \in D \ {p}$ .

For example, in Figure 1, the KNN of $p$ when $k = 6$ is the set circled using the dotted line. For one point $q$ in the $N_{k} (p)$ , the $q$ can be considered as the reverse KNN (RKNN) of $p$ .

Figure 1.

KNN of p when k = 6. KNN: k-nearest neighbor.

The mutual neighbors are constructed based on KNN and RKNN for representing a more closely relationship between two points. p and q become mutual neighbors, if they satisfy the following rule

o \in M N_{k} (p) \Leftrightarrow o \in N_{k} (p) \land p \in N_{k} (o)

(2)

Natural neighbor

One notable shortcoming of KNN and RKNN is that they both suffer from the problem of parameter selection. It is expected that the points similar to $p$ are all contained in $N_{k} (p)$ , so a suitable $k$ should be determined. However, it is impossible to select $k$ to describe the neighbor relationship of a set with various distribution of points. Specifically, $k$ should be a smaller value for the area with sparse points, while $k$ should be a larger value for the area with dense points.

Natural neighbor is proposed to solve this shortcoming for describing the neighbor relationship exactly. The theory of natural neighbors, drawing inspiration from the camaraderie prevalent in human society,⁴⁹ encompasses the subsequent quartet of definitions.

Definition 1 (natural stable state). In achieving a natural stable state, the search for a set of points $P {p_{1}, p_{2}, \dots, p_{n}}$ concludes when mutual neighbors are established for each point. The process can be described using the following equation

\forall p_{i}, \exists p_{j}, i \neq j, s . t . p_{i} \in N_{k} (p_{j}) \cap p_{j} \in N_{k} (p_{i})

(3)

Definition 2 (natural neighbor eigenvalue). At the mutual stable state, the eigenvalue is equivalent to $k$ and is represented as follows

R = {k | \forall p_{i}, \exists p_{j}, s . t . (i \neq j) \land (p_{i} \in M N_{k} (p))}

(4)

Definition 3 (natural neighbors). When reaching stability, the natural neighbors can be regarded as mutual neighbors. As point $p_{i}$ , its natural neighbors number is

Nb (p_{i}) = \sum_{p_{j} \in P, k = S_{a}} | {p_{j} | (p_{i} \neq p_{j}) \cap (p_{i} \in M N_{k} (p_{j}))} |

(5)

For $p_{j} \in Nb (p_{i})$ , the relationship can be expressed as

p_{i} \in NR (p_{j}) \Leftrightarrow p_{j} \in NR (p_{i})

(6)

Definition 4 (natural neighbor graph). The construction of the natural neighbor graph revolves around the concept of the natural neighbor and can be delineated as follows

G_{N} = (V, E)

(7)

where $V$ and $E$ denote the vertex set and edge set, respectively. Only if $p_{i} \in NR (p_{j})$ , there exists one edge $V_{ij}$ to connect points $p_{i}$ and $p_{j}$ . With $E$ representing the edge set and $V$ representing the vertex set. $V_{ij}$ links points $p_{i}$ and $p_{j}$ only under the condition that $p_{i} \in NR (p_{j})$ is satisfied.

Proposed method

To improve the quality of label data, a natural neighbors-based method is proposed. The proposed method involves first searching for the natural neighbor of each data point and constructing a natural neighbor graph based on the identified natural neighbors. Second, the natural neighbor clusters are obtained by searching the natural neighbor path and the sharing path for different points. Third, the kernel CLOF is calculated, where includes the LOF of the normal cluster and the suspicious cluster. Finally, clean the abnormal data based on the kernel CLOF, and data with unknown labels can be found, and some data can be further labeled with the help of the natural relationship shared by these data and labeled data. Each step’s implementation and illustration are meticulously delineated in the subsequent four sections.

As depicted in Figure 2, the flowchart of the proposed method is visually presented based on the aforementioned statements.

Figure 2.

The flowchart of the proposed method.

Natural neighbor graph construction

The relationship among various sample data is established through the construction of a natural neighbor graph. Totally, 23 index features are extracted from each sample data, that is, mean value, maximum value, minimum value, peak to peak, root mean square, crest factor, variance, square root amplitude, average amplitude, waveform factor, peak value, impulse factor, margin factor, kurtosis, skewness as well as eight wavelet energy coefficients. The natural stable state of these sample data can be obtained based on these features, when $k$ is natural neighbor eigenvalue. When the state reaches the natural stable state in traditional natural neighbor theory, all the points should have natural neighbors. Obviously, it is unsuitable for data mixed with abnormal data and the summation of the count of inherent neighbors is employed to assess the fulfillment of achieving a natural equilibrium state. In particular, upon reaching the natural equilibrium state through a search process, it is imperative to fulfill the subsequent equation

| \sum_{i = 1}^{N} N b_{k} (p_{i}) - \sum_{i = 1}^{N} N b_{k - 1} (p_{i}) | \leq k

(8)

The construction process algorithm is described in Table 1 and to enhance comprehension of this procedure, an illustrative instance of a natural neighbor graph constructed through this process is presented in Figure 3.

Table 1.

Natural neighbor graph construction algorithm.

Algorithm 1: Natural neighbor graph construction
Input features and initialize k to 1, condition = 1
1: while condition do
2: for each point $p_{i} \in P$ do
3: $Nb (p_{i}) = 0$ , $N_{k} (p_{i}) = \emptyset, N N_{k} (p_{i}) = \emptyset$
4: for each point $p_{i} \in P$ do
5: Search its KNNs $N_{k} (p_{i})$
6: for each point $p_{j} \in N_{k} (p_{i})$
7: search KNNs of $p_{j}$
8: end for
9: end for
10: end for
11: if each $p_{i} \in N_{k} (p_{j}), Nb (p_{i}) = 1$
12: $M N_{k} (p_{i}) = M N_{k} (p_{i}) \cup p_{j}$
13: end if
14: if $\| \sum_{i = 1}^{N} N b_{k} (p_{i}) - \sum_{i = 1}^{N} N b_{k - 1} (p_{i}) \| \leq k t h e n$
15: condition = 0
16: else
17: k = k + 1
18: end if
19: end while
20: return $R = k, NR and Nb of each point$

Figure 3.

A case of natural neighbor graph construction: (a) k = 0, (b) k = 1, (c) k = 2, (d) k = 3, (e) k = 4, and (f) the natural neighbor graph.

In Figure 3(a), the initial setting of $k$ to 1 is followed by the exploration of the KNNs for each point. The constructed process is depicted below.

There are totally nine data points as shown in the Figure 3(a), that is, $P_{data} = {A, B, C, D, E, F, G, H, I}$ . For point B, obviously, when k = 1, it meets the following neighbor relationship: $N_{k} (B) = {A}$ , $N_{k} (A) = {B}$ , $N_{k} (C) = {B}$ , $N_{k} (D) = {B}$ . It can be found that $B \in N_{k} (A)$ and also $A \in N_{k} (B)$ , thus $Nb (A) = 1$ and $Nb (B) = 1$ . A and B have stable conditions and their color changes from yellow to green. The natural neighbor relationship is described using blue arrows, while the unidirectional neighbor relationship is connected using black arrows.

Similarly, when k = 1, $N_{k} (E) = {H}$ , $N_{k} (H) = {E}$ , $N_{k} (G) = {E}$ , $N_{k} (F) = {E}$ , and $N_{k} (I) = {F}$ . Therefore, $Nb (E) = 1$ and $Nb (H) = 1$ , the count of natural neighbors for other points, such as $C$ , $D$ , $I$ , $F$ , $G$ remains 0. Figure 3(c) shows the searching outcomes when $k = 2$ . Two data points, including $D$ and $I$ lacks natural neighbors, hence denoted as $Nb (D) = 0, Nb (I) = 0$ . The natural neighbors’ number of the other points is all updated as 2, because there are two natural neighbor numbers for these points. At that time, the iteration stopping condition as shown in Equation (8) is not meet. Thus, k is updated as k + 1, that is, 3, and the searching results is shown in Figure 3(d). Three points including $D$ , $F$ , and $G$ all have natural neighbors, and thus their natural Stable states are updated. Finally, when k = 4, all the data points have natural neighbors and the sum of natural neighbor numbers would not change, which meets the iteration stopping condition. Finally, we can obtain the natural neighbor spectrum shown in Figure 3(f). The obtained neighbor spectrum shows that this construction not only is performed adaptively without inputting parameters, but also is effective to expressing the relationship of the whole data points. Data points belonging to one cluster can be connected through the spectrum relationship, which will be useful for data labeling and anomaly detection.

The search for natural neighbor clusters

The sample data can fall into different clusters without parameters derived from the natural neighbor graph. Specifically, there exists one path from one point $v_{i}$ to another point $v_{k}$ , which is denoted as $e_{i} \to . . . \to e_{j} \to e_{k}$ , indicating these two points belong to one cluster. They belong to different clusters if the path connected $v_{i}$ to $v_{k}$ cannot be found. To search all the natural neighbor clusters, the detailed process is shown in Table 2.

Table 2.

Cluster searching procedure algorithm.

Algorithm 2: Cluster searching procedure
input: The result of the natural neighbor graph
1: While $V \neq \emptyset$ do
2: Sort elements of $V$ according to the value of their Nb from large to small
3: Initialize $n = 0, m = 1$
4: $n = n + 1$
5: $c_{n}^{m} = {v_{1}} \cup NR (v_{1})$
6: while crad( $c_{n}^{m}, c_{n}^{m - 1}$ ) = 0 do
7: m = m + 1
8: $c_{n}^{m} = c_{n}^{m} \cup NR (c_{n}^{m})$
9: end while
10: $V = V - c_{n}$
11: if $V = \emptyset$ then
12: end while
13: return clusters $c_{r}, r = 1, 2, \dots, n$ .

First, the natural neighbor graph constructed in section A is input, and parameters including $i$ , $n$ , $m$ are initialized as 1, 0, 1, respectively. Second, the element representing the sample data is sorted according to the natural neighbor number $Nb$ from large to small. Third, $v_{1}$ which has the maximum is selected and combined with its natural neighbors. These combined points are defined as $c_{n}^{m}$ and then $c_{n}^{m}$ is combined with R mutual neighbors of points in $c_{n}^{m}$ iteratively. The iteration stops when the following rule is satisfied

crad (c_{n}^{m}, c_{n}^{m - 1}) = 0

(9)

where $crad (c_{n}^{m}, c_{n}^{m - 1})$ represents the difference in the number between $c_{n}^{m}$ and $c_{n}^{m - 1}$ .

The rule ensures that all points have paths connected to $v_{1}$ have been searched, and these points belong to $c_{n}$ . Afterwards, these points are all removed from $V$ and judged whether $V$ is $\emptyset$ or not. If $V$ is not $\emptyset$ , that means the search clusters do not contain the whole sample data points and, so $n = n + 1$ and so the search of $c_{n}$ will be continued. Otherwise, the search is completed and output the obtained clusters $c_{r}, r = 1, 2, . . ., n$ .

Afterwards, select the normal data according to the cluster data. Based on the search for clusters in section B, the set of clusters $C {c_{1}, c_{2}, . . ., c_{n}}$ can be obtained. It is obvious that the number of elements in different clusters satisfies

| c_{1} | \geq | c_{2} | \geq . . . \geq | c_{n} |

(10)

The normal data have the characteristics that they are more likely to fall into clusters with more data points compared with the abnormal data. Therefore, clusters $C_{h} {c_{1}, c_{2}, \dots, c_{l}}$ of the normal data is determined according to the following two rules

| c_{1} | + | c_{2} | + . . . + | c_{l} | \geq N β

(11)

| c_{1} | + | c_{2} | + . . . + | c_{l - 1} | < N β

(12)

where $β$ is usually chosen as 90%, and thus 90% data can be included in the $C_{h}$ . The remaining clusters ${c_{l + 1}, c_{l + 2}, \dots, c_{n}}$ are labeled as suspicious clusters, which are probably abnormal data and defined as $C_{s}$ .

Cluster local outlier factor

In this section, the abnormal data will be detected by CLOF, which describes the local abnormal degree of data with different labels. The calculation of CLOF is introduced as follows.

1) Calculate the CLOF of data in clusters of normal data. The improved LOF of data in the normal cluster is calculated using

{LOF}_{c} (p_{i}) = \frac{\sum_{o \in N_{R - c_{j}}} \frac{{lrd}_{R - c_{j}} (o)}{{lrd}_{R - c_{j}} (p_{i})}}{| N_{R - c_{j}} (p_{i}) |}

(13)

where $N_{R - c_{j}} (p_{i})$ represents the KNNs in cluster $c_{j}$ when $k = R$ . ${lrd}_{R - c_{j}} (o)$ , ${lrd}_{R - c_{j}} (p_{i})$ denotes the local reachable density of $o$ , $p_{i}$ among the data in cluster $c_{j}$ respectively. In their corresponding clusters, these calculated parameters of normal data are evident; thus, the negative influence of differences among different clusters can be eliminated.

To enhance the ability of CLOF in dirty data detection, the kernel regression method is introduced for calculating the CLOF of data among normal clusters, and the equation is shown as follows:

CLOF (p) = \frac{\sum_{o \in N_{R - c_{j}}} \frac{1}{{(kd (o))}^{2}} K (\frac{d (p, o)}{kd (o)}) {LOF}_{c} (o)}{\sum_{o \in N_{R - c_{j}}} \frac{1}{{(kd (o))}^{2}} K (\frac{d (p, o)}{kd (o)})}

(14)

Where $K (\cdot)$ represents the kernel function and is described as follows

K (x) = {\begin{matrix} 1 if ‖ x ‖ \leq 1 \\ \exp (- {(‖ x ‖ - 1)}^{2} / 2) otherwise \end{matrix}

(15)

From Equation (14), it is easy to find that $x$ equals $\frac{d (p, o)}{kd (o)}$ . Therefore, $K (x)$ is 1 when $d (p, o) \leq kd (o)$ , while the value will be smaller than 1 when $d (p, o) > kd (o)$ . Especially, $K (x)$ will be much smaller when $d (p, o)$ is much larger than $kd (o)$ . The value approaches but does not equal 1 when the object $p$ continuously moves away from the object $o$ . Therefore, with the help of $K (x)$ , a new LOF of $p$ can be estimated through the kernel regression of its natural neighborhood objects. The regression weight values depend on the ratio of $d (p, o)$ and $kd (o)$ , which can decrease the negative effect of uneven distribution of the objects of natural neighborhood. In other words, $K (x)$ gives a much broader scope for calculating CLOF of one object around which the neighborhood is sparse, while it gives a narrow scope when its neighborhoods are dense, as shown in Figure 4.

2) Evaluate the similarity between suspicious and normal clusters. The similarity is evaluated according to the distance between two clusters and is defined as

d (c_{r}, c_{t}) = min {d (p_{i}, p_{j}) | p_{i} \in c_{r}, p_{j} \in c_{t}}

(16)

where $r, t = 1, 2, \dots, n$ and $r \neq t$ . It can be found that the distance between two clusters is determined by the minimum value of distance between data points in two clusters. Based on the cluster distance, the similarity between suspicious and normal clusters can be obtained and shown as

d (c_{l + q}, C_{h}) = min {d (c_{l + q}, c_{r}) | c_{r} \in C_{h}}

(17)

where $1 \leq q \leq n - l$ and $c_{l + q}$ denotes a suspicious cluster.

3) Calculate the CLOF of data in a suspicious cluster using the following equation

CLOF' (c_{l + q}) = \frac{d (c_{l + q}, C_{h}) \times R}{\sum_{o \in N_{R} (P_{w})} d (o, p_{w}) \times CLOF (p_{w})}

(18)

where $p_{w}$ is the data in $C_{h}$ , $d (c_{l + q}, p_{w}) = d (c_{l + q}, c_{m})$ , and $m = \underset{c_{i} \in C_{h}}{\arg min} (d (c_{l + q}, c_{i}))$ .

Figure 4.

The calculation of the CLOF. CLOF: cluster local outlier factor.

Figure 5 shows a case of the calculation of CLOF for data in a suspicious cluster. The natural spectrum has been constructed and four data clusters including $c_{1}$ , $c_{2}$ , $c_{3}$ , and $c_{4}$ are formed. The number of samples in $c_{1}$ , $c_{2}$ , $c_{3}$ , and $c_{4}$ is 7, 6, 6, 2, can be seen respectively. The total number of samples $N$ is 21, $| c_{1} | + | c_{2} | + | c_{3} | > β N$ and thus $C_{h} = {c_{1}, c_{2}, c_{3}}$ . The CLOF of samples in $C_{h}$ can be easily obtained by calculating the LOF of samples in their corresponding cluster. $c_{4}$ can be decided as a suspicious cluster, and the nearest cluster in $C_{h}$ from $c_{4}$ is $c_{1}$ . Based on Equation (18), the CLOF of each sample in $c_{4}$ is written as

{CLOF}^{'} (c_{4}) = d (c_{4}, c_{3}) \times R / \sum_{o \in N_{R} (p_{2})} d (p_{2}, o) \times LOF (p_{2})

(19)

where $d (c_{4}, c_{3}) = d (p_{1}, p_{2})$ and $R = 3$ .

4) Obtain the final CLOF. One sample near the center of its normal cluster is more likely to be normal data than the sample at the border of the cluster. However, when sample data in high-quality clusters are acquired inadequately, ${CLOF}^{'}$ may fail to evaluate the abnormal degree of different clusters. Under these circumstances, the sample near the center may be detected as abnormal data. Considering this, the angle-based outlier factor (ABOF) is introduced to improve the detection performance of ${CLOF}^{'}$ . Just like Euclidean distance, the angle value can also be used to measure the spatial relationship of various objects and more suitable for objects with multi-dimension features. Figure 6 shows that the angle of different vectors to pairs of other points differ widely. The angle range of points near the center is greater than that of points at the border of the clusters. The ABOF for any object $p_{w}$ is defined as

ABOF (p_{w}) = {Var}_{p_{i} p_{j} \in C_{m}} [\frac{〈 p_{w} p_{j}, p_{w} p_{l} 〉}{{‖ p_{w} p_{j} ‖}^{2} \cdot {‖ p_{w} p_{l} ‖}^{2}}]

(20)

where $Var (\cdot)$ represents the variance of angles. In Figure 7, obviously, it can be found that

ABOF (o_{1}) < ABOF (o_{2}) < ABOF (o_{3})

(21)

Figure 5.

The calculation of the CLOF. CLOF: cluster local outlier factor.

Figure 6.

The abnormal degree is evaluated by the index of angle.

Figure 7.

The experimental setup of gear health monitoring.

The final CLOF for suspicious clusters improved by ABOF can be obtained and shown as below

CLOF (c_{l + q}) = {CLOF}^{'} (c_{l + q}) \times \exp (- ABOF (p_{w}))

(22)

where $ABOF (p_{w})$ is calculated between the object $p_{w}$ and two points in $c_{m}$ .

Label identification

The label identification is applied to clean abnormal data, and then the data of unknown label is identified. The process is described as follows.

1) Clean abnormal data. Based on CLOF, the abnormal data can be further detected. Specifically, given a threshold value $λ_{c}$ , the corresponding objects whose CLOF is larger than $λ_{c}$ are detected as abnormal data. According to the calculation of $LOF$ , the $LOF$ of normal data is almost equal to 1, and the abnormal data have $LOF$ much greater than 1. Therefore, $λ_{c}$ ^50,51 is set as follows, and then the abnormal data can be cleaned from the label data.

λ_{c} = max [2, mean (CLOF) + 3 std (CLOF)]

(23)

2) Identify the label of unlabeled data. After abnormal data cleaning, natural neighbor graph construction will be done repeatedly. It should be noted that the iteration stop condition is updated using the following equation

| \sum_{i = 1}^{N} N b_{k} (p_{i}) - \sum_{i = 1}^{N} N b_{k - 1} (p_{i}) | \leq 1

(24)

The search for natural neighbor clusters is also done repeatedly to establish the relationship among the data with known and unknown labels. For example, if the fault class of $T$ is contained in $c_{j}$ , it can be inferred that the sample data of unknown label data in $c_{j}$ represents the fault class of $T$ . If there are none labeled data in $c_{j}$ and $c_{j} \in C_{h}$ , $c_{j}$ it probably denotes a different fault class from the known label. If $c_{j} \notin C_{h}$ , the label of $c_{j}$ is the same as the label of its nearest cluster $c_{i}$ , owing to the normal cluster.

Experimental demonstrations

Quality assurance of label data for gear health monitoring

To maintain the reliability and safety of machinery equipment, it is crucial to obtain the accurate health condition of gears, which are vital transmission elements. In this section, the effectiveness of the proposed method is corroborated through the utilization of experimental gear data. Displayed in Figure 7, the experimental setup primarily comprises one motor, one pulley, one gearbox, and one magnetic powder brake. The motor is used to supply power for the pulley through the conveyor belt. The pulley is connected to the gearbox through a shaft and drives the pinion of the gearbox. The gearbox has one pair of gears of one high-speed stage, including pinion (driven) and wheel (driving). To offer a regulatory load, the magnetic powder brake finds application. Positioned on the gearbox housing is a vibration accelerometer for the gathering of vibration data, operating at a sampling frequency of 5.12 kHz.

Vibration data was collected under three types of faults, such as normal condition, broken teeth of wheel, wheel pitting, and pinion wear. Totally, 176 samples of data were collected, and each condition has 44 samples. Ten samples are randomly selected from three conditions, including normal, pinion wear, and wheel pitting, and labeled as the corresponding fault types. The other samples, including 34 samples of normal, pinion wear, and wheel pitting, and the whole samples of broken teeth of wheel are considered as the samples with unknown labels.

To ensure the data’s quality for advancing deep learning model construction, the proposed method is implemented. Principal component analysis is then utilized to decrease the natural neighbor spectrum, aiming at facilitating visualization, which is shown in Figure 8. Then the natural neighbor clusters can be searched. The clusters are labeled from $C_{1}$ to $C_{18}$ , respectively. In Figure 8, solely the initial five clusters comprising $C_{1}$ , $C_{2}$ , $C_{3}$ , $C_{4}$ , and $C_{5}$ are displayed. Detected as high-quality clusters are the clusters comprising $C_{1}$ , $C_{2}$ , $C_{3}$ , and $C_{4}$ , as illustrated in Figure 8, due to the fact that the cumulative number of these four clusters surpasses 90% of the total sample count. The other clusters are considered as the suspicious clusters, which probably have poor quality. To evaluate the data quality quantitatively, the proposed cluster LOF is calculated and displayed in Figure 9. The abnormal clusters whose CLOF are larger than the threshold value $T$ are plotted using white dots, as shown in Figure 9. The contour of CLOF is also presented, and the areas containing low-quality data are filled with pink color.

Figure 8.

The natural neighbor graph of the whole monitoring data.

Figure 9.

The CLOF of the whole data. CLOF: cluster local outlier factor.

The clusters marked from $C_{5}$ to $C_{18}$ are detected as the abnormal data and cleaned from the whole data samples. Afterwards, the natural neighbor spectrum is reconstructed, and the result is displayed in Figure 10. Compared with the results illustrated in Figure 8, the reconstructed natural neighbor spectrum is shown in Figure 10 does not contain the outlier samples. However, it should be noted that there are still lots of data with unknown labels that cannot be directly used for training deep learning models. Therefore, the unknown labels marked fault types will be further determined.

Figure 10.

The reconstructed natural neighbor graph after data cleaning.

The fault types of samples can be identified according to the relationship constructed by the natural neighbor spectrum. For example, the fault type of unlabeled data samples that are connected to the labeled data marked using red dots can be detected as wheel pitting. Because there is no labeled data in the $C_{4}$ , indicating that the relationship between the whole data samples of $C_{4}$ and data in the other clusters are little relevant, it can be inferred that the data samples of $C_{4}$ belong to a new emerging fault type different from the types, including the normal, broken teeth of wheel, and wheel pitting. This finding is consistent with the fact that the fault type of the data samples of $C_{4}$ belong to the fault type of pinion wear. The effectiveness of the proposed method for abnormal data detection and labeling of unknown data is demonstrated by the final outcome depicted in Figure 11. Consequently, data quality can be enhanced through these means for automatically labeling and abnormal data detection.

Figure 11.

The label identification result of the proposed method.

Quality assurance of label data from a real wind farm

Wind energy has zero-carbon emissions, no pollution, and renewable; therefore, many countries have given more emphasis on this energy by installing lots of wind turbines around mountain, sea, and even deserts in the last two decades. Wind turbines usually operates under tough environment, leading to high-frequency break down of them. To ensure the safety of these turbines, monitoring systems with intelligent diagnosis algorithms are developed and applied to the equipment. Unfortunately, missing values, drift segment, and distortion are commonly seen in the monitoring data due to the severe environment, which would lead to fault alarms and should be cleaned before being used to train the intelligent diagnosis model.

The monitored transmission system structure is shown in Figure 12(a) and as shown in Figure 12(b), a sensor is installed near the driven side which occurs fault easily. The sampling frequency is 25.6 kHz, and the time duration is 4 s per sample. During February 25, 2014, to April 16, 2016, 563 sampling data were collected.

Figure 12.

(a) The structure of wind turbine transmission and (b) the actual transmission systems in a real wind farm.

The proposed method is then applied to abnormal data cleaning. First, invalid data is cleaned based on the recorded rotary speed of the motor. The wind turbine operates and generates electricity when the wind speed data are larger than the cut-in speed, that is, 3 m/s. The rotary speed of the electrical motor is less than 1080 rpm, and the electromagnetic coil of the motor is not excited to generate electricity. Therefore, the invalid data are detected when the rotary speed is less than 1080 rpm, as shown in Figure 13. One hundred forty samples marked in blue color remain after cleaning these invalid data.

Figure 13.

The rotation speed of different data samples.

Continually, these data are input into the proposed CLOF. The natural neighbor spectrum is shown in Figure 14, and among these data, several marked using red starts denotes the data with known labels. Specifically, their labels all belong to the driven side fault. Based on this spectrum, KCLOF values are further calculated

Figure 14.

The natural neighbor graph of the whole monitoring data.

Figure 15 shows the result of the CLOF contour, and it can be seen from Figure 16 that the CLOF values of several sample data corresponding to the series number of 16, 36, 41, 119, 123, 138 exceed the threshold value. Therefore, these data are detected as abnormal data.

Figure 15.

CLOF contour of different data samples. CLOF: cluster local outlier factor.

Figure 16.

CLOF values of different data samples. CLOF: cluster local outlier factor.

The time-domain waveforms of detected abnormal data of no. 16, 36, 41, 119 are shown in Figure 17. Obviously, there is one drift segment in each waveform of no. 16, 36, 41, while the amplitude of no. 119 is not stable and increase gradually which may be attributed to the fluctuation of wind. Two data samples including no. 123 and 138 are wrongly detected as abnormal data. After deletion of these abnormal data, construct the natural neighbor spectrum again based on the remaining sample data. It can be seen that these data fall into two types. Among one type, there are several labeled data marked using red star, which are corresponding to the driven side fault, while the data of the other type is unknown but should be a different condition type from the driven side fault.

Figure 17.

The time-domain waveforms of different data samples: (a) no. 16, (b) no. 36, (c) no. 41, and (d) no. 119.

According to the maintenance records, the driven bearing suffered fault from February 25, 2014, to July 3, 2015. Afterwards, the wind turbine was repaired and thus the sampling data can be considered in health condition. In other words, the monitoring data contains two different fault types including driven bearing fault and health condition, which agrees with the detected result using the proposed method. Consequently, the proposed method not only can successfully detect the abnormal data from data with different label types but also can label data according to labeled data (Figure 18).

Figure 18.

(a) The reconstructed natural neighbor graph after data cleaning and (b) data labeling.

To verify the effectiveness of the proposed method, several advance abnormal data detection and data labeling methods are used for comparison. The results of these comparison methods are shown in Figure 19. DBSCAN has been widely used for data cleaning and thus used to process the wind data. However, compared with the proposed method, which is free of parameters setting, many parameters such as $ε$ and MinPts have to be set. It is difficult to search the suitable parameters and the performance greatly depends on the these parameters. Figure 19(a) gives the result of DBSCAN³⁶ when $ε = 2$ and MinPts = 3. It can be found that these objects are divided into four normal clusters and one noise cluster. The noise cluster contain many normal data that are not anomaly data. Specifically, four abnormal data can all be detected successfully, while 26 data samples are wrongly detected as abnormal data. In addition, with the help of DBSCAN, cluster 4 can be detected as a new fault type in which there are no known label data. But the drive-side fault data sample are wrongly classified as clusters 1, 2, and 3, and even Noise. In k-means,⁴³ there are also two parameters that have to set. Although the correct number of label types is unknown before data labeling, to present the best performance of k-means, the cluster number parameter is set to be 3, expecting k-means can classify the whole data into three clusters including Driven side fault, health, and noise.

Figure 19.

Abnormal data detection and labeling using different methods for a wind turbine: (a) DBSCAN, (b) k-means, (c) isolated forest, and (d) KLOF. DBSCAN: density-based spatial clustering of applications with noise; KLOF: kernel local outlier factor.

The result is displayed in Figure 19(b) and it can be found that the three of four abnormal data samples are successfully detected, while no. 119 is wrongly classified as the normal cluster. Some driven side fault samples are wrongly detected as the same type as the health condition.

Isolated forest⁴¹ and LOF³⁹ are classic anomaly detection methods and have been used for data cleaning. The results of them are shown in Figure 19(c) and (d), respectively. The contamination fraction is set to be 0.02 for the isolated forest, which assumes that the percentage of abnormal data in the whole data is 2%. Then, the isolated forest method can detect the abnormal data by checking whose fraction is larger than the threshold value. Two data samples, including no. 16, 41 are detected as abnormal data and no. 14 is wrongly detected. In KLOF, four abnormal data samples can be detected and just one, that is, no. 138 are wrongly detected. However, both isolated forest and KLOF can just detect the abnormal data but cannot label data samples with the condition types.

The test results of all the methods are shown in Table 3. Mo denotes the number of the abnormal data samples that are successfully detected by the corresponding methods. M1 represents the number of normal data samples that are wrongly detected as abnormal data. M3 is the accuracy of the data labeled. The proposed method can not only detect all the abnormal data samples but also label the condition type with an accuracy of 98.53. DBSCAN has the largest M1 value, that is, 26, while the number of the other methods are all very small. Thus, the proposed method is superior to the other methods in both abnormal data detection and labeling, which verifies its effectiveness.

Table 3.

The results of abnormal data detection and data labeling using different methods.

Task	DBSCAN	KLOF	k-Means	Isolated forest	Proposed
M0	4	3	3	2	4
M1	26	2	4	1	2
M2	67.14% (max)	—	85.71%	—	98.53%

DBSCAN: density-based spatial clustering of applications with noise; KLOF: kernel local outlier factor.

Quality assurance of label data for centrifugal pump

Centrifugal pumps are widely used in many industrial fields, such as nuclear power, petrochemical, thermal power, and play a vital role in such fields. Many researchers have done much research on fault diagnosis of centrifugal pumps and how to improve the data quality of centrifugal pumps has always been an important topic. As shown in Figure 20, the electrical motor drives the centrifugal pump, and between them, one roller element bearing supports the rotation axis. The water is driven from the import to the circulating water tank. One sensor is installed on the bearing house to monitor its condition.

Figure 20.

The centrifugal pump experiment platform.

Four condition types, including unbalanced fault induced by fixing one mass block on the rotary axis, inner race fault, outer race fault, and health condition, were considered, and the corresponding data were collected. Twenty data samples of the whole condition are randomly selected as the known fault labels.

To simulate the external disturbance, one hammer was used to knock the test bench and the corresponding condition type was outer race fault. Totally, there are 10 data samples that are abnormal data generated by knocking the bench. Obviously, data with different label type enhance the difficulty of the disturbance data detection. The proposed method is then applied to these data for detecting the abnormal data. First, the natural neighbor spectrum is constructed and shown in Figure 21. It can be seen that these data fall into five clusters, and the data samples denoted by a black star need to be labeled. The fault type of the other samples is described by a different color.

Figure 21.

The natural neighbor graph of the whole monitoring data.

The CLOF contour is then calculated and shown in Figure 22. As shown in Figure 23, the data whose KCLOF values are larger than threshold values are detected as the disturbance data. After the data cleaning of the first iteration, the remaining data is then continuously input into the proposed method for iteration until all the KCLOF are smaller than the threshold value. After two rounds of iteration, no more abnormal data can be detected.

Figure 22.

The CLOF contour of the whole data. CLOF: cluster local outlier factor.

Figure 23.

CLOF values of different data samples. CLOF: cluster local outlier factor.

Finally, nine abnormal data samples from no. 151 to 159 are successfully detected. Also, no. 51 and 149 are wrongly detected. The time-domain waveforms of two detected data, including no. 154, 156 are shown in Figure 24, respectively, and two normal data including no. 145, 148 are also displayed for comparison.

Figure 24.

The time-domain waveforms of different data samples: (a) no. 145, (b) no. 162, (c) no. 154, and (d) no. 156.

Then these abnormal data are cleaned, and the reconstructed natural neighbor spectrum is shown in Figure 25. These data with four condition type falls into four clusters as shown in Figure 25(a). Then, through the spectrum relationship between the known and unknown label data sample, we can infer what types of the unknown label data samples. The final data labeling result is shown in Figure 25(b), and all the data samples are labeled accurately.

Figure 25.

(a) The reconstructed natural neighbor graph after data cleaning and (b) the data labeling result.

Then, four methods are also used to process the vibration data collected from the centrifugal pump for comparison. Figure 26 present detecting results of these methods. As statistically calculated in Table 4, for DBSCAN, seven abnormal data are detected, while two data samples are wrongly detected. The maximum labeling accuracy of DBSCAN is 63.53. k-Means also gives an unsatisfactory result that three samples of abnormal data are successfully detected, while eight data samples are wrongly detected. k-Means also achieves the lowest labeling accuracy, just 60.53%, far below that of the proposed method, that is, 98.2%. Furthermore, nine abnormal data samples are successfully detected by the proposed method and just two data samples are wrongly detected as abnormal data, which also holds great advantages over all the compared methods. Therefore, the data quality can be assurance to a great extent by using the proposed method for data cleaning and labeling.

Figure 26.

Abnormal data detection and labeling using different methods for a centrifugal pump: (a) DBSCAN, (b) k-means, (c) isolated forest and (d) KLOF. DBSCAN: density-based spatial clustering of applications with noise; KLOF: kernel local outlier factor.

Table 4.

The results of abnormal data detection and data labeling using different methods

Task	DBSCAN	KLOF	k-Means	Isolated forest	Proposed
M0	7	6	3	4	9
M1	2	0	8	0	2
M2	63.53%(max)	—	60.53%(max)	—	98.2%

DBSCAN: density-based spatial clustering of applications with noise; KLOF: kernel local outlier factor.

Conclusions

A natural neighbor-based approach is proposed in this paper to enhance the quality of label data. In this method, the neighbor spectrum is constructed to establish the relationship among different data samples. Different clusters are searched based on the relationship, and CLOF is able to evaluate the quality degree of these clusters. By searching the labeled data relevant to the object data in the natural neighbor graph, the abnormal clusters can be cleaned, and the fault types of the unknown data can be decided by searching the labeled data that is relevant to the object data in the natural neighbor graph. Among data with different labels, the effectiveness of the proposed method in detecting abnormal data is verified by the experimental data collected from a gear, wind turbine, and Pump. Moreover, the proposed method can assist in labeling unknown data. Because feature extraction is an important stage and greatly influences the effect of data labeling, construction of more robust feature indicators will be considered further in our future work.

Footnotes

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by National Natural Science Foundation of China (62403406), Natural Science Foundation of Hebei Province (F2025203081), Open Fund (OGE202302-13) of Key Laboratory of Oil & Gas Equipment, Ministry of Education (Southwest Petroleum University), and China Scholarship Council Program under Project 202408130110.

ORCID iDs

Xuefang Xu

Zijian Qiao

References

Shao

Xiao

Leng

, et al. Collaborative human-computer fault diagnosis via calibrated confidence estimation. Adv Eng Inform 2025; 65: 103349.

Wang

Shao

, et al. A novel interpretable fault diagnosis method using multi-image feature extraction and attention fusion. Pattern Recognit Lett 2025; 189: 38–47.

Yang

, et al. Adversarial domain adaptation model based on LDTW for extreme partial transfer fault diagnosis of rotating machines. IEEE Trans Instrum Meas 2024; 73: 3538811.

Qiao

, et al. Caputo-Fabrizio fractional order derivative stochastic resonance enhanced by ADOF and its application in fault diagnosis of wind turbine drivetrain. Renew Energy 2023; 219: 119398.

Yan

Liu

Jia

. Health condition identification for rolling bearing using a multi-domain indicator-based optimized stacked denoising autoencoder. Struct Health Monit 2020; 19(5): 1602–1626.

, et al. Simulated data-assisted fault diagnosis framework with dual-path feature fusion for rolling element bearings under incomplete data. IEEE Trans Instrum Meas 2025; 74: 3542671.

Wang

Tian

Liang

, et al. Single and simultaneous fault diagnosis of gearbox via wavelet transform and improved deep residual network under imbalanced data. Eng Appl Artif Intell 2024; 133: 108146.

Luo

Shao

Lin

, et al. Meta-learning with elastic prototypical network for fault transfer diagnosis of bearings under unstable speeds. Reliab Eng Syst Saf 2024; 245: 110001.

Liang

Shuai

, et al. Semi-supervised subdomain adaptation graph convolutional network for fault transfer diagnosis of rotating machinery under time-varying speeds. IEEE/ASME Trans Mechatron 2024; 29(1): 730–741.

10.

Chen

Mao

Liu

. Big data: a survey. Mobile Netw Appl 2014; 19: 171–209.

11.

Wang

Shao

Yan

, et al. C-ECAFormer: a new lightweight fault diagnosis framework towards heavy noise and small samples. Eng Appl Artif Intell 2023; 126: 107031.

12.

Malekloo

Ozer

AlHamaydeh

, et al. Machine learning and structural health monitoring overview with emerging technology and high-dimensional data source highlights. Struct Health Monit 2022; 21(4): 1906–1955.

13.

Kong

Qin

Wang

, et al. Data-driven dictionary design–based sparse classification method for intelligent fault diagnosis of planet bearings. Struct Health Monit 2022; 21(4): 1313–1328.

14.

Zhao

Shao

Wang

, et al. Time-frequency self-similarity enhancement network and its application in wind turbines fault analysis. AdvEng Inform 2025; 65: 103322.

15.

Taleb

Dssouli

Serhani

. Big Data pre-processing: a quality framework. In: 2015 IEEE international congress on Big Data (BigData Congress), New York, NY, USA, 27 June–2 July 2015, pp. 191–198. Washington, DC: IEEE.

16.

Najafabadi

Villanustre

Khoshgoftaar

, et al. Deep learning applications and challenges in big data analytics. J Big Data 2015; 2(1): 2–21.

17.

Atla

Tada

Sheng

, et al. Sensitivity of different machine learning algorithms to noise. J Comput Sci Coll 2011; 26(5): 96–103.

18.

Lei

Sun

Xia

. Lost data reconstruction for structural health monitoring using deep convolutional generative adversarial networks. Struct Health Monit 2021; 20(4): 2069–2087.

19.

García-Gil

Luengo

García

, et al. Enabling smart data: noise filtering in big data classification. Inform Sci 2019; 479: 135–152.

20.

Shang

Zhao

Yan

, et al. Core loss: mining core samples efficiently for robust machine anomaly detection against data pollution. Mech Syst Signal Process 2023; 189: 110046.

21.

Wang

Lei

Yang

, et al. A graph neural network-based data cleaning method to prevent intelligent fault diagnosis from data contamination. Eng Appl Artif Intell 2023; 126: 107071.

22.

Xie

Wang

Tao

, et al. Research on one-dimensional data quality assurance for machinery health monitoring using compressive sensing and enhanced context encoder. Expert Syst Appl 2025; 297(Part B): 129407.

23.

Hodge

Austin

. A survey of outlier detection methodologies. Artif Intell 2004; 22(2): 85–126.

24.

Vera-Tudela

Kühn

. Analyzing wind turbine fatigue load prediction: the impact of wind farm flow conditions. Renew Energy 2017; 107: 352–360.

25.

Papatheou

Dervilis

Maguire

, et al. Performance monitoring of a wind turbine using extreme function theory. Renew Energy 2017; 113: 1490–1502.

26.

Guo

Lei

, et al. Machinery health indicator construction based on convolutional neural networks considering trend burr. Neurocomputing 2018; 292: 142–150.

27.

Shen

Zhou

. A combined algorithm for cleaning abnormal data of wind turbine power curve based on change point grouping algorithm and quartile algorithm. IEEE Trans Sustain Energy 2018; 10(1): 46–54.

28.

Mckinnon

Carroll

Mcdonald

, et al. Comparison of new anomaly detection technique for wind turbine condition monitoring using gearbox SCADA data. Energies 2020; 13: 5152.

29.

Xiao

Wang

, et al. Robust one-class SVM for fault detection. Chemometr Intell Lab Syst 2016; 151: 15–25.

30.

Chen

Zhang

Wang

, et al. Robust support vector data description for outlier detection with noise or uncertain data. Knowl Based Syst 2015; 90: 129–137.

31.

Bao

Tang

, et al. Computer vision and deep learning–based data anomaly detection method for structural health monitoring. Struct Health Monit 2019; 18(2): 401–421.

32.

Wang

Mao

. Outlier detection based on Gaussian process with application to industrial processes. Appl Soft Comput 2019; 76: 505–516.

33.

Dervilis

Worden

Cross

. On robust regression analysis as a means of exploring environmental and operational conditions for SHM data. J Sound Vib 2015; 347: 279–296.

34.

Ferdowsi

Jagannathan

Zawodniok

. An online outlier identification and removal scheme for improving fault detection performance. IEEE Trans Neural Netw Learn Syst 2013; 25(5): 908–919.

35.

Dai

Song

Sheng

, et al. Cleaning method for status monitoring data of power equipment based on stacked denoising autoencoders. IEEE Access 2017; 5: 22863–22870.

36.

Medina

Vilà-Valls

, et al. Robust variational-based kalman filter for outlier rejection with correlated measurements. IEEE Trans Signal Process 2020; 69: 357–369.

37.

Jia

Jin

Buzza

, et al. Wind turbine performance degradation assessment based on a novel similarity metric for machine performance curves. Renew Energy 2016; 99: 1191–1201.

38.

Tang

. A local density-based approach for outlier detection. Neurocomputing 2017; 241: 171–180.

39.

, et al. A generic energy prediction model of machine tools using deep learning algorithms. Appl Energy 2020; 275: 115402.

40.

Lei

. An incorrect data detection method for big data cleaning of machinery condition monitoring. IEEE Trans Ind Electron 2019; 67(3): 2326–2336.

41.

Xie

Tao

Xie

, et al. Abnormal data detection based on adaptive sliding window and weighted multiscale local outlier factor for machinery health monitoring. IEEE Trans Ind Electron 2022; 70(11): 11725–11734.

42.

Guo

Gao

, et al. Similarity-measured isolation forest: anomaly detection method for machine monitoring data. IEEE Trans Instrum Meas 2021; 70: 1–12.

43.

Pan

, et al. Daily condition monitoring of grid-connected wind turbine via high fidelity power curve and its comprehensive rating. Renew Energy 2020; 146: 2095–2111.

44.

Reyes

Ventura

. Evolutionary strategy to perform batch-mode active learning on multi-label data. ACM Trans Intell Syst Technol 2018; 9(4): 1–26.

45.

Sarr

JMA

Brochier

Brehmer

, et al. Complex data labeling with deep learning methods: lessons from fisheries acoustics. ISA Trans 2021; 109: 113–125.

46.

Kramer

. K-nearest neighbors. In: Dimensionality reduction with unsupervised nearest neighbors. Berlin, Heidelberg: Springer Berlin Heidelberg, 2013, pp. 13–23.

47.

Zhang

. Introduction to machine learning: k-nearest neighbors. Ann Transl Med 2016; 4(11): 218.

48.

Agrawal

. K-nearest neighbor for uncertain data. Int J Comput Appl 2014; 105(11): 13–16.

49.

Zhu

Feng

Huang

. Natural neighbor: a self-adaptive neighborhood method without parameter k. Pattern Recognit Lett 2016; 80: 30–36.

50.

Wang

Liu

, et al. Graph-based change detection for condition monitoring of rotating machines: techniques for graph similarity. IEEE Trans Reliab 2018; 68(3): 1034–1049.

51.

Breunig

Kriegel

, et al. LOF: identifying density-based local outliers. In: Proceedings of the 2000 ACM SIGMOD international conference on Management of data and symposium on principles of database systems, Texas, USA, 15–18 May 2000, pp. 93–104. New York: Association for Computing Machinery.