wCM based hybrid pre-processing algorithm for class imbalanced dataset

Abstract

Imbalanced dataset classification is challenging because of the severely skewed class distribution. The traditional machine learning algorithms show degraded performance for these skewed datasets. However, there are additional characteristics of a classification dataset that are not only challenging for the traditional machine learning algorithms but also increase the difficulty when constructing a model for imbalanced datasets. Data complexity metrics identify these intrinsic characteristics, which cause substantial deterioration of the learning algorithms’ performance. Though many research efforts have been made to deal with class noise, none of them focused on imbalanced datasets coupled with other intrinsic factors. This paper presents a novel hybrid pre-processing algorithm focusing on treating the class-label noise in the imbalanced dataset, which suffers from other intrinsic factors such as class overlapping, non-linear class boundaries, small disjuncts, and borderline examples. This algorithm uses the wCM complexity metric (proposed for imbalanced dataset) to identify noisy, borderline, and other difficult instances of the dataset and then intelligently handles these instances. Experiments on synthetic datasets and real-world datasets with different levels of imbalance, noise, small disjuncts, class overlapping, and borderline examples are conducted to check the effectiveness of the proposed algorithm. The experimental results show that the proposed algorithm offers an interesting alternative to popular state-of-the-art pre-processing algorithms for effectively handling imbalanced datasets along with noise and other difficulties.

Keywords

Classification class imbalance data complexity overlapping bayes error pre-processing learning algorithms

1 Introduction

A dataset is considered imbalanced in the classification task when the distribution of data points per class is not equal. Traditional machine learning classifiers tend to predict the data points from the majority class (i.e., the class with a large number of data points) correctly, but the data points of the minority class (i.e., the class with less number data points) are treated as noise and are usually ignored. Hence, the probability of misclassification of the minority class is high compared to the majority class [1]. This imbalanced dataset problem is mainly found in classification tasks such as predicting frauds in bank transactions, identifying cancer diseases, predicting software defects, etc.

Over the past years, several imbalance handling methods have been proposed [2 –5]. Their goal is to improve the classifiers’ performance on the minority class without heavily sacrificing the performance of the majority class. A complete review of the imbalance handling methods can be found in [6, 7]. However, it is observed that an imbalanced dataset does not always impose problems; for example, when the classes are linearly separable in the input feature space, or the imbalance ratio is not very high, it is not difficult to build a proper classification model [8, 9]. The problem arises when the imbalanced dataset suffers from other intrinsic factors such as class overlap [10 –12], presence of noise [13], small disjuncts [9], etc. There is a common understanding of the researchers [9 , 14] that the classification model accuracy for minority class degrades when the class imbalance occurs together with the above intrinsic factors, though the accuracy of the classifier for the majority class remains high.

Many researchers have proposed algorithms to deal with the dataset intrinsic factors in an imbalanced dataset, but they considered these factors independently. In 2004, Jo and Jopkiwikz [9] proposed an approach to handle small disjuncts in imbalanced datasets. A hybrid method for handling class overlapping and class imbalance problems was proposed in 2013 by Alejo et al. [15]. J. A. Saez et al., [16] in 2013 proposed a noise filter which was an extension of SMOTE to handle noisy and borderline examples in imbalanced datasets. In [17], the authors developed a new argument-based rule induction algorithm, ABMODLEM, with the help of expert knowledge from class-imbalanced data to the learning process. A method based on feature selection and instance selection to overcome class overlap and the class imbalance was suggested by Fernandez et al. [18]. In [19], the authors designed RIFUS to handle imbalanced dataset problem and noisy points. In 2019, Koziarskia et al. [20] proposed a radial-based oversampling algorithm to deal with imbalanced datasets along with the class label and feature noise. Recently, in 2020 an undersampling technique, DBMIST-US, was proposed [21] to handle the class imbalance problem with class overlapping. Another method for improving the performance of the classification in the imbalance dataset was proposed in [22]. This method used the local Mahalanobis distance learning (LMDL) method in the nearest neighbor.

In the literature, the researchers proposed the data complexity metrics to analyze the intrinsic factors of datasets. These data complexity metrics help extract knowledge about dataset factors that support selecting the proper learning algorithm. Ho and Basu [23] proposed data complexity metrics to measure the complexity of classification tasks. Most of the data complexity metrics proposed by various researchers [9 , 23–25] are for balanced data sets. In [26], we presented a systematic study to analyze the behavior of the existing complexity metrics when used with imbalanced datasets. We divided the imbalanced datasets into two categories: harmful and unharmful, using the complexity metrics. After that, we applied oversampling techniques to these datasets. The results revealed that these complexity metrics do not perform well for imbalanced datasets, which also suffer from other intrinsic factors such as class overlapping, noise, small disjuncts, etc. In [27], we proposed the weighted complexity metric (wCM) to access the difficulty level of an imbalanced dataset. This metric calculates the dataset complexity by computing the weighted nearest neighbors.

The hybrid approach proposed in this paper is a pre-processing technique that uses the wCM metric along with oversampling data points of minority classes and undersampling majority class points. We apply our proposed hybrid pre-processing algorithm to balance the dataset intelligently by treating noisy and difficult regions’ data points. Our algorithm is greatly inspired by the SPIDER method of strong amplification, presented in [14]. SPIDER method based on strong amplification simply calculates the k-nearest neighbor based on which it labels the data points as: ‘safe’ or ‘noisy.’ It amplifies all ‘noisy’ and ‘safe’ points from the minority class and removes all ‘noisy’ points from the majority class.

The novelty of our work is: that, unlike the other conventional pre-processing methods, the proposed algorithm can handle the imbalanced class distributions along with class noise, overlapping regions, and small disjuncts problems. The proposed algorithm pre-process the data points of minority and majority class based on wCM metric value, which helps retain the original distribution of minority and majority class samples and thus prevents any information loss of the dataset. Table 1, shows the comparative analysis of our proposed technique with other existing techniques in the literature.

Table 1
Comparative analysis of the proposed technique with existing techniques

Authors Handle Binary CIP Handle Multi-class CIP Handle Noise Handle Class Overlapping Handle Small Disjuncts Handle Non-linear Boundaries Handle Borderline Examples

Jo and Jopkiwikz 2003 [9] ✓ × × × ✓ × ×

Alejo et at., 2013 [15] × ✓ × ✓ × ✓ ✓

Saez et al., 2015 [16] ✓ × ✓ × × × ✓

Napieral-a et al., 2015 [17] ✓ × × × ✓ ✓ ✓

Fernandez et al., 2015 [18] ✓ × ✓ × × × ✓

Kaur et al., 2018[19] ✓ × ✓ × × × ×

Koziarskia et al., 2019 [20] ✓ × ✓ × ✓ ✓ ✓

Ponce et al., 2020[21] ✓ × × ✓ × × ×

Siddappa et al., 2020 [22] ✓ × × × ✓ ✓ ✓

wCM based Hybrid PreProcessing Technique (Our Proposed technique in this paper) ✓ × ✓ ✓ ✓ ✓ ✓

Authors	Handle Binary CIP	Handle Multi-class CIP	Handle Noise	Handle Class Overlapping	Handle Small Disjuncts	Handle Non-linear Boundaries	Handle Borderline Examples
Jo and Jopkiwikz 2003 [9]	✓	×	×	×	✓	×	×
Alejo et at., 2013 [15]	×	✓	×	✓	×	✓	✓
Saez et al., 2015 [16]	✓	×	✓	×	×	×	✓
Napieral-a et al., 2015 [17]	✓	×	×	×	✓	✓	✓
Fernandez et al., 2015 [18]	✓	×	✓	×	×	×	✓
Kaur et al., 2018[19]	✓	×	✓	×	×	×	×
Koziarskia et al., 2019 [20]	✓	×	✓	×	✓	✓	✓
Ponce et al., 2020[21]	✓	×	×	✓	×	×	×
Siddappa et al., 2020 [22]	✓	×	×	×	✓	✓	✓
wCM based Hybrid PreProcessing Technique (Our Proposed technique in this paper)	✓	×	✓	✓	✓	✓	✓

We have conducted an experimental evaluation of our hybrid pre-processing algorithm on three artificial benchmarked datasets and five real-world datasets with different noise levels. Moreover, we used four classifiers as Decision tree (DT), k-nn, Logistic regressor (LR), SVM with Linear kernel, and Gaussian kernel. All the experiments are conducted in MATLAB software and the “Classification Learner Toolbox of MATLAB” for classifiers. To prove the usefulness of our proposed algorithm, we compared it with the existing pre-processing methods like SMOTE-ENN (SMENN), SMOTE-Tk (SMTk), SPIDER, SMOTE (SM), BSMOTE (BSM), and ADASYN.

The organization of the rest paper is as follows. Section 2 presents the background, describing the state-of-art for the class imbalance problem, review of the data complexity metrics. Section 3 discusses the data complexity metric, wCM. In Section 4, we have presented our approach based on wCM for pre-processing the imbalanced datasets. Section 5 shows the experimental study performed and the analysis of the results. Finally, we conclude the paper with Section 6.

2 Related works

The data complexity metrics measure the complexity of a dataset which helps in selecting the appropriate classification algorithm. Ho and Basu [28] proposed the complexity metrics for binary classification problems. Further, extending the work of Ho and Basu, Singh [29] proposed two new complexity metrics for pattern recognition based on feature space partitioning. Various researchers have used these metrics to analyze classification problems, including understanding the effect of class overlapping, data dimensionality, and class density [30, 31]; selecting an evolutionary prototyping algorithm [31]. J.A. Saez et al. and Macia et al. [16, 32] used data complexity metrics: to predict the noise filtering efficacy in the nearest neighbor classifier [16] and to characterize the datasets for the selection of appropriate classifier [32]. Luengo and Herrera [33] proposed an automatic extraction method to differentiate the problem space using data complexity measures for classification. Moreover, Zubek and Plewczynski [34] described a data complexity metric based on probability distribution and Hellinger distance. Brun et al. [35], using Ho and Basu metrics as a reference, analyzed dynamic classifiers selection techniques and proposed the selection method based on these complexity metrics for dynamic selection of classifier.

Conversely, the researchers [8 , 35–39] proved that the existing data complexity metrics do not work well on imbalanced datasets. In the literature, few researchers have proposed complexity metrics, see [8 , 38], which help measure the complexity of imbalanced datasets. A complexity metric using scatter matrix-based class separability method for imbalanced datasets was proposed by Yu et al. [38]. Anwar et al. [36] proposed a data complexity metric, based on the k- nearest neighbor approach, for imbalanced datasets. Diez-Pastor et al. [39] used them to predict data complexity intervals for which some diversity-enhancing techniques may improve the results of an ensemble method. Fernandez et al. [40] conducted studies on microarray data, an example of highly imbalanced gene expression data, for analyzing the usefulness of data complexity metrics to evaluate the behavior of SMOTE algorithm with respect to the feature selection method. Victor et al. [8] presented three complexity metrics, adapted from the famous complexity metrics, for imbalanced datasets by regarding each class individually. In [41] authors proposed the Bayes Impact Index Index (BI³) metric to measure the impact of imbalance.

3 Weight calculation using wCM

In this paper, we used wCM to calculate the weights of the instances in the dataset based on the distances to the k-nearest neighbors for a data point. In this metric, (k + 1) neighbors are calculated, wherein the last neighbor (i.e. (k + 1)th neighbor) is used to calculate the first k neighbors weights to avoid the condition of assigning the kth neighbor weight as 0. This complexity metric involves the following steps.

i). Find (k + 1)-nearest neighbors (k is odd) of each data point in a class.

ii). Assign a weight to each of the (k + 1)-nearest neighbors.

$w_{i} = {\begin{matrix} \frac{d (x_{0}, x_{k + 1}^{NN}) - d (x_{0}, x_{i}^{NN})}{d (x_{0}, x_{k + 1}^{NN}) - d (x_{0}, x_{1}^{NN})}, if d (x_{0}, x_{k + 1}^{NN}) \neq d (x_{0}, x_{1}^{NN}) \\ 1, if d (x_{0}, x_{k + 1}^{NN}) = d (x_{0}, x_{1}^{NN}) \end{matrix}$ (1) where,

w_i = weight for i-th neighbor for the data point x₀.

d (xi, xj) = Euclidean distance between the data point x_i and x_j.

$x_{1}^{NN}$ = closest neighbor of x₀

$x_{k + 1}^{NN}$ = (k + 1)th neighbor of x₀.

It can be seen that according to equation (1) the nearest neighbor is assigned the weight as one and the farthest neighbor (i.e. (k + 1)th neighbor) is assigned the weight as 0.

iii) Normalize the weights as: $w_{i} = \frac{\underset{i}{w}}{\sum_{i = 1}^{k} \underset{i}{w}}$ (2)

iv) The weights of the nearest neighbours are summed based on their class label. $W_{i}^{'} = max \sum_{(x_{i}^{NN}, y_{i}^{NN})} wi \times δ (y = y_{i}^{NN})$ (3) where y is the class label for x₀, $y_{i}^{NN}$ is the class label for the ith nearest neighbor among k nearest neighbors for x₀. δ (.) is the delta function which returns one if $y = y_{i}^{NN}$ and zero otherwise.

We have used Euclidean distance to compute the nearest neighbors, which can be defined between two points a and b with p dimension, as: $d (a, b) = \sqrt[2]{\sum_{i = 1}^{p} (a_{i} - b_{i})^{2}}$ (4)

The number of nearest neighbours (k) is considered as an odd value. For the datasets used in the paper, the values of the features have been standardized to put equal weight on each feature in computing the distances.

4 Proposed Algorithm: wCM Hybrid Pre-processing

This section provides an overview of the proposed hybrid pre-processing algorithm that deals intelligently to overcome the difficulties in the given datasets: class imbalance, class label noise, non-linear class boundaries, and class overlapping. Our approach identifies the noisy data points and then flips the class labels of these data points. After that, it regenerates those data points of minority class that lie in the difficult areas (such as borderline examples, small disjuncts) and expands the decision borders in favor of the minority class by removing the data points from the majority class. It expands decision boundaries in such a way so as the original class distribution is not altered.

This algorithm consists of three phases:

i) First Phase (Labeling the data points): In the first phase, the algorithm calculates the complexity of each data point using the wCM metric (explained in section 3). After that, it labels data points as either ‘safe,’ ‘noise,’ or ‘difficult.’ Safe data points should be correctly classified by the base classifier, while noise and difficult data points are likely to be misclassified and thus need special attention in the second phase. Here, noise refers to those data points of majority (or minority class), which are located deep inside the region of the minority class (or majority class) and thus tend to be misclassified by the classifier as of other class. Moreover, difficult data points refer to the borderline data points, whose majority neighbors are of opposite class and thus likely to be misclassified by the classifier.

ii) Second Phase (Flipping the noise): The second phase treats the noisy data points by flipping the class label of majority class noisy points to minority class and vice-versa. After that, it oversamples the minority class data points by replicating all difficult data points from the minority class, and the number of copies generated for these minority class points is equal to the number of the examples in their neighborhood from the majority class.

iii) Third Phase (Undersampling the majority class): In its third phase, the algorithm re-calculates the complexity of the majority class data points using the new dataset, which consists of the old data points and the newly generated copies of the minority class. Thereafter, it undersamples ‘difficult’ data points from the majority class.

This algorithm is presented below.

Here, D = denotes the dataset,

D’=denotes a new dataset, where D’={D U newly generated minority class datapoints} P = denotes the minority class, and N = denotes the majority class.

P-safe, P-noise, P-difficult=safe, noisy, and difficult points from the minority class, respectively.

N-safe, N-noise, N-difficult=safe, noisy, and difficult points from the majority class, respectively.

The algorithm calculates the complexity of data points in D using the wCM metric. Here, label(q) returns the label of the data point q, nearest_neighbor(q, k) finds k nearest neighbors for data point q, and no_of_neighbors(q, k, C) gives the count of neighbors of q which belong to the class C using its k nearest neighbors. For flipping the noisy points, we have used a threshold value, ⌜. For the minority class, we set the threshold value as ⌜=0.6, whereas for the majority class, the threshold value is set to ⌜=0.7. Figure 1 shows the flowchart for the proposed algorithm.

Fig. 1

Flowchart showing the steps involved in the Proposed Algorithm.

Algorithm 1: wCMHybridPre-Processing (D, k)

Following steps calculate complexity for the majority class data points

For each q ∈ N; do

Calculate the nearest neighbors (k + 1) of q in D and assign weights to the neighbors of q using wCM weight formula (refer to steps 1 to 4 of wCM metric, Section 3).

if W(q, N)>W(q,P), then Label the point q as “N_safe”

else if W(q)>, then label q as “N_noise” else label q as “N_difficult”

End for loop

Following steps calculate complexity for the minority class data points

For each q ∈ P; do

Calculate the nearest neighbors (k + 1) of q in D and assign weights to the neighbors of q using wCM weight formula (refer to steps 1 to 4 of wCM metric, Section 3).

if W(q, P)>W(q, N), then label the point q as “P_safe”

else if W(q)> ’I, then label q as “P_noise” else label q as “P_difficult”

End for loop

Change the labels of “N_noise” data points as “P_safe” and the labels of “P_noise” as “N_safe”

Following steps are Oversampling the minority class points by replicating the minority class points.

For each point q ∈ P; do

If (label(q)=”P_difficult”); then

find nearest_neighbors(q, k + 2)

If (no_of_neighbors(q, k + 2, N)>no_of_neighbours(q, k + 2, P)); then

Create copies of q = no_of_neighbors (q, k + 2, N), with label as “N_safe” or “N_difficult”

Else create copies of q= no_of_neighbors(q, k, N), with label as “N_safe” or “N_difficult”

End if

End for loop

D’=D U {new data points generated for minority class}

Under-sampling of the majority class points

Re-calculate the complexity for data points of majority class using D’ data set and re-label the majority class points as “N_safe” or “N_difficult”

For each q ∈ N; do

If (label(q)=”N_difficult”); then

Remove q from N

End for loop

Apply any traditional classifier on D’

Time Complexity Analysis:

$\begin{matrix} \Rightarrow T (C 1 . O (n) \times (C 2 . O (d) + C 3 . O (1) + C 4 . O (1))) + \\ T (C 5 . O (p) \times (C 6 . O (d) + C 7 . O (1) + C 8 . O (1) + C 9 . O (d))) + \\ T (\begin{matrix} C 10 . O (p) \times (C 11 . O (1) \times (C 12 . O (d) + C 13 . O (1) + C 14 . O (1) \\ + C 15 . O (1)) \end{matrix}) + \\ T (C 16 . O (1)) + T (C 17 . O (n) \times O (d^{'})) + T (\begin{matrix} C 18 . O (n) \times C 19 . O (1) \\ \times C 20 . O (1) \end{matrix}) \end{matrix}$

On simplification:

$\begin{matrix} \Rightarrow T (C 1 . O (n) \times C 234 . O (d)) + T (\begin{matrix} C 5 . O (p) \times C 678 . O (d) + \\ C 9 . O (d) \end{matrix}) + \\ T (C 10 . O (p) \times C 111213141516 . O (d)) + T (C 17 . O (n \times d^{'})) + \\ T (C 181920 . O (n)) \end{matrix}$ $\begin{matrix} \Rightarrow T (C 1234 . O (n \times d)) + T (C 56789 . O (p \times d)) + \\ T (C 10111213141516 . O (p \times d)) + T (C 17 . O (n \times d^{'})) + \\ T (C 181920 . O (n)) \end{matrix}$ $\begin{matrix} \Rightarrow T (C 1234 . O (n \times d)) + T (\begin{matrix} C 5678910111213141516 . \\ O (p \times d) \end{matrix}) + \\ T (C 17181920 . O (n \times d^{'})) \end{matrix}$ $\begin{matrix} \Rightarrow T (C 1234 . O (n \times d)) + T (\begin{matrix} C 5678910111213141516 . \\ O (n \times d) \end{matrix}) + \\ T (C 17181920 . O (n \times d)) \end{matrix}$ $\Rightarrow T (C 1234567891011121314151617181920 . O (n \times d))$ $\Rightarrow O (d \times d)$ $\Rightarrow O (d^{2})$

where,

d, d’, p and n represents the size of the dataset D, new dataset D’, minority class P and majority class N, respectively. C1234 = C1 x C2 + C3 + C4, C56789101112131415 = C5 x C6 + C7 + C8 + C9 + C10 x C11 x C12 + C13 + C14 + C15, C1617181920= C16 + C17 + C18 x C19 x C20.

Here, C1, C2, ... , C20 are integer constants that in asymptotic notation representing the time taken by the machine for the respective operation execution. Each step is performing different operation so different integer constants C1, C2, ... ., C20 are taken as multiplication factor for time complexity analysis for respective step. O(1) is the asymptotic notation representing a constant time taken for the respective operation. As Step 3, Step 4, Step 7, Step 8, Step 11, Step 13, Step 14, Step15, Step 16, Step 19 and Step 20 takes constant time for execution, so their time complexity is O(1). Thus, the time complexity for the first four steps is T(C1. O(n)+C2.O(d)+C3.O(1)+C4.O(1)), which is simplified as C1234.O(n x d). Here, O(n) is the for loop complexity. O(d) is the complexity of finding the k nearest neighbors in data set D and assigning weights to the neighbors. Similarly, the time complexity analysis for steps 5, 6, 7, 8 and 9 is T(C5.O(p) x C6.O(d)+C7.O(1)+C8.O(1)+C9.O(d)), which simplifies to T(C56789.O(p x d)). Similarly, the simplified time complexity for oversampling steps from Step 10 to Step 16 is T(C10111213141516.O(p x d)) and the simplified time complexity for the under-sampling steps from Step 17 to Step 20 can be written as T(C17181920.O(p x d’), where d’ is the size of the new data set and d’ can be written in terms of d, because the size of D’ will be in fractions of original dataset D. Also, p can be written in terms of n because minority class points will be less or equal to the majority class points. Thereafter, n can be written as d because majority class points will be the subset of the dataset D. Hence, on simplification the time complexity of wCMHybridPre-Processing algorithm is written as O||D²||.

5 Experimental study

Experiments are implemented on synthetic datasets and real-world datasets downloaded from the KEEL Machine Learning repository. In sub-section 5.1 and 5.2 descriptions of synthetic and real-world datasets are given.

5.1 Synthetic datasets

We have used synthetic datasets of three different shapes (with different imbalance ratios, noise levels, and sizes), downloaded from the KEEL repository [42]. These different shapes imbalanced datasets are used to show that our proposed method can handle small disjunsts, borderline examples, and non-linear boundaries. Also, we have introduced noise in these datasets so that we can also show that our proposed algorithm can handle the class-label noise.

i) subclus dataset: This dataset consists of five rectangular shape sub-clusters of the minority class.

ii) clover dataset: This is a flower shape dataset with five elliptic petals.

iii) paw dataset: In this dataset, the minority class data points are decomposed into three elliptic sub-regions of varying cardinalities.

These synthetic datasets are shown in Fig. 2, where red dots denote the minority class instances, and blue dots denote the majority class instances. For better understanding, we have marked the boundaries for the minority class in Fig. 2. The sizes of these three datasets considered in this paper are 600 and 800 data points with two dimensions and imbalance ratios –1 : 7, 1 : 5.

Fig. 2

Synthetic datasets considered in this paper.

Introducing noise in synthetic datasets: We flipped the class labels of the data points randomly chosen from the majority and minority class, using python code in Anaconda Spyder IDE, to generate class noise in these data sets. The levels of noise generated in each of these datasets are –0%, 10%, 20%, 30%, and 40% noise (refer to Fig. 2 for clover shape dataset with different noise levels induced by flipping the class labels). Thus, resulting in a total of 30 synthetic datasets.

5.2 Real-world datasets

We have used 5 real-world imbalanced datasets with different imbalance ratios, sizes, and noise levels, such as 0%, 5%, 10%, 15% and 20%, downloaded from the KEEL data repository [43], thus resulting in a total of 25 different datasets. Table 2 summarizes the characteristics of the datasets used in our experiments. The datasets such as iris, yeast, and wine, consist of multiple classes so we modified these datasets following suggestions in literature to make them into two-class datasets. Table 3 show the modifications that we have used in this paper to create the minority and majority classes.

Table 2
Summary of characteristics of real-world datasets

Dataset #Features #instances # minority class #majority class I.R

Iris 4 150 50 100 01 : 02

Wine 13 178 59 119 0.33 : 0.67

Pima 8 768 268 500 01 : 01.9

Sonar 60 208 97 111 0.47 : 0.53

Yeast 8 1484 212 976 01 : 32.8

Dataset	#Features	#instances	# minority class	#majority class	I.R
Iris	4	150	50	100	01 : 02
Wine	13	178	59	119	0.33 : 0.67
Pima	8	768	268	500	01 : 01.9
Sonar	60	208	97	111	0.47 : 0.53
Yeast	8	1484	212	976	01 : 32.8

Table 3

Description of minority and majority classes in real-world imbalanced datasets

Dataset	Minority Class	Majority class
Iris	{setosa}	{versicolor, virginica}
Wine	{1}	{2,3,4,5}
Pima	{positive}	{negative}
Sonar	{R}	{M}
Yeast	{POX, MIT}	{CYT, NUC, VAC, ME1, ME2, ME3, ERL}

5.3 Classifiers used in this paper

We have used four different classifiers –decision tree (DT) classifier with 50 numbers of splits, k-nn classifier parameterized with k = 3 and distance calculated using Euclidean distance formula, logistic regression (LR) classifier, and support vector machine (SVM) classifier with linear and Gaussian kernels. These classifiers are implemented using the Classification Learner Toolbox of MATLAB. We have used 5- fold cross-validation to train and test the model built by different classifiers. We mainly focus on the evaluation metrics such as sensitivity and specificity to study the recognition of the minority class and also to put a check on the recognition ability of a classifier for the majority class.

6 Results and discussion

In this research study, we have proposed a wCM hybrid pre-processing algorithm. To assure the performance of the proposed algorithm, we have conducted a series of experiments to evaluate and compare the proposed algorithm behavior with the other well-known pre-processing algorithms. Although we have calculated sensitivity, specificity, precision, recall, and accuracy values for accessing the classifiers, but in this paper, we have shown only sensitivity and specificity values due to space limit. The situation worsens when with noise, the imbalance ratio is high and the decision boundary is non-linear (as in the clover dataset). One can notice that in Table 3 for the clover dataset, the sensitivity values for the minority class are less for imbalance ratio 1 : 7 than for 1 : 5 and the sensitivity values decrease on increasing the noise level. Also, for paw and subclus datasets, the sensitivity values for the OD column Table 4 are more than the clover dataset because the shape of the paw dataset with three elliptical sub-clusters and subclus dataset with rectangular boundaries is less complicated than the clover shape.

Table 4
Synthetic datasets sensitivity measure values for Decision Tree (DT) classifier

Dataset Size I.R. Noise% OD Proposed Algo. SMENN SMTK SPIDER SM BSM ADASYN

Clover 600 1 : 5 0 0.71 0.8512 0.8533 0.7933 0.7619 0.84 0.946 0.9489

10 0.6098 0.8603 0.8322 0.6993 0.9857 0.5734 0.8909 0.8159

20 0.4710 0.8509 0.8918 0.8144 0.9784 0.7533 0.8203 0.7421

30 0.3544 0.8397 0.8778 0.7442 0.9790 0.7828 0.8439 0.6981

40 0.2994 0.84 0.8842 0.6923 0.9659 0.7624 0.7195 0.6855

800 1 : 7 0 0.66 0.728 0.9814 0.9642 0.9864 0.9729 0.9729 0.9571

10 0.5328 0.7943 0.9086 0.8885 0.9546 0.8569 0.9263 0.8603

20 0.2533 0.8636 0.9092 0.8102 0.9098 0.8139 0.8492 0.7892

30 0.2143 0.8543 0.9078 0.8015 0.9281 0.7896 0.8414 0.7017

40 0.1831 0.8564 0.9029 0.7671 0.9773 0.7053 0.6831 0.6872

Paw 600 1 : 5 0 0.87 0.9091 0.982 0.9899 0.9583 0.984 0.976 0.9859

10 0.6585 0.85 0.9182 0.8543 0.9410 0.8428 0.8931 0.8270

20 0.5289 0.8915 0.8853 0.8197 0.9280 0.7922 0.8377 0.7484

30 0.4114 0.8726 0.9140 0.7579 0.9585 0.7262 0.7738 0.7165

40 0.3672 0.9187 0.9267 0.6568 0.9773 0.6454 0.6667 0.6128

800 1 : 7 0 0.89 0.9712 0.9886 0.9785 1.0 0.9786 0.9814 0.9885

10 0.6148 0.9720 0.9307 0.8885 0.9429 0.8732 0.9145 0.8326

20 0.3667 0.9286 0.8939 0.7761 0.9522 0.7385 0.8646 0.7381

30 0.3407 0.9653 0.9223 0.9681 0.9699 0.7039 0.7363 0.7175

40 0.2582 0.8622 0.9285 0.69 0.9861 0.7036 0.7615 0.7162

Subclus 600 1 : 5 0 0.97 0.9107 0.984 0.9739 0.9899 0.984 0.982 0.9762

10 0.7236 0.8926 0.9266 0.8780 0.9898 0.8909 0.9036 0.8365

20 0.6014 0.8409 0.9156 0.8309 0.9085 0.8312 0.8268 0.7780

30 0.4241 0.8944 0.9027 0.7299 0.9479 0.7489 0.7669 0.6961

40 0.4124 0.9057 0.8794 0.7014 0.9519 0.7636 0.7491 0.6437

800 1 : 7 0 0.95 0.97 0.9857 0.9842 1.0 0.9857 0.9857 0.9857

10 0.7213 0.9167 0.9454 0.8861 0.9900 0.8569 0.9307 0.8171

20 0.4667 0.9069 0.8831 0.7576 0.9178 0.7123 0.8415 0.6639

30 0.3736 0.8125 0.9288 0.7625 0.9495 0.6909 0.7524 0.7123

40 0.4131 0.8520 0.8944 0.6988 0.9908 0.6899 0.7104 0.6688

Dataset	Size	I.R.	Noise%	OD	Proposed Algo.	SMENN	SMTK	SPIDER	SM	BSM	ADASYN
Clover	600	1 : 5	0	0.71	0.8512	0.8533	0.7933	0.7619	0.84	0.946	0.9489
			10	0.6098	0.8603	0.8322	0.6993	0.9857	0.5734	0.8909	0.8159
			20	0.4710	0.8509	0.8918	0.8144	0.9784	0.7533	0.8203	0.7421
			30	0.3544	0.8397	0.8778	0.7442	0.9790	0.7828	0.8439	0.6981
			40	0.2994	0.84	0.8842	0.6923	0.9659	0.7624	0.7195	0.6855
	800	1 : 7	0	0.66	0.728	0.9814	0.9642	0.9864	0.9729	0.9729	0.9571
			10	0.5328	0.7943	0.9086	0.8885	0.9546	0.8569	0.9263	0.8603
			20	0.2533	0.8636	0.9092	0.8102	0.9098	0.8139	0.8492	0.7892
			30	0.2143	0.8543	0.9078	0.8015	0.9281	0.7896	0.8414	0.7017
			40	0.1831	0.8564	0.9029	0.7671	0.9773	0.7053	0.6831	0.6872
Paw	600	1 : 5	0	0.87	0.9091	0.982	0.9899	0.9583	0.984	0.976	0.9859
			10	0.6585	0.85	0.9182	0.8543	0.9410	0.8428	0.8931	0.8270
			20	0.5289	0.8915	0.8853	0.8197	0.9280	0.7922	0.8377	0.7484
			30	0.4114	0.8726	0.9140	0.7579	0.9585	0.7262	0.7738	0.7165
			40	0.3672	0.9187	0.9267	0.6568	0.9773	0.6454	0.6667	0.6128
	800	1 : 7	0	0.89	0.9712	0.9886	0.9785	1.0	0.9786	0.9814	0.9885
			10	0.6148	0.9720	0.9307	0.8885	0.9429	0.8732	0.9145	0.8326
			20	0.3667	0.9286	0.8939	0.7761	0.9522	0.7385	0.8646	0.7381
			30	0.3407	0.9653	0.9223	0.9681	0.9699	0.7039	0.7363	0.7175
			40	0.2582	0.8622	0.9285	0.69	0.9861	0.7036	0.7615	0.7162
Subclus	600	1 : 5	0	0.97	0.9107	0.984	0.9739	0.9899	0.984	0.982	0.9762
			10	0.7236	0.8926	0.9266	0.8780	0.9898	0.8909	0.9036	0.8365
			20	0.6014	0.8409	0.9156	0.8309	0.9085	0.8312	0.8268	0.7780
			30	0.4241	0.8944	0.9027	0.7299	0.9479	0.7489	0.7669	0.6961
			40	0.4124	0.9057	0.8794	0.7014	0.9519	0.7636	0.7491	0.6437
	800	1 : 7	0	0.95	0.97	0.9857	0.9842	1.0	0.9857	0.9857	0.9857
			10	0.7213	0.9167	0.9454	0.8861	0.9900	0.8569	0.9307	0.8171
			20	0.4667	0.9069	0.8831	0.7576	0.9178	0.7123	0.8415	0.6639
			30	0.3736	0.8125	0.9288	0.7625	0.9495	0.6909	0.7524	0.7123
			40	0.4131	0.8520	0.8944	0.6988	0.9908	0.6899	0.7104	0.6688

6.1 Synthetic datasets results

Tables 4 and 5 show the sensitivity and specificity measure values, respectively, for DT classifier on the original datasets (OD) and after applying our proposed algorithm and the existing pre-processing algorithms, such as SMENN (SMOTE-ENN), SMTK (SMOTE-Tomek Links), SPIDER, SM (SMOTE), BSM (Borderline SMOTE), ADASYN. Due to space limitations, we have not shown the results of the other classifiers.

Table 5
Synthetic datasets specificity measure values for Decision Tree (DT) classifier

Dataset Size I.R. Noise% OD Proposed Algo. SMENN SMTK SPIDER SM BSM ADASYN

Clover 600 1 : 5 0 0.944 0.9406 0.9648 0.9312 0.9505 0.932 0.932 0.938

10 0.8889 0.9688 0.9394 0.9268 0.8817 0.9099 0.8071 0.7275

20 0.8333 0.9461 0.7806 0.7424 0.5991 0.7078 0.7424 0.6472

30 0.8552 0.9261 0.7513 0.7217 0.5024 0.6471 0.6041 0.5724

40 0.8227 0.9596 0.7101 0.6549 0.3656 0.6629 0.6855 0.6584

800 1 : 7 0 0.9386 0.9494 0.9692 0.96 0.9633 0.9471 0.9486 0.95

10 0.9439 0.9593 0.8586 0.8097 0.7939 0.7876 0.8466 0.6667

20 0.92 0.9452 0.7655 0.7569 0.6544 0.7123 0.7077 0.6508

30 0.9045 0.9436 0.6759 0.6877 0.5709 0.5874 0.6181 0.5712

40 0.8501 0.9614 0.5837 0.5724 0.3308 0.6036 0.6265 0.5451

Paw 600 1 : 5 0 0.968 0.9899 0.9894 0.974 0.9838 0.968 0.974 0.972

10 0.9539 0.9919 0.8829 0.8889 0.8344 0.8386 0.8449 0.7254

20 0.8506 0.9694 0.7849 0.8117 0.6492 0.7597 0.7511 0.6494

30 0.8688 0.9821 0.7254 0.7489 0.4916 0.6697 0.5769 0.6041

40 0.8369 0.9633 0.6728 0.7423 0.4814 0.6218 0.5272 0.5461

800 1 : 7 0 0.9886 0.9785 0.9851 0.9871 0.9812 0.9757 0.9814 0.9829

10 0.9543 0.9885 0.8977 0.8746 0.8902 0.8348 0.8628 0.7168

20 0.9154 0.9798 0.7772 0.8169 0.5952 0.7908 0.7125 0.6015

30 0.8772 0.9699 0.7109 0.7621 0.4544 0.6440 0.6036 0.6197

40 0.8382 0.9556 0.7197 0.6354 0.3934 0.6133 0.5928 0.4685

Subclus 600 1 : 5 0 0.992 0.9802 0.9844 0.974 0.9771 0.986 0.98 0.982

10 0.9203 0.9602 0.8875 0.8826 0.8717 0.8302 0.1006 0.7925

20 0.8636 0.9607 0.8182 0.7900 0.7859 0.7338 0.7359 0.6883

30 0.8394 0.9501 0.8359 0.7579 0.5525 0.7964 0.6855 0.6697

40 0.7754 0.9412 0.8794 0.7014 0.9519 0.7636 0.7490 0.6437

800 1 : 7 0 0.9929 0.9874 0.9736 0.9857 0.9957 0.9714 0.9799 0.9671

10 0.9598 0.9737 0.8695 0.8614 0.855 0.8378 0.9027 0.76684

20 0.8892 0.9832 0.7612 0.8139 0.6105 0.7769 0.7708 0.7215

30 0.8694 0.9705 0.7023 0.7654 0.4334 0.6990 0.7346 0.6165

40 0.8348 0.967 0.6808 0.7394 0.3047 0.6797 0.6371 0.6688

Dataset	Size	I.R.	Noise%	OD	Proposed Algo.	SMENN	SMTK	SPIDER	SM	BSM	ADASYN
Clover	600	1 : 5	0	0.944	0.9406	0.9648	0.9312	0.9505	0.932	0.932	0.938
			10	0.8889	0.9688	0.9394	0.9268	0.8817	0.9099	0.8071	0.7275
			20	0.8333	0.9461	0.7806	0.7424	0.5991	0.7078	0.7424	0.6472
			30	0.8552	0.9261	0.7513	0.7217	0.5024	0.6471	0.6041	0.5724
			40	0.8227	0.9596	0.7101	0.6549	0.3656	0.6629	0.6855	0.6584
	800	1 : 7	0	0.9386	0.9494	0.9692	0.96	0.9633	0.9471	0.9486	0.95
			10	0.9439	0.9593	0.8586	0.8097	0.7939	0.7876	0.8466	0.6667
			20	0.92	0.9452	0.7655	0.7569	0.6544	0.7123	0.7077	0.6508
			30	0.9045	0.9436	0.6759	0.6877	0.5709	0.5874	0.6181	0.5712
			40	0.8501	0.9614	0.5837	0.5724	0.3308	0.6036	0.6265	0.5451
Paw	600	1 : 5	0	0.968	0.9899	0.9894	0.974	0.9838	0.968	0.974	0.972
			10	0.9539	0.9919	0.8829	0.8889	0.8344	0.8386	0.8449	0.7254
			20	0.8506	0.9694	0.7849	0.8117	0.6492	0.7597	0.7511	0.6494
			30	0.8688	0.9821	0.7254	0.7489	0.4916	0.6697	0.5769	0.6041
			40	0.8369	0.9633	0.6728	0.7423	0.4814	0.6218	0.5272	0.5461
	800	1 : 7	0	0.9886	0.9785	0.9851	0.9871	0.9812	0.9757	0.9814	0.9829
			10	0.9543	0.9885	0.8977	0.8746	0.8902	0.8348	0.8628	0.7168
			20	0.9154	0.9798	0.7772	0.8169	0.5952	0.7908	0.7125	0.6015
			30	0.8772	0.9699	0.7109	0.7621	0.4544	0.6440	0.6036	0.6197
			40	0.8382	0.9556	0.7197	0.6354	0.3934	0.6133	0.5928	0.4685
Subclus	600	1 : 5	0	0.992	0.9802	0.9844	0.974	0.9771	0.986	0.98	0.982
			10	0.9203	0.9602	0.8875	0.8826	0.8717	0.8302	0.1006	0.7925
			20	0.8636	0.9607	0.8182	0.7900	0.7859	0.7338	0.7359	0.6883
			30	0.8394	0.9501	0.8359	0.7579	0.5525	0.7964	0.6855	0.6697
			40	0.7754	0.9412	0.8794	0.7014	0.9519	0.7636	0.7490	0.6437
	800	1 : 7	0	0.9929	0.9874	0.9736	0.9857	0.9957	0.9714	0.9799	0.9671
			10	0.9598	0.9737	0.8695	0.8614	0.855	0.8378	0.9027	0.76684
			20	0.8892	0.9832	0.7612	0.8139	0.6105	0.7769	0.7708	0.7215
			30	0.8694	0.9705	0.7023	0.7654	0.4334	0.6990	0.7346	0.6165
			40	0.8348	0.967	0.6808	0.7394	0.3047	0.6797	0.6371	0.6688

6.1.2 Comparing the performance of the proposed algorithm with pre-processing methods

In this sub-section, we compare the performance of the proposed algorithm with the state-of-art pre-processing algorithms such as –SMENN, SMTK, SPIDER, SM, BSM, and ADASYN. We have used sensitivity and specificity values for the performance comparison.

As observed in Table 4, the improvement in the sensitivity values for the minority class is highest for SPIDER in most of the cases. On the other hand, the SPIDER algorithm substantially decreases the specificity values (refer to Table 5) for all the datasets consisting of noise. Also, the same observation has been found for sensitivity and specificity values for SMENN, SMTK, SM, BSM, and ADASYN. However, on applying our proposed wCM hybrid pre-processing algorithm, the sensitivity values (refer to Table 4) are improved for all the datasets with different noise levels. The proposed algorithm have alsoshown the improved values for specificity (refer to Table 5) compared to the other pre-processing algorithms. In order to understand better the process of regeneration of minority class instances and removal of noise and other difficult points from the clover dataset, we have presented the figures (refer to Fig. 4 (a) –(g)) after applying various pre-processing algorithms. We have only presented the results for the clover dataset with size = 800 and noise = 30 %, due to space limitation. Here, the red dots represent the minority class and the blue dots represent the majority class. Figure 4(a) shows the result of pre-processing by the proposed algorithm. The proposed algorithm works well on the noisy imbalanced datasets. It effectively regenerates the minority class instances and under-samples majority class instances without much affecting the class boundaries. However, the other pre-processing algorithms are not able to identify the noise correctly, thus leading to regenerate even the noisy points of the minority class, which in turn distorted the original clover shape of minority class data points (refer to Fig. 4 (b) –(g)). For challenging datasets (i.e., with high imbalance ratio, small dataset size, small disjuncts, high noise level, non-linear boundaries and high number of borderline examples), k-nn, SVM with Gaussian kernel and DT, give good results. SVM with Linear kernel and Logistic regression classifiers failed to generalize well for these datasets due to non-linear boundaries and difficult shapes.

Fig. 3

(a) clover shape with 10% noise; (b) clover shape, with 20% noise; (c) clover shape with 30% noise; (d) clover shape with 40% noise.

Fig. 4

Clover dataset with imbalance ratio = 1 : 7, dataset size = 800, noise = 30% after applying: (a) proposed algorithm; (b) SMENN; (c) ADASYN; (d) BSM; (e) SPIDER; (f) SM; (g) SMTK.

6.2 Real-world datasets results

To compare the data-generation and class-label noise treatment by the proposed algorithm, we provide the figures for the iris dataset with 20% class-label noise. Figure 5 (a) shows the original imbalanced dataset, where x-marks, circle, and square shapes represent the noise, majority data points, and minority data points, respectively. Figure 5. (b) –(h) shows the post- proposed algorithm data distribution, post-SMENN, post-SMTK, post-SM, post- BSM, post- ADASYN and post-SPIDER data distribution, where the circle and square shapes represent the majority and minority data points, respectively. One can observe from Fig. 5 that SM, BSM, ADASYN, and SMTK cannot handle class label noise. Although from Fig. 5 it is evident that SMENN, SPIDER and Proposed algorithm handles the noise, but the SMENN does more in-depth cleaning, therefore removes a large number of instances from majority class resulting in the shifting of boundaries of the dataset in favor of minority class. Also, the SPIDER algorithm removes more instances of majority class and cannot handle the class-label noise instances located deep inside the other class region. Our proposed algorithm is good at handling the class-label noise along with regenerating the minority class points and deleting the majority class points without much affecting the actual class boundaries. Also, it can be seen in Tables 6 and 7 that the evaluation measures such as specificity and sensitivity values are highest mostly for SMENN, SPIDER and our proposed algorithm. However, SMENN and SPIDER are not good at handling noise. In Tables 6 and 7, for the sonar dataset with noise levels: 0%, 5%, 10%, the ADASYN algorithm is not able to do the undersampling since the imbalance ratio between minority and majority classes is not high. Therefore the corresponding cell values are shown with “no us” i.e. no undersampling.

Fig. 5

Iris dataset a) with 20% noise; and after applying b) proposed algorithm; c) SMENN; d) SMTK; e) SM; f) BSM; g) ADASYN; h) SPIDER.

Table 6

Real-world dataset specificity measure values for the performance comparison

Dataset	Noise%	Original	Proposed Algo	SMENN	SMTK	SPIDER	SM	BSM	ADASYN
Iris	0	1	1	1	1	1	1	1	1
	5	1	1	1	0.96	1	0.96	0.96	0.96
	10	0.9524	1	1	0.96	0.9877	0.98	0.96	0.96
	15	0.9311	1	1	0.93	1	0.93	0.92	0.92
	20	0.9205	1	1	0.93	0.9877	0.91	0.91	0.91
Wine	0	0.9789	0.9785	1	0.98	0.9892	0.98	0.99	0.96
	5	0.9783	0.9674	0.99	0.96	0.8889	0.97	0.88	0.84
	10	0.9444	0.9891	1	0.91	0.8706	0.92	0.79	0.78
	15	0.9541	1	0.92	0.93	0.7647	0.94	0.7	0.69
	20	0.8795	0.9888	0.98	0.82	0.7632	0.92	0.75	0.69
Pima	0	0.805	0.9245	0.94	0.73	0.7204	0.75	0.69	0.69
	5	0.7947	0.9221	0.94	0.7	0.6578	0.72	0.68	0.66
	10	0.7222	0.9014	0.93	0.66	0.5437	0.67	0.64	0.59
	15	0.75	0.9063	0.94	0.7	0.5403	0.74	0.68	0.59
	20	0.6563	0.869	0.99	0.7	0.434	0.69	0.6	0.59
Sonar	0	0.9326	0.8876	0.79	0.55	0.8395	0.57	0.56	no us
	5	0.9048	0.9432	0.98	0.89	0.8108	0.9	0.86	no us
	10	0.8642	0.8243	0.98	0.84	0.6912	0.88	0.86	no us
	15	0.84	0.6324	0.98	0.9	0.5256	0.93	0.84	0.78
	20	0.8194	0.9481	1	0.91	0.7869	0.93	0.9	0.89
Yeast	0	0.9406	0.9722	1	0.8	0.84	0.81	0.9	0.88
	5	0.9477	0.9662	1	0.8	0.8444	0.79	0.89	0.87
	10	0.9436	0.9692	1	0.79	0.8509	0.79	0.89	0.88
	15	0.9426	0.9643	1	0.8	0.8411	0.8	0.89	0.88
	20	0.9365	0.9643	1	0.78	0.9489	0.8	0.9	0.87

Table 7

Real-world datasets sensitivity values for the performance comparison

Dataset	Noise%	OD	Proposed Algo	SMENN	SMTK	SPIDER	SM	BSM	ADASYN
Iris	0	1	1	1	1	1	1	1	1
	5	1	1	1	0.99	1	1	0.96	1
	10	1	1	1	1	1	1	1	1
	15	0.9394	1	1	0.99	1	0.98	0.97	0.98
	20	0.8125	0.9756	1	0.98	1	0.99	1	0.98
Wine	0	1	1	1	1	1	1	1	1
	5	0.96	1	1	0.99	1	0.99	0.99	1
	10	0.9038	0.9796	1	0.98	0.9577	0.94	1	1
	15	0.8545	1	1	0.92	1	0.91	0.95	0.93
	20	0.8475	0.9483	1	0.91	0.957	0.91	0.93	0.93
Pima	0	0.514	0.9125	0.97	0.92	0.9774	0.9	0.88	0.88
	5	0.5471	0.8986	0.98	0.86	0.9843	0.84	0.86	0.86
	10	0.4961	0.9181	0.98	0.75	0.9877	0.79	0.77	0.79
	15	0.5432	0.9167	0.91	0.73	0.9851	0.73	0.68	0.79
	20	0.5714	0.9062	0.92	0.68	0.9753	0.67	0.63	0.79
Sonar	0	0.7532	0.9444	0.81	0.69	0.9804	0.69	0.71	no us
	5	0.7439	0.9245	0.98	0.79	1	0.75	0.73	no us
	10	0.7176	0.9695	0.97	0.72	0.975	0.73	0.68	no us
	15	0.6923	0.9688	0.95	0.65	0.9794	0.68	0.65	0.62
	20	0.6915	0.9216	0.98	0.67	0.9786	0.69	0.67	0.67
Yeast	0	0.5189	0.9338	0.96	0.98	0.9779	0.97	0.76	0.69
	5	0.5071	0.9455	0.98	0.98	0.9817	0.98	0.76	0.67
	10	0.545	0.9183	0.97	0.98	0.9908	0.97	0.72	0.65
	15	0.5308	0.9339	0.97	0.98	0.9743	0.97	0.76	0.69
	20	0.5118	0.9416	0.95	0.98	1	0.98	0.73	0.67

6.3 Observations based on the experiments result

In this paper, we consider the problem of learning the classification model from an imbalanced data set. We have used the challenging imbalanced synthetic datasets and real-world datasets, with noise. The following are the observations made through the experimental analysis:

The traditional classifiers are able to generalize well for the minority class if the imbalanced dataset does not suffer from the other intrinsic problems such as noise, small disjuncts, non-linear boundaries, small dataset size and overlapping classes.

For the imbalanced datasets, the sensitivity value for the minority class decreases sharply on increasing the noise, borderline examples and small disjuncts.

Learning for the minority class in the imbalanced datasets with non-linear boundaries is more difficult as compared to the learning from imbalanced datasets with the linear boundaries.

The learning task for the minority class becomes worse when the imbalanced dataset with non-linear boundaries suffer from noise.

So summarizing our findings based on the experiments conducted, we can say that the proposed algorithm works well for the dataset suffering from class imbalance ratio, non-linear decision boundaries, borderline examples, small disjuncts, and class label noise. It improves the performance of the minority class and improves the performance of the majority class, without affecting the original class boundaries. The images of the dataset after applying the proposed algorithm also confirm that it intelligently treats the noisy and borderline data points and re-balances the division of data points from minority and majority class in the dataset.

7 Conclusion and future scope

In this paper, we considered the problem of learning from an imbalanced dataset suffering from other intrinsic factors such as class imbalance ratio, non-linear decision boundaries, borderline examples, small disjuncts, and noise. In this paper, we carried out an experimental study to show the usefulness of our proposed algorithm based on the wCM metric. A wide range of experiments on synthetic datasets and real-world datasets are conducted. These experiments show that the proposed algorithm can handle the class label noise along with the imbalance ratio, non-linear decision boundaries, borderline points, and small disjuncts, occurring altogether in a dataset. We hope that the proposed algorithm could give the researchers more insights to understand better the conditions for applying a particular pre-processing algorithm to enhance the classifier performance on challenging datasets. This algorithm can further be improved to work effectively for the multi-class classification problem. Moreover, we will propose a framework based on data complexity to handle the class imbalance problem and other dataset intrinsic factors.

References

Branco

, Torgo

and Ribeiro

R.P.

, A survey of predictive modeling on imbalanced domains, ACM Comput. Surv. 49(2) (2016), 1–50.

Wozniak

, Grana

and Corchado

, A survey of multiple classifier systems as hybrid systems, Information Fusion 16 (2014), 3–17.

Czarnecki

W.M.

and Tabor

, Extreme entropy machines: robust information theoretic classification, Pattern Anal. Appl. 20(2) (2017), 383–400.

Ksieniewicz

, Grana

and Wozniak

, Paired feature multilayer ensemble- concept and evaluation of a classifier, J. Intelligent and Fuzzy Systems 32(2) (2017), 1427–1436.

Gosain

, Gupta

and Singh

, Hybrid Data-Level Techniques for Class Imbalance Problem. In: Gupta D., Khanna A., Bhattacharyya S., Hassanien A.E., Anand S., Jaiswal A. (eds) International Conference on Innovative Computing and Communications. Advances in Intelligent Systems and Computing, 1165. Springer, Singapore, 2021. https://doi.org/10.1007/978-981-15-5113-0_95

Gosain

, Saha

and Singh

, Analysis of sampling based classification techniques to overcome class imbalancing. Proc 10th INDIACom-2016 IEEE Int Conference. 2016, pp. 320–326

Haixiang

, Yijing

, Shang

, Mingyun

, Yuanyue

and Binge

, Learning from class-imbalanced data: Review of methods and applications, Expert Systems with Applications 73 (2017), 220–239.

Barella

V.H.

, Garcia

L.P.F.

, De Souto

M.P.

, Lorena

A.C.

and De Carvalho

, Data complexity measures for imbalanced classification tasks. Intl. Joint Conf. on Neural Networks (IJCNN), Rio de Janeiro, 2018, pp. 1–8. doi: 10.1109/IJCNN.2018.8489661

and Japkowicz

, Class imbalances versus small disjuncts, SIGKDD Explor. Newsl. 6(1) (2004), 40–49. https://doi.org/10.1145/1007730.1007737

10.

Batista

G.E.A.P.A.

, Prati

R.C.

and Monard

M.C.

, Balancing strategies and class overlapping. In: IDA. 2005, pp. 24–35.

11.

Denil

and Trappenberg

T.P.

, Overlap versus imbalance. In: Canadian Conference on AI. 2010, pp. 220–231.

12.

Gracia

, Mollineda

R.A.

and Sanchez

J.S.

, On the k-NN performance in a challenging scenario of imbalance and overlapping, Pattern Anal. Appl. 11(3) (2008), 269–280.

13.

Garcia

L.P.F.

, De Carvalho

A.C.P.L.F.

and Lorena

A.C.

, Effect of label noise in the complexity of classification problems, J. Neurocomputing 160 (2015), 108–119.

14.

Napierala

and Stefanowski

, Types of minority class examples and their influence on learning classifiers from imbalanced data, J. Intelligent Information Systems 46(3) (2016), 563–597.

15.

Alejo

, Valdovinos

R.M.

, Garcia

and Pacheco-Sanchez

J.H.

, A hybrid method to face class overlap and class on neural networks and multi-class scenarios, Pattern Recognition Letters 34(4) (2013), 380–388.

16.

Saez

J.A.

, Luengo

, Stefanowski

and Herrera

, Predicting noise filtering efficacy with data complexity measures for nearest neighbor classification, Pattern Recognition. Elsevier Journal. 46 (2013), 355–364.

17.

Napieral-a

and Stefanowski

, Addressing imbalanced data with argument based rule learning, Expert Syst Appl. 42(24) (2015), 9468–81.

18.

Fernandez

, Jesus

M.J.D.

, Herrera

, Addressing overlapping in classification with imbalanced datasets: A first multi-objective approach for feature and instance selection. In: K. Jackowski, R. Burduk, K. Walkowiak, M. Wozniak, H. Yin (eds) Intelligent Data Engineering and Automated Learning – IDEAL Lecture Notes in Computer Science. 9375 (2015), 36–44.

19.

Kaur

and Gosain

, An intelligent undersampling technique based upon intuitionistic fuzzy sets to alleviate class imbalance problem of classification with noisy environment, International Journal of Intelligent Engineering Informatics 6(5) (2018), 417–433. DOI: 10.1504/IJIEI.2018.10015598.

20.

Koziarskia

, Krawczykb

and Wozniak

, Radial-based oversampling for noisy imbalanced data classification, Neurocomputing 343 (2019), 19–33.

21.

-Ponce

A.G.

, Valdovinos

R.M.

, Sanchez

J.S.

and Marcial-Romero

J.R.

, A new under-sampling method to face class overlap and imbalance, Applied Sciences; Basel 10 (2020), 5164. DOI: 10.3390/app10155164

22.

Siddappa

N.G.

and Kampalappa

, Imbalance data classification using local mahalanobis distance learning based on nearest neighbor, SN Comput. Sci 1 (2020), 76.

23.

, Basu

and Law

, Measures of geometrical complexity in classification problems. Data Complexity in Pattern Recognition Ser. Advanced Information and Knowledge Processing. Springer, London, 2006, pp. 1–23. https://doi.org/10.1007/978-1-84628-172-31.

24.

Provost

and Fawcett

, Robust classification for imprecise environments, J Machine Learning 42 (2001), 203–231.

25.

Xiong

, Wu

and Liu

, Classification with class overlapping: a systematic study, Proc. Intl. Conf. on E-Business Intelligence (2010), 491–497.

26.

Gosain

, Saha

and Singh

, Measuring harmfulness of class imbalance by data complexity measures in oversampling methods, International J. of Intelligent Engineering Informatics 7(2–3) (2019), 203–230.

27.

Singh

, Gosain

and Saha

, Weighted k-nearest neighbour based data complexitymetrics for imbalanced datasets, J. Statistical Analysis and Data mining 2020, 394–404. https://doi.org/10.1002/sam.11463

28.

T.K.

and Basu

, Complexity measures of supervised classification problems, IEEE Transactions on Pattern Analysis and Machine Intelligence 24 (2002), 289–300.

29.

Singh

, Multiresolution estimates of classification complexity, IEEE Trans. Pattern Anal. Mach. Intell. 25, pp. 1534–1539.

30.

Sanchez

J.S.

, Mollineda

R.A.

and Sotoca

J.M.

, An analysis of how training data complexity affects the nearest neighbor classifiers, Pattern Analysis Application, Springer 10 (2007), 189–201.

31.

Garcia

, Cano

J.R.

, Bernado-Mansilla

and Herrera

, Diagnose of effective evolutionary prototype selection using an overlapping measure, Intl. J. Pattern Recognition Artificial Intelligence 23 (2009), 2378–2398.

32.

Macia

, Mansilla

E.B.

, Puig

A.O.

and Ho

T.K.

, Learner excellence biased by data set selection: A case for data characterisation and artificial data sets, Pattern Recognition Elsevier 46 (2013), 1054–1066.

33.

Luengo

and Herrera

, An automatic extraction method of the domains of competence for learning classifiers using data complexity measures, J. Knowledge and Information Systems 42(1) (2015), 147–180.

34.

Zubek

and Plewczynski

, Complexity curve: A graphical measure of data complexity and classifier performance, Peer J Computer Science 2(2–3) (2016), e76.

35.

Brun

A.L.

, Britto

A.S.

Jr , Oliveira

L.S.

, Enembreck

and Sabourin

, A framework for dynamic classifier selection oriented by the classification problem difficulty, Pattern Recognition 76 (2018), 175–190.

36.

Anwar

, Jones

and Ganesh

, Measurement of data complexity for classification problems with imbalanced data, J. Statistical Analysis and Data Mining 7 (2014), 194–211.

37.

Xing

, Cai

, Hejlesen

, Toft

, Preliminary evaluation of classification complexity measures on imbalanced data, Proc. Chinese Intelligent Automation Conference (2013), 189–196.

38.

, Ni

, Xu

, Qin

and Jv

, Estimating harmfulness of class imbalance by scatter matrix based class separability measure, J. Intelligent Data Analysis 18 (2014), 203–216.

39.

Diez-Pastor

J.F.

, Rodriguez

J.J.

, Garcia-Osorio

C.I.

and Kuncheva

L.I.

, Diversity techniques improve the performance of the best imbalance learning ensembles, Information Sciences 325 (2015), 98–117.

40.

Fernandez

L.M.

, Canedo

V.B.

and Betanzos

A.A.

, Can classification performance be predicted by complexity measures? A study using microarray data, Intl. J. Knowledge and Information Systems, Springer 51(3) (2017), 1067–1090.

41.

, Cheung

Y.-M.

and Tang

Y.Y.

, Bayes imbalance impact index: A measure of class imbalanced data set for classification problem, IEEE Transactions on Neural Networks and Learning Systems 31(9) (2020), 3525–3539. DOI: 10.1109/TNNLS.2019.2944962.

42.

https://sci2s.ugr.es/keel/imbalanced.php

43.

https://sci2s.ugr.es/keel/classNoise.php#subB

wCM based hybrid pre-processing algorithm for class imbalanced dataset

Abstract

Keywords

1 Introduction

3 Weight calculation using wCM

Time Complexity Analysis:

5.1 Synthetic datasets

Table 2 Summary of characteristics of real-world datasets Dataset #Features #instances # minority class #majority class I.R Iris 4 150 50 100 01 : 02 Wine 13 178 59 119 0.33 : 0.67 Pima 8 768 268 500 01 : 01.9 Sonar 60 208 97 111 0.47 : 0.53 Yeast 8 1484 212 976 01 : 32.8

6 Results and discussion

7 Conclusion and future scope

References

Table 2
Summary of characteristics of real-world datasets

Dataset #Features #instances # minority class #majority class I.R

Iris 4 150 50 100 01 : 02

Wine 13 178 59 119 0.33 : 0.67

Pima 8 768 268 500 01 : 01.9

Sonar 60 208 97 111 0.47 : 0.53

Yeast 8 1484 212 976 01 : 32.8