A quantum-based oversampling method for classification of highly imbalanced and overlapped data

Abstract

Data imbalance is a challenging problem in classification tasks, and when combined with class overlapping, it further deteriorates classification performance. However, existing studies have rarely addressed both issues simultaneously. In this article, we propose a novel quantum-based oversampling method (QOSM) to effectively tackle data imbalance and class overlapping, thereby improving classification performance. QOSM utilizes the quantum potential theory to calculate the potential energy of each sample and selects the sample with the lowest potential as the center of each cover generated by a constructive covering algorithm. This approach optimizes cover center selection and better captures the distribution of the original samples, particularly in the overlapping regions. In addition, oversampling is performed on the samples of the minority class covers to mitigate the imbalance ratio (IR). We evaluated QOSM using three traditional classifiers (support vector machines [SVM], k-nearest neighbor [KNN], and naive Bayes [NB] classifier) on 10 publicly available KEEL data sets characterized by high IRs and varying degrees of overlap. Experimental results demonstrate that QOSM significantly improves classification accuracy compared to approaches that do not address class imbalance and overlapping. Moreover, QOSM consistently outperforms existing oversampling methods tested. With its compatibility with different classifiers, QOSM exhibits promising potential to improve the classification performance of highly imbalanced and overlapped data.

Keywords

Classification class imbalance class overlapping quantum potential energy oversampling

Impact statement

Accurate classification and analysis of imbalanced and overlapped data, which are prevalent in the fields of biology and medical sciences, are crucial for understanding complex biological systems, including disease diagnosis, drug discovery, and personalized medicine. This research introduces quantum-based oversampling method (QOSM), a novel oversampling method based on quantum potential theory, addressing challenges in highly imbalanced and overlapped data sets. QOSM improves sample distribution representation, especially in overlapped regions, by selecting cover centers using potential energy. Compared to existing methods like SMOTE (synthetic minority oversampling technique), QOSM considers individual sample and cover relationships, effectively reducing mislabeled synthetic samples. Extensive evaluations on 10 data sets with high imbalance ratios and overlap demonstrate that QOSM significantly improves classification accuracy, outperforming other methods across multiple classifiers, such as support vector machines (SVM), k-nearest neighbor (KNN), and naive Bayes (NB) classifier. Therefore, QOSM presents a novel and effective approach to address imbalance and overlap in classification, with great potential for application in various biological and medical data sets.

Introduction

Improving the performance of classification on imbalanced data, where the number of samples of the majority class greatly exceeds that of the minority class, has been a significant challenge in the field of machine learning. Imbalanced classification is prevalent in various fields, such as biological science and medical diagnosis. For instance, it is observed in imbalanced class learning in epigenetics,¹ readmission risk prediction in clinical and medical fields to support diagnostic decision-making,² and chemical classification and toxicity analysis.³ When traditional classifiers are applied to these imbalanced data sets, they often exhibit deficiencies.⁴ Consequently, numerous researchers have been motivated to develop different approaches to address the imbalanced data set problem and improve classification accuracy.⁵

Research in this area can be categorized into algorithm- and data-level approaches. Algorithm-level approaches involve the development of new algorithms or modifications to existing ones,⁶ tailored to the characteristics of imbalanced data, with the aim of achieving higher classification accuracy. On the other hand, data-level approaches primarily focus on reducing the imbalance ratio (IR),⁷ which is defined as the ratio of the number of the majority class samples to the number of minority class samples. This reduction is typically accomplished through resampling techniques,⁸ which aim to balance the distribution of samples across different classes. In data-level approaches, classification algorithms and data remain independent of each other, allowing existing classifiers to be applied to classification tasks without significant modification. In this article, our focus is on a data-level approach for addressing the challenges posed by imbalance data. We employ existing classifiers for classification purposes while leveraging techniques to handle the imbalance in the data set.

Classification of imbalanced data has seen widespread use of downsampling and oversampling techniques. Downsampling involves randomly eliminating samples from the majority class, which may result in the loss of valuable information contained within these removed samples. Conversely, in oversampling techniques, samples from the minority class are randomly selected and added repeatedly to the training data, potentially leading to overfitting. Recognizing the limitations of downsampling and oversampling, researchers have proposed methods to generate synthetic data instances using samples from the minority class for imbalance handling. Notable among those methods is SMOTE (synthetic minority oversampling technique)⁹ and its variants, which are introduced in the next section of this article. Many publications have demonstrated the effectiveness of these synthetic data generation methods in addressing class imbalances. These methods have been shown to frequently outperform traditional downsampling and oversampling techniques.⁵ Therefore, our focus in this article is on oversampling techniques that generate synthetic data instances utilizing samples from the minority class.

Many real-world data sets exhibit not only class imbalance but also class overlap. Class overlap occurs when the samples from two classes share similar features but have different class labels, making it difficult to separate them. When a data set is both highly imbalanced and overlapped, the combined effect of these two factors becomes more intricate.^10,11 In such cases, defining the decision boundary for classification becomes difficult, and traditional classifiers are prone to misclassifying the minority class samples in the overlapping area as belonging to the majority class. As a consequence, the classification boundary tends to shift toward the minority class area, as depicted in Figure 1.

Figure 1.

(a) The classification boundary of the original data set. (b) The classification boundary after changing the class overlap distribution.

Most existing approaches have been designed to address either class imbalance or class overlap independently. However, in numerous real-world applications, these problems coexist, and data sets often exhibit higher imbalance and overlap ratios.^12,13 A comprehensive review and analysis of previous studies on imbalance and class overlap can be found in Vuttipittayamongkol et al.,¹¹ and Santos et al.¹⁴ The analysis concluded that class overlap can have varying degrees of negative impact on classification results, whereas imbalance does not always have such an effect. However, when imbalance and class overlap are present simultaneously, the influence of class overlap becomes more pronounced. Consequently, studying data sets that are highly imbalanced and highly overlapped holds significance importance.

In this article, we propose a novel quantum-based oversampling method (QOSM) that addresses the challenges of data imbalance and overlapping simultaneously, with the goal of improving classification performance. QOSM utilizes cover structures to describe the data distribution and determine the decision boundary of the classification. Specifically, it focuses on changing the minority class distribution in the overlapping area. Each cover is represented by its center and radius.

Existing cover construction algorithms (CCAs)¹⁵ often rely on random selection of cover centers, which may not be suitable, especially in the overlapping area. To tackle this limitation, we introduce the principles of quantum theory to describe the data distribution and determine cover centers. In quantum theory, the distribution states of microscopic particles within an energy field are determined by their energy levels. Particles with low potential energy exhibit increased stability and tend to cluster together. In the context of data analysis, we apply the concept of “quantum potential” to characterize data samples. This potential energy field is computed based on the distances or similarities between data samples. Quantum potential theory offers a distinctive approach to address clustering or covering challenges because it takes into account the intricate interactions and relationships among data points, inspired by principles from quantum mechanics. Motivated by these observations and the pioneering work of Horn,¹⁶ we leverage quantum potential theory to optimize the process of determining cover centers in the CCA. In this approach, each data sample is analogously treated as a particle, and the particular samples with zero or minimum potential energy are considered central and are surrounded by more neighboring samples, making them suitable candidates for cover centers or cluster representatives.

QOSM aims to tackle the main cause of boundary migration in the classification of highly imbalanced and overlapped data. By increasing the number of minority class samples in the overlapping region, it can alter the minority class distribution, and partially shift the classification boundary toward the majority class, as illustrated in Figure 1(b). This method effectively addresses class overlapping while simultaneously reducing the IR. We expect that the oversampling method with quantum-based cover construction will improve classification accuracy for highly imbalanced and overlapped data.

The main contributions of this article are as follows:

We propose a novel quantum-based oversampling method for the classification of highly imbalanced and overlapped data.

The CCA is improved by incorporating the quantum potential energy theory, which better captures the spatial distribution of data in the overlapped area.

We demonstrate the superiority of QOSM over the state-of-the-art sampling methods through experiments on several extremely imbalanced and overlapped real-world data sets.

Related work

Several factors significantly impact the performance of classification on the imbalanced and overlapped data, including IR, the degree of overlap, the quantity of training samples, the ratio of noise samples, and the severity of intraclass sub-aggregation. Data-level sampling methods aim to mitigate the influence of these factors and improve classification performance. SMOTE⁹ is a conventional oversampling method used to reduce the IR by synthesizing new minority class instances between existing minority class samples. However, SMOTE has limitations, such as potentially generating mislabeled synthetic samples and increasing the risk of class overlap, especially in data sets containing noise. Consequently, learning from data sets processed by SMOTE often leads to a higher false prediction rate.¹⁷ To address these issues, several variants of SMOTE have been developed. Han proposed the borderline-SMOTE method,¹⁸ which focuses on oversampling borderline samples under the assumption that samples located in the boundary areas between the two classes play a more crucial role in classification. He et al.¹⁹ introduced the ADASYN (Adaptive Synthetic) method, which adaptively synthesizes new instances by considering the density distribution information of minority class samples. Clustering-based methods, such as Cluster_SMOTE²⁰ and DBSMOTE²¹ take into consideration dense or sparse subclusters of minority and majority instances.

Class imbalance generally has a detrimental effect on classification performance. Existing methods for addressing this problem can be categorized as follows. One approach involves using a kernel function to map the original data to a high dimensional feature space, aiming to increase the separability between classes.²² Another category of methods focuses on addressing class overlap through data cleaning techniques, such as SMOTE + ENN (Edited Nearest Neighbors) and SMOTE + TomekLinks.²³ In the SMOTE–ENN algorithm, synthetic instances are generated using SMOTE based on the minority samples. The class label of each synthetic instance is compared with the majority vote of its k-nearest neighbors (KNNs). Only the instances with consistent class labels are retained. Our previous work also demonstrated that the removal of mislabeled instances in SMOTE–ENN can improve the classification accuracy.³ Vuttipittayamongkol proposed a neighborhood-based undersampling framework for handling class imbalance in binary data sets, which involves removing potentially overlapping data points and enhancing the visibility of minority class instances.¹² Another interesting approach involves detecting overlapping regions and employing different strategies for classification in these regions.²⁴ However, these methods may lead to the loss of important information in datasets due to undersampling or fail to fully leverage the distribution and data structure within the overlapping region.

In recent years, there has been significant research activity in developing and improving quantum-based methods for clustering and classification. Quantum clustering methods were initially introduced by Horn and Gottlieb¹⁶ who drew inspiration from the observation that data points in a feature space can be associated with a Schrödinger equation, where the potential is determined by the data. Weinstein and Horn²⁵ further advanced this approach by developing a dynamic quantum clustering method using a time-dependent Schrödinger equation. Scott et al.²⁶ demonstrated that quantum mechanics serves as the foundation for several useful data clustering methods. Maignan and Scott²⁷ explored the mathematical task of finding all the roots of the exponential polynomial corresponding to the minima of a two-dimensional quantum potential and established that, if the points are contained within a square region of a specific size, there exists only one minimum. Decheng et al.²⁸ introduced a distance measure that enhances quantum clustering analysis by projecting non-spherical overlapping data in the Euclidean space onto a weighted Euclidean space, effectively creating non-overlapping data. Tian et al.²⁹ proposed a quantum clustering ensemble (QCE) technique derived from quantum mechanics, offering another approach to clustering tasks. Quantum clustering has also been successfully applied in data mining applications, such as outlier detection and text analysis.³⁰ Li and Kais³¹ presented a quantum algorithm for data classification based on the nearest neighbor learning algorithm. These methods have demonstrated the potential of quantum approaches in clustering and classification tasks. However, these existing methods directly applied the quantum theory for clustering and classification, without specifically addressing the challenges posed by the imbalance and overlapping data sets. Our QOSM takes a different approach by utilizing quantum theory to construct optimal covers from overlapped data. These covers are then employed to oversample the imbalanced data sets, effectively addressing both imbalance and overlap. Notably, our novel oversampling method is independent of classifiers, allowing for the applications of various advanced classification methods in subsequent classification tasks.

Materials and methods

We introduce a novel quantum-based oversampling method designed specifically for highly imbalanced and overlapped data in classification tasks. Our approach utilizes CCA,^15,32 to create a set of disjoint covers that represent the data distribution. Each cover consists of a center and radius, and samples within a cover share the same class label. While CCA provides a foundation for exploring spatial characteristics, the selection of cover center is traditionally random, which may lead to suboptimal classification performance, especially for highly imbalanced and overlapped data. To address this, we improve the cover center selection by ranking the quantum potential energy of each sample, resulting in more representative covers. Subsequently, we utilize these optimized covers to investigate the sample distribution, with a particular focus on identifying minority class samples in the overlapping area.

Our approach considers both individual sample relationship and cover relationships to capture the data set’s sample distribution. Initially, we apply the improved cover construction method to the original data set, taking into account the distribution characteristics of each individual sample. In our approach, we partition the minority class sample space into a set of covers, while also dividing the majority class samples into their respective covers. Our subsequent step involves identifying the minority class covers within the overlapping area. Typically, samples located in the regions of class overlap and the boundary region between classes are more susceptible to misclassification. The address this, we employ the KNN method to explore the distribution relationship between covers and extract the minority class covers situated in the overlapping area. Once the relevant covers are identified, we perform oversampling on the samples belonging to the minority class covers, generating synthetic instances to address the data imbalance. For a more detailed understanding of our approach, in the following subsections, we provide an overview of the CCA that outlines the process of cover construction, a discussion of quantum potential theory that explains its role in optimizing cover selection, and the technical details of our proposed QOSM that encompasses the steps involved in identifying minority class covers in the overlapping area and conducting oversampling to generate synthetic instances.

Cover construction algorithms

CCAs play an important role in our proposed method. CCA is a domain construction algorithm that constructs covers based on the distance between samples.¹⁵ Each cover consists of multiple samples and is characterized by a cover center and cover radius. These covers exhibit the following characteristics: (1) samples within the same cover share the same class label; (2) if samples with the same label exhibit significant differences, they are assigned to separate covers; and (3) samples belonging to different categories cannot be projected onto the same cover.

To provide an overview of the CCA, let us consider an input data set $X = {(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{p}, y_{p})}$ , where $p$ represents the number of samples in the data set $X$ , $x_{i} = (x_{i}^{1}, x_{i}^{2}, \dots x_{i}^{n})$ and $y_{i}$ denotes the $i th$ sample with its corresponding label, $n$ represents the dimension of sample $x_{i}$ , and $x_{i}^{j}$ represents the $j th$ feature of the sample $x_{i}$ . The output of the CCA is a set of covers C, defined as $C = {C_{1}^{1}, \dots, C_{1}^{n_{1}}, C_{2}^{1}, \dots, C_{2}^{n_{2}}, C_{m}^{1}, \dots, C_{m}^{n_{m}}}$ , where $C_{i}^{j}$ is the $j th$ cover of class $i$ . $C_{i} = \cup C_{i}^{j}$ encompasses all the covers of class $i$ . The main steps of CCA are similar to that in, Zhang and Zhang¹⁵, Guliyev and Ismailov³² briefly described as follows:

Step 1: Data normalization

Normalize the feature values of each sample to [0,1] using the max–min normalization technique, as shown in equation (1):

x' = \frac{x - M i n V a l u e}{M a x V a l u e - M i n V a l u e}

(1)

Step 2: Projection

Projecting the data set $X$ , initially in an $n$ -dimensional sphere $S^{n}$ onto an $(n + 1)$ -dimensional space, $S^{n + 1}$ , using equation (2):

T : X \to S^{n + 1}, T (x) = (x, \sqrt{R^{2} - {| x |}^{2}}), R \geq m a x {| x |, x \in X}

(2)

where $R$ denotes the radius of the sphere $S^{n}$ , which is set to the maximum value of $| x_{i} |$ . $x_{i} = (x_{i}^{1}, x_{i}^{2}, \dots, x_{i}^{n})$ represents the $i th$ sample of the data set $X$ and $| x_{i} | = \sqrt{{(x_{i}^{1})}^{2} + {(x_{i}^{2})}^{2} + \dots + {(x_{i}^{n})}^{2}}$ .

Step 3: Cover construction

Each cover is determined by a cover center $x_{k}$ and radius $r$ . The cover center is randomly selected, and the cover radius is determined as the average of $d_{i n t e r}$ and $d_{i n n e r}$ , satisfying $d_{i n n e r} < d_{i n t e r}$ , as illustrated in Figure 2. Here, $d_{i n t e r}$ represents the minimum distance of samples with different class labels from $x_{k}$ , while $d_{i n n e r}$ denotes the maximum distance between samples with the same class label from the cover center $x_{k}$ . The distance between samples is measured by the vector inner product, as shown in equation (3). The cover radius $r$ can be obtained by equation (4) through equation (6).

〈 x_{k}, x_{i} 〉 = x_{k}^{1} x_{i}^{1} + \cdot \cdot \cdot + x_{k}^{n + 1} x_{i}^{n + 1}, i \in {1, 2, \dots, p}

(3)

d_{int e r} = \underset{y_{i} \neq y_{k}}{m a x} {〈 x_{k}, x_{i} 〉}, i \in {1, 2, \dots, p}

(4)

d_{i n n e r} = \underset{y_{i} = y_{k}}{m i n} {〈 x_{k}, x_{i} 〉 | 〈 x_{k}, x_{i} 〉 > d_{int e r}}, i \in {1, 2, \dots, p}

(5)

r = \frac{d_{int e r} + d_{i n n e r}}{2}

(6)

Figure 2.

Calculating the cover radius $r$ . $d_{i n t e r}$ represents the minimum distance of samples with different class labels from the cover center, while $d_{i n n e r}$ denotes the maximum distance between samples with the same class label from the cover center.

In equation (4), the maximum inner product, which represents the minimum distance from samples between different classes. Here, $y_{k}$ and $y_{i}$ denote the categories of cover center $x_{k}$ and sample $x_{i}$ , respectively. When y_i ≠ y_k, it indicates that $x_{k}$ and $x_{i}$ belong to different categories. The value $p$ represents the number of samples in data set $X$ . Similarly, equation (5) calculates the minimum inner product, representing the maximum distance between samples of the same class. In this case, $y_{i} = y_{k}$ indicates that $x_{k}$ and $x_{i}$ are samples belonging to the same category.

Step 4: By iterating Step 3, we continue constructing covers until all samples in the data space are covered.

Quantum potential theory

In our proposed method, we leverage the theory of quantum potential energy in the cover construction process to improve the selection of cover centers. This subsection introduces the relevant theory of quantum potential energy and describes how the potential energy of each sample is calculated.

In the field of quantum mechanics,³³ a wave function represents the quantum state of a particle. Traditionally, the Schrödinger equation is solved to determine the particle distribution in various potential fields to obtain the corresponding wave function. The time-independent Schrödinger equation is given by:

H ϕ (x) \equiv (- \frac{δ^{2}}{2} \nabla^{2} + V (x)) ϕ (x) = E ϕ (x)

(7)

where $H$ represents the Hamiltonian operator, which describes the total energy of the quantum system. $ϕ (x)$ denotes the wave function, $V (x)$ is a potential function, $E$ represents the possible eigenvalue of $H$ , $\nabla^{2}$ is the Laplacian operator, and $δ$ is the scaling parameter of the wave function. Different values of $δ$ correspond to different quantum potential energies.

With the known wave function, we can determine the quantum potential function $V (x)$ by solving equation (7), which describes the probability density function of the input data.³⁴ By utilizing the Gaussian kernel, as shown in equation (8), for the wave function, we can solve for the potential function in a quantum system, as presented in equation (9).

ϕ (x) = \sum_{i = 1}^{n} e^{\frac{- {(x - x_{i})}^{2}}{2 δ^{2}}}

(8)

V (x) = E + \frac{(\frac{δ^{2}}{2}) \nabla^{2} ϕ (x)}{ϕ (x)}

(9)

By substituting equation (8) into equation (9), the potential function $V (x)$ can be obtained as:

V (x) = E + \frac{1}{2 δ^{2} ϕ (x)} \sum_{i = 1}^{n} {(x - x_{i})}^{2} ϕ (x)

(10)

Hence, the potential of each sample can be calculated precisely. Consequently, the center of the cover is determined by the sample with the minimum potential energy.

Quantum-based oversampling method

The proposed QOSM aims to address the bias in the classification boundary by modifying the distribution of the minority class samples in the overlapping regions, thereby improving classification performance. The method involves three stages: cover construction, identification of minority covers in the overlapping area, and oversampling.

In the cover construction stage, the spatial distribution of the original data set is determined based on the relationship between individual samples. Both the minority- and majority-class cover sets are created, where each cover represents a group of samples with the same class label. The center of each cover is determined by selecting the sample with the minimum quantum potential energy calculated by equation (10).

In the second stage, the KNN method is employed to explore the distribution characteristics between covers at a finer granularity. QOSM leverages the cover properties to investigate the spatial relationships of covers. This analysis allows for the identification of minority class covers located in the overlapping area.

Finally, oversampling is applied to the samples belonging to the minority class covers identified in the second stage. This increases the sample size of the minority class in the overlapping area, resulting in a balanced data set. The technical details are presented in the following subsections.

Constructing covers using sample quantum potential energy

The construction of covers is based on the characteristics of the sample space. Each cover comprises two crucial parameters: the cover center and cover radius. The principle underlying cover construction is to gather more samples with the same label around the cover center, ensuring high similarity among these samples. In CCA, the cover center is selected randomly, and different choices of cover centers yield varying coverage areas. Consequently, the output of CCA is greatly influenced by the cover center selection. Improper selection of the cover center plays a critical role in the overall learning process.

The proposed QOSM address this issue by selecting cover centers based on quantum theory, aiming to construct cover sets that align better with the sample distribution. For all samples with known class labels in the training data set, the potential energy of each sample is calculated using the method described in previous sections. Then, for all uncovered samples with the same class label, the sample with the lowest potential energy of this class is selected as the center used to construct the cover. The remaining uncovered samples of the same class are also used to construct covers in a similar manner until all samples are covered. The details of cover construction in our method are described in Algorithm 1 (Figure 3).

Figure 3.

Cover construction algorithm.

In Step 3 of Algorithm 1, the scale adjustment parameter is estimated by equation (11) as described in the literature:³⁵

\tilde{δ} = {\frac{4}{d i m + 2}}^{\frac{1}{(d i m + 4)}} p^{\frac{- 1}{(d i m + 4)}}

(11)

where $d i m$ represents the dimension of the data set, which is the number of features, and its value is $n + 1$ in Algorithm 1. $p$ is the number of samples in the data set $X$ .

Figure 4(a) illustrates a given data set, and Algorithm 1 constructs a cover set for samples with the same label, as shown in Figure 4(b). In the figure, solid circles represent the covers of minority-class samples, while dotted circles represent the covers of majority-class samples. The algorithm maps an n-dimensional input vector to an (n + 1)-dimensional space, thus representing an (n + 1)-dimensional input vector covering problem. Different classes of inputs are covered by distinct sets of sphere neighborhoods. For binary classification, two cover subsets, $C_{0} = {C_{0}^{1}, C_{0}^{2}, \dots, C_{0}^{n_{0}}}$ and $C_{1} = {C_{1}^{1}, C_{1}^{2}, \dots, C_{1}^{n_{1}}}$ , can be constructed, corresponding to the minority and majority classes, where n0 and n1 are the number of covers in the minority and majority cover sets, respectively.

Figure 4.

The process of finding minority class samples of the overlapping area. (“▲” represents minority class samples and “•” represents majority class samples.) (a) Original data set; (b) constructing two cover sets from minority and majority class samples; (c) constructed cover sets; (d) $k$ -nearest neighbors of the cover $C_{0}^{i}$ .

Finding the minority covers in the overlapping area

The second stage of the proposed method focuses on finding the minority class covers in the overlapping area. The cover construction method described previously is applied to the input data set $X$ , resulting in a minority class cover set $C_{0}$ and a majority class cover set $C_{1}$ , represented by the solid line circles and dotted line circles in Figure 4(c). Each cover contains essential information, such as the cover’s class label, center, radius, and samples it contains. Since the samples within each cover exhibit high similarity, we use the cover center to represent each cover. The sets of center points for the minority and majority class covers are denoted as ${c e n t e r}_{0} = {{c e n t e r}_{0}^{1}, {c e n t e r}_{0}^{2}, \dots, {c e n t e r}_{0}^{n_{0}}}$ = ${{c e n t e r}_{0}^{1}, {c e n t e r}_{0}^{2}, \dots, {c e n t e r}_{0}^{n_{0}}}$ and ${c e n t e r}_{1} = {{c e n t e r}_{1}^{1}, {c e n t e r}_{1}^{2}, \dots, {c e n t e r}_{1}^{n_{1}}}$ , respectively. Next, we employ the KNN method to explore the relationship between each pair of covers and the spatial distribution characteristics among them. Specifically, we examine the number of majority class covers $n_{i}_m a j$ among the $KNNs$ of a minority class cover $C_{0}^{i}$ to determine the presence of minority covers within the overlapping area. By comparing the values of $n_{i}_m a j$ and $k$ and considering a threshold $α$ , (e.g., $α = 0.3)$ , we can obtain three difference cases:

If $n_{i}_m a j > (1 - α) k$ , indicating that the majority of nearest neighbors of the minority cover $C_{0}^{i}$ are majority class covers, then this minority cover is situated within the majority class area. Since most of its neighbors are majority covers, the minority cover may be considered as noise or a small unconnected set. Therefore, it cannot be located at the classification boundary and is not selected for oversampling.

If $n_{j}_m a j < α k$ , indicating that most of the nearest neighbors of the cover $C_{0}^{i}$ are minority class covers. It can be inferred that the cover is situated within the area of the minority class and is not selected for oversampling.

If $α k \leq n_{j}_m a j \leq (1 - α) k,$ the minority class cover is located in an area where samples belong to both the minority and majority classes. This suggests that the cover is situated within the overlapping area and is a good candidate for oversampling.

To illustrate the process of identifying minority covers within the overlapping area, let’s consider an example. For a minority class cover $C_{0}^{i}$ , we identify its $k$ (set $k = 4$ ) nearest neighbors $c_{0}^{a}$ , $c_{1}^{b}$ , $c_{1}^{c}$ , and $c_{1}^{d}$ . Among these neighbors, there are three majority class covers (represented by the red dotted line circle), which corresponds to Case 3, satisfying $(1 - 0.3) k \leq n_{j m a j} \leq 0.3 k$ . In this case, the cover $C_{0}^{i}$ is situated close to the decision boundary between the minority and majority classes. The majority of samples within this cover are in the overlapping area, and these minority samples will be utilized for oversampling.

Oversampling in the overlapping area

The proposed QOSM incorporates oversampling of the minority class covers in the overlapping area. The goal is to ensure that the synthetic samples are distributed evenly within the overlapping region, thereby enhancing the representation of the minority class and expanding its area. The oversampling process follows these steps:

Calculate centroids, $x_{m e a n}$ for all the minority samples in the overlapping area. This centroid represents the central position within the overlapping region, as illustrated in Figure 5.

Generate synthetic samples, $x_{s y n t h e t i c}$ , using linear interpolation between each sample, $x_{i}$ , and the centroid $x_{m e a n}$ . Specifically, the synthetic sample is calculated by $x_{s y n t h e t i c} = x_{m e a n} + θ \cdot (x_{i} - x_{m e a n})$ , where $θ$ is a random number ranging between 0 and 1. This process is similar to the SMOTE algorithm,⁹ but with a distinction. In QOSM, synthetic samples are generated between a minority sample and the centroid of all minority samples in the overlapping area, instead of generating them between pairs of individual minority samples. This approach helps mitigate the impact of noisy data and reduces the likelihood of creating mislabeled synthetic samples.

Figure 5.

The process of oversampling in the overlapping area.

The number of synthetic samples generated for each minority sample is a parameter that can be determined on a case-by-case basis, considering factors, such as data characteristics (e.g. IR and maximum Fisher’s discriminant ratio maxF) and the desired IR after the oversampling process.

It is important to note that QOSM focuses on generating synthetic samples specifically in the overlapping area. By applying oversampling, the total number of samples in the minority class increases, which tends to balance the training set. Moreover, the distribution of the minority class within the overlapping region is altered. After addressing the issue of imbalance and overlap using QOSM, the traditional classification algorithms can be employed to build the models for further classification and prediction.

Based on the principle of minimizing overall error, the decision boundary will shift to some extent toward the majority class area, aiming to achieve higher accuracy in predicting the minority class samples in the overlapping area. QOSM seeks to achieve an improved balance between the classification of the minority and the majority class, thereby improving the overall classification performance. The oversampling of the minority class samples in the overlapping area increases the discriminative capability of the classifiers to some extent. The pseudo-code for the QOSM algorithm is presented in Algorithm 2 (Figure 6).

Figure 6.

Algorithm of QOSM.

Results

Data sets

To validate the effectiveness of our proposed QOSM, we conducted experiments on 10 imbalanced and overlapped data sets from the KEEL repository website (https://sci2s.ugr.es/keel/imbalanced.php). These data sets are widely used for evaluating the performance of imbalanced classification algorithms. Table 1 provides a summary of these data sets, including their Abbreviation (Abb), Instances (Ins), Features (Fea), Minority Class Sample Proportion (%min), Majority Class Sample Proportion (%maj), IR, and maxF (the degree of overlap). Data sets are arranged in ascending order by the maxF value. A smaller maxF value indicates a higher degree of overlap. All 10 data sets exhibit high levels of class imbalance with IR values greater than 9, and significant overlap with maxF values less than 4.5. Among these data sets, only one exhibits a relatively low overlap, with a maxF value greater than 2.5, while the remaining nine data sets have maxF values below 2.5.

Table 1.

Data set description.

Data sets	Abb	Ins	Fea	%min	%maj	IR	maxF
glass-0-1-4-6_vs_2	G02	205	9	8.29	91.71	11.06	0.349
winequality-red-4	Wr4	1599	11	3.31	96.69	29.17	0.378
winequality-white-3_vs_7	Ww37	900	11	2.22	97.78	44.05	0.387
yeast-0-2-5-6_vs_3-7-8-9	Y09	1004	8	9.86	90.14	9.14	0.694
yeast4	Y4	1484	8	3.44	96.56	28.07	0.741
abalone-20_vs_8-9-10	A2010	1916	8	1.36	98.64	72.69	1.266
glass4	G4	214	9	6.07	93.93	15.47	1.469
yeast6	Y6	1484	8	2.36	97.64	41.37	1.968
abalone-21_vs_8	A218	581	8	2.41	97.59	40.49	2.436
yeast5	Y5	1484	8	2.96	97.04	32.73	4.198

In Table 1, maxF is a measurement used to quantify the degree of overlap between classes.^36,24 It calculates the maximum discriminant ratio achieved by Fisher’s method across all the dimensions of an object. Fisher’s discriminant ratio for a single dimension is defined as:

f_{i} = \frac{{(μ_{0} - μ_{1})}^{2}}{{σ_{0}}^{2} + {σ_{1}}^{2}}

(12)

where $f_{i}$ represents the Fisher’s discriminant ratio of dimension $i$ , measuring the separability of the two classes in that dimension. $μ_{0}$ , $μ_{1}$ , ${σ_{0}}^{2}$ , and ${σ_{1}}^{2}$ denote the means and variances of classes 0 and 1 in dimension $i$ , respectively. The degree of class overlap can be evaluated by maxF, the maximum $f_{i}$ across all the feature dimensions:

m a x F = m a x {f_{i}}

(13)

where maxF ranges from 0 to infinity, and a lower maxF value indicates a higher degree of overlap in the data set.

Evaluation metrics

We employed three evaluation metrics to assess the performance of our proposed algorithm: F-measure, area under the receiver operator characteristic (ROC) curve (AUC), and G-mean. These metrics can be derived from the confusion matrix where the positive and negative classes represent the minority and majority classes, respectively. In the confusion matrix, TP represents the number of positive samples that are correctly classified as positive, FN represents the number of positive samples that are misclassified as negative, FP represents the number of negative samples misclassified as positive, and TN represents the number of negative samples that are correctly classified as negative. The evaluation metrics true positive rate (TPR), true negative rate (TNR), accuracy, precision, recall, F-measure, and G-mean are defined by equations (14) to (20), respectively.

T P R = \frac{T P}{T P + F N}

(14)

T N R = \frac{T N}{T N + F P}

(15)

A c c u r a c y = \frac{T P + T N}{T P + T N + F P + F N}

(16)

P r e c i s i o n = \frac{T P}{T P + F P}

(17)

R e c a l l = \frac{T P}{T P + F N}

(18)

F - m e a s u r e = \frac{(1 + β^{2}) R e c a l l \times P r e c i s i o n}{β^{2} \times Re c a l l + \Pr e c i s i o n} β \geq 0

(19)

G - m e a n = \sqrt{T P R \times T N R}

(20)

Generally, Precision and Recall exhibit an inverse relationship, where increasing one may come at the cost of reducing the other. F-measure provides a comprehensive measure by combining Precision and Recall, where β is the parameter used to adjust their relative importance. In our experiments, β is set to 1 making F-measure the harmonic mean of Recall and Precision. In addition, we employed G-mean as an evaluation metric, which represents the geometric mean of the classification accuracy of positive and negative samples. The ROC curve can be drawn by plotting the TPR against the false positive rate (FPR) at various threshold settings, where TPR and FPR can be calculated based on the confusion matrix. The ROC value, ranging from 0 to 1.0, indicates the overall performance of an algorithm. A higher ROC value signifies superior overall performance.

Comparison experiments and analysis

We conducted three sets of experiments to evaluate the effectiveness of our proposed method. The first set aimed to validate the performance improvement of three traditional base classification algorithms when QOSM is used, as illustrated in Figure 7 and reported in Table 2. The second set of experiments aimed to verify the effectiveness of the improved constructive covering algorithm based on quantum potential energy, and the results are presented in Table 3. Finally, we compared the performance of our QOSM with three other oversampling methods, and the results on three base classifiers (SVM, KNN, and BN) are reported in Tables 4 to 6.

Table 2.

The average F-measure, AUC, and G-mean on the 10 data sets using three classifiers with and without QOSM.

	F-measure	AUC	G-mean
SVM	0.0470 ± 0.0984	0.5146 ± 0.0312	0.0621 ± 0.1252
SVM + QOSM	0.3998 ± 0.1741	0.8214 ± 0.0998	0.8055 ± 0.1132
KNN	0.3018 ± 0.2848	0.6267 ± 0.1300	0.3651 ± 0.3228
KNN + QOSM	0.4856 ± 0.2007	0.7981 ± 0.0983	0.7720 ± 0.1200
NB	0.1978 ± 0.1264	0.6567 ± 0.0911	0.5685 ± 0.1456
NB + QOSM	0.2763 ± 0.1705	0.7679 ± 0.0992	0.7523 ± 0.1054

AUC: area under the curve; QOSM: quantum-based oversampling method; SVM: support vector machines; KNN: k-nearest neighbor; NB: naive Bayes classifier.

The bold values represent the results obtained by the classifiers when QOSM is employed.

Table 3.

F-measure, AUC, and G-mean values of QOSM and ROSM methods using SVM, KNN, and NB classifiers.

Classifier	Data set	F-measure		AUC		G-mean
Classifier	Data set	ROSM	QOSM	ROSM	QOSM	ROSM	QOSM
SVM	G02	0.1945	0.2075	0.6175	0.6614	0.5909	0.6302
	Wr4	0.0821	0.2087	0.5637	0.7450	0.5623	0.7369
	Ww37	0.0670	0.1375	0.6072	0.6830	0.5641	0.6472
	Y09	0.3327	0.5844	0.7188	0.8080	0.7130	0.7967
	Y4	0.0886	0.3473	0.5947	0.8839	0.5777	0.8837
	A2010	0.0269	0.4604	0.522	0.9073	0.4958	0.9010
	G4	0.2871	0.6934	0.7344	0.9083	0.7308	0.9016
	Y6	0.0549	0.3221	0.5714	0.8865	0.5259	0.8831
	A218	0.0836	0.5238	0.6287	0.7605	0.6202	0.7048
	Y5	0.1204	0.5137	0.6722	0.9701	0.6580	0.9697
	Average	0.1338	0.3998	0.6231	0.8214	0.6039	0.8055
KNN	G02	0.3636	0.4689	0.7544	0.8342	0.7493	0.8273
	Wr4	0.0549	0.1924	0.4944	0.6245	0.3595	0.5474
	Ww37	0	0.1345	0.4290	0.6994	0	0.6711
	Y09	0.2796	0.5788	0.6514	0.818	0.6457	0.8106
	Y4	0.0872	0.3521	0.5489	0.8107	0.4334	0.7966
	A2010	0.2000	0.4833	0.5947	0.6984	0.4448	0.6144
	G4	0.3621	0.7862	0.7767	0.9208	0.764	0.9144
	Y6	0.2476	0.5364	0.7827	0.8454	0.759	0.8506
	A218	0.2857	0.5848	0.6535	0.7939	0.5697	0.7523
	Y5	0.3612	0.7382	0.8069	0.9361	0.7868	0.9349
	Average	0.2242	0.4856	0.6493	0.7981	0.5512	0.7720
NB	G02	0.1761	0.1893	0.6158	0.6202	0.5720	0.5979
	Wr4	0.0586	0.1348	0.4888	0.6474	0.4481	0.6300
	Ww37	0.0645	0.1705	0.5511	0.7386	0.4616	0.699
	Y09	0.2019	0.5087	0.5418	0.7647	0.4926	0.7494
	Y4	0.0665	0.1210	0.5085	0.6897	0.3702	0.6754
	A2010	0.0264	0.0795	0.5162	0.7195	0.4862	0.7059
	G4	0.3758	0.6333	0.7125	0.8416	0.6881	0.8324
	Y6	0.0415	0.2601	0.4641	0.8983	0.3498	0.8953
	A218	0.0871	0.3118	0.6456	0.8272	0.628	0.8083
	Y5	0.0572	0.3539	0.4871	0.9313	0.1923	0.9296
	Average	0.1156	0.2763	0.5532	0.7679	0.4689	0.7523

AUC: area under the curve; QOSM: quantum-based oversampling method; ROSM: random oversampling method; SVM: support vector machines; KNN: k-nearest neighbor; NB: naive Bayes classifier.

Table 4.

F-measure, AUC, and G-mean values of each oversampling method using SVM.

Data set	F-measure				AUC				G-mean
Data set	QOSM	Poly	Pro	IPF	QOSM	Poly	Pro	IPF	QOSM	Poly	Pro	IPF
G02	0.208	0.220	0.196	0.184	0.661	0.706	0.649	0.643	0.630	0.659	0.599	0.568
Wr4	0.209	0.135	0.146	0.142	0.745	0.707	0.731	0.723	0.737	0.705	0.728	0.720
Ww37	0.138	0.100	0.075	0.077	0.683	0.610	0.584	0.614	0.647	0.544	0.527	0.575
Y09	0.584	0.538	0.523	0.515	0.808	0.798	0.790	0.788	0.797	0.787	0.779	0.776
Y4	0.347	0.310	0.286	0.294	0.884	0.874	0.866	0.860	0.884	0.874	0.865	0.860
A2010	0.460	0.276	0.250	0.238	0.907	0.873	0.888	0.885	0.901	0.868	0.883	0.881
G4	0.693	0.661	0.598	0.572	0.908	0.903	0.891	0.888	0.902	0.897	0.885	0.883
Y6	0.322	0.335	0.294	0.307	0.887	0.887	0.881	0.883	0.883	0.885	0.878	0.880
A218	0.524	0.441	0.362	0.405	0.761	0.756	0.750	0.753	0.705	0.701	0.697	0.699
Y5	0.513	0.514	0.484	0.466	0.970	0.970	0.966	0.964	0.970	0.970	0.966	0.963
Average	0.400	0.353	0.322	0.320	0.821	0.809	0.799	0.800	0.806	0.789	0.781	0.781

AUC: area under the curve; SVM: support vector machines; QOSM: quantum-based oversampling method; IPF: iterative-partitioning filter.

The bold values represent the best results achieved among the four oversampling methods for each dataset and evaluation metric.

Table 5.

F-measure, AUC, and G-mean values of each oversampling method using KNN.

Data set	F-measure				AUC				G-mean
Data set	QOSM	Poly	Pro	IPF	QOSM	Poly	Pro	IPF	QOSM	Poly	Pro	IPF
G02	0.469	0.453	0.420	0.426	0.834	0.829	0.818	0.818	0.827	0.821	0.812	0.812
Wr4	0.192	0.069	0.075	0.037	0.625	0.520	0.539	0.449	0.547	0.476	0.514	0.327
Ww37	0.135	0.076	0.124	0.103	0.699	0.616	0.671	0.642	0.671	0.515	0.631	0.597
Y09	0.579	0.492	0.413	0.450	0.818	0.787	0.771	0.796	0.811	0.780	0.769	0.794
Y4	0.352	0.324	0.288	0.312	0.811	0.828	0.833	0.810	0.797	0.821	0.831	0.798
A2010	0.483	0.338	0.267	0.244	0.698	0.731	0.799	0.796	0.614	0.670	0.781	0.779
G4	0.786	0.613	0.708	0.746	0.921	0.918	0.911	0.916	0.914	0.912	0.905	0.910
Y6	0.536	0.361	0.302	0.312	0.845	0.830	0.833	0.822	0.851	0.817	0.824	0.810
A218	0.585	0.596	0.458	0.449	0.794	0.825	0.816	0.844	0.752	0.799	0.791	0.822
Y5	0.738	0.620	0.566	0.603	0.936	0.929	0.924	0.928	0.935	0.928	0.924	0.927
Average	0.486	0.394	0.362	0.368	0.798	0.781	0.792	0.782	0.772	0.754	0.778	0.758

AUC: area under the curve; KNN: k-nearest neighbor; QOSM: quantum-based oversampling method; IPF: iterative-partitioning filter.

The bold values represent the best results achieved among the four oversampling methods for each dataset and evaluation metric.

Table 6.

F-measure, AUC, and G-mean values of each oversampling method using NB.

Data set	F-measure				AUC				G-mean
Data set	QOSM	Poly	Pro	IPF	QOSM	Poly	Pro	IPF	QOSM	Poly	Pro	IPF
G02	0.189	0.174	0.147	0.151	0.620	0.599	0.549	0.557	0.598	0.581	0.519	0.532
Wr4	0.135	0.110	0.104	0.097	0.647	0.627	0.613	0.601	0.630	0.622	0.611	0.599
Ww37	0.171	0.160	0.175	0.157	0.739	0.698	0.703	0.734	0.699	0.582	0.648	0.703
Y09	0.509	0.502	0.545	0.543	0.765	0.740	0.752	0.752	0.749	0.713	0.726	0.725
Y4	0.121	0.096	0.078	0.076	0.690	0.639	0.586	0.577	0.675	0.616	0.451	0.430
A2010	0.080	0.055	0.047	0.046	0.720	0.677	0.660	0.657	0.706	0.670	0.655	0.652
G4	0.633	0.633	0.459	0.528	0.842	0.844	0.758	0.798	0.832	0.834	0.735	0.787
Y6	0.260	0.104	0.194	0.066	0.898	0.783	0.660	0.655	0.895	0.759	0.563	0.555
A218	0.312	0.177	0.152	0.167	0.827	0.777	0.759	0.768	0.808	0.762	0.745	0.753
Y5	0.354	0.341	0.201	0.192	0.931	0.921	0.875	0.867	0.930	0.920	0.866	0.857
Average	0.276	0.235	0.210	0.202	0.768	0.731	0.691	0.697	0.752	0.706	0.652	0.659

AUC: area under the curve; NB: naive Bayes classifier; QOSM: quantum-based oversampling method; IPF: iterative-partitioning filter.

The bold values represent the best results achieved among the four oversampling methods for each dataset and evaluation metric.

It is important to note that, in QOSM, the step of finding the $KNNs$ of each minority class cover is crucial, where the value of $k$ needs to be determined empirically based on the data set. However, determining the optimal value of k is not the primary focus of this article. Instead, we conducted an additional comparative experiment to find an acceptable $k$ value. In this experiment, we set $k$ to values of 3, 4, 5, and 6 respectively, and evaluated the performance using the accuracy metric. The experimental results show that the best results were obtained when $k$ is set to 4. Therefore, three groups of experiments were conducted using this setting.

Our work primarily focuses on the oversampling methods rather than the classification algorithms themselves. Therefore, we utilized three widely used classifiers, namely, SVM,³⁷ KNNs, and NB,³⁸ as the base classifiers in our experiments. These classifiers were implemented using the scikit-learn machine learning library in Python, with their default parameter settings. In our experiments, the data sets are partitioned into training and testing sets at an 80–20% ratio, and five-fold cross-validation was employed for model selection. The models were evaluated using F-measure, AUC, and G-mean metrics. The results of three sets of experiments are presented and analyzed in detail in the following three subsections.

Classification performance with and without QOSM

To demonstrate the effectiveness of QOSM, we applied it to oversample 10 highly imbalanced and overlapped data sets. Then, we used three base classifiers to classify the oversampled data and compared the results with the classification performance on the original data sets without utilizing QOSM. Table 2 reports the average F-measure, AUC, and G-mean values, along with their corresponding standard deviation, obtained by three traditional classifiers with and without QOSM on the 10 data sets. It is evident that SVM performs poorly when not using QOSM, with an F-measure of only 0.047 and an AUC of 0.5145. This demonstrates the sensitivity of SVM to data sets that are both imbalanced and overlapped, as the support vectors in SVM struggle to be determined within the overlapping area. However, when QOSM is applied, SVM’s F-measure and AUC significantly improve to 0.3998 and 0.8214, respectively. The results indicate that, after QOSM oversampling, the average F-measure, G-mean, and AUC increased across all three classifiers. Specifically, for KNN, the average F-measure, G-mean, and AUC improved by 18.38, 40.69, and 17.14%. Using SVM, the corresponding improvements were 35.28, 73.34, and 30.68%, whereas for NB, they were 7.85, 18.38, and 11.12%, respectively. It is important to note that these improvements were observed even when using the same classifiers and parameter settings in both cases. Table 2 also reveals that the standard deviations of our method are slightly higher than those of the classifiers without QOSM on the original data. This can be attributed to the poor ability of the base classifiers to identify minority class samples in these data sets, as depicted in Figure 7. Consequently, the evaluation results on each data set are relatively poor, resulting in smaller standard deviations.

Figure 7.

F-measure, AUC, and G-mean values of each data set using the base classifiers and the QOSM.

Experimental results for each combination of data set and base classifier are depicted in Figure 7. The three sub-figures illustrate the results of F-measure, AUC, and G-mean results obtained by the three base classifiers with the default parameters on the 10 data sets, respectively. The solid lines represent the classification results using QOSM and dashed lines represent the results without using QOSM. It is evident that the QOSM significantly improves the classification performance of the three traditional classifiers on all ten data sets. This improvement can be attributed to the capability of the proposed QOSM in effectively handling the issues of class imbalance and overlap present in the data sets.

Comparison between QOSM and random oversampling method

We conducted experiments to compare the classification performance between QOSM and random oversampling method (ROSM). The key difference between QOSM and ROSM lies in the method of selecting cover centers. While ROSM randomly selects cover centers, QOSM utilizes quantum potential theory for cover center selection.

Table 3 presents the comparison results of each evaluation metric using SVM, KNN and NB classifiers. “Average” values represent the mean performance across 10 data sets. The results in Table 3 demonstrate that QOSM achieved superior performance compared with ROSM on all 10 data sets. The average values of all three metrics with QOSM were consistently higher than those with ROSM for all three classifiers and across all 10 data sets. It is noted that the F-measure values with ROSM were very low for several data sets (e.g. Ww37, Y4, and Y6), indicating that only a few positive samples were correctly classified. In contrast, QOSM significantly improved the classification performance, and the base classifiers achieved better results when combined with QOSM.

Although both methods performed poorly on the A2010 data set due to its extremely high IR (IR = 72.46), QOSM outperformed ROSM. These experimental results demonstrate that the proposed oversampling method based on the constructive covering algorithm and quantum potential theory is effective. By optimizing cover center selection and generating covers that align more with the original sample distribution, QOSM better defines the decision boundary through oversampling in the overlapping region, leading to improved classification accuracy.

Comparison of QOSM with other oversampling methods

We compared the performance of our proposed QOSM with three state-of-the-art oversampling algorithms: Polynom-fit-SMOTE (Poly),³⁹ ProWsyn (Pro),⁴⁰ and SMOTE-IPF (IPF).⁴¹ Polynom-fit-SMOTE focuses on improving the TPR while maintaining a reasonable TNR by employing polynomial fitting functions for oversampling the minority class. ProWsyn introduces a synthetic oversampling algorithm that assigns appropriate weight values to minority samples based on proximity information, ensuring a balanced distribution of synthetic samples. SMOTE-IPF extends the SMOTE algorithm by incorporating an iterative ensemble-based noise filter called Iterative-Partitioning Filter (IPF) to address challenges posed by noisy and borderline examples in imbalanced data sets. In this discussion, we refer to these algorithms as Poly, Pro, and IPF, respectively. These three algorithms have demonstrated superior performance in Kovács.⁴² The parameter settings for these algorithms in our experiments are the default values provided in the scikit-learn library for Python.

The evaluation results for F-measure, AUC, and G-mean corresponding to QOSM, Poly, Pro and IPF on the 10 data sets using SVM, KNN, and NB classifiers are reported in Tables 4, 5 and 6, respectively. The best results achieved by each oversampling method on each data set are highlighted in bold. The average values for different metrics on 10 data sets are provided in the last row of each table. Observing the results in Tables 4 through 6, it is evident that when using the SVM classifier, QOSM outperformed Poly, Pro, and IPF on 7 out of 10 data sets for all three evaluation measures. In fact, QOSM combined with SVM achieved better performance than the Pro and IPF oversampling methods on all 10 data sets, as shown in Table 4. When using KNN and NB, QOSM also exhibited better performance than the other three oversampling methods on most data sets, as indicated by all three evaluation measures.

The average values of F-measure, AUC, and G-mean across the 10 data sets are illustrated in Figure 7. The results demonstrate that QOSM consistently outperformed the other three methods for all three metrics when the SVM and NB classifiers were used. QOSM achieved higher average performance in terms of F-measure and AUC metrics, while its performance in terms of G-mean is slightly lower but still comparable when using the KNN classifier. These findings indicate that QOSM exhibits superior performance across multiple data sets and is less affected by different classifiers, highlighting its good generalization ability and potential for handling the imbalance and overlapping issues that exist in many real-world data sets.

Significance test

To determine the significance of the differences between QOSM and the other three compared oversampling methods, we conducted statistical tests using the Friedman test.⁴³ The Friedman test compares the mean ranks of all the algorithms across the experimental data sets. The result of the Friedman test for each combination of the metrics and the classifiers are presented in Table 7.

Table 7.

Friedman test results on F-measure, AUC, and G-mean for all the compared methods with SVM, KNN, and NB as base classifiers.

Classifier	Metric	F-value	Null Hypothesis
SVM	F-measure	19.8462	Rejected
SVM	AUC	10.5652	Rejected
SVM	G-mean	11.4082	Rejected
KNN	F-measure	14.6842	Rejected
KNN	AUC	1.5263	Not Rejected
KNN	G-mean	1.0000	Not Rejected
NB	F-measure	8.5439	Rejected
NB	AUC	17.2391	Rejected
NB	G-mean	14.1959	Rejected

AUC: area under the curve; SVM: support vector machines; KNN: k-nearest neighbor; NB: naive Bayes classifier.

The bold font indicates that the F-value is greater than the critical value of 2.96.

With a significance level of $α = 0.05$ , and considering 4 algorithms and 10 data sets in the experiments, the critical value of the Friedman test is 2.96. Table 7 reveals that the F-values obtained with the SVM and NB classifiers are significantly higher than 2.96. This indicates that the null-hypothesis, which assumes that all the compared algorithms perform similarly, can be rejected. In other words, there are significant differences among these algorithms. However, for the KNN classifier, the F-values of AUC and G-mean are lower than 2.96, suggesting that while QOSM performs better than the compared algorithms, the difference is not statistically significant. Based on these results, we can conclude that the null hypothesis is rejected for almost all the compared oversampling methods at a significance level of α = 0.05. This indicates that the proposed method QOSM outperforms the other three methods in terms of statistical significance in most cases we tested.

More discussion

The results presented in Table 3 demonstrate that QOSM systematically improves classification performance, and this improvement is statistically significant. It is interesting to explore the impact of the IR and degree of overlapping (maxF) on the classification performance improvement. Table 1 provides IR and maxF values for ten data sets. Previous research, including our own work,³ has shown that F-measure is a better metric for evaluating classification performance on imbalanced data compared to AUC, and this has been acknowledged by other researchers.⁴⁴ Therefore, in this analysis, we use F-measure to investigate the influence of IR on classification accuracy. From Table 3, we define the performance F_diff as the difference of F-measure values obtained by QOSM and ROSM for each classifier. We then compute the correlation coefficient (r) between F_diff, IR, and maxF. The correlation coefficients between F_diff and IR are 0.371, 0.004, -0.266 for the classifiers SVM, KNN and NB, respectively. It is important to note that the data characteristics of the ten data sets are diverse, and the IR varies significantly, ranging from 9 to 72. Therefore, we observe no strong correlation between F_diff and IR. On the other hand, the correlation coefficients between F_diff and maxF are 0.674, 0.673 and 0.637 for SVM, KNN, and NB classifiers, respectively. These values indicate a strong positive correlation between F_diff and maxF. Notably, the maxF values ranges from 0.3487 to 4.1976, as shown in Table 1, suggesting that a higher degree of overlapping leads to a more significant improvement in classification. These findings indicate that QOSM is more effective in addressing the issues of overlap issue rather than the issue of imbalance.

However, there are serval limitations in the current work. QOSM has only been tested on ten diverse data sets with varying characteristics, particularly in terms of their wide range of IRs. It is necessary to evaluate the performance of the proposed QOSM on a larger number of high-quality data sets, considering different IRs. In addition, in the future work, we plan to separately assess the performance of QOSM on data sets with imbalance and overlapping, to study the combined and individual influence of these two factors on classification performance.

While the proposed QOSM can be seamlessly integrated with various classifiers, we chose not to include deep learning-based classification methods in our analysis due to their usual requirement for large data sets for effective training, which was not met by the data set used in this study. We acknowledge that if we were to acquire a larger data set characterized by high imbalance and overlap, we would be eager to investigate the performance of deep learning-based methods in conjunction with QOSM.

Conclusions

In this study, we introduced QOSM, a novel oversampling method based on quantum potential theory, to address the challenges posed by highly imbalanced and overlapped data sets in classification tasks. By leveraging the constructive covering algorithm and quantum potential theory, QOSM selects cover centers based on their potential energy, resulting in improved representation of the original sample distribution, particularly in the overlapped region. In addition, QOSM calculates the centroid of minority covers with their KNNs, allowing for the generation of synthetic samples through linear interpolation. Compared to SMOTE and its variants, QOSM considers both the individual sample relationships and the relationships between covers, which mitigates the impact of noisy data and reduces the generation of mislabeled synthetic samples in highly imbalanced data sets.

Our experiments involved three traditional classifiers (SVM, KNN, and NB) and 10 public data sets characterized by high IRs and significant overlap. The results demonstrate that QOSM significantly enhances classification accuracy when compared to approaches that do not address the imbalance and overlap in the data. Furthermore, QOSM consistently outperforms three existing oversampling methods when evaluated with the same classifiers and data sets. Due to its classifier independence, QOSM can be effectively combined with various classifiers, offering promising potential for improving classification accuracy on highly imbalanced and overlapped data.

In conclusion, QOSM introduces a novel approach to tackle the challenges of imbalance and overlap in classification tasks. Through comprehensive evaluations and comparisons, we have shown its effectiveness in improving classification accuracy and its superiority over existing oversampling methods. With its versatility and potential for integration with different classifiers, QOSM provides a valuable tool for addressing imbalanced and overlapped data in real-world scenarios.

Footnotes

Authors’ Contributions

BY and CZ conceived the project. GT and BY completed most experiments. BY, GT, and CZ collaborated on drafting the manuscript. JL and PG provided insightful inputs and revised the manuscript. All authors have reviewed and approved the final version of the manuscript.

Declaration Of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported or partially supported by the Key Research Projects of Universities in Henan Province of China under grant number 17A520015, the Key Scientific and Technological Research Projects in Henan Province of China (grant numbers: 192102310216 and 232102210006), the Major Research Programs of Science and Technology in Henan Province of China (grant number: 221100210600), the Major Science and Technology Project in Henan Province of China (grant number: 201400210400), and the Medical Science and Technology Joint Construction Project in Henan Province of China (LHGJ20220215).

ORCID iD

Chaoyang Zhang

References

Haque

Skinner

Holder

. Imbalanced class learning in epigenetics. J Comput Biol 2014;21:492–507

Zhang

Luo

. Joint imbalanced classification and feature selection for hospital readmissions. Knowl Based Syst 2020;200:106020

Idakwo

Thangapandian

Luttrell

Wang

Zhou

Hong

Yang

Zhang

Gong

. Structure–activity relationship-based chemical classification of highly imbalanced Tox21 datasets. J Cheminform 2020;12:66

Leevy

Khoshgoftaar

Bauder

Seliya

. A survey on addressing high-class imbalance in big data. J Big Data 2018;5:42

Garcia

. Learning from imbalanced data. IEEE Trans Knowl Data Eng 2009;21:1263–84

Cho

Lee

Chang

. Instance-based entropy fuzzy support vector machine for imbalanced data. Theor Adv 2019; 23: 1183–202

Zhu

Guo

Xue

J-H

. Adjusting the imbalance ratio by the dimensionality of imbalanced data. Pattern Recogn Lett 2020;133:217–23

Pozzolo

Caelen

Johnson

Bontempi

. Calibrating probability with undersampling for unbalanced classification. In: 2015 IEEE symposium series on computational intelligence, Cape Town, South Africa, 7–10 December 2015, pp. 159–66. New York: IEEE.

Chawla

Bowyer

Hall

Kegelmeyer

. SMOTE: synthetic minority over-sampling technique. Jair 2002;16:321–57

10.

Borsos

Lemnaru

Potolea

. Dealing with overlap and imbalance: a new metric and approach. Pattern Anal Appl 2018;21:381–95

11.

Vuttipittayamongkol

Elyan

Petrovski

. On the class overlap problem in imbalanced data classification. Knowl Based Syst 2021;212:106631

12.

Vuttipittayamongkol

Elyan

. Neighbourhood-based undersampling approach for handling imbalanced and overlapped data. Inform Sci 2020;509:47–70

13.

Kalid

K-H

Tong

G-K

Khor

K-C

. A multiple classifiers system for anomaly detection in credit card data with unbalanced and overlapped classes. IEEE Access 2020;8:28210–21

14.

Santos

Abreu

Japkowicz

Fernández

Soares

Wilk

Santos

. On the joint-effect of class imbalance and overlap: a critical review. Artif Intell Rev 2022;55:6207–75

15.

Zhang

. A geometrical representation of McCulloch-Pitts neural model and its applications. IEEE Trans Neural Netw 1999;10:925–9

16.

Horn

Gottlieb

. Algorithm for data clustering in pattern recognition problems based on quantum mechanics. Phys Rev Lett 2001;88:018702

17.

Galar

Fernández

Barrenechea

Herrera

. EUSBoost: enhancing ensembles for highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn 2013;46:3460–71

18.

Han

Wang

W-Y

Mao

B-H

. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In: Huang

D-S

Zhang

X-P

Huang

G-B

(eds) Advances in intelligent computing. Berlin: Springer, 2005, pp. 878–87

19.

Bai

Garcia

. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE World Congress on Computational Intelligence), Hong Kong, 1–8 June 2008, pp. 1322–8. New York: IEEE

20.

Cieslak

Chawla

Striegel

. Combating imbalance in network intrusion datasets. In: 2006 IEEE international conference on granular computing, Atlanta, GA, 10–12 May 2006, pp. 732–7

21.

Bunkhumpornpat

Sinapiromsaran

Lursinsap

. DBSMOTE: density-based synthetic minority over-sampling technique. Appl Intell 2012;36:664–84

22.

Das

Krishnan

Cook

. Handling class overlap and imbalance to detect prompt situations in smart homes. In: 2013 IEEE 13th international conference on data mining workshops, Dallas, TX, 7–10 December 2013, pp. 266–273. New York: IEEE

23.

Batista

GEAPA

Prati

Monard

. A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor News 2004;6:20–9

24.

Vorraboot

Rasmequan

Chinnasarn

Lursinsap

. Improving classification rate constrained to imbalanced data between overlapped and non-overlapped regions by hybrid algorithms. Neurocomputing 2015;152:429–43

25.

Weinstein

Horn

. Dynamic quantum clustering: a method for visual exploration of structures in data. Phys Rev E Stat Nonlin Soft Matter Phys 2009;80:066117

26.

Scott

Therani

Wang

. Data clustering with quantum mechanics. Mathematics 2017;5:5

27.

Maignan

Scott

. A comprehensive analysis of quantum clustering: finding all the potential minima. Int J Data Min Knowl Manag Process 2021;11:33–54

28.

Decheng

Jon

Pang

Dong

Won

. Improved quantum clustering analysis based on the weighted distance and its application. Heliyon 2018;4:e00984

29.

Tian

Jia

Deng

Wang

. Quantum clustering ensemble. Int J Comput Intel Syst 2020;14:248–56

30.

Liu

Jiang

Yang

. Analyzing documents with quantum clustering. Pattern Recogn Lett 2016;77:8–13

31.

Kais

. Quantum cluster algorithm for data classification. Mater Theory 2021;5:6

32.

Guliyev

Ismailov

. A single hidden layer feedforward network with only one neuron in the hidden layer can approximate any univariate function. Neural Comput 2016;28:1289–304.

33.

Gasiorowicz

. Quantum physics. 3rd ed. Hobokn, NJ: John Wiley & Sons, 2007

34.

Nasios

Bors

. Kernel-based classification using quantum mechanics. Pattern Recogn 2007;40:875–89

35.

Wang

. Fuzzy clustering algorithm for classified attribute data based on quantum mechanism. J Syst Simul 2008;08:2119–22

36.

Alshomrani

Bawakid

Shim

S-O

Fernández

Herrera

. A proposal for evolutionary fuzzy systems using feature weighting: Dealing with overlapping in imbalanced datasets. Knowl Based Syst 2015;73:1–17

37.

Boser

Guyon

Vapnik

. A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory, Pittsburgh, PA, 27–29 July 1992. New York: Association for Computing Machinery, pp. 144–52

38.

Domingos

Pazzani

. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learn 1997;29:103–30

39.

Gazzah

Amara

NEB

. New oversampling approaches based on polynomial fitting for imbalanced data sets. In: 2008 the eighth IAPR international workshop on document analysis systems, Nara, Japan, 16–19 September 2008, pp. 677–84

40.

Barua

Islam

MdM

Murase

. ProWSyn: proximity weighted synthetic oversampling technique for imbalanced data set learning. In: Pei

Tseng

Cao

Motoda

(eds) Advances in Knowledge Discovery and Data Mining. Berlin: Springer, 2013, pp. 317–28

41.

Zhu

Zhang

Gong

Zhu

. SMOTE-NaN-DE: addressing the noisy and borderline examples problem in imbalanced classification by natural neighbors and differential evolution. Knowl Based Syst 2021;223:107056

42.

Kovács

. An empirical comparison and evaluation of minority oversampling techniques on a large number of imbalanced datasets. Appl Soft Comput 2019;83:105662

43.

Zhu

Lin

Liu

. Improving interpolation-based oversampling for imbalanced data learning. Knowl Based Syst 2020;187:104826

44.

Saito

Rehmsmeier

. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE 2015;10:e0118432