A data-driven principal component analysis-support vector machine approach for breast cancer diagnosis: Comparison and application

Abstract

In recent years, with the development of artificial intelligence, data-driven methodologies have been widely studied in fault diagnosis and detection, since an increasing number of complexities of modern complex systems make the mechanism model information difficult to obtain. Especially in people’s health monitoring, it is very difficult to achieve the mechanism model. The existing challenges, such as huge amount of data, high data dimension, large noise interference, and so forth, make the applications of data-driven approaches more suitable. For the sake of solving the problems above, we present principal component analysis-support vector machine (PCA-SVM) method with different kernels to reduce data dimension, and two sets of breast-cancer data are utilized to verify the method. Additionally, support vector machine-recursive feature elimination (SVM-RFE), the original SVM with different kernels, PCA and modified PCA (MPCA) methods are also applied to diagnose malignant cancer in comparison with PCA-SVM. In experiments, PCA-SVM via radial basis function (RBF) kernel shows better performance than other methods, with the two breast cancer datasets obtained from the University of Wisconsin Hospital. Finally, PCA-SVM in this study uses only six principal components and obtains better accuracy (97.19%) than most of the previous studies.

Keywords

Artificial intelligence machine learning fault diagnosis fault detection health monitoring

Introduction

In recent years, monitoring and detection methods based on the mechanism model have been developed rapidly, with the development of artificial intelligence. However, an increasing number of complexities of modern complex systems make the mechanism model information of the system difficult to obtain. For the sake of solving the problems, data are utilized by a growing number of researchers for fault diagnosis and detection (Costas-Perez and Rodriguez-Andina, 2009; Yin and Huang, 2015), and data-driven methodologies have been widely studied (Li et al., 2018; Yin et al., 2014a; Zhang et al., 2018). In the meanwhile, many detection methods were presented and applied in various fields to provide timely diagnosis. For instance, Yang et al. (2016) proposed a fast search method based on principal component analysis (PCA) to search codewords using vector quantization codebooks, which is obtained by PCA with Linde-Buzo-Gray algorithms. Yin et al. (2014b) constructed a fault detection scheme based on the proposed robust 1-class support vector machine (SVM), and the simulation example showed that the robust 1-class SVM was superior to the general 1-class SVM, especially when the training data set is corrupted by outliers. Yang and Hou (2016) used support vector machine-recursive feature elimination (SVM-RFE) method to reduce the dimension of variables to diagnose the faults in the benchmark of fed-batch fermentation penicillin process. Jiang and Yin (2018) proposed a recursive total principle component regression based design and implementation approach for efficient data-driven fault detection, and carried out the simulation tests on the Carsim to compare the proposed approach with multiple existing methods.

The mechanism of human body involves the influence of chemical, physical and physiological factors, therefore it’s very difficult to obtain the mechanism model in people’s health monitoring. With the development of modern medicine, more and more advanced testing instruments have been applied in clinical practice, and a large amount of data can be generated. But there exist many challenges, such as huge amount of data, high data dimension, large noise interference, and so forth, the applications of data-driven approaches seem quite suitable in this field. Medical data is a very wide range, it can be images, audio, video or text. The amount of medical data produced every day is considerable, and if there are appropriate data mining methods, it can greatly reduce the labor of doctors.

Lots of researchers have obtained many fruitful achievements on health monitoring or diagnosis based on data-driven methodologies. Gutta et al. (2018) proposed a likelihood ratio test to detect obstructive sleep apnea (OSA) using the widely available heart rate and peripheral oxygen saturation measurement signals, by conducting experiments on both synthetic and real data to show the effectiveness of the proposed OSA detection framework compared to purely data-driven approaches. Lee et al. (2012) used empirical mode decomposition and statistical approaches to detect motion and noise artifacts in electrocardiograph data. Yang et al. (2019) proposed a multilevel feature extraction method based on wavelet transform and used SVM-RFE method to select the most relevant features for atrial fibrillation detection. Jaganathan and Kuppuchamy (2013) presented the measurement of feature relevance based on fuzzy entropy, tested with a radial basis function (RBF) network classifier, and five benchmarked datasets from the UCI Machine Learning Repository have been used to evaluate the classifier.

Cancer is considered to be one of the leading causes of human death. Early detection can greatly improve the chances of cure, but there are no obvious symptoms at the early stage. In recent years, there has been a lot of data driven early diagnosis of cancer. Wu and Zhou (2017) proposed cervical cancer diagnosis based on SVM, unlike others, cervical cancer data were represented by 32 risk factors and four target variables. Breast density on mammograms may affect the accuracy of breast cancer diagnosis; Haque et al. (2017) developed a system to identify fat and cancer. Two-dimensional discrete cosine transform and principal component analysis were used to extract features, multi-layer perceptron (MLP), SVM and k-nearest neighbor (kNN) were adopted as classifier. Rustam and Maghfirah (2018) considered the presence of related oncogenes. They used SVM-RFE as feature selection and classification method for cancer microarray dataset. In Tan et al. (2011), RFE was modified as a two-stage feature elimination scheme, and the effectiveness of the model was evaluated using a multi-category lung cancer problem. In order to develop an intelligent remote system for detection and diagnosis of breast cancer on the basis of cytological images, George et al. (2014) proposed an algorithm for cell nuclei detection and segmentation. Based on previous work summary, most existing data-based diagnostic methods can be divided into two parts, namely feature selection and classifier design. Feature selection can select some representative data from the original data, or generate new features from the original data by some dimensionality reduction methods. The design of classifier mainly considers its generalization performance, and different samples can be distinguished by features learning, common classifiers include SVM, kNN, decision tree, and so forth (Cheng et al., 2019; Han et al., 2017).

Medical data usually have the character of high dimension, large noise and redundancy. Take the breast cancer dataset used in this paper as example, each patient is represented by a 30-dimensional vector. This 30-dimensional vector contains the nucleus radius, smoothness, cell size uniformity, and so forth. Even for an experienced professional physician, it is difficult to make a diagnosis with these 30 data at the same time. For the sake of solving these problems, this paper presents PCA-SVM with different kernels to reduce dimension of data and diagnose breast cancer. In addition, SVM-RFE, the original SVM with different kernels, PCA and MPCA methods are also applied to diagnose malignant cancer in comparison with PCA-SVM. There are some theoretical assumptions and application limitations about this work. The used datasets come from real patients in hospital, some errors cannot be avoided in the process of data collection, these errors are not considered in the experiment. Also, we assume that real data has large variance, noise has small variance. As for application limitations, the attribues of the newly collected data should be consistent with the dataset used in this article, and the diagnosis result is only for physicians reference. The contributions of this paper can be summarized as follows. Firstly, we put more emphasis on the actual demand of cancer diagnosis and want to find a concise but reliable data-driven breast diagnosis method, PCA-SVM is such a method, which can guarantee high accuracy without using too much computation. Secondly, MPCA is the first time to be used in cancer diagnosis. MPCA can calculate the diagnostic threshold of malignant tumor, which is a completely new idea for malignant tumor diagnosis. Thirdly, in addition to the previous researches, we also applied a variety of machine learning methods for comparative studies, including decision tree, random forest and neural networks. The results of these comparative studies also prove the superiority of PCA-SVM.

The remaining parts of the paper are organized as follows. The fundamentals of SVM and evaluation criteria are briefly reviewed in Section 2. Section 3 introduces the methodologies used in this paper, and discusses PCA-SVM, SVM-RFE, and MPCA approaches. Two sets of experiments are presented and discussed in Section4, to show the validity and practical applicability of the presented method. Finally, Section 5 gives a brief summary.

Background

SVM

As mentioned in Section 1, data-driven approaches are quite suitable for breast cancer diagnosis. The remarkable generalization ability of SVM makes it one of the most popular types of classification algorithms. SVM was presented on the basis of the Vapnik-Chervonenkis Dimension theory and structural risk minimization principle according to statistical learning theory (Vapnik, 2013). Therefore, the classification result is not related with the sample dimension.

Evaluation criteria for diagnosis performance

The effectiveness of different attributes should be evaluated by suitable metrics based on the true / false positive / negative classification results. True positives (TP) and true negatives (TN) represent the cases that the predicted classes match the actual (true) classes of the instances, and false positives (FP) and false negatives (FN) correspond to the opposite cases. The most common metrics are accuracy, sensitivity, specificity, positive predictive accuracy (PPA), negative predictive accuracy (NPA), F-measure, Kappa statistics, and they are introduced as follows.

The classification accuracy can be calculated as

Accuracy = \frac{TP + TN}{TP + TN + FP + FN} .

The sensitivity, which also means true positive rate, can be defined as

Sensitivity = \frac{TP}{TP + FN} .

The specificity or true negative rate, is calculated by

Specificity = \frac{TN}{TN + FP} .

Precision is also called PPA, and it can be calculated as

PPA = \frac{TP}{TP + FP} .

NPA is computed by

NPA = \frac{TN}{TN + FN} .

F-measure also means F-Score, it is defined as

F - measure = \frac{2 \times TP}{2 \times TP + FP + FN} .

Kappa statistics can be computed by

Kappa = \frac{Accuracy - P_{e}}{1 - P_{e}}

where $P_{e} = \frac{(TP + FP) (TP + FN) + (FN + TN) (FP + TN)}{{(TP + FP + TN + FN)}^{2}}$ .

Algorithm

PCA-SVM

PCA has been widely studied in many fields (Li et al., 2011; Yin et al., 2012). It converts multiple relative variables into a few irrelevant ones, namely principal components. In the meanwhile, the information of data is preserved after the dimension is reduced. Cancer diagnosis using PCA-SVM method can be considered as a two-stage process, PCA projects high-dimensional input data into low-dimensional principal components, then the selected principal components are used as the input of SVM for classification. Therefore, how to select the principal component is the key point of this method.

The original cancer data $X = (x_{1}, x_{2}, \dots, x_{m})^{T}$ is an m-dimensional vector, $Γ_{1}$ is the first principal component of X, which can be expressed as

Γ_{1} = {w_{1}}^{T} X = w_{1, 1} x_{1} + w_{1, 2} x_{2} + \dots + w_{1, m} x_{m}

(1)

where ${w_{1}}^{T}$ is the weight vectors. Normally $‖ w_{1} ‖ = 1$ , and our goal is to find a specific $w_{1}$ that maximizes the variance of $Γ_{1}$ , the variance is $var (Γ_{1})$ .

$Φ$ is the covariance matrix of vector X, and the eigenvalues are $σ_{1} ⩾ σ_{2} ⩾ \dots ⩾ σ_{m} ⩾ 0$ , then its corresponding orthogonal unit eigenvector $P = (p_{1}, p_{2}, \dots, p_{m})^{T}$ satisfies

P^{T} Φ P = diag (σ_{1}, σ_{2}, \dots, σ_{m})

(2)

In order to compute $var (Γ_{1})$ , $P^{T} w_{1}$ can be represented as a vector, that is $ε_{1} = (ε_{1, 1}, ε_{1, 2}, \dots, ε_{1, m})^{T} = P^{T} w_{1}$ , then

{ε_{1}}^{T} ε_{1} = {w_{1}}^{T} P P^{T} w_{1} = {w_{1}}^{T} w_{1} = 1

(3)

meanwhile

\begin{matrix} var (Γ_{1}) = {ε_{1}}^{T} P^{T} Φ P ε_{1} \\ = σ_{1} ε_{1, 1}^{2} + σ_{2} ε_{1, 2}^{2} + \dots + σ_{m} ε_{1, m}^{2} \\ ⩽ σ_{1} {ε_{1}}^{T} ε_{1} \end{matrix}

(4)

In this way, $var (Γ_{1})$ can be calculated from the $ε_{1}$ , when $ε_{1} = (1, 0, \dots, 0)^{T}$ ,

var (Γ_{1}) = σ_{1}

(5)

w_{1} = P ε_{1} = p_{1}

(6)

It can be seen, when the constraints are $‖ w_{1} ‖ = 1$ and $w_{1} = p_{1}$ , $var (Γ_{1})$ has maximum value, and

max_{{w_{1}}^{T} w_{1} = 1} {var (Γ_{1})} = σ_{1}

(7)

Let $Γ_{2}$ be the second principal component of X, under the conditions of $‖ w_{2} ‖ = 1$ , it can be concluded that

\begin{matrix} cov (Γ_{2}, Γ_{1}) = {w_{2}}^{T} Φ p_{1} = σ_{1} {w_{2}}^{T} p_{1} \\ let {w_{2}}^{T} p_{1} = 0, then \end{matrix}

(8)

ε_{2} = (ε_{2, 1}, ε_{2, 2}, \dots, ε_{2, m})^{T} = p^{T} w_{2}

(9)

where ${ε_{2}}^{T} ε_{2} = 1$ , equation (9) can be written as

{w_{2}}^{T} p_{1} = ε_{2, 1} {p_{1}}^{T} p_{1} + ε_{2, 2} {p_{2}}^{T} p_{1} + \dots + ε_{2, m} {p_{m}}^{T} p_{1} = 0

(10)

Therefore, the variance $var (Γ_{2})$ of the second principal component $Γ_{2}$ can be expressed as

\begin{matrix} var (Γ_{2}) = {w_{2}}^{T} Φ w_{2} \\ = σ_{1} ε_{2, 1}^{2} + σ_{2} ε_{2, 2}^{2} + \dots + σ_{m} ε_{2, m}^{2} \\ ⩽ σ_{2} {ε_{2}}^{T} ε_{2} \end{matrix}

(11)

Similar to the first principal component, under the conditions of $‖ w_{2} ‖ = 1$ and $cov (Γ_{2}, Γ_{1}) = σ_{1} {w_{2}}^{T} p_{1} = 0$ . $var (Γ_{2}) = σ_{2}$ have a maximum value. In conclusion, all principal components can be obtained according to such iterative law.

PCA-SVM is a method combining dimension-reduction and classification. The purpose of PCA is dimension-reduction, so the input is the original data, namely X, and the output is the feature after dimension reduction, namely principal component $Γ$ . The purpose of SVM is classification, input is the feature after dimension reduction. There is highly redundancy and linear correlation in the original data, if SVM is directly used for classification, there will be a large amount of computation. The advantage of PCA-SVM is that it can remove the linear correlation of the original data and reduce the computation in classification.

SVM-RFE

In order to reduce the dimension of the breast cancer database and obtain the most relevant attributes, SVM-RFE method (Table 1), which was proposed by Guyon et al. (2002) in dealing with colon cancer classification. The purpose is selecting the minimum set of attributes to achieve the optimal accuracy (John et al., 1994). Attributes will be listed in descending order of contribution to diagnosis accuracy (what is called RFE sequence) by using SVM-RFE. Each iteration will eliminate the most irrelevant attribute, until the RFE sequence is obtained. The irrelevant attribute eliminated at the beginning of iterations may typically be noisy or redundant. The process is finished when classification accuracy decreases after elimination of the last feature on the RFE sequence. Hence, the best diagnosis accuracy can be achieved with the minimum number of (relevant) attributes.

Table 1.

SVM-RFE algorithm (Guyon et al., 2002).

Input:

Training data:

X_{0} = [X_{1}, X_{2}, . . ., X_{k}, . . ., X_{l}]^{T}

Class labels:

y_{0} = [y_{1}, y_{2}, . . ., y_{k}, . . ., y_{l}]^{T}

Initialize:

Surviving attributes:

S = [1, 2, . . ., n]

RFE sequence:

r = []

Repeat until

s = []

Restrict training data to good attribute indices:

X = X_{0} (:, s)

Train the classifier:

α = SVM - train (X, y)

Calculate the weight vector of dimension length(s):

ω = \sum_{k} α_{k} y_{k} X_{k}

Calculate the sequence criteria:

c_{i} = (ω_{i})^{2}

, for all i

Identify attribute with the smallest sequence criterion:

f = \arg min (c)

Update RFE sequence:

r = [s (f), r]

Remove the identified attribute:

s = s (1 : f - 1, f + 1 : length (s))

end

Output:

RFE sequence: r

Modified PCA

Modified PCA (MPCA) method aims at combining squared prediction error (SPE) and $T^{2}$ statistics under some conditions (Yin et al., 2017), to reduce computation load. The main idea is presented as follows.

The likelihood ratio approach is widely utilized in diagnosis and detection, and the system model is

ψ = ε + η, η = {\begin{matrix} η_{0}, & no change \\ η_{1}, & change \end{matrix}

(12)

where $ψ, ε, η \in R^{m}$ , $ε$ obeys a normal distribution with variance $Φ$ , $ε ~ N (0, Φ)$ , $η$ is a constant used to calculate the diagnostic threshold, so it has two values: $η_{0}$ and $η_{0}$ . $ψ$ is the probability density function and is defined as

p_{η, Φ} (ψ) = \frac{1}{\sqrt{{(2 π)}^{m} det (Φ)}} e^{- \frac{1}{2} {(ψ - η)}^{T} Φ^{- 1} (ψ - η)} .

(13)

With the Gaussian vector $ψ$ , the log likelihood ratio is

\begin{matrix} L (ψ) = \ln \frac{p_{η_{1}} (ψ)}{p_{η_{0}} (ψ)} \\ = \frac{1}{2} [{(ψ - η_{0})}^{T} Φ^{- 1} (ψ - η_{0}) - {(ψ - η_{1})}^{T} Φ^{- 1} (ψ - η_{1})] . \end{matrix}

(14)

According to the decision rule, the core of likelihood ratio approach is represented as

L (ψ) = {\begin{matrix} < 0, & η = η_{0} is accepted \\ > 0, & η = η_{1} is accepted . \end{matrix}

(15)

Assume that $η_{0} = 0$ and the $κ$ is the number of $ψ$ samples, which means $ψ$ are $ψ_{1}, \dots, ψ_{κ}$ , putting the specific $ψ$ into equation (14), the definition of likelihood ratio can be described as

\begin{matrix} L_{1}^{κ} = \frac{1}{2} [\sum_{i = 1}^{κ} {ψ_{i}}^{T} Φ^{- 1} ψ_{i} - \sum_{i = 1}^{κ} {(ψ_{i} - η_{1})}^{T} Φ^{- 1} (ψ_{i} - η_{1})] \\ = \frac{1}{2} [κ {\bar{ψ}}^{T} Φ^{- 1} \bar{ψ} - κ {(\bar{ψ} - η_{1})}^{T} Φ^{- 1} (\bar{ψ} - η_{1})] \end{matrix}

(16)

where $\bar{ψ}$ is the average of $ψ$ , $\bar{ψ} = \frac{1}{κ} \sum_{i = 1}^{κ} ψ_{i}$ . In general, $η_{1}$ is not available in reality. Generalized likelihood ration (GLR) is presented in order to detect the change in $η$ . Maximum likelihood estimation is used to estimate $η_{1}$ . The maximum of $η_{1}$ is $\overset{\land}{η_{1}}$ that can be obtained by calculating the likelihood ratio with equation (16) reaching its maximum value

\overset{\land}{η_{1}} = \arg \max_{η_{1}} L_{1}^{κ} = \bar{ψ} \Rightarrow \max_{θ_{1}} L_{1}^{κ} = \frac{κ}{2} \bar{ψ} Φ^{- 1} {\bar{ψ}}^{T} .

(17)

Noticing that $\bar{ψ} ~ N (0, Φ / κ)$ , and $κ \bar{ψ} Φ^{- 1} {\bar{ψ}}^{T}$ follows $χ^{2}$ -distribution. With the confidence level $α$ , the change detection method of based on GLR can be described as below.

Set the threshold $J_{th} = χ_{α} (m)$ , where $χ_{α} (m)$ is determined by $χ^{2}$ -distribution table.

Define test statistic

J = κ \bar{ψ} Σ^{- 1} {\bar{ψ}}^{T}

(18)

where $\bar{ψ} = \frac{1}{κ} \sum_{i = 1}^{κ} ψ_{i}$ ;

Define detection logic

J = {\begin{matrix} < J_{th}, & no change \\ > J_{th}, & a change is detected . \end{matrix}

(19)

According to standard PCA statistics approach, $T^{2}$ and SPE are calculated statistics from which two thresholds can be obtained. Obviously, $T^{2}$ statistic is equal to GLR test statistic computed by equation (18) with $κ = 1$ , and the optimal diagnosis performance can be obtained.

As for monitoring the residual subspace, in order to avert the possible problem of $Λ_{res}$ and utilize $χ^{2}$ -table, an alternative test statistic is used.

Define

Γ = diag (\frac{σ_{m}}{σ_{l + 1}}, \dots, \frac{σ_{m}}{σ_{m - 1}}, 1) \in R^{(m - l) \times (m - l)}

Then

Γ^{1 / 2} P_{res}^{T} x ~ N (0, σ_{m} I_{(m - l) \times (m - l)})

and

x^{T} P_{res} Γ P_{res}^{T} x = σ_{m} x^{T} P_{res} σ_{res}^{- 1} P_{res}^{T} x

where x is the normalized samples, $P_{res}$ is the corresponding orthogonal unit eigenvector in the ill-conditioned condition. With the confidence level $α$ , an alternative statistic and a threshold are introduced as follows

T_{n}^{2} = x^{T} P_{res} Γ P_{res}^{T} x

(20)

J_{th, T_{n}^{2}} = σ_{m} χ_{α}^{2} (m - l) .

(21)

The diagnosis performance can be improved with $T_{n}^{2}$ statistics and the threshold $J_{th, T_{n}^{2}}$ .

In order to ensure a high diagnosis accuracy, the linear combination of two statistics is applied for diagnosis. $T^{2}$ is the statistic of normal PCA calculation, $T_{res}^{2}$ is a statistic calculated in ill-conditioned condition. It is described as

{sT}_{c}^{2} = ε_{1} T^{2} + ε_{2} T_{res}^{2}

(22)

where $ε_{1}, ε_{2} > 0$ . Then, equation (22) can be rewritten into

T_{c}^{2} = x^{T} P Ψ P^{T} x

where

Ψ = [\begin{matrix} ε_{1} Λ_{pc}^{- 1} & 0 \\ 0 & ε_{2} Q \end{matrix}]

Q = {\begin{matrix} Λ_{res}^{- 1}, & T_{res}^{2} = T_{H}^{2} \\ I, & T_{res}^{2} = SPE \\ Γ, & T_{res}^{2} = T_{n}^{2} . \end{matrix}

$Λ_{pc}$ is a singular value matrix in normal space, $Λ_{res}$ is a singular value matrix of ill-conditioned space. Considering that $P^{T} x ~ N (0, Λ), x^{T} P Λ^{- 1} P^{T} x ~ χ^{2} (m)$ , it is necessary to present a new combined test statistic to avoid the complex computation of $Λ_{res}^{- 1}$

T_{comb}^{2} = x^{T} P \bar{Γ} P^{T} x

(23)

where ${\bar{Γ}}_{k} = diag (\frac{σ_{m}}{σ_{1}}, \dots, \frac{σ_{m}}{σ_{m - 1}}, 1)$ . With a confidence level $α$ , the threshold of $T_{comb}^{2}$ is written as

J_{th, T_{comb}^{2}} = σ_{m} χ_{α}^{2} (m) .

(24)

Covariance matrix $Φ$ is singular in some special cases, and singular value decomposition (SVD) is described as

Φ = \frac{1}{N - 1} X^{T} X = [\begin{matrix} P & P_{⊥} \end{matrix}] {\begin{matrix} Λ & 0 \\ 0 & 0 \end{matrix}} [\begin{matrix} P^{T} \\ P_{⊥}^{T} \end{matrix}] .

Abnormal situation can be diagnosed by utilizing the combined statistic from equation (23) with the related threshold in equation (24) and the parity examination

P_{⊥}^{T} x = 0 .

(25)

In this paper, if $T_{comb}^{2} \leq J_{th, T_{comb}^{2}}$ and $T_{comb}^{2}$ also satisfies equation (25), the data will be considered as normal. Otherwise, it will be treated as abnormal situation.

Experiments

In order to verify the effectiveness of the presented algorithm in health monitoring, two breast cancer datasets are used in the experiments which are both from the University of Wisconsin Hospital (Mangasarian et al., 1995; Wolberg and Mangasarian, 1990). One dataset is a small set contains nine attributes, and the other one is a high dimension dataset with 30 attributes. In the first one, there are totally 699 groups of data from 699 patients, and 458 (65.5%) of them are benign and malignant data are in 241 (34.5%) groups. Since there exist missing attribute values in 16 groups, these groups of data are removed, in order to avoid deviations in the diagnostic results. Finally, 683 groups (444 benign and 239 malignant groups, respectively) are employed in the experiments. The attributes of the dataset (DataSet I) are shown in Table 2. There are totally 357 benign (62.7%) and 212 malignant data (37.3%), respectively, in the second dataset (DataSet II), and the attribute information is shown in Table 3. It can be observed that the data scales of the DataSet I and DataSet II, are both large enough for the diagnosis performance considered for evaluation criteria (Section II-C) to be valid in this context.

Table 2.

Attributes of DataSet I.

Number	Attributes name	Domain
1	Clump thickness	1–10
2	Uniformity of cell size	1–10
3	Uniformity of cell shape	1–10
4	Marginal adhesion	1–10
5	Single epithelial cell size	1–10
6	Bare nuclei	1–10
7	Bland chromatin	1–10
8	Normal nucleoli	1–10
9	Mitoses	1–10

Table 3.

Ten real-valued attributes computed for each cell nucleus in DataSet II.

Number	Attributes information
1	Radius (mean of distances from center to points on the perimeter)
2	Texture (standard deviation of gray-scale values)
3	Perimeter
4	Area
5	Smoothness (local variation in radius lengths)
6	Compactness (perimeter² / area - 1.0)
7	Concavity (severity of concave portions of the contour)
8	Concave points (number of concave portions of thecontour)
9	Symmetry
10	Fractal dimension (“coastline approximation” - 1)

Firstly, 10-fold cross-validation method is used to split training and testing data for both of the two breast cancer datasets. After that, two kinds of dimension reduction algorithms, PCA-SVM and SVM-RFE are employed to compare with original SVM classifier with different kernels on the basis of DataSet I, and the methods are also used to diagnose malignant cancer on DataSet II. Secondly, two statistic methods, PCA and MPCA are used to compare the diagnosis performances on DataSet I. Then, the optimal statistic method is also utilized to diagnose malignant cancer of DataSet II to compare the results with other previous studies using the same dataset.

The experiments based on DataSet I

Firstly, SVM-RFE algorithm is employed to reduce the data dimension of DataSet I, and to obtain RFE sequence after iterations. As shown in Table 4, the “uniformity of Cell Size” is the most relevant attribute of all, and the most irrelevant attribute is “mitoses”. On the basis of the RFE sequence, different numbers of attributes are chosen to feed the linear and RBF kernel SVM classifiers, respectively. As shown in Figure 1, the best classification accuracy 96.78% is obtained with the first six sets of attributes in the RFE sequence. In comparison, Figure 2 shows that the best result 96.93% obtained by RBF kernel SVM is better, with four of the most relevant attributes.

Table 4.

Attributes of DataSet I on RFE sequence.

Number	Attributes name
2	Uniformity of cell size
3	Uniformity of cell shape
6	Bare nuclei
1	Clump thickness
4	Marginal adhesion
8	Normal nucleoli
7	Bland chromatin
5	Single epithelial Cell Size
9	Mitoses

Figure 1.

Classification accuracies based on DataSet I via SVM-RFE method (linear kernel).

Figure 2.

Classification accuracies based on DataSet I via SVM-RFE method (RBF kernel).

Secondly, the training and testing data of Dataset I are introduced into PCA-SVM algorithm, and two kinds of SVMs, linear and RBF kernal SVMs are used in this experiment. As the results shown in Figure 3 and Figure 4, the best classification accuracy 97.21% obtained by RBF kernel is better, 1.61% higher than the 95.60% obtained on the basis of linear kernel SVM with the same five principal components.

Figure 3.

Classification accuracies based on DataSet I via PCA-SVM method (linear kernel).

Figure 4.

Classification accuracies based on DataSet I via PCA-SVM method (RBF kernel).

Table 5 shows the classification results obtained by SVM, SVM-RFE, PCA-SVM with different kernels based on DataSet I. Both SVM-RFE and PCA-SVM methods show better diagnosis performance than original SVMs with RBF kernel. The best result of all is 97.21% obtained by PCA-SVM method based on RBF kernel with five principal components.

Table 5.

The best classification results based on DataSet I with different dimension-reduction approaches (different kernels).

	SVM		SVM-RFE		PCA-SVM
	linear	RBF	linear	RBF	linear	RBF
Attribute number	9	9	6	4	5	5
Accuracy (%)	96.48	96.63	96.78	96.93	95.60	97.21
Sensitivity (%)	98.06	97.71	98.29	98.15	98.37	98.18
Specificity (%)	93.25	94.65	94.45	94.42	90.24	95.51
PPA (%)	96.80	97.17	96.91	97.29	95.20	97.43
NPA (%)	95.88	95.96	96.36	96.76	97.01	96.57
F-measure (%)	97.39	97.39	97.54	97.64	96.63	97.78
Kappa	0.92	0.93	0.93	0.93	0.90	0.94

PCA has two traditional statistical indicators: $T^{2}$ and squared prediction error (SPE). $T^{2}$ statistic measures the magnitude of variations that are inside the principal component subspace. SPE statistic describes the deviation degree of input variable to the principal element model, which is a measure of the external change of the model. Their threshold value can be calculated according to the data. If the samples SPE and $T^{2}$ are below the threshold value, it can be considered as a benign condition; if they are above the threshold value, it can be considered as a malignant condition. Based on PCA statistics, the thresholds of $T^{2}$ and SPE, which are calculated by benign data, are 13.2767 and 4.271, respectively, as shown in Figure 5. The confidence level is set as 99% and the number of principal components (PCs) is four. The two thresholds are both calculated with the benign database. In Figure 5, most of the data are under the threshold except for a few individual abnormal samples. The points above the dotted line will be regarded as malignant instances.

Figure 5.

PCA statistics under benign conditions of DataSet I.

The diagnosis result obtained on the basis of PCA statistics is shown in Figure 6, and the accuracy is 94.59%. Almost all of the malignant data are exceeding the threshold, indicating that the data are diagnosed accurately except for a few samples. However, compared with the results in Table 5, the accuracy obtained by PCA statistics is still a little worse than those of SVM-RFE, PCA-SVM, and even SVM classifier.

Figure 6.

Malignant diagnosis based on DataSet I via PCA statistics.

The benign and malignant data of DataSet I are also introduced into MPCA to compare with the accuracy obtained by PCA statistics. As shown in Figure 7, the threshold is 6.1337 and the accuracy is 98.70%, 4.11% higher than the result on PCA statistics. Compared with the results in Table 5, MPCA shows very good performance on diagnosing malignant cancer, and its accuracy is better than all the other methods.

Figure 7.

Malignant cancer diagnosis based on DataSet I via MPCA.

The experiments based on DataSet II

In Table 5, it can be found that all of the dimension-reduction approaches with RBF kernel function have better results than those with linear kernel function. Therefore, the approaches with RBF kernel function are employed to analysis the DataSet II in this set of experiments. In Table 6, RBF kernel SVM classifier uses all the 30 attributes to reach 89.63% accuracy. An accuracy 92.45% is achieved by SVM-RFE method with 12 relevant attributes, as shown in Figure 8. The best classification accuracy 97.19% is obtained on the basis of PCA-SVM with RBF kernel fuction (in Figure 9), using only six principal components. It is 4.74% and 7.56% higher than the accuracies obtained by SVM-RFE and SVM with RBF function, respectively.

Table 6.

The classification results based on DataSet II with dimension-reduction approaches (RBF kernel) and machine learning methods.

	Decision	ANN		Random forest			SVM	SVM-RFE	PCA-SVM
	tree	100 nodes	150 nodes	20 trees	30 trees	40 trees	30 AN	12 AN	6 AN
Accuracy (%)	91.74	92.44	91.39	95.95	96.31	96.13	89.63	92.45	97.19
Sensitivity (%)	93.56	96.64	95.52	96.92	97.20	96.92	90.53	92.68	98.93
Specificity (%)	88.68	85.38	84.43	94.34	94.81	94.81	88.00	92.55	94.26
PPA (%)	93.30	91.76	91.18	96.65	96.93	96.97	92.92	95.51	96.80
NPA (%)	89.10	93.78	91.79	94.49	95.26	94.81	84.50	88.20	98.01
F-measure (%)	93.43	94.13	93.30	96.78	97.06	96.92	91.62	93.93	97.81
Kappa	0.84	0.84	0.81	0.91	0.92	0.92	0.78	0.84	0.94

Nodes refer to the number of nodes in the network middle layer, Trees refer to the number of decision trees in the random forest, AN refers to attribute number.

Figure 8.

Classification accuracies based on DataSet II via SVM-RFE method (RBF kernel).

Figure 9.

Classification accuracies based on DataSet II via PCA-SVM method (RBF kernel).

From Figure 6 and Figure 7, MPCA has better accuracy, 4.11% higher than PCA, when they are employed to classify the DataSet I. Similarity, MPCA is applied to analysis DataSet II. In Figure 10, the threshold is 0.0038 and the accuracy is 96.00%. Compared with the results in Table 6, the accuracy obtained by MPCA is 3.55% and 6.37% higher than the ones by using SVM-RFE and SVM, and 1.19% lower than the accuracy achieved by PCA-SVM.

Figure 10.

Malignant cancer diagnosis based on DataSet II via MPCA.

With the comparison of the results between previous studies (Fan et al., 2011; Krishnan et al., 2010; Mert et al., 2014; Sweilam et al., 2010; Yadav et al., 2019; Yang et al., 2019) and this study in Table 7, it can be found that the 10-CV, PCA-SVM (RBF) used in this study has the highest accuracy (97.19%) except the best result (98.90%) of CBFDT, which was proposed by Fan et al. (2011). They used all the attributes, while the results obtained in this paper is based on PCA-SVM using six principal components only. The results for Yang and Xu (2019) and Yin et al. (2016) are also slightly better than our results, because they do not use cross-validation and simply group some datasets into testsets, potentially making test dataset easier to distinguish. Yadav et al. (2019) also used SVM classifier whose result is different from ours, it is also because of the partition of test dateset, they used 40% data as test dataset. We used 10-fold cross-validation, and the results are less random and more reliable.

Table 7.

Comparison of the classification accuracy (%) of previous studies and this study.

Author	Method	Attribute number	Accuracy
Sweilam et al.	PSO+SVM	30	93.52
	QPSO+SVM	30	93.06
Yang et al.	40% test data, DESVM	30	97.83
Yin et al.	50% test data, SVM-RFE	10	98.00
Yadav et al.	40% test data, SVM	30	95.51
Krishnan et al.	40% test data, SVM (poly)	30	92.62
	40% test data, SVM (RBF)	30	93.72
Mert et al.	10 - CV, PNN	3 (2IC+DWT)	96.31
	LOO, PNN	3 (2IC+DWT)	97.01
Fan et al.	CBFDT	30	98.90 (Best)
		30	92.70 (Average)
		30	86.10 (Lowest)
This study	10 - CV, SVM (RBF)	30	89.63
	10 - CV, SVM-RFE (RBF)	12	92.45
	10 - CV, PCA-SVM (RBF)	6	97.19
	MPCA	30	96.00

In addition, in order to show our results are more convincing, it is compared with the common used statistical machine learning methods such as decision tree (Quinlan, 1986), artificial neural network (ANN) (Rumelhart et al., 1986) and random forest (Vladimir et al., 2003). The designed decision tree is a tree structure with a depth of four, which adopts pruning algorithm, and the tree structure is shown in the Figure 11. The neural network is a common three layer classification neural network, which adopts cross entropy loss function, and the optimizer is Adam (Kingma and Ba, 2014). The number of nodes in the middle layer is taken as the variable to explore. Random forest is an ensemble algorithm based on decision tree, so the number of decision trees is taken as the variable to explore. 10-fold cross-validation was also used in all the experiments, all the experimental results are shown in the Table 6. It can be seen that among these algorithms, random forest has the best performance when there are 30 trees, but it is still lower than 10-CV, PCA-SVM (RBF). What is more, these methods have no dimensionality reduction process, so 30 attributes are directly used for classification.

Figure 11.

Decision tree structure.

Conclusion

In this paper, PCA-SVM is presented for health monitoring. In order to verify the method, two sets of experiments are conducted to compare the performance of different methodologies on two breast cancer datasets. PCA-SVM and SVM-RFE with different kernels are respectively employed to analysis the DataSet I in the first set of experiments. With the optimal numbers of attributes (four of nine attributes totally) on RFE sequence, SVM-RFE method with RBF kernel shows better performance than the one via linear kernel. The diagnosis accuracy (96.93%) is better than those of original SVMs, no matter the SVM classifiers are based on linear (96.48%) or RBF kernel (96.63). However, PCA-SVM with RBF kernel shows the best performance and it reaches an accuracy 97.21%. Then, PCA statistics and MPCA approach are also utilized to diagnose the malignant cancer of the DataSet I. The accuracy on MPCA approach is 98.70%, 4.11% higher than the one achieved by PCA statistics, and also 1.49% higher than the result based on PCA-SVM with RBF kernel. In the second set of experiments, dimension-reduction approaches with RBF kernel and MPCA method are used to diagnose malignant cancer of DataSet II. The accuracy obtained by PCA-SVM with RBF kernel is 97.19%, it is 1.19%, 3.55% and 6.37% higher than the ones by using MPCA, SVM-RFE and SVM via RBF kernel, respectively.

As the work in the paper shows, the diagnosis based on PCA-SVM via RBF kernel obtains a good performance with a small set of attributes, even with a high dimension dataset. On the DataSet I, although the result obtained on the basis of PCA-SVM is a little lower than the one based on MPCA, the presented method could diagnose malignant cancer effectively after reducing the data dimension of DataSet II, and it shows very good performance on solving high data dimension, large noise interference problems. Compared with the results of other previous studies, 10-CV, PCA-SVM (RBF) used in this study shows better performance than most of other methods in previous studies by using six principal components only.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Cheng

Zhu

, et al. (2019) An imitation medical diagnosis method of hydro-turbine generating unit based on Bayesian network. Transactions of the Institute of Measurement and Control 41(12): 34063420.

Costasperez

Rodriguezandina

(2009) Algorithmic concurrent error detection in complex digital-processing systems. IEEE Design & Test of Computers 26(1): 60–67.

Fan

Chang

Lin

, et al. (2011) A hybrid model combining case-based reasoning and fuzzy decision tree for medical data classification. Applied Soft Computing 11(1): 632–644.

George

Zayed

Roushdy

, et al. (2014) Remote computer-aided breast cancer detection and diagnosis system based on cytological images. IEEE Systems Journal 8(3): 949–964.

Gutta

Cheng

Nguyen

, et al. (2018) Cardiorespiratory model-based data-driven approach for sleep apnea detection. IEEE journal of biomedical and health informatics 22(4): 1036–1045.

Guyon

Weston

Barnhill

, et al. (2002) Gene selection for cancer classification using support vector machines. Machine Learning 46(1–3): 389–422.

Han

Jiang

Zhao

, et al. (2017) Comparison of random forest, artificial neural networks and support vector machine for intelligent diagnosis of rotating machinery. Transactions of the Institute of Measurement and Control 40(8): 2681–2693.

Haque

Hassan

Binmakhashen

, et al. (2017) Breast density classification for cancer detection using DCT-PCA feature extraction and classifier ensemble. In: International Conference on Intelligent Systems Design and Applications, Batumi, India, 14–16 December 2017, pp. 702–711. Cham: Springer.

Jaganathan

Kuppuchamy

(2013) A threshold fuzzy entropy based feature selection for medical database classification. Computers in Biology and Medicine 43(12): 2222–2229.

10.

Jiang

Yin

(2018) Recursive total principle component regression based fault detection and its application to vehicular cyber-physical systems. IEEE Transactions on Industrial Informatics 14(4): 1415–1423.

11.

John

Kohavi

Pfleger

(1994) Irrelevant features and the subset selection problem. In: Machine Learning Proceedings. Morgan Kaufmann, pp.121–129.

12.

Kingma

(2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

13.

Krishnan

MMR

Banerjee

Chakraborty

, et al. (2010) Statistical analysis of mammographic features and its classification using support vector machine. Expert Systems with Applications 37(1): 470–478.

14.

Lee

McManus

Merchant

, et al. (2012) Automatic motion and noise artifact detection in holter ECG data using empirical mode decomposition and statistical approaches. IEEE Transactions on Biomedical Engineering 59(6): 1499–1506.

15.

Liu

C W

(2011) A fuzzy-based data transformation for feature extraction to increase classification performance with small medical data sets. Artificial Intelligence in Medicine 52(1): 45–52.

16.

Ding

Yang

, et al. (2018) A fault detection approach for nonlinear systems based on data-driven realizations of fuzzy kernel representations. IEEE Transactions on Fuzzy Systems 26(4): 1800–1812.

17.

Mangasarian

Street

Wolberg

(1995) Breast cancer diagnosis and prognosis via linear programming. Operations Research 43(4): 570–577.

18.

Mert

Kilic

Akan

(2014) An improved hybrid feature reduction for increased breast cancer diagnostic performance. Biomedical Engineering Letters 4(3): 285–291.

19.

Quinlan

(1986) Induction on decision tree. Machine Learning 1(1): 81–106.

20.

Rumelhart

Hinton

Williams

(1986) Learning representations by back-propagating errors. Nature 323(3): 533–536.

21.

Rustam

Maghfirah

(2018) Correlated based SVM-RFE as feature selection for cancer classification using microarray databases. AIP Conference Proceedings 2023(1): 020235.

22.

Sweilam

Tharwat

Moniem

NKA

(2010) Support vector machine for diagnosis cancer disease: a comparative study. Egyptian Informatics Journal 11(2): 81–92.

23.

Tan

Lim

, et al. (2011) A modified two-stage SVM-RFE model for cancer classification using microarray data. In: International Conference on Neural Information Processing, Shanghai, China, 13–17 November 2011, pp. 668–675. Springer.

24.

Vapnik

(2013) The Nature of Statistical Learning Theory. Berlin, Germany: Springer Science & Business Media.

25.

Vladimir

Andy

Christopher

, et al. (2003) Random forest: A classification and regression tool for compound classification and QSAR modeling. Journal of Chemical Information & Computer Sciences 43(6): 1947–1958.

26.

Wolberg

Mangasarian

(1990) Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proceedings of the National Academy of Sciences 87(23): 9193–9196.

27.

Zhou

(2017) Data-driven diagnosis of cervical cancer with support vector machine-based approaches. IEEE Access 5: 25189–25195.

28.

Yadav

Jamir

Jain

(2019) Breast cancer prediction using SVM with PCA feature selection method. International Journal of Scientific Research in Computer Science 5(2): 969–978.

29.

Yang

Hou

(2016) Fault detection based on a robust one class support vector machine. Neurocomputing 190: 117–123.

30.

Yang

Garcia

Rodriguez-Andina

, et al. (2019) Using PPG signals and wearable devices for atrial fibrillation screening. IEEE Transactions on Industrial Electronics 66(11): 8832–8842.

31.

Yang

(2019) Feature extraction by PCA and diagnosis of breast tumors using SVM with DE-based parameter tuning. International Journal of Machine Learning and Cybernetics 10(3): 591–601.

32.

Yang

Tsai

Chou

(2016) PCA-based fast search method using PCA-LBG-based VQ codebook for codebook search. IEEE Access 4: 1332–1344.

33.

Yin

Huang

(2015) Performance monitoring for vehicle suspension system via fuzzy positivistic C-means clustering based on accelerometer measurements. IEEE/ASME Transactions on Mechatronics 20(5): 2613–2620.

34.

Yin

Ding

Haghani

, et al. (2012) A comparison study of basic data-driven fault diagnosis and process monitoring methods on the benchmark Tennessee Eastman process. Journal of Process Control 22(9): 1567–1581.

35.

Yin

Ding

Xie

, et al. (2014a) A review on basic data-driven approaches for industrial process monitoring. IEEE Transactions on Industrial Electronics 61(11): 6418–6428.

36.

Yin

Yang

Zhang

, et al. (2017) A data-driven learning approach for nonlinear process monitoring based on available sensing measurements. IEEE Transactions on Industrial Electronics 64(1): 643–653.

37.

Yin

Zhu

Jing

(2014b) Fault detection based on a robust one class support vector machine. Neurocomputing 145(18): 263–26.

38.

Yin

Fei

Yang

, et al. (2016) A novel SVM-RFE based biomedical data processing approach: Basic and beyond. In: Annual Conference of the IEEE Industrial Electronics Society, Florence, Italy, 23–27 October 2016, pp. 7143–7148. IEEE.

39.

Zhang

Qian

Mao

, et al. (2018) A data-driven design for fault detection of wind turbines using random forests and XGBoost. IEEE Access 6: 21020–21031.