Abstract
For the pattern recognition, most classification models are solved iteratively, except for Linear LDA, KLDA and ELM etc. In this paper, a nonlinear classification network model based on predefined evenly-distributed class centroids (PEDCC) is proposed. Its analytical solution can be obtained and has good interpretability. Using the characteristics of maximizing the inter-class distance of PEDCC and derivative weighted minimum mean square error loss function to minimize the intra-class distance, we can not only realize the effective nonlinearity of the network, but also obtain the analytical solution of the network weight. Then, the sample is classified based on GDA. In order to further improve the performance of classification, PCA is used to reduces the dimensionality of the original sample, meanwhile, the CReLU activation function are adopted to enhances the expression ability of the features. The network transforms the samples into the higher dimensional feature space through the weighted minimum mean square error, so as to find a better separation hyperplane. In experiments, the feasibility of the network structure is verified from pure linear
Introduction
The fundamental task of pattern recognition is to extract the features of the pattern samples, and then minimize the intra-class distance and maximize the inter-class distance in the feature space to achieve the best recognition performance. In order to achieve the above goals, experts and scholars in the field of pattern recognition and machine learning have conducted long-term research and proposed a large number of feature extraction and classification methods, from the Bayesian classifier based on the statistical pattern recognition theory to the linear classifier, support vector machines, radial basis function networks, multilayer feedforward networks, convolutional neural networks, and recurrent neural networks, etc., the recognition performance is greatly improved.
The basis of statistical pattern recognition [1] is Bayesian decision theory, that is, according to the Bayesian formula, the posterior probability of the sample is obtained through prior probability and class conditional probability density to classify the pattern. However, due to the difficulty in obtaining the class conditional probability density function, most classifier models belong to the Discriminative Model, which skips the solution of the class conditional probability density and directly calculates the mapping relationship between the input sample
Based on PEDCC [2], this paper studies the classification problem from the latent feature space
Since PEDCC was proposed in the classification supervised autoencoder [2], it has been applied to the PEDCC-Loss loss function for the CNN classifier [3], clustering [4], semi-supervised learning [5], incremental learning [6] and active learning [7] and OOD Detection [8, 9] etc. Meanwhile, the analytical generation method and frame characteristic analysis of PEDCC have also been solved mathematically [10].
In this article, based on the characteristics of PEDCC, we propose a new type of simple analytical learning classifier, as shown in Fig. 1a, where PCA+Whitening+CReLU is used as a feature preprocessing method to improve classification performance. It extracts the principal feature components of the sample and reduces the dimensionality of high-dimensional data. The CReLU [11] and Tanh activation functions are used to enhance the nonlinear ability of the network. In the entire network, PCA can be solved analytically, meanwhile, the fully connected weight parameter
(a) is a diagram of structure of analytical learning classifier, where PCA
Figure 1b is the training results of the Iris data set. The Iris data set includes 150 samples, divided into 3 categories, each with 50 samples and each sample includes 4 length attributes. The result in the figure is a distribution diagram of the sample feature vector
This paper uses PEDCC and weighted linear regression function to construct a new and effective analytical learning classifier. The main contributions of the article are as follows:
Based on the derivative weighted minimum mean square error criterion function, we theoretically obtain the analytical expression of training weight parameter, avoiding the complexity of iterative solution, improving the training speed, interpretability of the network and the number of effective latent features. PCA The feasibility of the network design of the analytical learning classifier is proved experimentally. Compared with SVM and ELM, in general, this method has advantages on small dataset both in recognition accuracy and training speed, while it also has advantages in training speed on large data sets. By analyzing the relationship between the norm of latent features and the recognition rate, a multi-stage network structure based on the latent feature norm is introduced, which can further significantly improve the classification performance.
This paper is mainly divided into 5 parts. The first part introduces the background of pattern classification and the contribution of the article, as well as the structure diagram of the analytical learning classifier based on PEDCC; the second part introduces the background and basic concepts of SVM, ELM, VDA, VFC and PEDCC; The third part is the network structure and theoretical derivation; the fourth part verifies the feasibility of the network under different data sets, showing the performance of analytical network compared with ELM and SVM,and verified the effective improvement of multi-layer analytical learning classifier; the fifth part is the generalization and summary of the analytical learning classifier.
The theoretical basis of traditional statistical pattern recognition is Bayesian decision theory. Typical models include Linear Classifiers [12, 13], Support Vector Machines [14, 15], Radial Basis Function Networks [16, 17], and Multilayer Feedforward Networks [18, 19], Extreme Learning Machine [20, 21], etc. Deep learning is also a statical learning method.
SVM and ELM in the form of a three-layer feedforward network
SVM classifier is the most important pattern classification method before the re-emergence of deep learning. Although SVM is short of recognition rate compared to deep learning methods, it is still widely used in many occasions because of its small sample learning ability and good interpretability. The traditional SVM is a two-class classification method. According to the characteristics of the data set, the hard interval maximization is used for the linearly separable data set, and the soft interval maximization is used for the linearly inseparable data set. The common point is to train a hyperplane to classify the data as much as possible. It mainly uses decision functions:
From network perspective, SVM is a typical single-latent-layer three-layer feedforward network. The optimization algorithm actually obtains the weight between the latent layer and the output layer, and the parameters of the latent layer kernel function (support vectors can be considered as the parameters between the input layer and the latent layer). Similarly, the Extreme Learning Machine (ELM) [20, 21] studies the learning method of the three-layer feedforward network from another unique perspective, namely by randomizing the weight between the input layer and the latent layer to reduce the parameters that need to be trained, so the parameters between the latent layer and the output layer can be solved by the least square error method, and the analytical solution of the three-layer network is achieved. This method has fast network learning speed and good generalization performance, but usually requires a large number of latent layer nodes.
Compared with the extreme learning machine solidifying the weight between the input and the latent layer, our algorithm solidifies the weight between the latent layer and the output layer, and only the weights between the input layer and the latent layer are needed to determined. Since this solidification is only a training objective, so the parameters do not appear in the classifier. The training results are embodied in the mean and covariance of various class conditional probability densities, which can better classify and reduce the number of latent layer nodes.
In order to improve the interpretability and training efficiency of neural networks, Kuo et al. [22] proposed a mathematical model called RECOS (Rectified-Correlations on a Sphere), which answered the necessity of a nonlinear activation function and why two-layer cascade is better than the single layer. Furthermore, they [23] interpreted the multi-layer perceptron (MLP) as a generalization of a two-class LDA classifier, and proposed a 4-layer feedforward MLP (FF-MLP) model. The model does not need gradient back propagation, and the network can be trained through a single feedforward calculation, and obtaining good results on small data sets. However, the network parameters of this method require prior knowledge or a lot of attempts, and cannot be applied to complex data sets such as images, which limits its application.
VDA and VFC based on regular simplex vertices
VDA (Vertex Discriminant Analysis) [24, 25] is a discriminative analysis method based on the regular simplex vertices, which are equidistant points in space, and the number of vertices used as class indicators is
where
In order to minimize the loss function, the algorithm of VDA is to initialize
where
VFC (Vertex Feature Classification) [26], which is also a classification method related to regular simplex, connects the vertices of the simplex with category to explore the geometric distribution in the high-dimensional feature space (simplex space) and uses multi-point positioning technology to classify unknown samples. Its basic idea is to project all its features into the simplex space, and then estimate the center of gravity, which is used as the coordinates of the sample features in the simplex space. Therefore, it can classify the samples into the category closest to the center of gravity based on the metric matrix.
The above two methods have certain recognition performance on small data sets, but both face the problems of high computational complexity of iterative solution, and the limitation of the output feature dimension.
PEDCC was originally proposed in CSAE [4], which can simultaneously realize the functions of classification and autoencoders under a unified network architecture. PEDCC are some points evenly distributed on the hypersphere in the high-dimensional space. Figure 2 is a schematic diagram of the distribution of 2–4 PEDCC points in a three-dimensional space. Initially, PEDCC relied on the physical model of electric charges distributed on the hypersphere to generate iteratively. For pattern classification, we hope that the inter-class distance is as large as possible, which can be achieved by using PEDCC.
The distribution diagram of evenly-distributed points for 
The difference between PEDCC and regular simplex is that in
For the CNN classifier based on PEDCC-Loss [5], the convolutional layer plays the role of feature extraction, and the first fully connected layer maps these features to latent features, and all sample features are closely distributed in C (Number of categories) PEDCC points (class centroids) evenly distributed on the unit hypersphere. Since the linear layer can achieve arbitrary spatial rotation, these PEDCC points can be randomly generated, and as long as the distribution is evenly, the position does not affect the recognition result, but it must remain unchanged during training and testing. In addition, PEDCC is not only for classification, but also for feature extraction and uncertainty analysis.
By the solidification characteristics of PEDCC, this paper hopes to construct a new shallow neural network, which can be solved analytically, and has good interpretability and classification performance. After large comparison of trial, this paper proposes a new simple analytical learning classifier, as shown in Fig. 1a. Here PCA
Analytical solution of weight matrix W
As shown in Fig. 1a, the main problem of network training is how to obtain the weight matrix
The nonlinear network with activation function is usually trained by means of BP algorithm which is backpropagating the gradient to the previous layer network to update the network weights. Here, since the training objectives of output
Here, due to the non-linearity of the activation function, in order to achieve the training objectives at the output
However, because of the inverse mapping of the training objectives, our goal is to output the minimum mean square error at
For the training objectives in the
where * represents the dot product,
where,
is a symmetric matrix. Then, we have
The weight vector can be regarded as a pseudo-inverse solution weighted by the derivative. When calculating,
In the pure linear network without activation function Tanh, we know that the sample is projected to the
In fact, if
Since we adopt the criterion of minimum mean square error, the probability density function distribution of various latent features
The above method requires the inverse and the derivative function of the activation function, so the activation function must be reversible and derivable. ReLU fucntion, which has a zero derivative on the negative semi-axis, is not suitable for the method of this article because of the inability to reversely map
Here, only PEDCC is feasible for the analytical learning classifier. Because of the particularity of the basic PEDCC structure (any basic PEDCC point has a element is 1), so the inverse of the activation function cannot be obtained. So
The diagram of Multi-stage analytical learning classifier, where AL is PCA
(a) the samples distribution in the two-dimensional space; (b) the samples distribution under the new axis after PCA; (c) the samples distribution after whitening; (d) the samples distribution after the CReLU function, the blue is separation hyperplane extended to the three-dimensional space.
Through the analysis of experimental data, we found that the recognition rate of samples with small latent feature norm was significantly low, which means that samples with different latent feature norm have different classification difficulties, and inspired us to build a multi-stage analytic learning network as shown in Fig. 3. For each stage of classifier, the samples with low latent feature norm are further input to the next stage of analytic learning classifier for better recognition performance, otherwise ending the recognition process.
For any input sample
In the training phase, samples satisfying
PCA
Experimental analysis of ALC
In order to verify the feasibility and rationality of the proposed analytic learning classifier, the following three cases of pure linear W, W
Verification of the analytical learning classifier structure
Results of the experiment
In order to verify the rationality and effectiveness of the analytic learning classifier, Mnist, Fashion Mnist, Digits, Banknote Authentication, Pima, Diadates Disease, Iris, Breast Cancer Disease, Waveform, Wine and other data sets are used to carried out the comparative experiments in terms of recognition accuracy and training speed. The selection of the data set is based on two principles: typicality and low recognition rate of conventional recognition methods.
MNIST data set consists of 55,000 training sets and 10,000 test sets, with a total of 10 categories. Each picture is a 28*28 gray scale handwritten picture. Fashion MNIST data set are made of 60,000 training samples and 10,000 test samples, with a total of 10 categories. Each picture is a 28*28 picture. Digits data set has 1797 samples, which are digital labels of [0, 9]. There are 10 classes in total, and each picture has a pixel value of 8*8. Breast Cancer Diagnostic data set has a total of 569 samples, a total of 2 categories, and each sample has a total of 30 attributes. Banknote Authentication data set has a total of 1372 samples with 2 classes, and each sample has a total of 4 attributes. PIMA Indian Diabetes data set has 768 samples, a total of 2 categories, each sample has a total of 8 attributes. Waveform data set has 5000 samples, each sample has 21 attributes, and there are 3 categories in total. Wine datasets has 178 samples, each sample has 13 features, and 3 classes in total. Iris data set has 150 samples, 4 features, and 3 categories.
Here, our experiment is implemented using Pytorch on an Intel(R) Core(TM) i7-8700CPU, 8.0GB RAM, and Nvidia GeForce GT 730.
Analysis of experimental results
The experimental results are obtained by ten-fold cross-validation. Generally, the judgment of network performance is mainly based on training time and recognition rate. Table 1 mainly shows the recognition rate and training speed of each data set under linear, non-linear and analytical learning classifier. Compared non-linearity (training
The recognition rate (Acc.) and training time (Tr.Ti.) of each data set under linear, non-linear and analytical learning classifier
The recognition rate (Acc.) and training time (Tr.Ti.) of each data set under linear, non-linear and analytical learning classifier
Furthermore, compared analytical learning classifier with nonlinear network, in addition to the Breast Cancer Diagnostic data set, the recognition rate of the network structure with PCA is better than the network without PCA, especially on large data (Mnist, FashionMnist). It shows that feature preprocessing can not only increase the nonlinearity of the network, but also promote it to find a better separation hyperplane in the higher-dimensional feature space to improve the recognition rate. Meanwhile, feature processing for multiclassification data sets can significantly reduce the number of parameters to reduce training time; on the contrary, feature pre-processing of small data sets will increase the number of parameters and slow down the training speed.
Selection of activation function after PCA
Table 1 shows the feature preprocessing has an impact on network performance. But why CReLU is chosen as activation function in feature preprocessing? We also used ReLU and Tanh functions to compare the result on some data sets as shown in Table 2.
The recognition performance of different activation functions in feature preprocessing
The recognition performance of different activation functions in feature preprocessing
From Table 2, compared to ReLU and Tanh functions, the recognition with CReLU is the best. Based on the experiment, it is verified that CReLU can increase the nonlinearity of the network structure, so that the sample can find the separation hyperplane in the higher-dimensional feature space.
Will the randomness of PEDCC affect the performance of the network? Below we use Digits and Iris datasets to prove it.
The influence of random PEDCC on performance under Digits and Iris data sets
The influence of random PEDCC on performance under Digits and Iris data sets
Network performance of different principal feature and latent features on the Digits and Waveform datasets
The influence of latent feature
The recognition rate (Acc.) and training time (Tr.Ti.) of PCA
Although random PEDCC does not affect the recognition performance in a linear network, Table 3 shows the accuracy of the Digits and Iris data sets are slightly affected by random PEDCC on the analytical learning classifier. So once the nonlinearity of network is improved, the random PEDCC has an influence on the performance of the network. In fact, we need choose the best result from different random PEDCC.
Besides the influence of random PEDCC on network performance, what is more important is the influence of the principal features and latent features. We take Digits and Waveform as examples to illustrate the influence of principal components and latent features on network performance.
It can be concluded from Table 4 that whether in terms of principal feature components or latent features, the overall trend of the network performance of Digits and Waveform is to become larger first and smaller later. The same trend exists for other datasets, so we can find the optimal principal features and latent features to achieve a better network performance.
The influence of the number of latent features on the rank of W
Previously, we objectively explained that taking the derivative of the activation function as the step into the weighted minimum mean square error can increase the rank of
In Table 5, based on Mnist, Digits and Iris data sets, we can find that different latent features
Comparison of classification performance among analytic learning classifier (ALC), SVM and ELM
In order to verify the effectiveness of the analytical learning classifier, we compare it with SVM and ELM in Table 6. Because our algorithm adopts the PCA method, in order to ensure fairness and reliability of comparison, we also use the PCA method for SVM, which is SVM
From the Table 6, it can be seen that the training speed of the analytical learning classifier is much higher than that of SVM. The reason is that the fully connected weight parameter W (Fig. 1a) used by the analytical learning classifier is obtained by matrix multiplication, which avoids the complex iterative process in traditional methods and effectively improves the training speed, as shown in Eq. (7). Its Time complexity involves matrix multiplication and matrix inversion. The Time complexity of basic matrix multiplication is
Multistage analytical learning classifier
Results of Multistage analytical learning classifier
From the experimental data, we find that the recognition rate of samples with
Recognition rate results of 9 data sets on multi-stage ALC network
Recognition rate results of 9 data sets on multi-stage ALC network
Performance of Multi-stage analytical learning classifier recognition rate before and after adding noise with 10 dB SNR
The impact of SNR on the recognition rate of Waveform dataset.
It can be seen from the table that the multi-stage analytical learning classifier can effectively improve the recognition rate, especially for large datasets such as Mnist and Fashion. The number of data in each row of the table represents the number of stages of the network, among which Digits, Breast and Iris cannot be applied due to the small number of data samples. For the large datasets MNIST and fashion MNIST, the improvement of recognition rate is very obvious. Compared with ELM and SVM results in Table 6, the recognition rate is similar. For some small datasets with low recognition rates, such as Pima and Waveform, there is also a significant improvement, which is a substantial improvement over SVM and ELM. In general, multistage analytical learning classifier is efficient and feasible.
In image classification, noise may have an impact on the experimental results. Based on this consideration, we conducted a noise experiment and added Gaussian noise (noise whose probability density function conforms to the Gaussian distribution) to the multi-stage analytical learning classifier. During the experiment, we added Gaussian noise to compare the recognition rate under different SNR. As shown in Fig. 5, the curve of recognition rate of Waveform dataset changes with SNR. From the figure, it can be seen that a low SNR has a significant impact on the recognition rate. When the SNR is greater than 10 dB, the impact on the recognition rate is relatively small, and in some cases, it may even improve the recognition rate. This is because our method is based on MSE and inherently has a good ability to resist noise influence. The experimental result under 10 dB SNR are shown in the Table 8.
From the experimental results, it can be seen that the influence of noise greater than 10 dB on the multi-stage analytical learning classifier is relatively small, and it can maintain a relatively stable recognition effect when adding noise, further verifying the effectiveness and stability of the classifier.
Conclusion
In this paper, a new analytical learning classifier with simple structure is proposed, and the analytical expression of full connection layer
In the future, we will further improve the network, including the optimization method of network parameter about principal component and latent feature number, the distribution characteristics of class conditional probability density, the comparative application of other feature preprocessing methods, and the method of using PEDCC for pattern recognition, etc.
