Abstract
This paper proposes an evolutionary-based selective ensemble learning framework for solving classification problem. In the proposed ensemble learning framework, extreme learning machine (ELM) is selected as base learner and evolutionary algorithms are employed to optimize the weights of base learners in the ensemble. Then, some base learners, that their weights are larger than the threshold, are selected for making decision. The proposed ensemble learning framework is evaluated on 20 benchmark data sets from KEEL repository through four different evolutionary algorithms. Results show that the proposed evolutionary-based ensemble learning framework outperforms the simple voting based ensemble method in terms of classification performance. In four evolutionary optimization algorithms, PSOGA-based and DE-based weight optimization algorithms can effectively improve the classification accuracy and generalization ability.
Introduction
Extreme learning machine (ELM) proposed by Huang et al. [1] is an effective approach for single-hidden-layer feedforward neural networks (SLFNs). Different from the traditional gradient-based learning algorithms, ELM not only has fast learning speed but also achieves better generalization performance. In recent years, ELM has attracted tremendous attentions. Many applications of the ELM have been proposed in the literature, i.e., credit risk assessment [2], face recognition [3], and image analysis [4].
However, a single ELM suffers from some problems, such as stability and overfitting. Random assignment parameters of ELM may lead to these problems. Some optimization strategies have been widely used in parameter selection of ELM. Zhu et al. [5] presented an evolutionary ELM (E-ELM) to select the input weights using differential evolution (DE) algorithm. The E-ELM can achieve good generalization performance with much more compact networks. On the basis of E-ELM, Cao et al. [6] further optimized the number of hidden nodes using a self-adaptive DE algorithm, and proposed a self-adaptive evolutionary ELM for SLFNs. Feng et al. [7] proposed an evolutionary selection ELM (ES-ELM) for regression. ES-ELM uses the crossover mechanism derived from the genetic algorithm (GA) to select the optimal number of hidden nodes for ELM. However, these strategies are mainly used to parameter optimization of ELM.
Moreover, recent research shows combination of multiple learners, i.e., ensemble learning, is also an effective strategy to deal with the above problems. Generally an ensemble learning will train multiple base classifiers and combine the predictions of these classifiers. Many ensemble strategies based on ELM have been proposed [8–11]. The EN-ELM proposed by Liu and Wang [8] is an effective ensemble method, which uses the cross validation strategy and ensemble learning approach to alleviate overfitting and enhance the predictive stability. All base classifiers in EN-ELM are considered equally important. Wang and Li [9] used a dynamic AdaBoost algorithm to ensemble the outputs of ELMs, while Zhai et al. [10] proposed a dynamic ensemble ELM based on sample entropy. Their aims is to deal with instability and over-fitting of a single ELM, especially on large data sets. The voting-based ELM ensemble learning method (V-ELM) [11] incorporated the voting method into ELMs and made the final decision based on majority voting method. V-ELM enhances the classification accuracy and has much smaller variations in different trials of simulations than that of ELM. These ensemble methods assume that all base learners in the ensemble contribute equally to the making decision [12]. However, this assumption does not work well in many real-world applications. It is necessary to distinguish the importance of each base learner in making decision. Some researchers have focused on the use of evolutionary algorithms, such as GA, DE, and particle swarm optimization (PSO), to optimize the weights of each base learner in the ensemble [12–14].
Based on the above considerations, this paper presents an evolutionary-based selective ensemble learning framework based on ELM for solving classification problem. In the proposed ensemble learning framework, evolutionary algorithms are employed to optimize the weights of base learners in the ensemble. The proposed ensemble learning framework is evaluated on 20 benchmark data sets from KEEL repository through four different evolutionary algorithms. Results show that the proposed evolutionary-based ensemble learning framework outperforms the simple voting based ensemble method in terms of the classification performance.
The remainder of this paper is organized as follows. Section 2 briefly introduces ELM and several evolutionary algorithms. Then, an evolutionary-based selective ensemble learning framework is presented in Section 3. Two evolutionary-based weight optimization strategies, DE-based optimization algorithm and PSOGA-based optimization algorithm, also illustrated in this section. The performance evaluation of several different ensembles is performed on 20 benchmark data sets in Section 4. Section 5 concludes this paper.
Preliminaries
This section briefly reviews the basic ELM algorithm and several evolutionary algorithms as the background knowledge for our work.
Extreme learning machine
ELM [1] works on the single hidden layer feedforward neural networks, where the input weights and biases are randomly chosen and need not to be adjusted. The output weights are calculated analytically using Moore-Penrose generalized inverse.
For N training samples (
Equation (1) can be rewritten as follows
Thus, the output weight matrix
In ELM, different activation functions can be used in different hidden neurons [16]. If a function satisfies ELM universal approximation capability theorems [17], it can be one of activation functions in ELM.
DE is an evolutionary optimization technique proposed by Price and Storn [18]. It has been widely applied for many real-world optimization problems [19]. Similar to other evolutionary algorithms, DE algorithm is also a population-based heuristic search algorithm. It uses three principal operations of mutation, crossover and selection to search a global optimum solution. These operations are carried out iteratively and their aim is to improve the fitness of the best vector in the population.
DE algorithm starts with a randomly generated initial population of M individuals Xi,G, i = 1, … , M, where the index i denotes the ith candidate solution of the population at generation G. Each individual is defined as a D-dimensional vector Xi,G = [xi,G (1) , xi,G (2) , … , xi,G (j) , …, xi,G (D)].
For each vector Xi,G, i = 1, … , M, a mutant vector Vi,G is created by adding it to the weighted difference of two randomly selected distinct vectors from the population, according to the most common DE version [20]
After mutation operation, the crossover operation combines the mutation vector Vi,G and the target vector Xi,G to generate a trial vector Ui,G = [ui,G (1) , ui,G (2) , …, ui,G (j) , … , ui,G (D)]:
During the selection process, to determine whether the trial vector Ui,G survives to the next generation, the fitness value of the trial vector Ui,G is compared with that of the target vector Xi,G. For an objective function f, the selection operation is carried out as
If the objective function value of the trial vector is less than that of the target vector, the target vector will be replaced by the trial vector to survive to the next generation.
GA is the adaptive heuristic search algorithm based on biological evolution. A fitness function is used to evaluate individuals. A typical GA involves four major steps of fitness evaluation, selection, crossover and mutation of a new population [21].
PSO is a population-based global stochastic search method for solving optimization problems [22]. Compared to GA, PSO is easier to implement and has fewer control parameters to adjust [23]. PSO optimizes objective function by a population-based search. The population consists of particles, which are randomly initialized and fly through the multi-dimensional search space to find the best solution according to an optimization function. In the optimization process, the velocity and the position of each particle are updated iteratively.
Garg [24] combined the advantage of the GA and PSO, and presented a hybrid PSO-GA approach to solving optimization problems. Its basic idea is to embed the genetic operators in the standard PSO, and to combine the ability of social thinking in PSO with the local search capability of GA.
PSO-GA approach also starts from the initialization phase. The particles of the swarm and their corresponding velocities are generated randomly over the search space. During optimization, each particle updates its previous best performance (personal best, pbest) and the best previous performance of its neighbors (global best, gbest).
Let Xi,G and Vi,G respectively be the position and velocity of ith particle in the search space at kth iteration, then its velocity and position of this particle at (k + 1)th iteration are updated as follows:
After forming a new generation in PSO iterations, some particles of new population are selected and then GA is applied for each of them separately. The number of selected particles is determined as
After selecting individuals from the population, the PSO-GA algorithm creates a new population by replacing points in the current population with better points via GA.
Then, PSO-GA algorithm updates population size and maximum iteration number for GA after evaluating the new population.
PSO-GA algorithm repeats the process and seeks the global optimum.
Zhou et al. [25] suggested that many instead of all predictors in the ensemble could be used for decision making. There are two important phases in ensemble learning methods, constructing multiple base learners and combining them [26].
Generally two approaches are employed to construct an ensemble model, namely homogeneous model and heterogeneous model [27]. A homogeneous ensemble contains a group of identical base learners, while a heterogeneous ensemble can be generated by different learning algorithms to the same dataset. In this paper, we construct a homogeneous ensemble learning model using ELM as base learner.
Many approaches have been presented to combine base learners in an ensemble. Simple voting and weighted voting strategies are widely used [28]. In weighted voting strategy, it is a very key issue to select the appropriate weights of voting for base learners in the ensemble.
In this paper, we employ four evolutionary-based algorithms, such as GA-based, PSO-based, DE-based and PSOGA-based algorithms, to optimize the weights of voting for base learners. Before optimization, each base learner in the ensemble is randomly initialized a weight between 0 and 1. Then an evolutionary algorithm is used to optimize these weights. To utilize evolutionary-based optimization algorithm, we combine the weights of base learners w k (k = 1, …, K, where K is the number of base classifiers) in the ensemble to form a vector W = [w1, w2, …, w K ], which is taken as an individual of a population in evolutionary algorithm and to be optimized.
In optimization process, we use the error rate as the objective function of the fitness in this paper. Therefore, the main goal is to minimize this objective function using the search capability of evolutionary algorithm.
After optimization, a population of candidate weight vectors is generated and we select an individual with the best fitness value as weights of base learners in the ensemble. We use a selective strategy to choose some ELM base learners, that their weights are larger than the threshold λ, to make decision through weighted majority voting strategy.
The proposed evolutionary-based selective ensemble learning framework for classification problem is illustrated in Algorithm 1.
Input:
Sequence of N samples D = {(
Process:
1. Divide the dataset into a training set and a validation set , where N
tr
and N
va
represent number of the training set and number of the test set, respectively. D
train
is used to generate base learner and D
validation
is used to evaluate the fitness of evolutionary algorithm. Randomly assign input weight a
i
and bias b
i
, i = 1, 2, …, L. Calculate the hidden layer output matrix Calculate the output weight . Construct base learner C
k
.
2. Generate an initial population of M individuals W G = {W1,G, W2,G, …, WM,G}, where Wi,G = [wi,G (1) , wi,G (2) , …, wi,G (K)], (i = 1, …, M) represents the weights of K base learners.
3. Use evolutionary-based method to evolve the population, where the fitness of a weight vector is measured by the error rate on the validation set D validation .
4. Obtain the evolved best weight vector W*.
5. Choose base leaners that the weights are larger than the threshold λ to ensemble.
Output:
An ensemble learner:
In decision making on the ensemble, the proposed algorithm sums up the total weight values of each category through majority voting. For a sample
In Step 3 of Algorithm 1, we employ evolutionary-based method to evolve the population and optimize the weights of each base learner in the ensemble. Algorithm 2 presents a DE-based optimization algorithm. We also give a PSOGA-based optimization algorithm [24] in Algorithm 3.
Input:
A series of base learners C
k
, k = 1, 2 … , K; number of iterations G
max
.
Process:
1. Initialization: Set the generation iterator G = 0.
2. Evaluate the fitness of the vector Ui,G and Xi,G using K base classifiers C
k
on validation set D
validation
, where the fitness is defined as the error rate; Update the vector Xi,G+1 of the next generation (G + 1) using Equation (6); Increase generation iterator G = G + 1.
Output:
The best weight vector W* = Xi,G.
Input:
A series of base learners C
k
, k = 1, 2 … , K; number of iterations G
max
.
Process:
1. Initialization: Set the generation iterator G = 0; initialize the position and velocity of each particle.
2. Update pbest and gbest position of the particle; Update the velocity of each particle using Equation (7); Update the position of each particle using Equation (8); Evaluate the fitness of each individual on validation set D
validation
, where the fitness is defined as the error rate; Choose GA
num
best particles, where GA
num
is calculated by Equation (9); apply GA for each selected best particle separately, and replace some particles in the current PSO population with the best individuals through DE operation; Update GA parameters: GA
num
, population size of the particles in GA, and maximum number of iteration in GA; Increase generation iterator G = G + 1.
Output:
The best weight vector W* = Xi,G.
Experimental results
Benchmark data sets
To validate the effectiveness of ensemble methods, we investigate the classification accuracies and computational efficiencies of five optimization approaches, i.e., simple voting based ensemble, GA-based ensemble, PSO-based ensemble, DE-based ensemble, and PSOGA-based ensemble on twenty benchmark data sets. The details of the twenty data sets are summarized in Table 1, which are from the KEEL dataset repository [29]. It is noted that N sample is number of samples, N feature is number of features and N class is number of categories.
Experimental settings
Five ensemble methods are evaluated in the experiments. Simple voting based ensemble applies simple voting strategy to decide the final class label of test sample. GA-based, PSO-based, DE-based and PSOGA-based ensemble methods employ GA, PSO, DE and PSO-GA [24] algorithm to optimize the weights of base learners respectively, and employ weighted majority voting strategy to make decision.
For all ELM base learners in simple voting based ensemble method and evolutionary-based ensemble methods, the Sigmoid kernel function is selected as activation function. The number of ELM base learners K is set to be 30. The number of hidden nodes L of ELM base learner is needed to be tuned. We employ a grid search method to choose the number of hidden nodes which ranges from 10 to 200 with an interval of 10. Ten trials are repeated for a single ELM base learner and the number of hidden nodes is then determined by the validation error. And then the tuned number of hidden nodes is used in all ensemble methods.
For evolutionary-based ensemble methods in the experiments, ELM base learners are all constructed on the 70% training samples randomly selected and the rest 30% are used to validate the fitness of evolutionary algorithms. The population size M is set to be 20. The number of maximum iteration is 50. The dimension of the population is 30 corresponding with 30 base learners. The threshold λ is set to be 0.5. Namely, we only choose the base learners that the weights are more than 0.5 to ensemble and verify the test samples through weighted voting strategy.
In the experiments, 5-fold cross-validation is also used to evaluate the performances of different ensemble methods. For fair comparisons, all the ensemble methods are run on the same 5-folded data sets. The training and testing process are repeated 50 times. In addition, all experiments are carried out in Matlab 7 environment under a PC with Intel 3.2 GHz CPU and 4 GB RAM.
Results and discussion
The evaluation results among five ensemble methods are presented in Table 2. We record the best number of hidden nodes (# hidden nodes) of ELM base learner on each data set, training time, testing accuracy (acc) and testing standard deviation (dev). Each result, including training time, testing accuracy and testing standard deviation, is an average of 50 trials. We emphasize in boldface the highest value among the five values that are being compared for each data set.
From the comparison results of Table 2, we can easily find that PSOGA-based ensemble is more time consuming than simple voting based ensemble, GA-based ensemble, PSO-based ensemble and DE-based ensemble. However, it obtains the best performance on 8 out of 20 data sets in terms of accuracy, which outperforms the other four comparison methods. The training time of GA-based ensemble, PSO-based ensemble and DE-based ensemble is similar, which is slightly higher than that of simple voting based ensemble. DE-based ensemble obtains the best accuracies on 7 out of 20 data sets, while 3 and 2 for GA-based ensemble and PSO-based ensemble respectively. In terms of the comparison between PSOGA-based ensemble and DE-based ensemble, it can be observed from Table 2 that the classification performance of PSOGA-based ensemble is slightly superior to that of DE-based ensemble. However, DE-based ensemble is more efficient in terms of computational cost. It also should be noted that although the complexity of simple voting based ensemble is the lowest, its classification performance is the worst.
Statistical significance test is also carried out to check which method is statistically better than the methods in comparison in terms of accuracy. Table 3 shows the rankings of the five ensemble methods. The rankings are computed by the Friedman test, which is a non-parametric equivalent of the repeated-measures ANOVA and compares the average ranks of methods [30]. Obviously, it can be observed from Table 3 that the ranking of PSOGA-based ensemble method is the lowest, only 1.9000. The ranking of DE-based ensemble method is slightly larger than that of PSOGA-based ensemble method and lower than those of PSO-based and GA-based ensemble methods, while simple voting based ensemble method achieves the highest ranking of 4.350. It shows that compared with simple voting based ensemble method, four evolutionary-based ensemble methods can improve the performance of classification to some extent. Furthermore, Friedman test returns the p value of 6.7618×10–6, so the hypothesis that all methods are uniform is rejected.
We further employ the Holm test to compare the best ranking method (PSOGA-based ensemble) with the remaining methods. In the Holm test, PSOGA-based ensemble is selected as the control method. The result of this test is shown in Table 4. It is noted that the Holm test applied step-down procedure to sequentially test the hypotheses ordered by their significance.
In Table 4, Holm test starts with the most significant p4 value, namely, p4= 9.5837×10–7. Because p4 is less than the associated Holm’ value (α/4 = 0.0125) in the same row, we reject the hypothesis and then compare p3 with α/3. Similarly, the second and the third hypotheses are also rejected. The results show that PSOGA-based approach outperforms simple voting based, GA-based and PSO-based ensemble methods. But for the last hypothesis, p1 is larger than the corresponding Holm’ value (α/1 = 0.05), so we accept this hypothesis. It indicates that there is no statistical significant difference between PSOGA-based and DE-based ensemble methods.
Conclusion
In this paper, we employ several evolutionary-based optimization method, such as PSOGA, PSO, DE and GA, to optimize the weights of ELM-based ensemble learning, and propose a general selective ensemble learning framework. Five ensemble methods are performed on 20 benchmark data sets from KEEL dataset repository. Compared with sampling voting ensemble, four evolutionary-based ensemble methods can improve the classification performance and generalization ability. Especially PSOGA-based and DE-based ensemble methods obtain more advantage in terms of the classification performance. However, PSOGA-based ensemble has no advantage in terms of the time complexity. In the future work, we will balance the complexity and accuracy, and give a better adaptive selective ensemble framework.
Footnotes
Acknowledgments
This work is partly supported by National Natural Science Foundation of China (No. 61373127), and the State Key Laboratory for Novel Software Technology (Nanjing University) of China (No. KFKT2015B16).
