A two stages algorithm for feature selection based on feature score and genetic algorithms

Abstract

Feature selection is an effective approach for solving the curse of dimensionality. Evolutionary computation, such as genetic algorithms, are extensively applied into feature selection. However, with the available algorithms, features aren’t screened before evolutionary computation starts and all of them are equal in status during the process of evolutionary computation. In this paper, a new algorithm that screens features before evolutionary computation starts, and makes full use of the screened ones during the process of evolutionary computation is proposed. In detail, important and useful features are found by scoring all features, and endowed with privileges in obtaining advantages comparing to other features during the forthcoming process of evolutionary computation, which is the first stage of our proposed algorithm. As for the second stage, we design a genetic algorithm with multiple sub populations, in which each sub population corresponds to a combination of important and useful features, and a competition mechanism between sub populations is introduced. As a result, important and useful features are further sufficiently used and extensively explored compared to the available algorithms, hence classification accuracies are increased. Experiments are performed with 8 datasets comparing to 11 state-of-the-art algorithms to validate our proposed algorithm. And the results show that our proposed algorithm outperforms the 11 state-of-the-art algorithms.

Keywords

Feature selection feature score feature combination Genetic Algorithms multiple sub populations

1. Introduction

With the development of information technology, tremendous amount of information can be collected, processed into data sets with many features, and applied to classification problem. However, too many features results high computational time. Although some features are important and useful (short for good features), some features may be unimportant, irrelevant or redundant, introduces noise and degrades classification performance. This is known as the curse of dimensionality [1, 2, 3, 4, 5, 6]. Feature selection (FS) is an effective method solving the curse of dimensionality, which selects appropriate subsets from features, removes unimportant, irrelevant or redundant features in order to reduce computational time, eliminates noise and improve classification performance [1, 7, 8, 9, 10, 11, 12].

Popular feature selection techniques fall into two broad groups: filter methods and wrapper methods [10, 11]. The former is independent of the classifier, removes unimportant, irrelevant or redundant by analyzing the features’ information. On the contrary, the latter employs a classifier, derives classification performance which is used to evaluate a feature subset [1, 3, 7]. Although filter methods are considered to be more general than wrapper methods, wrapper methods usually derives better classification performance than filter methods [1].

Filter methods have been widely discussed and many algorithms have been proposed. Feature Selection technique based on Feature Similarity (FSFS) removes redundancy by measuring similarity between features using a feature similarity measure [15]. Fuzzy rule based Feature Selection (FR-FS) is an algorithms which extracts fuzzy rules and selects appropriate features simultaneously [16].

Researchers have also widely studied wrapper methods and proposed many algorithms. The available algorithms fall into two broad groups: (1) Deterministic algorithms [17, 18, 19, 20, 21, 22], a deterministic feature subset can be derived if an algorithm is applied to a dataset; (2) Evolutionary Computation (EC) based algorithms [23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39], evolutionary computation, such as Genetic Algorithms (GA), Particle Swarm Optimization (PSO) Algorithms, Artificial Bee Colony (ABC) Algorithms, Genetic Programming (GP), and Ant Colony Optimization (ACO) Algorithms are applied to finding appropriate feature subsets, with which the final derived feature subsets are uncertain because of evolutionary computation’s randomness [1].

The main deterministic algorithms includes Sequential Forward Selection (SFS), Sequential Backward Selection (SBS) [19], Ranking Based method (RB-FS) [21], Graph Clustering with Node Centrality (GCNC) [22], etc. SFS adds features to an empty feature subset step by step. On the contrary, SBS removes features from a feature subset involving all features step by step. The two algorithms terminates when the adding or removing can’t increase classification performance [19]. However, with the two algorithms, once a feature is added or removed, it can’t be removed or added, which limits finding optimal feature subsets. RB-FS determines the number of selected features by building a convex hull in high-dimensional space firstly, then obtains a local optimal feature subsets by hill climbing algorithm [21]. GCNC regards all features as a graph and divides them into some clusters. Finally, feature selection is performed with node centrality based iterative search strategy [22].

The main EC based algorithms includes Modified Discrete Artificial Bee Colony (MDisABC) [1], Multi-Objective Evolutionary Algorithm based on Decomposition (MOEA/D) [7], Fast Feature Weighting algorithm of Data Gravitation Classification (FFW-DGC) [23], Multi Objective Differential Evolution based FS (DEMOFS) [36], Binary Bat Algorithm for FS (BBA) [37], Binary Grey Wolf Optimization approaches for feature selection (bGWO) [38], Visibility density Modified Binary coded Ant Colony Optimization algorithm (VMBACO) [39], etc. MDisABC is a binary artificial bee colony (ABC) algorithm, which integrates evolutionary based similarity search mechanisms and another binary ABC variant [1]. MOEA/D is an algorithm for feature selection and weighting using a Multi-Objective Evolutionary Algorithm based on Decomposition (MOEA/D), in which data points are projected into a hyper space. Then the distances of inter-class and intra-class are optimized by MOEA/D in order to select features [7]. FFW-DGC integrates data gravitation classification (DGC) model and fuzzy set to obtain optimal feature subsets and reduce computational time [23]. DEMOFS is a multi-objective feature selection algorithm based on differential evolution in order to increase classification accuracy and decrease the number of features simultaneously [36]. BBA is a wrapper algorithm based on the bats behavior that integrates the power of the bats’ exploration and the Optimum-Path Forest classifier’s speed [37]. bGWO is an algorithm for feature selection based on the binary version of the Grey Wolf Optimization (GWO) which is one of the latest bio-inspired optimization techniques [38]. VMBACO is an algorithm integrating a modified binary coded ACO and GA, in which the solution derived by GA is regarded as visibility or initial pheromone information [39].

As mentioned before, different features act different roles in classification. Although evolutionary computation have been extensively applied into feature selection, features are not discriminated before evolutionary computation starts, all features are equal in status during the process of evolutionary computation, and some individuals considered involve no good features with the available algorithms. As a result, good features aren’t sufficiently used and extensively explored, and computing resource is wasted. In this paper, we propose a method for screening good features by scoring all features before evolutionary computation starts, which is the first stage of our proposed algorithm. In the second stage, we design a genetic algorithm with multiple sub populations, in which each sub population corresponds to a combination of good features, and a competition mechanism between sub populations is introduced. As a result, good features are further sufficiently used and extensively explored compared to the available algorithms, hence classification accuracies are increased. The proposed algorithm is named as Feature Selection algorithm based on Feature Score and Genetic Algorithms (short for FS ${}^{2}$ GA).

The main contribution of the work are summarized as follows: (1) It is proposed for the first time that features should be screened before evolutionary computation starts (see the paragraph above); (2) a method for screening features is proposed (see Section 2);(3) a method for dividing multiple sub populations as well as the mechanism between sub populations are designed (see Section 3). We compare our proposed algorithm with available EC based algorithms to illuminate our contribution in Table 1.

The rest of the paper are organized as follows: in Section 2, we describe how to find good features; in Section 3, the GA algorithm is described in detail; Section 4 describes the experiments and analyses the results; finally, we conclude our work in Section 5.

Table 1
The comparison of FS ${}^{2}$ GA and available EC based algorithms

Algorithm	Screening	Division of sub	Competition
	individuals	populations	mechanism between
			sub populations
FS ${}^{2}$ GA	$\surd$	$\surd$	$\surd$
MDisABC	$\times$	$\times$	$\times$
MOEA/D	$\times$	$\times$	$\times$
FFW-DGC	$\times$	$\times$	$\times$
DEMOFS	$\times$	$\times$	$\times$
BBA	$\times$	$\times$	$\times$
bGWO	$\times$	$\times$	$\times$
VMBACO	$\times$	$\times$	$\times$

2. Finding good features

As mentioned before, we propose a method finding good features by scoring before evolutionary computation starts. In detail, a certain number of feature subsets are generated entirely stochastically, and a classifier is employed to obtain classification accuracies of all feature subsets. Then the feature subsets with high classification accuracies (short for elite subsets) are found. For each feature, its score is the sum of the accuracies of all elite subsets involving it, and the features with high scores are regarded as good features. This is because, if a feature appears in elite subsets frequently, then a feature subset involving it obtains high classification accuracy with high probability. That is to say, the feature is good and feature subsets involving it should be focused on. If two features appear the same times in the elite subsets, the one involved by the elite subsets with higher accuracies is more important than the other. Therefore, the accuracies of elite subsets should also be considered.

2.1 Representation of feature subset

A feature subset corresponds to a binary string, in which each bit corresponds to a feature. If the feature subset involves a certain feature, the corresponding bit is ‘1’, otherwise ‘0’. Therefore, a feature subset s can be expressed as the following binary string:

$\displaystyle s=b_{1}b_{2}\ldots b_{N}$ (1)

where $N$ is the number of features, and $b_{i}(1\leqslant i\leqslant N)$ can be defined as follows:

$\displaystyle b_{i}=\left\{\begin{array}[]{ll}0,&\text{ if $s$ doesn't involve% feature $i$}\\ 1,&\text{ if $s$ involves feature $i$}\end{array}\right.$ (2)

For instance, the feature subset (F1, F3, F5, F7, F8) is represented as the binary string ‘10101011’, as shown in the following figure.

Figure 1.

Representing feature subsets with binary string.

2.2 Generating random feature subset

$M$ feature subsets are generated entirely stochastically. In detail, for each bit in feature $s$ , a real number with the range [0,1) is generated stochastically. The bit is given the character ‘1’ if the generated number is less 0.5, otherwise the bit is given the character ‘0’. All generated random feature subsets constitute the set S ${}_{\text{rand}}$ .

2.3 Scoring features

Our work mainly aims at improving classification accuracies. Hence, for $s\in S_{\text{rand}}$ , a classifier is employed to derive its classification accuracy which is expressed as Accu( $s$ ). Then the feature subsets in S ${}_{\text{rand}}$ are clustered with K-Means $+$ $+$ clustering algorithm according their classification accuracies. The feature subsets in the class with the highest classification accuracies are regarded as elite subsets, and constitute the elite subsets which is expressed as S ${}_{\text{elite}}$ .

For feature $i$ , its score is expressed as Sco( $i$ ). The procedure computing all features’ scores can be described as the following pseudo code:

Algorithm 1: Computing all features’ scores

for (

i=

i\leqslant N

;

i++

)

Sco(

i

)

=

for (

s

: Selite)

b_{i}

==

‘1’

Sco(

i

)

=

Sco(

i

)

+

Accu(

s

)

The higher a feature’s score is, the better it is. The features are sorted by their scores from high to low and the top num_good_fea features are regarded as good features. The value of num_good_fea is decided by the specific scores of features.

2.4 Computational complexity analysis

The time of this stage is mainly the duration of obtaining the individuals’ classification accuracies. The classification accuracies of $|\text{S}_{\text{rand}}|$ individuals needs to be computed in this stage. Thus, the computational complexity of this stage is O( $|\text{S}_{\text{rand}}|$ ).

3. Genetic algorithms

In this section, we describe the second stage of our proposed algorithm. Genetic Algorithms (GA) is employed in this stage, in which each individual corresponds to a feature subset. As mentioned before, the feature subsets involving good features should be focused on. Therefore, the individuals considered during GA must contain one, some or all good features. To do that, GA with multiple sub populations is adopted and each sub population corresponds to a combination of good features. Individuals of a sub population must contain the combination of good features assigned to the sub population, and mustn’t contain good features out of the combination. To explore all possible individuals, the empty set of good features also corresponds to a sub population. That is, individuals of the sub population mustn’t contain any good feature. Good features don’t participate in crossover or mutation, while all other features participate in crossover and mutation. This ensures that individuals in a sub population involves the same and constant good features during the GA operations. A competition mechanism between sub populations is introduced, with which the sub populations with higher classification accuracies obtain more population sizes because high classification accuracies illustrates that the sub populations’ combination of good features is more excellent and should be explored intensively. Each sub population performs GA operations respectively and independently. And the only relationship between sub populations is the competition mechanism between them. In each round, the highest accuracy of all sub populations is regarded as the accuracy of the round, and the accuracy of the final round is regarded as the final accuracy.

The main steps of the GA are shown in the following figure, where NR is the number of GA’s rounds.

Figure 2.

The main steps of the GA.

3.1 Multiple sub populations

The genetic algorithm with multiple sub populations is employed in our proposed algorithm. Each sub population corresponds to a combination of good features. That is, the number of sub population is NM $=$ 2 ${}^{P}$ , where $P$ is the number of good features. Features are reordered by moving good features to the head of the binary string, which leads to that the binary string is divided into two sections. The first section is good features (short for head), and the second section is other features (short for tail), as shown in Fig. 4. Individuals in a sub population have the same head and different tail. A sub population can be expressed as SP ${}_{i}$ (1 $\leqslant i\leqslant$ NM), where $i$ is the decimal integer converted from the sub population’s head. For instance, the sub population which Fig. 3 belongs to can be expressed as SP ${}_{5}$ . Crossover and mutation are carried out only in the tail. That is to say, the head remains unchanged during all GA operations. Every individual during the GA operations derives a classification accuracy by employing a classifier.

Figure 3.

The binary string is divided into two sections.

As mentioned before, a competition mechanism between sub populations is introduced. Before round $r$ ( $r>$ 1) of GA, a sub population’s size is expressed as Size(SP ${}_{i}$ , $r$ ) and computed as follows:

$\displaystyle\text{Size}(\text{SP}_{i},r)=\frac{e^{\text{mul\_s}*\text{Max}(% \text{SP}_{i})}*\text{T\_Size}}{\sum_{j=1}^{\text{NM}}e^{\text{mul}*\text{Max}% (\text{SP}_{i})}}$ (3)

where Max(SP ${}_{i}$ ) is the highest classification accuracy derived by the feature subsets (individuals) of SP ${}_{i}$ up to now, the value as well as the individual obtains it are updated if necessary in each round; T_Size is the entire population size of the GA; and mul_s is the coefficient deciding the preponderance of the sub population with better highest classification accuracy. As a result, the sub population with better highest classification accuracy obtains bigger population size and will be explored more extensively in the next round. The highest classification accuracies in the first rounds don’t evaluate the sub populations precisely because of the randomness and less individuals that have been evaluated; the highest classification accuracies in the latter rounds evaluate the sub populations relatively more precisely because many individuals have been evaluated. Therefore, the initial value of mul_s is 0 and it increases mul_s_increase in each round.

3.2 Initial sub population and fitness function

All sub populations have the same initial size, which is computed as follows:

$\displaystyle\text{Init\_Size}=\left\lfloor\frac{\text{T\_Size}}{\text{NM}}\right\rfloor$ (4)

Due to the head of SP ${}_{i}$ is assigned, same and constant, only the tail needs to be generated and concatenated to the assigned head. In addition, the individual with all zero tail is generated to consider only the combination of good features. The procedure of initializing SP ${}_{i}$ is shown in the following algorithm:

Algorithm 2: Initializing sub population SP ${}_{i}$
for ( $j=$ 1; $j\leqslant$ Init_Size; $j++$ )
tail $=$ ‘’
for ( $k=p+$ 1; $k\leqslant N$ ; $k++$ )
if (( $j==$ 1) \|\| (rand() $<$ 0.5))
tail $=$ tail $+$ ‘0’
else tail $=$ tail $+$ ‘1’
Id( $i, j$ ) $=$ head( $i$ ) $+$ tail

Where Id( $i$ , $j$ ) is the $i$ th individual of SP ${}_{i}$ , and head( $i$ ) is the assigned head of SP ${}_{i}$ .

For individual Id( $i$ , $j$ ), a classifier is employed and its classification accuracy is derived, which is regarded as the fitness function value of the individual and expressed as Fit(Id( $i$ , $j$ )).

3.3 Selection

Roulette-Wheel selection method is employed by our proposed algorithm. In round $r$ , the probability that individual Id( $i$ , $j$ ) of round $r-1$ is selected is given by:

$\displaystyle\!\!\!\!\!\!\!\!\!\text{Pro\_Sel}(Id(i,j))=\frac{e^{\text{mul\_g}% *\text{Fit}(\text{Id}(i,j))}}{\sum_{k=1}^{\text{Size}(\text{SP}_{i},r-1)}e^{% \text{mul}*\text{Fit}(\text{Id}(i,k))}}$ (5)

where mul_g is the coefficient used to decide the preponderance of the individual with more classification accuracy. A random number with the range [0,1) is generated, and the individual whose range the random number falls in is selected. The procedure repeats (Size(SP ${}_{i}$ , $r$ ) $-$ 1) times. And the individual obtaining the value Max(SP ${}_{i}$ ) takes up the remaining one.

3.4 Crossover and mutation

As mentioned before, only the tails of individuals participate in crossover and mutation.

The probability that an individual is selected to perform crossover is pc. For selected individuals, they perform crossover in pairs. For two parents, a random number with the range [0,1) is generated for each bit of the tail. The two parents exchange their value of the bit if the corresponding random number is less than pc_bit. Crossover can be performed more freely with the method.

The probability that an individual is selected to perform mutation is pm. Some individuals with overwhelming superiority usually appears in the latter rounds and are selected too many times during selection. Hence, the value of pm increases pm_increase in each round to avoid the same individuals are evaluated repeatedly. For an individual selected to perform mutation, a random number with the range [0,1) is generated for each bit of the tail. The individual changes the value of the bit from ‘0’ to ‘1’, or from ‘1’ to ‘0’ if the corresponding random number is less than pm_bit. Mutation can be performed more freely with the method.

3.5 Computational complexity analysis

The time of this stage is mainly the duration of obtaining the individuals’ classification accuracies. The classification accuracies of T_Size $\times$ NR individuals needs to be computed in this stage. Thus, the computational complexity of this stage is O(T_Size*NR). The algorithm time is the sum of the durations of the two stages. Usually, the duration of the second stage is longer than that of the first stage. Therefore, the computational complexity of FS ${}^{2}$ GA C ${}_{\text{FS2GA}}$ is given by:

$\displaystyle\text{C}_{\text{FS2GA}}=O(\text{T\_Size}\times\text{NR})$ (6)

4. Experiments

Experiments are performed with 8 datasets comparing to 11 state-of-the-art algorithms to validate our proposed algorithm.

4.1 Experimental environment

The software environment and of hardware environment of the experiments are shown in Table 2.

Table 2
The software environment and of hardware environment

Title	Content
CPU	Intel(R) Core(TM) i5-6500 CPU @ 3.20 GHz (4 Cores) $\times$ 1
GPU	Nvidia GeForce GTX 1050 (2 GB Memoery) $\times$ 1
Memory	16 GB
Operation system	CentOS 7.0
Programming language	Python
Framework	Tensorflow

4.2 Datasets

Eight real world datasets from the UCI machine learning repository are used in the experiments, as shown in Table 3.

Table 3
The datasets used in the experiments

Name	Number of	Number of	Number of
	class	attributes	instances
German	2	24	1000
Hill Valley	2	100	606
Sonar	2	60	208
Spambase	2	57	4601
Ionosphere	2	34	351
Waveform	16	21	5000
Dermatology	6	34	366
Lung	3	56	32

Table 4

The result of the first stage

Datase	Number of good features	Good features
German	3	F9, F1, F3
Hill Valley	4	F70, F55, F73, F26
Sonar	3	F12, F11, F45
Spambase	3	F52, F27, F7
Ionosphere	3	F1, F8, F5
Waveform	4	F5, F4, F7, F6
Dermatology	2	F5, F15
Lung	3	F20, F53, F19

Table 5

The main parameters setting for each datasets

Dataset	Number of sub	Entire	Number of
	population	population size	rounds
German	8	320	40
Hill Valley	16	640	20
Sonar	8	640	20
Spambase	8	320	40
Ionosphere	8	640	20
Waveform	16	640	20
Dermatology	4	640	20
Lung	8	640	20

Table 6

The average accuracy of different values of pc (%)

Dataset	pc $=$ 0.80	pc $=$ 0.90
German	75.20	75.35
Hill Valley	58.83	59.65
Sonar	83.03	83.37
Spambase	91.07	91.10
Ionosphere	89.65	89.73
Waveform	87.71	87.74
Dermatology	98.81	98.85
Lung	87.15	87.28

Table 7

The average accuracy of different initial values of pm (%)

Dataset	pm $=$ 0.25	pm $=$ 0.50
German	75.40	75.35
Hill Valley	59.99	59.65
Sonar	83.61	83.37
Spambase	91.13	91.10
Ionosphere	89.71	89.73
Waveform	87.73	87.74
Dermatology	98.85	98.85
Lung	86.75	87.28

Table 8

The experimental results of FS ${}^{2}$ GA (%)

Dataset	Mean	Variance	Best
German	75.69	0.22	76.00
Hill Valley	60.98	1.77	66.04
Sonar	84.69	0.83	86.50
Spambase	91.60	0.28	92.18
Ionosphere	90.49	0.34	91.16
Waveform	88.07	0.13	88.18
Dermatology	98.92	0.00	98.92
Lung	89.13	2.21	93.33

Figure 4.

The score for each feature of each dataset.

Figure 4.

Conitnued.

Table 9

The rankings of the algorithms for each dataset (mean $\pm$ variance %)

(1) Australian, German, Hill Valley, Sonar, Spambase
Rank	German		Hill Valley		Sonar		Spambase
1	FS ${}^{2}$ GA	75.69 $\pm$ 0.22	FS ${}^{2}$ GA	60.98 $\pm$ 1.77	FS ${}^{2}$ GA	84.69 $\pm$ 0.83	FS ${}^{2}$ GA	91.60 $\pm$ 0.28
2	MOEA/D	71.30	DEMOFS	60.46	MOEA/D	82.74	VMBACO	89.42 $\pm$ 1.44
3	BBA	70.24	MOEA/D	57.50	DEMOFS	78.60	MOEA/D	88.48
4	MDisABC	70.15 $\pm$ 1.87	MDisABC	55.08 $\pm$ 2.13	GCNC	76.33 $\pm$ 2.52	GCNC	88.21 $\pm$ 1.15
5	DEMOFS	70.10			bGWO	73.10	SFS	87.40
6	SFS	68.20					SBS	87.01
7	SBS	65.80					RB-FS	83.21
(2) Ionosphere, Waveform, Dermatology, Lung
Rank	Ionosphere		Waveform		Dermatology		Lung
1	MDisABC	93.62 $\pm$ 1.64	FS ${}^{2}$ GA	88.07 $\pm$ 0.13	FS ${}^{2}$ GA	98.92 $\pm$ 0.00	DEMOFS	90.02
2	FS ${}^{2}$ GA	90.49 $\pm$ 0.34	MOEA/D	83.65	MOEA/D	96.13	FS ${}^{2}$ GA	89.13 $\pm$ 2.21
3	MOEA/D	88.31	bGWO	78.90	VMBACO	95.16 $\pm$ 2.66	MDisABC	67.04 $\pm$ 8.49
4	SFS	88.70	SBS	78.46
5	SBS	85.92	SFS	77.82

4.3 Algorithms in comparison

The following state-of-art deterministic or EC based algorithms are compared with our proposed algorithm:

(1)
Sequential Forward Selection (SFS) [19]
(2)
Sequential Backward Selection (SBS) [19]
(3)
Ranking Based method (RB-FS) [21]
(4)
Graph Clustering with Node Centrality (GCNC) [22]
(5)
Modified Discrete Artificial Bee Colony (MDisABC) [1]
(6)
Multi-Objective Evolutionary Algorithm based on Decomposition (MOEA/D) [7]
(7)
A Fast Feature Weighting algorithm of Data Gravitation Classification (FFW-DGC) [23]
(8)
Multi Objective Differential Evolution based FS (DEMOFS) [36]
(9)
Binary Bat Algorithm for FS (BBA) [37]
(10)
Binary Grey Wolf Optimization approaches for feature selection (bGWO) [38]
(11)
Visibility density Modified Binary coded Ant Colony Optimization algorithm (VMBACO) [39]

4.4 Classifier

For binary classification datasets, the SVM with linear kernel is employed. For multiple classification datasets, logistic regression is employed because the SVM with linear kernel is not suitable for multiple classifications. And ten-folds cross-validation is adopted to compute all individuals’ accuracies.

4.5 Settings and results of the first stage

In the first stage of our proposed algorithm, 5000 random feature subsets are randomly generated for each dataset. The value of K in K-Means $+$ $+$ clustering algorithm is 20. The score for each feature of each dataset is shown in the following figure, in which each bar corresponds to a feature.

Based on the scores above, good features of each dataset are screened and shown in Table 4.

Figure 5.

The accuracy of each round in GA of FS ${}^{2}$ GA for each dataset.

Figure 5.

Continued.

4.6 Settings and results of the second stage

Firstly, we determine the optimal values of the parameters. Although big population sizes can obtain better accuracies, it is impossible to set big population sizes because the process of obtaining an individual’s classification accuracy is time-consuming, and the performance of our hardware is limit. Thus, we set the population sizes of different datasets from 320 to 640 based on their numbers of good features and sub populations. By experiments, we find that the accuracy of a dataset usually doesn’t increase or increase slowly after some generations. Thus, we set the numbers of rounds for different datasets, as shown in Table 5.

To determine the value of crossover probability pc, we perform the experiments with different values. For each value of pc, we run FS ${}^{2}$ GA 3 times independently, and compute the average accuracy, as shown in Table 6, where the initial value of pm is 0.50. As can be seen in the table, the value 0.90 is better than the other value. Thus, we set pc $=$ 0.90.

As mentioned in sub Section 3.4, the value of mutation probability pm increases in each round. We also perform the experiments with different values to determine the initial value of mutation probability pm. For each initial value of pm, we run FS ${}^{2}$ GA 3 times independently, and compute the average accuracy, as shown in Table 7, where pc $=$ 0.90. As can be seen in the table, the value 0.25 is better than the other value. Therefore, we set the initial value pm as 0.25.

For each dataset, we run FS ${}^{2}$ GA 25 times independently, and compute the average value, best value and variance of the accuracies of the runs. The experimental results are shown in Table 8.

Table 9 shows the comparison of FS ${}^{2}$ GA and the other algorithms, in which the accuracies of the algorithms compared with FS ${}^{2}$ GA are referenced from the corresponding literatures. For each dataset, the algorithms are ranked from high to low according to their accuracies on the dataset. As can be seen in the table, our proposed algorithm takes first place in 6 datasets, second place in 2 datasets, which denotes that our proposed algorithm outperforms other algorithms. This is because good features are screened and sufficiently used, as well as the GA with multiple sub population and the competition mechanism is adopted. Furthermore, Fig. 5 shows the accuracy of each round in GA of FS ${}^{2}$ GA for each dataset. Similarly, each value in the figure is the average value of 25 independent runs.

Table 10
The computational time of FS ${}^{2}$ GA for each dataset (hours)

Dataset	Time	Dataset	Time
German	4.50	Hill Valley	3.65
Sonar	4.47	Spambase	6.35
Ionosphere	4.44	Waveform	6.34
Dermatology	4.39	Lung	3.79

4.7 Computational Time

The computational times of a run for each dataset with FS ${}^{2}$ GA are shown in Table 10. Due to the performances of our hardware devices, especially the GPU are fairly limit, the computational times can be significantly reduced if high performance hardware devices are used, such as GPU computing servers with multiple and high performance GPUs.

5. Conclusion

With the available EC based algorithms, features aren’t screened before evolutionary computation starts and all of them are equal in status during the process of evolutionary computation. In this work, we propose a new algorithm that screens features before evolutionary computation starts, and makes full use of the screened ones during the process of evolutionary computation, which is named as FS ${}^{2}$ GA (a two stage algorithms for Feature Selection based on Feature Score and Genetic Algorithms). In the first stage, features are scored by scoring all features, good features are found out, and endowed with privileges in obtaining advantages comparing to other features during the forthcoming process of evolutionary computation. As for the second stage, we design a genetic algorithm with multiple sub populations, in which each sub population corresponds to a combination of good features, and a competition mechanism between sub populations is introduced. With our proposed algorithm, good features are screened, sufficiently used and extensively explored. As a result, classification accuracies are increased. The experimental results show that FS ${}^{2}$ GA outperforms 11 state-of-the-art algorithms. Future work will focus on screening good features more precisely, as well as reducing computational time of FS ${}^{2}$ GA.

Footnotes

Acknowledgments

This paper is supported by the following scientific research projects of Mianyang Teachers’ College: QD2014A007 (no. 07165211) and 2014A07 (no. 07165212).

References

Emrah

Bing

Dervis

Mengjie

. A binary ABC algorithm based on advanced similarity scheme for feature selection. Applied Soft Computing 2015; 2015(36): 334-348.

Isabelle

André

. An introduction to variable and feature selection. J Mach Learn Res 2003; 2003(3): 1157-1182.

Manoranjan

Huan

. Feature selection for classification. Intelligent Data Analysis 1997; 1(3): 131-156.

Boyan

Francisco

Daniela

Silvia

. Information-theoretic selection of high-dimensional spectral features for structural recognition. Computer Vision and Image Understanding 2013; 117(3): 214-228.

Esmat

Hossein

Saeid

. A simultaneous feature adaptation and feature selection method for content-based image retrieval systems. Knowledge-Based System 2013; 2013(39): 85-94.

Shiping

Witold

Qingxin

William

. Unsupervised feature selection via maximum projection and minimum redundancy. Knowledge-Based Systems 2015; 2015(75): 19-29.

Sujoy

Swagatam

. Simultaneous feature selection and weighting – an evolutionary multi-objective optimization approach. Pattern Recognition Letters 2015; (65): 51-59.

Ghadah

Wenjia

. Determining appropriate approaches for using data in feature selection. International Journal of Machine Learning and Cybernetics 2017; 8(3): 915-928.

Isabelle

André

. An introduction to variable and feature selection. J Mach Learn Res 2003; 3(2003): 1157-1182.

10.

Dionysios

Aristomenis

George

. MUSIPER: A system for modeling music similarity perception based on objective feature subset selection. User Modeling and User-Adapted Interaction 2008; 18(4): 315-348.

11.

Dionysios

Aristomenis

George

. Evaluation of modeling music similarity perception via feature subset selection. International Conference on User Modeling. 2007. p. 288-297.

12.

Aristomenis

Dionisios

George

. Individualization of music similarity perception via feature subset selection. The IEEE International Conference on Systems, Man and Cybernetics. 2004. p. 552-556.

13.

Bing

Mengjie

Will

. New fitness functions in binary particle swarm optimization for feature selection. IEEE Congress on Evolutionary Computation. 2012. p. 1-8.

14.

Yvan

Iñaki

Pedro

. A review of feature selection techniques in bioinformatics. Bioinformatics 2007; 2007(23): 2507-2517.

15.

Pabitra

Chaitra

Sankar

. Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 2002; 24(3): 301-312.

16.

Yi-Cheng

Nikhil

I-Fang

. An integrated mechanism for feature selection and fuzzy rule extraction for classification. IEEE Transactions on Fuzzy Systems 2012; 20(4): 683-698.

17.

Kenji

Larry

. A practical approach to feature selection. The Ninth International Workshop on Machine Learning. 1992. p. 249-256.

18.

Hussein

Thomas

. Learning boolean concepts in the presence of many irrelevant features. Artificial Intelligence 1994; 69(1-2): 279-305.

19.

Mineichi

Jack

. Comparison of algorithms that select features for pattern classifiers. Pattern Recognition 2000; 33(1): 25-41.

20.

Pavel

Jana

Josef

. Floating search methods in feature selection. Pattern Recognition Letters 1994; 15(11): 1119-1125.

21.

Wang

Sun

Jiang

. Automatically fast determining of feature number for ranking-based feature selection. Electronics Letters 2012; 48(23): 1462-1463.

22.

Parham

Mehrdad

. A graph theoretic approach for unsupervised feature selection. Engineering Applications of Artificial Intelligence 2015; 2015(44): 33-45.

23.

Lizhi

Hongli

Haibo

. A fast feature weighting algorithm of data gravitation classification. Information Sciences 2016; 375: 54-78.

24.

Michael

William

Erik

Leslie

Anil

. Dimensionality reduction using genetic algorithms. IEEE Transactions on Evolutionary Computation 2000; 4(2): 164-171.

25.

Zhu

Ong

Dash

. Wrapper-filter feature selection algorithm using a memetic framework. IEEE Transactions on Systems Man and Cybernetics Part B 2007; 37(1): 70-76.

26.

Jihoon

Vasant

. Feature subset selection using a genetic algorithm. IEEE Intelligent Systems and Their Applications 1998; 13(2): 44-49.

27.

Ahmed

Zhang

Peng

. Improving feature ranking for biomarker discovery in proteomics mass spectrometry data using genetic programming. Connection Science 2014; 26(3): 215-243.

28.

Xue

Zhang

Browne

. Particle swarm optimization for feature selection in classification: A multi-objective approach. IEEE Transactions on Cybernetics 2013; 43(6): 1656-1671.

29.

Alper

. A discrete particle swarm optimization method for feature selection in binary classification problems. European Journal of Operational Research 2010; 206(3): 528-539.

30.

Liu

Wang

Chen

, et al. An improved particle swarm optimization for feature selection. Journal of Bionic Engineering 2011; 8(2): 191-200.

31.

Durga

Nikhil

Jyotirmoy

. Genetic programming for simultaneous feature selection and classifier design. IEEE Transactions on Systems Man and Cybernetics Part B 2006; 36(1): 106-117.

32.

Sheng

. Feature selection based f-score and aco algorithm in support vector machine. International Symposium on Knowledge Acquisition and Modeling. 2009. p. 19-23.

33.

Esra

Selma

AÖ

. An ant colony optimization based feature selection for webpage classification. The Scientific World Journal 2014; 2014(4): 1-16.

34.

. A rough set based hybrid method to feature selection. International Symposium on Knowledge Acquisition and Modeling. 2008. p. 585-588.

35.

Shahla

Mohammad

Nasser

Mehdi

. A novel aco-ga hybrid algorithm for feature selection in protein function prediction. Expert Systems with Applications 2009; 36(10): 12086-12094.

36.

Bing

Wenlong

Mengjie

. Multi-objective feature selection in classification: A differential evolution approach. Asia-Pacific Conference on Simulated Evolution and Learning. 2014. p. 516-528.

37.

Nakamura

RYM

Pereira

LAM

Costa

, et al. BBA: A binary bat algorithm for feature selection. The 25th SIBGRAPI Conference on Graphics, Patterns and Images. 2012. p. 291-297.

38.

Emary

Zawbaa

Hassanien

. Binary grey wolf optimization approaches for feature selection. Neurocomputing 2016; 2016(172): 371-381.

39.

Wan

Wang

, et al. A feature selection method based on modified binary coded ant colony optimization algorithm. Applied Soft Computing 2016; 2016(49): 248-258.

A two stages algorithm for feature selection based on feature score and genetic algorithms

Abstract

Keywords

1. Introduction

Table 1 The comparison of FS 2 GA and available EC based algorithms

2.1 Representation of feature subset

2.3 Scoring features

2.4 Computational complexity analysis

3. Genetic algorithms

3.5 Computational complexity analysis

4.1 Experimental environment

Table 2 The software environment and of hardware environment

Table 3 The datasets used in the experiments

4.5 Settings and results of the first stage

Table 10 The computational time of FS 2 GA for each dataset (hours)

5. Conclusion

Footnotes

Acknowledgments

References

Table 1
The comparison of FS ${}^{2}$ GA and available EC based algorithms

Table 2
The software environment and of hardware environment

Table 3
The datasets used in the experiments

Table 10
The computational time of FS ${}^{2}$ GA for each dataset (hours)