Improved feature size customized fast correlation-based filter for Naive Bayes text classification

Abstract

Feature selection is an essential part in the data preprocessing. In the text classification, most of the previous feature selection algorithms rarely consider the redundancy between features. This paper focuses on eliminating redundancy. After modifying the formula of feature correlation of original fast correlation-based filter (FCBF) and updating the algorithm strategy, we propose a new approach named improved feature size customized fast correlation-based filter (IFSC-FCBF). In addition, we combine IFSC-FCBF with Naive Bayes (NB) classifier for text classification, and test it in four typical text corpus data sets. The results demonstrate that with the same feature size, IFSC-FCBF method has the advantages of higher accuracy and shorter running time than other methods.

Keywords

Feature selection Naive Bayes text classification FCBF IFSC-FCBF

1 Introduction

With the development of the Internet, textual information and its diversity continue to increase. This makes the text classification more and more attention of the research community. Text classification is to assign a new document automatically to a predefined category [1]. A number of methods are now available to implement text classification, such as k-nearest neighbors (KNN) [2], decision tree [3], support vector machine [4], Naive Bayes [5], neural network [6]. Naive Bayes is the most popular among them due to its high computational efficiency and good performance of prediction [7].

Before text classification, the text needs to be vectorized. The text will be transformed into a vector or matrix that the computer can recognize. Text classification tasks commonly used text representation is vector space model (VSM) [8]. In this paper, we utilize this method to represent documents. The main idea is as follows: after word segmentation, the feature words in the training data are extracted to form a feature space (t₁, t₂ . . . t_m), where t_k, k ∈ [1, m] are feature words, m is the dimension of the feature space. Then, the i th document will be expressed as D_i = (w_{t
₁}, w_{t
₂} . . . w_{t
_k}), where w_{t
_k} is the weight of feature t_k in the document D_i. If w_{t
_k} = 0, it means t_k is absent in document D_i, otherwise w_{t
_k} represents the frequency of t_k appearing in the D_i.

In the text classification, because of the large dimension of feature words, it takes a long time to train the classifier. What’s more, some features are useless and some redundant features may reduce the classification accuracy. Therefore, feature selection is especially important. There are two main methods of feature selection in machine learning [9]: filter and wrapper. The filtering method selects a feature subset as a preprocessing step, which works independently of the classification algorithm. While the wrapper method uses the accuracy of the classifiers as criteria to perform feature selection. Wrappers have a better performance because they select the feature subset based on predetermined algorithms [10]. However, the wrapper method is time-consuming and complicated, which is obviously not suitable for text classification [11]. Therefore, we pay more attention to filtering method.

Researchers have proposed many feature filter methods for text classification. Document frequency [12] and information gain (IG) [13] are the most popular ones among them. Although information gain is an effective method for feature selection, it has the drawback that information gain only gives each feature an IG value individually without considering redundancy between features [1].

In order to eliminate redundant features, Peng et al. proposed a typical method named MRMR [14], but the computational complexity is too high [1]. Lee et al. eliminated redundant information with the usage of divergence to improve information gain [15]. Uysal et al. proposed the distinguishing feature selector (DFS) based on feature probability [16]. However, although these methods can effectively remove redundant features, they are all of high complexity.

In order to perform feature selection faster, we focused on the fast correlation-based filter (FCBF) [12], which has been applied in various fields. Liu et al. used to diagnose diesel engines [13]. Chen et al. combined FCBF with RELIEF for gene selection [14]. Davood et al. integrated PCA with FCBF to select audio features and visual features for emotion recognition [15]. In the medical field, Sarojini et al. applied FCBF to select features for diabetes databases and achieved improvement in medical diagnosis [16], and Bahasen used it in the diagnosis of epilepsy [17]. At present, FCBF is often used in emotion recognition [31, 32]. Although Suchi Vora and others involved FCBF in the research of text classification feature extraction algorithm, the classification effect of FCBF is not ideal [30].

In this paper, we propose an improved feature size customized fast correlation-based filter (IFSC-FCBF). Based on the characteristics of text features, we improve the formula of feature correlation in original FCBF and upgrade the algorithm strategy. Therefore, we can customize the selected feature size, which makes the text feature selection more accurate.

In the experiments, we combine IFSC-FCBF with Naive Bayes classifier and test it in English and Chinese corpus data sets. The results show that compared with other feature selection algorithms, IFSC-FCBF can effectively eliminate redundant features in a shorter execution time and has higher accuracy.

The rest of the paper is organized as follows: The second part discusses the related research of the existing algorithms involved in this paper. The third part introduces the theoretical derivation of IFSC-FCBF in detail. The fourth part gives the simulation comparison results of this method and other feature selection methods. The fifth part is the conclusion and future research directions of this paper.

2 Related work

2.1 Naive Bayes text classification

The text classification problem belongs to discrete data classification. There are usually two kinds of Bayesian models [18]: one is the Bernoulli Naive Bayes (BNB) [19], which only considers whether the features appear in the documents; the other is the multinomial Naive Bayes (MNB) [20], which focuses on the frequency of features in the documents. The experiment shows that the classification effect of multinomial model is better than Bernoulli model [21]. In this paper, we choose the multinomial Naive Bayesian model [20]. The idea of the algorithm is as follows: first calculating the prior probability of each category; then using Bayes’ theorem to calculate the posterior probabilities of the feature belonging to the category; at last deciding how to categorize features based on the category selection with the maximum a posteriori (MAP).

Assume the document category collection C ={ C₁, C₂ . . . C_j }, where j = 1, 2, 3 . . . V. D_i is a training document which can be represented as D_i ={ tf (t₁) , tf (t₂) . . . . tf (t_m) }. The category of the maximum probability is where the document D_i belongs. It can be described as following: $P (C_{j} | D_{i}) = \frac{P (D_{i} | C_{j}) P (C_{j})}{P (D_{i})}$ (1) where P (C_j) is the probability of documents belonging to the category C_j; P (D_i|C_j) is the probability that the document D_i belongs to the category C_j; P (D_i) = P (t₁, t₂ . . . t_m) is the joint probability of all features. It is obvious that P (D_i) is a constant. So the equation (1) can be converted into:

$C_{map} = \max_{C_{j} \in C} P (C_{j} | D_{i}) = \max_{C_{j} \in C} P (C_{j}) P (D_{i} | C_{j})$ (2)

where C_map represents the final classification result.

According to Naive Bayes feature independence assumption, the equation (2) can be simplified as: $\begin{matrix} C_{map} = \max_{C_{j} \in C} P (C_{j} | D_{i}) = \max_{C_{j} \in C} P (C_{j}) \\ P ({t_{1}, t_{2} . . . t_{m}} | C_{j}) \underset{C_{j} \in C}{= \max} P (C_{j}) \prod_{k = 1}^{m} P (t_{k} | C_{j}) \end{matrix}$ (3)

where m is the number of features, t_k (k = 1, 2, 3.... m) is the kth feature word in the document D_i.

2.2 Information gain feature selection (IG)

Information gain (IG) is an evaluation method based on information entropy, which measures the impact of the presence of feature terms on text categories [33]. For a given random variable X ={ x₁, x₂, x₃ … x_n }, x_i represent the probability that the ith random variable appears. The information entropy of X is defined as follows: $HX = \sum_{i} p (x_{i}) lb (p (x_{i}))$ (4)

For a classification system, the information gain IG (t) of the feature term t is the difference between the information entropy of the classification system when t exists and does not exist in all texts. Obviously, the larger the information gain value of a feature item, the more valuable information it brings to the classification system. Calculated as follows: $\begin{matrix} t) = - \sum_{j = 1}^{M} P (C_{j}) log P (C_{j}) \sum_{t} | \frac{{Dt}_{k}}{D} | \\ \sum_{j = 1}^{M} P (C_{j} | t_{k}) log P (C_{j} | t_{k}) + P ({\bar{t}}_{k}) \sum_{j = 1}^{V} \\ P (C_{j} | {\bar{t}}_{k}) log P (C_{j} | {\bar{t}}_{k}) \end{matrix}$ (5)

where V is the number of classes, P (C_j) is the probability of class C_j, P (t_k) is the probability of feature t_k appearing and $P ({\bar{t}}_{k})$ is its complement, P (C_j|t_k) is the conditional probability of C_jwhen t_k appears and $P (C_{j} | {\bar{t}}_{k})$ is its complement. Only the document frequency of the feature items is considered, and the term frequency factor is ignored. As a result, the dependence of the selected features on the category is not strong and the representativeness is insufficient. The impact of feature terms on the overall classification is measured only from a global perspective, and the differences in the distribution of feature terms within and between classes are not considered. This results in feature items that are more valuable for classification being filtered out, which affects the effect of feature selection.

2.3 Distinguishing feature selector (DFS)

DFS selects distinctive features while eliminating uninformative ones considering certain requirements on term characteristics [11]. According to the feature selection principle of DFS: If a feature often appears in one category and does not appear in other categories, it must score high. Conversely get low scores. Based on this principle, an initial scoring framework is constituted as $\sum_{j = 1}^{V} \frac{P (C_{j} | t)}{P (\bar{t} | C_{j}) + 1}$ (6)

But according to another principle: if a feature appears in all categories, it is irrelevant and must be given a low score. the formulation is extended to $DFS (t_{k}) = \sum_{j = 1}^{V} \frac{P (C_{j} | t_{k})}{P (\bar{t_{k}} | C_{j}) + P (t_{k} | \bar{C_{j}}) + 1}$ (7)

Where V is the number of classes, P (C_j|t_k) is the conditional probability of C_j when t_k appears, $P (\bar{t_{k}} | C_{j})$ is the conditional probability of absence of feature t_k in class C_j, $P (t_{k} | \bar{C_{j}})$ is the conditional probability of feature t_k appearing in all classes except C_j. This formula is the calculation formula for DFS.

2.4 Fast correlation-based filter (FCBF)

FCBF utilizes symmetric uncertainty to measure the correlation between features. The idea is that a feature is good if it is highly correlated with the class but not highly correlated with any of the other features [12]. It is a typical heuristic sequence backward elimination method. The information entropy of the variable X is defined as: $H (X) = - \sum_{i} P (x_{i}) log P (x_{i})$ (8)

where P (x_i) represents the probability when the variable X equals x_i, and the conditional information entropy of X given variable Y is defined as: $H (X | Y) = - \sum_{j} P (y_{j}) \sum_{i} P (x_{i} | y_{i}) log P (x_{i} | y_{i})$ (9)

where P (y_j) represents the probability when the variable Y equals y_i, P (x_i|y_i) is the probability of variable X given variable Y. Information gain is the amount of additional information, which means the decreased entropy of X reflected by Y [22], given by $IG (X | Y) = H (X) - H (X | Y)$ (10)

Therefore, symmetrical uncertainty can be calculated as following: $SU (X, Y) = 2 (\frac{IG (X | Y)}{H (X) + H (Y)})$ (11)

FCBF can be described as follows: if the symmetry uncertainty of a feature between categories is higher and relatively lower between other features, the feature is considered as the predominant feature. First, the symmetry uncertainty between each feature and category is SU_{f_i,c}, and the feature corresponding to the maximum value of SU is selected as the predominant feature f_q. The symmetry uncertainty of other features between predominant features is SU_{f_q,f_p}. If SU_{f_q,f_p} ≥ SU_{f_p,c}, the feature f_p will be removed. Then select the predominant features from the remaining features. Repeat the above steps until there are no feature remaining.

3 Proposal of IFSC-FCBF

Although FCBF has made great achievements in many fields, the experiment in the next section shows that it is not ideal enough for text classification. On the one hand, the original FCBF algorithm does not consider the distribution of features in the text when computing information gain. On the other hand, when inputs are highly correlated, FCBF algorithm may eliminate too many features [22]. In order to improve this method, we use the characteristic distributions of features to improve corresponding information gain, and then calculate the correlation. Also, we add an additional parameter for customizing the feature size to balance feature selection.

3.1 Correlation between feature and category

Information entropy is a quantity that reflects the degree of uncertainty of a variable [23]. In information classification, it reflects the distribution of feature in the corpus. For categories and features, the information entropy is defined as: $H (C) = - \sum_{j = 1}^{V} P (C_{j}) \times {log}_{2} P (C_{j})$ (12) $P (C_{j}) = \frac{\sum_{i = 1}^{n} δ (C_{i}, C_{j}) + L}{n + V \times L}$ (13) $H (t_{k}) = - \sum_{j = 1}^{V} P (t_{k}) \times {log}_{2} P (t_{k})$ (14) $P (t_{k}) = \frac{tf (t_{k, i}) + L}{\sum_{i = 1}^{n} tf (t_{k, i}) + V \times L}$ (15)

where n is the number of training documents, V is the number of categories, C_i is the class label of training document D_i, P (t_k) is the probability of the feature word t_k, tf (t_k,i) is the frequency of the feature word t_k appearing in the training document D_i. L is a smoothing factor and we take L = 0.001. δ () is a binary which has mentioned before.

For feature t_k, the conditional information entropy given the distribution of categories is: $H (t_{k} | C) = - \sum_{j = 1}^{V} P (t_{k} | C_{j}) \times {log}_{2} P (t_{k} | C_{j})$ (16) $P (t_{k} | C_{j}) = \frac{tf (t_{k} | C_{j}) + L}{\sum_{j = 1}^{V} tf (t_{k} | C_{j}) + V \times L}$ (17)

where P (t_k|C_j) is the conditional probability of feature word t_k in class C_j, tf (t_k|C_j) is frequency of the feature word t_k appearing in class C_j, L is the smoothing factor and V is the number of categories.

The feature information gain is the variation of feature information entropy given distribution of category.

The formula is as following: $IG (t_{k} | C) = H (t_{k}) - H (t_{k} | C)$ (18)

Therefore, the correlation between feature and category can be described as: $Corr (t_{k}, C) = \frac{IG (t_{k} | C)}{\sqrt{H (t_{k})} \times \sqrt{H (C)}}$ (19)

3.2 Correlation between feature and feature

For feature t_k1, the conditional information entropy given the distribution of feature t_k2 is: $H (t_{k 1} | t_{k 2}) = - \sum_{j = 1}^{V} P (t_{k 1} | t_{k 2}) {log}_{2} P (t_{k 1} | t_{k 2})$ (20) $P (t_{k 1} | t_{k 2}) = \frac{df (t_{k 1}, t_{k 2} | C_{j})}{df (t_{k 2} | C_{j})}$ (21)

where P (t_k1|t_k2) is the conditional probability of feature t_k1 given feature t_k2, df (t_k1, t_k2|C_j) is the number of documents of t_k1 and t_k2 appearing concurrently in class C_j, df (t_k2|C_j) is the number of documents of t_k2 appearing in class C_j.

Therefore the information gain of feature t_k1 given feature t_k2 in class C_j can be calculated as: $IG (t_{k 1} | t_{k 2}) = H (t_{k 1} | C) - H (t_{k 1} | t_{k 2})$ (22)

The correlation between t_k1 and t_k2 is as following: $Corr (t_{k 1}, t_{k 2}) = \frac{IG (t_{k 1} | t_{k 2})}{\sqrt{H (t_{k 1})} \times \sqrt{H (t_{k 2})}}$ (23)

In order to obtain the desirable feature size, we improve the procedure of FCBF. According to Yu [12], we find that the correlation between feature and category is predominant, which means that Corr (t_k, C) is of more value. Therefore, if all the features are filtered, but the size of final output feature list is smaller than what we set before, we will choose the larger value of Corr (t_k, C) to complement. So, we can get enough features directly from the final output. The algorithm only adds two judgement sentences based on the FCBF algorithm. Therefore, the complexity is the same as that of the FCBF algorithm, O (mn log n) [12].

4 Experiments and results

4.1 Experimental data

In the experiment, we adopt four common data sets, which includes two English sets: 20 newsgroup [24, 25], Ruster21578 [24 , 27], and two Chinese sets: Sogou Lab Corpus [28], Fudan University Chinese Corpus [27, 29]. The 20 newsgroups data set is a collection of approximately 20,000 newsgroup documents, partitioned evenly across 20 different newsgroups. Ruster21578 has 36 categories containing 10,835 documents. Sogou Lab Corpus is compiled from Sohu news, which includes 10 news categories. Each category contains 2000 documents. Fudan University Chinese Corpus is provided by Ronglu Li, which includes 10 categories containing 2,816 documents. In Ruster21578, the maximum class has 2875 documents and the minimum class has only 35 documents. We select 6 categories and 150 documents per class to do experiment. For other data sets we select 6 categories and 200 documents per class. The experiment was performed on an Ubuntu Intel (R) Xeon (R) CPU E5-2690 v4 @ 2.60 GHz, 125 G of RAM, and the experimental language was Python3.6. Use the document frequency as the criterion for selecting the feature words, and perform an average of 10-fold Cross Validation. The data distribution is shown in Table 2:

Table 1
IFSC-FCBF Algorithm

Input: text matrix X, category label matrix C

Output: Feature List S_best

1 initialize T ={ t₁, t₂ . . . . t_m }, S_list = {}, Size is an integer, thresh is a decimal

2 for t_k ∈ T, calculate Corr (t_k, C), append t_k to S_list when Corr (t_k, C) ≥ thresh

3 Sort S_list in descending order of Corr (t_k, C), initialize t_p = getFirst (S_list), initialize S_best ={ t_p }

4 do

t_q = getNext (S_list)

if t_q ≠ NULL:

calculate Corr (t_p, t_q)

if (Corr (t_p, t_q) ≥ Corr (t_q, C)):

remove t_q from S_list

else:

append t_q to S_best

t_q = getNext (S_list)

if length (S_best) ≥ size

return S_best

else:t_p = getNext (S_list)

until t_p = = NULL

if length (S_best) < size:

S_best = S_best + S_list [: (size - length (S_best))]

return S_best

Table 2

Data distribution

Data Set	Total Data	Training Data	Test Data	Class Number
20_newsgroup	1200	720	480	6
Ruster21578	900	540	360	6
Sougou Lab corpus	1200	720	480	6
Fudan corpus	1200	720	480	6

4.2 Text preprocessing and evaluation

We preprocess the text through the PYTHON3.6 program. In English text preprocessing, we need to remove special symbols, Arabic numerals, stop words and stems. In Chinese text pre-processing, the Chinese word is first segmented using the JIEBA [29], and then special symbols, Arabic numerals, and stop words are removed. The text classification procedure is displayed as Fig. 1.

Fig. 1

Text classification flowchart.

The experiment used precision (P), recall (R), F1 score, and Macro F1 score to evaluate the performance of classification algorithms. The formulas are as follows: $Precision : P = \frac{TP}{TP + FP}$ (24) $Macro Precision : Macro_P = \frac{1}{V} \sum_{i = 1}^{V} P$ (25) where V represents the number of categories. $Recall : R = \frac{TP}{TP + FN}$ (26) $Macro Recall : Macro_R = \frac{1}{V} \sum_{i = 1}^{V} R$ (27) $F 1 score : F 1 = \frac{2 \times P \times R}{P + R}$ (28) $\begin{matrix} Macro F 1 score : \end{matrix} Macro_F 1 = \frac{2 \times Macro_P \times Macro_R}{Macro_P + Macro_R}$ (29) where TP is the number of positive instances for correct prediction FP is the number of positive instances for incorrect prediction TN is the number of negative instances for correct prediction, and FN is the number of negative instances for incorrect prediction. The relationship is shown in the Table 3:

Table 3

Parameters meaning

True label	Predicted label
	Positive	Negative
Positive	TP	FN
Negative	FP	TN

4.3 Experiment results

In this paper, we adopt four models for comparison,

FCBF, IG, DFS, SVM and IFSC-FCBF.

FCBF: Fast Correlation-Based Filter [12]

IG: Information gain feature selection approach [8].

DFS: Distinguishing feature selector [11]

SVM: Support vector machine [4]

IFSC-FCBF: Improved feature size customized fast correlation-based filter we propose.

We first perform experiments in English data sets. In order to observe the effect of the selected feature number on the Macro F1 score, we increase the number of features from 100 to 500 continuously, and compare the performance of the above five methods as shown in Figs. 2 and 3:

Fig. 2

The Macro F1 score of algorithms on 20Newsgroup.

Fig. 3

The Macro F1 score of algorithms on Ruster21578.

From Figs. 2 and 3, we notice a slight increase of Macro F1 score when the number of features increases in two English data sets. When the number of features reaches 300, the value of Macro F1 score gradually level off. What the two graphs have in common is that the IFSC-FCBF feature selection algorithm select features more efficiently and obtain the better Macro F1 score. For other four methods, the original FCBF algorithm eliminate too many features in the previous section, so it cannot get ideal results in text classification. SVM as a classic classification algorithm, the F1 value obtained is lower than other algorithms, except for FCBF. The DFS feature selection algorithm obtains higher F1 values than the IG algorithm, especially in the 20newsgroup data set. When the number of features reaches 300, the F1 value of DFS is about 3% higher than IG.

In the same way, the results we obtain in the Chinese data sets are shown as follows:

In Figs. 4 and 5, the change trend of the macro F1 score is similar to that of the English data set. This means that when the number of features reaches 300, the performance of the algorithm is no longer affected. The Macro F1 score of the DFS algorithm is higher than IG about 1.4% averagely in the Fudan corpus and higher than IG about 8% in the Sogou corpus. While the Macro F1 score of IFSC-FCBF algorithm is higher than DFS about 1.3% in the Fudan corpus and higher than DFS about 1.5% in Sogou corpus.

Fig. 4

The Macro F1 score of algorithms on Fudan Corpus.

Fig. 5

The Macro F1 score of algorithms on Sogou Corpus.

Then we set the feature size to 200. After comparing effects of feature selection algorithms in each category, the maximum value of F1 in each category is represented in bold. In addition, we calculate the average of each column and add it at the bottom of table.

The maximum average value of P, R and F1 is represented in bold. All above mentioned are shown in the following four tables.

From Tables 4 and 5, based on the comparison of four algorithms in the English data sets, we can find that most of maximum F1 values are obtained by IFSC-FCBF algorithm. However, because different feature selection algorithms focus on different features, which means that each algorithm has its own preferred features, the results will be beneficial to some specific categories. Therefore, we cannot guarantee that all the F1 values obtained by IFSC-FCBF are the best for all categories. We can only say that in these two English data sets, based on the comparison of precision, recall and F1 values, IFSC-FCBF algorithm has a better performance of feature selection.

Table 4

Comparison of each category on 20Newsgroup

	IFSC-FCBF			DFS			IG			FCBF			SVM
Category	P	R	F1	P	R	F1	P	R	F1	P	R	F1	P	R	F1
alt. atheism	0.95	0.86	0.91	0.94	0.84	0.88	0.63	0.99	0.77	0.44	0.79	0.56	0.91	0.81	0.85
comp. graphics	0.82	0.82	0.82	0.90	0.82	0.86	0.89	0.73	0.80	0.78	0.66	0.72	0.70	0.80	0.75
Misc. forsale	0.98	0.94	0.96	0.93	0.94	0.94	0.96	0.97	0.96	0.66	0.84	0.74	0.88	0.86	0.87
rec. autos	0.95	0.94	0.94	0.95	0.95	0.95	0.94	0.96	0.95	0.97	0.78	0.86	0.97	0.82	0.89
sci. crypt	0.84	0.81	0.83	0.82	0.80	0.81	0.83	0.70	0.76	0.94	0.56	0.70	0.73	0.89	0.80
talk. politics. guns	0.76	0.90	0.82	0.73	0.87	0.80	0.90	0.71	0.79	0.71	0.51	0.60	0.83	0.77	0.80
Average Score	0.89	0.88	0.88	0.88	0.87	0.87	0.86	0.84	0.84	0.76	0.69	0.70	0.83	0.82	0.83

Table 5

Comparison of each category on Ruster21578

	IFSC-FCBF			DFS			IG			FCBF			SVM
Category	P	R	F1	P	R	F1	P	R	F1	P	R	F1	P	R	F1
Crude	0.61	0.62	0.62	0.58	0.57	0.57	0.60	0.55	0.58	0.75	0.66	0.70	0.55	0.75	0.64
Grain	0.65	0.45	0.53	0.70	0.43	0.53	0.62	0.49	0.55	0.25	0.26	0.26	0.49	0.58	0.53
Interest	1.00	0.96	0.98	1.00	0.94	0.97	1.00	0.95	0.97	0.97	0.83	0.89	0.55	0.43	0.48
Money-fx	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	1.00	0.84	0.90	0.87	0.70	0.54	0.61
Corn	0.59	0.80	0.68	0.60	0.87	0.71	0.59	0.75	0.66	0.48	0.67	0.56	0.97	0.97	0.97
Earn	0.60	0.58	0.59	0.56	0.57	0.56	0.54	0.60	0.57	0.56	0.45	0.50	0.96	0.92	0.97
Average Score	0.76	0.76	0.76	0.76	0.75	0.75	0.75	0.74	0.74	0.67	0.65	0.66	0.71	0.69	0.69

By comparing the data in Tables 6 and 7, we find that in most categories, the IFSC-FCBF gets a higher F1 value than other algorithms. These categories are education, sports, and sports in the Fudan Corpus and health, education, and tourism in the Sogou Corpus. But the average value obtained by IFSC-FCBF algorithm are roughly the same as that of DFS algorithm. It is hard to say which one is better, especially the similar situation occurs before. In Fig. 2, the results of the IFSC-FCBF and DFS algorithms are very similar when comparing the performance of five algorithms in the 20 newsgroup data sets.

Table 6

Comparison of each category on Fudan Corpus

	IFSC-FCBF			DFS			IG			FCBF			SVM
Category	P	R	F1	P	R	F1	P	R	F1	P	R	F1	P	R	F1
Environment	0.95	0.77	0.85	0.93	0.82	0.87	0.94	0.84	0.89	0.68	0.18	0.29	0.95	0.67	0.79
Traffic	0.68	0.91	0.78	0.91	0.79	0.85	0.98	0.58	0.73	0.83	0.47	0.60	0.52	0.87	0.65
Education	0.95	0.85	0.90	0.91	0.87	0.89	0.95	0.84	0.89	0.61	0.72	0.66	1.00	0.74	0.85
Military	0.96	0.88	0.92	0.70	1.00	0.82	0.99	0.88	0.93	0.61	0.72	0.66	1.00	0.80	0.89
Economy	0.81	0.85	0.83	0.85	0.90	0.88	0.88	0.88	0.88	0.96	0.59	0.73	0.96	0.63	0.76
Sport	0.81	0.82	0.81	0.87	0.72	0.79	0.48	0.91	0.63	0.83	0.32	0.46	0.71	0.96	0.81
Average Score	0.86	0.85	0.85	0.86	0.85	0.85	0.88	0.82	0.83	0.24	0.84	0.37	0.85	0.78	0.79

Table 7

Comparison of each category on Sogou Corpus

	IFSC-FCBF			DFS			IG			FCBF			SVM
Category	P	R	F1	P	R	F1	P	R	F1	P	R	F1	P	R	F1
Health	0.92	0.80	0.86	0.86	0.83	0.85	0.81	0.80	0.81	0.85	0.46	0.60	0.97	0.72	0.82
Education	0.92	0.83	0.87	0.91	0.82	0.86	0.91	0.83	0.87	0.96	0.59	0.73	0.88	0.65	0.75
Military	0.50	0.90	0.64	0.60	0.66	0.64	0.52	0.88	0.65	0.26	0.76	0.38	0.36	0.80	0.49
Tourism	0.72	0.64	0.68	0.62	0.66	0.64	0.78	0.59	0.67	0.48	0.57	0.52	0.71	0.61	0.65
Sport	0.90	0.83	0.86	0.89	0.85	0.87	0.90	0.84	0.87	0.92	0.51	0.66	0.86	0.69	0.77
Culture	0.84	0.60	0.70	0.70	0.70	0.70	0.79	0.63	0.76	0.78	0.32	0.46	0.93	0.46	0.61
Average Score	0.81	0.76	0.77	0.77	0.76	0.76	0.79	0.76	0.76	0.72	0.53	0.56	0.78	0.66	0.68

Therefore, for further comparison, we decide to analyze the algorithmic complexity as follows: first selecting 300 features randomly; then repeating experiments ten times and calculating the average running time. The approximate number of features in the data set is shown in Table 8 and the comparison of running timing is shown Fig. 6.

Table 8

Approximate feature number in four data sets

Datasets:	Ruster21578	20Newsgroup	Fudan Corpus	Sogou Corpus
Feature number	5,000	13,000	12,000	18,000

Fig. 6

Timing analysis in four datasets.

From Table 8, it shows that different data sets contain different numbers of feature words. From Fig. 6, we can find that with the addition of features, the running time increases accordingly. For the algorithm FCBF, because it eliminates too many features, the running time of it is the shortest but the accuracy is unsatisfactory. The running time of algorithm IG, DFS and SVM are similar, while algorithm IFSC-FCBF only needs half of the time than them and has a higher degree of accuracy. Therefore, we believe that algorithm IFSC-FCBF has a better performance of text classification than others.

5 Conclusion

In this paper, as shown above, we propose a new approach of text classification named improved feature size customized fast correlation-based filter (IFSC-FCBF). It not only is a precedent of feature selection approach with consideration to eliminating redundancy, but also makes a trade-off between the accuracy and complexity. The result shows that the algorithm IFSC-FCBF can select more efficient features under the relatively shorter running time, which effectively improve the performance of text classification.

For further research, apart from the improvement of existing algorithm, we will extend the feature selection and use bigram model or N-gram model for validation. The effective approach proposed will be combined with the Naive Bayes weighting algorithm so that we could get a more effective method for text classification.

Footnotes

Acknowledgments

This research was supported by National Natural Science Foundation of China (No. 61302155, No. 61274080, No. 61871234); This work was also supported by National Natural Science Fund Incubation Project of China (NY214052).

References

Shang

, Li

, Feng

, et al., Feature selection via maximizing global information gain for text classification[J], Knowledge-Based Systems 54 (2013), 298–309.

Salton

, Wong

and Yang

C.S.

, A vector space model for automatic indexing[J], Communications of the ACM 18(11) (1975), 613–620.

and Liu

, Efficient feature selection via analysis of relevance and redundancy[J], Journal of Machine Learning Research 5 (2004), 1205–1224.

Kohavi

and John

G.H.

, Wrappers for feature subset selection[J], Artificial Intelligence 97(1-2) (1997), 273–324.

John

G.H.

, Kohavi

and Pfleger

, Irrelevant features and the subset selection problem[M]//Machine Learning Proceedings 1994. Morgan Kaufmann, (1994), 121–129.

Huang

, Cai

and Xu

, A hybrid genetic algorithm for feature selection wrapper based on mutual information[J], Pattern Recognition Letters 28(13) (2007), 1825–1844.

Azam

and Yao

J.T.

, Comparison of term frequency and document frequency based feature selection metrics in text categorization[J], Expert Systems with Applications 39(5) (2012), 4760–4768.

Forman

, An extensive empirical study of feature selection metrics for text classification[J], Journal of Machine Learning Research 3 (2003), 1289–1305.

Shang

, Huang

, Zhu

, et al., A novel feature selection algorithm for text categorization[J], Expert Systems with Applications 33(1) (2007), 1–5.

10.

Lee

and Lee

G.G.

, Information gain and divergence-based feature selection for machine learning-based text categorization[J], Information Processing & Management 42(1) (2006), 155–165.

11.

Uysal

A.K.

and Gunal

, A novel probabilistic feature selection method for text classification[J], Knowledge-Based Systems 36 (2012), 226–235.

12.

, Liu

, Feature selection for high-dimensional data: A fast correlation-based filter solution[C], Proceedings of the 20th International Conference on Machine Learning (ICML-03) (2003), 856–863.

13.

Liu

, Zhang

and Ma

, A fault diagnosis approach for diesel engines based on self-adaptive WVD, improved FCBF and PECOC-RVM[J], Neurocomputing 177 (2016), 600–611.

14.

Chen

J.J.

, Song

and Zhang

, A novel hybrid gene selection approach based on ReliefF and FCBF[J], International Journal of Digital Content Technology and Its Applications 5(10) (2011), 404–411.

15.

Gharavian

, Bejani

and Sheikhan

, Audio-visual emotion recognition using FCBF feature selection method and particle swarm optimization for fuzzy ARTMAP neural networks[J], Multimedia Tools and Applications 76(2) (2017), 2331–2352.

16.

Balakrishnan

and Narayanaswamy

, Feature selection using fcbf in type ii diabetes databases[J], International Journal of the Computer, the Internet and the Management 17(1) (2009), 50–58.

17.

Şen

and Peker

, Novel approaches for automated epileptic diagnosis using FCBF selection and classification algorithms[J], Turkish Journal of Electrical Engineering & Computer Sciences 21(Sup. 1) (2013), 2092–2109.

18.

Zhang

, Jiang

, Li

, et al., Two feature weighting approaches for naive Bayes text classifiers[J], Knowledge-Based Systems 100 (2016), 137–144.

19.

Ponte

J.M.

and Croft

W.B.

, A language modeling approach to information retrieval[C], ACM SIGIR Forum, ACM 51(2) (2017), 202–208.

20.

McCallum

and Nigam

, A comparison of event models for naive bayes text classification[C], AAAI-98 Workshop on Learning for Text Categorization 752(1) (1998), 41–48.

21.

Wang

, Jiang

, Li

, A CFS-based feature weighting approach to naive bayes text classifiers[C], International Conference on Artificial Neural Networks, Springer, Cham, (2014), 555–562.

22.

Senliol

, Gulgezen

, Yu

, et al., Fast Correlation Based Filter (FCBF) with a different search strategy[C], 2008 23rd international symposium on computer and information sciences, IEEE (2008), 1–4.

23.

Liu

and Yu

, Toward integrating feature selection algorithms for classification and clustering[J], IEEE Transactions on Knowledge & Data Engineering 2005(4), 491–502.

24.

Bidi

, Elberrichi

, Feature selection for text classification using genetic algorithms[C], 2016 8th International Conference on Modelling, Identification and Control (ICMIC), IEEE (2016), 806–810.

25.

Jiang

, Liang

, Feng

, et al., Text classification based on deep belief network and softmax regression[J], Neural Computing and Applications 29(1) (2018), 61–70.

26.

Uysal

A.K.

, An improved global feature selection scheme for text classification[J], Expert Systems with Applications 43 (2016), 82–92.

27.

Chen

, Huang

, Tian

, et al., Feature selection for text classification with Naïve Bayes[J], Expert Systems with Applications 36(3) (2009), 5432–5435.

28.

Conneau

, Schwenk

, Barrault

, et al., Very deep convolutional networks for text classification[J], arXiv preprint arXiv:1606.01781, 2016.

29.

Jiang

, Wang

, Han

, et al., Deep feature weighting in Naive Bayes for Chinese text classification[C], 2016 4th International Conference on Cloud Computing and Intelligence Systems (CCIS), IEEE (2016), 160–164.

30.

Vora

, Yang

, A comprehensive study of eleven feature selection algorithms and their impact on text classification[C], 2017 Computing Conference, IEEE (2017), 440–449.

31.

Yan

, Li

and Meng

, Emotion recognition based on sparse learning feature selection method for social communication[J], Signal, Image and Video Processing (2019), 1–5.

32.

Özseven

, A novel feature selection method for speech emotion recognition[J], Applied Acoustics 146 (2019), 320–326.

33.

Lee

and Lee

G.G.

, Information gain and divergence-based feature selection for machine learning-based text categorization[J], Information Processing & Management 42(1) (2006), 155–165.