Abstract
Text classification is the task of assigning predefined categories to natural language documents, and it can provide conceptual views of document collections. The Naïve Bayes (NB) classifier is a family of simple probabilistic classifiers based on a common assumption that all features are independent of each other, given the category variable, and it is often used as the baseline in text classification. However, classical NB classifiers with multinomial, Bernoulli and Gaussian event models are not fully Bayesian. This study proposes three Bayesian counterparts, where it turns out that classical NB classifier with Bernoulli event model is equivalent to Bayesian counterpart. Finally, experimental results on 20 newsgroups and WebKB data sets show that the performance of Bayesian NB classifier with multinomial event model is similar to that of classical counterpart, but Bayesian NB classifier with Gaussian event model is obviously better than classical counterpart.
1. Introduction
Text classification [1] is known as the task of assigning one or more predefined categories to natural language documents. Instead of manually classifying documents or hand-making automatic classification rules, many machine learning algorithms are used to automatically classify unseen documents on the basis of human-labelled training documents. Given the growing volume of online documents available through World Wide Web (WWW), news feeds, electronic mail and digital libraries, this task is of great practical significance.
The Naïve Bayes (NB) classifier is a family of simple probabilistic classifiers based on a common assumption that all features are independent of each other, given the category variable [2]. Different NB classifiers differ mainly by the assumptions they make regarding the distribution of features. The assumptions on distribution of features are called event models of the NB classifier [3]. For discrete features, multinomial or Bernoulli distributions are popular. These assumptions lead to two distinct models, which are often confused [4,5]. When dealing with continuous features, a typical assumption is Gaussian distribution.
Despite apparently over-simplifier assumptions, NB classifier works quite well in many complex real-world applications, such as text classification [6,7], keyphrase extraction [8] and medical diagnosis [9]. This paradox is explained by Zhang that true reason for its competitive performance in classification lies in the dependence distribution [10]. In more details, how the local dependence of a feature distributes in each category, evenly or unevenly, and how the local dependencies of all features work together, consistently (supporting a certain category) or inconsistently (cancelling each other out), play a crucial role.
As one of the most efficient inductive learning algorithms, NB classifier is often used as a baseline in text classification because it is fast and easy to implement. Moreover, with appropriate pre-processing, it is competitive with more advanced methods including support vector machines (SVMs) [4]. However, classical NB classifier, as standardly presented, is not fully Bayesian – at least not in the sense that a posterior distribution over parameters is estimated from training documents and then used for predictive inference for new document. Inspired by the success of Bayesian counterparts of many classical methods, Bayesian NB classifiers to text classification are studied in this work with the following contributions:
Bayesian NB classifiers with multinomial, Bernoulli and Gaussian event models are proposed in the article, where it turns out that classical NB classifier with Bernoulli event model is equivalent to Bayesian counterpart.
Bayesian NB classifier with multinomial event model is similar to that of classical counterpart, but Bayesian NB classifier with Gaussian event model is obviously better than classical counterpart.
The rest of this article is organised as follows. Section 2 gives an overview of the related works in classical NB classifier and Bayesian methods. After classical NB classifier is briefly described in section 3, a fully Bayesian NB classifier is proposed in section 4. In section 5, experimental results on 20 newsgroup data show that Bayesian NB classifier has similar performance with classical NB classifier, and section 6 concludes this work.
2. Related work
In practice, the conditional independence assumption in NB classifier is rarely true, and as a result, its probability estimates are often suboptimal. In order to reduce inaccuracies from Naïve assumption, many approaches are proposed in literature. Such methods can be grouped into two categories. The first category comprises semi-NB methods [11,12]. These methods are aimed at enhancing NB’s accuracy by relaxing the conditional independence assumption. The second category includes feature weighting methods [13], although feature weighting is primarily been viewed as a means of increasing the influence of highly predictive feature and discounting features that have little predictive value [14,15].
Compared with classical methods, Bayesian methods [16] provide a natural and principled way of combining prior information with data, within a solid decision theoretical framework. One can incorporate past information about a parameter and form a prior distribution for future analysis. When new observations become available, the previous posterior distribution can be used as a prior. All inferences logically follow from Bayes’ theorem. Therefore, many classical approaches are reformulated within a Bayesian framework, such as hidden Markov model (HMM) [17], principal component analysis (PCA) [18], SVM [19], multidimensional scaling (MDS) [20] and many others.
However, there is not a Bayesian treatment of classical NB classifier. To the best of my knowledge, only Rennie described the Bayesian NB classifier in his master’s thesis, and he found that Bayesian NB classifier performed worser than classical NB classifier [21]. In fact, Rennie’s master’s thesis only considered multinomial event model and did not care about Bernoulli and Gaussian event models. Furthermore, Dirichlet hyper-parameters were not tuned, which resulted in worse performance. In order to quantify the trade-off between various classification decisions and predict the risk that accompany such decisions, Di Nunzio put the classification decision of NB classifiers with multinomial, Gaussian, Bernoulli and Poisson’s event models within the framework of Bayesian decision theory [22], but it is not still fully Bayesian. It is worth noting that cost-sensitive NB classifiers are also applicable to Bayesian NB classifiers.
3. Classical NB classifier
In the NB classifier, every feature

Decision-making procedure with the Naïve Bayes classifier.
As a matter of fact, NB classifier can be viewed as a generative process. To generate a document, NB classifier first chooses a category for it, and then, it generates each of the document’s features (such as words) independently according to a category-specific distribution. Figure 2 illustrates the generative process. In this figure, an arrow indicates a conditional dependency between variables.

Bayesian network graph illustrating the generative process for the Naïve Bayes classifier. In this figure, an arrow indicates a conditional dependency between variables.
Given a training document set
Setting
3.1. Multinomial event model
With a multinomial event model, each document is represented by the set of word occurrences from the document. That is to say, the order of words is not captured. It yields the familiar bag of words representation for documents. It is not difficult to see that each document can also viewed as a histogram, with each element counting the number of occurrences of the resulting word in the document. Following the model, words for each category
Similar to
3.2. Bernoulli event model
In the Bernoulli event model, each document is represented by a vector
This event model is especially popular for classifying short texts [5]. It has the benefit of explicitly modelling the absence of words. Note that a NB classifier with a Bernoulli event model is not the same as a multinomial NB classifier with frequency counts truncated to one. This study estimates each of these class-conditional word probabilities
Here,
3.3. Gaussian event model
In text classification, it is very common that the documents are represented as term frequency/inverse document frequency (TF × IDF) vectors. Because the TF × IDF value increases proportionally to the number of times a word appears in the document (i.e. TF) but is offset by the frequency of the word in the corpus (i.e. document frequency, DF), which helps to adjust for the fact that some words appear more frequently in general. When dealing with continuous data, such as TF × IDF vectors, a typical assumption is that the continuous values associated with each class are distributed according to a Gaussian distribution. Another common technique is to use binning techniques [24,25] to discretise the feature values, to obtain a new set of Bernoulli-distributed features. In fact, the discretisation may throw away some discriminative information [26].
According to the model, feature values of terms for each category
Again, ML can be used to estimate
4. Bayesian multinomial NB classifier
The NB classifier, as standardly presented in section 3, is not fully Bayesian – at least not in the sense that a posterior distribution over parameters is estimated from training documents and then used for predictive inference for new document. This section describes a fully Bayesian NB classifier in more details. The graphical model representation for Bayesian NB classifier is shown in Figure 3. The Bayesian NB classifier can be viewed as a generative process, which can be described as follows:
1. Draw a multinomial
2a. For each category,
2a.1. Draw a multinomial
2b. For each category,
2b.1. For each term,
2b.1.1. Draw a Bernoulli
2c. For each category,
2c.1. For each term,
2c.1.1. Draw a Gaussian
3. For each document,
3.1. Draw a category
3.2a. For each word,
3.2a.1. Draw a word
3.2b. For each word,
3.2b.1. Draw a Boolean variable x from Bernoulli
3.2b.2. If x is true, append the word v to document m; discard the word v otherwise;
3.2c. For each word
3.2c.1. Draw a term feature value from Gaussian

The graphical model representation for the Bayesian Naïve Bayes classifier: (a) multinomial event model, (b) Bernoulli event model and (c) Gaussian event model. In this figure, circle and double-circle variables indicate observed and latent variables, respectively. An arrow indicates a conditional dependency between variables, and stacked panes indicate a repeated sampling with the iteration number shown.
It is worth noting that one can generate the resulting documents from the above procedure for multinomial and Bernoulli event models, but one can only generate the resulting feature vector representations for Gaussian event model.
For convenience, let
4.1. Parameter estimation
Given a training document set
with
However, similar to NB classifier, in order to estimate
1. Multinomial event model
with
It is easy to see that equation (4) is equivalent to equation (13) when
2. Bernoulli event model
Following the mode of Beta distribution, MAP parameter estimates for
Again, equation (6) is equivalent to equation (15) when
3. Gaussian event model
where
with the mean
4.2. Decision-making procedure
In order to assign a category c to a given document
where Γ(·) and I(·) is the Gamma and indicator function, respectively. Again, an event model should be assumed in order to calculate the second term in equation (19):
Multinomial event model
Bernoulli event model
where
Gaussian event model
where
5. Experiments and discussions
In this study, two benchmark data sets, 20 newsgroups and WebKB, are utilised to evaluate the performance. Data set 20 newsgroups was collected and originally used for text classification by Liang [29], which contains 18,821 non-empty documents evenly distributed across 20 categories, each representing a newsgroup. WebKB contains webpages collected from computer science departments of various universities by the World Wide Knowledge Base (Web->Kb) project of the CMU text learning group. As with Nigam et al. [30], the categories ‘Department’ and ‘Staff’ were discarded because there were only a few pages from each university. The category ‘Other’ was also discarded because pages were very different among the examples for this class. After these discarding operations, 4199 webpages are left in the end. The same pre-processing and splitting with McCallum and Nigam [4], Rennie [21] and Cardoso-Cachopo [31] are applied to these two data sets. The final vocabulary size for 20 newsgroups and WebKB are 70,216 and 7770, respectively. Please refer to Cardoso-Cachopo [31] for more details.
In order to generate continuous feature vector representations for Gaussian event model, we then do a kind of TF × IDF transformation as follows and normalise each document to unit length [32]
To evaluate the performance of resulting classifiers, three standard measures for binary classification, precision, recall and Fρ score, are utilised in this study. Precision, recall and Fρ score (ρ = 1 in this study) are defined formally as follows
Here, TP (true positive) is the number of the correct positive predictions, FP (false positive) is the number of incorrect positive predictions and FN (false negative) is the number of incorrect negative predictions.
In classical NB classifier, α is fixed to 1, and β is tuned for multinomial and Bernoulli event models. It is not needed to tune parameters for Gaussian event model. For simplicity, the symmetric Dirichlet priors are used in Bayesian NB classifier, where

The performance of 10-fold cross validation with log2β in terms of macro-average F1 score on 20 newsgroups data set: (a) multinomial event model (for classical NB classifier), (c) multinomial event model (Bayesian NB classifier) and (d) Gaussian event model (Bayesian NB classifier). Since Bayesian NB classifier with Bernoulli event model is equivalent to that of classical counterpart (a = b = β + 1), (b) Bernoulli event model is for classical and Bayesian NB classifiers.

The performance of 10-fold cross validation with log2β in terms of macro-average F1 score on WebKB data set: (a) multinomial event model (for classical NB classifier), (c) multinomial event model (for Bayesian NB classifier) and (d) Gaussian event model (for Bayesian NB classifier). Since Bayesian NB classifier with Bernoulli event model is equivalent to that of classical counterpart (a = b = β + 1) and (b) Bernoulli event model is for classical and Bayesian NB classifiers.
With the tuned parameters in Figures 4 and 5, the experimental results on test data are reported in Tables 1 and 2 in terms of precision, recall and F1 score. Table 3 shows two-tailed significance with 95% confidence interval by paired-samples t-test [34]. From Tables 1 and 2, one can see that the performance of Bayesian NB classifier with multinomial event model is similar to that of classical counterpart, but Bayesian NB classifier with Gaussian event model is obviously better than classical counterpart. Table 3 also illustrates that there is no statistically significant difference between Bayesian and classical NB classifiers with multinomial event model, but as for Gaussian event model, the difference between Bayesian and classical NB classifiers is statistical significant, especially for WebKB data set. This observation is not consistent with that of Rennie [21]. What is more, NB classifier with multinomial event model outperforms that with Bernoulli event model, and NB classifier with Bernoulli event model outperforms that with Gaussian event model.
Experimental results on 20 newsgroups data set in terms of precision (%), recall (%) and F1 score (%)
To make it clear, category names corresponding to the first column are listed as follows: 1: alt.atheism; 2: comp.graphics; 3: comp.os.ms: windows.misc; 4: comp.sys.ibm.pc.hardware; 5: comp.sys.mac.hardware; 6: comp.windows.x; 7: misc.forsale; 8: rec.autos; 9: rec.motorcycles; 10: rec.sport.baseball; 11: rec.sport.hockey; 12: sci.crypt; 13: sci.electronics; 14: sci.med; 15: sci.space; 16: soc.religion.christian; 17: talk.politics.guns; 18: talk.politics.mideast; 19: talk.politics.misc; and 20: talk.religion.misc.
Experimental results on WebKB data set in terms of precision (%), recall (%) and F1 score (%)
To make it clear, category names corresponding to the first column are listed as follows: 1: student; 2: course; 3: faculty; and 4: project.
Two-tailed statistical significance with 95% confidence interval by paired-samples t-test
6. Conclusion
Text classification is a supporting technology in several information processing tasks, including controlled vocabulary indexing, content filtering (spam, pornography etc.), information security and others. Instead of manually classifying documents, many machine learning algorithms are trained to automatically classify documents based on annotated training documents. The NB classifier is often used as the baseline in text classification. However, classical NB classifiers with multinomial, Bernoulli and Gaussian event models are not fully Bayesian.
Inspired by the success of Bayesian counterparts of many classical methods, such as HMM, PCA, SVM and MDS, this study proposes three Bayesian counterpart classifiers, where it turns out that classical NB classifier with Bernoulli event model is equivalent to Bayesian counterpart. As a matter of fact, one can easily generalise the approach in the work to construct alternative NB classifiers with exponential family [35] event model. Finally, experimental results on 20 newsgroups and WebKB data sets show that Bayesian NB classifier with multinomial event model performs similarly with classical counterpart, but Bayesian NB classifier with Gaussian event model is obviously better than classical counterpart. What is more, NB classifier with multinomial event model outperforms that with Bernoulli event model, and NB classifier with Gaussian event model comes next to that with Gaussian event model.
Footnotes
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
This research received the financial support from National Science Foundation of China (ID: 71403255) and Key Technologies R&D Program of Chinese 12th Five-Year Plan (2011–2015) (ID: 2015BAH25F01).
