Performance of LDA and DCT models

Abstract

The Doubly Correlated Topic Model is a generative probabilistic topic model for automatically identifying topics from the corpus of the text documents. It is a mixed membership model, based on the fact that a document exhibits a number of topics. We used word co-occurrence statistical information for identifying an initial set of topics as posterior information for the model. Posterior inference methods utilized by the existing models are intractable and therefore provide an approximate solution. Consideration of co-occurred words as initial topics provides a tighter bound on the topic coherence. The proposed model is motivated by the Latent Dirichlet Allocation Model. The Doubly Correlated Topic Model differs from the Latent Dirichlet Allocation Model in its posterior inference; it uses the highest ranked co-occurred words as initial topics rather than obtaining from Dirichlet priors. The results of the proposed model suggest some improved performance on entropy and topical coherence over different datasets.

Keywords

1. Introduction

With the huge amount of information available online and wide usage by users and communities, a more complicated method is required to identify a user need. This has led to the need to automatically organize documents in a meaningful way to understand, search and visualize information from the collection. Categorizing documents according to their topic can improve the precision of search results.

Topic models are well-accepted approaches to analysing texts [1]. Topic modelling involves getting more relevant results for the user’s search from the enormous amount of information available. The basic idea behind the topic model is:

To identify the hidden topical patterns that exist in the corpus of documents. A document can exhibit multiple topics, for example, a document of ‘Zoology’ may also contain information about ‘Biology’, ‘Charles’, ‘Darwin’, ‘Animals’, ‘Evolution’, etc., to some extent. According to the definition of topic models, one can understand a ‘Zoology’ document as a mixture of various topics such as ‘Zoology’, ‘Biology’, ‘Charles’, ‘Darwin’, ‘Animals’, ‘Evolution’, etc.

After identifying topics, these documents are organized according to the topics identified by the topic models.

Topic models are generative models for collections of documents [2], in which each document is considered as a random mixture of corpus-wide topics and each word is drawn from those topics. They examine the document and infer the underlying topical pattern.

Topic models operate on three matrices: a non-negative topic matrix T of dimension 1×k, a non-negative vocabulary matrix V of dimension v×1, and a matrix P of dimension v×k. Every entry of the matrix is the probability distribution of N number of V words from the corpus. Topic models consist of two steps: inference and parameter estimation. In the inference step, the topic matrix is furnished by topic proportion over documents using various inference algorithms, whereas in the parameter estimation step, matrix P is populated with term distribution over topics.

Various topic models are available for organizing, summarizing, searching and understanding the text. Probabilistic Latent Semantic Analyses (PLSA) [3] and Latent Dirichlet Allocation (LDA) [4] are the most popular [5] and are often used as the basis [1] for a number of topic models.

In this study, we proposed an unsupervised probabilistic model, which observes concepts from the set of the highest ranked co-occurring words in the document. These concepts are then assigned to initial topic proportions. The term distribution over these topics is calculated to identify a topical mixture over the document/corpus. The proposed model is a generative probabilistic model for documents, which identifies hidden topical patterns in the document with more approximate inference by identifying correlated terms. Our model is based on LDA [4] with fewer assumptions. The hyper-parameters (two smoothening parameters) play a vital role in the topic models, and assigning appropriate values of hyper-parameters results in performance variation of topic models. In LDA, the value of α governs the smoothing effect of θ, and a higher value of α results in more smoothed topics, while α < 1 results in sparse topics. Here, sparseness of topics indicates the number of atoms that tend to have positive probabilities or higher probability. A higher value of α signifies that the document exhibits a mixture of higher number of topics, whereas a lower value of α signifies that the document is stuck to fewer topics, even only one [6]. On the other hand, the value of β signifies the height of relatedness of words in a topic. If the value of β is higher, more similar words may occur in the topic. Nevertheless, we have ruled out these assumptions, and have not considered any hyper-parameters in the proposed model. The proposed model has only one input that is a corpus.

The paper is organized as follows. Section 2 includes the related work carried out in this area; in addition, topic modelling with PLSA and LDA is briefly described with some improvements suggested by different authors in the posterior inference for LDA. Section 3 describes the proposed model with its generative process. Lastly, Section 4 describes the evaluation of the proposed model.

2. Related work

This study is related to the topic modelling and inference algorithm.

2.1. Topic modelling with PLSA

PLSA is a statistical model for co-occurrence data [3]. It derives the observed variables with respect to their co-occurrence with latent variables (Figure 1). In PLSA, the likelihood of the document is calculated as

p (d) = \underset{W}{Π} \sum_{t} p (t | d) p (w | t)

(1)

where t denotes topics and w indicates words.

Figure 1.

Graphical model representation of PLSA [3].

p(t|d) and p(w|t) can be obtained by applying expectation maximization [7] algorithm on the likelihood of a document collection; p(t|d) indicates per document topic proportions and p(w|t) denotes per word topic proportions. The words w_i and documents d_n conditionally depend on the latent topics; as a result, word w_i is generated through latent topics. A problem associated with this model is that it learns topic mixtures only from those documents on which it is trained [4].

2.2. Topic modelling with LDA

LDA models each document as a mixture over topics. The documents are distributed over topics, and topics generate words. Each vector of mixture proportions is assumed to be drawn from a Dirichlet distribution. The joint distribution of hidden and observed random variables is given by

{Π_{k = 1}^{K} p (β_{k} | η)} {Π_{d = 1}^{D} p (θ | α)} {Π_{n = 1}^{N} p (Z_{d, n} | θ_{d}) p (W_{d, n} | Z_{d, n} β_{1 \cdot \cdot K})}

(2)

where α is the Dirichlet parameter, sampled once for each corpus; each β is distribution over terms, and we have K of them; β_K is the space of all possible distribution and is derived from Dirichlet; θ_d is the per-document topic proportion, sampled once for each document; W_d,n is the nth word in the dth document; and Z_d,n is the per word topic assignment, sampled for every word of a document (Figure 2).

Figure 2.

Graphical model representation of LDA [4].

Almost all the uses of topic models require probabilistic inference [8] to determine the topical patterns that exist in each document. The inference task is to solve [9] per document topic proportions θ, per word topic assignment z, and per corpus topic distribution β. The topic proportions are calculated from the posterior, which allows the inclusion of some a priori confidence on the parameters of a prior distribution. A posterior distribution is the conditional distribution of the hidden variables given the observations. For one document, the posterior is the topic assignment and topic proportions [10]. Hence, the per document posterior p(θ|W_1…N), when the topics are fixed (LDA works with fixed K topics), will be:

\frac{p (θ | α) \sum_{n = 1}^{N} p (z_{n} | θ) p (w_{n} | z_{n}, β_{1 : K})}{\int_{θ} p (θ | α) \sum_{n = 1}^{N} p (z_{n} | θ) p (w_{n} | z_{n}, β_{1 : K})}

(3)

The denominator in Equation (3) is intractable, and hence, we cannot evaluate the exact posterior distribution. Consequently, we have to work with approximate posterior inference, which can be carried out in two ways, either by Gibbs sampling [11] or by variational inference [4, 12, 13]. Gibbs sampling is a special case of Monte Carlo Markov Chain (MCMC) [11]. For sample space (θ, z_1:N), the Gibbs sampler initially computes the conditional distribution of the current states of other hidden variables and observations, that is, p(θ | z_1:N, w_1:N). Given the values of z, θ is independent of w, but dependent on z. The posterior distribution of θ given n draws from θ is just a Dirichlet distribution, and thus, the conditional distribution will be:

p (θ | z_{1 : N} w_{1 : N}) = Dir (α + n (z_{1 : N}))

(4)

Subsequently, we have to sample it again for each z as follows:

p (z_{i} | z_{- i}, w_{1 : N}, θ) \propto p (z_{i} | θ) p (w_{i} | β_{1 : K}, z_{i})

(5)

This sampler converges very slowly. However, if sample space (θ, Z_1:N) is reduced, then it will converge faster. The collapsed Gibbs Sampler [11, 14] integrates this topic proportion from the sampler. Here, the state of the sample space is the collection of topic assignments. Therefore, we can iteratively sample topic assignments, giving the remaining topic assignments, for integrating out topic proportions and obtain the probability of a word of a given topic with a number of times topic assignment in another topic assignments, that is,

p (z_{i} | z_{- i}, w_{1 : N}) \propto p (w_{i} | β_{z_{i}}) Γ (n_{z_{i}} (z_{- i}))

(6)

A modified Gibbs sampler is the o-LDA [15], an online algorithm, which samples the next topic by training topics from the words of the document, instead of all preceding words. The algorithm achieves more accurate results with slightly slower performance. Furthermore, although scalable MCMC requires parallel hardware, the computational complexity still scales linearly with the data, which is not sufficient for a large number of data [16].

The variational inference is an alternative to Gibbs sampling, which replaces sampling with optimization to achieve the tightest lower bound [4]. The variational distribution is computed as:

q (θ, z_{1 : N} | γ, \emptyset_{1 : N}) = q (θ | γ) Π_{n = 1}^{N} q (z_{n} | \emptyset)

(7)

where γ and $\emptyset$ s are the free independent variational parameters that are optimized by the Kullback–Leibler divergence equation (Equation (7)) and the true posterior. The updated equations are as follows:

γ = α + \sum_{n = 1}^{N} \emptyset_{n}

(8)

\emptyset_{n} \propto exp {E (\log θ) + \log β_{W_{n}}}

(9)

Various kinds of extensions have been proposed for PLSA and LDA, including more contextual information such as time [13, 14, 17], authorship [18, 19], ranks [20] and links [1, 21 –25]. Dirichlet distribution randomly draws topic proportions, which reveal a near-independent structure [26]. Hence, LDA neglects the correlations in document-specific topic usage [5]. When normalizing γ-independent variables, it tends to exhibit a negative correlation. The Doubly Correlated Nonparametric Topic [5] model models topic and document correlations influenced by metadata for an unbounded set of potential topics. It uses MCMC techniques for learning and inference. The Logistic–Normal distribution [27] better captures the inter-component correlations. The Correlated Topic Model (CTM) [28] differs from LDA only in its first step. It draws topic proportion from a Logistic–Normal distribution. As Logistic–Normal distribution is not conjugated to multinomial [29], it complicates exact inference, and hence, it is also intractable in CTM. A fast variational inference [29] algorithm for CTM optimizes the log probability of the document with respect to variation parameters, which results in narrowing of the limits of the marginal probability of observations. Collapsed variational inference [16, 30] integrates the β values to achieve an improved approximation of the posterior variance. However, optimization, z_i depends on z_-I; hence, its application to a large number of data is difficult [31]. Wang et al. [31] modified collapsed variational inference by considering a single data point x_i with its topical patterns z_i. A data-driven split-merge algorithm [32] for the hierarchical Dirichlet process dynamically expands and contracts the number of topics. The split step creates new topics and the merge step removes redundant topics. In addition, Wang et al. [33] modified the optimization process by iteratively taking a random subset of data and updating the variational parameter. The variational objective function was optimized by stochastic optimization [34]. Furthermore, Blei et al., in supervised Latent Dirichlet Allocation [35], derived a maximum likelihood procedure to handle intractable posterior by maximizing the evidence lower bound. For a single document response, the vocational objective function was maximized with respect to $\emptyset_{1 : n}$ and γ to obtain an estimate for the posterior. Moreover, feaLDA [36] accounts for topics during the generative process, where each document can be associated with a single label or multiple label topics. HSLDA [37] models LDA with a global topic estimation. Each document is labelled using a hierarchy of conditionally dependent probit regressions. In HSLDA, β is estimated from the data instead of a priori. Hong et al. [38] employed variational inference with Kalman filtering, in which the variation parameters act as observation and true parameters acts as latent states of the model.

The latent mixed-membership model [39], in its generative process, considers the adjacency matrix as a collection of Bernoulli random variables. It uses variational inference for log-likelihood and approximate posterior of the multiple group membership of objects. The nested Chinese restaurant process [40, 41] is a generative probabilistic model that describes a priori distribution over a tree-structured hierarchy with infinitely many paths. The tree hierarchy is used to capture topical patterns in the document. The nested Chinese restaurant process is extended to design a nonparametric topic model tree [41]. The change-point stick-breaking process, together with a product of γ and Poisson construction, is used to represent time-evolving topics deeper in the tree. The L-LDA [42] extends the LDA model for multilabelled corpora by laying ‘topics’ in 1–1 correspondence with labels. Prior-LDA [43] extends LDA by including a two-stage generative process for each document. Initially, it samples a set of observed topics from a corpus-wide multinomial distribution, and then generates the words of the document. However, prior-LDA does not include dependencies between topics, and hence, cannot be used for estimating topics for new documents. The dependency LDA [43] is an extension of prior-LDA, in which the dependency between the topics is computed by T-corpus-wide probability distribution. In the present study, we have proposed a generative probabilistic model for documents, which identifies hidden topical patterns under the document with more approximate inference by identifying correlated terms.

3. Doubly correlated topic model

3.1. Terminology

Starting with basic terminologies, we define the following:

Word is the basic entity of the data having its literal and/or practical meaning. It is the sequence of non-space alphanumeric characters and special characters. A standard list of stop words is removed and then normalized by lowercase and stemmed by Porter stemmer.

The topic is a subset of set Word.

The document is a collection of sequence of N words, that is, [w₁, w₂, …., w_N].

3.2. Model

The Doubly Correlated Topic Model (DCTM) is an unsupervised generative probabilistic model for identifying topics from the corpus of the documents. The basic intuition behind the model is the mixed membership model [44], which describes a document as mixtures of multiple topics, where topics are distributed over a fixed vocabulary of words (see Figure 3). Our aim is to evaluate semantically related information from the documents. The DCTM does not consider any document as ‘Bag of words’, and words positioning is important for the model. For example, in the sentence ‘zoology is a part of biology’, we can consider that the words ‘zoology’ and ‘biology’ may be related. The DCTM’s emphasis is on calculating probability words from the vocabulary over concepts generated from co-occurrence; as a result, it computes correlation over correlated words. Our goal is to identify the underlying hidden topical structure of each document from the available thousands of documents in the corpus.

Figure 3.

Document as mixture of topics.

3.2.1. The generative process

The DCTM follows a generative process for each document of a corpus (see Figure 4). In the generative process, the DCTM observes topics from the highest-ranked co-occurred words of the documents. Then, it computes the word probability with respect to the observed topics. Initially, we have K number of topics that reside outside the documents, and each topic is the distribution of terms over a fixed vocabulary. These topics come from the concepts that are computed from word co-occurrence statistical information [45]. From these topics, the word topic proportions for every word are calculated for every document of the corpus.

Figure 4.

Generative process of DCTM.

3.2.2. Algorithm

The algorithm of DCTM for each document is as follows:

Create a vocabulary V {V₁, V₂, …., V_v} of words.

For each word in V:

(2.1) Calculate rank R with word co-occurrence statistical information.

Select concept C of K words with highest rank.

Topics z≈ Unigram_K(C).

For a matrix P of dimension v×K:

(5.1) Compute p(w_i|z_j, P_i,j).

For each column of P:

(6.1) Select T words with highest probability;

(6.2) Compute sum of T words.

Select column Y with largest sum.

End.

3.2.3. Graphical representation

The graphical representation of the DCTM is indicated in plate notation (see Figure 5). As shown in the figure, C is the list of K concepts that arise from the word co-occurrence statistical information [45]. It is sampled once for every M number of documents. The DCTM takes K concepts from each document where each concept may be a combination of one, two or a maximum of three words, and z is the topic proportions over the document, which are being prepared from the concept list. The distribution of word over topic assignment is computed by p(w|z), which is the proportion of assignments to topic t over all documents that come from the word w. It is sampled for every N number of words of every document with K topics, and is calculated over a matrix P of order K and V, where V is the size of the vocabulary.

Figure 5.

Graphical representation of DCTM.

With the concept list computed from word co-occurrence statistical information, the joint probability of the N words and K topics z is calculated as:

p (w, z | C, P) = p (z | C) p (w | z, P)

(10)

where $p (z | C)$ is simply top K unigram C, that is, |z| ≤ |C|, and C is computed from $p (C | w)$ . The value of $p (w | z, P)$ is calculated as shown in Figure 6, where P is the matrix of v×K. We have K number of topics, which are χ² distribution over words. Each column of the matrix is normalized to 1. For the word w_i under the topic z_j, where j < K, its probability is computed through Marginal Distribution over matrix P, that is, P_i,j shown as the shaded cell (see Figure 6). By solving Equation (10), the joint distribution of the hidden and observed random variables is:

{\sum_{n = 1}^{N} p (C | w_{n})} {\sum_{i = 1}^{V} \sum_{j = 1}^{K} \sum_{d = 1}^{D} p (w_{i} | z_{k, d}, P_{i, j})}

(11)

Figure 6.

Computation of p(w|z, P).

In Equation (11), the last term represents the likelihood terms of the document. When these likelihood terms are assigned to the topics, it results in co-occurred words of the document. When reducing the sample space, it will tighten the bound on the co-occurred words and this is can be achieved by splitting the document into D parts; that is, instead of considering a document as a whole, the model iterates on data paragraph-wise.

3.3. Learning

Dataset D = ${w_{i}}_{i = 1}^{N}$ is a sequence of independent and identically distributed realization of random variables. As the proposed model is a mixed membership model, which represents that a document is distributed over topics, it requires the prior knowledge of topic assignments so as to extract the hidden topical pattern of the document. A feature selection technique of word co-occurrence [45] is used for prior knowledge of topic assignments. The basic intuition behind the proposed technique is the probability distribution of co-occurrence between the terms of the document. The term ‘a’ is supposed to be the keyword of the document when the co-occurrence distribution between the term ‘a’ and the frequent terms is biased to a particular subset of frequent terms [45]. The χ² measure is used to calculate the degree of bias of distribution, and is computed as follows:

χ^{2} = \sum_{i = 1}^{n} \frac{{(OF - EF)}^{2}}{EF}

(12)

where OF is the observed frequency and EF is the expected frequency. For feature selection, Equation (12) is modified as follows:

χ^{2} (w) = \sum_{g \in G} \frac{{(freq (w, g) - n_{w} p_{g})}^{2}}{n_{w} p_{g}}

(13)

where p_g is the sum of the total number of terms in sentences where the term ‘g’ appears, divided by the total number of terms in the document, and n_w is the total number of terms in the sentences where w appears.

Word intrusions [46, 47] indicate words that are out of place or are not semantically related with the other, that is, intruders. For example, let us consider a set of words {‘zoology,’‘biology,’‘Darwin,’‘evolution,’‘bike’}. The words except ‘bike’ are semantically related to each other, and hence, ‘bike’ is considered as an intruder. Therefore, the word ‘bike’ should be removed. In order to measure the robustness to bias values, the maximal values are subtracted as:

{χ'}^{2} (w) = χ^{2} (w) - max_{g \in G} {\frac{{(freq (w, g) - n_{w} p_{g})}^{2}}{n_{w} p_{g}}}

(14)

3.4. Topic coherence

Almost all topics models learn the topics by a multinomial distribution over words [48], and the outcome will be the most probable words. Sometimes, these words may co-occur, but the resultant topic may have a very general and specific terms. Therefore, the quality of the topics plays an important role while selecting a topic model. In brief, the resultant topics should be trustworthy. The words of the topic mixture should be related. The topic coherence is a degree of relatedness between words of topic mixture. Initially, topic coherence is improved by applying likelihood terms in Equation (11). These likelihood terms result in the set of co-occurrence words. Computation of co-occurrence over a set of co-occurred terms generated by the algorithm given in Matsuo and Ishizuka [45] results in more semantically related words. The likelihood terms are calculated over each column of the matrix P. More likelihood terms reflect higher probability resulting in more related words.

The concepts (lists of co-occurrence words) may be unigram, bigram or trigram; however, with regard to assignment of these concepts to initial topic proportions, unigram is preferred. If the concepts are bigram or trigram, then they are converted to unigram. On the other hand, computation of the resultant topics from the joint distribution of the topic proportions of the same concept tightens the related words in the resultant topics. Furthermore, the appropriate number of topic limits the topic coherence. Instead of having a fixed number of most probable words in the resultant topic (e.g. top 10 or 20), selection of some threshold percentage, such as 60 or 50% of the maximum rank of the probable term, could result in more semantically related words.

4. Results and analysis

This section describes the evaluation of the proposed DCTM. For the comparison of the results with those of LDA, we used the C# implementation of LDA developed by Shusen Wang.¹ Three different datasets of different domains were used in our experiment. Our data contained 15,000 randomly crawled documents from Wikipedia [49, 50], 1760 documents from NIPS² [5, 32, 38, 51] collections and 10,000 documents from the PubMed³ [51] dataset. The selection of different datasets of different domains helped us to show improved performance of our model irrespective of any prior knowledge. Wikipedia documents contain different types of structured information, categorization information and links to other pages, whereas PubMed and NIPS comprise the text of datasets. All these three documents are freely available in the public domain. The web pages crawled from Wikipedia were converted to text files. A standard list of stop words, HTML tags and images were removed from the HTML documents. Similarly, NXML files of PubMed dataset were converted to text files by removing xml tags along with a standard list of stop words from the documents.

A vocabulary of words was generated as per the definition in Section 3.1. Sometimes, the large size of the vocabulary generated from the corpus of documents may cause problems [4] because some words may occur in a few documents only and some words have a very low frequency. Therefore, instead of creating one big vocabulary for a corpus, an individual vocabulary was generated for each document, and consequently, the proposed model did not suffer from the large vocabulary problem.

In the probability distribution of training and test datasets, Entropy is used as a direct evaluation technique for comparing our proposed model. Entropy [52] is a measure to evaluate the distance between two probability distributions. Thus, we could evaluate how well DCTM models the random variables (data). Entropy H(x) is calculated as:

H (x) = - \sum_{i = 1}^{n} p (x_{i}) . log (p (x_{i}))

(15)

While comparing two different probability distributions, Entropy is calculated as Cross Entropy for two distributions p and q on random variable x, defined as:

H (x) = - \sum_{i = 1}^{n} p (x_{i}) . \log q (x_{i})

(16)

When comparing two different models with entropy, the model with lower entropy is assumed to be more approximate than the other. The datasets were divided into 90:10 ratio for training and testing purposes. Entropy comparison of LDA and DCTM over Wikipedia, NIPS, and Pubmed dataset suggested improved performance of DCTM over LDA (see Tables 1 –3, respectively). The entropy of the LDA steadily decreased from topic value 10 to topic value 200, and subsequently remained constant, whereas the entropy of the DCTM increased from topic value 10 to topic value 100, and subsequently remained constant. The entropy of the DCTM suggested that the models perform better with low topic values; this is due to the fact that DCTM processes each document of the corpus separately and a document can be better explained by a topic mixture of 20–30 words, rather than having a topic mixture of 100 words. As DCTM processes each document separately, new documents can easily be added to the model for topic identification purpose.

Table 1.

Entropy comparison between LDA and DCTM over Wikipedia dataset.

Number of topics	LDA	DCTM
10	43.3	6.37
50	13.94	7.16
100	13.17	7.33
150	13.11	7.33
200	12.35	7.33

Table 2.

Entropy comparison between LDA and DCTM over NIPS dataset.

Number of topics	LDA	DCTM
10	31.431	7.553
50	31.353	7.656
100	18.729	7.650
150	17.859	7.650
200	14.109	7.650

Table 3.

Entropy comparison between LDA and DCTM over PubMed dataset.

Number of topics	LDA	DCTM
Word co-occurrence	DCTM	LDA
10	44.3	7.31
100	27.7	7.74
200	18.5	7.74

Topic coherence is a semantic relatedness between the words of the topic mixture. In addition, it can also be used to evaluate the quality of topic mixtures of the documents identified by topic models. We compared the top 20 topic words identified by DCTM, LDA and Word co-occurrence statistical information on the Wikipedia-Zoology⁴ web page (see Table 4). Table 4 suggests that DCTM offers tighter bound on the topic coherence because words generated from the DCTM are more semantically related than LDA and Word co-occurrence statistical information. A question may arise regarding why we are finding the correspondences between the most co-occurred words and words from vocabulary. The top 20 unigram keywords observed from Word co-occurrence statistical information [45] are shown in column 1 of Table 4. For our model, we used these unigram words as initial topics, and then the correspondence between these topics and vocabulary was calculated for each document, and the results are shown in column 2 of Table 4, whereas the LDA results are shown in column 3 of Table 4. The first most probable word identified by both LDA and DCTM was ‘zoology’, followed by ‘biology’, ‘animal’, ‘classification’ and so on by DCTM (see column 2 of Table 4). This suggested that DCTM could identify the document related to the topic ‘zoology’. Furthermore, the document may be related to topics such as ‘biology’, ‘animal’, ‘classification’, ‘evolutionary’, ‘evolution’, and so on with gradually decreasing probabilities. On the other hand, LDA could identify the document related to ‘zoology’ and then ‘Darwin’, ‘animal’, ‘branch’, and so on with gradually decreasing probabilities.

Table 4.

A sample of topics identified by DCTM, LDA and word co-occurrence statistical information.

Word co-occurrence	DCTM	LDA
Technology	Zoology	Zoology
Engineering	Biology	Darwin
Biology	Animal	Animal
Zoology	Classification	Branch
Animal	Evolutionary	Portal
Organisms	Evolution	Ethology
Sciences	Darwin	Ancient
Physiological	Portal	Von
Study	Term	Biology
Cell	Ethology	Evolution
Systems	Structure	Thomas
Classification	Ancient	Distribution
Darwin	Systematics	Entomology
Evolution	Kingdom	Pronounced
Evolutionary	Branch	Ornithology
Species	Category	Zoologist
Biological	Research	Structure
Structure	Zoologist	Herpetology
Theory	Linnaeus	Charles
Modern	Physiological	Series

5. Conclusion

The web is usually an unstructured or a semi-structured collection of documents. Structuring the web can help to improve the search results. To structure the web, it is necessary to understand the content of the document. Knowing the topics of the document can help in understanding the content of the document, and can consequently help to organize documents in a meaningful way. Topic modelling is one of the ways to identify topics of the document.

In this study, we proposed DCTM, a simple generative probabilistic model. As the exact inference was intractable, we employed a more approximate inference algorithm by reducing the assumptions in our model compared with those in the existing topic models. A feature selection algorithm based on co-occurrence of words in a document was used for posterior inference. Our assumption started from fixing topic assignment that was generated from an inference. Later, this assumption was backtracked, which produced more accurate results than having assumed the values for different hyper-parameters. Our emphasis was to find correspondence between data and topics, and variation of the values of hyper-parameter affected the smoothing and sparseness of topics, which was not used in the model. This made our inference method simpler and gave more accurate results. DCTM is a two-stage generative process model for each document. First, it samples a set of observed topics from the document, and then generates the topical words from the document. It differs from the LDA in its first step, and draws topic proportions from word co-occurrences. As the proposed model is an unsupervised model, it does not need to have any prior knowledge of the corpus. Furthermore, the vocabulary for each document is generated individually; therefore, it does not include any type of dependency between the documents and new documents can be added, and hence, it is a scalable model. The second assumption for DCTM is choosing the appropriate number of topics K. The concepts are initialized by the K most co-occurred words observed from the document. These concepts may be unigram, bigram or trigram, and when converted into unigram words, its size is ≥K. We considered only the K concepts as topics from the set of concepts because of the fact that a document can be better explained by topic mixture of 20–30 words.

The results of the proposed model suggested some performance improvement over the existing unsupervised topic models. Furthermore, the proposed model yielded better results on low number of topics between 10 and 50, and can be used for automatic document categorization and organization of documents.

Footnotes

Funding

This research received no specific grant from any funding agency in the public, commercial or not-for-profit sectors.

Notes

References

Xia

Tang

Moens

. Plink-LDA: Using link as prior information in topic modeling. In: Lee

Peng

Zhou

Moon

Unland

Yoo

(eds) Database systems for advanced applications. Lecture Notes in Computer Sciences, Vol. 7238. Berlin: Springer, 2012, pp. 213–227.

Arora

Moitra

. Learning topic models – Going beyond SVD. In: IEEE 53rd annual symposium on foundations of computer science, New Brunswick, NJ, 20–23 October 2012, pp. 1–10.

Hofmann

. Probabilistic latent semantic indexing. In: 22nd annual international ACM-SIGIR conference on research and development in information retrieval, Berkeley, CA, 15–19 August 1999, pp. 50–57.

Blei

Jordan

. Latent Dirichlet allocation. Journal of Machine Learning Research2003; 3: 993–1022.

Kim

Sudderth

. The doubly correlated nonparametric topic model. In: Advances in neural information processing systems (NIPS). Cambridge, MA: MIT Press, 2011, pp. 1980–1988.

StackExchange. Natural interpretation for LDA hyperparameters, http://stats.stackexchange.com/questions/37405/natural-interpretation-for-lda-hyperparameters (2012, accessed 16 April 2013).

Dempster

Laird

Rubin

. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological)1977; 39:1–38.

Sontag

Roy

. Complexity of inference in latent Dirichlet allocation. In: Advances in neural information processing systems (NIPS). Cambridge, MA: MIT Press, 2011, pp. 1008–1016.

Heinrich

. Parameter estimation for text analysis. Technical Report, University of Leipzig, Germany, 2008.

10.

Blei

. Topic models, http://videolectures.net/mlss09uk_blei_tm/ (2009, accessed 14 March 2013).

11.

Griffiths

Steyvers

. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America2004; 101: 5228–5235.

12.

Jordan

Ghahramani

Jaakkola

Saul

. Introduction to variational methods for graphical models. Machine Learning1999; 37: 183–233.

13.

Blei

Lafferty

. Dynamic topic models. In: 23rd International conference on machine learning (ICML), Pittsburgh, PA, 25–29 June 2006, pp. 113–120.

14.

Wang

McCallum

. Topics over time: A non-Markov continuous-time model of topical trends. In: 12th ACM SIGKDD international conference on knowledge discovery and data mining, Philadelphia, PA, 20–23 August 2006, pp. 424–433.

15.

Banerjee

Basu

. Topic models over text streams: A study of batch and online unsupervised learning. In: 7th SIAM international conference on data mining, Minneapolis, MN, 26–28 April 2007, pp. 431–436.

16.

Kurihara

Welling

Teh

. Collapsed Variational Dirichlet Process Mixture Models. In: International Joint Conferences on Artificial Intelligence (IJCAI), Hyderabad, India, 9–12 January 2007, pp. 2796–2801.

17.

Iwata

Yamada

Sakurai

Ueda

. Online multiscale dynamic topic models. In: 16th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, 25–28 July 2010, pp. 663–672.

18.

Rosen-Zvi

Griffiths

Steyvers

Smyth

. The author–topic model for authors and documents. In: 20th conference on uncertainty in artificial intelligence, Banff, Canada, 7–11 July 2004, pp. 487–494.

19.

Geng

Wang

Korba

. Adapting LDA model to discover author–topic relations for email analysis. In: Song

Eder

Nguyen

(eds) Data warehousing and knowledge discovery. Lecture Notes in Computer Sciences, Vol. 5182. Berlin: Springer-Verlag 2008, pp. 337–346.

20.

Duan

Zhang

Wen

. Rank topic: Ranking based topic modeling. In: IEEE 12th international conference on data mining (ICDM), Brussels, 10–13 December 2012, pp. 211–220.

21.

Liu

Niculescu-Mizil

Gyrc

. Topic-link LDA: Joint models of topic and author community. In: ACM ICML’09 26th international conference on machine learning (ICML), Montreal, 14–18 June 2009, pp. 665–672.

22.

Sun

Han

Gao

. iTopic model: Information network-integrated topic modeling. In: IEEE international conference on data mining (ICDM), Miami, FL, 6–9 December 2009, pp. 493–502.

23.

Nallapati

Ahmed

Xing

Cohen

. Joint latent topic models for text and citations. In: 14th ACM SIGKDD international conference on knowledge discovery and data mining, Las Vegas, NV, 24–27 August 2008, pp. 542–550.

24.

Chang

Blei

. Hierarchical relational models for document networks. The Annals of Applied Statistics2010; 4: 124–150.

25.

Wang

Blei

. Collaborative topic modeling for recommending scientific articles. In: 17th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, 21–24 August, 2011, pp. 448–456.

26.

Huang

Malisiewicz

. Correlated topic model details. Technical Report, Carnegie Mellon University, 2006, http://people.csail.mit.edu/tomasz/papers/huang_ctm_tech_report_2006.pdf

27.

Aitchison

Shen

. Logistic-normal distributions: Some properties and uses. Biometrika1980; 67: 261–272.

28.

Blei

Lafferty

. Correlated topic models. In: Advances in neural information processing systems (NIPS). Cambridge, MA: MIT Press, 2005, pp. 147–154.

29.

Blei

Lafferty

. A correlated topic model of science. The Annals of Applied Statistics 2007; 1: 17–35.

30.

Teh

Kurihara

Welling

. Collapsed variational inference for HDP. In: Advances in neural information processing systems (NIPS). Cambridge, MA: MIT Press, 2007, pp. 1481–1488.

31.

Wang

Blei

. Truncation-free online variational inference for Bayesian nonparametric models. In: Advances in neural information processing systems (NIPS). Cambridge, MA: MIT Press, 2012, pp. 413–421.

32.

Bryant

Sudderth

. Truly nonparametric online variational inference for hierarchical dirichlet processes. In: Advances in neural information processing systems (NIPS). Cambridge, MA: MIT Press, 2012, pp. 2699–2707.

33.

Wang

Paisley

Blei

. Online variational inference for the hierarchical Dirichlet process. In: 14th International conference on artificial intelligence and statistics (AISTATS), Fort Lauderdale, FL, 11–13 April 2011, pp. 752–760.

34.

Chen

. Implicit stochastic optimization with data mining for reservoir system operation. In: 9th International conference on machine learning and cybernetics, Qingdao, China, 11–14 July 2010, pp. 2410–2415.

35.

Blei

McAuliffe

. Supervised topic models. In: Advances in neural information processing systems (NIPS). Cambridge, MA: MIT Press, 2008, pp. 121–128.

36.

Lin

Pedrinaci

Domingue

. Feature LDA: A supervised topic model for automatic detection of web API documentations from the web. In: Cudré-Mauroux

Heflin

Sirin

. (eds) The semantic web: ISWC 2012. Lecture Notes in Computer Sciences, Vol. 7649. Berlin: Springer-Verlag, 2012, pp. 328–343.

37.

Perotte

Wood

Elhadad

Bartlett

. Hierarchically supervised latent Dirichlet allocation. In: Advances in neural information processing systems (NIPS). Cambridge, MA: MIT Press, 2011, pp. 2609–2617.

38.

Hong

Yin

Guo

Davison

. Tracking trends: Incorporating term volume into temporal topic models. In: 17th ACM SIGKDD international conference on knowledge discovery and data mining, San Diego, CA, 21–24 August 2011, pp. 484–492.

39.

Airoldi

Blei

Xing

Fienberg

. A Latent mixed membership model for relational data. In: ACM 3rd international workshop on link discovery, LINK-KDD’05, Chicago, IL, 21 August 2005, pp. 82–89.

40.

Blei

Griffiths

Jordan

Tenenbaum

. Hierarchical topic models and the nested Chinese restaurant process. In: Advances in neural information processing systems (NIPS), 2003. Cambridge, MA: MIT Press, pp. 17–24.

41.

Zhang

Dunson

Carin

. Hierarchical topic modeling for analysis of time-evolving personal choices. In: Advances in neural information processing systems (NIPS). Cambridge, MA: MIT Press, 2011, pp. 1395–1403.

42.

Ramage

Hall

Nallapati

Manning

. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In: Conference on empirical methods in natural language processing, Singapore, 6–7 August 2009, pp. 248–256.

43.

Rubin

Chambers

Smyth

Stevyers

. Statistical topic models for multi-label document classification. Machine Learning2012; 88: 157–208.

44.

Blei

. Mixed membership models, http://www.cs.princeton.edu/courses/archive/fall11/cos597C/lectures/mixed-membership.pdf (2011, accessed 12 April 2013).

45.

Matsuo

Ishizuka

. Keyword extraction from a single document using word co-occurrence statistical information. International Journal on Artificial Intelligence Tools2004; 13:157–169.

46.

Chang

Boyd-Graber

Gerrish

Wang

Blei

. Reading tea leaves: How humans interpret topic models. In: Advances in neural information processing systems (NIPS). Cambridge, MA: MIT Press, 2009, pp. 288–296.

47.

Mimno

Wallach

Talley

Leenders

McCallum

. Optimizing semantic coherence in topic models. In: EMNLP ‘11 proceedings of the conference on empirical methods in natural language processing, Edinburgh, 27–31 July 2011, pp. 262–272.

48.

Newman

Bonilla

Buntine

. Improving topic coherence with regularized topic models. In: Advances in neural information processing systems (NIPS). Cambridge, MA: MIT Press, 2011, pp. 496–504.

49.

Virtanen

Jia

Klami

Darrell

. Factorized multi-modal topic model. In: Twenty-eighth conference on uncertainty in artificial intelligence, Catalina Island, USA, 14–18 August 2012, pp. 843–851.

50.

Chaney

AJB

Blei

. Visualizing topic models. In: Sixth ICWSM, Dublin, 4–7 June, 2012.

51.

Yao

Mimno

McCallum

. Efficient methods for topic model inference on streaming document collections. In: 15th ACM SIGKDD international conference on knowledge discovery and data mining, 28 June to 1 July, 2009, Paris, pp. 937–946.

52.

Cover

Thomas

. Elements of information theory. Chichester: John Wiley & Sons, 1991, pp. 12–14.