Unsupervised learning of textual pattern based on Propagation in Bipartite Graph

Abstract

Graph-based algorithms have aroused considerable interests in recent years by facilitating pattern recognition and learning via information propagation process through the graph. Here, we propose an unsupervised learning algorithm based on propagation on bipartite graph, referred to as Propagation in Bipartite Graph (PBG) algorithm. The contributions of this approach are threefold: 1) we present an iterative graph-based algorithm and a straight-forward bipartite representation for textual data, in which vertices represent documents and words, and edges between documents and words represent the occurrences of the words in the documents. Additionally, 2) we show that PBG is more flexible and easier to be adapted for different applications than the mathematical formalism of the generative models, and 3) we present a comprehensive evaluation and comparison of PBG to other topic extraction techniques. Here, we describe the strategy employed in PBG algorithm as a problem of maximization of similarity between latent vectors assigned to vertices and edges and demonstrate that the proposed strategy can be improved by assigning good initial values for the vectors. We notice that PBG can be parallelized by a simple adjustment in the algorithm. We also show that the proposed algorithm is competitive with LDA and NMF in the task of textual collection modelling, returning coherent topics, and in the dimensionality reduction task.

Keywords

Unsupervised learning topic modelling bipartite graph representation dimensionality reduction text mining

1. Introduction

Data has increased faster than ever before and a massive amount of new information has been created and shared in daily base. The most common way of storing such information is employing textual format, such as journals, web pages, reports, memos and social networks. Therefore, the development of powerful interactive mining tools remain as relevant research topic to deal with textual data. An important task when exploring collections of documents is understanding the semantic information conveyed by texts, so that relevant concepts, themes, and topics present in the collections can be unveiled.

Many applications rely on unsupervised strategies for grouping textual documents, among them probabilistic topic modeling, a suite of algorithms that aims at discovering and annotating large collections of documents with thematic information [5]. These algorithms model each document as a mixture over a fixed set of underlying topics, in which each topic is characterized as a distribution over words. The topic probabilities can be indirectly inferred through the maximization of the log-likelihood of the data to be generated [6]. However, the mathematical technicality of these approaches hamper the exploration of new assumptions, heuristics or adaptations that might be useful in many real scenarios. From a practitioner’s perspective, the inclusion of heuristic knowledge into a probabilistic model and derivation of an effective and implementable inference algorithm are hard and tiresome tasks. Therefore, there is room for development of simple descriptive algorithms which allow easily incorporation of heuristic methods and adaptations for unsupervised tasks.

Topic modeling are useful in the sense that, empirically, they lead to good models for document collections, where topics tend to place high probability on words that represent concepts, and documents are represented as expressions of those concepts. Perusing the inferred topics is effective for model verification and for ensuring that the model is capturing the practitioner’s intuitions about the documents. Moreover, producing a human-interpretable decomposition of the texts can be a goal by itself, as when browsing or summarizing a large collection of documents.

In this spirit, much of the literature comparing different topic models presents examples of topics and examples of document-topic assignments to help understand the model’s mechanics. Topics can also help users discover new content via corpus exploration [5]. The presentation of these topics serves, either explicitly or implicitly, as a qualitative evaluation of the latent space, but there is no explicit quantitative evaluation of them. Instead, researchers employ a variety of metrics of model fit, such as perplexity or held-out likelihood. Such measures are useful for evaluating the predictive model, but do not address more exploratory goals of topic modeling. The mathematical and formal assumptions proposed by topics models, empirically, lead to good models of documents. However, fitting a mathematical model do not address the more exploratory goals of topic modeling: producing a human-interpretable decomposition of the texts. On the other hand, heuristic approaches have been widely used for solving problems more quickly, or finding better solutions. In some tasks, traditional topic extraction algorithms converge slowly, which requires many iterations, so that their peak performance can be reached. Their slowness is mainly due to their batch nature, in which parameters are updated only once after each iteration of the data. In such cases, a heuristic method based on incremental update can potentially speed up learning and performance. In other cases, heuristics can be combined with optimization algorithms to improve solutions.

Another important text extraction task is data representation, in which the vector space model (VSM) has been the usual approach. Nevertheless, more expressive representations, such as bipartite graph for representing relations between documents and words, can be employed. In such intuitive and straight-forward representation, vertex represents document or term, and edge correspond to the occurrence of a word in a document, i.e. there are edges only between pairs of vertices of different types (document or word). Eventually, a weight can be assigned to the edge according to the frequency of the word in the document.

The description of algorithms over a graph representation offers several advantages, since graph representations (1) avoid sparsity and ensure low memory consumption, (2) enable an easy description of operations for inclusion of the topological structure of a dataset, (3) enables an optimal description of the topological structure of a dataset, (4) provides local and global statistics of the dataset structure, and (5) enables the extraction of patterns which are not extracted by algorithms based on vector-space model [11, 33, 13].

Graph-based algorithms are commonly used in a semi-supervised learning schema, in which some labeled objects propagate their labels to other objects through the graph structure via transductive classifications [44, 33, 42, 3]. However, the appropriate use of the richness of information conveyed by those graphs may lead to unsupervised algorithms that, through a suitable propagation schema, group objects densely connected, keeping in different groups sparsely connected objects.

Here, we proposed an algorithm referred to as Propagation in Bipartite Graph (PBG), which uses a bipartite heterogeneous graph representation for unsupervised tasks such as topic extraction and soft clustering. We present its mathematical foundation and empirically compare PBG with other techniques for topic extraction. The strategy adopted in PBG is a combination of two traditional optimization techniques for topic extraction, namely Nonnegative Matrix Factorization (NMF) [23] and Latent Dirichlet Allocation (LDA) [7]. We combine their optimization procedures in a simple and descriptive algorithm based on a bipartite graph representation. This algorithmic approach enables the combination of heuristics, optimization procedures and the graph representation of a document collection for improving the unsupervised tasks such as topic extraction and document clustering.

The rationale behind the proposed graph-based unsupervised algorithm is the association of latent information for each vertex and edge and their appropriate propagation until convergence. Such latent information is kept in a $K$ -dimensional vector, where the $K$ value is defined by user, for the number of topics to be extracted. The algorithm encompasses an iterative propagation procedure, in which the topic (latent) information related to a vertex influences its neighbors’ topics until convergence through a label propagation schema. Due to the characteristic of the bipartite heterogeneous graph, the documents propagate their topic information to edges and edges propagate their topic information back to documents and terms. The divergence between topic information is optimized through the maximization of the generalized Kullback-Leibler divergence.

The main contributions of this article are three fold: we present 1) the use of bipartite graph representation and iterative propagation for the unsupervised learning process, 2) a description of PBG as a simple implementable algorithm that combines the inference process of Latent Dirichlet Allocation (LDA) and optimization steps of Nonnegative Matrix Factorization (NMF), and 3) a comprehensive evaluation and comparison of PBG and topic extraction techniques.

We also discuss the behavior of the algorithms for different datasets and the empirical evaluation shows our algorithm returns consistent results, which makes it a competitive alternative and a new exploratory possibility to the state-of-the-art unsupervised algorithms for textual exploration.

The remainder of the paper is organized as follows: Section 2 reviews some basic concepts and provides an overview of the standard topic modeling approaches; Section 3 introduces a topic extraction method that uses propagation in bipartite graphs and its implementation; in Section 4 we compare the computational aspects of algorithms for LDA, NMF, and our proposed algorithm PBG; Section 5 reports results from an empirical study on a large test suite; finally, Section 6 summarizes the results and discusses other potential applications and future work.

2. Related work and background

Topic Modeling has many applications in natural language processing and information retrieval [37, 5]. The main application is related to summarizing the most important themes in a domain that pervade a large unstructured collection of documents. Topic models can organize the collection according to the discovered themes.

A topic model typically consists of a set $\mathscr{K}$ with $K$ topics, each topic $k\in\mathscr{K}$ represented by a ranked list of strongly-associated terms. The corpus $\mathscr{D}$ with $D$ documents can also be associate with one or more topics. Each document $d_{j}\in\mathscr{D}$ , $1\leqslant j\leqslant D$ , is represented by a bag-of-word which is compounded by a vocabulary $\mathscr{W}$ with $W$ words. For notational convenience, a word $w_{i}\in\mathscr{W}$ , $1\leqslant i\leqslant W$ , can also be described as $w_{i,j}^{n}$ whether it is in $n$ -th position on document $d_{j}$ , and $f_{j,i}$ is the frequency of word $w_{i}$ in document $d_{j}$ .

Considerable research on topic modeling has focused on the use of probabilistic methods, where a topic is viewed as a probability distribution over words, with documents being mixtures of topics [37]. Alternative non-probabilistic models, such as those based on matrix decomposition, have also been effective in the task of topic extraction [38]. In fact, each paradigm has its own strengths and weaknesses [36]. Among these topic models, we choose those most popular methods for topic modeling: non-probabilistic NMF and probabilistic LDA methods. LDA and NMF are two traditional and well formalized techniques to extract topics from a corpus.

LDA is a Bayesian generative topic model for documents. The basic idea is that documents are represented as a random mixtures over latent topics, where each topic $k$ is characterized by a distribution over words [7]. The intuition behind LDA is to represent the probabilistic proportion of multiple topics in a document, and likewise, to keep the probability distribution relating topics and words. It is desirable that the words most related to a topic convey its thematic interpretation to the user [7].

NMF is a matrix factorization method which factorizes a matrix, in which all the elements have non-negative values, into two matrices with elements having non-negative values. NMF for documents factorizes a document-term matrix $\bm{F}$ into two matrices $\bm{A}$ and $\bm{B}$ such as $\bm{F}\approx\bm{AB}$ . The factor matrix $\bm{A}$ describes data clusters of related documents (it is related to thematic structure of documents that define the topics), and the factor matrix $\bm{B}$ can be interpreted as the relationship between topics and words.

Later in Section 4 we discuss in more details the computational aspects of LDA and NMF to theoretically justify the similitude between our proposed algorithm PBG and these well-formalized techniques. Before that, we introduce PBG as a label propagation procedure in bipartite graph.

The general idea of label propagation in a graph is well-established in literature [42, 43, 44, 21] and it is closely related to techniques for message passing and inference of probabilistic models [30, 16, 27, 32]. In [30] is investigated the approximation performance of belief propagation in an inference schema, and the experimental results suggest that propagation can yield accurate posterior marginals. In [27] is proposed a distributed variational message passing scheme for learning conjugate models, and the theoretical analysis support the favorable behavior observed in experiments with financial datasets. In [40], LDA is represented as a factor graph within the Markov random field framework. Factor graph is a bipartite graph representing the factorization of a probability distribution function and latent variables, and there is a relationship between factors functions and variables [4]. Diversely to what we propose here, traditional message passing methods do not use bipartite graph as a descriptive approach and data representation. In addition, the computational problem in message passing approach is still based on approximate inference and parameter estimation.

Considering computational aspects of topic extraction problem, probabilistic topic models depend on learning algorithms to infer latent topics. Various learning algorithms have been developed, including collapsed Gibbs sampling, variational inference, maximum a posteriori estimation, and belief propagation. These algorithms, varieties of LDA inference, are based in a generative approach framework for modeling high-dimensional sparse count data. In [2] is provided empirical and theoretical comparison between these inference algorithms, and the conclusion found is that they are closely related, i.e. depending on the update equations and hyperparameters employed their performance differences diminishes. Then, it is expected similar experimental results between our method and the inferences algorithms for LDA. Nevertheless, the proposal presented is a simple descriptive approach, with equivalent theoretical results compared to inference and matrix decomposition algorithms. Besides, it is suited for incorporating new heuristic and available problem information.

3. Topic extraction using propagation in bipartite graph

In this section, we present the problem formulation, the general notation, and the mathematical and computational foundations of the proposed unsupervised algorithm based on a bipartite graph. The PBG (Propagation in Bipartite Graph) algorithm is an unsupervised label propagation algorithm based on the update equation of NMF and LDA. Our proposed algorithms can be seen as a mixture of these two techniques, later in Section 4 we discuss in more detail this issue. We assume the KL-divergence as similarity measure to optimize the similarity among latent information of documents, terms and their links in the topic extraction process.

The collection of documents is represented by a bipartite graph. In the bipartite graph representation $G=(\mathscr{V}=\{\mathscr{D}\cup\mathscr{W}\},\mathscr{E},f)$ , the vertex set $\mathscr{V}=\{\mathscr{D}\cup\mathscr{W}\}$ corresponds to documents (vertices in $\mathscr{D}$ ) and terms (vertices in $\mathscr{W}$ ) while edges in $\mathscr{E}$ represent document-term relations, i.e. links between vertices in $\mathscr{D}$ and vertices in $\mathscr{W}$ . A non-negative value $f_{j,i}$ is associated to each edge $e_{j,i}$ , where $f_{j,i}$ is given by the frequency of occurrence of term $w_{i}$ in the document $d_{j}$ . To the vertices $d_{j}\in\mathscr{D}$ and $w_{i}\in\mathscr{W}$ , and edges $e_{j,i}\in\mathscr{E}$ are associated respectively the latent information vectors $\bar{A}_{j}$ , $\bar{B}_{i}$ and $\bar{C}_{j,i}$ . We denote the set of vectors associate do vertices and edges as $\bar{A}$ , $\bar{B}$ , and $\bar{C}$ . Thus, the goal of PBG is to extract latent information and associate it to vectors $\bar{A}_{j}$ , $\bar{B}_{i}$ and $\bar{C}_{j,i}$ .

3.1 Optimizing the divergence between latent information vectors

The assumption of PBG is that the divergence between latent information of documents in $\mathscr{D}$ , terms in $\mathscr{W}$ and edges in $\mathscr{E}$ are useful to improve the extraction of topics of documents. Our proposed algorithm propagates the latent information of terms and documents to edges, and use the latent information of edges to infer the topics. Thus, a $K$ -dimensional vector $\bar{C}_{j,i}$ is used to store the latent information of an edge $e_{j,i}\in\mathscr{E}$ . The latent information refers to thematic or topic structure of documents.

Figure 1.

Local propagation to vertex $d_{1}$ . It is related to Eq. (10), where the vectors associated to words (and edges) are propagated back to document vector. This procedure is described in Algorithm 2.

The rationale behind our proposed algorithm is that the larger the frequency $f_{j,i}$ , i.e. the edge weight, the larger the agreement between the latent information of vectors $\bar{A}_{j}$ and $\bar{B}_{j}$ , given by $(\bar{A}_{j}\odot\bar{B}_{i})$ , and the vector $\bar{C}_{j,i}$ (see Fig. 1). Then, calculating the Kullback-Leibler divergence between edge $e_{j,i}$ , $f_{j,i}\bar{C}_{e_{j,i},k}$ , and the vectors $\bar{A}_{j}$ and $\bar{B}_{i}$ , for all indexes of documents $d_{j}\in\mathscr{D}$ , words $w_{i}\in\mathscr{W}$ , and pairs document-word $e_{j,i}\in\mathscr{E}$ , we define the following maximization function:

$\displaystyle Q_{G}(\bar{A},\bar{B},\bar{C})_{k}=\sum_{e_{j,i}\in\mathscr{E}}% \left(f_{j,i}\bar{C}_{e_{j,i},k}\log{\frac{[\bar{A}_{j}\odot\bar{B}_{i}]_{k}}{% \bar{C}_{e_{j,i},k}}}\right)+\sum_{d_{j}\in\mathscr{D}}\mathscr{R}(\bar{A}_{j,% k},\alpha)$ (1)

where $\mathscr{R}(\bar{A}_{j,k},a)$ are regularization terms for each document $d_{j}$ , and $\alpha$ is a constant which controls the concentration of topic (or class) information in the vector.

$\displaystyle\mathscr{R}(\bar{A}_{j,k},\alpha)=(\alpha-\bar{A}_{j,k})\log{\bar% {A}_{j,k}}+\bar{A}_{j,k}(\log{\bar{A}_{j,k}}-1).$ (2)

A high value for $\alpha$ means that each document likely contains a mixture of all topics, and not any single topic. A low $\alpha$ value relax such constraints on a document and means that likely that document contains a mixture of just a few classes.

The latent information vectors for the whole set of vectors can be obtained by optimizing this equation to each pair of vertices linked by an edge, thus giving rise to the following cost function for the graph $G$ :

$\displaystyle Q(G)=\arg\max_{\bar{A}^{*},\bar{B}^{*},\bar{C}^{*}}\sum_{k\in% \mathscr{K}}Q_{G}(\bar{A},\bar{B},\bar{C})_{k}.$ (3)

The induction of the latent information of $Q(G)$ is performed using the gradient descent method. The maximum of $Q(G)$ with respect to $\bar{A}$ , $\bar{B}$ , and $\bar{C}$ , for all document $d_{j}\in\mathscr{D}$ , term $w_{i}\in\mathscr{W}$ and edge $e_{j,i}\in\mathscr{E}$ , in graph $G$ , are determined by setting the gradient to zero. In order to do so, we first maximize Eq. (1) with respect to $\bar{C}_{e_{j,i}}$ . Here, we constraint the values of vector $\bar{C}_{e_{j,i},k}$ such as $\sum_{k\in\mathscr{K}}\bar{C}_{e_{j,i},k}=1$ . Then, we form the Lagrangian by isolating the terms which contain $\bar{C}_{e_{j,i}}$ and adding the appropriate Lagrange multipliers.

$\displaystyle Q_{[\bar{C}_{e_{j,i},k}]}=\left(\sum_{k\in\mathscr{K}}f_{j,i}% \bar{C}_{e_{j,i},k}\log{\left(\frac{[\bar{A}_{j}\odot\bar{B}_{i}]_{k}}{\bar{C}% _{e_{j,i},k}}\right)}+\lambda\left(\sum_{l\in\mathscr{K}}\bar{C}_{e_{j,i},l}-1% \right)\right),$ (4)

where we have dropped the arguments of $Q$ for simplicity, and the subscript $[C_{e_{j,i}}]$ denotes that we have retained only those terms in $Q$ that are a function of $C_{e_{j,i}}$ . Taking derivatives with respect to $C_{e_{j,i},k}$ , we obtain:

$\displaystyle{\displaystyle\frac{\partial Q}{\partial\bar{C}_{e_{j,i},k}}}=f_{% j,i}\left(\log{([\bar{A}_{j}\odot\bar{B}_{i}]_{k})}-\log{(\bar{C}_{e_{j,i},k})% }-1+\frac{\lambda}{f_{j,i}}\right)$ (5)

Setting this derivative to zero yields the maximizing value of the edges vector $\bar{C}_{e_{j,i}}$ associated to graph $G$ ,

$\displaystyle\bar{C}_{e_{j,i}}\propto\bar{A}_{j}\odot\bar{B}_{i}.$ (6)

As the sum of values in vector $\bar{C}_{e_{j,i}}$ has to be equal 1, we can normalize it such as

$\displaystyle\bar{C}_{e_{j,i},k}=\frac{[\bar{A}_{j}\odot\bar{B}_{i}]_{k}}{\sum% _{l\in\mathscr{K}}[\bar{A}_{j}\odot\bar{B}_{i}]_{l}}.$ (7)

Next, we maximize Eq. (1) with respect to $\bar{A}_{j}$ , the vector associated to document $d_{j}\in\mathscr{D}$ . It is not necessary to use Lagrange to constraint the vector $\bar{A}_{j}$ because it is constrained by the regularization term. The terms containing $\bar{A}_{j}$ are:

$\displaystyle Q_{[\bar{A}_{j,k}]}=\sum_{w_{i}\in\mathscr{W}_{d_{j}}}\left(f_{j% ,i}\bar{C}_{e_{j,i},k}\log{\bar{A}_{j,k}}\right)+\mathscr{R}(\bar{A}_{j,k},\alpha)$ (8)

where the subset $\mathscr{W}_{d_{j}}$ indicates the set of words linked to document $d_{j}$ in the bipartite graph $G$ .

We take the derivative with respect to $\bar{A}_{j,k}$ to obtain the following update equation:

$\displaystyle{\displaystyle\frac{\partial Q_{[\bar{A}_{j,k}]}}{\partial\bar{A}% _{j,k}}}=\frac{1}{\bar{A}_{j,k}}\left(\sum_{w_{i}\in\mathscr{W}_{d_{j}}}f_{j,i% }\bar{C}_{e_{j,i},k}-\bar{A}_{j,k}+\alpha\right).$ (9)

Setting this equation to zero yields a maximum at:

$\displaystyle\bar{A}_{j}=\alpha+\sum_{w_{i}\in\mathscr{W}_{d_{j}}}f_{j,i}\bar{% C}_{e_{j,i}}.$ (10)

Finally, we maximize Eq. (1) with respect to $\bar{B}_{i}$ , the vector associated with words $w_{i}\in\mathscr{D}$ . To maximize with respect to $\bar{B}_{i}$ , we isolate terms and add Lagrange multipliers

$\displaystyle Q_{[\bar{B}_{i}]}=\sum_{k\in\mathscr{K}}\left(\sum_{d_{j}\in% \mathscr{D}}f_{j,i}\bar{C}_{e_{j,i}}\log{\bar{B}_{i}}+\lambda_{k}\left(\sum_{w% _{v}\in\mathscr{W}}\bar{B}_{v,k}-1\right)\right)$ (11)

By taking the derivative $Q_{[\bar{B}_{i}]}$ , we have

$\displaystyle{\displaystyle\frac{\partial Q_{[\bar{B}_{i}]}}{\partial\bar{B}_{% i,k}}}=\sum_{d_{j}\in\mathscr{D}}\frac{f_{j,i}\bar{C}_{e_{j,i},k}}{\bar{B}_{i,% k}}+\lambda_{k}$ (12)

Setting this equation to zero, and solving $\lambda_{k}$ , such as $\lambda_{k}=-\sum_{d_{j}\in\mathscr{D}}f_{j,i}\bar{C}_{e_{j,i},k}$ . Since we have $\sum_{w_{i}\in\mathscr{W}}\bar{B}_{i,k}=1$ , we can ignore $\lambda_{k}$ to estimate an un-normalized value of $\bar{B}_{i,k}$

$\displaystyle\hat{B}_{i,k}\propto\sum_{d_{j}\in\mathscr{D}}f_{j,i}\bar{C}_{e_{% j,i},k}$ (13)

Normalizing the value of $\hat{B}_{i,k}$ over all word $w_{v}$ in vocabulary, we have $\bar{B}_{i,k}={\displaystyle\frac{\hat{B}_{i,k}}{\sum_{w_{v}\in\mathscr{W}}% \hat{B}_{v,k}}}$ .

The iterative updates in Eqs (7), (10) and (13) that minimize Eq. (3) are the basis for the Propagation in Bipartite Graph (PBG) algorithm described in the next section.

3.2 Propagation in Bipartite Graph (PBG)

The idea of PBG algorithm is to propagate latent information throughout vertices’ neighborhoods. Assuming that the latent information vectors of terms and documents are randomly initialized. The iterative updates are performed in two different manners: (1) local updates, which account for propagations through the neighbourhood of each vertex, and (2) global updates, which propagate latent information throughout the entire bipartite graph and can be interpreted as a spreading of the information from local to global structures of the bipartite graph. The PBG algorithm is summarized in Algorithm 3.2. The local propagation is described in Algorithm 3.2 and the global propagation is described in Algorithm 3.2.

PBG AlgorithmInputInput bipartite graph $G$ , $\mathscr{D}$ // set of documents $\alpha$ // concentration parameter OutputOutput latent vectors $\bar{A},\bar{B},$ and $\bar{C}$ // latent information assigned to each word $w_{i}\in\mathscr{W}$ , document $d_{j}\in\mathscr{D}$ , and edge $e_{i,j}\in\mathscr{E}$ . localPropaglocalPropagprocproc globalPropagglobalPropagprocproc ForEachforeachdoend Initialize vector $\bar{A}_{j}$ for each document $d_{j}\in\mathscr{D}$ Initialize vector $\bar{B}_{i}$ for each word $w_{i}\in\mathscr{W}$ convergence $d_{j}\in\mathscr{D}$ $\bar{A}_{j}$ convergence $\bar{A}_{j}\leftarrow$ $G$ , $d_{j}$ , $\bar{A}_{j}$ , $\bar{B}$ , $\mathscr{D}$ $\bar{B}\leftarrow$ $G$ , $\bar{A}$ , $\bar{B}$

Local PropagationlocalPropaglocalPropagprocproc functionfunction ForEachforeachdoend $G$ , $d_{j}$ , $\bar{A}_{j}$ , $\bar{B}$ , $\mathscr{D}$ edge $e_{j,i}$ incident in $d_{j}$ $\bar{C}_{e_{j,i}}\leftarrow{\displaystyle\frac{(\bar{A}_{j}\odot\bar{B}_{i})}{% \sum_{k\in\mathscr{K}}(\bar{A}_{j}\odot\bar{B}_{i})_{k}}}$ $\bar{A}_{j}\leftarrow\alpha+\displaystyle\sum_{w_{i}\in\mathscr{W}_{d_{j}}}f_{% j,i}\bar{C}_{e_{j,i}}$

$A_{j}$

Global PropagationglobalPropagglobalPropagprocproc functionfunction ForEachforeachdoend

$G$ , $\bar{A}$ , $\bar{B}$ vertex $w_{i}\in\mathscr{W}$ edge $e_{j,i}$ incident in $w_{i}$ $\bar{C}_{e_{j,i}}\leftarrow{\displaystyle\frac{(\bar{A}_{j}\odot\bar{B}_{i})}{% \sum_{k}(\bar{A}_{j}\odot\bar{B}_{i})_{k}}}$ $\bar{B}_{i}\leftarrow\displaystyle\sum_{d_{j}\in\mathscr{D}}f_{j,i}\bar{C}_{e_% {j,i}}$ vertex $w_{i}\in\mathscr{W}$ $k\in\mathscr{K}$ $\bar{B}_{i,k}={\displaystyle\frac{\bar{B}_{i,k}}{\sum_{w_{p}\in\mathscr{W}}% \bar{B}_{p,k}}}$ $\bar{B}$

The propagation procedure of PBG algorithm (Algorithm 3.2) needs as input the set of documents $\mathscr{D}$ , a bipartite graph $G$ and a concentration parameter $\alpha$ . Initially, for each vertex $d_{j}\in\mathscr{D}$ and $w_{i}\in\mathscr{W}$ connected by an edge $e_{j,i}\in\mathscr{E}$ , the algorithm randomly initializes the corresponding latent information vectors $\bar{A}_{j}$ and $\bar{B}_{i}$ such that $\sum_{k\in\mathscr{K}}\bar{A}_{j,k}=1$ for all document $d_{j}\in\mathscr{D}$ , and $\sum_{w_{i}\in\mathscr{W}}\bar{B}_{i,k}=1$ for all topic $k\in\mathscr{K}$ . Then, the local propagation is performed for each edge $e_{j,i}$ incident to the vertex $d_{j}$ . This procedure creates a $K$ -dimensional vector $\bar{C}_{e_{j,i}}$ as result of the Hadamard product of $\bar{A}_{j}$ and $\bar{B}_{i}$ , $\bar{C}_{e_{j,i}}=\bar{A}_{j}\odot\bar{B}_{i}$ . The latent information vector $\bar{C}_{e_{j,i}}$ is normalized such that $\sum_{k\in\mathscr{K}}\bar{C}_{e_{j,i},k}=1$ . The local propagation is repeated for each vertex $d_{j}$ while entries in $\bar{A}_{j}$ are changing. The parameter $\alpha$ was used to control the concentration degree of vector $\bar{A}_{j}$ . Figure 1 illustrates the local propagation to vertex representing the document $d_{1}$ .

The global propagation is performed for every vertex $w_{i}\in\mathscr{W}$ , and for each edge $e_{j,i}$ incident on vertex $w_{i}$ . This procedure also creates a $K$ -dimensional vector $\bar{C}_{e_{j,i}}$ given by the Hadamard product of $\bar{A}_{j}$ and $\bar{B}_{i}$ . Vector $\bar{C}_{e_{j,i}}$ is normalized such that $\sum_{k\in\mathscr{K}}\bar{C}_{e_{j,i},k}=1$ and the values are propagated back to vectors $\bar{B}_{i}$ , as describe in Eq. (13). The latent information vectors $\bar{B}_{i}$ are normalized over all vertices $w_{p}\in\mathscr{W}$ . Figure 2 illustrates the global propagation related to vertex $w_{1}$ of bipartite graph $G$ . We hatched the set of vertices $\{w_{1},\ldots,w_{n}\}$ to indicate that each line in matrix $B$ is normalized independently.

Figure 2.

Global propagation to word $w_{1}$ . It is related to Eq. (13), where the vectors associated to documents (and edges) are propagated back to word vector. This procedure is described in Algorithm 3.

The rationale behind the proposed algorithm is to locally concentrate the latent information of each word of a document $d_{j}$ into vector $\bar{A}_{j}$ . Then the algorithm globaly concentrates the influence of all documents into vector $\bar{B}_{i}$ . When vectors $\bar{B}_{i}$ , for all words $w_{i}$ , are updated, each entry $k\in\mathscr{K}$ of $\bar{B}_{i}$ is normalized. This normalization gives the probability of that word assumes the topic $k$ .

We apply the local and global propagations until a maximum number of iterations is reached or until the latent information of documents remains the same in two successive iterations.

The complexity of the PBG algorithm is determined by the maximum number of local propagation $T_{\textit{local}}$ , the maximum number of interleaving between global and local propagations $T$ , the number of documents $D$ , the number of terms $W$ , the average number of terms per document $\hat{N}$ , and the number of topics $K$ . The local propagation is usually fast because it is iterated over the terms in only one document. Thus, the complexity of the algorithm PBG is $O(T\times D\times K\times((T_{\textit{local}}\times\hat{N})+W))$ .

3.3 Improving the PBG algorithm

The PBG algorithm is an approximate solution to the optimization problem established by LDA with variational inference and NMF with KL-Divergence. At the same time, it is a heuristic method based on label propagation. Thus, it can take advantage of any label propagation heuristic and the trivial representation of textual documents as bipartite graph. Here, we show two examples of using heuristic to improve both performance and quality of results. The first improvement is the generation of initial solution. Initial labels can be provided by any available heuristic. Here they were created by using clustering algorithms. The second improvement takes advantage of local structure of the graph to perform parallel processing of the documents.

3.3.1 Initializing latent information

Here, we describe how to create initial values for vectors $\bar{A}$ , associated to vertex representing documents. There are several algorithms in literature for document clustering [35, 28, 15, 41, 19], we can use any of them to improve the initial values associated to vectors $\bar{A}$ .

Thus, in order to assign good initial values, we have used an initialization based on clustering of documents. In general, the proposed initialization applied by PBG algorithm can be described as: (1) compute the clustering $\pi$ of documents in $K$ groups, such as $\pi(d_{j})$ associated a document $d_{j}$ to group $k$ , i.e. $\pi(d_{j})=k$ for $1\leqslant k\leqslant K$ ; (2) if $pi(d_{j})=k$ assign to the $k$ -th dimension of vector $\bar{A}_{j}$ value 1, 0 otherwise; (3) propagate the values in vector $\bar{A}_{j}$ to vectors $\bar{B}_{i}$ associated to each word $w_{i}$ connected to a document $d_{j}$ by the bipartite graph schema; (4) use the vectors $\bar{A}_{j}$ and $\bar{B}_{i}$ as initial labels of algorithm PBG.

Regarding the clustering algorithm used to initialize vectors $\bar{A}_{j}$ , it is preferable to use efficient methods to not overload the whole process. Note that it is possible to assign more than one cluster to vector $\bar{A}_{j}$ , as long as the clustering algorithm is able to find overlapping clusters.

Here we adopted the $K$ -means algorithm [26] and Hierarchical Link Clustering (HLC) algorithm [1] to document clustering. These algorithms use bag-of-words and graphs, respectively, for representing the corpus. For $K$ -means algorithms, documents were mapped to features space, with each feature corresponding to a word in vocabulary. For HLC algorithm, a homogeneous graph was created with vertices representing each document, and the edges linking the $R$ nearest documents. The $R$ nearest documents were obtained by cosine distance between the feature vectors of documents. In addition to simple graph representation of documents, the HLC algorithm was chosen because it provide a efficient greedy approach to find the initial documents clustering.

3.3.2 Parallel PBG for topic extraction

Another modification is a parallel computing algorithm for the topic extraction of textual documents. The proposed parallel PBG scheme is introduced to reduce the computational load of local propagation using a parallel computing technique. The approach is based on performance of PBG algorithm into subgraphs created from subset of documents. We split the set of documents into $t$ subsets such that $\mathscr{D}=\{\mathscr{D}_{1}\cup\ldots\cup\mathscr{D}_{t}\}$ , and each subset $\mathscr{D}_{r}$ induces a subgraph $G_{r}$ . Thus, it is possible to apply local propagation for each subgraph $G_{r}$ .

On the other hand, global propagation can not be parallelized since it requires one iteration over the structure of bipartite graph representing the entire collection of documents. Then, global propagation is performed after the end of the all local propagation threads. Before global propagation, the results of each thread have to be joined to describe the overall local propagations in the entire graph. Although this operation appears costly, the global propagation iterate only once over the entire structure of bipartite graph. The parallel version of the PBG algorithm is summarized in Algorithm 3.3.2.

[] Parallel PBG algorithmInputInput bipartite graph $G$ , $\mathscr{D}$ // set of documents $\alpha$ // concentration parameter $t$ // number of threads OutputSaída latent vectors $\bar{A}$ , $\bar{B}$ , and $\bar{C}$ localPropaglocalPropagprocproc globalPropagglobalPropagprocproc ForEachforeachdoend Initialize vectors $\bar{A}_{j}$ for each document $d_{j}\in\mathscr{D}$ Initialize vectors $\bar{B}_{i}$ for each word $w_{i}\in\mathscr{W}$ convergence or stop criterion not achieved // split the set of documents $\mathscr{D}$ in $t$ subsets $\mathscr{D}\leftarrow\{\mathscr{D}_{1},\ldots,\mathscr{D}_{t}\}$ $\mathscr{D}_{p}\in\mathscr{D}$ let $\mathscr{A}_{p}$ the set of documents assigned to documents in $\mathscr{D}_{p}$ . create an induced graph $G_{p}$ of documents in $\mathscr{A}_{p}$ run thread $G_{p}$ , $\mathscr{D}_{p}$ , $\mathscr{A}_{p}$ , $\bar{B}$ wait threads $\bar{B}\leftarrow$ $G$ , $\bar{A}$ , $\bar{B}$

4. Comparing PBG with LDA and NMF

The PBG can be related to technical aspects of NMF and Variational Bayesian (VB) Inference algorithm for LDA. Here we discuss this equivalence.

The equivalence between NMF and PLSI (Probabilistic Semantic Indexing) have been discussed in several works [9, 17]. Ding and colleagues [14] demonstrate that both NMF and PLSI optimize the same objective function, but they are different algorithms and converge to different local minima. Girolami and Kabán [18] demonstrate that LDA is a full Bayesian counterpart and a maximum-a-posteriori view of PLSI.

Here, we extend the equivalence between NMF and LDA described in [12] and indicate the computational aspects that were used to found our proposed algorithm. In order to relate these methods, we describe NMF and LDA focusing on their algorithmic point of view and optimization procedures.

4.1 Latent Dirichlet Allocation (LDA)

LDA is a generative topic model for documents. The basic idea is that documents are represented as a random mixtures over latent topics, where each topic $k$ is characterized by a distribution over words [7]. In a simplified LDA formulation, the word probabilities are parametrized by a $K\times W$ matrix $\beta$ . A topic $k$ ( $1\leqslant k\leqslant K$ ) is a discrete distribution over words with probability vector $\beta_{k}$ . Each document $d_{j}$ maintains a separated distribution $\theta_{j}$ that describes the contribution of each topic. Implementations of LDA inference algorithms typically use symmetric Dirichlet prior over $\theta=\{\theta_{1},\ldots,\theta_{D}\}$ , in which the concentration parameter $\alpha$ is fixed. Note that $\theta$ distribution can be interpreted as a $D\times K$ matrix relating documents and topics. Moreover, a topic distribution of a document $d_{j}$ and a word $w_{i,j}^{n}$ are associates in a distribution variable $z_{j,w_{i,j}^{n}}$ .

Given the parameter $\alpha$ and $\beta$ , the joint distribution of a topic mixture of documents, $\theta$ , is given by

$\displaystyle p(\theta,z,w|\alpha,\beta)=\prod_{d_{j}\in\mathscr{D}}p(\theta_{% j}|\alpha)\sum_{n=1}^{N_{j}}p(z_{j,w_{i,j}^{n}}|\theta_{j})p(w_{i,j}^{n}|z_{j,% w_{i,j}^{n}},\beta).$ (14)

where $N_{j}$ is the number of tokens words in document $d_{j}$ .

A wide variety of approximate inference algorithms can be considered for LDA, including variational inference and Markov chain Monte Carlo (MCMC). Here, we describe the variational inference algorithm because it is described as a optimization problem, as well as NMF algorithm.

The main idea behind the variational method is to use a distribution with its own parameters replacing the posterior distribution $p(\theta,z,w|\alpha,\beta)$ . This variational distribution for LDA is described as

$\displaystyle q(\theta_{j},z_{j}|\gamma_{j},\varphi_{j})=q(\theta_{j}|\gamma_{% j})\prod_{n=1}^{N_{j}}q(z_{j,w_{i,j}^{n}}|\theta_{j,n}),$ (15)

where $\gamma_{j}$ and $\varphi_{j}$ are the variational parameters respectively corresponding to LDA real distributions $\theta_{j}$ and $z_{j}$ . The value of variational parameters are chosen by an optimization procedure that attempts to minimize the KL-divergence between the variational distribution and the true posterior $p(\Theta,z,w|\alpha,\beta)$ . It can be translated directly into the following optimization problem

$\displaystyle(\gamma^{*},\varphi^{*})=\mathop{\text{missing}}{argmin}\limits_{% \gamma,\varphi}KL(q(\theta,z|\gamma,\varphi)\|p(\theta,z,w|\alpha,\beta))$ (16)

In fact, it is not possible to minimize the KL-divergence directly. However, bounding the log likelihood of a document, $p(w|\alpha,\beta)$ , and using Jensen’s inequality [20] it is possible to show that minimizing the KL-divergence between the variational distribution and the true posterior distribution is equivalent to maximizing the Evidence Lower Bound (ELBO) with respect to variational parameters. The ELBO is defined by the difference between the variational expectation of real posterior distribution and the variational distribution [7],

$\displaystyle\mathscr{L}=E_{q}[\log{p(\theta,z,w|\alpha,\beta)}]-E_{q}[\log{q(% \theta,z)}].$ (17)

ELBO $\mathscr{L}$ can be optimized using coordinate over the variational parameters (detailed derivation in [7]):

$\displaystyle\varphi_{j,i,k}\propto\beta_{k,i}\exp\left(E_{q}[\log(\theta_{j,k% })|\gamma]\right),$ (18) $\displaystyle\gamma_{j,k}=\alpha+\sum_{i=1}^{W}f_{j,i}\varphi_{j,i,k},$ (19) $\displaystyle\beta_{k,i}\propto\sum_{j=1}^{D}f_{j,i}\varphi_{j,i,k},$ (20)

where $f_{j,i}$ is the number of words $w_{i}$ in document $d_{j}$ . The expectation in the multinomial update can be computed as

$\displaystyle E_{q}[\log(\theta_{j,k})]=\psi(\gamma_{j,k})-\psi\left(\sum_{% \hat{k}=1}^{K}\gamma_{j,\hat{k}}\right),$ (21)

where $\Psi$ denotes the digamma function.

This section is a brief description of LDA derivatives by variational inference method. Here, we are interested in computational aspects of LDA inference algorithm, then we focus on update equations described in Eqs (18)–(20). In next section, we briefly describe the NMF updates equations derivatives. Based on these descriptions, we make a comparative analysis of these two methods and justify our method based on bipartite graphs.

4.2 Nonnegative matrix factorization

The NMF method approximately factorizes a matrix of which all the elements have non-negative values into two matrices with elements having non-negative values. NMF for documents factorizes a document-term matrix $\bm{F}=[f_{j,i}]$ , with dimension $D\times W$ , where each entry $f_{j,i}$ is the frequency of word $w_{i}$ in document $d_{j}$ , into two matrices $\bm{A}$ and $\bm{B}$ such as $\bm{F}\approx\bm{AB}$ , where $\bm{A}=[a_{j,k}]$ is a $D\times K$ matrix and $\bm{B}=[b_{k,i}]$ is a $K\times W$ matrix.

The factor matrices $\bm{A}$ and $\bm{B}$ are obtained by optimizing a cost function which can be set by using some distance measure. There are different types of cost functions [24]. Here, we are interested in NMF with KL-Divergence, defined as

$\displaystyle Q_{\textit{KL-NMF}}=\sum_{j,i}\left(f_{j,i}\log{\frac{f_{j,i}}{[% \bm{AB}]_{j,i}}}-f_{j,i}+[\bm{AB}]_{j,i}\right),$ (22)

where $[\bm{AB}]_{j,i}$ is the $j$ -th and $i$ -th position of the resulting product matrix $\bm{A}\times\bm{B}$ .

The simplest technique to solve the optimization of Eq. (22) is by Gradient descent method. Gradient descent-based method can be implemented by the following “multiplicative update rules”

$\displaystyle a_{j,k}=a_{j,k}\frac{\sum_{i}b_{k,i}f_{j,i}/[\bm{AB}]_{j,k}}{% \sum_{q}{b_{k,q}}},$ (23) $\displaystyle b_{k,i}=b_{k,i}\frac{\sum_{j}a_{j,k}f_{j,i}/[\bm{AB}]_{j,i}}{% \sum_{p}a_{p,k}}.$ (24)

The update rules in Eqs (23) and (24) described the computational aspects of NMF in algorithmic term. Correspondingly, the computational aspects of LDA inference algorithm (Eqs (18)–(20)) can be compared with NMF in terms of optimization updates rules. This is described in details in the next section, and then it is demonstrated the similitude of these methods with our proposed method PBG.

4.3 Computational aspects between PBG, NMF, and LDA

The correspondence between NMF-KL and variational inference algorithm for LDA follows the fact that they try to minimize the divergence between word frequency, document-topic and topic-word statistics. To clarify the relationship between NMF and LDA, we describe NMF-KL as a relaxation of variational problem. The equivalence is reached when a relaxation of functions $\log{\Gamma(\cdot)}$ and $\Psi(\cdot)$ are considered in the LDA derivations. This equivalence follows the Theorem 1, its mathematical proof is presented in [12].

.

The objective function of NMF with KL-Divergence is a approximation of ELBO $\mathscr{L}$ of LDA with symmetric Dirichlet priors [12].

In practice, LDA and NMF use iterative algorithms to reach a feasible solution. Theoretically, these updates are based on distinct methods and different mathematical foundation. However, we can indicate similarities in the update equations of NMF-KL, Eqs (23) and (24), and the LDA with Variational Inference, Eqs (18)–(20). These similarities were used to found the PBG algorithms and the graph propagation procedure.

Figure 3.

Plot of linear function $f(x)=x-0.48$ and function $f(x)=\exp(\psi(x))$ . It indicate that the exponential operator over a digamma function approximate a linear function when $x>0.5$ , i.e. $\exp(\psi(x))\approx x-0.48$ if $x>0.5$ .

Firstly, we extend the value of $\varphi_{j,i,k}$ , corresponding the topic statistic relation of each word in each document in LDA, to obtain the same statistic relation with NMF variables, $\varphi_{j,i,k}^{(\textit{NMF})}$ . In update rule for LDA (Eq. (18)) the exponential operation over a digamma function $\Psi(x)$ approximate a linear function when $x>0.5$ [29] – as we can see in Fig. 3. Thus, the value of $\varphi_{j,i,k}$ , is similar to a normalization over the vectors of distribution $\beta$ and variational distributions $\lambda$ . Therefore, it is possible to approximate the value of $\varphi$ only with simple operation

$\displaystyle\varphi_{j,i,k}\approx\beta_{k,i}\times\frac{\gamma_{j,k}}{\sum_{% k^{*}=1}^{\mathscr{K}}\gamma_{j,k^{*}}}.$ (25)

Thus, the value $\varphi_{j,i}$ approximate the Hadamard product of normalized vectors $\gamma_{j}$ and $\beta_{k}$ . The resulting factor matrix $\bm{A}$ is closely related to document-topic variational distribution $\lambda$ , and the resulting factor matrix $\bm{B}$ is closely related to topic-word distribution $\beta$ . Thus, considering these relationships, we can approximate the update of variational parameter $\varphi$ as

$\displaystyle\varphi_{j,i,k}\approx\varphi_{j,i,k}^{(\textit{NMF})}\propto% \left(\frac{a_{j,k}b_{k,i}}{\sum_{k^{*}=1}^{K}a_{j,k^{*}}b_{k^{*},i}}\right).$ (26)

Now, we describe the equivalence between document-topic value $a_{j,k}$ in NMF and its correspondent statistic value $\gamma_{j,k}$ in LDA. Without loss of generality, we can consider a row-wise normalization in factor matrix $\bm{B}$ , such that $\sum_{i}^{W}b_{k,i}=1$ . Then, using Eq. (26), we can rewrite update of factor $a_{j,k}$ in Eq. (23), as

$\displaystyle a_{j,k}=\sum_{i=1}^{W}f_{j,k}\varphi_{j,i,k}^{(\textit{NMF})}.$ (27)

Note that the updating equation of factor $a_{j,k}$ in Eq. (27) is similar to the updating equation of parameter $\gamma_{j,k}$ in Eq. (18), except by parameter $\alpha$ .

The update equation of factor $b_{k,i}$ in NMF, corresponding to the topic-word relationship, can be rewritten considering LDA variational variable $\varphi$ . To achieve this equivalence we can use the approximation Eq. (27) and the last value of $a_{j,k}$ obtained in Eq. (28), leading to the following $b_{k,i}$ .

$\displaystyle b_{k,i}=\frac{1}{\sum_{j}a_{j,k}}\frac{\sum_{j}f_{j,k}a_{j,k}b_{% k,i}}{[\bm{AB}]_{j,k}}=\frac{\sum_{j}f_{j,k}\varphi_{j,i,k}^{(\textit{NMF})}}{% \sum_{j}\sum_{i}f_{j,k}\varphi_{j,k,i}^{(\textit{NMF})}}$ (28)

By Eq. (28), we note that the value of $b_{k,i}$ is obtained by the statistics $\varphi$ for a specific word $w_{i}$ and topic $k$ for every document $d_{j}$ , and normalized by every word $w_{i}$ in the vocabulary. It corresponds to the topic-word distribution for a topic $k$ , represented by distribution $\beta_{k}$ in LDA.

Table 1

Equivalence between update equations of algorithms PBG, NMF, and LDA. The item a) it is related to the topics statistics for each word in each document; item b) it is related to document-topic statistics; and item c) it is related to topic-word statistics

	PBG	NMF	LDA
a)	$\displaystyle\bar{C}_{e_{j,i},k}=\frac{[\bar{A}_{j}\odot\bar{B}_{i}]_{k}}{% \displaystyle\sum_{l\in\mathscr{K}}[\bar{A}_{j}\odot\bar{B}_{i}]_{l}}$ ,	$\varphi_{j,i,k}^{(\textit{NMF})}\propto\frac{a_{j,k}b_{k,i}}{\displaystyle\sum% _{k^{}=1}^{K}a_{j,k^{}}b_{k^{*},i}}$ ,	$\varphi_{j,i,k}\approx\beta_{k,i}\times\frac{\gamma_{j,k}}{\displaystyle\sum_{% k^{}=1}^{\mathscr{K}}\gamma_{j,k^{}}}$
b)	$\displaystyle\bar{A}_{j}=\alpha+\displaystyle\sum_{w_{i}\in\mathscr{W}_{d_{j}}% }f_{j,i}\bar{C}_{e_{j,i}}$ ,	$\displaystyle a_{j,k}=\sum_{i=1}^{W}f_{j,k}\varphi_{j,i,k}^{(\textit{NMF})}$ ,	$\displaystyle\gamma_{j,k}=\alpha+\sum_{i=1}^{W}f_{j,i}\varphi_{j,i,k}$
c)	$\displaystyle\hat{B}_{i,k}\propto\sum_{d_{j}\in\mathscr{D}}f_{j,i}\bar{C}_{e_{% j,i},k}$ ,	$\displaystyle b_{k,i}=\frac{\displaystyle\sum_{j}f_{j,k}\varphi_{j,i,k}^{(% \textit{NMF})}}{\displaystyle\sum_{j}\sum_{i}f_{j,k}\varphi_{j,k,i}^{(\textit{% NMF})}}$ ,	$\displaystyle\beta_{k,i}\propto\sum_{j}f_{j,i}\varphi_{j,i,k}$

The equivalence of update equations of the algorithms NMF, LDA, and PBG are summarized in Table 1. This equivalence extends the theoretical analysis between objective function of NMF and LDA, as described in [12], and indicate the computational aspects of their update algorithms similar to the propagation procedure described in PBG algorithm. Note that for update equations of NMF and LDA, it is possible to keep topic statistics for each word in each document, and to propagate this statistics to variables relating document and topics, and variables relating topic and words. Once understanding the propagation of these variables values during the processes of topic extraction, it is possible to realize the inspiration of PBG in using a suitable data structure based on graph to propagated these statistics.

5. Experimental evaluation

The vectors $\bar{A}_{j}$ and $\bar{B}_{i}$ have to be post-processed according to the different unsupervised sub-tasks. In this section, we present the experimental results of PBG algorithm applied to the tasks of dimensionality reduction and topic extraction. We conducted the experiments using the documents collections described in Table 2. From collections described in [34] we employed the collections with largest number of documents and well suited for topic extraction. These documents were preprocessed as usual: stop-words were removed and terms were stemmed and the words frequency were used to weight the edges of bipartite graph.

Table 2
Collection of documents used in experimental evaluation. The column $D$ is the number of documents, the column $W$ is the number of unique words, and the last column $\hat{W}$ is the number of terms

Nome	$D$	$W$	$\hat{W}$	# classes
20ng	18808	45434	76.47	20
Dmoz-Business	18500	8303	11.93	37
classic4	7095	7749	35.28	4

The bipartite graph was created for each collection and the PBG algorithm was applied to demonstrate its ability to extract latent patterns. The results were the vectors $\bar{A}$ , $\bar{B}$ , and $\bar{C}$ which optimize the Eq. (1).

The documents used in this experimental evaluation have labels for the task of supervised learning. However, we have hidden the labels to fit the collection in an unsupervised context. The labels were used only to assist the evaluation of results in the task of document representativeness.

We evaluated three versions of PBG algorithm: PBG with $K$ -means initialization (pbg-init1); PBG with HCL initialization (pbg-init2), and the parallel version of PBG (pbg-parallel). The results in the task of topic extraction and document representativeness were compared with NMF with SVD initialization and LDA. We observe that SVD initialization can lead to rapid reduction of the approximation error of many NMF algorithms [8], however it is not clear how this can be conducted by LDA due to the necessity of the vectors be generated by a Dirichlet distribution. Hence, we have results with initialization only for NMF and PBG algorithms.

The Normalized Pointwise Mutual Information (NPMI) was used as a measure of association between words that describe a topic. It was useful to measure the coherence of a set of words generated by a topic extraction algorithm. Another measure was the accuracy of vector $\bar{A}$ to represent documents in the task of classification.

The number of topics was set to $K\in\{50,100,150,200\}$ . The LDA1

The Python implementation of LDA with variational inference and hyperparameters optimization is available at https:// github.com/kzhai/PyLDA.

hyperparameters were initially set by

\alpha=\frac{1}{K}

and

\beta=\frac{1}{n}

with symmetric priori. The problem established by NMF was solved using the gradient projection method [25].2

The Python implementation is available at https://www.csie.ntu.edu.tw/∼cjlin/nmf.

We performed initial experiments of PBG with only 10 iterations to find the best parameters

\alpha\in\{0.5,0.05,0.005\}

. The only required parameter of PBG,

\alpha

, can be specified according to user preference to give strength to a specific topic. After initial experiments, we adopted a standard value of

\alpha=0.05

. Another parameter, the stopping criterion, was set for all algorithms as the first achieved between 100 iterations or 5 hours of execution. To indicate the possibility of improvement of parallel version, we have set 8 threads to parallel version of PBG.

A strictly fair process of comparison between the proposed algorithms and probabilistic topic models algorithms is difficult, since probabilistic models traditionally use evaluation metrics based on likelihood and perplexity. In addition, perplexity or likelihood values are not necessarily correlated with human judgment about the semantic coherence of topics [31]. Therefore, we evaluate the topics coherence by the NPMI measure which approximates the human evaluation [31, 39].

5.1 Convergence analysis

In the absence of any general guarantees of convergence, we conducted an experimental analysis to indicate the convergence of PBG algorithm. Figure 4 shows the relationship between values of the objective function, defined in Eq. (1) ( $Q(G)$ ), and the number of iterations. To create Fig. 4 we calculated the value of $Q(G)$ , where $G$ is the bipartite graph with their respective associated vectors obtained after 50 iterations or at most 2 hours of executions. For the 20ng dataset, PBG algorithm did not need 50 iterations, it converged in only 15 iterations. The same convergence behaviour was perceived with other datasets, in which it also converged in approximately 15 iterations.

Table 3
Best accuracy values obtained by algorithms for Dataset 20ng

	$K=$ 50	$K=$ 100	$K=$ 150	$K=$ 200
Algorithms	Accuracy (%)	Accuracy (%)	Accuracy (%)	Accuracy (%)
pbg	70.7776	72.4559	71.5424	72.8224
pbg-init1	77.5016	77.6184	77.6609	77.0130
pbg-init2	72.8808	76.7102	76.8271	77.2042
pbg-parallel	72.5409	74.6123	75.0106	75.4461
lda	69.7100	70.1137	70.3845	71.2609
svd $+$ nmf	73.5129	76.4075	77.1829	77.5760

Figure 4.

Value of objective function (Eq. (1)) for each iteration of PBG algorithms for datasets 20ng (left), classic4 (middle) e Dmoz-Business (right).

5.2 Evaluation of documents representativeness

In this section we evaluate the representativity of a vectors $\bar{A}_{j}$ to represent the characteristics of a document $d_{j}$ . This is done in the same way that LDA does for topic-document distribution, and similarly to obtain a feature vector of a document in the dimensionality reduction task. To measure the effectiveness of vector representation, we have used the classification algorithm Support Vector Machine (SVM)3

³
Weka 3: Java Data Mining Software http://www.cs.waikato.ac.nz/ml/weka/.

in a cross-validation schema with usual parameters. A high accuracy value means that the feature vector produced by PBG captured the knowledge needed to represent the documents contents.

Tables 3–5 show the best accuracy for datasets 20ng, classic4, and Dmoz-Business, respectively. PBG algorithm captures better document characteristics than LDA in 100 iterations. The results indicate PBG as an effective method for dimensionality reduction task, hence a promising approach to tackle this problem. Especially, when the problem has an easy and straightforward strategy for graph construction, as in document-terms graphs. It is possible to further explore this representation to include heuristic knowledge to enrich the process and obtain better results.

Table 4

Best accuracy values obtained by algorithms for Dataset classic4

	$K=$ 50	$K=$ 100	$K=$ 150	$K=$ 200
Algorithms	Accuracy (%)	Accuracy (%)	Accuracy (%)	Accuracy (%)
pbg	94.7005	95.0810	95.2784	95.6025
pbg-init1	95.7858	95.9267	96.0395	96.2086
pbg-init2	95.3347	95.7153	95.4052	95.4898
pbg-parallel	94.7710	95.0106	95.2502	95.5039
lda	94.1508	94.5736	94.1931	94.4045
svd $+$ nmf	94.4891	95.1656	95.3206	95.7153

Table 5

Best accuracy values obtained by algorithms for Datasets Dmoz-Business Dataset

	$K=$ 50	$K=$ 100	$K=$ 150	$K=$ 200
Algorithms	Accuracy (%)	Accuracy (%)	Accuracy (%)	Accuracy (%)
pbg	35.8919	43.5946	45.3351	48.7135
pbg-init1	45.7189	54.3243	57.1838	57.8541
pbg-init2	46.8000	50.6649	53.4108	54.7081
pbg-parallel	35.6649	41.6649	45.8649	48.8973
lda	38.4919	44.8432	48.0703	49.5459
svd $+$ nmf	41.7459	48.6432	52.7892	55.2865

The critical difference diagram, Fig. 5, illustrates the statistical significance test obtained by comparing the results. Difference among algorithms connected with a line is not statistically significant. Thus PBG with heuristic initializations are superior with statistical difference compared to LDA. This result validates the hypothesis that the inclusion of heuristics in propagation process can lead to better results. It also demonstrates that to employ heuristic knowledge in PBG can be straightforward.

Figure 5.

Critical difference diagram considering the best accuracy for each algorithm.

Figures 6–8 illustrate the values of accuracy versus time for all algorithms and datasets. Each series illustrates a specific algorithm behaviour. They are grouped by number of topics $K$ . Each point in the graphic corresponds to the accuracy and time at the end of an iteration of a specific algorithm. So, the time interval between two consecutive points indicates the time spent for convergence of that corresponding iteration.

Figure 6.

Classification accuracy obtained during execution of algorithms used on this experimental evaluation. The features vectors (latent information vectors $\bar{A}_{j}$ ) extracted by algorithms were used to represent the Dataset 20ng.

Figure 7.

Figure 8.

In Figs 6–8 we note that the first point (at the begin of first iteration) of the series related to algorithm svd $+$ nmf occurs later than any other algorithms. The reason is the onerous time consuming of SVD decomposition. While HLC and $k$ -means initialization are computed in few seconds, the SVD decomposition requires a considerable time to factorize the document-term matrix. On the other hand, the initial accuracy obtained by SVD is better than the initial accuracy obtained by LDA and PBG. Likewise, the accuracy obtained by PBG with $K$ -means is better than LDA and PBG without initialization. However, $K$ -means is faster than SVD. The initial vectors obtained by HCL algorithm have not obtained good initial accuracy values. However, even with worse initial values, algorithm pbg-init2 achieves good accuracy values at the end of the iterations. PBG algorithm was able to improve the clustering initially obtained by HLC and, at the end, it achieves better results than LDA.

Figures 6–8 show that LDA converges faster with $K=50$ than with larger values of $K$ , when it has equal or lesser convergence time than the others. PBG with multithreading processing, pbg-parallel, has a fast convergence as expected. In some datasets, such as classic4 and Dmoz-Business, pbg-parallel was able to complete all 100 iterations, while other algorithms completed at most only 20 iterations in the same time interval.

Finally, these experiments indicate that even with inclusion of simple heuristics, such as clustering-based initialization and parallel processing, the results of PBG can be improved in the task of feature extraction. In addition, the proposed method achieves competitive results compared to methods as LDA and NMF.

5.3 Topics Evaluation using NPMI (Normalized Pointwise Mutual Information)

Another experimental evaluation was carried out in the topic extraction task. A topic is a set of words that frequently occur in a semantically related documents and can be used to describe its thematic structure. Formally, a topic $k$ is the set of words $\textit{top}_{k}^{L}$ formed by top $L$ words in topic-word distribution rank. In PBG scheme, the set of words is extracted as follows

$\displaystyle\textit{top}_{k}^{L}=\mathop{\text{missing}}{argmax}\limits_{% \textit{top}_{k}^{L}*\subset\mathscr{W}}\sum_{w_{i}\in top_{k}^{L}*}\bar{B}_{i% ,k}$ (29)

such as $|\textit{top}_{k}^{L}|=L$ . For LDA algorithm, a topic is the top $L=10$ words with high probability in $\phi$ (topic-word) distribution. Similarly, for NMF and PBG, a topic $K$ is the top $L=10$ words of set $\textit{top}_{k}^{L}$ .

Automatic method to quantify the coherence of topics and their semantic interpretation has been the target of several studies [10, 31, 22]. Here, to compare the quality of the topics obtained by PBG with those obtained by LDA and NMF, we use the metric NPMI (Normalized Pointwise Mutual Information).

NPMI is based on the association between pairs of words using external data [31, 22]. This method is correlated with human perception about topic coherence, and indicates how well words in the topic describe a theme. In [31], words correlations were extracted from Wikipedia pages,4

⁴

We use documents from Wikipedia of the year 2008. These documents are freely available at https://dumps.wikimedia.org/.

and counted the frequency of pairs of words in the same Wikipedia page. Pairs of words in

\textit{top}_{k}^{l}

with higher frequency indicates more semantic consistency for the topic

k

Table 6

Average value of NPMI obtained by algorithms used on this experimental evaluation

Dataset (num. topics)	pbg-init2	pbg-init1	lda	svd $+$ nmf	pbg	pbg-parallel
20ng (50)	0.1820	0.1680	0.1830	0.2090	0.1870	0.1820
20ng (100)	0.1870	0.1705	0.1685	0.1975	0.1725	0.1800
20ng (150)	0.1620	0.1713	0.1567	0.1897	0.1590	0.1810
20ng (200)	0.1533	0.1728	0.1512	0.1623	0.1613	0.1733
Dmoz-Business (50)	0.1250	0.1430	0.1500	0.1440	0.1100	0.1200
Dmoz-Business (100)	0.1245	0.1535	0.1195	0.1495	0.1010	0.1095
Dmoz-Business (150)	0.1247	0.1467	0.1210	0.1460	0.1090	0.1120
Dmoz-Business (200)	0.1212	0.1473	0.1182	0.1520	0.0950	0.1030
classic4 (50)	0.1690	0.1740	0.1700	0.1860	0.1790	0.1720
classic4 (100)	0.1760	0.1850	0.1885	0.1890	0.1900	0.1805
classic4 (150)	0.1633	0.1763	0.1773	0.1797	0.1713	0.1747
classic4 (200)	0.1730	0.1818	0.1765	0.1650	0.1642	0.1640

The results of the NPMI are shown in the Table 6. The results indicate that NMF obtains more coherent topics than LDA. The NPMI values of NMF are the highest for many datasets and number of topics. Despite this, PBG (and its variations) achieved good results, reaching better NMPI values in some data sets. These results indicate the viability of PBG as a competitive method, and a promising new technique for the topic extraction task.

6. Conclusion and future works

In this paper, we presented Propagation in Bipartite Graph (PBG) algorithm, an unsupervised algorithm based on label propagation that considers a collection of documents modeled as a bipartite heterogeneous graph. The proposed algorithm propagates the latent information vectors associated to vertices and edges in a bipartite graph considering unlabeled documents. The mathematical background of PBG is a simplification of update iterations of algorithms for NMF and LDA methods. This simplification makes the PBG algorithm a descriptive and intuitive option for practitioners to implement and expand its heuristics for topic extraction and unsupervised learning tasks. The proposed algorithm also obtained better document characteristics than LDA and NMF, indicating PBG as an interesting method for dimensionality reduction. While in the tasks of topic extraction, PBG obtained more coherent topics when it is evaluated by an automatic method as NPMI.

In future studies we intend to elaborate an online version of PBG algorithm, to use clustering algorithms to determine how many topics are appropriated in dataset collection, and to explore more heuristics to improve the results obtained by propagation in bipartite graph. We also intend to incorporate other types of objects relations as document-document or term-term besides the document-term relations and analyze the impact in the coherence of topics.

Footnotes

Acknowledgments

This work has been partially supported by the State of São Paulo Research Foundation (FAPESP), grant 2015/14228-9 and 2011/23689-9, the Brazilian Federal Research Council (CNPq), grant 302645/ 2015-2, and the Brazilian Federal Agency for Support and Evaluation of Graduate Education (CAPES).

References

Ahn

Y.-Y.

Bagrow

J.P.

and Lehmann

, Link communities reveal multiscale complexity in networks, Nature 466 (2010), 761–764.

Asuncion

Welling

Smyth

and Teh

Y.W.

, On smoothing and inference for topic models, in: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, UAI ’09, Arlington, Virginia, United States, AUAI Press, 2009, pp. 27–34.

Berton

and Lopes

A.A.

, Graph construction based on labeled instances for semi-supervised learning, Stockholm, Sweden, 2014, 2477–2482.

Bishop

C.M.

, Pattern Recognition and Machine Learning (Information Science and Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.

Blei

D.M.

, Introduction to probabilistic topic models, Communications of the ACM, 2011.

Blei

D.M.

and Lafferty

J.D.

, Topic models, in: Text Mining: Classification, Clustering, and Applications, Chapman and Hall/CRC Data Mining and Knowledge Discovery Series, 2009.

Blei

D.M.

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, J. Mach. Learn. Res. 3 (Mar. 2003), 993–1022.

Boutsidis

and Gallopoulos

, Svd based initialization: A head start for nonnegative matrix factorization, Pattern Recogn 41(4) (Apr. 2008), 1350–1362.

Buntine

, Variational extensions to em and multinomial pca, in: In ECML 2002, Springer-Verlag, 2002, pp. 23–34.

10.

Chang

Boyd-Graber

Wang

Gerris

and Blei

D.M.

, Reading tea leaves: How humans interpret topic models, in: Neural Information Processing Systems, 2009.

11.

de Paulo Faleiros

Berton

and de Andrade Lopes

, Exploring data classification with k-associated network, in: IV International Workshop on Web and Text Intelligence (WTI-2012), 2012.

12.

de Paulo Faleiros

and de Andrade Lopes

, On the equivalence between algorithms for non-negative matrix factorization and latent dirichlet allocation, in: ESANN 2016, 24th European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, Proceedings, April 26–29, 2016.

13.

de Paulo Faleiros

Rossi

R.G.

and de Andrade Lopes

, Optimizing the class information divergence for transductive classification of texts using propagation in bipartite graphs, Pattern Recognition Letters 87 (2017), 127–138. Advances in Graph-based Pattern Recognition.

14.

Ding

and Peng

, On the equivalence between non-negative matrix factorization and probabilistic latent semantic indexing, Comput. Stat. Data Anal. 52(8) (Apr. 2008), 3913–3927.

15.

Fung

B.C.

Wang

and Ester

, Hierarchical document clustering using frequent itemsets, in: In Proc. Siam International Conference on Data Mining 2003 (SDM 2003), 2003.

16.

Galán

S.F.

and Mengshoel

O.J.

, Neighborhood beautification: Graph layout through message passing, Journal of Visual Languages & Computing 44 (2018), 72–88.

17.

Gaussier

and Goutte

, Relation between plsa and nmf and implications, in: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’05, New York, NY, USA, ACM, 2005, pp. 601–602.

18.

Girolami

and Kabán

, On an equivalence between plsi and lda, in: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR ’03, New York, NY, USA, ACM, 2003, pp. 433–434.

19.

Hammouda

K.M.

and Kamel

M.S.

, Incremental document clustering using cluster similarity histograms, in: Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence, WI ’03, Washington, DC, USA, IEEE Computer Society, 2003, p. 597.

20.

Jordan

M.I.

Ghahramani

Jaakkola

T.S.

and Saul

L.K.

, An introduction to variational methods for graphical models, Mach. Learn. 37(2) (Nov. 1999), 183–233.

21.

Kong

M.K.

and Zhou

Z.-H.

, Transductive multilabel learning via label set propagation, IEEE Transactions on Knowledge and Data Engineering 25(3) (2013), 704–719.

22.

Lau

J.H.

Newman

and Baldwin

, Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality, in: Bouma

and Parmentier

, eds, Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, Gothenburg, Sweden, The Association for Computer Linguistics, April 26–30, 2014, pp. 530–539.

23.

Lee

D.D.

and Seung

H.S.

, Learning the parts of objects by non-negative matrix factorization, Nature 401(6755) (Oct. 1999), 788–791.

24.

Lee

D.D.

and Seung

H.S.

, Algorithms for non-negative matrix factorization, in: Leen

T.K.

Dietterich

T.G.

and Tresp

, eds, Advances in Neural Information Processing Systems 13, MIT Press, 2001, pp. 556–562.

25.

Lin

C.-J.

, Projected gradient methods for nonnegative matrix factorization, Neural Comput 19(10) (Oct. 2007), 2756–2779.

26.

MacQueen

J.B.

, Some methods for classification and analysis of multivariate observations, in: Cam

L.M.L.

and Neyman

, eds, Proc. of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Vol. 1, 1967, pp. 281–297.

27.

Masegosa

A.R.

Martínnez

A.M.

Langseth

Nielsen

T.D.

Salmerón

Ramos-López

and Madsen

A.L.

, d-VMP: Distributed variational message passing, in: Antonucci

Corani

and Campos

C.P.

, eds, Proceedings of the Eighth International Conference on Probabilistic Graphical Models, 2016, pp. 321–332.

28.

Moura

M.F.

and Rezende

S.O.

, A simple method for labeling hierarchical document cluster, in: Proceedings for the 10th IASTED – International Conference on Artificial Intelligence and Applications (IAI 2010), Calgary-Zurich, 2010, pp. 363–371.

29.

Muqattash

and Yahdi

, Infinite family of approximations of the digamma function, Mathematical and Computer Modelling 43(11–12) (2006), 1329–1336.

30.

Murphy

K.P.

Weiss

and Jordan

M.I.

, Loopy belief propagation for approximate inference: An empirical study, in: Proceedings of the Fifteenth Conference on Uncertainty in Artificial Intelligence, UAI’99, San Francisco, CA, USA, Morgan Kaufmann Publishers Inc, 1999, pp. 467–475.

31.

Newman

Lau

J.H.

Grieser

and Baldwin

, Automatic evaluation of topic coherence, in: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, Stroudsburg, PA, USA, Association for Computational Linguistics, 2010, pp. 100–108.

32.

Park

C.Y.

Laskey

K.B.

Costa

P.C.G.

and Matsumoto

, Message passing for hybrid bayesian networks using gaussian mixture reduction, in: 2015 Tenth International Conference on Digital Information Management (ICDIM), Oct 2015, pp. 210–216.

33.

Rossi

R.G.

Lopes

A.A.

Faleiros

T.P.

and Rezende

S.O.R.

, Inductive model generation for text classification using a bipartite heterogeneous network, Journal of Computer Science and Technology 29(3) (2014), 361–375.

34.

Rossi

R.G.

Marcacini

R.M.

and Rezende

S.O.

, Benchmarking text collections for classification and clustering tasks, Technical Report 395, Institute of Mathematics and Computer Sciences – University of Sao Paulo, 2013.

35.

Steinbach

Karypis

and Kumar

, A comparison of document clustering techniques, in: KDD Workshop on Text Mining, 2000.

36.

Stevens

Kegelmeyer

Andrzejewski

and Buttler

, Exploring topic coherence over many models and many topics, in:Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, EMNLP-CoNLL ’12, Stroudsburg, PA, USA, Association for Computational Linguistics, 2012, pp. 952–961.

37.

Steyvers

and Griffiths

, Probabilistic Topic Models, Lawrence Erlbaum Associates, 2007.

38.

Suh

Choo

Lee

and Reddy

C.K.

, Local topic discovery via boosted ensemble of nonnegative matrix factorization, in: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI’17, AAAI Press, 2017, pp. 4944–4948.

39.

Wallach

H.M.

Murray

Salakhutdinov

and Mimno

, Evaluation methods for topic models, in: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, New York, NY, USA, ACM, 2009, pp. 1105–1112.

40.

Zeng

Cheung

W.K.

and Liu

, Learning topic models by belief propagation, IEEE Trans. Pattern Anal. Mach. Intell. 35(5) (2013), 1121–1134.

41.

Zhang

Yoshida

Tang

and Wang

, Text clustering using frequent itemsets, Know.-Based Syst. 23 (July 2010), 379–388.

42.

Zhou

Bousquet

Lal

T.N.

Weston

and Schölkopf

, Learning with local and global consistency, in: Proceedings of the Advances in Neural Information Processing Systems, Vol. 16, 2004, pp. 321–328.

43.

Zhu

Ghahramani

and Lafferty

, Semi-supervised learning using gaussian fields and harmonic functions, in: Proceedings of the International Conference on Machine Learning, AAAI Press, 2003, pp. 912–919.

44.

Zhu

and Goldberg

A.B.

, Introduction to Semi-Supervised Learning, Morgan and Claypool Publishers, 2009.

Unsupervised learning of textual pattern based on Propagation in Bipartite Graph

Abstract

Keywords

1. Introduction

2. Related work and background

3. Topic extraction using propagation in bipartite graph

3.1 Optimizing the divergence between latent information vectors

3.3.1 Initializing latent information

3.3.2 Parallel PBG for topic extraction

4. Comparing PBG with LDA and NMF

4.1 Latent Dirichlet Allocation (LDA)

.

Table 2 Collection of documents used in experimental evaluation. The column D is the number of documents, the column W is the number of unique words, and the last column W ^ is the number of terms

Table 3 Best accuracy values obtained by algorithms for Dataset 20ng

3 Weka 3: Java Data Mining Software http://www.cs.waikato.ac.nz/ml/weka/.

Footnotes

Acknowledgments

References

Table 2
Collection of documents used in experimental evaluation. The column $D$ is the number of documents, the column $W$ is the number of unique words, and the last column $\hat{W}$ is the number of terms

Table 3
Best accuracy values obtained by algorithms for Dataset 20ng

³
Weka 3: Java Data Mining Software http://www.cs.waikato.ac.nz/ml/weka/.