DIC-DOC- K -means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K -means for improving the effectiveness of text document clustering

Abstract

In this article, a new initial centroid selection for a K-means document clustering algorithm, namely, Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means (DIC-DOC-K-means), to improve the performance of text document clustering is proposed. The first centroid is the document having the minimum standard deviation of its term frequency. Each of the other subsequent centroids is selected based on the dissimilarities of the previously selected centroids. For comparing the performance of the proposed DIC-DOC-K-means algorithm, the results of the K-means, K-means++ and weighted average of terms-based initial centroid selection + K-means (Weight_Avg_Initials + K-means) clustering algorithms are considered. The results show that the proposed DIC-DOC-K-means algorithm performs significantly better than the K-means, K-means++ and Weight_Avg_Initials+ K-means clustering algorithms for Reuters-21578 and WebKB with respect to purity, entropy and F-measure for most of the cluster sizes. The cluster sizes used for Reuters-8 are 8, 16, 24 and 32 and those for WebKB are 4, 8, 12 and 16. The results of the proposed DIC-DOC-K-means give a better performance for the number of clusters that are equal to the number of classes in the data set.

Keywords

Document clustering entropy initial cluster centroids purity

1. Introduction

The categorisation of text documents for a large document collection is a major problem in the areas of information retrieval and text mining [1,2]. Text document clustering and classification are among the most important techniques for organising text documents effectively [1,2]. Many clustering and classification approaches are used to organise text documents. The clustering algorithms that are unsupervised learning methodologies partition a document collection into many clusters, such that the documents within the same cluster are almost alike and dissimilar to each of the other clusters [1,3,4]. Partitioning and hierarchical clustering are the two major clustering techniques in information retrieval, data mining and text mining [1]. In recent years, it has been recognised that the partitional clustering approach is well adopted for a large-sized document collection due to its simplicity and low computational complexity [1 –3].

The best known partitioning clustering algorithm is the K-means algorithm. The K-means algorithm [2] starts with random initial cluster centroids and keeps reassigning the documents in the document collection to cluster centroids based on the similarity, or distance, between the documents and cluster centroids. The reassignment procedure will not stop until a convergence criterion is met. The procedure for choosing initial cluster centroids in the K-means clustering is very important as it has a direct impact on the formation of final clusters [1 –3,5,6]. Since clusters are separated groups, it is desirable to choose initial centroids which are well separated. It is dangerous to pick out outliers as initial centroids, since they are separate from normal samples.

In order to improve the performance of the K-means clustering algorithm, a variety of approaches have been proposed for choosing the initial centroids (seeds). Arthur and Vassilvitskii [7] proposed the K-means++ clustering method in 2007. It consists of randomly selecting only the first centroid from the data set. Each subsequent initial centroid is chosen with a probability proportional to the distance with respect to the previously selected set of centroids. The drawback of this approach refers to its sequential nature and the actual fact that it needs scans of the complete data set. Kang and Cho [8] have a proposed the seeds initialisation algorithm based on three parameters: centrality, sparsity and isotropy. Khan [9] has proposed an approach to choose the initial seeds for the K-means algorithm. This approach arranges the data based on their magnitude and then selects the seeds based on the higher distance between the consecutive arranged data. Karteeka et al. [10] have proposed a method, namely, single pass seed selection (SPSS) algorithm, to initialise the first centroid (seed) and the minimum distance that separates the centroids for K-means++ based on the point which is close to the additional range of alternative points within the knowledge set.

Improper initial centroids (seeds) may produce bad clusters which will directly affect the organisation of the documents [3,5,11]. Aytug Onan et al. [12] have proposed an improved ant clustering algorithm to overcome the sensitivity of initial seed selection. Jaya Mabel Rani and Latha [13] proposed an improved particle swarm optimisation (IPSO) for solving the problem of random initial selection using the K-means clustering algorithm for documents and avoid trapping in a local optimal solution. Sampath Premkumar and Hari Ganesh [14] proposed median-based initial centroid selection for K-means. The authors used a very simple data set, remarking that the technique is not suitable for large high-dimensional data sets. Sohrab Mahmud et al. [15] proposed weighted average of terms-based initial centroid selection for K-means clustering. It performs significantly better than the traditional K-means and K-means++ clustering algorithms.

Various initial centroid (seed) selection methods are used to select the initial centroids for K-means clustering [11,16 –26]. Most of them have focused on only one-dimensional data set. Thus, it is necessary to introduce some other novel ideas for generating initial centroids for text document clustering while using the K-means algorithm.

The main objective of this work is to develop high-quality text document clusters using the K-means clustering algorithm, with the help of the proposed new initial centroid selection algorithm. The proposed algorithm considers the document having a minimum standard deviation of term frequency as the first initial centroid. The remaining subsequent initial centroids are selected based on their dissimilarity of previously selected centroids. In the proposed method, every initial centroid of clusters is distinct and dissimilar to each of the other centroids. The remainder of this article is organised as follows: the preliminaries are outlined in section 2, the proposed algorithm is introduced and discussed in section 3, the experimental results are presented in section 4 and the conclusion is presented in section 5.

2. Preliminaries

2.1. Document representation

Most of the text document categorisation is adopted for the vector space model [27 –30] to represent the documents, that is to say, each unique term in vocabulary represents one dimension in the future vector space. Therefore, the text document data sets can be represented as a document-by-term matrix $D (n \times m)$ , where n and m indicate the number of documents and the number of terms occurring in the document data set, respectively. The matrix $D (n \times m)$ can be defined as in equation (1)

D (n \times m) = \begin{matrix} t_{1} & t_{2} & \dots & t_{m} \\ d_{1} & < d_{1}, t_{1} > & < d_{1}, t_{2} > & \dots & < d_{1}, t_{m} > \\ d_{2} & < d_{2}, t_{1} > & < d_{2}, t_{1} > & \dots & < d_{2}, t_{m} > \\ ⋮ & ⋮ & ⋮ & \dots & ⋮ \\ d_{n} & < d_{n}, t_{1} > & < d_{n}, t_{2} > & \dots & < d_{n}, t_{m} > \end{matrix}

(1)

The value of every member of this matrix depends on the degree of relationship between its associated terms and the respective document. There are many methods for measuring this relationship, very often the used word count method, or the term frequency–inverse document frequency (TF-IDF) [31] method. In this article, every member of this matrix is calculated by any of these two methods. In the word count method, $< d_{i}, t_{j} >$ , $1 \leq i \leq n and 1 \leq j \leq m$ , is represented by the total occurrence of the term $t_{j}$ in the document $d_{i}$ . Similarly, $< d_{i}, t_{j} >$ , $1 \leq i \leq n and 1 \leq j \leq m$ , is calculated by the TF-IDF method using equation (2)

< d_{i}, t_{j} \geq \frac{\sum_{j} t_{j}}{m} \times (1 + \log (\frac{n}{d t_{i}})), 1 \leq i \leq n and 1 \leq j \leq m

(2)

where $d t_{i}$ is the number of documents having the term $t_{i}$ .

2.2. Similarity measure between two documents

Cosine similarity is one of the most popular similarity measures for searching similar documents in text document processing [27,28,31 –34]. The cosine measure computes the cosine of the angle between two feature vectors and is used frequently in information retrieval, where the vectors are very large but sparse [35]. For two documents $d_{1}$ and $d_{2}$ , the cosine similarity measure between them is given in equation (3)

\cos (d_{1}, d_{2}) = \frac{d_{1} . d_{2}}{‖ d_{1} ‖ . ‖ d_{2} ‖} = \frac{\sum_{i = 1}^{m} (d_{1 i} \cdot d_{2 i})}{\sqrt{\sum_{i = 1}^{m} d_{1 i}^{2}} \times \sqrt{\sum_{i = 1}^{m} d_{2 i}^{2}}}

(3)

where m is the number of unique terms in both documents. When the cosine similarity is 1, the two documents are identical, and if it is 0, there is nothing common between these two documents.

2.3. K-means, K-means++ and Weight_Avg_Initials + K-means clustering algorithms

2.3.1. K-means clustering algorithm

K-means is the most important flat clustering algorithm [1,2,32,36], which is one of the top 10 algorithms in data mining. This algorithm is used to cluster n documents into K partitions. The standard K-means clustering algorithm is given in Algorithm 1.

Algorithm 1. The standard K-means clustering algorithm
Input: D – document-by-term matrix, n– number of documents, K– number of clusters Output: K clusters of the given data set 1. Randomly choose K documents from the document set as the initial centroids. 2. Calculate the similarity between each document and cluster centroids. 3. Assign each document of the document set to the cluster whose similarity between the document and the cluster centroid is the maximum of all the cluster centroids. 4. Recalculate the new cluster centroids using the mean value of each cluster. 5. Recalculate the similarity between each document and the newly obtained cluster centroids. 6. If no document was reassigned, then stop; otherwise, repeat from step 3.

Algorithm 1. The standard K-means clustering algorithm

Input: D – document-by-term matrix, n– number of documents, K– number of clusters
Output: K clusters of the given data set
1. Randomly choose K documents from the document set as the initial centroids.
2. Calculate the similarity between each document and cluster centroids.
3. Assign each document of the document set to the cluster whose similarity between the document and the cluster centroid is the maximum of all the cluster centroids.
4. Recalculate the new cluster centroids using the mean value of each cluster.
5. Recalculate the similarity between each document and the newly obtained cluster centroids.
6. If no document was reassigned, then stop; otherwise, repeat from step 3.

Disadvantages. Due to random selection of the initial centroids, it leads to local optimal solution and takes more number of iterations to reach convergence.

2.3.2. K-means++ clustering algorithm

The K-means++ algorithm [6, 37, 38] provides a way to choose initial centroids for the K-means algorithm. Let D be a set of document collection and K be the number of specified seeds for the cluster. Let $d (x)$ be the shortest distance from document x to the closest centroids. The K-means++ algorithm is described in Algorithm 2.

Algorithm 2. The K-means++ clustering algorithm
Phase I. Choose a set of K initial centres from a document set Input: D– document-by-term matrix, n– number of documents, K– number of clusters Output: K initial centroids 1. Randomly take one document $c_{1} \in D$ as the first centroid. 2. Take a new centroid $c_{i}$ , choosing $x \in D$ with the highest probability $d (x)^{2} / \sum_{x \in D} d {(x)}^{2}$ . 3. Repeat step 2 until all the K centroids are taken. Phase II. Standard K-means clustering algorithm Input: K initial centroids Output: K clusters of the given data set 1. Taking the K centroids which is the output of phase I, proceed with the standard K-means algorithm for clustering.

Algorithm 2. The K-means++ clustering algorithm

Phase I. Choose a set of K initial centres from a document set
Input: D– document-by-term matrix, n– number of documents, K– number of clusters
Output: K initial centroids
1. Randomly take one document

c_{1} \in D

as the first centroid.
2. Take a new centroid

c_{i}

, choosing

x \in D

with the highest probability

d (x)^{2} / \sum_{x \in D} d {(x)}^{2}

.
3. Repeat step 2 until all the K centroids are taken.
Phase II. Standard K-means clustering algorithm
Input: K initial centroids
Output: K clusters of the given data set
1. Taking the K centroids which is the output of phase I, proceed with the standard K-means algorithm for clustering.

Disadvantage. After selecting the first initial centroid, it takes $K - 1$ scans for the entire data set to select the remaining $K - 1$ centroids.

2.3.3. Weighted average of terms-based initial centroids for the K-means clustering algorithm

Sohrab Mahmud et al. [15] proposed weighted average of terms-based initial centroid selection for K-means. Let D be a data set, and it consists of n elements of data, such as $d_{1}, d_{2}, d_{3}, \dots, d_{n}$ . Each data point of this set may contain multiple attributes such as $d_{i}$ , which contains attributes $x_{1}, x_{2}, x_{3}, \dots, x_{m}$ , where m is the number of attributes. In the case of multidimensional attributes, they propose a weight factor for each attribute based on the distribution of the attributes in the entire data set and then multiplying the weight factor with each attribute. Now, they have calculated the sum values for each data and the average is obtained by dividing the total with m.

The average values of the entire set of data points are then sorted using merge sort. The sorted list of data points is then divided into K subsets. The nearest possible value of mean from each data set becomes the initial centroid of the cluster to be constructed. This algorithm is applied in n set of document collection. The unique terms are considered as m attributes.

This algorithm is described in Algorithm 3 as follows.

Algorithm 3. The Weight_Avg_Initials + K-means clustering algorithm
Input: $D = d_{1}, d_{2}, d_{3}, \dots, d_{n}$ , n– set of n data items, K– number of desired clusters Output: A set of K initial centroids Phase I. Choose a set of K initial centres from a data set 1. Calculate the average score of each data point: (a) $d_{i} = x_{1}, x_{2}, x_{3}, \dots, x_{m}$ (b) $d_{i} (avg) = \frac{(w_{1} \times x_{1} + w_{2} \times x_{2} + \dots + w_{n} \times x_{n})}{m}$ where x is the attribute’s value, m is the number of attributes and w is the weight by which to multiply to ensure fair distribution of the cluster. 2. Sort the data based on average scores. 3. Divide the sorted data set into K subsets. 4. Calculate the mean value of each subset. 5. Take the nearest possible data point of the mean as the initial centroid for each of the data subsets. Phase II. Standard K-means clustering algorithm Input: K initial centroids Output: K clusters of the given data set 1. Taking the K centroids which is the output of phase I, proceed with the standard K-means algorithm for clustering.

Algorithm 3. The Weight_Avg_Initials + K-means clustering algorithm

Input:

D = d_{1}, d_{2}, d_{3}, \dots, d_{n}

, n– set of n data items, K– number of desired clusters
Output: A set of K initial centroids
Phase I. Choose a set of K initial centres from a data set
1. Calculate the average score of each data point:
(a)

d_{i} = x_{1}, x_{2}, x_{3}, \dots, x_{m}

(b)

d_{i} (avg) = \frac{(w_{1} \times x_{1} + w_{2} \times x_{2} + \dots + w_{n} \times x_{n})}{m}

where x is the attribute’s value, m is the number of attributes and w is the weight by which to multiply to ensure fair distribution of the cluster.
2. Sort the data based on average scores.
3. Divide the sorted data set into K subsets.
4. Calculate the mean value of each subset.
5. Take the nearest possible data point of the mean as the initial centroid for each of the data subsets.
Phase II. Standard K-means clustering algorithm
Input: K initial centroids
Output: K clusters of the given data set
1. Taking the K centroids which is the output of phase I, proceed with the standard K-means algorithm for clustering.

3. The proposed Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means algorithm

Many algorithms have been introduced to select initial centroids for improving the performance of the K-means clustering algorithm. Many of them focus only on the random selection of initial centroids for clustering the given data set. Document clustering is a crucial and important application in information retrieval. Characteristics of this type of data such as high dimensionality and sparseness introduce new challenges to the clustering problem and make it harder compared with other types of data [1]. The proposed Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means (DIC-DOC-K-means) algorithm focuses on the clustering of the multidimensional data sets, such as large document collections, using dissimilarity-based centroid selection rather than the random selection of centroids. The K-means algorithm starts with allocating cluster centroids randomly and then looks for the ‘better’ solutions. The K-means++ algorithm starts with the random allocation of one cluster centroid and then searches for other centroids based on the first one. Thus, both the K-means and K-means++ algorithms use the random initialisation method for the starting centroid.

The proposed DIC-DOC-K-means uses the document having the minimum standard deviation of term frequency from the document collection as the first initial centroid. A low standard deviation means that all the term frequencies in a document are close to the mean of its term frequencies. That is, a document for which the term frequencies are almost close is selected as the first initial centroid. This idea completely avoids the random selection of the first centroid, which is used in both K-means and K-means++. The remaining $K - 1$ subsequent centroids are selected based on the dissimilarity of the previously selected centroids. The cosine similarity measure, which is widely used in information retrieval and text mining, is used to find the similarity between the selected centroids and other documents. If a document has dissimilarity with all other previously selected centroids, then the document is considered as another centroid. A control parameter $λ$ is used to control the identification of dissimilar centroids. This process is continued until K centroids are obtained. In this work, all initial centroids are dissimilar to each other. The identification of dissimilar centroids is achieved without increasing the computational complexity.

The algorithm discussed in this article consists of two phases: in phase I, a new dissimilarity-based initial centroid selection algorithm is used to determine the initial centroids; in phase II, the final clusters are formed using the standard K-means algorithm with the help of the initial centroids arrived from phase I.

The proposed DIC-DOC-K-means algorithm is described as follows.

Algorithm 4. Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means (DIC-DOC-K-means)
Phase I. Selection of K initial centroids (seeds) Input: D– document-by-term matrix, n– number of documents, K– number of clusters, $σ$ – standard deviation of the document set, $λ$ – control parameter Output: K initial centroids (seeds) 1. Initially, the centroid set is empty. 2. Initially, the control parameter value is set at 0, that is, $λ = 0$ . 3. The document with the least standard deviation $(σ)$ is the first centroid and placed in set c. If two or more documents have the same least $σ$ value, then any one of them is selected as the first centroid and placed in set c. In order to avoid complete random selection of the first centroid, the first occurring document in the list of least $σ$ value may be selected as the first centroid. 4. c is added to set C. 5. All the available documents are randomised and placed in set N. In order to provide the opportunity to select more dissimilar documents as centroids, the documents are randomised. 6. Repeat for all documents from set N until K centroids selected: (a) Select a document N_i from the set of N. i. Calculate the cosine similarity between the selected document N_i and all the selected centroids in set C; ii. If the similarity measure between the selected document N_i and all the centroids in set C is less than or equal to the value of control parameter $λ$ , then the selected document N_i is added as a centroid to set C; iii. If count(C) is equal to K then go to step 7. 7. If n documents are scanned and K centroids are not identified within one loop, increase the control parameter value $λ$ by 0.0005 and go to step 5. In order to find centroids with least similarity, a small value of 0.0005 is increased in each loop. 8. Output the identified set C of K dissimilar documents as K centroids. Phase II. Standard K-means clustering algorithm Input: Initial centroids from phase I Output: K clusters of n documents 1. Take the initial centroids from the output of phase I. 2. Continue the standard K-means clustering algorithm.

Algorithm 4. Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means (DIC-DOC-K-means)

Phase I. Selection of K initial centroids (seeds)
Input: D– document-by-term matrix, n– number of documents, K– number of clusters,

σ

– standard deviation of the document set,

λ

– control parameter
Output: K initial centroids (seeds)
1. Initially, the centroid set is empty.
2. Initially, the control parameter value is set at 0, that is,

λ = 0

.
3. The document with the least standard deviation

(σ)

is the first centroid and placed in set c. If two or more documents have the same least

σ

value, then any one of them is selected as the first centroid and placed in set c. In order to avoid complete random selection of the first centroid, the first occurring document in the list of least

σ

value may be selected as the first centroid.
4. c is added to set C.
5. All the available documents are randomised and placed in set N. In order to provide the opportunity to select more dissimilar documents as centroids, the documents are randomised.
6. Repeat for all documents from set N until K centroids selected:
(a) Select a document N_i from the set of N.
i. Calculate the cosine similarity between the selected document N_i and all the selected centroids in set C;
ii. If the similarity measure between the selected document N_i and all the centroids in set C is less than or equal to the value of control parameter

λ

, then the selected document N_i is added as a centroid to set C;
iii. If count(C) is equal to K then go to step 7.
7. If n documents are scanned and K centroids are not identified within one loop, increase the control parameter value

λ

by 0.0005 and go to step 5. In order to find centroids with least similarity, a small value of 0.0005 is increased in each loop.
8. Output the identified set C of K dissimilar documents as K centroids.
Phase II. Standard K-means clustering algorithm
Input: Initial centroids from phase I
Output: K clusters of n documents
1. Take the initial centroids from the output of phase I.
2. Continue the standard K-means clustering algorithm.

The proposed algorithm aims to identify K completely dissimilar initial centroids from a data collection. In order to attain this, a single control parameter $λ$ is introduced and is initially set at 0. The main purpose of introducing this control parameter is to initially scan the documents and identify K centroids with the similarity score of 0 (i.e. the similarity score between each and every centroid should be 0). If the same centroids are to be obtained for every run of the algorithm, the documents need not be randomised. If all the K centroids are obtained within a single scan, the algorithm proceeds to the next phase. In case K centroids with zero similarity score are not identified in the single scan, the similarity score is increased by a small factor using the control parameter $λ = 0.0005$ . The next scan is to identify the remaining K centroids, if it is not done it proceeds with marginally increasing the similarity score by 0.0005 for each consecutive scan.

The main reason for introducing the control parameter is to identify completely dissimilar K initial centroids from a data collection. In some data collections, there are no instances of having K completely dissimilar documents for identifying the K initial centroids. In such cases, increasing the similarity score by a small factor will aid in identifying the K centroids in judicious time without compromising the quality of the centroids to a large extent. Also, selecting a small number of centroids from a highly dissimilar data collection can be done easily using a very small similarity score (0). However, when a large number of centroids need to be identified, having a zero similarity score will result in heavy computational burden. Hence, the control parameter $λ$ is introduced to marginally increase the similarity value for identifying a large number of clusters. Thus, for searching a large number of initial seeds within less time, the control parameter will be used.

The proposed technique is computationally better without compromising the quality of the centroids as compared with the other methods. The K-means++ algorithm identifies K – 1 centroids except the first one, using K – 1 scans of the entire document set. This becomes computationally unreasonable when the number of initial centroids is large, or when the data collection has a large number of documents. The proposed algorithm will be able to identify the K centroids in a reasonable time for any value of K and for any type of data collection. This property of the proposed algorithm makes it robust and versatile in different application domains. The pseudocode for the proposed DIC-DOC-K-means is described as follows.

Pseudocode for phase I of the DIC-DOC-K-means algorithm
Input: D– document-by-term matrix, n– number of documents, K– number of clusters, $σ$ – standard deviation of the document set, $λ$ – control parameter Output: K initial centroids 1. $C \leftarrow {}$ // Initially C is an empty set 2. $λ = 0$ 3. $c \leftarrow D_{min (σ)}$ // c ← Document having min(σ) 4. $C \leftarrow c \cup C$ 5. $N \leftarrow random (n)$ // Random shuffling of n documents as N 6. $for N_{i} \leftarrow 1 to count (N)$ 7. $c_{1} \leftarrow 0$ 8. $for j \leftarrow 1 to count (C)$ 9. $s = sim (N_{i}, C_{j})$ // s ← Similarity(document N_i, document C_j) 10. $if s \leq λ$ 11. $c_{1} \leftarrow c_{1} + 1$ 12. end // for j = 1 to count(C) 13. $if equal (c_{1}, count (C)$ 14. $c \leftarrow N_{i}$ // c ← ith document of N 15. $C \leftarrow c \cup C$ 16. end // $If equal (c_{1}, count (C)$ 17. $if count (C) = = K$ 18. $return C$ // Return the output C 19. $break$ // For N_i → 1 to count(N) 20. $end$ // $If count (C) = = K$ 21. $end$ // For N_i → 1 to count(N) 22. $if count (C) \neq K$ 23. $λ \leftarrow λ + 0.0005$ // $λ$ is increased by 0.0005 24. $Go to step 5$

Pseudocode for phase I of the DIC-DOC-K-means algorithm

Input: D– document-by-term matrix, n– number of documents, K– number of clusters,

σ

– standard deviation of the document set,

λ

– control parameter
Output: K initial centroids
1.

C \leftarrow {}

// Initially C is an empty set
2.

λ = 0

c \leftarrow D_{min (σ)}

// c ← Document having min(σ)
4.

C \leftarrow c \cup C

N \leftarrow random (n)

// Random shuffling of n documents as N
6.

for N_{i} \leftarrow 1 to count (N)

c_{1} \leftarrow 0

for j \leftarrow 1 to count (C)

s = sim (N_{i}, C_{j})

// s ← Similarity(document N_i, document C_j)
10.

if s \leq λ

11.

c_{1} \leftarrow c_{1} + 1

12. end // for j = 1 to count(C)
13.

if equal (c_{1}, count (C)

14.

c \leftarrow N_{i}

// c ← ith document of N
15.

C \leftarrow c \cup C

16. end //

If equal (c_{1}, count (C)

17.

if count (C) = = K

18.

return C

// Return the output C
19.

break

// For N_i → 1 to count(N)
20.

end

If count (C) = = K

21.

end

// For N_i → 1 to count(N)
22.

if count (C) \neq K

23.

λ \leftarrow λ + 0.0005

λ

is increased by 0.0005
24.

Go to step 5

4. Experiments

The proposed algorithm is evaluated and compared with the three other clustering algorithms, namely, K-means, K-means++ and Weight_Avg_Initials + K-means, on two different document data sets.

4.1. Data sets

In this work, two data sets, namely, Reuters-8 and WebKB, are used to validate the quality of the cluster, as they are currently the most widely used benchmark in document clustering research. In both data sets, the documents are clearly classified in the separate classes. Hence, this classification of documents is used to evaluate and validate the clustering results of the proposed algorithm.

4.1.1. Reuters-8

Reuters-21578 document collection [39] is employed to judge the effectiveness of cluster performance. The Reuters-21578 ModeApt’e Split Text Categorization Test Collection contains thousands of documents collected from Reuters newswire in 1987. It comprises 90 categories and 12,902 documents. In this work, the 8 most frequent categories among the 90 categories and only 1 topic document are considered for this work. Table 1 shows the distribution of the training and testing documents of Reuters-8 in every class for training and testing.

Table 1.

Distribution of documents in each class of Reuters-8.

Class	Number of training data	Number of testing data	Subtotal of data
Acq	1596	696	2292
Crude	253	121	374
Earn	2840	1083	3923
Grain	41	10	51
Interest	190	81	271
Money-fix	206	87	293
Ship	108	36	144
Trade	251	75	326
Total	3458	2189	7674

4.1.2. WebKB

The WebKB data sets [40] contain web pages collected from computer science departments of various universities by the World Wide Knowledge Base (Web → Kb) project of the CMU Text Learning Group. The documents of this data set are not predestinated as training or testing patterns. They are divided randomly into training and testing subsets. Table 2 shows the distribution of the documents in every category randomly selected for training and testing.

Table 2.

Distribution of documents in each class of WebKB.

Class	Number of training data	Number of testing data	Subtotal of data
Project	336	168	504
Course	620	310	930
Faculty	750	374	1124
Student	1097	544	1641
Total	2803	1396	4199

4.2. Validation of performance

The performance of the proposed algorithm is compared with the other clustering algorithms, namely, K-means and K-means++ clustering algorithms, based on the external measures such as purity, entropy and F-measure [41]. In general, the better clustering results have larger values of purity and F-measure and a lower value of entropy. If $n_{i}$ is the number of members of class i, $n_{j}$ is the number of members of cluster j, $n_{ij}$ is the number of members of class i in cluster j and p is the number of classes in the document collection, then the quality measures such as purity, entropy and F-measure [27,28,42,43] are described as follows:

Purity. Purity or accuracy is commonly used for measuring quality of clustering. It measures the largest class of documents for each cluster. The purity of cluster j is the majority number of documents with identical class labels in the jth cluster and is mathematically defined in equation (4)

Purity (j) = \frac{1}{n_{j}} max {n_{ij}}, \forall i = 1 to p

(4)

The overall purity of clustering is a weighted sum of the cluster purities. It can be defined in equation (5)

Purity = \frac{\sum_{j} n_{j} Purity (j)}{n}

(5)

Entropy. Entropy for clustering of documents is mathematically defined in equation (6)

Entropy = \frac{\sum_{i = 1}^{K} n_{i} (\sum_{j = 1}^{p} - \frac{n_{i}^{j}}{n_{i}} \log \frac{n_{i}^{j}}{n_{i}})}{(\log p) n}

(6)

F-measure. In information retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned. Precision is the ratio of the number of relevant documents to the total number of documents retrieved. Recall is the ratio of the number of relevant documents retrieved to the total number of relevant documents in the entire collection. The precision and recall can be defined, respectively, as $Precision P (i, j) = (n_{ij} / n_{j})$ and $Recall R (i, j) = (n_{ij} / n_{i})$ .

F-measure is the harmonic mean of precision and recall [1,33,42] and is mathematically defined in equation (7)

F - measure (i, j) = \frac{2 \times P (i, j) \times R (i, j)}{P (i, j) + R (i, j)}

(7)

The final F-measure for all of the clusters is calculated using equation (8)

F - measure = \sum_{i} (\frac{n_{i}}{n}) \times max {F - measure (i, j)}

(8)

4.3. Results and discussion

The K-means, K-means++ and the proposed DIC-DOC-K-means algorithms are individually implemented in MATLAB 7.10. These three algorithms were run on the Reuters-21578 and WebKB data sets with cosine similarity measure. The results shown below are the average of 10 independent runs by each of K-means and K-means++. The Weight_Avg_Initials + K-means produced the result in a single run because of the fixed initial centroids for all runs. The existing standard K-means, K-means++, Weight_Avg_Initials + K-means and the proposed DIC-DOC-K-means were run under different K values (K = 8, 16, 24, 32) for Reuters-8 and (K = 4, 8, 12, 16) for WebKB for measuring the purity (Pu), entropy (En), F-measure (F-M) and recall (Re) of clusters.

4.3.1. Results for word count representation of documents

The word count representation of the document set produced the better result than the TF-IDF document representation for document clustering. The results for word count representation of documents on the testing data of Reuters-8 and WebKB are shown in Tables 3 and 4 and the same are represented from Figures 1 –8.

Table 3.

Purity and entropy values for word count representation of documents.

Data sets	K	K-means		K-means++		Weight_Avg_Initials + K-means		DIC-DOC-K-means
		Pu	En	Pu	En	Pu	En	Pu	En
Reuters-8	8	0.8525	0.1948	0.8857	0.1691	0.8689	0.1648	0.9063	0.1463
	16	0.9092	0.1315	0.9128	0.1287	0.8538	0.1758	0.9313	0.1130
	24	0.9109	0.1177	0.9201	0.1165	0.9255	0.1129	0.9333	0.1043
	32	0.9228	0.1082	0.9302	0.1021	0.9292	0.1033	0.9341	0.1034
WebKB	4	0.6211	0.6652	0.6347	0.6548	0.5802	0.6062	0.6901	0.6062
	8	0.6234	0.6462	0.6377	0.6295	0.6640	0.6122	0.6721	0.6122
	12	0.6500	0.6105	0.6532	0.6109	0.6261	0.5976	0.6696	0.5976
	16	0.6448	0.6073	0.6755	0.5970	0.6612	0.5882	0.6755	0.5882

Weight_Avg_Initials + K-means: weighted average of terms-based initial centroid selection + K-means; DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means; Pu: purity; En: entropy.

Table 4.

F-measure and recall values for word count representation of documents.

Data sets	K	K-means		K-means++		Weight_Avg_Initials + K-means		DIC-DOC-K-means
		F-M	Re	F-M	Re	F-M	Re	F-M	Re
Reuters-8	8	0.5214	0.4977	0.5627	0.5930	0.5865	0.5817	0.7743	0.7790
	16	0.4192	0.3536	0.4475	0.3700	0.2737	0.2265	0.5091	0.4216
	24	0.2992	0.2375	0.3274	0.2494	0.2791	0.2143	0.3714	0.2905
	32	0.2511	0.1906	0.2711	0.1982	0.2438	0.1873	0.2899	0.2152
WebKB	4	0.6041	0.6126	0.6227	0.6298	0.5759	0.5888	0.6766	0.6829
	8	0.3727	0.2973	0.3741	0.3002	0.4037	0.3195	0.3864	0.3260
	12	0.2772	0.1983	0.2781	0.2021	0.2730	0.1859	0.2883	0.2138
	16	0.2183	0.1474	0.2239	0.1517	0.2211	0.1604	0.2233	0.1553

Figure 1.

Comparison of purity values for Reuters-8 by word count representation.

Figure 2.

Comparison of entropy values for Reuters-8 by word count representation.

Figure 3.

Comparison of F-measure values for Reuters-8 by word count representation.

Figure 4.

Comparison of recall values for Reuters-8 by word count representation.

Figure 5.

Comparison of purity values for WebKB by word count representation.

Figure 6.

Comparison of entropy values for WebKB by word count representation.

Figure 7.

Comparison of F-measure values for WebKB by word count representation.

Figure 8.

Comparison of recall values for WebKB by word count representation.

In word count representation, the proposed DIC-DOC-K-means shows a significantly better result than the K-means, K-means++ and Weight_Avg_Initials + K-means clustering algorithms.

4.3.2. Results for the TF-IDF representation of documents

The TF-IDF method is the most common document representation in the area of text mining and information retrieval. The results for the TF-IDF representation of documents on testing data of Reuters-8 and WebKB are shown in Tables 5 and 6 and the same are represented in Figures 9 –16.

Table 5.

Accuracy and entropy values for TF-IDF representation of documents.

Data sets	K	K-means		K-means++		Weight_Avg_Initials + K-means		DIC-DOC-K-means
		Pu	En	Pu	En	Pu	En	Pu	En
Reuters-8	8	0.8503	0.1947	0.8727	0.1695	0.8616	0.1855	0.8915	0.1421
	16	0.8804	0.1532	0.9013	0.1375	0.8803	0.1575	0.9052	0.1320
	24	0.8851	0.1467	0.8985	0.1400	0.8917	0.1436	0.9033	0.1303
	32	0.8892	0.1427	0.8936	0.1303	0.8575	0.1751	0.9061	0.1267
WebKB	4	0.6322	0.6264	0.6650	0.5902	0.6547	0.6042	0.5871	0.6807
	8	0.6525	0.6031	0.6558	0.5959	0.6368	0.6198	0.6688	0.5799
	12	0.6413	0.6162	0.6582	0.6017	0.6311	0.6239	0.6582	0.5914
	16	0.6435	0.6170	0.6417	0.6205	0.6361	0.6236	0.6660	0.6011

TF-IDF: term frequency–inverse document frequency; Weight_Avg_Initials + K-means: weighted average of terms-based initial centroid selection + K-means; DIC-DOC-K-means: Dissimilarity-based Initial Centroid selection for DOCument clustering using K-means; Pu: purity; En: entropy.

Table 6.

F-measure and recall values for TF-IDF representation of documents.

Data sets	K	K-means		K-means++		Weight_Avg_Initials + K-means		DIC-DOC-K-means
		F-M	Re	F-M	Re	F-M	Re	F-M	Re
Reuters-8	8	0.4859	0.4934	0.5402	0.5302	0.4943	0.4572	0.6183	0.6323
	16	0.3414	0.2939	0.3885	0.3382	0.3704	0.3221	0.4124	0.3936
	24	0.2849	0.2364	0.2838	0.2323	0.2576	0.1989	0.2994	0.2731
	32	0.2214	0.1759	0.2344	0.1908	0.2177	0.1718	0.2282	0.2750
WebKB	4	0.5645	0.5481	0.5799	0.5877	0.5954	0.5488	0.4827	0.5098
	8	0.3440	0.2814	0.3434	0.2792	0.3492	0.2711	0.3532	0.2898
	12	0.2581	0.1864	0.2587	0.1919	0.2749	0.1893	0.2648	0.2043
	16	0.2054	0.1417	0.2057	0.1412	0.1959	0.1395	0.2066	0.1525

Figure 9.

Comparison of purity values for Reuters-8 by TF-IDF representation.

Figure 10.

Comparison of entropy values for Reuters-8 by TF-IDF representation.

Figure 11.

Comparison of F-measure values for Reuters-8 by TF-IDF representation.

Figure 12.

Comparison of recall values for Reuters-8 by TF-IDF representation.

Figure 13.

Comparison of purity values for WebKB by TF-IDF representation.

Figure 14.

Comparison of entropy values for WebKB by TF-IDF representation.

Figure 15.

Comparison of F-measure values for WebKB by TF-IDF representation.

Figure 16.

Comparison of recall values for WebKB by TF-IDF representation.

It is observed that the proposed DIC-DOC-K-means provides better results for Reuters-8 for all cluster sizes. For the WebKB data set, the proposed algorithm produced the significantly better result in all cases except K = 4. When the cluster size K = 4, the K-means++ and Weight_Avg_Initials + K-means give better results than the proposed DIC-DOC-K-means algorithm. When the cluster size K = 12, the Weight_Avg_Initials + K-means gives better F-measure values than the proposed as well as K-means and K-means++ algorithms.

It is observed that the proposed DIC-DOC-K-means algorithm produces better results compared with the standard K-means, K-means++ and Weight_Avg_Initials + K-means clustering algorithms for most of the cluster sizes used in this work. In particular, when the number of clusters is equal to the number of classes of the given data set, the proposed algorithm performs significantly better compared with the existing K-means, K-means++ and Weight_Avg_Initials + K-means document clustering algorithms. In addition, the word count representation of documents gives better performance than the TF-IDF representation of documents while using the proposed DIC-DOC-K-means.

5. Conclusion

A new DIC-DOC-K-means algorithm is presented. The proposed DIC-DOC-K-means algorithm uses the cosine similarity measure for calculating the similarity between the previously selected centroids and the other remaining documents for selecting subsequent centroids in phase I. The cosine similarity is used in phase II for calculating the similarity between centroids and documents, for assigning documents to its similar centroids. The performance is validated on testing data sets of Reuters-8 and WebKB, and compared with the performances of the K-means, K-means++ and Weight_Avg_Initials + K-means clustering algorithms. The performance in terms of purity, entropy and F-measure is calculated with the K values of 8, 16, 24 and 32 for Reuters-8 and with the K values of 4, 8, 12 and 16 for WebKB. The proposed DIC-DOC-K-means algorithm improves the performance of text document clustering for any type of text document data sets. Phase I of the proposed DIC-DOC-K-means, which is used for selecting initial centroids, is not only suitable for K-means document clustering, but also for any initial centroid-based document clustering and classification algorithms.

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship and/or publication of this article.

ORCID iD

R Lakshmi

References

Han

Kamber

Data mining: concepts and techniques. 2nd ed. San Francisco, CA: Morgan Kaufmann, 2006.

Al-Mubaid

Umair

SA.

A new text categorization technique using distributional clustering and learning logic. IEEE Trans Knowl Data Eng 2006; 18(9): 1156–1165.

Jain

AK.

Data clustering: 50 years k-means beyond K-means. Pattern Recognit Lett 2010; 31: 651–666.

Shetkar

Fernandes

Text categorization of documents using K-means and K-means++ clustering algorithm. Int J Recent Innov Tren Comput Commun 2016; 4(6): 485–489.

Pena

Lozano

Larranaga

An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognit Lett 1999; 20(10): 1027–1040.

Steinley

Brusco

MJ.

Initializing K-means batch clustering: a critical evaluation of several techniques. J Classif 2007; 24: 99–121.

Arthur

Vassilvitskii

. K-means++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on discrete algorithms, New Orleans, Louisiana, 7–9 January 2007, pp. 1027–1035.

Kang

Cho

. K-means clustering seeds initialization based on centrality, sparsity and isotropy. In: International conference on intelligent data engineering and automated learning, Burgos, 23–26 September 2009, pp. 109–117. Berlin: Springer.

Khan

An initial seed selection algorithm for K-means clustering of georeferenced data to improve replicability of cluster assignment for mapping application. Appl Soft Comput 2012; 12: 3698–3700.

10.

Karteeka Pavan

Rao

AVD

, et al. Single pass seed selection algorithm for K-means. J Comput Sci 2010; 6(1): 60–66.

11.

Agha

Ashour

WM.

Efficient and fast initialization algorithm for k-means clustering. Int J Intell Syst Appl 2010; 1: 21–31.

12.

Onan

Bulut

Korukoglu

An improved ant algorithm with LDA-based representation for text document clustering. J Inf Sci 2017; 43(2): 275–292.

13.

Jaya Mabel Rani

Latha

. Clustering analysis by improved particle swarm optimization and K-means algorithm. In: International conference on sustainable energy and intelligent systems, Tiruchengode, India, 27–29 December 2012.

14.

Sampath Premkumar

Hari Ganesh

. A median based external initial centroid selection method for K-means clustering. In: World congress on computing and communication technologies, Tiruchirappalli, India, 2–4 February 2017.

15.

Sohrab Mahmud

Mostafizer Rahman

Nasim Akhtar

. Improvement of K-means clustering algorithm with better initial centroids based on weighted average. In: International conference on 2012 7th international conference on electrical and computer engineering, Dhaka, Bangladesh, 20–22 December 2012.

16.

Mahesh Kumar

Rama Mohan Reddy

. A fast K-means clustering using prototypes for initial cluster center selection. In: International conference on intelligent systems and control, Coimbatore, India, 9–10 January 2015.

17.

Xinwu

. Research on text clustering algorithm based on improved K-means. In: International conference on computer design and applications, Qinhuangdao, China, 25–27 June 2010.

18.

Jaganathan

Jaiganesh

. An improved K-means algorithm combined with particle swarm optimization approach for efficient web document clustering. In: International conference on green computing, communication and conservation of energy (ICGCE), Chennai, India, 12–14 December 2014.

19.

Yuan

Meng

Zhang

, et al. A new algorithm to get the initial centroids. In: Proceedings of international conference on machine learning and cybernetics, Shanghai, China, 26–29 August 2004.

20.

Wang

Liu

Chen

, et al. A new partitioning based algorithm for document clustering. In: International conference on fuzzy systems and knowledge discovery, Shanghai, China, 26–28 July 2011.

21.

Katara

Choudhary

A modified version of the K-means clustering algorithm. Global J Comput Sci Technol 2015; 15(7): 1–7.

22.

de Amorim

Mirkin

. Minkowski metric, feature weighting and anomalous cluster initializing in K-means clustering. Pattern Recognit 2012; 45(3): 1061–1075.

23.

Kumar

Sahoo

A new initialization method to originate initial clusters for K-means algorithm. Int J Adv Sci Technol 2014; 62: 43–54.

24.

Celebi

Kingravi

Vela

PA.

A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 2013; 40(1): 200–210.

25.

Khan

Ahmad

Cluster center initialization algorithm for k-means clustering. Pattern Recognit Lett 2004; 25(11): 1293–1302.

26.

Zhang

Yang

Oja

Improving cluster analysis by co-initializations. Pattern Recognit Lett 2014; 45: 71–77.

27.

Kanungo

Mount

Netanyahu

, et al. An efficient k-means clustering algorithm: analysis and implementation. IEEE Trans Pattern Anal Mach Intell 2002; 24(7): 881–892.

28.

Aliguliyev

RM.

Clustering of document collection – a weighting approach. Expert Syst Appl 2009; 36: 7904–7916.

29.

Chen

Y-L

Chiu

Y-T.

Vector space model for patent documents with hierarchical class labels. J Inf Sci 2012; 38(3): 212–233.

30.

Rocha

Cobo

Feature selection strategies for automated classification of digital media content. J Inf Sci 2011; 37(4): 418–428.

31.

Bide

Shedge

Improved document clustering using k-means algorithm. In: 2015 IEEE international conference on electrical, computer and communication technologies (ICECCT), Coimbatore, India, 5–7 March 2015.

32.

Chen

C-H.

Improved TFIDF in big news retrieval: an empirical study. Pattern Recognit Lett 2017; 93: 113–122.

33.

Lin

Jiang

Lee

SJ.

A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng 2014; 26(7): 1575–1590.

34.

Basu

Murthy

CA.

A similarity assessment technique for effective grouping of documents. Inf Sci 2015; 311: 149–162.

35.

Dhillon

Modha

DS.

Concept decompositions for large sparse test data using clustering. Mach Learn 2001; 42(1): 143–175.

36.

Zhang

Shen

Gao

, et al. A density-based method for initializing the k-means clustering algorithm. In: Proceedings of 2012 international conference on network and computational intelligence, Homg Kong, China, 3–4 August 2012, pp. 46–53.

37.

Aubaidan

Mohd

Albared

Comparative study of k-means and k-means++ clustering algorithms on crime domain. J Comput Sci 2014; 10(7): 1197–1206.

38.

Agarwal

Jaiswal

Pal

k-means++ under approximation stability. In: International conference on theory and applications of models of computation, Hong Kong, China, 20–22 May 2013, pp. 84–95. Berlin: Springer.

39.

http://www.daviddlewis.com/resources/testcollections/reuters21578/

40.

http://www.cs.umb.edu/~smimarog/textmining/datasets/

41.

Liu

C-L

Hsaio

W-H

Lee

C-H

Chen

C-H

. Clustering tagged documents with labeled and unlabeled documents. Inf Process Manage 2013; 49:596–606.

42.

Yuan

Exploring performance of clustering methods on document sentiment analysis. J Inf Sci 2017; 43(1): 54–74.

43.

D’hont

Vertommen

Verhaegen

, et al. Pairwise-adaptive dissimilarity measure for document clustering. Inf Sci 2010; 180: 2341–2358.