Graph entropy-based clustering algorithm in medical brain image database

Abstract

The high incidence of brain tumor has increased significantly in recent years. It is becoming more and more concernful to discover knowledge through mining medical brain image to aid doctors’ diagnosis. Clustering medical images for Intelligent Decision Support is an important part in the field of medical image mining because there are several technical aspects which make this problem challenging. In this paper, we propose a medical brain image clustering method to find similar pathology images that can assist doctors to analyze the specific disease, discover its potential cause and make more accurate treatment. Firstly, this method represents medical brain image dataset as a weighted, undirected and complete graph. Secondly, this graph is sparsified so as to describe the similarity of medical images very well. Last but not the least, a graph entropy based clustering method for this sparsified graph is proposed to cluster these medical images. The experimental results show that this method can cluster medical images efficiently and run well in time complexity. The clustering results can better describe the similarity of medical images.

Keywords

Medical image graph entropy sparsification clustering

1 Introduction

With the constantly expanding of the powerful data storage, computing platform and mobile internet, the explosion of healthcare data and the rapid electronic digitalization are promising trends. Reports indicate that the data volume of healthcare system of United States had reached 150EB in 2011. According to the present rate ZB(10²¹GB) and YB(10²⁴GB) will be reached soon in the future [1]. Through consolidating and analyzing the explosively growing data, healthcare big data can improve the quality of healthcare, and reduce healthcare costs and risk. Therefore, healthcare big data plays an irreplaceable important role in these fields. Doctors, patients and medical organizations also can benefit from it.

The modernization of healthcare continues to be deepened based on the rapid development of science and technology. Medical imaging technologies, such as computed tomography (CT), positron emission tomography (PET) and magnetic resonance imaging (MRI), are used to help doctors to make diagnoses. As these technologies have been used widely in clinical diagnosis, a large number of medical images are produced in hospital every day. The medical image information forms rich, varied and huge storage healthcare big data after being digitized and provide enormous opportunities for computer-aided diagnosis. It is a hot issue of data mining on medical image at the moment that how to make full use of these medical images and mine valuable information from them to help doctors to diagnose. By using data mining to healthcare big data, doctors can discover similar diseases and specific diseases easily, and make better diagnoses.

Clustering is an important part of data mining, it has been widely used in pattern recognition, biology, image processing and web information retrieval, etc. In the field of image retrieval, the performance can be improved through utilizing clustering as its preprocessing part. In the field of medical image research, classification methods are usually used to divide medical images into several categories, for example [3], proposed a hierarchical nonparametric Bayesian approach for medical images gene expressions classification [4]. Used decision tree to detect tumor and classify brain MRI images. We can know whether there is a tumor in the medical image or not. However, with the constantly increasing number of medical images, brain CT images which have different pathological features also grow quickly. As a supervised learning method, classification method needs to specify the category number before executing, so it has been unable to accurately describe the features of medical images.

Therefore it is a good choice to apply clustering method to classify the medical images according to their features. Currently, there are some research production in the field of image clustering. For example [5], presented a new matrix factorization method called Hyper-graph regularized Non-negative Matrix Factorization (HNMF) to cluster images [6]. Proposed a method for image clustering with density maps derived from Self-Organizing Maps (SOM) [7]. Presented an approach for image clustering that incorporated pairwise constrains from humans [8]. Used color moments feature, histogram edge, and k-means method for clustering images. Furthermore, clustering methods have been applied in the research of image segmentation. For example [9], presented an improved intuitionistic fuzzy c-means (IIFCM) to segment medical images, which considers the local spatial information in an intuitionistic fuzzy way. MRI images can be segmented by using k-means [10]. Hybrid method of k-means and fuzzy k-means can cluster images [11, 12]. Proposed a fast MRI brain image segmentation method based on artificial bee colony (ABC) algorithm and fuzzy-c means (FCM) algorithm. However, these methods need to designate the number of clusters and they are very sensitive to the selection of parameter. Therefore, it has high practical significance and value to study the method which needn’t to specify the number of categories before clustering.

At present, some existing clustering algorithms have requirements to data distribution, and these methods concern on different clustering results. Partitioning-based and hierarchical-based clustering algorithms aim to find globular clusters, but they are difficult to find non-globular clusters. Although density-based methods can find non-globular clusters, it is very sensitive to the distribution of data density. If they are used in the data which have uneven dispersion, the result will be unsatisfactory.

At the same time, we find that it is a good way that graph theory is applied to solve clustering problems. Von Dongen proposed Markov clustering algorithm (MCL) [13]. It clustered graph by using the method of simulation stochastic flow. On this basis, Venu Satuluri and his colleagues put forward a multi-level regularized MCL(MLR-MCL) method [14]. George Karypis proposed a Kernighan-Lin method to segment images [15]. Inderjit S.Dhillon and his colleagues exploited weighted kernel k-means method to segment images [16]. Bolin Chen and his colleagues used clique seeds and graph entropy methods to identify protein complexes in protein-protein interaction networks [17]. However, with the increasing number of vertices in the graph, the efficiency of above graph clustering methods will be affected. Furthermore, these methods exist some limits in clustering weighted graph.

Data sampling is an effective method to solve the efficiency of data mining. Its main idea is that the sample data are extracted from the full data and use data mining algorithms on it, we can get the result with a relatively short time, which is similar to the result from the method on the whole data after a relatively short time. This method has been applied to the prior problems, such as data mining, machine learning, etc. [18]. Hence, during the process of graph clustering, it has became an effective method to improve the clustering results by sampling graph edges, in other words sparsifying graph edges, then cluster it. Inspired by this thought, L-spar method was proposed in [19]. It’s a graph sparsification method which prunes different number of edges for different nodes. But this method does not apply to sparsify the complete thought graph.

To solve the above problems, a medical image clustering method based on graph entropy is proposed in this paper. Firstly, each medical image is represented by a vertex, the similarity between two medical images are the edge weight between two vertices. Thereby, these medical images form a complete graph and we can get a graph model by sparsifying this complete graph. Secondly, undirected weighted graph clustering method based on graph entropy is proposed. By using this method, we can cluster sparsified graph model and get the result of medical image clustering at last. The experiment results show that our medical image clustering method based on graph entropy can cluster medical images efficiently.

The remainder of this paper is organized by follows. The second section introduces medical image preprocessing. We can get calculation method of similarity between medical images in this section. In the third section, a weighted, undirected and complete graph is established based on medical image dataset, and then the complete graph will be sparsified. In the fourth section, we propose a weighted and undirected graph clustering method based on graph entropy (WGCGE). The fifth section is experimental results and analysis. Finally, we make a conclusion about our work.

2 Medical image pre-processing

After observing a large number of brain CT images, we find that CT image is made up of a certain number of pixels, and the range of grayscale value is from 0 to 255. In digital images, a CT image is represented by a matrix, and the information of a pixel is donated by an element of the matrix. Meanwhile, when we communicate with doctors about some knowledge of brain CT image, we find the following fact: when doctors make diagnoses based on medical images, they often pay more attention to the areas where grayscale value change violently. Because these areas usually have some key features of the brain CT image, and they can depend on these key features to analyze the image and find the diagnosed images with high similarity with it. According to the above medical domain knowledge, more attention need to be paid to these features of brain CT images, so that we can cluster these images more accurately.

Given a brain CT image I consisting of m × n pixels, I (i, j) is the gray value of a pixel which is located in the coordinates (i, j) of image I. The smaller the value of I (i, j) is, the smaller the grayscale value of coordinate (i, j) of image I is. In other words, this point is much darker and the point of (i, j) is more important in image I. As original brain CT images are generally affected by noise, for ease of research, a preprocessing work is necessary step. In this preprocessing, our group’s multi-level texture extracting method is utilized. Figure 1 is an example to describe the preprocessing work. Figure 1(a) shows an original brain CT image. Firstly, ROI (Region Of Interest) of Fig. 1(a) is extracted. The extracted ROI is shown in Fig. 1(b), where the useless or disturbing region of brain CT image for medical diagnosis is removed. Secondly, multi-level array par [k] can be got by calculating the peak and the trough of the ROI area histogram. According to the k-layer classification array, we can get multi-level texture image by using canny algorithm [20]. Figure 1(c) is a multi-level texture image which has been corrected. Finally, in order to facilitate subsequent modeling, this multi-level texture image is normalized into an uniform size image whose size is Column×Row, as shown in Fig. 1(d). According to the statistics of the raw image dataset specification, Column = 161 and Row = 151 can be used in this paper.

The texture-oriented T-LBP method has been proposed in [21], and the similarity between medical images can be calculated by using this method. The smaller weight shows the more similar to each other by using this calculation method, otherwise it shows less similar to each other between medical images. Firstly, the LBP values of all the texture points in each texture image are calculated. Secondly, in order to better express medical image detail, texture image is divided into several regions. Comparing the similarity of each corresponding region can get the similarity between images. In Fig. 2, each texture image is divided into 16 regions so that every texture point can be in these areas. Then LBP histogram of each area can be obtained. Each histogram consists of 256 gray-scale values as its horizontal axis, and the number of texture points in a certain gray-scale value is regarded as the ordinates, as shown in Fig. 3. Histogram is normalized by using the Formula (1).

$Hist [i] = \frac{value (r)}{sum (number)}$ (1)

In the Formula (1), Hist [i] represents the histogram of ith area, value (r) represents the number of the texture point of each gray-scale value, the value of sum(number) is the sum number of texture points in this area.

The similarity weight w_ij is used to describe the similarity of two texture images G_i and G_j. It is calculated by summing up the all distances of corresponding partition areas between G_i and G_i. In this step, the distance of corresponding partition areas is the difference of gray-scale histogram of this partition. Formula (2) is used to calculate w_ij. $w_{ij} = \sqrt{\sum_{x = y = 1}^{x = y = num} (Dist (G_{i} (Hist [x]), G_{j} (Hist [y])))^{2}}$ (2)

In Formula (2), Dist (G_i (Hist [x]) , G_j (Hist [y])) is the distance of xth area between image G_i and image G_j, x is the label number of partition areas and its value range is from 1 to num, that is the amount of partition areas. Formula 3 is used to calculate the distance of corresponding partition areas between texture images.

$\begin{matrix} Dist (G_{i} (Hist [x]), G_{j} (Hist [y])) \\ = \sqrt{\sum_{index = 0}^{index = 255} {(r_{i, x} [index] - r_{j, y} [index])}^{2}} \end{matrix}$ (3)

In formula (3), index represents the range of gray-scale values (from 0 to 255). and r is the ordinate value of the texture point, that is normalized.

3 Sparsification of graph

In this section, a weighted undirected complete graph is established to represent a brain CT image dataset. Each brain CT image is defined as a vertex, and the relationship between images is defined as edge and weight. Secondly, the complete graph is sparsified by pruning the edges which have high weight. And the sparsified graph is used to make the follow-up clustering about medical images more efficiently.

3.1 Establishment of weighted undirected complete graph

Graphs are generally used to represent entities and their relationships in practice. In a graph model, a vertex can denote an entity, and an edge can denote the existence of the relationship between entities. Therefore, an undirected complete graph g is defined as a two tuple g = (V, E), where V is a vertex set, E is an edge set, and it belongs to V × V. However, a two-tuple graph cannot clearly exploit the relationships among entities. To better balance the degree of the relationships among entities, a weighted undirected graph is introduced as follows.

Definition 1. A weighted undirected graph G = (V, E, W) is a three tuple, in which V = {v₁, v₂, … v_n} is a vertex set, E = {e_ij = (v_i, v_j) |1 ≤ i, j ≤ n, and i, j ∈ V} is the edge set of V × V, W = {w_ij|1 ≤ i, j ≤ n, and i, j ∈ V} is the weight set of edges, w_ij represents the weight of the edge e_ij.

A weighted undirected complete graph aiming at medical textural images (denoted by WUCM graph) is established to represent the relationships of medical images. Firstly, each medical image is represented by a vertex in the WUCM graph model, and the vertex set V is created. Secondly an edge is created between any two vertices, and the edge set E is created. Thirdly, the weight w_ij of an edge e_ij is computed based on the formula (3), and it denotes the dissimilarity of two medical textual images. The weight set of edges W is thus created. Through the above third steps, a WUCM graph is established.

In the real world, any two entities of the same kind have a relationship. However, their correlation degree can vary in an enormously different range. Usually, more attention will be paid to the relationships with high correlation degree for a certain purpose. Because a high correlation degree always means a closer relationship between entities. In our WUCM graph, w_ij denotes the dissimilarity between medical images. That is, a smaller weight value indicates a higher similarity between images, and a bigger weight value indicates a lower similarity between images. A bigger weight (lower similarity) hardly makes sense of further research. For instance, if the weight between image A and image B is 2.3, the weight between image B and image C is 2.6, and the weight between image A and image C is 8.5, then more attention will be paid to the weight between A and B as well as between B and C, and the weight between A and C does not cause focus. Moreover, remaining the edge with high weight brings higher complexity of graph operations. Therefore, pruning some edges is a good choice to reduce the complexity of future research, and the pruned graph can still remain essential information of graphs.

3.2 Establishment of a sparsified weighted undirected graph

In this paper, medical image dataset is represented by a graph to describe the similarity between medical images. Each vertex is a medical image. The similarity is measured between medical images. In other words, every two vertices are connected. The graph which measures the similarity of medical images is a weighted, undirected and complete graph. For example, n medical images are represented by n vertices. According to the weight value calculation method of similarity between medical images during medical pre-processing, we calculate the similarity weight of each two vertices, then a weighted, undirected

Algorithm 1. The sparsification method for weighted,

undirected and completed graph

Input: weighted, undirected and completed graph G,

parameter e.

Output: sparsified weighted and undirected graph G’.

1. All the edges of G are sorted by their weights in

ascending order;

2. Create a queue for each vertex v_i in the graph G,

all the edges and their weights of v_i are stored in

the queue in ascending order;

3. FOR the store queue of each vertex v_i

4 Dequeue first [d^e] edges and put them into set S;

//d is the degree of vertex.

5. END FOR

6. All the repeat edges of the set S need to be found out

and put them into set S’;

7. The edges which are not in set S’ need to be deleted

from graph G;

8. RETURN the weighted and undirected graph G’;

and complete graph which have n vertices can be built. However, it is very expensive to cluster a complete graph. In order to get the clustering result in a relatively short time, we consider to sparsify this complete graph. The similarity degree between medical images has differences, and the edges which describe high similarity degree between images are more useful than low similarity degree for a graph which measure the similarity degree between medical images. For purpose of putting the images which have high similarity degree into a classification, the edges which describe high similarity degree are priority retained during sparsifying this weighted, undirected and complete graph. Algorithm 1 shows the method of weighted, undirected and complete graph sparsification.

To be sure, the weighted and undirected graph G includes n vertices, then it has m = n × (n - 1)/2 edges, each vertex v_i has d edges and d = n - 1. Here parameter e is called sparsification factor. Its range is [0, 1]. When e is 0, the value of [d^e] is 1. That means at least an edge of each vertex v_i puts into S in step 4. When e is 1, all of the edges of each vertex v_i are put into S. In the step 6, all the repeat edges of S need to be found out by using hashing method and put them into S’. The edges stored in S’ are the common sides from S. The common side means the two endpoints of it think they are similar to each other, so the edges like this are retained. If an edge is not a common side, it indicates that the one endpoint A is similar to another B, but B does not think it is similar to A. This edge can not reflect the similarity between vertices well.

When sparsification factor e is not equal to 1, That is, compared with the corresponding vertex of original graph, each vertex will reduce an edge at least. The following theorem is given to describe the above conclusion.

Theorem: When sparsification factor e ≠ 1, comparing to the corresponding vertex of original graph G, each vertex v_i of graph G’ will reduce an edge at least.

Proof: Suppose sparsification factor e ≠ 1, there is a vertex v in graph G’ which is got after sparsification. All edges of vertex v are not pruned. According to Algorithm 1, vertex v has common edge with any other vertex in graph G. Because all the edges of vertex v are selected during the edge selecting procedure according to [d^e], the sparsification factor e = 1. But it conflicts with the known condition e ≠ 1. Therefore, the assumption is invalid. The theorem is true.

In this paper, graph G’ is got by sparsifying the original graph G. In Algorithm 1, the edges in graph G are sorted by quick sort method in ascending order. If the number of edges in graph G is m, the time complexity of step 1 is O (mlogm). The time complexity of this procedure that creates a queue for each vertex, stores its weight in ascending order and then puts first [d^e] edges into set S is $O (\sqrt{m})$ . The time complexity of searching the repeat edges by hashing method is O(1). So the sum time complexity of Algorithm 1 is O (mlogm).

Exmple 1. As shown in Fig. 4(a), 10 brain CT images are represented by a weighted, undirected and complete graph, where V = {v₁, v₂, v₃, v₄, v₅, v₆, v₇, v₈, v₉, v₁₀} is a vertex set, E = {e_ij|1 ≤ i ≤ 10, 1 ≤ j ≤ 10, and i ∈ V, j ∈ V} is an edge set. By calculating the degree of similarity between images, we can get the edge weight set, $\begin{matrix} W = {w_{12} = 4.0088, w_{13} = 4.8251, w_{14} = 3.9554, \\ w_{15} = 4.2151, w_{16} = 4.5925, w_{17} = 3.8840, \\ w_{18} = 4.8045, w_{19} = 5.4477, w_{1, 10} = 4.6055, \\ w_{23} = 6.0454, w_{24} = 5.6607, w_{25} = 5.3834, \\ w_{26} = 5.5883, w_{27} = 5.327, w_{28} = 5.9271, \\ w_{29} = 6.5373, w_{2, 10} = 5.1941, w_{34} = 5.5293, \\ w_{35} = 5.3505, w_{36} = 5.1468, w_{37} = 5.5658, \\ w_{38} = 5.2673, w_{39} = 4.4938, w_{3, 10} = 4.7696, \\ w_{45} = 4.4721, w_{46} = 4.9582, w_{47} = 3.8554, \\ w_{48} = 4.6409, w_{49} = 6.4504, w_{4, 10} = 6.1606, \\ w_{56} = 4.0188, w_{57} = 3.5017, w_{58} = 4.9082, \\ w_{59} = 6.2617, w_{5, 10} = 5.9724, w_{67} = 4.2604, \\ w_{68} = 5.5589, w_{69} = 5.9774, w_{6, 10} = 5.7314, \\ w_{78} = 4.6685, w_{79} = 6.0805, w_{7, 10} = 5.8429, \\ w_{89} = 6.3307, w_{8, 10} = 6.3362, w_{9, 10} = 6.0956} . \end{matrix}$ As an example, we determine that the sparsification factor e is 0.6, then the number of edges which is retained for each vertex are 4, that is [d^e] =4. In Fig. 4(b), the solid lines represent the common edges which are retained by two endpoints. The dotted lines indicate that the edges are retained by one endpoint. Further, as shown in Fig. 4(c), it is a graph that leaves solid lines from Fig. 4(b). These edges can fully describe the similarity between medical images. In Fig. 4(d), in order to observe the structure of graph conveniently, we get the isomorphic graph of Fig. 4(c). By this procedure, the number of edges reduces from 45 to 13, the reduction rateis 71.1%.

4 Weighted and undirected graph clustering method based on graph entropy

In this section, the definition and the calculation method of vertex entropy and graph entropy are given. On this basis, we propose a weighted and undirected graph clustering method based on graph entropy (WGCGE).

4.1 The calculation method for weighted and undirected graph entropy

Graph entropy is a new definition of information theory. The modularity of graph can be evaluated by using it. A weighted and undirected graph G = (V, E, W) can be divided into several clusters through calculating its graph entropy. Each cluster G’ is a subgraph of G, and the similarity of intra-cluster vertices is higher than these inter-cluster vertices.

Given a cluster G’(V’, E’, W’) in the graph G, if any vertex v of graph G has edges to connect with the vertices of V’, we call it intra-connection of v. Likewise. We can define the outer-connection of vertex G, if any vertex G of graph G has edges to connect with the vertices out of the vertex set V’. p_i (v) represents the intra-connection probability of the vertex v. $p_{i} (v) = \frac{\sum_{in} \frac{1}{w_{ij}}}{\sum_{all} \frac{1}{w_{ij}}}$ (4)

During pre-processing of medical images, due to the smaller weight indicates the higher similarity between images, the smaller weight will be more useful. In the formula (4), $\sum_{in} \frac{1}{w_{ij}}$ represents the sum of the edge weight reciprocal between vertex v and the intra-cluster vertices. If the value is larger, the connection between vertex v and intra-cluster is closer. $\sum_{all} \frac{1}{w_{ij}}$ represents the sum of the edge weight reciprocal of all the edges. The larger intra-cluster connection probability of the vertex v indicates the vertex has greater probability to belong to this cluster. Otherwise, the probability of belonging to this cluster is smaller. For the same reason, we can use p_o (v) to represent the probability of outer-cluster connection of vertex v. $p_{o} (v) = \frac{\sum_{out} \frac{1}{w_{ij}}}{\sum_{all} \frac{1}{w_{ij}}} = 1 - p_{i} (v)$ (5)

Definition 2. Given a cluster, the entropy of a vertex v is defined based on the probability distribution. This probability distribution depicts the intra-cluster and outer-cluster connections of the vertex v. Formula (6) is used to calculate the entropy of vertex v. $e (v) = - p_{i} (v) {log}_{2} p_{i} (v) - p_{o} (v) {log}_{2} p_{o} (v)$ (6)

Definition 3. Given a cluster, the entropy of a graph G (V, E, W) is defined by the sum of all the vertices entropy in the graph G. As follows, it is used to calculate the entropy of graph G. $e (G) = \sum_{v \in V} e (v)$ (7)

Example 2. In Fig. 5(a), there is a cluster G’(V’, E’, W’) in the graph G (V, E, W). The vertices in the oval are intra-cluster vertices. The edges and their weights are given in Example 1 in the last chapter. For vertex v₁ in the vertex set V = {v₁, v₂, v₃, v₄, v₅, v₆, v₇, v₈, v₉, v₁₀}, all of its neighbor nodes are in the cluster G’. According to formula (4) and (5), p_i (v₁) =1, p₀ (v₁) =0. Then with formula (6), the entropy of vertex v₁ is 0, namely e (v₁) =0. For vertex v₃, its any neighbor node is not in cluster G’, so p_i (v₃) =0, p_o (v₃) =1, and its vertex entropy e (v₃) is 0. However, for vertex v₂, the edge e₂1 is in cluster G’, but the edge e₂, 10 is not. Through calculating the probability, p_i (v₂) is 0.5645, p_o (v₂) is 0.4355, and its vertex entropy e (v₂) is 0.988. This shows that if a vertex only has the intra-cluster connection or inter-cluster connection, the entropy of this vertex is 0. If the vertex has not only intra-cluster connection but also inter-cluster connection, the vertex entropy will be greaterthan 0.

The definition of graph entropy can efficiently measure the clustering result of weighted graph. If the graph entropy has lower value, there is higher correlation between vertices in the cluster. That is to say, there is higher similarity between vertices in the cluster. In Fig. 5(a), the graph entropy e (G) is 4.1545. It is smaller than that in Fig. 5(b). Therefore, Fig. 5(a) has a better clustering result, and the cluster will be narrowed. But it does not contain vertex v₂.

Algorithm 2. WGCGE clustering method

Input: weighted and undirected graph G’.

Output: the set of clustering result C₁ … C_n.

1. Initialize the set S of the candidate seed vertices;//Set S

is all of the vertices of graph G’

2. Choose a vertex v_s from the set S randomly as a seed

vertex, initial seed cluster C is made of the vertex v_s

and its neighbor nodes;

3. Each vertex entropy and graph entropy are calculated

on the condition that the cluster C is in graph;

4. FOR each neighbor vertex v_i of seed vertex v_s DO

5. IF v_i is deleted from cluster C, graph G’ entropy

will become smaller

6. THEN v_i is deleted from cluster C;

7. ELSE

8. Vertex v_i is still in the cluster C;

9. END FOR;

10. FOR each outer boundary vertex of the cluster C DO

11. IF vertex v_i is put into cluster C, graph G’ entropy

will become smaller;

12. THEN vertex v_i is put into cluster C;

13. ELSE

14. Vertex v_i is not able to put into cluster C;

15. END FOR;

16. The current cluster C can be a set C_i of the clustering

result;

17. The vertices of the cluster C_i are deleted from the set S;

18. Step 2∼17 are repeated until the set S is empty;

19. RETURN clustering result sets C₁ … C_n;

4.2 The WGCGE method

During graph clustering, the belonging of each vertex to a certain cluster depends on the clustering method. In this section, WGCGE method is proposed to add a vertice into a cluster or delete a vertice from a cluster, so that the graph entropy will be decrease, and the quality of cluster will be optimized. Algorithm 2 shows the description of WGCGE method.

To be sure, the initial set S of the candidate seed vertices is composed of all the vertices of the weighted and undirected graph G’. When a clustering result is got, the vertices in this cluster are deleted from S, so that these vertices are not clustered repeatedly. The neighbor node of one vertex means the vertex connects with this node, and the outer boundary vertex means the vertex connects with intra-cluster vertices. They are not in one cluster. In the step 4 and 10 of Algorithm 2, greedy algorithm is used to check whether the graph G entropy will decrease or not, when the number of vertex in the cluster increases or decreases. In this procedure, it is needn’t to recalculate the whole vertices entropy. We only need to update the vertex entropy which connects with the concerned vertices. Because this clustering method does not need to specify the parameter which is related to the number of clusters, we can get the number of clusters automatically, and we can also get any size and shape clusters.

In Algorithm 2, if weighted and undirected graph G’ has n vertices, the time complexity of calculating vertex entropy and graph entropy in step 3 is O (n) for the seed vertex selected from S every time. Step 4∼15 are the procedure of increasing and decreasing vertices from the cluster. That time complexity is O (n). The time complexity of step 17 is O(1). Since the above process loops n times at most, the worse time complexity of this method is O (n²).

5 Experimental results and analysis

In order to verify the efficiency of this medical image clustering method, this section is divided into two parts for explanation, furthermore as an example, brain CT images are used in the experiment, the other medical images also can be clustered by our clustering method. The selection of sparsification factor e during sparsifying the graph will be analyzed in part 1, then we can get the results of WGCGE method with different e. In part 2, we will discuss the medical image clustering method which we proposed and compare it with MCST and MCL methods. MCST is a medical image clustering method based on graph minimum spanning tree [17]. MCL is proposed in [8], that is a graph clustering method based on stochastic flow. Two different medical image data sets are used in our experiment. One of them is provided by hospital, which contains 2000 real brain CT images. The other one is composed of the simulated images, which contains 4000 images.

The experimental hardware environment is Intel(R) Core(TM)2 Quad CPU Q8400 @2.66 GHz, 2.0 GB RAM, Microsoft Windows 8 operating system. The experimental software development environment is Visual Studio 2012, Matlab 2009a.

5.1 The analysis of sparsification factor selection

In this section, we will discuss the selection of sparsification factor e during graph sparsification. Two weighted, undirected and complete graphs which represent two brain CT image datasets are sparsified for decreasing the time cost in such a premise. When different sparsification factor e is selected, it can be seen that the edges will be decrease violently with the e decreasing.

In order to measure the quality of the clustering results, we use the average f-score method. F-score is the comprehensive evaluation index of precision and recall, it can be used to reflect the overall index. The following is the calculation formula of f-score. $f - score = \frac{2 \times precision \times recall}{precision + recall} = \frac{2 k}{m + n}$ (8)

m is the number of the elements in a certain cluster which is got by using clustering method, n is the number of the elements in a certain cluster which has been classified in the medical images library and k is the number of the elements which is both in m and n. Therefore we can get precision = k/m, recall = k/n, and f-score = 2k/(m+n). The average f-score is the average of the sum of each cluster maximum f-score in the clustering results.

As shown in Fig. 6, it is the changing situation of the average f-score when the sparsification factor e changes. If e = 1, the weighted, undirected and complete graph is clustered by using this clustering method. According to the explanation of the clustering method, there is only one cluster in the clustering result, that is, the complete graph. Hence, we determine that the value of e is from 0 to 0.9. From Fig. 6, we can see that the average f-score of the real image set are increasing continuously when e rise from 0 to 0.4. Then the average f-score begins to make a small-scope fluctuation with e increasing, and it gets the max value when e is 0.6. In the simulated image set, with e value increasing, average f-score continue to increase. When e is 0.7, maximum value can be got.

5.2 Compare with other clustering methods

In order to verify the efficiency of the clustering method, we compare our method with MCST and MCL for time complexity and average f-score, by using average f-score, the accuracy of these clustering methods can be got.

In Fig. 7, it is the comparison results of WGCGE method with other two clustering methods. As can be seen from it, with the number of the images increasing constantly, the time consumption of these three clustering methods change obviously. However, the time consumption of the clustering method based on graph entropy is far lower than other two clustering methods. From Table 1, in real image set and simulated image set, we can see that the average f-score of WGCGE method is higher than other two clusters. The average f-score of MCST and MCL in simulated image set are higher than that in real image set, but the average f-score of WGCGE method in simulated image set is lower than that in k-mediods.

6 Conclusion

In this paper, we mainly focus on the medical image clustering method. According to the features of medical images and inspired by domain knowledge, we propose a medical image clustering method based on graph entropy. Firstly, medical images are pre-processed. We can get the calculation method of similarity between medical images. Secondly, medical image dataset is represented by weighted, undirected and complete graph, and the graph is sparsified. Last but not the least, we propose weighted and undirected graph clustering method based on graph entropy. The experimental results show that our clustering method can cluster medical images efficiently and perform well in time cost and clustering accuracy. This method can assist doctors to diagnose illnesses and provide better service to doctors.

Footnotes

Acknowledgments

The paper is partly supported by the National Natural Science Foundation of China under Grant No. 61272184, 61370084, 61202090, Natural Science Foundation of Heilongjiang Province under Grant No. F2016005, and Fundamental Research Funds for the Central Universities No. HEUCF100609, HEUCFT1202.

References

IHTT. (2013). Transforming Health Care through Big Data Strategies for leveraging big data in the health care industry, http://c4fd63cb482ce6861463-bc6183f1c18e748a49b87a25911a0555.r93.cf2.rackcdn.com/iHT2_BigData_2013.pdf

Jiawei

Han

, Micheline and Pei

Jian

, Data Mining: Concepts and techniques, 3rd ed.Beijing: China Machine Press, 2012, pp. 323–350.

Elguebaly

and Bouguila

, A hierarchical nonparametric Bayesian approach for medical images and gene expressions classification[J], Soft Computing19(1) (2015), 189–204.

Naik

and Patel

, Tumor detection and classification using decision tree in brain MRI[J], International Journal of Computer Science and Network Security (IJCSNS)14(6) (2014), 87.

Zeng

, Yu

, Li

, et al., Image clustering by hyper-graph regularized non-negative matrix factorization[J], Neurocomputing138 (2014), 209–217.

Arai

, Image clustering method based on self organization mapping: SOM derived density maps and its application for landsat thematic mapper image clustering[J], Image2(5) (2013), 10–11.

Biswas

and Jacobs

, Active image clustering with pairwise constraints from humans[J], International Journal of Computer Vision108(1-2) (2014), 133–147.

Malakar

and Mukherjee

, Image clustering using color moments, histogram, edge and K-means clustering[J], International Journal of Science and Research (IJSR), India Online ISSN (2013), 2319–7064.

Huang

C.W.

, Lin

K.P.

and Wu

M.C.

, et al., Intuitionistic fuzzy c-means clustering algorithm with neighborhood attraction in segmenting medical image[J], Soft Computing19(2) (2015), 459–470.

10.

Moftah

H.M.

, Azar

A.T.

, Al-Shammari

E.T.

, et al., Adaptive k-means clustering algorithm for MR breast image segmentation, Neural Computing and Applications24(7-8) (2014), 1917–1928.

11.

Rahmani

M.K.I.

, Pal

and Arora

, Clustering of image data using K-means and fuzzy K-means, International Journal of Advanced Computer Science and Applications (IJACSA)5(7) (2014).

12.

Menon

and Ramakrishnan

, Brain Tumor Segmentation in MRI images using unsupervised Artificial Bee Colony algorithm and FCM clustering[C]//Communications and Signal Processing (ICCSP), 2015 International Conference on IEEE, 2015, pp. 0006–0009.

13.

Van Dongen

S.M.

, Graph clustering by flow simulation, 2001.

14.

Satuluri

and Parthasarathy

, Scalable graph clustering using stochastic flows: Applications to community discovery, Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining ACM, pp. 2009, 737–746.

15.

Karypis

and Kumar

, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM Journal on scientific Computing20(1) (1998), 359–392.

16.

Dhillon

I.S.

, Guan

and Kulis

, Weighted graph cuts without eigenvectors a multilevel approach, Pattern Analysis and Machine Intelligence, IEEE Transactions on29(11) (2007), 1944–1957.

17.

Chen

, Shi

, Zhang

, et al., Identifying protein complexes in protein–protein interaction networks by using clique seeds and graph entropy, Proteomics13(2) (2013), 269–277.

18.

Provost

and Kolluri

, A survey of methods for scaling up inductive algorithms, Data Mining and Knowledge Discovery3(2) (1999), 131–169.

19.

Satuluri

, Parthasarathy

and Ruan

, Local graph sparsification for scalable clustering, Proceedings of the 2011 ACM SIGMOD International Conference on Management of data ACM, 2011, pp. 721–732.

20.

Canny

, A computational approach to edge detection, IEEE Transactions on Pattern Analysis and Machine Intelligence8(6) (1986), 679–698.

21.

Pan

, Gu

, Han

, et al., Medical image clustering algorithm based on graph model, Journal of Software24(Suppl(2) (2013), 178–187.

22.

Fitzgibbon

A.W.

, Pilu

, et al., Direct least square fitting of ellipses, IEEE Transactions on Pattern Analysis and Machine Intelligence21(5) (1999), 476–480.

23.

Zhou

, Cheng

and Yu

J.X.

, Graph clustering based on structural/attribute similarities, Proceedings of the VLDB Endowment2(1) (2009), 718–729.

24.

Satuluri

, Parthasarathy

and Ucar

, Markov clustering of protein interaction networks with improved balance and scalability, Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology ACM, pp. 2010, 247–256.

25.

Pan

, Han

, Yin

, et al., Association rule mining with domain knowledge constraint, 3rd International Conference on Intelligent System and Knowledge Engineering, Xiamen, China, 1, 2008, 273–278.

26.

Yan

and Han

, gspan: Graph-based substructure pattern mining, Proceeding of IEEE International Conference on Data Mining, 2002, pp. 721–724.

27.

, Zou

and Gao

, Mining frequent subgraphs over uncertain graph databases under probabilistic semantics, The VLDB Journal21 (2012), 753–777.

28.

Unay

, Ekin

and Jasinschi

R.S.

, Local structure-based region-of-interest retrieval in brain MR images[J], Information Technology in Biomedicine, IEEE Transactions on14(4) (2010), 897–903.

29.

Dehmer

and Emmert-Streib

, Structural information content of networks: Graph entropy based on local vertex functionals[J], Computational Biology and Chemistry32(2) (2008), 131–138.

30.

Xia

, Wang

, Sun

, et al., Steganalysis of LSB matching using differences between nonadjacent pixels[J], Multimedia Tools and Applications (2014), 1–16.

31.

Zou

, Li

, Gao

and Zhang

, Frequent subgraph pattern mining on uncertain graph data, Proceeding of the 18th ACM Conference on Information and Knowledge Management, New York, USA, 2009, 583–592.