Abstract
The high incidence of brain tumor has increased significantly in recent years. It is becoming more and more concernful to discover knowledge through mining medical brain image to aid doctors’ diagnosis. Clustering medical images for Intelligent Decision Support is an important part in the field of medical image mining because there are several technical aspects which make this problem challenging. In this paper, we propose a medical brain image clustering method to find similar pathology images that can assist doctors to analyze the specific disease, discover its potential cause and make more accurate treatment. Firstly, this method represents medical brain image dataset as a weighted, undirected and complete graph. Secondly, this graph is sparsified so as to describe the similarity of medical images very well. Last but not the least, a graph entropy based clustering method for this sparsified graph is proposed to cluster these medical images. The experimental results show that this method can cluster medical images efficiently and run well in time complexity. The clustering results can better describe the similarity of medical images.
Introduction
With the constantly expanding of the powerful data storage, computing platform and mobile internet, the explosion of healthcare data and the rapid electronic digitalization are promising trends. Reports indicate that the data volume of healthcare system of United States had reached 150EB in 2011. According to the present rate ZB(1021GB) and YB(1024GB) will be reached soon in the future [1]. Through consolidating and analyzing the explosively growing data, healthcare big data can improve the quality of healthcare, and reduce healthcare costs and risk. Therefore, healthcare big data plays an irreplaceable important role in these fields. Doctors, patients and medical organizations also can benefit from it.
The modernization of healthcare continues to be deepened based on the rapid development of science and technology. Medical imaging technologies, such as computed tomography (CT), positron emission tomography (PET) and magnetic resonance imaging (MRI), are used to help doctors to make diagnoses. As these technologies have been used widely in clinical diagnosis, a large number of medical images are produced in hospital every day. The medical image information forms rich, varied and huge storage healthcare big data after being digitized and provide enormous opportunities for computer-aided diagnosis. It is a hot issue of data mining on medical image at the moment that how to make full use of these medical images and mine valuable information from them to help doctors to diagnose. By using data mining to healthcare big data, doctors can discover similar diseases and specific diseases easily, and make better diagnoses.
Clustering is an important part of data mining, it has been widely used in pattern recognition, biology, image processing and web information retrieval, etc. In the field of image retrieval, the performance can be improved through utilizing clustering as its preprocessing part. In the field of medical image research, classification methods are usually used to divide medical images into several categories, for example [3], proposed a hierarchical nonparametric Bayesian approach for medical images gene expressions classification [4]. Used decision tree to detect tumor and classify brain MRI images. We can know whether there is a tumor in the medical image or not. However, with the constantly increasing number of medical images, brain CT images which have different pathological features also grow quickly. As a supervised learning method, classification method needs to specify the category number before executing, so it has been unable to accurately describe the features of medical images.
Therefore it is a good choice to apply clustering method to classify the medical images according to their features. Currently, there are some research production in the field of image clustering. For example [5], presented a new matrix factorization method called Hyper-graph regularized Non-negative Matrix Factorization (HNMF) to cluster images [6]. Proposed a method for image clustering with density maps derived from Self-Organizing Maps (SOM) [7]. Presented an approach for image clustering that incorporated pairwise constrains from humans [8]. Used color moments feature, histogram edge, and k-means method for clustering images. Furthermore, clustering methods have been applied in the research of image segmentation. For example [9], presented an improved intuitionistic fuzzy c-means (IIFCM) to segment medical images, which considers the local spatial information in an intuitionistic fuzzy way. MRI images can be segmented by using k-means [10]. Hybrid method of k-means and fuzzy k-means can cluster images [11, 12]. Proposed a fast MRI brain image segmentation method based on artificial bee colony (ABC) algorithm and fuzzy-c means (FCM) algorithm. However, these methods need to designate the number of clusters and they are very sensitive to the selection of parameter. Therefore, it has high practical significance and value to study the method which needn’t to specify the number of categories before clustering.
At present, some existing clustering algorithms have requirements to data distribution, and these methods concern on different clustering results. Partitioning-based and hierarchical-based clustering algorithms aim to find globular clusters, but they are difficult to find non-globular clusters. Although density-based methods can find non-globular clusters, it is very sensitive to the distribution of data density. If they are used in the data which have uneven dispersion, the result will be unsatisfactory.
At the same time, we find that it is a good way that graph theory is applied to solve clustering problems. Von Dongen proposed Markov clustering algorithm (MCL) [13]. It clustered graph by using the method of simulation stochastic flow. On this basis, Venu Satuluri and his colleagues put forward a multi-level regularized MCL(MLR-MCL) method [14]. George Karypis proposed a Kernighan-Lin method to segment images [15]. Inderjit S.Dhillon and his colleagues exploited weighted kernel k-means method to segment images [16]. Bolin Chen and his colleagues used clique seeds and graph entropy methods to identify protein complexes in protein-protein interaction networks [17]. However, with the increasing number of vertices in the graph, the efficiency of above graph clustering methods will be affected. Furthermore, these methods exist some limits in clustering weighted graph.
Data sampling is an effective method to solve the efficiency of data mining. Its main idea is that the sample data are extracted from the full data and use data mining algorithms on it, we can get the result with a relatively short time, which is similar to the result from the method on the whole data after a relatively short time. This method has been applied to the prior problems, such as data mining, machine learning, etc. [18]. Hence, during the process of graph clustering, it has became an effective method to improve the clustering results by sampling graph edges, in other words sparsifying graph edges, then cluster it. Inspired by this thought, L-spar method was proposed in [19]. It’s a graph sparsification method which prunes different number of edges for different nodes. But this method does not apply to sparsify the complete thought graph.
To solve the above problems, a medical image clustering method based on graph entropy is proposed in this paper. Firstly, each medical image is represented by a vertex, the similarity between two medical images are the edge weight between two vertices. Thereby, these medical images form a complete graph and we can get a graph model by sparsifying this complete graph. Secondly, undirected weighted graph clustering method based on graph entropy is proposed. By using this method, we can cluster sparsified graph model and get the result of medical image clustering at last. The experiment results show that our medical image clustering method based on graph entropy can cluster medical images efficiently.
The remainder of this paper is organized by follows. The second section introduces medical image preprocessing. We can get calculation method of similarity between medical images in this section. In the third section, a weighted, undirected and complete graph is established based on medical image dataset, and then the complete graph will be sparsified. In the fourth section, we propose a weighted and undirected graph clustering method based on graph entropy (WGCGE). The fifth section is experimental results and analysis. Finally, we make a conclusion about our work.
Medical image pre-processing
After observing a large number of brain CT images, we find that CT image is made up of a certain number of pixels, and the range of grayscale value is from 0 to 255. In digital images, a CT image is represented by a matrix, and the information of a pixel is donated by an element of the matrix. Meanwhile, when we communicate with doctors about some knowledge of brain CT image, we find the following fact: when doctors make diagnoses based on medical images, they often pay more attention to the areas where grayscale value change violently. Because these areas usually have some key features of the brain CT image, and they can depend on these key features to analyze the image and find the diagnosed images with high similarity with it. According to the above medical domain knowledge, more attention need to be paid to these features of brain CT images, so that we can cluster these images more accurately.
Given a brain CT image I consisting of m × n pixels, I (i, j) is the gray value of a pixel which is located in the coordinates (i, j) of image I. The smaller the value of I (i, j) is, the smaller the grayscale value of coordinate (i, j) of image I is. In other words, this point is much darker and the point of (i, j) is more important in image I. As original brain CT images are generally affected by noise, for ease of research, a preprocessing work is necessary step. In this preprocessing, our group’s multi-level texture extracting method is utilized. Figure 1 is an example to describe the preprocessing work. Figure 1(a) shows an original brain CT image. Firstly, ROI (Region Of Interest) of Fig. 1(a) is extracted. The extracted ROI is shown in Fig. 1(b), where the useless or disturbing region of brain CT image for medical diagnosis is removed. Secondly, multi-level array par [k] can be got by calculating the peak and the trough of the ROI area histogram. According to the k-layer classification array, we can get multi-level texture image by using canny algorithm [20]. Figure 1(c) is a multi-level texture image which has been corrected. Finally, in order to facilitate subsequent modeling, this multi-level texture image is normalized into an uniform size image whose size is Column×Row, as shown in Fig. 1(d). According to the statistics of the raw image dataset specification, Column = 161 and Row = 151 can be used in this paper.
The texture-oriented T-LBP method has been proposed in [21], and the similarity between medical images can be calculated by using this method. The smaller weight shows the more similar to each other by using this calculation method, otherwise it shows less similar to each other between medical images. Firstly, the LBP values of all the texture points in each texture image are calculated. Secondly, in order to better express medical image detail, texture image is divided into several regions. Comparing the similarity of each corresponding region can get the similarity between images. In Fig. 2, each texture image is divided into 16 regions so that every texture point can be in these areas. Then LBP histogram of each area can be obtained. Each histogram consists of 256 gray-scale values as its horizontal axis, and the number of texture points in a certain gray-scale value is regarded as the ordinates, as shown in Fig. 3. Histogram is normalized by using the Formula (1).
In the Formula (1), Hist [i] represents the histogram of ith area, value (r) represents the number of the texture point of each gray-scale value, the value of sum(number) is the sum number of texture points in this area.
The similarity weight w
ij
is used to describe the similarity of two texture images G
i
and G
j
. It is calculated by summing up the all distances of corresponding partition areas between G
i
and G
i
. In this step, the distance of corresponding partition areas is the difference of gray-scale histogram of this partition. Formula (2) is used to calculate w
ij
.
In Formula (2), Dist (G i (Hist [x]) , G j (Hist [y])) is the distance of xth area between image G i and image G j , x is the label number of partition areas and its value range is from 1 to num, that is the amount of partition areas. Formula 3 is used to calculate the distance of corresponding partition areas between texture images.
In formula (3), index represents the range of gray-scale values (from 0 to 255). and r is the ordinate value of the texture point, that is normalized.
In this section, a weighted undirected complete graph is established to represent a brain CT image dataset. Each brain CT image is defined as a vertex, and the relationship between images is defined as edge and weight. Secondly, the complete graph is sparsified by pruning the edges which have high weight. And the sparsified graph is used to make the follow-up clustering about medical images more efficiently.
Establishment of weighted undirected complete graph
Graphs are generally used to represent entities and their relationships in practice. In a graph model, a vertex can denote an entity, and an edge can denote the existence of the relationship between entities. Therefore, an undirected complete graph g is defined as a two tuple g = (V, E), where V is a vertex set, E is an edge set, and it belongs to V × V. However, a two-tuple graph cannot clearly exploit the relationships among entities. To better balance the degree of the relationships among entities, a weighted undirected graph is introduced as follows.
A weighted undirected complete graph aiming at medical textural images (denoted by WUCM graph) is established to represent the relationships of medical images. Firstly, each medical image is represented by a vertex in the WUCM graph model, and the vertex set V is created. Secondly an edge is created between any two vertices, and the edge set E is created. Thirdly, the weight w ij of an edge e ij is computed based on the formula (3), and it denotes the dissimilarity of two medical textual images. The weight set of edges W is thus created. Through the above third steps, a WUCM graph is established.
In the real world, any two entities of the same kind have a relationship. However, their correlation degree can vary in an enormously different range. Usually, more attention will be paid to the relationships with high correlation degree for a certain purpose. Because a high correlation degree always means a closer relationship between entities. In our WUCM graph, w ij denotes the dissimilarity between medical images. That is, a smaller weight value indicates a higher similarity between images, and a bigger weight value indicates a lower similarity between images. A bigger weight (lower similarity) hardly makes sense of further research. For instance, if the weight between image A and image B is 2.3, the weight between image B and image C is 2.6, and the weight between image A and image C is 8.5, then more attention will be paid to the weight between A and B as well as between B and C, and the weight between A and C does not cause focus. Moreover, remaining the edge with high weight brings higher complexity of graph operations. Therefore, pruning some edges is a good choice to reduce the complexity of future research, and the pruned graph can still remain essential information of graphs.
Establishment of a sparsified weighted undirected graph
In this paper, medical image dataset is represented by a graph to describe the similarity between medical images. Each vertex is a medical image. The similarity is measured between medical images. In other words, every two vertices are connected. The graph which measures the similarity of medical images is a weighted, undirected and complete graph. For example, n medical images are represented by n vertices. According to the weight value calculation method of similarity between medical images during medical pre-processing, we calculate the similarity weight of each two vertices, then a weighted, undirected
undirected and completed graph
Input: weighted, undirected and completed graph G,
parameter e.
Output: sparsified weighted and undirected graph G’.
1. All the edges of G are sorted by their weights in
ascending order;
2. Create a queue for each vertex v i in the graph G,
all the edges and their weights of v i are stored in
the queue in ascending order;
3. FOR the store queue of each vertex v i
4 Dequeue first [d e ] edges and put them into set S;
//d is the degree of vertex.
5. END FOR
6. All the repeat edges of the set S need to be found out
and put them into set S’;
7. The edges which are not in set S’ need to be deleted
from graph G;
8. RETURN the weighted and undirected graph G’;
and complete graph which have n vertices can be built. However, it is very expensive to cluster a complete graph. In order to get the clustering result in a relatively short time, we consider to sparsify this complete graph. The similarity degree between medical images has differences, and the edges which describe high similarity degree between images are more useful than low similarity degree for a graph which measure the similarity degree between medical images. For purpose of putting the images which have high similarity degree into a classification, the edges which describe high similarity degree are priority retained during sparsifying this weighted, undirected and complete graph. Algorithm 1 shows the method of weighted, undirected and complete graph sparsification.
To be sure, the weighted and undirected graph G includes n vertices, then it has m = n × (n - 1)/2 edges, each vertex v i has d edges and d = n - 1. Here parameter e is called sparsification factor. Its range is [0, 1]. When e is 0, the value of [d e ] is 1. That means at least an edge of each vertex v i puts into S in step 4. When e is 1, all of the edges of each vertex v i are put into S. In the step 6, all the repeat edges of S need to be found out by using hashing method and put them into S’. The edges stored in S’ are the common sides from S. The common side means the two endpoints of it think they are similar to each other, so the edges like this are retained. If an edge is not a common side, it indicates that the one endpoint A is similar to another B, but B does not think it is similar to A. This edge can not reflect the similarity between vertices well.
When sparsification factor e is not equal to 1, That is, compared with the corresponding vertex of original graph, each vertex will reduce an edge at least. The following theorem is given to describe the above conclusion.
In this paper, graph G’ is got by sparsifying the original graph G. In Algorithm 1, the edges in graph G are sorted by quick sort method in ascending order. If the number of edges in graph G is m, the time complexity of step 1 is O (mlogm). The time complexity of this procedure that creates a queue for each vertex, stores its weight in ascending order and then puts first [d e ] edges into set S is . The time complexity of searching the repeat edges by hashing method is O(1). So the sum time complexity of Algorithm 1 is O (mlogm).
Weighted and undirected graph clustering method based on graph entropy
In this section, the definition and the calculation method of vertex entropy and graph entropy are given. On this basis, we propose a weighted and undirected graph clustering method based on graph entropy (WGCGE).
The calculation method for weighted and undirected graph entropy
Graph entropy is a new definition of information theory. The modularity of graph can be evaluated by using it. A weighted and undirected graph G = (V, E, W) can be divided into several clusters through calculating its graph entropy. Each cluster G’ is a subgraph of G, and the similarity of intra-cluster vertices is higher than these inter-cluster vertices.
Given a cluster G’(V’, E’, W’) in the graph G, if any vertex v of graph G has edges to connect with the vertices of V’, we call it intra-connection of v. Likewise. We can define the outer-connection of vertex G, if any vertex G of graph G has edges to connect with the vertices out of the vertex set V’. p
i
(v) represents the intra-connection probability of the vertex v.
During pre-processing of medical images, due to the smaller weight indicates the higher similarity between images, the smaller weight will be more useful. In the formula (4), represents the sum of the edge weight reciprocal between vertex v and the intra-cluster vertices. If the value is larger, the connection between vertex v and intra-cluster is closer. represents the sum of the edge weight reciprocal of all the edges. The larger intra-cluster connection probability of the vertex v indicates the vertex has greater probability to belong to this cluster. Otherwise, the probability of belonging to this cluster is smaller. For the same reason, we can use p
o
(v) to represent the probability of outer-cluster connection of vertex v.
The definition of graph entropy can efficiently measure the clustering result of weighted graph. If the graph entropy has lower value, there is higher correlation between vertices in the cluster. That is to say, there is higher similarity between vertices in the cluster. In Fig. 5(a), the graph entropy e (G) is 4.1545. It is smaller than that in Fig. 5(b). Therefore, Fig. 5(a) has a better clustering result, and the cluster will be narrowed. But it does not contain vertex v2.
Input: weighted and undirected graph G’.
Output: the set of clustering result C1 … C n .
1. Initialize the set S of the candidate seed vertices;//Set S
is all of the vertices of graph G’
2. Choose a vertex v s from the set S randomly as a seed
vertex, initial seed cluster C is made of the vertex v s
and its neighbor nodes;
3. Each vertex entropy and graph entropy are calculated
on the condition that the cluster C is in graph;
4. FOR each neighbor vertex v i of seed vertex v s DO
5. IF v i is deleted from cluster C, graph G’ entropy
will become smaller
6. THEN v i is deleted from cluster C;
7. ELSE
8. Vertex v i is still in the cluster C;
9. END FOR;
10. FOR each outer boundary vertex of the cluster C DO
11. IF vertex v i is put into cluster C, graph G’ entropy
will become smaller;
12. THEN vertex v i is put into cluster C;
13. ELSE
14. Vertex v i is not able to put into cluster C;
15. END FOR;
16. The current cluster C can be a set C i of the clustering
result;
17. The vertices of the cluster C i are deleted from the set S;
18. Step 2∼17 are repeated until the set S is empty;
19. RETURN clustering result sets C1 … C n ;
During graph clustering, the belonging of each vertex to a certain cluster depends on the clustering method. In this section, WGCGE method is proposed to add a vertice into a cluster or delete a vertice from a cluster, so that the graph entropy will be decrease, and the quality of cluster will be optimized. Algorithm 2 shows the description of WGCGE method.
To be sure, the initial set S of the candidate seed vertices is composed of all the vertices of the weighted and undirected graph G’. When a clustering result is got, the vertices in this cluster are deleted from S, so that these vertices are not clustered repeatedly. The neighbor node of one vertex means the vertex connects with this node, and the outer boundary vertex means the vertex connects with intra-cluster vertices. They are not in one cluster. In the step 4 and 10 of Algorithm 2, greedy algorithm is used to check whether the graph G entropy will decrease or not, when the number of vertex in the cluster increases or decreases. In this procedure, it is needn’t to recalculate the whole vertices entropy. We only need to update the vertex entropy which connects with the concerned vertices. Because this clustering method does not need to specify the parameter which is related to the number of clusters, we can get the number of clusters automatically, and we can also get any size and shape clusters.
In Algorithm 2, if weighted and undirected graph G’ has n vertices, the time complexity of calculating vertex entropy and graph entropy in step 3 is O (n) for the seed vertex selected from S every time. Step 4∼15 are the procedure of increasing and decreasing vertices from the cluster. That time complexity is O (n). The time complexity of step 17 is O(1). Since the above process loops n times at most, the worse time complexity of this method is O (n2).
Experimental results and analysis
In order to verify the efficiency of this medical image clustering method, this section is divided into two parts for explanation, furthermore as an example, brain CT images are used in the experiment, the other medical images also can be clustered by our clustering method. The selection of sparsification factor e during sparsifying the graph will be analyzed in part 1, then we can get the results of WGCGE method with different e. In part 2, we will discuss the medical image clustering method which we proposed and compare it with MCST and MCL methods. MCST is a medical image clustering method based on graph minimum spanning tree [17]. MCL is proposed in [8], that is a graph clustering method based on stochastic flow. Two different medical image data sets are used in our experiment. One of them is provided by hospital, which contains 2000 real brain CT images. The other one is composed of the simulated images, which contains 4000 images.
The experimental hardware environment is Intel(R) Core(TM)2 Quad CPU Q8400 @2.66 GHz, 2.0 GB RAM, Microsoft Windows 8 operating system. The experimental software development environment is Visual Studio 2012, Matlab 2009a.
The analysis of sparsification factor selection
In this section, we will discuss the selection of sparsification factor e during graph sparsification. Two weighted, undirected and complete graphs which represent two brain CT image datasets are sparsified for decreasing the time cost in such a premise. When different sparsification factor e is selected, it can be seen that the edges will be decrease violently with the e decreasing.
In order to measure the quality of the clustering results, we use the average f-score method. F-score is the comprehensive evaluation index of precision and recall, it can be used to reflect the overall index. The following is the calculation formula of f-score.
m is the number of the elements in a certain cluster which is got by using clustering method, n is the number of the elements in a certain cluster which has been classified in the medical images library and k is the number of the elements which is both in m and n. Therefore we can get precision = k/m, recall = k/n, and f-score = 2k/(m+n). The average f-score is the average of the sum of each cluster maximum f-score in the clustering results.
As shown in Fig. 6, it is the changing situation of the average f-score when the sparsification factor e changes. If e = 1, the weighted, undirected and complete graph is clustered by using this clustering method. According to the explanation of the clustering method, there is only one cluster in the clustering result, that is, the complete graph. Hence, we determine that the value of e is from 0 to 0.9. From Fig. 6, we can see that the average f-score of the real image set are increasing continuously when e rise from 0 to 0.4. Then the average f-score begins to make a small-scope fluctuation with e increasing, and it gets the max value when e is 0.6. In the simulated image set, with e value increasing, average f-score continue to increase. When e is 0.7, maximum value can be got.
In order to verify the efficiency of the clustering method, we compare our method with MCST and MCL for time complexity and average f-score, by using average f-score, the accuracy of these clustering methods can be got.
In Fig. 7, it is the comparison results of WGCGE method with other two clustering methods. As can be seen from it, with the number of the images increasing constantly, the time consumption of these three clustering methods change obviously. However, the time consumption of the clustering method based on graph entropy is far lower than other two clustering methods. From Table 1, in real image set and simulated image set, we can see that the average f-score of WGCGE method is higher than other two clusters. The average f-score of MCST and MCL in simulated image set are higher than that in real image set, but the average f-score of WGCGE method in simulated image set is lower than that in k-mediods.
Conclusion
In this paper, we mainly focus on the medical image clustering method. According to the features of medical images and inspired by domain knowledge, we propose a medical image clustering method based on graph entropy. Firstly, medical images are pre-processed. We can get the calculation method of similarity between medical images. Secondly, medical image dataset is represented by weighted, undirected and complete graph, and the graph is sparsified. Last but not the least, we propose weighted and undirected graph clustering method based on graph entropy. The experimental results show that our clustering method can cluster medical images efficiently and perform well in time cost and clustering accuracy. This method can assist doctors to diagnose illnesses and provide better service to doctors.
Footnotes
Acknowledgments
The paper is partly supported by the National Natural Science Foundation of China under Grant No. 61272184, 61370084, 61202090, Natural Science Foundation of Heilongjiang Province under Grant No. F2016005, and Fundamental Research Funds for the Central Universities No. HEUCF100609, HEUCFT1202.
