Abstract
With the rapid development of social network, community detection has caught lots of researcher’s attention. However, the existing methods are difficult to solve mass data in growing large-scale networks. In this paper, a two-step method is proposed to cope with large-scale networks based on the construction of community tree and the
Keywords
Introduction
The social network could be understood as the relationship structure of the connections among the actors, which is composed of limited sets or several groups of their relationships [1, 2]. And it is also faced with mass actor data, such as online social networks, Facebook, Twitter, etc. [3]. It’s meaningful for us to find out an accurate and effective way to detect the stable communities under the large-scale social networks. Currently, most algorithms, like Modularity [4], Louvain [5] and Infomap [6], perform the advantages and drawbacks in community detection. In this big data era, although some methods such as Incremental Attribute Reduction Algorithm by Lv et al. [7] are proposed to tackle large amounts of data, it is still tough to put these algorithms into use with more complex social networks, especially which is large-scale and full of huge amounts of semantic data. Therefore, it is significant to identify the community structure in large-scale networks.
In this paper, a multi-stage community detection algorithm is proposed to solve the large amount of data based on community tree and
The construction of the community tree is efficient for detection which reveals the property that almost the whole nodes are existed in the three levels. And the structure is important for accelerating the detection. Meanwhile, it provides a local environment to research the dynamic actions of nodes. Based on the initial communities, we introduce an
In the rest of the paper, Section 2 gives the related work. We analyze the multi-stage method and A community detection algorithm based on
There are many efficient methods which solve the community detection problem effectively. Generally, these scenarios can be classified from the static or dynamic calculation of the communities. For static calculation, the final division of communities could be determined by computing some of all nodes combinations whether met the global optimization. Heuristic methods based on optimization modularity, like Simulated Annealing Method [8] and Extremal Optimization [9], were put forward to reduce the time complexity. Blondel et al. gave a hierarchy greedy algorithm which could reduce the time complexity greatly [5]. To tackle the variety of structure and properties of complex social and information networks, Khomami et al. [10] put forward an algorithm based on distributed learning automata (DLACD) that equips each vertex of network with a learning automation. For the widely using of Bayesian probability model in topic discovery and other fields, Newman proposed an approach for directed network based on the hybrid model and Expectation-Maximization [11]. However, this algorithm had to specify the number of network community in advance. In order to mine communities using concise or easily interpretable descriptions rather than only depend on structural aspects, Atzmueller et al. [12] proposed a description-oriented community detection method using exhaustive subgroup discovery. They proposed several optimistic estimates of standard community quality functions to prune search space in an exhaustive branch-and-bound algorithm effectively.
Actually, the structure of large-scale social network is dynamic and full of continuous changes. Normally, it is a valid way to take a snapshot which corresponds to the static topology at one time point for dynamic network at different time points. Next, the existing static network community discovery algorithm can be used on each snapshot independently. And dynamic calculated methods mainly could be classified to the following aspects. Firstly, Wang et al. [13] proposed a method to detect overlapping community that takes the mass difference between nodes into consideration. They used PageRank algorithm to evaluate node mass and determine the community affiliation based on nodes’ positions in the inherent peak-valley structure of the topology potential field. Another routing algorithm of social network was proposed by Jia [14], which is an autonomous method using dynamic Bayesian non-negative matrix factorization to detect overlapping communities automatically. Next, methods based on particle swarm optimization are of advantage on complex social networks. An improved particle swarm optimization algorithm which has improved the parameters of velocity formula and introduced new distance calculation formula is proposed by Wu et al. [15] and is proved more efficient for function optimization problems. To optimize the modularity density, a memetic particle swarm optimization algorithm (MPSOA) introducing particle swarm optimization-based global search operator and tabu local search operator was proposed by Zhang et al. [16]. COPRA [17], proposed by Gregory, was stronger which was suitable for overlapping structure. COPRA extended the process of label propagation so that the nodes could include more label and more information of different communities. Though LPA owned a great performance on complexity of time, it was still an uncertain way. Leung et al. [18] put forward the improving LPA way, which allocated an weight for every label and the weight will be lower with the distance of propagation. And it was effective to prevent the huge communities’ forming. The other reason for the instability of LPA was the randomness. So, Lou et al. gave the (Coherent Neighborhood Propinquity, CNP) [19] to provide a more stable way. By reason of the local structure of communities, some researches concentrate on the local links of network. Lancichinetti and Fortunato [20]proposed the LFM based on local extension optimization to obtain the final division which was composed of iterated the natural communities of all nodes. Another way was GCE [21], proposed by Lee et al., which was fit on highly overlapping structure.
To some extent, these methods could acquire the community structure of the social networks in different time. However, their resistance to network noise is very poor and they must rerun algorithm on each snapshot independently which leads to the large complexity when faced with large-scale networks.
Therefore, we proposed an approach to solve the large-scale social network which datasets are more than millions. Our method, NC-GT, composed of two steps. Firstly, we construct the community tree which performs these two advantages, the efficient stored structure of networks whose nodes are active between three levels, and the other is, on this account, it is flexible for researching the dynamic nodes with this structure. Secondly, we could detect the initial communities by the similarity easily. As for the remaining single nodes,
A novel multi-stage communit detection method
It is important to give an exact definition of the problem. The structure of the social network is defined in Definition 1. The network nodes play different roles in the social network which is sparse. Definition 2 gives the concrete meanings of the nodes.
Given an undirected network
Network node is denoted as
Equation (2) presents the effect of nodes in a certain community,
Despite the social network is complexity, dynamic and with huge amounts of node, usually, the community is local which is hardly found a node belongs to the whole communities. In the topology of the social network, it is difficult to describe the locality and found the communities. So, we could convert the social network as the community tree which is easier to detect the communities.
The Algorithm 1 gives the construction of the community tree by the Definition 3. In the algorithm, whether a node has been visited is stored in an array n_visited[]. If node
According to the construction of the community tree, we give the following three features like these:
Each node in the community tree is existed in three coherent levels which provides the convenience for the next researching like detection or evolution of community; We only make the concentration on the first emergence of the node number for avoiding the repeat and the efficiency; The storage structure of the community tree could simplify the complex graph structure which is more beneficial.
From the construction of the community tree, it is effortless to know that most of the nodes are activated in three levels. So, the community detection can be prosperous as the overlap nodes are the focus of attention.
To find correct communities, we give the following related rules:
The breath first searching start from the root node which is as the contained community. When comes to node
Initially, we give a mark of level which value is zero to every node. If the node is visited, let the value of mark change to the current level so that it is effortless to find whether visited in the pre-level. It is obvious to know there are still some contained communities after once visit. Therefore, we should consider the border node formed in these communities.
The Algorithm 2 gives the initial division of the community tree following the above rules. This is a rough division of the community tree to prepare for more precise division in the next steps.
From part A, some exact communities and overlapping nodes which are hard to judge since the nodes is in the several communities. We put forward the
The community formed by independent overlapping node makes its correlated
Now, we consider the main utility for every player is the increase or decrease of the modularity itself whatever strategy it chooses. Modularity [2] is defined as follows which is to measure the quality of the detecting communities.
Where
So, it can be used to convey the payoff function of every game part
And the formulation of
Since the community detection is to get the max average modularity, the
Now, it is vital to discuss the solution of the
Given the above definition, the said game could be expressed as the characteristic function form like:
where
So,
It is significant to analysis the following three axioms which is the establishment of the said Shapley value’s definition.
where
According to the unique Shapley value from the analysis, each component
Retrospect the above algorithms, Algorithm 1 gives the construction of the community tree mainly by one-time travel with the complex of
The main point is to divide the final communities precisely, and Algorithm 3 gives ultimate result. It is not hard to get the complexity is
Experiments
Datasets and settings
We consider evaluating our proposed method on synthetic and real-world networks to presents its performances. The synthetic networks are generated randomly based on the LFR (Lancichinetti Fortunato Radicchi) [23] which is easily to control the distribution of degree and community size of the networks and demonstrate the precise of the algorithm of the community detection. The following Table 1 gives the synthetic datasets including some parameters we use for this paper. Here, the mixing parameter
Datasets of synthetic networks
Datasets of synthetic networks
As for real-world networks, it is more significant for us to demonstrate the effectiveness of algorithm. We collect the following datasets which are available from Stanford Network Analysis Project (SNAP) [24].
Datasets of real-world networks
The algorithms were evaluated on the same environment with 8 G RAM and Inter CPU 2.40 GHz for the simulation. And the following community detection methods are used as the comparison:
Since the most popular measurement of the community detection on networks is Modularity [4]. However, it has been shown that modularity is not perfect and performs the unavailable when comes to real-world. The Eq. (4) gives the specific definition.
As for the networks which communities have been known, NMI (normalized mutual information) is used for measuring the performance of the algorithms. NMI is defined as follows:
Where
However, it is necessary to consider the number of discovered community via the algorithms. The result of the detection is always the apparent indicator to compare the precise of the algorithms.
The evaluations on the synthetic networks
The influence of mixing parameters 
The influence of average degree 
Figure 1 gives comparisons in impacts of NMI among algorithms and different scale of synthetic networks from the mixing parameter
We also evaluate the different average degrees in synthetic networks and exhibit the performances on the NMI among algorithms. With the increase of the amounts of the node, there are merely changes in the different scales. From any networks, all of algorithms excepting Infomap keep a well and stable performance. As for Infomap, it is abnormal when it reaches to 10 and it has an increase when the scale of network gets stronger. Besides, we found that Modularity and Louvain possess the good effect with the change of average degree, while they could not keep the same performance on the mixing parameter
The results of modularity on datasets.
Modularity is the most popular measurement which we evaluate on the four real-world datasets. Compared to synthetic networks, the real-world’s properties are more complex and irregular than the synthetics. Besides, NC-GT and Louvain exhibit great performance on different datasets. Compared from multi-aspects, when we concentrate on any one of the algorithms, the results of the modularity are very approximate which reach the phenomenon that the influence of datasets on community detection is slight.
The results of NMI on datasets.
We observe that Infomap and LabelRankT have a perishing performance, while the others maintain the goodness on the evaluation. However, the best point is only 0.856 which illustrates all of algorithms are worse than the synthetic networks when they come to the real-world networks. Combined with Fig. 3, the Louvain and NC-GT get the great and stable performance with the high NMI compared with synthetic networks. Infomap has exhibited the fluctuated and great performance on synthetic networks, while it pays the worst performance and which reaches at 0.2. We get the conclusion Infomap does not considerate the complex ties among real-world networks. The community detection in real-world datasets has to combine the social properties.
Runtime analysis.
It is significant to analyze the runtime of algorithms to demonstrate the efficiency. From Fig. 5, we could Modularity perform the worst with the increase of scale of networks which is fit on small-scale datasets. Our algorithm, NC-GT, is faster than modularity and a considerable to compare with Louvain, Infomap and LabelRanT. NC-GT is suitable for community detection on large-scale networks, especially the online social networks which possess the large amount of nodes and edges and complex ties of the nodes.
The increasing challenge in community detection is more and more common. In this paper, we construct the community tree to store the structure of network which could obtain the local property of nodes. And based on the unique structure, the initial detection is more efficient and easier. Meanwhile,
Footnotes
Acknowledgments
Natural Science Foundation of China (61572036), Natural Science Foundation of Anhui Province of China (1708085MF156), Science Foundation for The Excellent Youth Scholars by Anhui Province, China (2009SQRZ127), CERNET Next Generation Internet Creative Project of China (NGII20170305).
