Public opinion bunching storage model for dense graph data in social networks 1

Abstract

Graph data storage has a promising prospect due to the surge of graph-structure data. Especially in social networks, it is widely used because hot public opinions trigger some network structures consisting of massively associated entities. However, the current storage model suffers from slow processing speed in this dense association graph data. Thus, we propose a new storage model for dense graph data in social networks to improve data processing efficiency. First, we identify the public opinion network formed by hot topics or events. Second, we design the germ elements and public opinion bunching mapping relationship based on equivalence partition. Finally, the Public Opinion Bunching Storage(POBS) model is constructed to implement dense graph data storage effectively. Extensive experiments on Twitter datasets demonstrate that the proposed POBS performs favorably against the state-of-the-art graph data models for storage and processing.

Keywords

Graph data storage social networks topic cluster equivalent partition

1 Introduction

Graph data storage [1-3] is a prominent technical support for tasks such as social network analysis [4], aiming at efficient management of relationship intensive graph data. These highly connected data are formed by hot topics [5, 6] that trigger related discussions. Therefore, the dense network contains generous repeated data from the same public opinion [7]. However, traditional graph storage models face several challenges in the storage and operation of social networks data. Since the hotspot data is mentioned frequently, it will be refreshed in memory repeatedly. The generation of redundant data and the high probability of repeated use result in a serious waste of space and a slow operation speed.

To tackle the above problems, in this paper, we propose a storage model for social networks graph data to improve storage space utilization and response time. The social networks public opinion is formed by frequent interactions, i.e. attention, forwarding, reply, likes, etc [8]. They reveal the different density redundant data around the hot topic. Therefore, segmenting topics [9, 10] and extracting reused data for independent storage can increase the mapping association between data. It can reduce the repeated storage and reading speed of data. Most existing graph data models do not store differentially these data, such as the property graph model [11], the hypergraph model [12], and the triple stores i.e., RDF storage [13]. Due to the structures being formed after data emerges [14], the graph model usually stores all data equally. Therefore, we optimize the property graph model by developing a public opinion bunching mapping structure that deals with social networks graph data to eliminate redundancy and speed up query operations.

Concretely, we first identify the public opinion network of hot topics by checking the size and the energy value of data. Then, we select the representative elements and map all data lossless to shared storage and association. The most active and intensive data is separated from the original property set and processed to form a public opinion bunching structure based on the equivalence partition in the overall data space. We obtain multiple topic nodes and design the public opinion bunching storage model to realize the dense graph storage of social networks. The experimental results show that the POBS model outperforms the state-of-the-art graph models in terms of processing speed about the data loading, query, and clustering.

The main contributions of our work are summarized as follows:

We design a graph data storage model aiming at the hot public opinions in social networks.

We extract representative data and make them as a new node to construct graph structure. They compress the shared content and create a public opinion bunching mapping relationship.

We define the public opinion bunching storage model based on equivalent partition. It obtains better storage space occupancy and I/O efficiency.

We conduct experiments in Twitter to verify the performance of POBS by comparing it with the baseline graph models.

2 Related work

2.1 Graph data storage models

The graph data model refers to the graph database abstracting and managing the data in the form of graphs [15, 16]. The widely used graph data model includes the following three types. From the perspective of the underlying storage structure, both hypergraph and property graph are a native graph, while a triple store is a kind of resource description framework. The hypergraph describes the relationship between multiple nodes, while a property graph focuses more on describing the unidirectional relationship between two nodes. The detailed introduction is as follows.

The triple store [13] is a subject-predicate-object triple data structure. It solves the problem of search efficiency utilizing the six-fold index, but it incurs six times space overhead and costs a lot of updating and maintenance. Since its storage engine is not optimized for storing property graphs, it is not a native graph database. In the case of correlated data-intensive scenarios, self-join operations impose a considerable load and their performance is limited by the the data scale.

The hypergraph model [12] is proposed as an extension of graphs that allows the relationship to connect any number of nodes. The property graph model allows only one association. However, the hypergraph model provides an association for contacting any number of nodes on either end. The model is suitable for many-to-many relationships that dominate the field [17]. In social networks, it is used to construct the network structure and achieve the expert-finding technique by hyper-paths. Although it can effectively conduct queries, graph partition, and reduce the number of relationships, hypergraphs cannot process the redundant data and are only applicable to the field of many-to-many relationships.

The property graph model uses native graph storage and processing to manage the graph data [11]. It consists of nodes, relationships, properties, and labels. The nodes are connected by links to form a graph with index-free adjacency [18]. The nodes and relationships as containers can store properties, and the property value exists in the form of any key-value pair. At the same time, the nodes can be labeled with one or more tags, which can organize the nodes together and indicate their role in this dataset. In addition, the relationships make the nodes more semantic by direction and name. Since the graph model is convenient for the management of the system [19], the graph structure contains much available potential information. With the prevalence of social networks, the need to convert robust relational data into graph models has emerged significantly. Neo4j and Titan are excellent implementations of the property graph model. Inspired by the advantages of the property graph, we optimize the data model to achieve the compression storage in the social networks graph.

2.2 Graph data storage and optimization methods

Existing techniques for graph data storage and optimization [20, 21] can be broadly categorized into the compression storage of general graph and query-friendly graph compression [22, 23].

General graph compression storage has been studied for ample graphs scenario [24]. The idea is to encode a graph or its transitive closure into compact data structures via node ordering determined by, e.g., document similarity, hosts, and linkage similarity [24]. These methods preserve the information of the entire graph and highly depend on the graph type, coding mechanism, and application domains. Besides, they need decompression before querying. Therefore, there is an unsatisfactory efficiency for the methods of general graph compression.

Query-friendly storage models aim at optimization specific of queries, which include neighborhood [25] and reachability queries [26], etc. Nelson et al. propose the query-able compression based on exploits Eulerian paths and multi-position linearization. The existing methods only preserves information for a particular query instead of all types of queries. They have to modify the query evaluation algorithms on original graphs to answer queries in their compact structures.

Our work differs from the current compression methods in the following:

In addition to the structure of nodes and edges in the original graph, we also consider the role of the property in graph storage.

The storage techniques we designed is developed for all types of queries while preserving the information of the entire graph.

We query directly in the compressed graph data structure without decompressing.

3 Preliminaries

In this section, we briefly present some important preliminary contents involved in our design POBS graph model.

3.1 The property graph model

Property graph denotes as G = (V, E, A). V (G), E (G), and A (G) denote the collection of nodes, edges, and properties. The node and edge usually contain multiple property fields. A (G) is made up of a fixed-size record and referenced by a node and a relational record. The record contains a pointer to the inline value or the dynamic store record. When the size of the property value is less than the recording capacity, it is stored in the inline value. Otherwise, the property value adopts dynamic storage in the index file.

3.2 Hot public opinion in social networks

Public opinions are content that has caused widespread discussion after a hot topic or event occurred in social networks [3, 27]. It spreads from network platforms such as social networks and gains enough attention. In public opinion, there are abundant mutual forwarding relationships among all information posted on Twitter. It consists of different topics, which can inspire multiple dense network structures. These public opinion data contain discussions on the same topic, resulting in quantity of duplicate nodes being stored and frequently read. The existing methods [3-7] for identifying hot topics mainly rely on technology such as comparing the similarity of message content and clustering high-frequency hot words.

3.3 Equivalence relation

Let Q be a binary relation on a non-empty set B. If Q is reflexive, symmetric, and transitive, then Q to be an equivalence relation on B.

Based on equivalence relation, B can be divided into different equivalence classes, that is, the equivalence classes of Q constitute the partition of B. The equivalence class refers to the collection of all elements associated with b ∈ B. The set of all equivalence classes is called quotient set, [b] _Q|b ∈ B = B/Q. The purpose of introducing the equivalence relation [28] is to classify the social networks data into multiple network clusters, each of which is a collection of topic. The representative topic elements are selected within each set to common management of the repetitive content, so that reduce the computational complexity and storage utilization.

4 The proposed POBS model

In this section, we propose a novel graph data storage model. Since the native graph storage can increase the independence of the graph data model, it ensures the separation of graph structure (nodes, edges) and graph data (property). In the light of the native node storage and relationship storage in the property graph model, we design POBS by integrating topic nodes and the public opinion bunching relationships. The difference of graph data storage between the property graph model and POBS is shown in Fig. 1. It demonstrates the node type and interaction relationship of the storage files in the property graph model and POBS, respectively.

Fig. 1

The difference between the property graph and POBS in data storage.

In the native graph storage of the property graph model, the interaction of each storage file on the disk is shown in Fig. 1 (a), where the relationship is a doubly-linked list structure, and the property is a one-way linked list structure. Node a, b is the message nodes. T_n is the topic nodes in different topic clusters. p₁, p₂ denote the property fields. Each entity node has a pointer to the property, which stores the tweet id, the tweet content, forwarding volume, number of comments or likes, etc. Depending on the size of the property value, the property is stored as an inline value, or as a new record in the dynamic storage file. Each node points to a specific property p_i (such as mid and content), which forms a one-to-one mapping.

Public opinion bunching storage relationship is a two-way linked list structure in the proposed POBS model, while the property is a one-way linked list structure shown as in Fig. 1 (b). The dotted line indicates the many entities discussing the same topic point to the one dynamic storage file across the common topic nodes Tn. POBS dramatically reduces the number of copies of property records of entities, e.g. tweet content property of node b. Through the transformation of inline values and pointers, the dynamic stored records of common property records of the original entity nodes are shared. Hence, POBS achieves the compression of storage space and accelerates the operation of relevant data.

4.1 Problem statement

POBS denotes as G_T = (V, E_t, A), where V is divided into message node M and topic node Tn, M∩ Tn = ∅ and M ∪ Tn = V. A includes the shared property, which are the duplicate data stored in dynamic storage records T_∼ and reminder property A_t, A_t∩ T_∼ = ∅ and A_t ∪ T_∼ = A. The node set M points to the property set A_t. The property domain corresponding to Tn is T_∼. Thus, the node set is |V| = |M| + |Tn|, which is all sample nodes in the dataset. |Tn| is usually the number of hot topics in the dataset. The graph is a public opinion bunching mapping structure formed by multiple nodes with duplicate data pointing to one topic node Tn_i ∈ Tn. The duplicate content is stored by Tn_i and shared by other corresponding associated nodes.

4.2 Identifying network structure of hot public opinion

Hot public opinion consists of different topics, which can inspire multiple dense network structures. Since the topic that can cause clustered dense networks are diverse, we need to measure the heat of data comprehensively. The purpose is to find the data in A (G) which can be constructed into T_∼.

We construct a directed graph by forwarding operation, which is a simple graph without self-looping and parallel edges. The nodes represent messages. The edge is the forwarding between messages. The nodes in the identical cluster are densely connected and discuss the same topic. We define and recognize the network structure of hot public opinion that can lead to high aggregation and a large amount of content redundancy.

On the one hand, the data cannot be encoded as inline values and occupy the storage space in the property graph data model when the content exceeds 32Byte [11]. Therefore, we distinguish the size of a message to discover the data that may waste storage space. On the other hand, the amount of forwards, comments, and likes decide the heat of the messages. If the volume of interactions reaches a certain threshold set by the empirical values, the message is hot data that influence the throughput of I/O operations [29]. In accordance with the above two conditions, we define the message as a public opinion network(PON) entity, in which the size and the energy value satisfy these requirements. They attract many forwarders to form a public opinion network, which contains massive duplicate data.

4.3 Equivalent partition and public opinion bunching mapping

In this section, the topic nodes are extracted from the entities of hot public opinion. T_∼ is constructed from A (G) and the public opinion bunching structure is generated.

4.3.1 Equivalence relation on the hot public opinion network

In a public opinion network, each entry is a collection whose elements are the mapping of key-value pairs. The property value of message content is segmented into shared data (i.e., forwarded original messages) and differentiated parts (i.e., opinions on forwarding content) by flags “RT @”.

We first transform each message into a vector in the same order as the property segments. According to the related research on text feature analysis [30], the top-3000 most frequently used words in a dataset can fully represent the semantics of all texts. So, we select the top 3,000 words with the highest frequency in the microblog dataset to compose a feature dictionary. The message is denoted as a vector in the continuous space with each bit corresponding to a word feature of the vocabulary. Thus, the value of each bit is the number of occurrences of the word in the message m. We further split the feature representation of the messages into two components based on “RT @”. The shared contents are represented as a vector $\vec{t}$ . $\vec{x}$ is a vector representing the content of different parts. They are represented by the 3000-dimensions feature vectors. Thus, each message entry in a public opinion network is formulated as: $m = g (\vec{t}, \vec{x}) = \vec{t} \oplus b \cdot \vec{x},$ (1) where, ⊕ denotes the cascade operation of two vectors. b is a coefficient of mask, which is used to take the difference part of the vector. We normalize the vector representation as: ${w_{i}}^{'} = \frac{w_{i}}{\sum_{n = 1}^{3000} w_{n}}, w_{i} \in m$ (2)w_i is the i-th value of the message feature m. We calculated the ratio of each value m to the sum of all bits n and obtained a normalized representation of the message feature.

Because the original messages have no different parts, they are represented as $g (\vec{t}, \vec{0})$ .

Suppose T is the space of all messages in the neighborhood of the original microblog message, where $g (\vec{t}, \vec{x})$ and $g (\vec{t}, \vec{y})$ are any two messages. If there is a binary relation ∼ that makes $g (\vec{t}, \vec{x}) \equiv g (\vec{t}, \vec{y})$ in the sufficiently small neighborhood of the original message, then ∼ is the equivalent relation ∼ ⊆ T × T. The relationship of $g (\vec{t}, \vec{x})$ and $g (\vec{t}, \vec{y})$ can be expressed as: $\sim = {< g (\vec{t}, \vec{x}), g (\vec{t}, \vec{y}) > | g (\vec{t}, \vec{x}) χ g (\vec{t}, \vec{y}) \to g (\vec{t}, \vec{0})}$ (3) where χ is the operation of finding the shared values of all vectors.

According to the equivalence relation, T can be divided into several disjoint subsets. Each subset can be regarded as an equivalent class to represent all elements. Therefore, the elements in T can be replaced by a series of equivalent classes. The segmentation is achieved based on equivalence classes composed of different topics. We extract the common feature in equivalence class so that the class can be distinguished from any other equivalence class in T. We use these features to represent the equivalence classes and define them as germ elements in public opinion network shown in Definition 1.

Definition 1. (Germ Elements in PON) Given the equivalence relation ∼, it can divide T into n equivalence classes {T₁, T₂, . . . , T_n}, where T_i = {a_i1, a_i2, . . . , a_ij}. For all elements a_ij, only n equivalent classes are needed to represent them, where i × j >> n, that is, |T| = i × j becomes |T| = n. For each equivalence class, the message formed around the shared feature part of all elements in the equivalence class is called Germ elements: $[\vec{t}]_{\sim} = {< g (\vec{t}, \vec{q}) > | \vec{q} \in m, g (\vec{t}, \vec{q}) \sim g (\vec{t}, \vec{0})}$ (4)

In the public opinion network space, for any $g (\vec{t}, \vec{x})$ , $g (\vec{t}, \vec{y})$ , and $g (\vec{t}, \vec{z})$ in the set T, they all have the following properties based on vector $g (\vec{t}, \vec{0})$ :

Law of Reflexivity.

For $g (\vec{t}, \vec{x})$ ∈T, based on the Equation (3), we obtain its binary relation:

$< g (\vec{t}, \vec{x}), g (\vec{t}, \vec{x}) >$ → $g (\vec{t}, \vec{x}) χ g (\vec{t}, \vec{x})$

→ $(\vec{t} \oplus b \cdot \vec{x}) χ (\vec{t} \oplus b \cdot \vec{x})$ → $\vec{t} \to \vec{t} \oplus b \cdot \vec{0} \to g (\vec{t}, \vec{0})$ .

So $< g (\vec{t}, \vec{x}), g (\vec{t}, \vec{x}) > \in$ ∼. For $g (\vec{t}, \vec{y})$ and $g (\vec{t}, \vec{z})$ , they have the same equivalence relation. We conclude that,

$\forall g (\vec{t}, \vec{q}) \in T$ , $g (\vec{t}, \vec{q}) \sim g (\vec{t}, \vec{q})$ .

Law of Symmetric.

For $g (\vec{t}, \vec{x})$ ∈T, $g (\vec{t}, \vec{y})$ ∈T, we get its binary relation based on the Equation (3):

$< g (\vec{t}, \vec{x}), g (\vec{t}, \vec{y}) >$ → $g (\vec{t}, \vec{x}) χ g (\vec{t}, \vec{y})$

→ $(\vec{t} \oplus b \cdot \vec{x}) χ (\vec{t} \oplus b \cdot \vec{y})$ → $\vec{t} \to \vec{t} \oplus b \cdot \vec{0} \to g (\vec{t}, \vec{0})$ ,

$< g (\vec{t}, \vec{y}), g (\vec{t}, \vec{x}) >$ → $g (\vec{t}, \vec{y}) χ g (\vec{t}, \vec{x})$

→ $(\vec{t} \oplus b \cdot \vec{y}) χ (\vec{t} \oplus b \cdot \vec{x})$ → $\vec{t} \to \vec{q} \oplus b \cdot \vec{0} \to g (\vec{t}, \vec{0})$ ,

So $< g (\vec{t}, \vec{x}), g (\vec{t}, \vec{y}) >$ → $< g (\vec{t}, \vec{y}), g (\vec{t}, \vec{x}) >$ . Similarly, for any pair of data composed of three entities, the same equivalence relation can be obtained. That is,

$\forall g (\vec{t}, \vec{s}), g (\vec{t}, \vec{t}) \in T$ , $g (\vec{q}, \vec{s}) \sim g (\vec{q}, \vec{t}) \to g (\vec{q}, \vec{t}) \sim g (\vec{q}, \vec{s})$ .

Law of Transitivity.

For $g (\vec{t}, \vec{x})$ ∈T, $g (\vec{t}, \vec{z})$ ∈T, we get its binary relation according to the Equation (3):

$< g (\vec{t}, \vec{x}), g (\vec{t}, \vec{y}) >$ → $g (\vec{t}, \vec{x}) χ g (\vec{t}, \vec{y})$

→ $(\vec{t} \oplus b \cdot \vec{x}) χ (\vec{t} \oplus b \cdot \vec{y})$ → $\vec{t} \to \vec{t} \oplus \vec{0} \to g (\vec{t}, \vec{0})$ ,

$< g (\vec{t}, \vec{y}), g (\vec{t}, \vec{z}) >$ → $g (\vec{t}, \vec{y}) χ g (\vec{t}, \vec{z})$

→ $(\vec{t} \oplus b \cdot \vec{y}) χ (\vec{t} \oplus b \cdot \vec{z})$ → $\vec{t} \to \vec{t} \oplus \vec{0} \to g (\vec{t}, \vec{0})$ .

Based on the above three equations, we can obtain $< g (\vec{t}, \vec{x}), g (\vec{t}, \vec{y}) > \land < g (\vec{t}, \vec{y}), g (\vec{t}, \vec{z}) >$ → $< g (\vec{t}, \vec{x}), g (\vec{t}, \vec{z}) >$ . That is,

$\forall g (\vec{t}, \vec{x}), g (\vec{t}, \vec{y}), g (\vec{t}, \vec{z}) \in T$ , $g (\vec{t}, \vec{x}) \sim g (\vec{t}, \vec{y}) \land g (\vec{t}, \vec{y}) \sim g (\vec{t}, \vec{z}) \to g (\vec{t}, \vec{x}) \sim g (\vec{t}, \vec{z})$ .

To sum up, ∼ is an equivalence relation on T. The germ elements in set T can effectively cover all the data in space T.

4.3.2 Topic nodes and public opinion bunching mapping structure

The germ elements contain the public part of the equivalence class set. We regard the germ element as a special node in the graph model, which is called the topic node. The topic nodes are constructed by the equivalence reduction to store the repetitive content in their property fields.

The social networks with intensive and massively relational data increase the workload of the graph database and occupy primary system resources when traversing its nodes. So we introduce the public opinion bunching relationships to prune out the redundancy data. In this way, we can segment the large-scale graph data effectively and compress each clustering structure by mapping between the topic nodes T_n and the remaining data M. After the extraction of the germ elements in the property record file, we construct the public opinion bunching mapping between the topic nodes and the message nodes. They ensure the data is lossless.

In order to generate public opinion bunching relationships, we need to discover and label the importance of properties. We acquire the node properties in turn by the label and then establish the index on the label property. Based on the PON entry, we obtain the property name and the value of compressed object, which exceed the given threshold. We traverse the node set D_v connected to the node v and then check its property. If the property value of message content p_c in a node meets the conditions of the network structure of hot public opinion, an index will be created on the property of this node. The index tree $I_{v}^{p}$ based on nodes set V and its properties will be formed. Then, we perform the second time traversal on the node set according to $I_{v}^{p}$ . We extract the data pointed to by the index as an independent node. Through the property value of the node, we judge whether the part $\vec{t}$ is repetitive. If the node does not exist, this node is regarded as a topic node and associated with the original node. Otherwise, a new association with the existing topic node will be created. After traversal, it forms a public opinion bunching mapping relationship in a cluster network structure. We delete the index and the dynamic store file of the remaining nodes except topic nodes and save the $\vec{x}$ of the remaining nodes as an inline value. Finally, the public opinion bunching mapping relationship is generated. The algorithm complexity reaches O (|V|²). This process of algorithm is outlined in Algorithm 1.

Algorithm 1 The implementation process of public opinion bunching relationship generation algorithm
Input:
G=(V, E, A);
Output:
Topic node set T_n and public opinion bunching relationship;
1: whilev in Vdo
2: D_v← traverse all nodes connected to the node v;
3: forv_i in D_vdo
4: ifp_c is not inline then
5: create index on : p_c;
6: $I_{v}^{p} \leftarrow$ index(p_c);
7: end if
8: end for
9: end while
10: if $I_{v}^{p} \neq \emptyset$ then
11: whileI_i in $I_{v}^{p}$ do
12: node n_tmp ← I_i.property;
13: ifn_tmp is not in T_nthen
14: topic node ←n_tmp;
15: T_n<< topic node;
16: else
17: createRelation(I_i .node, topic node);
18: delete n_tmp, I_i;
19: end if
20: end while
21: end if
22: return T_n and public opinion bunching relationship.

The public opinion bunching structure is shown in Fig. 2. The nodes id _ a, id _ b, id _ c are the original message nodes, and tid _ a, tid _ b, and tid _ c are the extracted topic nodes. Other nodes are forwarding nodes around the original nodes, and they contain the original message content and comment information. The retweets not only are independent microblogs released by the user but also contain the content of the tweets, which is the original data dumped. So much the same content data is contained in different entity nodes, which causes multiple copies throughout the store file. Therefore, the message content is repeatedly stored in the property of each node, resulting in a waste of data space. We design a public opinion bunching structure to jointly point the cluster data of each topic to a topic node. Common data within a topic is stored by topic nodes and shared by all nodes. The message node only stores the different content of each forwarding. The public opinion bunching structure not only avoids the repeated storage of data but also ensures the integrity of the graph structure.

Fig. 2

The public opinion bunching mapping structure of POBS.

4.3.3 Public opinion bunching storage model

The data structure composed of the topic nodes and the public opinion bunching mapping relationship is further formally defined as the public opinion bunching storage model to realize the decoupling of network clusters. All topic nodes about equivalence relation form the public opinion bunching set of T. We use T_∼ to represent the public opinion bunching set: $T_{\sim} = {_{\sim} | \vec{t} \in T}$ (5) where T_∼ satisfies the conditions:

$[\vec{t_{a}}]_{\sim} \cap [\vec{t_{b}}]_{\sim} \cap . . . \cap [\vec{t_{c}}]_{\sim} = \emptyset$ (6) $[\vec{t_{a}}]_{\sim} \cup [\vec{t_{b}}]_{\sim} \cup . . . \cup [\vec{t_{c}}]_{\sim} = T$ (7)

Definition 2. (POBS Model) The equivalence relation on the PON space T determines the construction of public opinion bunching set T_∼, and the generated public opinion bunching set represents all coarse-grained equivalence classes contents, which effectively solves the equivalence partition problem of T. The public opinion bunching mapping is used to further generate the fine-grained division, so as to achieve the coverage of all contents, and obtain better space utilization. The public opinion network space is mapped to the public opinion bunching set in a new compression form, which is defined as the Public Opinion Bunching Storage Model and shown as follows: $T / \sim = {\begin{matrix} . . . \\ {\vec{t_{i}}, [\vec{x_{i}}, \vec{y_{i}}, \vec{z_{i}} . . .]} \\ . . . \end{matrix}}$ (8) The public opinion bunching storage model covers all the representative elements and the differences in the equivalence class. The difference part is divided out and used as the property value of the original node in V, which is stored in A_t. Hence, the public opinion bunching set structure is constructed on the property graph model, and a new graph model G_T = (V, E_t, A) is formed. By extracting the germ elements, the data stored in the public opinion bunching set model is mapped to a bunching structure.

5 Experiment and analysis

In this section, we experimentally evaluate the performance of POBS in terms of storage utilization and operating time on a set of microblog datasets. We adopt the typical graph operation and describe each workload in detail to make a comprehensive comparison. Concretely, we select three benchmark models as a comparison to verify the performance of the POBS in four aspects, including data loading, common queries, random queries, and clustering tests, respectively. The typical models are the triple store model, the hypergraph model, and the property graph model.

5.1 Datasets

We investigate some hot public opinions and crawl the Twitter data using Twitter API based on the relevant keywords and the area in which the event occurred. The dataset contains 215,276 tweets. Each message consists of five fields, namely Twitter id, tweet content, forwarding volume, number of comments, and likes. We obtain data on three hot public opinion events: “Queensland Floods”, “Rio Olympic Games”, and “Election2016”.

To thoroughly verify the performance of the POBS under different scales and topic sizes, the Twitter data is formed into three datasets. The dataset I is a single topic about “Queensland Floods”. Dataset II merges “Queensland Floods” and “Olympic Games”. Dataset III is a mixture of the above three topics. Since each hot topic can generate several public opinion network structures, the number of PON entries in each dataset determines and reflects the effectiveness of POBS in dealing with redundant data. We preprocess the data by removing the punctuation and special symbols without affecting semantic comprehension. Then, we identify the complex network cluster and calculate the proportion of datasets occupied by PON entities. The statistics of the dataset are shown in Table 1.

Table 1
Summary of dataset statistics

Dataset #Tweets #PON #PON entries #Topic

Dataset I 70,347 18 18994 (27%) 1

Dataset II 138,599 56 63756 (46%) 2

Dataset III 215,276 129 133471 (62%) 3

Dataset	#Tweets	#PON	#PON entries	#Topic
Dataset I	70,347	18	18994 (27%)	1
Dataset II	138,599	56	63756 (46%)	2
Dataset III	215,276	129	133471 (62%)	3

5.2 Performance evaluation on data loading

We evaluate the performance of our POBS model after importing data mainly by conducting two types of insert operations, which are single insertion workload, and massive inserted workloads. We compare the proposed POBS with three graph models in the data loading under three datasets. The detailed analysis and discussion are as follows.

Single insertion workloads. We simulate a real-time single insertion operation, that is, the graph is created progressively. We assume that the growth conforms to the steps of the single insert operation and adds nodes and edges in turn. The process is constantly creating a single node and a relationship related to the node that has already been created. We create a graph database and then load the experimental data into it. The graph is constructed incrementally, with each object (node or edge) being inserted in turn. Figure 3 (a) reveals the executive time of the POBS is minimal. The higher the proportion of redundancy data in the dataset that meets the PON entity, the faster the single importing speed is. However, the insertion time of compared models on three datasets increases significantly as the redundancy grew. This is because the higher the repetition rate of the dataset, the greater the proportion of data compressed by the POBS model. Therefore, the POBS model can effectively improve the execution time of insert by insert to a certain extent.

Fig. 3

The efficiency of massive insertion workloads.

Massive insertion workloads. We simulate a massive insertion to obtain graph-structured data, that is, to quickly load existing graph data in batch mode. The massive insertion workloads are processed by batch-importing all of the nodes, and then batch-importing all of the relationships. If the ending node does not exist, then we create a node with only the assigned node id and node property. We first create a graph database and configure it to be the bulk loaded mode, and then load three datasets into the graph database. Simultaneously, we measure the throughput to create the entire graph. The results are shown in Fig. 3 (b). We can see that POBS has a higher throughput per second than the other graph models. The vital superiority is the public opinion bunching structure of POBS, which reduces the number of relationships to be created and inserted. However, other comparative models store all relationships and nodes, and the insertion time significantly increases with data volume.

5.2.1 Query workload, QW

We perform two common graph queries on the four graph data models, which are the finding neighbors (FindNeighbours) and the find shortest path (FindShortestPath). For the FindNeighbours, we can find a user’s friend or follower on the social network. Through the FindShortestPath, we can understand the intimacy of the two users. For social networks, it is crucial whether these queries can be efficiently implemented in the shortest time.

FindNeighbours. We use breadth-first traversal to find the neighbor nodes. In this process, only the first-order neighbors of each node are found, and the time needed for each node to find neighbors is counted. The experimental results are shown in Fig. 4, and the query time of the graph data model is evaluated with the number of nodes increasing by 20% of the dataset.

From the results presented in Fig. 4, we can see that the execution time of the POBS model in the different numbers of query nodes is significantly lower than the benchmark models. The efficiency of POBS has been significantly improved with the increase in query data volume and repetition rate. This is because POBS extracts redundant data and compresses the relationship structure, thus obtaining efficient graph query performance.

Fig. 4

The efficiency of massive insertion workloads.

FindShortestPath. We find the shortest path between a given starting node and randomly selected nodes in this operation. We randomly select nodes {20, 40, 60, 80, 100}. The comparison results of query time as shown in Fig. 5. The execution time of POBS is significantly lower than the contrast models. The increasing amount of data does not obviously affect the query time of the POBS model. However, the baseline graph models spend more time as the data volume increases. This phenomenon shows that the time of round-trip query nodes is significantly declined in POBS through the public opinion bunching mapping, showing a high query efficiency.

Fig. 5

The efficiency of massive insertion workloads.

5.2.2 Random query workloads, RQW

To verify the efficiency of an ad-hoc query, we conduct random query workloads on the graph models. We set rand and t to ensure the randomness. rand is used to limit the degree of the relationship. t is used to qualify the number of nodes. Two types of random queries are as follows.

CQL1: profile match graph = (a)-[*0..rand]-(b) return graph limit t;

CQL2: profile match dire graph = (a)-[*0..rand] → (b) return dire graph limit t;

where a and b are the starting node and ending node. The two query statements used for the undirected or directed graph are executed seven times. They query a graph with an arbitrary number of edges and output the first t items. The comparison of random query time is shown in Table 2. From the results, POBS obtains the minimal query times and outperforms the contrast models in all random cases. The comparison of other models is Triple store>Hypergraph>Property graph. The reason is that public opinion bunching mapping of POBS associates large amounts of related data reducing edge traversal in the query. Therefore, the POBS model has better integrity and stability in the query.

Table 2
Comparisons of the time of random query (ms)

CQL rand t Triple store Hypergraph Property graph POBS

CQL1 7 575 2891 2312 1976 1027

0 880 72702 2093 1771 1011

5 9249 3084 2811 2250 1207

5 389 507 382 114 103

3 94833 6023 5401 3268 1998

3 6926 798 617 483 407

1 3769 612 410 244 197

CQL2 3 6864 980 727 450 290

5 7006 589 306 272 201

8 7060 562 313 268 198

9 679 239 186 89 88

0 945 408 287 138 92

4 89554 3762 2067 1481 970

1 1375 478 301 105 97

CQL	rand	t	Triple store	Hypergraph	Property graph	POBS
CQL1	7	575	2891	2312	1976	1027
	0	880	72702	2093	1771	1011
	5	9249	3084	2811	2250	1207
	5	389	507	382	114	103
	3	94833	6023	5401	3268	1998
	3	6926	798	617	483	407
	1	3769	612	410	244	197
CQL2	3	6864	980	727	450	290
	5	7006	589	306	272	201
	8	7060	562	313	268	198
	9	679	239	186	89	88
	0	945	408	287	138	92
	4	89554	3762	2067	1481	970
	1	1375	478	301	105	97

5.3 Performance evaluation on Clustering Workloads, CW

Clustering is the common operation for data analysis within social networks, such as topic discovery, community mining, etc. The speed of clustering is a critical metric to the performance of the graph storage. We select widely used clustering algorithms, including K-means [31], DBSCAN [32], Spectral Clustering [33] and Hierarchical Clustering [34], to perform clustering operations on four graph data models and compare their execution speed.

Since Dataset III contains the maximum redundant data, which can exclude the interference of other factors to demonstrate the advantages of the POBS fully, we conduct experiments on Dataset III in this section. In all clustering operations, we use the vector space model to quantify the message and convert the similarities to text distance. Based on the number of topics in Dataset III, we set the number of clusters to 26 in all clustering methods that need to specify the number of clusters. We select the corresponding initial cluster centers for K-means. Notably, we use topic nodes as the initial centroids in the POBS model. In DBSCAN, according to the given MinPts and the value of the radius Eps, all the core points are calculated. The mapping of the core point to the point that is less than the radius Eps is set at the core point. Particularly, we know from the nature of DBSCAN that every topic node can serve as the core object. For Spectral Clustering, we use the K-nearest neighbor as the composition method and Ncut as the cut method. When we deal with the data through POBS, we treat multiple nodes connected to each topic node as one node and do not consider the public opinion bunching relationship. We use the cohesion method in hierarchical clustering.

We execute 20 times clustering operations on the above four algorithms in the POBS and baseline graph models, respectively. Table 3 reveals the efficiency of POBS about average running time in the clustering task. We obtain some observations by comparing them with the baseline graph models, POBS makes each clustering algorithm achieve the best execution efficiency, that is, the shortest clustering time. It is due to the doubly linked list structure in the topic node of the public opinion bunching relationship. We can use each topic node to determine which cluster is directly connected by the multiple nodes. In addition, the K-means algorithm has the shortest clustering time among the four clustering algorithms. The reason is that POBS specializes in dealing with redundant data that normally belongs to the same category. In the storage process, POBS initially aggregates the same topic of forwarding relationship and maps it to the same topic node. Thereby, the topic nodes act as the initial centroid of the K-means algorithm, and the number of iterations is greatly reduced. In summary, the POBS reaches superior performance in the clustering as a result of the distinct advantages.

Table 3
Comparisons of the average running time in clustering task (ms)

Models K-means DBSCAN Spectral Hierarchical

clustering clustering

Hypergraph 41.08 67.76 60.12 79.76

Triple store 52.35 79.90 73.12 96.37

Property 46.78 73.93 65.97 89.08

graph

POBS 17.71 26.88 19.53 39.80

Models	K-means	DBSCAN	Spectral	Hierarchical
Hypergraph	41.08	67.76	60.12	79.76
Triple store	52.35	79.90	73.12	96.37
Property	46.78	73.93	65.97	89.08
graph
POBS	17.71	26.88	19.53	39.80

6 Conclusions

We propose a graph data storage model for the hot public opinion in social networks. Firstly, we identify the public opinion network and act as the targets for storage management. Secondly, we propose the equivalent partition and germ elements in data space. Finally, we extract the topic node that is shared content and design the public opinion bunching mapping relationship to decouple the data. Experiments indicate that POBS effectively improves the space utilization rate and processing speed. In future work, we will further perform multiple mode representation and compressed storage on multi-modal data of social networks.

References

Gustavo Cordeiro Galvão Van Erven , Rommel Novaes Carvalho , Waldeyr Mendes Cordeiro da Silva , Sérgio Lifschitz , Harley VeraOlivera , Maristela Holanda , Designing graph databases with GRAPHED, Journal Database Management 30(1) (2019), 41–60.

Carlos Javier Fernández Candel , Diego Sevilla Ruiz , JesúsJoaquín García Molina , A unified metamodel for nosql andrelational databases, Inf. Syst. 104 (2022), 101898.

Sarvani Anandarao , Sweetlin Hemalatha Chellasamy , Detection ofhot topic in tweets using modified density peak clustering, Ing&nierie des Systèmes d Inf 26(6) (2021), 523–531.

Qiuyang Gu , Qilian Ni , Xiangzhao Meng , Zhijiao Yang , Dynamicsocial privacy protection based on graph mode partition in complexsocial network, Pers. Ubiquitous Comput. 23(3-4) (2019), 511–519.

Jagrati Singh , Anil Kumar Singh , NSLPCD: Topic based tweetsclustering using node significance based label propagation communitydetection algorithm, Ann. Math. Artif. Intell. 89(3-4) (2021), 371–407.

Dongha Lee , Jiaming Shen , SeongKu Kang , Susik Yoon , Jiawei Han , Hwanjo Yu Taxocom: Topic taxonomy completion with hierarchical discovery of novel topic clusters. CoRR, abs/2201.06771, 2022.

Kheir Eddine Daouadi , Rim Zghal Rebaï , Ikram Amous , ,Optimizing semantic deep forest for tweet topic classification, Inf. Syst. 101 (2021), 101801.

Chang Sup Park , Barbara K. Kaye The tweet goes on:Interconnection of twitter opinion leadership, network size, andcivic engagement, Comput. Hum. Behav. 69 (2017), 174–180.

Ahmed Imad Aziz Al-Ghezi Universal Workload-based Graph Partitioning and Storage Adaption for Distributed RDF Stores. PhD thesis, University of Gottingen, Germany, 2021.

10.

Ali Davoudian , Liu Chen , Hongwei Tu , Mengchi Liu , Aworkload-adaptive streaming partitioner for distributed graphstores, Data Sci. Eng. 6(2) (2021), 163–179.

11.

Robinson

, Webber

, Eifrem

Graph Databases. Oreilly Media, California, 2nd edition, 2015.

12.

Songlin Hu , Wantao Liu , Tilmann Rabl , Shuo Huang , Ying Liang , ZhengXiao , Hans-Arno Jacobsen , Xubin Pei , Jiye Wang Dualtable: A hybrid storage model for update optimization in hive. In 31st IEEE International Conference on Data Engineering, ICDE 2015, Seoul, SouthKorea, April 13–17, 2015, pages 1340–1351. IEEE Computer Society, 2015.

13.

Marcelo Arenas , Martín Ugarte Designing a query languagefor RDF: Marrying open and closed worlds,21:1-21:, ACM Trans. DatabaseSyst. 42(4) (2017), 46.

14.

, Michael Gubanov

Polyfuse: A large-scale hybrid data fusion system. In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19–22, 2017, pages 1575–1578. IEEE Computer Society, 2017.

15.

Mislene Da Silva Nunes , Methanias Colaço Júnior , GastãoFlorêncio Miranda Jr , Beatriz Trinchão Andrade , Anapproach to preprocess and cluster a BRDF database, Graph.Model. 119 (2022), 101123.

16.

Humberto Luiz Razente , Maria Camila Nardini Barioni and Yasin Silva

, Storing data once in m-trees and pm-trees: Revisiting thebuilding principles of metric access methods, Inf. Syst. 104 (2022), 101896.

17.

Wenfei Fan , Jianzhong Li , Xin Wang , Yinghui Wu Query preserving graph compression. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20–24, 2012, pages 157–168. ACM, 2012.

18.

Roberto De Virgilio Smart RDF data storage in graph databases. In Proceedings of the 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017, Madrid, Spain, May 14–17, 2017, pages 872–881. IEEE Computer Society/ACM, 2017.

19.

Rana Alotaibi , Chuan Lei , Abdul Quamar , Vasilis Efthymiou , FatmaÖzcan Property graph schema optimization for domain-specific knowledge graphs. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19–22, 2021, pages 924–935. IEEE, 2021.

20.

Martin Schäler , Christine Tex , Veit Köppen, David Broneske , Gunter Saake , Towards multi-purpose main-memory storage structures: Exploiting sub-space distance equalities in totally ordered data sets for exact knn queries, Inf. Syst. 101 (2021), 101791.

21.

Ali Davoudian Helios: An adaptive and query workloaddriven partitioning framework for distributed graph stores. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30–July 5, 2019, pages 1820–1822. ACM, 2019.

22.

Youyang Yao , Jiaqi Li , Rong Chen Analysis and improvement of optimizer for query processing on graph store. In Proceedings of the 9th Asia-Pacific Workshop on Systems, APSys 2018, Jeju Island, Republic of Korea, August 27–28, 2018, pages 6:1–6:8. ACM, 2018.

23.

Anurag Khandelwal , Zongheng Yang , Evan Ye , Rachit Agarwal , Ion Stoica Zipg: A memory-efficient graph store for interactive queries. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14–19, 2017, pages 1149–1164. ACM, 2017.

24.

Palash Goyal

Emilio Ferrara , Graph embedding techniques,applications, and performance: A survey, Knowl. Based Syst. 151 (2018), 78–94.

25.

Michael Nelson , Sridhar Radhakrishnan , Amlan Chatterjee , Chandra Sekharan

Queryable compression on streaming social networks. In 2017 IEEE International Conference on Big Data (IEEE BigData 2017), Boston, MA, USA, December 11–14, 2017, pages 988–993. IEEE Computer Society, 2017.

26.

Arijit Khan , Charu Aggarwal

Toward query-friendly compressionof rapid graph streams,, Soc. Netw. Anal. Min. 7(1) (2017), 23:1–23:19.

27.

Adel Alharbi

, Mohammad Hijji , Amer Aljaedi , Enhancing topicclustering for arabic security news based on k-means and topicmodelling, IET Networks 10(6) (2021), 278–294.

28.

Fuliang Lu , Nishad Kothari , Xing Feng , Lianzhu Zhang , Equivalenceclasses in matching covered graphs, Discret. Math. 343(8) (2020), 111945.

29.

BingWei , Limin Xiao , Bingyu Zhou , Guangjun Qin , Baicheng Yan , Zhisheng Huo , Fine-grained management of I/O optimizations based onworkload characteristics, Frontiers Comput. Sci. 15(3) (2021), 153102.

30.

Omri Suissa , Avshalom Elmalech , Maayan Zhitomirsky-Geffet , Textanalysis using deep neural networks in digital humanities andinformation science, J. Assoc. Inf. Sci. Technol. 73(2) (2022), 268–287.

31.

Yu Zhang , Kanat Tangwongsan , Srikanta Tirthapura Streaming k-means clustering with fast queries. In 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19–22, 2017, pages 449–460. IEEE Computer Society, 2017.

32.

Erich Schubert , Jörg Sander , Martin Ester , Hans-Peter Kriegel , Xiaowei Xu , DBSCAN revisited, revisited: Why and how you should(still) use DBSCAN,19:1-19:, ACM Trans. Database Syst. 42(3) (2017), 21.

33.

Arif Mahmood , Michael Small , Subspace based network communitydetection using sparse linear coding, IEEE Trans. Knowl. DataEng. 28(3) (2016), 801–812.

34.

Anan Liu , Yuting Su , Weizhi Nie and Mohan Kankanhalli

, Hierarchical clustering multi-task learning for joint human actiongrouping and recognition, IEEE Trans. Pattern Anal. Mach.Intell. 39(1) (2017), 102–114.

Public opinion bunching storage model for dense graph data in social networks 1

Abstract

Keywords

1 Introduction

2 Related work

2.1 Graph data storage models

2.2 Graph data storage and optimization methods

3 Preliminaries

3.1 The property graph model

3.2 Hot public opinion in social networks

3.3 Equivalence relation

4 The proposed POBS model

4.2 Identifying network structure of hot public opinion

4.3 Equivalent partition and public opinion bunching mapping

4.3.1 Equivalence relation on the hot public opinion network

5.1 Datasets

Table 1 Summary of dataset statistics Dataset #Tweets #PON #PON entries #Topic Dataset I 70,347 18 18994 (27%) 1 Dataset II 138,599 56 63756 (46%) 2 Dataset III 215,276 129 133471 (62%) 3

Table 3 Comparisons of the average running time in clustering task (ms) Models K-means DBSCAN Spectral Hierarchical clustering clustering Hypergraph 41.08 67.76 60.12 79.76 Triple store 52.35 79.90 73.12 96.37 Property 46.78 73.93 65.97 89.08 graph POBS 17.71 26.88 19.53 39.80

References

Table 1
Summary of dataset statistics

Dataset #Tweets #PON #PON entries #Topic

Dataset I 70,347 18 18994 (27%) 1

Dataset II 138,599 56 63756 (46%) 2

Dataset III 215,276 129 133471 (62%) 3

Table 3
Comparisons of the average running time in clustering task (ms)

Models K-means DBSCAN Spectral Hierarchical

clustering clustering

Hypergraph 41.08 67.76 60.12 79.76

Triple store 52.35 79.90 73.12 96.37

Property 46.78 73.93 65.97 89.08

graph

POBS 17.71 26.88 19.53 39.80