Par-BF: A parallel partitioned Bloom filter for dynamic data sets

Abstract

Compared with a hash table, a Bloom filter (BF) is more space efficient for supporting fast matching through a controllable and acceptable false positive probability. The space size of the basic BF is predetermined based on the expected number of elements to be stored. However, we cannot predict the space scale of a BF for dynamic sets. It is still challenging for the two existing solutions, scalable BF (SBF) and dynamic BF (DBF), to manipulate dynamic data sets with low memory overhead but achieving high performance. This article presents a partitioned BF (Par-BF) for dynamic data sets. Compared with DBF and SBF, Par-BF is able to leverage a sweet spot between high performance and low overhead by a group of formulas to support fast concurrent matching. Specifically, the size and the range of the false positive probability in Par-BF can be deliberately derived. From our trace-driven experimental results, the input/output operations per second of Par-BF outperforms that of DBF and SBF by 10× to 14× and by 3× to 8×, respectively. Besides, through our proposed garbage collection policy, Par-BF consumes less than half of the memory usage of SBF.

Keywords

Bloom filter dynamic sets Par-BF false positive fast matching

1. Introduction

We are now living in an era awash in data, and the amount of digital data in data centers is expected to double annually from today to 2020 (Gantz and Reinsel, 2012). High-performance data-intensive processing is in an irreversible high demand trend (e.g. scientific computing). Specially, target objects from large-scale and complex data sets always have to be retrieved before further processing.

Indexing techniques have been widely used to accelerate object query time using a certain amount of fast memory space. Building a highly effective index for data-intensive distributed computing has been widely studied (e.g. Hadoop). Through using an appropriate indexing technique, we can quickly detect the repetitive objects and thus reduce remote input/outout (I/O) transfer time and eventually speedup the object retrieval process. Hadoop++ (Dittrich et al., 2010) is an improved Hadoop framework to provide a noninvasive, DBMS-independent indexing function, coined Trojan Index, to accelerate the query runtime of HadoopDB (Abouzeid et al., 2009). Sailfish (Rao et al., 2012) improves disk I/O performance in Hadoop via building an index in the Map/Reduce layer to optimally sort and aggregate the intermediate data.

A critical challenge in designing an effective index is how we can reduce space overhead as much as possible while guaranteeing indexing throughput. Compared with a hashing-based index, a Bloom filter (BF) is more spaceefficient for set membership query in terms of is element x in set S? at the cost of a false positive probability (fpp) (Bloom, 1970). Nonetheless, the space saving often outweighs this drawback when fpp is in an acceptable range. An index entry in a standard BF only consumes about $1.44 \log_{2} (1 / fpp)$ bits in average rather than concerning on its key length. Using a BF as an index for redundancy elimination, set difference or set intersection is widely studied (Tarkoma et al., 2012). BF has gained a wide spectrum of various network applications, including distributed caching, resource routing, packet routing and measurement infrastructure, and so on (Broder and Mitzenmacher, 2004). In data-intensive distributed computing circumstances, a BF-derived algorithm, bloomjoin, has been proved to greatly reduce communication cost in distributed database join operations (Lee et al., 2012; Michael et al., 2007). Moreover, a number of BFvariants have been proposed to meet the demand of the specific application context, including counting, deletion, multisets, and space efficiency (Tarkoma et al., 2012).

One of the key problems of BFs is false positive. False positive denotes that an item actually does not belong to the set, but the BF falsely reports it is in. On one hand, when the false positive occurs, an extra round trip I/O in the much slower backup indexing is issued to validate the result, leading to indexing throughput degrading sharply. On the other hand, more memory is consumed when the expected fpp is set to a smaller one. Therefore, fpp is expected in an acceptable range to balance the trade-off of indexing throughput and memory overhead (Almeida et al., 2007).

In order to keep the actual false positive rate under the expected fpp, the maximum number of elements to be indexed in a standard BF is usually predetermined as n. However, we cannot precisely predict the scale of elements in dynamic data sets. This leads to the challenge to get a list of flexible sub-BFs. Scalable BF (SBF) (Almeida et al., 2007) and dynamic BF (DBF) (Guo et al., 2010) are two typical design to address this issue. Both DBF and SBF are made up of a series of one or multiple sub-BFs. When the current sub-BF gets full (up to n elements), a new one is added into the list to serve new element insertions. Typically, the space of each sub-BFs is much less than the overall pre-allocated shared memory space. However, SBF and DBF lack further optimization to accommodate with the trade-off between indexing performance and memory cost. The main problem of DBF is not able to control the overall fpp, while SBF does not support BF-based algebra operations for simplifying resource management (e.g. bloomjoin).

This article presents parallel partitioned BF (Par-BF) to support fast element membership query for large dynamic sets. Par-BF partitions the whole required memory space into disjointed sub-BF lists. Each sub-BF list can hold maximal N elements. Each sub-BF list can contain s_e homogeneous sub-BFs at most. A thread is assigned into each sub-BF list to do matching. An effective way for assigning thread number is making it equal to the number of CPU cores (Pusukuri et al., 2011). There are two major objectives of designing Par-BF. The first one is compulsorily supporting useful bit vector-based algebra operations. The other one is compulsorily controlling the fpp in each matching thread. Even though DBF and SBF can be extended to multi-thread environment, to meet the heterogeneous performance/cost demand, some essential parameters of Par-BF should be well derived as follows.

How can we choose proper fpp values? We need to determine the following two values to control the overall fpp. One is the expected upper bound value of fpp in each sub-BF list, F_max. The other is the expected fpp of each sub-BF, fpp_e.

How can we tune the memory space of Par-BF to balance the performance requirement and memory consumption?

In Table 1, we summarize the improvement of Par-BF, compared with DBF and SBF. From our trace-driven evaluated results, the input/output operations per second (IOPS) of Par-BF outperforms that of DBF and SBF, by 10× to 14× and 3× to 8×, respectively. Besides, through our garbage collection policy, the memory overhead of Par-BF is less than half of SBF.

Table 1.

Comparison results of SBF, DBF, and Par-BF.

Property	DBF	SBF	Par-BF
Query method	Linear	Linear	Parallel
Algebra operations	Supported	Not Supported	Supported
Initializations	Empirical	Empirical	Calculated
FPP	Uncontrollable	Convergent	Controllable
Performance	Very Low	Low	High

DBF: dynamic Bloom filter; SBF: scalable Bloom filter; PAR-BF: parallel partitioned Bloom filter.

This article is organized as follows. First, we provide an overview about our design motivation. And then, the design of Par-BF is given in details. The experimental evaluation results are given in Evaluation section, followed by the summary of related work and finally with the Conclusion section.

2. Motivations

We first list some used notations in Table 2 for better understanding our following discussions.

Table 2.

Preliminary definition of essential parameters.

Symbol	Meaning
C	Maximum capacity of the shared resource
M	Space overhead of Par-BF (bits)
N	Maximum amount of elements indexed in a sub-BF list
n	Maximum amount of elements indexed in a sub-BF
S _b	Average indexing cost in a memory based sub-BF
S _q	Average indexing cost in the remote K-V container
fpp _e	Expected fpp of a sub-BF
F	Overall expected fpp of Par-BF
σ _I	Average indexing cost of Par-BF
Q_worst	Worst indexing cost of Par-BF
σ _E	Expected cost of an element access
σ _L	Average cost of an element access from local storage
σ _R	Average cost of an element access from remote storage
m	Bit vector size of each sub-BF
T	Thread number
s _e	Maximum length of each sub-BF list
s _m	Total amount of allocated sub-BFs for a set

PAR-BF: parallel partitioned Bloom filter; BF: Bloom filter.

2.1. Using an index in data-intensive distributed applications

A typical data-intensive distributed computing framework architecture usually separates the compute and storage resources into two cliques: compute nodes and storage nodes, both of which are interconnected by a shared network infrastructure. The cluster of the separated disks are virtually managed and provides a universal namespace by distributed file system (e.g HDFS in Hadoop). Following up the customized scheduling, the compute nodes selectively read the to-be-processing data from the storage nodes and then saving back the processed results to the storage nodes. Through index techniques, the redundant data can be directly retrieved from local store of its compute node rather than conventionally being read from remote storage nodes. Taking a simple example, a target processing job, which is assigned to be processed in a compute node A, requests for three data blocks, X, Y, and Z. Block X has been stored in the local storage of A. Meanwhile, the index of A reports that X has been stored locally. As a result, only blocks Y and Z are read from the remote storage.

Virtualization is a key feature for a cloud-based computing framework. Virtualization enables the aggregation of multiple decentralized physical disk devices into a shared, flexible logical volume pool for centralized and elastic management. However, dynamic resource allocation leads to the expected number of elements to be stored unpredictably. For instance, as shown in Figure 1, suppose there are V disjointed and independent sets {X₁, X₂, …X_V}. All the sets are competing for a limited and shared pre-allocated space which can store C elements at most. The major challenge is how much space should be preassigned to each X_i, 1 < i < V. We denote the space allocated to set X_i as C_i. Thus, we have equation (1). As a result, the combination value of the formalized V-tuples (C₁, C₂, …C_V) is $(\begin{matrix} C + V - 1 \\ C \end{matrix})$ , which is deduced by the selection problem in combinatorics (Cameron, 1994). Therefore, the size allocated to X_i may change dynamically (either reduction or inflation).

\sum_{i}^{V} C_{i} = C, 0 \leq C_{i} \leq C,

(1)

Figure 1.

The size allocated to each X_i may change dynamically when V disjointed and independent sets ${X_{1}, X_{2}, \dots, X_{V}}$ to compete for a limited and shared pre-allocated space

We summarize the three indexing properties for dynamic sets in data-intensive distributed computing circumstances as follows:

Data partition: The to-be-processed data of each compute node is independent from other compute nodes. Therefore, an individual index is responsible for its corresponding compute node. Keeping the property of independent data partition can enhance data correlation for customizing a specific service.

Processing parallelism: Data partition makes the data boundary in distinct compute nodes more distinct (high cohesion/low coupling) to avoid the costly data synchronization. Parallelization can greatly improve indexing throughput through multiple threads, similar to SIMD in the Flynn’s taxonomy (Flynn, 1972).

Heterogeneous performance/cost demand: Various data-intensive computing services, such as Web indexing, data mining, and scientific simulation, have heterogeneous indexing demands to achieve service level agreement (SLA). Moreover, a heterogeneous environment, whose storage devices are configured to meet demands of different size and I/O speed, is preferable to one that is homogeneous in many clusters (Crago et al., 2011; Madhavapeddy et al., 2010).

2.2. BFs for dynamic data sets

Hash Table is the most common data structure for indexing. However, implementing a hash table is not appealing in size-constrained and expensive memory since it requires overproportioning to guarantee that all table entries fit. Moreover, hash collisions (i.e. different entries have the same mapping position) incurs extra memory overhead (e.g. linear chaining) to handle them and deteriorate the indexing throughput (e.g. from the expected O(1) time to the worst O(n) time).To achieve both space and time efficient, the perfect universal hash functions¹ in a hash table can be optimally precalculated only when the key universe of the target set is foreseeably limited (Fredman et al., 1984). Whereas, collisions are always unexpectedly happened in a dynamic data set when the key universe is nearly infinite (Dietzfelbinger et al., 1994). Although some improved hashing solutions, such as Cuckoo Hashing (Pagh and Rodler, 2004), have been advised for dynamic sets, the collision is still considered as one of the main performance bottleneck in hashing.

Figure 2 provides a brief example of a standard BF. A standard BF for representing a data set $X = {x_{1}, x_{2}, \dots, x_{n}}$ is described as a bit array (vector) of m bits. Initially, all the bits are set to 0. The query method always takes k independent (pseudo) random hash functions h₁, h₂, …h_k in the range of {1, …, m} For each element $x \in X$ , the mapping bits satisfy that $h_{i} (x) \geq 1$ , where 1 ≤ i≤ k. False positive comes as a key problem for BF. As each bit position can be shared by many elements, a membership query of an item y has a certain probability to yield a false positive that all h_i (y) have been marked to nonzero by the other elements in set X. Thus, false positive denotes that an item actually does not belong to the set but BF falsely reports it is in.

Figure 2.

An example of a BF. m is set to 16 bits, all the bits are initialized as 0 in (1). Each element x_i is hashed k = 3 times, with each hashing position is set to 1, such as the mapping positions of x₁ are 2, 5, 11 in (2). The item y₁ cannot be in the set X since a 0 is found at one of its mapping position. The mapping positions of item y₂ are previously marked by x₁ and x₂, the BF has yielded a false positive in (3). BF: Bloom filter.

The expected fpp in a BF can be tuned in a sufficient and small range to meet indexing requirement with high-performance and low-memory overhead. Equation (2) denotes the mathematical relationship among variants m, n, and k. For example, an inline deduplication process (Zhu et al., 2008) contains 4×10⁹ unique chunks whose average size is 1 kB (The total data size is 4 TB). Assuming that each fingerprint (FP; key) of a chunk is generated by 16-byte MD5, at least 64 GB of memory for the full index is consumed in the hash table. Whereas, only 8.64 GB of memory is needed for indexing all of these chunks when the fpp of a standard BF is expected to 1/2¹⁰ (≈0.0976%). Moreover, the performance drawback of hash collisions can be conditionally avoided in a BF. Counting BF (CBF) is proposed by Fan et al. (2000) for solving the problem that a standard BF dose not support element deletions. Each mapping entry in a CBF is a small counter bit rather than a single flag bit. It has been revealed that 4 bits per counter always suffice for most applications since the probability of counter overflow (≥16)can be neglected.

m = 1.44 n \times \log_{2} (\frac{1}{fpp}) .

(2)

Neither standard BF nor CBF takes dynamic sets into consideration, while SBF and DBF are two methods to support dynamic sets as shown in Figure 1. As depicted in Figure 3, the basic idea for supporting dynamic sets is dividing the shared BF space into a certain number of sub-BF units. A new sub-BF (BF_s) will be allocated into the sub-BF list of a set X_i for new insertions when all the previous sub-BFs {BF₁, …, BF_s−1} are full. For an element x, the membership query in set X_i will iterate its corresponding lists and return true only if x really exists in any one of sub-BFs. We summarize the characteristics of SBF and DBF as follows.

Figure 3.

The sub-BF unit is the fundamental BF-based data structure for supporting dynamic sets. BF: Bloom filter.

2.2.1. Dynamic BF

A DBF consists of s homogeneous standard (or counting) sub-BFs. “Homogeneous” indicates both the size and the k bit-mapping hash functions are exactly the same in each sub-BF in order to support useful bit vector-based algebra operations: union, intersection, and halving (Broder and Mitzenmacher, 2004). Suppose we have two sets S₁ and S₂. Two homogenous BFs B₁ and B₂ are used to represent the element membership of S₁ and S₂, respectively. The algebra operations between S₁ and S₂ are listed as follows.

2.2.2. Union

A BF B that represents the set union $S = S_{1} \cup S_{2}$ can be created by taking the OR operation between vector-based original BFs $B = B_{1} \lor B_{2}$ . The merged BF B representing set S can report any element belonging to either S₁ or S₂. One positive effect of using union operation is space saving when less than half bits in vector of B are nonzero (controlling the expected fpp). Thus, both S₁ and S₂ can be represented by only one BF B within an expected fpp.

2.2.3. Intersection

A BF B that represents the set intersection $S = S_{1} \cap S_{2}$ can be created by taking the AND operation between vector-based original BFs $B = B_{1} \land B_{2}$ . The merged BF B representing set S can accurately report any element belonging to both S₁ and S₂, with the probability of ${(1 - 1 / m)}^{k^{2} | S_{1} \cap (~ S_{2}) | | S_{2} \cap (~ S_{1}) |}$ , where “∼” denotes the complementation of a set. The analysis of intersection is more complicated than that of union, which can be found in literature (Broder and Mitzenmacher, 2004; Guo et al., 2010). The idea of Bloomjoin (Mackert and Lohman, 1986) is based on intersection operations to find common elements in two sets B₁ and B₂ in a distributed database. It is not necessary to send the common elements costly from B₁ to B₂, and vice versa, especially in distributed systems (Tarkoma et al., 2012).

2.2.4. Halving

It is a special case of union. For example, if the BF B with size m is divisible by 2, halving can be done by bitwise OR between the first and second halves. Thus, the size of B becomes m/2, and each modified bit position of element x is calculated by $(h_{i} (x) \cdot mod \cdot m) / 2$ .

However, the main problem of DBF is that there is no mechanism to control the overall false positive rate F. Let f (h) denote the fpp of a sub-BF BF_h. F is calculated according to equation (3). In DBF, F equals to the probability of any one of the sub-BFs occurring a false positive during an element membership query. Assuming each sub-BF can index n elements at most. Thus, we have $s = C / n$ . If n becomes smaller, the space availability is more compact and efficient. However, the range of s becomes wider, which may make F in a unacceptable value.

\begin{matrix} F = 1 - Π_{h = 1}^{s} (1 - f (h)) \approx \sum_{h = 1}^{s} f (h) \\ (\forall h = 1, 2, \dots, s, f (h) << 1 / s) \end{matrix},

(3)

2.2.5. Scalable BF

An SBF is made up of a series of heterogeneous sub-BFs. Both m and k of each sub-BF are different. The key idea is that each successive sub-BF is created with a tighter maximum error probability on a geometric progression according to equation (4), where r, 0 <r < 1 is the tight factor.

f (h) = r^{h - 1} \times f (1) .

(4)

Thus, the compounded probability over the whole series is in convergence to $f (1) / 1 - r$ . Figure 4(a) shows a comparable example on F value growth between DBF and SBF according to equation (3). The value of F in DBF shows a linear growth corresponding to the increase of variable s. However, the F value of SBF converges to 0.02 even though s increases in an infinite trend. Although the impact of choosing suitable parameters on regulating sub-BF space usage and fpp in SBF design is well studied, there are still problems faced in handling the heterogeneous property of each sub-BF:

Increasing the average consumed bits of a key in BF_h by $(h - 1) \cdot r$ bits, compared with that in the original sub-BF BF₁.

No support of BF-based algebra operations so as to simplify resource management.

Recalculating mapping positions of each sub-BF iteration for an element membership query. The cost is $(h (k) \cdot mod \cdot m_{h})$ , where m_h (denotes the BF_h size) is not the same as that of any other sub-BFs.

Figure 4.

(a) An example of F value growth comparison between DBF and SBF, f(1) = 0.01, r = 0.5 in SBF. (b) Our recommended Par-BF design, $s_{m} = C / n$ . BF: Bloom filter; PAR-BF: parallel partitioned Bloom filter.

2.3. Performance/cost-driven initializations of a BF

As illustrated in Figure 5, there are three states of an element x membership query including In, Not in, and False Positive. Suppose element x is requested for further processing. Through BF-based indexing, x is identified whether existed in the local store first, in terms of either K-V store or cache manner containing a subset of the data. The average indexing cost is roughly denoted as $σ_{I} \cdot x$ can be directly accessed from the local store, roughly costing σ_L when it is In the store. Alternatively, x is accessed from the remote storage container through network links when it is Not In the store. This process roughly costs σ_R. Then, we have $σ_{L} << σ_{R}$ . However, the state False Positive is met occasionally when x was actually not existed in the local store, but the index reports it is in, with the rough cost (σ_L+σ_R). Thus, the expected access cost of x, σ_E, is calculated using equation (5), where hr denotes hit ratio of the local store. And $hr \cdot σ_{L}$ , $(1 - hr - fpp) \cdot σ_{R}$ and $fpp \cdot (σ_{L} + σ_{R})$ denote the expected access cost of query states In, Not in, and False Positive, respectively. Also $fpp \cdot σ_{L}$ can be neglected, because it is usually much smaller (≈0) than the other factors. From the perspective of the indexing performance, reducing σ_I and fpp can improve σ_E when consuming more resources, such as increasing memory size, using more fast I/O devices. Besides, the parameter initializations of a BF-based index should be flexibly configured to accommodate in the heterogeneous distributed computing environment.

\begin{matrix} σ_{E} = & σ_{I} + hr \cdot σ_{L} + (1 - hr - fpp) \cdot σ_{R} \\ + fpp \cdot (σ_{L} + σ_{R}) \\ = & σ_{I} + fpp \cdot σ_{L} + σ_{R} - hr \cdot (σ_{R} - σ_{L}) \\ \approx σ_{I} + σ_{R} - hr \cdot (σ_{R} - σ_{L}) \end{matrix}

(5)

Figure 5.

Three states of element membership query in a BF. BF: Bloom filter.

There is a trade-off between performance and memory overhead in designing a BF. The overall fpp plays a negative effect on the performance. As in equation (6), both n (capacity) and k (mapping functions) of each homogenous sub-BF are positive factors to reduce the F value (derived from equation (3), where $f (h) = 0 . 5^{k}$ (Bloom, 1970). However, from the perspective view of dynamic resource allocation, a smaller value of n makes finer allocation granularity, which indicates better resource utilization. Moreover, as shown in equation (7), a greater k makes more space overhead (denoted by M in bits). Table 3 gives an example on the expected $〈 F, M 〉$ pair results of DBF with different s and k when C = 10⁹. While n and k in DBF can be tuned to meet both the application performance and the cost requirement.

F = \frac{C}{n} \cdot 0 . 5^{k} (s = \frac{C}{n}),

(6)

M = \frac{C}{n} \cdot 1.44 \cdot n \cdot k = 1.44 \cdot k \cdot C,

(7)

Table 3.

The expected $〈 F, M 〉$ pair results of DBF with different s and k values when C = 10⁹.^a

k	128	64	32	16	8
$s = C / n$	128	64	32	16	8
8	$〈 50 %, 1.44 〉$	$〈 25 %, 1.44 〉$	$〈 12.5 %, 1.44 〉$	$〈 6.25 %, 1.44 〉$	$〈 3.12 %, 1.44 〉$
10	$〈 12.5 %, 1.8 〉$	$〈 6.25 %, 1.8 〉$	$〈 3.12 %, 1.8 〉$	$〈 1.56 %, 1.8 〉$	$〈 0.78 %, 1.8 〉$
12	$〈 3.12 %, 2.16 〉$	$〈 1.56 %, 2.16 〉$	$〈 0.78 %, 2.16 〉$	$〈 0.39 %, 2.16 〉$	$〈 0.20 %, 2.16 〉$
14	$〈 0.78 %, 2.52 〉$	$〈 0.39 %, 2.52 〉$	$〈 0.20 %, 2.52 〉$	$〈 0.10 %, 2.52 〉$	$〈 0.05 %, 2.52 〉$
16	$〈 0.20 %, 2.88 〉$	$〈 0.10 %, 2.88 〉$	$〈 0.05 %, 2.88 〉$	$〈 0.025 %, 2.88 〉$	$〈 0.012 %, 2.88 〉$

DBF: dynamic Bloom filter.

Each item is instantiated by the $〈 F, M 〉$ pair, the unit size of M is GB.

3. Design of Par-BF

3.1. Principles

The design of Par-BF is shown in Figure 6. The shared BF space of set X_i is divided into a certain number of sub-BF lists. The total number of sub-BF lists is $[C_{i} / n \cdot s_{e}]$ at most in X_i, where X_i contains $C_{i} (0 \leq C_{i} \leq C)$ elements and $s_{m} = C_{i} / n$ . The thread is assigned at the sub-BF list level to do independent and parallel membership query. The maximal number of sub-BF is $C / n$ . The maximal number of sub-BF list is $C / N$ .

Figure 6.

The design of Par-BF. PAR-BF: parallel partitioned Bloom filter.

The expected upper bound value of fpp in each sub-BF list is set to $F_{\max}$ and the expected fpp of each homogenous sub-BF is set to $fp p_{e}$ . The expected list size S_e is equal to $F_{\max} / fp p_{e}$ according to equation (3). Each sub-BF list can contain $n \cdot (F_{\max} / fp p_{e})$ disjoined elements at most, denoted as N. Each sub-BF has its own organized storage space, and thus its I/O operations, such as membership query, insertion, and deletion, only require their corresponding resources to minimize both memory overhead and disk I/O cost.

From Figure 4(b), Par-BF has the following two features: (1) the maximum fpp of each matching thread is limited to F_max and (2) the list size for matching iteration is limited so as to improve query throughput. The time cost of iterating a list contains s_m sub-BFs is $O (s_{m})$ , whereas the time cost of Par-BF is $O (s_{e})$ , reduced by $s_{m} / s_{e}$ times.

3.2. Element operations

To EXPLOIT the hierarchical relationship of different components in Par-BF, we define three as shown in Figure 7.

Figure 7.

Flow charts of the three element operations in Par-BF. There are totally $[C_{i} / n \cdot s_{e}]$ threads for element x lookup in target set X_i in (a). PAR-BF: parallel partitioned Bloom filter.

3.1.1. Lookup

There are maximum $[s_{m} / s_{e}]$ threads for parallel lookup in a set X_i. For an element x membership query, if the result matches in a sub-BF of a sub-BF list, the corresponding K-V container is further checked whether it really contains the element to eliminate the disturbance of false positive. If the element is found, the matched thread delivers the result to the main thread. All corresponding lookup threads are terminated by the main thread. If the element is not found, a false positive is happened and continued to iterate the remaining sub-BFs in its sub-BF list. Therefore, a false positive leads to an extra useless I/O round. If a sub-BF in a thread mismatches, the thread does not stop iterating the next sub-BF until getting to the list tail. If all the threads do not find the x, this means that x mismatches. Due to the possible existence of temporal locality in element accesses, the latest allocated sub-BF will be checked in advance in each matching thread.

3.1.2. Insertion and deletion

For both insertion and deletion of an element x, the multi-thread lookup operation often locates the corresponding sub-BF in advance. The key of element x insertion is mapped into the corresponding position of an active sub-BF whose amount of the indexed elements is less than the maximum n in a sub-BF. If there is no active sub-BF, a new sub-BF and its corresponding data container will be allocated for new element insertions. We use CBF for each sub-BF to support element deletions. For element x deletion, each mapping position of its corresponding sub-BF is decreased by 1.

3.2. Memory optimizations in Par-BF

Although CBF supports element deletions, the space requirement of a CBF is three times more than its original BF, because it stores a 4-bit counter rather than 1-bit flag in each position of the mapping array. We propose building CBFs in a relative slow block device (compared with memory) to keep track of element update, while keeping the transformed standard BFs in memory. As shown in Figure 8(a), each sub-CBF will be transformed into the corresponding BFs according to Figure 8(b) method. Then, we can update the position bit between a standard sub-BF and its corresponding sub-CBF. However, this transformation method is only advised for read-mostly workloads due to the synchronization cost of element update between CBF and BF.

Figure 8.

(a) Transformed standard sub-BFs instead of sub-CBFs will exist in DRAM for fast matching. (b) The transformation method. BFs: Bloom filters; DRAM: dynamic random-access memory.

To reclaim the free space after element deletions and make better use of memory space, Par-BF uses the union algebra operation to do garbage collection. A union operation mainly contains two steps.

Sorting each sub-BF in a set by the total number of stored elements ascendingly, ${B_{1}, B_{2}, \dots,}$ .

If $| B_{1} | + | B_{2} | \leq n$ , $B_{2} = B_{1} \lor B_{2}$ . Then, the union operation will reclaim the $B F_{B_{1}}$ and its container space. After that the previous marked number of groups is changed to the new sequence ${B_{1}, B_{2}, \dots,}$ . Finally, union will recursively jump to step (1) until $| B_{1} | + | B_{2} | > n$ . Figure 9 illustrates an example.

Figure 9.

An example of the union operations between two sub-BFs. Each BF is initialized as a CBF. There are 16 mapping entries (m = 16 × 4 bits), k = 3, and n = 3. BFs: Bloom filters; CBF: counting Bloom filter.

3.3. Parameter initializations

Some essential parameters of Par-BF can be tuned to meet the requirement of both performance and memory overhead. The appropriate values can be delibrately derived from the lookup performance analysis in Par-BF.

3.3.1. Lookup cost analysis

Suppose that the computing resources are fairly shared by each nondistinctive thread. When the element x does not exist, iterating the sub-BF list of a matching thread from the head to the tail for an element x membership lookup costs the most time. We denote the worst time cost as Q_worst. Equation (8) shows the expected Q_worst, where the expected probability of a full sub-BF returns a wrong match result $({0.5}^{k} = fp p_{e})$ . Thus, $F_{\max} = Q_{worst} - s_{e} \cdot S_{b} / S_{q}$ due to $F_{\max} = s_{e} \cdot fp p_{e}$ . There is a critical trade-off point between performance and memory overhead. For example, Bufferhash (Anand et al., 2010) keeps all BFs in a super table in memory ignoring S_b cost. Compared with S_q, S_b in memory can be neglected. Whereas, Bloomstore (Lu et al., 2012) only keeps a write buffer-related BF in memory and stores the other BFs in Flash to minimize memory overhead, but S_b is at the same level as S_q.

\begin{matrix} Q_{worst} = s_{e} ({0.5}^{k} \cdot (S_{b} + S_{q}) + (1 - {0.5}^{k}) \cdot S_{b}) \\ = F_{\max} \cdot S_{q} + s_{e} \cdot S_{b} \end{matrix}

(8)

3.3.2. Parameter tunings

Assigning an appropriate number of threads is very important to obtain good performance for a multi-threaded application running on a multi-core system. Few threads may not fully utilize multi-core resources. While if we assign more threads than the hardware can support (which is called oversubscription) for task running, the context switching and the lock mechanism will decrease the performance. Thus, we need to create a suitable number of threads for indexing. A simple and effective way is that the thread number is equal to that of CPU cores (Pusukuri et al., 2011). For example, the Thread library of C++ 11 uses the caller std:: thread:: hardware_concurrency() to return the number of threads that can truly run concurrently for a given execution of a program, which is equal to the number of CPU cores in default. This indicates that N is equal to C/T. T is also equal to the maximum number of sub-lists (each thread is responsible for a sub-list matching process). We can conduct equation (9) since $m / n = \log_{2} e \cdot k \approx 1.44 \cdot k$ is satisfied in each sub-BF (Bloom, 1970) when the sub-BF is a standard BF. Once the sub-BF is built upon a CBF, the memory cost extends to 4 M.

M = m \cdot \frac{N}{n} \cdot T \approx 1.44 \cdot k \cdot C,

(9)

As $s_{e} = N / n = F_{\max} / fp p_{e}$ is always satisfied, n can be initialized using equation (10), where F_max is calculated by equation (8).

n = \frac{N \cdot fp p_{e}}{F_{\max}} = \frac{C \cdot {0.5}^{k} \cdot S_{q}}{T \cdot (Q_{worst} - s_{e} \cdot S_{b})} .

(10)

We summarize the results of some parameters in Table 4. From equation (9), we conclude that if the maximum space overhead is limited by M′, the inequality $k \leq M^{'} / 1.44 \cdot C$ must be satisfied. Besides, F_max is not less than fpp_e. Thus, the inequality $Q_{worst} - s_{e} \cdot S_{b} / S_{q} \geq fp p_{e} = 0 . 5^{k}$ is satisfied. As a result, the range of k is restricted by equation (11):

\log_{0.5} (\frac{Q_{worst} - s_{e} \cdot S_{b}}{S_{q}}) \leq k \leq \frac{M^{'}}{1.44 \cdot C} .

(11)

Table 4.

Results of some essential parameters in Par-BF.

Symbol	Value
F _max	$\frac{Q_{worst} - s_{e} \cdot S_{b}}{S_{q}}$
fpp _e	0.5^k
n	$\frac{N \cdot fp p_{e}}{F_{\max}} = \frac{C \cdot {0.5}^{k} \cdot S_{q}}{T \cdot (Q_{worst} - s_{e} \cdot S_{b})}$
m (bits)	n .1.44 . k
M	$1.44 \cdot k \cdot C$
s _e	$\frac{M}{m \cdot T} = \frac{C}{n \cdot T}$

PAR-BF: parallel partitioned Bloom filter.

4. Evaluation

We evaluate and verify the design of Par-BF using an example of typical distributed I/O intensive process, network redundancy elimination (NRE). NRE aims to identify and eliminate duplicate chunks that are repeated across network links. The Par BF-based NRE has been established in our network function virtualization prototype in Openstack (Ge et al., 2014). Briefly speaking, the NRE process is to segment a transferred data stream into a certain amount of data chunks according to a fixed- or variable-size chunking policy (Anand et al., 2009). Those chunks are stored in a local store for future reuse instead of reading them from the remote data servers. The efficiency of NRE is in direct proportion in duplicated chunk identification through Par-BF. Specifically, we focus on four aspects: (1) the effects of parameter tuning to meet performance/cost trade-off in a heterogeneous environment; (2) the lookup performance in DBF, SBF, and Par-BF; (3) the efficiency of garbage collection to reduce memory overhead; and (4) the performance effect of choosing different thread number in Par-BF.

4.1. Experiment setup

As illustrated in Figure 10, there are three nodes, A, B, and C, to do NRE services when configuring heterogeneous local store(s) to meet the SLA of each NRE service. The subset of chunks are managed in a cache manner in each node. The local store of nodes A, B, and C is built upon only memory, both memory and Flash, and only Flash, respectively. LRU is set as the default cache replacement algorithm. The role of Par-BF is to quickly determine the source side of a requested chunk from either the local node or the remote data container. CBF is the fundamental unit of Par-BF to support chunk deletions due to the replacement operations. We evaluate Par-BF on a Ubuntu 13.02 64-bit server. The basic hardware environment is described in Table 5. As given in Table 6, following equation (5), $σ_{L}$ and $σ_{R}$ of the three nodes are observed. While $σ^{'} E$ of each node is predefined as the worst access time, according to the SLA demand and hardware configuration. As a result, the corresponding Q_w_orst of Par-BF can be deduced on demand. The hr is recorded using 12 GB memory of node A, 24 GB memory plus 3 TB Flash of node B, and 100 GB Flash of node C as the cache, respectively. Because only one workload is adopted, the hr of each node is recorded in a limit of variation.

Figure 10.

Overview of the experimental simulation diagram.

Table 5.

The composition of our testing heterogeneous environment.

Component	Node A	Node B	Node C	Remote data server
CPU	2×Intel i5-4570(16 Threads)	2× Intel E5-2670(16 Threads)	2×Intel i5-4570	Intel E5-2670
RAM (GB)	16	64	4	32
DISK	×	2× 2 TB 4K-stripe RAID-0 Flash	Intel SSD 320 128 GB	4× 1 TB 8K-stripe RAID-0 HDD
NIC	10-Gigabit Ethernet

CPU: central processing unit; RAM: random-access memory; NIC: National Informatics Center.

Table 6.

Measurements of I/O costs in our testing heterogeneous environment.

Cost	Node A	Node B	Node C
σ_L (µs)	10	64	220
σ_R (µs)	400	420	560
hr (%)	75	90	30
Q_worst (µs)	12	20	40
$σ^{'} E$ (µs)	120	120	500

I/O: input/output.

We set T =16×2 in default, which is equal to the number of CPU cores in each compute node. Three major experimental components are configured as follows.

4.1.1. Generating FP

We use a 128-bit variant of Murmurhash (Appleby, 2009) to generate a FP (as key) to represent a chunk. The former [0, 63] bits of the generated 128 bits are considered as the final key considering the trade-off between key space saving and key collision. FP is considered as the hint to get the target chunk from the remote container rather than directly comparing chunks.

4.1.2. Generating k random hash functions for each sub-BF

Adam Kirsch and Michael Mitzenmacher (Kirsch and Mitzenmacher, 2006) have proved that applying two independent hash functions, h₁(x) and h₂(x), to simulated additional hash functions as the form of $g_{i} (x) = h_{1} (x) + i \cdot h_{2} (x) (1 \leq i \leq k)$ can significantly reduce the computational overhead of generating k independent hash functions. Therefore, we adopt the former [0, 63] bits of the 128-bit murmurhash value as the h₁(x) and the remaining [64, 127] value as h₂(x). For each element $x \in S$ , each mapping bit position calculated by $g_{i} (x) \cdot mod \cdot m$ will be increased by 1 in the corresponding CBF.

4.1.3. The implementation of remote key-value containers

We use Berkeley DB (BDB in short) to organize remote key-value data containers, using a B-Tree index, where each key is the FP of a chunk and the value is chunk’s content.

4.2. Data trace

We develop a software simulator to generate a key-value workload according to the principles of the Yahoo! Cloud Serving Benchmark (YCSB) (Cooper et al., 2010). YCSB uses the Monte-Carlo method to generate a typical I/O access distribution. At first, a certain amount of chunks are chosen by obeying a probability distribution from the large sample space of unique chunks. Then these chunks are organized by the trace files which are formatted as lines of chunk key: chunk content: size: operation through the colon as the delimiter. The operation is instantiated by one of the three operations, read, write, and delete. Finally, each trace file is formatted as lines of timestamp: file name to simulate the user’s behavior.

For simplicity, each data flow for satisfying clients request is based on the whole file granularity. The chunk is identified as the duplicated one when it is hit in cache, otherwise, it will be read from the remote data container, and, additionally, to emulate the user behavior to request files of data sets. We generated data requests from three distinct synthetic traces, T₁, T₂, and T₃ to simulate differentiated NRE services according to the access characteristics of practical network applications. Table 7 describes the three traces in detail. For example, the T1 trace contains 8.25×10¹² requests for 1.2×10⁹ unique chunks with 128B average request size. Nodes A, B, and C use T₁, T₂, and T₃ to simulate data requests, respectively.

Table 7.

The statistics of NRE traces.^a

Trace	Requests	Chung samples	Chunk size	File size
T ₁	8.25 × 10¹²	1.2 × 10⁹	128 B	16 KB
T ₂	6.20 × 10¹²	3.0 × 10⁹	1 KB	16 KB
T ₃	4.50 × 10⁹	2.8 × 10⁹	8 KB	1 MB

NRE: network redundancy elimination.

The file access behavior of three traces obeys zipf distribution. Meanwhile, the chunk requests of T₁, T₂, and T₃ are generated according to “latest” distribution (α = 0.99) (Cooper et al. 2010), zipf distribution, and random distribution, respectively.

4.3. Parameter tunings

The parameters for tuning in Par-BF are given in Table 8. Node B is used as an example to analyze the progress of parameter tunings in detail. We use 3 TB Flash as the K-V store in the local side to speedup chunk reads. Thus, the overall capacity of elements (chunks), C, is 3×10⁹. The worst indexing time of Par-BF, Q_worst, is measured as 20 µs (see Table 6). All sub-BFs are kept in memory so we can neglect S_b . S_q is 420 µs, which equals to $σ_{R}$ (see Table 6). Therefore, F_max is computed by equation (8). The maximal memory overhead of Par-BF is restricted to 20 GB. According to equation (11), $k \in [4, 9]$ , we use k = 8 in default. M is evaluated as the 4× value of equation (9) since each sub-BF is built upon a CBF. The value of n is calculated using equation (10). Thus, we have $s_{e} = C / n \cdot T$ and $m = M / s_{e} \cdot T$ . After s_e is calculated, we can get N.

Table 8.

Preliminary definition of parameters.

Node	C	M/M′	k	F _max	fpp _e	n	m	s _e	N
A	9 × 10⁷	0.52/1 (GB)	$8 \in [5, 30]$	3%	0.4%	$3.75 \times 10^{5}$	2.2 MB	8	$3 \times 10^{6}$
B	3 × 10⁹	17.5/20 (GB)	$8 \in [4, 9]$	4.8%	0.4%	7.875 × 10⁶	45.4 MB	12	9.45 × 10⁷
C	1.25 × 10⁷	90/100 (MB)	$10 \in [3, 11]$	8%	0.1%	5460	38.5 KB	80	4.368 × 10⁵

As shown in Figure 11, the average indexing latency $(σ_{I})$ of node A, node B, and node C is measured as 7.24 µs, 12.25 µs, and 32.12 µs, respectively. The worst indexing latencies of node A, node B, and node C are measured as 13.66 µs, 19.92 µs, and 38.87 µs, respectively. They are close to the expected Q_worst in Table 6. Thus, we validate that parameter tuning in Par-BF can meet both the performance and the cost requirements on each node.

Figure 11.

Results of measuring $σ_{I}$ in each node.

Par-BF is also able to find a sweet point to balance the trade-off between high performance and low overhead. Suppose that $X, X \leq S_{e}$ BFs of each sub-BF list are in main memory. This indicates that a membership query will be checked by BDB index in the remote data server when the query is missed in all X sub-BFs. For instance, Figure 12 shows $σ_{I}$ comparisons of node B between the two actual recorded values and the expected value when choosing different proportion of memory overhead $ρ$ . The actual $σ_{I}$ is close to the expected $σ_{I}$ (nearly 99%), which validate our policy of parameter tunings. The two actual values are distinguished by the sub-BF replacement policy to decide which sub-BFs should be existed in memory. The Latest means keeping the recent accessed sub-BFs in memory like the LRU replacement policy, while the Random means uniformly choosing sub-BFs in memory. We can see that the actual indexing cost by randomly choosing is close to the expected Q_worst (above 98%). Meanwhile, the actual indexing cost by choosing the latest replacement is much less than the expected value when the proportion of memory overhead is in the range of (0, 100) percentage. This is because the mechanism of the recent allocated BF has the priority to be queried in advance when the property of temporal locality is satisfied in data accesses. In this case, keeping 20% sub-BFs in memory can alleviate the pressure of indexing cost.

Figure 12.

Indexing cost (latency) comparisons between the actual values and the expected value when choosing different proportion of memory overhead $ρ$ . The expected $σ_{I}$ is calculated by $ρ \cdot S_{b} + (1 - ρ) \cdot S_{q}$ , where S_b is recorded as 7.24 µs and S_q is recorded as 100 µs.

4.4. The performance of Par-BF

The negative effect of a false positive slows down the overall throughput of matching. According to equation (3), fpp_e must be restricted in an acceptable range to meet the matching performance requirement. We record the average indexing throughput of nodes A, B, and C through three dynamic BF mechanisms, DBF, SBF, and our recommended Par-BF and the results are shown in Figure 13. Accordingly, we have the following conclusion.

The IOPS of Par-BF outperforms that of DBF and SBF, by 10× to 14× and by 3× to 8×, respectively. This is because that Par-BF always contains T matching threads to achieve parallelism.

DBF always performs the worst. There are mainly two reasons. One is that its list size is more than 15× compared with the value of Par-BF. The other results from the unacceptable false positive value $(s_{m} \cdot fp p_{e} = 1.92)$ according to Figure 4(a).

Compared with T₃, maintaining temporal locality of chunk accesses in T₁ and T₂ can greatly increases the IOPS.

Figure 13.

Throughput comparisons between DBF, SBF, and Par-BF through running T₁, T₂, and T₃. The initialized r = 0.5, s = 2, and f(1) = fpp_e of SBF. DBF: dynamic Bloom filter; SBF: scalable Bloom filter; Par-BF: parallel partitioned Bloom filter.

4.5. Garbage collection

Figure 14 shows the memory overhead of node B without doing GC. The memory overhead of both DBF and Par-BF is less than that of SBF by about 0.75 GB. Supporting chunk deletions is the precondition to do garbage collection. As a result, CBF is advised as the basic sub-BF data structure. The space overhead of CBF is more than that of the basic BF by about 3× as usual. Figure 15 shows the benefit of GC process in the NRE process. The stored chunks are more valuable and wastes less storage space. For instance, the redundant chunks of T₂ are only occupied by 21.76% without GC, whereas, this value is 48.36% with GC. This difference indicates that the GC process makes the memory space more efficient $(24.76 % / 42.43 % \approx 0.45 X)$ , compared with the occupied space without using GC process. The final memory overhead of Par-BF is about 8.4 GB, which is less than half of the space overhead of SBF (4.5×4 GB = 18 GB).

Figure 14.

The memory overhead of using data trace T₂ by the three BF policies. BF: Bloom filter.

Figure 15.

The comparison on percentage of chunk access in each trace (type “1” denotes “do GC” and type “2” denotes “do NOT GC”).

4.6. Choosing thread number

The results of Figure 16 validates the rationality of choosing thread number, T, which equals to that of CPU cores. From Table 4 and equation (10), we know that $n \propto 1 / T$ when assuming T is the only concerned variable, while keeping k and essential I/O costs as constants. For example, n value will be double when T is changed from 32 to 16. We found that the throughput of Par-BF on each node is gradually improved corresponding to the thread increment until the thread number is equal to 32, while the throughput of Par-BF with 32 threads outperforms that of choosing other thread values in the three nodes. However, when the thread number is more than 32, the throughput of Par-BF may decline due to the expensive thread synchronization. Thus, making the thread number equal to the number of CPU cores can not only make full use of parallelism but also avoid costly synchronization of surplus threads.

Figure 16.

Par-BF throughput comparisons of choosing different thread number in the three nodes. PAR-BF: parallel partitioned Bloom filter.

4.7. Trade-off between memory overhead and indexing performance

As we mentioned above, the expected fpp_e of each sub-BF in Par-BF is relative to between memory overhead and indexing performance. From Table 4, the fpp_e is positive, calculated by k. As shown in Figure 17, we gave the comparisons in memory overhead and average latency by choosing four typical k value in node B. We can conclude: (1) A Par-BF with smaller k value uses less memory size but more deteriorates the average indexing latency. For example, the memory overhead is 8.78 GB and the average indexing latency is 48.6 µs when k is set to 4. In contrast, the memory overhead is extended to 52.1 GB, but the average indexing latency is reduced to 0.16 µs when k is set to 24. (2) Once a traditional hash table based on linear chaining is used as the indexing structure, the memory overhead is 48 GB to indexing all FPs of chunks, while the average latency of indexing is about 2.4 µs, which cannot be neglected due to hash collisions. (3) Finally, k = 8 is recommended as given in Table 8 to meet the trade-off between memory overhead (17.5 GB) and indexing performance (12.25 µs).

Figure 17.

Comparisons with choosing different k in Par-BF in node B: (a) memory overhead and (b) average latency. Traditional hash table based on linear chaining is used as the baseline. PAR-BF: parallel partitioned Bloom filter.

5. Related work

Our work is built upon previous great work as follows.

5.1. Application

Guided by the space-efficient property of BFs, which is first given in 1970s (Bloom, 1970), the BF-based indexing becomes more and more popular for fast matching applications. Anand et.al. (2009) proposed BF as the index to identify FP matches in various of NRE processes. Zhu et al. (2008) used a BF for backup deduplication, called Summary Vector, as the summary data structure to test whether a data chunk is a duplicated one. Debnath et al. (2010) designed a disk-presence BF in RAM to record keys destaged to hard disk so that hard disk access latencies can be avoided when lookups are done on nonexisting keys in a large-scale deduplication system. BF-based index are also used in IP routing (Li et al., 2011; Song et al., 2005; Yu et al., 2009). Broder and Mitzenmacher wrote an excellent survey on network applications of BFs (Broder and Mitzenmacher, 2004). Meanwhile, there are some excellent literatures (Anand et al., 2010; Debnath et al., 2011; Lu et al. 2012), in designing flash-based BF to provide a new dimension of trade-off with BF access times to reduce RAM space usage. Specially, BF-based joins, Bloomjoin have been widely well studied (Babb, 1979; Lee et al., 2012; Michael et al., 2007; Mullin, 1990) to optimize distributed query execution in a distributed database (e.g. Hive in Map/Reduce frameworks).

5.2. Variants of BF

Many variants based on basic BF technique have been proposed in the previous research. The d-left CBF(Bonomi et al., 2006) divides a hash table into d sub-tables which have the same size. The d-left CBF offers the same functionality as a CBF, but uses less space, generally saving a factor of two or more. Compressed BF (Mitzenmacher, 2002) improves delivery efficiency when a BF is passed in a message between distributed nodes. The hierarchical BF (Shanmugasundaram et al., 2004) is a data structure to support multiple substring (block-size granularity) matchings. The spectral BF (Cohen and Matias, 2003) mainly supports membership query in multi-sets. Bloomier filters (Chazelle et al., 2004) allows association of values with a subset of the domain elements, which are implemented using a cascade of BFs. Weighted BF (Jehoshua et al., 2006) exploits the priori knowledge of the frequency of an element request by varying the each mapping function popularity degree. Hao et.al. prposed a partitioned hashing technique to select hash functions that set fewer bits (Hao et al., 2007). Although their experimental results show that they can improve the performance by 10-fold compared with standard contructs, it is not applicable for dynamic environments. Tarkoma et.al. gave a comprehensive survey on reviewing over 20 variants and discussing their applications in distributed systems (Tarkoma et al., 2012). The most closely related literatures of our work are the DBF (Guo et al., 2010) and the SBF (Almeida et al., 2007). However, both still face some challenges on system performance and memory overhead for indexing a large-scale dynamic data set aforementioned.

6. Conclusion and future work

Multiple independent sets contesting with a restricted shared resource is universe in multitasks-based computer systems. As shown in Table 1, both DBF and SBF still face the challenge by indexing dynamic data sets with low memory overhead but achieving high performance. In order to overcome this challenge, we design a new BF, called Par-BF, which can tune the essential parameters by a group of formulas. We validate our Par-BF design via trace-driven evaluation in a practical application. Through supporting parallel fast matching with suitable thread number, the performance of Par-BF outperforms that of both DBF and SBF. Moreover, Par-BF costs less memory than DBF and SBF for the union algebra operation in Par-BF reduces memory wastage .

6.1. Future work

Due to our limitations, an interesting problem on exploiting Par-BF in hardware accelerators, such as graphics processing units, field-programmable gate array, is not discussed in this article. We intend to use more benchmarks to evaluate Par-BF and validate more applications, especially for I/O-intensive scientific computing. The code of Par-BF is openly available for future study at https://github.com/ironliuyi/Par-BF.

Footnotes

Acknowledgements

We authors would like to thank the anonymous reviewers for their valuable feedbacks to improve this article. We also appreciate Dr Liang Zhang and other researchers from Huawei Corporation for their support.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work is partially supported by the following NSF awards: 1439622, 1305237, 1421913, 1217569, and 1115471.

Notes

Author biographies

Yi Liu is currently a researcher in Huawei. In 2015, he received his doctoral degree in Computer Application Technology from Shenzhen Institutes of Advanced Technology, Chinese Academy of Science, China. In 2008, he received his bachelor’s degree in computer science at National University of Defense and Technology, China, and in 2011, he received his master’s degree in software engineering at Peking University, China. His current research interests include SSD storage techniques and deduplication in networks and backup systems.

Xiongzi Ge is currently a PhD student at the Department of Computer Science and Engineering, University of Minnesota, Twin Cities. He received his BE and PhD degrees in computer science from Huazhong University of Science and Technology, China, in 2005 and 2012, respectively. He was a senior software engineer in Shenzhen Cloud Computing Center from 2011 to 2013. He was a visiting scholar in the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, from 2009 to 2011. He was a visiting student of the Digital Technology Center (DTC), in University of Minnesota, from 2008 to 2009.

David Hung-Chang Du is currently the Qwest Chair Professor of Computer Science and Engineering at University of Minnesota, Minneapolis. He is the Center Director of the NSF multi-university I/UCRC Center of Research in Intelligent Storage (CRIS). In addition to NSF, CRIS is currently sponsored with 11 companies with 15 sponsored memberships. He was a Program Director (IPA) at National Science Foundation CISE/CNS Division from March 2006 to August 2008. At NSF, he was responsible for NeTS (Networking Research cluster) NOSS (Networks of Sensor Systems) Program and worked with two other colleagues, Karl Levitt and Ralph Wachter, on Cyber Trust (Internet Security) Program. In 2008, he was also assigned to CSR (Computer System Research) Cluster for handling research in computer systems. He received his BS degree in mathematics from National Tsing-Hua University (Taiwan) in 1974 and MS and PhD degrees from the University of Washington (Seattle) in 1980 and 1981, respectively. He joined the University of Minnesota as a faculty since 1981. He also has been a visiting professor in Germany, Korea, Singapore, Hong Kong, and Taiwan.

Xiaoxia Huang is currently a professor in Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China. She is the Deputy Director of the Center for Real-Time Monitoring and Communications Technologies (RTT). Her research interests include cognitive radio networks, smartphone applications, wireless sensor networks, and wireless communications. She received her BE and ME degrees in electrical engineering from Huazhong University of Science and Technology, China, in 2000 and 2002, respectively, and the PhD degree in electrical and computer engineering from the University of Florida in 2007.

References

Abouzeid

Bajda-Pawlikowski

Abadi

. (2009) HadoopDB: an architectural hybrid of Map/Reduce and DBMS technologies for analytical workloads. Proceedings of the VLDB Endowment 2(1): 922–933.

Almeida

Baquero

Preguica

. (2007) Scalable bloom filters. Information Processing Letters 101(6): 255–261.

Anand

Muthukrishnan

Akella

. (2009) Redundancy in network traffic: findings and implications. In: Proceedings of the eleventh international joint conference on measurement and modeling of computer systems, Seattle, Washington, 15–19 June 2009, pp. 37–48. New York: ACM.

Anand

Muthukrishnan

Kappes

. (2010) Cheap and large cams for high performance data-intensive networked systems. In: Proceedings of the 7th USENIX conference on networked systems design and implementation, Vol 10, San Jose, CA, 28–30 April 2010, pp. 29–29. Berkeley: USENIX Association.

Appleby

(2009) Murmurhash 64 bits variant. Available at: https://sites.google.com/site/murmurhash/ (accessed 1 June 2013).

Babb

(1979) Implementing a relational database by means of specialized hardware. ACM Transactions on Database Systems (TODS) 4(1): 1–29.

Bloom

(1970) Space/time trade-offs in hash coding with allowable errors. Communications of the ACM 13(7): 422–426.

Bonomi

Mitzenmacher

Panigrahy

. (2006) An improved construction for counting bloom filters. In: Azar

Erlebach

(eds) Algorithms–ESA. Heildberg: Springer, pp. 684–695.

Broder

Mitzenmacher

(2004) Network applications of bloom filters: a survey. Internet Mathematics 1(4): 485–509.

10.

Cameron

(1994) Combinatorics: Topics, Techniques, Algorithms. New York: Cambridge University Press.

11.

Chazelle

Kilian

Rubinfeld

. (2004) The bloomier filter: an efficient data structure for static support lookup tables. In: Proceedings of the fifteenth annual ACM-SIAM symposium on discrete algorithms, New Orleans, LA, 11–13 January 2004, pp. 30–39. Philadelphia: Society for Industrial and Applied Mathematics.

12.

Cohen

Matias

(2003) Spectral bloom filters. In: Proceedings of the 2003 ACM SIGMOD international conference on Management of data, San Diego, CA, 9–12 June 2003, pp. 241–252. New York: ACM.

13.

Cooper

Silberstein

Tam

. (2010) Benchmarking cloud serving systems with YCSB. In: Proceedings of the 1st ACM symposium on Cloud computing, Indianapolis, IN, 10–11 June 2010, pp. 143–154. New York: ACM.

14.

Crago

Dunn

Eads

. (2011) Heterogeneous cloud computing. In: 2011 IEEE International Conference on Cluster Computing (CLUSTER), Austin, TX, 26–30 September 2011, pp. 378–385. Washington, DC: IEEE Society.

15.

Debnath

Sengupta

(2010) Flashstore: high throughput persistent key-value store. Proceedings of the VLDB Endowment, vol. 3, Singapore, 13–17 September 2010, pp. 1414–1425. New York: ACM.

16.

Debnath

Sengupta

. (2011) Bloomflash: Bloom filter on flash-based storage. In: Distributed Computing Systems (ICDCS), 2011 31st International Conference on, Minneapolis, MN, 20-24 June 2011, pp. 635–644. Washington, DC: IEEE Computer Society.

17.

Dietzfelbinger

Karlin

Mehlhorn

. (1994) Dynamic perfect hashing: upper and lower bounds. SIAM Journal on Computing 23(4): 738–761.

18.

Dittrich

Quiań e-Ruiz

Jindal

. (2010) Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, vol. 3, Singapore, 13–17 September 2010, pp. 515–529. New York: ACM.

19.

Fan

Cao

Almeida

. (2000) Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking (TON) 8(3): 281–293.

20.

Flynn

(1972) Some computer organizations and their effectiveness. IEEE Transactions on Computers 100(9): 948–960.

21.

Fredman

Komlos

Szemeredi

(1984) Storing a sparse table with 0 (1) worst case access time. Journal of the ACM (JACM) 31(3): 538–544.

22.

Gantz

Reinsel

(2012) The digital universe in 2020: big data, bigger digital shadows, and biggest growth in the far east. IDC iView: IDC Analyze the Future 2007: 1–16.

23.

Liu

Zhang

. (2014) OpenANFV: accelerating network function virtualization with a consolidated framework in openstack. In: SIGCOMM, pp. 353–354. New York: ACM.

24.

Guo

Chen

. (2010) The dynamic bloom filters. IEEE Transactions on Knowledge and Data Engineering 22(1): 120–133.

25.

Hao

Kodialam

Lakshman

(2007) Building high accuracy bloom filters using partitioned hashing. ACM SIGMETRICS Performance Evaluation Review 35(1): 277–288.

26.

Jehoshua

Jie

Anxiao

(2006) Weighted bloom filter. In: 2006 IEEE International Symposium on Information Theory, Seattle, USA, 9–14 July 2006, pp. 2304–2308. Washington, DC: IEEE.

27.

Kirsch

Mitzenmacher

(2006) Less hashing, same performance: building a better bloom filter. In: Azar

Erlebach

(eds) Algorithms–ESA. Heildberg: Springer, pp. 456–467.

28.

Lee

Kim

(2012) Join processing using bloom filter in Map/Reduce. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium, San Antonio, TX, 23–26 October 2012, pp. 100–105. New York: ACM.

29.

. (2011) Exploring efficient and scalable multicast routing in future data center networks. In: Proceedings of the IEEE INFOCOM’11, Shanghai, 10–15 April 2011, pp. 1368–1376. Washington, DC: IEEE.

30.

Nam

(2012) Bloomstore: Bloom-filter based memory-efficient key-value store for indexing of data deduplication on flash. In: IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST), Santa Clara, CA, 30 April–6 May 2012, pp. 1–11. Washington, DC: IEEE.

31.

Mackert

Lohman

(1986) R* optimizer validation and performance evaluation for distributed queries. In: VLDB’86 Twelfth International Conference on Very Large Data Bases, Kyoto, Japan, 25–28 August 1986, pp. 149–159. San Francisco: Morgan Kaufmann.

32.

Madhavapeddy

Mortier

Crowcroft

. (2010) Multiscale not multicore: Efficient heterogeneous cloud computing. In: Proceedings of the 2010 ACM-BCS Visions of Computer Science Conference, Edinburgh, UK, 13–16 April 2010, p. 6. Swinton, UK: British Computer Society.

33.

Michael

Nejdl

Papapetrou

. (2007) Improving distributed join efficiency with extended bloom filter operations. In: 21st International Conference on Advanced Information Networking and Applications, pp. 187–194. IEEE Society.

34.

Mitzenmacher

(2002) Compressed bloom filters. IEEE/ACM Transactions on Networking (TON) 10(5): 604–612.

35.

Mullin

(1990) Optimal semijoins for distributed database systems. IEEE Transactions on Software Engineering 16(5): 558–560.

36.

Pagh

Rodler

(2004) Cuckoo hashing. Journal of Algorithms 51(2): 122–144.

37.

Pusukuri

Gupta

Bhuyan

(2011) Thread reinforcer: dynamically determining number of threads via OS level monitoring. In: 2011 IEEE International Symposium on Workload Characterization, Austin, TX, 6–8 November 2011, pp. 116–125. Washington, DC: IEEE Society.

38.

Rao

Ramakrishnan

Silberstein

. (2012) Sailfish: a framework for large scale data processing. In: Proceedings of the Third ACM Symposium on Cloud Computing, San Jose, CA, 14–17 October 2012, p. 4. New York: ACM.

39.

Shanmugasundaram

Bronnimann

Memon

(2004) Payload attribution via hierarchical bloom filters. In: Proceedings of the 11th ACM conference on computer and communications security, Washington, DC, 25–29 October 2004, pp. 31–41. New York: ACM.

40.

Song

Dharmapurikar

Turner

. (2005) Fast hash table lookup using extended bloom filter: an aid to network processing. In: ACM SIGCOMM Computer Communication Review, vol. 35, Philadelphia, PA, 22–26 August 2005, New York: ACM.

41.

Tarkoma

Rothenberg

Lagerspetz

(2012) Theory and practice of bloom filters for distributed systems. Communications Surveys & Tutorials, IEEE 14(1): 131–155.

42.

Fabrikant

Rexford

(2009) BUFFALO: bloom filter forwarding architecture for large organizations. In: Proceedings of the 5th international conference on Emerging networking experiments and technologies, Rome, Italy, 1–4 December 2009, pp. 313–324. New York: ACM.

43.

Zhu

Patterson

(2008) Avoiding the disk bottleneck in the data domain deduplication file system. In: Proceedings of the Fast’08, Vol 8, San Jose, CA, 26–29 February 2008, pp. 269–282. Berkeley: USENIX.