Abstract
The identification of frequent patterns plays a key role in mining association rules. FP-growth is a fundamental algorithm for frequent pattern mining. It employs a prefix tree structure (FP-Tree) and a recursive mining process to discover frequent patterns. However, the performance of FP-growth is closely related to the total number of recursive calls, which leads to poor performance when multiple conditional FP-trees are required to be constructed. This paper proposes highly compressed FP-tree (HCFP-tree). This increases prefix sharing and reduces the number of nodes in the prefix tree. Based on HCFP-tree, we design a new algorithm called HCFP-growth. This algorithm greatly reduces the number of recursive calls required to mine full frequent patterns. Experiments conducted on various types of datasets demonstrate that HCFP-growth is always among the fastest algorithms. It also consumes the least memory in many cases, and its memory consumption is comparable to that of existing algorithms in other cases.
Introduction
Frequent itemset or frequent pattern (FI) mining is employed in data mining and knowledge discovery tasks such as spanning classification [1], clustering [2], decision support and regular pattern mining [3]. It has also been applied to many different types real-world databases containing medical clinical data [4], web data [5], customer data [6], library records [7], and so forth.
Apriori [8] and FP-growth [9] are two fundamental algorithms for mining FIs. Apriori requires multiple database scans and consumes substantial memory. In contrast, FP-growth uses a prefix tree structure based on a pattern growth approach that avoids candidate generation and significantly improves performance. Therefore, FP-growth is more commonly applied to mining tasks.
Many extended algorithms based on FP-growth have been proposed to accomplish various mining tasks. FI-mining algorithms also exist for uncertain data, such as Tube-trees [10] and UF-streaming+ [11]. The following techniques are based on mining weighted frequent patterns. WIP [12] and IWS [13] are algorithms that mine weighted interesting patterns, and IWI Miner [14] is applied in mining infrequent weighted patterns. Because it is challenging for a data analyst to specify the appropriate minimum support, a top-k mining approach has been proposed that efficiently retrieves a set of top-k frequent patterns without requiring a minimum support parameter; these algorithms include FCPminer [15] and BTK [16]. Because each item is different and should not be treated identically among transaction datasets, FPME [17] was proposed to mine FI in these transaction datasets. These algorithms all rely on the tree structure of the FP-growth algorithm. Therefore, reducing the complexity of the FP-growth algorithm would greatly improve the efficiency of these algorithms.
The FP-tree in FP-growth consists of a prefix tree and a header table. As a result, FP-growth scans the database only twice. The first scan identifies all the frequent items and counts their support. Then, during the second scan, the algorithm inserts the frequent transaction items identified into the FP-tree.
A transaction dataset
A transaction dataset
FP-tree for the dataset in Table 1.
After constructing the FP-tree, FP-growth constructs a conditional pattern base and a conditional FP-tree for each item in the header table. This procedure is then applied recursively to each conditional FP-tree until the conditional FP-tree contains only one path. Subsequently, the FIs are obtained after forming all the possible combinations of the nodes in the single paths.
It is easy to see that the major cost of FP-growth is constructing the conditional FP-trees. Therefore, FP-growth efficiency is determined by the total number of recursions [18]. A smaller number of recursive calls means that the mining process will finish early and involve constructing fewer conditional trees. However, the number of recursive calls is determined by prefix sharing among patterns and the number of nodes in the tree. When there is more prefix sharing and fewer nodes in the tree, the tree is more likely to have a single path.
In this paper, we propose a novel compact tree structure named the highly compressed FP-tree (HCFP-tree) to increase prefix sharing. The implementation of HCFP-growth that utilizes HCFP-tree for frequent pattern mining has the following advantages: (1) HCFP-growth is faster than the existing algorithms because it effectively reduces the number of recursive calls and generates fewer conditional trees. (2) HCFP-growth uses relatively less memory because the number of nodes in the HCFP-tree is much smaller than that in FP-tree. (3) By reusing the HeaderTable, the number of nodes in HCFP-tree could be further reduced.
The remainder of this paper is organized as follows. Section 2 reviews the related work with respect to HCFP-tree and HCFP-growth. Section 3 employs examples to demonstrate the HCFP-growth algorithm. The experimental results are presented in Section 4. Finally, conclusions are drawn in Section 5.
Many algorithms have been proposed to improve the mining speed of the FP-growth algorithm such as DiffNodesets [19], IFP-growth [20], LP-growth [21], FP-growth+ [22], FDM [23], PrePost+ [24], NSFI [25], and CT-PRO [26].
IFP-growth utilizes a new tree structure named FP-tree+ that adds an address table and does not eliminate the infrequent nodes. Therefore, IFP-growth may increase the cost of tree traversal despite reducing the number of conditional FP-trees in sparse datasets. LP-growth applies a linear structure called Liner Prefix Node that speeds up tree traversal by reducing the number of pointers between nodes. However, LP-growth can perform poorly when the average number of child nodes exceeds 3. FP-growth+ mines frequent patterns by recursively traversing the global FP-tree; therefore, its mining efficiency is low when the global FP-tree has a large number of nodes. FDM switches between the FP-tree and DIFFset-list mining techniques according to the conditional pattern base.
Although the above algorithms can improve the mining speed, they do not efficiently decrease the total number of recursive calls and the number of nodes in the prefix-tree.
Other nonrecursive algorithms, such as CT-PRO, BISC [18], nonordfp [27], NSFI and PrePost
In CT-PRO, the authors suggested a compressed FP-tree named CFP-tree. Each node of a CFP-tree has a count array in which each array entry registers the count of transactions matched with the item subset in the path from the root of subtree to that node. Therefore, a CFP-tree maintains more prefix sharing than does an FP-tree. A CFP-tree is formed by the leftmost branch of an FP-tree and has fewer nodes than an FP-tree.
CFP-tree for the dataset shown in Table 1.
The CT-PRO algorithm applies a CFP-tree to mine FIs without recursive calls; however, it requires large amounts of memory. Meanwhile, to mine longer FIs, CT-PRO requires a procedure called RecMine. When the paths contain many infrequent items, the cost of RecMine may cause CT-PROs performance to be poorer than that of FP-growth. Thus, it is necessary to improve the performance of CT-PRO.
As shown in Fig. 2, we found a cut in the CFP-tree. The branches on the right-hand side of the cut can be compressed further, while those on the left-hand side can form a new compressed tree-based data structure, which we refer to as HCFP-tree. The HCFP-tree structure and the HCFP-growth algorithm, which applies HCFP-trees to frequent patterns mining, are described in the next section.
Our FI mining method involves two main steps. First, HCFP-growth constructs an HCFP-tree by scanning the dataset. Then, it extracts FIs by recursively traversing the HCFP-tree and constructing conditional HCFP-trees. This section first defines several key FI-mining terms before introducing HCFP-tree and describing the HCFP-growth algorithm in detail.
HCFP-tree: A highly compressed FP-tree structure
Let
An item subset
where 0
First, to make the generation of conditional transactions from HCPT-Tree more convenient, each item in each transaction is mapped to its corresponding item index.
As mentioned in Section 2, HCFP-tree aims to maximize prefix sharing and is defined as follows.
It comprises a HeaderTable and the body of the tree. The tree’s root represents the index of the item with the highest support. Each tree node is a compressed prefix node (CPN) that stores the index. Each CPN comprises an item index, path count, a pointer to the next node with the same item index, pointers to the parent CPN and CPN’s next sibling, a child pointer, and a pointer to the path count node (PCN).
Next, to compress the right branches in the CFP-tree, we introduce the concept of the first-start item index (FSI).
Transactions can thus be divided into two types, where the first type has FSIs equal to the index of their first item, while this is not true for the second type. Each CPN uses PCN to increase prefix sharing.
Compared to a CFP-tree, an HCFP-tree has the following major differences:
The CFP-PRO algorithm utilizes the CFP-tree to mine FIs nonrecursively, while HCFP-tree is proposed to mine FIs recursively. Each node in a CFP-tree boasts a count array in which many entries may be empty. In contrast, HCFP-tree creates PCNs for each node independently, and connects them by pointers; therefore, an HCFP-tree requires less memory than a CFP-tree. The count array of nodes in CFP-tree counts only one type of transactions, while the PCN of a node in HCFP-tree can count two types of transactions. Therefore, the HCFP-tree can improve prefix sharing compared to the CFP-tree.
First, the algorithm scans the dataset
Then,
The HCFP-tree construction process is depicted in the following example.
An HCFP-tree state: inserting the first transaction.
An HCFP-tree state: inserting the second transaction.
An HCFP-tree state: inserting all transactions.
Next, the index set of the second transaction is inserted, where the mapped index set is {2, 4}. Because the FSI of the second index set is 2, the path count of CPN (2:1) is increased by 1 and its corresponding PCN is updated to {2, 1, 1, null}. Subsequently, index ‘4’ is inserted into the tree. However, there is no index ‘4’ in the child node of CPN (2:2); therefore, a new CPN (4:1) is created whose new PCN is {2, 1, 0, null}, as shown in Fig. 4. Then, the remaining index sets can be inserted into HCFP-tree. Figure 5 shows the HCFP-tree after inserting all the transactions.
Because both the HCFP-tree and FP-tree employ an ordered tree structure and a recursive mining FI process, we compare HCFP-tree and FP-tree regarding the following three steps: node generation, transaction insertion, and tree traversal.
In HCFP-tree, each newly generated CPN can be seen (approximately) as a combination of a new FP-Tree node and PCN.
Rationale: Each node in the FP-tree consists of 2 integer domains and 4 pointers: item-name, count, node-link, parent pointer, child pointer, and sibling pointer. An HCFP-tree PCN consists of 3 integer values and 1 pointer: level-id, firstcount, secondcount, and next-link. The integer values can be accessed directly. However, we must confirm whether the pointer has been stored before accessing it. Therefore, accessing an integer value requires only one task, while accessing a pointer requires two tasks. In the FP-tree, generating a new node requires ten tasks, and in the HCFP-tree, creating a new PCN requires five tasks. Assuming that
Rationale: Let
For HCFP-tree, approximately
The relation
And the solution is
Therefore, in the transaction insertion phase, HCFP-tree achieves better performance than FP-tree if the CEN is greater than 50%.
Rationale: Given item
In an HCFP-tree, when all branches with item
The relation,
and the solution is
Property 3 is important, because it shows that HCFP-tree is more efficient than FP-tree when the CEN is greater than 50%.
To further reduce the number of nodes in an HCFP-tree, we propose a technique of reusing the HeaderTable.
Section 3.2 demonstrated how to construct an HCFP-tree, which first constructs the Header Table and the leftmost branch of the HCFP-tree. Because both the HeaderTable and the leftmost branch of the HCFP-tree contain all the frequent items, the HeaderTable can be reused to take the place of the leftmost branch of the HCFP-tree.
HCFP-tree construction algorithm.
The procedure of InsertHCFPTree.
To accomplish this task, two fields are added to each entry in HeaderTable: path count and a pointer to a PCN. When a transaction is inserted in to the tree, HCFP-growth performs item insertion starting from HeaderTable [FSI], and then, HeaderTable [FSI] becomes the current node. If there a HeaderTable node exists that is adjacent to the current node and matches the next index of the transaction, the current node is set to that node. Otherwise, the algorithm confirms whether the current node has one child node whose index matches the inserted index. When such a child node exists, the algorithm regards that child node as the current node. Otherwise, the algorithm creates a new child node.
The procedure of UpdatePCN.
An HCFP-tree reusing the HeaderTable.
Figures 6–8 show the pseudocode for constructing the HCFP-tree.
Figure 9 shows a HCFP-tree that reuses the HeaderTable for the database shown in Table 1. The number of nodes in the HCFP-tree is reduced by 4.
In Section 4.3, the experimental results show that the technique of reusing the HeaderTable achieves outstanding performance for sparse datasets (such as Retail) that contain a large number of items.
After constructing the HCFP-Tree, HCFP-Growth starts mining FIs from it.
HCFP-growth traverses an HCFP-tree in a bottom-up strategy and creates conditional HCFP-trees recursively. Initially, the presented algorithm starts with the least frequent item index in the HeaderTable and then searches the nodes based on that item index’s node links. The algorithm visits the nodes in the path from each linked node to a root by following parent pointer and counts their support values. HCFP-growth terminates the path-searching operation when the parent node is the root; these item indexes of the nodes in the path form a transaction substring. Assume that the number of items indexed in the path is
If a single path cannot be derived from a node
Otherwise, when
According to these conditional transactions, the algorithm creates a conditional pattern base. Subsequently, each conditional transaction is sorted in descending order of support and infrequent items are eliminated. Then, based on the sorted conditional pattern base, the algorithm generates a new conditional HCFP-tree. HCFP-growth will construct conditional HCFP-trees recursively until each node in HeaderTable of the conditional HCFP-tree derives a single path.
When
For any combination of the item indexes,
All the item index combinations are mapped to patterns. Finally, the set of FIs is generated by combining the patterns and the prefix itemset.
The conditional HCFP-tree of index 4.
For node ‘3’ in Fig. 9, the algorithm derives an FI (c: 8) and a single path {3, 2, 1}. Thus, its PCNs can be accessed and its set of FIs can be generated: {c b a: 2}, {c b: 4}, and {c a: 4}. Similarly, nodes ‘2’ and ‘1’ drive their sets of FIs: {b a: 4}, {b: 8}, and {a: 8}.
The following Property 4 shows why HCFP-growth is advantageous.
Rationale: Both FP-growth and HCFP-growth algorithm use a recursive mining method that continues until the conditional tree contains only one path. HCFP-growth utilizes a highly compressed tree structure. For the same transaction dataset, HCFP-tree supports more prefix sharing and generates fewer nodes does than a corresponding FP-tree. Therefore, HCFP-tree forms a single-path earlier than FP-tree. Thus, the number of recursions in HCFP-growth is smaller than the number of recursions in FP-growth.
Based on the above analysis, the proposed HCFP-growth algorithm is shown in Fig. 11. HCFP-growth selects each item index with
HCFP-growth algorithm.
To evaluate HCFP-growth, we compare it with other state-of-the-art algorithms based on three criteria: running time, memory consumption, and scalability. Experiments are also performed to observe the number of recursive calls and nodes in the conditional trees to demonstrate the compression efficiency of HCFP-tree.
Experimental environment and datasets
All the experiments are executed on a computer with a 2.16 GHz Intel processor with 2 GB of memory and a Windows 7 OS. In this study, HCFP-growth is compared with FP-growth, CT-PRO, LP-growth, NSFI, and MAFIA [28]. All these algorithms were implemented in C++ and compiled with Microsoft Visual Studio 2010. Several real datasets were employed to evaluate the performance of the algorithms; the characteristics of these datasets are presented in Table 2. Pumsb is provided by
Real datasets
Real datasets
Datasets for scalability test
Table 3 presents the parameters of synthetic datasets for scalability tests. These datasets are generated by the IBM data generator acquired from
When CEN is greater than 50%, HCFP-tree executes faster than does FP-growth, as discussed in Section 3.2. HCFP-growth requires fewer recursive calls and conditional HCFP-trees than does FP-growth, according to Property 4. To support these analyses, experiments are performed to compare the number of nodes required by HCFP-growth and FP-growth. Figure 12 shows the experimental results of the number of nodes resulting from six datasets. These results show that HCFP-growth requires a smaller number of nodes for all the datasets compared to the number of nodes required by FP-growth.
Notably, the CEN is the highest in the Connect dataset, while the CEN is the lowest in Retail. When minsup is 60% in Connect, HCFP-growth requires only 40.36% of the number of nodes required by FP-growth, and the CEN is 59.64%. For the Retail dataset, the number of nodes required by HCFP-growth is still 13.1% less than that required by FP-growth. Next, experiments are performed to compare the number of recursive calls made in HCFP-growth and FP-growth.
The number of nodes in HCFP-growth and FP-growth on various datasets at various support thresholds.
The number of recursive calls in HCFP-growth compared to FP-growth on various datasets at various support threshold.
Figure 13 shows results of the number of recursive calls for the various datasets. A conditional tree is constructed during a recursive call. From the experimental results, the number of conditional trees created by HCFP-growth for all the datasets is smaller than the number of trees created by FP-growth. This is especially apparent when minsup is 60% and HCFP-growth creates 33,923 trees for the Connect dataset – only 27.22% of the number of trees created by FP-growth. Even for Retail, the number of conditional trees created by HCFP-growth is still 21% smaller than the number created by FP-growth.
From these results, it can be concluded that HCFP-growth reduces the cost of conditional tree construction, and outperforms FP-growth in most cases even when the CEN is less than 50%.
Figure 14 shows the runtimes with various minsup values on the real datasets in Table 2. To improve the mining performance, MAFIA utilizes a vertical bitmap that may increase both runtime and memory consumption. Therefore, MAFIA performs the worst in most cases, especially on sparse datasets such as Retail (in Fig. 14b, the MAFIA result is not included because the runtime is too high to plot).
CT-PRO achieves outstanding runtime performance when minsup is low due to its compressed FP-tree data structure. Nevertheless, the CT-PRO algorithm requires large amounts of memory when processing all the datasets, and its runtime is longer than that of HCFP-growth when minsup is large.
Particularly, in Fig. 14c and d when the threshold was greater than 70% for the Connect, 65% for the Pumsb and 10% for the Mushroom, the CT-PRO algorithm’s performance is only slightly better than that of the MAFIA algorithm.
Runtime test.
FP-growth’s mining time exceeds that of LP-growth on all the datasets except Retail. For example, in Pumsb, when minsup is 60%, the execution time of the FP-growth algorithm is 485 seconds. However, for the same minsup, LP-growth requires only 475 seconds. LP-growth’s performance improvement occurs because it uses an LP-tree, which reduces the number of pointers between nodes.
However, LP-tree may degenerate into FP-tree when sparse datasets have large amounts of items. For example, Retail has 16,469 items; thus, LP-growth exhibits poor performance on Retail, as shown in Fig. 14b.
NSFI recursively executed the N-list intersection procedure instead of constructing conditional trees. Therefore, NSFI shows excellent runtime performance on the dense datasets. However, when there are many items but little prefix sharing in the sparse datasets, NSFI’s efficiency is poor because it must construct many N-lists.
Overall, HCFP-growth achieves the best performance on almost all the datasets. The reason is that when minsup is below 90% for Connect, the CEN is greater than 50%. Although when minsup is less than 80% for Pumsb, 90% for Chess, 70% for Accidents, and 30% for Mushroom, and the CENs are not greater than 50%, HCFP-growth still executes faster than the others because it reduces the cost of creating conditional trees. Notably, the CEN on Retail is only 13.1%. However, Retail has many items; consequently, the presented algorithm maintains good performance on Retail by reusing the HeaderTable.
In this section, we study the memory consumption of the compared algorithms on the datasets listed in Table 2.
Memory tests.
CT-PRO and MAFIA consume more memory than the other algorithms in almost all cases. As shown in Fig. 15, when minsup is 90% for Connect, 30% for Mushroom, and 80% for Accidents, CT-PRO has relatively good memory efficiency. However, as minsup increases, CT-PRO’s memory consumption increases dramatically. In contrast, MAFIA’s memory usage remains constant regardless of the minsup value. Notably, in Fig. 15f, with a threshold below 40% for the Accidents dataset, MAFIA boasts outstanding memory efficiency.
In addition to the prefix tree, NSFI must also construct N-lists; therefore, it consumes more memory, especially on sparse datasets. LP-growth consumes relatively less memory although its memory usage is still higher than that of HCFP-growth. FP-growth’s memory consumption is similar to that of our algorithm in almost all cases.
Even though HCFP-growth does not always use smallest amount of memory, as shown in Fig. 15b and f, it still uses less memory in these cases than do some of the existing algorithms, such as NSFI and CT-PRO. In addition, the proposed algorithm achieves the best experimental results regarding memory in many cases. In particular, in Fig. 15a and c–e, HCFP-growth outperforms the compared algorithms on the Connect, Mushroom, Pumsb and Chess datasets in all cases.
Scalability test.
Figure 16 shows the results of the scalability experiments by varying the number of items and transactions in the synthetic datasets listed in Table 3. Notably, MAFIA and CT-PRO perform the worst, because they have the highest memory requirements.
FP-growth exhibits outstanding scalability in these experiments as the number of transactions increases, but its performance falls as the numbers of items increase, as shown in Fig. 16c and d.
Based on its linear structure, LP-growth enjoys better memory scalability performance than FP-growth; through its runtime scalability does not. NSFI shows better scalability than MAFIA, FP-growth and CT-PRO, but its memory consumption increases faster than HCFP-growth. The presented algorithm boasts the best runtime performances in the scalability experiments as shown in Fig. 16a and c. As the number of items and transactions increase, the runtime of HCFP-growth increases slightly due to its highly compressed tree structure. Consequently, as the number of items and transactions increase, there is a slight increase in the number of nodes in HCFP-tree. HCFP-growth does not show the best memory scalability, however, its memory scalability is still better than that of MAFIA, NSFI and CT-PRO, as shown in Fig. 16d. In summary, HCFP-growth exhibits the best runtime and relatively better memory scalability compared to the other tested algorithms as the numbers of items and transactions increase.
Conclusion
In this study, a novel highly compressed tree structure (HCFP-tree) and an efficient FI-mining algorithm (HCFP-growth) are proposed. By using a new data structure called PCN, an HCFP-tree supports more prefix sharing and considerably reduces the number of nodes in the tree. The major advantage of the HCFP-growth is that it reduces the number of recursive calls by adopting the highly compressed HCFP-tree structure. Therefore, HCFP-growth markedly improves the FI-mining performance because it generates fewer conditional trees compared to FP-growth. The experimental results show that HCFP-growth executes faster than do the other algorithms and that its memory consumption is the smallest in many cases and is comparable to those of existing algorithms in other cases. The methods and strategies presented in this paper can be extended to other types of patterns mining such as weighted pattern mining, sequential pattern mining, subgraph mining, and so on. Further study along these directions could provide significant academic value.
Footnotes
Acknowledgments
This research was supported in part by the National Science and Technology Major Project of the Ministry of Science and Technology of China under grant 2018ZX10715003, the National Key R&D Program of China under grant 2017YFC1703900, the Sichuan Science and Technology Program under grants 2018GZ0192 and 2018SZ0065, and the National Natural Science Foundation of China (NSFC) under grant 81803851.
