Accelerate tree ensemble learning based on adaptive sampling

Abstract

Gradient Boosting Decision Tree (GBDT) has been used extensively in machine learning applications due to its superiority in efficiency, accuracy and interpretability. Although there are already excellent and popular open source implementations such as XGBoost and LightGBM, etc., however, large data size tend to make scalable and efficient learning to be very difficult. Since sampling is an efficient technique for alleviate massive data analysis performance issues, we exploit sampling techniques to address this problem. In this paper, we propose the AdaGBDT approach which apply an adaptive sampling method based on Massart’s Inequality to build GBDT model and draws samples in an on-line manner without manually specifying sample size. AdaGBDT is implemented by integrating the adaptive sampling method into LightGBM. The experimental results showed that, AdaGBDT not only keeps a small sample size and has a better training performance than LightGBM, but also subject to the constraint of estimation accuracy and confidence.

Keywords

Gradient boosting decision tree adaptive sampling scalable learning mas-sart’s inequality

1. Introduction

Gradient Boosting Tree (GBDT) [1] is one of the most popular algorithm in machine learning, which is the state-of-the-art solutions in many machine learning tasks. However, the increasing volume of data tends to make the conventional GBDT implementations to be inefficient because the construction of tree learner at each gradient boosting round need scan entire training data for each feature to find the optimal split points. Many optimizations are proposed in real application scenarios to tackle this issue and have achieved good performance. The two most representative GBDT implementations are XGBoost [2] and LightGBM [3]. Both of them implement histogram-based algorithm [4, 5, 6] instead of the inefficient pre-sorted method to split data instances. Moreover, LightGBM optimize the histogram-based algorithm by ignoring the zero feature values and reduces the cost of histogram building from O(#data) to O(#non_zero_data) for a feature [3]. However, this method still subjects to the problem of the fine balance between accuracy and efficiency.

To cope with the challenge posed by the data size, one possible approach is to take a random sample on data instances since the approximate response is acceptable for many machine learning tasks. Sampling is a critical technique widely used in statistical analysis and computer science. In machine learning area, researchers often use sampling technique to estimate the accuracy of the classifier or extract a random subset of the original data to reduce the number of data instances. Many off-the-shelf sampling methods are weight-based and are designed for AdaBoost [7]. In order to apply sampling techniques to GBDT, the work in [3] proposed a novel techniques GOSS to obtain samples in terms of the gradients which implicitly indicate the importance of data instances. The experimental results of LightGBM show that GOSS can bring nearly 2x speed-up by using only 10%–20% data and achieve better accuracy when maintaining competitive sampling ratio compared with Stochastic Gradient Boosting (SGB) [8]. GOSS need two parameters top_rate and other_rate as inputs to determine how much data instances with big gradients should be retained and how much data instances should be sampled from the rest. It is often difficult since the appropriate sample size is typically unknown.

Adaptive sampling is a promising solution target to the problem by drawing examples sequentially in an online manner without a pre-specified sample size. Compared with static sampling methods in which sample size is determined in advance, adaptive sampling scheme determines when to stop drawing examples according to random samples seen so far. It means sample size could be dynamically determined during the training procedure. There are already many studies in this area. Chen and Xu [9] conducted extensive research on adaptive boosting algorithm and proposed an efficient adaptive sampling technique for binary classification task, which uses significantly lower sample size while maintaining competitive accuracy and confidence when compared with most existing methods.

In this paper, we exploit sampling techniques to address the problem that large data size tend to make scalable and efficient GBDT learning to be very difficult. The major novelty and technical contributions can be briefly summarized as following: we propose the AdaGBDT approach which applies an adaptive sampling method based on Massart’s Inequality to build GBDT model and draws samples in an online manner without manually specifying sample size. Our AdaGBDT is implemented by integrating the adaptive sampling method into the popular GBDT implementation LightGBM. In order to achieve better usability than other works, our AdaGBDT approach provide users with accuracy level parameter $\varepsilon$ and confidence level parameter $\delta$ instead of sampling ratio, which significantly reduced the difficulty of parameter tuning. The remainder of this paper is organized as follows. Section 2 present a brief review and preliminary knowledge. AdaGBDT is detailed in Section 3. Section 4 illustrate the experimental results, and the concluded remarks is in Section 5.

2. Background

In this section, we will briefly describe the related work of sampling scheme designing and off-the-shelf work on adaptive sampling techniques.

2.1 Related work

Currently, most of the existing GBDT algorithm toolkits such as XGBoost, pGBRT and scikit-learn provide a simple random sampling method. Such a static sampling method chooses a random sample subset to train the base learner at each iteration. LightGBM proposes a new gradient-based sampling technique GOSS to obtain sample subsets, which achieves a good balance between reducing data instances and keeps acceptable accuracy of decision tree. As is shown in [3], LightGBM outperforms other GBDT implementations. Therefore, we combine adaptive sampling with LightGBM for better performance requirement and use LightGBM as the baseline for our experiments.

GOSS sorts gradient values of the samples in a descending order in each iteration, then controls the proportion of instances with different gradients need to retain by parameters top_rate and other_rate. While GOSS require user to specify the sampling rate which is often to be difficult. Users do not know how much samples is full enough to insure the estimation accuracy and confidence. In this context, the well-known Chernoff-Hoeffding bounds are commonly used to determine sample size for data mining problems [12, 13, 14, 15]. It can be used both in a static way and dynamic way. In conventional batch sampling, sample size is pre-specified which is unnecessary large to make sure work well in any possible situations and keep a reasonable good accuracy and confidence. It lead to the sample size is overestimated in most cases of static sampling.

Adaptive sampling is used instead of the traditional static sampling due to its scalability whose stopping criterion is dynamically adjusted by random samples seen so far. Moreover, recent studies in [16, 17] shows that adaptive sampling holds lower sample size in contrast with conventional batch sampling, which holds the worst-case sample size even though current hypothesis is good enough.

Watanabe proposed an adaptive sampling technique Madaboost [17, 18, 19, 20], and applied it to boosting algorithm. These works show that adaptive sampling techniques can substantially improve the performance and scalability of ensemble learning. The adaptive sampling method proposed in [16] was based on Chernoff bound, which make the bounds tighter than method in [21, 22]. Furthermore, Jianhua Chen et.al developed an efficient ensemble learning method by combine Massart’s inequality and the techniques in [23, 24]. It significantly reduced sample size, meanwhile, keep comparable accuracy compared with work proposed in [21]. To achieve better efficiency and scalability for GBDT, we apply the method proposed in [21] to off-the-shelf implementation for GBDT.

In this paper, we work to apply the dynamic sampling technique to GBDT to speed up the training process in large-scale datasets scenario. We use adaptive sampling technique based on the Massart’s inequality [25] to optimize LightGBM. In next subsection, we will describe the problem definition which exist in most theoretical studies respect to sampling scheme designing.

2.2 Computer recovery boiler

The foundational problem in most sampling scheme designing is how to estimate the probability $p=\text{Pr}(\text{A})$ of a random event A. We draw i.i.d. (Independent and identically distributed) sample ${\text{X}}_{i}$ of X and estimate the probability $p=\text{Pr}\{X=1\}$ for a Bernoulli variable X. Let $\widehat{{\bm{p}}_{\bm{n}}}=\frac{\Sigma^{n}_{i=1}{X_{i}}}{\bm{n}}$ be an estimator for $p$ , $\bm{n}$ s the number of samples when the stopping rule is satisfied. What we want to know is how much $\bm{n}$ is at least. Chernoff-Hoeffding bounds [10, 11] are used extensively in statistical sampling and computer science to design sampling schemes for better efficiency and scalability in ensemble learning. It asserts that, for $\varepsilon,\delta\in(0,\ 1)$ , the coverage probability $\text{Pr}\{|\widehat{{\bm{p}}_{\bm{n}}}-p|<\varepsilon\}$ is greater than $1-\delta$ for any $p\in(0,\ 1)$ provided that $\bm{n}>\frac{{\text{ln}\frac{2}{\delta}\ }}{2{\varepsilon}^{2}}$ . Here $\varepsilon$ represents for the margin of absolute error and $1-\delta$ is called confidence level parameter. The Chernoff-Hoeffding bound significantly make this bound tight which state that ${\text{Pr}\{|\widehat{{\bm{p}}_{\bm{n}}}-p|<\varepsilon\}\ }{>1-}\delta$ if $\bm{n}>\frac{1}{4{\varepsilon}^{2}\delta}$ . To achieve the goal of low sample size with comparable accuracy and confidence, many work use such sampling schemes for better efficiency and scalability and quite a few research work in ensemble learning are proposed recently [17, 18, 19, 20, 21, 22, 23, 24]. These works take the following two problems into account.

2.2.1 Problem 1 – Control of absolute error

Design an adaptive sampling scheme which represents stopping criterion such that for any pre-specified $\varepsilon\in(0,\ 1),\ \delta\in(0,\ 1)$ and any $\text{\ }p\in(0,\ 1)$ , the relative frequency $\widehat{{\bm{p}}_{\bm{n}}}$ at the termination of the sampling process guarantees:

$\text{Pr}\{|\widehat{{\bm{p}}_{\bm{n}}}-p|\geqslant\varepsilon\}\leqslant\delta$

When termination condition is satisfied after seeing $\bm{n}$ samples. $\varepsilon$ and $\delta$ are a priori margin of absolute error and confidence level parameters respectively.

2.2.2 Problem 2 – Control of relative error

Design an adaptive sampling scheme which represents stopping criterion such that for any pre-specified $\varepsilon\in(0,\ 1),\ \delta\in(0,\ 1)$ and any $p\in(0,\ 1)$ , the relative frequency $\widehat{{\bm{p}}_{\bm{n}}}$ at the termination of the sampling process guarantees:

$\text{Pr}\left\{\left|\frac{\widehat{{\bm{p}}_{\bm{n}}}-p}{p}\right|\geqslant% \varepsilon\right\}\leqslant\delta$

When termination condition is satisfied after seeing $\bm{n}$ samples. $\varepsilon$ and $\delta$ are a priori margin of relative error and confidence level parameters respectively.

It can be clearly seen for the above definition that the bound of sample size can be worked out when the error, confidence level and samples seen so far are given.

3. AdaGDBT

In this section, we will present our AdaGBDT approach and detail its adaptive sampling technique [23] and implementation.

3.1 Sampling adaptively using the massart inequality

In [23], they utilize the $U(z,\theta)$ designed in [24] to study sampling scheme:

$\displaystyle U\left(z,\theta\right)=\left\{\begin{array}[]{ll}\frac{9}{2}% \frac{{(z-2\theta)}^{2}}{(z+2\theta)(z+2\theta-3)}&z\in\left[0,\ 1\right],\ % \theta\in(0,\ 1)\\ -\infty&z\notin\left[0,\ 1\right],\ \theta\in(0,\ 1)\end{array}\right.$

Then associate a simplified version of Massart Stopping Rule in [16] with the function $U$ forms the basis of the stopping condition:

Lemma 1: Let $p=E[X]$ be the expected value of the Bernoulli variable $X$ . Let $\widehat{{\bm{p}}_{\bm{n}}}$ be the relative frequency of successes in $\bm{n}$ Bernoulli trials. For any $0<z\leqslant p$ , we have $\text{Pr}\{\widehat{{\bm{p}}_{\bm{n}}}<z\mathrel{\left|\vphantom{\widehat{{\bm% {p}}_{\bm{n}}}<zp}\right.\kern-1.2pt}p\}\text{<}e^{nU(z,p)}$ . For any $p<z\leqslant 1$ , we have $\text{Pr}\{\ \widehat{{\bm{p}}_{\bm{n}}}>z\mathrel{\left|\vphantom{\ \widehat{% {\bm{p}}_{\bm{n}}}>zp}\right.\kern-1.2pt}p\}<e^{nU(z,p)}$ .

They conduct preliminary theoretical analysis on their method for controlling absolute error and relative error. For any pre-specified $0<\varepsilon\leqslant 1$ , $0<\delta\leqslant 1$ , the sampling algorithm ${\textit{ABS}}_{M}$ terminate until stopping condition satisfied in the case of controlling absolute error:

$\displaystyle\bm{n}\geqslant\frac{2{\text{ln}\frac{2}{\delta}\ }}{{\varepsilon% }^{2}}\left[\frac{1}{4}-{\left(\left|\widehat{{\bm{p}}_{\bm{n}}}-\frac{1}{2}% \right|-\frac{2}{3}\varepsilon\right)}^{2}\right]$ (1)

Their theorem demonstrates there is an upper-bound

$\displaystyle n_{0}=\text{max}\left\{\left\lceil\frac{\text{ln}\frac{\delta}{2% }}{U\left(p+\varepsilon,p+2\varepsilon\right)}\right\rceil,\left\lceil\frac{% \text{ln}\frac{\delta}{2}}{U\left(p-\varepsilon,p-2\varepsilon\right)}\right% \rceil\right\}$

when we assume that true probability $p$ we want to estimate satisfied $p\leqslant\frac{1}{2}-2\varepsilon$ , the sampling scheme with criterion Eq. (1) will stop with the number of samples seen so far satisfied the inequality $\bm{n}\leqslant n_{0}$ , and produce $\widehat{{\bm{p}}_{\bm{n}}}$ which satisfied $\widehat{{\bm{p}}_{\bm{n}}}\leqslant p+\varepsilon$ . The theorem is still valid in the case of $p\geqslant\frac{1}{2}+2\varepsilon$ which produces $\widehat{{\bm{p}}_{\bm{n}}}$ which satisfies $\widehat{{\bm{p}}_{\bm{n}}}\geqslant p-\varepsilon$ .

For any pre-specified $0<\varepsilon\leqslant 1$ , $0<\delta\leqslant 1$ , the sampling algorithm ${\textit{REL}}_{M}$ terminate until stopping condition satisfied in the case of controlling relative error:

$\displaystyle\widehat{{\bm{p}}_{\bm{n}}}>0\ \text{and}\ n\geqslant\frac{{\text% {ln}\frac{\delta}{2}\ }}{U\left(\widehat{{\bm{p}}_{\bm{n}}},\frac{\widehat{{% \bm{p}}_{\bm{n}}}}{1+\varepsilon}\right)}$ (2)

In Eqs (1) and (2), $\widehat{{\bm{p}}_{\bm{n}}}$ is the relative frequency of successes and $\bm{n}$ is the number of samples seen so far. Similarly, there is an upper-bound $n_{1}=\left\lceil\frac{{\text{ln}\frac{\delta}{2}\ }}{U\left(p({1-}\varepsilon% ),\frac{p({1-}\varepsilon)}{({1+}\varepsilon)}\right)}\right\rceil$ when we assume that true probability $p$ we want to estimate satisfied $p\leqslant\frac{1}{2}-2\varepsilon$ , the sampling scheme with criterion function Eq. (2) will stop with the number of samples seen so far satisfied the inequality $\bm{\text{n}}\leqslant n_{1}$ . And produce $\widehat{{\bm{p}}_{\bm{n}}}$ which satisfied $\widehat{{\bm{p}}_{\bm{n}}}\geqslant p(1-\varepsilon)$ . The detail of the theoretical analysis about the sampling algorithm ${\textit{ABS}}_{M}$ and ${\textit{REL}}_{M}$ can be found in [24].

The method in [23] define $U_{h,S}=P_{h,S}-1/2$ as an estimate based on samples $S$ of $U_{h}=P_{h}-1/2$ , which used to measure whether $U_{h,S}$ and $U_{h}$ are close enough with high confidence. This definition stems from the requirement of boosting algorithm that weak leaner selected at each iteration should have accuracy greater than 1/2. So Chen et.al. redefined this problem as follows.

3.1.1 Problem 3

For any pre-specified $\varepsilon\in(0,\ 1),\delta\in(0,\ 1)$ . Construct an adaptive sampling scheme that guarantees:

$\displaystyle\text{Pr}\left\{\left|{\bm{U}}_{\bm{h},\bm{S}}-U_{h}\right|% \geqslant\varepsilon\left|U_{h}\right|\right\}\leqslant\delta$ (3)

Associated with $U_{h,S}=P_{h,S}-\frac{1}{2},U_{h}=P_{h}-\frac{1}{2}$ , we have:

$\text{Pr}\left\{\left|{\bm{P}}_{\bm{h},\bm{S}}-P_{h}\right|\geqslant% \varepsilon\left|P_{h}-1/2\right|\right\}\leqslant\delta$

Replace the $\widehat{{\bm{p}}_{\bm{n}}}$ in Eq. (1) with $P_{h,S}$ and fixed $\varepsilon$ in Eq. (1) by $\ {\varepsilon}^{\prime}=\frac{\varepsilon|P_{h,S}-1/2|}{1+\varepsilon}$ . We have new stopping condition for the problem defined above:

$\displaystyle\bm{\text{n}}\geqslant\frac{2{\text{ln}\frac{2}{\delta}\ }}{{% \left({\varepsilon}^{\prime}\right)}^{2}}\left[\frac{1}{4}-{\left(\left|P_{h,S% }-\frac{1}{2}\right|-\frac{2}{3}{\varepsilon}^{\prime}\right)}^{2}\right]$ (4)

The justification and the proof for testing this condition can be found in [23]. Algorithm 1 describes the detailed procedure of the method. This algorithm run once in every iteration. Let $S_{lb}$ be the lower bound of the number of samples drawn from training data set, and make sure the number of samples we keep would not be too small when the sampling process is terminated. The algorithm’s overhead is proportional to the size of the hypothesis space and the number of samples due to all hypotheses will be go over when a sample is added to calculate newest value of $P_{h,S}$ .

3.2 AdaGDBT: LightGBM with adaptive sampling

In order to improve the performance of LightGBM over the huge amount of data, we implement AdaGDBT which integrate the adaptive sampling algorithm introduced in last section into LightGBM, and make few modifications to the algorithm shown in Algorithm 1, which is for hypothesis selection using the Massart Inequality. We set 0.5 for the initial value of $P_{h,S}$ to calculate the sample size in the first round. The sample size of subsequent iterations will calculate based on the accuracy of last iteration. The algorithm with adaptive sampling is shown in Algorithm 2.

Algorithm 1: HS ${}_{M}$
1	$S\leftarrow$ { }.
2	$\bm{n}\leftarrow 0$ .
3	Done $\leftarrow$ false.
4	While $\bm{n}\leqslant S_{lb}\ \text{OR}\ \textit{Done}=$ false
5	DO
6	begin
7	Draw a random training example $x$
8	$S\leftarrow S\cup\{x\}$ .
9	$\bm{n}\leftarrow\bm{n}+1$ .
10	Compute $P_{h,S}$ and $U_{h,S}=P_{h,S}-1/2$ for each $h\in H$
11	If $\bm{n}>S_{lb}$ and $\bm{n}$ satisfies EQ. 4 and the classifier $h\in H$ with maximal
12	$U_{h,S}$ has $U_{h,S}>0$
13	Then $\textit{Done}\leftarrow\textit{true}$
14	end
15	Output $h,\ P_{h,S}\ \text{and}\ \bm{n}$ .

During the implementation, we create a new class AdaGBDT which inherits GBDT interface in LightGBM. AdaGBDT implements TrainOneIter method to add adaptive sampling algorithm to it. The pseudo code is shown in Algorithm 2. Note that line 8 trains a tree which would lead to additional overhead when we get samples multiple times in a round. To avoid this we fix the calculation of the sample size. Actual sample size is calculated as: $\textit{sampleSize}\leftarrow\textit{value of EQ.4}+\textit{log(sampleSize)}$ .

Algorithm 2: TrainOneIter
1	$S\leftarrow\{∼{}∼{}\}$
2	sampleSize $\leftarrow$ 0
3	Done $\leftarrow$ false
4	$P_{h,S}=$ 0.5
5	While $\textit{Done}=\textit{false}$
6	DO
7	Begin
8	treeLearner.Train()
9	Compute $P_{h,S}$ and $U_{h,S}=P_{h,S}-1/2$
10	$\textit{sampleSize}\leftarrow$ Compute sample size using EQ. 4
11	Draw sampleSize random training examples $S^{\prime}$
12	$S\leftarrow S\cup S^{\prime}$ .
13	IfsampleSize satisfies EQ. 4 and $U_{h,S}$ has $U_{h,S}>0$
14	Then $\textit{Done}\leftarrow\textit{true}$
15	end

LightGBM implements the histogram-based algorithm to find the best split points. The complexity of building histogram and finding split points for data with dense features is $\textit{O(\#data}\times\textit{\#feature)}$ and $\textit{O(\#bin}\times\textit{\#feature)}$ separately. The cost comes mainly from building histogram because the value of #bin is generally very small relative to #data. Sampling algorithm at line 11 lead to additional overhead because it need to scan all the data instances to get random samples. The complexity of sampling is O(#data), which is determined only by the number of original data size. It can be seen that the sampling algorithm introduces relatively less overhead when there are more valid features in the data set. The complexity of building histogram changes from $\textit{O(\#data}\times\textit{\#feature)}$ to $\textit{NO(\#data}\times\textit{\#feature)}$ , where N is sampling ratio by using our implementation. In order to observe the influence of the adaptive sampling algorithm on training cost under different data sizes, we design a series of experiments to compare the average time of training LightGBM model and our adaptive version on data with different instances and feature sizes. The results demonstrated the superiority on intensive data sets of our implementation.

4. Experiments

In this section, we report the experimental results with regards to LightGBM with adaptive sampling (i.e., AdaGBDT). Eight datasets are used in our experiments, which contain four artificial data sets and four publicly available data sets. The number of data instances and features are shown in Table 1. Our experiments only run the binary classification tasks and use binary error as measurement. All experiments run on single computer with two quad core i7-4790 CPUs and 16G memories.

4.1 Speed and accuracy comparison

In this subsection, we present speed and accuracy comparison between LightGBM and AdaGBDT. The artificial data sets used in experiments are generated by Matlab, which cover Gamma distribution, Gaussian distribution, Poisson distribution and Uniform distribution. Four publicly available data sets are from 26.

Table 1
Datasets used in the experiments

Name	#data	#feature	Description
GammaDS	100K	500	Dense
NormDS	100K	500	Dense
PoissonDS	100K	500	Dense
UniformDS	100K	500	Dense
Higgs	10.5M	28	Sparse
News20	6K	5K	Sparse
real-sim	72309	20958	Sparse
kdda	8407752	20216830	Sparse

The first four artificial data sets consist of 500 features and 100K instances. We generate data sets with gamma distribution GammaDS by setting shape parameter to 2 and scale parameter to 500. Data set NormDS with Gaussian distribution is generated by setting the value of mean to 0 and the value of variance to 500. We set $\lambda=$ 2 to generate data set with Poisson distribution PoissonDS and $a=$ 0, $b=$ 1000 to generate data set with Uniform distribution UniformDS. The experiments presented below are all running over the data sets with different distribution and sparsity.

Table 2

Overhead of training (seconds)

Dataset	LightGBM	AdaGBDT
GammaDS	3.401e-2	1.664e-2 (using 15.35% of data)
NormDS	3.318e-2	1.652e-2 (using 15.38% of data)
PoissonDS	2.986e-2	1.236e-2 (using 15.41% of data)
UniformDS	3.302e-2	1.553e-2 (using 15.42% of data)
Higgs	0.332	0.154 (using 0.15% of data)
News20	4.104e-2	3.589e-2 (using 76.6% of data)
real-sim	4.997e-2	3.864e-2 (using 4.91% of data)
kdda	0.966	0.247 (using 0.11% of data)

The time consumption of training are presented in Table 2. We recorded the average time cost for one iteration of training process. The results in Table 2 show that AdaGBDT costs about 25%–50% of the time spent by LightGBM, meanwhile, only 0.15%–76.6% of data is used. The results demonstrate that our AdaGBDT successfully improved the performance of training by reducing the number of needed instances.

Table 3

Accuracy comparison

Dataset	LightGBM	AdaGBDT
GammaDS	0.2967	0.2872
NormDS	0.2966	0.2874
PoissonDS	0.2966	0.2927
UniformDS	0.2980	0.2889
Higgs	0.2691	0.2773
New20	0.0186	0.0343
real-sim	0.0175	0.0372
kdda	0.1348	0.1395

The result of accuracy comparison are showed in Table 3. We set $\varepsilon=$ 0.01 and $\delta=$ 0.95 in AdaGBDT, which achieved nearly same test accuracy, sometimes the test accuracy of the learner using a random samples even higher than using entire dataset. It means our AdaGBDT implementation can achieve slightly higher accuracy with a better performance compared with LightGBM.

Figure 1.

Iteration-accuracy curve.

The training accuracy in each iteration on four data sets are showed in Fig. 1. Iteration-Accuracy curve. Note that the binary error of AdaGBDT is slightly higher than LightGBM for news20 and kdda datasets, but lower for NormDS and PoissonDS datasets. We can see AdaGBDT performs not very well on data with sparse features. Because there are very little information we can get from small samples with sparse feature space. The results demonstrate that our implementation works better with intensive data.

4.2 The impact of data size on performance

In this experiment, we compare the average training time of LightGBM and AdaGBDT at one iteration when running on data with different sample size and feature size. The datasets used in experiments are with four different distribution whose distribution type and corresponding configuration are same as the data sets described in Section 4.1.

Our experiment compare the time cost of training on 20 datasets for each distribution. In Fig. 2, data with same distribution consist of 200 features and the number of instances ranges from 50K to 500K at intervals of 50K. In Fig. 3, data with same distribution consist of 100K instances and the number of features ranges from 100 $\sim$ 1000 at intervals of 100. The results in Figs 2 and 3 show that the average training time of LightGBM and AdaGBDT is on the rise and almost linear growth. But the growth rate of AdaGBDT is significantly lower than that of LightGBM. Note that AdaGBDT contributes more in the aspect of speed with the number of instances and features increased. These results verified that our AdaGBDT works better for massive data with dense features.

Figure 2.

Time cost of training for different numbers of instances.

Figure 3.

Time cost of training for different numbers of features.

5. Conclusions

In this paper, we employ adaptive sampling method into LightGBM to reduce the number of data instances in an online fashion by drawing a subset of training data in every iteration of training procedure. Our AdaGBDT approach provide users with accuracy level parameter $\varepsilon$ and confidence level parameter $\delta$ instead of sampling ratio, which reduces the difficulty of parameter tuning. We conducted comprehensive experiments to evaluate speed and accuracy of LightGBM and AdaGBDT. Experimental results show that our implementation achieve better result within the limits of a priori margin of the error and the confidence parameter for intensive data. Moreover, AdaGBDT can achieve further performance improvements as the amount of instances and features increases. AdaGBDT could be a good choice when there are data sets with large instances and features. In the future work, we will study the sampling method suitable for sparse data and enhance LightGBM with adaptive sampling schema for multi-class classification task.

Footnotes

Acknowledgments

This work was supported by the Fund by National Natural Science Foundation of China (Grant No. 61462012, No. 61562010, No. U1531246), The Innovation Team of the Data Analysis and Cloud Service of Guizhou Province (Grant No. [2015]53).

References

Friedman

J.H.

, Greedy function approximation: A gradient boosting machine, Annals of Statistics 29(5) (2011), 1189–1232.

Chen

and Guestrin

, XGBoost: A Scalable Tree Boosting System, Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2016, pp. 785–794.

LightGBM, http://www.dmtk.io/.

Alsabti

Ranka

and Singh

, CLOUDS: A DECISION TREE CLASSIFIER FOR LARGE DATasets, Knowledge Discovery & Data Mining (1998).

Jin

and Agrawal

, Communication and Memory Efficient Parallel Decision Tree Construction, Proc. SDM 2003.

Burges

C.J.C.

and Wu

, McRank: learning to rank using multiple classification and gradient boosting, Proc. International Conference on Neural Information Processing Systems (2007), 897–904.

Collins

Schapire

R.E.

and Singer

, Logistic regression, AdaBoost and bregman distances, Machine Learning 48(1–3) (2002), 253–285.

Friedman

J.H.

, Stochastic gradient boosting, Computational Statistics & Data Analysis 38(4) (2002), 367–378.

Chen

and Xu

, Sampling adaptively using the massart inequality for scalable learning, Proc. International Conference on Machine Learning and Applications IEEE, 2014, pp. 362–367.

10.

Chernoff

, A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations, Annals of Mathematical Statistics 23(4) (1952), 493–507.

11.

Hoeffding

, Taylor & Francis Online: Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association 58(301).

12.

Domingo

and Watanabe

, Adaptive sampling methods for scaling up knowledge discovery algorithms, Data Mining & Knowledge Discovery 6(2) (2002), 131–152.

13.

Kivinen

and Mannila

, The power of sampling in knowledge discovery, Proc. Thirteenth ACM Sigact-Sigmod-Sigart Symposium on Principles of Database Systems, ACM, 1994, pp. 77–85.

14.

Toivonen

, Sampling large databases for association rules, Proc. International Conference on Very Large Data Bases, Morgan Kaufmann Publishers Inc., 1996, pp. 134–145.

15.

Wrobel

, An algorithm for multi-relational discovery of subgroups, Proc. European Symposium on Principles of Data Mining and Knowledge Discovery, Springer-Verlag, 1997, pp. 78–87.

16.

Chen

, A new framework of multistage estimation, Mathematics (2008).

17.

Watanabe

, Sequential sampling techniques for algorithmic learning theory, Proc. International Conference on Algorithmic Learning Theory, Springer-Verlag, 2000, pp. 27–40.

18.

Domingo

Watanabe

, Scaling up a boosting-based learner via adaptive sampling, Proc. PAKDD 2000, 317–328.

19.

Domingo

and Watanabe

, Adaptive sampling methods for scaling up knowledge discovery algorithms, Data Mining & Knowledge Discovery 6(2) (2002), 131–152.

20.

Watanabe

, Simple sampling techniques for discovery science, IEICE Transactions on Information & Systems 83(1) (2000), 19–26.

21.

Chen

and Chen

, A new method for adaptive sequential sampling for learning and parameter estimation, Proc. International Conference on Foundations of Intelligent Systems, Springer-Verlag, 2011, pp. 220–229.

22.

Chen

, Scalable ensemble learning by adaptive sampling, Proc. International Conference on Machine Learning and Applications, IEEE Computer Society, 2012, pp. 622–625.

23.

Chen

and Xu

, Sampling adaptively using the massart inequality for scalable learning, Proc. International Conference on Machine Learning and Applications, IEEE, 2014, pp. 362–367.

24.

Chen

, Properties of a new adaptive sampling method with applications to scalable learning, Proc. IEEE International Joint Conferences on Web Intelligence, 2013, pp. 9–15.

25.

Massart

, The tight constant in the Dvoretzky-Kiefer-Wolfowitz inequality, Annals of Probability 18(3) (1990), 1269–1283.

26.

Libsvm Binary Classification Data, https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/binary.html.

Accelerate tree ensemble learning based on adaptive sampling

Abstract

Keywords

1. Introduction

2. Background

2.1 Related work

2.2 Computer recovery boiler

2.2.1 Problem 1 – Control of absolute error

2.2.2 Problem 2 – Control of relative error

3. AdaGDBT

3.1 Sampling adaptively using the massart inequality

4. Experiments

4.1 Speed and accuracy comparison

Table 1 Datasets used in the experiments

Footnotes

Acknowledgments

References

Table 1
Datasets used in the experiments