Abstract
Data streaming classification has become an essential task in many fields where real-time decisions have to be made based on incoming information. Neural networks are a particularly suitable technique for the streaming scenario due to their incremental learning nature. However, the high computation cost of deep architectures limits their applicability to high-velocity streams, hence they have not yet been fully explored in the literature. Therefore, in this work, we aim to evaluate the effectiveness of complex deep neural networks for supervised classification in the streaming context. We propose an asynchronous deep learning framework in which training and testing are performed simultaneously in two different processes. The data stream entering the system is dual fed into both layers in order to concurrently provide quick predictions and update the deep learning model. This separation reduces processing time while obtaining high accuracy on classification. Several time-series datasets from the UCR repository have been simulated as streams to evaluate our proposal, which has been compared to other methods such as Hoeffding trees, drift detectors, and ensemble models. The statistical analysis carried out verifies the improvement in performance achieved with our dual-pipeline deep learning framework, that is also competitive in terms of computation time.
Keywords
Introduction
Knowledge discovery from data streams has recently gained importance due to the enormous amount of data that modern devices collect at high speed. Models that deal efficiently with streams of data to provide real-time predictions are necessary in many fields such as machine fault detection [1], electricity demand prediction [2], financial data prediction [3], computer security [4], and health care [5]. An important challenge in the streaming scenario is performing online classification since its specific characteristics prevent from using traditional batch-learning techniques [6]. Data stream models learn incrementally using incoming data that cannot be stored, and have to be ready to assign a label whenever a new instance arrives.
Over the last decades, many novel classification algorithms have been proposed to improve the accuracy of traditional methods [7]. Promising results have been obtained by adapting these existing techniques for the streaming context. However, the recent advent of deep neural networks (DNNs) as the state-of-the-art for many problems opens an interesting research direction to aim for a higher performance [8]. DNNs are particularly suitable for data streaming due to their incremental learning nature and their capacity for solving dynamic non-linear problems [9]. Nevertheless, their application to high-velocity streams presents limitations due to the high computational cost of their learning procedure. Therefore, there is little research on the use of DNNs for data streaming, which do not usually appear as high-performing models in the literature [10].
Our aim in this study is to develop a framework that can use complex DNNs for data streaming classification, in which maintaining a high processing rate is essential. This paper proposes a novel Asynchronous dual-pipeline Deep Learning framework for data streaming (ADLStream) that is designed to deal with the specific requirements of this scenario. Training and classification processes work simultaneously in two separated layers, hence the system is always ready to provide predictions. At the same time, the other process constantly updates the model in order to adjust it to changes in the incoming data distribution. This division allows reducing the computation time needed to deal with the instances while maintaining a high accuracy on classification. ADLStream is a general framework that could be used with any kind of deep learning (DL) model, regardless of its architecture.
As a case study to validate the performance of the proposed framework in a complex environment we have used convolutional neural networks (CNNs), which are especially indicated for dealing with data that has spatial or temporal structure [11]. For the experiments, we have simulated as streams a large number of time-series datasets from the UCR repository [12]. The performance of our proposal is compared in terms of accuracy and processing time to other popular streaming techniques such as Hoeffding trees and ensemble models. Furthermore, we have carried out several experiments with artificial datasets to evaluate the robustness of ADLStream when the properties of the stream change significantly over time, which is known as concept drift [13].
The main contributions of this work can be summarised as follows:
ADLStream, a novel deep learning framework for data streaming classification. Asynchronous dual-pipeline architecture that reduces processing time of DL networks for data streaming by splitting training and classification tasks. A thorough experimental study, comparing ADLStream to several streaming techniques over more than 30 datasets. An analysis of the effects of different concept drifts on the performance of ADLStream.
The rest of the paper is organised as follows: Section 2 presents a review on related work; in Section 3 the materials used and the methods proposed in the study are described; Section 4 presents the experimental setup designed; Section 5 reports and discusses the results obtained; Section 6 presents the conclusions and possible future work.
Learning from data streams presents several challenges that prevent from directly using traditional data mining algorithms. Classifiers designed for static datasets typically require to iterate several times over the instances, hence they are not suitable for dealing with data arriving at high speed. An effective data stream classification model should be able to extract the relevant information with just a single pass on the instances, and using a limited amount of time and memory, which increases the complexity of the learning procedure [14]. Another important aspect when dealing with streams is the type of learning framework, which depends on the availability of labels for the incoming examples. In this work, we follow a completely supervised learning framework, where the true class of all examples is always known posterior to classification. Therefore, the studies covered in this section all work under the same assumption, which is very common in the literature. However, there are also studies considering other scenarios, such as learning with delayed labelling or semi-supervised learning (i.e. only a fraction of incoming examples have labels) [15]. Regardless of the framework considered, the classical division of batch learning techniques into training and predicting phases has to be shifted to an online approach in which both tasks are interleaved, given that the stream may be infinite. Furthermore, the update of the models has to account for concept drifts since the data distribution may change over time. These variations in the boundaries between classes need to be detected in order to carry out efficient retraining of the models [16].
Considering the above-mentioned characteristics of data streaming, there have been efforts to adapt for this context many existing techniques such as support vector machines [17], k-nearest-neighbors [18], rule-based classification [19], and Bayesian classification [20]. However, due to their simplicity and fast processing rate, one of the most popular approaches has been to develop adaptive algorithms based on decision-tree methods [21]. The Very Fast Decision Tree (VFDT) method, based on the Hoeffding tree principle, was the earliest to be designed specifically for stream classification [22]. Hoeffding bounds help to incrementally build a model similar to what a batch learner would produce. They split a node only when there is statistical significance between the current best attribute and the others. Following this concept, new proposals such as Hoeffding Option Trees (HOT) were designed. In HOT, each example can update a set of option nodes instead of a single leaf, leading to a representation of multiple trees as separate paths [23]. Other algorithms more suitable for time-changing data streams were later developed, such as Hoeffding Adaptive Trees (HAT) [24], that reduce the effect of past data by using sliding windows and replacing branches.
Furthermore, an important area of research on data streams has been the development of real-time monitoring techniques that detect concept drifts over incoming data [25]. There exist a wide variety of concept drifts, that are usually simplified in four types [26]: abrupt drift, when there is a sudden shift from one concept to another; incremental drift, which implies going through many different intermediate concepts while drifting to a new concept; gradual drift, when the stream oscillates between two concepts before drifting completely; and recurrent drift, that happens when the stream drifts to a previously seen concept. Additionally, in real-world streams the distribution of classes may change over time, leading to an imbalance that increases the difficulty of classification.
To deal with these problems, the concept drift detection method (DDM) was presented in [27], which helps control the accuracy of predictions of the learning model. DDM can be used with a wrapper on a classifier by creating a new model with recent examples whenever a significant change in the class distribution is detected. This technique proved to be independent of the underlying classification algorithm used, and other studies have developed similar proposals, such as Early Drift Detection Method (EDDM) [28] or a non-parametric method based on Hoeffding’s bounds [29]. Although drift detection methods would allow using batch algorithms as the base learner, their combination still faces the computational cost of rebuilding models from scratch many times. Therefore, alternative approaches to handle concept drifts have considered using classifiers that are adaptable to change on data. This behaviour can be implemented by using a sliding window or using incremental or online learners, such as neural networks, which can keep their weights updated by processing each instance only once [30].
More recently, ensemble methods have gained relevance since they can improve the robustness of single classification models and allow an easier adaptation to variations in the distribution of the data [15]. Online approaches to traditional bagging and boosting algorithms were designed in [31], in which the incoming samples are weighted using the Poisson distribution for carrying out the model updating. Later, several studies have proposed modifications to these methods in order to improve randomization, such as: Adaptive-Size Hoeffding Trees (ASHT), that builds an ensemble of trees of different sizes [24]; ADWIN Bagging, that uses adaptative windows to detect concept drifts and eliminate ensemble members with poor performance [24]; and Leveraging Bagging, which increases resampling and uses output detection codes [32].
The latest studies on data stream classification have followed the approach of combining the ideas of ensemble models and concept drift detection. The Adaptive Random Forest (ARF) algorithm for classification of evolving data streams was proposed in [33]. ARF improves resampling methods to add diversity and uses adaptive operators to cope with concept drifts. Furthermore, the independence of its components allows for a parallel implementation that reduces processing time without degrading performance.
In [34] the authors propose an improvement applicable to any online ensemble that adds possible abstentions in the voting process. Only classifiers that perform above a certain confidence level are allowed to vote, which proved to be particularly useful in noisy data streams. The same authors of this work developed later the Kappa Updated Ensemble (KUE) [16]. This algorithm is driven by the Kappa statistic and uses weighted voting from a pool of classifiers to provide predictions. In KUE, each component deals with a random dimensionality, as opposed to ARF in which the subspace size is fixed.
Datasets used for the study
Datasets used for the study
Other studies have recently proposed novel drift detection methods: [35] presents a methodology to detect different concept drifts by selecting dynamically the most competent ensemble member to classify each incoming example; and [36] develops the Enhanced Concept Profiling Framework (ECPF), which focuses on improving speed by reusing previously trained classifiers when recurrent concept drifts are detected.
With regard to the application of DL techniques to data stream classification, few studies have presented deep neural network models for this context. Existing work is limited to the use of MultiLayer Perceptron (MLP) [37] and ensemble methods using them as the base classifiers [38]. Other proposals have considered placing traditional classifiers on top of simple Deep Belief Networks (DBNs) [10]. Although more sophisticated DL models are currently state-of-the-art for many problems in the batch setting, they have not been explored in the data streaming literature. Their high computational complexity has so far been a severe restriction to make them suitable for a high-velocity stream scenario. In particular, architectures such as recurrent or convolutional networks have provided better performance than MLPs or ensemble models with grid-like data such as images or time series [39]. The above mentioned related works mostly validate their proposals using data without an inner temporal or spatial dependence. Therefore, building DL models that deal efficiently with this kind of data in the streaming context is a research area that has yet to be addressed.
Furthermore, there are recent proposals that develop asynchronous frameworks to reduce processing time in several domains: in [40] a system using coupled deep neural networks is proposed for reinforcement learning; and a genetic programming rule-based classifier for data streaming that can run asynchronously is presented in [41]. This trend calls for the development of similar proposals using deep learning, since the performance of popular streaming ensemble models can be enhanced by complex DL models if they have an adequate processing rate.
Description of datasets
A total of 29 different time-series datasets have been used for this study, which have been obtained from the public UCR repository [12]. These datasets have already been used in the literature to simulate streams for different applications such as stream clustering [42], anomaly detection [43] or density estimation of data streams [44]. All the datasets considered are composed of instances that are one-dimensional time series, hence they have an inner grid-like structure. The length indicates the number of points that each individual instance arriving at the stream has. The particular characteristics of each dataset are presented in Table 1. Only those that have a minimum of 1000 instances have been used, in order to reproduce a realistic streaming scenario. As can be seen, the datasets are different in terms of the length of the series and the number of classes considered. They cover six different domains that are the following:
Sensor: Readings from sensors in areas such as process control measurement (Wafer), weather monitoring (MoteStrain), car engines (Ford), human voice recognition (Phoneme), or animal sounds (InsectWingBeat). ECG: Electrocardiogram records for tasks such as detecting heart problems (TwoLeadECG) or identifying different people (CinCEGTorso). Motion: Captures of gestures generated from accelerometers (UWaveGestureLibrary) and digital pen traces (Pendigits). Image: Outlines of images that are mapped onto a one-dimensional series, such as faces (FaceAll), hands (HandOutlines) or shapes (ShapesAll). Device: Data from daily electrical power consumption (ElectricDevices). Simulated: Artificially generated time series for problems such as signal processing (Mallat) or pattern recognition (TwoPatterns).
A more detailed explanation of each type with figures can be found at [12]. The use of time-series data allows us to evaluate the behaviour of complex and time-consuming models in our framework, such as CNNs. Moreover, the variability of the datasets selected is essential in order to prove the capacity of generalisation of our proposal.
The asynchronous dual-pipeline framework presented in this section aims to provide a general deep learning-based architecture to achieve high performance in data streaming classification. This novel framework improves processing rate and allows to use efficiently deep learning models, such as convolutional or recurrent networks, for data arriving at high speed. Similarly to most of the existing literature on data streaming, in this work we consider a fully supervised scenario in which the labels are immediately available for all processed examples and can be used to update the model [15]. Accordingly, the data stream can be defined as a sequence of labelled instances
ADLStream framework proposed in the study to perform classification over data streams.
The design of the ADLStream system is fully illustrated in Fig. 1, in which it can be seen that predicting and training phases are separated into two different layers. This split allows making predictions at any time while keeping the DL model constantly updated, and reduces the computation time compared to the traditional sequential scheme. The logic behind the complete system is described in Algorithm 1. Examples arriving online (i.e. instance by instance) from the stream are sent to both processes that work asynchronously. When the predicting process receives an instance, it is instantly classified using a previously trained model. Given that DNNs are significantly faster for predicting compared to the training procedure, the prediction layer is always ready to immediately classify incoming data. In contrast, the training layer is more time-consuming and works by saving the instances received and grouping them in batches. Once a specific number of batches are collected, they are fed to the DL model in order to carry out the training procedure through back-propagation. The new set of weights obtained when the training is completed is passed to the predicting process to maintain both models updated. With this approach, it is assured that each individual example is tested before it is used to train the network. More specifically, at least
[h] InputArguments
Stream: Stream to be analysed
FMainMain FnProcedure: Stream
[htb] InputInputOutputOutput
Decide randomly if item
Yes Delete a random item from the sample
Update sample weights with decaying factor (
Sampling
Although ideally both layers could deal concurrently with all received instances, the significant difference between training and predicting execution time in DNN models poses problems given the high rate of arrival of examples. The capacity of the system for processing all data and re-training the model would depend on several factors such as the stream speed, data topology or computer specifications. However, the optimal situation in which every instance is used for training would imply that an increasingly larger queue of awaiting instances would be formed in the training layer. To solve this problem we propose to specify a limited number of instances to train with, while the rest are discarded in order not to overload the buffer. When the number of examples in the queue reaches a certain value
In the streaming scenario, the distribution of the incoming data tends to evolve over time, which implies that recent instances are more relevant to describe the state of the stream than older ones [45]. A first approach to deal with this issue could be to use a sliding window with the most recent instances [46]. However, with this solution, useful information about the distant historical behaviour of the stream can be lost. For this reason, a common approach is to consider a weighted sampling algorithm to regulate the choice of instances from the stream. In our proposed framework, a biased reservoir sampling method using the A-Chao algorithm is implemented [47]. This technique maintains a reservoir of samples with associated weights and performs probabilistic insertions and deletions on arrival of new stream points. As explained in Algorithm 1, when a new item is examined, its relative weight is used to randomly decide whether it will be inserted into the reservoir. In case it is selected, one random item inside the reservoir is deleted and the new instance is added. The weights of the instances belonging to the reservoir are updated each iteration with a decaying factor, which is fixed to
CNN Architecture for time-series data streaming classification.
Maintaining a completely unbiased sample is not practical given that the evolution of the stream may lead to a reservoir filled with past irrelevant history. Therefore, it is desirable to bias the sampling to represent more recent behaviour of the stream [48]. Accordingly, newly arrived items are assigned more weight in order to increase their probability of belonging to the random sample.
With this reservoir approach, the framework keeps a fixed-size window containing a representative sample of the recent instances, which are fed to the training layer whenever an iteration is completed. In the experimental study, a grid search with different combinations has been performed in order to find suitable values for
The ADLStream framework can be used with any type of DL architecture, but as a case study for this work we have used CNNs, which are particularly suitable for dealing with temporal data [49]. The objective of the selected DL model is to perform 1D convolution over the time-series examples in order to automatically extract abstract features that represent the internal structure of the data [50]. The proposed 1D-CNN architecture, illustrated in Fig. 2, is inspired by the study presented in [51]. As can be seen, the network is composed of a block of several one-dimensional dilated convolutional layers with pooling followed by a fully-connected Multi-Layer Perceptron (MLP), which performs the classification.
In the convolutional layers, the one-dimensional input array is convolved with filters of kernel size
where
Regarding the specific number of layers and feature maps used in our model, a more detailed description of the proposed architecture is presented in Table 2. Similarly to well known CNN architectures such as VGG [56], the number of filters in consecutive convolutional layers is increased in order to extract more detailed features from the richer representations obtained. Therefore, the convolutional layers have 32, 64, and 128 filters respectively. Due to the decreasing spatial resolution of the max-pooled feature maps, the kernel size of the convolutional layers are set to 7, 5 and 3 respectively, and with dilation rate 3. The features extracted from the convolutional block are then transferred to the fully connected block that has two dense layers of 512 and 128 neurons, and a softmax layer that has as many neurons as the number of classes considered. Furthermore, dropout layers with a small rate (0.2) are used after the fully connected layers. Dropout has proven to be an effective regularisation method since it enhances the capacity of generalisation of the network by deactivating different neurons on each training iteration [57]. Moreover, another element that we have considered to prevent overfitting is not to use a very large batch size. Deep networks converge to sharp minimizers when trained using large batches and they lose generalisation abilities [58]. For the implementation of the proposed convolutional network, the Keras framework has been used [59].
CNN architecture. The values of
This section presents the comparative study carried out to evaluate the performance of the ADLStream framework. The experimental process is based on a statistical analysis, with the results obtained for all datasets, that compares our proposal with several state-of-the-art algorithms for data stream classification. For simulating the streaming, the Apache Kafka platform has been used, since it has emerged as the best stream-processing tool in terms of efficiency of data management [60]. The Kafka server allows reproducing a real data streaming scenario in which instances are constantly arriving, hence the evolution over time of the accuracy of the models can be analysed.
Models used for comparison
Numerous existing techniques have been considered for the study, with the aim of fully covering all families of algorithms that have been proposed in the literature for this problem. Table 3 presents the different classifiers that have been evaluated, grouped by family, and with the abbreviations that would be used throughout the paper. All selected models are implemented in popular open-source frameworks such as MOA [61] and Scikit-learn [62].
Classifiers used for the comparative study
Classifiers used for the comparative study
Unlike in traditional batch learning setup, cross-validation cannot be used as the evaluation technique in a streaming setup since unlimited data tends to make it too expensive computationally. In the online streaming setting, the objective is to capture how the accuracy evolves over time, hence one possible alternative is to perform holdout evaluation over independent sets periodically. Although ideally holdout could provide the best estimation of the accuracy on recent data, it is not practical for real scenarios where obtaining sufficient recent data for testing may be challenging. Therefore, the most extended solution is to maximise the use of available data with an interleaved test-then-train approach [63]. With this method, every instance is used to test the model before it is used for training. This implies that the model is always tested with unseen examples, which allows to incrementally update the accuracy taking into account all incoming instances. However, since streaming learning algorithms are supposed to evolve over time due to changes in the data distribution, more recent instances should be given more importance in order to provide a reliable error estimation. Accordingly, the predictive sequential (or prequential) evaluation method implements this idea of decreasing the relevance of past examples for the evaluation.
[64] proposes the use of a forgetting mechanism for computing prequential accuracy by using sliding windows or decaying factors. When using a sliding window of size
With regard to the specific measure used for evaluation, the standard accuracy is not the most appropriate option for the streaming context, since the number of instances for each class can change and lead to an imbalanced distribution [66]. Therefore, the Kappa statistic is a more reliable measure for estimating the performance of data streaming classification algorithms. Equation (3) presents how to compute the Kappa value, where
Once the accuracy values for each method have been obtained, it is necessary to carry out a statistical analysis in order to correctly compare the performance of different classifiers. Since our study compares multiple classifiers over multiple datasets, the Friedman test is the recommended method [67]. This non-parametric test allows to detect global differences and provides a ranking of the algorithms. In the case of obtaining a
In this section, we present and discuss the results obtained from the experiments carried out. The prequential Kappa results of all methods considered are reported, followed by the statistical evaluation performed. Moreover, given the importance of the speed in a data streaming scenario, the computation time of all algorithms is also analysed. Finally, we present a study on the effect of different concept drifts. For all tests, we have used a computer with an Intel Core i7-770K CPU and two NVIDIA GeForce GTX 1080 8GB GPU.
Kappa and processing time results obtained with the ADLStream framework depending on batch size and the number of batches fed. The white dot represents the chosen values for the parameters.
Instead of setting arbitrary values, a grid search has been conducted to select the training hyper-parameters of the ADLStream framework, as it was mentioned in Section 3.2. Figure 3 illustrates the grid search performed with values ranging from 10 to 120 for both parameters, the batch size (
Heat map of the prequential Kappa results obtained for each method over each dataset. The numbering of the datasets corresponds to the ordering presented in Table 1.
Prequential Kappa results for the best classifiers of different families over all datasets
In order to examine the behaviour of each method over each dataset, the prequential Kappa statistics using decaying factors are collected and analysed. Figure 4 presents a heat map illustrating the results of all techniques. In the map, the methods are ordered with regard to their average overall accuracy, hence best models are at the left-hand side of the figure. As can be seen, our ADLStream proposal achieves a high level of accuracy for almost all the datasets considered. These results demonstrate the suitability of the use of CNNs for the time-series data streaming scenario. Moreover, the high-quality results obtained by the Multi-Layer Perceptron (MLP) also demonstrates the power of neural network-based techniques compared to the rest of classifiers. The adaptive random forest method (ARF) was the best ensemble model, obtaining the second position in average performance. Other methods with good results are those related to the Bayesian (NB) and drift classifiers (SCD) families. Concerning the decision trees family, the adaptive version ASHT provides the higher accuracy, closely followed by the standard HT. Lastly, the rest of the ensemble models (bagging and boosting) and the function classifiers are the classifiers with the poorest performance.
To provide a more detailed analysis, the results obtained with a subset of the top-performing classifiers of different families are reported in Table 4 and illustrated with a box-plot in Fig. 5. Our proposal outperforms the rest of the methods for almost all datasets, and there are particular cases that are worth mentioning. In the last four datasets, the methods from the literature struggle to achieve good performance. As an example, for the ChlorineConcentration dataset, ADLStream leads the ranking with an accuracy of 0.948 while the second method achieves just 0.149. On the other hand, the tree-based ensemble ARF got a better result than ADLStream in two datasets (Mallat and HandOutline). However, the difference in accuracy between both methods in these datasets is not significant as it is less than 0.02.
Box plot summarising the Kappa statistics results obtained with the top-performing methods of each family.
Radar plot comparing the performance of different techniques with our proposed ADLStream framework.
Friedman test ranking
Another important aspect is that ADLStream using CNNs stands as the most reliable method since it shows less variability of results, as can be seen in the box-plot. A further visual comparison between the top-performing techniques is shown in Fig. 6. The increase in performance for all datasets achieved with the novel proposed method proves its capacity of generalisation. Using a constant architecture and parametrisation, ADLStream has achieved a high level of accuracy for all cases, regardless of the different characteristics of the streams in terms of length of the series and number of classes. In ADLStream, the CNN is trained a lower number of times and with fewer instances than the rest of methods, due to the grouping in batches and the examples discarded. Nevertheless, this fact does not have a detrimental effect on the performance, since CNNs can rapidly converge to an optimal set of weights even using fewer data.
The Kappa statistic results obtained from the experiments have to be analysed through a statistical test to correctly verify the hypothesis of improved performance of our proposal. The global ranking obtained from applying the Friedman Test is presented in Table 5. As expected, the ADLStream model leads the ranking with a high difference in score with respect to the second method, which is ARF. MLP ranks in the third position and has a similar behaviour compared to NB and SCD. Finally, decision trees and ensembles obtain a lower score given their poorer performance. The null hypothesis can be rejected since the
For carrying out the Holm’s procedure, ADLStream is set as the best method and compared to the rest of the algorithms individually. The results obtained for this step are displayed in Table 6, which reports the adjusted
Holm’s post-hoc analysis
Holm’s post-hoc analysis
Processing time in milliseconds for the best methods of different families over all datasets
Bar plot comparing the performance in terms of processing time of different techniques with the proposed ADLStream framework.
Accuracy and processing time results of the ADLStream framework and a sequential CNN model
Given the high rate of arrival of instances in the streaming scenario, analysing the average time that each method takes to process an instance (hereinafter referred as processing time) is essential. The processing time for the set of best classifiers of each family over each dataset is reported in Table 7. As can be seen in the comparison shown in Fig. 7, the processing time of the proposed ADLStream model is competitive with respect to the rest of the algorithms considered. Logically, simpler models such as MLP, NB or HT have a higher processing rate, but the small increase in processing time is compensated with significantly higher accuracy, as it was seen in Section 5.2. Furthermore, the processing time of ADLStream is lower than other ensemble methods such as LBAG and BO-AD, which are considered state-of-the-art ensemble techniques in the data streaming literature [69].
Comparison between sequential and dual-pipeline approach
In this subsection, we present a comparison with a sequential approach, in order to illustrate the importance of the asynchronous dual-pipeline architecture designed for the DL model. The novel ADLStream framework introduced in this study aims to tackle the processing time problem of complex DNNs models in the data streaming context. Table 8 presents a comparison in terms of time and
Datasets used for drift analysis
Datasets used for drift analysis
Kappa accuracy of the top 6 classifiers for concept drift datasets
performance between a sequential CNN and the proposed ADLStream. In the sequential scheme, which is significantly slower, every instance is used to train the model after it has been given a prediction. Therefore, all examples are classified with a model recently updated with all instances seen so far. In contrast, in the dual-pipeline framework, train and test phases work concurrently, obtaining an average speed-up of 42 times faster than the sequential model. Thanks to the parallelisation, the time to process an instance corresponds just to the prediction time, while the model is trained in a separate layer as many times as possible. Depending on the speed of the data stream and the available computing resources, ADLStream may classify more instances with a non-updated model. Theoretically, this fact could produce a great difference in performance between both approaches. However, the results presented in Table 8 do not show that the sequential approach, in which the model is re-trained more times, outperforms significantly the proposed ADLStream system.
Friedman test ranking for concept drift experiments
Given the evolving nature of data streams, we also consider important to carry out experiments that evaluate the effect of concept drifts on the performance of ADLStream. For this purpose, we have created a set of 15 data streams, with a million instances each, using different generators (RBF, RandomTree, Agrawal, SEA, LED) from MOA [63]. These streams cover the main types of concept drifts (gradual, abrupt, incremental and recurrent drift) with different speeds and class-imbalance drifts. Table 9 provides a more detailed description regarding the number of attributes, number of classes, imbalance rate (IR) and type of drift of each dataset.
Holm’s post-hoc analysis for concept drift experiments
Holm’s post-hoc analysis for concept drift experiments
Prequential Kappa evolution for ARF, BA-AD, KUE and ADLStream over two datasets of each drift type.
As stated in previous sections, the ADLStream framework can be used with any DL model. The model should be selected depending on the characteristics of the data so that maximal accuracy can be obtained. However, these experiments aim to evaluate how the asynchronous approach of the framework recovers from concept drifts. Therefore, we have decided to keep using the model detailed in Section 3.2.1 even though CNNs may not be optimal for these type of datasets.
Table 10 presents the Kappa accuracy obtained with the top six techniques over the different concept drift datasets. As can be seen, ADLStream obtains very competitive results, similar to those obtained by the literature methods and even better in seven cases. This implies that ADLStream also leads the performance ranking for these experiments, as it is displayed in Table 11. However, the subsequent post-hoc analysis (Table 12) concludes that there is not a significant difference between the performance of ADLStream, KUE, BA-AD and HAT. Taking into account that we have not explicitly designed a concept drift detection method, the performance obtained is comparable to other state-of-the-art classifiers. These results demonstrate the robustness of ADLStream to deal with different types of drifts in the incoming data distribution. The model can adapt to changes by relying on the incremental learning nature of neural networks, which is found to be very helpful for these situations. Furthermore, the fact that we have used CNNs, which are not particularly suitable for data without a grid-like structure, further supports the strength of our proposal.
Additionally, we provide a visual comparison of the reaction of different classifiers to concept drifts. Figure 8 shows the evolution of the prequential Kappa metric with the progress of the streams. In the particular case of the fast incremental drift dataset (RBFi-fast), the concept continuously changes faster than the adaptive capability of the CNN. However, it can be seen that, for the rest of datasets, ADLStream is able to recover satisfactorily from the drifts. In general, the figures show that our proposal offers a similar performance, and even better in some cases, than other popular models without the need for any explicit drift detection mechanism.
In this paper, a novel asynchronous dual-pipeline deep learning framework for data stream classification is presented. The proposed system has two separate layers for training and testing that work simultaneously, in order to provide quick predictions and perform frequent updates of the model. This architecture alleviates the computational cost problem of complex deep learning models for the data streaming scenario, in which speed is essential. The results obtained using a large number of time-series datasets showed that the ADLStream framework outperforms other state-of-the-art techniques in the literature, such as Hoeffding trees or ensemble methods. Furthermore, in terms of processing time, our proposal was also found to be competitive and even faster to other extensively used bagging and boosting methods.
We also aimed to illustrate the importance of the dual-pipeline architecture for a real-time environment by comparing its performance and processing time with a sequential approach. It was seen that the layers working asynchronously provided a decisive time reduction to deal with data arriving at high speed, while maintaining a very similar predictive accuracy. In addition, other aspects with regard to the training procedure of the system were analysed, such as the impact of the batch size and the maximum number of batches used for updating the model. Furthermore, a study on the behaviour of ADLStream over datasets that simulate different concept drifts was carried out. For these experiments, our proposal showed a performance comparable to other state-of-the-art techniques. These results proved the capacity of our framework to adapt to changes in the data, without the need for any explicit drift detection method. In conclusion, our study demonstrated that deep learning is a very powerful solution for performing online classification with data of different characteristics. To the best of our knowledge, there are no scientific papers that adapt complex deep neural networks for data streaming. Therefore, the positive results of the experimental analysis carried out could be helpful in order to give further importance to the application of deep learning models in the streaming literature.
Future work should study the suitability of the framework when the label for each instance is not always immediately available. In many applications, obtaining the ground truth is a time-consuming process that often has to be done manually. Therefore, it is important to consider other scenarios such as semi-supervised learning. Moreover, future studies should consider the application of other deep learning architectures such as Long-Short Term Memory, Temporal Convolutional Networks, and others available in the DeepLearning4J library, which can be used with MOA. More powerful networks, together with a more sophisticated hyper-parameter search and advanced sampling techniques, could enhance the performance of data stream classification. Furthermore, another interesting research direction is to develop an ensemble framework containing a pool of deep learning models with different update criteria and abstaining mechanisms.
Footnotes
Acknowledgments
We are grateful to NVIDIA for their GPU Grant Program that has provided us high quality GPU devices for carrying out the study.
Funding
This research has been funded by the Spanish Ministry of Economy and Competitiveness under the project TIN2017-88209-C2-2-R.
