An improved short term load forecasting with ranker based feature selection technique

Abstract

The load forecasting is the significant task carried out by the electricity providing utility companies for estimating the future electricity load. The proper planning, scheduling, functioning, and maintenance of the power system rely on the accurate forecasting of the electricity load. In this paper, the clustering-based filter feature selection is proposed for assisting the forecasting models in improving the short term load forecasting performance. The Recurrent Neural Network based Long Short Term Memory (LSTM) is developed for forecasting the short term load and compared against Multilayer Perceptron (MLP), Radial Basis Function (RBF), Support Vector Regression (SVR) and Random Forest (RF). The performance of the forecasting model is improved by reducing the curse of dimensionality using filter feature selection such as Fast Correlation Based Filter (FCBF), Mutual Information (MI), and RReliefF. The clustering is utilized to group the similar load patterns and eliminate the outliers. The feature selection identifies the relevant features related to the load by taking samples from each cluster. To show the generality, the proposed model is experimented by using two different datasets from European countries. The result shows that the forecasting models with selected features produce better performance especially the LSTM with RReliefF outperformed other models.

Keywords

Load forecasting feature selection clustering deep learning long short term memory

1 Introduction

The forecasting of the short-term electricity load is a critical task for the electricity providing utility companies in their operation and management of the supply to their customers. The load forecasting plays a vital role in the power system for achieving power demand and supply equilibrium by proper power system planning, design, development, distribution, and maintenance. The inaccurate forecasting may create unfair financial implications for the power system [19]. The load forecasting is divided into three major categories namely, long-term load forecasting, medium-term load forecasting and short-term load forecasting. The forecasting done for the next one hour to one week is known as short-term load forecasting, the forecasting is done for the next one week to a month and next one month to a year is known as medium-term and long-term load forecasting respectively [52].The electricity cannot be generated in large scale excessively for satisfying the future electricity demand. So, it should be generated when there is a demand for it. If the demand for the load is known in advance, all the related operational requirements for the generation of the electricity are made available without any delay. The over estimation or an under estimation of the load leads to the complex financial implications to the power system [55]. From the application point of view, the short term load forecasting is important for the proper functioning, stable and an economic operation of the power system. It helps to make decision on the day-to-day operational activities such as proper planning and scheduling of power generation, scheduling of fuel purchase, allocation of resources, unit commitment, load increment and decrement, scheduling of maintenance, energy storage optimization and preparing the dispatch schedule of the electricity load in the power system [10 , 68].

The uncertain and volatile nature of the load and also the existence of the incomplete, noisy, redundant and irrelevant data in the load becomes the big barrier to achieve the accurate forecasting. The load data has more correlation with other environmental factors such as, humidity, temperature, wind speed, pressure, snowfall, and cloud cover [39]. If all features are considered then the number of features increases which leads to overfitting and curse of dimensionality. So, the relevant features should be identified before forecasting. The feature selection provides the significant benefits in load forecasting by identifying the relevant features related to load. The filter (ranker) feature selection is an important type of feature selection that selects the relevant features at high speed by evaluating each features against target feature using the statistical characteristics. It helps to reduce the overfitting and improve the accuracy of the forecasting [46]. The researchers utilized varieties of filter feature selections such as Relief, ReliefF, RReliefF, MI, correlation based feature selection (CFS), and FCBF for improving the performance of the forecasting. In order to achieve the short-term load forecasting, much research has been carried out over the past decades from the statistical approach to deep learning approaches. The statistical approach uses a mathematical formula for forecasting and doesnot has the capacity to analyze the non-linear relationships. With the advent of machine learning, the large volume of electricity data is managed and processed efficiently. It can handle both lonear and non-linear data at reduced complexity. In the literature, the number of machine learning and deep learning methods have been adapted for load forecasting [1, 12].

In modern era, the deep learning plays an important role for forecasting an accurate load. It extracts the features directly from the input sequence and discovers the complex nonlinear dependencies. Correctly, the deep learning-based recurrent neural network (RNN) handles well the time series data. It produces the output by considering the past input and past computations at each time. It is accomplished by embedding the previous event information at the hidden state variables. The RNN loops the hidden state information for maintaining sequence dependency. Even though RNN works well for time series data, it has some limitations due to the short term memory capacity. The RNN cannot remember the longer sequences. Instead, it can remember only a few steps back. It also suffers from the vanishing gradient and exploding gradient problems. The variant of RNN called LSTM overcomes these difficulties by keeping the information long using gates in each cell [27]. Instead of neurons, the LSTM network uses memory that replaces the short term memory of RNN. So, the LSTM can preserve the information for a long sequence. Thus, it enables an adequate understanding of the sequence dependence that exists in the time series data. It guarantees better prediction accuracy compared to the state-of-the-art methods in time series prediction [44 , 65]. The performance of the LSTM also can be enhanced by selecting the relevant features externally by using the feature selection. In this paper, the short term load forecasting with filter feature selection improves the accuracy by removing outliers using clustering, reducing the curse of dimensionality by removing the irrelevant features using filter feature selection and reducing overfitting by analyzing the complex, non-linear, uncertain, time series load data using LSTM.

The rest of the paper is organized as follows. Section 2 discusses the related works investigated by various researchers in load forecasting. Section 3 highlights the methods utilized in this paper, presents the details of data preprocessing (Phase-I) and discusses the importance of feature selection. It also demonstrates the feature selection and forecasting models (Phase-II). The experimental results and performance evaluation are discussed in section 4. The conclusion is provided in section 5.

2 Related work

This section discusses the review of related research works investigated by the researchers. The literature work is classified into feature selection, traditional and machine learning methods. Abedinia et al. developed a system for an electricity load forecasting in which the information theoretic-based minimum redundancy maximum relevancy and maximum synergy feature selection was introduced to handle the redundancy, relevancy and synergy. It considered only two-way interactions but does not considered the higher order interaction. The temperature data along with the load and calendar data only cosidered as input. But, fails to consider other highly correlated weather data [44]. Rana et al. utilized the combination of the RReliefF and MI feature selection to find the relevant features and compared it with autocorrelation (AC). Even the load data has high correlation with environmental and other features, the forecasting is done only based on the historical load. The consideration of additional features and elimination of outliers are not considered [37]. Huang et al. developed a generalized minimum redundancy maximum relevancy (GmRMR) feature selection for an improved accuracy by removing the irrelevant and redundant features. It utilizes the calendar and load features with different lags. The load with very large lag does not have much correlation with the forecasting point. As the size of data increases the complexity also increases and it becomes the hindrance for the accurate forecasting [43]. Rana et al. designed an ensemble-based neural network for predicting the electricity demand interval with partial autocorrelation, CFS and MI feature selection. The author did not consider the weather factors and also does not deal the outlier data [39]. Koprinska [22] utilized the Auto Correlation (AC), MI, CFS and Relief feature selection. The neural network with features selected by Relief method works well and improves the performance of the forecasting. The random samples taken by the Relief algorithm may leads to wrong selection of features and reduce the forecasting performance. Fong et al. selected the relevant features by using the coefficients of variance of each feature. As it works well with the high dimensional data, it increases the computation time as the number of features increases [50]. Senliol et al. proposed a novel Fast Correlation Based Feature selection as an extension to CFS where each highly correlated feature got a chance to eliminate irrelevant features with less correlaton [5]. However, it works fast it does not deal with outliers.

The load forecasting methods are classified into traditional and machine learning methods. The traditional methods such as Linear Regression [23], Auto Regressive Integrated Moving Average [25], and Kalman Filter [14] are utilized by researchers for load forecasting. It has an abilty to analyze the linear relationships but it cannot analyze the non-linear relationships exist between the input and output. On the otherhand, the machine learning methods can analyze the complex, linear and non-linear relationships effectively. It improves the accuracy of forecasting with the minimal model complexity. Zeng et al. introduced a hybrid learning approach in which the single hidden layer feed-forward neural network with switched delay particle swarm optimization was utilized to forecast the short-term electricity load. The complexity of the model increases with increasing hidden layers [42]. Sarhani and Afia utilized the SVR with particle swarm optimization for load forecasting. Hence, the performance is improved by identifying and removing the irrelevant features using the CFS. The setting of SVR parameters becomes the hindrance for improving the forecasting perofrmance [38]. Ortiz-Arroyo et al. [11] designed an ANN model for forecasting the daily peak load demand. The performance of the forecasting is affected by the number of hidden nodes and the number of epochs. As a result, the forecasting error increases with the increasing number of nodes and number of epochs. The electricity load forecasting could be done in the recent years using Back Propagation Neural Network [36 , 64], Multilayer Perceptron [15], Elman Neural Network [7 , 67], Fuzzy Logic [41, 47], SVR [35 , 69], RBF [6], Wavelett Neural Network [44, 59], RF [16], Artificial Neural Network [3 , 40], Ensemble Based Neural Network [39], Support Vector Machine [64], and LSTM [27]. Eventhough the machine learning methods work well with non-linear complex data with multiple inputs, it suffers from the overfitting when the number of layers and the volume of data increases. The deep learning is a kind of machine learning that handles well the uncertain time series data by analyzing the temporal dependency. In [27], Kong et al. utilized the RNN based LSTM for short term residential load forecasting and proved that the proposed model better performs the state-of-the-art models.

The researchers utilized varieties of filter feature selection such as Relief [22], ReliefF [26 , 56], RReliefF [37], MI [22 , 60], CFS [5 , 39], FCBF [5, 29] and AC [22, 39] for identifying the relevant features and improving the performance of load forecasting. Most of the models in the the literature concentrates on identifying the relevant features, but they fails to concentrate on outliers. The outliers does not contribute for improving the performance of load forecasting. Instead, it may pull down the forecasting performance. So, these outliers need to be removed from the load data. Eventhough the weather features has high correlation with the load demand, most of the models in literature does not consider weather features. They only considers the calendar and load features. Hence, the feature selection discussed in the literature may consider all samples (MI,FCBF) or random samples (Relief,ReliefF,RReliefF) for calculating feature importance. When all the samples are considered, it may affect the computation process of the feature selection [60]. When the random samples are considered, it may misguide the feature selection in calculating the feature importance. In addition to that, the machine learning methods discussed in literature suffer from an over fitting problem while the number of layers and the volume of data increases. Hence, it suffers in improving the accuracy with the absence of the abilty to remember the past information and discovering the temporal dependency for the current computation. They have lack of capability in handling the complex, uncertain, non-linear, time series temporal dependency characteristics of load data.

The proposed short term load forecasting with filter feature selection removes the outliers by using the clustering concepts, reduces the curse of dimensionality & feature space by removing the irrelevant features using the filter feature selection and reduces the overfitting by effectively handling the temporal dependency using LSTM.

3 Methodology

The proposed short term load forecasting model consists of two phases namely, data preprocessing and forecasting. The data preprocessing phase integrates the load and weather data, cleans it by replacing the missing values using the previous 24 hrs load values of the feature. After that, it extracts the ‘date’ feature for an indexing purpose from four individual calendar features, ‘year’, ‘month’, ‘day’ and ‘minutes’. Subsequently, it normalizes the data using min-max normalization. Then, the clustering-based filter feature selection is introduced to remove the outliers, irrelevant, and redundant features from the original load data. It forms the clusters and removes the outliers using k-means clustering. Consequently, it takes the stratified samples from each cluster and the relevant features that will contribute to improving the forecasting performance are identified by using the instance-based RReliefF feature selection, information theoretic-based MI feature selection and the correlation-based FCBF.

The forecasting phase reduces the overfitting of the model by effectively handling the complex, uncertain, non-linear and temporal dependency nature of the load using the LSTM. It forecasts the future load by using the list of features selected by the filter feature selection along with load as the input. The model parameters are tuned using the time series cross-validation on a rolling basis. Finally, the performance of the LSTM is compared against the popular learning models, such as MLP, RBF, SVR, and RF with all features and selected features in terms of MAPE, MAE, MSE and RMSE. The general structure of the short term load forecasting with filter feature selection is shown in Fig. 1.

Fig. 1

Short Term Load Forecasting with Filter Feature Selection.

3.1 Phase-I: Data preprocessing

The data preprocessing is an important task to be carried out before predictive analysis. As the prediction process depends on the historical data, the quality of the predictive analysis also depends on the input quality. While the historical data consists of incompleteness, it may lead to inaccurate predictions. In order to achieve an accurate prediction of the load, the historical input data should be preprocessed [35].

First, the hourly recorded electricity load and weather data are integrated and cleaned. Then, the date feature is extracted from the calendar features for an indexing purpose. After that, the min-max normalization is applied to make the dataset suitable for further processing. Let ‘D’ be the dataset, ‘X’ be the feature, ‘min_x’ be the minimum of ‘X’, ‘max_x’ be the maximum of ‘X’, original interval be [min_x, max_x] and the new interval be [nmin_x, nmax_x]. Any value ‘v’ from the original interval is mapped into the new value ‘new_v’ in the normalized interval [0, 1] by the min-max normalization as follows, ${new}_{v} = \frac{v - \min_{x}}{\max_{x} - \min_{x}} (\underset{x}{nmax} - \underset{x}{nmin}) + \underset{x}{nmin}$ (1)

3.1.1 Clustering

The clustering groups similar samples and identifies the outliers. The feature selection may consider all samples or random samples for calculating feature importance. When all samples are considered, it may affect the computation process of the feature selection whereas the consideration of only the random samples may misguide the feature selection process in calculating the feature importance [8, 51]. In order to overcome these difficulties, the clustering can be utilized before the feature selection. So, in this paper, the k-means clustering was utilized to assist the feature selection in identifying the relevant features [20 , 50]. First, similar sequences of load samples are grouped by using k-means clustering and removed the outliers. Then, the stratified samples are taken from each cluster by the filter feature selection for identifying the relevant features related to load [21 , 49]. The pseudocode for the clustering-based filter feature selection is shown in Fig. 2. The following section discusses the step by step procedure for the clustering-based filter feature selection.

Fig. 2

Pseudocode for Clustering-based Filter Feature Selection.

First, it finds the optimal number of clusters by using within sum of squares (WSS) metric as follows, $WSS = \sum_{i = 1}^{p} d {(m_{i}, q^{(i)})}^{2}$ (2) $WSS = \sum_{i = 1}^{p} \sum_{j = 1}^{K} {(m_{ij} - q_{j}^{(i)})}^{2}$ (3) where ‘p’ represents the number of samples in the dataset, ‘m’ represents any one sample in the given dataset, ‘q’ represents the cluster centroid, q⁽ⁱ⁾ represents the closest centroid associated with the sample ‘i’, d(m,q⁽ⁱ⁾) represents the distance between ‘m’ and ‘q⁽ⁱ⁾’ and ‘K’ represents the initial number of clusters.

Consequently, it selects the final optimal number of clusters ‘k’ using the WSS value. Then, it selects ‘k’ samples from ‘D’ as the cluster centroids. After that, identify the similar samples for each cluster centroid from the dataset ‘D’ by using the Euclidean distance formula as follows, $d (m, n) = \sqrt{\sum_{i = 1}^{p} {(n_{i} - m_{i})}^{2}}$ (4)

The ‘n’ and ‘m’ represents two samples from the dataset and d(m,n) represents the distance between ‘m’ and ‘n’. Then, it forms the clusters by using similar samples belongs to each cluster centroids. After that, it updates the cluster centroids by calculating the mean sample (m_c,n_c) from the number of samples belongs to that cluster as follows, $(m_{c}, n_{c}) = (\frac{\sum_{i = 1}^{p} m_{i}}{p}, \frac{\sum_{j = 1}^{p} n_{j}}{p})$ (5)

Repeat the same process until there is no change in the cluster formation. Then, it takes the stratified samples from each cluster and finds the relevant features using filter feature selection. The list of features selected is utilized by the forecasting models for forecasting the future load demand.

3.1.2 Feature selection

The feature selection has a prominent role in reducing the complexity of the forecasting by providing reduced data in terms of relevant features. When the dataset has irrelevant features, it makes the model becomes complicated. So, identifying relevant features is an inevitable task that needs to be done to achieve an accurate prediction [32, 53]. It also helps to reduce the overfitting and curse of dimensionality problems during learning which in turn improves the accuracy of forecasting. Due to the rapid growth of load data in the power system and also the high correlation of load with other environmental factors, the feature selection becomes a crucial preprocessing task need to be done before forecasting the load.

Fast Correlation Based Filter: The FCBF is a type of filter feature selection that uses the symmetric uncertainty(SU) for measuring the relationship between features. It calculates the relationship between the feature-feature and feature-target. It also prunes the features that are redundant as they have a high correlation with other features. The useful features must have high relevance with the target, but it should not be redundant to any other relevant features. As the load has non-linear nature, the information theory-based correlation measure is utilized to find the relationship between features [29].

Let ‘S’ be the dataset with ‘n’ number of features f₁, f₂, f₃,..., f_n and the target feature ‘c’. First, set the predefined threshold value ‘δ’ for identifying the relevant features. Then, calculate the SU for each feature ‘f_i’ with the target feature ‘c’. The SU is a valuable property to measure the relationship between two features. When the features have high correlation, the SU(f_i, c) value is 1, otherwise it is 0. It is calculated as follows, $SU (f_{i}, c) = 2 [\frac{IG ((f_{i} | c)}{H (f_{i}) + H (c)}]$ (6) where IG ((f_i|c) is an information gain of ‘f_i’ given ‘c’. $IG (f_{i} | c) = H (f_{i}) - H (f_{i} | c)$ (7)

When the feature ‘f_i’ is more correlated than the feature ‘f_j’ with the target feature ‘c’, the information gain of ‘f_i’ is higher than the information gain of ‘f_j’, IG (f_i|c) > IG (f_j|c). Where H(f_i) represents the entropy of the feature ‘f_i’ and is defined as follows, $H (f_{i}) = - \sum_{k} P (f_{ik}) \log_{2} (P (f_{ik}))$ (8) where P(f_ik) represents the prior probability of all values of ‘f_i’. Consequently, the calculated SU of each feature ‘f_i’ is compared against the predefined threshold ‘δ’ and added to the selected list ‘S_selected’ if SU(f_i, c)> δ. Then, arrange the features in the ‘S_selected‘ list in descending order of SU and take the feature with the highest SU as the first feature ‘f_s1’ and the next highest feature as the second feature ‘f_s2’. After that, remove the feature ‘f_s2’ from the ‘S_selected’ list if the SU(f_s1,f_s2)≥SU(f_s2,c). Repeat the process for all features in the ‘S_selected‘ list. Finally, the ‘S_selected’ list provides the final list of relevant features.

Mutual Information: The mutual information is a non-linear correlation-based feature selection that calculates the information shared by two or more features. The independent features do not give any information about other features. The dependency or the amount of information shared by one feature with other features is called an entropy [43]. Let ‘X’ and ‘Y’ are two features, then the mutual information between them is denoted as I(X;Y). The I(X;Y) is 0 when both ’X’ and ‘Y’ are independent. As the load data is continuous the mutual information of the feature ‘X’ with target ‘Y’ is calculated as follows, $I (X; Y) = \int_{x} \int_{y} p {(X, Y)}^{(x, y)} \log (\frac{p {(X, Y)}^{(x, y)}}{p_{X} (x) p_{Y} (y)})$ (9) where p(X, Y) is the joint probability density function of ‘X’ and ‘Y’. The ‘p_X’ and ‘p_Y’ are the marginal probability density function of ‘X’ and ‘Y’ [50].

RReliefF: The RReliefF supports the non-myopic discretization of numeric features. As the target feature is continuous in regression problems, the hit and misses cannot be utilized to estimate the feature weight. So, instead of hits and misses, it uses the kind of probability that denotes the predicted value of two instances that are different. The probability is modeled by using the relative distance of the predicted value between two instances. The RReliefF estimates the weight of each feature by taking random samples from the dataset and provides all features along with its weight. The features with the highest weight are considered as relevant features. The following section demonstrates the working of the RReliefF.

Let ‘x’ represents the vector of attribute values and ‘τ(x)‘ represents the predictor value. Initially, the RReliefF algorithm initializes the weight value for all features to 0. Then, it selects the sample ‘R_i‘randomly from the dataset. After that, it finds the k-nearest instances to ‘R_i‘ and calculates the weight for different predictions (N_dC) [34]. $N_{dC} = N_{dC} + diff (τ (.), R_{i}, I_{j}) . d (i; j)$ (10) where diff(R_i, R_j) is the similarity calculation function that calculates the difference between two instances based on Euclidean distance measure. The d(i,j) represents the distance between two instances ‘R_i‘ and ’I_j‘. The instances with small distances have greater influence on feature selection. $d (i, j) = \frac{d_{1} (i, j)}{\sum_{l}^{k} d_{1} (i, l)}$ (11) $d (i, j) = e^{- {(\frac{rank (R_{i}, I_{j})}{σ})}^{2}}$ (12) where rank (R_i, I_j) denotes the rank of the instance ‘I_j‘ from ‘R_i‘and ‘σ’ denotes the parameter that is defined by the user for controlling the influence of the distance. Then, it calculates the weight for different attributes (N_dA) as follows, $N_{dA} [A] = N_{dA} [A] + diff (A, R_{i}, I_{j}) . d (i, j)$ (13)

Next, it calculates the weight for different predictions and attributes N_dC_&dA as follows,

$\begin{matrix} N_{dC & dA} [A] = N_{dC & dA} [A] \\ + diff (τ (.), R_{i}, I_{j}) . diff (A, R_{i}, I_{j}) . d (i, j) \end{matrix}$ (14)

Finally, it calculates the weight for each attribute as follows,

$\begin{matrix} W [A] = N_{dC & dA} [A] / N_{dC} - (N_{dA} [A] \\ - N_{dC & dA} [A]) / (M - N_{dC}) \end{matrix}$ (15)

The vector ‘W’ has the final list of selected features [24].

3.2 Phase-II: Forecasting models

The forecasting of the short term load demand is performed using LSTM and compared against MLP, RBF, SVR, and RF. As the real time load is the complex, uncertain, volatile, time series and non-linear, the statistical methods cannot handle it effectively and fail to provide accurate forecasting [28]. As the MLP [15], RBF [6], SVR [66] and RF [16] are suitable for processing the load with minimal overfitting and high accuracy, these models are elected as the competing models. Once the model is trained, it is utilized for forecasting the future load efficiently and more accurately.

3.2.1 Multilayer perceptron

The multilayer perceptron is an acyclic and directed feed-forward artificial neural network. It consists of many layers, such as one input layer, one or more hidden layers, and only one output layer [24]. In the multilayer perceptron, the number of neurons are organized in each layer. The neurons in different layers are connected by using weighted links. All the neurons except the neurons in the input layer use the non-linear activation function. The MLP model is suitable for regression problems and also it has an ability to handle the non-linear data with many features and number of inputs. Hence, it can easily differentiate the data that is not linearly separable and also the prediction with MLP is very fast. The structure of the MLP is shown in Fig. 3.

Fig. 3

Structure of Multilayer Perceptron.

The inputs (x) provided at the input layer (X) are passed to the hidden layer (Z) and the outcome of the hidden layer is passed to the output layer (Y). At each hidden and output neurons, the sum of input with its connection weight provides the net activation of that neuron. The value of the activation function is determined by the transfer function, usually sigmoid function.

3.2.2 Radial basis function

The radial basis function performs supervised learning and effectively handles the regression and time series prediction problems. It functions same as that of the feed-forward network, but it uses one hidden layer and Gaussian functions as an activation function in each hidden unit. The hidden layer acts as a tuned processor and each hidden unit in that layer acts as a pattern detector that detects the score for matching an input vector and its connection weight. The hidden layer consists of non-linear units which have separate activation function, center, parameter and width. It approximates the multivariate functions locally and captures effectively the uncertainty of the model using the Gaussian function [6, 46].

3.2.3 Support vector regression

The SVR is the stable model for handling the non-linear data. It has the regularization capability and can handle regression problems effectively by maintaining all the necessary characteristics related to the maximal margin as that of SVM. It does not suffer from the overfitting. It uses the non-linear mapping for mapping the input vector ‘X’ onto ‘m’ dimensional feature space [35 , 66]. After that, the linear model, f(X,w), is developed as follows, $f (X, w) = \sum_{j = 1}^{m} w_{j} g_{j} (X) + b$ (16) where w_jg_j (X) represents the set of non-linear transformations and ‘b’ represents bias. When the data is preprocessed, the data may have zero mean and the bias will be dropped. The SVR uses the ɛ-intensive loss function that is defined as follows, $L_{ɛ} (y, f (X, w)) = {\begin{matrix} 0, if | y - f (X, w) | ⩽ ɛ \\ y - f (X, w) - ɛ, otherwise \end{matrix}$ (17)

The complexity of SVR can be reduced by minimizing ∥w ∥ ². The deviation outside the ɛ-intensive zone is measured by introducing slack variables, ξi, ξ*, in the training sample for minimizing ∥w ∥ ². The minimized SVR model is expressed as follows, $\frac{1}{2} {∥ w ∥}^{2} + C \sum_{i = 1}^{n} (ξ_{i} + ξ_{i}^{*})$ (18)

The accuracy of SVR can be improved by setting carefully the ‘ɛ’ and the kernel parameters [33].

3.2.4 Random forest

The random forest is utilized as the black box model due to its built-in ensembling capability. It handles regression problems with high robustness to the correlated features. It is the powerful stable algorithm appropriate for the regression problems which has an ability to reduce the overfitting of the model effectively [16]. When it is used for a regression problems, it is called random forest regressor. It is the supervised and tree based machine learning algorithm of versatile nature. It builds the forest by constructing multiple ensembles of decision trees. The random forest searches the best features among its random subset of features. The function of the random forest depends on bagging and random feature selection [43]. The bootstrap samples are taken by sampling with replacement from training data and the features are selected randomly for constructing the bag. If ‘p’ number of features in the dataset, then √p number of features are selected for building each tree. Finally, the outcome of all the ensemble decision trees are merged for improving the generalization capability of the model [2].

3.2.5 Long short term memory

The long short term memory is a variant of RNN, which is developed for solving the vanishing gradient and exploding gradient problems of simple RNN. The RNN takes a vector of the input sequence, models in each hidden node and maintains the state information that can be utilized as one of the input for the next modeling at the same node. It produces the output by considering the past input and past computations at each time [58]. It is accomplished by embedding the previous event information at the hidden state variables. As the RNN has a limited memory, it has the ability to remember short sequences. The LSTM overcomes this difficulty by keeping the information for a long period using gates in each cell. Instead of neurons, the LSTM network uses memory that replaces the short term memory of RNN. So, the LSTM can preserve information for a long sequence [9]. Thus, it enables an adequate understanding of the sequence dependence exists in the time series load data. The architecture of the LSTM cell is given in Fig. 4. The gates in the LSTM cell regulates the flow of information and decides upon the relevant information to be kept and the irrelevant information to be forgotten. The LSTM cell uses three types of gates, namely forget gate, input gate, and output gate. The forget gate finds the information which should be thrown away from the cell state. The input gate decides the new information from the input to be added to the cell state for updating the memory from the old cell state ‘C_t - 1’ to the current state ‘C_t .’. The output gate decides what information, O_t, from the cell state can be included in the output.

Fig. 4

Architecture of Long Short Term Memory cell.

Let the input sequence be {x₁, x₂, x₃, ... , x_n}, the input at time ‘t’ be ‘x_t’, the previous hidden state at time ‘t-1’ be ‘h_t - 1’ and the the new hidden state at time ‘t’ be ‘h_t‘. The ‘C_t - 1’, ‘C_t’ and ‘ ${\tilde{c}}_{t}$ ’are previous cell state, new cell state and the candidate respectively. The forget gate ‘f_t’, input gate ‘i_t’, output gate ‘o_t’ are mathematically expressed as follows, $f_{t} = σ (W_{f} . [h_{t - 1}, x_{t} + b_{f}])$ (19) $\tilde{c_{t}} = \tanh (W_{c} . [h_{t - 1}, x_{t}] + b_{c})$ (20) $i_{t} = σ (W_{i} . [h_{t - 1}, x_{t}] + b_{i})$ (21) $c_{t} = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ \tilde{c_{t}}$ (22) $o_{t} = σ (W_{o} . [h_{t - 1}, x_{t}] + b_{o})$ (23) $h_{t} = o_{t} ⊙ \tanh (c_{t})$ (24)

where W_f, W_i, W_c and W_o represents the weight matrix of forget gate, input gate during sigmoid function, input gate during tanh function and output gate respectively. The b_f, b_i, b_c and b_o represents the bias weight matrix of forget gate, input gate during sigmoid function, input gate during tanh function and output gate respectively. The performance of the LSTM can be improved by increasing the depth of the model [27].

4 Experimental results and analysis

In the present paper, the historical load and weather data of two European countries, Switzerland (DST-I) and France (DST-II), were used to show the generality of the proposed model. The load and weather data recorded at every hour from 1st January 2008 to 31st December 2012 is considered for the experiment. The load dataset consists of 24 hrs recording of load values along with day, month, year and minute values. The weather dataset consists of 20 meteorological features with four calendar features day, month, year and minute. The load has high correlation with weather values. So, the load and weather values are merged to form the dataset that is suitable for forecasting. The redundant calendar features from the weather data are removed manually. Now, the dataset consists of 25 features and 43848 instances. The new feature named ‘date’ is extracted from the ‘day’, ‘month’, ‘year’ and ‘minute’ features for an index purpose. Finally, the dataset with 22 features and 43848 instances are utilized for forecasting the short term load.

The sample of load demand from two datasets, DST-I and DST-II, are plotted in Fig. 5 and Fig. 6 respectively.

Fig. 5

Sample of load demand from DST-I.

Fig. 6

Sample of load demand from DST-II.

The datasets with different characteristics such as load demand range and load demand trend are considered to show the generality of the proposed model. The minimum, maximum and mean load demand of the DST-I are 229 MW, 8694 MW and 5633 MW respectively. On the otherhand, the minimum, maximum and mean load demand of the DST-II are 30826 MW, 102098 MW and 56051 MW respectively. The load demand trend has more fluctuations in DST-I, whereas the DST-II has a smooth load trend compared to DST-I. The dataset is partitioned into training dataset, validation dataset and testing dataset. The load data from 1^st January 2008 to 31^st December 2010 is utilized as training data, the load data from 1^st January 2011 to 31^st December 2011 is utilized as validation data and the load data from 1^st January 2012 to 31^st December 2012 is used as the testing data.

The list of features of the dataset are f₁-Date, f₂-Load, f₃-Temperature, f₄- Relative Humidity, f₅- Pressure, f₆- Total precipitation amount (high resolution, limited time range), f₇- Total precipitation amount (low resolution), f₈- Snowfall amount (high resolution, limited time range), f₉- Snowfall amount (low resolution), f₁₀- Total cloud cover, f₁₁- Low cloud cover, f₁₂- Medium cloud cover, f₁₃- High cloud cover, f₁₄- Sunshine Duration, f₁₅- Shortwave Radiation, f₁₆- Wind Speed [10 m above gnd], f₁₇- Wind Direction [10 m above gnd], f₁₈- Wind Speed [80 m above gnd], f₁₉- Wind Direction [80 m above gnd], f₂₀- Wind Speed [900 m above gnd], f₂₁- Wind Direction [900 m above gnd] and f₂₂- Wind Gust. The following section discusses the experimental results obtained with DST-I.

To improve the forecasting performance the clustering-based filter feature selection such as FCBF, MI, and RReliefF are applied individually on the dataset and the relevant features are obtained. After performing the feature selection on DST-I, 13 features are selected as the relevant features related to load feature. The features selected by the feature selection for the DST-I are listed in Table 1.

Table 1

Selected Features from DST-I

Feature Selection Methodologies	Selected Features
Fast Correlation Based Filter	f₁, f₃, f₁₆, f₁₉, f₁₅, f₂₂, f₁₈, f₄, f₂₀, f₆, f₁₄, f₁₇,f₂₁
Mutual Information	f₁, f₃, f₅, f₄, f₂₁, f₁₂, f₁₆, f₁₇, f₁₈, f₁₉, f₂₀, f₁₄, f₁₅
RReliefF	f₁, f₃, f₄, f₅, f₁₅, f₁₇, f₁₆, f₁₈, f₁₉, f₁₄, f₂₀, f₂₁, f₂₂

The LSTM model is trained by using the training dataset. The load and weather data at the previous time step is given as input and the load data at the next timestep is forecasted. The LSTM network is configured by setting the number of inputs as 13 for DST-I, the number of inputs as 12 for DST-II, the number of output as 1, number of hidden layers are tested from 1 to 10 and selected 2 as the optimal, the number of units in the hidden layers are tested from 10 to 100 and selected 20 as optimal, the number of epochs tested from 100 to 500 and selected 150 as optimal, the optimizer as Adam and the loss function as mean absolute error.

Consequently, the performance of the models are enhanced by tuning its parameters using the validation dataset. Generally the traditional time series forecasting methods produce less accurate results when the time interval between the training time and forecasting time increases. In this paper, to overcome this limitation, the time series cross-validation on a rolling basis is utilized as the validation technique. The forecasting of the short term load is performed by using LSTM and compared against MLP, RBF, SVR, and RF with all features and selected features. The comparison of the actual and forecast load of DST-I is plotted in Fig. 7. It shows that the forecasting model with selected features provided by the feature selection produces better results compared to the models with all features. The LSTM with the RReliefF feature selection surpassed other models. The LSTM is more suitable for time series data. It extracts the features from the input and keeps the sequence dependency for a longer period. The selected features provided by the feature selection also help the LSTM to reduce the complexity of feature extraction from the number of features. Generally, the performance of the forecasting model is evaluated in many ways. In this paper, the MAPE, MAE, MSE, and RMSE are utilized as the evaluation measures and are defined as follows,

Fig. 7

Forecasting Results of DST-I. (i) Comparison of actual and forecast load using MLP. (ii) Comparison of actual and forecast load using RBF. (iii) Comparison of actual and forecast load using SVR. (iv) Comparison of actual and forecast load using RF. (v) Comparison of actual and forecast load using LSTM.

$MAPE = \frac{1}{n} \sum_{t = 1}^{n} | (\frac{Y_{t} - F_{t}}{Y_{t}}) * 100 |$ (25) $MAE = \frac{1}{n} \sum_{t = 1}^{n} | Y_{t} - F_{t} |$ (26) $MSE = \frac{1}{n} \sum_{t = 1}^{n} {(Y_{t} - F_{t})}^{2}$ (27) $RMSE = \sqrt{\frac{1}{n} \sum_{t = 1}^{n} {(Y_{t} - F_{t})}^{2}}$ (28) where ‘n’ represents the number of samples, ‘Y_t‘ represents the actual observation for the period ‘t’ and the ‘F_t‘ represents the forecast for the same period ‘t’. The accuracy of the forecasting will be high when the error is low. Table 2 shows that the comparison of the performance of LSTM against MLP, RBF, SVR, and RF in terms of MAPE, MAE, MSE, and RMSE for DST-I. It demonstrates that the feature selection plays an essential role in the forecasting. The relevant features have much impact on the target feature. The irrelevant features create complications during learning. Sometimes, it may misguide the learning process which may leads to inaccurate results. Table 2 shows that the LSTM with RReliefF outperformed others by producing least error. Similarly, each model with feature selection achieved better performance compared to model without feature selection.

Table 2

Comparison of Forecasting Results of DST-I in terms of MAPE, MAE, MSE and RMSE

MODELS	MAPE	MAE	MSE	RMSE
MLP	14.615	694.815	631782.500	794.847
MLP-FCBF	14.115	667.245	583568.688	763.917
MLP-MI	13.786	654.448	557243.750	746.488
MLP-RRELIEFF	13.471	638.949	531100.312	728.766
RBF	13.154	627.599	501105.656	707.888
RBF-FCBF	11.635	557.929	404686.125	636.149
RBF-MI	10.219	488.327	315247.031	561.469
RBF-RRELIEFF	9.798	469.590	295319.156	543.433
SVR	9.673	463.161	287249.656	535.957
SVR-FCBF	9.577	462.578	283719.250	532.653
SVR-MI	9.517	457.890	279543.562	528.719
SVR-RRELIEFF	9.494	456.140	280545.469	529.665
RF	9.451	453.223	276363.062	525.702
RF-FCBF	9.365	450.552	273707.844	523.171
RF-MI	9.190	443.093	266105.531	515.854
RF-RRELIEFF	9.139	440.289	259233.016	509.149
LSTM	9.041	434.848	256096.406	506.060
LSTM- FCBF	8.983	434.057	255441.812	505.413
LSTM-MI	8.626	417.212	234836.297	484.599
LSTM-RRELIEFF	8.140	392.801	211288.328	459.661

The forecasting performance of the LSTM with RReliefF is also tested using DST-II. The following section discusses the experimental results obtained from DST-II. The dataset consists of 22 features and 43848 instances as DST-I. The relevant features which contribute to improve the forecasting performance of DST-II are identified by using filter feature selection such as FCBF, MI, and RReliefF. The merit score of the features flatten after 12 features. So, the topmost 12 features are identified as the important features related to the load. The list of features selected by the FCBF, MI, and RReliefF feature selection are shown in Table 3.

Table 3

Selected Features from DST-II

Feature Selection Methodologies	Selected Features
Fast Correlation Based Filter	f₁, f₃, f₄, f₁₈, f₁₉, f₇, f₁₅, f₁₂, f₁₄, f₁₆, f₂₀, f₁₇
Mutual Information	f₁, f₃, f₄, f₁₆, f₂₀, f₂₁, f₅, f₁₅, f₁₈, f₁₇, f₁₉, f₁₂
RReliefF	f₁, f₃, f₄, f₂₀, f₅, f₂₂, f₁₆, f₁₇, f₂₁, f₁₉, f₁₈, f₁₅

The load and weather data at the pevious timestep is given as input to the LSTM model and the load at the next timestep is forecasted. The forecasting of the load demand of DST-II for the year 2012 is performed by using LSTM and compared against MLP, RBF, SVR, and RF with all features and selected features. Figure 8 shows the comparison of actual and forecast load for DST-II. The DST-II has a smooth load trend compared to DST-I. So, the LSTM with RReliefF learns effectively the sequence dependency exists in the load data. As a result, it produced more accurate forecast compared to DST-I. It proves that the LSTM with RReliefF works well for DST-II also.

Fig. 8

Forecasting Results of DST-II. (i) Comparison of actual and forecast load using MLP. (ii) Comparison of actual and forecast load using RBF. (iii) Comparison of actual and forecast load using SVR. (iv) Comparison of actual and forecast load using RF. (v) Comparison of actual and forecast load using LSTM.

Table 4 shows the comparison of forecasting results of DST-II in terms of MAPE, MAE, MSE, and RMSE. It shows that the forecasting models with feature selection achieved better performance compared to the model without feature selection. The LSTM with RReliefF outperformed other models due to the two levels of the feature extraction. The first level extraction by the external feature selection and second level extraction by its internal feature extraction capability. So, it utilizes all the relevant features that contribute to forecast the future load and provides better results.

Table 4

Comparison of Forecasting Results of DST-II in terms of MAPE, MAE, MSE and RMSE

MODELS	MAPE	MAE	MSE	RMSE
MLP	5.073	2665.987	9683551.000	3111.840
MLP-FCBF	4.263	2234.771	7086835.500	2662.111
MLP-MI	4.144	2194.962	7285942.000	2699.248
MLP-RRELIEFF	4.019	2137.533	6607934.500	2570.590
RBF	3.900	2090.270	6781258.500	2604.085
RBF-FCBF	3.895	2053.681	6368274.500	2523.544
RBF-MI	3.895	2055.712	6429664.000	2535.678
RBF-RRELIEFF	3.860	2041.626	6308346.000	2511.642
SVR	3.855	2041.881	6349896.500	2519.900
SVR-FCBF	3.851	2047.635	6347484.000	2519.421
SVR-MI	3.811	2028.973	6176549.000	2485.266
SVR-RRELIEFF	3.759	1986.620	6149297.000	2479.778
RF	3.745	1980.671	6078395.000	2465.440
RF-FCBF	3.709	1961.078	5993257.500	2448.113
RF-MI	3.688	1969.675	6034218.500	2456.465
RF-RRELIEFF	3.668	1942.470	6368511.500	2523.591
LSTM	3.593	1939.132	6048781.821	2459.427
LSTM-FCBF	3.319	1801.689	5767432.201	2401.548
LSTM-MI	3.294	1807.212	5911965.184	2431.453
LSTM-RRELIEFF	3.249	1774.291	5736080.084	2395.011

The time complexity of the short term load forecasting using LSTM with RReliefF is O(n/2.a+w). The nearest instances are found for calculating the importance of feature require O(n/2.a) steps for ‘a’ features and ‘n’ instances. The computational complexity to update each weight in LSTM per time step is O(1). Hence, the overall complexity of LSTM per timestep is O(w), where ‘w’ represents the number of weight. The analysis of the forecasting errors shown in Table 2 and Table 4 confirms that the necessity of LSTM with RReliefF for producing an accurate result. In addition to that the training time of the LSTM with RReliefF is also compared against MLP, RBF, SVR, and RF. For DST-I, the training time of MLP, RBF, SVR, RF, LSTM and LSTM with RReliefF are 194 s, 157 s, 128 s, 106 s, 31 s and 15 s respectively. For DST-II, the training time of MLP, RBF, SVR, RF, LSTM and LSTM with RReliefF are 190 s, 162 s, 104 s, 85 s, 25 s and 12 s respectively. Due to the removal of outliers, the reduction of dimension of data, and the effective handling of non-linear temporal dependency, the complexity and the training time of the LSTM with RReliefF is drastically reduced compared to others.

4.1 Forecasting accuracy significance test

The superiority of the LSTM with RReliefF is verified in terms of the performance measures such as MAPE, MAE, MSE, and RMSE. In addition to that, two statistical tests also conducted to demonstrate the improvement of the forecasting performance of LSTM with RReliefF. In the present paper, based on the research recommendations provided by [13] and [31], the Wilcoxon signed-rank test (pairwise comparison test) and the Friedman test (multiple comparisons test) are conducted to verify the significance of the LSTM with RReliefF model.

The Wilcoxon signed-rank test is the famous nonparametric test conducted between two sets of data with the same size. It is used to perform the pairwise significant test between two models. It is based on difference scores as a sign test. However, in addition to analyze the signs of the differences, it also takes into account the magnitude of the observed differences. It assumes the null hypothesis (H₀) as the medians of the differences between the two group samples are equal. The i^th forecasting error (e_i) is calculated from the i^th forecast values of the two models and are used to measure the statistic value (W_statistic) as follows, $W_{statistic} = \min {r^{+}, r^{-}}$ (29) where ‘r⁺’ represents the sum of ranks that the first model is greater than the second model, ‘r^-‘ represents the sum of ranks that the first model is smaller than the second one. When the e_i > 0 the ‘r⁺’ be the sum of ranks, when e_i <0 the ‘r^-‘ be the sum of ranks and when e_i = 0 then exclude this i^th pair from the comparison and reduce the total sample size. Then, the critical value ‘W’ is determined as follows, $W = \frac{N (N + 1)}{4}$ (30) where ‘N’ represents the number of data. If the observed value of ‘W_statistic’ is less than or equal to the critical value ‘W’ then the null hypothesis is rejected. It shows the improved performance of the model [18 , 69]. Each model utilized in this paper is independent of each other. The Wilcoxon signed-rank test is conducted as a pair between the LSTM with RReliefF and each other model.

The Friedman test is also a nonparametric statistical test. It determines the significant differences between the forecasting errors produced by two or more models. It assumes the null hypothesis as the means of the forecasting errors of two or more models are same [18 , 69]. The statistic ‘F’ of the Friedman test is measured as follows, $F = \frac{12 N}{k (k + 1)} [\sum_{j = 1}^{k} R_{j}^{2} - \frac{k {(k + 1)}^{2}}{4}]$ (31) where ‘k’ represents the total number of forecasting models utilized for comparison, ‘N’ represents the number of forecasting values and ‘R_j’ represents the average rank sum on the forecasting error ’r’ of the forecasting model ‘j’ [4 , 68]. The ‘R_j’ is calculated as follows, $R_{j} = \frac{1}{N} \sum_{i = 1}^{N} r_{i}^{j}$ (32)

If the statistic value of ‘F’ is larger than the Friedman critical value (which is obtained from the Friedman critical value table) and the p-value is less than ‘α’, then the null hypothesis is rejected. The results of the Wilcoxon signed-rank test and the Friedman test for DST-I and DST-II are shown in Table 5 and Table 6 respectively.

Table 5

Results of Wilcoxon signed-rank test and Friedman test obtained from DST-I

Compared Models	Wilcoxon Signed-Rank Test				Friedman Test
	α= 0.02	p-value	α= 0.05	p-value	α= 0.05
	W = 19,291,860		W = 19,291,860
LSTM-RReliefF vs. MLP	2231133.0^b	0.0000^*	2231133^b	0.0000^**	H₀: e₁ = e₂ =
LSTM-RReliefF vs. RBF	986839.0^b	0.0000^*	986839^b	0.0000^**	e₃ = e₄ = e₅ =
LSTM-RReliefF vs. SVR	4971843.5^b	0.0000^*	4971843.5^b	0.0000^**	e₆ = e₇ = e₈ =
LSTM-RReliefF vs. RF	5095021.0^b	0.0000^*	5095021^b	0.0000^**	e₉ = e₁₀ = e₁₁ =
LSTM-RReliefF vs.LSTM	6538324.0^b	0.0000^*	6538324^b	0.0000^**	e₁₂ = e₁₃ = e₁₄ =
LSTM-RReliefF vs. MLP-FCBF	3268371.0^b	0.0000^*	3268371^b	0.0000^**	e₁₅ = e₁₆ = e₁₇ =
LSTM-RReliefF vs. MLP-MI	2855260.0^b	0.0000^*	2855260^b	0.0000^**	e₁₈ = e₁₉ = e₂₀
LSTM-RReliefF vs. MLP-RRELIEFF	2380948.5^b	0.0000^*	2380948.5^b	0.0000^**
LSTM-RReliefF vs. RBF-FCBF	3183097.5^b	0.0000^*	3183097.5^b	0.0000^**
LSTM-RReliefF vs. RBF-MI	3733468.0^b	0.0000^*	3733468^b	0.0000^**	F = 59673.85
LSTM-RReliefF vs. RBF-RRELIEFF	3673605.0^b	0.0000^*	3673605^b	0.0000^**	p = 0.0000
LSTM-RReliefF vs. SVR-FCBF	5977937.0^b	0.0000^*	5977937^b	0.0000^**	(Reject H₀
LSTM-RReliefF vs. SVR-MI	4172278.5^b	0.0000^*	4172278.5^b	0.0000^**
LSTM-RReliefF vs. SVR-RRELIEFF	4357516.5^b	0.0000^*	4357516.5^b	0.0000^**
LSTM-RReliefF vs. RF-FCBF	6488165.5^b	0.0000^*	6488165.5^b	0.0000^**
LSTM-RReliefF vs. RF-MI	4638320.5^b	0.0000^*	4638320.5^b	0.0000^**
LSTM-RReliefF vs. RF-RReliefF	6105123.5^b	0.0000^*	6105123.5^b	0.0000^**
LSTM-RReliefF vs. LSTM-FCBF	7321100.0^b	0.0000^*	7321100^b	0.0000^**
LSTM-RReliefF vs. LSTM-MI	5656479.5^b	0.0000^*	5656479.5^b	0.0000^**

^bIndicates that the LSTM with RReliefF significantly surpasses other compared models; ^*Represents that the test indicates not to accept the null hypothesis under α= 0.02; **Represents that the test indicates not to accept the null hypothesis under α= 0.05.

Table 6

Results of Wilcoxon signed-rank test and Friedman test obtained from DST-II

Compared Models	Wilcoxon Signed-Rank Test				Friedman Test
	α= 0.02	p-value	α= 0.05	p-value	α= 0.05
	W = 19,291,860		W = 19,291,860
LSTM-RReliefF vs. MLP	3249.0^b	0.0000^*	3249.0^b	0.0000^**	H₀: e₁ = e₂ =
LSTM-RReliefF vs. RBF	792331.5^b	0.0000^*	792331.5^b	0.0000^**	e₃ = e₄ = e₅ =
LSTM-RReliefF vs. SVR	526.0^b	0.0000^*	526.0^b	0.0000^**	e₆ = e₇ = e₈ =
LSTM-RReliefF vs. RF	1428.5^b	0.0000^*	1428.5^b	0.0000^**	e₉ = e₁₀ = e₁₁ =
LSTM-RReliefF vs.LSTM	4633503.5^b	0.0000^*	4633503.5^b	0.0000^**	e₁₂ = e₁₃ = e₁₄ =
LSTM-RReliefF vs. MLP-FCBF	530.0^b	0.0000^*	530.0^b	0.0000^**	e₁₅ = e₁₆ = e₁₇ =
LSTM-RReliefF vs. MLP-MI	199960.5^b	0.0000^*	199960.5^b	0.0000^**	e₁₈ = e₁₉ = e₂₀
LSTM-RReliefF vs. MLP-RRELIEFF	14106.5^b	0.0000^*	14106.5^b	0.0000^**
LSTM-RReliefF vs. RBF-FCBF	560.0^b	0.0000^*	560.0^b	0.0000^**	F = 88329.33
LSTM-RReliefF vs. RBF-MI	4369.0^b	0.0000^*	4369.0^b	0.0000^**	p = 0.0000
LSTM-RReliefF vs. RBF-RRELIEFF	26752.5^b	0.0000^*	26752.5^b	0.0000^**	(Reject H₀)
LSTM-RReliefF vs. SVR-FCBF	2984.0^b	0.0000^*	2984.0^b	0.0000^**
LSTM-RReliefF vs. SVR-MI	521213.5^b	0.0000^*	521213.5^b	0.0000^**
LSTM-RReliefF vs. SVR-RRELIEFF	55590.0^b	0.0000^*	55590.0^b	0.0000^**
LSTM-RReliefF vs. RF-FCBF	21064.0^b	0.0000^*	21064.0^b	0.0000^**
LSTM-RReliefF vs. RF-MI	1306847.5^b	0.0000^*	1306847.5^b	0.0000^**
LSTM-RReliefF vs. RF-RReliefF	7890871.5^b	0.0000^*	7890871.5^b	0.0000^**
LSTM-RReliefF vs. LSTM-FCBF	18678911.5^b	0.0000^*	18678911.5^b	0.0000^**
LSTM-RReliefF vs. LSTM-MI	8822561.5^b	0.0000^*	8822561.5^b	0.0000^**

^bIndicates that the LSTM with RReliefF significantly surpasses the other compared models; ^*Represents that the test indicates not to accept the null hypothesis under α= 0.02; **Represents that the test indicates not to accept the null hypothesis under α= 0.05.

For DST-I, the Wilcoxon signed-rank test is conducted between each pair of models by setting the significance level, α= 0.02 and α= 0.05. In both cases, the Wilcoxon signed-rank test produces ‘W_statistics’ value less than the critical value ‘W’ and also the p-value is less than ‘α’. So, the null hypothesis is rejected. The multiple comparison Friedman test is conducted by setting the α= 0.05. The null hypothesis is rejected since the p-value is less than ‘α’ and the Friedman statistic value ‘F’ is larger than the Friedman critical value. The Wilcoxon signed-rank test and Friedman test show that the LSTM with RReliefF model is superior compared to other models.

The superiority of the LSTM with RReliefF is also tested with DST-II.The Wilcoxon signed-rank test and Friedman test are applied by setting the significance level, α= 0.02 and α= 0.05. The Friedman test is also conducted by setting the α= 0.05. In both test, the null hypothesis is rejected. The significant test results showed in Table 5 and Table 6 represents the feature selection adds significant contribution in improving the performance of the forecasting. From both case studies using DST-I and DST-II, the performance measures such as MAPE, MAE, MSE and RMSE and Statistical tests such as Wilcoxon signed-rank test and Friedman test proves that the LSTM with RReliefF significantly outperformed other models.

5 Conclusion

The electricity load forecasting is becoming one of the critical issue to solve the energy crisis problems. This has become an important research area of global concern. The short term load forecasting has a crucial role in the power system planning, scheduling, operation, dispatching and maintenance. In this paper, the clustering-based filter feature selection was introduced to remove the outliers, reduce the curse of dimensionality, reduce the overfitting issues with short term load forecasting. In this paper, the performance of the forecasting was improved by removing the outliers using clustering, reducing the curse of dimensionality by removing the irrelevant features using the filter feature selection such as FCBF, MI, and RReliefF and reducing the overfitting using the deep recurrent neural network based LSTM. It considers calendar, weather and load features for forecasting the short term load. The performance of the forecasting models was evaluated in terms of MAPE, MAE, MSE, and RMSE. The LSTM with RReliefF model outperformed others by producing least error. Hence, the generality of the LSTM with RReliefF was proved using the hourly recorded historical load demand and weather data of two European countries. The computational time was also reduced drastically by effectively removing the irrelevant features and outliers. The significance of the LSTM with RReliefF model was also verified by conducting the Wilcoxon signed-rank test and the Friedman test. The result shows that the short term load forecasting with filter feature selection especially, the LSTM with RReliefF surpassed other models. In the future, hybrid feature selection can also be incorporated with machine learning and deep learning to improve the forecasting performance.

References

Jornaz

and Samaranayake

V.A.

, A Multi-Step Approach to Modeling the 24-hour Daily Profiles of Electricity Load using Daily Splines, Energies 12(21) (2019), 4169.

Lahouar

and Slama

J.B.

, Day-ahead load forecast using random forest and expert input selection, Energy Conversion and Management 103 (2015), 1040–1051.

Karmel

, Adhithiyan

and Senthil Kumar

, Machine Learning Based Approach For Pothole Detection, Int J Civ Eng Technol 9(5) (2018), 882–888.

Gensler

, Wind Power Ensemble Forecasting: Performance Measures and Ensemble Architectures for Deterministic and Probabilistic Forecasts, kassel university press GmbH, 2019.

Senliol

, Gulgezen

, Yu

and Cataltepe

, Fast Correlation Based Filter (FCBF) with a different search strategy, In Proc. 23rd International Symposium on Computer and Information Sciences, IEEE, 2008, 1–4.

Zhao

, Liang

, Gao

and Liu

, Short-Term Load Forecasting Based on RBF Neural Network. In, Journal of Physics: Conference Series, IOP Publishing 1069(1) (2018), 012091.

Bennett

, Stewart

and Lu

, Autoregressive with exogenous variables and neural network short-term load forecast models for residential low voltage distribution networks, Energies 7(5) (2014), 2938–2960.

Jayakumar

and Chellappan

, Associativity based mobility-adaptive K-clustering in mobile ad-hoc networks, In International Conference on Intelligent Information Technology, Springer, Berlin, Heidelberg, 2004, 160–168.

Tian

, Ma

, Zhang

and Zhan

, A deep neural network model for short-term load forecast based on long short-term memory network and convolutional neural network, Energies 11(12) (2018), 3493.

10.

Yeom

C.U.

and Kwak

K.C.

, Short-term electricity-load forecasting using a TSK-based extreme learning machine with knowledge representation, Energies 10(10) (2017), 1613.

11.

Ortiz-Arroyo

, Skov

M.K.

and Huynh

, Accurate electricity load forecasting with artificial neural networks, In, Computational Intelligence for Modelling, Control and Automation and International Conference on Intelligent Agents, Web Technologies and Internet Commerce, IEEE 1 (2005), 94–99.

12.

Saeed

, Gazem

and Mohammed

, A Busalim and editors, Recent Trends in Data Science and Soft Computing, In Proc. of the 3rd International Conference of Reliable Information and Communication Technology, Springer; 2018.

13.

Wilcoxon

, Individual comparisons by ranking methods, Biom Bull 1 (1945), 80–83.

14.

Zhao

and Su

, Short-term load forecasting using Kalman filter and elman neural network, In Proc. of conference on Industrial Electronics and Applications, IEEE, 2007, 1043–1047.

15.

Dudek

, Multilayer perceptron for short-term load forecasting: from global to local approach, Neural Computing and Applications 14 (2019), 1–3.

16.

Dudek

, Short-term load forecasting using random forests, In Intelligent Systems, Springer, Cham, 2015, 821–828.

17.

Swaroop

, Senthil Kumar

and Muthamil Selvan

, An efficient model for share market prediction using data mining techniques”, Int J Appl Eng Res 9(17) (2014), 3807–3812.

18.

Fan

G.F.

, Peng

L.L.

and Hong

W.C.

, Short term load forecasting based on phase space reconstruction algorithm and bi-square kernel regression model, Applied Energy 224 (2018), 13–33.

19.

Chen

, Canizares

C.A.

and Singh

, ANN-based short-term load forecasting in electricity markets, In Proc. of IEEE Power Engineering Society Winter Meeting Conference (Cat. No. 01CH37194), 2001, 411–415.

20.

Gangurde

H.D.

, Feature selection using clustering approach for big data, International Journal of Computer Applications 975 (2014), 8887.

21.

Liu

and Motoda

, Computational methods of Feature selection, Taylor & Francis Group, LLC, 2008.

22.

Koprinska

, Rana

and Agelidis

V.G.

, Correlation and instance based feature selection for electricity load forecasting, Knowledge-Based Systems 82 (2015), 29–40.

23.

Koprinska

, Rana

and Agelidis

V.G.

, Yearly and seasonal models for electricity load forecasting. In Proc. of International Joint Conference on Neural Networks, IEEE, 2011, 1474–1481.

24.

Moon

, Kim

and Son EHwang

, Hybrid Short-Term Load Forecasting Scheme Using Random Forest and Multilayer Perceptron, Energies 11(12) (2018), 3283.

25.

Taylor

J.W.

, An evaluation of methods for very short-term load forecasting using minute-by-minute British data, International Journal of Forecasting 24(4) (2008), 645–658.

26.

Wang

, Xu

, Zhang

, Guo

and Zomaya

A.Y.

, Robust Big Data Analytics for Electricity Price Forecasting in the Smart Grid, IEEE Trans Big Data 5(1) (2017), 34–45.

27.

Kong

, Dong

Z.Y.

, Jia

, Hill

D.J.

, Xu

and Zhang

, Short-term residential load forecasting based on LSTM recurrent neural network, IEEE Transactions on Smart Grid 10(1) (2017), 841–851.

28.

Barolli

, Xhafa

, Khan

Z.A.

, Odhabi

, and editors, Advances in Internet, Data and Web Technologies, In 7th International Conference on Emerging Internet, Data and Web Technologies, Springer, 2019.

29.

and Liu

, Feature selection for high-dimensional data: A fast correlation-based filter solution, In Proc. of the 20th International Conference on Machine Learning, Washington DC (2003), 856–863.

30.

De Felice

, Yao

, Short-term load forecasting with neural network ensembles: A comparative study, IEEE Computational Intelligence Magazine 6(3) (2011), 47–56.

31.

Friedman

, A comparison of alternative tests of significance for the problem of m rankings, Ann Math Stat 11 (1940), 86–92.

32.

Ghiasi

, Irani Jam

, Teimourian

, Zarrabi

and Yousefi

, A new prediction model of electricity load based on hybrid forecast engine, International Journal of Ambient Energy 4 (2017), 1–8.

33.

Rafiei

, Niknam

, Aghaei

, Shafie-Khah

and Catalão

J.P.

, Probabilistic load forecasting using an improved wavelet neural network trained by generalized extreme learning machine, IEEE Transactions on Smart Grid 9(6) (2018), 6961–6971.

34.

Robnik-Šikonja

and Kononenko

, Theoretical and empirical analysis of ReliefF and RRelief, Machine Learning 53(1-2) (2003), 23–69.

35.

Sarhani

and El Afia

, Feature selection and parameter optimization of support vector regression for electric load forecasting. In Proc. International Conference on Electrical and Information Technologies, IEEE, 2016, 288–293.

36.

Naimur Rahman

, Esmailpour

, Zhao

, Machine Learning with Big Data An Efficient Electricity Generation Forecasting System, Big Data Res 5 (2016), 9–15.

37.

Rana

, Koprinska

and Agelidis

V.G.

, Feature selection for electricity load prediction, In, Proc 19th Int Conf Neural Inf. Process. ICONIP, Springer, Berlin, Heidelberg 7664 (2012), 526–534.

38.

Sarhani

and El Afia

, Electric load forecasting using hybrid machine learning approach incorporating feature selection, In BDCA, 2015, 1–7.

39.

Rana

, Koprinska

and Khosravi

, Feature Selection for Interval Forecasting of Electricity Demand Time Series Data, Artificial Neural Networks, Springer, Cham, 2015, 445–462.

40.

Agila

and Senthil Kumar

, An Efficient Crop Identification Using Deep Learning, International Journal of Scientific & Technology Research 9(1) (2020).

41.

Siddarameshwara

, Yelamali

and Byahatti

, Electricity short term load forecasting using elman recurrent neural network. In Proc of International Conference on Advances in Recent Technologies in Communication and Computing, IEEE, 2010, 351–354.

42.

Zeng

, Zhang

, Liu

, Liang

and Alsaadi

F.E.

, A switching delayed PSO optimized extreme learning machine for short-term load forecasting, Neurocomputing 240 (2017), 175–182.

43.

Huang

, Hu

, Cai

and Yang

, Short term electrical load forecasting using mutual information based feature selection with generalized minimum-redundancy and maximum-relevance criteria, Entropy 18(9) (2016), 330.

44.

Abedinia

, Amjady

and Zareipour

, A New Feature Selection Technique for Load and Price Forecast of Electrical Power Systems, IEEE Trans Power Syst 32(1) (2017), 62–74.

45.

Abbas

O.A.

, Comparisons Between Data Clustering Algorithms, Int Arab J Inf Technol 5(3) (2008), 320–325.

46.

Senthil Kumar

, Improved Prediction of Wind Speed using Machine Learning, EAI Endorsed Trans Energy Web 6(23) (2019).

47.

Chang

P.C.

, Fan

C.Y.

and Hsieh

J.C.

, A weighted evolving fuzzy neural network for electricity demand forecasting. In Proc. of First Asian Conference on Intelligent Information and Database Systems, IEEE, 2009, 330–335.

48.

Song

, Ni

and Wang

, A fast clustering-based feature subset selection algorithm for high-dimensional data, IEEE Transactions on Knowledge and Data Engineering 25(1) (2011), 1–4.

49.

Sugumar

, Rengarajan

and Jayakumar

, A technique to stock market prediction using fuzzy clustering and artificial neural networks, Computing and Informatics 33(5) (2015), 992–1024.

50.

Fong

, Biuk-Aghai

R.P.

and Si

Y.W.

, Lightweight Feature Selection Methods Based on Standardized Measure of Dispersion for Mining Big Data, In Proc Computer and Information Technology (CIT), IEEE International Conference, 2016, 553–559.

51.

Jancy

and Jayakumar

, Pivot variable location-based clustering algorithm for reducing dead nodes in wireless sensor networks, Neural Computing and Applications 31(5) (2019), 1467–1480.

52.

Khatoon

and Singh

A.K.

, Effects of various factors on electric load forecasting: An overview, In Proc of Power India International Conference (PIICON), IEEE, 2014, 1–5.

53.

S.K.

and Lopez

, A Review on Feature Selection Methods for High Dimensional Data, Intern J Eng Technol 8(2) (2016), 669–672.

54.

P S.K. and Lopez

, Forecasting of Wind Speed Using Feature Selection and Neural Networks, Int J Renew Energy Res 6(3) (2016).

55.

P S.K., A Review of Soft Computing Techniques in Short-Term Load Forecasting, Int J Appl Eng Res 12(18) (2017), 7202–7206.

56.

Kumar

and Lopez

, Feature Selection used for Wind Speed Forecasting with Data Driven Approaches, J Eng Sci Technol Rev 8(5) (2015), 124–127.

57.

Pramono

S.H.

, Rohmatillah

, Maulana

, Hasanah

R.N.

and Hario

, Deep Learning-Based Short-Term Load Forecasting for Supporting Demand Response Program in Hybrid Energy System, Energies 12(17) (2019), 3359.

58.

Kelo

S.M.

and Dudul

S.V.

, Short-term Maharashtra state electrical power load prediction with special emphasis on seasonal changes using a novel focused time lagged recurrent neural network based on time delay neural network model, Expert Systems with Applications 38(3) (2011), 1554–1564.

59.

Salkuti

S.R.

, Short-term electrical load forecasting using hybrid ANN–DE and wavelet transforms approach, Electrical Engineering 100(4) (2018), 2755–2763.

60.

Kiruthika

V.G.

, Arutchudar

and Senthil Kumar

, Highest humidity prediction using data mining techniques, Int J Appl Eng Res 9(16) (2014), 3259–3264.

61.

Yang

, Wang

and Wang

, Research and application of a novel hybrid model based on data selection and artificial intelligence algorithm for short term load forecasting, Entropy 19(2) (2017), 52.

62.

Hong

W.C.

, Li

M.W.

, Geng

and Zhang

, Novel chaotic bat algorithm for forecasting complex motion of floating platforms, Applied Mathematical Modelling 72 (2019), 425–443.

63.

Hong

W.-C.

, Hybrid Advanced Optimization Methods with Evolutionary Computation Techniques in Energy Forecasting, Energies 2018.

64.

, Power System Short -term Load Forecasting Based on Improved Support Vector Machine”, International Symposium on Knowledge Acquisition and Modeling, IEEE, 2008, 658–662.

65.

Yao

, Evolving artificial neural networks. In, Proc. of the IEEE 87(9) (1999), 1423–1447.

66.

Chen

, Xu

, Chu

, Li

, Wu

, Ni

, Bao

and Wang

, Short-term electrical load forecasting using the Support Vector Regression (SVR) model to calculate the demand response baseline for office buildings, Applied Energy 195 (2017), 59–70.

67.

Dong

, Wang

and Guo

, Research and Application of Hybrid Forecasting Model Based on an Optimal Feature Selection System-A Case Study on Electrical Load Forecasting, Energies 10(4) (2017), 490.

68.

Dong

, Zhang

and Hong

W.C.

, A hybrid seasonal mechanism with a chaotic cuckoo search algorithm with a support vector regression model for electric load forecasting, Energies 11(4) (2018), 1009.

69.

Zhang

and Hong

W.C.

, Electric load forecasting by complete ensemble empirical mode decomposition adaptive noise and support vector regression with quantum-based dragonfly algorithm, Nonlinear Dynamics 98(2) (2019), 1107–1136.