Abstract
The standard backpropagation algorithm has already proven its effectiveness in most of the potential problems, but the major limitation is entrapment of local minima and slow convergence rate. To address these issues, a modified backpropagation algorithm has been proposed by adding a third term called inertia, the physical component used to accelerate the network towards the convergence without getting stuck into local minima. The Chebyshev polynomial form is a convenient method for expanding a function in a linear independent term. Inertia has been expanded using Chebyshev polynomial which is used as a third term in weight updation. The performance of the proposed algorithm outperforms the standard backpropagation algorithm (SBP) and the backpropagation algorithm with momentum (SBPM). The proposed algorithm was tested with the standard benchmark problems such as XOR problem, parity checking problem and dataset from UCI machine learning repository such as iris flower classification, wheat classification, breast cancer detection and wine classification. Experimental results show that the addition of the third parameter called inertia in the backpropagation algorithm gave better performance and faster convergence rate compared to the SBP and SBPM.
Introduction
Artificial Neural Network is a logical model used to simulate the learning process of the human brain. The Artificial Neural Network basically work like actual biological neurons in the brain and it is composed of a collection of tiny processing unit called artificial neurons trained to perform the complex calculations. Since the human brain identifies and classifies those things by experience with a lot of learning process, the Artificial Neural Network also trained not to be programmed.
There are many different complex applications solved successfully by an Artificial Neural Network. Some of those are Handwritten character recognition, predicting future trends, health care, diagnosing diseases, medicine, engineering, manufacturing, ocean exploration, predicting an earthquake and so on [1–5].
There are different types of feed-forward network available. Each network has unique characteristics and designed for some specific purposes. The architecture of feed-forward neural network consists of an input, hidden and an output layer. Each and every node in a layer is connected to every other node in the neighboring layer. The layers are all usually nonlinear processing units but it can use linear activation function to fire the output neuron. The goal of the network is to produce the expected output when a set of input is fed to the network. The output of one layer is given as input to another layer.
The backpropagation neural network algorithm (SBP) is one of the most popular and robust oldest algorithms proposed by Hinton, Rumelhart, and Williams [6]. The BP algorithm actually works to minimize the cost function namely, the mean square error between the actual output and desired output of the network [7]. It is widely used in a lot of practical problems and shows its efficiency especially in the classification problem [8], but there is some drawback such as inaccurate prediction result [9]. According to the research work [10, 11], reported clearly about the poor convergence rate of the backpropagation algorithm because of local minima and flatspot problem. But still, the BP algorithm is preferred and modified to give an enhanced version by a lot of researchers because of its simplicity in both calculation and implementation compared to other algorithms.
The major part of machine learning would be having an optimization algorithm as its basis. The Gradient descent is an optimization algorithm and it works to minimize the cost function by identifying the values of its parameters. The backpropagation algorithm uses the optimization method called gradient descent learning rule to reduce the mean square error between the actual and desired output by having careful selection of free parameters of the network such as learning rate, activation function, initial weight, and bias. The wrong selection of these parameter values lead the network to have slow convergence rate which would stuck into local minima and sometimes network error also.
Apart from initial weights and bias, the important parameters which are often used for weight adjustment is learning rate. Qambar Abbas et al. [12], proposed a novel variable scheme of learning rate. The author proved that the changes in learning rate also changes the training behavior of the backpropagation algorithm [12]. Elsadek and Zahraa [13], proposed an improved error backpropagation algorithm by using cross entropy as a error function and adaptive learning rate. Instead of keeping constant learning rate, the value of learning rate modified based on weight and gradient value of previous iteration gave better accuracy and convergence rate.
Instead of using backpropagation in classification problems, Kosbatwar and S. K. Pathan [14], used for pattern recognition problem and showed better performance in the problem of character recognition. Sornam and Thangavel [15], proposed three-term optical backpropagation algorithm by introducing a nonlinear error function to speed up the convergence rate. Ashwini Sapkala and Kulkarni [16], modified the backpropagation algorithm by added the white gaussian noise in the weighted sum entity and showed that noise injected backpropagation network required less number of epoch for convergence when compared with the SBP. Automatic plate recognition system developed by Joseph Tarigan et al. [17], where they employed the genetic algorithm to effectively identify the number of hidden neurons, learning rate and momentum rate which is used for the backpropagation algorithm and showed that the proposed method 37% faster in the convergence than the SBP. Qun Dai et al. [18], proposed the new two phased and ensemble scheme integrated backpropagation algorithm (TP-ES-BP) for face recognition problem and reveals that the TP-ES-BP algorithm work better to alleviate the problem of local minima in the SBP. Florin Gorunescu and Smaranda Belciug [19], boosted up the SBP with the aid of stimulus sampling technique applied at the output neuron and showed it is more efficient approach for better convergence.
Gibb and Johndar [20], has studied different types of modification in backpropagation algorithm such as Algorithmic modifications, Heuristic change, Regularizer, Techniques on networks and Generative Networks. In algorithmic modification, the Standard BP algorithm(SBP) was modified by introducing momentum term, Quasi Newton method as a second order method, and Resillent backpropation algorithm. In heuristic change method, variation were made in the manner by using initial weight selection technique, learning rate and weight updating method. In the regularizer method, the changes were made in the error function and minimizing the architecture of network. In the category of techniques on network, the modification were made in a way by varying network operations. Example of such work were recurrent neural network and coupled neuron. In final category, generative network, variation were made in SBP which gave the ability to grow. Example of such work was pruning.
In this paper, a modified backpropagation algorithm is proposed with the third term called inertia (MBPI) which is expanded in the form of Chebyshev orthogonal polynomial added in the weight updation rule which is used to accelerate the network for faster convergence. The proposed method shows better performance when compared to standard backpropagation algorithm (SBP) and backpropagation algorithm with momentum (SBPM). The paper is organized as follows: Section 2 explains the proposed algorithm followed by the simulation result in section 3. Section 4 discusses the conclusion.
Proposed modified backpropagation algorithm with inertia (MBPI)
Chebyshev orthogonal polynomials
Chebyshevpolynomial The Chebyshev polynomials of the first kind are a set of orthogonal polynomials defined as the solutions to the Chebyshev differential equation and denoted as recurrence relation T n (x) [21]. They have applied as an approximation to the least squares, a special case of the Gegenbauer polynomial [22] with α = 0, where α is the index of the polynomial. They are also intimately connected with trigonometric multiple-angle formulae. The Chebyshev polynomials of the first kind are denoted as T n (x), and are implemented in the Wolfram Language as Chebyshev T[n, x]. They are normalized in such a way that T n (1) =1. The first few polynomials are illustrated for x ∈ [-1, 1], n=1, 2,..., 5.
The Chebyshev polynomial of the first kind can be defined by the contour integral function,
The Chebyshev polynomials are defined through the identity,
Based on the first two law of physics stated by Sir Issac Newton, An object at rest stay remain in rest An object at motion stay remain in motion
In order to overcome this, apply a force required to move an object. Unlike momentum and mass, it is not the product of mass and velocity. Hence the property of inertia is more useful for the navigation system. The amount of mass possessed by an object can be calculated by measuring the extent the object observe its inertia. This can be done by measuring the amount of force applied to produce a certain acceleration.
Inertia can be classified as Translational Inertia and Rotational Inertia.
The translational inertia [23] is simply the product of the mass (m) and its acceleration (a),
Substituting Equation (3) in Equation (5) for mass (m), we get,
The standard backpropagation network consists of one input layer, one output layer and one hidden layer between the input and output layer. These layers have to be fully connected in a feedforward fashion so that the neurons in the input unit fully connected to neurons in the hidden layers, similarly the connections in between the hidden and output layer.
The modified Backpropagation network with inertia augmented the weight connection (U) added with momentum (β) and inertia (γ) in the input to hidden layer whereas in the hidden to output layer weights (V) are added with momentum (β) alone. Figure 1 shows the structure of the proposed modified neural network,

Proposed modified neural network with inertia.
The modified algorithm is an enhanced version of standard backpropagation algorithm containing the third term inertia which is expanded in the form of Chebyshev polynomial that help to increase the convergence speed of the network. The addition of inertia term helps to get out from the chance of stuck into the local minima.
Generate random weight matrices U and V between input to hidden layer and hidden to output layer respectively. From each training pattern set T={X
i
, O
i
} where X
i
is the input vector and O
i
is the output vector, feed the input vector to the input layer. Compute net input for each hidden node using
Compute output for each hidden nodes
Compute net input for each output node
Calculate output for nodes in the output layer by applying activation function
Calculate gradient of output node
Compute error at output nodes as follows
Error at hidden neuron can be calculated as
The weight updation with momentum,
The weight updation with momentum and inertia:
Repeat Step 2 to 11 until mean square error reduced to 0.01.
The proposed system was developed by using Python (version 3.5) in the Spyder environment, 64 bit windows 10 OS with the RAM 4 GB and the processor is Intel core I5, 7 th Generation. The speed of the processor is 2.50GHz.
The performance of the proposed algorithm MBPI has been experimented using selected example benchmark problems such as (i) XOR problem, (ii) Parity Checking problem and four datasets [24] from UCI Machine Learning repository. The weights are randomly generated in the range of (-0.01, 0.01) for binary class problems and (-0.05, 0.05) for multiclass problem. For all the selected problems, the hyperbolic tangent activation function (1 - e-x)/1 - e
x
) was used to fire the hidden neurons and if the problem is binary classification, then sigmoid activation function 1/(1 + e-x) has been used to fire the output neuron, otherwise softmax function
XOR Problem
The “exclusive or” problem is one of the classic problems in artificial neural network research. It is the problem of predicting the output of XOR gate for given two binary inputs. The function returns a true value as 1 if both inputs are not same and return a false value as 0 if both inputs are same. The proposed algorithm converged by its 160 th epoch with MSE of 0.01. For the same number of epoch, the SBP and SBPM gave MSE of 0.96 and 0.41 respectively. Figure 2 reveals the implementation result of SBP, SBP with Momentum and MBP with Inertia for XOR problem by presenting learning curve of these algorithms. This clearly shows the speedy convergence for the proposed method.

Learning curve plot for XOR.
The parity checking is a method used to confirm that the data transmitted in between the communication node is correct. It is the simplest form of error detecting code. The parity bit is padded to the original binary string to indicate it as odd parity or even parity. If the number of bit 1 in the original string is odd, then it is odd parity otherwise it is termed as even parity. 25 combinations of 5-bit binary string has been generated and given as input to the network. First 16 combination of inputs were used for training and remaining 16 for testing the network. As compared with existing method SBP and SBPM which gave the classification accuracy of 95% and 97% for 500 epochs with a mean square error of 2.50 and 0.63, the proposed MBPI reached its targeted MSE 0.01 with the classification accuracy of 99.9% respectively.
Figure 3 shows the performance of algorithms by representing a learning curve for 500 epochs.

Learning curve plot for parity checking problem.
Iris is the best-known dataset [24] from UCI machine learning repository. The iris dataset contains 150 instances to classify three different type of iris flowers such as Iris Setosa, Iris Virginica, and Iris versicolor. This dataset is also called Fisher dataset. The proposed MBPI converged to the target mean square error of 0.01 by its 200
th
epoch. The MSE of SBP and SBPM were 0.05 and 0.04 respectively for the same number of epoch. The shape features for classification as follows, Sepal Length Sepal Width Petal Length Petal Width
Figure 4 shows the learning curve for comparison algorithms.

Learning curve plot for IRIS dataset.
The second major crop in India is wheat. The classification of wheat is very important and challenging task for breeders and geneticists. The wheat classification dataset [24] from UCI machine learning repository to classify three different major type of wheats such as kama, Rosa and Canadian. The dataset have 210 instances where 70 samples for each type of wheat. The geometrical parameters for predictions are, area A perimeter P compactness C = 4 * pi * A/P2
length of kernel width of kernel asymmetry coefficient length of kernel groove.
Figure 5 illustrates the performance of the MBPI in the wheat dataset by plotting the learning curve for 120 epochs with MSE of 0.01 for MBPI, 0.15 for SBP and 0.39 for SBPM.

Learning curve plot for wheat dataset.
The most common dreadful disease for the women in all over the world is breast cancer. In recent days, the number of women suffered by this disease getting increases. Different methodologies proposed for earlier detection of breast cancer. The breast cancer dataset [24] taken from UCI Machine learning repository contains 699 instances and 10 predictors to diagnose the cancer. The dataset have 458 instances for malign and 241 instances for benign. The features for predictions are Clump thickness Cell size Cell shape Marginal adhesion Single epithelial Bare nuclei Bland chromatin Normal nucleoli Mitoses.
Figure 6 shows the learning curve of the algorithms for 150 epochs with a mean square error of 0.03, 0.04 and 0.01 for SBP, SBPM and MBPI respectively.

Learning curve plot for breast cancer.
The wine dataset [24] from UCI machine learning repository contain 178 samples belonging to three different type of wine grown in specific area of Italy and it is categorized based on following 13 chemical analyses such as Alcoholic Malic Ash Alcalinity Magnesium Phenols Flavanoids Non flavanoids Proanthocyanins Color Hue Dilution Proline.
Figure 7 represents the learning curve of the algorithms for 40 epoch with MSE of 0.01 for MBPI whereas 0.05 and 0.02 for SBP and SBPM respectively which illustrate the better performance of MBPI.

Learning curve plot for wine classification.
Simulation results of selected problems
Comparison table for the proposed MBPI with SBP and SBPM
Confusion Matrix is one of the performance measure used to describe the classification capability of the classifier. It is applicable when output can be two or more classes. Table 3, depicts the results of the confusion matrix predicted by using SBP, SBPM, and MBPI. The following measures can be determined by using confusion matrix such as Recall, Precision and F1 score. These measures can be calculated based on four basic terminologies:
TP – Both actual and predicted are true.
TN – Both actual and predicted are false.
FP – Actual sample as false but predicted as true.
FN – Actual sample as true but predicted as false.
Recall is the ratio between the number of true positive and number of positive samples in the dataset. High recall value indicates the classes are correctly recognized and it can be defined by using following equation
The recall value obtained by using MBPI is around 1 for all the test dataset given in Table 3, shows that the class is at most correctly classified by the proposed classifier and the results have less number of FN.
Precision obtained by dividing the total number of positive samples correctly classified by the classifier with the number of samples predicted as positive. The precision can be expressed by the relation,
In Table 3, the precision value is almost 1 for all the experimented dataset in the case of MBPI, reveals that positive samples classified as positive are indeed positive which means there must be a less number of FP.
The F1 score is the overall measure of the model combine both recall and precision. If the value of F1 score is around one then the model is perfect and the model is failure if it is zero. The F1 score can be computed by
The F1 measure computed from the confusion matrix of the MBPI algorithm given in Table 3 is around one for all the test dataset proved that the proposed classifier is perfect.
Average metrics from confusion matrix
Cross Validation is one of the technique used to reserve few subset of data which is not given for training the model can be used as test set to cross validate the performance of the model. Different cross validation techniques are available such as Leave one out cross validation, k fold cross validation, stratified k fold cross validation, Adversarial validation and so on. In this work, k fold method used for cross validation. In this technique, the dataset are divided into k folds, and in each fold, k-1 fold of the dataset used for training and k th fold used for testing. Repeat this process until all k folds served as the test set. The average of the validation errors can be used to predict the performance of the model.
Performance of MBPI using K Fold cross validation
Performance of MBPI using K Fold cross validation
In the model machine learning era, a wide variety of learning algorithms proposed for the classification and pattern recognition problems. Hence there is a need for the statistical approach to show the significance of the proposed model. Statistical hypothesis testing is used to confidently present the results of the proposed model. Dietterich [25], concluded that 5x2 cross validation, a statistical test is best when the algorithm need to be executed ten times to analyze the performance of the algorithm and for the algorithm executed only once then McNemar’s Test is best.
McNemar’s Test is one of the non-parametric statistical test and in general used for paired nominal data. It is also called "within subject chi-squared test". In the context of machine learning, McNemar’s test is used to compare the predictive performance of the two classifiers on a single test dataset. Nowadays, this statistical test widely used to test the efficiency of the deep learning models, since it is more suitable for large datasets. The following steps involved to perform the McNemar’s test. Find the contingency table which contains the transformation of the predicted result of two classifiers. Table 5 represents the structure of the contingency table, Compute statistics (“chi-squared”) from the contingency table using the formula,
Fix the Alpha value as 0.05. H0, the Null Hypothesis means the two classifiers has the same proportion of error, so there is no significant differences between the two models. To find the significance of the model, calculate the p-value using Equation (20),
The output of the McNemar’s test can be interpreted like as follows, if p-value >alpha, then there is no significant difference between two classifiers. Both models disagree with the same amount of error then accept H0. if p-value<alpha, then there is a significant difference between the two models. Reject H0 because both classifiers have a different proportion of errors.
Structure of the contingency table
McNemar’s test results
In general, the standard backpropagation algorithm utilizes the two hyperparameters called learning rate and momentum factor to train the artificial neural network. The goal of this algorithm starts at some point in the error function which one is defined over the weights and attempts to reach the global minimum of the function. But in real time cases, the error surface of the nonlinear network is more complex compared with a linear network and contains numerous local minima. By adding the third term called Chebyshev expanded inertia in the weight updation rule changes the path which helps to reaches the global optimum result and gives fast convergence rate without getting stuck in local minima problem.
From the results given in Table 3, the Recall value of the proposed MBPI for the experimented dataset, reveals that most of the positive samples were correctly classified and these values never moved lesser than precision which makes sure that those predicted positive were indeed positive. The F1 score is 1 and around 1 for all the dataset strongly conveyed that the proposed model is the best classifier. The statistical performance metrics of the proposed algorithm derived from the confusion matrix and cross-validation techniques employed strongly reveals that the modified algorithm MBPI gave better performance and faster convergence rate when compared to the existing algorithms SBP and SBPM.
The significance of the proposed model MBPI is proved by conducting the McNemar’s test to compare with two classifiers SBP and SBPM respectively. From Table 6, the p-value is less than 0.05 for all the test dataset which strongly represents the proposed model have significant differences when compared with the other two classifiers except XOR problem, since the sample size is very low in the XOR problem. McNemar’s test mainly depends on the probability of the error of two classifiers, the result was not significant in the case of XOR problem.
Conclusion
In this paper, a modified backpropagation algorithm has been proposed by adding a third term called inertia, which is expanded in the form of Chebyshev orthogonal polynomial which makes the term as an independent component and more suitable for function approximation. Because of so many advancements, the Chebyshev polynomial is widely used in many areas of numerical analysis. The value of the cost function and the number of epochs for convergence were significantly reduced when it is compared with SBP and SBPM. The Chebyshev expanded inertia make the network move towards the speedy convergence and it is proved by implementing and tested with the standard benchmark problems of XOR problem, Parity checking problem and UCI datasets namely iris flower classification, wheat classification, breast cancer detection, and wine classification. The statistical measures derived from the confusion matrix, the results of k fold cross validation technique and the results of the McNemar’s test shows that the proposed modified algorithm MBPI gave better convergence rate compared to the existing algorithms SBP and SBPM. As the real time large dataset is imbalanced in nature, the proposed algorithm would be applied in the imbalanced dataset in the future.
