Abstract
Ensuring the quality of software products is important for them to be successful. Discovering errors and fixing defective software modules early in the project lifecycle (e.g. in the testing phase) can save resources and enhance software quality. Developers should prioritize testing procedures and continuously maintain their software projects; however, when there are few instances of a new project, it is hard to build an accurate defect prediction model. Different information about software projects is available and can be utilized through open repositories. Developers can leverage the labeled defect information to build a defect prediction model. The abundance of historical software information in similar domains can assist in transferring the knowledge gained from training this information to other domains for cross-project defect prediction models. Deep learning is a promising machine learner. Deep Belief network (DBN) is a deep learning algorithm that can discover latent relationships between input features by training them through multi-hidden layers; however, it is difficult to build a good prediction model from a dataset with few modules or instances. In this research, we utilized auxiliary datasets to initialize a DBN model and transfer the obtained knowledge to train the DBN model using a source project in a cross-project combination. The expressive features generated from the DBN model are used to build a classical classifier from the source class label and test it on other target project instances. Our evaluation of 13 open Java projects from the PROMISE repository shows that our proposed model achieves improvements based on F-measures (3.6%, 4.9%, and 5.1%) for the three settings of the DBN model measured against the best used benchmark model of TCA/ TCA+ techniques. Moreover, T_DBN and DBN_Only models achieve improvement in terms of F-measure by (11.1% and 6.2%) against the best used benchmark model of TCA/TCA+ on Relink validation dataset.
Introduction
The stage of testing and maintaining software products typically takes the highest budget and efforts of any software development life cycle. Writing test cases with limited time and resources is a challenging and difficult process [27]. Usually, testing all possible cases is impractical [10]. When there is a limited testing budget, static code metrics can be used as predictors, to indicate the possibility of fault existence, and then testing resources and budget can be focused on the predicted defective modules. The prediction of software defects is a quick lightweight sampling technique which can save effort and cost, and indicate if there is a necessity for further investigations to ensure quality assurance (QA) [39]. Most statistical machine learning algorithms can recognize faults effectively, but provide limited results for defect prediction [50]. Defect prediction can be within-the same file or two successive versions of project which is called “within-project defect prediction” (WPDP), where the training and testing datasets can be from the same project distribution, or can be a cross-project defect prediction (CPDP) model. That is, the training set is from one domain and the testing set is from other similar domains [34]. Classical machine algorithms are used to determine the probability of the existence of faults in the new modules, depending on previously collected information. In this case, applying a machine classifier directly to two different distributions between one source domain and the target domain may decrease the performance of the classifier [41]. On the other hand, some classical machine learners such as the Naive Bayes (NB) treat the predictor features as statistically independent [16, 39].
In software engineering applications, the strategy of transfer learning can be used to overcome the lack of a sufficient label training set in one domain for testing purposes. In the case of cross-project defect prediction, the performance of the model is usually poor [34, 40]. Transfer learning enhances learning and attempts to benefit from the knowledge obtained from one domain and share it when training a new domain [35]. A domain adaptation method called transfer component analysis (TCA) can be used on traditional metrics to transfer learning from the source project to another target project, in cross-project combinations, to improve the performance of the defect prediction model [17]. Nowadays, the lack of training data of one domain is not a problem as the learning model from a related domain can be utilized to transfer the common features between the two domains [14, 21]. However, building software defect prediction for a cross-project model using transfer learning only on traditional source code metrics needs more investigation for improvement [10].
In recent years, deep learning has become a hot research topic [1, 7, 9, 46]. It is a type of machine learning that encompasses a family of neural network algorithms using multi-hidden layers for supervised and unsupervised learning such as deep belief network (DBN) [22]. Researchers have used deep learning to investigate its ability to extract expressive features that improve the performance of the prediction models [50]. However, deep learning faces some challenges when the size of the training data set for a target domain is small. Deep learning cannot fine-tune the parameters of the network easily and so can be prone to over-fitting problem [22]. Transfer learning within deep learning can solve this problem. It aims to share the learned knowledge from one domain with new domains which must have some similar parameters with others that are specific for each domain. Moreover, transfer learning exploits the correlation between the problem domains. Furthermore, transfer learning can transfer better features from the bottom generalized layers than the higher specific layers in multi-layer neural networks. The transferable features improve performance and generalization even after the model is fine-tuned on a new task [19].
This research investigated a representative learner algorithm to extract more expressive common features, to assist improving the performance of a cross-project software defect prediction model. DBN with transfer learning using traditional or static metrics from the source code are applied to cross-project combinations to improve the defect prediction model.
Background
Building defective-free software is a quick sampling policy for testing less significant modules [39]. One approach is to utilize data mining algorithms on historical data to predict defects for new projects. The dependency on source code metrics to indicate faults in different granularity levels was ensured early on in software programs. Some product metrics, such as static code metrics, were proposed as a metric-based model to predict defects [16]. For a new domain with limited class labels, CPDP can be used to train the model in one domain that has plenty of labeled classes, and test the model in new similar instances. However, the CPDP model still obtains a poorer performance than within-project, because of the difference between the distributions of the two projects. Moreover, the input features that are being used to build the prediction model are treated independently and the latent relationship between these features is not investigated. One solution, to overcome the difference of the distribution of the input features between the source project and the target project when building a defect predict model, is the transfer component analysis (TCA) [37]. TCA transforms the source and target projects into the same subspace, thus it enhances cross-project software classification and defect prediction [13, 33, 34, 47]. Another enhancement of the TCA is the TCA+ for normalization options selection, since the performance of the TCA depends on the type of normalization [17].
In recent years, deep learning has appeared as a new class of machine learning methods. Deep learning is based on algorithms that use artificial neural networks to learn the latent relationship between the input features. The challenge in deep learning is the difficulty in optimizing the network parameters due to the random weight initialization. Hinton et al. [11] proposed a deep belief network (DBN) model as a fast greedy learning algorithm to pre-train the network in order to obtain good network initialization. Transfer learning is widely used in daily life when we use information we have from previous experiences to learn a new skill or task. Transfer learning enhances the learning by transferring the knowledge gained from one domain to learn a new domain with little human effort [21]. The need for transfer learning in machine learning for a new domain emerges when there is no sufficient data to train the model or when the data collection process is expensive.
Problem statement
Early fault prediction in the testing process is helpful in saving effort and cost. Many machine learning algorithms can extract a statistical relationship from the source code using static metrics. Traditional techniques extract a shallow relationship between input features using a limited number of non-linear transformation layers, while assuming the existence of the same distribution of the training and testing datasets. The DBN algorithm still needs more investigation in this field. Moreover, most defect prediction approaches work effectively for within-project combinations, In the case of the lack of a training dataset for a new project, it is desirable to train data from existing source projects, and test the model on the target project (i.e. cross-projects) [40]. However, the performance is relatively small compared with the WPDP performance; therefore, to fill this gap, this research investigates the ability of DBN with transfer learning to enhance the software defects prediction model for cross-project task by extracting the representative shared information between the input static code metrics for both the source and target projects.
Research questions
This research attempted to answer the following questions:
Can a DBN Learner with the Transfer Learning technique outperform the TCA/TCA+ on traditional features for a cross-project defect prediction? How will the performance of the proposed model be different with different settings? What is the time and space cost for the proposed DBN Model with the Transfer Learning technique based on traditional metrics extraction?
This section reviews previous research works and experiments to describe source code-based defect prediction, deep learning, transfer learning and the methods of applying these concepts to improve defect prediction models.
Source code metric-based defect prediction
In recent years, some software researchers have started to use different techniques to predict defects early in the testing phase, based on different input feature metrics [13, 33, 34]. Prasad et al. [24] point to some static code metrics that have a significant relationship with the existence of faults. Kaur and Singh [2] tested 10 object-oriented metrics for coupling and cohesion such as CK metrics. Analysis by logistic regression shows that some metrics have a greater impact on predicting defects in a class, and outperform class size in terms of low severity faults. Ozturk and Zengin [30] ensured the need for deriving low-level metrics relations from existing ones, not to add new metrics, and to improve data preprocessing to treat skewness in the data before applying machine learners. In this research, we used product metrics for static source code from different domains found in the PROMISE repository and attempted to preprocess the skewness in the data before applying the defect prediction model.
Deep learning-based predict defects
Recently, the deep learning algorithm DBN, has been used for extracting representative features and reconstructing input features for more tasks such as recognizing and discriminating purposes.
Tzortzis and Likas [12] used the DBN model to detect spams in three email datasets, and achieved similar and better result than the support vector machine (SVM) learner. Also, Song et al. [18] used the DBN to classify Chinese text documents into different categories and got better results than the SVM classifier. Ri-Xian et al. [23] combined DBN with transfer learning for cross-project defect detection. They trained the DBN model in features from source sample of image datasets to set the initial weights for the target sample for similar domain and then transferred the parameter, including structure and offset, to the target domain, then fine-tuned the model by training the model on a sample from the target domain using BP algorithm. The DBN trained a solar cell model with four hidden layers containing 3000, 1500, 750, and 90 nodes for each layer respectively. In this research, we took the technique of using DBN with transfer learning and tried to apply it to another domain, specifically software engineering. DBN can extract latent relationships between the input features and transfer learning can share the optimized parameters between two domains to improve the source code-based defect prediction model.
Sharma et al. [28] investigated the DBN model to detect an abnormality in two datasets for a digit (MNIST) and non-digit dataset (face and handwritten characters). In the software defect prediction field, Yang et al. [50] proposed a deeper model by applying the DBN model to predict whether a change in a file is buggy or not, which is beneficial for developers as it reveals real-time defects. Wang et al. [38] proposed a DBN on abstract syntax trees (ASTs) to learn semantic features of the source code at the file level. They examined 10 open source datasets from PROMISE repository to establish within-project and cross-project defect prediction. The researchers chose 10 datasets for hidden layers with 100 nodes for each, as they reflected the best value of F-measure and 200 iterations with error rate about 0.098. The results showed improvement in values of average F-measure with 64.1, 60.0, and 59.7 for the three classifiers: alternating decision tree (ADTree), Naive Bayes, and Logistic Regression respectively. In this research, we used the same dataset as that used in [38] to compare our work with the second baseline for traditional features (PROMISE features) for cross-project defect prediction. In addition, we investigated the ability of transfer learning, when combined with DBN, to predict defects in software domain. Li et al. [15] used the convolutional neural network (CNN) to generate semantic and structural features from program (ASTs) to improve the defect prediction model for buggy or clean files. The model was applied on seven Java open source projects from PROMISE repository, they concatenated the extracted semantic features with traditional static metrics and the augmented features improved the performance of the used model better than the DBN model used in [38]. As the addition of traditional static features improves the performance of the defect prediction model, this study encouraged us to explore the role of traditional static metrics only using DBN model with transfer learning to improve defect prediction.
Transfer learning for cross-project defect prediction
Zimmermann et al. [40] showed that cross-project defect prediction between two domains faces some challenges; the predictor model requires the collection of sufficient training data with corresponding defects data, which is not always available for the new project. They utilized the available historical data from one similar domain (source domain) and tested the model on the new project (target domain). The performance was not the same as the within-project prediction because of the difference in the characteristics of data and process between the two domains. In their analysis, only 21 out of 622 open-source cross-project combinations have a value of precision, recall and accuracy of more than 0.75. Pan et al. [37] proposed the transfer component analysis (TCA) approach in order to reduce the distance between the marginal distributions of two different but related domains. TCA transfers learning between two domains by extracting common latent features using maximum mean discrepancy in Reproducing Kernel Hilbert Space (RKHS). Nam et al. [17] utilized the state-of-the-art unsupervised TCA for transfer learning for cross-project defect prediction and proposed normalization select option using TCA+ by defining dataset characteristic vector (DCV) for each project in source and target domains. To decide normalization options. TCA+ stipulates that the source and the target project must have the same number and type of metrics to learn latent features in order for the transformation process for cross-project defect prediction to succeed. The results showed an improvement in the performance of cross-project defect prediction. Some researchers performed different methods of transfer learning in deep learning models, and were able to achieve better results than by building models from scratch [20, 31, 42, 48, 49]. Different domains utilized transfer learning for a cross-project to enhance the performance of classification models such as natural language processing for sentiment classification [6, 36], image recognition and computer vision [5, 26], and software defect prediction [3, 17, 47]. In this research, we benefited from the TCA+ method and tried to compare the ability of transfer learning within DBN against the TCA+ as a baseline approach.
Based on the previous literature for defect prediction, we summarized the following points:
Static code for product metrics is still a significant indicator for the existence of faults in the source code, thus these metrics can still be used to build a model to predict defects. In the case of the lack of class labeled instances, we can benefit from the abundance of auxiliary datasets that have enough labeled class instances to build a defect prediction model and test on similar new domain instances. Most machine learning investigates either independently or the shallow relationship between the metrics and their effect on defect prediction. In addition, the effects of the combination of these metrics on defect prediction need more investigation. For example, x and y are two features which independently affect the dependable variable z, but x*y and x Transfer learning is about moving the knowledge obtained in one domain to easily learn another similar domain. It achieves good results in cross-project prediction in minimizing the differences in distribution between the source and target datasets, and can then improve the performance of the classification technique. The DBN model achieves good results in defect prediction and classification for different domains. However, DBN in the context of software defect prediction is still premature and needs more investigation. DBN can utilize transferring knowledge from one domain to another to overcome the shortage in training dataset for the target domain and improve the learning process. Software defect prediction based on source code metrics is one of the fields that should investigate the combination of DBN with transfer learning to improve the performance of the prediction model.
The overall research approach
In this section, we introduce the methodology of our thesis approach to improve defect prediction based on static metrics using DBN and transfer learning. Our methodology comprised three phases: data collection and preprocessing, building the learning model using DBN and transfer learning and the prediction model construction phase. Figure 1 shows the phases of the proposed approach.
The proposed model for CPDP using DBN and transfer learning.
As the open source dataset is publicly available for investigation, we used 13 open source datasets from 10 projects in Java language from the PROMISE data repository via the website:
Table 1 summarizes the properties of the 13 datasets, which have 20 static code metrics containing object-oriented metrics. More information about these metrics can be found in [25].
Research phases
The proposed approach consists of three major phases: dataset collection and preprocessing, building the learning model, and prediction model construction.
Properties of 13 Java dataset
Properties of 13 Java dataset
Dataset collection and preprocessing contains many steps as shown in Fig. 1.
3.3.1.1. Dataset collection
Our approach depends on the static code metrics of product metrics to predict defects at the file level. These metrics represent the input features for the proposed approach. They are quantifiable measures that are used to describe the characteristics of the source code at the design level. Table 2 shows descriptions of these metrics.
3.3.1.2. Dataset preprocessing
The datasets have to be pre-processed using different techniques: over-sampling, normalization selection and noise reduction.
Over-sampling: this technique is used to overcome the imbalanced class label. As shown in Table 1, above, the collected datasets have a greater major class (non-defect) than minor class (defect) in most of the projects with the exception of Poi projects (defect rate Data normalization selection: the values of the measured metrics are not on the same scale, so the comparison of the impact of these metrics on class label will be difficult. In addition, the type of normalization process affects the performance of the prediction model [17]. To overcome this problem; two normalization options were suggested for the dataset; min-max and z-score methods with variations of z-score according to TCA+ normalization selection option method in the study by Wang et al. [38]. Min-max method transforms all the metrics values between 0 and 1, for example, for the feature vector
Z-score normalization transforms the mean of the data to 0 and standard deviation to 1, for example, for vector feature
In the cross-project case, there are two variations of z-score according to the level of the difference in the distribution of the source and the target domains, so in one z-score variation (N3) we standardized the source and target according to the mean and standard deviation of the target dataset as in Eq. (3). We use this type of normalization when the data of the source is either sparse or dense and has little statistical information; thus we use the target’s statistical information to scale the two datasets. The other variation of z-score (N4) is to standardize source and target according to the mean and standard deviation of the source dataset as in Eq. (4) [17]. We also use this type of normalization when the target data has little statistical information, the data is either sparse or dense, so we use source statistical information to standardize both source and target projects.
Where
To choose the normalization option we have to define the similarity vectors between the source and target domain by defining DCV for each project. DCV can be calculated by measuring the Euclidean distance between each pair instances in the dataset as follows:
Where
Rule 1 is used when the mean and the standard deviation of source and target are the same (no transfer information can be used) so this is no normalization type (NON); rule 2 is used when the minimum and maximum of both source and target are different, thus we use min-max normalization (N1); rules 3 and 4 are used when the standard and number of instances are different, in other words, either the source or target has good transferable information, so we normalize both source and target using one of their statistical information (N3), (N4) respectively, otherwise we use normal z-score (N2).
Noise reduction: our projects come from open source repository and may contain noise instances; which can affect the distribution of the dataset and change the overall behavior of the dataset in any process. In this work, we used auxiliary dataset comprising all dataset except the combination of cross-project (source-target) to initialize the DBN model and transfer the knowledge from this data to another similar project. Our goal was to depend on a robust dataset to some degree when transferring the obtained learning. Inter Quartile Range (IQR) rule [34] was used to define any outliers and remove them from the dataset.
Domain Adaptation using Transfer Component Analysis (TCA): In cross-project defect prediction, when there are no available labeled data in the target dataset, one source domain is used to train the prediction model and test the model on another similar target domain, when the two domains are similar in some way in coding. TCA is the state-of-the-art method for domain adaptation in cross-project classification task to transfer the extracted shared latent features between two domains. TCA projects the two domains into a new latent space called a reproducing kernel Hilbert space (RKHS) to minimize the difference in the two distributions and maximize the variance but keep the data properties. Pan et al. [37] propose unsupervised TCA that learns latent features without using the labeled data from the dataset. In our proposed approach we used unsupervised TCA to transform the source and target domains into latent space. The number of features can be reduced in the final transformation, but we used the same number of features as the original dataset to keep the original information when using the DBN model. Figure 1 shows domain adaptation using TCA process in phase I.
This phase contains two processes: the DBN Pre_training model and Transfer DBN weights.
3.3.2.1. DBN Pre_training Model
This step involves initializing DBN, using the auxiliary dataset as shown in Fig. 1, in the building learning phase. The DBN is a probabilistic and generative model composed of several stacked RBMs. The RBM is a generative stochastic energy-based model which can be represented as a graphical model that contains two connected layers; a visible layer which represents input features and a hidden layer which represents the extracted latent features. The visible layer and hidden layer nodes are connected symmetrically, and there are no connections between nodes in the same layer [22].
From the definition of the RBM as an energy model, the goal of RBM training is to minimize the energy of the neural network. Assuming that we have n visible nodes and m hidden nodes;
We can define for the state (
Where
Furthermore, given the state of the hidden nodes, we can calculate the conditional probability of the visible layer nodes as in Eq. (9).
Where
Through training the RBM, the parameters
3.3.2.2. Transfer parameters within DBN model
Training DBN on a small limited dataset can hardly optimize the network parameters and can expose it to the over-fitting problem. Transfer learning can utilize the correlation between similar domains and the availability of an auxiliary dataset. In addition, in the case of training DBN, it can overcome the shortage in training dataset and transfer the structure of one domain project with the optimized parameters to initialize the weights and train the DBN on another domain project. In the DBN model, we can perform transfer learning using an auxiliary dataset and source dataset in un-supervised learning to generate optimized parameters for supervised learning in the next phase as follows (the main steps):
Using the auxiliary dataset (All) to initialize the DBN and train the model to get good parameters (the weight matrix From the above point we started to train the DBN model using TCA_ed source dataset and then transfer the optimized DBN-parameters and extract representational features from the transformed source dataset [20] as shown in DBN pre-training in Fig. 1 for new source dataset. We can perform the experiment with and without transfer learning (without transfer means initializing the DBN model with source dataset only not from auxiliary dataset) to compare and calculate the improvements. Algorithm 2 shows the steps in the DBN pre-training phase.
The pre-trained DBN model must be fine-tuned using forward neural network and back-propagation algorithm to optimize the error rate, the TCA_ed source and TCA_ed target projects with class labels are needed to build the prediction model and evaluate the performance of the model respectively. The SMOTE technique is used to balance the label class on the transformed source dataset so the prediction model will not be biased to any class. In this phase, an output layer is added on the top of the DBN model for the classification task and is trained on the balanced and TCA_ed source dataset. To evaluate the prediction model, we need to test it on the TCA_ed target dataset. In this phase we used the transferred DBN model as the discriminator model for the classification task in supervised learning. According to Zhang [20] the representational transfer in the DBN model improves the accuracy of the classification task compared to the representational transfer in the back-propagation neural network model. Moreover, to utilize the techniques of transfer learning within DBN model, the learned parameters from the bottom layers are more generic than the higher ones [43]. Thus, they were transferred to the next phase, therefore the following steps are included.
To keep the learning generated from the above unsupervised steps, we removed the higher layer and froze the first hidden layer while keeping the rest, since the generality of the extracted features decreases from bottom to the top, but the transferable features still have good impact on network learning than random features [29]. The frozen layers will prevent the good parameters from changing. An output layer was added on the top of the extracted features with logistic regression function (Softmax) for classification task, and then the whole transferred DBN model was fine-tuned by updating the weights of the network parameters. Then the probabilities for the predicted value for the label class using transformed source dataset only with its label class were obtained. The predicted model was tested on the transformed target dataset for cross-project defect prediction. The model was tested using DBN with and without transfer learning. Figure 2 shows Prediction model. The performance of the prediction model was calculated using evaluation metrics. Algorithm 3 shows the steps in the DBN fine-tune phase and prediction model construction.
Tuning # of neurons.
To evaluate the proposed approach results, different evaluation metrics were used to assess the performance of the machine classifiers. We used Recall, Precision, and f-measures that are popular metrics for defect prediction models. Moreover, we compared the performance of the different settings of the model in the experiment. Precision and recall can be calculated using confusion matrix that describes the result of the prediction model compared to the actual results in the dataset. Table 3 shows the confusion matrix.
The precision measure represents the ratio of the correctly instances predicted as a defect to all number predicted as defective instances as shown in Eq. (10).
Recall measure represents the ratio of the instances correctly predicted as defects to the actual number of defective instances as shown in Eq. (11).
F-measure is a combination of the two precision and recall measures and represents the weighted harmonic mean value from these two metrics; f-measure can be calculated as in Eq. (12).
The high precision value indicates the ability of the prediction model to discover the defective instances in the dataset, while, the high recall value indicates the ability of the prediction model to discover the defective instances in the project. F-measure takes the summary of the two measures [38, 50].
Confusion matrix
The proposed approach was achieved using a PC with the following properties; Intel corei5 CPU 2.5 GHz, 8 GB RAM, Windows 10. Matlab R2017a to find Euclidean distance for DCV and run TCA. Microsoft Excel 2010 was used to implement our version of normalization selection using decision rules as in TCA+. For DBN and transfer learning we used open source toolkit for Java code, Deeplearning4j. Deeplearning4j is widely used to build and train neural networks, especially deep learning models. More information about deeplearning4j can be found on the website:
Experimental setup
Dataset
To evaluate the DBN model and facilitate reproducibility of the experiment, we used 13 open source projects from the PROMISE datasets repository.
Baselines of traditional features
To evaluate the performance of the DBN model based on traditional static code metrics for defect prediction, we compared the extractive features generated from the DBN model on traditional features using three baselines, the first baseline uses 20 traditional features from the source dataset to train logistic regression classifier and test on target dataset features. The second baseline (TCA) uses the domain adaptation approach TCA with different normalization options on source and target traditional features. The transformed source dataset was used to train logistic regression classifier and we tested it on the transformed target dataset. The third baseline (TCA+) is similar to baseline two but with normalization option selection depending on decision rule in Fig. 4. Based on the degree of similarities between source and target DCVs’, the type of normalization is selected before applying TCA following the method of Nam et al. [17]. Logistic regression is a probabilistic linear model that is widely adopted in data mining as a machine classifier [4, 38]. We used the default settings implemented in Weka3.8.1 tool for the three baselines.
Cross-project defect prediction
When there are few instances in a domain it is difficult to build an accurate machine classifier for a new dataset, therefore we can benefit from the availability of other similar domains to build a defect prediction model and test it on the new domain. This technique is called cross-project defect prediction. This means building a machine classifier on the source dataset and examining it on the target dataset. Thus, we prepared a dataset to perform cross-project defect prediction by conducting 22 combinations for cross-project settings for the source-target project as in [38]. When training the DBN model we used the auxiliary dataset to initialize the model. We used 11 out of 13 projects, excluding the source-target combination from the auxiliary dataset. We called them All_without_source-target (All) for each combination, so we had 11 auxiliary datasets. For example, when training ant-camel in cross-project settings we excluded these two projects from the all auxiliary and added the remaining projects in one file. The auxiliary dataset is treated independently from the cross-project combinations. After that, we applied noise reduction (IQR) method using Weka 3.8.1 implementation. IQR detects outlier and extreme instances, which we then removed from the auxiliary dataset using Remove-With-Values filter implemented in Weka 3.8.1.
To have equal participation of the instances we also applied oversampling on the auxiliary dataset using SMOTE technique implemented in Weka 3.8.1. We repeated the process of balancing percentages until the two classes were equal. These two steps on the auxiliary dataset were essential as the data features should be clean and balanced when the learning is transferred to the source dataset when building the DBN model. SMOTE was applied to the source dataset only for the source-target combination after the domain adaptation (TCA) was applied, and we did not apply it to the target dataset, only to the target dataset domain adaptation (TCA) to mimic the reality.
Parameter settings for the DBN model
The most important part of the deep learning model is to find the appropriate architecture which is a heuristic and endeavor issue. Deep learning models have many parameters that need to be adjusted before training the proposed model. Some of these hyper parameters are related to the deep model and some are related to the optimizer function. There are different methods of adjusting these parameters using automatic procedures such as grid search, random search or hyper parameter optimization. All of these approaches need a lot of resources and time. Nevertheless, we can carry out a manual search for the main hyper-parameters [44].
The main effective parameters that need fine-tuning are the number of hidden layers, the number of hidden neurons in each layer and the number of epochs [38]. Bengio [44] recommends picking standard model choices from other papers and conducting some search for others. For the number of hidden layers, as our dataset has a small number of instances to train, choosing too few hidden layers may result in poor performance and the model may not store all the expressive features of the training set in order to improve cross-project defect prediction. However, choosing too many layers may take a lot of time and may cause the model to over-fit. Thus, we took three hidden layers, as many papers reported that this number of hidden layers obtains good result for the DBN model [8, 50]. To tune the other two parameters, we began by preparing two settings for the experiment. For the first setting, we trained cross-project using the DBN model with transfer learning (using auxiliary dataset), while for the second setting, we trained the DBN model without transfer learning (without using auxiliary dataset). To measure the performance of the proposed model, we suggested naming the two model settings. When training the auxiliary dataset All_without_source-target in pre-training phase, fine-tuned on transformed source dataset, and then tested on transformed target dataset to be named T_DBN model. While we suggested naming the model training on transformed source dataset only for pre-training and fine-tuning phases and then tested on transformed target dataset to be named DBN_Only model for both with and without transfer learning respectively.
To measure the effects of the two parameters we tested the model on five cross-project combinations from the dataset. Wang et al. [38] selected five projects of two consecutive versions to adjust the DBN parameters for within-project defect prediction and validated the model on cross-project combinations. In this thesis, we have no within-project setting, so we chose random five combinations, and added a validation dataset experiment for our model on another dataset for cross-project combinations. We tested each combination individually and took the average f-measure for the five cross-project combinations in the defect prediction model. The five cross-project combinations included: ant-camel, log4j-jEdit, xerces-xalan, ivy2-synapse12, and ant-poi3.
Moreover, for the purpose of comparison between models with different settings, in order to answer research question 2, we took two different combinations of datasets to measure the effect of the two parameters on these models and get independent results. Thus, we were able to differentiate the performance of the two models. In this setting, we removed noise reduction using IQR step from All-without_source-target but kept SMOTE step. The five cross-project combinations for comparison purposes included: ant_poi3, lucene-log4j, xerces-ivy2, camel-jEdit, and xerces-xalan.
4.1.4.1. Setting the number of neurons in each layer
For the number of neurons in each hidden layer we tested a fixed number of neurons in each hidden layer as, according to Bengio [44], having the same number of neurons in each hidden layer works better than pyramid like or upside-down pyramid where the number of neurons decreases or increases respectively through the hidden layers. We chose seven discrete values to test for number of neurons including: 3, 5, 15, 18, 30, 50, 60 and tested with three hidden layers with fixed number of epochs
4.1.4.2. Setting the number of epochs
In training DBN model the number of epochs is an effective parameter, in each epoch the model tries to update the weights of the model and to decrease the difference between input data and the generated ones. In other words, as the number of epochs increases, the error rate decreases. In each epoch, the model updates the weights in one pass using all datasets; having many epochs decreases the differences but takes too much time [38]. To tune the number of epochs we used squared error (squared loss in deeplearning4j library) and f-measure to evaluate the number of epochs. We used 50 epochs as the maximum number and we scored f-measure for the five cross-combinations every 5 epochs, along with the squared loss. We took the average for both f-measure and squared error for the five combinations. Figure 3 shows the result of tuning the number of epochs. In Fig. 3, the squared error decreases as the number of epochs increases. The best average f-measure was scored with a value of 49.5 when the number of epochs was 35 and the average squared was 0.035; as a result, we took the value of 35 for the set number of epochs in T_DBN model.
Tuning # of epochs.
This section presents the result of the proposed model through different measures for the model and answers the research questions (RQs). It is followed by an illustration of the results and an analysis of these outputs for the different settings of the deep belief network models, with and without transfer learning, in addition to the justification of the differences in the results. Moreover, we present the general effects of some approaches on the different models.
DBN model with transfer learning results
RQ1: Can a DBN Learner with the Transfer Learning technique outperform the TCA/TCA+ on traditional features for cross-project defect prediction?
To answer this question, we used the first setting of the DBN model with transfer learning. We compared our proposed T_DBN based cross-project for defect prediction with the three baselines mentioned early. In baseline one, we built the logistic regression classifier on the source project and tested it on the target project. In baseline two, we used three normalization options (N2, N3, N4) with TCA technique for domain adaptation and built logistic regression on the source project, and then tested it on the transformed target project. While in baseline three, we used TCA+ technique that returns normalization option depending on decision rules shown in Fig. 2 before applying TCA. After that, we built the logistic regression on the transformed source project and tested it on the transformed target project. We were encouraged by the result of the type of normalization in baseline two and took the best average one for all 22 combinations from (N2, N3, N4), which was N3 (normalize source and target using mean and standard deviation of the source project). Thus, we normalized all 22 cross-project combinations using N3 for T_DBN model before applying TCA. In addition, deep belief network needs its own normalization type, so we adopted N3 for all combinations before applying T_DBN model.
The All dataset was used to pre-train the DBN model and transfer the parameter (weight matrix
4.2.1.1. DBN model with transfer learning results analysis
The experimental results in Table 4 for T_DBN model show the average f-measure to be around 49 with high values of recall measure for most of the cross-project combinations. Although the results show a low value for the precision values for most combinations, the average recall achieved 86%. In the case of defect prediction models, we were interested to achieve a small false negative (FN) value in Eq. (10) for the recall measure. In other words, it is better to predict a small number of a real defective file as non-defective than to predict a large number as clean when they are, in fact, defective. However, the low precision value can be overcome by carrying out more inspections in the testing phase. The low precision value comes from the high false positive (FP) value which means that the model predicts the instances as defective when they are in fact clean.
Precision, recall, and f-measure of the T_DBN model
Precision, recall, and f-measure of the T_DBN model
The results in Table 5 show that the T_DBN model returns better average f-measure than the three baselines with values (48.6) against 39.8 for baseline1, 37.9 (N2), 40.1 (N3), and 37.9 (N4) for TCA (baseline2) and 36.6 for TCA+ (baseline3). Moreover, T_DBN achieves eight values (with four values above 70%) of f-measure out of 22 combinations higher than the four models including baseline1 (TCA+) for Wang et al. [38] with ADTree classifier. For example, T_DBN scores best f-measure for ant1.6-poi3.0 (78.2), xalan2.5-lucene2.2 (73.7), log4j-lucene2.2 (73.7) and synapse1.2-poi3.0 (77.9). While TCA+ for Wang et al. [38] scored best value (65.1) for synapse 1.2-poi3.0 and TCA [N3] had two best values for jEdit-log4j (65.5) and lucene2.2-log4j (65.3). Furthermore, TCA+ did not get better results than TCA since the authors of TCA+ point out that the type of normalization cannot be generalized. Moreover, the average f-measure for TCA with normalization type N3 had the highest values between the three baselines of our own implementation. This result is compatible with that of Nam et al. [17] for the type of normalization with TCA for cross-project combinations. However, T _DBN produced bad results in some cross-project combinations, for example came1.4-ant1.6 (42.0) while baseline1 was 59.9 f-measure, and three worse results were produced from the three baselines in lucene2.2-log4j (52.5), xerces1.3-ivy2.0 (20.4) and poi3.0-synapse1.2 (51.7). This can be attributed to the small defective rate in the target project or the type of normalization in both processes, domain adaptation and deep learning, since we took the N3 normalization type as the best average result for all cross-project combinations.
Comparisons of F-measure (%) with the five models, with the best F-measure highlighted in bold
From the results in Section 4.2.1 we can summarize the following points:
DBN with transfer learning improves recall measure but does not improve precision measure; low values for precision measure come from high values in FP which can be resolved by carrying out more inspections. DBN with transfer learning (T_DBN) improves cross-project defect prediction and outperforms TCA/TCA+ on average f-measure for the same classifier (logistic regression) and achieves a little improvement comparable with other classifiers (ADTree for TCA+) in [38]. However, T_DBN does not improve the result for some cross-project combinations either from the type of normalization or the small defective rate in the target project.
To prove that T_DBN is the best model we used one-way analysis of variance (ANOVA) to test if the means of F-measure for the best five models presented in Table 5 (Baseline1, TCA N3, TCA+, TCA+ ADTree classifier [38], T_DBN) are equal, and to determine if there is one mean that is significantly different from the others. The hypotheses are:
Table 6 provides the ANOVA results for the best five models presented in Table 5. As shown in the table, the P value for all the models is less than 5%. Therefore,
ANOVA test
*The F-ratio value is 2.51228. The p-value is 0.046041. The result is significant at
RQ2: How will the performance of the proposed model be different with different settings?
4.2.2.1. DBN with and without transfer learning result
To answer this question, we prepared for the second setting DBN without transfer learning and compared it with the first model. The second setting is DBN_Only that trains the DBN model without utilizing an auxiliary dataset, meaning that the two phases of the DBN model are trained on a TCA_ed source dataset and tested on a TCA_ed target dataset without transfer learning. We used a different set of source-target combinations to obtain independent result from the first setting. The set of combinations contains: ant-poi3, lucene-log4j, xerces-ivy2, camel-jEdit, and xerces-xalan.
Figure 4 shows the average f-measure for the five combinations for 50 epochs for transfer learning with DBN model T_DBN (A) left and for model without transfer learning DBN_Only (B) right respectively.
Average F-measure for T_DBN model (A) left and DBN_Only model (B) right. (A) Average F-measure for T_DBN; (B) Average F-measure for DBN_only.
Figure 5 shows T_DBN in the small squared line that uses transfer learning and the large squared line for DBN_Only model.
Table 7 shows the performance in terms of the best f-measure of the two models for 50 neurons and 50 epochs. The best values are highlighted in bold.
DBN with and without transfer learning result analysis
The result of the first two settings for the proposed T_DBN model with transfer learning and DBN_Only without transfer learning is shown in Fig. 4A and B respectively. Figure 4A shows the stability of the average of the best f-measure values for T_DBN more than the result for DBN_Only in Fig. 4B. Figure 5 shows the two values of the average best-f-measure for the two models. They have comparable values with higher average values for T_DBN. Table 7 shows that in 14 out of 22 combinations T_DBN outperforms DBN_Only, and they have equal values for 4 combinations while DBN_Only has four combinations better than T_DBN. The result of the T_DBN with average f-measure (48.6) outperforms deep belief network without transfer learning DBN_Only with average f-measure (47.3), this means that both models, with and without transfer learning, improve defect prediction for cross-project combinations.
The best F-measure for T_DBN and DBN_Only models
Average F-measure for T_DBN and DBN_Only models.
4.2.2.2. DBN models with SOMTE technique results
Furthermore, as we wanted to show the effect of SMOTE technique on the used models, we first applied SMOTE technique on the best model of the three baselines, which is TCA with N3 normalization option, which we called TCA [N3] Smote, and we compared this with our three models. The first model was T_DBN model with SMOTE technique plus noise reduction (NR) using IQR on auxiliary dataset (All_without_source-target) which we renamed T_DBN [NR
DBN models with SOMTE technique results analysis
For the second settings of the proposed model and the effect of SMOTE technique only on the prediction model, Table 8 shows that the T_DBN [SMOTE] outperforms the TCA (N3) [Smote] with average F-measures of 48.8 and 43.7 respectively. It outperforms the other two settings of the deep belief network with average F-measure (48.6) for T_DBN [NR
F-measure for the three settings of the DBN models and TCA method
The highest average f-measure for the T_DBN [SMOTE] model illustrates that the deep belief network can deal with some degree of noise instances and gives them low weights in the higher hidden layer. Thus, they will be discarded in the classification phase. While in T_DBN [NR
4.2.2.3. DBN models on validation dataset result
We also extended our work and used ReLink validation dataset that is used in [17]. We applied T_DBN and DBN_Only on ReLink dataset. Table 9 shows information about Relink dataset.
ReLink dataset information for validation test
The number of files is small and the dataset contain 26 metrics of complexity metrics such as average cyclomatic, average line, maximum cyclomatic, and more details for metrics and meaning are found in Understand website tools3
The best f-measure for T_DBN and DBN_Only for Relink dataset
DBN models on validation dataset result analysis
Table 10 shows that in the validation test for the Relink dataset both T_DBN and DBN_Only outperform both TCA with (N2), (N3), and (N4) and TCA+ approaches for our version of implementation of TCA/TCA+ with logistic regression. Also, our results are comparable and slightly better than Nam et al. [17] for the same cross-project combinations with average f-measure of 62.8 for our models T_DBN against 61.0 for TCA+ for Nam et al. [17], with an improvement of 1.8%. Also, T_DBN achieved higher f-measure values in five out of six combinations. Nam et al. [17] used LIBLINEAR for logistic regression classifier implementation.
From the analysis of Section 4.2.2 we can summarize the following points:
DBN with transfer learning that utilizes the availability of the auxiliary dataset improves defect prediction for cross-project tasks. For example, some of the combination values (8 out of 22) for the T_DBN model are the highest among the baselines on PROMISE projects, and all the T_DBN values are higher than those for the DBN_Only model and the baseline models on Relink dataset. DBN training on source project only can extract expressive features and improve cross-project defect prediction models (such as DBN_Only model). The DBN model can deal with some degree of noise instances and extract expressive features to improve cross-project defect models (such as T_DBN [SMOTE] against T _DBN [NR Balancing class label using SMOTE technique improves the prediction defect model for cross-project combinations (such as T_DBN [SMOTE] model). Our proposed model T_DBN outperforms the second baseline for Wang et al. [38] on traditional features on PROMISE dataset. Furthermore, T_DBN outperforms [17] on validation ReLink dataset.
RQ3: What is the time and space cost for the T_DBN based features extraction?
To answer this question, we kept track of cost of time and memory space for T_DBN to generate expressive features in two phases of the DBN model when training with All_without_source-target and fine-tuned with the source dataset, and carried out prediction on the target dataset. We also recorded time and memory cost for the setting without transfer learning on DBN_Only. For pre-processing and domain adaptation we excluded them from calculation of time and space cost because they are normal procedures. Table 11 shows the costs of time and space memory for T_DBN and DBN_Only models.
Time and space cost for T_DBN model and DBN_Only result analysis
The result in Table 10 shows the time in minutes (m) and memory space in megabytes (MB) for T_DBN and DBN_Only models. The values of time for generating expressive features and building the prediction model, then testing on target project ranges between 6 and 12:30 minutes for T_DBN model with an average of about 9 minutes. While it ranges between nearly 0:30 and 2 minutes with an average of about 1 minutes for the DBN_Only model. For example, log4j-jEdit combination took about 8 minutes (7:53.2) in the T_DBN model while the same model took nearly half a minute (00:27.9) for the DBN_Only model. The excessive time in the first model was due to the pre-training phase for auxiliary dataset compared to the log4j as source dataset which is nearly small dataset with (109) instances.
Time & Memory Space cost for T_DBN &DBN_Only (m: minute)
Time & Memory Space cost for T_DBN &DBN_Only (m: minute)
In general, the time taken is fairly applicable in practice. The memory space taken also ranges between 10 MB–12 MB for T_DBN and 10 MB–11 MB for DBN_Only. For example, xalan2.5-lucene2.2 needs 12 MB for T_DBN model and 11 MB for DBN_Only model.
From the analysis of the results for Section 4.2.3 we summarize the following:
DBN with its two different settings, with and without transfer learning, is acceptable and applicable for extracting expressive features, in practice, in terms of time and memory space.
Conclusion
In this study we proposed a representative learning based DBN to learn latent features from source code metrics to enhance the defect prediction process on different target projects. We leveraged the concept of transfer learning within the DBN model to improve learning between two similar projects in cross-project combinations. We benefited from the availability of the auxiliary dataset to initialize the DBN model and transfer the learned parameters to fine-tune the model using the source project generated using representative features. Then, we built a prediction model for the target project. We designed three settings for our proposed model: T_DBN, DBN_Only and T_DBN [SMOTE]. We evaluated them on open source projects from the PROMISE repository for different cross-project combinations and they outperformed baseline 1, for Wang et al. [38] by 0.7%. In this paper, the three models outperformed the three baselines and improved the defect prediction model for cross-projects on an average F-measure with 3.6%, 4.9%, and 5.1% for DBN_Only, T_DBN [NR
Future work
In the future, we would like to extend our representative features DBN_model to be evaluated on different language projects like C/C++, and C#. In addition, we could evaluate our model using different granularity levels such as class or package level. Moreover, we could test different techniques of transfer learning for cross-project combination for different domains and within the DBN model. Additionally, we could examine the performance of transfer learning with the DBN model with different classifiers. We could also conduct more analysis on source metrics to select the most important features to improve defect prediction model. Finally, we could search for a new technique to improve precision measure value in defect prediction models.
Threats to validity
Traditional feature selection
In our experiment we used some static code metrics that contained different features of source code but not all product metrics, and we cannot generalize these metrics to different cross-project combinations in different code languages. Adding more or different static code metrics may or may not improve the results. Moreover, the performance of the proposed model is still unknown for metrics of different source code languages and/or closed software projects.
Implementation of TCA+
Our second and third baselines depended on TCA/TCA+ approaches. We implemented our own version of normalization type option selection for TCA+ with the help of the instructions in [17] and the assistant of the first author of the same paper. We followed all the procedures defined in the paper.
Footnotes
Acknowledgments
This paper is part of the Master’s thesis submitted by Nawzat Alsmadi to the computer information systems, Faculty of IT, Yarmouk University, in April 2018.
