Default prediction of online credit loans based on mobile application usage behaviors

Abstract

Credit scoring is widely used by financial institutions for default prediction, however, a significant portion of online credit loan customers have inadequate or unverifiable credit histories, making it difficult for financial institutions to make effective credit decisions. Since the widespread use of smartphones and the popularity of mobile applications, it is worth investigating whether mobile application usage behaviors (App behaviors) of customers can effectively predict online loan defaults. This paper proposes a combined algorithm of CNN and LightGBM, and establishes credit scoring models with App behaviors to evaluate the default risk of online credit loans based on logistic regression, LightGBM, CNN and the combined algorithm, respectively. The experimental results suggest that App behaviors have an obvious effect on the default prediction of customers applying for online credit loans, and the combined model outperforms the other models in terms of the area under the curve (AUC). Furthermore, integrated credit scoring models are developed by combining App behaviors with traditional scoring features. A comparison of the integrated models and the traditional scoring model indicates that the integrated models have achieved a significant improvement in classification performance and App behaviors can be a powerful complement to the traditional credit scoring model.

Keywords

Credit scoring online credit loans mobile application usage behaviors logistic regression combined model

1 Introduction

In recent years, with the change of people’s consumption concept and the development of internet finance, not only internet financial platforms are promoting online lending business, but many traditional financial institutions are also shifting their lending business to online. As an efficiently reachable and unsecured retail financial product, the online credit loan has attracted a large number of customers due to its convenience and low cost, but it has also brought new challenges to the traditional scoring system. Traditionally, financial institutions use the basic information of customers and the data of the Credit Management Bureau as input features to establish credit scoring models [1 –4]. After years of practical experience, the credit data has proven to be very effective in predicting the credit defaults of customers. However, a considerable number of online credit loan customers’ credit information is insufficient, which makes it difficult for financial institutions to assess customers’ credit risk. In 2017, Demirguc-Kunt et al. [5] revealed that the number of adults without financial accounts is estimated to be 1.7 billion. Data-driven methods for decision making and the use of non-traditional data become future research directions for credit scoring [6]. Some researchers have attempted to explore non-traditional features to establish default prediction models, especially for the peer-to-peer lending platform. For instance, account transactional data [7 –9], online operation behavior data [10], social network information [11] and social media data [12] are new data sources for alternative features on credit scoring. Djeundje et al. [13] found that credit scoring models based on email usage, psychometric data, and demographic data can achieve better performance than models that only use demographic data. Liang and He [14] used loan-related semantic features extracted from textual information as input features to develop a credit evaluation model to evaluate the default risk of borrowers. They found that the semantic features related to loans greatly improved the predictive ability of credit evaluation models, especially for first-time loan applicants. Based on Facebook data, Cnudde et al. [15] developed a well-performing credit scoring model with an automatic scoring function for microfinance and the model can also be applied to other financial scenarios. More recently, the use of mobile phone data as non-traditional features for credit scoring has aroused the interest of researchers. There have been some credit scoring articles exploring the correlation between mobile phone usage data and default. For example, Óskarsdóttir et al. [16] extracted detailed call records from the customer’s mobile phone and established the customer’s calling network with the information obtained from the mobile operators. Roa et al. [17] tested the validity of customer data extracted from a mobile supper application for default prediction.

The proportion of population with mobile phones in the world has exceeded 95% and the number is still growing [18]. People’s lifestyle has changed by the popularity of smartphones and applications: People are used to going out without cash and credit cards, using mobile payment instead; The number of phone calls is reduced and young people prefer to communicate directly through WeChat or online videos; Most of the entertainment activities such as watching videos, listening to music, playing games and reading news are carried out on cell phones; Life behaviors such as paying bills, course learning, socializing and blog sharing are done through mobile apps. The same goes for financial activities, where the vast majority of online loans are applied by customers on mobile phones. People applying for online credit loans may not have rich credit histories or proof of stable income, but they commonly have extensive mobile app usage behavior data. However, to our knowledge, there are no studies exploring the relationship between App behaviors and online loan defaults. In this paper, we explore the feasibility of App behaviors as alternative data for credit score modeling through empirical analysis. The experimental results show that behavioral data such as installation, browsing time, usage locations and push messages of apps are available for customer behavior profiling, and that App behaviors can enable financial institutions to effectively evaluate the default risk of online loans as well as provide personalized and convenient credit services to customers who lack credit history.

With the application of non-traditional data on credit scoring, the data volume is increasing and the data forms display pluralism simultaneously, which requires the algorithms for training scoring models to handle more feature data and mine deeper information. Over the past decades, a great deal of research has been performed on individual credit scoring, and the early credit scoring models were constructed by statistical methods such as linear discriminant analysis [19], logistic regression [20] and Naive Bayes [21, 22]. Logistic regression is of epoch-making significance, and nowadays, the credit scoring model based on logistic regression is still widely used by financial institutions and academics because of its explanatory power and stability [23, 24]. However, with the popularization of big data, it is difficult for traditional statistical methods to mine high-dimensional and nonlinear information, so machine learning algorithms are widely applied in credit scoring models, for instance, support vector machine [25 –27] and neural networks [28, 29]. Furthermore, many ensemble learning algorithms have been used to enhance the performance of scoring models, like random forest [30], adaptive boost [31], and extreme gradient boosting (XGBoost) [32]. Experimental results revealed that the performance of ensemble learning classification approaches is better than individual learners [33, 34]. Recently, as deep learning has become a research hotspot in artificial intelligence for its superior performance, many scholars also established credit scoring models using deep learning algorithms such as CNN [7], long short-term memory [35] and deep belief networks [36], which have achieved desirable prediction results.

In order to sufficiently validate the default prediction effect of App behaviors, the paper proposes an innovative algorithm combining CNN and LightGBM, and applies App behaviors to establish scoring models to predict the default of online credit loans based on logistic regression, LightGBM, CNN and the combined algorithm, respectively. The performance of scoring models based on App behaviors are compared by means of the receiver operating characteristic (ROC) curve, the area under the curve (AUC) and the population stability index (PSI). Moreover, App behaviors are combined with traditional scoring features to establish integrated credit scoring models, which are used to verify the complementary effect of App behaviors on the traditional scoring model. To the best of our knowledge, we are the first to employ the customers’ various App behaviors to establish credit scoring models.

The structure of this paper is organized as follows. Section 2 describes the dataset used in this paper and how it is processed. Section 3 introduces the methods and evaluation measures for credit scoring models. Section 4 analyses the experimental results and compares the performance of the scoring models. Section 5 draws the conclusion, and section 6 discusses regulatory implications and future recommendations.

2 Data

2.1 Dataset description

The research on default prediction is in cooperation with a medium-sized commercial bank in China. The data for establishing credit scoring models is derived from the information of customers who got online credit loans from the bank between July 1, 2018 and September 30, 2019. Customers who were overdue for more than 90 days within 12 months after loan issuance are defined as default samples. The normal samples refer to customers who have not been overdue within 12 months after loans were issued. The dataset includes 19,225 default samples defined in this way. Because the number of normal samples is much more than that of default samples, the undersampling method is used to extract normal samples to alleviate the class imbalance problem. According to the normal to default samples ratio of 8:2, a corresponding proportion of normal samples are randomly selected from the monthly dataset to ensure the balance of the sample data.

The paper innovatively takes customers’ mobile application usage behaviors as input features for credit scoring models. The customers’ information used in our models is encrypted and App behaviors are processed into statistical feature series. The technical requirements and conditions for gathering data are very strict, requiring customers to authorize the bank to collect information from the lending app when they apply for credit loans. The collected information includes the types and quantity of apps, installation and uninstallation, usage frequency and duration, apps push information, geographic location change on GPS and the concentration of usage time, which are shown in Table 1.

Table 1
The feature series of mobile apps

Series Explanation Amount

Type and Quantity Types and Number of Apps 40

Install and Uninstall Installation and Uninstallation of Apps 146

Usage Duration, Frequency, and Time Period of Apps 211

GPS Geographical Location Change 15

Message Push Message Pushed by Apps 45

Series	Explanation	Amount
Type and Quantity	Types and Number of Apps	40
Install and Uninstall	Installation and Uninstallation of Apps	146
Usage	Duration, Frequency, and Time Period of Apps	211
GPS	Geographical Location Change	15
Message Push	Message Pushed by Apps	45

Apart from the features mentioned in Table 1, traditional scoring features (including basic customer information, credit records, repayment ability, loan information, account information and real estate) are also used to train models. Table 2 describes the detailed traditional features of the credit scoring model. Further, traditional features and App behaviors are combined to build scoring models for comparative analysis.

Table 2

The traditional features of credit scoring

Features	Explanation	Amount
Basic Information	Age, Gender, Marriage, Education, etc.	21
Credit	Information Processed by the Credit Report	261
Repayment Ability	Job, Income, Work Stability, etc.	24
Loan Information	Loan Amount, Loan Term, Repayment Method, etc.	15
Account Information	Deposits, Account Transactions, etc.	64
Real Estate	Situation, Quantity, Value, etc.	19

2.2 Data cleaning

To ensure the quality of the training data and the accuracy of the scoring model, data cleaning is performed on the features and samples. If the features contain too many nulls or zero values, or if the samples contain outliers, it will lead to deviation in the model results. Table 3 displays the changes in the quantity of modeling data after removing the anomalous features and samples.

Table 3
The result of data cleaning

Data Cleaning Original Quantity Retained Quantity

For App Features 457 396

For Samples 96125 94564

Data Cleaning	Original Quantity	Retained Quantity
For App Features	457	396
For Samples	96125	94564

2.3 Training and test set

To test the classification ability of the models, the dataset is divided into a training set and a test set, as shown in Table 4. The customers who applied for online credit loans from July 1, 2018 to June 30, 2019 are considered as training samples. From July 1, 2019 to September 30, 2019, customers who applied for online credit loans are regarded as test samples. At the same time, it is ensured that there is no duplication of customers in the training and test sets, so they are simultaneously independent of each other in terms of time and sample. Further, this paper adopts a 10-fold cross-validation method to randomly sample 90% of the training set for model training and 10% of it for validation, and the average of ten model validation results is taken as the final model result.

Table 4
Training samples and test samples

Dataset Default Normal Total

Train 15093 60114 75207

Test 3485 15872 19357

Dataset	Default	Normal	Total
Train	15093	60114	75207
Test	3485	15872	19357

2.4 Feature preprocessing for three algorithms

As the quality of features affects the training effect of the algorithms, feature pre-processing is a crucial step in the model establishment. Credit scoring models in the research are trained by logistic regression, LightGBM and CNN. It is necessary for different algorithms to implement corresponding feature preprocessing, which usually includes feature selection and feature encoding.

Logistic regression is a generalized linear regression model for classification prediction, so the size of input feature values has a great influence on the result of the model. Therefore, to achieve dimensionality uniformity, it is essential to encode the input features using the transformation methods like the weight of evidence (WoE) [37]. Moreover, since the model based on logistic regression shouldn’t contain too many features, it is necessary to screen features through methods such as information value (IV) [38] and correlation test.

The weight of evidence (WoE) is a popular method to encode the original features. WoE encoding starts with grouping the original features. After grouping, the calculation formula of WoE for the group i is as follows: ${WoE}_{i} = ln (\frac{G_{i} / G}{B_{i} / B}), i = 1, 2, . . ., k$ (1) where G_i and B_i represent the number of normal samples and default samples corresponding to the ith group, and G and B denote the total number of normal samples and default samples. According to Eq. (1), the larger the WoE value, the higher the proportion of normal samples of the ith group.

Information value (IV) is also known as information concentration. It can measure the overall predictive power of each feature with the calculation formula as follows:

$IV = \sum_{i = 1}^{k} [(\frac{G_{i}}{G} - \frac{B_{i}}{B})] {WoE}_{i}$ (2)

When IV is less than 0.02, it means that the feature has barely any predictive ability to the target variable and should be removed. If IV is between 0.02 and 0.1, the feature has a weak relevance to the target variable. When IV is from 0.1 to 0.3, there is a moderate correlation between the feature and the target variable. Once IV is equal to or greater than 0.3, the feature has a strong predictive ability for the target variable.

LightGBM is an ensemble learning algorithm with a boosting framework based on decision tree classifiers. The size of the input feature values does not affect the model result because of the characteristics of the decision tree algorithm. Therefore, the key preprocessing operation is to select features that have a significant impact on the model through feature importance screening. The feature importance screening refers to evaluating the predictive power of features based on the times of a feature used in decision trees classifiers and the information gain from the feature during the training process of the ensemble algorithm.

CNN is a typical feedforward neural network which requires scaling input features, as different feature dimensions can affect the results of neural network. To eliminate the dimensional influence of features, the max-min normalization method is performed to transform the original data into [0,1] through the following formula: $x_{ij}^{*} = \frac{x_{j}^{i} - x_{\min}^{i}}{x_{\max}^{i} - x_{\min}^{i}}$ (3) where $x_{j}^{i}$ is the jth sample value of the ith feature series, $x_{\max}^{i}$ and $x_{\min}^{i}$ represent the maximum and minimum of the ith feature series, respectively.

Besides, deep learning networks can automatically extract features, which means that no additional feature selection process is required.

3 Methods

This section will introduce three algorithms including CNN, LightGBM and the combination of CNN and LightGBM. Since logistic regression is a commonly used method in scoring models, it will not be described in detail, but only for modeling and performance comparison. Then the evaluation measures used to assess the classification performance and stability of the scoring models are presented, which are the ROC curve, the AUC and the PSI.

3.1 LightGBM

Light gradient boosting machine (LightGBM) is a novel gradient boosting decision tree (GBDT) algorithm proposed by Microsoft in 2017 [39]. As a member of the GBDT family, LightGBM achieves higher processing efficiency and lower memory usage than other boosting algorithms without compromising accuracy. LightGBM innovatively adopts the histogram algorithm instead of the pre-sorted algorithm to improve computational efficiency. In addition, Gradient-based One Side Sampling (GOSS) and Exclusive Feature Bundling (EFB) are put forward in LightGBM for more efficient processing of data with large samples and features.

GOSS is a sampling optimization method that significantly improves the training speed of the model by reducing the number of data samples while ensuring the accuracy of the model. It removes samples with smaller absolute values of gradients by random sampling, and only utilizes the remaining data to calculate information gain [40]. When calculating information gain, since samples with larger gradients have a greater impact on the information gain and samples with smaller gradients are given corresponding larger weight coefficients for compensation, GOSS can obtain fairly accurate results on the remaining samples. Meanwhile, sampling will increase the diversity of the base learners, which potentially helps to improve the generalization performance.

EFB is used to solve the problem of the high time complexity of the histogram algorithm for sparse data. The core role of the EFB algorithm is to extract features of sparse data by binding mutually exclusive features that will not simultaneously be zero to reduce the number of features effectively [41]. It is an NP-Hard problem to get the best bundling mutually exclusive features, so the greedy algorithm is used to obtain a proper approximation ratio, which can reduce the number of features without greatly affecting the determination of the segmentation point.

LightGBM has also optimized the growth strategy of the base tree classifier by using Leaf-wise tree growth that is a more efficient method than Level-wise tree growth for the same number of splits. Leaf-wise can reduce more errors and get better accuracy because the leaf with the largest splitting gain can be found from all the current leaves to split each time. Since Leaf-wise may generate a deeper decision tree, LightGBM limits the depth of the tree by setting super parameters of max_depth to prevent overfitting [42].

3.2 CNN

Convolutional Neural Network (CNN) is a popular deep learning network, which has achieved excellent performance for large-scale image processing and has great success in the field of image recognition, image segmentation and pattern recognition. It was also applied to build scoring models and the experimental result showed that it could effectively predict mortgage default [7]. The principle of CNN and the network structure of our scoring model are described as follows. The input feature data in the paper is App behaviors, which are different from the picture with three dimensions, having only a one-dimensional series of 396 features. Fig. 1 is the 1-D convolutional neural network structure.

Fig. 1

The 1-D convolutional neural network structure.

Convolutional layer: The purpose of convolutional layers is to extract different features from input data. A convolutional layer contains multiple filters, and each element of the filter corresponds to a weight coefficient and a bias term. Weight sharing of the filters provides an effective way to reduce the number of parameters and shorten the training time of models. The network structure of our model contains two convolutional layers. The first layer is composed of 64 filters with the size of 11, and the second layer consists of 128 filters with the size of 9.

An activation function follows a convolutional layer to enhance the nonlinear expression ability of networks. The activation function used in the paper is the ReLU function which has numerous advantages. Firstly, the gradient can be avoided to disappear during the backpropagation. Secondly, it can reduce the interdependence of parameters and alleviate the overfitting problem. Thirdly, the computation of the backpropagation is relatively smaller than other activation functions. For feature X, the ReLU activation function implements a nonlinear transformation of its value x by the formula max (0, x).

Pooling layer: The pooling layer can simplify the complexity of the network by compressing features to remove redundant information from high-dimensional features and reduce computation, while preserving the key features and some shift invariance of the input features to prevent overfitting and improve generalization ability. The paper uses the max-pooling layer with the size of 4 after each convolutional layer to replace four features with their maximum one. After two pooling layers, the dimension of features is reduced to 1/16 of the original.

Fully connected layer: CNN applies fully connected layers for classification prediction. It can realize the dimensional transformation of data from high to low dimension while retaining valuable information. The flattened features are obtained by combining local features, and then fully connected layers perform a weighted summation of the features with matrix parameters. Subsequently, it applies a nonlinear activation function to forecast classification results. Two fully connected layers are used in our network. The first layer contains the input of 3200 features and the output of 64 features, and the second layer outputs binary classification results through the input of 64 features and the Softmax activation function.

The dropout method is used in each convolutional layer and fully connected layer, which randomly assigns zero weights to neurons in the network. Since we choose a keeping ratio of 0.7, the thirty percent of neurons will be zero weight. Through this regularization method, the network’s response to small changes in data is less sensitive. Therefore, it can prevent overfitting and improve the robustness of the model.

The CNN structure of our credit scoring model is shown in Fig. 2. It is composed of one input layer, two convolutional layers, two pooling layers, two fully connected layers and one output layer. The number of layers and hyperparameters of our model have been adjusted and optimized after extensive experiments.

Fig. 2

The CNN structure of our credit scoring model.

CNN can automatically learn multiple features from the training data. The weight sharing of filters and the sparse connectivity between layers can reduce the training parameters and decrease the complexity of the network, so it extracts features more effectively and have stronger adaptability. These advantages of CNN are mainly realized in convolutional layers and pooling layers. The role of full connection layers is to integrate features extracted by the previous layer to predict the classification. However, when the sample size is not large enough, simple nonlinear methods may fall into the local optimum situation, which is difficult to achieve the ideal result. Therefore, LightGBM may be used instead of fully connected layers for classification prediction.

3.3 Combined algorithm

To improve the classification accuracy of the model, a combined algorithm of CNN and LightGBM is proposed to build the credit scoring model. Fig. 3 shows the main procedure of the combined algorithm. In order to enhance the feature extraction performance and generalization ability of the scoring model, we apply the 10-fold cross-validation to establish ten different CNN models and fix their weight parameters. Inspired by the transfer learning and ensemble learning, convolutional layers and pooling layers of the 10 CNNs are used to extract features of App behaviors, and then the features are retrained by LightGBM to predict classification.

Fig. 3

The main procedure of the combined algorithm.

The implementation of our proposed ensemble algorithm is as follows:

Step 1: Construct the convolutional neural networks according to the structure in Section 3.2 and initialize the weights.

Step 2: Divide the training data into 10 equal-sized parts randomly and then combine them into 10 data groups. In the ith group, the part P_i is used for validation, while the rest of the parts are applied for training.

Step 3: Input the data groups obtained by Step 2 into the convolution neural network in Step 1 separately, and update the weights iteratively by backpropagation until the model converges and the loss is lower than the target threshold. As a result, 10 different convolutional neural network models are built.

Step 4: Fix the weights of the 10 CNN models trained in Step 3 and reserve the network modules before the fully connected layers.

Step 5: Input the training data into the reserved convolutional layers and pooling layers to extract feature data, and then flatten feature data into a one-dimensional feature vector.

Step 6: Apply LightGBM instead of the fully connected layer to retrain the flattened feature data for classification prediction.

3.4 Evaluation measures

The ROC curve and the AUC are widely used to evaluate the classification performance of credit scoring models because of their character agnostic to the skew dataset [22]. The ROC curve is a graphical approach to evaluate the sorting ability of scoring models. Its abscissa and the ordinates are true positive rate (TPR) and false positive rate (FPR), respectively. TPR is defined as the proportion of the default samples correctly predicted by the credit scoring model, and FPR is corresponding to the fraction of actual normal samples wrongly predicted as the default category by the credit scoring model. As the score increases, both TPR and FPR will approach 1, forming a curve starting from the coordinate point (0, 0) and ending at the point (1, 1). AUC is the area under the ROC curve with values from 0 to 1. When a model can perfectly identify default customers, the AUC will approach 1. When a classifier just performs random guessing, the AUC will be equal to 0.5.

The PSI is an important indicator to measure the stability of scoring models, which can assess the difference in the distribution of scores between training samples and test samples. Its calculating formula is:

$PSI = \sum_{i} [(p_{train}^{i} - p_{test}^{i}) \times \ln (\frac{p_{train}^{i}}{p_{test}^{i}})]$ (4)

where $p_{train}^{i}$ and $p_{test}^{i}$ represent the percentage of samples in the i-th bin of training set and test set, respectively. It is generally believed that the model is stable when PSI is less than 0.1.

4 Experiments

Our experiments have two purposes. One is to verify whether App behaviors can effectively evaluate the default risk of online credit loans and whether the combination of App behaviors and traditional scoring features can achieve the desired enhancement effect. The second is to verify whether the combined model of CNN and LightGBM can obtain better performance in the default prediction.

4.1 Credit scoring models based solely on App behaviors

The first experiment applies App behaviors to establish credit scoring models with logistic regression, CNN, LightGBM and the combined algorithm of CNN and LightGBM, respectively. Then the ROC curves are employed to compare the performance of models, as displayed in Fig. 4. The classification ability of the models is evaluated by AUC and the stability of the models is evaluated by the value of PSI, which is shown in Table 5. As seen in Table 5, the AUC value of each model is greater than 0.64, which suggests that App behaviors have a certain predictive power. The PSI values of models are much less than 0.1, illustrating that all credit scoring models are stable. The Combined model performs best on the evaluation metric of AUC and hence has a superior default prediction ability. The classification capacity of the models trained by LightGBM and CNN is close to that of the combined model. Although the LR model has the lowest AUC, the AUC values are quite close on the training and test sets and its PSI is the lowest, which indidcates that LR has reliable stability.

Fig. 4

The ROC curves of models based on the App behaviors. (a) is on the training set, (b) is on the test set.

Table 5

The AUC and PSI results of models with App behaviors

Model	Train AUC	Test AUC	AUC Change Ratio	PSI
LR	64.57%	64.21%	-0.56%	0.08%
LGB	68.63%	66.63%	-2.93%	0.72%
CNN	68.23%	65.91%	-3.40%	0.91%
Combined	69.84%	66.90%	-4.21%	0.92%

LR is a logistic regression model, LGB is a LightGBM model, CNN is a CNN model as described in section 3.2, and Combined is the model based on the combined algorithm as described in section 3.3. Combined performs best on the evaluation metric of AUC. The performance of the CNN and LGB are close to the Combined model. The AUC value of the LR model is the lowest and so is its PSI.

4.2 Correlation analysis of App behaviors and default

By analyzing the results of the credit scoring models, we can further study the characteristics of App behaviors in default prediction and find out which kind of App behaviors has a significant effect on default prediction. As observed in Fig. 4, the curves are closer to the upper left corner in the low scoring segments of the scoring models. This indicates that the scoring models have better default prediction ability in the low scoring segments than that in the high scoring segments, and therefore the App behaviors data is more suitable for identifying default behavior of ’bad’ customers.

Furthermore, the feature importance screening results of the LightGBM model and the IV values of the features are applied to determine what kind of App behavior has a significant impact on default prediction and the degree of impact, where the IV values are taken from the IV maximum of the corresponding kinds of App behavior features. Meanwhile, the Spearman correlation test is used to analyze the correlation between App behaviors and default. The main App behavior features used in the model and the correlation results are shown in Table 6.

Table 6
Mainly used App behavior features and their correlation results

Types Usage Behavior Correlation Max IV

Lending/Borrowing Installation, Usage Positive 17.55%

Financial Management Usage Negative 9.51%

Office Installation Negative 8.85%

Games/Entertainment Usage Positive 12.08%

Gambling Installation, Usage Positive 10.52%

GPS Information Frequency of Location Changes Positive 7.55%

Use Time Period Use Time at Night Positive 6.36%

Types	Usage Behavior	Correlation	Max IV
Lending/Borrowing	Installation, Usage	Positive	17.55%
Financial Management	Usage	Negative	9.51%
Office	Installation	Negative	8.85%
Games/Entertainment	Usage	Positive	12.08%
Gambling	Installation, Usage	Positive	10.52%
GPS Information	Frequency of Location Changes	Positive	7.55%
Use Time Period	Use Time at Night	Positive	6.36%

From Table 6, it can be seen that the App behaviors of Lending & Borrowing are positively correlated with default, and its IV is the highest, so the degree of influence is the greatest. In general, customers who install many lending apps or often browse lending apps are usually cash-strapped and borrow frequently, making it difficult for one or two financial institutions to meet their financing needs. Therefore, those customers are bound to have a higher default risk than normal customers. Meanwhile, excessive usage of entertainment and gambling apps also signals a higher risk of default. Customers who spend a lot of time on entertainment apps or indulge in illegal gambling apps will have a higher probability of payment difficulties or credit fraud compared to the other customers. Besides, abnormal app usage behavior is positively correlated with default as well. Customers whose geographic locations change frequently, especially those with city positioning changes, and customers who often use mobile phones at night also have a certain risk of default. Conversely, the behavior features of financial management apps and office apps are negatively correlated with default. Customers who often use financial management apps or install more office apps always have a good asset condition or stable white-collar jobs, therefore, they have a higher repayment ability and are less likely to default.

4.3 Integrated scoring models based on App behaviors and traditional features

The second experiment is to further verify the superiority of combining App behaviors and traditional scoring features for default prediction. Logistic regression is used to develop six credit scoring models, which differ only in terms of input features. The performance of the models are measured and compared by ROC, AUC and PSI, as shown in Fig. 5 and Table 7. As observed, the AUC value of the model TR using only traditional features is 82.32% on the test set. After integrating traditional features with App behaviors, the classification performance of the integrated models has significantly improved according to the AUC values. Among the five integrated models, the model TR_Com that uses the Combined model result as input feature performs best in classification with the AUC value of 86.01% on the test set. In terms of AUC values and PSI values, the classification ability of TR_CNN, TR_LGB and TR_Com is close and significantly better than that of TR_App and TR_LR. Correspondingly, the stability performance of the latter is better than that of the former.

Fig. 5

The ROC curves of models based on combined features. (a) is on the training set, (b) is on the test set.

Table 7

The AUC and PSI results of models based on combined features

Model	Train AUC	Test AUC	AUC Change Ratio	PSI
TR	82.66%	82.32%	-0.41%	0.12%
TR_App	86.14%	85.42%	-0.84%	0.22%
TR_LR	86.14%	85.17%	-1.13%	0.39%
TR_LGB	87.92%	85.96%	-2.23%	0.49%
TR_CNN	87.65%	85.61%	-2.33%	0.52%
TR_Com	88.07%	86.01%	-2.34%	0.56%

The above models are established by the logistic regression algorithm, and the only difference between the six models is the input features. TR represents the model only using traditional features. TR_App represents the model using the combination of traditional features and App behaviors. TR_LR is the model using the combination of traditional features and the LR model result. TR_LGB is the model using the combination of traditional features and the LGB model result. TR_CNN is the model using the combination of traditional features and the CNN model result. TR_Com represents the model using the combination of traditional features and the Combined model result.

5 Conclusion

In this paper, we explore the role of App behaviors in default prediction and propose a combined scoring model with excellent performance.

Firstly, the credit scoring models are established based on App behaviors using logistic regression, LightGBM, CNN, and the combined algorithm of CNN and LightGBM, respectively. The experimental result shows that App behaviors has a clear default predictive value for online credit loan and the logistic regression algorithm-based scoring model has the highest stability. Meanwhile, the combined algorithm-based model has the best classification ability and the stability of the model is within the acceptable range, hence its overall performance is outstanding.

Secondly, the result of the integrated models trained on App behaviors and traditional scoring features demonstrates that App behaviors significantly improve models’ performance and strengthen the default prediction power of credit scoring models. In particular, the integrated model that used the combined model result as the input feature has the best performance in classification. So both App behavior data and the combined algorithm of CNN and LightGBM are quite valuable for credit scoring models.

Finally, the feature importance screening results and the correlation between App behaviors and default show that App behaviors are indeed highly correlated with the credit risk of customers applying for online loans. As the analysis results of the scoring model shown, the behavior features of lending apps are highly predictive of default, and customers who frequently install and browse many lending apps have a higher risk of default. Likewise, customers who frequently use entertainment apps, gambling apps, and those with abnormal app usage behavior are also subject to a certain risk of default. Conversely, customers who often use financial management apps and office apps usually have assets and stable jobs, and rarely default on their loans. Therefore, it is of great practical significance for financial institutions to use App behaviors for customer profiling and risk identification.

6 Regulatory implications and recommendations

The collection and application of mobile App behavior data rely on the authorization of customers and the technological capability of financial institutions. In practice, the bank solicits customers to voluntarily authorize it to access application information for credit decisions by providing them with preferential credit policies. Since just part of the customers will agree to authorize financial institutions to collect application information due to their trust in financial institutions, the scoring model based on App behaviors can only be applied to these customers. Therefore, on the premise that customers voluntarily authorize and lending apps collect information based on the principle of minimum necessity, financial institutions apply App behaviors as alternative data for credit scoring modeling, which is in line with regulatory policies and can provide customers with better credit services.

Since the stability of the credit scoring model based on the combined algorithm is poorer than that of the logistic regression-based model, it is necessary for financial institutions to maintain a high frequency of iterative optimization when using the combined model. Furthermore, by observing the ROC curve in Fig. 4, it can be found that the scoring models perform better in the low-scoring segment than the high-scoring segment in terms of classification ability. It suggests that App behaviors are likely to be more suitable for identifying default risks of low-scoring customers. The App behaviors may be applied in the future to develop the fraud prediction model for so-called long-tail customers.

References

Chi

B.W.

and Hsu

C.C.

, A hybrid approach to integrate genetic algorithm into dual scoring model in enhancing the performance of credit scoring model, Expert Systems with Applications 39(3) (2012), 2650–2661.

Jones

, Johnstone

and Wilson

, An empirical evaluation of the performance of binary classifiers in the prediction of credit ratings changes, Journal of Banking & Finance 56 (2015), 72–85.

Abdou

H.A.

, Tsafack

M.D.D.

, Ntim

C.G.

and Baker

R.D.

, Predicting creditworthiness in retail banking with limited scoring data, Knowledge-Based Systems 103 (2016), 89–103.

Nalic

, Martinovic

and Zagar

, New hybrid data mining model for credit scoring based on feature selection algorithm and ensemble classifiers, Advanced Engineering Informatics 45 (2020), 101130.

Demirgüç-Kunt

, Klapper

, Singer

, Ansar

and Hess

, The Global Findex Database 2017: Measuring Financial Inclusion and the Fintech Revolution, World Bank Group, Washington, (2018).

Onay

and Öztürk

, A review of credit scoring research in the age of big data, Journal of Financial Regulation and Compliance 26(3) (2018), 382–405.

Kvamme

, Sellereite

, Aas

and Sjursen

, Predicting mortgage default using convolutional neural networks, Expert Systems with Applications 102 (2018), 207–217.

Speakman

, Sridharan

and Markus

, Three Population Covariate Shift for Mobile Phone-Based Credit Scoring, Proceedings of the 1st ACM SIGCAS Conference on Computing and Sustainable Societies (2018), 1–7.

Tobback

and Martens

, Retail credit scoring using finegrained payment data, Journal of the Royal Statistical Society Series A-Statistics in Society 182(4) (2019), 1227–1246.

10.

Berg

, Burg

, Gombovi

and Puri

, On the rise of fintechs: Credit scoring using digital footprints, Social Science Electronic Publishing 33(7) (2020), 2845–2897.

11.

Niu

, Ren

and Li

, Credit scoring using machine learning by combing social network information: Evidence from peer-topeer lending, Information 10(12) (2019), 397.

12.

Kulkarni

S.V.

and Dhage

S.N.

, Advanced credit score calculation using social media and machine learning, Journal of Intelligent & Fuzzy Systems 36(3) (2019), 2373–2380.

13.

Djeundje

V.B.

, Crook

, Calabrese

and Hamid

, Enhancing credit scoring with alternative data, Expert Systems with Applications 163 (2020), 113766.

14.

Liang

and He

, Analyzing credit risk among Chinese P2Plending businesses by integrating text-related soft information, Electronic Commerce Research and Applications 40 (2020), 100947.

15.

De Cnudde

, Moeyersoms

, Stankova

, Tobback

, Javaly

and Martens

, What does your facebook profile reveal about your creditworthiness? using alternative data for microfinance, Journal of the Operational Research Society 70 (2018), 353–363.

16.

Óskarsdóttir

, Bravo

, Sarraute

, Vanthienen

and Baesens

, The value of big data for credit scoring: Enhancing financial inclusion using mobile phone data and social network analytics, Applied Soft Computing 74 (2019), 26–39.

17.

Roa

, Correa-Bahnsen

, Suarez

, Cortés-Tejada

, Luque

M.A.

and Bravo

, Super-app behavioral patterns in credit risk models: Financial, statistical and regulatory implications, Expert Systems with Applications 169 (2021), 114486.

18.

Blumenstock

, Don’t forget people in the use of big data for development, Nature 561(7722) (2018), 170–172.

19.

Durand

, Risk Elements in Consumer Instalment Financing, National Bureau of Economic Research, Inc., (1941), 189–201.

20.

Wiginton

J.C.

, , A note on the comparison of logit and discriminant models of consumer credit behavior, The Journal of Financial Quantitative Analysis 15 (1980), 757–770.

21.

Kao

L.J.

, Chiu

C.C.

and Chiu

F.Y.

, A bayesian latent variable model with classification and regression tree approach for behavior and credit scoring, Knowledge-Based Systems 36 (2012), 245–252.

22.

Luo

, A comprehensive decision support approach for credit scoring, Industrial Management & Data Systems 120(2) (2019), 280–290.

23.

Sohn

S.Y.

, Kim

D.H.

and Yoon

J.H.

, Technology credit scoring model with fuzzy logistic regression, Applied Soft Computing 43 (2016), 150–158.

24.

Shen

, Wang

and Shen

, A cost-sensitive logistic regression credit scoring model based on multi-objective optimization approach, Technological and Economic Development of Economy 26 (2019), 405429.

25.

Sun

, Li

, Chang

P.C.

and Huang

Q.H.

, Dynamic credit scoring using b & b with incremental-SVM-ensemble, Kybernetes 44 (2015), 518–535.

26.

Maldonado

, Bravo

, Lopez

and Perez

, Integrated framework for profit-based feature selection and SVM classification in credit scoring, Decision Support Systems 104 (2017), 113–121.

27.

Pawiak

, Abdar

and Acharya

U.R.

, Application of new deep genetic cascade ensemble of SVM classifiers to predict the Australian credit scoring, Applied Soft Computing 84 (2019), 105740.

28.

West

, Neural network credit scoring models, Computers & Operations Research 27(11) (2000), 1131–1152.

29.

Marcano-Cedeno

, Marin-De-La-Barcena

, Jimenez-Trillo

, Pinuela

J.A.

and Andina

, Artificial metaplasticity neural network applied to credit scoring, International Journal of Neural Systems 21(4) (2011), 311–317.

30.

Verikas

, Gelzinis

and Bacauskiene

, Mining data with random forests: A survey and results of new tests, Pattern Recognition 44(2) (2011), 330–349.

31.

Shen

, Zhao

, Kou

and Alsaadi

F.E.

, A new deep learning ensemble credit risk evaluation model with an improved synthetic minority oversampling technique, Applied Soft Computing 98 (2021), 106852.

32.

Chen

and Guestrin

, Xgboost:AScalableTree Boosting System, in: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’16, San Francisco, California, USA, (2016), 785794.

33.

Dastile

, Celik

and Potsane

, Statistical and machine learning models in credit scoring: A systematic literature survey, Applied Soft Computing 91 (2020), 106263.

34.

Koutanaei

F.N.

, Sajedi

and Khanbabaei

, A hybrid data mining model of feature selection algorithms and ensemble learning classifiers for credit scoring, Journal of Retailing and Consumer Services 27 (2015), 11–23.

35.

Liang

and Cai

, Forecasting peer-to-peer platform default rate with LSTM neural network, Electronic Commerce Research and Applications 43(8) (2020), 100997.

36.

Luo

, Wu

and Wu

, A deep learning approach for credit scoring using credit default swaps, Engineering Applications of Artificial Intelligence 65 (2017), 465–470.

37.

Weed

D.L.

, Weight of evidence: A review of concept and methods, Risk analysis 25(6) (2005), 1545–1557.

38.

Ruiz

, Gomes

, Rodrigues

and Gama

, Credit Scoring in Microfinance Using Non-traditional Data, in: Progress in Artificial Intelligence, Springer International Publishing, Cham, (2017), 447–458.

39.

, Meng

, Finley

, Wang

, Chen

, Ma

, Ye

and Liu

T.Y.

, Lightgbm: A Highly Efficient Gradient Boosting Decision Tree, in: Advances in Neural Information Processing Systems, Vol. 30, Curran Associates, Inc., (2017), 3146–3154.

40.

Jin

, Lu

, Qin

, Cheng

and Mao

, Swiftids: Real-time intrusion detection system based on lightgbm and parallel intrusion detection mechanism, Computers & Security 97 (2020), 101984.

41.

and Chen

, A comparative performance assessment of ensemble learning for credit scoring, Mathematics 8(10) (2020), 1756.

42.

Zeng

, Yang

, Zhang