Bank Loan Default Prediction--Predictive Modeling (Part 3)

Abstract of the project

Objective: In this data challenge, I am going to work with 8 datasets from a bank (dataset was collected from year of 1999). Analyze the data and train the model to predict the customers that may default on loans.

Steps:

Data pre-processing
- Load the data from .asc file into pd.DataFrame data structure.
- Transform the data into the ideal format and content.
- Rename and drop the columns if needed.
- Include some basic feature engineering for each table, such as encoding for categorical features and pivot.
- Analyze the data distribution by visualization.
Feature Engineering and Dataset Preparation
- Merge some tables to create new features that may relate to the prediction.
- Merge all the tables together to feed to the model. Only in int and float type. Drop or fill the columns with null values.
- Conduct feature selection.
Baseline model
- Begin with Linear SVM.
- Evaluation
- Visualize the coeff of features.
- Tune hyper-parameters for SVM.
Try other models
- Random Forest
- AdaBoost
- Gradient Boosting
- XGBoost
Concolusion

Note: This blog is part 3 of the whole project, only contains the predictive modeling. Please refer to the rest blogs for the following parts.

import pandas as pd 
import numpy as np 

import matplotlib.pyplot as plt
import seaborn as sns 
sns.set(font_scale = 1.2)
%matplotlib inline

Feature Selection

Variance

The most basic feature selection is removing all zero-variance features, i.e. features that have the same value in all samples. Feature Selection in Scikit-learn provides some built-in feature selection methods. In our case, we do not have a very huge feature dimensionality, so I did not apply a harsh feature selection policy. I only check whether there is a zero-variance feature by checking how many different values are in each feature.

def dinstinct_values(df):
    '''
    Check the number of distinct values in each column
    INPUT:
      - df: the pandas dataframe needs to be checked
    OUTPUT:
      - None
    '''
    col = df.columns.to_list()
    for c in col:
        print('There are %d different values in the column %s.' % (len(df[c].value_counts()), c))  

dinstinct_values(data)

There are 645 different values in the column amount_x.
There are 5 different values in the column duration.
There are 577 different values in the column payments.
There are 2 different values in the column status.
There are 682 different values in the column balance_before_loan_3m.
There are 653 different values in the column balance_before_loan_2w.
There are 682 different values in the column balance_before_loan_1m.
There are 169 different values in the column trans_freq_sum.
There are 203 different values in the column trans_freq_mean.
There are 3 different values in the column trans_year_count.
There are 90 different values in the column large_amount_freq.
There are 21 different values in the column low_5k_freq.
There are 3 different values in the column frequency.
There are 5 different values in the column usage_year.
There are 4 different values in the column type.
There are 649 different values in the column amount_y.
There are 227 different values in the column order_a_k_Missing_symbol.
There are 111 different values in the column order_a_k_POJISTNE.
There are 429 different values in the column order_a_k_SIPO.
There are 591 different values in the column order_a_k_UVER.
There are 2 different values in the column sex.
There are 46 different values in the column age.
There are 77 different values in the column A1.
There are 77 different values in the column A4.
There are 53 different values in the column A5.
There are 36 different values in the column A6.
There are 17 different values in the column A7.
There are 6 different values in the column A8.
There are 11 different values in the column A9.
There are 70 different values in the column A10.
There are 76 different values in the column A11.
There are 73 different values in the column A13.
There are 44 different values in the column A14.
There are 76 different values in the column A16.
There are 6 different values in the column year_loan.

Linear Correlation among data

If two variables are highly correlated, keeping both variables may be redundant. Keeping only one will help reduce dimensionality withour much loss of information.

corr = data.corr()
plt.figure(figsize = (15, 7))
sns.heatmap(corr, cmap = 'OrRd', linewidths = 0.01)

png

Predictive Modeling

baseline model:
- SVM
other models:
- RandomForest
- AdaBoost
- GradientBoost
- XGBoost

features = data.columns.to_list()
features.remove('status')

SVM with linear kernel

from sklearn import svm
clf = svm.SVC(kernel = 'linear', C = 1, probability = True)
clf.fit(X_train, y_train)
clf.score(X_test, y_test) # accuracy

0.8947368421052632

# check the difference between target and prediction
# if result is not 0, it means target != prediction

diff = clf.predict(X_test) - y_test
diff[diff != 0] 

  -1
 -1
  1
  -1
 -1
 -1
 -1
 -1
 -1
  1
  1
 -1
  1
 -1
  1
 -1
  1
 -1
 -1
 -1
  1
  1
 -1
  -1
Name: status, dtype: int64

# predict the probablity score for each class/category

clf.predict_proba(X_test)[:5]

array([[0.98444619, 0.01555381],
       [0.98370535, 0.01629465],
       [0.46087692, 0.53912308],
       [0.92077182, 0.07922818],
       [0.99238528, 0.00761472]])

# evaluation with sklearn's classification_report

from sklearn.metrics import classification_report
print(classification_report(y_test, clf.predict(X_test)))

              precision    recall  f1-score   support

           0       0.92      0.96      0.94       200
           1       0.60      0.43      0.50        28

    accuracy                           0.89       228
   macro avg       0.76      0.69      0.72       228
weighted avg       0.88      0.89      0.89       228

compare the SVM coeff

I visualized the SVM coefficient of each input feature by calculating the square and dividing by the maximum value to find the ratio. Larger coeff ratio indicates this feature has a greater impact

# process the svm weights by square
svm_weights = (clf.coef_ ** 2).sum(axis = 0)
svm_weights /= svm_weights.max() # indicates how large is the svm coeff of each feature

plt.figure(figsize = (12, 5))
sns.barplot(features, svm_weights, label = 'SVM weight', alpha = 0.6)

plt.title("Feature Selection")
plt.xticks(rotation = 90) 
plt.axis('tight')
plt.legend(loc = 'upper right')

png

visualize the confusion matrix

Confusion Matrix can show the ground truth label vs prediction results. Therefore, we can figure out the model’s performance. For example, a false positive is the model predict a negative sample to positive, in our case is true label should be 0 but the prediction is 1. In our scenario, we should try to reduce the false negative.

from sklearn.metrics import confusion_matrix

def plot_cf_matrix(y_test, y_pred):
    '''
    Plot the confusion matrix.
    INPUT:
      - y_test: ground truth label 
      - y_pred: model's prediction
    '''
    cf_matrix = confusion_matrix(y_test, y_pred)
    group_names = ['True Neg', 'False Pos', 'False Neg', 'True Pos']
    group_counts = ['{0:0.0f}'.format(value) for value in
                    cf_matrix.flatten()]
    group_percentages = ['{0:.2%}'.format(value) for value in
                         cf_matrix.flatten() / np.sum(cf_matrix)]
    labels = [f'{v1}\n{v2}\n{v3}' for v1, v2, v3 in 
              zip(group_names,group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2, 2)

    sns.heatmap(cf_matrix, annot = labels, fmt = '', cmap = 'Blues')

y_pred = clf.predict(X_test)
plot_cf_matrix(y_test, y_pred)

png

SVM with normalized input

There are more than one scalers, I adopted StandardScaler in this case. If there are many outliers, better to choose other scalers such as robustscaler. In some cases, processing the data with normalization does not yield better results.

from sklearn.preprocessing import StandardScaler
# normalize the input data with standard scaler
sc = StandardScaler()
X = sc.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1 / 3)

clf2 = svm.SVC(kernel = 'rbf', C = 10, probability = True)
clf2.fit(X_train, y_train)
clf2.score(X_test, y_test)

0.8947368421052632

print(classification_report(y_test, clf2.predict(X_test)))

              precision    recall  f1-score   support

           0       0.93      0.95      0.94       200
           1       0.59      0.46      0.52        28

    accuracy                           0.89       228
   macro avg       0.76      0.71      0.73       228
weighted avg       0.89      0.89      0.89       228

y_pred = clf2.predict(X_test)
plot_cf_matrix(y_test, y_pred)

png

visualize the ROC curve and precision-recall curve

The roc curve requires either the probabilities or the non-thresholded decision values from the estimator.

from sklearn.metrics import precision_recall_curve, PrecisionRecallDisplay, roc_curve, RocCurveDisplay

y_score = clf2.decision_function(X_test)

fpr, tpr, _ = roc_curve(y_test, y_score, pos_label=clf2.classes_[1])
roc_display = RocCurveDisplay(fpr = fpr, tpr = tpr)

prec, recall, _ = precision_recall_curve(y_test, y_score,
                                         pos_label=clf2.classes_[1])
pr_display = PrecisionRecallDisplay(precision=prec, recall=recall)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 4))

roc_display.plot(ax = ax1)
pr_display.plot(ax = ax2)

png

Optimize the hyper-parameters of SVM with GridSearch

One headache in machine learning area is the hyper-parameter tuning. With grid search, this step is much easier. We can define all the hyper-parameters choices in the dictionary and it will run all and find out the best one.

from sklearn.model_selection import GridSearchCV
svc = svm.SVC()

param_grid = [{'C':[0.1,1,10], 'kernel':['linear']},
             {'C':[0.1,1,10], 'gamma':[0.001,0.01],'kernel':['rbf']}]

scoring = 'accuracy'
clfs = GridSearchCV(svc, param_grid, scoring = scoring, cv=10)
clfs.fit(X_train,y_train)

GridSearchCV(cv=10, estimator=SVC(),
             param_grid=[{'C': [0.1, 1, 10], 'kernel': ['linear']},
                         {'C': [0.1, 1, 10], 'gamma': [0.001, 0.01],
                          'kernel': ['rbf']}],
             scoring='accuracy')

print(clfs.best_estimator_)

SVC(C=0.1, kernel='linear')

from sklearn.metrics import classification_report

print(classification_report(y_test, clfs.best_estimator_.predict(X_test)))

              precision    recall  f1-score   support

           0       0.97      0.98      0.97       212
           1       0.64      0.56      0.60        16

    accuracy                           0.95       228
   macro avg       0.81      0.77      0.79       228
weighted avg       0.94      0.95      0.95       228

y_pred = clfs.best_estimator_.predict(X_test)
plot_cf_matrix(y_test, y_pred)

png

Try other models

RandomForest
AdaBoost
GradientBoost
XGBoost

script to visualize the feature importance

def visualize_coeff(model, name, feature_col = features):
    '''
    Function to visualize the coeff importance of each model
    INPUT:
      - model: the model instance
      - name: name of the model to show on title
      - feature_col: a list of column names of all features, default value set to be a global list named features
    OUTPUT:
      None
    '''
    coeff = model.feature_importances_
    
    importance = pd.DataFrame({"feature": features, 'importance': coeff})
    importance.sort_values(by = 'importance', ascending = False, inplace = True)
    
    plt.figure(figsize = (12, 5))
    plt.xticks(rotation = 90) 

    sns.barplot(x = 'feature', y = 'importance', data = importance).set_title('The feature coeff of %s' % name)

Random Forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth = 2, random_state = 10)
rf.fit(X_train, y_train)
rf.score(X_test, y_test)

0.9122807017543859

print(classification_report(y_test, rf.predict(X_test)))

              precision    recall  f1-score   support

           0       0.92      1.00      0.95       206
           1       0.75      0.14      0.23        22

    accuracy                           0.91       228
   macro avg       0.83      0.57      0.59       228
weighted avg       0.90      0.91      0.88       228

visualize_coeff(rf, 'RandomForestClassifier')

png

y_pred = rf.predict(X_test)
plot_cf_matrix(y_test, y_pred)

png

AdaBoost

from sklearn.ensemble import AdaBoostClassifier

adaboost = AdaBoostClassifier(n_estimators = 100)

adaboost.fit(X_train, y_train)
adaboost.score(X_test, y_test)

0.9254385964912281

print(classification_report(y_test, adaboost.predict(X_test)))

              precision    recall  f1-score   support

           0       0.97      0.95      0.96       212
           1       0.48      0.62      0.54        16

    accuracy                           0.93       228
   macro avg       0.72      0.79      0.75       228
weighted avg       0.94      0.93      0.93       228

visualize_coeff(adaboost, 'AdaBoost')

png

Gradient Boosting

from sklearn.ensemble import GradientBoostingClassifier

dtc = GradientBoostingClassifier(n_estimators = 200, learning_rate = 0.1).fit(X_train, y_train)
dtc.score(X_test, y_test)

0.9122807017543859

print(classification_report(y_test, dtc.predict(X_test)))

              precision    recall  f1-score   support

           0       0.95      0.96      0.95       206
           1       0.55      0.50      0.52        22

    accuracy                           0.91       228
   macro avg       0.75      0.73      0.74       228
weighted avg       0.91      0.91      0.91       228

visualize_coeff(dtc, 'GradientBoostingClassifier')

png

XGBoost

from xgboost import XGBClassifier
xgb = XGBClassifier(colsample_bytree = 0.4603,
                    gamma            = 0.0468, 
                    learning_rate    = 0.05, 
                    max_depth        = 3, 
                    min_child_weight = 1.7817, 
                    n_estimators     = 2200,
                    reg_alpha        = 0.4640, 
                    reg_lambda       = 0.8571,
                    subsample        = 0.5213, 
                    silent           = 1,
                    random_state     = 7,
                    nthread          = -1)

xgb.fit(X_train, y_train)
xgb.score(X_test, y_test)

0.9166666666666666

print(classification_report(y_test, xgb.predict(X_test)))

              precision    recall  f1-score   support

           0       0.93      0.98      0.95       200
           1       0.76      0.46      0.58        28

    accuracy                           0.92       228
   macro avg       0.85      0.72      0.77       228
weighted avg       0.91      0.92      0.91       228

visualize_coeff(xgb, 'XGBoost')

png

y_pred = xgb.predict(X_test)
plot_cf_matrix(y_test, y_pred)

png

Conlusion

In this project, I fully clean the data and engineered features in order to make a dataset. I apply SVM as the baseline model, I try other models such as Random Forest, AdaBoost, GradientBoost, and XGBoost. I evaluate the performance and visualize the feature coeffs. In this use case, the most important evaluation metric should be the false negative and recall, because we may not approve the client’s loan application if the client has a high possibility to default on loans. Predicting the clients who may default on loans can help the bank apply the damage control policy. In addition, based on the data collected from the client’s transaction history and other information, bankers may have more practical standards when approving a client’s loan application.