Bank Loan Default Prediction--Feature Engineering (Part 2)

Bank Loan Default Prediction--Feature Engineering (Part 2)

2021, Mar 15    

Abstract of the project

Objective: In this data challenge, I am going to work with 8 datasets from a bank (dataset was collected from year of 1999). Analyze the data and train the model to predict the customers that may default on loans.

Steps:

  1. Data pre-processing
    • Load the data from .asc file into pd.DataFrame data structure.
    • Transform the data into the ideal format and content.
    • Rename and drop the columns if needed.
    • Include some basic feature engineering for each table, such as encoding for categorical features and pivot.
    • Analyze the data distribution by visualization.
  2. Feature Engineering and Dataset Preparation
    • Merge some tables to create new features that may relate to the prediction.
    • Merge all the tables together to feed to the model. Only in int and float type. Drop or fill the columns with null values.
    • Conduct feature selection.
  3. Baseline model
    • Begin with Linear SVM.
    • Evaluation
    • Visualize the coeff of features.
    • Tune hyper-parameters for SVM.
  4. Try other models
    • Random Forest
    • AdaBoost
    • Gradient Boosting
    • XGBoost
  5. Concolusion

Note: This blog is part 2 of the whole project, only contains the feature engineering and dataset preparation. Please refer to the rest blogs for the following parts.

import pandas as pd 
import numpy as np 

import matplotlib.pyplot as plt
import seaborn as sns 
sns.set(font_scale = 1.2)
%matplotlib inline

Feature Engineering and Dataset Preparation

Merge loan, account, and transaction tables.

  1. merge loan and account tables on acount_id
  2. merge with transaction table with account_id

With the merge function, we can define the key to be merged on, and the way to merge(inner, left, right, outer), we can customize the suffixes of the merged table if there are columns with same name in both tables.

loan_account = pd.merge(loan,
                        account,
                        how = 'left',
                        on = 'account_id',
                        suffixes = ('_loan','_acc'))
loan_account.head()
loan_id account_id date amount duration payments status district_id frequency usage_year
0 5314 1787 1993-07-05 96396 12 8033.0 1 30 1 6
1 5316 1801 1993-07-11 165960 36 4610.0 0 46 2 6
2 6863 9188 1993-07-28 127080 60 2118.0 0 45 2 6
3 5325 1843 1993-08-03 105804 36 2939.0 0 12 2 6
4 7240 11013 1993-09-06 274740 60 4579.0 0 1 1 6
len(loan_account)
682
loan_account_trans = pd.merge(loan_account,
                              trans,
                              how = 'left',
                              on = 'account_id',
                              suffixes = ('_loan','_trans')
                             )
# display into two seperate parts just for the tight layout of the blog
display(loan_account_trans.iloc[:2, :11])
display(loan_account_trans.iloc[:2, 11:])
loan_id account_id date amount_loan duration payments status district_id frequency usage_year trans_id
0 5314 1787 1993-07-05 96396 12 8033.0 1 30 1 6 523621
1 5314 1787 1993-07-05 96396 12 8033.0 1 30 1 6 524054
trans_date type operation amount_trans balance k_symbol bank account trans_type large_amount
0 1993-03-22 PRIJEM VKLAD 1100.0 1100.0 missing NaN NaN 1.0 1
1 1993-04-21 PRIJEM VKLAD 9900.0 11000.0 missing NaN NaN 1.0 3

Calculate the date difference between loan and transaction

I use the date - transaction date to obtain the date difference between the loan grant date and the transaction date. Filters the records of transactions before loan approval to investigate clients’ financial situation and habbits. This also meets the actual needs: we need to analyze the past transactions to verify whether the client’s loan application can be approved.

  • loan_trans_diff:
    • greater than 0: indicates the transactions before the loan approval
    • less than 0: indicates the transactions after the loan approval
loan_account_trans['loan_trans_diff'] = ((loan_account_trans['date'] - loan_account_trans['trans_date']) / np.timedelta64(1, 'D')).astype(int)
loan_account_trans = loan_account_trans[loan_account_trans['loan_trans_diff'] >= 0]
display(loan_account_trans.iloc[:2, :11])
display(loan_account_trans.iloc[:2, 11:])
loan_id account_id date amount_loan duration payments status district_id frequency usage_year trans_id
0 5314 1787 1993-07-05 96396 12 8033.0 1 30 1 6 523621
1 5314 1787 1993-07-05 96396 12 8033.0 1 30 1 6 524054
trans_date type operation amount_trans balance k_symbol bank account trans_type large_amount loan_trans_diff
0 1993-03-22 PRIJEM VKLAD 1100.0 1100.0 missing NaN NaN 1.0 1 105
1 1993-04-21 PRIJEM VKLAD 9900.0 11000.0 missing NaN NaN 1.0 3 75

Average transaction amount over the past period of time: 2 weeks, 1 month, and 3 months before loan date

What was the average transaction amount over the past period of time? Using groupby and aggregation can help us extract this information. The previous average transaction information can obtain the effective characteristics of the customer’s transaction habits. Therefore, these data can be used to determine whether the customer will default on repayment, and then the bank can consider approving or rejecting the loan application.

two_week = loan_account_trans[loan_account_trans['loan_trans_diff'] <= 14]
two_week = two_week.groupby(['loan_id'], as_index = None).mean()
two_week['balance_before_loan_2w'] = two_week['balance']
two_week = two_week.loc[:, ['loan_id', 'balance_before_loan_2w']]
display(two_week.head(5))
loan_id balance_before_loan_2w
0 4959 25996.825
1 4962 35169.220
2 4967 31329.250
3 4968 32703.700
4 4973 24468.850
one_month = loan_account_trans[loan_account_trans['loan_trans_diff'] <= 30]
one_month = one_month.groupby(['loan_id'],as_index = None).mean()
one_month['balance_before_loan_1m'] = one_month['balance']
one_month = one_month.loc[:,['loan_id','balance_before_loan_1m']]
display(one_month.head(5))
loan_id balance_before_loan_1m
0 4959 29349.755556
1 4961 14062.416667
2 4962 60769.460000
3 4967 33760.620000
4 4968 31890.450000
three_month = loan_account_trans[loan_account_trans['loan_trans_diff'] <= 90]
three_month = three_month.groupby(['loan_id'],as_index = None).mean()
three_month['balance_before_loan_3m'] = three_month['balance']
three_month = three_month.loc[:,['loan_id','balance_before_loan_3m']]
display(three_month.head(5))
loan_id balance_before_loan_3m
0 4959 29830.005000
1 4961 15123.228571
2 4962 67547.706897
3 4967 20759.711111
4 4968 26212.342308

Transaction frequence

before = loan_account_trans
before['trans_year'] = before['trans_date'].dt.year # extract the year from datetime column
before.head(2)
loan_id account_id date amount_loan duration payments status district_id frequency usage_year ... operation amount_trans balance k_symbol bank account trans_type large_amount loan_trans_diff trans_year
0 5314 1787 1993-07-05 96396 12 8033.0 1 30 1 6 ... VKLAD 1100.0 1100.0 missing NaN NaN 1.0 1 105 1993
1 5314 1787 1993-07-05 96396 12 8033.0 1 30 1 6 ... VKLAD 9900.0 11000.0 missing NaN NaN 1.0 3 75 1993

2 rows × 23 columns

groupby loan id and the year of transaction date

trans_freq = before.loc[:, ['loan_id','trans_year','balance']].groupby(['loan_id','trans_year']).count()
trans_freq.rename(columns={'balance':'trans_freq'}, inplace = True)
trans_freq.head()
trans_freq
loan_id trans_year
4959 1993 54
1994 1
4961 1995 54
1996 26
4962 1996 46
f = {'trans_freq': ['sum', 'mean'], 
     'trans_year': ['count']}

trans_freq = trans_freq.reset_index().groupby('loan_id').agg(f)
trans_freq.columns = [s1 + '_' + s2 for (s1,s2) in trans_freq.columns.to_list()]
trans_freq.reset_index(inplace = True)

Number of large amount transactions

We encoded the transaction amount by quantile. If the transaction amount greater than Q3, then it is encoded to 3.

large_amount = before[before['large_amount'] == 3]
large_amount_freq = large_amount.loc[:,['loan_id','large_amount']].groupby(['loan_id']).count()
large_amount_freq.rename(columns = {'large_amount': 'large_amount_freq'}, inplace = True)
display(large_amount_freq.head())
large_amount_freq
loan_id
4959 24
4961 26
4962 83
4967 32
4968 16

Number of transactions when balance lower than 5k

low_5k = before[before['balance'] <= 5000] 
low_5k_freq = low_5k.loc[:,['loan_id', 'balance']].groupby(['loan_id']).count()
low_5k_freq.rename(columns = {'balance': 'low_5k_freq'},inplace = True)
display(low_5k_freq.head())
low_5k_freq
loan_id
4959 1
4961 10
4962 1
4967 3
4973 1

Combine engineered features (transaction habbits)

loan_info = loan.merge(three_month,
                       how = 'left',
                       on = 'loan_id')

loan_info = loan_info.merge(two_week,
                            how = 'left',
                            on = 'loan_id')

loan_info = loan_info.merge(one_month,
                            how = 'left',
                            on = 'loan_id')

loan_info = loan_info.merge(trans_freq,
                            how = 'left',
                            on = 'loan_id')

loan_info = loan_info.merge(large_amount_freq,
                            how = 'left',
                            on = 'loan_id')

loan_info = loan_info.merge(low_5k_freq,
                            how = 'left',
                            on = 'loan_id')
loan_info.head()
loan_id account_id date amount duration payments status balance_before_loan_3m balance_before_loan_2w balance_before_loan_1m trans_freq_sum trans_freq_mean trans_year_count large_amount_freq low_5k_freq
0 5314 1787 1993-07-05 96396 12 8033.0 1 15966.666667 NaN 20100.000000 4 4.0 1 1.0 1.0
1 5316 1801 1993-07-11 165960 36 4610.0 0 56036.575862 36072.350000 33113.788889 37 37.0 1 14.0 2.0
2 6863 9188 1993-07-28 127080 60 2118.0 0 35160.929412 20272.800000 36233.066667 24 24.0 1 9.0 1.0
3 5325 1843 1993-08-03 105804 36 2939.0 0 46378.637500 34243.066667 42535.133333 25 25.0 1 13.0 1.0
4 7240 11013 1993-09-06 274740 60 4579.0 0 60804.547059 41127.900000 49872.757143 27 27.0 1 16.0 1.0

Compare two types of clients:

  • status = 0: successfully make payments
  • status = 1: fail to make payments

I observe their habits of large-amount transactions.

loan_info[loan_info['status'] == 0]['large_amount_freq'].describe()
count    605.000000
mean      30.737190
std       19.470525
min        1.000000
25%       16.000000
50%       26.000000
75%       41.000000
max      104.000000
Name: large_amount_freq, dtype: float64
loan_info[loan_info['status'] == 1]['large_amount_freq'].describe()
count    76.000000
mean     28.723684
std      19.039327
min       1.000000
25%      13.750000
50%      26.000000
75%      42.250000
max      75.000000
Name: large_amount_freq, dtype: float64
# making subplots to compare the payment habbits of these two types of clients
plt.figure(figsize = (14, 5))
plt.subplot(211)
sns.histplot(loan_info[loan_info['status'] == 0]['large_amount_freq'], 
             stat = 'density', kde = True, 
             label = 'Success in making payment')
plt.legend()
plt.ylim(0, 0.03)
plt.xlim(0, 105)

plt.subplot(212)
sns.histplot(loan_info[loan_info['status'] == 1]['large_amount_freq'], 
             stat = 'density', kde = True,
             label = 'Fail in making payment')
plt.ylim(0, 0.03)
plt.xlim(0, 105)
plt.legend()

png

Merge all tables

data = pd.merge(loan_info,
                account,
                how = 'left',
                on = 'account_id'
               )
data = pd.merge(data,
                disp,
                how = 'left',
                on = 'account_id'
               )
data = pd.merge(data,
                card,
                how = 'left',
                on = 'disp_id'
               )
data = pd.merge(data,
                order,
                how = 'left',
                on = 'account_id'
               )
data = pd.merge(data,
                client,
                how = 'left',
                on = 'client_id'
               )
data = pd.merge(data,
                district,
                how = 'left',
                left_on = 'district_id_x',
                right_on = 'A1'
               )
display(data.iloc[:3, :10])
display(data.iloc[:3, 10: 21])
display(data.iloc[:3, 21: 30])
display(data.iloc[:3, 30:])
loan_id account_id date amount_x duration payments status balance_before_loan_3m balance_before_loan_2w balance_before_loan_1m
0 5314 1787 1993-07-05 96396 12 8033.0 1 15966.666667 NaN 20100.000000
1 5316 1801 1993-07-11 165960 36 4610.0 0 56036.575862 36072.35 33113.788889
2 6863 9188 1993-07-28 127080 60 2118.0 0 35160.929412 20272.80 36233.066667
trans_freq_sum trans_freq_mean trans_year_count large_amount_freq low_5k_freq district_id_x frequency usage_year disp_id client_id card_id
0 4 4.0 1 1.0 1.0 30 1 6 2166 2166 NaN
1 37 37.0 1 14.0 2.0 46 2 6 2181 2181 NaN
2 24 24.0 1 9.0 1.0 45 2 6 11006 11314 NaN
type issued amount_y order_a_k_LEASING order_a_k_Missing_symbol order_a_k_POJISTNE order_a_k_SIPO order_a_k_UVER birth_number
0 NaN NaN 8033.2 0.0 0.0 0.0 0.0 8033.2 475722
1 NaN NaN 13152.0 0.0 3419.0 956.0 4167.0 4610.0 680722
2 NaN NaN 10591.3 0.0 471.0 66.0 7936.0 2118.3 360602
district_id_y sex age A1 A4 A5 A6 A7 A8 A9 A10 A11 A12 A13 A14 A15 A16
0 30 1 52 30 94812 15 13 8 2 10 81.8 9650 3.38 3.67 100 2985 2804
1 46 0 31 46 112709 48 20 7 3 10 73.5 8369 1.79 2.31 117 2854 2618
2 45 0 63 45 77917 85 19 6 1 5 53.5 8390 2.28 2.89 132 2080 2122
data['year'] = data['date'].dt.year
data['year_loan'] = 1999 - (data['year']).astype(int)

Drop unnecessary columns

data.drop(['date','year', 'account_id', 'birth_number', 'card_id',
           'client_id', 'disp_id', 'district_id_x', 'district_id_y','issued',
           'loan_id', 'A12', 'A15'], axis = 1,  inplace = True)
data.columns.unique()
Index(['amount_x', 'duration', 'payments', 'status', 'balance_before_loan_3m',
       'balance_before_loan_2w', 'balance_before_loan_1m', 'trans_freq_sum',
       'trans_freq_mean', 'trans_year_count', 'large_amount_freq',
       'low_5k_freq', 'frequency', 'usage_year', 'type', 'amount_y',
       'order_a_k_LEASING', 'order_a_k_Missing_symbol', 'order_a_k_POJISTNE',
       'order_a_k_SIPO', 'order_a_k_UVER', 'sex', 'age', 'A1', 'A4', 'A5',
       'A6', 'A7', 'A8', 'A9', 'A10', 'A11', 'A13', 'A14', 'A16', 'year_loan'],
      dtype='object')
data['type'].fillna(-1, inplace = True)
data.fillna(0, inplace = True)
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 682 entries, 0 to 681
Data columns (total 36 columns):
amount_x                    682 non-null int64
duration                    682 non-null int64
payments                    682 non-null float64
status                      682 non-null int64
balance_before_loan_3m      682 non-null float64
balance_before_loan_2w      682 non-null float64
balance_before_loan_1m      682 non-null float64
trans_freq_sum              682 non-null int64
trans_freq_mean             682 non-null float64
trans_year_count            682 non-null int64
large_amount_freq           682 non-null float64
low_5k_freq                 682 non-null float64
frequency                   682 non-null int64
usage_year                  682 non-null int64
type                        682 non-null float64
amount_y                    682 non-null float64
order_a_k_LEASING           682 non-null float64
order_a_k_Missing_symbol    682 non-null float64
order_a_k_POJISTNE          682 non-null float64
order_a_k_SIPO              682 non-null float64
order_a_k_UVER              682 non-null float64
sex                         682 non-null int64
age                         682 non-null int64
A1                          682 non-null int64
A4                          682 non-null int64
A5                          682 non-null int64
A6                          682 non-null int64
A7                          682 non-null int64
A8                          682 non-null int64
A9                          682 non-null int64
A10                         682 non-null float64
A11                         682 non-null int64
A13                         682 non-null float64
A14                         682 non-null int64
A16                         682 non-null int64
year_loan                   682 non-null int64
dtypes: float64(16), int64(20)
memory usage: 197.1 KB

Dataset preparation: input and output split

  • the input are features
  • the output should be the status: 0 / 1 I split train and test with a ratio of 2:1
from sklearn.model_selection import train_test_split

X = data.loc[:, data.columns != 'status']
y = data['status']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3)
len(X_train), len(X_test)
(454, 228)