This is a predictive modeling project of loan classification using select machine learning algorithms.

I. Setup and II. Pre-Processing were provided by the course. I modified a few aspects to fit my style, such as section labeling, modifying graphs, as well as variable naming conventions in Python, and editing the dataframe column names.

III. Classification, IV. Scoring, and V. Appendix were coded by me.

I. Setup

1. Packages

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import preprocessing

%matplotlib inline

2. About the Dataset

This dataset is about past loans. The loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

Field Description
Loan_status Whether a loan is paid off on in collection
Principal Basic principal loan amount at the
Terms Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule
Effective_date When the loan got originated and took effects
Due_date Since it’s one-time payoff schedule, each loan has one single due date
Age Age of applicant
Education Education of applicant
Gender The gender of applicant

3. Load Data From CSV File

import wget

url = ""
filename =
df = pd.read_csv(filename)
4. Convert to Datetime Object

df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender
0 0 0 PAIDOFF 1000 30 2016-09-08 2016-10-07 45 High School or Below male
1 2 2 PAIDOFF 1000 30 2016-09-08 2016-10-07 33 Bechalor female
2 3 3 PAIDOFF 1000 15 2016-09-08 2016-09-22 27 college male
3 4 4 PAIDOFF 1000 30 2016-09-09 2016-10-08 28 college female
4 6 6 PAIDOFF 1000 30 2016-09-09 2016-10-08 29 college male

II. Pre-Processing

Exploring the number of a few classes in the data set.

1. Dependent Variable

PAIDOFF       260
Name: loan_status, dtype: int64

2. Gender

a. By Principal Amount

bins = np.linspace(df.Principal.min(), df.Principal.max(), 10)
sns.set(font_scale = 2.5)
g = sns.FacetGrid(df,
                  col = "Gender",
                  hue = "loan_status",
                  palette = "viridis",
                  col_wrap = 2,
                  height = 9,
                  aspect = 13 / 9),
      bins = bins,
      ec = "k")
g.set(ylabel = "Fequency")

b. By Age

bins = np.linspace(df.age.min(), df.age.max(), 10)
sns.set(font_scale = 2.5)
g = sns.FacetGrid(df,
                  col = "Gender",
                  hue = "loan_status",
                  palette = "viridis",
                  col_wrap = 2,
                  height = 9,
                  aspect = 13 / 9),
      bins = bins,
      ec = "k")
g.set(xlabel = "Age",
      ylabel = "Fequency")
df.groupby(['Gender'])['loan_status'].value_counts(normalize = True)
Gender  loan_status
female  PAIDOFF        0.865385
        COLLECTION     0.134615
male    PAIDOFF        0.731293
        COLLECTION     0.268707
Name: loan_status, dtype: float64

86 % of female pay there loans while only 73 % of males pay there loan. Convert male to 0 and female to 1.

df['Gender'].replace(to_replace = ['male', 'female'],
                     value = [0, 1],
                     inplace = True)
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender
0 0 0 PAIDOFF 1000 30 2016-09-08 2016-10-07 45 High School or Below 0
1 2 2 PAIDOFF 1000 30 2016-09-08 2016-10-07 33 Bechalor 1
2 3 3 PAIDOFF 1000 15 2016-09-08 2016-09-22 27 college 0
3 4 4 PAIDOFF 1000 30 2016-09-09 2016-10-08 28 college 1
4 6 6 PAIDOFF 1000 30 2016-09-09 2016-10-08 29 college 0

3. Day of Week

df['dayofweek'] = df['effective_date'].dt.dayofweek
bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10)
sns.set(font_scale = 2.5)
g = sns.FacetGrid(df,
                  col = "Gender",
                  hue = "loan_status",
                  palette = "viridis",
                  col_wrap = 2,
                  height = 9,
                  aspect = 13 / 9),
      bins = bins,
      ec = "k")
g.set(xlabel = "Day of Week",
      ylabel = "Fequency")

Categorize days by being a weekend or not.

df['Weekend'] = df['dayofweek'].apply(lambda x: 1 if (x > 3) else 0)
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender dayofweek Weekend
0 0 0 PAIDOFF 1000 30 2016-09-08 2016-10-07 45 High School or Below 0 3 0
1 2 2 PAIDOFF 1000 30 2016-09-08 2016-10-07 33 Bechalor 1 3 0
2 3 3 PAIDOFF 1000 15 2016-09-08 2016-09-22 27 college 0 3 0
3 4 4 PAIDOFF 1000 30 2016-09-09 2016-10-08 28 college 1 4 1
4 6 6 PAIDOFF 1000 30 2016-09-09 2016-10-08 29 college 0 4 1

4. Education

Education has four categories.

df.groupby(['education'])['loan_status'].value_counts(normalize = True)
education             loan_status
Bechalor              PAIDOFF        0.750000
                      COLLECTION     0.250000
High School or Below  PAIDOFF        0.741722
                      COLLECTION     0.258278
Master or Above       COLLECTION     0.500000
                      PAIDOFF        0.500000
college               PAIDOFF        0.765101
                      COLLECTION     0.234899
Name: loan_status, dtype: float64

5. Feature Selection

Feature = df[['Principal','terms','age','Gender','Weekend']]
#Convert 'education' to dummy variables.
Feature = pd.concat([Feature, pd.get_dummies(df['education'])], axis = 1)
Feature.drop(['Master or Above'], axis = 1, inplace = True)
#Rename columns.
Feature.rename(columns = {'terms': 'Terms',
                          'age': 'Age',
                          'Bechalor': 'Bachelor',
                          'college': 'College'},
               inplace = True)
Principal Terms Age Gender Weekend Bachelor High School or Below College
0 1000 30 45 0 0 0 1 0
1 1000 30 33 1 0 1 0 0
2 1000 15 27 0 0 0 0 1
3 1000 30 28 1 1 0 0 1
4 1000 30 29 0 1 0 0 1

6. Normalize Data

X = Feature
y = df['loan_status'].values
X = preprocessing.StandardScaler().fit(X).transform(X)

III. Classification

1. Packages

from sklearn.model_selection import validation_curve as VC
from sklearn.neighbors import KNeighborsClassifier as KNC
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression as LR

2. K-Nearest Neighbor(KNN)

For K-Nearest Neighbors, I choose a 5-fold cross-validation split across 1 to 70 neighbors.

k_neighbors = list(range(1, 71))
train_scores, valid_scores = VC(KNC(),
                                X, y,
                                param_name = 'n_neighbors',
                                param_range = k_neighbors,
                                cv = 5)
train_knn = []
valid_knn = []

for k in range(70) :
    train_knn.append(sum(train_scores[k]) / 5)
    valid_knn.append(sum(valid_scores[k]) / 5)
fig, ax = plt.subplots(figsize=(12, 8))
        c = 'b',
        lw = 3,
        label = "Train Set")
        c = 'g',
        lw = 3,
        label = "Validation Set")

ax.set_title("Train and Test Set Accuracies of KNN for 1 - 70 Neighbors", fontsize = 22)
ax.set_xlabel(xlabel = "Number of Neighbors", fontsize = 18)
ax.set_ylabel(ylabel = "Accuracy", fontsize = 18)
ax.tick_params(axis = 'both', labelsize = 16)
ax.legend(fontsize = 18)
plt.figtext(0.9, 0, "5-fold cross-validation values were averaged.", ha = 'right', fontsize = 12);

Using approximately 50 nearest neighbors or more yields a converged accuracy of about 0.75 for the train and validation sets.

knn_model_final = KNC(n_neighbors = 50).fit(X, y)
metrics.accuracy_score(y, knn_model_final.predict(X))

3. Decision Tree (Random Forest)

Because the training data set is small, I decided to use Random Forest to optimize. Random Forests are a collection of Decision Trees used to build a final model. A single decision tree is computed very quickly, however it is prone to overfitting. Random Forest calculates the importance of the features from the many trees used to create it.

I created a Random Forest from 30 Decision Trees.

dtrf_model_final = RandomForestClassifier(n_estimators = 30, max_depth = 10, random_state = 3333).fit(X, y)
metrics.accuracy_score(y, dtrf_model_final.predict(X))

4. Support Vector Machine

I performed a grid search on a search vector machine, varying $C$ from $10^{-5}$ to $10^5$ and varying $γ$ from $10^{-7}$ to $1$, both by powers of $10$.

Similar to the KNN model, I used the average accuracy from 5-fold cross-validation.

C_list = [10**n for n in range(-5, 5)]
gamma_list = [10**n for n in range(-7, 0)]
param_grid = {'C': C_list,  
              'gamma': gamma_list} 

grid = GridSearchCV(svm.SVC(),
                    cv = 5,
                    refit = True,
                    verbose = 0).fit(X, y)
{'C': 1e-05, 'gamma': 1e-07}

The GridSearch algorithm yields that the best hyperparameters are $C = 10^{-5}$ and $γ = 10^{-7}$.

However, these are the respective minimal values for both $C$ and $γ$.

Here is a look at the accuracy of all models from the Grid Search sorted by the hyperparameters.

df_grid = pd.DataFrame(data = grid.cv_results_['mean_test_score'].reshape(10, 7),
                       index = C_list,
                       columns = gamma_list)
1.000000e-07 1.000000e-06 1.000000e-05 1.000000e-04 1.000000e-03 1.000000e-02 1.000000e-01
0.00001 0.75147 0.75147 0.751470 0.751470 0.751470 0.751470 0.751470
0.00010 0.75147 0.75147 0.751470 0.751470 0.751470 0.751470 0.751470
0.00100 0.75147 0.75147 0.751470 0.751470 0.751470 0.751470 0.751470
0.01000 0.75147 0.75147 0.751470 0.751470 0.751470 0.751470 0.751470
0.10000 0.75147 0.75147 0.751470 0.751470 0.751470 0.751470 0.751470
1.00000 0.75147 0.75147 0.751470 0.751470 0.751470 0.751470 0.708282
10.00000 0.75147 0.75147 0.751470 0.751470 0.751470 0.633126 0.647867
100.00000 0.75147 0.75147 0.751470 0.751470 0.615942 0.635901 0.647867
1000.00000 0.75147 0.75147 0.751470 0.633333 0.627329 0.653416 0.670973
10000.00000 0.75147 0.75147 0.636232 0.639130 0.633043 0.676770 0.673954

The accuracy scores from all combinations of $C$ (rows) and $γ$ (columns) are shown above. We can see that the vast majority of cells except those in the bottom right corner have the same accuracy score to 6 decimal places, that is, $0.751470$.

I used $C = 10^{-5}$ and $γ = 10^{-7}$ for the final model.

svm_model_final = svm.SVC(C = 1e-5, gamma = 1e-7, kernel = 'rbf').fit(X, y)
metrics.accuracy_score(y, svm_model_final.predict(X))

5. Logistic Regression

For Logistic Regression, I choose a 5-fold cross-validation split, varying $C$ from $10^{-5}$ to $10^5$ by powers of $10$.

train_scores, valid_scores = VC(LR(),
                                X, y,
                                param_name = 'C',
                                param_range = C_list,
                                cv = 5)

train_lr = []
valid_lr = []

for c in range(len(C_list)) :
    train_lr.append(sum(train_scores[c]) / 5)
    valid_lr.append(sum(valid_scores[c]) / 5)
fig, ax = plt.subplots(figsize=(12, 8))
        c = 'b',
        lw = 3,
        label = "Train Set")
        c = 'g',
        lw = 3,
        label = "Validation Set")

ax.set_title("Train and Test Set Accuracies of Logistic Regression for C", fontsize = 22)
ax.set_xlabel(xlabel = "Hyperparameter C value", fontsize = 18)
ax.set_ylabel(ylabel = "Accuracy", fontsize = 18)
ax.tick_params(axis = 'both', labelsize = 16)
ax.legend(fontsize = 18, loc = 'center right')
plt.figtext(0.9, 0, "5-fold cross-validation values were averaged.", ha = 'right', fontsize = 12);

Any $C ≤ 10^{-2}$ from our graph yields an accuracy of approximately 0.75 for the training and test sets.

In [25]:
lr_model_final = LR(C = 1e-4).fit(X,y)
metrics.accuracy_score(y, lr_model_final.predict(X))

IV. Test Set Results

1. Packages

from sklearn.metrics import jaccard_score as js
from sklearn.metrics import f1_score as fs
from sklearn.metrics import log_loss as ll

2. Test Set

The same steps are repeated from the train set.

a. Load

url = ""
filename =
df_test = pd.read_csv(filename)
Unnamed: 0 Unnamed: 0.1 loan_status Principal terms effective_date due_date age education Gender
0 1 1 PAIDOFF 1000 30 9/8/2016 10/7/2016 50 Bechalor female
1 5 5 PAIDOFF 300 7 9/9/2016 9/15/2016 35 Master or Above male
2 21 21 PAIDOFF 1000 30 9/10/2016 10/9/2016 43 High School or Below female
3 24 24 PAIDOFF 1000 30 9/10/2016 10/9/2016 26 college male
4 35 35 PAIDOFF 800 15 9/11/2016 9/25/2016 29 Bechalor male

b. Pre-Process

df_test['effective_date'] = pd.to_datetime(df_test['effective_date'])
df_test['dayofweek'] = df_test['effective_date'].dt.dayofweek
df_test['Weekend'] = df_test['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
df_test['Gender'].replace(to_replace = ['male', 'female'], value = [0, 1], inplace=True)

c. Feature Selection

Feature_test = df_test[['Principal','terms','age','Gender','Weekend']]
Feature_test = pd.concat([Feature_test, pd.get_dummies(df_test['education'])], axis = 1)
Feature_test.drop(['Master or Above'], axis = 1, inplace = True)
Feature_test.rename(columns = {'terms': 'Terms',
                               'age': 'Age',
                               'Bechalor': 'Bachelor',
                               'college': 'College'},
                    inplace = True)

d. Normalize Data

X_test = Feature_test
y_test = df_test['loan_status']
X_test = preprocessing.StandardScaler().fit(X_test).transform(X_test)

3. Scoring

Using the test set, the Jaccard score and F1 score will be calculated for each of the K-Nearest Neighbors model, Decision Tree (Random Forest) model, Support Vector Machine model, and Logistic Regression model. In addition, the log loss will be calculated for the Logistic Regression model.

#Initialize lists.
models = [knn_model_final, dtrf_model_final, svm_model_final, lr_model_final]
jaccard = []
f1s = []
log_loss = ["NA", "NA", "NA"]

#Calculate scores from each model and append to lists.
for model in models :
    y_pred = model.predict(X_test)
    jaccard.append(js(y_test, y_pred, average = 'micro'))
    f1s.append(fs(y_test, y_pred, average = 'micro'))
    if model == lr_model_final :
        log_loss.append(ll(y_test, model.predict_proba(X_test)))

#Output dataframe.
data = {"Algorithm": ["K-Nearest Neighbors", "Decision Tree/Random Forest", "Support Vector Machine", "Logistic Regression"],
        "Jaccard": jaccard,
        "F1-Score": f1s,
        "Log Loss": log_loss}

report = pd.DataFrame(data = data)
Jaccard F1-Score Log Loss
K-Nearest Neighbors 0.636364 0.777778 NA
Decision Tree/Random Forest 0.588235 0.740741 NA
Support Vector Machine 0.588235 0.740741 NA
Logistic Regression 0.588235 0.740741 0.571405

It is important to note that this was an extremely small data set, with only 346 observations in the train set and 54 observations in the test set. A larger sample would possible give better scores and accuracy.

V. Appendix

Complete original train set.

pd.set_option("display.max_rows", None, "display.max_columns", None)

