The Best Classifier¶

Final Project in Machine Learning with Python¶

Rohan Lewis¶

2020.10.03¶

This is a predictive modeling project of loan classification using select machine learning algorithms.

I. Setup and II. Pre-Processing were provided by the course. I modified a few aspects to fit my style, such as section labeling, modifying graphs, as well as variable naming conventions in Python, and editing the dataframe column names.

III. Classification, IV. Scoring, and V. Appendix were coded by me.

I. Setup¶

1. Packages¶

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import preprocessing

%matplotlib inline

2. About the Dataset¶

This dataset is about past loans. The loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:

Field	Description
Loan_status	Whether a loan is paid off on in collection
Principal	Basic principal loan amount at the
Terms	Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule
Effective_date	When the loan got originated and took effects
Due_date	Since it’s one-time payoff schedule, each loan has one single due date
Age	Age of applicant
Education	Education of applicant
Gender	The gender of applicant

3. Load Data From CSV File¶

import wget

url = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv"
filename = wget.download(url)
df = pd.read_csv(filename)

100% [..............................................................................] 23101 / 23101

4. Convert to Datetime Object¶

df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
df.head()

II. Pre-Processing¶

Exploring the number of a few classes in the data set.

1. Dependent Variable¶

df['loan_status'].value_counts()

PAIDOFF       260
COLLECTION     86
Name: loan_status, dtype: int64

2. Gender¶

a. By Principal Amount¶

bins = np.linspace(df.Principal.min(), df.Principal.max(), 10)
sns.set(font_scale = 2.5)
g = sns.FacetGrid(df,
                  col = "Gender",
                  hue = "loan_status",
                  palette = "viridis",
                  col_wrap = 2,
                  height = 9,
                  aspect = 13 / 9)
g.map(plt.hist,
      'Principal',
      bins = bins,
      ec = "k")
g.set(ylabel = "Fequency")
g.axes[-1].legend();

b. By Age¶

bins = np.linspace(df.age.min(), df.age.max(), 10)
sns.set(font_scale = 2.5)
g = sns.FacetGrid(df,
                  col = "Gender",
                  hue = "loan_status",
                  palette = "viridis",
                  col_wrap = 2,
                  height = 9,
                  aspect = 13 / 9)
g.map(plt.hist,
      'age',
      bins = bins,
      ec = "k")
g.set(xlabel = "Age",
      ylabel = "Fequency")
g.axes[-1].legend();

df.groupby(['Gender'])['loan_status'].value_counts(normalize = True)

Gender  loan_status
female  PAIDOFF        0.865385
        COLLECTION     0.134615
male    PAIDOFF        0.731293
        COLLECTION     0.268707
Name: loan_status, dtype: float64

86 % of female pay there loans while only 73 % of males pay there loan. Convert male to 0 and female to 1.

df['Gender'].replace(to_replace = ['male', 'female'],
                     value = [0, 1],
                     inplace = True)
df.head()

3. Day of Week¶

df['dayofweek'] = df['effective_date'].dt.dayofweek
bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10)
sns.set(font_scale = 2.5)
g = sns.FacetGrid(df,
                  col = "Gender",
                  hue = "loan_status",
                  palette = "viridis",
                  col_wrap = 2,
                  height = 9,
                  aspect = 13 / 9)
g.map(plt.hist,
      'dayofweek',
      bins = bins,
      ec = "k")
g.set(xlabel = "Day of Week",
      ylabel = "Fequency")
g.axes[-1].legend();

Categorize days by being a weekend or not.

df['Weekend'] = df['dayofweek'].apply(lambda x: 1 if (x > 3) else 0)
df.head()

4. Education¶

Education has four categories.

df.groupby(['education'])['loan_status'].value_counts(normalize = True)

education             loan_status
Bechalor              PAIDOFF        0.750000
                      COLLECTION     0.250000
High School or Below  PAIDOFF        0.741722
                      COLLECTION     0.258278
Master or Above       COLLECTION     0.500000
                      PAIDOFF        0.500000
college               PAIDOFF        0.765101
                      COLLECTION     0.234899
Name: loan_status, dtype: float64

5. Feature Selection¶

Feature = df[['Principal','terms','age','Gender','Weekend']]
#Convert 'education' to dummy variables.
Feature = pd.concat([Feature, pd.get_dummies(df['education'])], axis = 1)
Feature.drop(['Master or Above'], axis = 1, inplace = True)
#Rename columns.
Feature.rename(columns = {'terms': 'Terms',
                          'age': 'Age',
                          'Bechalor': 'Bachelor',
                          'college': 'College'},
               inplace = True)
Feature.head()

6. Normalize Data¶

X = Feature
y = df['loan_status'].values
X = preprocessing.StandardScaler().fit(X).transform(X)

III. Classification¶

1. Packages¶

from sklearn.model_selection import validation_curve as VC
from sklearn.neighbors import KNeighborsClassifier as KNC
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression as LR

2. K-Nearest Neighbor(KNN)¶

For K-Nearest Neighbors, I choose a 5-fold cross-validation split across 1 to 70 neighbors.

k_neighbors = list(range(1, 71))
train_scores, valid_scores = VC(KNC(),
                                X, y,
                                param_name = 'n_neighbors',
                                param_range = k_neighbors,
                                cv = 5)
train_knn = []
valid_knn = []

for k in range(70) :
    train_knn.append(sum(train_scores[k]) / 5)
    valid_knn.append(sum(valid_scores[k]) / 5)

fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(k_neighbors,
        train_knn,
        c = 'b',
        lw = 3,
        label = "Train Set")
ax.plot(k_neighbors,
        valid_knn,
        c = 'g',
        lw = 3,
        label = "Validation Set")

ax.set_title("Train and Test Set Accuracies of KNN for 1 - 70 Neighbors", fontsize = 22)
ax.set_xlabel(xlabel = "Number of Neighbors", fontsize = 18)
ax.set_ylabel(ylabel = "Accuracy", fontsize = 18)
ax.tick_params(axis = 'both', labelsize = 16)
ax.legend(fontsize = 18)
plt.figtext(0.9, 0, "5-fold cross-validation values were averaged.", ha = 'right', fontsize = 12);

Using approximately 50 nearest neighbors or more yields a converged accuracy of about 0.75 for the train and validation sets.

knn_model_final = KNC(n_neighbors = 50).fit(X, y)
metrics.accuracy_score(y, knn_model_final.predict(X))

0.7456647398843931

3. Decision Tree (Random Forest)¶

Because the training data set is small, I decided to use Random Forest to optimize. Random Forests are a collection of Decision Trees used to build a final model. A single decision tree is computed very quickly, however it is prone to overfitting. Random Forest calculates the importance of the features from the many trees used to create it.

I created a Random Forest from 30 Decision Trees.

dtrf_model_final = RandomForestClassifier(n_estimators = 30, max_depth = 10, random_state = 3333).fit(X, y)
metrics.accuracy_score(y, dtrf_model_final.predict(X))

0.8872832369942196

4. Support Vector Machine¶

I performed a grid search on a search vector machine, varying $C$ from $10^{-5}$ to $10^5$ and varying $γ$ from $10^{-7}$ to $1$, both by powers of $10$.

Similar to the KNN model, I used the average accuracy from 5-fold cross-validation.

C_list = [10**n for n in range(-5, 5)]
gamma_list = [10**n for n in range(-7, 0)]
param_grid = {'C': C_list,  
              'gamma': gamma_list} 

grid = GridSearchCV(svm.SVC(),
                    param_grid,
                    cv = 5,
                    refit = True,
                    verbose = 0).fit(X, y)

grid.best_params_

{'C': 1e-05, 'gamma': 1e-07}

The GridSearch algorithm yields that the best hyperparameters are $C = 10^{-5}$ and $γ = 10^{-7}$.

However, these are the respective minimal values for both $C$ and $γ$.

Here is a look at the accuracy of all models from the Grid Search sorted by the hyperparameters.

df_grid = pd.DataFrame(data = grid.cv_results_['mean_test_score'].reshape(10, 7),
                       index = C_list,
                       columns = gamma_list)
df_grid

The accuracy scores from all combinations of $C$ (rows) and $γ$ (columns) are shown above. We can see that the vast majority of cells except those in the bottom right corner have the same accuracy score to 6 decimal places, that is, $0.751470$.

I used $C = 10^{-5}$ and $γ = 10^{-7}$ for the final model.

svm_model_final = svm.SVC(C = 1e-5, gamma = 1e-7, kernel = 'rbf').fit(X, y)
metrics.accuracy_score(y, svm_model_final.predict(X))

0.7514450867052023

5. Logistic Regression¶

For Logistic Regression, I choose a 5-fold cross-validation split, varying $C$ from $10^{-5}$ to $10^5$ by powers of $10$.

train_scores, valid_scores = VC(LR(),
                                X, y,
                                param_name = 'C',
                                param_range = C_list,
                                cv = 5)

train_lr = []
valid_lr = []

for c in range(len(C_list)) :
    train_lr.append(sum(train_scores[c]) / 5)
    valid_lr.append(sum(valid_scores[c]) / 5)

fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(C_list,
        train_lr,
        c = 'b',
        lw = 3,
        label = "Train Set")
ax.plot(C_list,
        valid_lr,
        c = 'g',
        lw = 3,
        label = "Validation Set")

ax.set_title("Train and Test Set Accuracies of Logistic Regression for C", fontsize = 22)
ax.set_xlabel(xlabel = "Hyperparameter C value", fontsize = 18)
ax.set_xscale('log')
ax.set_ylabel(ylabel = "Accuracy", fontsize = 18)
ax.tick_params(axis = 'both', labelsize = 16)
ax.legend(fontsize = 18, loc = 'center right')
plt.figtext(0.9, 0, "5-fold cross-validation values were averaged.", ha = 'right', fontsize = 12);

Any $C ≤ 10^{-2}$ from our graph yields an accuracy of approximately 0.75 for the training and test sets.

lr_model_final = LR(C = 1e-4).fit(X,y)
metrics.accuracy_score(y, lr_model_final.predict(X))

0.7514450867052023

IV. Test Set Results¶

1. Packages¶

from sklearn.metrics import jaccard_score as js
from sklearn.metrics import f1_score as fs
from sklearn.metrics import log_loss as ll

2. Test Set¶

The same steps are repeated from the train set.

a. Load¶

url = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv"
filename = wget.download(url)
df_test = pd.read_csv(filename)
df_test.head()

100% [................................................................................] 3642 / 3642

b. Pre-Process¶

df_test['effective_date'] = pd.to_datetime(df_test['effective_date'])
df_test['dayofweek'] = df_test['effective_date'].dt.dayofweek
df_test['Weekend'] = df_test['dayofweek'].apply(lambda x: 1 if (x>3)  else 0)
df_test['Gender'].replace(to_replace = ['male', 'female'], value = [0, 1], inplace=True)

c. Feature Selection¶

Feature_test = df_test[['Principal','terms','age','Gender','Weekend']]
Feature_test = pd.concat([Feature_test, pd.get_dummies(df_test['education'])], axis = 1)
Feature_test.drop(['Master or Above'], axis = 1, inplace = True)
Feature_test.rename(columns = {'terms': 'Terms',
                               'age': 'Age',
                               'Bechalor': 'Bachelor',
                               'college': 'College'},
                    inplace = True)

d. Normalize Data¶

X_test = Feature_test
y_test = df_test['loan_status']
X_test = preprocessing.StandardScaler().fit(X_test).transform(X_test)

3. Scoring¶

Using the test set, the Jaccard score and F1 score will be calculated for each of the K-Nearest Neighbors model, Decision Tree (Random Forest) model, Support Vector Machine model, and Logistic Regression model. In addition, the log loss will be calculated for the Logistic Regression model.

#Initialize lists.
models = [knn_model_final, dtrf_model_final, svm_model_final, lr_model_final]
jaccard = []
f1s = []
log_loss = ["NA", "NA", "NA"]

#Calculate scores from each model and append to lists.
for model in models :
    y_pred = model.predict(X_test)
    jaccard.append(js(y_test, y_pred, average = 'micro'))
    f1s.append(fs(y_test, y_pred, average = 'micro'))
    if model == lr_model_final :
        log_loss.append(ll(y_test, model.predict_proba(X_test)))

#Output dataframe.
data = {"Algorithm": ["K-Nearest Neighbors", "Decision Tree/Random Forest", "Support Vector Machine", "Logistic Regression"],
        "Jaccard": jaccard,
        "F1-Score": f1s,
        "Log Loss": log_loss}

report = pd.DataFrame(data = data)
report.set_index("Algorithm")

It is important to note that this was an extremely small data set, with only 346 observations in the train set and 54 observations in the test set. A larger sample would possible give better scores and accuracy.

V. Appendix¶

Complete original train set.¶

pd.set_option("display.max_rows", None, "display.max_columns", None)

pd.read_csv('loan_train.csv')

Complete original test set.¶

pd.read_csv('loan_test.csv')

	Unnamed: 0	Unnamed: 0.1	loan_status	Principal	terms	effective_date	due_date	age	education	Gender
0	0	0	PAIDOFF	1000	30	2016-09-08	2016-10-07	45	High School or Below	male
1	2	2	PAIDOFF	1000	30	2016-09-08	2016-10-07	33	Bechalor	female
2	3	3	PAIDOFF	1000	15	2016-09-08	2016-09-22	27	college	male
3	4	4	PAIDOFF	1000	30	2016-09-09	2016-10-08	28	college	female
4	6	6	PAIDOFF	1000	30	2016-09-09	2016-10-08	29	college	male

	1.000000e-07	1.000000e-06	1.000000e-05	1.000000e-04	1.000000e-03	1.000000e-02	1.000000e-01
0.00001	0.75147	0.75147	0.751470	0.751470	0.751470	0.751470	0.751470
0.00010	0.75147	0.75147	0.751470	0.751470	0.751470	0.751470	0.751470
0.00100	0.75147	0.75147	0.751470	0.751470	0.751470	0.751470	0.751470
0.01000	0.75147	0.75147	0.751470	0.751470	0.751470	0.751470	0.751470
0.10000	0.75147	0.75147	0.751470	0.751470	0.751470	0.751470	0.751470
1.00000	0.75147	0.75147	0.751470	0.751470	0.751470	0.751470	0.708282
10.00000	0.75147	0.75147	0.751470	0.751470	0.751470	0.633126	0.647867
100.00000	0.75147	0.75147	0.751470	0.751470	0.615942	0.635901	0.647867
1000.00000	0.75147	0.75147	0.751470	0.633333	0.627329	0.653416	0.670973
10000.00000	0.75147	0.75147	0.636232	0.639130	0.633043	0.676770	0.673954

	Unnamed: 0	Unnamed: 0.1	loan_status	Principal	terms	effective_date	due_date	age	education	Gender
0	1	1	PAIDOFF	1000	30	9/8/2016	10/7/2016	50	Bechalor	female
1	5	5	PAIDOFF	300	7	9/9/2016	9/15/2016	35	Master or Above	male
2	21	21	PAIDOFF	1000	30	9/10/2016	10/9/2016	43	High School or Below	female
3	24	24	PAIDOFF	1000	30	9/10/2016	10/9/2016	26	college	male
4	35	35	PAIDOFF	800	15	9/11/2016	9/25/2016	29	Bechalor	male

	Jaccard	F1-Score	Log Loss
Algorithm
K-Nearest Neighbors	0.636364	0.777778	NA
Decision Tree/Random Forest	0.588235	0.740741	NA
Support Vector Machine	0.588235	0.740741	NA
Logistic Regression	0.588235	0.740741	0.571405