This is a predictive modeling project of loan classification using select machine learning algorithms.
I. Setup and II. Pre-Processing were provided by the course. I modified a few aspects to fit my style, such as section labeling, modifying graphs, as well as variable naming conventions in Python, and editing the dataframe column names.
III. Classification, IV. Scoring, and V. Appendix were coded by me.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import preprocessing
%matplotlib inline
This dataset is about past loans. The loan_train.csv data set includes details of 346 customers whose loan are already paid off or defaulted. It includes following fields:
Field | Description |
---|---|
Loan_status | Whether a loan is paid off on in collection |
Principal | Basic principal loan amount at the |
Terms | Origination terms which can be weekly (7 days), biweekly, and monthly payoff schedule |
Effective_date | When the loan got originated and took effects |
Due_date | Since it’s one-time payoff schedule, each loan has one single due date |
Age | Age of applicant |
Education | Education of applicant |
Gender | The gender of applicant |
import wget
url = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_train.csv"
filename = wget.download(url)
df = pd.read_csv(filename)
df['due_date'] = pd.to_datetime(df['due_date'])
df['effective_date'] = pd.to_datetime(df['effective_date'])
df.head()
df['loan_status'].value_counts()
bins = np.linspace(df.Principal.min(), df.Principal.max(), 10)
sns.set(font_scale = 2.5)
g = sns.FacetGrid(df,
col = "Gender",
hue = "loan_status",
palette = "viridis",
col_wrap = 2,
height = 9,
aspect = 13 / 9)
g.map(plt.hist,
'Principal',
bins = bins,
ec = "k")
g.set(ylabel = "Fequency")
g.axes[-1].legend();
bins = np.linspace(df.age.min(), df.age.max(), 10)
sns.set(font_scale = 2.5)
g = sns.FacetGrid(df,
col = "Gender",
hue = "loan_status",
palette = "viridis",
col_wrap = 2,
height = 9,
aspect = 13 / 9)
g.map(plt.hist,
'age',
bins = bins,
ec = "k")
g.set(xlabel = "Age",
ylabel = "Fequency")
g.axes[-1].legend();
df.groupby(['Gender'])['loan_status'].value_counts(normalize = True)
86 % of female pay there loans while only 73 % of males pay there loan. Convert male to 0 and female to 1.
df['Gender'].replace(to_replace = ['male', 'female'],
value = [0, 1],
inplace = True)
df.head()
df['dayofweek'] = df['effective_date'].dt.dayofweek
bins = np.linspace(df.dayofweek.min(), df.dayofweek.max(), 10)
sns.set(font_scale = 2.5)
g = sns.FacetGrid(df,
col = "Gender",
hue = "loan_status",
palette = "viridis",
col_wrap = 2,
height = 9,
aspect = 13 / 9)
g.map(plt.hist,
'dayofweek',
bins = bins,
ec = "k")
g.set(xlabel = "Day of Week",
ylabel = "Fequency")
g.axes[-1].legend();
Categorize days by being a weekend or not.
df['Weekend'] = df['dayofweek'].apply(lambda x: 1 if (x > 3) else 0)
df.head()
Education has four categories.
df.groupby(['education'])['loan_status'].value_counts(normalize = True)
Feature = df[['Principal','terms','age','Gender','Weekend']]
#Convert 'education' to dummy variables.
Feature = pd.concat([Feature, pd.get_dummies(df['education'])], axis = 1)
Feature.drop(['Master or Above'], axis = 1, inplace = True)
#Rename columns.
Feature.rename(columns = {'terms': 'Terms',
'age': 'Age',
'Bechalor': 'Bachelor',
'college': 'College'},
inplace = True)
Feature.head()
X = Feature
y = df['loan_status'].values
X = preprocessing.StandardScaler().fit(X).transform(X)
from sklearn.model_selection import validation_curve as VC
from sklearn.neighbors import KNeighborsClassifier as KNC
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
from sklearn import svm
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression as LR
For K-Nearest Neighbors, I choose a 5-fold cross-validation split across 1 to 70 neighbors.
k_neighbors = list(range(1, 71))
train_scores, valid_scores = VC(KNC(),
X, y,
param_name = 'n_neighbors',
param_range = k_neighbors,
cv = 5)
train_knn = []
valid_knn = []
for k in range(70) :
train_knn.append(sum(train_scores[k]) / 5)
valid_knn.append(sum(valid_scores[k]) / 5)
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(k_neighbors,
train_knn,
c = 'b',
lw = 3,
label = "Train Set")
ax.plot(k_neighbors,
valid_knn,
c = 'g',
lw = 3,
label = "Validation Set")
ax.set_title("Train and Test Set Accuracies of KNN for 1 - 70 Neighbors", fontsize = 22)
ax.set_xlabel(xlabel = "Number of Neighbors", fontsize = 18)
ax.set_ylabel(ylabel = "Accuracy", fontsize = 18)
ax.tick_params(axis = 'both', labelsize = 16)
ax.legend(fontsize = 18)
plt.figtext(0.9, 0, "5-fold cross-validation values were averaged.", ha = 'right', fontsize = 12);
Using approximately 50 nearest neighbors or more yields a converged accuracy of about 0.75 for the train and validation sets.
knn_model_final = KNC(n_neighbors = 50).fit(X, y)
metrics.accuracy_score(y, knn_model_final.predict(X))
Because the training data set is small, I decided to use Random Forest to optimize. Random Forests are a collection of Decision Trees used to build a final model. A single decision tree is computed very quickly, however it is prone to overfitting. Random Forest calculates the importance of the features from the many trees used to create it.
I created a Random Forest from 30 Decision Trees.
dtrf_model_final = RandomForestClassifier(n_estimators = 30, max_depth = 10, random_state = 3333).fit(X, y)
metrics.accuracy_score(y, dtrf_model_final.predict(X))
I performed a grid search on a search vector machine, varying $C$ from $10^{-5}$ to $10^5$ and varying $γ$ from $10^{-7}$ to $1$, both by powers of $10$.
Similar to the KNN model, I used the average accuracy from 5-fold cross-validation.
C_list = [10**n for n in range(-5, 5)]
gamma_list = [10**n for n in range(-7, 0)]
param_grid = {'C': C_list,
'gamma': gamma_list}
grid = GridSearchCV(svm.SVC(),
param_grid,
cv = 5,
refit = True,
verbose = 0).fit(X, y)
grid.best_params_
The GridSearch algorithm yields that the best hyperparameters are $C = 10^{-5}$ and $γ = 10^{-7}$.
However, these are the respective minimal values for both $C$ and $γ$.
Here is a look at the accuracy of all models from the Grid Search sorted by the hyperparameters.
df_grid = pd.DataFrame(data = grid.cv_results_['mean_test_score'].reshape(10, 7),
index = C_list,
columns = gamma_list)
df_grid
The accuracy scores from all combinations of $C$ (rows) and $γ$ (columns) are shown above. We can see that the vast majority of cells except those in the bottom right corner have the same accuracy score to 6 decimal places, that is, $0.751470$.
I used $C = 10^{-5}$ and $γ = 10^{-7}$ for the final model.
svm_model_final = svm.SVC(C = 1e-5, gamma = 1e-7, kernel = 'rbf').fit(X, y)
metrics.accuracy_score(y, svm_model_final.predict(X))
For Logistic Regression, I choose a 5-fold cross-validation split, varying $C$ from $10^{-5}$ to $10^5$ by powers of $10$.
train_scores, valid_scores = VC(LR(),
X, y,
param_name = 'C',
param_range = C_list,
cv = 5)
train_lr = []
valid_lr = []
for c in range(len(C_list)) :
train_lr.append(sum(train_scores[c]) / 5)
valid_lr.append(sum(valid_scores[c]) / 5)
fig, ax = plt.subplots(figsize=(12, 8))
ax.plot(C_list,
train_lr,
c = 'b',
lw = 3,
label = "Train Set")
ax.plot(C_list,
valid_lr,
c = 'g',
lw = 3,
label = "Validation Set")
ax.set_title("Train and Test Set Accuracies of Logistic Regression for C", fontsize = 22)
ax.set_xlabel(xlabel = "Hyperparameter C value", fontsize = 18)
ax.set_xscale('log')
ax.set_ylabel(ylabel = "Accuracy", fontsize = 18)
ax.tick_params(axis = 'both', labelsize = 16)
ax.legend(fontsize = 18, loc = 'center right')
plt.figtext(0.9, 0, "5-fold cross-validation values were averaged.", ha = 'right', fontsize = 12);
Any $C ≤ 10^{-2}$ from our graph yields an accuracy of approximately 0.75 for the training and test sets.
lr_model_final = LR(C = 1e-4).fit(X,y)
metrics.accuracy_score(y, lr_model_final.predict(X))
from sklearn.metrics import jaccard_score as js
from sklearn.metrics import f1_score as fs
from sklearn.metrics import log_loss as ll
url = "https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/loan_test.csv"
filename = wget.download(url)
df_test = pd.read_csv(filename)
df_test.head()
df_test['effective_date'] = pd.to_datetime(df_test['effective_date'])
df_test['dayofweek'] = df_test['effective_date'].dt.dayofweek
df_test['Weekend'] = df_test['dayofweek'].apply(lambda x: 1 if (x>3) else 0)
df_test['Gender'].replace(to_replace = ['male', 'female'], value = [0, 1], inplace=True)
Feature_test = df_test[['Principal','terms','age','Gender','Weekend']]
Feature_test = pd.concat([Feature_test, pd.get_dummies(df_test['education'])], axis = 1)
Feature_test.drop(['Master or Above'], axis = 1, inplace = True)
Feature_test.rename(columns = {'terms': 'Terms',
'age': 'Age',
'Bechalor': 'Bachelor',
'college': 'College'},
inplace = True)
X_test = Feature_test
y_test = df_test['loan_status']
X_test = preprocessing.StandardScaler().fit(X_test).transform(X_test)
Using the test set, the Jaccard score and F1 score will be calculated for each of the K-Nearest Neighbors model, Decision Tree (Random Forest) model, Support Vector Machine model, and Logistic Regression model. In addition, the log loss will be calculated for the Logistic Regression model.
#Initialize lists.
models = [knn_model_final, dtrf_model_final, svm_model_final, lr_model_final]
jaccard = []
f1s = []
log_loss = ["NA", "NA", "NA"]
#Calculate scores from each model and append to lists.
for model in models :
y_pred = model.predict(X_test)
jaccard.append(js(y_test, y_pred, average = 'micro'))
f1s.append(fs(y_test, y_pred, average = 'micro'))
if model == lr_model_final :
log_loss.append(ll(y_test, model.predict_proba(X_test)))
#Output dataframe.
data = {"Algorithm": ["K-Nearest Neighbors", "Decision Tree/Random Forest", "Support Vector Machine", "Logistic Regression"],
"Jaccard": jaccard,
"F1-Score": f1s,
"Log Loss": log_loss}
report = pd.DataFrame(data = data)
report.set_index("Algorithm")
pd.set_option("display.max_rows", None, "display.max_columns", None)
pd.read_csv('loan_train.csv')
pd.read_csv('loan_test.csv')