I am using the Ames, Iowa dataset from the class notes. This dataset contains 2930 observations and 82 variables related to house sales in Ames, Iowa. The variables are related to characteristics and measurements of the house, its location, and its purchase.
Refer to the codebook for a complete list of variables and labels.
I will fit four models to predict the Sale Price of the homes on the same training set. The models will be evaluated with the same test set. They will then be compared and analyzed and a final model will be chosen.
import math
import numpy as np
import pandas as pd
from matplotlib import cm
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter as FF
from matplotlib.ticker import StrMethodFormatter as SMF
import seaborn as sns
from sklearn.model_selection import cross_val_predict as CVP
from sklearn.model_selection import KFold
from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression as LR
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_squared_error as MSErr
from sklearn.metrics import mean_absolute_error as MAErr
from sklearn.metrics import r2_score
from sklearn.model_selection import train_test_split as TTS
from sklearn.preprocessing import StandardScaler as SS
df = pd.read_csv('data/Ames_Housing_Data.tsv', sep = '\t')
#Set display options.
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100
I am going to select a few variables that are noteworthy and compare relationships between them and to Sale Price. The variables are varied between numerical and categorical, so I plan to choose interesting variables that would be of common interest that also contain a minimal number of missing values.
df.dtypes.value_counts()
A summary of all variables is shown below.
df.describe().round(2)
A count of missing values by variable is shown below.
df_na = df.isna().sum().to_frame().sort_values(by = 0, axis = 0)
df_na = df_na.rename(columns={0: 'NA Count'})
df_na.T
Because there are plenty of variables without missing values, only complete columns will be used for the analysis.
complete_columns = df_na[df_na['NA Count'] == 0].index.to_list()
df_no_na = df[complete_columns]
df_no_na.shape
sns.set()
#Scatter Plot.
exp1 = sns.lmplot(data = df_no_na, x = 'Gr Liv Area', y = 'SalePrice', hue = 'Bedroom AbvGr',
palette = 'viridis_r', ci = None, height = 6, aspect = 13 / 6)
#Axes.
ax = plt.gca()
#Title setup.
ax.set_title('Price vs Area and Number of Bedrooms', fontsize = 24)
#X-axis setup.
ax.set_xlabel("Above Ground Living Area (sq. ft.)", fontsize = 22)
ax.set_xscale('log')
xlabels = [1000, 2000, 3000, 4000, 5000, 6000]
ax.set_xticks(xlabels)
ax.set_xticklabels(xlabels, rotation = 45, ha = 'right')
ax.get_xaxis().set_major_formatter(FF(lambda x, p: format(int(x), ',')))
#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))
ax.tick_params(axis = 'both', which = 'major', labelsize = 14)
#Legend setup.
exp1._legend.remove()
ax.legend(loc = 'upper left', title = 'Bedrooms', ncol = 2, title_fontsize = 18, fontsize = 14);
For the first exploration, Price is compared to Above Ground Living Area and Number of Bedrooms. Area had some right skew, so the x-axis was log scaled.
Bedrooms alone does not indicate a clear relationship with price. However, Bedrooms increases as the Area does. All regression lines for Bedrooms from $0$ to $6$ show a positive correlation of Area with Price. As noted below, there is only one house with $8$ bedrooms.
bdrm_cnt = df_no_na['Bedroom AbvGr'].value_counts().sort_index().to_frame().rename(columns = {'Bedroom AbvGr': "Count of Houses"})
bdrm_cnt.index.name = "Number of Bedrooms"
bdrm_cnt.T
#Sort Neighborhoods by median Price in ascending order.
xlabels = df_no_na.groupby(['Neighborhood'])['SalePrice'].median().sort_values().index
exp2 = sns.violinplot(data = df_no_na, x = 'Neighborhood', y = 'SalePrice', scale = 'width', width = 0.6,
palette = 'viridis_r', order = xlabels)
plt.gcf().set_size_inches(15, 6.92)
#Axes.
ax = plt.gca()
#Title setup.
ax.set_title('Price vs Neighborhood', fontsize = 24)
#X-axis setup.
ax.set_xlabel("Neighborhood", fontsize = 22)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')
#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))
ax.tick_params(axis = 'both', which = 'major', labelsize = 14)
For the second exploration, Price is compared to Neighborhood. Neighborhood is represented by violin plots, sort in ascending order by median Price.
There is much overlap in Price, as there is significant and varying spread of house values within each Neighborhood. Overall there seems to be some correlation that warrants a further investigation. See frequency table of Neighborhood below.
nbhd_cnt = df_no_na['Neighborhood'].value_counts().sort_index().to_frame().rename(columns = {'Neighborhood': "Count of Houses"})
nbhd_cnt.index.name = "Neighborhood"
nbhd_cnt.T
#Scatter Plot.
exp3 = sns.lmplot(data = df_no_na, x = 'Year Built', y = 'SalePrice', hue = 'Overall Qual',
palette = 'viridis_r', ci = None, height = 6, aspect = 13 / 6)
#Axes.
ax = plt.gca()
#Title setup.
ax.set_title('Price vs Year Built and Overall Quality', fontsize = 24)
#X-axis setup.
ax.set_xlabel("Year Built", fontsize = 22)
ax.set_xlim(1870, 2015)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')
#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))
ax.tick_params(axis = 'both', which = 'major', labelsize = 14)
#Legend setup.
exp3._legend.remove()
ax.legend(loc = 'upper left', title = 'Overall House Quality',
labels = ['Very Poor', 'Poor', 'Fair', 'Below Average', 'Average',
'Above Average', 'Good', 'Very Good', 'Excellent', 'Very Excellent'],
ncol = 2, title_fontsize = 18, fontsize = 14);
For the third exploration, Price is compared to Year Built and Overall Quality.
Comparing these three together is quite telling. Houses built earlier are of overall lower quality and lower selling price, as seen with the yellow, green, and cyan data points in the bottom of the graph. From 1990 - 2010, there is a sharp increase in quality and selling price, as seen with the blue, indigo, and purple data points on the far right of the graph.
The regression lines for Overall Quality show a positive relationship for Year Built and Sale Price. Furthurmore, there is very little intersection of these lines, they are clearly separated. The exception is 'Very Poor' and 'Poor' at the bottom center of the graph.
A histogram of Sale Price is shown below.
df_no_na['SalePrice'].hist(bins = 8);
Sale Price is log transformed to have a more normal distribution.
np.log1p(df_no_na['SalePrice']).hist();
Sale Price is set as the index.
In addition, Order and PID are dropped, as they are ID variables.
df_final = df_no_na.drop(['Order', 'PID'], axis = 1)
df_final['SalePrice'] = np.log1p(df_no_na['SalePrice'])
df_final.set_index('SalePrice', inplace = True)
Catgorical variables are encoded.
#Code from class notebook.
one_hot_encode_cols = df_final.dtypes[df_final.dtypes == np.object]
one_hot_encode_cols = one_hot_encode_cols.index.tolist()
df_final = pd.get_dummies(df_final, columns = one_hot_encode_cols, drop_first=True)
df_final.shape
Check for skewed distributions.
#Code from class notebook.
float_cols = df_final.columns[df_final.dtypes == np.float]
skew_limit = 0.75
skew_vals = df_final[float_cols].skew()
skew_cols = (skew_vals
.sort_values(ascending = False)
.to_frame()
.rename(columns = {0: 'Skew'})
.query('abs(Skew) > {}'.format(skew_limit)))
skew_cols
df_final.head()
Data is ready for modeling.
X = df_final.reset_index().drop('SalePrice', axis = 1)
y = df_final.index
X_train, X_test, y_train, y_test = TTS(X, y, test_size = 0.3, random_state = 13131)
X_train_1 = X_train.copy()
X_test_1 = X_test.copy()
reg_1 = LR().fit(X_train_1, y_train)
y_pred = reg_1.predict(X_test_1)
Based on common conversations I've heard and read about purchasing a home in the past 30 years, I chose the following variables as predictors of Sale Price.
I think these will give a good snapshot and I am curious to see if they show up as final predictors, as well as how the overall manual model compares with the calculated models.
#Numerical predictors.
predictors2 = ['1st Flr SF', '2nd Flr SF', 'Bedroom AbvGr', 'Full Bath',
'Half Bath', 'Overall Cond', 'Overall Qual', 'Year Built']
#Neighborhoods has been one hot encoded.
neighborhoods = [v for v in df_final.columns.to_list() if v.startswith('Neighborhood')]
X_train_2 = X_train[predictors2 + neighborhoods]
X_test_2 = X_test[predictors2 + neighborhoods]
reg_2 = LR().fit(X_train_2, y_train)
kf = KFold(shuffle = True, random_state = 13131, n_splits = 3)
#Retrieve R2 scores for different alpha for LASSO or Ridge.
def optimize_alpha(alphas, x, y, model, kf):
#Scale and transform x.
s = SS()
x = s.fit_transform(x)
#List of R2.
r2_scores = []
for alpha in alphas:
reg = model(alpha = alpha, max_iter = 500000)
y_pred = CVP(reg, x, y, cv = kf)
score = r2_score(y, y_pred)
r2_scores.append(score)
return(r2_scores)
def alpha_r2_graph(alphas, R2s, xlabels, model):
df = pd.DataFrame(data = {'alpha': alphas,
'R2': R2s})
sns.set()
#Scatter Plot.
sns.lineplot(data = df,
x = 'alpha',
y = 'R2',
marker = 'o')
#Size.
plt.gcf().set_size_inches(15, 6.92)
paper_rc = {'lines.linewidth': 2, 'lines.markersize': 6}
#Axes.
ax = plt.gca()
#Title setup.
ax.set_title("Optimizing Hyperparameter for {} Regression".format(model), fontsize = 24)
#X-axis setup.
ax.set_xlabel("α", fontsize = 22)
ax.set_xscale('log')
ax.set_xticks(xlabels)
ax.set_xticklabels(xlabels, rotation = 45, ha = 'right')
if (model == 'Ridge') :
ax.get_xaxis().set_major_formatter(FF(lambda x, p: format(int(x), ',')))
#Y-axis setup.
ax.set_ylabel("R2", fontsize = 22)
ylabels = [0, 0.2, 0.4, 0.6, 0.8, 1]
ax.set_xticks(xlabels)
ax.tick_params(axis = 'both', which = 'major', labelsize = 14)
All variables will be used as predictors with an $L_1$ regularization.
alphas = list(pd.core.common.flatten([[a / 2, a, 2 * a] for a in np.geomspace(1e-5, 1e1, 7)]))
xlabels = [a for a in np.geomspace(1e-5, 1e1, 7)]
s = SS()
X_train_3 = s.fit_transform(X_train)
#Determine R2s and graph.
r2s_l = optimize_alpha(alphas, X_train_3, y_train, Lasso, kf)
alpha_r2_graph(alphas, r2s_l, xlabels, 'LASSO')
$α = 0.005$ is optimal.
X_test_3 = s.transform(X_test)
reg_3 = Lasso(alpha = 0.005).fit(X_train_3, y_train)
All variables will be used as predictors with an $L_2$ regularization.
alphas = list(pd.core.common.flatten([[a / 2, a, 2 * a] for a in np.geomspace(1, 1e6, 7)]))
xlabels = [a for a in np.geomspace(1, 1e6, 7)]
s = SS()
X_train_4 = s.fit_transform(X_train)
#Determine R2s and graph.
r2s_r = optimize_alpha(alphas, X_train_4, y_train, Ridge, kf)
alpha_r2_graph(alphas, r2s_r, xlabels, 'Ridge')
$α = 200$ is optimal.
X_test_4 = s.transform(X_test)
reg_4 = Ridge(alpha = 200).fit(X_train_4, y_train)
#Inputs regression models, predictors, and y-values, outputs DataFrame of R2, Adjusted R2, RMSE, and MAE.
def summary_df(models, Xs, Y) :
index = ['All Variables', 'My Variables', 'LASSO', 'Ridge']
R2 = []
ADJ_R2 = []
RMSE = []
MAE = []
for i in range(4):
y_pred = models[i].predict(Xs[i])
#R2.
r2 = r2_score(Y, y_pred)
R2.append(r2)
#Adj R2.
adj_r2 = 1.0 - (1.0 - r2) * (len(Y) - 1.0) / (len(Y) - Xs[i].shape[1] - 1.0)
ADJ_R2.append(adj_r2)
#RMSE.
rmse = math.sqrt(MSErr(Y, y_pred))
RMSE.append(rmse)
#MAE.
mae = MAErr(Y, y_pred)
MAE.append(mae)
df = pd.DataFrame(data = {'R2': R2,
'Adjusted R2': ADJ_R2,
'RMSE': RMSE,
'MAE': MAE},
index = index)
return(df)
models = [reg_1, reg_2, reg_3, reg_4]
X_trains = [X_train_1, X_train_2, X_train_3, X_train_4]
summary_df(models, X_trains, y_train)
X_tests = [X_test_1, X_test_2, X_test_3, X_test_4]
summary_df(models, X_tests, y_test)
Including All Variables was clearly best for the training data, but clearly overfit and the worst for the test data. It will be ignored from here on.
My Variables, simply based on a little personal insight of what people look for, did surprisingly well compared to the LASSO and Ridge models. My Variables has the lowest Adjusted R2 score so a further investigation of predictors is necessary.
X_train_2.columns
X_train.columns[reg_3.coef_ != 0]
X_train.columns[reg_4.coef_ != 0]
LASSO trimmed down to $78$ predictors. Ridge only had $3$ coefficients go to $0$. In that sense, My Variables is very much the parsimonious model compared to the other two. It is slightly less accurate, however the variables used as predictors are easily explainable:
When people look for a house, they want to know how much space is available in the first and second floors, along with how many bedrooms and bathrooms there are. They would like to know the overall quality and condition, what year it was built in, and what the neighborhood is like.
My Variables concisely sums up 86% of the variance of the house prices in real world terms. The regularization models cannot do that.
I would like to begin this section by stating I am by no means advocating going "by gut instinct" and ignoring exploration and feature selection. In this particular instance, my background knowledge was helpful and manual selection produced a better model than the two I evaluated with machine learning. In reality, that will probably happen much less often.
Using some other variables with my model may have improved R2 and Adjusted R2. For example, Heating QC, Central Air, Fireplaces, and Pool Area are amenities which may hold some predictive power.
I quickly elimated all columns which had at least one NA value. It is possible that I could have removed a few columns with a low number of NAs and high predictive power. A further investigation of these variables could have been useful.
I did not use any polynomial features, which could possibly improve the model.
Log transforming was a powerful tool for the target variable.
Latitude and Longitude of houses, that is, their specific locations, may show to be a better indicator of price than Neighborhood.
It is important to stress that this model is specific to Ames, IA, and would vary greatly across the US. Although this data set of nearly 3,000 observations is satisfactory for a city, a much larger sample would be necessary for the US.