Final for Supervised Learning: Regression for Machine Learning

Regression and Regularization with Housing for Ames, IA

Rohan Lewis

2020.12.01

I. Objective

I am using the Ames, Iowa dataset from the class notes. This dataset contains 2930 observations and 82 variables related to house sales in Ames, Iowa. The variables are related to characteristics and measurements of the house, its location, and its purchase.

Refer to the codebook for a complete list of variables and labels.

I will fit four models to predict the Sale Price of the homes on the same training set. The models will be evaluated with the same test set. They will then be compared and analyzed and a final model will be chosen.

1. Packages

In [1]:
import math
import numpy as np
import pandas as pd

from matplotlib import cm
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter as FF
from matplotlib.ticker import StrMethodFormatter as SMF
import seaborn as sns

from sklearn.model_selection import cross_val_predict as CVP
from sklearn.model_selection import KFold

from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression as LR
from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error as MSErr
from sklearn.metrics import mean_absolute_error as MAErr
from sklearn.metrics import r2_score

from sklearn.model_selection import train_test_split as TTS
from sklearn.preprocessing import StandardScaler as SS

2. Load Data

In [2]:
df = pd.read_csv('data/Ames_Housing_Data.tsv', sep = '\t')

#Set display options.
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

II. Data Exploration

I am going to select a few variables that are noteworthy and compare relationships between them and to Sale Price. The variables are varied between numerical and categorical, so I plan to choose interesting variables that would be of common interest that also contain a minimal number of missing values.

1. Data types

In [3]:
df.dtypes.value_counts()
Out[3]:
object     43
int64      28
float64    11
dtype: int64

2. Dataframe descriptive stats

A summary of all variables is shown below.

In [4]:
df.describe().round(2)
Out[4]:
Order PID MS SubClass Lot Frontage Lot Area Overall Qual Overall Cond Year Built Year Remod/Add Mas Vnr Area BsmtFin SF 1 BsmtFin SF 2 Bsmt Unf SF Total Bsmt SF 1st Flr SF 2nd Flr SF Low Qual Fin SF Gr Liv Area Bsmt Full Bath Bsmt Half Bath Full Bath Half Bath Bedroom AbvGr Kitchen AbvGr TotRms AbvGrd Fireplaces Garage Yr Blt Garage Cars Garage Area Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch Screen Porch Pool Area Misc Val Mo Sold Yr Sold SalePrice
count 2930.00 2.930000e+03 2930.00 2440.00 2930.00 2930.00 2930.00 2930.00 2930.00 2907.00 2929.00 2929.00 2929.00 2929.00 2930.00 2930.00 2930.00 2930.00 2928.00 2928.00 2930.00 2930.00 2930.00 2930.00 2930.00 2930.00 2771.00 2929.00 2929.00 2930.00 2930.00 2930.00 2930.00 2930.00 2930.00 2930.00 2930.00 2930.00 2930.00
mean 1465.50 7.144645e+08 57.39 69.22 10147.92 6.09 5.56 1971.36 1984.27 101.90 442.63 49.72 559.26 1051.61 1159.56 335.46 4.68 1499.69 0.43 0.06 1.57 0.38 2.85 1.04 6.44 0.60 1978.13 1.77 472.82 93.75 47.53 23.01 2.59 16.00 2.24 50.64 6.22 2007.79 180796.06
std 845.96 1.887308e+08 42.64 23.37 7880.02 1.41 1.11 30.25 20.86 179.11 455.59 169.17 439.49 440.62 391.89 428.40 46.31 505.51 0.52 0.25 0.55 0.50 0.83 0.21 1.57 0.65 25.53 0.76 215.05 126.36 67.48 64.14 25.14 56.09 35.60 566.34 2.71 1.32 79886.69
min 1.00 5.263011e+08 20.00 21.00 1300.00 1.00 1.00 1872.00 1950.00 0.00 0.00 0.00 0.00 0.00 334.00 0.00 0.00 334.00 0.00 0.00 0.00 0.00 0.00 0.00 2.00 0.00 1895.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 1.00 2006.00 12789.00
25% 733.25 5.284770e+08 20.00 58.00 7440.25 5.00 5.00 1954.00 1965.00 0.00 0.00 0.00 219.00 793.00 876.25 0.00 0.00 1126.00 0.00 0.00 1.00 0.00 2.00 1.00 5.00 0.00 1960.00 1.00 320.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 4.00 2007.00 129500.00
50% 1465.50 5.354536e+08 50.00 68.00 9436.50 6.00 5.00 1973.00 1993.00 0.00 370.00 0.00 466.00 990.00 1084.00 0.00 0.00 1442.00 0.00 0.00 2.00 0.00 3.00 1.00 6.00 1.00 1979.00 2.00 480.00 0.00 27.00 0.00 0.00 0.00 0.00 0.00 6.00 2008.00 160000.00
75% 2197.75 9.071811e+08 70.00 80.00 11555.25 7.00 6.00 2001.00 2004.00 164.00 734.00 0.00 802.00 1302.00 1384.00 703.75 0.00 1742.75 1.00 0.00 2.00 1.00 3.00 1.00 7.00 1.00 2002.00 2.00 576.00 168.00 70.00 0.00 0.00 0.00 0.00 0.00 8.00 2009.00 213500.00
max 2930.00 1.007100e+09 190.00 313.00 215245.00 10.00 9.00 2010.00 2010.00 1600.00 5644.00 1526.00 2336.00 6110.00 5095.00 2065.00 1064.00 5642.00 3.00 2.00 4.00 2.00 8.00 3.00 15.00 4.00 2207.00 5.00 1488.00 1424.00 742.00 1012.00 508.00 576.00 800.00 17000.00 12.00 2010.00 755000.00

3. Missing values

A count of missing values by variable is shown below.

In [5]:
df_na = df.isna().sum().to_frame().sort_values(by = 0, axis = 0)

df_na = df_na.rename(columns={0: 'NA Count'})

df_na.T
Out[5]:
Order Sale Condition Heating QC Central Air 1st Flr SF 2nd Flr SF Low Qual Fin SF Gr Liv Area Full Bath Half Bath Bedroom AbvGr Kitchen AbvGr Kitchen Qual Foundation TotRms AbvGrd Fireplaces Paved Drive Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch Screen Porch Pool Area Misc Val Mo Sold Yr Sold Sale Type Functional Exter Cond Heating Condition 1 House Style Bldg Type Condition 2 Exter Qual Neighborhood Land Slope Lot Config Utilities Land Contour Lot Shape Street Lot Area MS Zoning MS SubClass PID Overall Qual Overall Cond SalePrice Year Built Year Remod/Add Roof Style Roof Matl Exterior 1st Exterior 2nd Electrical BsmtFin SF 1 BsmtFin SF 2 Bsmt Unf SF Total Bsmt SF Garage Cars Garage Area Bsmt Half Bath Bsmt Full Bath Mas Vnr Area Mas Vnr Type Bsmt Qual Bsmt Cond BsmtFin Type 1 BsmtFin Type 2 Bsmt Exposure Garage Type Garage Finish Garage Cond Garage Yr Blt Garage Qual Lot Frontage Fireplace Qu Fence Alley Misc Feature Pool QC
NA Count 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 23 23 80 80 80 81 83 157 159 159 159 159 490 1422 2358 2732 2824 2917

Because there are plenty of variables without missing values, only complete columns will be used for the analysis.

In [6]:
complete_columns = df_na[df_na['NA Count'] == 0].index.to_list()

df_no_na = df[complete_columns]
df_no_na.shape
Out[6]:
(2930, 55)

4. Exploration 1

In [7]:
sns.set()

#Scatter Plot.
exp1 = sns.lmplot(data = df_no_na, x = 'Gr Liv Area', y = 'SalePrice', hue = 'Bedroom AbvGr',
                  palette = 'viridis_r', ci = None, height = 6, aspect = 13 / 6)

#Axes.
ax = plt.gca()

#Title setup.
ax.set_title('Price vs Area and Number of Bedrooms', fontsize = 24)

#X-axis setup.
ax.set_xlabel("Above Ground Living Area (sq. ft.)", fontsize = 22)
ax.set_xscale('log')
xlabels = [1000, 2000, 3000, 4000, 5000, 6000]
ax.set_xticks(xlabels)
ax.set_xticklabels(xlabels, rotation = 45, ha = 'right')
ax.get_xaxis().set_major_formatter(FF(lambda x, p: format(int(x), ',')))

#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))

ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

#Legend setup.
exp1._legend.remove()
ax.legend(loc = 'upper left', title = 'Bedrooms', ncol = 2, title_fontsize = 18, fontsize = 14);

For the first exploration, Price is compared to Above Ground Living Area and Number of Bedrooms. Area had some right skew, so the x-axis was log scaled.

Bedrooms alone does not indicate a clear relationship with price. However, Bedrooms increases as the Area does. All regression lines for Bedrooms from $0$ to $6$ show a positive correlation of Area with Price. As noted below, there is only one house with $8$ bedrooms.

In [8]:
bdrm_cnt = df_no_na['Bedroom AbvGr'].value_counts().sort_index().to_frame().rename(columns = {'Bedroom AbvGr': "Count of Houses"})
bdrm_cnt.index.name = "Number of Bedrooms"
bdrm_cnt.T
Out[8]:
Number of Bedrooms 0 1 2 3 4 5 6 8
Count of Houses 8 112 743 1597 400 48 21 1

5. Exploration 2

In [9]:
#Sort Neighborhoods by median Price in ascending order.
xlabels = df_no_na.groupby(['Neighborhood'])['SalePrice'].median().sort_values().index

exp2 = sns.violinplot(data = df_no_na, x = 'Neighborhood', y = 'SalePrice', scale = 'width', width = 0.6,
                      palette = 'viridis_r', order = xlabels)

plt.gcf().set_size_inches(15, 6.92)

#Axes.
ax = plt.gca()

#Title setup.
ax.set_title('Price vs Neighborhood', fontsize = 24)

#X-axis setup.
ax.set_xlabel("Neighborhood", fontsize = 22)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')

#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))

ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

For the second exploration, Price is compared to Neighborhood. Neighborhood is represented by violin plots, sort in ascending order by median Price.

There is much overlap in Price, as there is significant and varying spread of house values within each Neighborhood. Overall there seems to be some correlation that warrants a further investigation. See frequency table of Neighborhood below.

In [10]:
nbhd_cnt = df_no_na['Neighborhood'].value_counts().sort_index().to_frame().rename(columns = {'Neighborhood': "Count of Houses"})
nbhd_cnt.index.name = "Neighborhood"
nbhd_cnt.T
Out[10]:
Neighborhood Blmngtn Blueste BrDale BrkSide ClearCr CollgCr Crawfor Edwards Gilbert Greens GrnHill IDOTRR Landmrk MeadowV Mitchel NAmes NPkVill NWAmes NoRidge NridgHt OldTown SWISU Sawyer SawyerW Somerst StoneBr Timber Veenker
Count of Houses 28 10 30 108 44 267 103 194 165 8 2 93 1 37 114 443 23 131 71 166 239 48 151 125 182 51 72 24

6. Exploration 3

In [11]:
#Scatter Plot.
exp3 = sns.lmplot(data = df_no_na, x = 'Year Built', y = 'SalePrice', hue = 'Overall Qual',
                  palette = 'viridis_r', ci = None, height = 6, aspect = 13 / 6)

#Axes.
ax = plt.gca()

#Title setup.
ax.set_title('Price vs Year Built and Overall Quality', fontsize = 24)

#X-axis setup.
ax.set_xlabel("Year Built", fontsize = 22)
ax.set_xlim(1870, 2015)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')

#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))

ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

#Legend setup.
exp3._legend.remove()
ax.legend(loc = 'upper left', title = 'Overall House Quality',
          labels = ['Very Poor', 'Poor', 'Fair', 'Below Average', 'Average',
                    'Above Average', 'Good', 'Very Good', 'Excellent', 'Very Excellent'],
          ncol = 2, title_fontsize = 18, fontsize = 14);

For the third exploration, Price is compared to Year Built and Overall Quality.

Comparing these three together is quite telling. Houses built earlier are of overall lower quality and lower selling price, as seen with the yellow, green, and cyan data points in the bottom of the graph. From 1990 - 2010, there is a sharp increase in quality and selling price, as seen with the blue, indigo, and purple data points on the far right of the graph.

The regression lines for Overall Quality show a positive relationship for Year Built and Sale Price. Furthurmore, there is very little intersection of these lines, they are clearly separated. The exception is 'Very Poor' and 'Poor' at the bottom center of the graph.

7. Target Variable

A histogram of Sale Price is shown below.

In [12]:
df_no_na['SalePrice'].hist(bins = 8);

Sale Price is log transformed to have a more normal distribution.

In [13]:
np.log1p(df_no_na['SalePrice']).hist();

Sale Price is set as the index.

In addition, Order and PID are dropped, as they are ID variables.

In [14]:
df_final = df_no_na.drop(['Order', 'PID'], axis = 1)
df_final['SalePrice'] = np.log1p(df_no_na['SalePrice'])
df_final.set_index('SalePrice', inplace = True)

8. One-hot encoding for dummy variables

Catgorical variables are encoded.

In [15]:
#Code from class notebook.
one_hot_encode_cols = df_final.dtypes[df_final.dtypes == np.object]
one_hot_encode_cols = one_hot_encode_cols.index.tolist()

df_final = pd.get_dummies(df_final, columns = one_hot_encode_cols, drop_first=True)
df_final.shape
Out[15]:
(2930, 191)

9. Log transforming skew variables

Check for skewed distributions.

In [16]:
#Code from class notebook.
float_cols = df_final.columns[df_final.dtypes == np.float]

skew_limit = 0.75
skew_vals = df_final[float_cols].skew()
skew_cols = (skew_vals
             .sort_values(ascending = False)
             .to_frame()
             .rename(columns = {0: 'Skew'})
             .query('abs(Skew) > {}'.format(skew_limit)))

skew_cols
Out[16]:
Skew
In [17]:
df_final.head()
Out[17]:
1st Flr SF 2nd Flr SF Low Qual Fin SF Gr Liv Area Full Bath Half Bath Bedroom AbvGr Kitchen AbvGr TotRms AbvGrd Fireplaces Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch Screen Porch Pool Area Misc Val Mo Sold Yr Sold Lot Area MS SubClass Overall Qual Overall Cond Year Built Year Remod/Add Sale Condition_AdjLand Sale Condition_Alloca Sale Condition_Family Sale Condition_Normal Sale Condition_Partial Heating QC_Fa Heating QC_Gd Heating QC_Po Heating QC_TA Central Air_Y Kitchen Qual_Fa Kitchen Qual_Gd Kitchen Qual_Po Kitchen Qual_TA Foundation_CBlock Foundation_PConc Foundation_Slab Foundation_Stone Foundation_Wood Paved Drive_P Paved Drive_Y Sale Type_CWD Sale Type_Con Sale Type_ConLD Sale Type_ConLI ... Street_Pave MS Zoning_C (all) MS Zoning_FV MS Zoning_I (all) MS Zoning_RH MS Zoning_RL MS Zoning_RM Roof Style_Gable Roof Style_Gambrel Roof Style_Hip Roof Style_Mansard Roof Style_Shed Roof Matl_CompShg Roof Matl_Membran Roof Matl_Metal Roof Matl_Roll Roof Matl_Tar&Grv Roof Matl_WdShake Roof Matl_WdShngl Exterior 1st_AsphShn Exterior 1st_BrkComm Exterior 1st_BrkFace Exterior 1st_CBlock Exterior 1st_CemntBd Exterior 1st_HdBoard Exterior 1st_ImStucc Exterior 1st_MetalSd Exterior 1st_Plywood Exterior 1st_PreCast Exterior 1st_Stone Exterior 1st_Stucco Exterior 1st_VinylSd Exterior 1st_Wd Sdng Exterior 1st_WdShing Exterior 2nd_AsphShn Exterior 2nd_Brk Cmn Exterior 2nd_BrkFace Exterior 2nd_CBlock Exterior 2nd_CmentBd Exterior 2nd_HdBoard Exterior 2nd_ImStucc Exterior 2nd_MetalSd Exterior 2nd_Other Exterior 2nd_Plywood Exterior 2nd_PreCast Exterior 2nd_Stone Exterior 2nd_Stucco Exterior 2nd_VinylSd Exterior 2nd_Wd Sdng Exterior 2nd_Wd Shng
SalePrice
12.278398 1656 0 0 1656 1 0 3 1 7 2 210 62 0 0 0 0 0 5 2010 31770 20 6 5 1960 1960 0 0 0 1 0 1 0 0 0 1 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 ... 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
11.561725 896 0 0 896 1 0 2 1 5 0 140 0 0 0 120 0 0 6 2010 11622 20 5 6 1961 1961 0 0 0 1 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 ... 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
12.055256 1329 0 0 1329 1 1 3 1 6 0 393 36 0 0 0 0 12500 6 2010 14267 20 6 6 1958 1958 0 0 0 1 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 ... 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
12.404928 2110 0 0 2110 2 1 3 1 8 2 0 0 0 0 0 0 0 4 2010 11160 20 7 5 1968 1968 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 ... 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
12.154258 928 701 0 1629 2 1 3 1 6 1 212 34 0 0 0 0 0 3 2010 13830 60 5 5 1997 1998 0 0 0 1 0 0 1 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 ... 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0

5 rows × 191 columns

10. Train and Test Split.

Data is ready for modeling.

In [18]:
X = df_final.reset_index().drop('SalePrice', axis = 1)
y = df_final.index

X_train, X_test, y_train, y_test = TTS(X, y, test_size = 0.3, random_state = 13131)

III. Modeling

1. Linear Regression

All variables will be used as predictors.

In [19]:
X_train_1 = X_train.copy()
X_test_1 = X_test.copy()

reg_1 = LR().fit(X_train_1, y_train)
y_pred = reg_1.predict(X_test_1)

2. Manual Variable Selection

Based on common conversations I've heard and read about purchasing a home in the past 30 years, I chose the following variables as predictors of Sale Price.

  • First Floor Area and Second Floor Area
  • Bedrooms
  • Full Bathrooms and Half Bathrooms
  • Neighborhood (its corresponding categories)
  • Overall Condition and Overall Quality (their corresponding categories)
  • Year Built

I think these will give a good snapshot and I am curious to see if they show up as final predictors, as well as how the overall manual model compares with the calculated models.

In [20]:
#Numerical predictors.
predictors2 = ['1st Flr SF', '2nd Flr SF', 'Bedroom AbvGr', 'Full Bath',
               'Half Bath', 'Overall Cond', 'Overall Qual', 'Year Built']

#Neighborhoods has been one hot encoded.
neighborhoods = [v for v in df_final.columns.to_list() if v.startswith('Neighborhood')] 

X_train_2 = X_train[predictors2 + neighborhoods]
X_test_2 = X_test[predictors2 + neighborhoods]

reg_2 = LR().fit(X_train_2, y_train)

3. Regularized Regression Setup

K Fold

In [21]:
kf = KFold(shuffle = True, random_state = 13131, n_splits = 3)

Optimization Function

In [22]:
#Retrieve R2 scores for different alpha for LASSO or Ridge.
def optimize_alpha(alphas, x, y, model, kf):
    
    #Scale and transform x.
    s = SS()
    x = s.fit_transform(x)
    
    #List of R2.
    r2_scores = []
    
    for alpha in alphas:
        
        reg = model(alpha = alpha, max_iter = 500000)
        y_pred = CVP(reg, x, y, cv = kf)
        score = r2_score(y, y_pred)
        r2_scores.append(score)
    
    return(r2_scores)

Graph Helper

In [23]:
def alpha_r2_graph(alphas, R2s, xlabels, model):
    
    df = pd.DataFrame(data = {'alpha': alphas,
                              'R2': R2s})
    sns.set()

    #Scatter Plot.
    sns.lineplot(data = df,
                 x = 'alpha',
                 y = 'R2',
                 marker = 'o')
    
    #Size.
    plt.gcf().set_size_inches(15, 6.92)
    paper_rc = {'lines.linewidth': 2, 'lines.markersize': 6}  
    
    #Axes.
    ax = plt.gca()
    
    #Title setup.
    ax.set_title("Optimizing Hyperparameter for {} Regression".format(model), fontsize = 24)

    #X-axis setup.
    ax.set_xlabel("α", fontsize = 22)
    ax.set_xscale('log')
    ax.set_xticks(xlabels)
    ax.set_xticklabels(xlabels, rotation = 45, ha = 'right')
    if (model == 'Ridge') :
        ax.get_xaxis().set_major_formatter(FF(lambda x, p: format(int(x), ',')))

    #Y-axis setup.
    ax.set_ylabel("R2", fontsize = 22)
    ylabels = [0, 0.2, 0.4, 0.6, 0.8, 1]
    ax.set_xticks(xlabels)
    
    ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

4. LASSO

All variables will be used as predictors with an $L_1$ regularization.

In [24]:
alphas = list(pd.core.common.flatten([[a / 2, a, 2 * a] for a in np.geomspace(1e-5, 1e1, 7)]))
xlabels = [a for a in np.geomspace(1e-5, 1e1, 7)]

s = SS()
X_train_3 = s.fit_transform(X_train)

#Determine R2s and graph.
r2s_l = optimize_alpha(alphas, X_train_3, y_train, Lasso, kf)
alpha_r2_graph(alphas, r2s_l, xlabels, 'LASSO')

$α = 0.005$ is optimal.

In [25]:
X_test_3 = s.transform(X_test)

reg_3 = Lasso(alpha = 0.005).fit(X_train_3, y_train)

5. Ridge

All variables will be used as predictors with an $L_2$ regularization.

In [26]:
alphas = list(pd.core.common.flatten([[a / 2, a, 2 * a] for a in np.geomspace(1, 1e6, 7)]))
xlabels = [a for a in np.geomspace(1, 1e6, 7)]

s = SS()
X_train_4 = s.fit_transform(X_train)

#Determine R2s and graph.
r2s_r = optimize_alpha(alphas, X_train_4, y_train, Ridge, kf)
alpha_r2_graph(alphas, r2s_r, xlabels, 'Ridge')

$α = 200$ is optimal.

In [27]:
X_test_4 = s.transform(X_test)

reg_4 = Ridge(alpha = 200).fit(X_train_4, y_train)

IV. Evaluation and Selection

1. Helper Function

In [28]:
#Inputs regression models, predictors, and y-values, outputs DataFrame of R2, Adjusted R2, RMSE, and MAE.
def summary_df(models, Xs, Y) :

    index = ['All Variables', 'My Variables', 'LASSO', 'Ridge']
    R2 = []
    ADJ_R2 = []
    RMSE = []
    MAE = []

    for i in range(4):
        y_pred = models[i].predict(Xs[i])
        
        #R2.
        r2 = r2_score(Y, y_pred)
        R2.append(r2)
        
        #Adj R2.
        adj_r2 = 1.0 - (1.0 - r2) * (len(Y) - 1.0) / (len(Y) - Xs[i].shape[1] - 1.0)
        ADJ_R2.append(adj_r2)
        
        #RMSE.
        rmse = math.sqrt(MSErr(Y, y_pred))
        RMSE.append(rmse)
                         
        #MAE.
        mae = MAErr(Y, y_pred)
        MAE.append(mae)

    df = pd.DataFrame(data = {'R2': R2,
                              'Adjusted R2': ADJ_R2,
                              'RMSE': RMSE,
                              'MAE': MAE},
                      index = index)
    return(df)

2. Train Data

In [29]:
models = [reg_1, reg_2, reg_3, reg_4]
X_trains = [X_train_1, X_train_2, X_train_3, X_train_4]

summary_df(models, X_trains, y_train)
Out[29]:
R2 Adjusted R2 RMSE MAE
All Variables 0.923746 0.915911 0.115140 0.080549
My Variables 0.857730 0.855258 0.157272 0.106099
LASSO 0.897472 0.886938 0.133510 0.091102
Ridge 0.912745 0.903780 0.123166 0.084782

3. Test Data

In [30]:
X_tests = [X_test_1, X_test_2, X_test_3, X_test_4]

summary_df(models, X_tests, y_test)
Out[30]:
R2 Adjusted R2 RMSE MAE
All Variables 0.827449 0.779476 0.159716 0.094183
My Variables 0.861722 0.855981 0.142977 0.100758
LASSO 0.882675 0.850057 0.131700 0.092853
Ridge 0.879854 0.846450 0.133274 0.093467

4. Summary of Metrics

Including All Variables was clearly best for the training data, but clearly overfit and the worst for the test data. It will be ignored from here on.

My Variables, simply based on a little personal insight of what people look for, did surprisingly well compared to the LASSO and Ridge models. My Variables has the lowest Adjusted R2 score so a further investigation of predictors is necessary.

My Variable Predictors

In [31]:
X_train_2.columns
Out[31]:
Index(['1st Flr SF', '2nd Flr SF', 'Bedroom AbvGr', 'Full Bath', 'Half Bath',
       'Overall Cond', 'Overall Qual', 'Year Built', 'Neighborhood_Blueste',
       'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr',
       'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards',
       'Neighborhood_Gilbert', 'Neighborhood_Greens', 'Neighborhood_GrnHill',
       'Neighborhood_IDOTRR', 'Neighborhood_Landmrk', 'Neighborhood_MeadowV',
       'Neighborhood_Mitchel', 'Neighborhood_NAmes', 'Neighborhood_NPkVill',
       'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
       'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Sawyer',
       'Neighborhood_SawyerW', 'Neighborhood_Somerst', 'Neighborhood_StoneBr',
       'Neighborhood_Timber', 'Neighborhood_Veenker'],
      dtype='object')

LASSO Predictors

In [32]:
X_train.columns[reg_3.coef_ != 0]
Out[32]:
Index(['1st Flr SF', 'Gr Liv Area', 'Full Bath', 'Kitchen AbvGr', 'Fireplaces',
       'Wood Deck SF', 'Open Porch SF', 'Screen Porch', 'Pool Area',
       'Misc Val', 'Yr Sold', 'Lot Area', 'MS SubClass', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Sale Condition_Normal',
       'Sale Condition_Partial', 'Heating QC_Po', 'Heating QC_TA',
       'Central Air_Y', 'Kitchen Qual_Fa', 'Kitchen Qual_TA',
       'Foundation_PConc', 'Foundation_Slab', 'Paved Drive_Y', 'Sale Type_Con',
       'Sale Type_New', 'Sale Type_Oth', 'Functional_Maj2', 'Functional_Sal',
       'Functional_Sev', 'Functional_Typ', 'Exter Cond_Fa', 'Exter Cond_Po',
       'Heating_GasW', 'Heating_Grav', 'Condition 1_Feedr', 'Condition 1_Norm',
       'Condition 1_PosN', 'House Style_2.5Fin', 'House Style_SFoyer',
       'Bldg Type_2fmCon', 'Bldg Type_Twnhs', 'Bldg Type_TwnhsE',
       'Condition 2_PosA', 'Condition 2_RRAe', 'Exter Qual_TA',
       'Neighborhood_BrDale', 'Neighborhood_ClearCr', 'Neighborhood_Crawfor',
       'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_GrnHill',
       'Neighborhood_MeadowV', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
       'Neighborhood_OldTown', 'Neighborhood_SawyerW', 'Neighborhood_Somerst',
       'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Land Slope_Mod',
       'Lot Config_CulDSac', 'Utilities_NoSeWa', 'Utilities_NoSewr',
       'Land Contour_HLS', 'Lot Shape_IR3', 'MS Zoning_C (all)',
       'MS Zoning_RL', 'MS Zoning_RM', 'Roof Style_Hip',
       'Exterior 1st_BrkComm', 'Exterior 1st_BrkFace', 'Exterior 1st_Stucco',
       'Exterior 2nd_CmentBd', 'Exterior 2nd_Stucco'],
      dtype='object')

Ridge Predictors

In [33]:
X_train.columns[reg_4.coef_ != 0]
Out[33]:
Index(['1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area',
       'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr',
       'TotRms AbvGrd', 'Fireplaces',
       ...
       'Exterior 2nd_HdBoard', 'Exterior 2nd_ImStucc', 'Exterior 2nd_MetalSd',
       'Exterior 2nd_Other', 'Exterior 2nd_Plywood', 'Exterior 2nd_Stone',
       'Exterior 2nd_Stucco', 'Exterior 2nd_VinylSd', 'Exterior 2nd_Wd Sdng',
       'Exterior 2nd_Wd Shng'],
      dtype='object', length=188)

5. Model Selection

LASSO trimmed down to $78$ predictors. Ridge only had $3$ coefficients go to $0$. In that sense, My Variables is very much the parsimonious model compared to the other two. It is slightly less accurate, however the variables used as predictors are easily explainable:

When people look for a house, they want to know how much space is available in the first and second floors, along with how many bedrooms and bathrooms there are. They would like to know the overall quality and condition, what year it was built in, and what the neighborhood is like.

My Variables concisely sums up 86% of the variance of the house prices in real world terms. The regularization models cannot do that.

V. Key Findings, Limitations, and Conclusion

I would like to begin this section by stating I am by no means advocating going "by gut instinct" and ignoring exploration and feature selection. In this particular instance, my background knowledge was helpful and manual selection produced a better model than the two I evaluated with machine learning. In reality, that will probably happen much less often.

Using some other variables with my model may have improved R2 and Adjusted R2. For example, Heating QC, Central Air, Fireplaces, and Pool Area are amenities which may hold some predictive power.

I quickly elimated all columns which had at least one NA value. It is possible that I could have removed a few columns with a low number of NAs and high predictive power. A further investigation of these variables could have been useful.

I did not use any polynomial features, which could possibly improve the model.

Log transforming was a powerful tool for the target variable.

Latitude and Longitude of houses, that is, their specific locations, may show to be a better indicator of price than Neighborhood.

It is important to stress that this model is specific to Ames, IA, and would vary greatly across the US. Although this data set of nearly 3,000 observations is satisfactory for a city, a much larger sample would be necessary for the US.