Final for Supervised Learning: Regression for Machine Learning¶

Regression and Regularization with Housing for Ames, IA¶

Rohan Lewis¶

2020.12.01¶

I. Objective¶

I am using the Ames, Iowa dataset from the class notes. This dataset contains 2930 observations and 82 variables related to house sales in Ames, Iowa. The variables are related to characteristics and measurements of the house, its location, and its purchase.

Refer to the codebook for a complete list of variables and labels.

I will fit four models to predict the Sale Price of the homes on the same training set. The models will be evaluated with the same test set. They will then be compared and analyzed and a final model will be chosen.

1. Packages¶

import math
import numpy as np
import pandas as pd

from matplotlib import cm
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter as FF
from matplotlib.ticker import StrMethodFormatter as SMF
import seaborn as sns

from sklearn.model_selection import cross_val_predict as CVP
from sklearn.model_selection import KFold

from sklearn.linear_model import Lasso
from sklearn.linear_model import LinearRegression as LR
from sklearn.linear_model import Ridge

from sklearn.metrics import mean_squared_error as MSErr
from sklearn.metrics import mean_absolute_error as MAErr
from sklearn.metrics import r2_score

from sklearn.model_selection import train_test_split as TTS
from sklearn.preprocessing import StandardScaler as SS

2. Load Data¶

df = pd.read_csv('data/Ames_Housing_Data.tsv', sep = '\t')

#Set display options.
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

II. Data Exploration¶

I am going to select a few variables that are noteworthy and compare relationships between them and to Sale Price. The variables are varied between numerical and categorical, so I plan to choose interesting variables that would be of common interest that also contain a minimal number of missing values.

1. Data types¶

df.dtypes.value_counts()

object     43
int64      28
float64    11
dtype: int64

2. Dataframe descriptive stats¶

A summary of all variables is shown below.

df.describe().round(2)

3. Missing values¶

A count of missing values by variable is shown below.

df_na = df.isna().sum().to_frame().sort_values(by = 0, axis = 0)

df_na = df_na.rename(columns={0: 'NA Count'})

df_na.T

Because there are plenty of variables without missing values, only complete columns will be used for the analysis.

complete_columns = df_na[df_na['NA Count'] == 0].index.to_list()

df_no_na = df[complete_columns]
df_no_na.shape

(2930, 55)

4. Exploration 1¶

sns.set()

#Scatter Plot.
exp1 = sns.lmplot(data = df_no_na, x = 'Gr Liv Area', y = 'SalePrice', hue = 'Bedroom AbvGr',
                  palette = 'viridis_r', ci = None, height = 6, aspect = 13 / 6)

#Axes.
ax = plt.gca()

#Title setup.
ax.set_title('Price vs Area and Number of Bedrooms', fontsize = 24)

#X-axis setup.
ax.set_xlabel("Above Ground Living Area (sq. ft.)", fontsize = 22)
ax.set_xscale('log')
xlabels = [1000, 2000, 3000, 4000, 5000, 6000]
ax.set_xticks(xlabels)
ax.set_xticklabels(xlabels, rotation = 45, ha = 'right')
ax.get_xaxis().set_major_formatter(FF(lambda x, p: format(int(x), ',')))

#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))

ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

#Legend setup.
exp1._legend.remove()
ax.legend(loc = 'upper left', title = 'Bedrooms', ncol = 2, title_fontsize = 18, fontsize = 14);

For the first exploration, Price is compared to Above Ground Living Area and Number of Bedrooms. Area had some right skew, so the x-axis was log scaled.

Bedrooms alone does not indicate a clear relationship with price. However, Bedrooms increases as the Area does. All regression lines for Bedrooms from $0$ to $6$ show a positive correlation of Area with Price. As noted below, there is only one house with $8$ bedrooms.

bdrm_cnt = df_no_na['Bedroom AbvGr'].value_counts().sort_index().to_frame().rename(columns = {'Bedroom AbvGr': "Count of Houses"})
bdrm_cnt.index.name = "Number of Bedrooms"
bdrm_cnt.T

5. Exploration 2¶

#Sort Neighborhoods by median Price in ascending order.
xlabels = df_no_na.groupby(['Neighborhood'])['SalePrice'].median().sort_values().index

exp2 = sns.violinplot(data = df_no_na, x = 'Neighborhood', y = 'SalePrice', scale = 'width', width = 0.6,
                      palette = 'viridis_r', order = xlabels)

plt.gcf().set_size_inches(15, 6.92)

#Axes.
ax = plt.gca()

#Title setup.
ax.set_title('Price vs Neighborhood', fontsize = 24)

#X-axis setup.
ax.set_xlabel("Neighborhood", fontsize = 22)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')

#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))

ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

For the second exploration, Price is compared to Neighborhood. Neighborhood is represented by violin plots, sort in ascending order by median Price.

There is much overlap in Price, as there is significant and varying spread of house values within each Neighborhood. Overall there seems to be some correlation that warrants a further investigation. See frequency table of Neighborhood below.

nbhd_cnt = df_no_na['Neighborhood'].value_counts().sort_index().to_frame().rename(columns = {'Neighborhood': "Count of Houses"})
nbhd_cnt.index.name = "Neighborhood"
nbhd_cnt.T

6. Exploration 3¶

#Scatter Plot.
exp3 = sns.lmplot(data = df_no_na, x = 'Year Built', y = 'SalePrice', hue = 'Overall Qual',
                  palette = 'viridis_r', ci = None, height = 6, aspect = 13 / 6)

#Axes.
ax = plt.gca()

#Title setup.
ax.set_title('Price vs Year Built and Overall Quality', fontsize = 24)

#X-axis setup.
ax.set_xlabel("Year Built", fontsize = 22)
ax.set_xlim(1870, 2015)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')

#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))

ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

#Legend setup.
exp3._legend.remove()
ax.legend(loc = 'upper left', title = 'Overall House Quality',
          labels = ['Very Poor', 'Poor', 'Fair', 'Below Average', 'Average',
                    'Above Average', 'Good', 'Very Good', 'Excellent', 'Very Excellent'],
          ncol = 2, title_fontsize = 18, fontsize = 14);

For the third exploration, Price is compared to Year Built and Overall Quality.

Comparing these three together is quite telling. Houses built earlier are of overall lower quality and lower selling price, as seen with the yellow, green, and cyan data points in the bottom of the graph. From 1990 - 2010, there is a sharp increase in quality and selling price, as seen with the blue, indigo, and purple data points on the far right of the graph.

The regression lines for Overall Quality show a positive relationship for Year Built and Sale Price. Furthurmore, there is very little intersection of these lines, they are clearly separated. The exception is 'Very Poor' and 'Poor' at the bottom center of the graph.

7. Target Variable¶

A histogram of Sale Price is shown below.

df_no_na['SalePrice'].hist(bins = 8);

Sale Price is log transformed to have a more normal distribution.

np.log1p(df_no_na['SalePrice']).hist();

Sale Price is set as the index.

In addition, Order and PID are dropped, as they are ID variables.

df_final = df_no_na.drop(['Order', 'PID'], axis = 1)
df_final['SalePrice'] = np.log1p(df_no_na['SalePrice'])
df_final.set_index('SalePrice', inplace = True)

8. One-hot encoding for dummy variables¶

Catgorical variables are encoded.

#Code from class notebook.
one_hot_encode_cols = df_final.dtypes[df_final.dtypes == np.object]
one_hot_encode_cols = one_hot_encode_cols.index.tolist()

df_final = pd.get_dummies(df_final, columns = one_hot_encode_cols, drop_first=True)
df_final.shape

(2930, 191)

9. Log transforming skew variables¶

Check for skewed distributions.

#Code from class notebook.
float_cols = df_final.columns[df_final.dtypes == np.float]

skew_limit = 0.75
skew_vals = df_final[float_cols].skew()
skew_cols = (skew_vals
             .sort_values(ascending = False)
             .to_frame()
             .rename(columns = {0: 'Skew'})
             .query('abs(Skew) > {}'.format(skew_limit)))

skew_cols

df_final.head()

10. Train and Test Split.¶

Data is ready for modeling.

X = df_final.reset_index().drop('SalePrice', axis = 1)
y = df_final.index

X_train, X_test, y_train, y_test = TTS(X, y, test_size = 0.3, random_state = 13131)

III. Modeling¶

1. Linear Regression¶

All variables will be used as predictors.

X_train_1 = X_train.copy()
X_test_1 = X_test.copy()

reg_1 = LR().fit(X_train_1, y_train)
y_pred = reg_1.predict(X_test_1)

2. Manual Variable Selection¶

Based on common conversations I've heard and read about purchasing a home in the past 30 years, I chose the following variables as predictors of Sale Price.

First Floor Area and Second Floor Area
Bedrooms
Full Bathrooms and Half Bathrooms
Neighborhood (its corresponding categories)
Overall Condition and Overall Quality (their corresponding categories)
Year Built

I think these will give a good snapshot and I am curious to see if they show up as final predictors, as well as how the overall manual model compares with the calculated models.

#Numerical predictors.
predictors2 = ['1st Flr SF', '2nd Flr SF', 'Bedroom AbvGr', 'Full Bath',
               'Half Bath', 'Overall Cond', 'Overall Qual', 'Year Built']

#Neighborhoods has been one hot encoded.
neighborhoods = [v for v in df_final.columns.to_list() if v.startswith('Neighborhood')] 

X_train_2 = X_train[predictors2 + neighborhoods]
X_test_2 = X_test[predictors2 + neighborhoods]

reg_2 = LR().fit(X_train_2, y_train)

3. Regularized Regression Setup¶

K Fold¶

kf = KFold(shuffle = True, random_state = 13131, n_splits = 3)

Optimization Function¶

#Retrieve R2 scores for different alpha for LASSO or Ridge.
def optimize_alpha(alphas, x, y, model, kf):
    
    #Scale and transform x.
    s = SS()
    x = s.fit_transform(x)
    
    #List of R2.
    r2_scores = []
    
    for alpha in alphas:
        
        reg = model(alpha = alpha, max_iter = 500000)
        y_pred = CVP(reg, x, y, cv = kf)
        score = r2_score(y, y_pred)
        r2_scores.append(score)
    
    return(r2_scores)

Graph Helper¶

def alpha_r2_graph(alphas, R2s, xlabels, model):
    
    df = pd.DataFrame(data = {'alpha': alphas,
                              'R2': R2s})
    sns.set()

    #Scatter Plot.
    sns.lineplot(data = df,
                 x = 'alpha',
                 y = 'R2',
                 marker = 'o')
    
    #Size.
    plt.gcf().set_size_inches(15, 6.92)
    paper_rc = {'lines.linewidth': 2, 'lines.markersize': 6}  
    
    #Axes.
    ax = plt.gca()
    
    #Title setup.
    ax.set_title("Optimizing Hyperparameter for {} Regression".format(model), fontsize = 24)

    #X-axis setup.
    ax.set_xlabel("α", fontsize = 22)
    ax.set_xscale('log')
    ax.set_xticks(xlabels)
    ax.set_xticklabels(xlabels, rotation = 45, ha = 'right')
    if (model == 'Ridge') :
        ax.get_xaxis().set_major_formatter(FF(lambda x, p: format(int(x), ',')))

    #Y-axis setup.
    ax.set_ylabel("R2", fontsize = 22)
    ylabels = [0, 0.2, 0.4, 0.6, 0.8, 1]
    ax.set_xticks(xlabels)
    
    ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

4. LASSO¶

All variables will be used as predictors with an $L_1$ regularization.

alphas = list(pd.core.common.flatten([[a / 2, a, 2 * a] for a in np.geomspace(1e-5, 1e1, 7)]))
xlabels = [a for a in np.geomspace(1e-5, 1e1, 7)]

s = SS()
X_train_3 = s.fit_transform(X_train)

#Determine R2s and graph.
r2s_l = optimize_alpha(alphas, X_train_3, y_train, Lasso, kf)
alpha_r2_graph(alphas, r2s_l, xlabels, 'LASSO')

$α = 0.005$ is optimal.

X_test_3 = s.transform(X_test)

reg_3 = Lasso(alpha = 0.005).fit(X_train_3, y_train)

5. Ridge¶

All variables will be used as predictors with an $L_2$ regularization.

alphas = list(pd.core.common.flatten([[a / 2, a, 2 * a] for a in np.geomspace(1, 1e6, 7)]))
xlabels = [a for a in np.geomspace(1, 1e6, 7)]

s = SS()
X_train_4 = s.fit_transform(X_train)

#Determine R2s and graph.
r2s_r = optimize_alpha(alphas, X_train_4, y_train, Ridge, kf)
alpha_r2_graph(alphas, r2s_r, xlabels, 'Ridge')

$α = 200$ is optimal.

X_test_4 = s.transform(X_test)

reg_4 = Ridge(alpha = 200).fit(X_train_4, y_train)

IV. Evaluation and Selection¶

1. Helper Function¶

#Inputs regression models, predictors, and y-values, outputs DataFrame of R2, Adjusted R2, RMSE, and MAE.
def summary_df(models, Xs, Y) :

    index = ['All Variables', 'My Variables', 'LASSO', 'Ridge']
    R2 = []
    ADJ_R2 = []
    RMSE = []
    MAE = []

    for i in range(4):
        y_pred = models[i].predict(Xs[i])
        
        #R2.
        r2 = r2_score(Y, y_pred)
        R2.append(r2)
        
        #Adj R2.
        adj_r2 = 1.0 - (1.0 - r2) * (len(Y) - 1.0) / (len(Y) - Xs[i].shape[1] - 1.0)
        ADJ_R2.append(adj_r2)
        
        #RMSE.
        rmse = math.sqrt(MSErr(Y, y_pred))
        RMSE.append(rmse)
                         
        #MAE.
        mae = MAErr(Y, y_pred)
        MAE.append(mae)

    df = pd.DataFrame(data = {'R2': R2,
                              'Adjusted R2': ADJ_R2,
                              'RMSE': RMSE,
                              'MAE': MAE},
                      index = index)
    return(df)

2. Train Data¶

models = [reg_1, reg_2, reg_3, reg_4]
X_trains = [X_train_1, X_train_2, X_train_3, X_train_4]

summary_df(models, X_trains, y_train)

3. Test Data¶

X_tests = [X_test_1, X_test_2, X_test_3, X_test_4]

summary_df(models, X_tests, y_test)

4. Summary of Metrics¶

Including All Variables was clearly best for the training data, but clearly overfit and the worst for the test data. It will be ignored from here on.

My Variables, simply based on a little personal insight of what people look for, did surprisingly well compared to the LASSO and Ridge models. My Variables has the lowest Adjusted R² score so a further investigation of predictors is necessary.

My Variable Predictors¶

X_train_2.columns

Index(['1st Flr SF', '2nd Flr SF', 'Bedroom AbvGr', 'Full Bath', 'Half Bath',
       'Overall Cond', 'Overall Qual', 'Year Built', 'Neighborhood_Blueste',
       'Neighborhood_BrDale', 'Neighborhood_BrkSide', 'Neighborhood_ClearCr',
       'Neighborhood_CollgCr', 'Neighborhood_Crawfor', 'Neighborhood_Edwards',
       'Neighborhood_Gilbert', 'Neighborhood_Greens', 'Neighborhood_GrnHill',
       'Neighborhood_IDOTRR', 'Neighborhood_Landmrk', 'Neighborhood_MeadowV',
       'Neighborhood_Mitchel', 'Neighborhood_NAmes', 'Neighborhood_NPkVill',
       'Neighborhood_NWAmes', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
       'Neighborhood_OldTown', 'Neighborhood_SWISU', 'Neighborhood_Sawyer',
       'Neighborhood_SawyerW', 'Neighborhood_Somerst', 'Neighborhood_StoneBr',
       'Neighborhood_Timber', 'Neighborhood_Veenker'],
      dtype='object')

LASSO Predictors¶

X_train.columns[reg_3.coef_ != 0]

Index(['1st Flr SF', 'Gr Liv Area', 'Full Bath', 'Kitchen AbvGr', 'Fireplaces',
       'Wood Deck SF', 'Open Porch SF', 'Screen Porch', 'Pool Area',
       'Misc Val', 'Yr Sold', 'Lot Area', 'MS SubClass', 'Overall Qual',
       'Overall Cond', 'Year Built', 'Year Remod/Add', 'Sale Condition_Normal',
       'Sale Condition_Partial', 'Heating QC_Po', 'Heating QC_TA',
       'Central Air_Y', 'Kitchen Qual_Fa', 'Kitchen Qual_TA',
       'Foundation_PConc', 'Foundation_Slab', 'Paved Drive_Y', 'Sale Type_Con',
       'Sale Type_New', 'Sale Type_Oth', 'Functional_Maj2', 'Functional_Sal',
       'Functional_Sev', 'Functional_Typ', 'Exter Cond_Fa', 'Exter Cond_Po',
       'Heating_GasW', 'Heating_Grav', 'Condition 1_Feedr', 'Condition 1_Norm',
       'Condition 1_PosN', 'House Style_2.5Fin', 'House Style_SFoyer',
       'Bldg Type_2fmCon', 'Bldg Type_Twnhs', 'Bldg Type_TwnhsE',
       'Condition 2_PosA', 'Condition 2_RRAe', 'Exter Qual_TA',
       'Neighborhood_BrDale', 'Neighborhood_ClearCr', 'Neighborhood_Crawfor',
       'Neighborhood_Edwards', 'Neighborhood_Gilbert', 'Neighborhood_GrnHill',
       'Neighborhood_MeadowV', 'Neighborhood_NoRidge', 'Neighborhood_NridgHt',
       'Neighborhood_OldTown', 'Neighborhood_SawyerW', 'Neighborhood_Somerst',
       'Neighborhood_StoneBr', 'Neighborhood_Timber', 'Land Slope_Mod',
       'Lot Config_CulDSac', 'Utilities_NoSeWa', 'Utilities_NoSewr',
       'Land Contour_HLS', 'Lot Shape_IR3', 'MS Zoning_C (all)',
       'MS Zoning_RL', 'MS Zoning_RM', 'Roof Style_Hip',
       'Exterior 1st_BrkComm', 'Exterior 1st_BrkFace', 'Exterior 1st_Stucco',
       'Exterior 2nd_CmentBd', 'Exterior 2nd_Stucco'],
      dtype='object')

Ridge Predictors¶

X_train.columns[reg_4.coef_ != 0]

Index(['1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area',
       'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr',
       'TotRms AbvGrd', 'Fireplaces',
       ...
       'Exterior 2nd_HdBoard', 'Exterior 2nd_ImStucc', 'Exterior 2nd_MetalSd',
       'Exterior 2nd_Other', 'Exterior 2nd_Plywood', 'Exterior 2nd_Stone',
       'Exterior 2nd_Stucco', 'Exterior 2nd_VinylSd', 'Exterior 2nd_Wd Sdng',
       'Exterior 2nd_Wd Shng'],
      dtype='object', length=188)

5. Model Selection¶

LASSO trimmed down to $78$ predictors. Ridge only had $3$ coefficients go to $0$. In that sense, My Variables is very much the parsimonious model compared to the other two. It is slightly less accurate, however the variables used as predictors are easily explainable:

When people look for a house, they want to know how much space is available in the first and second floors, along with how many bedrooms and bathrooms there are. They would like to know the overall quality and condition, what year it was built in, and what the neighborhood is like.

My Variables concisely sums up 86% of the variance of the house prices in real world terms. The regularization models cannot do that.

V. Key Findings, Limitations, and Conclusion¶

I would like to begin this section by stating I am by no means advocating going "by gut instinct" and ignoring exploration and feature selection. In this particular instance, my background knowledge was helpful and manual selection produced a better model than the two I evaluated with machine learning. In reality, that will probably happen much less often.

Using some other variables with my model may have improved R² and Adjusted R². For example, Heating QC, Central Air, Fireplaces, and Pool Area are amenities which may hold some predictive power.

I quickly elimated all columns which had at least one NA value. It is possible that I could have removed a few columns with a low number of NAs and high predictive power. A further investigation of these variables could have been useful.

I did not use any polynomial features, which could possibly improve the model.

Log transforming was a powerful tool for the target variable.

Latitude and Longitude of houses, that is, their specific locations, may show to be a better indicator of price than Neighborhood.

It is important to stress that this model is specific to Ames, IA, and would vary greatly across the US. Although this data set of nearly 3,000 observations is satisfactory for a city, a much larger sample would be necessary for the US.

	Order	PID	MS SubClass	Lot Frontage	Lot Area	Overall Qual	Overall Cond	Year Built	Year Remod/Add	Mas Vnr Area	BsmtFin SF 1	BsmtFin SF 2	Bsmt Unf SF	Total Bsmt SF	1st Flr SF	2nd Flr SF	Low Qual Fin SF	Gr Liv Area	Bsmt Full Bath	Bsmt Half Bath	Full Bath	Half Bath	Bedroom AbvGr	Kitchen AbvGr	TotRms AbvGrd	Fireplaces	Garage Yr Blt	Garage Cars	Garage Area	Wood Deck SF	Open Porch SF	Enclosed Porch	3Ssn Porch	Screen Porch	Pool Area	Misc Val	Mo Sold	Yr Sold	SalePrice
count	2930.00	2.930000e+03	2930.00	2440.00	2930.00	2930.00	2930.00	2930.00	2930.00	2907.00	2929.00	2929.00	2929.00	2929.00	2930.00	2930.00	2930.00	2930.00	2928.00	2928.00	2930.00	2930.00	2930.00	2930.00	2930.00	2930.00	2771.00	2929.00	2929.00	2930.00	2930.00	2930.00	2930.00	2930.00	2930.00	2930.00	2930.00	2930.00	2930.00
mean	1465.50	7.144645e+08	57.39	69.22	10147.92	6.09	5.56	1971.36	1984.27	101.90	442.63	49.72	559.26	1051.61	1159.56	335.46	4.68	1499.69	0.43	0.06	1.57	0.38	2.85	1.04	6.44	0.60	1978.13	1.77	472.82	93.75	47.53	23.01	2.59	16.00	2.24	50.64	6.22	2007.79	180796.06
std	845.96	1.887308e+08	42.64	23.37	7880.02	1.41	1.11	30.25	20.86	179.11	455.59	169.17	439.49	440.62	391.89	428.40	46.31	505.51	0.52	0.25	0.55	0.50	0.83	0.21	1.57	0.65	25.53	0.76	215.05	126.36	67.48	64.14	25.14	56.09	35.60	566.34	2.71	1.32	79886.69
min	1.00	5.263011e+08	20.00	21.00	1300.00	1.00	1.00	1872.00	1950.00	0.00	0.00	0.00	0.00	0.00	334.00	0.00	0.00	334.00	0.00	0.00	0.00	0.00	0.00	0.00	2.00	0.00	1895.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	1.00	2006.00	12789.00
25%	733.25	5.284770e+08	20.00	58.00	7440.25	5.00	5.00	1954.00	1965.00	0.00	0.00	0.00	219.00	793.00	876.25	0.00	0.00	1126.00	0.00	0.00	1.00	0.00	2.00	1.00	5.00	0.00	1960.00	1.00	320.00	0.00	0.00	0.00	0.00	0.00	0.00	0.00	4.00	2007.00	129500.00
50%	1465.50	5.354536e+08	50.00	68.00	9436.50	6.00	5.00	1973.00	1993.00	0.00	370.00	0.00	466.00	990.00	1084.00	0.00	0.00	1442.00	0.00	0.00	2.00	0.00	3.00	1.00	6.00	1.00	1979.00	2.00	480.00	0.00	27.00	0.00	0.00	0.00	0.00	0.00	6.00	2008.00	160000.00
75%	2197.75	9.071811e+08	70.00	80.00	11555.25	7.00	6.00	2001.00	2004.00	164.00	734.00	0.00	802.00	1302.00	1384.00	703.75	0.00	1742.75	1.00	0.00	2.00	1.00	3.00	1.00	7.00	1.00	2002.00	2.00	576.00	168.00	70.00	0.00	0.00	0.00	0.00	0.00	8.00	2009.00	213500.00
max	2930.00	1.007100e+09	190.00	313.00	215245.00	10.00	9.00	2010.00	2010.00	1600.00	5644.00	1526.00	2336.00	6110.00	5095.00	2065.00	1064.00	5642.00	3.00	2.00	4.00	2.00	8.00	3.00	15.00	4.00	2207.00	5.00	1488.00	1424.00	742.00	1012.00	508.00	576.00	800.00	17000.00	12.00	2010.00	755000.00

	1st Flr SF	2nd Flr SF	Low Qual Fin SF	Gr Liv Area	Full Bath	Half Bath	Bedroom AbvGr	Kitchen AbvGr	TotRms AbvGrd	Fireplaces	Wood Deck SF	Open Porch SF	Enclosed Porch	3Ssn Porch	Screen Porch	Pool Area	Misc Val	Mo Sold	Yr Sold	Lot Area	MS SubClass	Overall Qual	Overall Cond	Year Built	Year Remod/Add	Sale Condition_AdjLand	Sale Condition_Alloca	Sale Condition_Family	Sale Condition_Normal	Sale Condition_Partial	Heating QC_Fa	Heating QC_Gd	Heating QC_Po	Heating QC_TA	Central Air_Y	Kitchen Qual_Fa	Kitchen Qual_Gd	Kitchen Qual_Po	Kitchen Qual_TA	Foundation_CBlock	Foundation_PConc	Foundation_Slab	Foundation_Stone	Foundation_Wood	Paved Drive_P	Paved Drive_Y	Sale Type_CWD	Sale Type_Con	Sale Type_ConLD	Sale Type_ConLI	...	Street_Pave	MS Zoning_C (all)	MS Zoning_FV	MS Zoning_I (all)	MS Zoning_RH	MS Zoning_RL	MS Zoning_RM	Roof Style_Gable	Roof Style_Gambrel	Roof Style_Hip	Roof Style_Mansard	Roof Style_Shed	Roof Matl_CompShg	Roof Matl_Membran	Roof Matl_Metal	Roof Matl_Roll	Roof Matl_Tar&Grv	Roof Matl_WdShake	Roof Matl_WdShngl	Exterior 1st_AsphShn	Exterior 1st_BrkComm	Exterior 1st_BrkFace	Exterior 1st_CBlock	Exterior 1st_CemntBd	Exterior 1st_HdBoard	Exterior 1st_ImStucc	Exterior 1st_MetalSd	Exterior 1st_Plywood	Exterior 1st_PreCast	Exterior 1st_Stone	Exterior 1st_Stucco	Exterior 1st_VinylSd	Exterior 1st_Wd Sdng	Exterior 1st_WdShing	Exterior 2nd_AsphShn	Exterior 2nd_Brk Cmn	Exterior 2nd_BrkFace	Exterior 2nd_CBlock	Exterior 2nd_CmentBd	Exterior 2nd_HdBoard	Exterior 2nd_ImStucc	Exterior 2nd_MetalSd	Exterior 2nd_Other	Exterior 2nd_Plywood	Exterior 2nd_PreCast	Exterior 2nd_Stone	Exterior 2nd_Stucco	Exterior 2nd_VinylSd	Exterior 2nd_Wd Sdng	Exterior 2nd_Wd Shng
SalePrice
12.278398	1656	0	0	1656	1	0	3	1	7	2	210	62	0	0	0	0	0	5	2010	31770	20	6	5	1960	1960	0	0	0	1	0	1	0	0	0	1	0	0	0	1	1	0	0	0	0	1	0	0	0	0	0	...	1	0	0	0	0	1	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0
11.561725	896	0	0	896	1	0	2	1	5	0	140	0	0	0	120	0	0	6	2010	11622	20	5	6	1961	1961	0	0	0	1	0	0	0	0	1	1	0	0	0	1	1	0	0	0	0	0	1	0	0	0	0	...	1	0	0	0	1	0	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
12.055256	1329	0	0	1329	1	1	3	1	6	0	393	36	0	0	0	0	12500	6	2010	14267	20	6	6	1958	1958	0	0	0	1	0	0	0	0	1	1	0	1	0	0	1	0	0	0	0	0	1	0	0	0	0	...	1	0	0	0	0	1	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0
12.404928	2110	0	0	2110	2	1	3	1	8	2	0	0	0	0	0	0	0	4	2010	11160	20	7	5	1968	1968	0	0	0	1	0	0	0	0	0	1	0	0	0	0	1	0	0	0	0	0	1	0	0	0	0	...	1	0	0	0	0	1	0	0	0	1	0	0	1	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0
12.154258	928	701	0	1629	2	1	3	1	6	1	212	34	0	0	0	0	0	3	2010	13830	60	5	5	1997	1998	0	0	0	1	0	0	1	0	0	1	0	0	0	1	0	1	0	0	0	0	1	0	0	0	0	...	1	0	0	0	0	1	0	1	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0

	R2	Adjusted R2	RMSE	MAE
All Variables	0.923746	0.915911	0.115140	0.080549
My Variables	0.857730	0.855258	0.157272	0.106099
LASSO	0.897472	0.886938	0.133510	0.091102
Ridge	0.912745	0.903780	0.123166	0.084782

	R2	Adjusted R2	RMSE	MAE
All Variables	0.827449	0.779476	0.159716	0.094183
My Variables	0.861722	0.855981	0.142977	0.100758
LASSO	0.882675	0.850057	0.131700	0.092853
Ridge	0.879854	0.846450	0.133274	0.093467