Final for Exploratory Data Analysis for Machine Learning

Exploratory Data Analysis with Housing in Ames, IA

Rohan Lewis

2020.11.30

I. Data Set Description

I am using the Ames, Iowa dataset from the class notes. This dataset contains 2930 observations and 82 variables related to house sales in Ames, Iowa. The variables are related to characteristics and measurements of the house, its location, and its purchase.

Refer to the codebook for a complete list of variables and labels.

1. Packages

In [1]:
from matplotlib import cm
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter as FF
from matplotlib.ticker import StrMethodFormatter as SMF
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression as LR
from sklearn.feature_selection import f_regression as fr
import statsmodels.api as sm

2. Load Data

In [2]:
df = pd.read_csv('data/Ames_Housing_Data.tsv', sep = '\t')

#Set display options.
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

df.head()
Out[2]:
Order PID MS SubClass MS Zoning Lot Frontage Lot Area Street Alley Lot Shape Land Contour Utilities Lot Config Land Slope Neighborhood Condition 1 Condition 2 Bldg Type House Style Overall Qual Overall Cond Year Built Year Remod/Add Roof Style Roof Matl Exterior 1st Exterior 2nd Mas Vnr Type Mas Vnr Area Exter Qual Exter Cond Foundation Bsmt Qual Bsmt Cond Bsmt Exposure BsmtFin Type 1 BsmtFin SF 1 BsmtFin Type 2 BsmtFin SF 2 Bsmt Unf SF Total Bsmt SF Heating Heating QC Central Air Electrical 1st Flr SF 2nd Flr SF Low Qual Fin SF Gr Liv Area Bsmt Full Bath Bsmt Half Bath Full Bath Half Bath Bedroom AbvGr Kitchen AbvGr Kitchen Qual TotRms AbvGrd Functional Fireplaces Fireplace Qu Garage Type Garage Yr Blt Garage Finish Garage Cars Garage Area Garage Qual Garage Cond Paved Drive Wood Deck SF Open Porch SF Enclosed Porch 3Ssn Porch Screen Porch Pool Area Pool QC Fence Misc Feature Misc Val Mo Sold Yr Sold Sale Type Sale Condition SalePrice
0 1 526301100 20 RL 141.0 31770 Pave NaN IR1 Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 6 5 1960 1960 Hip CompShg BrkFace Plywood Stone 112.0 TA TA CBlock TA Gd Gd BLQ 639.0 Unf 0.0 441.0 1080.0 GasA Fa Y SBrkr 1656 0 0 1656 1.0 0.0 1 0 3 1 TA 7 Typ 2 Gd Attchd 1960.0 Fin 2.0 528.0 TA TA P 210 62 0 0 0 0 NaN NaN NaN 0 5 2010 WD Normal 215000
1 2 526350040 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub Inside Gtl NAmes Feedr Norm 1Fam 1Story 5 6 1961 1961 Gable CompShg VinylSd VinylSd None 0.0 TA TA CBlock TA TA No Rec 468.0 LwQ 144.0 270.0 882.0 GasA TA Y SBrkr 896 0 0 896 0.0 0.0 1 0 2 1 TA 5 Typ 0 NaN Attchd 1961.0 Unf 1.0 730.0 TA TA Y 140 0 0 0 120 0 NaN MnPrv NaN 0 6 2010 WD Normal 105000
2 3 526351010 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 6 6 1958 1958 Hip CompShg Wd Sdng Wd Sdng BrkFace 108.0 TA TA CBlock TA TA No ALQ 923.0 Unf 0.0 406.0 1329.0 GasA TA Y SBrkr 1329 0 0 1329 0.0 0.0 1 1 3 1 Gd 6 Typ 0 NaN Attchd 1958.0 Unf 1.0 312.0 TA TA Y 393 36 0 0 0 0 NaN NaN Gar2 12500 6 2010 WD Normal 172000
3 4 526353030 20 RL 93.0 11160 Pave NaN Reg Lvl AllPub Corner Gtl NAmes Norm Norm 1Fam 1Story 7 5 1968 1968 Hip CompShg BrkFace BrkFace None 0.0 Gd TA CBlock TA TA No ALQ 1065.0 Unf 0.0 1045.0 2110.0 GasA Ex Y SBrkr 2110 0 0 2110 1.0 0.0 2 1 3 1 Ex 8 Typ 2 TA Attchd 1968.0 Fin 2.0 522.0 TA TA Y 0 0 0 0 0 0 NaN NaN NaN 0 4 2010 WD Normal 244000
4 5 527105010 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub Inside Gtl Gilbert Norm Norm 1Fam 2Story 5 5 1997 1998 Gable CompShg VinylSd VinylSd None 0.0 TA TA PConc Gd TA No GLQ 791.0 Unf 0.0 137.0 928.0 GasA Gd Y SBrkr 928 701 0 1629 0.0 0.0 2 1 3 1 TA 6 Typ 1 TA Attchd 1997.0 Fin 2.0 482.0 TA TA Y 212 34 0 0 0 0 NaN MnPrv NaN 0 3 2010 WD Normal 189900

II. Data Exploration Plan

I am going to select a few variables that are noteworthy and compare relationships between them and to Sale Price. The variables are varied between numerical and categorical, so I plan to choose interesting variables that would be of common interest that also contain a minimal number of missing values.

1. Data types

In [3]:
df.dtypes.value_counts()
Out[3]:
object     43
int64      28
float64    11
dtype: int64

2. Missing values

A count of missing values by variable is shown below.

In [4]:
df_na = df.isna().sum().to_frame().sort_values(by = 0, axis = 0)

df_na.rename(columns={0: 'NA Count'}, inplace=True)

df_na[df_na['NA Count'] == 0].index
Out[4]:
Index(['Order', 'Sale Condition', 'Heating QC', 'Central Air', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Full Bath',
       'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual',
       'Foundation', 'TotRms AbvGrd', 'Fireplaces', 'Paved Drive',
       'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch',
       'Screen Porch', 'Pool Area', 'Misc Val', 'Mo Sold', 'Yr Sold',
       'Sale Type', 'Functional', 'Exter Cond', 'Heating', 'Condition 1',
       'House Style', 'Bldg Type', 'Condition 2', 'Exter Qual', 'Neighborhood',
       'Land Slope', 'Lot Config', 'Utilities', 'Land Contour', 'Lot Shape',
       'Street', 'Lot Area', 'MS Zoning', 'MS SubClass', 'PID', 'Overall Qual',
       'Overall Cond', 'SalePrice', 'Year Built', 'Year Remod/Add',
       'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd'],
      dtype='object')

III. Data Cleaning

I am looking at Sale Price, Lot Area, Number of Bedrooms, Neighborhood, Year Built, and Overall Quality. There are no missing values for this subset.

IV. Key Findings

1. Exploration 1

In [5]:
sns.set()

#Scatter Plot.
exp1 = sns.lmplot(data = df, x = 'Lot Area', y = 'SalePrice', hue = 'Bedroom AbvGr',
                  palette = 'viridis_r', ci = None, height = 6, aspect = 13 / 6)

#Axes.
ax = plt.gca()

#Title setup.
ax.set_title('Price vs Lot Area and Number of Bedrooms', fontsize = 24)

#X-axis setup.
ax.set_xlabel("Lot Area (sq. ft.)", fontsize = 22)
ax.set_xscale('log')
xlabels = [2500, 5000, 10000, 25000, 50000, 100000, 250000]
ax.set_xticks(xlabels)
ax.set_xticklabels(xlabels, rotation = 45, ha = 'right')
ax.get_xaxis().set_major_formatter(FF(lambda x, p: format(int(x), ',')))

#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))

ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

#Legend setup.
exp1._legend.remove()
ax.legend(loc = 'upper left', title = 'Bedrooms', ncol = 2, title_fontsize = 18, fontsize = 14);

For the first exploration, Price is compared to Lot Area and Number of Bedrooms. Lot Area had a considerable right skew, so the x-axis was log scaled.

Number of Bedrooms alone does not indicate a clear relationship with price. However, all regression lines for Number of Bedrooms from $0$ to $6$ show an positive correlation of Lot Area with Price. As noted below, there are only a few houses with $6$ or $8$ bedrooms.

In [6]:
bdrm_cnt = df['Bedroom AbvGr'].value_counts().sort_index().to_frame().rename(columns = {'Bedroom AbvGr': "Count of Houses"})
bdrm_cnt.index.name = "Number of Bedrooms"
bdrm_cnt
Out[6]:
Count of Houses
Number of Bedrooms
0 8
1 112
2 743
3 1597
4 400
5 48
6 21
8 1

2. Exploration 2

In [7]:
#Sort Neighborhoods by median Price in ascending order.
xlabels = df.groupby(['Neighborhood'])['SalePrice'].median().sort_values().index

exp2 = sns.violinplot(data = df, x = 'Neighborhood', y = 'SalePrice', scale = 'width', width = 0.6,
                      palette = 'viridis_r', order = xlabels)

plt.gcf().set_size_inches(15, 6.92)

#Axes.
ax = plt.gca()

#Title setup.
ax.set_title('Price vs Neighborhood', fontsize = 24)

#X-axis setup.
ax.set_xlabel("Neighborhood", fontsize = 22)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')

#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))

ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

For the second exploration, Price is compared to Neighborhood. Neighborhood is represented by violin plots, sort in ascending order by median Price.

There is much overlap in Price, as there is significant and varying spread of house values within each Neighborhood. Overall there seems to be some correlation that warrants a further investigation.

3. Exploration 3

In [8]:
#Scatter Plot.
exp3 = sns.lmplot(data = df, x = 'Year Built', y = 'SalePrice', hue = 'Overall Qual',
                  palette = 'viridis_r', ci = None, height = 6, aspect = 13 / 6)

#Axes.
ax = plt.gca()

#Title setup.
ax.set_title('Price vs Year Built and Overall Quality', fontsize = 24)

#X-axis setup.
ax.set_xlabel("Year Built", fontsize = 22)
ax.set_xlim(1870, 2015)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')

#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))

ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

#Legend setup.
exp3._legend.remove()
ax.legend(loc = 'upper left', title = 'Overall House Quality',
          labels = ['Very Poor', 'Poor', 'Fair', 'Below Average', 'Average',
                    'Above Average', 'Good', 'Very Good', 'Excellent', 'Very Excellent'],
          ncol = 2, title_fontsize = 18, fontsize = 14);

For the third exploration, Price is compared to Year Built and Overall Quality.

Comparing these three together is quite telling. Houses built earlier are of overall lower quality and lower selling price, as seen with the yellow, green, and cyan data points in the bottom of the graph. From 1990 - 2010, there is a sharp increase in quality and selling price, as seen with the blue, indigo, and purple data points on the far right of the graph.

The regression lines for Overall Quality show a positive relationship for Year Built and Sale Price. Furthurmore, there is very little intersection of these lines, they are clearly separated. The exception is 'Very Poor' and 'Poor' at the bottom center of the graph.

V. Hypotheses

1. Hypothesis 1

An increase in Lot Area will increase Sale Price.

2. Hypothesis 2

Homes located in 'Green Hills', 'Northridge', 'Northridge Heights', and 'Stone Brook' have a greater Sale Price than homes located in other Neighborhoods

3. Hypothesis 3

An increase in Overall Quality will increase Sale Price.

VI. Signifiance Test

H0: Overall Quality does not affect Sale Price.

HA: Overall Quality affects Sale Price.

In [9]:
x = df['Overall Qual'].to_numpy().reshape(-1, 1)
y = df['SalePrice'].to_numpy().reshape(-1, )

reg = LR().fit(x, y)

r_sq = reg.score(x, y)
p_val = fr(x, y)[1][0]

print("The R-squared is {}.".format(round(r_sq, 3)))
print("The p_value is {}.".format(p_val))
The R-squared is 0.639.
The p_value is 0.0.

63.9% of the variance in Sale Price can be explained by Overall Condition. This is significant at the 95% confidence level. H0 is therefore rejected.

VII. Future Analysis

Determining other variables with a high level of correlation, doing a multiple regression to explain more variance in Sale Price, and achieving significant results would be the next steps in this analysis.

VIII. Data Summary

Although there are some missing values, the vast majority of this data is complete and of good quality.

Latitude and Longitude of houses, that is, there specific locations, may show to be a better indicator of price than Neighborhood.

It is important to stress that this model is specific to Ames, IA, and would vary greatly across the US. Although this data set of nearly 3,000 observations is satisfactory for a city, a much larger sample would be necessary for the US.