Final for Exploratory Data Analysis for Machine Learning¶

Exploratory Data Analysis with Housing in Ames, IA¶

Rohan Lewis¶

2020.11.30¶

I. Data Set Description¶

I am using the Ames, Iowa dataset from the class notes. This dataset contains 2930 observations and 82 variables related to house sales in Ames, Iowa. The variables are related to characteristics and measurements of the house, its location, and its purchase.

Refer to the codebook for a complete list of variables and labels.

1. Packages¶

from matplotlib import cm
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter as FF
from matplotlib.ticker import StrMethodFormatter as SMF
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression as LR
from sklearn.feature_selection import f_regression as fr
import statsmodels.api as sm

2. Load Data¶

df = pd.read_csv('data/Ames_Housing_Data.tsv', sep = '\t')

#Set display options.
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100

df.head()

II. Data Exploration Plan¶

I am going to select a few variables that are noteworthy and compare relationships between them and to Sale Price. The variables are varied between numerical and categorical, so I plan to choose interesting variables that would be of common interest that also contain a minimal number of missing values.

1. Data types¶

df.dtypes.value_counts()

object     43
int64      28
float64    11
dtype: int64

2. Missing values¶

A count of missing values by variable is shown below.

df_na = df.isna().sum().to_frame().sort_values(by = 0, axis = 0)

df_na.rename(columns={0: 'NA Count'}, inplace=True)

df_na[df_na['NA Count'] == 0].index

Index(['Order', 'Sale Condition', 'Heating QC', 'Central Air', '1st Flr SF',
       '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Full Bath',
       'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual',
       'Foundation', 'TotRms AbvGrd', 'Fireplaces', 'Paved Drive',
       'Wood Deck SF', 'Open Porch SF', 'Enclosed Porch', '3Ssn Porch',
       'Screen Porch', 'Pool Area', 'Misc Val', 'Mo Sold', 'Yr Sold',
       'Sale Type', 'Functional', 'Exter Cond', 'Heating', 'Condition 1',
       'House Style', 'Bldg Type', 'Condition 2', 'Exter Qual', 'Neighborhood',
       'Land Slope', 'Lot Config', 'Utilities', 'Land Contour', 'Lot Shape',
       'Street', 'Lot Area', 'MS Zoning', 'MS SubClass', 'PID', 'Overall Qual',
       'Overall Cond', 'SalePrice', 'Year Built', 'Year Remod/Add',
       'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd'],
      dtype='object')

III. Data Cleaning¶

I am looking at Sale Price, Lot Area, Number of Bedrooms, Neighborhood, Year Built, and Overall Quality. There are no missing values for this subset.

IV. Key Findings¶

1. Exploration 1¶

sns.set()

#Scatter Plot.
exp1 = sns.lmplot(data = df, x = 'Lot Area', y = 'SalePrice', hue = 'Bedroom AbvGr',
                  palette = 'viridis_r', ci = None, height = 6, aspect = 13 / 6)

#Axes.
ax = plt.gca()

#Title setup.
ax.set_title('Price vs Lot Area and Number of Bedrooms', fontsize = 24)

#X-axis setup.
ax.set_xlabel("Lot Area (sq. ft.)", fontsize = 22)
ax.set_xscale('log')
xlabels = [2500, 5000, 10000, 25000, 50000, 100000, 250000]
ax.set_xticks(xlabels)
ax.set_xticklabels(xlabels, rotation = 45, ha = 'right')
ax.get_xaxis().set_major_formatter(FF(lambda x, p: format(int(x), ',')))

#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))

ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

#Legend setup.
exp1._legend.remove()
ax.legend(loc = 'upper left', title = 'Bedrooms', ncol = 2, title_fontsize = 18, fontsize = 14);

For the first exploration, Price is compared to Lot Area and Number of Bedrooms. Lot Area had a considerable right skew, so the x-axis was log scaled.

Number of Bedrooms alone does not indicate a clear relationship with price. However, all regression lines for Number of Bedrooms from $0$ to $6$ show an positive correlation of Lot Area with Price. As noted below, there are only a few houses with $6$ or $8$ bedrooms.

bdrm_cnt = df['Bedroom AbvGr'].value_counts().sort_index().to_frame().rename(columns = {'Bedroom AbvGr': "Count of Houses"})
bdrm_cnt.index.name = "Number of Bedrooms"
bdrm_cnt

2. Exploration 2¶

#Sort Neighborhoods by median Price in ascending order.
xlabels = df.groupby(['Neighborhood'])['SalePrice'].median().sort_values().index

exp2 = sns.violinplot(data = df, x = 'Neighborhood', y = 'SalePrice', scale = 'width', width = 0.6,
                      palette = 'viridis_r', order = xlabels)

plt.gcf().set_size_inches(15, 6.92)

#Axes.
ax = plt.gca()

#Title setup.
ax.set_title('Price vs Neighborhood', fontsize = 24)

#X-axis setup.
ax.set_xlabel("Neighborhood", fontsize = 22)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')

#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))

ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

For the second exploration, Price is compared to Neighborhood. Neighborhood is represented by violin plots, sort in ascending order by median Price.

There is much overlap in Price, as there is significant and varying spread of house values within each Neighborhood. Overall there seems to be some correlation that warrants a further investigation.

3. Exploration 3¶

#Scatter Plot.
exp3 = sns.lmplot(data = df, x = 'Year Built', y = 'SalePrice', hue = 'Overall Qual',
                  palette = 'viridis_r', ci = None, height = 6, aspect = 13 / 6)

#Axes.
ax = plt.gca()

#Title setup.
ax.set_title('Price vs Year Built and Overall Quality', fontsize = 24)

#X-axis setup.
ax.set_xlabel("Year Built", fontsize = 22)
ax.set_xlim(1870, 2015)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')

#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))

ax.tick_params(axis = 'both', which = 'major', labelsize = 14)

#Legend setup.
exp3._legend.remove()
ax.legend(loc = 'upper left', title = 'Overall House Quality',
          labels = ['Very Poor', 'Poor', 'Fair', 'Below Average', 'Average',
                    'Above Average', 'Good', 'Very Good', 'Excellent', 'Very Excellent'],
          ncol = 2, title_fontsize = 18, fontsize = 14);

For the third exploration, Price is compared to Year Built and Overall Quality.

Comparing these three together is quite telling. Houses built earlier are of overall lower quality and lower selling price, as seen with the yellow, green, and cyan data points in the bottom of the graph. From 1990 - 2010, there is a sharp increase in quality and selling price, as seen with the blue, indigo, and purple data points on the far right of the graph.

The regression lines for Overall Quality show a positive relationship for Year Built and Sale Price. Furthurmore, there is very little intersection of these lines, they are clearly separated. The exception is 'Very Poor' and 'Poor' at the bottom center of the graph.

V. Hypotheses¶

1. Hypothesis 1¶

An increase in Lot Area will increase Sale Price.

2. Hypothesis 2¶

Homes located in 'Green Hills', 'Northridge', 'Northridge Heights', and 'Stone Brook' have a greater Sale Price than homes located in other Neighborhoods

3. Hypothesis 3¶

An increase in Overall Quality will increase Sale Price.

VI. Signifiance Test¶

H₀: Overall Quality does not affect Sale Price.

H_A: Overall Quality affects Sale Price.

x = df['Overall Qual'].to_numpy().reshape(-1, 1)
y = df['SalePrice'].to_numpy().reshape(-1, )

reg = LR().fit(x, y)

r_sq = reg.score(x, y)
p_val = fr(x, y)[1][0]

print("The R-squared is {}.".format(round(r_sq, 3)))
print("The p_value is {}.".format(p_val))

The R-squared is 0.639.
The p_value is 0.0.

63.9% of the variance in Sale Price can be explained by Overall Condition. This is significant at the 95% confidence level. H₀ is therefore rejected.

VII. Future Analysis¶

Determining other variables with a high level of correlation, doing a multiple regression to explain more variance in Sale Price, and achieving significant results would be the next steps in this analysis.

VIII. Data Summary¶

Although there are some missing values, the vast majority of this data is complete and of good quality.

Latitude and Longitude of houses, that is, there specific locations, may show to be a better indicator of price than Neighborhood.

It is important to stress that this model is specific to Ames, IA, and would vary greatly across the US. Although this data set of nearly 3,000 observations is satisfactory for a city, a much larger sample would be necessary for the US.

	Order	PID	MS SubClass	MS Zoning	Lot Frontage	Lot Area	Street	Alley	Lot Shape	Land Contour	Utilities	Lot Config	Land Slope	Neighborhood	Condition 1	Condition 2	Bldg Type	House Style	Overall Qual	Overall Cond	Year Built	Year Remod/Add	Roof Style	Roof Matl	Exterior 1st	Exterior 2nd	Mas Vnr Type	Mas Vnr Area	Exter Qual	Exter Cond	Foundation	Bsmt Qual	Bsmt Cond	Bsmt Exposure	BsmtFin Type 1	BsmtFin SF 1	BsmtFin Type 2	BsmtFin SF 2	Bsmt Unf SF	Total Bsmt SF	Heating	Heating QC	Central Air	Electrical	1st Flr SF	2nd Flr SF	Gr Liv Area	Bsmt Full Bath	Full Bath	Half Bath	Bedroom AbvGr	Kitchen AbvGr	Kitchen Qual	TotRms AbvGrd	Functional	Fireplaces	Fireplace Qu	Garage Type	Garage Yr Blt	Garage Finish	Garage Cars	Garage Area	Garage Qual	Garage Cond	Paved Drive	Wood Deck SF	Open Porch SF	Screen Porch	Pool QC	Fence	Misc Feature	Misc Val	Mo Sold	Yr Sold	Sale Type	Sale Condition	SalePrice
0	1	526301100	20	RL	141.0	31770	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	5	1960	1960	Hip	CompShg	BrkFace	Plywood	Stone	112.0	TA	TA	CBlock	TA	Gd	Gd	BLQ	639.0	Unf	0.0	441.0	1080.0	GasA	Fa	Y	SBrkr	1656	0	1656	1.0	1	0	3	1	TA	7	Typ	2	Gd	Attchd	1960.0	Fin	2.0	528.0	TA	TA	P	210	62	0	NaN	NaN	NaN	0	5	2010	WD	Normal	215000
1	2	526350040	20	RH	80.0	11622	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	NAmes	Feedr	Norm	1Fam	1Story	5	6	1961	1961	Gable	CompShg	VinylSd	VinylSd	None	0.0	TA	TA	CBlock	TA	TA	No	Rec	468.0	LwQ	144.0	270.0	882.0	GasA	TA	Y	SBrkr	896	0	896	0.0	1	0	2	1	TA	5	Typ	0	NaN	Attchd	1961.0	Unf	1.0	730.0	TA	TA	Y	140	0	120	NaN	MnPrv	NaN	0	6	2010	WD	Normal	105000
2	3	526351010	20	RL	81.0	14267	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	6	6	1958	1958	Hip	CompShg	Wd Sdng	Wd Sdng	BrkFace	108.0	TA	TA	CBlock	TA	TA	No	ALQ	923.0	Unf	0.0	406.0	1329.0	GasA	TA	Y	SBrkr	1329	0	1329	0.0	1	1	3	1	Gd	6	Typ	0	NaN	Attchd	1958.0	Unf	1.0	312.0	TA	TA	Y	393	36	0	NaN	NaN	Gar2	12500	6	2010	WD	Normal	172000
3	4	526353030	20	RL	93.0	11160	Pave	NaN	Reg	Lvl	AllPub	Corner	Gtl	NAmes	Norm	Norm	1Fam	1Story	7	5	1968	1968	Hip	CompShg	BrkFace	BrkFace	None	0.0	Gd	TA	CBlock	TA	TA	No	ALQ	1065.0	Unf	0.0	1045.0	2110.0	GasA	Ex	Y	SBrkr	2110	0	2110	1.0	2	1	3	1	Ex	8	Typ	2	TA	Attchd	1968.0	Fin	2.0	522.0	TA	TA	Y	0	0	0	NaN	NaN	NaN	0	4	2010	WD	Normal	244000
4	5	527105010	60	RL	74.0	13830	Pave	NaN	IR1	Lvl	AllPub	Inside	Gtl	Gilbert	Norm	Norm	1Fam	2Story	5	5	1997	1998	Gable	CompShg	VinylSd	VinylSd	None	0.0	TA	TA	PConc	Gd	TA	No	GLQ	791.0	Unf	0.0	137.0	928.0	GasA	Gd	Y	SBrkr	928	701	1629	0.0	2	1	3	1	TA	6	Typ	1	TA	Attchd	1997.0	Fin	2.0	482.0	TA	TA	Y	212	34	0	NaN	MnPrv	NaN	0	3	2010	WD	Normal	189900

	Count of Houses
Number of Bedrooms
0	8
1	112
2	743
3	1597
4	400
5	48
6	21
8	1