I am using the Ames, Iowa dataset from the class notes. This dataset contains 2930 observations and 82 variables related to house sales in Ames, Iowa. The variables are related to characteristics and measurements of the house, its location, and its purchase.
Refer to the codebook for a complete list of variables and labels.
from matplotlib import cm
import matplotlib.pyplot as plt
from matplotlib.ticker import FuncFormatter as FF
from matplotlib.ticker import StrMethodFormatter as SMF
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression as LR
from sklearn.feature_selection import f_regression as fr
import statsmodels.api as sm
df = pd.read_csv('data/Ames_Housing_Data.tsv', sep = '\t')
#Set display options.
pd.options.display.max_rows = 100
pd.options.display.max_columns = 100
df.head()
I am going to select a few variables that are noteworthy and compare relationships between them and to Sale Price. The variables are varied between numerical and categorical, so I plan to choose interesting variables that would be of common interest that also contain a minimal number of missing values.
df.dtypes.value_counts()
A count of missing values by variable is shown below.
df_na = df.isna().sum().to_frame().sort_values(by = 0, axis = 0)
df_na.rename(columns={0: 'NA Count'}, inplace=True)
df_na[df_na['NA Count'] == 0].index
I am looking at Sale Price, Lot Area, Number of Bedrooms, Neighborhood, Year Built, and Overall Quality. There are no missing values for this subset.
sns.set()
#Scatter Plot.
exp1 = sns.lmplot(data = df, x = 'Lot Area', y = 'SalePrice', hue = 'Bedroom AbvGr',
palette = 'viridis_r', ci = None, height = 6, aspect = 13 / 6)
#Axes.
ax = plt.gca()
#Title setup.
ax.set_title('Price vs Lot Area and Number of Bedrooms', fontsize = 24)
#X-axis setup.
ax.set_xlabel("Lot Area (sq. ft.)", fontsize = 22)
ax.set_xscale('log')
xlabels = [2500, 5000, 10000, 25000, 50000, 100000, 250000]
ax.set_xticks(xlabels)
ax.set_xticklabels(xlabels, rotation = 45, ha = 'right')
ax.get_xaxis().set_major_formatter(FF(lambda x, p: format(int(x), ',')))
#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))
ax.tick_params(axis = 'both', which = 'major', labelsize = 14)
#Legend setup.
exp1._legend.remove()
ax.legend(loc = 'upper left', title = 'Bedrooms', ncol = 2, title_fontsize = 18, fontsize = 14);
For the first exploration, Price is compared to Lot Area and Number of Bedrooms. Lot Area had a considerable right skew, so the x-axis was log scaled.
Number of Bedrooms alone does not indicate a clear relationship with price. However, all regression lines for Number of Bedrooms from $0$ to $6$ show an positive correlation of Lot Area with Price. As noted below, there are only a few houses with $6$ or $8$ bedrooms.
bdrm_cnt = df['Bedroom AbvGr'].value_counts().sort_index().to_frame().rename(columns = {'Bedroom AbvGr': "Count of Houses"})
bdrm_cnt.index.name = "Number of Bedrooms"
bdrm_cnt
#Sort Neighborhoods by median Price in ascending order.
xlabels = df.groupby(['Neighborhood'])['SalePrice'].median().sort_values().index
exp2 = sns.violinplot(data = df, x = 'Neighborhood', y = 'SalePrice', scale = 'width', width = 0.6,
palette = 'viridis_r', order = xlabels)
plt.gcf().set_size_inches(15, 6.92)
#Axes.
ax = plt.gca()
#Title setup.
ax.set_title('Price vs Neighborhood', fontsize = 24)
#X-axis setup.
ax.set_xlabel("Neighborhood", fontsize = 22)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')
#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))
ax.tick_params(axis = 'both', which = 'major', labelsize = 14)
For the second exploration, Price is compared to Neighborhood. Neighborhood is represented by violin plots, sort in ascending order by median Price.
There is much overlap in Price, as there is significant and varying spread of house values within each Neighborhood. Overall there seems to be some correlation that warrants a further investigation.
#Scatter Plot.
exp3 = sns.lmplot(data = df, x = 'Year Built', y = 'SalePrice', hue = 'Overall Qual',
palette = 'viridis_r', ci = None, height = 6, aspect = 13 / 6)
#Axes.
ax = plt.gca()
#Title setup.
ax.set_title('Price vs Year Built and Overall Quality', fontsize = 24)
#X-axis setup.
ax.set_xlabel("Year Built", fontsize = 22)
ax.set_xlim(1870, 2015)
ax.set_xticklabels(ax.get_xticklabels(), rotation = 45, ha = 'right')
#Y-axis setup.
ax.set_ylabel("Price", fontsize = 22)
ax.set_ylim(0, 800000)
ax.yaxis.set_major_formatter(SMF('${x:,.0f}'))
ax.tick_params(axis = 'both', which = 'major', labelsize = 14)
#Legend setup.
exp3._legend.remove()
ax.legend(loc = 'upper left', title = 'Overall House Quality',
labels = ['Very Poor', 'Poor', 'Fair', 'Below Average', 'Average',
'Above Average', 'Good', 'Very Good', 'Excellent', 'Very Excellent'],
ncol = 2, title_fontsize = 18, fontsize = 14);
For the third exploration, Price is compared to Year Built and Overall Quality.
Comparing these three together is quite telling. Houses built earlier are of overall lower quality and lower selling price, as seen with the yellow, green, and cyan data points in the bottom of the graph. From 1990 - 2010, there is a sharp increase in quality and selling price, as seen with the blue, indigo, and purple data points on the far right of the graph.
The regression lines for Overall Quality show a positive relationship for Year Built and Sale Price. Furthurmore, there is very little intersection of these lines, they are clearly separated. The exception is 'Very Poor' and 'Poor' at the bottom center of the graph.
Homes located in 'Green Hills', 'Northridge', 'Northridge Heights', and 'Stone Brook' have a greater Sale Price than homes located in other Neighborhoods
An increase in Overall Quality will increase Sale Price.
H0: Overall Quality does not affect Sale Price.
HA: Overall Quality affects Sale Price.
x = df['Overall Qual'].to_numpy().reshape(-1, 1)
y = df['SalePrice'].to_numpy().reshape(-1, )
reg = LR().fit(x, y)
r_sq = reg.score(x, y)
p_val = fr(x, y)[1][0]
print("The R-squared is {}.".format(round(r_sq, 3)))
print("The p_value is {}.".format(p_val))
63.9% of the variance in Sale Price can be explained by Overall Condition. This is significant at the 95% confidence level. H0 is therefore rejected.
Determining other variables with a high level of correlation, doing a multiple regression to explain more variance in Sale Price, and achieving significant results would be the next steps in this analysis.
Although there are some missing values, the vast majority of this data is complete and of good quality.
Latitude and Longitude of houses, that is, there specific locations, may show to be a better indicator of price than Neighborhood.
It is important to stress that this model is specific to Ames, IA, and would vary greatly across the US. Although this data set of nearly 3,000 observations is satisfactory for a city, a much larger sample would be necessary for the US.