Refer to the codebook for the Ames, Iowa dataset for more information.
See the Appendix for code.
Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.
The above histogram displays the spread of the ages for 1000 homes. Notice that the data is skewed right, and approximately half the homes are younger than 45 years and half are older. The mean age of the houses is 48 years. Over 25% of the homes are less than 20 years old.
The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.
Table with Summary | ||||||
House Price Statistics | ||||||
---|---|---|---|---|---|---|
Count | Minimum | Median | Mean | Maximum | Std. dev. | |
Neighborhood | ||||||
Blmngtn | 11 | 172500 | 191000.0 | 198634.5 | 246990 | 26454.9 |
Blueste | 3 | 116500 | 123900.0 | 125800.0 | 137000 | 10381.2 |
BrDale | 10 | 83000 | 98750.0 | 98930.0 | 125500 | 13337.6 |
BrkSide | 41 | 39300 | 124000.0 | 122472.6 | 207000 | 37309.9 |
ClearCr | 13 | 107500 | 185000.0 | 193153.8 | 277000 | 48068.7 |
CollgCr | 85 | 110000 | 195800.0 | 196950.9 | 475000 | 52786.1 |
Crawfor | 29 | 96500 | 205000.0 | 204196.6 | 392500 | 71267.6 |
Edwards | 60 | 61500 | 127250.0 | 136322.0 | 415000 | 54851.6 |
Gilbert | 49 | 133000 | 183500.0 | 193328.0 | 377500 | 41190.4 |
Greens | 4 | 155000 | 212625.0 | 198562.5 | 214000 | 29063.4 |
GrnHill | 2 | 230000 | 280000.0 | 280000.0 | 330000 | 70710.7 |
IDOTRR | 35 | 34900 | 99500.0 | 97620.7 | 150909 | 31530.4 |
Landmrk | ||||||
MeadowV | 16 | 73000 | 85750.0 | 92946.9 | 129500 | 18939.8 |
Mitchel | 44 | 93500 | 156500.0 | 165013.6 | 285000 | 39682.9 |
NAmes | 155 | 76500 | 139900.0 | 141356.0 | 277500 | 27268.0 |
NoRidge | 28 | 235000 | 290000.0 | 295844.6 | 405000 | 35889.0 |
NPkVill | 4 | 123000 | 142100.0 | 139275.0 | 149900 | 11958.4 |
NridgHt | 57 | 173000 | 336860.0 | 333646.7 | 615000 | 105088.9 |
NWAmes | 41 | 82500 | 185000.0 | 194093.9 | 278000 | 41340.5 |
OldTown | 71 | 12789 | 120000.0 | 120225.6 | 265979 | 36429.7 |
Sawyer | 61 | 63900 | 136000.0 | 139312.7 | 219000 | 21216.2 |
SawyerW | 46 | 67500 | 182500.0 | 183101.0 | 320000 | 48354.4 |
Somerst | 74 | 139000 | 221650.0 | 234595.9 | 468000 | 65199.5 |
StoneBr | 20 | 180000 | 340691.5 | 339316.0 | 591587 | 123459.1 |
SWISU | 12 | 80000 | 134000.0 | 130619.5 | 169000 | 27375.8 |
Timber | 19 | 175000 | 232500.0 | 265192.2 | 425000 | 84029.6 |
Veenker | 10 | 150000 | 205750.0 | 233650.0 | 385000 | 72545.4 |
The above box and violin plot shows the house prices. They are sorted in ascending order by median price of their neighborhood.
First, note that the number of homes in each neighborhood varies from 2 to 155.
Next, neither the most expensive nor least expensive house for each neighborhood is uniformly increasing as the medians are. It is also clear that the spread of prices in each neighborhood varies greatly.
The mean and median prices for each neighborhood tend to be quite similar.
Some general trends and labels can be made, however, there is no clear most expensive or least expensive neighborhood, as there is no neighborhood which has the highest or lowest prices for the metrics of minimum, median, mean, and maximum. The number of houses and standard deviation also makes the qualification unclear.
StoneBr and NridgHt have the largest spreads in terms of standard deviation. They are also the most expensive neighborhoods by median, mean, and maximum house prices. However, one must also consider the geographic distribution of houses in a neighborhood. Is a home that is significantly more or less expensive than the median or mean priced home in a neighborhood close to the center, or is it close to the border of another neighborhood? This metric is not included in the data, but would possibly provide some interesting insight.
Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.
Pool.QC Misc.Feature Alley Fence Fireplace.Qu
997 971 933 798 491
Lot.Frontage Garage.Yr.Blt Garage.Qual Garage.Cond Garage.Type
167 48 47 47 46
The column names have been sorted in descending order by number of missing values. Pool.QC has 997 missing values. This corresponds to 997 houses that do not have a swimming pool, which is reasonable and appropriate for data of 1000 houses.
We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.
price Lot.Area Land.Slope Year.Built Year.Remod.Add
Min. : 12789 Min. : 1470 Gtl:962 Min. :1872 Min. :1950
1st Qu.:129763 1st Qu.: 7314 Mod: 33 1st Qu.:1955 1st Qu.:1966
Median :159467 Median : 9317 Sev: 5 Median :1975 Median :1992
Mean :181190 Mean : 10352 Mean :1972 Mean :1984
3rd Qu.:213000 3rd Qu.: 11650 3rd Qu.:2001 3rd Qu.:2004
Max. :615000 Max. :215245 Max. :2010 Max. :2010
Bedroom.AbvGr
Min. :0.000
1st Qu.:2.000
Median :3.000
Mean :2.806
3rd Qu.:3.000
Max. :6.000
A summary of the subsetted data and correlation plot are shown above. Lot.Slope has been removed because it it is categorical. Collinearity does not seem to be relevant for this problem.
Predictor_Variables Adjusted_R2
1 Initial Model 0.5598345
2 Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr 0.5220090
3 Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr 0.5526062
4 Lot.Area + Land.Slope + Year.Remod.Add + Bedroom.AbvGr 0.4473663
5 Lot.Area + Land.Slope + Year.Built + Bedroom.AbvGr 0.4922384
6 Lot.Area + Land.Slope + Year.Built + Year.Remod.Add 0.5314859
I used the backward selection method, selecting predictors by their Adjusted R2. I am choosing this method because a reliable overall prediction model is my priority. I am not as concerned about the significance of the individual predictors, and thus not using p-values.
The backward selection method will compare the Adjusted R2 of the initial model with five predictor variables to each of the models including four of those variables. As we can see from the above output, none of the models with four predictors has a greater Adjusted R2 than the initial model.
Thus, we include all five given variables. A summary of the model is shown below.
Call:
lm(formula = log(price) ~ ., data = q4)
Residuals:
Min 1Q Median 3Q Max
-2.0878 -0.1651 -0.0211 0.1657 0.9945
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.371e+01 8.574e-01 -15.996 < 2e-16 ***
Lot.Area 1.028e-05 1.106e-06 9.296 < 2e-16 ***
Land.SlopeMod 1.384e-01 4.991e-02 2.773 0.00565 **
Land.SlopeSev -4.567e-01 1.514e-01 -3.016 0.00263 **
Year.Built 6.049e-03 3.788e-04 15.968 < 2e-16 ***
Year.Remod.Add 6.778e-03 5.468e-04 12.395 < 2e-16 ***
Bedroom.AbvGr 8.686e-02 1.077e-02 8.063 2.12e-15 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.279 on 993 degrees of freedom
Multiple R-squared: 0.5625, Adjusted R-squared: 0.5598
F-statistic: 212.8 on 6 and 993 DF, p-value: < 2.2e-16
Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?
428 998 645 66 918
4.3591298 1.0516498 1.0007075 0.9890399 0.9457524
# A tibble: 1 x 6
price Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr
<int> <int> <fct> <int> <int> <int>
1 12789 9656 Gtl 1923 1970 2
[1] 12789
House 428 has the highest squared residual by far. The price, $12,789, seemed extremely low, so I also filtered the dataset to confirm that house 428 is indeed the lowest priced house in the dataset. In the Neighborhood vs Price plot above, house 428 corresponds to the minimum point in OldTown. This explains the high squared residual of house 428.
Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?
Predictor_Variables Adjusted_R2
1 Initial Model 0.6032018
2 Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr 0.5220090
3 Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr 0.6014831
4 Lot.Area + Land.Slope + Year.Remod.Add + Bedroom.AbvGr 0.4932282
5 Lot.Area + Land.Slope + Year.Built + Bedroom.AbvGr 0.5363921
6 Lot.Area + Land.Slope + Year.Built + Year.Remod.Add 0.5910888
Similar to Question 4, we can see from the above output, none of the models with four predictors has a greater adjusted R2 than the initial model.
Thus, we again include all five given variables. A summary of the model is shown below.
Call:
lm(formula = log(price) ~ ., data = q6)
Residuals:
Min 1Q Median 3Q Max
-2.14050 -0.15650 -0.01561 0.15350 0.90854
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.553e+01 8.197e-01 -18.947 < 2e-16 ***
Lot.Area 2.442e-01 1.708e-02 14.297 < 2e-16 ***
Land.SlopeMod 1.151e-01 4.734e-02 2.431 0.0152 *
Land.SlopeSev -6.554e-02 1.222e-01 -0.536 0.5917
Year.Built 5.981e-03 3.597e-04 16.628 < 2e-16 ***
Year.Remod.Add 6.734e-03 5.190e-04 12.975 < 2e-16 ***
Bedroom.AbvGr 5.909e-02 1.055e-02 5.599 2.8e-08 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.2649 on 993 degrees of freedom
Multiple R-squared: 0.6056, Adjusted R-squared: 0.6032
F-statistic: 254.1 on 6 and 993 DF, p-value: < 2.2e-16
Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.
It is indeed better to log transform Lot.Area
Upon transforming Lot Area to log(Lot Area), the spread of points is much less congested. Looking at the output summaries from Questions 4 and 6, the respective Adjusted R2 values are 0.560 and 0.603. Using a log transformation on Lot Area explains a little more variance, which can be confirmed numerically and graphically.
All code is included here.
knitr::opts_chunk$set(comment = NA)
load("ames_train.Rdata")
library(MASS)
library(dplyr)
library(expss)
library(forcats)
library(GGally)
library(ggplot2)
library(grid)
library(cowplot)
library(lemon)
library(scales)
library(tidyr)
library(viridis)
#Median and Mean Age.
q1_median = 2020 - median(ames_train$Year.Built)
q1_mean = 2020 - mean(ames_train$Year.Built)
#Histogram.
g1 <- ggplot(ames_train, aes(x = (2020 - Year.Built)))
g1 <- g1 + geom_histogram(bins = 30, fill = viridis(7)[4])
#Title.
g1 <- g1 + ggtitle("Distribution of Houses by Age")
#X-axis
g1 <- g1 + scale_x_continuous("Age of House (years)", expand = c(0,0))
#Y-axis.
g1 <- g1 + scale_y_continuous(name = "Frequency of Age", expand = c(0,0))
#Median Line.
g1 <- g1 + geom_vline(xintercept = q1_median, size = 1.5, color = viridis(7)[2])
#Mean Line.
g1 <- g1 + geom_vline(xintercept = q1_mean, size = 1.5, color = viridis(7)[6])
#Labels.
median_label <- grobTree(textGrob(paste("Median =", q1_median, "years"),
x = 0.28, y = 0.85, hjust = 0,
gp = gpar(col = viridis(7)[2], fontsize = 15)))
mean_label <- grobTree(textGrob(paste("Mean = approx.", round(q1_mean, 0), "years"),
x = 0.3, y = 0.80, hjust = 0,
gp = gpar(col = viridis(7)[6], fontsize = 15)))
g1 <- g1 + annotation_custom(median_label) + annotation_custom(mean_label)
#Modify labels and text.
g1 <- g1 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
g1
g2 <- ggplot(ames_train, aes(x = fct_reorder(Neighborhood, price, .fun = 'median'),
y = price))
#Violin and Box Plot.
g2 <- g2 + geom_violin(color = viridis(7)[5], width = 1.3) + geom_boxplot(color = viridis(7)[3], width = 0.2)
#Title.
g2 <- g2 + ggtitle("Neighborhood vs Price")
#X-axis
g2 <- g2 + scale_x_discrete(name = "Neighborhood")
#Y-axis.
g2 <- g2 + scale_y_continuous(name = "Price", labels = dollar)
#Modify labels and text.
g2 <- g2 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 10, angle = 55),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(hjust = 0.5, size = 12, face = "bold"),
legend.position = "none")
g2
ames_train %>%
apply_labels(Neighborhood = "Neighborhood", price = "") %>%
tab_rows(Neighborhood) %>%
tab_cells(price) %>%
tab_cols(total(label = "House Price Statistics")) %>%
tab_stat_fun("Count" = w_n,
"Minimum" = w_min,
"Median" = w_median,
"Mean" = w_mean,
"Maximum" = w_max,
"Std. dev." = w_sd,
method = list) %>%
tab_pivot() %>%
set_caption("Table with Summary") %>%
htmlTable(css.cell = c("width: 100px", rep("width: 80px", ncol(.) - 1)))
sort(colSums(is.na(ames_train)), decreasing = TRUE)[0:10]
q4 <- ames_train %>%
select(price, Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr)
summary(q4)
ggcorr(q4[-3]) + scale_fill_viridis(direction = -1)
q4_model <- lm(log(price) ~ ., q4)
new_R2 <- summary(lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = q4))$adj.r.squared
Predictors <- ("Initial Model")
R2 <- new_R2
for (x in c(2:6)) {
Predictors <- c(Predictors, do.call(paste, c(as.list(colnames(q4[-c(1,x)])), sep = " + ")))
temp_df <- q4 %>%
select(colnames(q4[-x]))
new_R2 <- summary(lm(log(price) ~ ., data = temp_df))$adj.r.squared
R2 <- c(R2, new_R2)
}
R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values
summary(q4_model)
sort(residuals(q4_model) ^2, decreasing = TRUE)[1:5]
print(q4[428,])
print(min(q4$price))
q6 <- q4 %>%
mutate(Lot.Area = log(Lot.Area))
q6_model <- lm(log(price) ~ ., q6)
new_R2 <- summary(lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = q6))$adj.r.squared
Predictors <- ("Initial Model")
R2 <- new_R2
for (x in c(2:6)) {
Predictors <- c(Predictors, do.call(paste, c(as.list(colnames(q6[-c(1,x)])), sep = " + ")))
temp_df <- q6 %>%
select(colnames(q6[-x]))
new_R2 <- summary(lm(log(price) ~ ., data = temp_df))$adj.r.squared
R2 <- c(R2, new_R2)
}
R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values
summary(q6_model)
#Data for Log(House Price) by Lot Area.
q7_1 <- q4 %>%
mutate(Observed = log(price)) %>%
select(Lot.Area, Observed)
q7_1$Predicted = predict(q4_model)
q7_1 <- q7_1 %>%
gather(y_type, price, c(Predicted, Observed))
#Data for Log(House Price) by Logt(Lot Area).
q7_2 <- q6 %>%
mutate(Observed = log(price)) %>%
select(Lot.Area, Observed)
q7_2$Predicted = predict(q6_model)
q7_2 <- q7_2 %>%
gather(y_type, price, c(Predicted, Observed))
#Scatter, log(price) by Lot.Area.
g7_1 <- ggplot(q7_1, aes(x = Lot.Area, y = price, color = y_type, shape = y_type))
g7_1 <- g7_1 + geom_point(size = 4, stroke = 1.5, alpha = 0.75)
#Title.
g7_1 <- g7_1 + ggtitle(label = "")
#X-axis.
g7_1 <- g7_1 + scale_x_continuous("Lot Area", labels = comma)
#Y-axis.
g7_1 <- g7_1 + scale_y_continuous("Log(House Price)")
#Viridis Color Scheme.
g7_1 <- g7_1 + scale_color_manual(name = "Log(House Price)",
values = c(viridis(7)[6], viridis(7)[2]))
#Shape Scheme.
g7_1 <- g7_1 + scale_shape_manual(name = "Log(House Price)",
values = c(3, 4))
#Modify labels and text. Remove legend from left plot.
g7_1 <- g7_1 + theme(plot.title = element_text(hjust = 1, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.position = "none")
#Scatter, log(price) by log(Lot.Area).
g7_2 <- ggplot(q7_2, aes(x = Lot.Area, y = price, color = y_type, shape = y_type))
g7_2 <- g7_2 + geom_point(size = 4, stroke = 1.5, alpha = 0.75)
#Title.
g7_2 <- g7_2 + ggtitle(label = "Log(House Price) by Lot Area and Log(Lot Area)")
#X-axis.
g7_2 <- g7_2 + scale_x_continuous("Log(Lot Area)")
#Y-axis.
g7_2 <- g7_2 + scale_y_continuous("Log(House Price)", position = "right")
#Viridis Color Scheme.
g7_2 <- g7_2 + scale_color_manual(name = "Log(House Price)",
values = c(viridis(7)[6], viridis(7)[2]))
#Shape Scheme.
g7_2 <- g7_2 + scale_shape_manual(name = "Log(House Price)",
values = c(3, 4))
#Modify labels and text.
g7_2 <- g7_2 + theme(plot.title = element_text(hjust = 2.1, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
#Modify legend text. Save as object.
g7_legend <- g_legend(g7_2 + theme(legend.text = element_text(size = 12),
legend.title = element_text(hjust = 0.5, size = 14, face = "bold"),
legend.key.width = unit(1.5, "cm"),
legend.position = "bottom"))
#Remove legend from right plot.
g7_2 <- g7_2 + theme(legend.position = "none")
#Align two plots horizontally across one row.
g7 <- plot_grid(g7_1, g7_2, align = "h", nrow = 1)
#Include legend at the bottom.
plot_grid(g7, g7_legend, ncol = 1, rel_heights = c(0.90, 0.10))