Refer to the codebook for the Ames, Iowa dataset for more information.

See the Appendix for code.


1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.


The above histogram displays the spread of the ages for 1000 homes. Notice that the data is skewed right, and approximately half the homes are younger than 45 years and half are older. The mean age of the houses is 48 years. Over 25% of the homes are less than 20 years old.


2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

Table with Summary
 House Price Statistics 
 Count   Minimum   Median   Mean   Maximum   Std. dev. 
 Neighborhood 
   Blmngtn  11 172500 191000.0 198634.5 246990 26454.9
   Blueste  3 116500 123900.0 125800.0 137000 10381.2
   BrDale  10 83000 98750.0 98930.0 125500 13337.6
   BrkSide  41 39300 124000.0 122472.6 207000 37309.9
   ClearCr  13 107500 185000.0 193153.8 277000 48068.7
   CollgCr  85 110000 195800.0 196950.9 475000 52786.1
   Crawfor  29 96500 205000.0 204196.6 392500 71267.6
   Edwards  60 61500 127250.0 136322.0 415000 54851.6
   Gilbert  49 133000 183500.0 193328.0 377500 41190.4
   Greens  4 155000 212625.0 198562.5 214000 29063.4
   GrnHill  2 230000 280000.0 280000.0 330000 70710.7
   IDOTRR  35 34900 99500.0 97620.7 150909 31530.4
   Landmrk 
   MeadowV  16 73000 85750.0 92946.9 129500 18939.8
   Mitchel  44 93500 156500.0 165013.6 285000 39682.9
   NAmes  155 76500 139900.0 141356.0 277500 27268.0
   NoRidge  28 235000 290000.0 295844.6 405000 35889.0
   NPkVill  4 123000 142100.0 139275.0 149900 11958.4
   NridgHt  57 173000 336860.0 333646.7 615000 105088.9
   NWAmes  41 82500 185000.0 194093.9 278000 41340.5
   OldTown  71 12789 120000.0 120225.6 265979 36429.7
   Sawyer  61 63900 136000.0 139312.7 219000 21216.2
   SawyerW  46 67500 182500.0 183101.0 320000 48354.4
   Somerst  74 139000 221650.0 234595.9 468000 65199.5
   StoneBr  20 180000 340691.5 339316.0 591587 123459.1
   SWISU  12 80000 134000.0 130619.5 169000 27375.8
   Timber  19 175000 232500.0 265192.2 425000 84029.6
   Veenker  10 150000 205750.0 233650.0 385000 72545.4

The above box and violin plot shows the house prices. They are sorted in ascending order by median price of their neighborhood.

First, note that the number of homes in each neighborhood varies from 2 to 155.

Next, neither the most expensive nor least expensive house for each neighborhood is uniformly increasing as the medians are. It is also clear that the spread of prices in each neighborhood varies greatly.

The mean and median prices for each neighborhood tend to be quite similar.

Some general trends and labels can be made, however, there is no clear most expensive or least expensive neighborhood, as there is no neighborhood which has the highest or lowest prices for the metrics of minimum, median, mean, and maximum. The number of houses and standard deviation also makes the qualification unclear.

StoneBr and NridgHt have the largest spreads in terms of standard deviation. They are also the most expensive neighborhoods by median, mean, and maximum house prices. However, one must also consider the geographic distribution of houses in a neighborhood. Is a home that is significantly more or less expensive than the median or mean priced home in a neighborhood close to the center, or is it close to the border of another neighborhood? This metric is not included in the data, but would possibly provide some interesting insight.


3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

      Pool.QC  Misc.Feature         Alley         Fence  Fireplace.Qu 
          997           971           933           798           491 
 Lot.Frontage Garage.Yr.Blt   Garage.Qual   Garage.Cond   Garage.Type 
          167            48            47            47            46 

The column names have been sorted in descending order by number of missing values. Pool.QC has 997 missing values. This corresponds to 997 houses that do not have a swimming pool, which is reasonable and appropriate for data of 1000 houses.


4

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

     price           Lot.Area      Land.Slope   Year.Built   Year.Remod.Add
 Min.   : 12789   Min.   :  1470   Gtl:962    Min.   :1872   Min.   :1950  
 1st Qu.:129763   1st Qu.:  7314   Mod: 33    1st Qu.:1955   1st Qu.:1966  
 Median :159467   Median :  9317   Sev:  5    Median :1975   Median :1992  
 Mean   :181190   Mean   : 10352              Mean   :1972   Mean   :1984  
 3rd Qu.:213000   3rd Qu.: 11650              3rd Qu.:2001   3rd Qu.:2004  
 Max.   :615000   Max.   :215245              Max.   :2010   Max.   :2010  
 Bedroom.AbvGr  
 Min.   :0.000  
 1st Qu.:2.000  
 Median :3.000  
 Mean   :2.806  
 3rd Qu.:3.000  
 Max.   :6.000  


A summary of the subsetted data and correlation plot are shown above. Lot.Slope has been removed because it it is categorical. Collinearity does not seem to be relevant for this problem.


                                       Predictor_Variables Adjusted_R2
1                                            Initial Model   0.5598345
2 Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr   0.5220090
3   Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr   0.5526062
4   Lot.Area + Land.Slope + Year.Remod.Add + Bedroom.AbvGr   0.4473663
5       Lot.Area + Land.Slope + Year.Built + Bedroom.AbvGr   0.4922384
6      Lot.Area + Land.Slope + Year.Built + Year.Remod.Add   0.5314859

I used the backward selection method, selecting predictors by their Adjusted R2. I am choosing this method because a reliable overall prediction model is my priority. I am not as concerned about the significance of the individual predictors, and thus not using p-values.

The backward selection method will compare the Adjusted R2 of the initial model with five predictor variables to each of the models including four of those variables. As we can see from the above output, none of the models with four predictors has a greater Adjusted R2 than the initial model.

Thus, we include all five given variables. A summary of the model is shown below.



Call:
lm(formula = log(price) ~ ., data = q4)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0878 -0.1651 -0.0211  0.1657  0.9945 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.371e+01  8.574e-01 -15.996  < 2e-16 ***
Lot.Area        1.028e-05  1.106e-06   9.296  < 2e-16 ***
Land.SlopeMod   1.384e-01  4.991e-02   2.773  0.00565 ** 
Land.SlopeSev  -4.567e-01  1.514e-01  -3.016  0.00263 ** 
Year.Built      6.049e-03  3.788e-04  15.968  < 2e-16 ***
Year.Remod.Add  6.778e-03  5.468e-04  12.395  < 2e-16 ***
Bedroom.AbvGr   8.686e-02  1.077e-02   8.063 2.12e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.279 on 993 degrees of freedom
Multiple R-squared:  0.5625,    Adjusted R-squared:  0.5598 
F-statistic: 212.8 on 6 and 993 DF,  p-value: < 2.2e-16

5

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

      428       998       645        66       918 
4.3591298 1.0516498 1.0007075 0.9890399 0.9457524 
# A tibble: 1 x 6
  price Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr
  <int>    <int> <fct>           <int>          <int>         <int>
1 12789     9656 Gtl              1923           1970             2
[1] 12789

House 428 has the highest squared residual by far. The price, $12,789, seemed extremely low, so I also filtered the dataset to confirm that house 428 is indeed the lowest priced house in the dataset. In the Neighborhood vs Price plot above, house 428 corresponds to the minimum point in OldTown. This explains the high squared residual of house 428.


6

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

                                       Predictor_Variables Adjusted_R2
1                                            Initial Model   0.6032018
2 Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr   0.5220090
3   Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr   0.6014831
4   Lot.Area + Land.Slope + Year.Remod.Add + Bedroom.AbvGr   0.4932282
5       Lot.Area + Land.Slope + Year.Built + Bedroom.AbvGr   0.5363921
6      Lot.Area + Land.Slope + Year.Built + Year.Remod.Add   0.5910888

Similar to Question 4, we can see from the above output, none of the models with four predictors has a greater adjusted R2 than the initial model.

Thus, we again include all five given variables. A summary of the model is shown below.



Call:
lm(formula = log(price) ~ ., data = q6)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.14050 -0.15650 -0.01561  0.15350  0.90854 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.553e+01  8.197e-01 -18.947  < 2e-16 ***
Lot.Area        2.442e-01  1.708e-02  14.297  < 2e-16 ***
Land.SlopeMod   1.151e-01  4.734e-02   2.431   0.0152 *  
Land.SlopeSev  -6.554e-02  1.222e-01  -0.536   0.5917    
Year.Built      5.981e-03  3.597e-04  16.628  < 2e-16 ***
Year.Remod.Add  6.734e-03  5.190e-04  12.975  < 2e-16 ***
Bedroom.AbvGr   5.909e-02  1.055e-02   5.599  2.8e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2649 on 993 degrees of freedom
Multiple R-squared:  0.6056,    Adjusted R-squared:  0.6032 
F-statistic: 254.1 on 6 and 993 DF,  p-value: < 2.2e-16

7

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.


It is indeed better to log transform Lot.Area

Upon transforming Lot Area to log(Lot Area), the spread of points is much less congested. Looking at the output summaries from Questions 4 and 6, the respective Adjusted R2 values are 0.560 and 0.603. Using a log transformation on Lot Area explains a little more variance, which can be confirmed numerically and graphically.


Appendix

All code is included here.

Setup & Packages

knitr::opts_chunk$set(comment = NA)
load("ames_train.Rdata")

library(MASS)
library(dplyr)
library(expss)
library(forcats)
library(GGally)
library(ggplot2)
library(grid)
library(cowplot)
library(lemon)
library(scales)
library(tidyr)
library(viridis)

1

Histogram

#Median and Mean Age.
q1_median = 2020 - median(ames_train$Year.Built)
q1_mean = 2020 - mean(ames_train$Year.Built)
#Histogram.
g1 <- ggplot(ames_train, aes(x = (2020 - Year.Built)))
g1 <- g1 + geom_histogram(bins = 30, fill = viridis(7)[4])
#Title.
g1 <- g1 + ggtitle("Distribution of Houses by Age")
#X-axis
g1 <- g1 + scale_x_continuous("Age of House (years)", expand = c(0,0))
#Y-axis.
g1 <- g1 + scale_y_continuous(name = "Frequency of Age", expand = c(0,0))
#Median Line.
g1 <- g1 + geom_vline(xintercept = q1_median, size = 1.5, color = viridis(7)[2])
#Mean Line.
g1 <- g1 + geom_vline(xintercept = q1_mean, size = 1.5, color = viridis(7)[6])
#Labels.
median_label <- grobTree(textGrob(paste("Median =", q1_median, "years"),
                                  x = 0.28,  y = 0.85, hjust = 0,
                                  gp = gpar(col = viridis(7)[2], fontsize = 15)))
mean_label <- grobTree(textGrob(paste("Mean = approx.", round(q1_mean, 0), "years"),
                                x = 0.3,  y = 0.80, hjust = 0,
                                gp = gpar(col = viridis(7)[6], fontsize = 15)))
g1 <- g1 + annotation_custom(median_label) + annotation_custom(mean_label)

#Modify labels and text.
g1 <- g1 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                 axis.text.x = element_text(size = 12),
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"))
g1

2

Neighborhood Violin & Box Plot.

g2 <- ggplot(ames_train, aes(x = fct_reorder(Neighborhood, price, .fun = 'median'),
                             y = price))
#Violin and Box Plot.
g2 <- g2 + geom_violin(color = viridis(7)[5], width = 1.3) + geom_boxplot(color = viridis(7)[3], width = 0.2)
#Title. 
g2 <- g2 + ggtitle("Neighborhood vs Price")
#X-axis
g2 <- g2 + scale_x_discrete(name = "Neighborhood")
#Y-axis.
g2 <- g2 + scale_y_continuous(name = "Price", labels = dollar)
#Modify labels and text.
g2 <- g2 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                 axis.text.x = element_text(hjust = 1, size = 10, angle = 55),
                 axis.title.x = element_text(size = 12, face = "bold"),
                 axis.text.y = element_text(size = 10),
                 axis.title.y = element_text(hjust = 0.5, size = 12, face = "bold"),
                 legend.position = "none")
g2

Neighborhood Table Summary.

ames_train  %>%
  apply_labels(Neighborhood = "Neighborhood", price = "") %>%
  tab_rows(Neighborhood) %>%
  tab_cells(price) %>%
  tab_cols(total(label = "House Price Statistics")) %>%
  tab_stat_fun("Count" = w_n,
               "Minimum" = w_min,
               "Median" = w_median,
               "Mean" = w_mean,
               "Maximum" = w_max,
               "Std. dev." = w_sd,
               method = list) %>%
  tab_pivot() %>%
  set_caption("Table with Summary") %>% 
  htmlTable(css.cell = c("width: 100px", rep("width: 80px", ncol(.) - 1)))

3

Missing Values.

sort(colSums(is.na(ames_train)), decreasing = TRUE)[0:10]

4

Data Summary.

q4 <- ames_train %>%
  select(price, Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr)

summary(q4)

ggcorr(q4[-3]) + scale_fill_viridis(direction = -1)

Backward Selection, Part I.

q4_model <- lm(log(price) ~ ., q4)

new_R2 <- summary(lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = q4))$adj.r.squared
Predictors <- ("Initial Model")
R2 <- new_R2

for (x in c(2:6)) {
  Predictors <- c(Predictors, do.call(paste, c(as.list(colnames(q4[-c(1,x)])), sep = " + ")))
  temp_df <- q4 %>%
    select(colnames(q4[-x]))

  new_R2 <- summary(lm(log(price) ~ ., data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Model Summary, Part I.

summary(q4_model)

5

House 428.

sort(residuals(q4_model) ^2, decreasing = TRUE)[1:5]
print(q4[428,])
print(min(q4$price))

6

Backward Selection, Part II.

q6 <- q4 %>%
  mutate(Lot.Area = log(Lot.Area))
q6_model <- lm(log(price) ~ ., q6)

new_R2 <- summary(lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = q6))$adj.r.squared
Predictors <- ("Initial Model")
R2 <- new_R2

for (x in c(2:6)) {
  Predictors <- c(Predictors, do.call(paste, c(as.list(colnames(q6[-c(1,x)])), sep = " + ")))
  temp_df <- q6 %>%
    select(colnames(q6[-x]))

  new_R2 <- summary(lm(log(price) ~ ., data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Model Summary, Part II.

summary(q6_model)

7

Scatter Plots.

#Data for Log(House Price) by Lot Area.
q7_1 <- q4 %>%
  mutate(Observed = log(price)) %>%
  select(Lot.Area, Observed)

q7_1$Predicted = predict(q4_model)

q7_1 <- q7_1 %>%
  gather(y_type, price, c(Predicted, Observed))

#Data for Log(House Price) by Logt(Lot Area).
q7_2 <- q6 %>%
  mutate(Observed = log(price)) %>%
  select(Lot.Area, Observed)

q7_2$Predicted = predict(q6_model)

q7_2 <- q7_2 %>%
  gather(y_type, price, c(Predicted, Observed))

#Scatter, log(price) by Lot.Area.
g7_1 <- ggplot(q7_1, aes(x = Lot.Area, y = price, color = y_type, shape = y_type))
g7_1 <- g7_1 + geom_point(size = 4, stroke = 1.5, alpha = 0.75)
#Title.
g7_1 <- g7_1 + ggtitle(label = "")
#X-axis.
g7_1 <- g7_1 + scale_x_continuous("Lot Area", labels = comma)
#Y-axis.
g7_1 <- g7_1 + scale_y_continuous("Log(House Price)")
#Viridis Color Scheme.
g7_1 <- g7_1 + scale_color_manual(name = "Log(House Price)",
                                  values = c(viridis(7)[6], viridis(7)[2]))
#Shape Scheme.
g7_1 <- g7_1 + scale_shape_manual(name = "Log(House Price)",
                                  values = c(3, 4))
#Modify labels and text. Remove legend from left plot.
g7_1 <- g7_1 + theme(plot.title = element_text(hjust = 1, size = 16, face = "bold"),
                     axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
                     axis.title.x = element_text(size = 14, face = "bold"),
                     axis.text.y = element_text(size = 12),
                     axis.title.y = element_text(size = 14, face = "bold"),
                     legend.position = "none")

#Scatter, log(price) by log(Lot.Area).
g7_2 <- ggplot(q7_2, aes(x = Lot.Area, y = price, color = y_type, shape = y_type))
g7_2 <- g7_2 + geom_point(size = 4, stroke = 1.5, alpha = 0.75)
#Title.
g7_2 <- g7_2 + ggtitle(label = "Log(House Price) by Lot Area and Log(Lot Area)")
#X-axis.
g7_2 <- g7_2 + scale_x_continuous("Log(Lot Area)")
#Y-axis.
g7_2 <- g7_2 + scale_y_continuous("Log(House Price)", position = "right")
#Viridis Color Scheme.
g7_2 <- g7_2 + scale_color_manual(name = "Log(House Price)",
                                  values = c(viridis(7)[6], viridis(7)[2]))
#Shape Scheme.
g7_2 <- g7_2 + scale_shape_manual(name = "Log(House Price)",
                                  values = c(3, 4))
#Modify labels and text.
g7_2 <- g7_2 + theme(plot.title = element_text(hjust = 2.1, size = 16, face = "bold"),
                     axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
                     axis.title.x = element_text(size = 14, face = "bold"),
                     axis.text.y = element_text(size = 12),
                     axis.title.y = element_text(size = 14, face = "bold"))

#Modify legend text.  Save as object.
g7_legend <- g_legend(g7_2 + theme(legend.text = element_text(size = 12),
                                   legend.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                                   legend.key.width = unit(1.5, "cm"),
                                   legend.position = "bottom"))
#Remove legend from right plot.
g7_2 <- g7_2 + theme(legend.position = "none")

#Align two plots horizontally across one row.
g7 <- plot_grid(g7_1, g7_2, align = "h", nrow = 1)
#Include legend at the bottom.
plot_grid(g7, g7_legend, ncol = 1, rel_heights = c(0.90, 0.10))