Refer to the codebook for the Ames, Iowa dataset for more information.

See the Appendix for code.

1

Make a labeled histogram (with 30 bins) of the ages of the houses in the data set, and describe the distribution.

The above histogram displays the spread of the ages for 1000 homes. Notice that the data is skewed right, and approximately half the homes are younger than 45 years and half are older. The mean age of the houses is 48 years. Over 25% of the homes are less than 20 years old.

2

The mantra in real estate is “Location, Location, Location!” Make a graphical display that relates a home price to its neighborhood in Ames, Iowa. Which summary statistics are most appropriate to use for determining the most expensive, least expensive, and most heterogeneous (having the most variation in housing price) neighborhoods? Report which neighborhoods these are based on the summary statistics of your choice. Report the value of your chosen summary statistics for these neighborhoods.

	House Price Statistics
Table with Summary
	Count	Minimum	Median	Mean	Maximum	Std. dev.
Neighborhood
Blmngtn	11	172500	191000.0	198634.5	246990	26454.9
Blueste	3	116500	123900.0	125800.0	137000	10381.2
BrDale	10	83000	98750.0	98930.0	125500	13337.6
BrkSide	41	39300	124000.0	122472.6	207000	37309.9
ClearCr	13	107500	185000.0	193153.8	277000	48068.7
CollgCr	85	110000	195800.0	196950.9	475000	52786.1
Crawfor	29	96500	205000.0	204196.6	392500	71267.6
Edwards	60	61500	127250.0	136322.0	415000	54851.6
Gilbert	49	133000	183500.0	193328.0	377500	41190.4
Greens	4	155000	212625.0	198562.5	214000	29063.4
GrnHill	2	230000	280000.0	280000.0	330000	70710.7
IDOTRR	35	34900	99500.0	97620.7	150909	31530.4
Landmrk
MeadowV	16	73000	85750.0	92946.9	129500	18939.8
Mitchel	44	93500	156500.0	165013.6	285000	39682.9
NAmes	155	76500	139900.0	141356.0	277500	27268.0
NoRidge	28	235000	290000.0	295844.6	405000	35889.0
NPkVill	4	123000	142100.0	139275.0	149900	11958.4
NridgHt	57	173000	336860.0	333646.7	615000	105088.9
NWAmes	41	82500	185000.0	194093.9	278000	41340.5
OldTown	71	12789	120000.0	120225.6	265979	36429.7
Sawyer	61	63900	136000.0	139312.7	219000	21216.2
SawyerW	46	67500	182500.0	183101.0	320000	48354.4
Somerst	74	139000	221650.0	234595.9	468000	65199.5
StoneBr	20	180000	340691.5	339316.0	591587	123459.1
SWISU	12	80000	134000.0	130619.5	169000	27375.8
Timber	19	175000	232500.0	265192.2	425000	84029.6
Veenker	10	150000	205750.0	233650.0	385000	72545.4

The above box and violin plot shows the house prices. They are sorted in ascending order by median price of their neighborhood.

First, note that the number of homes in each neighborhood varies from 2 to 155.

Next, neither the most expensive nor least expensive house for each neighborhood is uniformly increasing as the medians are. It is also clear that the spread of prices in each neighborhood varies greatly.

The mean and median prices for each neighborhood tend to be quite similar.

Some general trends and labels can be made, however, there is no clear most expensive or least expensive neighborhood, as there is no neighborhood which has the highest or lowest prices for the metrics of minimum, median, mean, and maximum. The number of houses and standard deviation also makes the qualification unclear.

StoneBr and NridgHt have the largest spreads in terms of standard deviation. They are also the most expensive neighborhoods by median, mean, and maximum house prices. However, one must also consider the geographic distribution of houses in a neighborhood. Is a home that is significantly more or less expensive than the median or mean priced home in a neighborhood close to the center, or is it close to the border of another neighborhood? This metric is not included in the data, but would possibly provide some interesting insight.

3

Which variable has the largest number of missing values? Explain why it makes sense that there are so many missing values for this variable.

      Pool.QC  Misc.Feature         Alley         Fence  Fireplace.Qu 
          997           971           933           798           491 
 Lot.Frontage Garage.Yr.Blt   Garage.Qual   Garage.Cond   Garage.Type 
          167            48            47            47            46

The column names have been sorted in descending order by number of missing values. Pool.QC has 997 missing values. This corresponds to 997 houses that do not have a swimming pool, which is reasonable and appropriate for data of 1000 houses.

4

We want to predict the natural log of the home prices. Candidate explanatory variables are lot size in square feet (Lot.Area), slope of property (Land.Slope), original construction date (Year.Built), remodel date (Year.Remod.Add), and the number of bedrooms above grade (Bedroom.AbvGr). Pick a model selection or model averaging method covered in the Specialization, and describe how this method works. Then, use this method to find the best multiple regression model for predicting the natural log of the home prices.

     price           Lot.Area      Land.Slope   Year.Built   Year.Remod.Add
 Min.   : 12789   Min.   :  1470   Gtl:962    Min.   :1872   Min.   :1950  
 1st Qu.:129763   1st Qu.:  7314   Mod: 33    1st Qu.:1955   1st Qu.:1966  
 Median :159467   Median :  9317   Sev:  5    Median :1975   Median :1992  
 Mean   :181190   Mean   : 10352              Mean   :1972   Mean   :1984  
 3rd Qu.:213000   3rd Qu.: 11650              3rd Qu.:2001   3rd Qu.:2004  
 Max.   :615000   Max.   :215245              Max.   :2010   Max.   :2010  
 Bedroom.AbvGr  
 Min.   :0.000  
 1st Qu.:2.000  
 Median :3.000  
 Mean   :2.806  
 3rd Qu.:3.000  
 Max.   :6.000

A summary of the subsetted data and correlation plot are shown above. Lot.Slope has been removed because it it is categorical. Collinearity does not seem to be relevant for this problem.

                                       Predictor_Variables Adjusted_R2
1                                            Initial Model   0.5598345
2 Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr   0.5220090
3   Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr   0.5526062
4   Lot.Area + Land.Slope + Year.Remod.Add + Bedroom.AbvGr   0.4473663
5       Lot.Area + Land.Slope + Year.Built + Bedroom.AbvGr   0.4922384
6      Lot.Area + Land.Slope + Year.Built + Year.Remod.Add   0.5314859

I used the backward selection method, selecting predictors by their Adjusted R². I am choosing this method because a reliable overall prediction model is my priority. I am not as concerned about the significance of the individual predictors, and thus not using p-values.

The backward selection method will compare the Adjusted R² of the initial model with five predictor variables to each of the models including four of those variables. As we can see from the above output, none of the models with four predictors has a greater Adjusted R² than the initial model.

Thus, we include all five given variables. A summary of the model is shown below.


Call:
lm(formula = log(price) ~ ., data = q4)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0878 -0.1651 -0.0211  0.1657  0.9945 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.371e+01  8.574e-01 -15.996  < 2e-16 ***
Lot.Area        1.028e-05  1.106e-06   9.296  < 2e-16 ***
Land.SlopeMod   1.384e-01  4.991e-02   2.773  0.00565 ** 
Land.SlopeSev  -4.567e-01  1.514e-01  -3.016  0.00263 ** 
Year.Built      6.049e-03  3.788e-04  15.968  < 2e-16 ***
Year.Remod.Add  6.778e-03  5.468e-04  12.395  < 2e-16 ***
Bedroom.AbvGr   8.686e-02  1.077e-02   8.063 2.12e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.279 on 993 degrees of freedom
Multiple R-squared:  0.5625,    Adjusted R-squared:  0.5598 
F-statistic: 212.8 on 6 and 993 DF,  p-value: < 2.2e-16

5

Which home has the largest squared residual in the previous analysis (Question 4)? Looking at all the variables in the data set, can you explain why this home stands out from the rest (what factors contribute to the high squared residual and why are those factors relevant)?

      428       998       645        66       918 
4.3591298 1.0516498 1.0007075 0.9890399 0.9457524

# A tibble: 1 x 6
  price Lot.Area Land.Slope Year.Built Year.Remod.Add Bedroom.AbvGr
  <int>    <int> <fct>           <int>          <int>         <int>
1 12789     9656 Gtl              1923           1970             2

[1] 12789

House 428 has the highest squared residual by far. The price, $12,789, seemed extremely low, so I also filtered the dataset to confirm that house 428 is indeed the lowest priced house in the dataset. In the Neighborhood vs Price plot above, house 428 corresponds to the minimum point in OldTown. This explains the high squared residual of house 428.

6

Use the same model selection method you chose in Question 4 to again find the best multiple regression model to predict the natural log of home prices, but this time replacing Lot.Area with log(Lot.Area). Do you arrive at a model including the same set of predictors?

                                       Predictor_Variables Adjusted_R2
1                                            Initial Model   0.6032018
2 Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr   0.5220090
3   Lot.Area + Year.Built + Year.Remod.Add + Bedroom.AbvGr   0.6014831
4   Lot.Area + Land.Slope + Year.Remod.Add + Bedroom.AbvGr   0.4932282
5       Lot.Area + Land.Slope + Year.Built + Bedroom.AbvGr   0.5363921
6      Lot.Area + Land.Slope + Year.Built + Year.Remod.Add   0.5910888

Similar to Question 4, we can see from the above output, none of the models with four predictors has a greater adjusted R² than the initial model.

Thus, we again include all five given variables. A summary of the model is shown below.


Call:
lm(formula = log(price) ~ ., data = q6)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.14050 -0.15650 -0.01561  0.15350  0.90854 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -1.553e+01  8.197e-01 -18.947  < 2e-16 ***
Lot.Area        2.442e-01  1.708e-02  14.297  < 2e-16 ***
Land.SlopeMod   1.151e-01  4.734e-02   2.431   0.0152 *  
Land.SlopeSev  -6.554e-02  1.222e-01  -0.536   0.5917    
Year.Built      5.981e-03  3.597e-04  16.628  < 2e-16 ***
Year.Remod.Add  6.734e-03  5.190e-04  12.975  < 2e-16 ***
Bedroom.AbvGr   5.909e-02  1.055e-02   5.599  2.8e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2649 on 993 degrees of freedom
Multiple R-squared:  0.6056,    Adjusted R-squared:  0.6032 
F-statistic: 254.1 on 6 and 993 DF,  p-value: < 2.2e-16

7

Do you think it is better to log transform Lot.Area, in terms of assumptions for linear regression? Make graphs of the predicted values of log home price versus the true values of log home price for the regression models selected for Lot.Area and log(Lot.Area). Referencing these two plots, provide a written support that includes a quantitative justification for your answer in the first part of question 7.

It is indeed better to log transform Lot.Area

Upon transforming Lot Area to log(Lot Area), the spread of points is much less congested. Looking at the output summaries from Questions 4 and 6, the respective Adjusted R² values are 0.560 and 0.603. Using a log transformation on Lot Area explains a little more variance, which can be confirmed numerically and graphically.

Appendix

All code is included here.

Setup & Packages

knitr::opts_chunk$set(comment = NA)
load("ames_train.Rdata")

library(MASS)
library(dplyr)
library(expss)
library(forcats)
library(GGally)
library(ggplot2)
library(grid)
library(cowplot)
library(lemon)
library(scales)
library(tidyr)
library(viridis)

1

Histogram

#Median and Mean Age.
q1_median = 2020 - median(ames_train$Year.Built)
q1_mean = 2020 - mean(ames_train$Year.Built)
#Histogram.
g1 <- ggplot(ames_train, aes(x = (2020 - Year.Built)))
g1 <- g1 + geom_histogram(bins = 30, fill = viridis(7)[4])
#Title.
g1 <- g1 + ggtitle("Distribution of Houses by Age")
#X-axis
g1 <- g1 + scale_x_continuous("Age of House (years)", expand = c(0,0))
#Y-axis.
g1 <- g1 + scale_y_continuous(name = "Frequency of Age", expand = c(0,0))
#Median Line.
g1 <- g1 + geom_vline(xintercept = q1_median, size = 1.5, color = viridis(7)[2])
#Mean Line.
g1 <- g1 + geom_vline(xintercept = q1_mean, size = 1.5, color = viridis(7)[6])
#Labels.
median_label <- grobTree(textGrob(paste("Median =", q1_median, "years"),
                                  x = 0.28,  y = 0.85, hjust = 0,
                                  gp = gpar(col = viridis(7)[2], fontsize = 15)))
mean_label <- grobTree(textGrob(paste("Mean = approx.", round(q1_mean, 0), "years"),
                                x = 0.3,  y = 0.80, hjust = 0,
                                gp = gpar(col = viridis(7)[6], fontsize = 15)))
g1 <- g1 + annotation_custom(median_label) + annotation_custom(mean_label)

#Modify labels and text.
g1 <- g1 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                 axis.text.x = element_text(size = 12),
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"))
g1

2

Neighborhood Violin & Box Plot.

g2 <- ggplot(ames_train, aes(x = fct_reorder(Neighborhood, price, .fun = 'median'),
                             y = price))
#Violin and Box Plot.
g2 <- g2 + geom_violin(color = viridis(7)[5], width = 1.3) + geom_boxplot(color = viridis(7)[3], width = 0.2)
#Title. 
g2 <- g2 + ggtitle("Neighborhood vs Price")
#X-axis
g2 <- g2 + scale_x_discrete(name = "Neighborhood")
#Y-axis.
g2 <- g2 + scale_y_continuous(name = "Price", labels = dollar)
#Modify labels and text.
g2 <- g2 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                 axis.text.x = element_text(hjust = 1, size = 10, angle = 55),
                 axis.title.x = element_text(size = 12, face = "bold"),
                 axis.text.y = element_text(size = 10),
                 axis.title.y = element_text(hjust = 0.5, size = 12, face = "bold"),
                 legend.position = "none")
g2

Neighborhood Table Summary.

ames_train  %>%
  apply_labels(Neighborhood = "Neighborhood", price = "") %>%
  tab_rows(Neighborhood) %>%
  tab_cells(price) %>%
  tab_cols(total(label = "House Price Statistics")) %>%
  tab_stat_fun("Count" = w_n,
               "Minimum" = w_min,
               "Median" = w_median,
               "Mean" = w_mean,
               "Maximum" = w_max,
               "Std. dev." = w_sd,
               method = list) %>%
  tab_pivot() %>%
  set_caption("Table with Summary") %>% 
  htmlTable(css.cell = c("width: 100px", rep("width: 80px", ncol(.) - 1)))

3

Missing Values.

sort(colSums(is.na(ames_train)), decreasing = TRUE)[0:10]

4

Data Summary.

q4 <- ames_train %>%
  select(price, Lot.Area, Land.Slope, Year.Built, Year.Remod.Add, Bedroom.AbvGr)

summary(q4)

ggcorr(q4[-3]) + scale_fill_viridis(direction = -1)

Backward Selection, Part I.

q4_model <- lm(log(price) ~ ., q4)

new_R2 <- summary(lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = q4))$adj.r.squared
Predictors <- ("Initial Model")
R2 <- new_R2

for (x in c(2:6)) {
  Predictors <- c(Predictors, do.call(paste, c(as.list(colnames(q4[-c(1,x)])), sep = " + ")))
  temp_df <- q4 %>%
    select(colnames(q4[-x]))

  new_R2 <- summary(lm(log(price) ~ ., data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Model Summary, Part I.

summary(q4_model)

5

House 428.

sort(residuals(q4_model) ^2, decreasing = TRUE)[1:5]
print(q4[428,])
print(min(q4$price))

6

Backward Selection, Part II.

q6 <- q4 %>%
  mutate(Lot.Area = log(Lot.Area))
q6_model <- lm(log(price) ~ ., q6)

new_R2 <- summary(lm(log(price) ~ Lot.Area + Land.Slope + Year.Built + Year.Remod.Add + Bedroom.AbvGr, data = q6))$adj.r.squared
Predictors <- ("Initial Model")
R2 <- new_R2

for (x in c(2:6)) {
  Predictors <- c(Predictors, do.call(paste, c(as.list(colnames(q6[-c(1,x)])), sep = " + ")))
  temp_df <- q6 %>%
    select(colnames(q6[-x]))

  new_R2 <- summary(lm(log(price) ~ ., data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Model Summary, Part II.

summary(q6_model)

7

Scatter Plots.

#Data for Log(House Price) by Lot Area.
q7_1 <- q4 %>%
  mutate(Observed = log(price)) %>%
  select(Lot.Area, Observed)

q7_1$Predicted = predict(q4_model)

q7_1 <- q7_1 %>%
  gather(y_type, price, c(Predicted, Observed))

#Data for Log(House Price) by Logt(Lot Area).
q7_2 <- q6 %>%
  mutate(Observed = log(price)) %>%
  select(Lot.Area, Observed)

q7_2$Predicted = predict(q6_model)

q7_2 <- q7_2 %>%
  gather(y_type, price, c(Predicted, Observed))

#Scatter, log(price) by Lot.Area.
g7_1 <- ggplot(q7_1, aes(x = Lot.Area, y = price, color = y_type, shape = y_type))
g7_1 <- g7_1 + geom_point(size = 4, stroke = 1.5, alpha = 0.75)
#Title.
g7_1 <- g7_1 + ggtitle(label = "")
#X-axis.
g7_1 <- g7_1 + scale_x_continuous("Lot Area", labels = comma)
#Y-axis.
g7_1 <- g7_1 + scale_y_continuous("Log(House Price)")
#Viridis Color Scheme.
g7_1 <- g7_1 + scale_color_manual(name = "Log(House Price)",
                                  values = c(viridis(7)[6], viridis(7)[2]))
#Shape Scheme.
g7_1 <- g7_1 + scale_shape_manual(name = "Log(House Price)",
                                  values = c(3, 4))
#Modify labels and text. Remove legend from left plot.
g7_1 <- g7_1 + theme(plot.title = element_text(hjust = 1, size = 16, face = "bold"),
                     axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
                     axis.title.x = element_text(size = 14, face = "bold"),
                     axis.text.y = element_text(size = 12),
                     axis.title.y = element_text(size = 14, face = "bold"),
                     legend.position = "none")

#Scatter, log(price) by log(Lot.Area).
g7_2 <- ggplot(q7_2, aes(x = Lot.Area, y = price, color = y_type, shape = y_type))
g7_2 <- g7_2 + geom_point(size = 4, stroke = 1.5, alpha = 0.75)
#Title.
g7_2 <- g7_2 + ggtitle(label = "Log(House Price) by Lot Area and Log(Lot Area)")
#X-axis.
g7_2 <- g7_2 + scale_x_continuous("Log(Lot Area)")
#Y-axis.
g7_2 <- g7_2 + scale_y_continuous("Log(House Price)", position = "right")
#Viridis Color Scheme.
g7_2 <- g7_2 + scale_color_manual(name = "Log(House Price)",
                                  values = c(viridis(7)[6], viridis(7)[2]))
#Shape Scheme.
g7_2 <- g7_2 + scale_shape_manual(name = "Log(House Price)",
                                  values = c(3, 4))
#Modify labels and text.
g7_2 <- g7_2 + theme(plot.title = element_text(hjust = 2.1, size = 16, face = "bold"),
                     axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
                     axis.title.x = element_text(size = 14, face = "bold"),
                     axis.text.y = element_text(size = 12),
                     axis.title.y = element_text(size = 14, face = "bold"))

#Modify legend text.  Save as object.
g7_legend <- g_legend(g7_2 + theme(legend.text = element_text(size = 12),
                                   legend.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                                   legend.key.width = unit(1.5, "cm"),
                                   legend.position = "bottom"))
#Remove legend from right plot.
g7_2 <- g7_2 + theme(legend.position = "none")

#Align two plots horizontally across one row.
g7 <- plot_grid(g7_1, g7_2, align = "h", nrow = 1)
#Include legend at the bottom.
plot_grid(g7, g7_legend, ncol = 1, rel_heights = c(0.90, 0.10))

Peer Assessment I - EDA and Basic Model Selection

Rohan Lewis

8/18/2020

1

2

3

4

5

6

7