As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.
Refer to the codebook for the Ames, Iowa dataset for more information.
See the Appendix for code.
In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.
Use the code block below to load any necessary packages.
When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.
Do a detailed EDA of the ames_train
data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.
After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).
The Test and Validation sets only have observations with Sale.Condition
Normal, so the Training set was filtered only for houses with Sale.Condition
Normal. This reduces the observations in ames_train
from 1000 to 834.
Variables can be broadly grouped into categorical and numerical. Several of the categorical variables have several NAs. These were noted and will be addressed later in the analysis. See Appendix for full summary of categorical and numerical variables.
PID
was removed as it is not a predictor. Utilities
was removed as all observations have the same value. Lot.Frontage
, Mas.Vnr.Area
, and Garage.Yr.Blt
were removed as numerical variables with NAs. See Appendix for details.
After exploring the variables, I found the following graphs interesting to warrant further exploration.
Price would be expected to increase as area of the house does. In addition, one would expect larger houses to have more bedrooms. The first graph provides support to these assumptions.
Different neighborhoods in an urbanized area would be expected to vary in worth. The price of the houses by neighborhood, ascending by median, demonstrates that this appears to be true.
Lastly, I explored the year homes were built and their overall quality measurement. Although these relationships appear to be complicated, there seems to be a sharp increase in price and quality of homes after 1990.
In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.
Based on your EDA, select at most 10 predictor variables from ames_train
and create a linear model for price
(or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).
area
, Bedroom.AbvGr
, Full.Bath
, Half.Bath
, Neighborhood
, Overall.Qual
, Year.Built
, and Year.Remod.Add
were chosen as predictors for price
. These variables seem to adequately convey what would affect the price of a home, based on common conversations and the Exploratory Data Analysis.
Call:
lm(formula = price ~ ., data = ames_train_prelim)
Residuals:
Min 1Q Median 3Q Max
-87397 -13313 597 11760 173401
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1604837.751 180849.802 -8.874 < 0.0000000000000002 ***
area 83.740 3.747 22.348 < 0.0000000000000002 ***
Bedroom.AbvGr -7337.244 1616.362 -4.539 0.000006518937916 ***
Full.Bath -9695.931 2938.903 -3.299 0.001013 **
Half.Bath -8926.653 2342.261 -3.811 0.000149 ***
NeighborhoodBlueste -12393.056 17699.167 -0.700 0.484005
NeighborhoodBrDale -16176.646 14157.432 -1.143 0.253539
NeighborhoodBrkSide 19145.526 11793.960 1.623 0.104916
NeighborhoodClearCr 42247.995 12769.088 3.309 0.000980 ***
NeighborhoodCollgCr 14972.693 10097.912 1.483 0.138538
NeighborhoodCrawfor 39613.160 11759.113 3.369 0.000792 ***
NeighborhoodEdwards 5535.183 10932.775 0.506 0.612792
NeighborhoodGilbert 9692.894 10711.870 0.905 0.365807
NeighborhoodGreens 7354.592 16207.875 0.454 0.650121
NeighborhoodGrnHill 99678.462 20194.272 4.936 0.000000972644229 ***
NeighborhoodIDOTRR 4923.579 12084.320 0.407 0.683798
NeighborhoodMeadowV -17429.166 12241.862 -1.424 0.154917
NeighborhoodMitchel 20693.640 10688.497 1.936 0.053215 .
NeighborhoodNAmes 20224.599 10601.972 1.908 0.056801 .
NeighborhoodNoRidge 34267.303 10986.333 3.119 0.001880 **
NeighborhoodNPkVill 4842.707 16142.031 0.300 0.764251
NeighborhoodNridgHt 36571.239 10694.927 3.419 0.000659 ***
NeighborhoodNWAmes 19002.271 10864.610 1.749 0.080677 .
NeighborhoodOldTown 2882.063 11674.701 0.247 0.805078
NeighborhoodSawyer 17428.674 10878.849 1.602 0.109539
NeighborhoodSawyerW 6338.329 10631.014 0.596 0.551205
NeighborhoodSomerst 14154.945 10324.047 1.371 0.170743
NeighborhoodStoneBr 26380.306 12155.808 2.170 0.030290 *
NeighborhoodSWISU 9118.996 13672.985 0.667 0.505008
NeighborhoodTimber 38421.753 11397.163 3.371 0.000785 ***
NeighborhoodVeenker 37130.883 12982.241 2.860 0.004346 **
Overall.Qual2 9000.734 27906.884 0.323 0.747138
Overall.Qual3 26190.108 26817.781 0.977 0.329068
Overall.Qual4 23441.471 25761.728 0.910 0.363134
Overall.Qual5 30288.255 25603.313 1.183 0.237171
Overall.Qual6 35857.873 25633.323 1.399 0.162241
Overall.Qual7 46986.134 25767.668 1.823 0.068612 .
Overall.Qual8 75643.005 26019.802 2.907 0.003749 **
Overall.Qual9 150784.489 26620.988 5.664 0.000000020679126 ***
Overall.Qual10 206006.064 29141.245 7.069 0.000000000003423 ***
Year.Built 551.884 73.495 7.509 0.000000000000161 ***
Year.Remod.Add 276.969 59.742 4.636 0.000004151533789 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25050 on 792 degrees of freedom
Multiple R-squared: 0.8858, Adjusted R-squared: 0.8799
F-statistic: 149.8 on 41 and 792 DF, p-value: < 0.00000000000000022
Based on the simple linear model of our preliminary variables, the R2 is 0.8858, which means approximately 89% of the variability in price can be explained by eight variables for the 1000 observed houses in the training data set. Furthermore, looking at p-values, five numerical variables are significant at α = 0.001, and the sixth is very close.
For Neighborhood and Overall Quality, the two categorical variables selected, several individual levels are significant at α = 0.01 and α = 0.001. Northridge Heights, Northridge, Green Hills, and Timber, four of the top 5 neighborhoods by price, are notable in the initial linear regression model as significant. Several other neighborhoods are also significant at various levels. Overall Quality of 8, 9, and 10 have the highest significance as predictors. This numerical analysis agrees with observations in the graphs from the Exploratory Data Analysis.
Adjusted R2 and p-values will explored further in the upcoming Model Selection process.
Now either using BAS
another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?
Choosing backward selection by p-value, we note from the above summary that all eight predictor variables are significant at α = 0.01. No variable is eliminated using this method.
We will compare this to forward selection by Adjusted R2.
Predictor_Variables Adjusted_R2
1 Overall.Qual 0.6886230
2 Overall.Qual + area 0.8092515
3 Previous Predictors + Neighborhood 0.8614690
4 Previous Predictors + Year.Built 0.8700999
5 Previous Predictors + Bedroom.AbvGr 0.8747203
6 Previous Predictors + Year.Remod.Add 0.8773952
7 Previous Predictors + Half.Bath 0.8783981
8 Previous Predictors + Full.Bath 0.8798951
Since every variable in the initial selection increases the Adjusted R2, it is left in the model.
Backward selection by p-values emphasizes the significance of the predictors, and forward selection by Adjusted R2 emphasizes the variability of price that can be explained by these predictors. Both selection processes yield the same result, which signifies a good starting point for the rest of the analysis.
One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.
The Residual Plots for all numerical predictors are shown below. For Area, Number of Bedrooms, and Number of Bathrooms, the residuals appear to be randomly dispersed about y = 0. The vast majority of houses have 2-4 bedrooms, 1 or 2 full bathrooms, and 1 or 2 half bathrooms, so there is a higher concentration and spread at those values. For Year Built and Remodel Year, an interesting pattern of heteroscedasticity may be emerging. That is, as the year built or remodel increases, so does the spread of residuals. This seems to be more pronounced for Year Built than Remodel Year. This will be addressed in the revised model.
Using the definition of outliers as residuals that are outside of three times the standard deviation of the residuals, there are nine positive outliers and two negative outliers. This is about 1.3% of the remaining 834 observations. There is no need to remove these observations (highlighted in blue), but they will be noted as the analysis continues.
You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).
The RMSE of the training data using the initial model is $24,407.13. This represents the spread, or noise, of the residuals.
[1] 24407.13
The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.†To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test
.
Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?
First, it is important to note that one observation was removed. In the test data, there is a house in the Neighborhood Landmark. Landmark was not in the training data, and is thus not an individual level in the preliminary prediction model.
# A tibble: 1 x 9
price area Bedroom.AbvGr Half.Bath Full.Bath Neighborhood Overall.Qual
<int> <int> <int> <int> <int> <fct> <fct>
1 137000 1320 3 1 2 Landmrk 6
# ... with 2 more variables: Year.Built <int>, Year.Remod.Add <int>
The RMSE of the rest of the testing data using the initial model is $25.674.43. This is about $1,200 more than that of the training data. Since there is more spread, or noise, the preliminary prediction model is more accurate on the training data than the testing data, signifying overfitting may be an issue.
[1] 25674.43
Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.
Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.
Carefully document the process that you used to come up with your final model, so that you can answer the questions below.
Did you decide to transform any variables? Why or why not? Explain in a few sentences.
First, the dependent variable, price
is log transformed. It is clear from ames_train
and ames_test
that price
is right skewed.
To account for the possible heteroscedascity observed in Initial Model Residuals, Year.Built
and Year.Remod.Add
were converted to age by subtracting from 2020, and then log transformed. Several other variables were log transformed to see if an improvement occured.
Predictor_Variables R2_of_Predictor R2_of_log_Predictor
1 Year.Built 0.36720772 0.42388630
2 Year.Remod.Add 0.33864569 0.31292128
3 area 0.55743101 0.57449584
4 Lot.Area 0.06387691 0.15652010
5 Garage.Area 0.41015147 0.20452499
6 Total.Bsmt.SF 0.44791613 0.15621816
7 X1st.Flr.SF 0.42528235 0.42101737
8 X2nd.Flr.SF 0.10275480 0.03124951
Of the numerical variables shown above, House.Age
(derived from Year.Built
), area
, and Lot.Area
show an increase of R2 as a single predictor upon being transformed by log.
Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.
X1st.Flr.SF
and X2nd.Flr.SF
were temporarily combined (see Variable Selection below), but not for the final model, as the coefficient for their sum will be in the model by defintion. As there are several different variables that are related to area, there are also several variables related to number and types of rooms. Any sum would be unnecessary.
Yr.Sold
and Mo.Sold
were combined into a Date.Sold
variable. The upcoming analysis will demonstrate if it is a useful predictor or not.
Many categorical variables had 75% or more of the observations in one level. These often overlapped with the majority level in another categorical variable. However, no combinations were made.
What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.
The correlation of several variables related to area and number of rooms are charted below. Notice that the created combined area of the X1st.Flr.SF
and X2nd.Flr.SF
floor is virtually identical to area
. area
will be removed. Garage.Cars
is removed as it is very similar to Garage.Area
. There are some other correlated variables that will be kept note of.
The following output is the first twenty predictors from forward selection by Adjusted R2. These will be used to select the final model.
Predictor_Variables Adjusted_R2
1 Overall.Qual 0.6751866
2 Overall.Qual + Neighborhood 0.7696939
3 Previous Predictors + X1st.Flr.SF 0.8184783
4 Previous Predictors + X2nd.Flr.SF 0.8716027
5 Previous Predictors + BsmtFin.Type.1 0.8932844
6 Previous Predictors + Overall.Cond 0.9072078
7 Previous Predictors + House.Age 0.9185515
8 Previous Predictors + Lot.Area 0.9292977
9 Previous Predictors + MS.Zoning 0.9333688
10 Previous Predictors + BsmtFin.SF.1 0.9376955
11 Previous Predictors + Central.Air 0.9406146
12 Previous Predictors + Condition.1 0.9427471
13 Previous Predictors + BsmtFin.SF.2 0.9448202
14 Previous Predictors + Exterior.1st 0.9461936
15 Previous Predictors + Garage.Area 0.9474239
16 Previous Predictors + Bsmt.Unf.SF 0.9486238
17 Previous Predictors + Fireplace.Qu 0.9501178
18 Previous Predictors + Kitchen.Qual 0.9511929
19 Previous Predictors + Functional 0.9526140
20 Previous Predictors + Year.Remod.Add 0.9534796
How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.
Similar to Section 2, several observations were removed removed from ames_test
. In the test data, there is a house in the Neighborhood Landmark. Landmark was not in the training data, and is thus not an individual level in the preliminary prediction model. In addition, two observations with level AsphShn from Exterior.1st
were removed from ames_test
. There were no observations that had these levels in ames_train
.
# A tibble: 2 x 6
PID price Exterior.1st Neighborhood Lot.Area Year.Built
<int> <dbl> <fct> <fct> <dbl> <int>
1 908102130 11.5 AsphShn Edwards 9.19 1965
2 903481100 11.3 AsphShn IDOTRR 9.24 1915
As more predictors are included, the RMSE of ames_train
decreases. ames_test
shows a general trend of decreasing, however, it increases with the addition of the 9th variable, and the difference in RMSE of ames_train
and ames_test
increases to over $2,000.
Number_of_Predictors Root_Mean_Square_Error Dataset
7 19696.70 Ames Train
7 21068.62 Ames Test
8 17890.80 Ames Train
8 19206.40 Ames Test
9 17720.38 Ames Train
9 19447.75 Ames Test
10 16269.88 Ames Train
10 18440.39 Ames Test
11 16221.88 Ames Train
11 18306.83 Ames Test
I selected the first eight variables from above.
The eight predictors, in order, are Overall.Qual
, Neighborhood
, X1st.Flr.SF
, X2nd.Flr.SF
, BsmtFin.Type.1
, Overall.Cond
, log transform of House.Age
(from Year.Built
), and the log transform of Lot.Area
.
Provide the summary table for your model.
Call:
lm(formula = price ~ ., data = ames_train_final3)
Residuals:
Min 1Q Median 3Q Max
-0.77174 -0.05640 0.00396 0.06037 0.30304
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.33262639 0.17149534 60.250 < 0.0000000000000002 ***
Overall.Qual2 0.16811555 0.12215389 1.376 0.169136
Overall.Qual3 0.20168669 0.11470901 1.758 0.079098 .
Overall.Qual4 0.27234093 0.11172534 2.438 0.015008 *
Overall.Qual5 0.33478896 0.11120087 3.011 0.002691 **
Overall.Qual6 0.38751159 0.11172788 3.468 0.000552 ***
Overall.Qual7 0.45213153 0.11232258 4.025 0.000062454025060 ***
Overall.Qual8 0.52873248 0.11339681 4.663 0.000003671059772 ***
Overall.Qual9 0.69849569 0.11585908 6.029 0.000000002546870 ***
Overall.Qual10 0.75439072 0.12555296 6.009 0.000000002870934 ***
NeighborhoodBlueste -0.01050263 0.07339958 -0.143 0.886257
NeighborhoodBrDale -0.09695169 0.05849983 -1.657 0.097861 .
NeighborhoodBrkSide -0.02796181 0.04870488 -0.574 0.566061
NeighborhoodClearCr 0.02560836 0.05603476 0.457 0.647792
NeighborhoodCollgCr -0.06617316 0.04252900 -1.556 0.120125
NeighborhoodCrawfor 0.07528602 0.04950458 1.521 0.128718
NeighborhoodEdwards -0.10236415 0.04562363 -2.244 0.025134 *
NeighborhoodGilbert -0.06436173 0.04536566 -1.419 0.156376
NeighborhoodGreens 0.15927139 0.06689129 2.381 0.017503 *
NeighborhoodGrnHill 0.44547165 0.08355449 5.332 0.000000127652051 ***
NeighborhoodIDOTRR -0.13623368 0.05000315 -2.725 0.006584 **
NeighborhoodMeadowV -0.13399780 0.05061391 -2.647 0.008274 **
NeighborhoodMitchel -0.03096376 0.04528050 -0.684 0.494291
NeighborhoodNAmes -0.03214116 0.04452960 -0.722 0.470638
NeighborhoodNoRidge 0.01098184 0.04663098 0.236 0.813878
NeighborhoodNPkVill -0.00198758 0.06645671 -0.030 0.976148
NeighborhoodNridgHt -0.01047609 0.04447277 -0.236 0.813834
NeighborhoodNWAmes -0.07938304 0.04610364 -1.722 0.085495 .
NeighborhoodOldTown -0.09886232 0.04799814 -2.060 0.039757 *
NeighborhoodSawyer -0.04839336 0.04572718 -1.058 0.290244
NeighborhoodSawyerW -0.09970701 0.04455661 -2.238 0.025518 *
NeighborhoodSomerst -0.00422347 0.04260388 -0.099 0.921058
NeighborhoodStoneBr 0.01085850 0.05052940 0.215 0.829906
NeighborhoodSWISU -0.03610013 0.05592865 -0.645 0.518814
NeighborhoodTimber -0.03593131 0.04812336 -0.747 0.455500
NeighborhoodVeenker 0.04563645 0.05468488 0.835 0.404235
X1st.Flr.SF 0.00039566 0.00001555 25.443 < 0.0000000000000002 ***
X2nd.Flr.SF 0.00028604 0.00001153 24.811 < 0.0000000000000002 ***
BsmtFin.Type.1BLQ -0.02336231 0.01480578 -1.578 0.114990
BsmtFin.Type.1GLQ 0.01218316 0.01290092 0.944 0.345276
BsmtFin.Type.1LwQ -0.07221175 0.01865225 -3.871 0.000117 ***
BsmtFin.Type.1Rec -0.05011208 0.01433919 -3.495 0.000501 ***
BsmtFin.Type.1Unf -0.09061384 0.01247227 -7.265 0.000000000000904 ***
BsmtFin.Type.1NA -0.25408661 0.02748213 -9.246 < 0.0000000000000002 ***
Overall.Cond2 0.32091864 0.13144029 2.442 0.014845 *
Overall.Cond3 0.18378116 0.08620238 2.132 0.033321 *
Overall.Cond4 0.32280192 0.07994626 4.038 0.000059297735739 ***
Overall.Cond5 0.39726544 0.07946006 5.000 0.000000710019770 ***
Overall.Cond6 0.44944305 0.07948361 5.655 0.000000021928515 ***
Overall.Cond7 0.50071436 0.07961647 6.289 0.000000000531573 ***
Overall.Cond8 0.52777254 0.08055204 6.552 0.000000000103105 ***
Overall.Cond9 0.52776568 0.08545613 6.176 0.000000001058321 ***
House.Age -0.18550638 0.01475237 -12.575 < 0.0000000000000002 ***
Lot.Area 0.11868952 0.01084815 10.941 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1015 on 780 degrees of freedom
Multiple R-squared: 0.9338, Adjusted R-squared: 0.9293
F-statistic: 207.6 on 53 and 780 DF, p-value: < 0.00000000000000022
For your final model, create and briefly interpret an informative plot of the residuals.
For all four numerical predictors, the residuals appear to be randomly scattered around 0. Outliers are highlighted in blue.
There is a slight left skew, but the majority of the residuals fit a normal distribution.
From both plots, it is clear there are about 15 - 20 fitted values whose residuals deviate from the others. Considering this is approximately 1.7 - 2.3% of the data in the model, the residuals seem to be of constant variability.
The houses were chosen as “individual residential properties sold in Ames, IA from 2006 to 2010”. The houses selected, and thus residuals, appear to be independent of one another.
For your final model, calculate and briefly comment on the RMSE.
The RMSE of ames_train
is $17,890.80, and the RMSE of ames_test
is $19,201.56. The RMSE of ames_test
has changed from above because the two observations with level AsphShn from Exterior.1st
that were removed from are now included, as Exterior.1st
is no longer a predictor.
The spread of the actual price values of the test set is about $1,300 more than that of the training set. Using more variables as predictors shows signs of overfitting. The analyzed model, however, has a spread that is much less than the initial model from part 2, where both RMSE values are greater than $24,000.
[1] 17890.8
[1] 19201.56
What are some strengths and weaknesses of your model?
The final model has an R2 of 0.9338. 93% of the variance of price is explained by eight of the 81 variables. Furthermore, each of the eight predictors, whether it be numerical or at least one level in a categorical, is significant at the α = 0.001 level.
The model is a very good fit for most of the data. However, the model is fit for only those homes with Sale.Condition
that is Normal. While the initial training set was a standard 1000 houses, filtering left only 834 to create the model. This is approximately 3.3% of the total. It is important to stress that this model is specific to Ames, and would vary greatly in predictive power across the US.
Many categorical variables have some levels with a large proportion of the observations, and some levels with a few or none. Neighborhood
and Overall.Qual
are two examples, and also the first two predictors identified by forward selection. It is possible that more houses in the levels with smaller totals could improve the model.
Instead of Neighborhood
, I think latitude and longitude values of each house, perhaps even zoning map sections, would be a more accurate predictor.
Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.
You will use the ames_validation
dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention:
The RMSE of ames_validation
is $19,397.47, which is extremely similar to the RMSE of ames_test
at $19,201.56. This consistency is a good indicator that the model is a reasonable fit.
[1] 19397.47
The true proportion of ames_test
prices that fall within the 95% prediction interval is 94.9%. The true proportion of ames_validation
prices that fall within the 95% prediction interval is 94.6%.
Because both of these values are extremely close to 95%, the final model properly reflects uncertainty.
[1] 0.9485294
[1] 0.9462647
Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.
The model provided by this analysis accurately predicts the price of approximately 95% of the sample houses in Ames, IA to within $39,000.
The most surprising predictor to me was BsmtFin.Type.1
. I skipped over it in the preliminary model without any hesitation.
During my steps, I modified the data and model several times. I removed outliers, removed levels, changed the order of predictors, and included extra predictors. I found that the model chosen is quite robust and parsimonious.
While not collinear, I found many of the variables to show some significant relationship with others. I found it interesting how the RMSE quickly converged and did not vary much after fifteen variables. Most of the variables added very little to the model.
And interesting aspect of the data is that the observations were generally not uniformly distributed by variables, specifically the categorical variables. For example, over 25% of homes are less than 20 years old, and over 50% are less than 50 years old. Well over 90% of houses have an overall condition around average, there are very few that are poor or exceptional. 99% of houses do not have a pool. 80% of houses have no fence. Over 75% of houses have a gable roof style. There are many other examples. This seems to be a reason why many variables added very little to no change to the model.
Finally, I learned to not underestimate log transforming variables, including the dependent variable.
knitr::opts_chunk$set(comment = NA)
options("scipen" = 100)
load("ames_train.Rdata")
library(MASS)
library(BAS)
library(broom)
library(dplyr)
library(expss)
library(forcats)
library(GGally)
library(ggplot2)
library(grid)
library(gridExtra)
library(Metrics)
library(scales)
library(statsr)
library(tidyr)
library(viridis)
#Numerical categories were converted to factor variables.
ames_train$Overall.Qual <- as.factor(ames_train$Overall.Qual)
ames_train$Overall.Cond <- as.factor(ames_train$Overall.Cond)
Only level Normal are considered from Sale.Condition
.
All Numerical Variables.
PID area price MS.SubClass
Min. : 526302030 Min. : 334 Min. : 39300 Min. : 20.00
1st Qu.: 531458660 1st Qu.:1077 1st Qu.:129000 1st Qu.: 20.00
Median : 535454145 Median :1383 Median :155500 Median : 50.00
Mean : 713844613 Mean :1450 Mean :174622 Mean : 57.86
3rd Qu.: 907199140 3rd Qu.:1718 3rd Qu.:205000 3rd Qu.: 70.00
Max. :1007100110 Max. :3672 Max. :615000 Max. :190.00
Lot.Frontage Lot.Area Year.Built Year.Remod.Add
Min. : 21.00 Min. : 1470 Min. :1872 Min. :1950
1st Qu.: 56.00 1st Qu.: 7206 1st Qu.:1954 1st Qu.:1965
Median : 69.00 Median : 9202 Median :1972 Median :1990
Mean : 68.51 Mean : 10228 Mean :1970 Mean :1983
3rd Qu.: 80.00 3rd Qu.: 11419 3rd Qu.:1997 3rd Qu.:2002
Max. :313.00 Max. :215245 Max. :2009 Max. :2009
NA's :156
Mas.Vnr.Area BsmtFin.SF.1 BsmtFin.SF.2 Bsmt.Unf.SF
Min. : 0.00 Min. : 0.0 Min. : 0.00 Min. : 0.0
1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.: 209.0
Median : 0.00 Median : 402.0 Median : 0.00 Median : 433.0
Mean : 95.62 Mean : 453.3 Mean : 52.44 Mean : 517.7
3rd Qu.: 143.00 3rd Qu.: 747.0 3rd Qu.: 0.00 3rd Qu.: 747.8
Max. :1290.00 Max. :2257.0 Max. :1526.00 Max. :2336.0
NA's :4
Total.Bsmt.SF X1st.Flr.SF X2nd.Flr.SF Low.Qual.Fin.SF
Min. : 0 Min. : 334 Min. : 0.0 Min. : 0.00
1st Qu.: 784 1st Qu.: 864 1st Qu.: 0.0 1st Qu.: 0.00
Median : 974 Median :1058 Median : 0.0 Median : 0.00
Mean :1023 Mean :1127 Mean : 318.2 Mean : 4.52
3rd Qu.:1248 3rd Qu.:1337 3rd Qu.: 689.0 3rd Qu.: 0.00
Max. :2846 Max. :2696 Max. :1836.0 Max. :1064.00
Bsmt.Full.Bath Bsmt.Half.Bath Full.Bath Half.Bath
Min. :0.0000 Min. :0.00000 Min. :0.000 Min. :0.0000
1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.0000
Median :0.0000 Median :0.00000 Median :1.000 Median :0.0000
Mean :0.4365 Mean :0.06115 Mean :1.505 Mean :0.3777
3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000 3rd Qu.:1.0000
Max. :2.0000 Max. :2.00000 Max. :4.000 Max. :2.0000
Bedroom.AbvGr Kitchen.AbvGr TotRms.AbvGrd Fireplaces
Min. :0.000 Min. :1.000 Min. : 2.000 Min. :0.0000
1st Qu.:2.000 1st Qu.:1.000 1st Qu.: 5.000 1st Qu.:0.0000
Median :3.000 Median :1.000 Median : 6.000 Median :1.0000
Mean :2.808 Mean :1.038 Mean : 6.272 Mean :0.5923
3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.: 7.000 3rd Qu.:1.0000
Max. :6.000 Max. :2.000 Max. :13.000 Max. :4.0000
Garage.Yr.Blt Garage.Cars Garage.Area Wood.Deck.SF
Min. :1900 Min. :0.000 Min. : 0.0 Min. : 0.00
1st Qu.:1960 1st Qu.:1.000 1st Qu.: 308.0 1st Qu.: 0.00
Median :1977 Median :2.000 Median : 472.5 Median : 0.00
Mean :1976 Mean :1.724 Mean : 460.9 Mean : 93.18
3rd Qu.:1999 3rd Qu.:2.000 3rd Qu.: 573.5 3rd Qu.:168.00
Max. :2009 Max. :5.000 Max. :1356.0 Max. :857.00
NA's :36
Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch
Min. : 0.00 Min. : 0 Min. : 0.000 Min. : 0.00
1st Qu.: 0.00 1st Qu.: 0 1st Qu.: 0.000 1st Qu.: 0.00
Median : 25.00 Median : 0 Median : 0.000 Median : 0.00
Mean : 45.31 Mean : 24 Mean : 3.191 Mean : 14.87
3rd Qu.: 71.50 3rd Qu.: 0 3rd Qu.: 0.000 3rd Qu.: 0.00
Max. :341.00 Max. :432 Max. :508.000 Max. :440.00
Pool.Area Misc.Val Mo.Sold Yr.Sold
Min. : 0.000 Min. : 0.00 Min. : 1.000 Min. :2006
1st Qu.: 0.000 1st Qu.: 0.00 1st Qu.: 5.000 1st Qu.:2007
Median : 0.000 Median : 0.00 Median : 6.000 Median :2008
Mean : 1.754 Mean : 53.12 Mean : 6.159 Mean :2008
3rd Qu.: 0.000 3rd Qu.: 0.00 3rd Qu.: 8.000 3rd Qu.:2009
Max. :800.000 Max. :15500.00 Max. :12.000 Max. :2010
All Factor Variables.
MS.Zoning Street Alley Lot.Shape Land.Contour Utilities
A (agr): 0 Grvl: 2 Grvl: 28 IR1:286 Bnk: 29 AllPub:834
C (all): 5 Pave:832 Pave: 25 IR2: 24 HLS: 29 NoSeWa: 0
FV : 35 NA's:781 IR3: 2 Low: 19 NoSewr: 0
I (all): 1 Reg:522 Lvl:757
RH : 6
RL :656
RM :131
Lot.Config Land.Slope Neighborhood Condition.1 Condition.2
Corner :145 Gtl:800 NAmes :140 Norm :732 Norm :825
CulDSac: 63 Mod: 31 CollgCr: 75 Feedr : 46 Feedr : 6
FR2 : 32 Sev: 3 OldTown: 62 Artery : 19 Artery : 1
FR3 : 4 Sawyer : 57 PosN : 9 PosN : 1
Inside :590 Edwards: 50 RRAn : 9 RRNn : 1
Somerst: 45 RRAe : 8 PosA : 0
(Other):405 (Other): 11 (Other): 0
Bldg.Type House.Style Overall.Qual Overall.Cond Roof.Style
1Fam :687 1Story :422 5 :268 5 :441 Flat : 6
2fmCon: 18 2Story :239 6 :209 6 :170 Gable :654
Duplex: 26 1.5Fin : 85 7 :168 7 :123 Gambrel: 8
Twnhs : 33 SLvl : 38 8 : 88 8 : 46 Hip :163
TwnhsE: 70 SFoyer : 32 4 : 58 4 : 30 Mansard: 3
2.5Unf : 10 9 : 24 3 : 11 Shed : 0
(Other): 8 (Other): 19 (Other): 13
Roof.Matl Exterior.1st Exterior.2nd Mas.Vnr.Type Exter.Qual Exter.Cond
CompShg:822 VinylSd:262 VinylSd:259 : 4 Ex: 20 Ex: 4
Tar&Grv: 8 HdBoard:145 HdBoard:134 BrkCmn : 6 Fa: 8 Fa: 16
WdShngl: 2 MetalSd:130 MetalSd:132 BrkFace:263 Gd:266 Gd:103
Metal : 1 Wd Sdng:125 Wd Sdng:119 CBlock : 0 TA:540 Po: 0
WdShake: 1 Plywood: 66 Plywood: 85 None :516 TA:711
ClyTile: 0 BrkFace: 32 CmentBd: 30 Stone : 45
(Other): 0 (Other): 74 (Other): 75
Foundation Bsmt.Qual Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.Type.2
BrkTil: 88 : 0 : 0 : 1 GLQ :236 Unf :711
CBlock:381 Ex : 52 Ex : 2 Av :128 Unf :217 Rec : 28
PConc :351 Fa : 22 Fa : 19 Gd : 75 ALQ :144 LwQ : 25
Slab : 11 Gd :350 Gd : 29 Mn : 72 Rec : 92 BLQ : 23
Stone : 3 Po : 1 Po : 1 No :538 BLQ : 80 ALQ : 16
Wood : 0 TA :389 TA :763 NA's: 20 (Other): 45 (Other): 11
NA's: 20 NA's: 20 NA's : 20 NA's : 20
Heating Heating.QC Central.Air Electrical Kitchen.Qual Functional
Floor: 0 Ex:407 N: 45 : 0 Ex: 41 Typ :777
GasA :824 Fa: 18 Y:789 FuseA: 44 Fa: 19 Min2 : 21
GasW : 7 Gd:145 FuseF: 12 Gd:327 Min1 : 16
Grav : 1 Po: 1 FuseP: 2 Po: 1 Mod : 14
OthW : 1 TA:263 Mix : 0 TA:446 Maj1 : 4
Wall : 1 SBrkr:776 Maj2 : 2
(Other): 0
Fireplace.Qu Garage.Type Garage.Finish Garage.Qual Garage.Cond Paved.Drive
Ex : 10 2Types : 8 : 1 : 0 : 0 N: 61
Fa : 22 Attchd :505 Fin :185 Ex : 1 Ex : 1 P: 28
Gd :169 Basment: 8 RFn :231 Fa : 32 Fa : 19 Y:745
Po : 18 BuiltIn: 43 Unf :382 Gd : 5 Gd : 5
TA :201 CarPort: 1 NA's: 35 Po : 3 Po : 5
NA's:414 Detchd :234 TA :757 TA :768
NA's : 35 NA's: 36 NA's: 36
Pool.QC Fence Misc.Feature Sale.Type
Ex : 1 GdPrv: 39 Elev: 0 WD :797
Fa : 1 GdWo : 33 Gar2: 2 COD : 16
Gd : 1 MnPrv:105 Othr: 1 ConLD : 5
TA : 0 MnWw : 2 Shed: 24 ConLw : 5
NA's:831 NA's :655 TenC: 1 Con : 4
NA's:806 ConLI : 3
(Other): 4
Numerical Variables with NAs.
ames_train_NAs <- ames_train_EDA[colSums(is.na(ames_train_EDA)) != 0]
summary(ames_train_NAs[!(sapply(ames_train_NAs, is.factor))])
Lot.Frontage Mas.Vnr.Area Garage.Yr.Blt
Min. : 21.00 Min. : 0.00 Min. :1900
1st Qu.: 56.00 1st Qu.: 0.00 1st Qu.:1960
Median : 69.00 Median : 0.00 Median :1977
Mean : 68.51 Mean : 95.62 Mean :1976
3rd Qu.: 80.00 3rd Qu.: 143.00 3rd Qu.:1999
Max. :313.00 Max. :1290.00 Max. :2009
NA's :156 NA's :4 NA's :36
Remove Variables.
ames_train_EDA <- ames_train_EDA %>%
select(-PID, -Utilities, -Lot.Frontage, -Mas.Vnr.Area, -Garage.Yr.Blt)
Factor Variables with NAs.
ames_train_NAs <- ames_train_EDA[colSums(is.na(ames_train_EDA)) != 0]
sort(colSums(is.na(ames_train_NAs)), decreasing = TRUE)
Pool.QC Misc.Feature Alley Fence Fireplace.Qu
831 806 781 655 414
Garage.Qual Garage.Cond Garage.Type Garage.Finish Bsmt.Qual
36 36 35 35 20
Bsmt.Cond Bsmt.Exposure BsmtFin.Type.1 BsmtFin.Type.2
20 20 20 20
Add NA as a factor level.
ames_train$Alley <- addNA(ames_train$Alley)
ames_train$Bsmt.Qual <- addNA(ames_train$Bsmt.Qual)
ames_train$Bsmt.Cond <- addNA(ames_train$Bsmt.Cond)
ames_train$Bsmt.Exposure <- addNA(ames_train$Bsmt.Exposure)
ames_train$BsmtFin.Type.1 <- addNA(ames_train$BsmtFin.Type.1)
ames_train$BsmtFin.Type.2 <- addNA(ames_train$BsmtFin.Type.2)
ames_train$Fireplace.Qu <- addNA(ames_train$Fireplace.Qu)
ames_train$Garage.Type <- addNA(ames_train$Garage.Type)
ames_train$Garage.Finish <- addNA(ames_train$Garage.Finish)
ames_train$Garage.Qual <- addNA(ames_train$Garage.Qual)
ames_train$Garage.Cond <- addNA(ames_train$Garage.Cond)
ames_train$Fence <- addNA(ames_train$Fence)
ames_train$Misc.Feature <- addNA(ames_train$Misc.Feature)
ames_train$Pool.QC <- addNA(ames_train$Pool.QC)
Price vs Area and Number of Bedrooms.
g1 <- ggplot(ames_train, aes(x = area, y = price, color = as.factor(Bedroom.AbvGr)))
#Scatter and Regression Lines.
g1 <- g1 + geom_point(size = 2, alpha = 0.7) + geom_smooth(formula = y ~ x, method = 'lm', se = FALSE)
#Title.
g1 <- g1 + ggtitle("Price vs Area and Number of Bedrooms")
#X-axis
g1 <- g1 + scale_x_continuous(name = "Area (sq. ft.)", labels = comma)
#Y-axis.
g1 <- g1 + scale_y_continuous(name = "Price", labels = dollar)
#Virids Color Scheme.
g1 <- g1 + scale_color_viridis("Bedrooms", discrete = TRUE)
#Modify labels and text.
g1 <- g1 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 10, angle = 55),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(hjust = 0.5, size = 12, face = "bold"))
g1
Price vs Neighborhood.
g2 <- ggplot(ames_train, aes(x = fct_reorder(Neighborhood, price, .fun = 'median'), y = price))
#Violin and Box Plot.
g2 <- g2 + geom_violin(color = viridis(7)[5], width = 1.3) + geom_boxplot(color = viridis(7)[3], width = 0.2)
#Title.
g2 <- g2 + ggtitle("Price vs Neighborhood")
#X-axis
g2 <- g2 + scale_x_discrete(name = "Neighborhood")
#Y-axis.
g2 <- g2 + scale_y_continuous(name = "Price", labels = dollar)
#Modify labels and text.
g2 <- g2 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 10, angle = 55),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(hjust = 0.5, size = 12, face = "bold"),
legend.position = "none")
g2
Price vs Year Built and Overall Quality.
g3 <- ggplot(data = ames_train, mapping = aes(x = Year.Built, y = price, color = Overall.Qual))
#Violin & Box Plot.
g3 <- g3 + geom_point(size = 2, alpha = 0.7)
#Title.
g3 <- g3 + ggtitle("Price vs Year Built and Overall Quality")
#X-axis
g3 <- g3 + scale_x_continuous("Year Built")
#Y-axis.
g3 <- g3 + scale_y_continuous("Price", labels = dollar)
#Virids Color Scheme.
g3 <- g3 + scale_color_viridis("Overall Quality", discrete = TRUE,
labels = c("Very Poor", "Poor", "Fair", "Below Average", "Average",
"Above Average", "Good", "Very Good", "Excellent", "Very Excellent"))
#Modify labels and text.
g3 <- g3 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 10, angle = 55),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(hjust = 0.5, size = 12, face = "bold"))
g3
Subset preliminary model.
ames_train_prelim <- ames_train %>%
select(price,
area,
Bedroom.AbvGr,
Full.Bath,
Half.Bath,
Neighborhood,
Overall.Qual,
Year.Built,
Year.Remod.Add)
prelim_model <- lm(price ~ ., data = ames_train_prelim)
summary(prelim_model)
#Takes a data frame and number of predictor to next choose for the highest Adjusted R2.
#Returns a dataframe with list of selected variables along with previously chosen variables
#and corresponding Adjusted R squared.
forward_r2 <- function(df) {
#Number of variables.
num_of_vars <- dim(df)[2]
#Number of predictors calculated.
predictor_num <- 1
#Adjusted R2 to be greater than.
prev_R2 <- 0
#Iterator
improvement <- TRUE
#
final_predictors <- NULL
final_r2 <- NULL
#Iterate while Adjusted R2 increases.
while (improvement) {
#Lists are initialized as empty
predictors <- NULL
positions <- NULL
r2 <- NULL
for (x in c((predictor_num + 1):num_of_vars)) {
positions = c(positions, x)
#If we are selecting the first predictor, only include that one.
#Otherwise, include all predictors chosen thus far and the selected.
if (predictor_num == 1) {
predictors <- c(predictors, colnames(df[x]))
} else if (predictor_num == 2) {
predictors <- c(predictors, paste(colnames(df[2]),
colnames(df[x]),
sep = " + "))
} else {
predictors <- c(predictors, paste("Previous Predictors",
colnames(df[x]),
sep = " + "))
}
#Make a temp dataframe with only the chosen predictors and the current selected one.
temp_df <- df %>%
select(colnames(df[c(1:predictor_num, x)]))
#Retrieve the Adjusted R squared from that model.
temp_r2 <- summary(lm(price ~ ., data = temp_df))$adj.r.squared
r2 <- c(r2, temp_r2)
}
#Create a dataframe from the three lists.
r2_values <- data.frame("Predictor_Variables" = predictors,
"Positions" = positions,
"Adjusted_R2" = r2) %>%
arrange(desc(r2))
high_pred <- r2_values$Predictor_Variables[1]
pred_pos <- r2_values$Positions[1]
high_r2 <- r2_values$Adjusted_R2[1]
if (high_r2 > prev_R2) {
#Add the predictor(s) with the highest Adjusted R2 thus far.
final_predictors <- c(final_predictors, high_pred)
final_r2 <- c(final_r2, high_r2)
#One of the variables is the price.
if (predictor_num < num_of_vars - 1) {
if (pred_pos == num_of_vars) {
#Reorder the df for the next iteration.
df <- df[c(1 : predictor_num, pred_pos, (predictor_num + 1) : (pred_pos - 1))]
} else if ((predictor_num + 1) < pred_pos) {
#Reorder the df for the next iteration.
df <- df[c(1 : predictor_num, pred_pos, (predictor_num + 1) : (pred_pos - 1), (pred_pos + 1) : num_of_vars)]
}
#Increase the number of predictors for next iteration.
predictor_num <- predictor_num + 1
#Increase the Adjusted R2 for next iteration.
prev_R2 <- high_r2
} else {
#No more variables remaining.
improvement <- FALSE
}
} else {
#Adjusted R2 did not increase.
improvement <- FALSE
}
}
model_selection <- data.frame("Predictor_Variables" = final_predictors,
"Adjusted_R2" = final_r2)
model_selection
}
#Model Selection.
forward_r2(ames_train_prelim)
Define outliers and graph helper.
#Include columns for residuals and outliers.
ames_train_prelim <- ames_train_prelim %>%
mutate(price_resid = residuals(prelim_model),
outliers = ifelse((price_resid >= 0) & (price_resid >= 3*sd(price_resid)), "yes",
ifelse((price_resid <= 0) & (price_resid <= -3*sd(price_resid)), "yes", "no")))
#Appends identical attributes to residual plots.
residual_gg_helper <- function(gg_object) {
#Scatter
gg_object <- gg_object + geom_point(aes(color = outliers, alpha = outliers), size = 2)
#Best Fit.
gg_object <- gg_object + stat_smooth(formula = y ~ x, method = "lm", color = viridis(8)[1], size = 0.6, se = FALSE)
#Viridis Color Scheme
gg_object <- gg_object + scale_color_manual(values = c(viridis(8)[5], viridis(8)[3]))
gg_object <- gg_object + scale_alpha_manual(values = c(0.3, 1))
#Modify labels and text.
gg_object <- gg_object + theme(plot.title = element_text(hjust = 0.5, size = 10, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 8, angle = 55),
axis.title.x = element_text(size = 8, face = "bold"),
axis.text.y = element_text(size = 8),
axis.title.y = element_text(hjust = 0.5, size = 8, face = "bold"),
legend.position = "none")
return(gg_object)
}
Residual Values vs Area.
g4 <- ggplot(ames_train_prelim, aes(x = area, y = price_resid))
#Title.
g4 <- g4 + ggtitle("Residual Value vs Area")
#X-axis & Y-axis.
g4 <- g4 + scale_x_continuous("Area (sq. ft.)", labels = comma) + ylab("Residual Value")
#Additional attributes.
g4 <- residual_gg_helper(g4)
Residual Values vs Number of Bedrooms.
g5 <- ggplot(ames_train_prelim, aes(x = Bedroom.AbvGr, y = price_resid))
#Title.
g5 <- g5 + ggtitle("Residual Value vs Number of Bedrooms")
#X-axis & Y-axis.
g5 <- g5 + scale_x_continuous("Number of Bedrooms") + ylab("Residual Value")
#Additional attributes.
g5 <- residual_gg_helper(g5)
Residual Values vs Year Built.
g6 <- ggplot(ames_train_prelim, aes(x = Year.Built, y = price_resid))
#Title.
g6 <- g6 + ggtitle("Residual Value vs Year Built")
#X-axis & Y-axis.
g6 <- g6 + scale_x_continuous("Year Built") + ylab("Residual Value")
#Additional attributes.
g6 <- residual_gg_helper(g6)
Residual Values vs Remodel Year.
g7 <- ggplot(ames_train_prelim, aes(x = Year.Remod.Add, y = price_resid))
#Title.
g7 <- g7 + ggtitle("Residual Value vs Remodel Year")
#X-axis & Y-axis.
g7 <- g7 + scale_x_continuous("Remodel Year") + ylab("Residual Value")
#Additional attributes.
g7 <- residual_gg_helper(g7)
Residual Values vs Number of Full Bathrooms.
g8 <- ggplot(ames_train_prelim, aes(x = Full.Bath, y = price_resid))
#Title.
g8 <- g8 + ggtitle("Residual Value vs Number of Full Bathrooms")
#X-axis & Y-axis.
g8 <- g8 + scale_x_continuous("Number of Bathrooms") + ylab("Residual Value")
#Additional attributes.
g8 <- residual_gg_helper(g8)
Residual Values vs Number of Half Bathrooms.
g9 <- ggplot(ames_train_prelim, aes(x = Full.Bath, y = price_resid))
#Title.
g9 <- g9 + ggtitle("Residual Value vs Number of Half Bathrooms")
#X-axis & Y-axis.
g9 <- g9 + scale_x_continuous("Number of Bathrooms") + ylab("Residual Value")
#Additional attributes.
g9 <- residual_gg_helper(g9)
Create Grid.
grid.arrange(g4, g5, g6, g7, g8, g9,
layout_matrix = matrix(rbind(c(1, 1, 1, 1, 1, NA, 2, 2, 2, 2, 2),
c(3, 3, 3, 3, 3, NA, 4, 4, 4, 4, 4),
c(5, 5, 5, 5, 5, NA, 6, 6, 6, 6, 6)),
ncol = 11))
Load ames_test
.
#Numerical categories were converted to factor variables.
ames_test$Overall.Qual <- as.factor(ames_test$Overall.Qual)
ames_test$Overall.Cond <- as.factor(ames_test$Overall.Cond)
Subset preliminary model.
ames_test_prelim <- ames_test %>%
select(price,
area,
Bedroom.AbvGr,
Half.Bath,
Full.Bath,
Neighborhood,
Overall.Qual,
Year.Built,
Year.Remod.Add)
ames_test_prelim[ames_test_prelim$Neighborhood == "Landmrk", ]
RMSE of ames_test
.
ames_test_prelim <- ames_test_prelim[ames_test_prelim$Neighborhood != "Landmrk", ]
prelim_test <- predict(prelim_model, ames_test_prelim)
rmse(ames_test_prelim$price, prelim_test)
Distribution of Houses by Price Histograms for ames_train
and ames_test
.
price_train <- ggplot(ames_train, aes(x = price))
price_train <- price_train + geom_histogram(bins = 30, fill = viridis(7)[4])
#Title.
price_train <- price_train + ggtitle("")
#X-axis
price_train <- price_train + scale_x_continuous(name = "House Prices (Training Set)", expand = c(0,0), labels = dollar)
#Y-axis.
price_train <- price_train + scale_y_continuous(name = "Frequency of Price", limits = c(0, 180), expand = c(0,0))
#Modify labels and text.
price_train <- price_train + theme(plot.title = element_text(hjust = 0, size = 14, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 10, angle = 55),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 12, face = "bold"))
price_test <- ggplot(ames_test, aes(x = price))
price_test <- price_test + geom_histogram(bins = 30, fill = viridis(7)[4])
#Title.
price_test <- price_test + ggtitle("Distribution of Houses by Price")
#X-axis
price_test <- price_test + scale_x_continuous(name = "House Prices (Testing Set)", expand = c(0,0), labels = dollar)
#Y-axis.
price_test <- price_test + scale_y_continuous(name = "", limits = c(0, 180), expand = c(0,0))
#Modify labels and text.
price_test <- price_test + theme(plot.title = element_text(hjust = -2.7, size = 14, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 10, angle = 55),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 12, face = "bold"))
grid.arrange(price_train, price_test,
layout_matrix = matrix(c(1, 2),
ncol = 2))
Log transformations.
#Select only relevant variables.
ames_train_transf <- ames_train %>%
select(price, Year.Built, Year.Remod.Add, area, Lot.Area, Garage.Area, Total.Bsmt.SF, X1st.Flr.SF, X2nd.Flr.SF)
predictor <- NULL
r2 <- NULL
log_r2 <- NULL
for (p in colnames(ames_train_transf)[-1]) {
temp_r2 <- summary(lm(as.formula(paste("price ~ ", p, sep = "")),
data = ames_train_transf))$r.squared
if (p %in% c("Year.Built", "Year.Remod.Add")) {
temp_log_r2 <- summary(lm(as.formula(paste("price ~ log(2020 - ", p, ")", sep = "")),
data = ames_train_transf))$r.squared
} else {
temp_log_r2 <- summary(lm(as.formula(paste("price ~ log(", p, " + 1)", sep = "")),
data = ames_train_transf))$r.squared
}
predictor <- c(predictor, p)
r2 <- c(r2, temp_r2)
log_r2 <- c(log_r2, temp_log_r2)
}
transformations <- data.frame("Predictor_Variables" = predictor,
"R2_of_Predictor" = r2,
"R2_of_log_Predictor" = log_r2)
transformations
Transform House.Age
(from Year.Built
) and Lot.Area
in ames_train
and ames_test
.
#Transform.
ames_train_final <- ames_train %>%
mutate(House.Age = log(2020 - Year.Built),
Lot.Area = log(Lot.Area))
ames_test_final <- ames_test %>%
mutate(House.Age = log(2020 - Year.Built),
Lot.Area = log(Lot.Area))
Create Date.Sold
from Yr.Sold
and Mo.Sold
.
ames_train_final <- ames_train_final %>%
mutate(Date.Sold = ifelse((Mo.Sold > 9),
as.Date(paste(Yr.Sold,
Mo.Sold,
"01",
sep = "/"),
"%Y/%m/%d"),
as.Date(paste(Yr.Sold,
paste("0",
Mo.Sold,
sep = ""),
"01",
sep = "/"),
"%Y/%m/%d")))
Correlation plot of selected numerical variables.
ggcorr(ames_train_final %>%
mutate(X1st.X2nd.SF = X1st.Flr.SF + X2nd.Flr.SF) %>%
select(Lot.Area, area, X1st.Flr.SF, X2nd.Flr.SF, Garage.Area,
Garage.Cars, X1st.X2nd.SF, Total.Bsmt.SF, TotRms.AbvGrd,
Kitchen.AbvGr, Bedroom.AbvGr, Full.Bath, Half.Bath),
label = TRUE) + scale_fill_viridis(direction = -1)
Remove area
, Garage.Cars
, and Year.Built
.
Forward selection by Adjusted R2.
Two observations with level AsphShn in Exterior.1st
.
ames_test_final %>%
filter(Exterior.1st == "AsphShn") %>%
select(PID, price, Exterior.1st, Neighborhood, Lot.Area, Year.Built)
Remove new levels (3 observations) and add NAs as levels.
ames_test_final <- ames_test_final %>%
filter(Neighborhood != "Landmrk",
Exterior.1st != "AsphShn") %>%
mutate(BsmtFin.Type.1 = addNA(BsmtFin.Type.1))
Top 20 columns from ames_train
and ames_test
are subset.
ames_train_final2 <- ames_train_final %>%
select(price, Overall.Qual, Neighborhood, X1st.Flr.SF, X2nd.Flr.SF, BsmtFin.Type.1,
Overall.Cond, House.Age, Lot.Area, MS.Zoning, BsmtFin.SF.1, Central.Air,
Condition.1, BsmtFin.SF.2, Exterior.1st, Garage.Area, Bsmt.Unf.SF,
Fireplace.Qu, Kitchen.Qual, Functional, Year.Remod.Add)
ames_test_final2 <- ames_test_final2 %>%
select(price, Overall.Qual, Neighborhood, X1st.Flr.SF, X2nd.Flr.SF, BsmtFin.Type.1,
Overall.Cond, House.Age, Lot.Area, MS.Zoning, BsmtFin.SF.1, Central.Air,
Condition.1, BsmtFin.SF.2, Exterior.1st, Garage.Area, Bsmt.Unf.SF,
Fireplace.Qu, Kitchen.Qual, Functional, Year.Remod.Add)
Create 20 different models, in order, and calculate and compare the RMSE of ames_train
and ames_test
.
num_of_pred <- NULL
rmse <- NULL
dataset <- NULL
#The dependent variable, price, has been log transformed. Change it back for RMSE calculation.
rmse_log <- function(obs, pred) {
obs <- exp(obs)
pred <- exp(pred)
return(rmse(obs, pred))
}
for (i in 2:21) {
temp_train_df <- ames_train_final2[, 1:i]
temp_test_df <- ames_test_final2[, 1:i]
temp_model <- lm(price ~ ., data = temp_train_df)
temp_predicted <- predict(temp_model, temp_train_df)
temp_tested <- predict(temp_model, temp_test_df)
num_of_pred <- c(num_of_pred, i-1, i-1)
rmse <- c(rmse,
rmse_log(temp_train_df$price, temp_predicted),
rmse_log(temp_test_df$price, temp_tested))
dataset <- c(dataset, "Ames Train", "Ames Test")
}
rmse_df <- data.frame("Number_of_Predictors" = num_of_pred,
"Root_Mean_Square_Error" = rmse,
"Dataset" = dataset)
RMSE vs Number of Variables for ames_train
and ames_test
.
#RMSE vs Number of Variables for ames_train and ames_test.
g10 <- ggplot(rmse_df, aes(x = Number_of_Predictors, y = Root_Mean_Square_Error, color = Dataset))
g10 <- g10 + geom_point(size = 7)
#Title.
g10 <- g10 + ggtitle("Number of Predictors vs RMSE for Training and Test Data")
#X-axis & Y-axis.
g10 <- g10 + scale_x_continuous("Number of Predictors") + scale_y_continuous("RMSE", labels = dollar)
#Viridis Color Scheme
g10 <- g10 + scale_color_manual(values = c(viridis(8)[6], viridis(8)[2]))
#Modify labels and text.
g10 <- g10 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 10, angle = 55),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 12, face = "bold"))
g10
RMSE for 7 - 11 variables.
Define outliers and graph helper.
#Include columns for residuals and outliers.
ames_train_diagnostics <- ames_train_final3 %>%
mutate(price_resid = residuals(final_model),
outliers = ifelse((price_resid >= 0) & (price_resid >= 3*sd(price_resid)), "yes",
ifelse((price_resid <= 0) & (price_resid <= -3*sd(price_resid)), "yes", "no")))
#Appends identical attributes to residual plots.
#Use residual_gg_helper from 2.3.
Residual Values vs First Floor Area.
g11 <- ggplot(ames_train_diagnostics, aes(x = X1st.Flr.SF, y = price_resid))
#Title.
g11 <- g11 + ggtitle("Residual Value vs 1st Floor Area")
#X-axis & Y-axis.
g11 <- g11 + scale_x_continuous("Area (sq. ft.)", labels = comma) + ylab("Residual Value")
#Additional attributes.
g11 <- residual_gg_helper(g11)
Residual Values vs 2nd Floor Area.
g12 <- ggplot(ames_train_diagnostics, aes(x = X2nd.Flr.SF, y = price_resid))
#Title.
g12 <- g12 + ggtitle("Residual Value vs 2nd Floor Area")
#X-axis & Y-axis.
g12 <- g12 + scale_x_continuous("Area (sq. ft.)", labels = comma) + ylab("Residual Value")
#Additional attributes.
g12 <- residual_gg_helper(g12)
Residual Values vs House Age.
g14 <- ggplot(ames_train_diagnostics, aes(x = House.Age, y = price_resid))
#Title.
g14 <- g14 + ggtitle("Residual Value vs ln(House Age)")
#X-axis & Y-axis.
g14 <- g14 + scale_x_continuous("ln(House Age)") + ylab("Residual Value")
#Additional attributes.
g14 <- residual_gg_helper(g14)
Residual Values vs Lot Area.
g15 <- ggplot(ames_train_diagnostics, aes(x = Lot.Area, y = price_resid))
#Title.
g15 <- g15 + ggtitle("Residual Value vs ln(Lot Area)")
#X-axis & Y-axis.
g15 <- g15 + scale_x_continuous("ln(Lot Area)") + ylab("Residual Value")
#Additional attributes.
g15 <- residual_gg_helper(g15)
Create Grid.
Histogram
g16 <- ggplot(final_model, mapping = aes(x = .resid)) + geom_histogram(bins = 30, fill = viridis(5)[3])
#Title.
g16 <- g16 + ggtitle("Distribution of Residuals")
#X-axis & Y-axis.
g16 <- g16 + scale_x_continuous("Residual Value") + ylab("Frequency")
#Modify labels and text.
g16 <- g16 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 10, angle = 55),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 12, face = "bold"))
Normal-QQ Plot
g17 <- ggplot(data = final_model)
g17 <- g17 + stat_qq(data = final_model, aes(sample = .resid), color = viridis(5)[3], size = 3, alpha = 0.4)
g17 <- g17 + stat_qq_line(data = final_model, mapping = aes(sample = .resid), color = viridis(5)[1], size = 1)
#Title.
g17 <- g17 + ggtitle("Normal-QQ Plot")
#X-axis & Y-axis.
g17 <- g17 + scale_x_continuous("Theoretical Quantile") + scale_y_continuous("Standardized Residual")
#Modify labels and text.
g17 <- g17 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(size = 10),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 12, face = "bold"))
Create Grid.
Scatter of Residuals vs. Fitted.
g18 <- ggplot(final_model, aes(x = .fitted, y = .resid))
g18 <- g18 + geom_point(color = viridis(5)[3], size = 3, alpha = 0.4)
##Horizontal Line.
g18 <- g18 + geom_hline(yintercept = 0, col = viridis(5)[1], linetype = "dashed")
#Title.
g18 <- g18 + ggtitle("Residuals vs Fitted")
#X-axis & Y-axis.
g18 <- g18 + scale_x_log10("Fitted Value") + scale_y_continuous("Residual Value")
#Modify labels and text.
g18 <- g18 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 10, angle = 55),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 12, face = "bold"))
Scatter of Absolute Value of Residuals vs. Fitted.
g19 <- ggplot(final_model, aes(x = .fitted, y = abs(.resid)))
g19 <- g19 + geom_point(color = viridis(5)[3], size = 3, alpha = 0.4)
##Horizontal Line.
g19 <- g19 + geom_hline(yintercept = 0, col = viridis(5)[1], linetype = "dashed")
#Title.
g19 <- g19 + ggtitle("Absolute Value of Residuals vs Fitted")
#X-axis & Y-axis.
g19 <- g19 + scale_x_log10("Fitted Value") + scale_y_continuous("Absolute Value of Residual")
#Modify labels and text.
g19 <- g19 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 10, angle = 55),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 12, face = "bold"))
Create Grid.
RMSE of ames_train
and ames_test
#Training Set.
final_train <- predict(final_model, ames_train_final3)
rmse_log(ames_train_final3$price, final_train)
#Test Set.
ames_test_final3 <- ames_test_final %>%
filter(Neighborhood != "Landmrk") %>%
mutate(BsmtFin.Type.1 = addNA(BsmtFin.Type.1)) %>%
select(price, Overall.Qual, Neighborhood, X1st.Flr.SF, X2nd.Flr.SF, BsmtFin.Type.1,
Overall.Cond, House.Age, Lot.Area)
final_test <- predict(final_model, ames_test_final3)
rmse_log(ames_test_final3$price, final_test)
RMSE of ames_validation
.
load("ames_validation.Rdata")
ames_valid_final <- ames_validation %>%
mutate(price = log(price),
Overall.Qual = as.factor(Overall.Qual),
BsmtFin.Type.1 = addNA(BsmtFin.Type.1),
Overall.Cond = as.factor(Overall.Cond),
House.Age = log(2020 - Year.Built),
Lot.Area = log(Lot.Area)) %>%
select(price, Overall.Qual, Neighborhood, X1st.Flr.SF, X2nd.Flr.SF, BsmtFin.Type.1,
Overall.Cond, House.Age, Lot.Area)
final_validation <- predict(final_model, ames_valid_final)
rmse_log(ames_valid_final$price, final_validation)
Prediction Interval
#Retrieve prediction interval. Find the percentage of price values in the prediction interval.
#Ames Test.
predict_test <- exp(predict(final_model, ames_test_final3, interval = "prediction"))
mean(exp(ames_test_final3$price) > predict_test[,"lwr"] & exp(ames_test_final3$price) < predict_test[,"upr"])
#Ames Validation.
predict_valid <- exp(predict(final_model, ames_valid_final, interval = "prediction"))
mean(exp(ames_valid_final$price) > predict_valid[,"lwr"] &
exp(ames_valid_final$price) < predict_valid[,"upr"])