Background

As a statistical consultant working for a real estate investment firm, your task is to develop a model to predict the selling price of a given home in Ames, Iowa. Your employer hopes to use this information to help assess whether the asking price of a house is higher or lower than the true value of the house. If the home is undervalued, it may be a good investment for the firm.

Refer to the codebook for the Ames, Iowa dataset for more information.

See the Appendix for code.


Training Data and relevant packages

In order to better assess the quality of the model you will produce, the data have been randomly divided into three separate pieces: a training data set, a testing data set, and a validation data set. For now we will load the training data set, the others will be loaded and used later.

Use the code block below to load any necessary packages.


Part 1 : Exploratory Data Analysis (EDA)

When you first get your data, it’s very tempting to immediately begin fitting models and assessing how they perform. However, before you begin modeling, it’s absolutely essential to explore the structure of the data and the relationships between the variables in the data set.

Do a detailed EDA of the ames_train data set, to learn about the structure of the data and the relationships between the variables in the data set (refer to Introduction to Probability and Data, Week 2, for a reminder about EDA if needed). Your EDA should involve creating and reviewing many plots/graphs and considering the patterns and relationships you see.

After you have explored completely, submit the three graphs/plots that you found most informative during your EDA process, and briefly explain what you learned from each (why you found each informative).


The Test and Validation sets only have observations with Sale.Condition Normal, so the Training set was filtered only for houses with Sale.Condition Normal. This reduces the observations in ames_train from 1000 to 834.

Variables can be broadly grouped into categorical and numerical. Several of the categorical variables have several NAs. These were noted and will be addressed later in the analysis. See Appendix for full summary of categorical and numerical variables.

PID was removed as it is not a predictor. Utilities was removed as all observations have the same value. Lot.Frontage, Mas.Vnr.Area, and Garage.Yr.Blt were removed as numerical variables with NAs. See Appendix for details.

After exploring the variables, I found the following graphs interesting to warrant further exploration.

Price would be expected to increase as area of the house does. In addition, one would expect larger houses to have more bedrooms. The first graph provides support to these assumptions.

Different neighborhoods in an urbanized area would be expected to vary in worth. The price of the houses by neighborhood, ascending by median, demonstrates that this appears to be true.

Lastly, I explored the year homes were built and their overall quality measurement. Although these relationships appear to be complicated, there seems to be a sharp increase in price and quality of homes after 1990.


Part 2 - Development and assessment of an initial model, following a semi-guided process of analysis

2.1 An Initial Model

In building a model, it is often useful to start by creating a simple, intuitive initial model based on the results of the exploratory data analysis. (Note: The goal at this stage is not to identify the “best” possible model but rather to choose a reasonable and understandable starting point. Later you will expand and revise this model to create your final model.

Based on your EDA, select at most 10 predictor variables from ames_train and create a linear model for price (or a transformed version of price) using those variables. Provide the R code and the summary output table for your model, a brief justification for the variables you have chosen, and a brief discussion of the model results in context (focused on the variables that appear to be important predictors and how they relate to sales price).


area, Bedroom.AbvGr, Full.Bath, Half.Bath, Neighborhood, Overall.Qual, Year.Built, and Year.Remod.Add were chosen as predictors for price. These variables seem to adequately convey what would affect the price of a home, based on common conversations and the Exploratory Data Analysis.


Call:
lm(formula = price ~ ., data = ames_train_prelim)

Residuals:
   Min     1Q Median     3Q    Max 
-87397 -13313    597  11760 173401 

Coefficients:
                        Estimate   Std. Error t value             Pr(>|t|)    
(Intercept)         -1604837.751   180849.802  -8.874 < 0.0000000000000002 ***
area                      83.740        3.747  22.348 < 0.0000000000000002 ***
Bedroom.AbvGr          -7337.244     1616.362  -4.539    0.000006518937916 ***
Full.Bath              -9695.931     2938.903  -3.299             0.001013 ** 
Half.Bath              -8926.653     2342.261  -3.811             0.000149 ***
NeighborhoodBlueste   -12393.056    17699.167  -0.700             0.484005    
NeighborhoodBrDale    -16176.646    14157.432  -1.143             0.253539    
NeighborhoodBrkSide    19145.526    11793.960   1.623             0.104916    
NeighborhoodClearCr    42247.995    12769.088   3.309             0.000980 ***
NeighborhoodCollgCr    14972.693    10097.912   1.483             0.138538    
NeighborhoodCrawfor    39613.160    11759.113   3.369             0.000792 ***
NeighborhoodEdwards     5535.183    10932.775   0.506             0.612792    
NeighborhoodGilbert     9692.894    10711.870   0.905             0.365807    
NeighborhoodGreens      7354.592    16207.875   0.454             0.650121    
NeighborhoodGrnHill    99678.462    20194.272   4.936    0.000000972644229 ***
NeighborhoodIDOTRR      4923.579    12084.320   0.407             0.683798    
NeighborhoodMeadowV   -17429.166    12241.862  -1.424             0.154917    
NeighborhoodMitchel    20693.640    10688.497   1.936             0.053215 .  
NeighborhoodNAmes      20224.599    10601.972   1.908             0.056801 .  
NeighborhoodNoRidge    34267.303    10986.333   3.119             0.001880 ** 
NeighborhoodNPkVill     4842.707    16142.031   0.300             0.764251    
NeighborhoodNridgHt    36571.239    10694.927   3.419             0.000659 ***
NeighborhoodNWAmes     19002.271    10864.610   1.749             0.080677 .  
NeighborhoodOldTown     2882.063    11674.701   0.247             0.805078    
NeighborhoodSawyer     17428.674    10878.849   1.602             0.109539    
NeighborhoodSawyerW     6338.329    10631.014   0.596             0.551205    
NeighborhoodSomerst    14154.945    10324.047   1.371             0.170743    
NeighborhoodStoneBr    26380.306    12155.808   2.170             0.030290 *  
NeighborhoodSWISU       9118.996    13672.985   0.667             0.505008    
NeighborhoodTimber     38421.753    11397.163   3.371             0.000785 ***
NeighborhoodVeenker    37130.883    12982.241   2.860             0.004346 ** 
Overall.Qual2           9000.734    27906.884   0.323             0.747138    
Overall.Qual3          26190.108    26817.781   0.977             0.329068    
Overall.Qual4          23441.471    25761.728   0.910             0.363134    
Overall.Qual5          30288.255    25603.313   1.183             0.237171    
Overall.Qual6          35857.873    25633.323   1.399             0.162241    
Overall.Qual7          46986.134    25767.668   1.823             0.068612 .  
Overall.Qual8          75643.005    26019.802   2.907             0.003749 ** 
Overall.Qual9         150784.489    26620.988   5.664    0.000000020679126 ***
Overall.Qual10        206006.064    29141.245   7.069    0.000000000003423 ***
Year.Built               551.884       73.495   7.509    0.000000000000161 ***
Year.Remod.Add           276.969       59.742   4.636    0.000004151533789 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25050 on 792 degrees of freedom
Multiple R-squared:  0.8858,    Adjusted R-squared:  0.8799 
F-statistic: 149.8 on 41 and 792 DF,  p-value: < 0.00000000000000022

Based on the simple linear model of our preliminary variables, the R2 is 0.8858, which means approximately 89% of the variability in price can be explained by eight variables for the 1000 observed houses in the training data set. Furthermore, looking at p-values, five numerical variables are significant at α = 0.001, and the sixth is very close.

For Neighborhood and Overall Quality, the two categorical variables selected, several individual levels are significant at α = 0.01 and α = 0.001. Northridge Heights, Northridge, Green Hills, and Timber, four of the top 5 neighborhoods by price, are notable in the initial linear regression model as significant. Several other neighborhoods are also significant at various levels. Overall Quality of 8, 9, and 10 have the highest significance as predictors. This numerical analysis agrees with observations in the graphs from the Exploratory Data Analysis.

Adjusted R2 and p-values will explored further in the upcoming Model Selection process.


2.2 Model Selection

Now either using BAS another stepwise selection procedure choose the “best” model you can, using your initial model as your starting point. Try at least two different model selection methods and compare their results. Do they both arrive at the same model or do they disagree? What do you think this means?


Choosing backward selection by p-value, we note from the above summary that all eight predictor variables are significant at α = 0.01. No variable is eliminated using this method.

We will compare this to forward selection by Adjusted R2.

                   Predictor_Variables Adjusted_R2
1                         Overall.Qual   0.6886230
2                  Overall.Qual + area   0.8092515
3   Previous Predictors + Neighborhood   0.8614690
4     Previous Predictors + Year.Built   0.8700999
5  Previous Predictors + Bedroom.AbvGr   0.8747203
6 Previous Predictors + Year.Remod.Add   0.8773952
7      Previous Predictors + Half.Bath   0.8783981
8      Previous Predictors + Full.Bath   0.8798951

Since every variable in the initial selection increases the Adjusted R2, it is left in the model.

Backward selection by p-values emphasizes the significance of the predictors, and forward selection by Adjusted R2 emphasizes the variability of price that can be explained by these predictors. Both selection processes yield the same result, which signifies a good starting point for the rest of the analysis.


2.3 Initial Model Residuals

One way to assess the performance of a model is to examine the model’s residuals. In the space below, create a residual plot for your preferred model from above and use it to assess whether your model appears to fit the data well. Comment on any interesting structure in the residual plot (trend, outliers, etc.) and briefly discuss potential implications it may have for your model and inference / prediction you might produce.


The Residual Plots for all numerical predictors are shown below. For Area, Number of Bedrooms, and Number of Bathrooms, the residuals appear to be randomly dispersed about y = 0. The vast majority of houses have 2-4 bedrooms, 1 or 2 full bathrooms, and 1 or 2 half bathrooms, so there is a higher concentration and spread at those values. For Year Built and Remodel Year, an interesting pattern of heteroscedasticity may be emerging. That is, as the year built or remodel increases, so does the spread of residuals. This seems to be more pronounced for Year Built than Remodel Year. This will be addressed in the revised model.

Using the definition of outliers as residuals that are outside of three times the standard deviation of the residuals, there are nine positive outliers and two negative outliers. This is about 1.3% of the remaining 834 observations. There is no need to remove these observations (highlighted in blue), but they will be noted as the analysis continues.


2.4 Initial Model RMSE

You can calculate it directly based on the model output. Be specific about the units of your RMSE (depending on whether you transformed your response variable). The value you report will be more meaningful if it is in the original units (dollars).


The RMSE of the training data using the initial model is $24,407.13. This represents the spread, or noise, of the residuals.

[1] 24407.13

2.5 Overfitting

The process of building a model generally involves starting with an initial model (as you have done above), identifying its shortcomings, and adapting the model accordingly. This process may be repeated several times until the model fits the data reasonably well. However, the model may do well on training data but perform poorly out-of-sample (meaning, on a dataset other than the original training data) because the model is overly-tuned to specifically fit the training data. This is called “overfitting.” To determine whether overfitting is occurring on a model, compare the performance of a model on both in-sample and out-of-sample data sets. To look at performance of your initial model on out-of-sample data, you will use the data set ames_test.

Use your model from above to generate predictions for the housing prices in the test data set. Are the predictions significantly more accurate (compared to the actual sales prices) for the training data than the test data? Why or why not? Briefly explain how you determined that (what steps or processes did you use)?


First, it is important to note that one observation was removed. In the test data, there is a house in the Neighborhood Landmark. Landmark was not in the training data, and is thus not an individual level in the preliminary prediction model.

# A tibble: 1 x 9
   price  area Bedroom.AbvGr Half.Bath Full.Bath Neighborhood Overall.Qual
   <int> <int>         <int>     <int>     <int> <fct>        <fct>       
1 137000  1320             3         1         2 Landmrk      6           
# ... with 2 more variables: Year.Built <int>, Year.Remod.Add <int>

The RMSE of the rest of the testing data using the initial model is $25.674.43. This is about $1,200 more than that of the training data. Since there is more spread, or noise, the preliminary prediction model is more accurate on the training data than the testing data, signifying overfitting may be an issue.

[1] 25674.43

Note to the learner: If in real-life practice this out-of-sample analysis shows evidence that the training data fits your model a lot better than the test data, it is probably a good idea to go back and revise the model (usually by simplifying the model) to reduce this overfitting. For simplicity, we do not ask you to do this on the assignment, however.

Part 3 Development of a Final Model

Now that you have developed an initial model to use as a baseline, create a final model with at most 20 variables to predict housing prices in Ames, IA, selecting from the full array of variables in the dataset and using any of the tools that we introduced in this specialization.

Carefully document the process that you used to come up with your final model, so that you can answer the questions below.


3.1 Transformation

Did you decide to transform any variables? Why or why not? Explain in a few sentences.


First, the dependent variable, price is log transformed. It is clear from ames_train and ames_test that price is right skewed.

To account for the possible heteroscedascity observed in Initial Model Residuals, Year.Built and Year.Remod.Add were converted to age by subtracting from 2020, and then log transformed. Several other variables were log transformed to see if an improvement occured.

  Predictor_Variables R2_of_Predictor R2_of_log_Predictor
1          Year.Built      0.36720772          0.42388630
2      Year.Remod.Add      0.33864569          0.31292128
3                area      0.55743101          0.57449584
4            Lot.Area      0.06387691          0.15652010
5         Garage.Area      0.41015147          0.20452499
6       Total.Bsmt.SF      0.44791613          0.15621816
7         X1st.Flr.SF      0.42528235          0.42101737
8         X2nd.Flr.SF      0.10275480          0.03124951

Of the numerical variables shown above, House.Age (derived from Year.Built), area, and Lot.Area show an increase of R2 as a single predictor upon being transformed by log.


3.2 Variable Interaction

Did you decide to include any variable interactions? Why or why not? Explain in a few sentences.


X1st.Flr.SF and X2nd.Flr.SF were temporarily combined (see Variable Selection below), but not for the final model, as the coefficient for their sum will be in the model by defintion. As there are several different variables that are related to area, there are also several variables related to number and types of rooms. Any sum would be unnecessary.

Yr.Sold and Mo.Sold were combined into a Date.Sold variable. The upcoming analysis will demonstrate if it is a useful predictor or not.

Many categorical variables had 75% or more of the observations in one level. These often overlapped with the majority level in another categorical variable. However, no combinations were made.


3.3 Variable Selection

What method did you use to select the variables you included? Why did you select the method you used? Explain in a few sentences.


The correlation of several variables related to area and number of rooms are charted below. Notice that the created combined area of the X1st.Flr.SF and X2nd.Flr.SF floor is virtually identical to area. area will be removed. Garage.Cars is removed as it is very similar to Garage.Area. There are some other correlated variables that will be kept note of.

The following output is the first twenty predictors from forward selection by Adjusted R2. These will be used to select the final model.

                    Predictor_Variables Adjusted_R2
1                          Overall.Qual   0.6751866
2           Overall.Qual + Neighborhood   0.7696939
3     Previous Predictors + X1st.Flr.SF   0.8184783
4     Previous Predictors + X2nd.Flr.SF   0.8716027
5  Previous Predictors + BsmtFin.Type.1   0.8932844
6    Previous Predictors + Overall.Cond   0.9072078
7       Previous Predictors + House.Age   0.9185515
8        Previous Predictors + Lot.Area   0.9292977
9       Previous Predictors + MS.Zoning   0.9333688
10   Previous Predictors + BsmtFin.SF.1   0.9376955
11    Previous Predictors + Central.Air   0.9406146
12    Previous Predictors + Condition.1   0.9427471
13   Previous Predictors + BsmtFin.SF.2   0.9448202
14   Previous Predictors + Exterior.1st   0.9461936
15    Previous Predictors + Garage.Area   0.9474239
16    Previous Predictors + Bsmt.Unf.SF   0.9486238
17   Previous Predictors + Fireplace.Qu   0.9501178
18   Previous Predictors + Kitchen.Qual   0.9511929
19     Previous Predictors + Functional   0.9526140
20 Previous Predictors + Year.Remod.Add   0.9534796

3.4 Model Testing

How did testing the model on out-of-sample data affect whether or how you changed your model? Explain in a few sentences.


Similar to Section 2, several observations were removed removed from ames_test. In the test data, there is a house in the Neighborhood Landmark. Landmark was not in the training data, and is thus not an individual level in the preliminary prediction model. In addition, two observations with level AsphShn from Exterior.1st were removed from ames_test. There were no observations that had these levels in ames_train.

# A tibble: 2 x 6
        PID price Exterior.1st Neighborhood Lot.Area Year.Built
      <int> <dbl> <fct>        <fct>           <dbl>      <int>
1 908102130  11.5 AsphShn      Edwards          9.19       1965
2 903481100  11.3 AsphShn      IDOTRR           9.24       1915

As more predictors are included, the RMSE of ames_train decreases. ames_test shows a general trend of decreasing, however, it increases with the addition of the 9th variable, and the difference in RMSE of ames_train and ames_test increases to over $2,000.

 Number_of_Predictors Root_Mean_Square_Error    Dataset
                    7               19696.70 Ames Train
                    7               21068.62  Ames Test
                    8               17890.80 Ames Train
                    8               19206.40  Ames Test
                    9               17720.38 Ames Train
                    9               19447.75  Ames Test
                   10               16269.88 Ames Train
                   10               18440.39  Ames Test
                   11               16221.88 Ames Train
                   11               18306.83  Ames Test

I selected the first eight variables from above.

The eight predictors, in order, are Overall.Qual, Neighborhood, X1st.Flr.SF, X2nd.Flr.SF, BsmtFin.Type.1, Overall.Cond, log transform of House.Age (from Year.Built), and the log transform of Lot.Area.


3.5 Final Model

Provide the summary table for your model.


Call:
lm(formula = price ~ ., data = ames_train_final3)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.77174 -0.05640  0.00396  0.06037  0.30304 

Coefficients:
                       Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)         10.33262639  0.17149534  60.250 < 0.0000000000000002 ***
Overall.Qual2        0.16811555  0.12215389   1.376             0.169136    
Overall.Qual3        0.20168669  0.11470901   1.758             0.079098 .  
Overall.Qual4        0.27234093  0.11172534   2.438             0.015008 *  
Overall.Qual5        0.33478896  0.11120087   3.011             0.002691 ** 
Overall.Qual6        0.38751159  0.11172788   3.468             0.000552 ***
Overall.Qual7        0.45213153  0.11232258   4.025    0.000062454025060 ***
Overall.Qual8        0.52873248  0.11339681   4.663    0.000003671059772 ***
Overall.Qual9        0.69849569  0.11585908   6.029    0.000000002546870 ***
Overall.Qual10       0.75439072  0.12555296   6.009    0.000000002870934 ***
NeighborhoodBlueste -0.01050263  0.07339958  -0.143             0.886257    
NeighborhoodBrDale  -0.09695169  0.05849983  -1.657             0.097861 .  
NeighborhoodBrkSide -0.02796181  0.04870488  -0.574             0.566061    
NeighborhoodClearCr  0.02560836  0.05603476   0.457             0.647792    
NeighborhoodCollgCr -0.06617316  0.04252900  -1.556             0.120125    
NeighborhoodCrawfor  0.07528602  0.04950458   1.521             0.128718    
NeighborhoodEdwards -0.10236415  0.04562363  -2.244             0.025134 *  
NeighborhoodGilbert -0.06436173  0.04536566  -1.419             0.156376    
NeighborhoodGreens   0.15927139  0.06689129   2.381             0.017503 *  
NeighborhoodGrnHill  0.44547165  0.08355449   5.332    0.000000127652051 ***
NeighborhoodIDOTRR  -0.13623368  0.05000315  -2.725             0.006584 ** 
NeighborhoodMeadowV -0.13399780  0.05061391  -2.647             0.008274 ** 
NeighborhoodMitchel -0.03096376  0.04528050  -0.684             0.494291    
NeighborhoodNAmes   -0.03214116  0.04452960  -0.722             0.470638    
NeighborhoodNoRidge  0.01098184  0.04663098   0.236             0.813878    
NeighborhoodNPkVill -0.00198758  0.06645671  -0.030             0.976148    
NeighborhoodNridgHt -0.01047609  0.04447277  -0.236             0.813834    
NeighborhoodNWAmes  -0.07938304  0.04610364  -1.722             0.085495 .  
NeighborhoodOldTown -0.09886232  0.04799814  -2.060             0.039757 *  
NeighborhoodSawyer  -0.04839336  0.04572718  -1.058             0.290244    
NeighborhoodSawyerW -0.09970701  0.04455661  -2.238             0.025518 *  
NeighborhoodSomerst -0.00422347  0.04260388  -0.099             0.921058    
NeighborhoodStoneBr  0.01085850  0.05052940   0.215             0.829906    
NeighborhoodSWISU   -0.03610013  0.05592865  -0.645             0.518814    
NeighborhoodTimber  -0.03593131  0.04812336  -0.747             0.455500    
NeighborhoodVeenker  0.04563645  0.05468488   0.835             0.404235    
X1st.Flr.SF          0.00039566  0.00001555  25.443 < 0.0000000000000002 ***
X2nd.Flr.SF          0.00028604  0.00001153  24.811 < 0.0000000000000002 ***
BsmtFin.Type.1BLQ   -0.02336231  0.01480578  -1.578             0.114990    
BsmtFin.Type.1GLQ    0.01218316  0.01290092   0.944             0.345276    
BsmtFin.Type.1LwQ   -0.07221175  0.01865225  -3.871             0.000117 ***
BsmtFin.Type.1Rec   -0.05011208  0.01433919  -3.495             0.000501 ***
BsmtFin.Type.1Unf   -0.09061384  0.01247227  -7.265    0.000000000000904 ***
BsmtFin.Type.1NA    -0.25408661  0.02748213  -9.246 < 0.0000000000000002 ***
Overall.Cond2        0.32091864  0.13144029   2.442             0.014845 *  
Overall.Cond3        0.18378116  0.08620238   2.132             0.033321 *  
Overall.Cond4        0.32280192  0.07994626   4.038    0.000059297735739 ***
Overall.Cond5        0.39726544  0.07946006   5.000    0.000000710019770 ***
Overall.Cond6        0.44944305  0.07948361   5.655    0.000000021928515 ***
Overall.Cond7        0.50071436  0.07961647   6.289    0.000000000531573 ***
Overall.Cond8        0.52777254  0.08055204   6.552    0.000000000103105 ***
Overall.Cond9        0.52776568  0.08545613   6.176    0.000000001058321 ***
House.Age           -0.18550638  0.01475237 -12.575 < 0.0000000000000002 ***
Lot.Area             0.11868952  0.01084815  10.941 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1015 on 780 degrees of freedom
Multiple R-squared:  0.9338,    Adjusted R-squared:  0.9293 
F-statistic: 207.6 on 53 and 780 DF,  p-value: < 0.00000000000000022

Part 4 : Final Model Assessment

4.1 Final Model Residual

For your final model, create and briefly interpret an informative plot of the residuals.


Linear Relationships Between x and y

For all four numerical predictors, the residuals appear to be randomly scattered around 0. Outliers are highlighted in blue.

Nearly Normal Residuals

There is a slight left skew, but the majority of the residuals fit a normal distribution.

Constant Variability of Residuals

From both plots, it is clear there are about 15 - 20 fitted values whose residuals deviate from the others. Considering this is approximately 1.7 - 2.3% of the data in the model, the residuals seem to be of constant variability.

Independence of Residuals

The houses were chosen as “individual residential properties sold in Ames, IA from 2006 to 2010”. The houses selected, and thus residuals, appear to be independent of one another.


4.2 Final Model RMSE

For your final model, calculate and briefly comment on the RMSE.


The RMSE of ames_train is $17,890.80, and the RMSE of ames_test is $19,201.56. The RMSE of ames_test has changed from above because the two observations with level AsphShn from Exterior.1st that were removed from are now included, as Exterior.1st is no longer a predictor.

The spread of the actual price values of the test set is about $1,300 more than that of the training set. Using more variables as predictors shows signs of overfitting. The analyzed model, however, has a spread that is much less than the initial model from part 2, where both RMSE values are greater than $24,000.

[1] 17890.8
[1] 19201.56

4.3 Final Model Evaluation

What are some strengths and weaknesses of your model?


The final model has an R2 of 0.9338. 93% of the variance of price is explained by eight of the 81 variables. Furthermore, each of the eight predictors, whether it be numerical or at least one level in a categorical, is significant at the α = 0.001 level.

The model is a very good fit for most of the data. However, the model is fit for only those homes with Sale.Condition that is Normal. While the initial training set was a standard 1000 houses, filtering left only 834 to create the model. This is approximately 3.3% of the total. It is important to stress that this model is specific to Ames, and would vary greatly in predictive power across the US.

Many categorical variables have some levels with a large proportion of the observations, and some levels with a few or none. Neighborhood and Overall.Qual are two examples, and also the first two predictors identified by forward selection. It is possible that more houses in the levels with smaller totals could improve the model.

Instead of Neighborhood, I think latitude and longitude values of each house, perhaps even zoning map sections, would be a more accurate predictor.


4.4 Final Model Validation

Testing your final model on a separate, validation data set is a great way to determine how your model will perform in real-life practice.

You will use the ames_validation dataset to do some additional assessment of your final model. Discuss your findings, be sure to mention:

  • What is the RMSE of your final model when applied to the validation data?
  • How does this value compare to that of the training data and/or testing data?
  • What percentage of the 95% predictive confidence (or credible) intervals contain the true price of the house in the validation data set?
  • From this result, does your final model properly reflect uncertainty?

The RMSE of ames_validation is $19,397.47, which is extremely similar to the RMSE of ames_test at $19,201.56. This consistency is a good indicator that the model is a reasonable fit.

[1] 19397.47

The true proportion of ames_test prices that fall within the 95% prediction interval is 94.9%. The true proportion of ames_validation prices that fall within the 95% prediction interval is 94.6%.

Because both of these values are extremely close to 95%, the final model properly reflects uncertainty.

[1] 0.9485294
[1] 0.9462647

Part 5 : Conclusion

Provide a brief summary of your results, and a brief discussion of what you have learned about the data and your model.


The model provided by this analysis accurately predicts the price of approximately 95% of the sample houses in Ames, IA to within $39,000.

The most surprising predictor to me was BsmtFin.Type.1. I skipped over it in the preliminary model without any hesitation.

During my steps, I modified the data and model several times. I removed outliers, removed levels, changed the order of predictors, and included extra predictors. I found that the model chosen is quite robust and parsimonious.

While not collinear, I found many of the variables to show some significant relationship with others. I found it interesting how the RMSE quickly converged and did not vary much after fifteen variables. Most of the variables added very little to the model.

And interesting aspect of the data is that the observations were generally not uniformly distributed by variables, specifically the categorical variables. For example, over 25% of homes are less than 20 years old, and over 50% are less than 50 years old. Well over 90% of houses have an overall condition around average, there are very few that are poor or exceptional. 99% of houses do not have a pool. 80% of houses have no fence. Over 75% of houses have a gable roof style. There are many other examples. This seems to be a reason why many variables added very little to no change to the model.

Finally, I learned to not underestimate log transforming variables, including the dependent variable.


Appendix


Part 1 : Exploratory Data Analysis (EDA)

All Numerical Variables.

      PID                  area          price         MS.SubClass    
 Min.   : 526302030   Min.   : 334   Min.   : 39300   Min.   : 20.00  
 1st Qu.: 531458660   1st Qu.:1077   1st Qu.:129000   1st Qu.: 20.00  
 Median : 535454145   Median :1383   Median :155500   Median : 50.00  
 Mean   : 713844613   Mean   :1450   Mean   :174622   Mean   : 57.86  
 3rd Qu.: 907199140   3rd Qu.:1718   3rd Qu.:205000   3rd Qu.: 70.00  
 Max.   :1007100110   Max.   :3672   Max.   :615000   Max.   :190.00  
                                                                      
  Lot.Frontage       Lot.Area        Year.Built   Year.Remod.Add
 Min.   : 21.00   Min.   :  1470   Min.   :1872   Min.   :1950  
 1st Qu.: 56.00   1st Qu.:  7206   1st Qu.:1954   1st Qu.:1965  
 Median : 69.00   Median :  9202   Median :1972   Median :1990  
 Mean   : 68.51   Mean   : 10228   Mean   :1970   Mean   :1983  
 3rd Qu.: 80.00   3rd Qu.: 11419   3rd Qu.:1997   3rd Qu.:2002  
 Max.   :313.00   Max.   :215245   Max.   :2009   Max.   :2009  
 NA's   :156                                                    
  Mas.Vnr.Area      BsmtFin.SF.1     BsmtFin.SF.2      Bsmt.Unf.SF    
 Min.   :   0.00   Min.   :   0.0   Min.   :   0.00   Min.   :   0.0  
 1st Qu.:   0.00   1st Qu.:   0.0   1st Qu.:   0.00   1st Qu.: 209.0  
 Median :   0.00   Median : 402.0   Median :   0.00   Median : 433.0  
 Mean   :  95.62   Mean   : 453.3   Mean   :  52.44   Mean   : 517.7  
 3rd Qu.: 143.00   3rd Qu.: 747.0   3rd Qu.:   0.00   3rd Qu.: 747.8  
 Max.   :1290.00   Max.   :2257.0   Max.   :1526.00   Max.   :2336.0  
 NA's   :4                                                            
 Total.Bsmt.SF   X1st.Flr.SF    X2nd.Flr.SF     Low.Qual.Fin.SF  
 Min.   :   0   Min.   : 334   Min.   :   0.0   Min.   :   0.00  
 1st Qu.: 784   1st Qu.: 864   1st Qu.:   0.0   1st Qu.:   0.00  
 Median : 974   Median :1058   Median :   0.0   Median :   0.00  
 Mean   :1023   Mean   :1127   Mean   : 318.2   Mean   :   4.52  
 3rd Qu.:1248   3rd Qu.:1337   3rd Qu.: 689.0   3rd Qu.:   0.00  
 Max.   :2846   Max.   :2696   Max.   :1836.0   Max.   :1064.00  
                                                                 
 Bsmt.Full.Bath   Bsmt.Half.Bath      Full.Bath       Half.Bath     
 Min.   :0.0000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000  
 1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000  
 Median :0.0000   Median :0.00000   Median :1.000   Median :0.0000  
 Mean   :0.4365   Mean   :0.06115   Mean   :1.505   Mean   :0.3777  
 3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000  
 Max.   :2.0000   Max.   :2.00000   Max.   :4.000   Max.   :2.0000  
                                                                    
 Bedroom.AbvGr   Kitchen.AbvGr   TotRms.AbvGrd      Fireplaces    
 Min.   :0.000   Min.   :1.000   Min.   : 2.000   Min.   :0.0000  
 1st Qu.:2.000   1st Qu.:1.000   1st Qu.: 5.000   1st Qu.:0.0000  
 Median :3.000   Median :1.000   Median : 6.000   Median :1.0000  
 Mean   :2.808   Mean   :1.038   Mean   : 6.272   Mean   :0.5923  
 3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.: 7.000   3rd Qu.:1.0000  
 Max.   :6.000   Max.   :2.000   Max.   :13.000   Max.   :4.0000  
                                                                  
 Garage.Yr.Blt   Garage.Cars     Garage.Area      Wood.Deck.SF   
 Min.   :1900   Min.   :0.000   Min.   :   0.0   Min.   :  0.00  
 1st Qu.:1960   1st Qu.:1.000   1st Qu.: 308.0   1st Qu.:  0.00  
 Median :1977   Median :2.000   Median : 472.5   Median :  0.00  
 Mean   :1976   Mean   :1.724   Mean   : 460.9   Mean   : 93.18  
 3rd Qu.:1999   3rd Qu.:2.000   3rd Qu.: 573.5   3rd Qu.:168.00  
 Max.   :2009   Max.   :5.000   Max.   :1356.0   Max.   :857.00  
 NA's   :36                                                      
 Open.Porch.SF    Enclosed.Porch  X3Ssn.Porch       Screen.Porch   
 Min.   :  0.00   Min.   :  0    Min.   :  0.000   Min.   :  0.00  
 1st Qu.:  0.00   1st Qu.:  0    1st Qu.:  0.000   1st Qu.:  0.00  
 Median : 25.00   Median :  0    Median :  0.000   Median :  0.00  
 Mean   : 45.31   Mean   : 24    Mean   :  3.191   Mean   : 14.87  
 3rd Qu.: 71.50   3rd Qu.:  0    3rd Qu.:  0.000   3rd Qu.:  0.00  
 Max.   :341.00   Max.   :432    Max.   :508.000   Max.   :440.00  
                                                                   
   Pool.Area          Misc.Val           Mo.Sold          Yr.Sold    
 Min.   :  0.000   Min.   :    0.00   Min.   : 1.000   Min.   :2006  
 1st Qu.:  0.000   1st Qu.:    0.00   1st Qu.: 5.000   1st Qu.:2007  
 Median :  0.000   Median :    0.00   Median : 6.000   Median :2008  
 Mean   :  1.754   Mean   :   53.12   Mean   : 6.159   Mean   :2008  
 3rd Qu.:  0.000   3rd Qu.:    0.00   3rd Qu.: 8.000   3rd Qu.:2009  
 Max.   :800.000   Max.   :15500.00   Max.   :12.000   Max.   :2010  
                                                                     

All Factor Variables.

   MS.Zoning    Street     Alley     Lot.Shape Land.Contour  Utilities  
 A (agr):  0   Grvl:  2   Grvl: 28   IR1:286   Bnk: 29      AllPub:834  
 C (all):  5   Pave:832   Pave: 25   IR2: 24   HLS: 29      NoSeWa:  0  
 FV     : 35              NA's:781   IR3:  2   Low: 19      NoSewr:  0  
 I (all):  1                         Reg:522   Lvl:757                  
 RH     :  6                                                            
 RL     :656                                                            
 RM     :131                                                            
   Lot.Config  Land.Slope  Neighborhood  Condition.1   Condition.2 
 Corner :145   Gtl:800    NAmes  :140   Norm   :732   Norm   :825  
 CulDSac: 63   Mod: 31    CollgCr: 75   Feedr  : 46   Feedr  :  6  
 FR2    : 32   Sev:  3    OldTown: 62   Artery : 19   Artery :  1  
 FR3    :  4              Sawyer : 57   PosN   :  9   PosN   :  1  
 Inside :590              Edwards: 50   RRAn   :  9   RRNn   :  1  
                          Somerst: 45   RRAe   :  8   PosA   :  0  
                          (Other):405   (Other): 11   (Other):  0  
  Bldg.Type    House.Style   Overall.Qual  Overall.Cond   Roof.Style 
 1Fam  :687   1Story :422   5      :268   5      :441   Flat   :  6  
 2fmCon: 18   2Story :239   6      :209   6      :170   Gable  :654  
 Duplex: 26   1.5Fin : 85   7      :168   7      :123   Gambrel:  8  
 Twnhs : 33   SLvl   : 38   8      : 88   8      : 46   Hip    :163  
 TwnhsE: 70   SFoyer : 32   4      : 58   4      : 30   Mansard:  3  
              2.5Unf : 10   9      : 24   3      : 11   Shed   :  0  
              (Other):  8   (Other): 19   (Other): 13                
   Roof.Matl    Exterior.1st  Exterior.2nd  Mas.Vnr.Type Exter.Qual Exter.Cond
 CompShg:822   VinylSd:262   VinylSd:259          :  4   Ex: 20     Ex:  4    
 Tar&Grv:  8   HdBoard:145   HdBoard:134   BrkCmn :  6   Fa:  8     Fa: 16    
 WdShngl:  2   MetalSd:130   MetalSd:132   BrkFace:263   Gd:266     Gd:103    
 Metal  :  1   Wd Sdng:125   Wd Sdng:119   CBlock :  0   TA:540     Po:  0    
 WdShake:  1   Plywood: 66   Plywood: 85   None   :516              TA:711    
 ClyTile:  0   BrkFace: 32   CmentBd: 30   Stone  : 45                        
 (Other):  0   (Other): 74   (Other): 75                                      
  Foundation  Bsmt.Qual  Bsmt.Cond  Bsmt.Exposure BsmtFin.Type.1 BsmtFin.Type.2
 BrkTil: 88       :  0       :  0       :  1      GLQ    :236    Unf    :711   
 CBlock:381   Ex  : 52   Ex  :  2   Av  :128      Unf    :217    Rec    : 28   
 PConc :351   Fa  : 22   Fa  : 19   Gd  : 75      ALQ    :144    LwQ    : 25   
 Slab  : 11   Gd  :350   Gd  : 29   Mn  : 72      Rec    : 92    BLQ    : 23   
 Stone :  3   Po  :  1   Po  :  1   No  :538      BLQ    : 80    ALQ    : 16   
 Wood  :  0   TA  :389   TA  :763   NA's: 20      (Other): 45    (Other): 11   
              NA's: 20   NA's: 20                 NA's   : 20    NA's   : 20   
  Heating    Heating.QC Central.Air Electrical  Kitchen.Qual   Functional 
 Floor:  0   Ex:407     N: 45            :  0   Ex: 41       Typ    :777  
 GasA :824   Fa: 18     Y:789       FuseA: 44   Fa: 19       Min2   : 21  
 GasW :  7   Gd:145                 FuseF: 12   Gd:327       Min1   : 16  
 Grav :  1   Po:  1                 FuseP:  2   Po:  1       Mod    : 14  
 OthW :  1   TA:263                 Mix  :  0   TA:446       Maj1   :  4  
 Wall :  1                          SBrkr:776                Maj2   :  2  
                                                             (Other):  0  
 Fireplace.Qu  Garage.Type  Garage.Finish Garage.Qual Garage.Cond Paved.Drive
 Ex  : 10     2Types :  8       :  1          :  0        :  0    N: 61      
 Fa  : 22     Attchd :505   Fin :185      Ex  :  1    Ex  :  1    P: 28      
 Gd  :169     Basment:  8   RFn :231      Fa  : 32    Fa  : 19    Y:745      
 Po  : 18     BuiltIn: 43   Unf :382      Gd  :  5    Gd  :  5               
 TA  :201     CarPort:  1   NA's: 35      Po  :  3    Po  :  5               
 NA's:414     Detchd :234                 TA  :757    TA  :768               
              NA's   : 35                 NA's: 36    NA's: 36               
 Pool.QC      Fence     Misc.Feature   Sale.Type  
 Ex  :  1   GdPrv: 39   Elev:  0     WD     :797  
 Fa  :  1   GdWo : 33   Gar2:  2     COD    : 16  
 Gd  :  1   MnPrv:105   Othr:  1     ConLD  :  5  
 TA  :  0   MnWw :  2   Shed: 24     ConLw  :  5  
 NA's:831   NA's :655   TenC:  1     Con    :  4  
                        NA's:806     ConLI  :  3  
                                     (Other):  4  

Numerical Variables with NAs.

  Lot.Frontage     Mas.Vnr.Area     Garage.Yr.Blt 
 Min.   : 21.00   Min.   :   0.00   Min.   :1900  
 1st Qu.: 56.00   1st Qu.:   0.00   1st Qu.:1960  
 Median : 69.00   Median :   0.00   Median :1977  
 Mean   : 68.51   Mean   :  95.62   Mean   :1976  
 3rd Qu.: 80.00   3rd Qu.: 143.00   3rd Qu.:1999  
 Max.   :313.00   Max.   :1290.00   Max.   :2009  
 NA's   :156      NA's   :4         NA's   :36    

Remove Variables.

Factor Variables with NAs.

       Pool.QC   Misc.Feature          Alley          Fence   Fireplace.Qu 
           831            806            781            655            414 
   Garage.Qual    Garage.Cond    Garage.Type  Garage.Finish      Bsmt.Qual 
            36             36             35             35             20 
     Bsmt.Cond  Bsmt.Exposure BsmtFin.Type.1 BsmtFin.Type.2 
            20             20             20             20 

Add NA as a factor level.

Price vs Area and Number of Bedrooms.

Price vs Neighborhood.

Price vs Year Built and Overall Quality.


Part 2 : Development and assessment of an initial model, following a semi-guided process of analysis

2.2 Model Selection

#Takes a data frame and number of predictor to next choose for the highest Adjusted R2.
#Returns a dataframe with list of selected variables along with previously chosen variables
#and corresponding Adjusted R squared.
forward_r2 <- function(df) {
  
  #Number of variables.
  num_of_vars <- dim(df)[2]
  #Number of predictors calculated.
  predictor_num <- 1
  #Adjusted R2 to be greater than.
  prev_R2 <- 0
  #Iterator
  improvement <- TRUE
  #
  final_predictors <- NULL
  final_r2 <- NULL
  
  
  #Iterate while Adjusted R2 increases.
  while (improvement) {
    
    #Lists are initialized as empty
    predictors <- NULL
    positions <- NULL
    r2 <- NULL
    
    for (x in c((predictor_num + 1):num_of_vars)) {

      positions = c(positions, x)
      #If we are selecting the first predictor, only include that one.
      #Otherwise, include all predictors chosen thus far and the selected.
      if (predictor_num == 1) {
        predictors <- c(predictors, colnames(df[x]))  
      } else if (predictor_num == 2) {
        predictors <- c(predictors, paste(colnames(df[2]),
                                          colnames(df[x]),
                                          sep = " + "))
      } else {
        predictors <- c(predictors, paste("Previous Predictors",
                                          colnames(df[x]),
                                          sep = " + "))
      }
      
      #Make a temp dataframe with only the chosen predictors and the current selected one.
      temp_df <- df %>%
        select(colnames(df[c(1:predictor_num, x)]))
      
      #Retrieve the Adjusted R squared from that model.
      temp_r2 <- summary(lm(price ~ ., data = temp_df))$adj.r.squared
      r2 <- c(r2, temp_r2)
    }
    
    #Create a dataframe from the three lists.
    r2_values <- data.frame("Predictor_Variables" = predictors,
                            "Positions" = positions,
                            "Adjusted_R2" = r2) %>%
      arrange(desc(r2))
    
    high_pred <- r2_values$Predictor_Variables[1]
    pred_pos <- r2_values$Positions[1]
    high_r2 <- r2_values$Adjusted_R2[1]
    
    if (high_r2 > prev_R2) {
      
      #Add the predictor(s) with the highest Adjusted R2 thus far.
      final_predictors <- c(final_predictors, high_pred)
      final_r2 <- c(final_r2, high_r2)
      
      #One of the variables is the price.
      if (predictor_num < num_of_vars - 1) {

        if (pred_pos == num_of_vars) {
          #Reorder the df for the next iteration.
          df <- df[c(1 : predictor_num, pred_pos, (predictor_num + 1) : (pred_pos - 1))]
          
        } else if ((predictor_num + 1) < pred_pos) {
          #Reorder the df for the next iteration.
          df <- df[c(1 : predictor_num, pred_pos, (predictor_num + 1) : (pred_pos - 1), (pred_pos + 1) : num_of_vars)]
        }
                
        #Increase the number of predictors for next iteration.
        predictor_num <- predictor_num + 1
        #Increase the Adjusted R2 for next iteration.
        prev_R2 <- high_r2
        
      } else {
        #No more variables remaining.
        improvement <- FALSE
      }
      
    } else {
      #Adjusted R2 did not increase.
      improvement <- FALSE
    }
  }
  model_selection <- data.frame("Predictor_Variables" = final_predictors,
                                "Adjusted_R2" = final_r2)
  model_selection
}

#Model Selection.
forward_r2(ames_train_prelim)

2.3 Initial Model Residuals

Define outliers and graph helper.

Residual Values vs Area.

Residual Values vs Number of Bedrooms.

Residual Values vs Year Built.

Residual Values vs Remodel Year.

Residual Values vs Number of Full Bathrooms.

Residual Values vs Number of Half Bathrooms.

Create Grid.


Part 3 Development of a Final Model


3.1 Transformation

Distribution of Houses by Price Histograms for ames_train and ames_test.

price_train <- ggplot(ames_train, aes(x = price))
price_train <- price_train + geom_histogram(bins = 30, fill = viridis(7)[4])
#Title.
price_train <- price_train + ggtitle("")
#X-axis
price_train <- price_train + scale_x_continuous(name = "House Prices (Training Set)", expand = c(0,0), labels = dollar)
#Y-axis.
price_train <- price_train + scale_y_continuous(name = "Frequency of Price", limits = c(0, 180), expand = c(0,0))
#Modify labels and text.
price_train <- price_train + theme(plot.title = element_text(hjust = 0, size = 14, face = "bold"),
                                 axis.text.x = element_text(hjust = 1, size = 10, angle = 55), 
                                 axis.title.x = element_text(size = 12, face = "bold"),
                                 axis.text.y = element_text(size = 10),
                                 axis.title.y = element_text(size = 12, face = "bold"))

price_test <- ggplot(ames_test, aes(x = price))
price_test <- price_test + geom_histogram(bins = 30, fill = viridis(7)[4])
#Title.
price_test <- price_test + ggtitle("Distribution of Houses by Price")
#X-axis
price_test <- price_test + scale_x_continuous(name = "House Prices (Testing Set)", expand = c(0,0), labels = dollar)
#Y-axis.
price_test <- price_test + scale_y_continuous(name = "", limits = c(0, 180), expand = c(0,0))
#Modify labels and text.
price_test <- price_test + theme(plot.title = element_text(hjust = -2.7, size = 14, face = "bold"),
                                 axis.text.x = element_text(hjust = 1, size = 10, angle = 55), 
                                 axis.title.x = element_text(size = 12, face = "bold"),
                                 axis.text.y = element_text(size = 10),
                                 axis.title.y = element_text(size = 12, face = "bold"))

grid.arrange(price_train, price_test,
             layout_matrix = matrix(c(1, 2),
                                    ncol = 2))

Log transformations.

Transform House.Age (from Year.Built) and Lot.Area in ames_train and ames_test.


3.4 Model Testing

Two observations with level AsphShn in Exterior.1st.

Remove new levels (3 observations) and add NAs as levels.

Top 20 columns from ames_train and ames_test are subset.

Create 20 different models, in order, and calculate and compare the RMSE of ames_train and ames_test.

RMSE vs Number of Variables for ames_train and ames_test.

RMSE for 7 - 11 variables.


Part 4 : Final Model Assessment

4.1 Final Model Residuals

Linear Relationships Between x and y

Define outliers and graph helper.

Residual Values vs First Floor Area.

Residual Values vs 2nd Floor Area.

Residual Values vs House Age.

Residual Values vs Lot Area.

Create Grid.

Constant Variability of Residuals

Scatter of Residuals vs. Fitted.

Scatter of Absolute Value of Residuals vs. Fitted.

Create Grid.