Part 1: Data

From the Codebook, “The data set is comprised of 651 randomly sampled movies produced and released before 2016.”.

Because there were no details, I decided to plot a few variables to see how random this sample is.

First as we see from above, the distribution of movies is from 1970 - 2014. More movies have been released in recent years, so skewed left is not unexpected.

I also took a look at the Oscar categories. The number of best actor wins (93) and best actress wins (72) is quite high for 651 movies. This leads me to believe that more well known, heavily advertised, and blockbuster films were more likely to be choosen from the sample. I would have to investigate the sampling method in more detail to draw a more concrete conclusion, but that is not the purpose of this project. I merely wanted a more indepth summary of the movies chosen.

 best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
 no :629      no :644      no :558        no :579          no :608     
 yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
 top200_box
 no :636   
 yes: 15   

This was an observational study, as no hypothesis, controls, nor confounding variables were specified beforehand. We can use this data to generalize and draw some trends in movies of the past 50 years, but not make definitive causal arguments.


Part 2: Data manipulation

Cleaning :

Here is an output of all NAs in the predictors.

           title       title_type            genre          runtime 
               0                0                0                1 
     mpaa_rating    thtr_rel_year   thtr_rel_month      imdb_rating 
               0                0                0                0 
  imdb_num_votes    critics_score   audience_score     best_pic_nom 
               0                0                0                0 
    best_pic_win   best_actor_win best_actress_win     best_dir_win 
               0                0                0                0 
      top200_box 
               0 
# A tibble: 1 x 3
  title              thtr_rel_year runtime
  <chr>                      <dbl>   <dbl>
1 The End of America          2008      NA

There exists a single NA in runtime for the 2008 Documentary “The End of America”. IMDb provides a runtime of 74 minutes.

Title Type : Feature Film

The “feature_film” category is created from “title_type”.

# A tibble: 3 x 2
  title_type       n
  <fct>        <int>
1 Documentary     55
2 Feature Film   591
3 TV Movie         5
# A tibble: 2 x 2
  feature_film     n
  <chr>        <int>
1 " yes"         591
2 "no"            60

Genre : Drama

The “drama” category is created from “genre”.

# A tibble: 11 x 2
   genre                         n
   <fct>                     <int>
 1 Action & Adventure           65
 2 Animation                     9
 3 Art House & International    14
 4 Comedy                       87
 5 Documentary                  52
 6 Drama                       305
 7 Horror                       23
 8 Musical & Performing Arts    12
 9 Mystery & Suspense           59
10 Other                        16
11 Science Fiction & Fantasy     9
# A tibble: 2 x 2
  drama      n
  <chr>  <int>
1 " yes"   305
2 "no"     346

MPAA Rating : R

The “mpaa_rating_R” category is created from “mpaa_rating”.

# A tibble: 6 x 2
  mpaa_rating     n
  <fct>       <int>
1 G              19
2 NC-17           2
3 PG            118
4 PG-13         133
5 R             329
6 Unrated        50
# A tibble: 2 x 2
  mpaa_rating_R     n
  <chr>         <int>
1 " yes"          329
2 "no"            322

Theatrical Release Month : Oscar Season

The “oscar_season” category is created from October, November, and December from “thtr_rel_month”.

# A tibble: 12 x 2
   thtr_rel_month     n
            <dbl> <int>
 1              1    69
 2              2    34
 3              3    51
 4              4    45
 5              5    44
 6              6    72
 7              7    48
 8              8    44
 9              9    53
10             10    70
11             11    51
12             12    70
# A tibble: 2 x 2
  oscar_season     n
  <chr>        <int>
1 " yes"         191
2 "no"           460

Theatrical Release Month : Summer Season

The “summer_season” category is created from May, June, July, and August from “thtr_rel_month”.

# A tibble: 12 x 2
   thtr_rel_month     n
            <dbl> <int>
 1              1    69
 2              2    34
 3              3    51
 4              4    45
 5              5    44
 6              6    72
 7              7    48
 8              8    44
 9              9    53
10             10    70
11             11    51
12             12    70
# A tibble: 2 x 2
  summer_season     n
  <chr>         <int>
1 " yes"          208
2 "no"            443

Part 3: Exploratory data analysis

For each of the five new variables, a relative distribution graph, table summary, and write-up is included. The graphs are relatively distributed by height within a value, but not amongst the values within a category.

For example, for the first new variable, “feature_film”, 591 of the movies are feature films, and only 60 are not. The size of the two plots compared to each other does not account for that.

Vertical lines representing Q1, Median, and Q3 are also shown. The summary includes specific values to complement the graph.


Title Type : Feature Film

Table with Summary Statistics.
 #Total     Movie Type 
   Feature Film   Documentary or TV Movie 
 Audience Score 
   Count  651.0   591.0 60.0
   Min.  11.0   11.0 19.0
   Median  65.0   62.0 85.5
   Max.  97.0   97.0 96.0
   Mean  62.4   60.5 81.0
   Std. dev.  20.2   19.8 13.6

Although most movies in the sample are Feature Films, the few Documentaries and TV Movies have a very high average Audience Score. The Audience Score of Feature Films is spread wide, with two slight peaks at ~45 and ~80.


Genre : Drama

Table with Summary Statistics.
 #Total     Genre 
   Drama   Other Genre 
 Audience Score 
   Count  651.0   305.0 346.0
   Min.  11.0   13.0 11.0
   Median  65.0   70.0 61.0
   Max.  97.0   95.0 97.0
   Mean  62.4   65.3 59.7
   Std. dev.  20.2   18.5 21.3

Almost half the movies in the sample are of genre Drama. The Audience Scores are spread wide, and there is a prominent peak at ~80. Movies of other genres are also spread wide, with two slight peaks at ~40 and ~85.


MPAA Rating : R

Table with Summary Statistics.
 #Total     MPAA Rating 
   Rated R   Other Rating 
 Audience Score 
   Count  651.0   329.0 322.0
   Min.  11.0   14.0 11.0
   Median  65.0   64.0 65.5
   Max.  97.0   97.0 96.0
   Mean  62.4   62.0 62.7
   Std. dev.  20.2   20.2 20.3

Similar to Drama, approximately half the movies in the sample have an MPAA Rating of R. The Audience Scores are spread wide for movies that are Rated R, and there is a noticeable peak at ~40 and a prominent peak at ~80. Movies of other MPAA Ratings are also spread wide, with noticeable peak at ~80.


Theatrical Release Month : Oscar Season

Table with Summary Statistics.
 #Total     Theatrical Release Month (I) 
   Oscar Month   Other Month 
 Audience Score 
   Count  651.0   191.0 460.0
   Min.  11.0   13.0 11.0
   Median  65.0   69.0 64.0
   Max.  97.0   97.0 96.0
   Mean  62.4   63.7 61.8
   Std. dev.  20.2   20.5 20.1

The distribution of Audience Scores for movies during Oscar Months and non-Oscar Months is very similar to that of movies Rated R and those that are not. Movies during Oscar Months have a noticeable peak at ~45 and a prominent peak at ~80. Movies released during other months have a noticeable peak at ~80.


Theatrical Release Month : Summer Season

Table with Summary Statistics.
 #Total     Theatrical Release Month (II) 
   Summer Month   Other Month 
 Audience Score 
   Count  651.0   208.0 443.0
   Min.  11.0   11.0 13.0
   Median  65.0   65.0 66.0
   Max.  97.0   94.0 97.0
   Mean  62.4   61.8 62.6
   Std. dev.  20.2   19.9 20.4

Almost a third of the movies are released during the summer months. In my opinion, summer movies tend be heavily advertised, action/adventure movies, and often sequels, that are highly anticipated. They tend to make money and are popular, as children are not in school, however they are not necessarily ranked highly as favorite movies. Notice that the prominent peak for Summer Months is ~70, a bit less than the prominent peak for other variables. Non-summer months have a slight peak at ~50 and a noticeable peak at ~80.

Summary of New Variables.

The new variables have similar spreads, regardless of value taken. Almost all tend to have an interquartile range around 45 to 80, and almost all have a median and mean around 60 to 70. People tend to vote for movies they like, which explains the usually medium to prominent peak on the right. This lack of difference is not optimal for regression.

The noted exception is Documentary and TV Movies (non-feature films) tend rank very high on audience score. The prominent peak and noticeably higher Q1, Median, Mean, and Q3 values can attest to this.

Numerical Predictors.

Looking at the correlation of the numerical variables, the Audience Score and Critics Score for Rotten Tomatoes do disagree, but have a high degree of correlation. Sometimes Summer blockbusters with action and heavy advertisement are rated highly by audiences, but critics are more particular about the acting and writing. Conversely, sometimes very well made movies that critics vote highly are dismissed as boring by the audience.

It is also apparent that Rotten Tomatoes Audience Score and IMDb Ranking are very highly correlated. For all practical purposes, they are measuring the same value in a different metric on different websites.

Critics Score and IMDb Ranking are highly correlated as well. To eliminate collinearity, I am eliminating IMDb Ranking from the model. I am also eliminating the number of votes on IMDb as well. I am doing a model as if IMDb did not exist. How people perceive the movie should not change because of that.


Part 4: Modeling

Variables for Model:

Currently the predictors being considered for Audience Score are:

 [1] "feature_film"     "drama"            "runtime"          "mpaa_rating_R"   
 [5] "thtr_rel_year"    "oscar_season"     "summer_season"    "critics_score"   
 [9] "best_pic_nom"     "best_pic_win"     "best_actor_win"   "best_actress_win"
[13] "best_dir_win"     "top200_box"      

Here is the full model from a standard linear regression.


Call:
lm(formula = audience_score ~ ., data = movies_2)

Residuals:
    Min      1Q  Median      3Q     Max 
-39.731  -9.352   0.274   9.555  41.790 

Coefficients:
                      Estimate Std. Error t value             Pr(>|t|)    
(Intercept)          59.978601 104.439129   0.574             0.565973    
feature_filmno        7.933782   2.276353   3.485             0.000525 ***
dramano              -2.263441   1.231165  -1.838             0.066462 .  
runtime               0.069850   0.032905   2.123             0.034159 *  
mpaa_rating_Rno      -0.007311   1.152706  -0.006             0.994941    
thtr_rel_year        -0.015650   0.052080  -0.301             0.763892    
oscar_seasonno        0.609028   1.414150   0.431             0.666856    
summer_seasonno       0.665105   1.347795   0.493             0.621847    
critics_score         0.451313   0.022602  19.968 < 0.0000000000000002 ***
best_pic_nomyes       9.866923   3.697231   2.669             0.007808 ** 
best_pic_winyes      -1.978728   6.465119  -0.306             0.759658    
best_actor_winyes    -1.897722   1.676590  -1.132             0.258106    
best_actress_winyes  -2.483406   1.855222  -1.339             0.181178    
best_dir_winyes      -0.505184   2.456630  -0.206             0.837137    
top200_boxyes         3.811390   3.815805   0.999             0.318251    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 14.2 on 636 degrees of freedom
Multiple R-squared:  0.5179,    Adjusted R-squared:  0.5073 
F-statistic:  48.8 on 14 and 636 DF,  p-value: < 0.00000000000000022

Model Selection Choice and Process:

Many coefficients of the predictors for the above full regression model are not statistically significant. Therefore, I used Bayesian Adaptive Sampling Linear Regression Model with a Zellner-Siow Cauchy prior, a uniform assignment of those probabilities, and Markov Chain Monte Carlo. I used the ZS-null as it is widely used for BAS. I used MCMC as We do not know the posterior model probabilities, yet we can can calulate the ratio between the two posterior models, as well as its usefulness for so many predictors that cannot all be enumerated.

Model diagnostics

On the Posterior Inclusion Probabilities Plot on the left, the renormalized marginal likelihoods time prior are virtually identical to the relative Monte Carlo frequencies of sampled models for each variable.

Simililarly, on the Posterior Model Probabilities Plot on the right, the renormalized marginal likelihoods time prior are virtually identical to the relative Monte Carlo frequencies of sampled models for each model.

Since both plots exhibit convergence, it can be safely assumed that the MCMC simulation has been run a sufficient number of times.

From the Residuals vs Fitted Plot, the spread is reasonably constant over the domain of fitted values by Bayesian Model Averaging. The three outliers are highlighted, which we will ignore in this exploration. The regression line is also displayed, which is virtually identical to the horizontal line y = 0.

From the Cumulative Model Probabilities Plot, there are about 1,900 unique models with MCMC sampling. About 100 models contribute to 90% of the probability, with the rest leveling off and individually not contributing much to the total.

From the Model Complexity Plot, the highest Bayes Factors coincide with four or five predictors.

From the Inclusion Probabilities Plot, it is evident that the Intercept, Feature Film, Critics Score, and Best Picture Nomination have PIPs greater than 0.5, indicating they are consistently important in models for predicting the Audience Score. Runtime and Drama are both between 0.25 and 0.5.

Above is a visual of the top 20 models and the respective variables that are included in each model. Below is a numerical summary of the top 5 models.

                        P(B != 0 | Y)  model 1     model 2     model 3     model 4     model 5
    Intercept              1.00000000   1.0000   1.0000000   1.0000000   1.0000000   1.0000000
    feature_filmno         0.86018677   1.0000   1.0000000   1.0000000   1.0000000   0.0000000
    dramano                0.25423584   0.0000   0.0000000   1.0000000   0.0000000   0.0000000
    runtime                0.39245911   0.0000   1.0000000   0.0000000   1.0000000   0.0000000
    mpaa_rating_Rno        0.05830688   0.0000   0.0000000   0.0000000   0.0000000   0.0000000
    thtr_rel_year          0.05672302   0.0000   0.0000000   0.0000000   0.0000000   0.0000000
    oscar_seasonno         0.05698547   0.0000   0.0000000   0.0000000   0.0000000   0.0000000
    summer_seasonno        0.06524963   0.0000   0.0000000   0.0000000   0.0000000   0.0000000
    critics_score          0.99994812   1.0000   1.0000000   1.0000000   1.0000000   1.0000000
    best_pic_nomyes        0.73618469   1.0000   1.0000000   1.0000000   0.0000000   1.0000000
    best_pic_winyes        0.06446533   0.0000   0.0000000   0.0000000   0.0000000   0.0000000
    best_actor_winyes      0.08132935   0.0000   0.0000000   0.0000000   0.0000000   0.0000000
    best_actress_winyes    0.09598389   0.0000   0.0000000   0.0000000   0.0000000   0.0000000
    best_dir_winyes        0.05783386   0.0000   0.0000000   0.0000000   0.0000000   0.0000000
    top200_boxyes          0.08627014   0.0000   0.0000000   0.0000000   0.0000000   0.0000000
    BF                             NA   1.0000   0.5445015   0.4466619   0.3593476   0.2289054
    PostProbs                      NA   0.1578   0.0882000   0.0720000   0.0575000   0.0372000
    R2                             NA   0.5086   0.5121000   0.5118000   0.5071000   0.5017000
    dim                            NA   4.0000   5.0000000   5.0000000   4.0000000   3.0000000
    logmarg                        NA 219.8217 219.2137713 219.0157025 218.7981907 218.3472093

The first column of the above summary is the Marginal Inclusion Probabilities. The other columns correspond to the variables included in the first five models. Notice that model 1 only includes the 4 variables with Marginal Inclusion Probabilities > 5. Apart from Intercept, these are the most significant preditors from the initial general model. The other two significant predictors are dramano and runtime, which are the next two variables with the highest Marginal Inclusion Probabilies. These two variables are included in some of models 2, 3, and 4.

Interpretation of Model Coefficients

                              2.5%        97.5%          beta
Intercept           61.24614069118 63.429658884 62.3625192012
feature_filmno       0.00000000000 10.183174961  5.7545004165
dramano             -3.59079535793  0.000000000 -0.5856351040
runtime             -0.00001117767  0.108028794  0.0272086605
mpaa_rating_Rno     -0.52980008227  0.010233574  0.0007352816
thtr_rel_year       -0.01493934665  0.001098655 -0.0006987118
oscar_seasonno       0.00000000000  0.000000000 -0.0015487517
summer_seasonno     -0.04975483944  0.444459098  0.0424006288
critics_score        0.42080134984  0.517140013  0.4675224048
best_pic_nomyes      0.00000000000 14.149633800  6.8208995639
best_pic_winyes     -0.37932049922  3.437513609  0.0431449185
best_actor_winyes   -1.97428237482  0.000000000 -0.1184197050
best_actress_winyes -2.61508485011  0.001278899 -0.1873369991
best_dir_winyes     -0.49290380591  0.154044331 -0.0246473552
top200_boxyes       -0.24298264177  4.204191831  0.3003693049
attr(,"Probability")
[1] 0.95
attr(,"class")
[1] "confint.bas"

The intercept of the regression model is 61.25 to 63.43 , which would is the 95% credible interval for the Audience Score of a movie that : 1) is a Feature Film, 2) is of genre Drama, 3) is 0 minutes long, 4) has an MPAA Rating of R, 5) was released in year 0, 6) was released during Summer Season, 7) has a Rotten Tomatoes Critics Score of 0, 8) was not nominated nor won Best Picture, 9) the actors in the movie have not won Best Actor, 10) the actresses in the movie have not won Best Actress, 11) the director of the movie has not won Best Director, and 12) is not in the Top 200 Box Office Movies of All Time.

Most Significant Predictors

These are the predictors that have Marginal Inclusion Probabilities greater than 0.5 and are included in most of the top 20 models.

If a movie is not a Feature Film, 95% of those corresponding Audience Scores would increase by 0.00 - 10.18.

For each 1 point increase of Rotten Tomatoes Critics Score, 95% of those corresponding Audience Scores would increase by 0.42 - 0.52.

If a movie was nominated for Best Picture, 95% of those corresponding Audience Scores would increase by 0.00 - 14.15.

Slightly Significant Predictors

These are the predictors that have Marginal Inclusion Probabilities slightly less than 0.5 and are included in some of the top 20 models.

If a movie is not a Drama, 95% of those corresponding Audience Scores would decrease by 0.00 - 3.59.

For each minute in the runtime of the movie, 95% of those corresponding Audience Scores would increase by 0.00 - 0.11.

Other Predictors.

These are the predictors that have Marginal Inclusion Probabilities much less than 0.5 and are included in very few if any of the top 20 models.

If a movie is not Rated R, 95% of those corresponding Audience Scores would decrease by 0.53 to increase by 0.10.

For every year the movie is released after year 0, 95% of those corresponding Audience Scores would decrease by 0.015 to increase by 0.0011.

Oscar Season seems to have no effect.

If a movie was not released during Summer Season, 95% of those corresponding Audience Scores would decrease by 0.050 to 0.44.

If a movie won Best Picture, 95% of those corresponding Audience Scores would decrease by 0.38 to increase by 3.44.

If an actor in the movie had won Best Actor at any time, 95% of those corresponding Audience Scores would decrease by 0 to 1.97.

If an actress in the movie had won Best Actress at any time, 95% of those corresponding Audience Scores would decrease by 2.62 to increase by 0.0013.

If the director of the movie had won Best Director at any time, 95% of those corresponding Audience Scores would decrease by 0.49 to increase by 0.15.

If movie is in the Top 200 Box Office Movies of All Time, 95% of those corresponding Audience Scores would decrease by 0.24 to increase by 4.20.


Part 5: Prediction

Deadpool, having an Audience Score of 90%, was an 108 minute long, Rated-R, Action Adventure (non Drama) Feature Film released in neither Summer nor Oscar Season of 2016, that has a Critics Score of 85%, that was never nominated for any Oscars, but is ranked #171 by Box Office .

         2.5%    97.5%     pred
[1,] 46.32591 102.6517 74.48878
attr(,"Probability")
[1] 0.95
attr(,"class")
[1] "confint.bas"
[1] "Intercept"       "dramano"         "mpaa_rating_Rno" "critics_score"  
[5] "best_pic_nomyes" "best_dir_winyes"

Using a Best Prediction Model, the 95% credible interval for the Audience Score of any 108 minute long, Rated-R, Feature Film released in 2016, that has a Critics Score of 85%, and has never won Best Picture is between 46.33 and 100 (as 100 is the highest score). Deadpool, at 90%, lies in this interval. Notice that these numbers are virtually identical for the Median Prediction Model.

         2.5%    97.5%    pred
[1,] 46.29436 102.1638 74.2291
attr(,"Probability")
[1] 0.95
attr(,"class")
[1] "confint.bas"
[1] "Intercept"       "feature_filmno"  "critics_score"   "best_pic_nomyes"

Part 6: Conclusion

Including the intercept, there are four main predictors, two somewhat important predictors, and several that add very little to the model average.

The credible intervals are extremely wide. More data and relevant predictors would be necessary to tighten the predicted values.

I have already mentioned reservations about the sampling. More movies across a larger span of time could possibly provide a better sample for prediction.

It is important to note we did not differentiate the among movies that are not Feature Films. This is also true for movies that are not Genres, as well as movies not Rated R. We grouped the months, but did not look at months individually.

I think directors and studio could possibly influence rating, but they were not included in the model.

Box Office Earnings could possibly influence rating. That metric was not included in the original dataset.

It is important to note none of these variables have been shown to be causal in affecting the Audience Score. We can only predict an interval of scores based on the characteristics of the movie that have been shown to show trends from our sample of 651 movies.

Appendix

All code for each Part is included here.

Part 2: Data manipulation

Part 3: Exploratory data analysis

Title Type : Feature Film

Output : Plot and Table Summary for Feature Film

Genre : Drama

Output : Plot and Table Summary for Feature Film

MPAA Rating : R

Output : Plot and Table Summary for MPAA Rating R

Theatrical Release Month : Oscar Season

Output : Plot and Table Summary for Oscar Season

#Oscar Season.

#Distribution Plot.
g6 <- ggplot(movies, aes(x = audience_score, y = oscar_season))
#Densiy Ridges.
g6 <- g6 + stat_density_ridges(aes(fill = ..x..),
                               geom = "density_ridges_gradient",
                               quantile_lines = TRUE,
                               rel_min_height = 0.005,
                               scale = 1.25,
                               size = 0.5)
#Title.
g6 <- g6 + ggtitle("Movie Distribution by Audience Score and Theatrical Release Month (I)")
#X-axis
g6 <- g6 + scale_x_continuous("Rotten Tomatoes Audience Score", breaks = c(0, 25, 50, 75, 100))
#Y-axis.
g6 <- g6 + scale_y_discrete(name = "", limits = c("no", " yes"),
                            labels = c("Other Month", "Oscar Month"))
#Viridis Color Scheme.
g6 <- g6 + scale_fill_viridis(name = "Audience\nScore", breaks = c(0, 25, 50, 75, 100),
                              guide = guide_colorbar(label.position = "left", hjust = 1))
#Modify labels and text.
g6 <- g6 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                 axis.text.x = element_text(size = 12),
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.text.y = element_text(hjust = 1, size = 12, angle = 35),
                 axis.title.y = element_text(size = 14, face = "bold"),
                 legend.text = element_text(size = 12),
                 legend.title = element_text(hjust = 0.5, size = 14, face = "bold"))
#Footnote.
g6 <- g6 + labs(caption = "October, November, and December are Oscar Months.")
g6

#Summary Statistics.
movies_eda %>% 
    tab_cells(audience_score) %>%
    tab_cols(total(), oscar_season) %>% 
    tab_stat_unweighted_valid_n(label = "Count") %>%
    tab_stat_min() %>%
    tab_stat_median() %>%
    tab_stat_max() %>%
    tab_stat_mean %>%
    tab_stat_sd() %>%
    tab_pivot() %>% 
    set_caption("Table with Summary Statistics.")

Theatrical Release Month : Summer Season

Output : Plot and Table Summary for Summer Season

#Summer Season.

#Distribution Plot.
g7 <- ggplot(movies, aes(x = audience_score, y = summer_season))
#Densiy Ridges.
g7 <- g7 + stat_density_ridges(aes(fill = ..x..),
                               geom = "density_ridges_gradient",
                               quantile_lines = TRUE,
                               rel_min_height = 0.005,
                               scale = 1.25,
                               size = 0.5)
#Title. 
g7 <- g7 + ggtitle("Movie Distribution by Audience Score and Theatrical Release Month (II)")
#X-axis
g7 <- g7 + scale_x_continuous("Rotten Tomatoes Audience Score", breaks = c(0, 25, 50, 75, 100))
#Y-axis.
g7 <- g7 + scale_y_discrete(name = "", limits = c("no", " yes"),
                            labels = c("Other Month", "Summer Month"))
#Viridis Color Scheme.
g7 <- g7 + scale_fill_viridis(name = "Audience\nScore", breaks = c(0, 25, 50, 75, 100),
                              guide = guide_colorbar(label.position = "left", hjust = 1))
#Modify labels and text.
g7 <- g7 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                 axis.text.x = element_text(size = 12),
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.text.y = element_text(hjust = 1, size = 12, angle = 35),
                 axis.title.y = element_text(size = 14, face = "bold"),
                 legend.text = element_text(size = 12),
                 legend.title = element_text(hjust = 0.5, size = 14, face = "bold"))
#Footnote.
g7 <- g7 + labs(caption = "May, June, July, and August are Summer Months.")
g7

#Summary Statistics.
movies_eda %>% 
    tab_cells(audience_score) %>%
    tab_cols(total(), summer_season) %>% 
    tab_stat_unweighted_valid_n(label = "Count") %>%
    tab_stat_min() %>%
    tab_stat_median() %>%
    tab_stat_max() %>%
    tab_stat_mean %>%
    tab_stat_sd() %>%
    tab_pivot() %>% 
    set_caption("Table with Summary Statistics.")

Part 4: Modeling

Model diagnostics

Bayesian Regression

Plots : Diagnostics 1

#Convergence Plot.
#Equivalent of diagnostics(movies.ZS, ask = F).
g8 <- ggplot(mapping = aes(x = movies.ZS$probne0.RN, y = movies.ZS$probne0.MCMC))
#Scatter Plot.
g8 <- g8 + geom_point(color = viridis(5)[3], size = 4, alpha = 0.5)
#y = x.
g8 <- g8 + geom_segment(aes(x = 0, y = 0, xend = 1.05, yend = 1.05), color = viridis(5)[1])
#Title. 
g8 <- g8 + ggtitle("Convergence Plot: Posterior Inclusion Probabilities")
#X-axis
g8 <- g8 + scale_x_continuous("pip (renormalized)",
                              breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0),
                              expand = c(0,0))
#Y-axis.
g8 <- g8 + scale_y_continuous("pip (MCMC)",
                              breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0),
                              expand = c(0,0))
#Modify labels and text.
g8 <- g8 + theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
                 axis.text.x = element_text(size = 10),
                 axis.title.x = element_text(size = 12, face = "bold"),
                 axis.text.y = element_text(hjust = 1, size = 10, angle = 35),
                 axis.title.y = element_text(size = 12, face = "bold"))

#Convergence Plot.
g9 <- ggplot(mapping = aes(movies.ZS$postprobs.RN, y = movies.ZS$postprobs.MCMC))
#Scatter Plot.
g9 <- g9 + geom_point(color = viridis(5)[3], size = 4, alpha = 0.5)
#y = x.
g9 <- g9 + geom_segment(aes(x = -0.005, y = -0.005, xend = 0.2, yend = 0.2), color = viridis(5)[1])
#Title. 
g9 <- g9+ ggtitle("Convergence Plot: Posterior Model Probabilities")
#X-axis
g9 <- g9 + scale_x_continuous("p(M | Y) (renormalized)",
                              breaks = c(0, 0.05, 0.10, 0.15, 0.20),
                              expand = c(0,0))
#Y-axis.
g9 <- g9 + scale_y_continuous("p(M | Y) (MCMC)",
                              breaks = c(0, 0.05, 0.10, 0.15, 0.20),
                              expand = c(0,0))
#Modify labels and text.
g9 <- g9 + theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
                 axis.text.x = element_text(size = 10),
                 axis.title.x = element_text(size = 12, face = "bold"),
                 axis.text.y = element_text(hjust = 1, size = 10, angle = 35),
                 axis.title.y = element_text(size = 12, face = "bold"))


grid.arrange(g8, g9, ncol = 2)

Plots : Diagnostics 2

#Residuals and Fitted Data.
test <- predict(movies.ZS, movies_2)
temp_df <- data.frame(test$fit) %>%
  mutate(residuals = movies_2$audience_score - test$fit) %>%
  `colnames<-`(c("fitted", "residuals")) %>%
  mutate(outliers = ifelse(residuals >= 2.9*sd(residuals), "yes", "no"))

#Residuals vs Fitted Plot.
#Equivalent of plot(movies.ZS, which = 1, ask = F).
g10 <- ggplot(data = temp_df, mapping = aes(x = fitted, y = residuals))
#Scatter Plot.
g10 <- g10 + geom_point(aes(color = outliers, alpha = outliers), size = 4)
#Best Fit.
g10 <- g10 + stat_smooth(method = "lm", color = viridis(5)[1], se = FALSE)
#Title. 
g10 <- g10 + ggtitle("Residuals vs Fitted")
#X-axis
g10 <- g10 + scale_x_continuous("Predictions under BMA\nbas.lm(audience_score ~.)")
#Y-axis.
g10 <- g10 + scale_y_continuous("Residuals")
#Viridis Color Scheme
g10 <- g10 + scale_color_manual(values = c(viridis(5)[3], viridis(7)[3]))
g10 <- g10 + scale_alpha_manual(values = c(0.2, 1))
#Modify labels and text.
g10 <- g10 + theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
                   axis.text.x = element_text(size = 8),
                   axis.title.x = element_text(size = 10, face = "bold"),
                   axis.text.y = element_text(hjust = 1, size = 8, angle = 35),
                   axis.title.y = element_text(size = 10, face = "bold"),
                   legend.position = "none")

#Model Probabilities Data.
temp_df <- data.frame("Probability" = movies.ZS$postprobs) %>%
  mutate(n = 1:length(movies.ZS$postprobs),
         Cumulative = cumsum(Probability))

logistic_model <- SummarizeGrowth(temp_df$n, temp_df$Cumulative)
df.pred <-  data.frame(n = 1:length(movies.ZS$postprobs),
                       Cumulative = predict(logistic_model$model))

#Model Probabilities.
#Equivalent of plot(movies.ZS, which = 2, ask = F).
g11 <- ggplot(data = temp_df, mapping = aes(x = n, y = Cumulative))
#Scatter Plot.
g11 <- g11 + geom_point(color = viridis(5)[3], size = 4, alpha = 0.5)
#Best Fit.
g11 <- g11 + geom_line(df.pred, mapping = aes(y = df.pred[, 2]), color = viridis(5)[1], size = 1.2)

#Title. 
g11 <- g11 + ggtitle("Model Probabilities")
#X-axis
g11 <- g11 + scale_x_continuous("Model Search Order by Probability\nbas.lm(audience_score ~.)")
#Y-axis.
g11 <- g11 + scale_y_continuous("Cumulative Probability")
#Modify labels and text.
g11 <- g11 + theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
                   axis.text.x = element_text(size = 8),
                   axis.title.x = element_text(size = 10, face = "bold"),
                   axis.text.y = element_text(hjust = 1, size = 8, angle = 35),
                   axis.title.y = element_text(size = 10, face = "bold"))

#Model Complexity Data.
temp_df <- data.frame("logmarg" = movies.ZS$logmarg,
                      "size" = movies.ZS$size)
#Model Complexity.
#Equivalent of plot(movies.ZS, which = 3, ask = F).
g12_1 <- ggplot(data = temp_df %>% filter(logmarg > 75), mapping = aes(x = size, y = logmarg))
g12_2 <- ggplot(data = temp_df %>% filter(logmarg < 75), mapping = aes(x = size, y = logmarg))

#Scatter Plot.
g12_1 <- g12_1 + geom_point(color = viridis(5)[3], size = 4, alpha = 0.3)
g12_2 <- g12_2 + geom_point(color = viridis(5)[3], size = 4, alpha = 0.3)
#Title. 
g12_1 <- g12_1 + ggtitle("Model Complexity")
#X-axis.
g12_1 <- g12_1 + scale_x_continuous(limits = c(0, 11),
                                    breaks = c(0, 2, 4, 6, 8, 10))
g12_2 <- g12_2 + scale_x_continuous("Model Dimension\nbas.lm(audience_score ~.)",
                                    limits = c(0, 11),
                                    breaks = c(0, 2, 4, 6, 8, 10))
#Y-axis.
g12_1 <- g12_1 + scale_y_continuous("log(Marginal)")
g12_2 <- g12_2 + scale_y_continuous("", breaks = c(0, 30, 60))
#Modify labels and text.
g12_1 <- g12_1 + theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
                       axis.text.x = element_blank(),
                       axis.ticks.x = element_blank(),
                       axis.title.x = element_blank(),
                       axis.text.y = element_text(hjust = 1, size = 8, angle = 35),
                       axis.title.y = element_text(hjust = 0, size = 10, face = "bold"))

g12_2 <- g12_2 + theme(plot.title = element_blank(),
                       axis.text.x = element_text(size = 8),
                       axis.title.x = element_text(size = 10, face = "bold"),
                       axis.text.y = element_text(hjust = 1, size = 8, angle = 35),
                       panel.grid.minor.y = element_blank(),
                       axis.title.y = element_text(size = 10, face = "bold"))

g12 <- arrangeGrob(g12_1, g12_2, layout_matrix = matrix(c(1,1,2), ncol = 1))

#Inclusion Probabilities Data.
temp_df <- data.frame("Predictor" = factor(movies.ZS$names),
                      "Probability" = movies.ZS$probne0) %>%
   mutate(Predictor = fct_rev(fct_relevel(Predictor, movies.ZS$names)),
         Highlight = ifelse(Probability > 0.6, "yes", "no"))
#Inclusion Probabilities.
#Equivalent of plot(movies.ZS, which = 4, ask = F).
g13 <- ggplot(data = temp_df, mapping = aes(x = Probability, y = Predictor))
#Horizontal Lines.
g13 <- g13 + geom_segment(aes(x = 0, y = Predictor,
                              xend = Probability, yend = Predictor,
                              color = Highlight, size = Highlight))
#Title. 
g13 <- g13 + ggtitle("Inclusion Probabilities")
#X-axis
g13 <- g13 + scale_x_continuous("Marginal Inclusion Probability")
#Y-axis.
g13 <- g13 + scale_y_discrete("bas.lm(audience_score ~.)")
#Viridis Color Scheme.
g13 <- g13 + scale_color_manual(values = c(viridis(5)[1], viridis(5)[3]))
#Scale Size.
g13 <- g13 + scale_size_manual(values = c(1, 3))
#Y-axis labels.
fontcolor <- rev(ifelse(temp_df$Highlight == "yes", viridis(5)[3], viridis(5)[1]))
fontface <- rev(ifelse(temp_df$Highlight == "yes", "bold", "plain"))
#Modify labels and text.
g13 <- g13 + theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
                   axis.text.x = element_text(size = 8),
                   axis.title.x = element_text(size = 10, face = "bold"),
                   axis.text.y = element_text(hjust = 1, size = 8, angle = 35,
                                              color = fontcolor, face = fontface),
                   axis.title.y = element_text(size = 10, face = "bold"),
                   legend.position = "none")

grid.arrange(g10, g11, g12, g13,
             layout_matrix = matrix(rbind(c(1, 1, 1, 1, 1, NA, 2, 2, 2, 2, 2),
                                          c(3, 3, 3, 3, 3, NA, 4, 4, 4, 4, 4)),
                                    ncol = 11))

Image : Top 20 Models

Output : Marginal Inclusion Probabilities and Top 5 Models

Interpretation of Model Coefficients

Output : Coefficients and Intervals