From the Codebook, “The data set is comprised of 651 randomly sampled movies produced and released before 2016.”.
Because there were no details, I decided to plot a few variables to see how random this sample is.
First as we see from above, the distribution of movies is from 1970 - 2014. More movies have been released in recent years, so skewed left is not unexpected.
I also took a look at the Oscar categories. The number of best actor wins (93) and best actress wins (72) is quite high for 651 movies. This leads me to believe that more well known, heavily advertised, and blockbuster films were more likely to be choosen from the sample. I would have to investigate the sampling method in more detail to draw a more concrete conclusion, but that is not the purpose of this project. I merely wanted a more indepth summary of the movies chosen.
best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
no :629 no :644 no :558 no :579 no :608
yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
top200_box
no :636
yes: 15
This was an observational study, as no hypothesis, controls, nor confounding variables were specified beforehand. We can use this data to generalize and draw some trends in movies of the past 50 years, but not make definitive causal arguments.
Here is an output of all NAs in the predictors.
title title_type genre runtime
0 0 0 1
mpaa_rating thtr_rel_year thtr_rel_month imdb_rating
0 0 0 0
imdb_num_votes critics_score audience_score best_pic_nom
0 0 0 0
best_pic_win best_actor_win best_actress_win best_dir_win
0 0 0 0
top200_box
0
# A tibble: 1 x 3
title thtr_rel_year runtime
<chr> <dbl> <dbl>
1 The End of America 2008 NA
There exists a single NA in runtime for the 2008 Documentary “The End of America”. IMDb provides a runtime of 74 minutes.
The “feature_film” category is created from “title_type”.
# A tibble: 3 x 2
title_type n
<fct> <int>
1 Documentary 55
2 Feature Film 591
3 TV Movie 5
# A tibble: 2 x 2
feature_film n
<chr> <int>
1 " yes" 591
2 "no" 60
The “drama” category is created from “genre”.
# A tibble: 11 x 2
genre n
<fct> <int>
1 Action & Adventure 65
2 Animation 9
3 Art House & International 14
4 Comedy 87
5 Documentary 52
6 Drama 305
7 Horror 23
8 Musical & Performing Arts 12
9 Mystery & Suspense 59
10 Other 16
11 Science Fiction & Fantasy 9
# A tibble: 2 x 2
drama n
<chr> <int>
1 " yes" 305
2 "no" 346
The “mpaa_rating_R” category is created from “mpaa_rating”.
# A tibble: 6 x 2
mpaa_rating n
<fct> <int>
1 G 19
2 NC-17 2
3 PG 118
4 PG-13 133
5 R 329
6 Unrated 50
# A tibble: 2 x 2
mpaa_rating_R n
<chr> <int>
1 " yes" 329
2 "no" 322
The “oscar_season” category is created from October, November, and December from “thtr_rel_month”.
# A tibble: 12 x 2
thtr_rel_month n
<dbl> <int>
1 1 69
2 2 34
3 3 51
4 4 45
5 5 44
6 6 72
7 7 48
8 8 44
9 9 53
10 10 70
11 11 51
12 12 70
# A tibble: 2 x 2
oscar_season n
<chr> <int>
1 " yes" 191
2 "no" 460
The “summer_season” category is created from May, June, July, and August from “thtr_rel_month”.
# A tibble: 12 x 2
thtr_rel_month n
<dbl> <int>
1 1 69
2 2 34
3 3 51
4 4 45
5 5 44
6 6 72
7 7 48
8 8 44
9 9 53
10 10 70
11 11 51
12 12 70
# A tibble: 2 x 2
summer_season n
<chr> <int>
1 " yes" 208
2 "no" 443
For each of the five new variables, a relative distribution graph, table summary, and write-up is included. The graphs are relatively distributed by height within a value, but not amongst the values within a category.
For example, for the first new variable, “feature_film”, 591 of the movies are feature films, and only 60 are not. The size of the two plots compared to each other does not account for that.
Vertical lines representing Q1, Median, and Q3 are also shown. The summary includes specific values to complement the graph.
Table with Summary Statistics. | ||||
#Total | Movie Type | |||
---|---|---|---|---|
Feature Film | Documentary or TV Movie | |||
Audience Score | ||||
Count | 651.0 | 591.0 | 60.0 | |
Min. | 11.0 | 11.0 | 19.0 | |
Median | 65.0 | 62.0 | 85.5 | |
Max. | 97.0 | 97.0 | 96.0 | |
Mean | 62.4 | 60.5 | 81.0 | |
Std. dev. | 20.2 | 19.8 | 13.6 |
Although most movies in the sample are Feature Films, the few Documentaries and TV Movies have a very high average Audience Score. The Audience Score of Feature Films is spread wide, with two slight peaks at ~45 and ~80.
Table with Summary Statistics. | ||||
#Total | Genre | |||
---|---|---|---|---|
Drama | Other Genre | |||
Audience Score | ||||
Count | 651.0 | 305.0 | 346.0 | |
Min. | 11.0 | 13.0 | 11.0 | |
Median | 65.0 | 70.0 | 61.0 | |
Max. | 97.0 | 95.0 | 97.0 | |
Mean | 62.4 | 65.3 | 59.7 | |
Std. dev. | 20.2 | 18.5 | 21.3 |
Almost half the movies in the sample are of genre Drama. The Audience Scores are spread wide, and there is a prominent peak at ~80. Movies of other genres are also spread wide, with two slight peaks at ~40 and ~85.
Table with Summary Statistics. | ||||
#Total | MPAA Rating | |||
---|---|---|---|---|
Rated R | Other Rating | |||
Audience Score | ||||
Count | 651.0 | 329.0 | 322.0 | |
Min. | 11.0 | 14.0 | 11.0 | |
Median | 65.0 | 64.0 | 65.5 | |
Max. | 97.0 | 97.0 | 96.0 | |
Mean | 62.4 | 62.0 | 62.7 | |
Std. dev. | 20.2 | 20.2 | 20.3 |
Similar to Drama, approximately half the movies in the sample have an MPAA Rating of R. The Audience Scores are spread wide for movies that are Rated R, and there is a noticeable peak at ~40 and a prominent peak at ~80. Movies of other MPAA Ratings are also spread wide, with noticeable peak at ~80.
Table with Summary Statistics. | ||||
#Total | Theatrical Release Month (I) | |||
---|---|---|---|---|
Oscar Month | Other Month | |||
Audience Score | ||||
Count | 651.0 | 191.0 | 460.0 | |
Min. | 11.0 | 13.0 | 11.0 | |
Median | 65.0 | 69.0 | 64.0 | |
Max. | 97.0 | 97.0 | 96.0 | |
Mean | 62.4 | 63.7 | 61.8 | |
Std. dev. | 20.2 | 20.5 | 20.1 |
The distribution of Audience Scores for movies during Oscar Months and non-Oscar Months is very similar to that of movies Rated R and those that are not. Movies during Oscar Months have a noticeable peak at ~45 and a prominent peak at ~80. Movies released during other months have a noticeable peak at ~80.
Table with Summary Statistics. | ||||
#Total | Theatrical Release Month (II) | |||
---|---|---|---|---|
Summer Month | Other Month | |||
Audience Score | ||||
Count | 651.0 | 208.0 | 443.0 | |
Min. | 11.0 | 11.0 | 13.0 | |
Median | 65.0 | 65.0 | 66.0 | |
Max. | 97.0 | 94.0 | 97.0 | |
Mean | 62.4 | 61.8 | 62.6 | |
Std. dev. | 20.2 | 19.9 | 20.4 |
Almost a third of the movies are released during the summer months. In my opinion, summer movies tend be heavily advertised, action/adventure movies, and often sequels, that are highly anticipated. They tend to make money and are popular, as children are not in school, however they are not necessarily ranked highly as favorite movies. Notice that the prominent peak for Summer Months is ~70, a bit less than the prominent peak for other variables. Non-summer months have a slight peak at ~50 and a noticeable peak at ~80.
The new variables have similar spreads, regardless of value taken. Almost all tend to have an interquartile range around 45 to 80, and almost all have a median and mean around 60 to 70. People tend to vote for movies they like, which explains the usually medium to prominent peak on the right. This lack of difference is not optimal for regression.
The noted exception is Documentary and TV Movies (non-feature films) tend rank very high on audience score. The prominent peak and noticeably higher Q1, Median, Mean, and Q3 values can attest to this.
Looking at the correlation of the numerical variables, the Audience Score and Critics Score for Rotten Tomatoes do disagree, but have a high degree of correlation. Sometimes Summer blockbusters with action and heavy advertisement are rated highly by audiences, but critics are more particular about the acting and writing. Conversely, sometimes very well made movies that critics vote highly are dismissed as boring by the audience.
It is also apparent that Rotten Tomatoes Audience Score and IMDb Ranking are very highly correlated. For all practical purposes, they are measuring the same value in a different metric on different websites.
Critics Score and IMDb Ranking are highly correlated as well. To eliminate collinearity, I am eliminating IMDb Ranking from the model. I am also eliminating the number of votes on IMDb as well. I am doing a model as if IMDb did not exist. How people perceive the movie should not change because of that.
Currently the predictors being considered for Audience Score are:
[1] "feature_film" "drama" "runtime" "mpaa_rating_R"
[5] "thtr_rel_year" "oscar_season" "summer_season" "critics_score"
[9] "best_pic_nom" "best_pic_win" "best_actor_win" "best_actress_win"
[13] "best_dir_win" "top200_box"
Here is the full model from a standard linear regression.
Call:
lm(formula = audience_score ~ ., data = movies_2)
Residuals:
Min 1Q Median 3Q Max
-39.731 -9.352 0.274 9.555 41.790
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 59.978601 104.439129 0.574 0.565973
feature_filmno 7.933782 2.276353 3.485 0.000525 ***
dramano -2.263441 1.231165 -1.838 0.066462 .
runtime 0.069850 0.032905 2.123 0.034159 *
mpaa_rating_Rno -0.007311 1.152706 -0.006 0.994941
thtr_rel_year -0.015650 0.052080 -0.301 0.763892
oscar_seasonno 0.609028 1.414150 0.431 0.666856
summer_seasonno 0.665105 1.347795 0.493 0.621847
critics_score 0.451313 0.022602 19.968 < 0.0000000000000002 ***
best_pic_nomyes 9.866923 3.697231 2.669 0.007808 **
best_pic_winyes -1.978728 6.465119 -0.306 0.759658
best_actor_winyes -1.897722 1.676590 -1.132 0.258106
best_actress_winyes -2.483406 1.855222 -1.339 0.181178
best_dir_winyes -0.505184 2.456630 -0.206 0.837137
top200_boxyes 3.811390 3.815805 0.999 0.318251
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 14.2 on 636 degrees of freedom
Multiple R-squared: 0.5179, Adjusted R-squared: 0.5073
F-statistic: 48.8 on 14 and 636 DF, p-value: < 0.00000000000000022
Many coefficients of the predictors for the above full regression model are not statistically significant. Therefore, I used Bayesian Adaptive Sampling Linear Regression Model with a Zellner-Siow Cauchy prior, a uniform assignment of those probabilities, and Markov Chain Monte Carlo. I used the ZS-null as it is widely used for BAS. I used MCMC as We do not know the posterior model probabilities, yet we can can calulate the ratio between the two posterior models, as well as its usefulness for so many predictors that cannot all be enumerated.
On the Posterior Inclusion Probabilities Plot on the left, the renormalized marginal likelihoods time prior are virtually identical to the relative Monte Carlo frequencies of sampled models for each variable.
Simililarly, on the Posterior Model Probabilities Plot on the right, the renormalized marginal likelihoods time prior are virtually identical to the relative Monte Carlo frequencies of sampled models for each model.
Since both plots exhibit convergence, it can be safely assumed that the MCMC simulation has been run a sufficient number of times.
From the Residuals vs Fitted Plot, the spread is reasonably constant over the domain of fitted values by Bayesian Model Averaging. The three outliers are highlighted, which we will ignore in this exploration. The regression line is also displayed, which is virtually identical to the horizontal line y = 0.
From the Cumulative Model Probabilities Plot, there are about 1,900 unique models with MCMC sampling. About 100 models contribute to 90% of the probability, with the rest leveling off and individually not contributing much to the total.
From the Model Complexity Plot, the highest Bayes Factors coincide with four or five predictors.
From the Inclusion Probabilities Plot, it is evident that the Intercept, Feature Film, Critics Score, and Best Picture Nomination have PIPs greater than 0.5, indicating they are consistently important in models for predicting the Audience Score. Runtime and Drama are both between 0.25 and 0.5.
Above is a visual of the top 20 models and the respective variables that are included in each model. Below is a numerical summary of the top 5 models.
P(B != 0 | Y) model 1 model 2 model 3 model 4 model 5
Intercept 1.00000000 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
feature_filmno 0.86018677 1.0000 1.0000000 1.0000000 1.0000000 0.0000000
dramano 0.25423584 0.0000 0.0000000 1.0000000 0.0000000 0.0000000
runtime 0.39245911 0.0000 1.0000000 0.0000000 1.0000000 0.0000000
mpaa_rating_Rno 0.05830688 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
thtr_rel_year 0.05672302 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
oscar_seasonno 0.05698547 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
summer_seasonno 0.06524963 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
critics_score 0.99994812 1.0000 1.0000000 1.0000000 1.0000000 1.0000000
best_pic_nomyes 0.73618469 1.0000 1.0000000 1.0000000 0.0000000 1.0000000
best_pic_winyes 0.06446533 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
best_actor_winyes 0.08132935 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
best_actress_winyes 0.09598389 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
best_dir_winyes 0.05783386 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
top200_boxyes 0.08627014 0.0000 0.0000000 0.0000000 0.0000000 0.0000000
BF NA 1.0000 0.5445015 0.4466619 0.3593476 0.2289054
PostProbs NA 0.1578 0.0882000 0.0720000 0.0575000 0.0372000
R2 NA 0.5086 0.5121000 0.5118000 0.5071000 0.5017000
dim NA 4.0000 5.0000000 5.0000000 4.0000000 3.0000000
logmarg NA 219.8217 219.2137713 219.0157025 218.7981907 218.3472093
The first column of the above summary is the Marginal Inclusion Probabilities. The other columns correspond to the variables included in the first five models. Notice that model 1 only includes the 4 variables with Marginal Inclusion Probabilities > 5. Apart from Intercept, these are the most significant preditors from the initial general model. The other two significant predictors are dramano and runtime, which are the next two variables with the highest Marginal Inclusion Probabilies. These two variables are included in some of models 2, 3, and 4.
2.5% 97.5% beta
Intercept 61.24614069118 63.429658884 62.3625192012
feature_filmno 0.00000000000 10.183174961 5.7545004165
dramano -3.59079535793 0.000000000 -0.5856351040
runtime -0.00001117767 0.108028794 0.0272086605
mpaa_rating_Rno -0.52980008227 0.010233574 0.0007352816
thtr_rel_year -0.01493934665 0.001098655 -0.0006987118
oscar_seasonno 0.00000000000 0.000000000 -0.0015487517
summer_seasonno -0.04975483944 0.444459098 0.0424006288
critics_score 0.42080134984 0.517140013 0.4675224048
best_pic_nomyes 0.00000000000 14.149633800 6.8208995639
best_pic_winyes -0.37932049922 3.437513609 0.0431449185
best_actor_winyes -1.97428237482 0.000000000 -0.1184197050
best_actress_winyes -2.61508485011 0.001278899 -0.1873369991
best_dir_winyes -0.49290380591 0.154044331 -0.0246473552
top200_boxyes -0.24298264177 4.204191831 0.3003693049
attr(,"Probability")
[1] 0.95
attr(,"class")
[1] "confint.bas"
The intercept of the regression model is 61.25 to 63.43 , which would is the 95% credible interval for the Audience Score of a movie that : 1) is a Feature Film, 2) is of genre Drama, 3) is 0 minutes long, 4) has an MPAA Rating of R, 5) was released in year 0, 6) was released during Summer Season, 7) has a Rotten Tomatoes Critics Score of 0, 8) was not nominated nor won Best Picture, 9) the actors in the movie have not won Best Actor, 10) the actresses in the movie have not won Best Actress, 11) the director of the movie has not won Best Director, and 12) is not in the Top 200 Box Office Movies of All Time.
These are the predictors that have Marginal Inclusion Probabilities greater than 0.5 and are included in most of the top 20 models.
If a movie is not a Feature Film, 95% of those corresponding Audience Scores would increase by 0.00 - 10.18.
For each 1 point increase of Rotten Tomatoes Critics Score, 95% of those corresponding Audience Scores would increase by 0.42 - 0.52.
If a movie was nominated for Best Picture, 95% of those corresponding Audience Scores would increase by 0.00 - 14.15.
These are the predictors that have Marginal Inclusion Probabilities slightly less than 0.5 and are included in some of the top 20 models.
If a movie is not a Drama, 95% of those corresponding Audience Scores would decrease by 0.00 - 3.59.
For each minute in the runtime of the movie, 95% of those corresponding Audience Scores would increase by 0.00 - 0.11.
These are the predictors that have Marginal Inclusion Probabilities much less than 0.5 and are included in very few if any of the top 20 models.
If a movie is not Rated R, 95% of those corresponding Audience Scores would decrease by 0.53 to increase by 0.10.
For every year the movie is released after year 0, 95% of those corresponding Audience Scores would decrease by 0.015 to increase by 0.0011.
Oscar Season seems to have no effect.
If a movie was not released during Summer Season, 95% of those corresponding Audience Scores would decrease by 0.050 to 0.44.
If a movie won Best Picture, 95% of those corresponding Audience Scores would decrease by 0.38 to increase by 3.44.
If an actor in the movie had won Best Actor at any time, 95% of those corresponding Audience Scores would decrease by 0 to 1.97.
If an actress in the movie had won Best Actress at any time, 95% of those corresponding Audience Scores would decrease by 2.62 to increase by 0.0013.
If the director of the movie had won Best Director at any time, 95% of those corresponding Audience Scores would decrease by 0.49 to increase by 0.15.
If movie is in the Top 200 Box Office Movies of All Time, 95% of those corresponding Audience Scores would decrease by 0.24 to increase by 4.20.
Deadpool, having an Audience Score of 90%, was an 108 minute long, Rated-R, Action Adventure (non Drama) Feature Film released in neither Summer nor Oscar Season of 2016, that has a Critics Score of 85%, that was never nominated for any Oscars, but is ranked #171 by Box Office .
2.5% 97.5% pred
[1,] 46.32591 102.6517 74.48878
attr(,"Probability")
[1] 0.95
attr(,"class")
[1] "confint.bas"
[1] "Intercept" "dramano" "mpaa_rating_Rno" "critics_score"
[5] "best_pic_nomyes" "best_dir_winyes"
Using a Best Prediction Model, the 95% credible interval for the Audience Score of any 108 minute long, Rated-R, Feature Film released in 2016, that has a Critics Score of 85%, and has never won Best Picture is between 46.33 and 100 (as 100 is the highest score). Deadpool, at 90%, lies in this interval. Notice that these numbers are virtually identical for the Median Prediction Model.
2.5% 97.5% pred
[1,] 46.29436 102.1638 74.2291
attr(,"Probability")
[1] 0.95
attr(,"class")
[1] "confint.bas"
[1] "Intercept" "feature_filmno" "critics_score" "best_pic_nomyes"
Including the intercept, there are four main predictors, two somewhat important predictors, and several that add very little to the model average.
The credible intervals are extremely wide. More data and relevant predictors would be necessary to tighten the predicted values.
I have already mentioned reservations about the sampling. More movies across a larger span of time could possibly provide a better sample for prediction.
It is important to note we did not differentiate the among movies that are not Feature Films. This is also true for movies that are not Genres, as well as movies not Rated R. We grouped the months, but did not look at months individually.
I think directors and studio could possibly influence rating, but they were not included in the model.
Box Office Earnings could possibly influence rating. That metric was not included in the original dataset.
It is important to note none of these variables have been shown to be causal in affecting the Audience Score. We can only predict an interval of scores based on the characteristics of the movie that have been shown to show trends from our sample of 651 movies.
All code for each Part is included here.
knitr::opts_chunk$set(comment = NA)
options("scipen" = 100)
years <- c(1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010)
Plot : Distribution by Year of 651 Randomly Sampled Movies
#Histogram.
sample_length <- max(movies$thtr_rel_year) - min(movies$thtr_rel_year) + 1
g1 <- ggplot(movies, aes(x = thtr_rel_year, fill = cut(thtr_rel_year, sample_length)))
g1 <- g1 + geom_histogram(bins = sample_length)
#Title.
g1 <- g1 + ggtitle(label = "Distribution by Year of 651 Randomly Sampled Movies")
#Viridis Color Scheme.
g1 <- g1 + scale_fill_viridis_d()
#X-axis.
g1 <- g1 + scale_x_continuous(name = "Theatrical Release Year",
breaks = years,
labels = years,
expand = c(0,0))
#Y-axis.
g1 <- g1 + scale_y_continuous(name = "Number of Movies")
#Modify labels and text. Remove legend from left plot.
g1 <- g1 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.position = "none")
g1
Summary: Oscar Categories and Top 200 Boxoffice
summary(movies %>%
select(best_pic_nom,
best_pic_win,
best_actor_win,
best_actress_win,
best_dir_win,
top200_box))
#Temp dataframe for summary statistics.
movies_eda <- movies %>%
select(audience_score, feature_film, drama, mpaa_rating_R, oscar_season, summer_season) %>%
apply_labels(audience_score = "Audience Score",
feature_film = "Movie Type",
feature_film = c("Feature Film" = " yes",
"Documentary or TV Movie" = "no"),
drama = "Genre",
drama = c("Drama" = " yes",
"Other Genre" = "no"),
mpaa_rating_R = "MPAA Rating",
mpaa_rating_R = c("Rated R" = " yes",
"Other Rating" = "no"),
oscar_season = "Theatrical Release Month (I)",
oscar_season = c("Oscar Month" = " yes",
"Other Month" = "no"),
summer_season = "Theatrical Release Month (II)",
summer_season = c("Summer Month" = " yes",
"Other Month" = "no"))
Output : Plot and Table Summary for Feature Film
#Feature Film.
#Distribution Plot.
g3 <- ggplot(movies, aes(x = audience_score, y = feature_film))
#Densiy Ridges.
g3 <- g3 + stat_density_ridges(aes(fill = ..x..),
geom = "density_ridges_gradient",
quantile_lines = TRUE,
rel_min_height = 0.005,
scale = 1.25,
size = 0.5)
#Title.
g3 <- g3 + ggtitle("Movie Distribution by Audience Score and Film Type")
#X-axis
g3 <- g3 + scale_x_continuous("Rotten Tomatoes Audience Score", breaks = c(0, 25, 50, 75, 100))
#Y-axis.
g3 <- g3 + scale_y_discrete(name = "", limits = c("no", " yes"),
labels = c("Documentary\nor TV Movie", "Feature Film"))
#Viridis Color Scheme.
g3 <- g3 + scale_fill_viridis(name = "Audience\nScore", breaks = c(0, 25, 50, 75, 100),
guide = guide_colorbar(label.position = "left", hjust = 1))
#Modify labels and text.
g3 <- g3 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(hjust = 1, size = 12, angle = 35),
axis.title.y = element_text(size = 14, face = "bold"),
legend.text = element_text(size = 12),
legend.title = element_text(hjust = 0.5, size = 14, face = "bold"))
g3
#Summary Statistics.
movies_eda %>%
tab_cells(audience_score) %>%
tab_cols(total(), feature_film) %>%
tab_stat_unweighted_valid_n(label = "Count") %>%
tab_stat_min() %>%
tab_stat_median() %>%
tab_stat_max() %>%
tab_stat_mean %>%
tab_stat_sd() %>%
tab_pivot() %>%
set_caption("Table with Summary Statistics.")
Output : Plot and Table Summary for Feature Film
#Drama.
#Distribution Plot.
g4 <- ggplot(movies, aes(x = audience_score, y = drama))
#Densiy Ridges.
g4 <- g4 + stat_density_ridges(aes(fill = ..x..),
geom = "density_ridges_gradient",
quantile_lines = TRUE,
rel_min_height = 0.005,
scale = 1.25,
size = 0.5)
#Title.
g4 <- g4 + ggtitle("Movie Distribution by Audience Score and Genre")
#X-axis
g4 <- g4 + scale_x_continuous("Rotten Tomatoes Audience Score", breaks = c(0, 25, 50, 75, 100))
#Y-axis.
g4 <- g4 + scale_y_discrete(name = "", limits = c("no", " yes"),
labels = c("Other Genre", "Drama"))
#Viridis Color Scheme.
g4 <- g4 + scale_fill_viridis(name = "Audience\nScore", breaks = c(0, 25, 50, 75, 100),
guide = guide_colorbar(label.position = "left", hjust = 1))
#Modify labels and text.
g4 <- g4 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(hjust = 1, size = 12, angle = 35),
axis.title.y = element_text(size = 14, face = "bold"),
legend.text = element_text(size = 12),
legend.title = element_text(hjust = 0.5, size = 14, face = "bold"))
g4
#Summary Statistics.
movies_eda %>%
tab_cells(audience_score) %>%
tab_cols(total(), drama) %>%
tab_stat_unweighted_valid_n(label = "Count") %>%
tab_stat_min() %>%
tab_stat_median() %>%
tab_stat_max() %>%
tab_stat_mean %>%
tab_stat_sd() %>%
tab_pivot() %>%
set_caption("Table with Summary Statistics.")
Output : Plot and Table Summary for MPAA Rating R
#Rated-R.
#Distribution Plot.
g5 <- ggplot(movies, aes(x = audience_score, y = mpaa_rating_R))
#Densiy Ridges.
g5 <- g5 + stat_density_ridges(aes(fill = ..x..),
geom = "density_ridges_gradient",
quantile_lines = TRUE,
rel_min_height = 0.005,
scale = 1.25,
size = 0.5)
g5 <- g5 + ggtitle("Movie Distribution by Audience Score and MPAA Rating")
#X-axis
g5 <- g5 + scale_x_continuous("Rotten Tomatoes Audience Score", breaks = c(0, 25, 50, 75, 100))
#Y-axis.
g5 <- g5 + scale_y_discrete(name = "", limits = c("no", " yes"),
labels = c("Other Rating", "Rated R"))
#Viridis Color Scheme.
g5 <- g5 + scale_fill_viridis(name = "Audience\nScore", breaks = c(0, 25, 50, 75, 100),
guide = guide_colorbar(label.position = "left", hjust = 1))
#Modify labels and text.
g5 <- g5 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(hjust = 1, size = 12, angle = 35),
axis.title.y = element_text(size = 14, face = "bold"),
legend.text = element_text(size = 12),
legend.title = element_text(hjust = 0.5, size = 14, face = "bold"))
g5
#Summary Statistics.
movies_eda %>%
tab_cells(audience_score) %>%
tab_cols(total(), mpaa_rating_R) %>%
tab_stat_unweighted_valid_n(label = "Count") %>%
tab_stat_min() %>%
tab_stat_median() %>%
tab_stat_max() %>%
tab_stat_mean %>%
tab_stat_sd() %>%
tab_pivot() %>%
set_caption("Table with Summary Statistics.")
Output : Plot and Table Summary for Oscar Season
#Oscar Season.
#Distribution Plot.
g6 <- ggplot(movies, aes(x = audience_score, y = oscar_season))
#Densiy Ridges.
g6 <- g6 + stat_density_ridges(aes(fill = ..x..),
geom = "density_ridges_gradient",
quantile_lines = TRUE,
rel_min_height = 0.005,
scale = 1.25,
size = 0.5)
#Title.
g6 <- g6 + ggtitle("Movie Distribution by Audience Score and Theatrical Release Month (I)")
#X-axis
g6 <- g6 + scale_x_continuous("Rotten Tomatoes Audience Score", breaks = c(0, 25, 50, 75, 100))
#Y-axis.
g6 <- g6 + scale_y_discrete(name = "", limits = c("no", " yes"),
labels = c("Other Month", "Oscar Month"))
#Viridis Color Scheme.
g6 <- g6 + scale_fill_viridis(name = "Audience\nScore", breaks = c(0, 25, 50, 75, 100),
guide = guide_colorbar(label.position = "left", hjust = 1))
#Modify labels and text.
g6 <- g6 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(hjust = 1, size = 12, angle = 35),
axis.title.y = element_text(size = 14, face = "bold"),
legend.text = element_text(size = 12),
legend.title = element_text(hjust = 0.5, size = 14, face = "bold"))
#Footnote.
g6 <- g6 + labs(caption = "October, November, and December are Oscar Months.")
g6
#Summary Statistics.
movies_eda %>%
tab_cells(audience_score) %>%
tab_cols(total(), oscar_season) %>%
tab_stat_unweighted_valid_n(label = "Count") %>%
tab_stat_min() %>%
tab_stat_median() %>%
tab_stat_max() %>%
tab_stat_mean %>%
tab_stat_sd() %>%
tab_pivot() %>%
set_caption("Table with Summary Statistics.")
Output : Plot and Table Summary for Summer Season
#Summer Season.
#Distribution Plot.
g7 <- ggplot(movies, aes(x = audience_score, y = summer_season))
#Densiy Ridges.
g7 <- g7 + stat_density_ridges(aes(fill = ..x..),
geom = "density_ridges_gradient",
quantile_lines = TRUE,
rel_min_height = 0.005,
scale = 1.25,
size = 0.5)
#Title.
g7 <- g7 + ggtitle("Movie Distribution by Audience Score and Theatrical Release Month (II)")
#X-axis
g7 <- g7 + scale_x_continuous("Rotten Tomatoes Audience Score", breaks = c(0, 25, 50, 75, 100))
#Y-axis.
g7 <- g7 + scale_y_discrete(name = "", limits = c("no", " yes"),
labels = c("Other Month", "Summer Month"))
#Viridis Color Scheme.
g7 <- g7 + scale_fill_viridis(name = "Audience\nScore", breaks = c(0, 25, 50, 75, 100),
guide = guide_colorbar(label.position = "left", hjust = 1))
#Modify labels and text.
g7 <- g7 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(hjust = 1, size = 12, angle = 35),
axis.title.y = element_text(size = 14, face = "bold"),
legend.text = element_text(size = 12),
legend.title = element_text(hjust = 0.5, size = 14, face = "bold"))
#Footnote.
g7 <- g7 + labs(caption = "May, June, July, and August are Summer Months.")
g7
#Summary Statistics.
movies_eda %>%
tab_cells(audience_score) %>%
tab_cols(total(), summer_season) %>%
tab_stat_unweighted_valid_n(label = "Count") %>%
tab_stat_min() %>%
tab_stat_median() %>%
tab_stat_max() %>%
tab_stat_mean %>%
tab_stat_sd() %>%
tab_pivot() %>%
set_caption("Table with Summary Statistics.")
Plot : Correlation of Numerical Predictors
numericals <- movies %>%
select(audience_score, critics_score, imdb_rating, imdb_num_votes, runtime, thtr_rel_year)
ggcorr(numericals) + scale_fill_viridis(direction = -1)
Output : Predictors
movies_2 <- movies %>%
select(audience_score, feature_film, drama, runtime,
mpaa_rating_R, thtr_rel_year, oscar_season,
summer_season, critics_score, best_pic_nom,
best_pic_win, best_actor_win, best_actress_win,
best_dir_win, top200_box)
colnames(movies_2)[-1]
Standard Linear Regression
Bayesian Regression
#Regression Model.
movies.ZS <- bas.lm(audience_score ~ .,
data = movies_2,
prior = "ZS-null",
modelprior = uniform(),
method = "MCMC")
Plots : Diagnostics 1
#Convergence Plot.
#Equivalent of diagnostics(movies.ZS, ask = F).
g8 <- ggplot(mapping = aes(x = movies.ZS$probne0.RN, y = movies.ZS$probne0.MCMC))
#Scatter Plot.
g8 <- g8 + geom_point(color = viridis(5)[3], size = 4, alpha = 0.5)
#y = x.
g8 <- g8 + geom_segment(aes(x = 0, y = 0, xend = 1.05, yend = 1.05), color = viridis(5)[1])
#Title.
g8 <- g8 + ggtitle("Convergence Plot: Posterior Inclusion Probabilities")
#X-axis
g8 <- g8 + scale_x_continuous("pip (renormalized)",
breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0),
expand = c(0,0))
#Y-axis.
g8 <- g8 + scale_y_continuous("pip (MCMC)",
breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0),
expand = c(0,0))
#Modify labels and text.
g8 <- g8 + theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
axis.text.x = element_text(size = 10),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(hjust = 1, size = 10, angle = 35),
axis.title.y = element_text(size = 12, face = "bold"))
#Convergence Plot.
g9 <- ggplot(mapping = aes(movies.ZS$postprobs.RN, y = movies.ZS$postprobs.MCMC))
#Scatter Plot.
g9 <- g9 + geom_point(color = viridis(5)[3], size = 4, alpha = 0.5)
#y = x.
g9 <- g9 + geom_segment(aes(x = -0.005, y = -0.005, xend = 0.2, yend = 0.2), color = viridis(5)[1])
#Title.
g9 <- g9+ ggtitle("Convergence Plot: Posterior Model Probabilities")
#X-axis
g9 <- g9 + scale_x_continuous("p(M | Y) (renormalized)",
breaks = c(0, 0.05, 0.10, 0.15, 0.20),
expand = c(0,0))
#Y-axis.
g9 <- g9 + scale_y_continuous("p(M | Y) (MCMC)",
breaks = c(0, 0.05, 0.10, 0.15, 0.20),
expand = c(0,0))
#Modify labels and text.
g9 <- g9 + theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
axis.text.x = element_text(size = 10),
axis.title.x = element_text(size = 12, face = "bold"),
axis.text.y = element_text(hjust = 1, size = 10, angle = 35),
axis.title.y = element_text(size = 12, face = "bold"))
grid.arrange(g8, g9, ncol = 2)
Plots : Diagnostics 2
#Residuals and Fitted Data.
test <- predict(movies.ZS, movies_2)
temp_df <- data.frame(test$fit) %>%
mutate(residuals = movies_2$audience_score - test$fit) %>%
`colnames<-`(c("fitted", "residuals")) %>%
mutate(outliers = ifelse(residuals >= 2.9*sd(residuals), "yes", "no"))
#Residuals vs Fitted Plot.
#Equivalent of plot(movies.ZS, which = 1, ask = F).
g10 <- ggplot(data = temp_df, mapping = aes(x = fitted, y = residuals))
#Scatter Plot.
g10 <- g10 + geom_point(aes(color = outliers, alpha = outliers), size = 4)
#Best Fit.
g10 <- g10 + stat_smooth(method = "lm", color = viridis(5)[1], se = FALSE)
#Title.
g10 <- g10 + ggtitle("Residuals vs Fitted")
#X-axis
g10 <- g10 + scale_x_continuous("Predictions under BMA\nbas.lm(audience_score ~.)")
#Y-axis.
g10 <- g10 + scale_y_continuous("Residuals")
#Viridis Color Scheme
g10 <- g10 + scale_color_manual(values = c(viridis(5)[3], viridis(7)[3]))
g10 <- g10 + scale_alpha_manual(values = c(0.2, 1))
#Modify labels and text.
g10 <- g10 + theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
axis.text.x = element_text(size = 8),
axis.title.x = element_text(size = 10, face = "bold"),
axis.text.y = element_text(hjust = 1, size = 8, angle = 35),
axis.title.y = element_text(size = 10, face = "bold"),
legend.position = "none")
#Model Probabilities Data.
temp_df <- data.frame("Probability" = movies.ZS$postprobs) %>%
mutate(n = 1:length(movies.ZS$postprobs),
Cumulative = cumsum(Probability))
logistic_model <- SummarizeGrowth(temp_df$n, temp_df$Cumulative)
df.pred <- data.frame(n = 1:length(movies.ZS$postprobs),
Cumulative = predict(logistic_model$model))
#Model Probabilities.
#Equivalent of plot(movies.ZS, which = 2, ask = F).
g11 <- ggplot(data = temp_df, mapping = aes(x = n, y = Cumulative))
#Scatter Plot.
g11 <- g11 + geom_point(color = viridis(5)[3], size = 4, alpha = 0.5)
#Best Fit.
g11 <- g11 + geom_line(df.pred, mapping = aes(y = df.pred[, 2]), color = viridis(5)[1], size = 1.2)
#Title.
g11 <- g11 + ggtitle("Model Probabilities")
#X-axis
g11 <- g11 + scale_x_continuous("Model Search Order by Probability\nbas.lm(audience_score ~.)")
#Y-axis.
g11 <- g11 + scale_y_continuous("Cumulative Probability")
#Modify labels and text.
g11 <- g11 + theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
axis.text.x = element_text(size = 8),
axis.title.x = element_text(size = 10, face = "bold"),
axis.text.y = element_text(hjust = 1, size = 8, angle = 35),
axis.title.y = element_text(size = 10, face = "bold"))
#Model Complexity Data.
temp_df <- data.frame("logmarg" = movies.ZS$logmarg,
"size" = movies.ZS$size)
#Model Complexity.
#Equivalent of plot(movies.ZS, which = 3, ask = F).
g12_1 <- ggplot(data = temp_df %>% filter(logmarg > 75), mapping = aes(x = size, y = logmarg))
g12_2 <- ggplot(data = temp_df %>% filter(logmarg < 75), mapping = aes(x = size, y = logmarg))
#Scatter Plot.
g12_1 <- g12_1 + geom_point(color = viridis(5)[3], size = 4, alpha = 0.3)
g12_2 <- g12_2 + geom_point(color = viridis(5)[3], size = 4, alpha = 0.3)
#Title.
g12_1 <- g12_1 + ggtitle("Model Complexity")
#X-axis.
g12_1 <- g12_1 + scale_x_continuous(limits = c(0, 11),
breaks = c(0, 2, 4, 6, 8, 10))
g12_2 <- g12_2 + scale_x_continuous("Model Dimension\nbas.lm(audience_score ~.)",
limits = c(0, 11),
breaks = c(0, 2, 4, 6, 8, 10))
#Y-axis.
g12_1 <- g12_1 + scale_y_continuous("log(Marginal)")
g12_2 <- g12_2 + scale_y_continuous("", breaks = c(0, 30, 60))
#Modify labels and text.
g12_1 <- g12_1 + theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.x = element_blank(),
axis.text.y = element_text(hjust = 1, size = 8, angle = 35),
axis.title.y = element_text(hjust = 0, size = 10, face = "bold"))
g12_2 <- g12_2 + theme(plot.title = element_blank(),
axis.text.x = element_text(size = 8),
axis.title.x = element_text(size = 10, face = "bold"),
axis.text.y = element_text(hjust = 1, size = 8, angle = 35),
panel.grid.minor.y = element_blank(),
axis.title.y = element_text(size = 10, face = "bold"))
g12 <- arrangeGrob(g12_1, g12_2, layout_matrix = matrix(c(1,1,2), ncol = 1))
#Inclusion Probabilities Data.
temp_df <- data.frame("Predictor" = factor(movies.ZS$names),
"Probability" = movies.ZS$probne0) %>%
mutate(Predictor = fct_rev(fct_relevel(Predictor, movies.ZS$names)),
Highlight = ifelse(Probability > 0.6, "yes", "no"))
#Inclusion Probabilities.
#Equivalent of plot(movies.ZS, which = 4, ask = F).
g13 <- ggplot(data = temp_df, mapping = aes(x = Probability, y = Predictor))
#Horizontal Lines.
g13 <- g13 + geom_segment(aes(x = 0, y = Predictor,
xend = Probability, yend = Predictor,
color = Highlight, size = Highlight))
#Title.
g13 <- g13 + ggtitle("Inclusion Probabilities")
#X-axis
g13 <- g13 + scale_x_continuous("Marginal Inclusion Probability")
#Y-axis.
g13 <- g13 + scale_y_discrete("bas.lm(audience_score ~.)")
#Viridis Color Scheme.
g13 <- g13 + scale_color_manual(values = c(viridis(5)[1], viridis(5)[3]))
#Scale Size.
g13 <- g13 + scale_size_manual(values = c(1, 3))
#Y-axis labels.
fontcolor <- rev(ifelse(temp_df$Highlight == "yes", viridis(5)[3], viridis(5)[1]))
fontface <- rev(ifelse(temp_df$Highlight == "yes", "bold", "plain"))
#Modify labels and text.
g13 <- g13 + theme(plot.title = element_text(hjust = 0.5, size = 12, face = "bold"),
axis.text.x = element_text(size = 8),
axis.title.x = element_text(size = 10, face = "bold"),
axis.text.y = element_text(hjust = 1, size = 8, angle = 35,
color = fontcolor, face = fontface),
axis.title.y = element_text(size = 10, face = "bold"),
legend.position = "none")
grid.arrange(g10, g11, g12, g13,
layout_matrix = matrix(rbind(c(1, 1, 1, 1, 1, NA, 2, 2, 2, 2, 2),
c(3, 3, 3, 3, 3, NA, 4, 4, 4, 4, 4)),
ncol = 11))
Image : Top 20 Models
Output : Marginal Inclusion Probabilities and Top 5 Models
Deadpool data
#Deadpool data from https://www.rottentomatoes.com/m/deadpool.
deadpool <- data.frame("audience_score" = 90,
"feature_film" = " yes",
"drama" = "no",
"runtime" = 108,
"mpaa_rating_R" = " yes",
"thtr_rel_year" = 2016,
"oscar_season" = "no",
"summer_season" = "no",
"critics_score" = 85,
"best_pic_nom" = "no",
"best_pic_win" = "no",
"best_actor_win" = "no",
"best_actress_win" = "no",
"best_dir_win" = "no",
"top200_box" = "yes")
BPM Output
BPM <- predict(movies.ZS, newdata = deadpool, estimator = "BPM", interval = "prediction", se.fit = TRUE)
confint(BPM)
variable.names(BPM)
HPM Output