From the Codebook, “The data set is comprised of 651 randomly sampled movies produced and released before 2016.”.
Because there were no details, I decided to plot a few variables to see how random this sample is.
First as we see from above, the distribution of movies is from 1970 - 2014. More movies have been released in recent years, so skewed left is not unexpected.
I also took a look at the Oscar categories. The number of best actor wins (93) and best actress wins (72) is quite high for 651 movies. This leads me to believe that more well known, heavily advertised, and blockbuster films were more likely to be choosen from the sample. I would have to investigate the sampling method in more detail to draw a more concrete conclusion, but that is not the purpose of this project. I merely wanted a more indepth summary of the movies chosen.
best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
no :629 no :644 no :558 no :579 no :608
yes: 22 yes: 7 yes: 93 yes: 72 yes: 43
top200_box
no :636
yes: 15
This was an observational study, as no hypothesis, controls, nor confounding variables were specified beforehand. We can use this data to generalize and draw some trends in movies of the past 50 years, but not make definitive causal arguments.
As someone who has watched many documentaries, movies, and TV shows, I have regularly used IMDb for over 20 years. I would like to see if the IMDB Rating is reasonably predictable from characteristics of the movie.
I first unselected imdb_url and rt_url, as these are unique and identify the movie. I then unselected thtr_rel_day, dvd_rel_year dvd_rel_month, and dvd_rel_day. I think thtr_rel_year and thtr_rel_month would have a greater impact on the number votes. DVD Release is usually 4 - 12 months after theatrical release, and including both would have collinearity issues. I next unselected all Rotten Tomatoes variables, that is, critics_rating, critics_score, audience_rating, and audience_score. Since I am focused on IMDb, I am ignoring the ratings and scores from Rotten Tomatoes.
Finally, I unselected actor1, actor2, actor3, actor4, and actor5. These would be too difficult to organize and categorize into a linear model, at least for the scope of this class. I included best_pic_nom, best_pic_win, best_actor_win, best_actress_win, and top200_box, as I think these characteristics would affect the rating.
Using title for identification purposes, the remaining variables are title_type, genre, runtime, mpaa_rating, studio, thtr_rel_year, thtr_rel_month, imdb_rating (dependent variable in this exploration), imdb_num_votes, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, top200_box and director.
Can the Rating on IMDb be predicted from these variables?
Here is a summary of the preliminary variables.
title title_type genre runtime
Length:651 Documentary : 55 Drama :305 Min. : 39.0
Class :character Feature Film:591 Comedy : 87 1st Qu.: 92.0
Mode :character TV Movie : 5 Action & Adventure: 65 Median :103.0
Mystery & Suspense: 59 Mean :105.8
Documentary : 52 3rd Qu.:115.8
Horror : 23 Max. :267.0
(Other) : 60 NA's :1
mpaa_rating studio thtr_rel_year
G : 19 Paramount Pictures : 37 Min. :1970
NC-17 : 2 Warner Bros. Pictures : 30 1st Qu.:1990
PG :118 Sony Pictures Home Entertainment: 27 Median :2000
PG-13 :133 Universal Pictures : 23 Mean :1998
R :329 Warner Home Video : 19 3rd Qu.:2007
Unrated: 50 (Other) :507 Max. :2014
NA's : 8
thtr_rel_month imdb_rating imdb_num_votes best_pic_nom best_pic_win
Min. : 1.00 Min. :1.900 Min. : 180 no :629 no :644
1st Qu.: 4.00 1st Qu.:5.900 1st Qu.: 4546 yes: 22 yes: 7
Median : 7.00 Median :6.600 Median : 15116
Mean : 6.74 Mean :6.493 Mean : 57533
3rd Qu.:10.00 3rd Qu.:7.300 3rd Qu.: 58301
Max. :12.00 Max. :9.000 Max. :893008
best_actor_win best_actress_win top200_box director
no :558 no :579 no :636 Length:651
yes: 93 yes: 72 yes: 15 Class :character
Mode :character
There seems to be a wide range of genres, and they are distributed reasonably well over the years. I do have some reservations about using Genre, however, as a movie can be of multiple genres.
imdb_num_votes imdb_rating
Min. : 180 Min. :1.900
1st Qu.: 4546 1st Qu.:5.900
Median : 15116 Median :6.600
Mean : 57533 Mean :6.493
3rd Qu.: 58301 3rd Qu.:7.300
Max. :893008 Max. :9.000
I decided to look at the number of votes and ratings for the movies in the sample. A range of 1.9 - 9.0, with most in 5.0 - 8.0, is a good representation. The heavy density on the left and exaggerated right skew of votes is to be expected.
First, over 400 directors have directed only one movie in our sample. I am therefore unselecting this variable from the model.
As we can see, over 100 Studios have produce one movie in our sample. However, many movie studios are repeated by different names in the column. For example, “20th Century Fox” appears as all of the following:
[1] "20th Century Fox"
[2] "20th Century Fox Film Corporat"
[3] "Twentieth Century Fox Home Entertainment"
[4] "Fox Searchlight Pictures"
[5] "Fox Atomic"
[6] "20th Century Fox Film Corporation"
[7] "Fox Searchlight"
[8] "Fox"
There are several others. For full details in the cleaning, see the Appendix.
After omitting eight movies with no studio, consolidating the different names of the same studio, as well as filtering out studios with only a single movie, the distribution is as follows. Note that over 400 movies in the original sample were produced by only 11 studios.
Currently the predictors being considered for IMDb Rating are:
[1] "title_type" "genre" "runtime" "mpaa_rating"
[5] "studio" "thtr_rel_year" "thtr_rel_month" "imdb_num_votes"
[9] "best_pic_nom" "best_pic_win" "best_actor_win" "best_actress_win"
See the Appendix for model details.
[1] "The R-Squared from the above variables is 0.4957 ."
[1] "The F-Statistic from the above variables is 6.6 ."
[1] "The p-Value from the above F-Statistic on 72 and 482 degrees of freedom is ~ 0 ."
H0 : The slopes of all variables are zero for the regression model.
HA : At least one slope is different from 0.
Since the p-value < 0.05 from the F test above, we reject H0.
I will be using the forward selection method, selecting predictors by their adjusted R2. I am choosing this method as a reliable overall prediction model is my priority. I am not as concerned about the significance of the individual predictors.
Predictor_Variables Adjusted_R2
1 title_type 0.0911757925
2 genre 0.2058854544
3 runtime 0.0930769772
4 mpaa_rating 0.0544662807
5 studio 0.0250511730
6 thtr_rel_year -0.0001270223
7 thtr_rel_month 0.0069380522
8 imdb_num_votes 0.1336030480
9 best_pic_nom 0.0540743214
10 best_pic_win 0.0194410075
11 best_actor_win 0.0034698482
12 best_actress_win 0.0038613392
13 top200_box 0.0084224665
The first round of selection yields genre as having the highest adjusted R2, which is 0.20589.
Predictor_Variables Adjusted_R2
1 genre + title_type 0.2142243
2 genre + runtime 0.2638900
3 genre + mpaa_rating 0.2204991
4 genre + studio 0.2116931
5 genre + thtr_rel_year 0.2085660
6 genre + thtr_rel_month 0.2120711
7 genre + imdb_num_votes 0.3630692
8 genre + best_pic_nom 0.2482785
9 genre + best_pic_win 0.2254451
10 genre + best_actor_win 0.2068519
11 genre + best_actress_win 0.2072701
12 genre + top200_box 0.2201771
The second round of selection yields genre + imdb_num_votes as having the highest adjusted R2, which is 0.36307.
Predictor_Variables Adjusted_R2
1 selected explanatory variables + title_type 0.3748242
2 selected explanatory variables + runtime 0.3722187
3 selected explanatory variables + mpaa_rating 0.3869285
4 selected explanatory variables + studio 0.3741239
5 selected explanatory variables + thtr_rel_year 0.3841490
6 selected explanatory variables + thtr_rel_month 0.3637500
7 selected explanatory variables + best_pic_nom 0.3694659
8 selected explanatory variables + best_pic_win 0.3619455
9 selected explanatory variables + best_actor_win 0.3629337
10 selected explanatory variables + best_actress_win 0.3621119
11 selected explanatory variables + top200_box 0.3622158
The third round of selection yields genre + imdb_num_votes + mpaa_rating as having the highest adjusted R2, which is 0.38693.
Predictor_Variables Adjusted_R2
1 selected explanatory variables + title_type 0.3956541
2 selected explanatory variables + runtime 0.4001415
3 selected explanatory variables + studio 0.3893247
4 selected explanatory variables + thtr_rel_year 0.4033181
5 selected explanatory variables + thtr_rel_month 0.3870388
6 selected explanatory variables + best_pic_nom 0.3935230
7 selected explanatory variables + best_pic_win 0.3857890
8 selected explanatory variables + best_actor_win 0.3872100
9 selected explanatory variables + best_actress_win 0.3861547
10 selected explanatory variables + top200_box 0.3858422
The fourth round of selection yields genre + imdb_num_votes + mpaa_rating + thtr_rel_year as having the highest adjusted R2, which is 0.40332.
Predictor_Variables Adjusted_R2
1 selected explanatory variables + title_type 0.4124202
2 selected explanatory variables + runtime 0.4101359
3 selected explanatory variables + studio 0.4043084
4 selected explanatory variables + thtr_rel_month 0.4031649
5 selected explanatory variables + best_pic_nom 0.4078319
6 selected explanatory variables + best_pic_win 0.4024575
7 selected explanatory variables + best_actor_win 0.4031490
8 selected explanatory variables + best_actress_win 0.4023141
9 selected explanatory variables + top200_box 0.4022237
The fifth round of selection yields genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type as having the highest adjusted R2, which is 0.41242.
Predictor_Variables Adjusted_R2
1 selected explanatory variables + runtime 0.4196435
2 selected explanatory variables + studio 0.4130270
3 selected explanatory variables + thtr_rel_month 0.4122719
4 selected explanatory variables + best_pic_nom 0.4169521
5 selected explanatory variables + best_pic_win 0.4115879
6 selected explanatory variables + best_actor_win 0.4124188
7 selected explanatory variables + best_actress_win 0.4114617
8 selected explanatory variables + top200_box 0.4113380
The sixth round of selection yields genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type + runtime as having the highest adjusted R2, which is 0.41964.
Predictor_Variables Adjusted_R2
1 selected explanatory variables + studio 0.4210571
2 selected explanatory variables + thtr_rel_month 0.4187164
3 selected explanatory variables + best_pic_nom 0.4221565
4 selected explanatory variables + best_pic_win 0.4193633
5 selected explanatory variables + best_actor_win 0.4186905
6 selected explanatory variables + best_actress_win 0.4185631
7 selected explanatory variables + top200_box 0.4186287
The seventh round of selection yields genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type + runtime + best_pic_nom as having the highest adjusted R2, which is 0.42216.
Predictor_Variables Adjusted_R2
1 selected explanatory variables + studio 0.4227173
2 selected explanatory variables + thtr_rel_month 0.4210947
3 selected explanatory variables + best_pic_win 0.4238089
4 selected explanatory variables + best_actor_win 0.4211023
5 selected explanatory variables + best_actress_win 0.4211543
6 selected explanatory variables + top200_box 0.4211215
The eight round of selection yields genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type + runtime + best_pic_nom + best_pic_win as having the highest adjusted R2, which is 0.42381.
Predictor_Variables Adjusted_R2
1 selected explanatory variables + studio 0.4248530
2 selected explanatory variables + thtr_rel_month 0.4227314
3 selected explanatory variables + best_actor_win 0.4227339
4 selected explanatory variables + best_actress_win 0.4227907
5 selected explanatory variables + top200_box 0.4227849
The ninth round of selection yields genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type + runtime + best_pic_nom + best_pic_win + studio as having the highest adjusted R2, which is 0.42485.
Predictor_Variables Adjusted_R2
1 selected explanatory variables + thtr_rel_month 0.4236725
2 selected explanatory variables + best_actor_win 0.4238159
3 selected explanatory variables + best_actress_win 0.4236672
4 selected explanatory variables + top200_box 0.4237895
No additional predictor variable is a satisfactory explantory variable, as every adjusted R2 value is less than 0.42485.
The three numerical explanatory variables included in the model are imdb_num_votes, thtr_rel_year, and runtime.
For Number of IMDB Votes, the residuals seem to be randomly scattered around 0. As noted above, the heavy density on the left and exaggerated right skew of votes is to be expected. For Theatrical Release Year, the residuals are randomly scattered around 0. I do not know for sure whether to treat year as numerical or categorical, so I included it anyway. Finally for Runtime, the plot shows a random scatter around 0 for the vast majority of movies, that is movies between 90 - 150 minutes long.
There is a slight left skew, but the majority of the residuals fit a normal distribution.
From both plots, it is clear there are about 10 - 15 fitted values whose residuals deviate from the others. Considering this is approximately 1.8 - 2.7% of the data in the model, the residuals seem to be of constant variability.
We do not know the exact method or sampling, nor the order in which the movies were chosen, but the movies selected, and thus residuals, appear to be independent of one another.
For full details of the final regression model, see the Appendix.
The intercept of the regression model is 33.46, which would be the IMDb Rating of a movie that : 1) is of genre Action & Adventure, 2) received 0 votes on IMDb, 3) is rated G, 4) was released in theaters in year 0, 5) is a Documentary, 6) is 0 minutes long, 7) has not been nominated nor won the Best Picture, 8) and was produced by A24.
There are 10 genre coefficients, which would be expected to change the IMDB Rating from -0.05 (Animation) to +1.37 (Musical & Performing Arts).
For every 1,000 people that vote for the movie, the expected change to the IMDB Rating would be +0.0036. If 400,000 people vote for a movie, for example, the expected change to the IMDb Rating would be +1.44.
There are 4 MPAA rating coefficients, which would be expected to change the IMDB Rating from -0.62 (PG-13) to -0.18 (Unrated).
For each year the movie is released in theaters after year 0, the expected change to the IMDb Rating would be -0.0136. A movie released in 1980, for example, would be expected to change the IMDb Rating by -26.9.
For each minute of runtime of the movie, the expected change to the IMDb Rating would be +0.006487. A 120 minute movie, for example, would be expected to change the IMDb Rating by +0.78.
There are two additional title types. The vast majority of movies are Feature Films, which would be expected to change the IMDb Rating by -1.04. TV Movies would be expected to change the IMDb Rating by +0.15.
If a movie has received a Best Picture Nomination, the expected change to the IMDb Rating would be +0.45.
If a movie has won the Best Picture, the expected change to the IMDb Rating would be -0.69.
There are 47 studio coefficients, which would be expected to change the IMDB Rating from -0.67 (Hollywood Pictures) to +1.13 (Republic Pictures).
Deadpool, having an IMDb rating of 8.0, was an 108 minute long, Action & Adventure, Rated-R Feature Film released in 2016 by 20th Century Fox that has received 872,845 votes on IMDb and was never nominated for Best Picture (and hence, did not win, either).
fit lwr upr
1 8.555389 6.838782 10.272
Our 95% prediction interval for the IMDb Rating of any movie that was an 108 minute long, Action & Adventure, Rated-R Feature Film released in 2016 by 20th Century Fox that has received 872,845 votes on IMDb and was never nominated for Best Picture (and hence, did not win, either) is between 6.84 and 10 (as 10 is the maximum score). Deadpool, at 8.0, lies in this interval.
[1] "The R-squared value is 0.4954"
This means that the final regression model of nine variables to predict the IMDb Rating only explains about 50% of the variability in the IMDb Rating.
I have already mentioned reservations about the sampling, as well as how different genre types are not mutually exclusive for a movie. In addition, I purposely excluded actors and actresses from the model due to the complexity, but I think they would affect the rating, as people can be biased towards favorites and rate them higher.
I think directors could possibly influence rating, but more data per director would have to be available.
Box Office Earnings could possibly influence rating. That metric was not included in the original dataset.
It is important to note none of these variables have been shown to be causal in affecting the IMDb rating. We can only predict an interval of ratings based on the characterstics of the movie that have been shown to show trends from our reduced sample of 555 movies.
The dataset data is also not static, it is dynamic. For example, take Django Unchained, the movie with the highest number of votes in the dataset.
title imdb_rating imdb_num_votes
1 Django Unchained 8.5 893008
In reality, Django Unchained has a rating of 8.4 by over 1,299,000 votes, as new votes regularly happen on IMDb, even for older movies. The final project dataset, like many other aspects of this course, has not been updated.
All code for each Part is included here.
knitr::opts_chunk$set(comment = NA)
options("scipen" = 100)
years <- c(1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010)
Plot : Distribution by Year of 651 Randomly Sampled Movies
#Histogram.
sample_length <- max(thtr_rel_year) - min(thtr_rel_year) + 1
g1 <- ggplot(movies, aes(x = thtr_rel_year, fill = cut(thtr_rel_year, sample_length)))
g1 <- g1 + geom_histogram(bins = sample_length)
#Title.
g1 <- g1 + ggtitle(label = "Distribution by Year of 651 Randomly Sampled Movies")
#Viridis Color Scheme.
g1 <- g1 + scale_fill_viridis_d()
#X-axis.
g1 <- g1 + scale_x_continuous(name = "Theatrical Release Year",
breaks = years,
labels = years,
expand = c(0,0))
#Y-axis.
g1 <- g1 + scale_y_continuous(name = "Number of Movies")
#Modify labels and text. Remove legend from left plot.
g1 <- g1 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.position = "none")
g1
Summary: Oscar Categories and Top 200 Boxoffice
summary(movies %>%
select(best_pic_nom,
best_pic_win,
best_actor_win,
best_actress_win,
best_dir_win,
top200_box))
Summary : Preliminary Model Variables
movies_model <- movies %>% select(title, title_type, genre,
runtime, mpaa_rating, studio,
thtr_rel_year, thtr_rel_month,
imdb_rating, imdb_num_votes,
best_pic_nom, best_pic_win,
best_actor_win, best_actress_win,
top200_box, director)
summary(movies_model)
Plot : Distribution of Genres by Year of 651 Randomly Sampled Movies
#Boxplot.
g2 <- ggplot(movies_model, aes(x = genre, y = thtr_rel_year))
g2 <- g2 + geom_boxplot(aes(color = genre), size = 1.3)
#Title.
g2 <- g2 + ggtitle(label = "Distribution of Genres by Year of 651 Randomly Sampled Movies")
#Viridis Color Scheme.
#X-axis.
g2 <- g2 + scale_x_discrete(name = "Genre",
expand = c(0,0))
#Y-axis.
g2 <- g2 + scale_y_continuous(name = "Theatrical Release Year")
#Modify labels and text. Remove legend from left plot.
g2 <- g2 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 10, angle = 45),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.position = "none")
g2
Summary: IMDb Information by Year of 651 Randomly Sampled Movies
Plot : IMDb Information by Year of 651 Randomly Sampled Movies
#Scatter plot.
g3 <- ggplot(movies_model %>% filter(imdb_num_votes <= 100000), aes(x = imdb_num_votes, y = imdb_rating))
g3 <- g3 + geom_point(aes(color = thtr_rel_year), size = 5, alpha = 0.5)
#Title.
g3 <- g3 + ggtitle(label = "")
#Viridis Color Scheme.
g3 <- g3 + scale_color_viridis()
#X-axis.
g3 <- g3 + scale_x_continuous(name = "Number of Votes on IMDb",
labels = comma,
expand = c(0.005,0.005))
#Y-axis.
g3 <- g3 + scale_y_continuous(name = "Rating on IMDb",
limits = c(1.9, 9),
breaks = c(2, 4, 6, 8),
labels = c("2.0", "4.0", "6.0", "8.0"))
#Modify labels and text. Remove legend from left plot.
g3 <- g3 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_text(hjust = 1, size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"),
legend.position = "none")
g4 <- ggplot(movies_model %>% filter(imdb_num_votes >= 100000), aes(x = imdb_num_votes, y = imdb_rating))
g4 <- g4 + geom_point(aes(color = thtr_rel_year), size = 5, alpha = 0.5)
#Title.
g4 <- g4 + ggtitle(label = "IMDb Information by Year of 651 Randomly Sampled Movies")
#Viridis Color Scheme.
g4 <- g4 + scale_color_viridis(name = "Theatrical Release Year",
limits = c(1970, 2014),
breaks = c(1975, 1985, 1995, 2005))
g4 <- g4 + guides(color = guide_colorbar(title.position = "top"))
#X-axis.
g4 <- g4 + scale_x_continuous(name = "",
labels = comma,
expand = c(0,0))
#Y-axis.
g4 <- g4 + scale_y_continuous(name = "", position = "right",
limits = c(5, 9.3),
breaks = c(6, 8),
labels = c("6.0", "8.0"),
expand = c(0,0))
#Modify labels and text. Remove legend from left plot.
g4 <- g4 + theme(plot.title = element_text(hjust = 1.25, size = 16, face = "bold"),
axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
axis.title.x = element_blank(),
axis.text.y = element_text(size = 12),
axis.title.y = element_blank())
#Modify legend text. Save as object.
legend3 <- g_legend(g4 + theme(legend.text = element_text(size = 12),
legend.title = element_text(hjust = 0.5, size = 14, face = "bold"),
legend.key.width = unit(1.5, "cm"),
legend.position = "bottom"))
#Remove legend from right plot.
g4 <- g4 + theme(legend.position = "none")
#Arrange left plot, right plot, and legend on the bottom.
grid.arrange(g3, g4, legend3, layout_matrix = matrix(rbind(c(1, 1, 1, 2, 2),
c(1, 1, 1, 2, 2),
c(1, 1, 1, 2, 2),
c(1, 1, 1, 2, 2),
c(1, 1, 1, 2, 2),
c(1, 1, 1, NA, NA),
c(1, 1, 1, 3, 3),
c(1, 1, 1, NA, NA)),
ncol = 5))
Plot : Distribution of 651 Movies by Studio and Distribution of 651 Movies by Director
#Studio Tally.
studio_df <- data.frame(movies_model %>% group_by(studio) %>% tally())
#Distribution of Movies by Studio.
g5 <- ggplot(studio_df, aes(x = n)) + geom_histogram(bins = 37, fill = viridis(4)[2])
#Title.
g5 <- g5 + ggtitle("Distribution of 651 Movies by Studio")
#X-axis
g5 <- g5 + scale_x_continuous("Number of Movies by Studio")
#Y-axis.
g5 <- g5 + scale_y_continuous(name = "Frequency of Studios")
#Modify labels and text.
g5 <- g5 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
#Director Tally.
director_df <- data.frame(movies_model %>% group_by(director) %>% tally())
#Distribution of Movies by Director.
g6 <- ggplot(director_df, aes(x = n)) + geom_histogram(bins = 4, fill = viridis(4)[3])
#Title.
g6 <- g6 + ggtitle("Distribution of 651 Movies by Director")
#X-axis
g6 <- g6 + scale_x_continuous("Number of Movies by Director")
#Y-axis.
g6 <- g6 + scale_y_continuous(name = "Frequency of Directors")
#Modify labels and text.
g6 <- g6 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
grid.arrange(g6, g5, layout_matrix = matrix(rbind(c(1, 1, 1, 1, 1, 1, NA, 2, 2, 2, 2, 2, 2)),
ncol = 13))
Ouput : 20th Century Fox names.
movies_model$studio <- as.character(movies_model$studio)
#20th Century Fox Output.
unique(movies_model[grepl("Fox", movies_model$studio),]$studio)
Cleaning : Consolidating Similar Studio Names.
#Studios with repeats.
studios <- c("A24", "Anchor Bay", "Buena Vista", "Embassy", "Gravitas", "HBO", "IFC",
"Indomina", "Miramax", "National Geographic", "New Line", "Newmarket", "Nordisk",
"Orion", "Paramount", "Roadside", "Sony", "Touchstone",
"Trimark", "Universal", "Weinstein")
for (studio_n in studios) {
movies_model[grepl(studio_n, movies_model$studio),]$studio <- studio_n
}
#Disney/Walt Disney, DreamWorks, 20th Century Fox, LionsGate, MagnetMagnolia, MGM, Warner
movies_model[grepl("Columbia|TriStar", movies_model$studio),]$studio <- "Columbia Tristar Pictures"
movies_model[grepl("Disney", movies_model$studio),]$studio <- "Walt Disney"
movies_model[grepl("Dream", movies_model$studio),]$studio <- "DreamWorks"
movies_model[grepl("Fox", movies_model$studio),]$studio <- "20th Century Fox"
movies_model[grepl("Lions", movies_model$studio),]$studio <- "LionsGate"
movies_model[grepl("Magnet|Magnolia", movies_model$studio),]$studio <- "Magnolia Pictures"
movies_model[grepl("Goldwyn|MGM|UA|United Artists", movies_model$studio),]$studio <- "MGM/UA"
movies_model[grepl("Warner|WARNER", movies_model$studio),]$studio <- "Warner Brothers"
#New Studio Tally and filter.
studio_df <- data.frame(movies_model %>%
filter(!(is.na(studio))) %>%
group_by(studio) %>%
tally()) %>%
mutate(studio = as.factor(studio)) %>%
filter(n >= 2)
Plots : Distribution of 555 Movies by Studio
#Distribution of Movies by Studio.
g7 <- ggplot(studio_df, aes(x = n)) + geom_histogram(bins = 70, fill = viridis(4)[2])
#Title.
g7 <- g7 + ggtitle("")
#X-axis
g7 <- g7 + scale_x_continuous("")
#Y-axis.
g7 <- g7 + scale_y_continuous(name = "Frequency of Studios")
#Modify labels and text.
g7 <- g7 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
#Distribution of Movies by Studio.
g8 <- ggplot(studio_df %>% filter(n > 15), aes(x = n, y = reorder(studio, n)))
g8 <- g8 + geom_bar(stat = "identity", fill = viridis(4)[1], orientation = "y")
#Title.
g8 <- g8 + ggtitle(paste("Distribution of", sum(studio_df$n), "Movies by Studio"))
#X-axis
g8 <- g8 + scale_x_continuous(name = "Number of Movies by Studio")
#Y-axis.
#g8 <- g8 + scale_y_discrete("Studios")
#Modify labels and text.
g8 <- g8 + theme(plot.title = element_text(hjust = -2.9, size = 14, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(hjust = -1.25, size = 14, face = "bold"),
axis.text.y = element_text(hjust = 1, size = 12, angle = 25),
axis.title.y = element_blank())
grid.arrange(g7, g8, layout_matrix = matrix(rbind(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2)),
ncol = 13))
Output : Current Predictors
movies_model <- movies_model %>%
filter(studio %in% studio_df$studio) %>%
select(c(9, 2:8, 10:15))
colnames(movies_model)[2:13]
#Regression model with all preliminary variables.
prelim_reg <-summary(lm(imdb_rating ~ title_type + genre + runtime + mpaa_rating + thtr_rel_year
+ thtr_rel_month + studio + imdb_num_votes + best_pic_nom + best_pic_win
+ best_actor_win + best_actress_win + top200_box + studio,
data = movies_model))
#R-Squared.
ars <- round(prelim_reg$adj.r.squared, 4)
#F-Statistic
fstat <- prelim_reg$fstatistic
#p-Value.
pVal <- round(pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE), 4)
print(paste("The Adjusted R-Squared from the above variables is", ars, "."))
print(paste("The F-Statistic from the above variables is", round(fstat[1], 1), "."))
print(paste("The p-Value from the above F-Statistic on",
round(fstat[2], 1), "and",
round(fstat[3], 1), "degrees of freedom is ~", pVal, "."))
Summary: Full Preliminary Model
Call:
lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating +
thtr_rel_year + thtr_rel_month + studio + imdb_num_votes +
best_pic_nom + best_pic_win + best_actor_win + best_actress_win +
top200_box + studio, data = movies_model)
Residuals:
Min 1Q Median 3Q Max
-3.9881 -0.4175 0.0546 0.4730 2.0034
Coefficients:
Estimate Std. Error t value
(Intercept) 33.0920647282 8.9796994549 3.685
title_typeFeature Film -1.0397740139 0.3955394783 -2.629
title_typeTV Movie 0.1569482647 0.7716563918 0.203
genreAnimation -0.0427513898 0.3449180413 -0.124
genreArt House & International 0.8237201801 0.3162499180 2.605
genreComedy 0.1067288104 0.1466946206 0.728
genreDocumentary 1.1086552693 0.4113052952 2.695
genreDrama 0.7848739190 0.1285059802 6.108
genreHorror 0.0996561542 0.2374034042 0.420
genreMusical & Performing Arts 1.3732533822 0.3033648767 4.527
genreMystery & Suspense 0.3906291051 0.1683403723 2.320
genreOther 0.5741568102 0.2693976098 2.131
genreScience Fiction & Fantasy -0.0149705049 0.3339902503 -0.045
runtime 0.0066409354 0.0026380976 2.517
mpaa_ratingPG -0.3996056197 0.2453697349 -1.629
mpaa_ratingPG-13 -0.6145918371 0.2630071888 -2.337
mpaa_ratingR -0.3183288548 0.2542493338 -1.252
mpaa_ratingUnrated -0.1771660433 0.3555901792 -0.498
thtr_rel_year -0.0134267072 0.0044947647 -2.987
thtr_rel_month 0.0006803687 0.0110708432 0.061
studioA24 -0.0118077000 0.4991055127 -0.024
studioAnchor Bay 0.3488829130 0.5098301262 0.684
studioArtisan Entertainment 0.3156037544 0.5960848194 0.529
studioBuena Vista 0.0651250867 0.2340550741 0.278
studioColumbia Tristar Pictures 0.2805988756 0.2767933755 1.014
studioCowboy Pictures 0.1760188295 0.6108859028 0.288
studioDreamWorks -0.0698553513 0.6066163605 -0.115
studioEmbassy 0.7988596336 0.6023212550 1.326
studioFilmDistrict 0.4385913468 0.5972519193 0.734
studioFirst Run Features -0.3683156401 0.4421053139 -0.833
studioFocus Features 0.2439760377 0.4312608928 0.566
studioGravitas 0.4573032239 0.6114470485 0.748
studioHBO -0.1884186341 0.3547275139 -0.531
studioHollywood Pictures -0.6761898877 0.4892543212 -1.382
studioIFC 0.3325683127 0.2584900055 1.287
studioImage Entertainment -0.0023122818 0.5087688957 -0.005
studioIndependent Pictures -0.0458056070 0.5691689725 -0.080
studioIndomina -0.3372124267 0.6053969672 -0.557
studioLionsGate -0.2676692262 0.2231946749 -1.199
studioLive Home Video -0.1519203660 0.5959082579 -0.255
studioMagnolia Pictures 0.2528784405 0.2884313161 0.877
studioMGM/UA 0.0539310824 0.1813265095 0.297
studioMiramax 0.0935375321 0.2058216185 0.454
studioMusic Box Films 1.0857080278 0.6113035762 1.776
studioNational Geographic 0.2854352063 0.5402568747 0.528
studioNelson Entertainment -0.4772252575 0.6063003738 -0.787
studioNew Line -0.4011971216 0.2319546896 -1.730
studioNew Yorker Films 0.8142511125 0.5012171781 1.625
studioNewmarket 0.0844994970 0.6107747531 0.138
studioNordisk 0.9131957062 0.6103857406 1.496
studioOrion -0.0332412388 0.2783928875 -0.119
studioOverture Films 0.5818489073 0.5989521956 0.971
studioParamount -0.0911573044 0.1611513905 -0.566
studioRepublic Pictures Home Video 1.1194647737 0.5944461941 1.883
studioRoadside 0.4190381879 0.5038369516 0.832
studioScreen Gems 0.3187952270 0.4903305809 0.650
studioSony 0.0857688064 0.1629073214 0.526
studioStrand Releasing 0.8165781624 0.6577240461 1.242
studioSummit Entertainment 0.8088711682 0.5162408464 1.567
studioThinkFilm 0.3218437238 0.5033122526 0.639
studioTouchstone 0.3437929033 0.3968596527 0.866
studioTrimark 0.6428141299 0.4913450418 1.308
studioUniversal 0.3358562362 0.1785847796 1.881
studioUSA Films 0.2392586118 0.5945411010 0.402
studioWalt Disney 0.5226832631 0.3457106508 1.512
studioWarner Brothers 0.0837387292 0.1519591673 0.551
studioWeinstein -0.1152029489 0.2729688309 -0.422
imdb_num_votes 0.0000035949 0.0000003944 9.116
best_pic_nomyes 0.4635658556 0.2278296436 2.035
best_pic_winyes -0.7013182602 0.4187856116 -1.675
best_actor_winyes -0.0397421630 0.1083895160 -0.367
best_actress_winyes 0.0001336515 0.1149282736 0.001
top200_boxyes 0.0786191591 0.2383003162 0.330
Pr(>|t|)
(Intercept) 0.000254 ***
title_typeFeature Film 0.008844 **
title_typeTV Movie 0.838915
genreAnimation 0.901409
genreArt House & International 0.009481 **
genreComedy 0.467238
genreDocumentary 0.007275 **
genreDrama 0.00000000208 ***
genreHorror 0.674836
genreMusical & Performing Arts 0.00000755615 ***
genreMystery & Suspense 0.020732 *
genreOther 0.033573 *
genreScience Fiction & Fantasy 0.964267
runtime 0.012148 *
mpaa_ratingPG 0.104054
mpaa_ratingPG-13 0.019859 *
mpaa_ratingR 0.211165
mpaa_ratingUnrated 0.618549
thtr_rel_year 0.002959 **
thtr_rel_month 0.951022
studioA24 0.981135
studioAnchor Bay 0.494107
studioArtisan Entertainment 0.596729
studioBuena Vista 0.780942
studioColumbia Tristar Pictures 0.311211
studioCowboy Pictures 0.773366
studioDreamWorks 0.908370
studioEmbassy 0.185368
studioFilmDistrict 0.463093
studioFirst Run Features 0.405204
studioFocus Features 0.571842
studioGravitas 0.454883
studioHBO 0.595550
studioHollywood Pictures 0.167586
studioIFC 0.198858
studioImage Entertainment 0.996376
studioIndependent Pictures 0.935890
studioIndomina 0.577779
studioLionsGate 0.231015
studioLive Home Video 0.798879
studioMagnolia Pictures 0.381066
studioMGM/UA 0.766270
studioMiramax 0.649703
studioMusic Box Films 0.076355 .
studioNational Geographic 0.597512
studioNelson Entertainment 0.431604
studioNew Line 0.084336 .
studioNew Yorker Films 0.104913
studioNewmarket 0.890023
studioNordisk 0.135283
studioOrion 0.905005
studioOverture Films 0.331814
studioParamount 0.571886
studioRepublic Pictures Home Video 0.060274 .
studioRoadside 0.405994
studioScreen Gems 0.515896
studioSony 0.598791
studioStrand Releasing 0.215017
studioSummit Entertainment 0.117806
studioThinkFilm 0.522833
studioTouchstone 0.386766
studioTrimark 0.191404
studioUniversal 0.060622 .
studioUSA Films 0.687549
studioWalt Disney 0.131212
studioWarner Brothers 0.581847
studioWeinstein 0.673186
imdb_num_votes < 0.0000000000000002 ***
best_pic_nomyes 0.042427 *
best_pic_winyes 0.094652 .
best_actor_winyes 0.714033
best_actress_winyes 0.999073
top200_boxyes 0.741607
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8168 on 482 degrees of freedom
Multiple R-squared: 0.4957, Adjusted R-squared: 0.4204
F-statistic: 6.58 on 72 and 482 DF, p-value: < 0.00000000000000022
Select the first variable that has the highest adjusted R2.
#Selection 1.
Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(2:14)]) {
Predictors <- c(Predictors, x)
temp_df <- movies_model %>%
select(imdb_rating, x) %>%
`colnames<-`(c("imdb_rating", "depen"))
new_R2 <- summary(lm(imdb_rating ~ depen,
data = temp_df))$adj.r.squared
R2 <- c(R2, new_R2)
}
R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values
Select the second variable, combined with the developing model, that has the highest adjusted R2.
#Selection 2.
movies_model <- movies_model %>%
select(1, 3, 2, 4:14)
Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(3:14)]) {
Predictors <- c(Predictors, paste("genre + ", x))
temp_df <- movies_model %>%
select(imdb_rating, genre, x) %>%
`colnames<-`(c("imdb_rating", "genre", "depen"))
new_R2 <- summary(lm(imdb_rating ~ genre + depen,
data = temp_df))$adj.r.squared
R2 <- c(R2, new_R2)
}
R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values
Select the third variable, combined with the developing model, that has the highest adjusted R2.
#Selection 3.
movies_model <- movies_model %>%
select(1, 2, 9, 3:8, 10:14)
Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(4:14)]) {
Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
temp_df <- movies_model %>%
select(imdb_rating, genre, imdb_num_votes, x) %>%
`colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "depen"))
new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + depen,
data = temp_df))$adj.r.squared
R2 <- c(R2, new_R2)
}
R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values
Select the fourth variable, combined with the developing model, that has the highest adjusted R2.
#Selection 4.
movies_model <- movies_model %>%
select(1:3, 6, 4, 5, 7:14)
Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(5:14)]) {
Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
temp_df <- movies_model %>%
select(imdb_rating, genre, imdb_num_votes, mpaa_rating, x) %>%
`colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "depen"))
new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + depen,
data = temp_df))$adj.r.squared
R2 <- c(R2, new_R2)
}
R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values
Select the fifth variable, combined with the developing model, that has the highest adjusted R2.
#Selection 5.
movies_model <- movies_model %>%
select(1:4, 8, 5:7, 9:14)
Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(6:14)]) {
Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
temp_df <- movies_model %>%
select(imdb_rating, genre, imdb_num_votes, mpaa_rating, thtr_rel_year, x) %>%
`colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "thtr_rel_year" , "depen"))
new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + depen,
data = temp_df))$adj.r.squared
R2 <- c(R2, new_R2)
}
R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values
Select the sixth variable, combined with the developing model, that has the highest adjusted R2.
#Selection 6.
#No change in order.
Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(7:14)]) {
Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
temp_df <- movies_model %>%
select(imdb_rating, genre, imdb_num_votes, mpaa_rating, thtr_rel_year, title_type,
x) %>%
`colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "thtr_rel_year", "title_type",
"depen"))
new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type +
depen,
data = temp_df))$adj.r.squared
R2 <- c(R2, new_R2)
}
R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values
Select the seventh variable, combined with the developing model, that has the highest adjusted R2.
#Selection 7.
#No change in order.
Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(8:14)]) {
Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
temp_df <- movies_model %>%
select(imdb_rating, genre, imdb_num_votes, mpaa_rating, thtr_rel_year, title_type,
runtime, x) %>%
`colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "thtr_rel_year", "title_type",
"runtime", "depen"))
new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type +
runtime + depen,
data = temp_df))$adj.r.squared
R2 <- c(R2, new_R2)
}
R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values
Select the eighth variable, combined with the developing model, that has the highest adjusted R2.
#Selection 8.
movies_model <- movies_model %>%
select(1:7, 10, 8, 9, 11:14)
Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(9:14)]) {
Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
temp_df <- movies_model %>%
select(imdb_rating, genre, imdb_num_votes, mpaa_rating, thtr_rel_year, title_type,
runtime, best_pic_nom, x) %>%
`colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "thtr_rel_year", "title_type",
"runtime", "best_pic_nom", "depen"))
new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type +
runtime + best_pic_nom + depen,
data = temp_df))$adj.r.squared
R2 <- c(R2, new_R2)
}
R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values
Select the ninth variable, combined with the developing model, that has the highest adjusted R2.
#Selection 9.
movies_model <- movies_model %>%
select(1:8, 11, 9, 10, 12:14)
Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(10:14)]) {
Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
temp_df <- movies_model %>%
select(imdb_rating, genre, imdb_num_votes, mpaa_rating, thtr_rel_year, title_type,
runtime, best_pic_nom, best_pic_win, x) %>%
`colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "thtr_rel_year", "title_type",
"runtime", "best_pic_nom", "best_pic_win", "depen"))
new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type +
runtime + best_pic_nom + best_pic_win + depen,
data = temp_df))$adj.r.squared
R2 <- c(R2, new_R2)
}
R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values
Show that no additional variable, combined with the developing model, increases the adjusted R2.
#Selection 10.
#No change in order.
Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(11:14)]) {
Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
temp_df <- movies_model %>%
select(imdb_rating, genre, imdb_num_votes, mpaa_rating, thtr_rel_year, title_type,
runtime, best_pic_nom, best_pic_win, studio, x) %>%
`colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "thtr_rel_year", "title_type",
"runtime", "best_pic_nom", "best_pic_win", "studio", "depen"))
new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type +
runtime + best_pic_nom + best_pic_win + studio + depen,
data = temp_df))$adj.r.squared
R2 <- c(R2, new_R2)
}
R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values
Plots: Residual vs IMDB Number of Votes, Residual vs Theatrical Release Year, and Residual vs Runtime
final_model <- movies_model %>%
select(1:10)
#Regression Model.
final_reg <- lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type
+ runtime + best_pic_nom + best_pic_win + studio,
data = movies_model)
#Scatter of Residual Values vs IMDB Number of Votes.
g9 <- ggplot(final_reg, aes(x = imdb_num_votes, y = .resid))
g9 <- g9 + geom_point(color = viridis(5)[3], size = 3, alpha = 0.2)
#Horizontal line at y = 0.
g9 <- g9 + geom_hline(yintercept = 0, col = viridis(5)[1], linetype = "dashed")
#Title.
g9 <- g9 + ggtitle("Residual Values vs Votes")
#X-axis & Y-axis.
g9 <- g9 + scale_x_continuous("IMDB Number of Votes", labels = comma) + ylab("Residual Value")
#Modify labels and text.
g9 <- g9 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
#Scatter of Residual Values vs Theatrical Release Year.
g10 <- ggplot(final_reg, aes(x = thtr_rel_year , y = .resid))
g10 <- g10 + geom_point(color = viridis(5)[3], size = 3, alpha = 0.2)
#Horizontal line at y = 0.
g10 <- g10 + geom_hline(yintercept = 0, col = viridis(5)[1], linetype = "dashed")
#Title.
g10 <- g10 + ggtitle("Residual Values vs Year")
#X-axis & Y-axis.
g10 <- g10 + xlab("Theatrical Release Year") + ylab("Residual Value")
#Modify labels and text.
g10 <- g10 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
#Scatter of Residual Values vs Runtime of Movie.
g11 <- ggplot(final_reg, aes(x = runtime, y = .resid))
g11 <- g11 + geom_point(color = viridis(5)[3], size = 3, alpha = 0.2)
#Horizontal line at y = 0.
g11 <- g11 + geom_hline(yintercept = 0, col = viridis(5)[1], linetype = "dashed")
#Title.
g11 <- g11 + ggtitle("Residual Values vs Runtime")
#X-axis & Y-axis.
g11 <- g11 + xlab("Runtime of Movie (minutes)") + ylab("Residual Value")
#Modify labels and text.
g11 <- g11 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
grid.arrange(g9, g10, g11, layout_matrix = matrix(rbind(c(1, 1, 2, 2),
c(NA, 3, 3, NA)),
ncol = 4))
Plots : Distribution of Residual Values and Normal-QQ
#Histogram
g12 <- ggplot(mapping = aes(x = final_reg$residuals)) + geom_histogram(bins = 20, fill = viridis(5)[3])
#Title.
g12 <- g12 + ggtitle("Distribution of Residual Values")
#X-axis & Y-axis.
g12 <- g12 + xlab("Residual Value") + ylab("Frequency")
#Modify labels and text.
g12 <- g12 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
#QQLine
res = final_reg$residuals
m = (quantile(res, 0.75) - quantile(res, 0.25)) / (qnorm(0.75) - qnorm(0.25))
b = quantile(res, .25) - m * qnorm(0.25)
qq_line = data.frame(intercept = b,
slope = m)
#Normal-QQ Plot
g13 <- ggplot(data = final_reg)
g13 <- g13 + stat_qq(aes(sample = res), color = viridis(5)[3], size = 3, alpha = 0.4)
g13 <- g13 + geom_abline(data = qq_line ,aes(intercept = b, slope = m), color = viridis(5)[1], size = 1)
#Title.
g13 <- g13 + ggtitle("Normal-QQ Plot")
#X-axis & Y-axis.
g13 <- g13 + xlab("Theoretical Quantile") + ylab("Standardized Residual")
#Modify labels and text.
g13 <- g13 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
grid.arrange(g12, g13, layout_matrix = matrix(c(1, 2),
ncol = 2))
Plots : Residual vs Fitted Values and Absolute Value of Residuals vs Fitted Values
#Scatter of Residuals vs. Fitted.
g14 <- ggplot(final_reg, aes(x = .fitted, y = .resid))
g14 <- g14 + geom_point(color = viridis(5)[3], size = 3, alpha = 0.4)
##Horizontal Line.
g14 <- g14 + geom_hline(yintercept = 0, col = viridis(5)[1], linetype = "dashed")
#Title.
g14 <- g14 + ggtitle("Residual vs Fitted Values")
#X-axis & Y-axis.
g14 <- g14 + xlab("Fitted Value") + ylab("Residual Value")
#Modify labels and text.
g14 <- g14 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
#Scatter of Absolute Value of Residuals vs. Fitted.
g15 <- ggplot(final_reg, aes(x = .fitted, y = abs(.resid)))
g15 <- g15 + geom_point(color = viridis(5)[3], size = 3, alpha = 0.4)
##Horizontal Line.
g15 <- g15 + geom_hline(yintercept = 0, col = viridis(5)[1], linetype = "dashed")
#Title.
g15 <- g15 + ggtitle("Absolute Value of Residuals vs Fitted Values")
#X-axis & Y-axis.
g15 <- g15 + xlab("Fitted Value") + ylab("Absolute Value of Residual")
#Modify labels and text.
g15 <- g15 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
axis.text.x = element_text(size = 12),
axis.title.x = element_text(size = 14, face = "bold"),
axis.text.y = element_text(size = 12),
axis.title.y = element_text(size = 14, face = "bold"))
grid.arrange(g14, g15, layout_matrix = matrix(c(1, 2),
ncol = 2))
Output : Final Regression Model
Call:
lm(formula = imdb_rating ~ genre + imdb_num_votes + mpaa_rating +
thtr_rel_year + title_type + runtime + best_pic_nom + best_pic_win +
studio, data = movies_model)
Residuals:
Min 1Q Median 3Q Max
-3.9834 -0.4210 0.0501 0.4831 2.0092
Coefficients:
Estimate Std. Error t value
(Intercept) 33.466623723 8.893593000 3.763
genreAnimation -0.054576319 0.341477488 -0.160
genreArt House & International 0.821902649 0.314460459 2.614
genreComedy 0.103447373 0.144643594 0.715
genreDocumentary 1.102576913 0.409228155 2.694
genreDrama 0.779740751 0.126251654 6.176
genreHorror 0.095550056 0.236081543 0.405
genreMusical & Performing Arts 1.369524721 0.301723109 4.539
genreMystery & Suspense 0.380272489 0.165101836 2.303
genreOther 0.570172760 0.266941993 2.136
genreScience Fiction & Fantasy -0.006956664 0.332273711 -0.021
imdb_num_votes 0.000003637 0.000000378 9.622
mpaa_ratingPG -0.405807833 0.243784352 -1.665
mpaa_ratingPG-13 -0.620720506 0.260200287 -2.386
mpaa_ratingR -0.325248967 0.251558577 -1.293
mpaa_ratingUnrated -0.183475741 0.352573003 -0.520
thtr_rel_year -0.013599557 0.004452674 -3.054
title_typeFeature Film -1.044387195 0.393503213 -2.654
title_typeTV Movie 0.154541942 0.765411443 0.202
runtime 0.006487380 0.002480862 2.615
best_pic_nomyes 0.453182941 0.221183322 2.049
best_pic_winyes -0.692966281 0.413507058 -1.676
studioA24 -0.001808897 0.496349157 -0.004
studioAnchor Bay 0.359235698 0.506885308 0.709
studioArtisan Entertainment 0.325650479 0.592174035 0.550
studioBuena Vista 0.076394711 0.231289211 0.330
studioColumbia Tristar Pictures 0.272555110 0.274622534 0.992
studioCowboy Pictures 0.180038946 0.605770472 0.297
studioDreamWorks -0.058437674 0.599481428 -0.097
studioEmbassy 0.780944544 0.596657180 1.309
studioFilmDistrict 0.434419405 0.594431817 0.731
studioFirst Run Features -0.362773615 0.439249685 -0.826
studioFocus Features 0.249836129 0.428749917 0.583
studioGravitas 0.464452949 0.608731855 0.763
studioHBO -0.183561514 0.352235847 -0.521
studioHollywood Pictures -0.670688014 0.486790164 -1.378
studioIFC 0.340400065 0.256811732 1.325
studioImage Entertainment 0.007286210 0.506221283 0.014
studioIndependent Pictures -0.055275263 0.565441320 -0.098
studioIndomina -0.332292366 0.602611897 -0.551
studioLionsGate -0.262525609 0.220986695 -1.188
studioLive Home Video -0.163367100 0.590907733 -0.276
studioMagnolia Pictures 0.253886585 0.284691867 0.892
studioMGM/UA 0.054282170 0.180102679 0.301
studioMiramax 0.091622940 0.204326834 0.448
studioMusic Box Films 1.096406842 0.608221321 1.803
studioNational Geographic 0.270789354 0.536668849 0.505
studioNelson Entertainment -0.472088904 0.603612170 -0.782
studioNew Line -0.400111068 0.230759449 -1.734
studioNew Yorker Films 0.819809272 0.496611280 1.651
studioNewmarket 0.076672297 0.605013926 0.127
studioNordisk 0.923446869 0.606720405 1.522
studioOrion -0.039276164 0.276607551 -0.142
studioOverture Films 0.584999973 0.592257029 0.988
studioParamount -0.084601888 0.159770957 -0.530
studioRepublic Pictures Home Video 1.130731858 0.590845635 1.914
studioRoadside 0.430897748 0.500559476 0.861
studioScreen Gems 0.324935812 0.487935143 0.666
studioSony 0.083083166 0.161825309 0.513
studioStrand Releasing 0.823843876 0.651305719 1.265
studioSummit Entertainment 0.784638894 0.508943732 1.542
studioThinkFilm 0.305328758 0.497871784 0.613
studioTouchstone 0.315897381 0.386628620 0.817
studioTrimark 0.632995803 0.487862650 1.297
studioUniversal 0.331624385 0.177343065 1.870
studioUSA Films 0.223905977 0.590340500 0.379
studioWalt Disney 0.517178787 0.342174247 1.511
studioWarner Brothers 0.083163595 0.151329790 0.550
studioWeinstein -0.113686626 0.271284146 -0.419
Pr(>|t|)
(Intercept) 0.000188 ***
genreAnimation 0.873086
genreArt House & International 0.009235 **
genreComedy 0.474836
genreDocumentary 0.007298 **
genreDrama 0.00000000139 ***
genreHorror 0.685852
genreMusical & Performing Arts 0.00000713306 ***
genreMystery & Suspense 0.021686 *
genreOther 0.033183 *
genreScience Fiction & Fantasy 0.983305
imdb_num_votes < 0.0000000000000002 ***
mpaa_ratingPG 0.096634 .
mpaa_ratingPG-13 0.017436 *
mpaa_ratingR 0.196648
mpaa_ratingUnrated 0.603028
thtr_rel_year 0.002380 **
title_typeFeature Film 0.008213 **
title_typeTV Movie 0.840074
runtime 0.009201 **
best_pic_nomyes 0.041008 *
best_pic_winyes 0.094415 .
studioA24 0.997094
studioAnchor Bay 0.478843
studioArtisan Entertainment 0.582624
studioBuena Vista 0.741316
studioColumbia Tristar Pictures 0.321461
studioCowboy Pictures 0.766436
studioDreamWorks 0.922385
studioEmbassy 0.191198
studioFilmDistrict 0.465245
studioFirst Run Features 0.409270
studioFocus Features 0.560360
studioGravitas 0.445843
studioHBO 0.602512
studioHollywood Pictures 0.168906
studioIFC 0.185633
studioImage Entertainment 0.988522
studioIndependent Pictures 0.922166
studioIndomina 0.581599
studioLionsGate 0.235425
studioLive Home Video 0.782306
studioMagnolia Pictures 0.372945
studioMGM/UA 0.763242
studioMiramax 0.654055
studioMusic Box Films 0.072063 .
studioNational Geographic 0.614086
studioNelson Entertainment 0.434533
studioNew Line 0.083572 .
studioNew Yorker Films 0.099424 .
studioNewmarket 0.899208
studioNordisk 0.128652
studioOrion 0.887145
studioOverture Films 0.323768
studioParamount 0.596687
studioRepublic Pictures Home Video 0.056239 .
studioRoadside 0.389755
studioScreen Gems 0.505765
studioSony 0.607896
studioStrand Releasing 0.206509
studioSummit Entertainment 0.123797
studioThinkFilm 0.539986
studioTouchstone 0.414297
studioTrimark 0.195079
studioUniversal 0.062090 .
studioUSA Films 0.704644
studioWalt Disney 0.131324
studioWarner Brothers 0.582879
studioWeinstein 0.675351
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8136 on 486 degrees of freedom
Multiple R-squared: 0.4954, Adjusted R-squared: 0.4249
F-statistic: 7.018 on 68 and 486 DF, p-value: < 0.00000000000000022
#Deadpool data from https://www.imdb.com/title/tt1431045/.
deadpool <- data.frame("imdb_rating" = 8.0,
"genre" = "Action & Adventure",
"imdb_num_votes" = 872845,
"mpaa_rating" = "R",
"thtr_rel_year" = 2016,
"title_type" = "Feature Film",
"runtime" = 108,
"best_pic_nom" = "no",
"best_pic_win" = "no",
"studio" = "20th Century Fox")
#Prediction and Interval of Deadpool IMDb Rating from data.
predict(final_reg, deadpool, interval = "predict")