Part 1: Data

From the Codebook, “The data set is comprised of 651 randomly sampled movies produced and released before 2016.”.

Because there were no details, I decided to plot a few variables to see how random this sample is.

First as we see from above, the distribution of movies is from 1970 - 2014. More movies have been released in recent years, so skewed left is not unexpected.

I also took a look at the Oscar categories. The number of best actor wins (93) and best actress wins (72) is quite high for 651 movies. This leads me to believe that more well known, heavily advertised, and blockbuster films were more likely to be choosen from the sample. I would have to investigate the sampling method in more detail to draw a more concrete conclusion, but that is not the purpose of this project. I merely wanted a more indepth summary of the movies chosen.

 best_pic_nom best_pic_win best_actor_win best_actress_win best_dir_win
 no :629      no :644      no :558        no :579          no :608     
 yes: 22      yes:  7      yes: 93        yes: 72          yes: 43     
 top200_box
 no :636   
 yes: 15

This was an observational study, as no hypothesis, controls, nor confounding variables were specified beforehand. We can use this data to generalize and draw some trends in movies of the past 50 years, but not make definitive causal arguments.

Part 2: Research question

As someone who has watched many documentaries, movies, and TV shows, I have regularly used IMDb for over 20 years. I would like to see if the IMDB Rating is reasonably predictable from characteristics of the movie.

I first unselected imdb_url and rt_url, as these are unique and identify the movie. I then unselected thtr_rel_day, dvd_rel_year dvd_rel_month, and dvd_rel_day. I think thtr_rel_year and thtr_rel_month would have a greater impact on the number votes. DVD Release is usually 4 - 12 months after theatrical release, and including both would have collinearity issues. I next unselected all Rotten Tomatoes variables, that is, critics_rating, critics_score, audience_rating, and audience_score. Since I am focused on IMDb, I am ignoring the ratings and scores from Rotten Tomatoes.

Finally, I unselected actor1, actor2, actor3, actor4, and actor5. These would be too difficult to organize and categorize into a linear model, at least for the scope of this class. I included best_pic_nom, best_pic_win, best_actor_win, best_actress_win, and top200_box, as I think these characteristics would affect the rating.

Using title for identification purposes, the remaining variables are title_type, genre, runtime, mpaa_rating, studio, thtr_rel_year, thtr_rel_month, imdb_rating (dependent variable in this exploration), imdb_num_votes, best_pic_nom, best_pic_win, best_actor_win, best_actress_win, top200_box and director.

Can the Rating on IMDb be predicted from these variables?

Part 3: Exploratory data analysis

Here is a summary of the preliminary variables.

    title                  title_type                 genre        runtime     
 Length:651         Documentary : 55   Drama             :305   Min.   : 39.0  
 Class :character   Feature Film:591   Comedy            : 87   1st Qu.: 92.0  
 Mode  :character   TV Movie    :  5   Action & Adventure: 65   Median :103.0  
                                       Mystery & Suspense: 59   Mean   :105.8  
                                       Documentary       : 52   3rd Qu.:115.8  
                                       Horror            : 23   Max.   :267.0  
                                       (Other)           : 60   NA's   :1      
  mpaa_rating                               studio    thtr_rel_year 
 G      : 19   Paramount Pictures              : 37   Min.   :1970  
 NC-17  :  2   Warner Bros. Pictures           : 30   1st Qu.:1990  
 PG     :118   Sony Pictures Home Entertainment: 27   Median :2000  
 PG-13  :133   Universal Pictures              : 23   Mean   :1998  
 R      :329   Warner Home Video               : 19   3rd Qu.:2007  
 Unrated: 50   (Other)                         :507   Max.   :2014  
               NA's                            :  8                 
 thtr_rel_month   imdb_rating    imdb_num_votes   best_pic_nom best_pic_win
 Min.   : 1.00   Min.   :1.900   Min.   :   180   no :629      no :644     
 1st Qu.: 4.00   1st Qu.:5.900   1st Qu.:  4546   yes: 22      yes:  7     
 Median : 7.00   Median :6.600   Median : 15116                            
 Mean   : 6.74   Mean   :6.493   Mean   : 57533                            
 3rd Qu.:10.00   3rd Qu.:7.300   3rd Qu.: 58301                            
 Max.   :12.00   Max.   :9.000   Max.   :893008                            
                                                                           
 best_actor_win best_actress_win top200_box   director        
 no :558        no :579          no :636    Length:651        
 yes: 93        yes: 72          yes: 15    Class :character  
                                            Mode  :character

There seems to be a wide range of genres, and they are distributed reasonably well over the years. I do have some reservations about using Genre, however, as a movie can be of multiple genres.

 imdb_num_votes    imdb_rating   
 Min.   :   180   Min.   :1.900  
 1st Qu.:  4546   1st Qu.:5.900  
 Median : 15116   Median :6.600  
 Mean   : 57533   Mean   :6.493  
 3rd Qu.: 58301   3rd Qu.:7.300  
 Max.   :893008   Max.   :9.000

I decided to look at the number of votes and ratings for the movies in the sample. A range of 1.9 - 9.0, with most in 5.0 - 8.0, is a good representation. The heavy density on the left and exaggerated right skew of votes is to be expected.

First, over 400 directors have directed only one movie in our sample. I am therefore unselecting this variable from the model.

As we can see, over 100 Studios have produce one movie in our sample. However, many movie studios are repeated by different names in the column. For example, “20th Century Fox” appears as all of the following:

[1] "20th Century Fox"                        
[2] "20th Century Fox Film Corporat"          
[3] "Twentieth Century Fox Home Entertainment"
[4] "Fox Searchlight Pictures"                
[5] "Fox Atomic"                              
[6] "20th Century Fox Film Corporation"       
[7] "Fox Searchlight"                         
[8] "Fox"

There are several others. For full details in the cleaning, see the Appendix.

After omitting eight movies with no studio, consolidating the different names of the same studio, as well as filtering out studios with only a single movie, the distribution is as follows. Note that over 400 movies in the original sample were produced by only 11 studios.

Part 4: Modeling

Variables for Model:

Currently the predictors being considered for IMDb Rating are:

 [1] "title_type"       "genre"            "runtime"          "mpaa_rating"     
 [5] "studio"           "thtr_rel_year"    "thtr_rel_month"   "imdb_num_votes"  
 [9] "best_pic_nom"     "best_pic_win"     "best_actor_win"   "best_actress_win"

See the Appendix for model details.

[1] "The R-Squared from the above variables is 0.4957 ."

[1] "The F-Statistic from the above variables is 6.6 ."

[1] "The p-Value from the above F-Statistic on 72 and 482 degrees of freedom is ~ 0 ."

Hypotheses:

H₀ : The slopes of all variables are zero for the regression model.

H_A : At least one slope is different from 0.

Since the p-value < 0.05 from the F test above, we reject H₀.

Model Selection Choice:

I will be using the forward selection method, selecting predictors by their adjusted R². I am choosing this method as a reliable overall prediction model is my priority. I am not as concerned about the significance of the individual predictors.

Model Selection Process:

   Predictor_Variables   Adjusted_R2
1           title_type  0.0911757925
2                genre  0.2058854544
3              runtime  0.0930769772
4          mpaa_rating  0.0544662807
5               studio  0.0250511730
6        thtr_rel_year -0.0001270223
7       thtr_rel_month  0.0069380522
8       imdb_num_votes  0.1336030480
9         best_pic_nom  0.0540743214
10        best_pic_win  0.0194410075
11      best_actor_win  0.0034698482
12    best_actress_win  0.0038613392
13          top200_box  0.0084224665

The first round of selection yields genre as having the highest adjusted R², which is 0.20589.

         Predictor_Variables Adjusted_R2
1        genre +  title_type   0.2142243
2           genre +  runtime   0.2638900
3       genre +  mpaa_rating   0.2204991
4            genre +  studio   0.2116931
5     genre +  thtr_rel_year   0.2085660
6    genre +  thtr_rel_month   0.2120711
7    genre +  imdb_num_votes   0.3630692
8      genre +  best_pic_nom   0.2482785
9      genre +  best_pic_win   0.2254451
10   genre +  best_actor_win   0.2068519
11 genre +  best_actress_win   0.2072701
12       genre +  top200_box   0.2201771

The second round of selection yields genre + imdb_num_votes as having the highest adjusted R², which is 0.36307.

                                  Predictor_Variables Adjusted_R2
1        selected explanatory variables +  title_type   0.3748242
2           selected explanatory variables +  runtime   0.3722187
3       selected explanatory variables +  mpaa_rating   0.3869285
4            selected explanatory variables +  studio   0.3741239
5     selected explanatory variables +  thtr_rel_year   0.3841490
6    selected explanatory variables +  thtr_rel_month   0.3637500
7      selected explanatory variables +  best_pic_nom   0.3694659
8      selected explanatory variables +  best_pic_win   0.3619455
9    selected explanatory variables +  best_actor_win   0.3629337
10 selected explanatory variables +  best_actress_win   0.3621119
11       selected explanatory variables +  top200_box   0.3622158

The third round of selection yields genre + imdb_num_votes + mpaa_rating as having the highest adjusted R², which is 0.38693.

                                  Predictor_Variables Adjusted_R2
1        selected explanatory variables +  title_type   0.3956541
2           selected explanatory variables +  runtime   0.4001415
3            selected explanatory variables +  studio   0.3893247
4     selected explanatory variables +  thtr_rel_year   0.4033181
5    selected explanatory variables +  thtr_rel_month   0.3870388
6      selected explanatory variables +  best_pic_nom   0.3935230
7      selected explanatory variables +  best_pic_win   0.3857890
8    selected explanatory variables +  best_actor_win   0.3872100
9  selected explanatory variables +  best_actress_win   0.3861547
10       selected explanatory variables +  top200_box   0.3858422

The fourth round of selection yields genre + imdb_num_votes + mpaa_rating + thtr_rel_year as having the highest adjusted R², which is 0.40332.

                                 Predictor_Variables Adjusted_R2
1       selected explanatory variables +  title_type   0.4124202
2          selected explanatory variables +  runtime   0.4101359
3           selected explanatory variables +  studio   0.4043084
4   selected explanatory variables +  thtr_rel_month   0.4031649
5     selected explanatory variables +  best_pic_nom   0.4078319
6     selected explanatory variables +  best_pic_win   0.4024575
7   selected explanatory variables +  best_actor_win   0.4031490
8 selected explanatory variables +  best_actress_win   0.4023141
9       selected explanatory variables +  top200_box   0.4022237

The fifth round of selection yields genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type as having the highest adjusted R², which is 0.41242.

                                 Predictor_Variables Adjusted_R2
1          selected explanatory variables +  runtime   0.4196435
2           selected explanatory variables +  studio   0.4130270
3   selected explanatory variables +  thtr_rel_month   0.4122719
4     selected explanatory variables +  best_pic_nom   0.4169521
5     selected explanatory variables +  best_pic_win   0.4115879
6   selected explanatory variables +  best_actor_win   0.4124188
7 selected explanatory variables +  best_actress_win   0.4114617
8       selected explanatory variables +  top200_box   0.4113380

The sixth round of selection yields genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type + runtime as having the highest adjusted R², which is 0.41964.

                                 Predictor_Variables Adjusted_R2
1           selected explanatory variables +  studio   0.4210571
2   selected explanatory variables +  thtr_rel_month   0.4187164
3     selected explanatory variables +  best_pic_nom   0.4221565
4     selected explanatory variables +  best_pic_win   0.4193633
5   selected explanatory variables +  best_actor_win   0.4186905
6 selected explanatory variables +  best_actress_win   0.4185631
7       selected explanatory variables +  top200_box   0.4186287

The seventh round of selection yields genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type + runtime + best_pic_nom as having the highest adjusted R², which is 0.42216.

                                 Predictor_Variables Adjusted_R2
1           selected explanatory variables +  studio   0.4227173
2   selected explanatory variables +  thtr_rel_month   0.4210947
3     selected explanatory variables +  best_pic_win   0.4238089
4   selected explanatory variables +  best_actor_win   0.4211023
5 selected explanatory variables +  best_actress_win   0.4211543
6       selected explanatory variables +  top200_box   0.4211215

The eight round of selection yields genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type + runtime + best_pic_nom + best_pic_win as having the highest adjusted R², which is 0.42381.

                                 Predictor_Variables Adjusted_R2
1           selected explanatory variables +  studio   0.4248530
2   selected explanatory variables +  thtr_rel_month   0.4227314
3   selected explanatory variables +  best_actor_win   0.4227339
4 selected explanatory variables +  best_actress_win   0.4227907
5       selected explanatory variables +  top200_box   0.4227849

The ninth round of selection yields genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type + runtime + best_pic_nom + best_pic_win + studio as having the highest adjusted R², which is 0.42485.

                                 Predictor_Variables Adjusted_R2
1   selected explanatory variables +  thtr_rel_month   0.4236725
2   selected explanatory variables +  best_actor_win   0.4238159
3 selected explanatory variables +  best_actress_win   0.4236672
4       selected explanatory variables +  top200_box   0.4237895

No additional predictor variable is a satisfactory explantory variable, as every adjusted R² value is less than 0.42485.

Model diagnostics

Linear Relationships Between x and y

The three numerical explanatory variables included in the model are imdb_num_votes, thtr_rel_year, and runtime.

For Number of IMDB Votes, the residuals seem to be randomly scattered around 0. As noted above, the heavy density on the left and exaggerated right skew of votes is to be expected. For Theatrical Release Year, the residuals are randomly scattered around 0. I do not know for sure whether to treat year as numerical or categorical, so I included it anyway. Finally for Runtime, the plot shows a random scatter around 0 for the vast majority of movies, that is movies between 90 - 150 minutes long.

Nearly Normal Residuals

There is a slight left skew, but the majority of the residuals fit a normal distribution.

Constant Variability of Residuals

From both plots, it is clear there are about 10 - 15 fitted values whose residuals deviate from the others. Considering this is approximately 1.8 - 2.7% of the data in the model, the residuals seem to be of constant variability.

Independence of Residuals

We do not know the exact method or sampling, nor the order in which the movies were chosen, but the movies selected, and thus residuals, appear to be independent of one another.

Interpretation of Model Coefficients

For full details of the final regression model, see the Appendix.

The intercept of the regression model is 33.46, which would be the IMDb Rating of a movie that : 1) is of genre Action & Adventure, 2) received 0 votes on IMDb, 3) is rated G, 4) was released in theaters in year 0, 5) is a Documentary, 6) is 0 minutes long, 7) has not been nominated nor won the Best Picture, 8) and was produced by A24.

There are 10 genre coefficients, which would be expected to change the IMDB Rating from -0.05 (Animation) to +1.37 (Musical & Performing Arts).

For every 1,000 people that vote for the movie, the expected change to the IMDB Rating would be +0.0036. If 400,000 people vote for a movie, for example, the expected change to the IMDb Rating would be +1.44.

There are 4 MPAA rating coefficients, which would be expected to change the IMDB Rating from -0.62 (PG-13) to -0.18 (Unrated).

For each year the movie is released in theaters after year 0, the expected change to the IMDb Rating would be -0.0136. A movie released in 1980, for example, would be expected to change the IMDb Rating by -26.9.

For each minute of runtime of the movie, the expected change to the IMDb Rating would be +0.006487. A 120 minute movie, for example, would be expected to change the IMDb Rating by +0.78.

There are two additional title types. The vast majority of movies are Feature Films, which would be expected to change the IMDb Rating by -1.04. TV Movies would be expected to change the IMDb Rating by +0.15.

If a movie has received a Best Picture Nomination, the expected change to the IMDb Rating would be +0.45.

If a movie has won the Best Picture, the expected change to the IMDb Rating would be -0.69.

There are 47 studio coefficients, which would be expected to change the IMDB Rating from -0.67 (Hollywood Pictures) to +1.13 (Republic Pictures).

Part 5: Prediction

Deadpool, having an IMDb rating of 8.0, was an 108 minute long, Action & Adventure, Rated-R Feature Film released in 2016 by 20th Century Fox that has received 872,845 votes on IMDb and was never nominated for Best Picture (and hence, did not win, either).

       fit      lwr    upr
1 8.555389 6.838782 10.272

Our 95% prediction interval for the IMDb Rating of any movie that was an 108 minute long, Action & Adventure, Rated-R Feature Film released in 2016 by 20th Century Fox that has received 872,845 votes on IMDb and was never nominated for Best Picture (and hence, did not win, either) is between 6.84 and 10 (as 10 is the maximum score). Deadpool, at 8.0, lies in this interval.

Part 6: Conclusion

[1] "The R-squared value is 0.4954"

This means that the final regression model of nine variables to predict the IMDb Rating only explains about 50% of the variability in the IMDb Rating.

I have already mentioned reservations about the sampling, as well as how different genre types are not mutually exclusive for a movie. In addition, I purposely excluded actors and actresses from the model due to the complexity, but I think they would affect the rating, as people can be biased towards favorites and rate them higher.

I think directors could possibly influence rating, but more data per director would have to be available.

Box Office Earnings could possibly influence rating. That metric was not included in the original dataset.

It is important to note none of these variables have been shown to be causal in affecting the IMDb rating. We can only predict an interval of ratings based on the characterstics of the movie that have been shown to show trends from our reduced sample of 555 movies.

The dataset data is also not static, it is dynamic. For example, take Django Unchained, the movie with the highest number of votes in the dataset.

             title imdb_rating imdb_num_votes
1 Django Unchained         8.5         893008

In reality, Django Unchained has a rating of 8.4 by over 1,299,000 votes, as new votes regularly happen on IMDb, even for older movies. The final project dataset, like many other aspects of this course, has not been updated.

Appendix

All code for each Part is included here.

Setup

knitr::opts_chunk$set(comment = NA)
options("scipen" = 100)

years <- c(1970, 1975, 1980, 1985, 1990, 1995, 2000, 2005, 2010)

Load packages

library(dplyr)
library(ggplot2)
library(ggpubr)
library(grid)
library(gridExtra)
library(lemon)
library(lindia)
library(scales)
library(statsr)
library(viridis)

Load data

load("movies.Rdata")

Part 1: Data

Plot : Distribution by Year of 651 Randomly Sampled Movies

#Histogram.
sample_length <- max(thtr_rel_year) - min(thtr_rel_year) + 1
g1 <- ggplot(movies, aes(x = thtr_rel_year, fill = cut(thtr_rel_year, sample_length)))
g1 <- g1 + geom_histogram(bins = sample_length)
#Title.
g1 <- g1 + ggtitle(label = "Distribution by Year of 651 Randomly Sampled Movies")
#Viridis Color Scheme.
g1 <- g1 + scale_fill_viridis_d()
#X-axis.
g1 <- g1 + scale_x_continuous(name = "Theatrical Release Year",
                              breaks = years,
                              labels = years,
                              expand = c(0,0))
#Y-axis.
g1 <- g1 + scale_y_continuous(name = "Number of Movies")
                              
#Modify labels and text. Remove legend from left plot.
g1 <- g1 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                 axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"),
                 legend.position = "none")
g1

Summary: Oscar Categories and Top 200 Boxoffice

summary(movies %>%
          select(best_pic_nom,
                 best_pic_win,
                 best_actor_win,
                 best_actress_win,
                 best_dir_win,
                 top200_box))

Part 2: Research question

Part 3: Exploratory data analysis

Summary : Preliminary Model Variables

movies_model <- movies %>% select(title, title_type, genre,
                                  runtime, mpaa_rating, studio,
                                  thtr_rel_year, thtr_rel_month,
                                  imdb_rating, imdb_num_votes,
                                  best_pic_nom, best_pic_win,
                                  best_actor_win, best_actress_win,
                                  top200_box, director)
summary(movies_model)

Plot : Distribution of Genres by Year of 651 Randomly Sampled Movies

#Boxplot.
g2 <- ggplot(movies_model, aes(x = genre, y = thtr_rel_year))
g2 <- g2 + geom_boxplot(aes(color = genre), size = 1.3)
#Title.
g2 <- g2 + ggtitle(label = "Distribution of Genres by Year of 651 Randomly Sampled Movies")
#Viridis Color Scheme.
#X-axis.
g2 <- g2 + scale_x_discrete(name = "Genre",
                            expand = c(0,0))
#Y-axis.
g2 <- g2 + scale_y_continuous(name = "Theatrical Release Year")
#Modify labels and text. Remove legend from left plot.
g2 <- g2 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                 axis.text.x = element_text(hjust = 1, size = 10, angle = 45),
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"),
                 legend.position = "none")
g2

Summary: IMDb Information by Year of 651 Randomly Sampled Movies

summary(movies_model %>%
          select(imdb_num_votes,
                 imdb_rating))

Plot : IMDb Information by Year of 651 Randomly Sampled Movies

#Scatter plot.
g3 <- ggplot(movies_model %>% filter(imdb_num_votes <= 100000), aes(x = imdb_num_votes, y = imdb_rating))
g3 <- g3 + geom_point(aes(color = thtr_rel_year), size = 5, alpha = 0.5)
#Title.
g3 <- g3 + ggtitle(label = "")
#Viridis Color Scheme.
g3 <- g3 + scale_color_viridis()
#X-axis.
g3 <- g3 + scale_x_continuous(name = "Number of Votes on IMDb",
                              labels = comma,
                              expand = c(0.005,0.005))
#Y-axis.
g3 <- g3 + scale_y_continuous(name = "Rating on IMDb",
                              limits = c(1.9, 9),
                              breaks = c(2, 4, 6, 8),
                              labels = c("2.0", "4.0", "6.0", "8.0"))
#Modify labels and text. Remove legend from left plot.
g3 <- g3 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                 axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
                 axis.title.x = element_text(hjust = 1, size = 14, face = "bold"),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"),
                 legend.position = "none")


g4 <- ggplot(movies_model %>% filter(imdb_num_votes >= 100000), aes(x = imdb_num_votes, y = imdb_rating))
g4 <- g4 + geom_point(aes(color = thtr_rel_year), size = 5, alpha = 0.5)
#Title.
g4 <- g4  + ggtitle(label = "IMDb Information by Year of 651 Randomly Sampled Movies")
#Viridis Color Scheme.
g4 <- g4  + scale_color_viridis(name = "Theatrical Release Year",
                                limits = c(1970, 2014),
                                breaks = c(1975, 1985, 1995, 2005))
g4 <- g4 + guides(color = guide_colorbar(title.position = "top"))
#X-axis.
g4 <- g4  + scale_x_continuous(name = "",
                              labels = comma,
                              expand = c(0,0)) 
#Y-axis.
g4 <- g4 + scale_y_continuous(name = "", position = "right",
                              limits = c(5, 9.3),
                              breaks = c(6, 8),
                              labels = c("6.0", "8.0"),
                              expand = c(0,0))
#Modify labels and text. Remove legend from left plot.
g4 <- g4  + theme(plot.title = element_text(hjust = 1.25, size = 16, face = "bold"),
                  axis.text.x = element_text(hjust = 1, size = 12, angle = 45),
                  axis.title.x = element_blank(),
                  axis.text.y = element_text(size = 12),
                  axis.title.y = element_blank())
                 

#Modify legend text.  Save as object.
legend3 <- g_legend(g4 + theme(legend.text = element_text(size = 12),
                               legend.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                               legend.key.width = unit(1.5, "cm"),
                               legend.position = "bottom"))
#Remove legend from right plot.
g4 <- g4 + theme(legend.position = "none")
#Arrange left plot, right plot, and legend on the bottom.
grid.arrange(g3, g4, legend3, layout_matrix = matrix(rbind(c(1, 1, 1, 2, 2),
                                                           c(1, 1, 1, 2, 2),
                                                           c(1, 1, 1, 2, 2),
                                                           c(1, 1, 1, 2, 2),
                                                           c(1, 1, 1, 2, 2),
                                                           c(1, 1, 1, NA, NA),
                                                           c(1, 1, 1, 3, 3),
                                                           c(1, 1, 1, NA, NA)),
                                                     ncol = 5))

Plot : Distribution of 651 Movies by Studio and Distribution of 651 Movies by Director

#Studio Tally.
studio_df <- data.frame(movies_model %>% group_by(studio) %>% tally())

#Distribution of Movies by Studio.
g5 <- ggplot(studio_df, aes(x = n)) + geom_histogram(bins = 37, fill = viridis(4)[2])
#Title.
g5 <- g5 + ggtitle("Distribution of 651 Movies by Studio")
#X-axis
g5 <- g5 + scale_x_continuous("Number of Movies by Studio")
#Y-axis.
g5 <- g5 + scale_y_continuous(name = "Frequency of Studios")
#Modify labels and text.
g5 <- g5 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                 axis.text.x = element_text(size = 12),
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"))

#Director Tally.
director_df <- data.frame(movies_model %>% group_by(director) %>% tally())

#Distribution of Movies by Director.
g6 <-  ggplot(director_df, aes(x = n)) + geom_histogram(bins = 4, fill = viridis(4)[3])
#Title.
g6 <- g6 + ggtitle("Distribution of 651 Movies by Director")
#X-axis
g6 <- g6 + scale_x_continuous("Number of Movies by Director")
#Y-axis.
g6 <- g6 + scale_y_continuous(name = "Frequency of Directors")
#Modify labels and text.
g6 <- g6 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                 axis.text.x = element_text(size = 12),
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"))

grid.arrange(g6, g5, layout_matrix = matrix(rbind(c(1, 1, 1, 1, 1, 1, NA, 2, 2, 2, 2, 2, 2)),
                                            ncol = 13))

Ouput : 20th Century Fox names.

movies_model$studio <- as.character(movies_model$studio)

#20th Century Fox Output.
unique(movies_model[grepl("Fox", movies_model$studio),]$studio)

Cleaning : Consolidating Similar Studio Names.

#Studios with repeats.
studios <- c("A24", "Anchor Bay", "Buena Vista", "Embassy", "Gravitas", "HBO", "IFC",
             "Indomina", "Miramax", "National Geographic", "New Line", "Newmarket", "Nordisk",
             "Orion", "Paramount", "Roadside", "Sony", "Touchstone",
             "Trimark", "Universal", "Weinstein")

for (studio_n in studios) {
  movies_model[grepl(studio_n, movies_model$studio),]$studio <- studio_n
}

#Disney/Walt Disney, DreamWorks, 20th Century Fox, LionsGate, MagnetMagnolia, MGM, Warner
movies_model[grepl("Columbia|TriStar", movies_model$studio),]$studio <- "Columbia Tristar Pictures"
movies_model[grepl("Disney", movies_model$studio),]$studio <- "Walt Disney"
movies_model[grepl("Dream", movies_model$studio),]$studio <- "DreamWorks"
movies_model[grepl("Fox", movies_model$studio),]$studio <- "20th Century Fox"
movies_model[grepl("Lions", movies_model$studio),]$studio <- "LionsGate"
movies_model[grepl("Magnet|Magnolia", movies_model$studio),]$studio <- "Magnolia Pictures"
movies_model[grepl("Goldwyn|MGM|UA|United Artists", movies_model$studio),]$studio <- "MGM/UA"
movies_model[grepl("Warner|WARNER", movies_model$studio),]$studio <- "Warner Brothers"

#New Studio Tally and filter.
studio_df <- data.frame(movies_model %>%
                          filter(!(is.na(studio)))  %>%
                          group_by(studio) %>%
                          tally()) %>%
  mutate(studio = as.factor(studio)) %>%
  filter(n >= 2)

Plots : Distribution of 555 Movies by Studio

#Distribution of Movies by Studio.
g7 <- ggplot(studio_df, aes(x = n)) + geom_histogram(bins = 70, fill = viridis(4)[2])
#Title.
g7 <- g7 + ggtitle("")
#X-axis
g7 <- g7 + scale_x_continuous("")
#Y-axis.
g7 <- g7 + scale_y_continuous(name = "Frequency of Studios")
#Modify labels and text.
g7 <- g7 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                 axis.text.x = element_text(size = 12),
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"))

#Distribution of Movies by Studio.
g8 <- ggplot(studio_df %>% filter(n > 15), aes(x = n, y = reorder(studio, n)))
g8 <- g8 + geom_bar(stat = "identity", fill = viridis(4)[1], orientation = "y")
#Title.
g8 <- g8 + ggtitle(paste("Distribution of", sum(studio_df$n), "Movies by Studio"))
#X-axis
g8 <- g8 + scale_x_continuous(name = "Number of Movies by Studio")
#Y-axis.
#g8 <- g8 + scale_y_discrete("Studios")
#Modify labels and text.
g8 <- g8 + theme(plot.title = element_text(hjust = -2.9, size = 14, face = "bold"),
                 axis.text.x = element_text(size = 12), 
                 axis.title.x = element_text(hjust = -1.25, size = 14, face = "bold"),
                 axis.text.y = element_text(hjust = 1, size = 12, angle = 25),
                 axis.title.y = element_blank())

grid.arrange(g7, g8, layout_matrix = matrix(rbind(c(1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2)),
                                            ncol = 13))

Part 4: Modeling

Variables for Model:

Output : Current Predictors

movies_model <- movies_model %>%
  filter(studio %in% studio_df$studio) %>%
  select(c(9, 2:8, 10:15))

colnames(movies_model)[2:13]

#Regression model with all preliminary variables.
prelim_reg <-summary(lm(imdb_rating ~ title_type + genre + runtime + mpaa_rating + thtr_rel_year
                        + thtr_rel_month + studio + imdb_num_votes + best_pic_nom + best_pic_win
                        + best_actor_win + best_actress_win + top200_box + studio,
                        data = movies_model))

#R-Squared.
ars <- round(prelim_reg$adj.r.squared, 4)
#F-Statistic
fstat <- prelim_reg$fstatistic
#p-Value.
pVal <- round(pf(fstat[1], fstat[2], fstat[3], lower.tail = FALSE), 4)

print(paste("The Adjusted R-Squared from the above variables is", ars, "."))
print(paste("The F-Statistic from the above variables is", round(fstat[1], 1), "."))
print(paste("The p-Value from the above F-Statistic on",
            round(fstat[2], 1), "and",
            round(fstat[3], 1), "degrees of freedom is ~", pVal, "."))

Summary: Full Preliminary Model

prelim_reg


Call:
lm(formula = imdb_rating ~ title_type + genre + runtime + mpaa_rating + 
    thtr_rel_year + thtr_rel_month + studio + imdb_num_votes + 
    best_pic_nom + best_pic_win + best_actor_win + best_actress_win + 
    top200_box + studio, data = movies_model)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9881 -0.4175  0.0546  0.4730  2.0034 

Coefficients:
                                        Estimate    Std. Error t value
(Intercept)                        33.0920647282  8.9796994549   3.685
title_typeFeature Film             -1.0397740139  0.3955394783  -2.629
title_typeTV Movie                  0.1569482647  0.7716563918   0.203
genreAnimation                     -0.0427513898  0.3449180413  -0.124
genreArt House & International      0.8237201801  0.3162499180   2.605
genreComedy                         0.1067288104  0.1466946206   0.728
genreDocumentary                    1.1086552693  0.4113052952   2.695
genreDrama                          0.7848739190  0.1285059802   6.108
genreHorror                         0.0996561542  0.2374034042   0.420
genreMusical & Performing Arts      1.3732533822  0.3033648767   4.527
genreMystery & Suspense             0.3906291051  0.1683403723   2.320
genreOther                          0.5741568102  0.2693976098   2.131
genreScience Fiction & Fantasy     -0.0149705049  0.3339902503  -0.045
runtime                             0.0066409354  0.0026380976   2.517
mpaa_ratingPG                      -0.3996056197  0.2453697349  -1.629
mpaa_ratingPG-13                   -0.6145918371  0.2630071888  -2.337
mpaa_ratingR                       -0.3183288548  0.2542493338  -1.252
mpaa_ratingUnrated                 -0.1771660433  0.3555901792  -0.498
thtr_rel_year                      -0.0134267072  0.0044947647  -2.987
thtr_rel_month                      0.0006803687  0.0110708432   0.061
studioA24                          -0.0118077000  0.4991055127  -0.024
studioAnchor Bay                    0.3488829130  0.5098301262   0.684
studioArtisan Entertainment         0.3156037544  0.5960848194   0.529
studioBuena Vista                   0.0651250867  0.2340550741   0.278
studioColumbia Tristar Pictures     0.2805988756  0.2767933755   1.014
studioCowboy Pictures               0.1760188295  0.6108859028   0.288
studioDreamWorks                   -0.0698553513  0.6066163605  -0.115
studioEmbassy                       0.7988596336  0.6023212550   1.326
studioFilmDistrict                  0.4385913468  0.5972519193   0.734
studioFirst Run Features           -0.3683156401  0.4421053139  -0.833
studioFocus Features                0.2439760377  0.4312608928   0.566
studioGravitas                      0.4573032239  0.6114470485   0.748
studioHBO                          -0.1884186341  0.3547275139  -0.531
studioHollywood Pictures           -0.6761898877  0.4892543212  -1.382
studioIFC                           0.3325683127  0.2584900055   1.287
studioImage Entertainment          -0.0023122818  0.5087688957  -0.005
studioIndependent Pictures         -0.0458056070  0.5691689725  -0.080
studioIndomina                     -0.3372124267  0.6053969672  -0.557
studioLionsGate                    -0.2676692262  0.2231946749  -1.199
studioLive Home Video              -0.1519203660  0.5959082579  -0.255
studioMagnolia Pictures             0.2528784405  0.2884313161   0.877
studioMGM/UA                        0.0539310824  0.1813265095   0.297
studioMiramax                       0.0935375321  0.2058216185   0.454
studioMusic Box Films               1.0857080278  0.6113035762   1.776
studioNational Geographic           0.2854352063  0.5402568747   0.528
studioNelson Entertainment         -0.4772252575  0.6063003738  -0.787
studioNew Line                     -0.4011971216  0.2319546896  -1.730
studioNew Yorker Films              0.8142511125  0.5012171781   1.625
studioNewmarket                     0.0844994970  0.6107747531   0.138
studioNordisk                       0.9131957062  0.6103857406   1.496
studioOrion                        -0.0332412388  0.2783928875  -0.119
studioOverture Films                0.5818489073  0.5989521956   0.971
studioParamount                    -0.0911573044  0.1611513905  -0.566
studioRepublic Pictures Home Video  1.1194647737  0.5944461941   1.883
studioRoadside                      0.4190381879  0.5038369516   0.832
studioScreen Gems                   0.3187952270  0.4903305809   0.650
studioSony                          0.0857688064  0.1629073214   0.526
studioStrand Releasing              0.8165781624  0.6577240461   1.242
studioSummit Entertainment          0.8088711682  0.5162408464   1.567
studioThinkFilm                     0.3218437238  0.5033122526   0.639
studioTouchstone                    0.3437929033  0.3968596527   0.866
studioTrimark                       0.6428141299  0.4913450418   1.308
studioUniversal                     0.3358562362  0.1785847796   1.881
studioUSA Films                     0.2392586118  0.5945411010   0.402
studioWalt Disney                   0.5226832631  0.3457106508   1.512
studioWarner Brothers               0.0837387292  0.1519591673   0.551
studioWeinstein                    -0.1152029489  0.2729688309  -0.422
imdb_num_votes                      0.0000035949  0.0000003944   9.116
best_pic_nomyes                     0.4635658556  0.2278296436   2.035
best_pic_winyes                    -0.7013182602  0.4187856116  -1.675
best_actor_winyes                  -0.0397421630  0.1083895160  -0.367
best_actress_winyes                 0.0001336515  0.1149282736   0.001
top200_boxyes                       0.0786191591  0.2383003162   0.330
                                               Pr(>|t|)    
(Intercept)                                    0.000254 ***
title_typeFeature Film                         0.008844 ** 
title_typeTV Movie                             0.838915    
genreAnimation                                 0.901409    
genreArt House & International                 0.009481 ** 
genreComedy                                    0.467238    
genreDocumentary                               0.007275 ** 
genreDrama                                0.00000000208 ***
genreHorror                                    0.674836    
genreMusical & Performing Arts            0.00000755615 ***
genreMystery & Suspense                        0.020732 *  
genreOther                                     0.033573 *  
genreScience Fiction & Fantasy                 0.964267    
runtime                                        0.012148 *  
mpaa_ratingPG                                  0.104054    
mpaa_ratingPG-13                               0.019859 *  
mpaa_ratingR                                   0.211165    
mpaa_ratingUnrated                             0.618549    
thtr_rel_year                                  0.002959 ** 
thtr_rel_month                                 0.951022    
studioA24                                      0.981135    
studioAnchor Bay                               0.494107    
studioArtisan Entertainment                    0.596729    
studioBuena Vista                              0.780942    
studioColumbia Tristar Pictures                0.311211    
studioCowboy Pictures                          0.773366    
studioDreamWorks                               0.908370    
studioEmbassy                                  0.185368    
studioFilmDistrict                             0.463093    
studioFirst Run Features                       0.405204    
studioFocus Features                           0.571842    
studioGravitas                                 0.454883    
studioHBO                                      0.595550    
studioHollywood Pictures                       0.167586    
studioIFC                                      0.198858    
studioImage Entertainment                      0.996376    
studioIndependent Pictures                     0.935890    
studioIndomina                                 0.577779    
studioLionsGate                                0.231015    
studioLive Home Video                          0.798879    
studioMagnolia Pictures                        0.381066    
studioMGM/UA                                   0.766270    
studioMiramax                                  0.649703    
studioMusic Box Films                          0.076355 .  
studioNational Geographic                      0.597512    
studioNelson Entertainment                     0.431604    
studioNew Line                                 0.084336 .  
studioNew Yorker Films                         0.104913    
studioNewmarket                                0.890023    
studioNordisk                                  0.135283    
studioOrion                                    0.905005    
studioOverture Films                           0.331814    
studioParamount                                0.571886    
studioRepublic Pictures Home Video             0.060274 .  
studioRoadside                                 0.405994    
studioScreen Gems                              0.515896    
studioSony                                     0.598791    
studioStrand Releasing                         0.215017    
studioSummit Entertainment                     0.117806    
studioThinkFilm                                0.522833    
studioTouchstone                               0.386766    
studioTrimark                                  0.191404    
studioUniversal                                0.060622 .  
studioUSA Films                                0.687549    
studioWalt Disney                              0.131212    
studioWarner Brothers                          0.581847    
studioWeinstein                                0.673186    
imdb_num_votes                     < 0.0000000000000002 ***
best_pic_nomyes                                0.042427 *  
best_pic_winyes                                0.094652 .  
best_actor_winyes                              0.714033    
best_actress_winyes                            0.999073    
top200_boxyes                                  0.741607    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8168 on 482 degrees of freedom
Multiple R-squared:  0.4957,    Adjusted R-squared:  0.4204 
F-statistic:  6.58 on 72 and 482 DF,  p-value: < 0.00000000000000022

Model Selection Process:

Select the first variable that has the highest adjusted R².

#Selection 1.
Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(2:14)]) {
  Predictors <- c(Predictors, x)
  temp_df <- movies_model %>%
    select(imdb_rating, x) %>%
    `colnames<-`(c("imdb_rating", "depen"))
  new_R2 <- summary(lm(imdb_rating ~ depen,
                       data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Select the second variable, combined with the developing model, that has the highest adjusted R².

#Selection 2.
movies_model <- movies_model %>%
  select(1, 3, 2, 4:14)

Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(3:14)]) {
  Predictors <- c(Predictors, paste("genre + ", x))
  temp_df <- movies_model %>%
    select(imdb_rating, genre, x) %>%
    `colnames<-`(c("imdb_rating", "genre", "depen"))
  new_R2 <- summary(lm(imdb_rating ~ genre + depen,
                       data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Select the third variable, combined with the developing model, that has the highest adjusted R².

#Selection 3.
movies_model <- movies_model %>%
  select(1, 2, 9, 3:8, 10:14)

Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(4:14)]) {
  Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
  temp_df <- movies_model %>%
    select(imdb_rating, genre, imdb_num_votes, x) %>%
    `colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "depen"))
  new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + depen,
                       data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Select the fourth variable, combined with the developing model, that has the highest adjusted R².

#Selection 4.
movies_model <- movies_model %>%
  select(1:3, 6, 4, 5, 7:14)

Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(5:14)]) {
  Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
  temp_df <- movies_model %>%
    select(imdb_rating, genre, imdb_num_votes, mpaa_rating, x) %>%
    `colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "depen"))
  new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + depen,
                       data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Select the fifth variable, combined with the developing model, that has the highest adjusted R².

#Selection 5.
movies_model <- movies_model %>%
  select(1:4, 8, 5:7, 9:14)

Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(6:14)]) {
  Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
  temp_df <- movies_model %>%
    select(imdb_rating, genre, imdb_num_votes, mpaa_rating, thtr_rel_year, x) %>%
    `colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "thtr_rel_year" , "depen"))
  new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + depen,
                       data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Select the sixth variable, combined with the developing model, that has the highest adjusted R².

#Selection 6.
#No change in order.

Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(7:14)]) {
  Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
  temp_df <- movies_model %>%
    select(imdb_rating, genre, imdb_num_votes, mpaa_rating, thtr_rel_year, title_type,
           x) %>%
    `colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "thtr_rel_year", "title_type",
                   "depen"))
  new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type +
                         depen,
                       data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Select the seventh variable, combined with the developing model, that has the highest adjusted R².

#Selection 7.
#No change in order.

Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(8:14)]) {
  Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
  temp_df <- movies_model %>%
    select(imdb_rating, genre, imdb_num_votes, mpaa_rating, thtr_rel_year, title_type,
           runtime, x) %>%
    `colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "thtr_rel_year", "title_type",
                   "runtime", "depen"))
  new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type +
                         runtime + depen,
                       data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Select the eighth variable, combined with the developing model, that has the highest adjusted R².

#Selection 8.
movies_model <- movies_model %>%
  select(1:7, 10, 8, 9, 11:14)

Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(9:14)]) {
  Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
  temp_df <- movies_model %>%
    select(imdb_rating, genre, imdb_num_votes, mpaa_rating, thtr_rel_year, title_type,
           runtime, best_pic_nom, x) %>%
    `colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "thtr_rel_year", "title_type",
                   "runtime", "best_pic_nom", "depen"))
  new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type +
                         runtime + best_pic_nom + depen,
                       data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Select the ninth variable, combined with the developing model, that has the highest adjusted R².

#Selection 9.
movies_model <- movies_model %>%
  select(1:8, 11, 9, 10, 12:14)

Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(10:14)]) {
  Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
  temp_df <- movies_model %>%
    select(imdb_rating, genre, imdb_num_votes, mpaa_rating, thtr_rel_year, title_type,
           runtime, best_pic_nom, best_pic_win, x) %>%
    `colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "thtr_rel_year", "title_type",
                   "runtime", "best_pic_nom", "best_pic_win", "depen"))
  new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type +
                         runtime + best_pic_nom + best_pic_win + depen,
                       data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Show that no additional variable, combined with the developing model, increases the adjusted R².

#Selection 10.
#No change in order.

Predictors <- NULL
R2 <- NULL
for (x in colnames(movies_model)[c(11:14)]) {
  Predictors <- c(Predictors, paste("selected explanatory variables + ", x))
  temp_df <- movies_model %>%
    select(imdb_rating, genre, imdb_num_votes, mpaa_rating, thtr_rel_year, title_type,
           runtime, best_pic_nom, best_pic_win, studio, x) %>%
    `colnames<-`(c("imdb_rating", "genre", "imdb_num_votes", "mpaa_rating", "thtr_rel_year", "title_type",
                   "runtime", "best_pic_nom", "best_pic_win", "studio", "depen"))
  new_R2 <- summary(lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type +
                         runtime + best_pic_nom + best_pic_win + studio + depen,
                       data = temp_df))$adj.r.squared
  R2 <- c(R2, new_R2)
}

R2_values <- data.frame("Predictor_Variables" = Predictors, "Adjusted_R2" = R2)
R2_values

Model diagnostics

Linear Relationships Between x and y

Plots: Residual vs IMDB Number of Votes, Residual vs Theatrical Release Year, and Residual vs Runtime

final_model <- movies_model %>%
  select(1:10)

#Regression Model.
final_reg <- lm(imdb_rating ~ genre + imdb_num_votes + mpaa_rating + thtr_rel_year + title_type
                + runtime + best_pic_nom + best_pic_win + studio,
                data = movies_model)

#Scatter of Residual Values vs IMDB Number of Votes.
g9 <- ggplot(final_reg, aes(x = imdb_num_votes, y = .resid))
g9 <- g9 + geom_point(color = viridis(5)[3], size = 3, alpha = 0.2)
#Horizontal line at y = 0.
g9 <- g9 + geom_hline(yintercept = 0, col = viridis(5)[1], linetype = "dashed")
#Title.
g9 <- g9 + ggtitle("Residual Values vs Votes")
#X-axis & Y-axis.
g9 <- g9 + scale_x_continuous("IMDB Number of Votes", labels = comma) + ylab("Residual Value")
#Modify labels and text.
g9 <- g9 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                 axis.text.x = element_text(size = 12), 
                 axis.title.x = element_text(size = 14, face = "bold"),
                 axis.text.y = element_text(size = 12),
                 axis.title.y = element_text(size = 14, face = "bold"))

#Scatter of Residual Values vs Theatrical Release Year.
g10 <- ggplot(final_reg, aes(x = thtr_rel_year , y = .resid))
g10 <- g10 + geom_point(color = viridis(5)[3], size = 3, alpha = 0.2)
#Horizontal line at y = 0.
g10 <- g10 + geom_hline(yintercept = 0, col = viridis(5)[1], linetype = "dashed")
#Title.
g10 <- g10 + ggtitle("Residual Values vs Year")
#X-axis & Y-axis.
g10 <- g10 + xlab("Theatrical Release Year") + ylab("Residual Value")
#Modify labels and text.
g10 <- g10 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                  axis.text.x = element_text(size = 12), 
                  axis.title.x = element_text(size = 14, face = "bold"),
                  axis.text.y = element_text(size = 12),
                  axis.title.y = element_text(size = 14, face = "bold"))

#Scatter of Residual Values vs Runtime of Movie.
g11 <- ggplot(final_reg, aes(x = runtime, y = .resid))
g11 <- g11 + geom_point(color = viridis(5)[3], size = 3, alpha = 0.2)
#Horizontal line at y = 0.
g11 <- g11 + geom_hline(yintercept = 0, col = viridis(5)[1], linetype = "dashed")
#Title.
g11 <- g11 + ggtitle("Residual Values vs Runtime")
#X-axis & Y-axis.
g11 <- g11 + xlab("Runtime of Movie (minutes)") + ylab("Residual Value")
#Modify labels and text.
g11 <- g11 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                  axis.text.x = element_text(size = 12), 
                  axis.title.x = element_text(size = 14, face = "bold"),
                  axis.text.y = element_text(size = 12),
                  axis.title.y = element_text(size = 14, face = "bold"))

grid.arrange(g9, g10, g11, layout_matrix = matrix(rbind(c(1, 1, 2, 2),
                                                        c(NA, 3, 3, NA)),
                                                  ncol = 4))

Nearly Normal Residuals

Plots : Distribution of Residual Values and Normal-QQ

#Histogram
g12 <- ggplot(mapping = aes(x = final_reg$residuals)) + geom_histogram(bins = 20, fill = viridis(5)[3])
#Title.
g12 <- g12 + ggtitle("Distribution of Residual Values")
#X-axis & Y-axis.
g12 <- g12 + xlab("Residual Value") + ylab("Frequency")
#Modify labels and text.
g12 <- g12 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                   axis.text.x = element_text(size = 12), 
                   axis.title.x = element_text(size = 14, face = "bold"),
                   axis.text.y = element_text(size = 12),
                   axis.title.y = element_text(size = 14, face = "bold"))

#QQLine
res = final_reg$residuals
m = (quantile(res, 0.75) - quantile(res, 0.25)) / (qnorm(0.75) - qnorm(0.25))
b = quantile(res, .25) - m * qnorm(0.25)
qq_line = data.frame(intercept = b,
                     slope = m)

#Normal-QQ Plot
g13 <- ggplot(data = final_reg)
g13 <- g13 + stat_qq(aes(sample = res), color = viridis(5)[3], size = 3, alpha = 0.4)
g13 <- g13 + geom_abline(data = qq_line ,aes(intercept = b, slope = m), color = viridis(5)[1], size = 1)
#Title.
g13 <- g13 + ggtitle("Normal-QQ Plot")
#X-axis & Y-axis.
g13 <- g13 + xlab("Theoretical Quantile") + ylab("Standardized Residual")
#Modify labels and text.
g13 <- g13 + theme(plot.title = element_text(hjust = 0.5, size = 16, face = "bold"),
                   axis.text.x = element_text(size = 12), 
                   axis.title.x = element_text(size = 14, face = "bold"),
                   axis.text.y = element_text(size = 12),
                   axis.title.y = element_text(size = 14, face = "bold"))

grid.arrange(g12, g13, layout_matrix = matrix(c(1, 2),
                                              ncol = 2))

Constant Variability of Residuals

Plots : Residual vs Fitted Values and Absolute Value of Residuals vs Fitted Values

#Scatter of Residuals vs. Fitted.
g14 <- ggplot(final_reg, aes(x = .fitted, y = .resid))
g14 <- g14 + geom_point(color = viridis(5)[3], size = 3, alpha = 0.4)
##Horizontal Line.
g14 <- g14 + geom_hline(yintercept = 0, col = viridis(5)[1], linetype = "dashed")
#Title.
g14 <- g14 + ggtitle("Residual vs Fitted Values")
#X-axis & Y-axis.
g14 <- g14 + xlab("Fitted Value") + ylab("Residual Value")
#Modify labels and text.
g14 <- g14 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                   axis.text.x = element_text(size = 12), 
                   axis.title.x = element_text(size = 14, face = "bold"),
                   axis.text.y = element_text(size = 12),
                   axis.title.y = element_text(size = 14, face = "bold"))

#Scatter of Absolute Value of Residuals vs. Fitted.
g15 <- ggplot(final_reg, aes(x = .fitted, y = abs(.resid)))
g15 <- g15 + geom_point(color = viridis(5)[3], size = 3, alpha = 0.4)
##Horizontal Line.
g15 <- g15 + geom_hline(yintercept = 0, col = viridis(5)[1], linetype = "dashed")
#Title.
g15 <- g15 + ggtitle("Absolute Value of Residuals vs Fitted Values")
#X-axis & Y-axis.
g15 <- g15 + xlab("Fitted Value") + ylab("Absolute Value of Residual")
#Modify labels and text.
g15 <- g15 + theme(plot.title = element_text(hjust = 0.5, size = 14, face = "bold"),
                   axis.text.x = element_text(size = 12), 
                   axis.title.x = element_text(size = 14, face = "bold"),
                   axis.text.y = element_text(size = 12),
                   axis.title.y = element_text(size = 14, face = "bold"))

grid.arrange(g14, g15, layout_matrix = matrix(c(1, 2),
                                              ncol = 2))

Interpretation of Model Coefficients

Output : Final Regression Model

summary(final_reg)


Call:
lm(formula = imdb_rating ~ genre + imdb_num_votes + mpaa_rating + 
    thtr_rel_year + title_type + runtime + best_pic_nom + best_pic_win + 
    studio, data = movies_model)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9834 -0.4210  0.0501  0.4831  2.0092 

Coefficients:
                                       Estimate   Std. Error t value
(Intercept)                        33.466623723  8.893593000   3.763
genreAnimation                     -0.054576319  0.341477488  -0.160
genreArt House & International      0.821902649  0.314460459   2.614
genreComedy                         0.103447373  0.144643594   0.715
genreDocumentary                    1.102576913  0.409228155   2.694
genreDrama                          0.779740751  0.126251654   6.176
genreHorror                         0.095550056  0.236081543   0.405
genreMusical & Performing Arts      1.369524721  0.301723109   4.539
genreMystery & Suspense             0.380272489  0.165101836   2.303
genreOther                          0.570172760  0.266941993   2.136
genreScience Fiction & Fantasy     -0.006956664  0.332273711  -0.021
imdb_num_votes                      0.000003637  0.000000378   9.622
mpaa_ratingPG                      -0.405807833  0.243784352  -1.665
mpaa_ratingPG-13                   -0.620720506  0.260200287  -2.386
mpaa_ratingR                       -0.325248967  0.251558577  -1.293
mpaa_ratingUnrated                 -0.183475741  0.352573003  -0.520
thtr_rel_year                      -0.013599557  0.004452674  -3.054
title_typeFeature Film             -1.044387195  0.393503213  -2.654
title_typeTV Movie                  0.154541942  0.765411443   0.202
runtime                             0.006487380  0.002480862   2.615
best_pic_nomyes                     0.453182941  0.221183322   2.049
best_pic_winyes                    -0.692966281  0.413507058  -1.676
studioA24                          -0.001808897  0.496349157  -0.004
studioAnchor Bay                    0.359235698  0.506885308   0.709
studioArtisan Entertainment         0.325650479  0.592174035   0.550
studioBuena Vista                   0.076394711  0.231289211   0.330
studioColumbia Tristar Pictures     0.272555110  0.274622534   0.992
studioCowboy Pictures               0.180038946  0.605770472   0.297
studioDreamWorks                   -0.058437674  0.599481428  -0.097
studioEmbassy                       0.780944544  0.596657180   1.309
studioFilmDistrict                  0.434419405  0.594431817   0.731
studioFirst Run Features           -0.362773615  0.439249685  -0.826
studioFocus Features                0.249836129  0.428749917   0.583
studioGravitas                      0.464452949  0.608731855   0.763
studioHBO                          -0.183561514  0.352235847  -0.521
studioHollywood Pictures           -0.670688014  0.486790164  -1.378
studioIFC                           0.340400065  0.256811732   1.325
studioImage Entertainment           0.007286210  0.506221283   0.014
studioIndependent Pictures         -0.055275263  0.565441320  -0.098
studioIndomina                     -0.332292366  0.602611897  -0.551
studioLionsGate                    -0.262525609  0.220986695  -1.188
studioLive Home Video              -0.163367100  0.590907733  -0.276
studioMagnolia Pictures             0.253886585  0.284691867   0.892
studioMGM/UA                        0.054282170  0.180102679   0.301
studioMiramax                       0.091622940  0.204326834   0.448
studioMusic Box Films               1.096406842  0.608221321   1.803
studioNational Geographic           0.270789354  0.536668849   0.505
studioNelson Entertainment         -0.472088904  0.603612170  -0.782
studioNew Line                     -0.400111068  0.230759449  -1.734
studioNew Yorker Films              0.819809272  0.496611280   1.651
studioNewmarket                     0.076672297  0.605013926   0.127
studioNordisk                       0.923446869  0.606720405   1.522
studioOrion                        -0.039276164  0.276607551  -0.142
studioOverture Films                0.584999973  0.592257029   0.988
studioParamount                    -0.084601888  0.159770957  -0.530
studioRepublic Pictures Home Video  1.130731858  0.590845635   1.914
studioRoadside                      0.430897748  0.500559476   0.861
studioScreen Gems                   0.324935812  0.487935143   0.666
studioSony                          0.083083166  0.161825309   0.513
studioStrand Releasing              0.823843876  0.651305719   1.265
studioSummit Entertainment          0.784638894  0.508943732   1.542
studioThinkFilm                     0.305328758  0.497871784   0.613
studioTouchstone                    0.315897381  0.386628620   0.817
studioTrimark                       0.632995803  0.487862650   1.297
studioUniversal                     0.331624385  0.177343065   1.870
studioUSA Films                     0.223905977  0.590340500   0.379
studioWalt Disney                   0.517178787  0.342174247   1.511
studioWarner Brothers               0.083163595  0.151329790   0.550
studioWeinstein                    -0.113686626  0.271284146  -0.419
                                               Pr(>|t|)    
(Intercept)                                    0.000188 ***
genreAnimation                                 0.873086    
genreArt House & International                 0.009235 ** 
genreComedy                                    0.474836    
genreDocumentary                               0.007298 ** 
genreDrama                                0.00000000139 ***
genreHorror                                    0.685852    
genreMusical & Performing Arts            0.00000713306 ***
genreMystery & Suspense                        0.021686 *  
genreOther                                     0.033183 *  
genreScience Fiction & Fantasy                 0.983305    
imdb_num_votes                     < 0.0000000000000002 ***
mpaa_ratingPG                                  0.096634 .  
mpaa_ratingPG-13                               0.017436 *  
mpaa_ratingR                                   0.196648    
mpaa_ratingUnrated                             0.603028    
thtr_rel_year                                  0.002380 ** 
title_typeFeature Film                         0.008213 ** 
title_typeTV Movie                             0.840074    
runtime                                        0.009201 ** 
best_pic_nomyes                                0.041008 *  
best_pic_winyes                                0.094415 .  
studioA24                                      0.997094    
studioAnchor Bay                               0.478843    
studioArtisan Entertainment                    0.582624    
studioBuena Vista                              0.741316    
studioColumbia Tristar Pictures                0.321461    
studioCowboy Pictures                          0.766436    
studioDreamWorks                               0.922385    
studioEmbassy                                  0.191198    
studioFilmDistrict                             0.465245    
studioFirst Run Features                       0.409270    
studioFocus Features                           0.560360    
studioGravitas                                 0.445843    
studioHBO                                      0.602512    
studioHollywood Pictures                       0.168906    
studioIFC                                      0.185633    
studioImage Entertainment                      0.988522    
studioIndependent Pictures                     0.922166    
studioIndomina                                 0.581599    
studioLionsGate                                0.235425    
studioLive Home Video                          0.782306    
studioMagnolia Pictures                        0.372945    
studioMGM/UA                                   0.763242    
studioMiramax                                  0.654055    
studioMusic Box Films                          0.072063 .  
studioNational Geographic                      0.614086    
studioNelson Entertainment                     0.434533    
studioNew Line                                 0.083572 .  
studioNew Yorker Films                         0.099424 .  
studioNewmarket                                0.899208    
studioNordisk                                  0.128652    
studioOrion                                    0.887145    
studioOverture Films                           0.323768    
studioParamount                                0.596687    
studioRepublic Pictures Home Video             0.056239 .  
studioRoadside                                 0.389755    
studioScreen Gems                              0.505765    
studioSony                                     0.607896    
studioStrand Releasing                         0.206509    
studioSummit Entertainment                     0.123797    
studioThinkFilm                                0.539986    
studioTouchstone                               0.414297    
studioTrimark                                  0.195079    
studioUniversal                                0.062090 .  
studioUSA Films                                0.704644    
studioWalt Disney                              0.131324    
studioWarner Brothers                          0.582879    
studioWeinstein                                0.675351    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8136 on 486 degrees of freedom
Multiple R-squared:  0.4954,    Adjusted R-squared:  0.4249 
F-statistic: 7.018 on 68 and 486 DF,  p-value: < 0.00000000000000022

Part 5 : Prediction

#Deadpool data from https://www.imdb.com/title/tt1431045/.
deadpool <- data.frame("imdb_rating" = 8.0,
                       "genre" = "Action & Adventure",
                       "imdb_num_votes" = 872845,
                       "mpaa_rating" = "R",
                       "thtr_rel_year" = 2016,
                       "title_type" = "Feature Film",
                       "runtime" = 108,
                       "best_pic_nom" = "no",
                       "best_pic_win" = "no",
                       "studio" = "20th Century Fox")

#Prediction and Interval of Deadpool IMDb Rating from data.
predict(final_reg, deadpool, interval = "predict")

Part 6 : Conclusion

#Final Regression R-squared.
print(paste("The R-squared value is", round(summary(final_reg)$r.squared, 4)))

#Dataset data for Django Unchained.
data.frame(movies %>% filter(title == "Django Unchained") %>% select(title, imdb_rating, imdb_num_votes))

Modeling and Prediction for Movies

Rohan Lewis

06/01/2020

Part 1: Data

Part 2: Research question

Part 3: Exploratory data analysis

Part 4: Modeling

Variables for Model:

Hypotheses:

Model Selection Choice:

Model Selection Process:

Model diagnostics

Linear Relationships Between x and y

Nearly Normal Residuals

Constant Variability of Residuals

Independence of Residuals

Interpretation of Model Coefficients

Part 5: Prediction

Part 6: Conclusion

Appendix

Setup

Load packages

Load data

Part 1: Data

Part 2: Research question

Part 3: Exploratory data analysis

Part 4: Modeling

Variables for Model:

Model Selection Process:

Model diagnostics

Linear Relationships Between x and y

Nearly Normal Residuals

Constant Variability of Residuals

Interpretation of Model Coefficients

Part 5 : Prediction

Part 6 : Conclusion