Project 1: Explore and Prepare Data (2023)

Note: This project involves getting data ready for analysis and doing some preliminary investigations. Project 2 will involve modeling and predictions, and will be released at a later date. Both projects will have equal weightage towards your grade.

In this project, you will explore a dataset that contains information about movies, including ratings, budget, gross revenue and other attributes. It was prepared by Dr.Guy Lebanon, and here is his description of the dataset:

The file movies_merged contains a dataframe with the same name that has 40K rows and 39 columns. Each row represents a movie title and each column represents a descriptor such as Title, Actors, and Budget. I collected the data by querying IMDb’s API (see and joining it with a separate dataset of movie budgets and gross earnings (unknown to you). The join key was the movie title. This data is available for personal use, but IMDb’s terms of service do not allow it to be used for commercial purposes or for creating a competing repository.

Your goal is to investigate the relationship between the movie descriptors and the box office success of movies, as represented by the variable Gross. This task is extremely important as it can help a studio decide which titles to fund for production, how much to bid on produced movies, when to release a title, how much to invest in marketing and PR, etc. This information is most useful before a title is released, but it is still very valuable after the movie is already released to the public (for example it can affect additional marketing spend or how much a studio should negotiate with on-demand streaming companies for a second window streaming rights).

This is an R Markdown Notebook. Open this file in RStudio to get started.

When you execute code within the notebook, the results appear beneath the code. Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.

x = 1:10print(x^2)
 [1] 1 4 9 16 25 36 49 64 81 100

Plots appear inline too:

plot(x, x^2, 'o')

Add a new chunk by clicking the Insert Chunk button on the toolbar or by pressing Cmd+Option+I.

When you save the notebook, an HTML file containing the code and output will be saved alongside it (click the Preview button or press Cmd+Shift+K to preview the HTML file).

Please complete the tasks below and submit this R Markdown file (as pr1.Rmd) as well as a PDF export of it (as pr1.pdf). Both should contain all the code, output, plots and written responses for each task.

Make sure you’ve downloaded the movies_merged file and it is in the current working directory. Now load it into memory:


This creates an object of the same name (movies_merged). For convenience, you can copy it to df and start using it:

df = movies_mergedcat("Dataset has", dim(df)[1], "rows and", dim(df)[2], "columns", end="\n", file="")
Dataset has 40789 rows and 39 columns 
 [1] "Title" "Year" "Rated" "Released" "Runtime" "Genre" "Director" "Writer" [9] "Actors" "Plot" "Language" "Country" "Awards" "Poster" "Metascore" "imdbRating" [17] "imdbVotes" "imdbID" "Type" "tomatoMeter" "tomatoImage" "tomatoRating" "tomatoReviews" "tomatoFresh" [25] "tomatoRotten" "tomatoConsensus" "tomatoUserMeter" "tomatoUserRating" "tomatoUserReviews" "tomatoURL" "DVD" "BoxOffice" [33] "Production" "Website" "Response" "Budget" "Domestic_Gross" "Gross" "Date" 

Load any R packages that you will need to use. You can come back to this chunk, edit it and re-run to load any additional packages later.

package SnowballC was built under R version 3.3.2

If you are loading any non-standard packages (ones that have not been discussed in class or explicitly allowed for this project), please mention them below. Include any special instructions if they cannot be installed using the regular install.packages('') command.

Non-standard packages used: None

Each task below is worth 10 points, and is meant to be performed sequentially, step 2 after you have processed the data as described in step 1. Total points: 100

Complete each task by implementing code chunks as described by TODO comments, and by responding to questions (“Q:”) with written answers (“A:”). If you are unable to find a meaningful or strong relationship in any of the cases when requested, explain why not by referring to appropriate plots/statistics.

It is OK to handle missing values below by omission, but please omit as little as possible. It is worthwhile to invest in reusable and clear code as you may need to use it or modify it in project 2.

The variable Type captures whether the row is a movie, a TV series, or a game. Remove all rows from df that do not correspond to movies.

# TODO: Remove all rows from df that do not correspond to moviesdf = df[df$Type=="movie",]nrow(df)
[1] 40000

Q: How many rows are left after removal? Enter your response below.

A: 4000

The variable Runtime represents the length of the title as a string. Write R code to convert it to a numeric value (in minutes) and replace df$Runtime with the new numeric column.

# TODO: Replace df$Runtime with a numeric column containing the runtime in minutesRuntimeToMinutes = function(str){ v = strsplit(str, " ")[[1]] res = 0 if(length(v) < 2){ print(v) return(0) } for (i in seq(2, length(v)+1, 2)) { if (v[[i]] == "min") res = res + as.integer(v[i-1]) if (v[[i]] == "h") res = res + as.integer(v[i-1]) * 60 } return(res)}# Let's remove NAsdf = df[df$Runtime != "N/A" & !$Runtime),]# Convert Runtime variable to numericdf$Runtime = sapply(df$Runtime, RuntimeToMinutes)

Now investigate the distribution of Runtime values and how it changes over years (variable Year, which you can bucket into decades) and in relation to the budget (variable Budget). Include any plots that illustrate.

Let’s first bucket movies into decades.

YearToDecade = function(year){ if(year >= 1880 && year < 1890) return("1880s") if(year >= 1890 && year < 1900) return("1890s") if(year >= 1900 && year < 1910) return("1900s") if(year >= 1910 && year < 1920) return("1910s") if(year >= 1920 && year < 1930) return("1920s") if(year >= 1930 && year < 1940) return("1930s") if(year >= 1940 && year < 1950) return("1940s") if(year >= 1950 && year < 1960) return("1950s") if(year >= 1960 && year < 1970) return("1960s") if(year >= 1970 && year < 1980) return("1970s") if(year >= 1980 && year < 1990) return("1980s") if(year >= 1990 && year < 2000) return("1990s") if(year >= 2000 && year < 2010) return("2000s") if(year >= 2010) return("2010s") return("Other")}df$Decade = factor(sapply(df$Year, YearToDecade), levels=c("1880s","1890s","1900s","1910s","1920s","1930s","1940s","1950s","1960s","1970s","1980s","1990s","2000s","2010s"))

Now let’s see how Runtime density look like over the decades.

chart 0){ #print(paste(as.character(df[i,]$AwardsNum[[1]][1]) , "nomination", sep=" ")) df[i,]$Nominations = as.numeric(df[i,]$AwardsNum[[1]][1]) } } # 2 numbers found : most probably the first is win and the second is nomination if(length(df[i,]$AwardsNum[[1]])==2){ if((length(grep("win",df[i,]$Awards))>0) || (length(grep("won",df[i,]$Awards))>0)|| (length(grep("Won",df[i,]$Awards))>0)){ #print(paste(as.character(df[i,]$AwardsNum[[1]][1]) , "Win", sep=" ")) df[i,]$Wins = as.numeric(df[i,]$AwardsNum[[1]][1]) } if(length(grep("nomination",df[i,]$Awards))>0){ #print(paste(as.character(df[i,]$AwardsNum[[1]][2]) , "nomination", sep=" ")) df[i,]$Nominations = as.numeric(df[i,]$AwardsNum[[1]][2]) } } # 3 numbers found : most probably the first is total or another award, the second is win, and the third is nomination if(length(df[i,]$AwardsNum[[1]])==3){ if((length(grep("win",df[i,]$Awards))>0) || (length(grep("won",df[i,]$Awards))>0)|| (length(grep("Won",df[i,]$Awards))>0)){ #print(paste(as.character(df[i,]$AwardsNum[[1]][2]) , "Win", sep=" ")) df[i,]$Wins = as.numeric(df[i,]$AwardsNum[[1]][2]) } if(length(grep("nomination",df[i,]$Awards))>0){ #print(paste(as.character(df[i,]$AwardsNum[[1]][3]) , "nomination", sep=" ")) df[i,]$Nominations = as.numeric(df[i,]$AwardsNum[[1]][3]) } } # no numbers found : set both win and nomination to 0, increment the rowsWithInvalidAwards counter if(length(df[i,]$AwardsNum[[1]])==0){ df[i,]$Wins = 0 df[i,]$Nominations = 0 rowsWithInvalidAwards = rowsWithInvalidAwards + 1 } if(length(unlist(df[i,]$Wins))==1) df[i,]$Wins = as.numeric(unlist(df[i,]$Wins)) else print(paste(df[i,]$Awards, df[i,]$Wins, sep = " ==> " )) if(length(unlist(df[i,]$Nominations))==1) df[i,]$Nominations = as.numeric(unlist(df[i,]$Nominations)[[1]]) else print(paste(df[i,]$Nominations, df[i,]$Nominations, sep = " ==> " ))}

Number of rows that have non-zero wins nominations is 675

print(nrow(df) - rowsWithInvalidAwards)
[1] 675

Q: How did you construct your conversion mechanism? How many rows had valid/non-zero wins or nominations?

A: First I counted how many numbers exist in the Awards column using regular expressions. Based on number of regular expression matches (each match represent one number within the Awards column) we decide the value for Wins and Nominations column. If there is one number: use regular expression to decide if this number is for Wins or Nominations. If there are two numbers: use regular expressions to make sure the first one goes to Wins if any of the following exist(“Win”, “win”, “won”), use regular expression to make sure the second one goes to Nominations if (“nomination”) exists. If there are three numbers: use regular expressions to make sure the second one goes to Wins if any of the following exist(“Win”, “win”, “won”), use regular expression to make sure the third one goes to Nominations if (“nomination”) exists. If there are no numbers: count this row as Invalid row. This counter will be used to answer the second part of this question. I found 674 rows with valid non-zero wins or nominations.

# TODO: Plot Gross revenue against wins and nominationsggplot(df, aes(Gross)) + geom_point(data=df, aes(y=Wins, color="Wins")) + geom_point(data=df, aes(y=Nominations, color="Nominations")) + labs(title="Figure 11 - Nominations/Wins vs. Gross", x="Gross", y="Nominations / Wins")

Let’s see Wins alone against Gross

ggplot(df) + geom_point(data=df, aes(x=Wins, y=Gross)) + labs(title="Figure 12 - Wins vs. Gross", x="Wins", y="Gross")

Let’s see Nominations alone against Gross

ggplot(df) + geom_point(data=df, aes(x=Nominations, y=Gross)) + labs(title="Figure 13 - Nominations vs. Gross", x="Nominations", y="Gross")

Q: How does the gross revenue vary by number of awards won and nominations received?

A: We can notice from Figures 11, Figure 12, and Figure 13 that neither wins nor nominations correlate with Gross. Both wins and nominations fall below 50 regardless to Gross.

There are several variables that describe ratings, including IMDb ratings (imdbRating represents average user ratings and imdbVotes represents the number of user ratings), and multiple Rotten Tomatoes ratings (represented by several variables pre-fixed by tomato). Read up on such ratings on the web (for example and

Investigate the pairwise relationships between these different descriptors using graphs.

# TODO: Illustrate how ratings from IMDb and Rotten Tomatoes are relatedchart