Reference no: EM133144584
BUS5DWR Data Wrangling and R - La Trobe University
Overview
Assignment Requirements
Part 1
The given data files Movie.csv, Rating.csv and Continent.csv record the information about the IMDB movie ratings.
Write R code in an Rmd file to answer the following questions. Each question should be presented in one code chunk:
Load the dataset from the given files into three data frames called Movie, Rating, and Continent. Rename columns to remove space if they exist. (Hint: use str_replace_all to do this automatically for all columns). Remove the column Writer in the Movie dataframe. Display the summary of each dataframe.
How many movies produced by 'Universal Pictures' have the actor 'Arnold Schwarzenegger'?
Display the five most-reviewed movies that belong to both Action and Drama. Display only the Title and the number of reviews.
Display movie rating information including Title, average rating and two new columns (1) 'TotalVote' showing the total votes from both males and females and (2) 'Popular' showing 'Male' for movies with the MalesTotalVotes greater than FemalesTotalVotes and 'Female' otherwise. (Hint: see Workshop 9 exercise). Show only TEN movies with the highest average rating.
Display the number of Comedy movies and their average rating from each continent.
Analyse the distribution of the average rating of all the movies after the year 2000. (Hint: draw a boxplot and histogram and write a short paragraph (less than 100 words) to describe your insight).
Part 2
The given Spotify.xlsx file records the summary of Australia's top 200 daily-streamed songs (or tracks) in the first three months of 2017 and 2018. The Data worksheet records the total streams and the highest position of each song in each month. You will see that the data is far from being ready for analysis and needs to be 'wrangled'. The given Artist.csv file records the artists who perform the songs. You are required to write R code to perform the following steps.
Load the data from the Spotify worksheet into a dataframe named Spotify. Replace the space in the column name with an underscore ("_"). Show the structure of Spotify.
You can see that most column names contain the month information, which should be placed as row values. Let:
• Use pivot_longer to transform the dataframe into four columns, namely Artist_ID, Track_Name, Month, and Value.
• Drop all rows having NA in Value.
• Split the Month column into Month and Year
• Display the number of columns and rows.
You can see that the data in column Value contains both the total stream and highest position of the song in the corresponding month. Note that the smaller value of the position, the higher the position.
• Split the Value column into two columns with appropriate names.
• For each month-year, show the total streams and the number of songs appearing in the daily top 200.
Find all tracks that appeared in all six months with each monthly stream more than 100,000. Display their name, total stream and highest position. Export the result into a CSV file.
Load the data from the Artist.csv file into a new dataframe. Rename the columns to remove spaces. How many artists do not have songs listed in the Spotify dataframe?
Draw a bar chart to compare the artists of the songs/tracks returned in Q2.4 based on their total stream. Order the bar from the highest to the lowest total stream. Write a small paragraph describing your insight got from this chart.
Attachment:- Data Wrangling and R Assignment.rar