Develop and assess your skills in R programming

Assignment Help Advanced Statistics
Reference no: EM132315931

Assignment

The purpose of this assignment is to develop and assess your skills in R programming including wrangling, summarising and plotting data. Using the tidyverse package is recommended but not compulsory. Please read through the entire assignment and understand the submission format and marking rubrics before starting.

Part 1

The spreadsheet titled ‘censusdata.xlsx' contains information about the number of bedrooms in occupied private dwellings for local government areas in Melbourne for the years 2011 and 2016. You will see that it is far from being ready for analysis and needs to be ‘wrangled'. Additionally a few errors have been deliberately introduced into the first two columns so these will need to be corrected by initial analysis.

1. Explain why the data in its current form is not considered to be in ‘tidy' format.

2. Write R code to read in the data (readxl package), manipulate it and output it to a single csv file having the following header row.

region,year,br_count_0,br_count_1,br_count_2,br_count_3,br_count_4_or_more,br_count_unstate d,av_per_dwelling,av_per_household

Your code will have the following sections (not necessarily in the order given and the process may be iterative as you find more things to do). Please include comments in the code to separate each segment and explain your steps.

• Read in the data sets into two data frames df2011 and df2016.
• Compare the layout of each of the two data frames, then remove appropriate rows of one data frame to match the format of the other.
• Write a function that takes in a table of the original form and outputs a table in the desired form with columns specified above.
o Remove unwanted rows or columns
o Split values into multiple columns to make them atomic
o Appropriately transform the data into the desired form
o Rename columns
• Apply the function to each table to create two tables in the desired format.
• Do a summary of each table to look for unusual values.
• Correct those values until the two tables have the same dimensions and format.
• Merge the two tables into a single table so that we see data in the form Banyule,2011,78,1287,8457,21865,11366,645,3.1,2.6
Banyule,2016,...
Bayside,2011,...
Bayside,2016,...
...
Victoria,2011,...

Victoria,2016,... Australia,2011,... Australia,2016,...
(listed alphabetically by region, then by year, with Victoria and Australia at the end)

• Write the result to a csv file (it should have 65 rows including the header).

3. Which region(s) (ignoring Victoria and Australia) had the largest increase in the number of occupied dwellings with 3 or more bedrooms between 2011 and 2016? (Ignore the unstated counts.)

Part 2

The online hospitality company Airbnb has made publicly available a number of datasets. This part of the assignment makes use of the listings.csv dataset which is available at

It consists of a number of parameters related to properties available for lodging in the Melbourne metropolitan area and can be visualised

Write R code to answer the following.

1. Give the five neighbourhoods with the most listings (list them along with the counts in descending order).

2. How many listings contain the following words (upper or lower case or mixed) in the name column?
a. Beautiful
b. Quiet
c. Amazing
d. <another adjective of your choice with at least 200 instances>

3. How many listings are there with last review in 2016? Give month by month counts for the year 2016.

4. Create a new column of the table which calculates the number of ids that correspond to the given host_id . Your answer will match the calculated_host_listings_count column (only use this column to check your answer).

5. Write a function that inputs a listing id and outputs a score that is the sum of points according to the following criteria:
a. Points for the neighbourhood: (average number of bedrooms per dwelling in 2016) × 50 (this comes from the data set in Part 1)
b. Points for the room type: 200 for Entire home/apt, 100 for Private room, 0 for Shared room
c. Points for minimum nights: 50 for 1 night, 25 for 2 nights, 0 for 3 or more nights
d. Points for availability: (availability_365) divided by 5
e. Points for review frequency: 50 × (reviews per month), but no more than 100
f. Points for price: (300 minus price)
Which id (ids if more than one) has the highest score according to the above system?

Part 3

Write a short report summarising the variables in the two (processed) datasets from parts 1 and 2 through tables and plots with R including the following:
• A histogram showing the distribution of a variable of interest
• A plot of one or more variables with time on the x axis (e.g. month, year or date)
• A word cloud of the words in the name column of the listings table.

• A map showing the price of listings by colour (e.g. a dot plot or heat map - you will need to use an R package that can map geospatial data)
Point out any interesting patterns (e.g. trends) you see from your plots and summaries.

Submission
Your submission to this assignment will consist of two files.

1) A single .R file with all the code used for this assignment (all parts), including comments that contain the answers for parts 1 and 2.
2) A document (pdf or docx) for part 3 (including code here is optional, only code in the R file will be assessed). Do not include the answers for parts 1 or 2 here.

Note: please keep the data files in the same directory as your scripts so that you do not specify directories in your code. This will make your R code easier to assess.

Attachment:- Data wrangling.rar

Reference no: EM132315931

Questions Cloud

Discuss significance of the finding for future communication : Effective Business Communications. Discuss the significance of the findings for future communication research or management communication practices.
Discuss what each ratio will tell the financial statement : Explain how to calculate the following ratios and discuss what each ratio will tell the financial statement reader.
List all of the vessels and heart chambers : List all of the vessels and heart chambers that the embolus would encounter on its journey from the popliteal vein to the pulmonary artery
Treasury stock account shows a balance of : Copper, Inc. initially issued 100,000 shares of $1 par value stock for $500,000 in 2013. In 2015, the company repurchased 10,000 shares for $100,000.
Develop and assess your skills in R programming : BUS5DWR - DATA WRANGLING AND R - La Trobe University - Compare the layout of each of the two data frames, then remove appropriate rows of one data frame
How liquid you think the following assets are likely to be : State how liquid you think the following assets are likely to be 10-year Government bonds. $10,000 in a day-today bank account.
What is comparative analysis of financial statements : What is comparative analysis of financial statements and how do we go about completing such a task?
Prepare reversing entries on january-1-2021 : In an effort to minimize errors in recording transactions, Indio Company utilizes reversing entries. Prepare reversing entries on January 1, 2021.
Why must fermentation steps occur : If a cell has plenty of glucose but no oxygen, why doesn't the cell continue to use glucose in glycolysis and just stop at pyruvate

Reviews

len2315931

6/4/2019 1:45:10 AM

In part 3, 2 marks will be deducted if there is little or no explanation of what is being plotted, or if no patterns in the data are noted. In part 3, 1 or more marks will be deducted for each plot that lacks colour or is displayed inappropriately (e.g. missing axis labels, messy view). Overall 2 marks will be deducted if the R code is poorly commented with lack of explanation.

len2315931

6/4/2019 1:45:03 AM

Please note the following as it shows how marks may be deducted. For question 1 of part 1 up to 1 mark will be deducted for any incomplete explanation. Marks will be deducted if the R code does not work easily on the marker’s R studio installation and if you need to be offered an opportunity to show the marker that it does work on your installation. This means all references to directories have to be removed and packages being used are to be specified clearly. It will be assumed that the tidyverse, readxl and packages used for the word cloud as per the link above have been installed. For each coding question, 0.5 marks will be deducted for each minor mistake in the code, 1 mark deducted for each major mistake up to the full number allocated for that question.

len2315931

6/4/2019 1:44:54 AM

Your submission to this assignment will consist of two files. 1) A single .R file with all the code used for this assignment (all parts), including comments that contain the answers for parts 1 and 2. 2) A document (pdf or docx) for part 3 (including code here is optional, only code in the R file will be assessed). Do not include the answers for parts 1 or 2 here. Note: please keep the data files in the same directory as your scripts so that you do not specify directories in your code. This will make your R code easier to assess.

Write a Review

Advanced Statistics Questions & Answers

  Relationship between speed, flow and geometry

Write a project proposal on relationship between speed, flow and geometry on single carriageway roads.

  Logistic regression model

Compute the log-odds ratio for each group in Logistic regression model.

  Logistic regression

Foundations of Logistic Regression

  Probability and statistics

The tubes produced by a machine are defective. If six tubes are inspected at random , determine the probability that.

  Solve the linear model

o This is a linear model. If your model needs a different engine, then you need to rethink your approach to the model. Remember, there are no IF, Max, or MIN statements in linear models.

  Plan the analysis

Plan the analysis

  Quantitative analysis

State the hypotheses that you are going to test.

  Modelise as a markov chain

modelise as a markov chain

  Correlation and regression

What are the degrees of freedom for regression

  Construct a frequency distribution for payment method

Construct a frequency distribution for Payment method

  Perform simple linear regression

Perform simple linear regression

  Quality control analysis

Determining the root causes

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd