Summarising and plotting data

Assignment Help Other Subject
Reference no: EM132324979

DATA WRANGLING AND R Assignment -

The purpose of this assignment is to develop and assess your skills in R programming including wrangling, summarising and plotting data. Using the tidyverse package is recommended but not compulsory. Please read through the entire assignment.

Part 1 - The spreadsheet titled 'censusdata.xlsx' contains information about the number of bedrooms in occupied private dwellings for local government areas in Melbourne for the years 2011 and 2016. You will see that it is far from being ready for analysis and needs to be 'wrangled'. Additionally a few errors have been deliberately introduced into the first two columns so these will need to be corrected by initial analysis.

1. Explain why the data in its current form is not considered to be in 'tidy' format.

2. Write R code to read in the data (readxl package), manipulate it and output it to a single csv file having the following header row.

region,year,br_count_0,br_count_1,br_count_2,br_count_3,br_count_4_or_more,br_count_unstate d,av_per_dwelling,av_per_household

Your code will have the following sections (not necessarily in the order given and the process may be iterative as you find more things to do). Please include comments in the code to separate each segment and explain your steps.

Read in the data sets into two data frames df2011 and df2016.

Compare the layout of each of the two data frames, then remove appropriate rows of one data frame to match the format of the other.

Write a function that takes in a table of the original form and outputs a table in the desired form with columns specified above.

  • Remove unwanted rows or columns.
  • Split values into multiple columns to make them atomic.
  • Appropriately transform the data into the desired form.
  • Rename columns.

Apply the function to each table to create two tables in the desired format.

Do a summary of each table to look for unusual values.

Correct those values until the two tables have the same dimensions and format.

Merge the two tables into a single table so that we see data in the form

Banyule,2011,78,1287,8457,21865,11366,645,3.1,2.6

Banyule,2016,...

Bayside,2011,...

Bayside,2016,...

...

Victoria,2011,...

Victoria,2016,...

Australia,2011,...

Australia,2016,...

(listed alphabetically by region, then by year, with Victoria and Australia at the end) (2 marks)

Write the result to a csv file (it should have 65 rows including the header).

3. Which region(s) (ignoring Victoria and Australia) had the largest increase in the number of occupied dwellings with 3 or more bedrooms between 2011 and 2016? (Ignore the unstated counts.)

Part 2 - The online hospitality company Airbnb has made publicly available a number of datasets. This part of the assignment makes use of the listings.csv dataset.

It consists of a number of parameters related to properties available for lodging in the Melbourne metropolitan area and can be visualized.

Write R code to answer the following.

1. Give the five neighbourhoods with the most listings (list them along with the counts in descending order).

2. How many listings contain the following words (upper or lower case or mixed) in the name column?

a. Beautiful

b. Quiet

c. Amazing

d. <another adjective of your choice with at least 200 instances>

3. How many listings are there with last review in 2016? Give month by month counts for the year 2016.

4. Create a new column of the table which calculates the number of ids that correspond to the given host_id . Your answer will match the calculated_host_listings_count column (only use this column to check your answer).

5. Write a function that inputs a listing id and outputs a score that is the sum of points according to the following criteria:

a. Points for the neighbourhood: (average number of bedrooms per dwelling in 2016) × 50 (this comes from the data set in Part 1)

b. Points for the room type: 200 for Entire home/apt, 100 for Private room, 0 for Shared room

c. Points for minimum nights: 50 for 1 night, 25 for 2 nights, 0 for 3 or more nights

d. Points for availability: (availability_365) divided by 5

e. Points for review frequency: 50 × (reviews per month), but no more than 100

f. Points for price: (300 minus price)

Which id (ids if more than one) has the highest score according to the above system?

Part 3 - Write a short report summarising the variables in the two (processed) datasets from parts 1 and 2 through tables [2 marks] and plots with R including the following:

  • A histogram showing the distribution of a variable of interest.
  • A plot of one or more variables with time on the x axis (e.g. month, year or date).
  • A word cloud of the words in the name column of the listings table. You may follow the instructions and use the packages referred.
  • A map showing the price of listings by colour (e.g. a dot plot or heat map - you will need to use an R package that can map geospatial data).

Point out any interesting patterns (e.g. trends) you see from your plots and summaries.

Attachment:- Assignment Files.rar

Verified Expert

This paper concerns data wrangling , plotting and summarizing in r, two data sets are provided listing data set and census data set. census data contain population dwellings in melbourne while listings contain airbnb listings in melbourne. The paper covers writing a r function which takes the two census data frames for years(2011 and 2016) process them and merge them into a single dataframe, with the number of bedrooms, and average dwelling per household.

Reference no: EM132324979

Questions Cloud

What is the important political characteristic about concord : The purpose of this assignment is to help you prepare for the upcoming essay on The Minutemen and Their World. Please post a response to Dr. B.'s questions.
How can we explain the lorenz curve : How can we explain the Lorenz Curve, how it is used to calculate the Gini Coefficient? and what does the Gini Coefficient tell us?
What specific habits and values does the leader display : What specific habits, qualities, belief and values does this leader display, give an example? What l eadership skills (vision, strategy and tactics).
What were factors that help push europeans to explore world : The discovery of the Americas and its colonization were spurred by many factors in Europe. What were 2 of those factors that helped push Europeans to explore.
Summarising and plotting data : BUS5DWR DATA WRANGLING AND R Assignment. Develop and assess your skills in R programming including wrangling, summarising and plotting data
Explain significance in the development of western culture : Although the United States of America has become a melting pot for a diversity of cultures, its basic institutions and ideals are essentially Western.
What technological methods are used in afghanistan and iraq : What technological methods are used in Afghanistan and Iraq and how have they been used tactically (i.e., what goals are they used to achieve)?
What aspect of your heritage do you plan to focus on : What aspect of your heritage do you plan to focus on? What are you planning to create for your project? A poster? Art piece? Film? Other:_______?
Enhance one objectives : Political skills is the ability to influence others in such a way as to enhance one's objectives.

Reviews

len2324979

6/19/2019 4:23:25 AM

There will be a different set of data for everyone. Need to aggregate data carefully. Please note the following as it shows how marks may be deducted. For question 1 of part 1 up to 1 mark will be deducted for any incomplete explanation. Marks will be deducted if the R code does not work easily on the marker’s R studio installation and if you need to be offered an opportunity to show the marker that it does work on your installation. This means all references to directories have to be removed and packages being used are to be specified clearly. It will be assumed that the tidyverse, readxl and packages used for the word cloud as per the link above have been installed.

len2324979

6/19/2019 4:23:16 AM

For each coding question, 0.5 marks will be deducted for each minor mistake in the code, 1 mark deducted for each major mistake up to the full number allocated for that question. In part 3, 2 marks will be deducted if there is little or no explanation of what is being plotted, or if no patterns in the data are noted. In part 3, 1 or more marks will be deducted for each plot that lacks colour or is displayed inappropriately (e.g. missing axis labels, messy view). Overall 2 marks will be deducted if the R code is poorly commented with lack of explanation.

Write a Review

Other Subject Questions & Answers

  What is your intended employee and organizational outcomes

Summarize your "new" performance appraisal and explain the criteria and competencies you selected. Justify your rational for the areas you've selected.

  Problem regarding the main process variables

Semiconductors (chips) are produced on wafers that contains hundreds of chips. The wafer yield is defined to be the proportion of these chips that are acceptable for use and clearly  control engineers aim to maximise this yield during manufacturin..

  The implementation of a simple educational program

For this discussion share an outline with supportive content for the implementation of a simple educational program or intervention to promote.

  Does the company have access to mental health counselors

Value of a team environment: Is the company operating with a team structure? How effective is the team structure? How could the company improve the effectiveness of the team environment? If there is no team structure currently in place, how sho..

  How might an internal process improvement leader use

Read through the sample report. How might an internal process/program improvement leader use a series of these types of reports to diagnose a team or an organization

  What categories of chemicals are exempt from oshas hazcom

What categories of chemicals are exempt from OSHA's HazCom labeling requirements? Your response must be at least 150 words in length. You are required to use at least your textbook as source material for your response.

  Why is this myth so differcult to abandon

Assimilation or pluralism, and its connection to black americans contact with other ethinic groups,conflict with other ethinic groups. Provide an example of a widely - held myth or misconception about Blacks. How do we know this is a myth? Why is ..

  Why did the chosen art form impact you or still impacts you

Why did this chosen art form impact you or still impacts you? Post an artwork image and describe the media used by artists when creating this art form.

  Upper-middle and lower constitutional rights tiers

What are the differences between upper, middle, and lower constitutional rights tiers? What are some actual case examples of each one? Is there a prong test that can be used to figure out which tier a case falls into?

  History of psychology more controversial than sigmund freud

There is probably no single person in the history of psychology more controversial than Sigmund Freud

  Explain recreation opportunity spectrum

Discuss the importance of using the Recreation Opportunity Spectrum as a planning tool in the provision of outdoor recreation in Mauritius

  Describe the uniform crime report

Looking at the four Violent Index Crimes included in the Uniform Crime Report (UCR) comment on the official stats for each. Do you feel the numbers are an accurate representation of the number of offenses that occur? Why or why not?

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd