Write python code to integrate several datasets

Assignment Help Python Programming
Reference no: EM132318947

For this assessment, you are required to write Python (Python 2/3) code to integrate several datasets into one single schema and find and fix possible problems in the data. Input and output of this assessment are shown below:

Table 1. The input and output of the task

Inputs

Output

Jupyter notebook

vic_suburb_boundary.zip, gtfs.zip Crimebylocation.xlsx

<student_no>.csv

<student_no>_solution.csv

<student_no>_ass3.ipynb

You are given multiple datasets in various formats and the task is about creating housing information in Victoria, Australia. Your assessment is to perform the following tasks.

Task 1: Data Integration

In this task, you are required to integrate these datasets into one with the following schema.

Table 2. Description of the final schema

COLUMN

DESCRIPTION

ID

A unique id for the property

Address

The property address

Suburb (20/100)

The property suburb. The suburb must only be calculated using Vic_suburb_boundary.zip. Default value: "not available"

Price

The property price

Type

The type of property

Date

Date of sold

Rooms

Number of bedrooms

Bathroom

Number of bathrooms

Car

The number of parking space of the property

LandSize

The area of the property

Age

The age of the property at the time of selling

Latitude

The Latitude of the property

Longitude

The Longitude of the property

train_station_id (15/100)

The closest train station to the property that has a direct trip to the Southern Cross Railway Station. A direct trip is a trip that there are no connections (transfers) in the trip from the origin to the destination. Default value: 0

distance_to_train_stat ion (5/100)

The direct distance from the closest train station to the property that has a direct trip to the Southern Cross Railway Station. Default value: 0

travel_min_to_CBD (20/100)

The average travel time (minutes) from the closest train station (regional/metropolitan) that has a direct trip to the "Southern Cross Railway Station" on weekdays (i.e. Monday-Friday) departing between 7 to 9:30 am. For example, if there are 3 direct trips departing from the closest train station to the Southern Cross Railway Station on weekdays between 7-9:30 am and each takes 6, 7, and 8 minutes respectively, then the value of this column for the property should be (6+7+8)/3.). Default value: 0

over_priced? (10/100)

A boolean feature indicating whether or not the price of the property is higher than the median price of the similar properties (with respect to bedrooms, bathrooms, parking_space, and property_type attributes) in the same suburb on the year of selling. Default value: -1

crime_A_average (7/100)

The average of type A crime for three years prior to selling in the same suburb as the property. For example, if a property is sold in 2016, then you should calculate the average of the crime type A for 2013, 2014 and 2015. Default value: -1

crime_B_average (7/100)

The average of type B crime for three years prior to selling in the same suburb as the property. For example, if a property is sold in 2016, then you should calculate the average of the crime type B crime for 2013, 2014 and 2015. Default value: -1

crime_C_average (6/100)

The average of type C crime for three years prior to selling in the same suburb as the property. For example, if a property is sold in 2016, then you should calculate the average of the crime type C for 2013, 2014 and 2015. Default value: -1

Task 2: data reshaping

In this task, you need to study the effect of different normalization/transformation methods (i.e. standardization, min-max normalization, log, power, and root transformation) on Rooms, crime_C_average, travel_min_to_CBD, and property_age attributes. You need to observe and explain their effect assuming that we want to build a linear model on price using these attributes as the predictors of the linear model and recommend which one(s) do you think would work better on this data. When building the linear model, the same normalization/transformation method can be applied to each of these attributes.

Task 3: Documentation and Methodology

The main focus on the documentation would be on the quality of your explanation on finishing these tasks. Your notebook file should be on a decent format with proper sections and subsections.

Note 1: the output csv file must have the exact same columns as specified on the schema. If you decide not to calculate any of the required attributes, then you must have a column for that attribute in your final data-frame with the default value as the value of all the rows. Please note that output file which is not in a correct format, as specified in the integrated schema, won't be marked.

Note 2: the radius of the earth is still 6378 km!

Note 3: In table 2, numbers in front of some of the rows in the format of (a/b) are the allocated mark associated with that attribute. For example, the "suburb" attribute carries 20% of the total mark of task 1. Please note that 10% of the total marks for task 1 is marked on any other issue that may occur during the data integration process.

Note 4: You can only use the vic_suburb_boundary.zip file to extract the suburb name of the property. Using other external datasets or packages (e.g., geopy) to directly get the suburb information will be penalized (this will result in 0 marks for the suburb attribute).

Attachment:- Data Wrangling.rar

Verified Expert

In this assignment data integration,data reshaping and data wrangling is done where language is chosen as python jupyter notebook the ide which is very smooth and clear to work.

Reference no: EM132318947

Questions Cloud

Analysis-strategic planning : Why is strategic planning important to the organization? How does strategic planning help the organization?
What is the department role within the organization : You have been hired to manage a particular aspect of the new adult addictions center. It is your job to write a proposal to bring to the next board meeting.
Companies to engage in price-fixing : Identify the factors that you think were present in the ADM case. Explain your answer and be specific.
Define recommendations to propose a plan for a new center : One of the goals of the treatment center you work at is to eventually open a new, modern treatment facility. You will be working with a team to research.
Write python code to integrate several datasets : FIT5196 - Data wrangling - Monash University - write Python code to integrate several datasets into one single schema
Explain the main organising issue in your leadership model : Project Management Assignment - Self Reflection. Explain main organising idea or issue in your leadership model? E.G. credibility, vision, traits, relationships
What goals has the company accomplished : What issues or problems was the company facing? What goals has the company accomplished? What were the key success factors at the company?
Discussion about the helicopter parent and anxiety : A helicopter parent is a term for a person who pays extremely close attention to his or her child or children, particularly at educational institutions.
Determine the fees of the external auditor with little : Does it matter that often an audit committee hires, evaluates, determines the fees of the external auditor with little,if any, input from the corporate managers

Reviews

inf2318947

7/30/2019 3:27:50 AM

Thanks for the work done. One last request before submitting the assignment. Could you please generate a final Turnitin report of file I will send you. I have made lots of changes so need to be on the safer side before submitting. Any help would be appreciated Please find the attached file. It's in kbs so won't take long 33675675_1submission 2.ipynb vPlease try to create a Turnitin of .ipynb file itself it's a request I sent an email to you with the file attached. No worries I just need a report. I am happy with that

inf2318947

7/30/2019 3:24:29 AM

crime_A_average crime_B_average crime_C_average over_priced? Calculate travel_min_to_CBD for Southern Cross not for Flinders street. crime_A_average crime_B_average crime_C_average over_priced? I want your Expert to start coding for below points 1). Travel_min_to_CBD (Remember, I don't want to see Flinders anywhere in the file, they asked for Southern cross so do it for the same) 2). crime_A_average 3). crime_B_average 4). crime_C_avaerage 5). Over_priced

inf2318947

7/30/2019 3:23:10 AM

Find attached screenshots for your reference 33675649_1mistake1.PNG 33675624_2mistake2.PNG 33675698_3mistake3.PNG I want her to write the whole code for calculating "Travel minute to CDB" with proper comments and explanation of each chunk of code. Make sure she will use my attached files

inf2318947

7/30/2019 3:22:21 AM

I want her to calculate "Travel Minute to CBD" again for Southern Cross. Also, use my files which I am attaching here, Include comments for every chunk of code. I want it back in next 2-3 hours. 3367 5640 _1FI T5196 -S1 -2019 ass essme nt 3.p df 33 67564 0_2gtfs .zip 336 75 68_3298 69579 .csv

inf2318947

7/30/2019 3:21:37 AM

The average travel time (minutes) from the closest train station (regional/metropolitan) that has a direct trip to the “Southern Cross Railway Station” on weekdays (i.e. Monday-Friday) departing ?between 7 to 9:30 am. For example, if there are 3 direct trips departing from the closest train station to the Southern Cross Railway Station on weekdays between 7-9:30 am and each takes 6, 7, and 8 minutes respectively, then the value of this column for the property should be (6+7+8)/3.). "

inf2318947

7/30/2019 3:21:29 AM

Could you please send me a turnitin report of.ipynb file. Thanks I need turnitin report of 29869579_ass3.ipynb file. Please send it ASAP Sure, I will send you send you some chunk of codes in next 30 mins. Thanks Please find attachment and send me a Turnitin report. Thanks pfa

inf2318947

7/30/2019 3:15:46 AM

Could you please send me a draft of work done until now, I just want to make sure everything is going in a proper way or not.v Could you please send me a draft of work done ASAP Send me the work done until now in a .ipynb file, I will go through it and will check whether it is getting done as per my requirements or not. Send it ASAP save the .ipynb file with work done until now and sharing here assignment should be 100% as per the requirements and files I shared. Any anomaly in the file or failing in doing Attached is the I am getting while reading it. 33 6756 47_ 1err or5 .PNG

inf2318947

7/30/2019 3:11:59 AM

Note 4:- TASK 2 is an open-ended question. Please study different normalization/transformation methods on the variables (Rooms, crime_C_average, travel_min_to_CBD, and property_age attributes) and summarize your observation, and recommend which methods/attributes are best to build a linear regression model in this particular dataset. You don't need to apply normalization on the target variable (Price). Note 5:- You are given the radius of the Earth to calculate the direct distance between two points. Note6:- For the average time to CBD, the specifications say that we need to consider the trips to Southern Cross on weekdays, but is it has to be on all of the weekdays? ==>Yes. It has to be on all the weekdays. Like if a trip is only on Monday and Tuesday, do we need to consider it? ==>No. Note7:- Please keep the same order of columns as the order in the description.

inf2318947

7/30/2019 3:11:50 AM

Share this with the tutor and tell him to stick on these requirements Please. Note 1: The output CSV file must have the exact same columns as specified on the schema. Please note that output file which is not in a correct format, as specified in the integrated schema, WON'T BE MARKED. Note 2: In table 2, numbers in front of some of the rows in the format of (a/b) are the allocated mark associated with that attribute. For example, the “suburb” attribute carries 20% of the total mark of task 1. Please note that 10% of the total marks for task 1 is marked on any other issue that may occur during the data integration process. Note 3: You can only use the vic_suburb_boundary.zip file to extract the suburb name of the property. Using other external datasets or packages (e.g., geopy) to directly get the suburb information will be penalized (this will result in 0 marks for the suburb attribute).

inf2318947

7/30/2019 3:11:19 AM

See examples below: If the nearest station you get is Caulfield Railway Station (Caulfield East), then you can set either 19943 or 22248 as train_station_id, and calculate the distance from the property to this station. For travel_min_to_CBD, the following three values are acceptable: 1) get all the direct trips from 19943 to SC, and calculate the average travel time. 2) get all the direct trips from 22248 to SC, and calculate the average travel time, or 3) get all the direct trips from either 19943 or 22248 to SC, and calculate the average travel time. I hope these clarifications address your concerns

inf2318947

7/30/2019 3:10:44 AM

19943 Caulfield Railway Station (Caulfield East) -37.877459 145.042525 22248 Caulfield Railway Station (Caulfield East) -37.877459 145.042525 19915 Clayton Railway Station (Clayton) -37.924683 145.120534 22249 Clayton Railway Station (Clayton) -37.924683 145.120534 22254 Broadmeadows Railway Station (Broadmeadows) -37.683049 144.919613 20030 Broadmeadows Railway Station (Broadmeadows) -37.683049 144.919613 A: For these cases, as long as there are direct trips from these stations to the SC station, you can randomly choose the station id.

inf2318947

7/30/2019 3:10:08 AM

Q10: If it is not overpriced, what is the value? A: the value should be 0. Please note that -1 (default value) means that you don't want to calculate this attribute. Average of Crime Type Q11: How to calculate the Crime Type A for 2017? A: You need to sum up all crime A records for 2016, 2015, 2014, and calculate the average. Suppose the total number of crime A records in the previous 3 years are a, b, c. Then the average should be (a+b+c)/3 Same station name, different id Q12: I have found stations with the same station_name, latitude, and longitude but holds different station_id, how should I deal with this case? Examples are given as follows.

inf2318947

7/30/2019 3:09:52 AM

c) Q7: Station C is the nearest station to the property and there is a direct trip on Weekend only. Is this the answer? A: No. Anyway, two conditions must meet: 1) there are direct trips to Southern Cross on ALL weekdays 2) the trips departing during 7 to 9:30 in the morning. Q8: What kind of measurement (unit) that should we use (m or km or else)? A: The distance should be in meter (m). Overprice Q9: If there is only one record (the property itself), is it overpriced? A: No. If there is only one record, the property cannot be overpriced.

inf2318947

7/30/2019 3:09:29 AM

A: It means a trip that has Southern Cross Railway Station as a stop. Train station (train_station_id) Q4: Which station should I find for the train_station_id attribute? A: The train station you need to get is the one which has direct trips to either station 20043 and 22180 on weekdays (all weekdays), from 7 to 9:30. Please see the following cases for further clarification: a) Q5: Station A is the nearest station to the property, but there is no direct trip during weekdays. Is this the answer? A: No. b) Q6: Station B is the nearest station to the property and has a direct trip to Southern Cross (SC) Railway Stations on Monday/Tuesday only. Is this the answer? A: No. The station should have direct trips to SC on every weekday.

inf2318947

7/30/2019 3:09:05 AM

Southern Cross (SC) Railway Station Q2: Which Southern Cross Railway Station(s) should be used? A: Please only use the following stations for calculation 20043 Southern Cross Railway Station (Melbourne City) -37.818334 144.952525 22180 Southern Cross Railway Station (Melbourne City) -37.817936 144.951411 Q3: When you say "a direct trip to the Southern Cross Railway Station", do you mean a trip that terminates at Southern Cross Railway Station or a trip that has Southern Cross Railway Station as a stop?

inf2318947

7/30/2019 3:08:36 AM

Packages Q1: What packages could I use? A: You can use many packages. But for the suburb attribute extraction, you cannot use the packages (such as geopy) which directly return the suburb name given a lat/log. If you use such packages, you will not get the mark for this attribute. You can use packages to handle shapefiles and retrieve the suburb information from shapefile provided.

inf2318947

7/30/2019 3:08:16 AM

Could you please share below updates our tutor shared with us and tell the expert to please stick on these points. As you may know, data wrangling is a time-consuming task. We may come across with different problems when we pre-process the data. The datasets for Assessment 3 were collected from different sources. If you have questions, please read the following clarification before you raise new questions in the forum.

inf2318947

7/30/2019 3:07:13 AM

Note 1: the output csv file must have the exact same columns as specified on the schema. Please note that output file which is not in a correct format, as specified in the integrated schema, won’t be marked. Note 2: In table 2, numbers in front of some of the rows in the format of (a/b) are the allocated mark associated with that attribute. For example, the “suburb” attribute carries 20% of the total mark of task 1. Please note that 10% of the total marks for task 1 is marked on any other issue that may occur during the data integration process. Note 3: You can only use the vic_suburb_boundary.zip file to extract the suburb name of the property. Using other external datasets or packages (e.g., geopy) to directly get the suburb information will be penalized (this will result in 0 marks for the suburb attribute).

inf2318947

7/30/2019 3:07:02 AM

I request you guys please help me in getting HD as this is the only hope to pass this unit.I am totally relied on you guys please help me in getting HD. I promise that I will refer each of my friend about your services and will give all my assignments to you guys. Please send these notes to tutor and tell him to read it carefully, Most importantly "Using other external datasets or packages (e.g., geopy) to directly get the suburb information will be penalized (this will result in 0 marks for the suburb attribute)."

inf2318947

7/30/2019 3:05:37 AM

I paid whole amount in one shot. Please start assignment ASAP and try to deliver it before time so I can ask for revision if required. Also, i am attaching Marking rubric for this assignment so make sure the assignment should be of HD (HIGH distinction) level. 33675651_1112364 1Rubric-A3-FIT5196.pdf

inf2318947

7/30/2019 3:05:13 AM

Payment is done. Please start assignment ASAP. Please note that I want HD grade in this assignment so make sure it should be 100% as per the requirements and upto mark with high quality of work. This is my last assignment and I need to achieve 100% in this assignment to pass this unit so I request you to put all your efforts to make this assignment above expectations.

inf2318947

7/30/2019 3:04:22 AM

I am doing half payment right now. Once it will get done will do remaining payment. Will that be alright? Please find attached Zip file for the Data Wrangling assignment. It includes a specification file which gives you information on how to perform the assignment and there is one submission details file which will let you know extra details about submission and assignment. Rest of the files are related to the assignment. Let me know if you have any questions. I am expecting this assignment as per requirements and with high standard as this is my last assignment and I want to score really well in this assignment. Please try to make it up to mark and as per the requirements. 3367 5670_1Dat a Wrangli ng a s3.zip

len2318947

6/9/2019 10:29:29 PM

Please find attached file for the Data Wrangling assignment. It is a specification file which gives you information on how to perform the assignment and there is one submission details file which will let you know extra details about submission and assignment. Let me know if you have any questions. I am expecting this assignment as per requirements and with high standard as this is my last assignment and I want to score really well in this assignment. Please try to make it up to mark and as per the requirements.

Write a Review

Python Programming Questions & Answers

  Write a python program to implement the diff command

Without using the system() function to call any bash commands, write a python program that will implement a simple version of the diff command.

  Write a program for checking a circle

Write a program for checking a circle program must either print "is a circle: YES" or "is a circle: NO", appropriately.

  Prepare a python program

Prepare a Python program which evaluates how many stuck numbers there are in a range of integers. The range will be input as two command-line arguments.

  Python atm program to enter account number

Write a simple Python ATM program. Ask user to enter their account number, and print their initail balance. (Just make one up). Ask them if they wish to make deposit or withdrawal.

  Python function to calculate two roots

Write a Python function main() to calculate two roots. You must input a,b and c from keyboard, and then print two roots. Suppose the discriminant D= b2-4ac is positive.

  Design program that asks user to enter amount in python

IN Python Design a program that asks the user to enter the amount that he or she has budget in a month. A loop should then prompt the user to enter his or her expenses for the month.

  Write python program which imports three dictionaries

Write a Python program called hours.py which imports three dictionaries, and uses the data in them to calculate how many hours each person has spent in the lab.

  Write python program to create factors of numbers

Write down a python program which takes two numbers and creates the factors of both numbers and displays the greatest common factor.

  Email spam filter

Analyze the emails and predict whether the mail is a spam or not a spam - Create a training file and copy the text of several mails and spams in to it And create a test set identical to the training set but with different examples.

  Improve the readability and structural design of the code

Improve the readability and structural design of the code by improving the function names, variables, and loops, as well as whitespace. Move functions close to related functions or blocks of code related to your organised code.

  Create a simple and responsive gui

Please use primarily PHP or Python to solve the exercise and create a simple and responsive GUI, using HTML, CSS and JavaScript.Do not use a database.

  The program is to print the time

The program is to print the time in seconds that the iterative version takes, the time in seconds that the recursive version takes, and the difference between the times.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd