Analyze the vancouvers housing price

Assignment Help Basic Statistics
Reference no: EM131850933

Assignment: EDA and Bootstrapping

Objective

Statistics play a vital role in data science for (at least) two reasons. First, it can be used to gain a deep understanding of data. This process is known as Exploratory Data Analysis (EDA).

Second, it can be used to infer the relationship between a sample and the population. This process is known as inference. In this assignment, you will learn about EDA and statistical inference through the analysis of a very interesting dataset - property tax report data. Specifically, you will learn the followings:

1. Be able to perform EDA on a single column (i.e., univariate analysis)
2. Be able to perform EDA on multiple columns (i.e., multivariate analysis)
3. Be able to extract insights from visualizations
4. Be able to ask critical questions about data
5. Be able to estimate a population parameter based on a sample
6. Be able to use the bootstrap to quantify the uncertainty of an estimated value

In this assignment, you can use pandas or PySpark to manipulate data, and use matplotlib or seaborn to make plots.

Part 1. EDA

Imagine you are a data scientist working at a real-estate company. In this week, your job is to analyze the Vancouver's housing price. You first download a dataset from property_tax_report_2018.zip (property_tax_report_2018.zip). The dataset contains information on properties from BC Assessment (BCA) and City sources in 2018. You can find the schema information of the dataset from attached file. But this is not enough. You still know little about the data. That's why you need to do EDA in order to get a better and deeper understanding of the data.

We first load the data as a DataFrame. To make this analysis more interesting, I added two new columns to the data: CURRENT_PRICE represents the property price in 2018; PREVIOUS_PRICE represents the property price in 2017.

In [ ]:import pandas as pd

df = pd.read_csv("property_tax_report_2018.csv")

df['CURRENT_PRICE'] = df.apply(lambda x: x['CURRENT_LAND_VALUE ']+x['CURRENT_IMPROVEMENT_VALUE'], axis = 1)

df['PREVIOUS_PRICE'] = df.apply(lambda x: x['PREVIOUS_LAND_VAL UE']+x['PREVIOUS_IMPROVEMENT_VALUE'], axis = 1)

Now let's start the EDA process.

Question 1. Look at some example rows

Print the first five rows of the data:

Question 2. Get summary statistics

From the above output, you will know that the data has 28 columns. Please use the describe() function to get the summary statistics of each column.

Please look at the above output carefully, and make sure that you understand the meanings of each row (e.g., std, 25% percentile).

Question 3. Examine missing values

Now we are going to perform EDA on a single column (i.e., univariate analysis). We chose YEAR_BUILT, which represents in which year a property was built. We first check whether the column has any missing value.

Missing values are very common in real-world datasets. In practice, you should always be aware of the impact of the missing values on your downstream analysis results.

Question 4. Plot a line chart

We now start investigating the values in the YEAR_BUILT column. Suppose we want to know: "How many properties were built in each year (from 1900 to 2018)?" Please plot a line chart to answer the question.

Please write down the two most interesting findings that you draw from the plot. For example, you can say: "Vancouver has about 6300 properties built in 1996 alone, which is more than any other year". For each finding, please write no more than 2 sentences.
Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 5. Plot a bar chart

Next, we want to find that, between 1900 and 2018, which years have the most number of properties been built? Plot a bar chart to show the top 20 years.

In [ ]:

Please write down the two most interesting findings that you draw from the plot.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 6. Plot a histogram

What's the distribution of the number of properties built between 1900 and 2018? Please plot a histogram to answer this question.

In [ ]:

Please write down the two most interesting findings that you draw from the plot.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 7. Make a scatter plot

Suppose we are interested in those years which built more than 2000 properties. Make a scatter plot to examine whether there is a relationship between the number of built properties and the year?

In [ ]:

Please write down the two most interesting findings that you draw from the plot.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 8. PDF and CDF

Can you believe that you have already drawn 8 interesting findings by exploring a single column! This is the power of EDA combined with critical thinking. Now we are moving to multivariate analysis.
Suppose you want to compare the housing price between this year and last year, i.e., CURRENT_PRICE vs. PREVIOUS_PRICE. You can plot their distributions, and make the comparison. There are two ways to define a distribution: Probabilistic Distribution Function and Cumulative Distribution Function (CDF).

In the following, please make two plots and put them side-by-side.

In the first plot, use histograms to plot the probabilistic distributions of CURRENT_PRICE and PREVIOUS_PRICE.
In the second plot, use histograms to plot the cumulative distributions of CURRENT_PRICE and PREVIOUS_PRICE.
There are a few properties which are way more expensive than the others. For both plots, please exclude those properties by setting xlim = (0, 5Million).

In [ ]:

Please write down the two most interesting findings that you draw from the plots.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 9. Use EDA to answer an interesting question (1)

In the above plots, we found that the overall housing price has increased, but we do not which type of property has increased more.
Now we add another variable LEGAL_TYPE (e.g., STRATA, LAND) to the analysis, and consider three variables (LEGAL_TYPE, CURRENT_PRICE, PREVIOUS_PRICE) in total.

In the following, please make two plots and put them side-by-side.

In the first plot, please use histograms to plot the probabilistic distributions of CURRENT_PRICE and PREVIOUS_PRICE for LEGAL_TYPE = "STRATA".

In the first plot, please use histograms to plot the probabilistic distributions of CURRENT_PRICE and PREVIOUS_PRICE for LEGAL_TYPE = "LAND".

In [ ]:

Please write down the two most interesting findings that you draw from the plots.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 10. Use EDA to answer interesting questions

Although the housing price of the entire Vancouver area is increasing, there might be some areas whose housing price is decreasing. To answer this question, we need to consider another column
-- PROPERTY_POSTAL_CODE.

PROPERTY_POSTAL_CODE (e.g., "V5A 1S6") is a six-character string with a space separating the third and fourth characters. We use the first three characters to represent an area.

We first filter out the areas which have less than 10 properties. For each of the remaining areas, we calculate the percentage of the properties whose price has decreased compared to the last year. For example, if an area "V5A" has 50 properties, and 30 of them have decreased, then the percentage is 60%.

Please write code to find the top-10 areas with the highest percentages. Create a bar chart to visualize them.

In [ ]:

Please write down the two most interesting findings that you draw from the plot.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Question 11. Come up with your own question.

You need to complete the following three tasks.

Firstly, please come up with an interesting question on your own (like Q9 and Q10).

A short description of the question: [ADD TEXT]

Secondly, please write code so that the output of your code can answer the question.

In [ ]:

Thirdly, please write the two most important findings.

Findings

1. [ADD TEXT]
2. [ADD TEXT]

Part 2. Bootstrapping

In Part 1, we run our analysis over the full dataset. In reality, however, you may not be that lucky. It is more often than not that you can only collect a sample of the data. Whenever you derive a conclusion from a sample (e.g., The Vancouver's housing price has increased by 10\% since last year), you should ALWAYS ask yourself: "CAN I TRUST IT?". In other words, you want to know that if the same analysis was conducted on the full data, would the same conclusion be derived? In Part 2, you will learn how to use bootstrapping to answer this question.

Please download the sample dataset property_tax_report_2018_sample.zip (property_tax_report_2018_sample.zip), and load it as a DataFrame.

In [ ]:df_sample = pd.read_csv("property_tax_report_sample.csv")

df_sample['CURRENT_PRICE'] = df_sample.apply(lambda x: x['CURR ENT_LAND_VALUE']+x['CURRENT_IMPROVEMENT_VALUE'], axis = 1)

df_sample['PREVIOUS_PRICE'] = df_sample.apply(lambda x: x['PRE VIOUS_LAND_VALUE']+x['PREVIOUS_IMPROVEMENT_VALUE'], axis = 1)

df_sample = df_sample[df_sample['LEGAL_TYPE'] == 'STRATA']

Task 1. Analysis Result Without Bootstrapping

Please compute the median of PREVIOUS_PRICE and CURRENT_PRICE, respectively, and compare them in a bar chart.

In [ ]:

Task 2. Analysis Result With Bootstrapping

From the above chart, we find that the median of PREVIOUS_PRICE is about 0.6 M, and the median of CURRENT_PRICE is about 0.7 M. Since the numbers were obtained from the sample, "CAN WE TRUST THESE NUMBERS?"

In the following, please implement the bootstrap by yourself, and compute a 95%-confidence interval for each number. You can find the description of the algorithm in Section 7.

Attachment:- Assignment.rar

Reference no: EM131850933

Questions Cloud

Review the accounts payable process flowchart : Review the Accounts Payable Process flowchart given to you by the owner of ABC Co. Based on the AICPA Code of Professional Conduct, prepare a 1-2 page summary.
Forgive student loan debt : Every few weeks now a petition pops up in Facebook newsfeeds urging the government to forgive all student debt. The comment from the person
How would you advise to approach development of a position : Tracy, one of your friends, is in the early stages of conducting a feasibility analysis for a small business consulting company.
Detailed work in managerial economics : ABC Company has been producing its own widgets that are used in manufacturing its final product. The cost of manufacturing 10,000 widgets is summarized below.
Analyze the vancouvers housing price : Imagine you are a data scientist working at a real-estate company. In this week, your job is to analyze the Vancouver's housing price.
What are the qualifications required to be president : What are the qualifications required to be president? Describe at least one other qualification you feel should also be required and explain why.
What is the value price elasticity of? demand : What is the value price elasticity of? demand, expressed as a positive? number?
How many disbursements were not authorized : You should have the IDEA software already loaded on your computer from Auditing 1. Complete the assigned problem(s)for the week and upload your results.
Discuss how diversity might influence such abuse : Explain the ethical implications of cultural abuse of leader/manager power. Briefly discuss how diversity might influence (hinder or aid) such abuse.

Reviews

Write a Review

Basic Statistics Questions & Answers

  Identify response variable and populations to be compared

Describing the ANOVA model, continued. For each of the following situations, identify the response variable and the populations to be compared, and give I, the ni, and N.

  What proportion of sample means will have values in range

if random sample size is n=25 are selected from this population, what proportion of sample means will have values between 37 and 43.

  What type of error have commited if null hypothesis is false

If you were to accept the null hypothesis, even though the null hypothesis was false, what type of error would you have commited?

  Index of respiratory muscle strength

Maximal static inspiratory pressure (PImax) is an index of respiratory muscle strength. The following data show the measure of PImax (cm H2O) for 15 cystic ?brosis patients.

  Estimate the number of batteries with lifetimes

The mean lifetime was 3.2 hours with a standard deviation of 0.1 hours. The lifetimes are approximately bell-shaped. Estimate the number of batteries with lifetimes between 3.0 hours and 3.4 hours.

  The resulting life data from the batteries had a mean of

a popular flashlight that uses d size batteries was selected and severa of the same models were purchased to test the

  Compute the exponential smoothing forecast

Construct a time series plot. What type of pattern exists in the data and show the four-quarter and centered moving average values for this time series.

  Find the value of the constant in terms of k

For a constant k, two discrete random variables have a joint PMF- Find the value of the constant in terms of K.- Find the marginal PMFs.

  Noise level independent of flight phase

There are three flight phases. At α = 0.05, is the cockpit noise level independent of flight phase?

  Suppose that when certain geological conditions exist there

suppose that when certain geological conditions exist there is a 20 chance of striking oil. a drilling company finds

  Determining building effort-design work and testing effort

Data on distribution of cost for building effort, design work and testing effort.

  In the state of florida non-personalized automobile license

in the state of florida non-personalized automobile license plates use 7 characters. the first 4 charactersare letters

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd