Write a full report on your analysis of your chosen dataset

Assignment Help Other Subject
Reference no: EM133566942

Exploratory Data Analysis and Visualisation

Task description:

This project aims to bring all the skills you have learnt over the semester together, and give you an opportunity to apply them to a dataset. Such a dataset can be chosen by yourself, which can be relevant to financial, insurance, medical or health areas etc. We provide the Ames Housing dataset (an American housing dataset) if you would prefer it. A brief description of the variables in the dataset is at the end of this file.

In this project, you will need to write a full report on your analysis of your chosen dataset, from the beginning of the data science methodology, where you will need to establish your problems of interest/exploration to the end of the Further Preprocessing stage. You will then train a simple linear model on the training dataset and predict values for the test dataset. Finally, you will evaluate your model using the metric RMSE against the test dataset and plot the residuals (similar to that shown in Week 10), and draw your final conclusions.

Your report is to obtain insights and aim to make impacts, and should have a structure which follows the data science methodology. The main purpose of this project is to conduct your analysis using EDA and visualisation.

For example, on the housing dataset (if chosen by you), when writing the report, put yourself into the shoes of a real estate analyst wanting to obtain insights from this dataset to predict house prices. The dataset already has a lot of reports written on it - find them here. Be inspired by them for EDA, but do not focus too much on their modelling.

Download the datasets labelled train and test from Canvas. As you are a real estate analyst, your target variable is SalePrice. Note for most of the report, you will only use the train dataset. This includes preprocessing, EDA, and everything else up to and including the creation of a linear model.

The linear model will then be trained on the train dataset. You will then predict a set of SalePrice values based on the variable information in the test dataset. You can then compare your predicted values to the ‘real' values in the test dataset. Therefore the test dataset is only needed for the "Evaluation" section of the report.

Report structure

Your report needs to include the following sections. In each section you will need to give a very brief explanation as to what the section is about, what the purpose of the section is and/or describe the key pieces of information in your general approach. For example, in the "Data preprocessing" section, you would explain what exactly data preprocessing is, why you need to clean the data, and describe the key ideas in your approach e.g. fill in missing values with median based of external controls.

0. Title and abstract:
On the first page, you should have:
- A suitable title for your report
- Your student ID
- An abstract/executive summary outlining your problems, analysis and findings.

1. Problem identification:
You should conduct some background research into the Ames Housing dataset and:
- Give some information on the dataset.
- Gather and list points of domain expertise to help you make better decisions and shape your report (e.g. you should identify creating a variable similar to Week 7/9's SeasonSold would require you to know which seasons correspond to which months as the dataset is American)
- Seek to understand the variables here.

Problem identification and understanding is crucial in any data science project. You should:
- Think about (after gaining domain expertise) a few questions of interest, which you will then translate into data science problems to solve within your report (if you get stuck look at a few examples from the Melbourne dataset slides with problems of interest).
- Provide a list of these data science problems. You will need to address and interpret your corresponding findings later on in the body of your report.

Note that examples of problems for you to find and solve can be:
- Identify which suburb/location had the biggest growth in SalePrice by plotting and examining the sale prices cross different suburbs;
- Analyse a possible pattern of SalePrice vs YrSold/MoSold, LotArea and/or some other variables which can reasonably be included;
- Use predictions from your final model to compare suburbs which have shown varying growth. Or, to identify which suburbs have been growing the most over the last few years.

UG students (unit 11374): Generate and address at least five problems.

G students (unit 11517): Generate and address at least seven problems, including the last problem listed above which uses predictions from your final model, e.g. find a way to compare the predictions (maybe median?) between suburbs (could be the top 5 suburbs) which have shown varying growth from your time series plots of growth over time.

2. Data preprocessing:
In this section you should:
- Preprocess your code, treat missing values etc.
- Note at least one key observation, e.g. identified possible missing values or outliers for a particular area/suburb or year e.g. 2016 is significantly higher. Or perhaps one column is missing more than 50% of its values.

3. EDA:
In this section you should:
- Include tasks such as determining which variables are significant, which observations may be outliers etc., and other EDA goals.
- Find as much insight as possible to support your modelling decisions later on.
- Use data visualisation techniques taught in the unit to answer your chosen problems of interest.

4. Further preprocessing:
In this section you should:
- Select the final variables for your model based off your EDA (basically remove the non-significant variables).
- Create any new variables which you think may help based on your EDA in this section.
- Justify your decisions and provide EDA evidence as to how a variable is insignificant (e.g. no observable relationship to target variable in scatter plot).

5. Modelling:
In this section you should:
- Fit and evaluate a linear model to describe the relationship between your target variable and a number of selected significant predictors.
- Use your model to predict the prices of properties described by your test dataset.

Alternatively, you may use another, more advanced model of your choice. If you do use a linear model, remember its likings such as a normalised distribution in the target variable.

6. Evaluation:
You should:
- Evaluate your model against the metric RMSE given the actual values in the test
dataset
- Plot the residuals similar to that shown in the Week 10 slides. Pick a suitable cut off value for the red dots.

The data science methodology is an iterative process. Try to minimise your RMSE, so always go back and think about what improvements can be made, then fit another model, and find your second RMSE, and so on, noting what works and what does not. Compare at least two different models you considered, noting their differences.

7. Recommendations and final conclusions:
You should:
- Summarise your findings and provide your found solutions to your problems of interest. Note anything you found particularly interesting and useful to your project.
- State the best RMSE you obtained and why/how (i.e. what variables you used, any applied transformations etc.).
- State any improvements you could make and why/how you could achieve such improvements in future works.

8. References:
You should:
- Include a reference list and cite your references via in-text referencing or footnotes.

Variables in the Ames Housing dataset:

Below, please find a brief description of the variables within the dataset. For more detail, look inside the data_description.txt file.

• SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
• MSSubClass: The building class
• MSZoning: The general zoning classification
• LotFrontage: Linear feet of street connected to property
• LotArea: Lot size in square feet
• Street: Type of road access
• Alley: Type of alley access
• LotShape: General shape of property
• LandContour: Flatness of the property
• Utilities: Type of utilities available
• LotConfig: Lot configuration
• LandSlope: Slope of property
• Neighborhood: Physical locations within Ames city limits
• Condition1: Proximity to main road or railroad
• Condition2: Proximity to main road or railroad (if a second is present)
• BldgType: Type of dwelling
• HouseStyle: Style of dwelling
• OverallQual: Overall material and finish quality
• OverallCond: Overall condition rating
• YearBuilt: Original construction date
• YearRemodAdd: Remodel date
• RoofStyle: Type of roof
• RoofMatl: Roof material
• Exterior1st: Exterior covering on house
• Exterior2nd: Exterior covering on house (if more than one material)
• MasVnrType: Masonry veneer type
• MasVnrArea: Masonry veneer area in square feet
• ExterQual: Exterior material quality
• ExterCond: Present condition of the material on the exterior
• Foundation: Type of foundation
• BsmtQual: Height of the basement
• BsmtCond: General condition of the basement
• BsmtExposure: Walkout or garden level basement walls
• BsmtFinType1: Quality of basement finished area
• BsmtFinSF1: Type 1 finished square feet
• BsmtFinType2: Quality of second finished area (if present)
• BsmtFinSF2: Type 2 finished square feet
• BsmtUnfSF: Unfinished square feet of basement area
• TotalBsmtSF: Total square feet of basement area
• Heating: Type of heating
• HeatingQC: Heating quality and condition
• CentralAir: Central air conditioning
• Electrical: Electrical system
• 1stFlrSF: First Floor square feet
• 2ndFlrSF: Second floor square feet
• LowQualFinSF: Low quality finished square feet (all floors)
• GrLivArea: Above grade (ground) living area square feet
• BsmtFullBath: Basement full bathrooms
• BsmtHalfBath: Basement half bathrooms
• FullBath: Full bathrooms above grade
• HalfBath: Half baths above grade
• Bedroom: Number of bedrooms above basement level
• Kitchen: Number of kitchens
• KitchenQual: Kitchen quality
• TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
• Functional: Home functionality rating
• Fireplaces: Number of fireplaces
• FireplaceQu: Fireplace quality
• GarageType: Garage location
• GarageYrBlt: Year garage was built
• GarageFinish: Interior finish of the garage
• GarageCars: Size of garage in car capacity
• GarageArea: Size of garage in square feet
• GarageQual: Garage quality
• GarageCond: Garage condition
• PavedDrive: Paved driveway
• WoodDeckSF: Wood deck area in square feet
• OpenPorchSF: Open porch area in square feet
• EnclosedPorch: Enclosed porch area in square feet
• 3SsnPorch: Three season porch area in square feet
• ScreenPorch: Screen porch area in square feet
• PoolArea: Pool area in square feet
• PoolQC: Pool quality
• Fence: Fence quality
• MiscFeature: Miscellaneous feature not covered in other categories
• MiscVal: $Value of miscellaneous feature
• MoSold: Month Sold
• YrSold: Year Sold
• SaleType: Type of sale
• SaleCondition: Condition of sale

Reference no: EM133566942

Questions Cloud

What is civic engagement according to the report : What is civic engagement, according to the report? Why do you think Texans are not civically engaged? Is it political culture, time, money, etc?
Discuss the experiment and the ethical issues : discuss the experiment and the ethical issues. You will need to identify another source to find out more information about the experiment that you selected
What is natural gender languages : What is natural gender languages, grammatical gender languages, and genderless languages? Do these languages connect to one another? If so, how?
What effects do media depictions of body image : What effects do media depictions of body image and socialbeauty standards have on the incidence of eating disorders and obesity in young adults?
Write a full report on your analysis of your chosen dataset : 11517 Exploratory Data Analysis and Visualisation G, University of Canberra - write a full report on your analysis of your chosen dataset
Compare and contrast sensation and perception : Compare and contrast sensation and perception. Describe an example of sensory adaptation that you have experienced. Why would it be difficult to play a sport
Identify interest group active in texas that could support : Identify an interest group active in Texas that could either support or oppose making changes within the county based on your identified issue.
Compare transformation and transactional leadership : Compare transformation and transactional leadership effectiveness. Assess and comment on leadership approaches Christians should avoid.
Describe what you think is the most direct evidence : Describe what you think is the most direct evidence that challenges this explanation. Describe and make clear two elements of Milgram's original "voice feedback

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd