COSC3337 Data Science Assignment - Exploratory Data Analysis

Assignment Help Other Subject
Reference no: EM132378144

COSC 3337 Data Science Assignment - Exploratory Data Analysis and Data Visualization, University of Houston, USA

Part A - Exploratory Data Analysis for a Vehicle Silhouettes Dataset

Download Statlog (Vehicle Silhouettes) Data Set, limiting yourself to analyzing to the following subset of the dataset for the tasks 1-5 below; use all examples to create the subset:

i. If your last name starts with A-K, you analyze the COMPACTNESS (average perim)**2/area), ELONGATEDNESS (area/(shrink width)**2), SCALED VARIANCE (2nd order moment about minor axis)/area ALONG MAJOR AXIS attributes (1st , 8th , and 11th attribute) and the class variable.

ii. If your last name starts with L-Z, you analyze the you analyze the COMPACTNESS (average perim)**2/area), CIRCULARITY (average radius)**2/area, SCALED VARIANCE (2nd order moment about minor axis)/area HOLLOWS RATIO attributes(1st , 2nd , and 18th attribute) and the class variable.

Apply the following exploratory data analysis techniques using R to your dataset:

1. Compute the covariance matrix for the three numerical attributes you are analyzing; also compute the correlation for each of the three pairs of attributes. Interpret the statistical findings!

2. Create a scatter plot for the last two numerical attributes of your dataset. Interpret the scatter plot!

3. Create histograms for the first 2 numerical attributes each for the whole dataset and the instances of each of the 4 classes. That is, you create 10 histograms. Interpret the obtained displays!

4. Create box plots for the COMPACTNESS attribute for the instances of each class and a fifth box plot for all instances in the dataset. Interpret and compare the 5 box plots!

5. Create 3 supervised scatter plots using 2 of the 3 attributes and the class variable; use different colors for the class variable. Interpret the scatter plots!

6. Fit a linear model that predicts the dependent variable B and H using all the 18 numerical attributes as independent variables for a dataset VS-Mod which is created as follows from the complete raw Vehicle Silhouette Dataset:

a. Z-score the 18 numerical attributes

b. Add an attribute B to the dataset that is 1 if the example is a bus, and 0 otherwise.

c. Add a variable V to the dataset that is 1 if the example is a van and 0 otherwise.

Report the R2 of the obtained linear model and the coefficients of each attribute in the obtained two regression functions. Next, interpret the results! What do the coefficients tell you about the importance of the 18 attributes for the two prediction problems? What about negative and positive coefficients-also assess to which extend the coefficients of two regression functions agree with each other.

7. Using the dataset VS-Mod you used in the previous task, create 3 different decision tree models that predict the class attribute B based on the numerical 18 attributes and have 20 or less nodes and create 3 different decision tree models that predicts the class attribute V based on the numerical18 attributes and has 20 or less nodes. Explain how the 3 decision tree models were obtained. Report the training accuracy and the testing accuracy for each decision tree; interpret the learnt decision tree-what do they tell you about the importance of the 18 attributes in the used dataset for the classification problem? Assess the training accuracy obtained. Also compare you findings with the findings you obtained for task 6!

8. Write a conclusion (at most 18 sentences!) summarizing the most important findings of task1-7-what did we learn about the dataset? In particular, address the findings obtained related to predicting buses, vans, and all 4 classes using the attributes in the dataset. Also assess the difficulty of your classification task 6 points.

9. Are there any other interesting observations about your dataset?

Remark: About 40% of the Assignment1 points will be allocated to interpreting statistical findings and visualizations! A few extra points will be allocated for really good answers to the questions in green!

5 Examples in the raw Vehicle Silhouette Dataset:

96 55 103 201 65 9 204 32 23 166 227 624 246 74 6 2 186 194 opel

89 36 51 109 52 6 118 57 17 129 137 206 125 80 2 14 181 185 van

99 41 77 197 69 6 177 36 21 139 202 485 151 72 4 10 198 199 bus

104 54 100 186 61 10 216 31 24 173 225 686 220 74 5 11 185 195 saab

101 56 100 215 69 10 208 32 24 169 227 651 223 74 6 5 186 193 opel

Attribute Information for the Vehicle Silhouette Dataset:

Attribute Information:

ATTRIBUTES

1. COMPACTNESS (average perim)**2/area

2. CIRCULARITY (average radius)**2/area

3. DISTANCE CIRCULARITY area/(av.distance from border)**2

4. RADIUS RATIO (max.rad-min.rad)/av.radius

5. PR.AXIS ASPECT RATIO (minor axis)/(major axis)

6. MAX.LENGTH ASPECT RATIO (length perp. max length)/(max length)

7. SCATTER RATIO (inertia about minor axis)/(inertia about major axis)

8. ELONGATEDNESS area/(shrink width)**2

9. PR.AXIS RECTANGULARITY area/(pr.axis length*pr.axis width)

10. MAX.LENGTH RECTANGULARITY area/(max.length*length perp. to this)

11> SCALED VARIANCE (2nd order moment about minor axis)/area

ALONG MAJOR AXIS

12. SCALED VARIANCE (2nd order moment about major axis)/area

ALONG MINOR AXIS

13. SCALED RADIUS OF GYRATION (mavar+mivar)/area

14. SKEWNESS ABOUT (3rd order moment about major axis)/sigma_min**3

MAJOR AXIS

15. SKEWNESS ABOUT (3rd order moment about minor axis)/sigma_maj**3

MINOR AXIS

16. KURTOSIS ABOUT (4th order moment about major axis)/sigma_min**4

MINOR AXIS

17. KURTOSIS ABOUT (4th order moment about minor axis)/sigma_maj**4

MAJOR AXIS

18. HOLLOWS RATIO (area of hollows)/(area of bounding polygon)

Where sigma_maj**2 is the variance along the major axis and sigma_min**2 is the variance along the minor axis, and area of hollows= area of bounding poly-area of object

The area of the bounding polygon is found as a side result of the computation to find the maximum length. Each individual length computation yields a pair of calipers to the object orientated at every 5 degrees. The object is propagated into an image containing the union of these calipers to obtain an image of the bounding polygon.

CLASSES (4): OPEL, SAAB, BUS, VAN

Part B - Density-Based Crime Analysis and Data Visualization

In Part B of this Assignment we will be using the following 4 datasets depicted below

1238_figure.png

Each dataset contains longitude-latitude pairs of the crime of the mentioned category which occurred in a particular time interval; e.g. Harassment12-17.csv contains locations of harassment crimes which occurred in time slots 12 through 17.

10. Create "heatmap"-style density plots for the 4 datasets! Use the same bandwidth for each display; experiment with different values for the bandwidth and try to identify the most suitable bandwidth for density plots for the 4 datasets. Report how and why you chose the particular bandwidth for your 4 displays.

11. Create a density contour plots for the Harassment12-17.csv and Harassmen6-11.csv datasets. Interpret the obtained density plots! 

12. Summarize to which extend Harassments and PetitLarcency are collocated in time slots 6-11! Try to produce an analysis method and/or visualization method that reports the strength of the collocation between the two crime types in the given time interval. Next, summarize to which extend the 2 crime types are anti-collocated! Try to produce an analysis method and/or visualization method that report the strength of the anti-collocation between the two crime types. Interpret the display/analysis results!

13. Summarize to which extend the distribution of harassment crimes changed between time slots 0-5 and 6-11 and between time slots 6-11 and 12-17. Develop an analysis method and/or visualization method that summarizes the strength and direction of change for a given pair of datasets. Alternatively, you might produce and implement a method that identifies and visualizes regions of increase and decreased density with respect a given pair of datasets. Interpret the obtained displays/analysis results!

More sophisticated analysis and visualization approaches for tasks 12 and 13 will get higher scores including awarding extra points. Moreover, "alternative" approaches to find good data visualizations for those two tasks are welcome.

Attachment:- Data Science Assignment File.rar

Reference no: EM132378144

Questions Cloud

How can common law torts be used in environmental law : How can common law torts be used in environmental law?
Why we should have various sources of law : Why we should have various sources of law instead of having one single source which is easier for us to refer and to decide any matters.
What is california environmental agency : What is California's environmental agency? What is its purpose and major functions?
Are cloud based word processing systems : Are cloud based word processing systems ready for the legal industry? Why or why not?
COSC3337 Data Science Assignment - Exploratory Data Analysis : COSC 3337 Data Science Assignment - Exploratory Data Analysis and Data Visualization, Homework Help, University of Houston, USA -
Burger king corporation v hungry jack pty : Is the decision in Burger King Corporation v Hungry Jack's Pty Limited (2001) 69 NSWLR 558 consistent with the understanding of the court's
Proponents of classical contract theory : In the following case Burger King Corporation v Hungry Jack's Pty Limited (2001) 69 NSWLR 558, does the decision of the NSW Court of Appeal
Regents of the university of california : In the case of Tarasoff v. Regents of the University of California, what was the decision and how did the court come to the decisions?
How do framing and ideology affect public policy making : How do framing and ideology affect public policy making?

Reviews

len2378144

9/28/2019 5:05:16 AM

Remarks: Points allocated to a particular task are preliminary and subject to change. Points will be deducted for incomplete submission. Submission Guidelines: Create a folder and name it as LastName_StudentId_HW1. HW1 folder should include: R code for the tasks. The data files needed to run the R codes. The assignment report containing all the plots and results along with the interpretations one question at a time. Submit the LastName_StudentId_HW1 folder in a zipped file (.zip no .rar , .7z …) through Blackboard.

Write a Review

Other Subject Questions & Answers

  What is war driving or war flying

What is "war driving" or "war flying"? Are you comfortable (or would you use) a wireless "hot spot" to do computer work? What safeguards might you use in.

  What are some of the investigative challenges faced

Conduct research and summarize a criminal case involving an internet predator. Be sure and address what type of sexual offender this case involved.

  Prepare the interview by figuring out precisely

SSC 200 :prepare for the interview by figuring out precisely which topics you want to cover and how you will get your interviewee to start talking about them.

  How do people communicate

How do people communicate? Provide examples of verbal and nonverbal communication. How do you communicate nonverbally?

  Determining how soils are formed

Consider the following: Imagine that you work for the United States Department of Agriculture. One of the farmers in your geographical area asks you what kind.

  Increase the value of a real option

Which of the following will NOT increase the value of a real option?

  What makes a stimulus adverse

What makes a stimulus adverse? Does adverse stimuli vary and different among people?

  Explore - analyse and visualize the given dataset

Content Analyst in an ABC online multimedia company and your task for this analytical project is to use analytical tool (i.e. IBM Watson Analytics) to explore

  How can you use the knowledge you have gained from module

How can you use the knowledge you have gained from this module to support your goal(s)? What did you learn about yourself as you completed the assignments.

  What are the characteristics of the program

Select a prison special offender population and research a program aimed to assist or care for that population. What are the characteristics of the program?

  Determine market opportunity

A high-tech firm has just developed a new technology to correct bad vision without surgery or contact lenses. the company needs to estimate the demand for such a service. describe the MkIS that might be used by the firm to be able to determine market..

  Analyze key elements of issue that has global ramifications

Global Issue - Theoretical Framework. Analyze key elements of a specific issue that has global, multicultural, and diversity ramifications.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd