Estimate the age of abalone using the information

Assignment Help Other Subject
Reference no: EM132618122

Problem Set

Directions
Show all work/steps/calculations using a combination of code and Markdown.
You may also consult Stackoverflow, etc. This is not by any means "closed book" or anything like that.
Add whatever markdown or code cells you need in each part to explain and calculate your answers. Don't just provide answers but explain them as well. Explain and interpret your results.
Abalone
This is a problem about Abalone...but not really. You should be able to use common knowledge to work with the data in this problem.

Description of fields in abalone data.

Name Data Type Meas. Description
---- --------- ----- -----------
Sex nominal M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole weight continuous grams whole abalone
Shucked weight continuous grams weight of meat
Viscera weight continuous grams gut weight (after bleeding)
Shell weight continuous grams after being dried
Rings integer +1.5 gives the age in years
The target variable is Rings because it is a predictor of age. You can take this as your problem, "How do we estimate the age of an abalone from the available data?"

At a high level you have:

Question/Problem
ETL
EDA
Statistical and Mathematical Modeling.
Of necessity, there is very little ETL here except to verify that the data has loaded correctly and with the correct types. For the EDA and Modeling parts, follow the guidelines in Fundamentals. Do not use regression as it had not been covered yet; you may only use single value models (mean) or mathematical distributions.

Begin

Part- 1.Question/Problem
This question is estimate the age of abalone using the information in the data. We are going to understand the mathematical distributions of the variables in the data and come up with a model that can help in predicting the age of any new abalone.

Part- 2.Extract,Transform, Load (ETL)
Reading DataSet 'abalone_dataset.csv'

Part-3 .Exploratory Data Analysis

Part - 4. Single Variable Analysis

It is clear that, males population is more than other two categories. As there is less difference between female and infants, let us identify their difference by using relative density.

From the above result, we can clearly see that, ranking of categories as per 'sex' variable according to the denisty plot Male>Infant>Female, (even though female and infant have almost similar density).
2. Numerical Data
Quartile Based Cv
Quartile Based Change in Variance : To identify distribution and spread of curve in the distribution.
QCV = IQR/median
IQR is the Inter-Quartile Range, which is difference between 3rd Quartile (75 percentile) and 1st Quartile (25 percentile) of the data.

From above distributions, as we already expected,
1. The length has left skewed distribution.
2. The diameter has left skewed distribution.
3. The height have no skewed distribution.
4. All weights are right skewed distributions, and it is needed to notice that shuck, viscera, shell weights are similar to each other and similar to whole weight as well.
5. Rings are having a distribution that is not skewed, with mean around 10, so the age of abalone would be around 11.5 years.

Understanding similarities in weights distributions
Neccessity: We need to understand this because, we can expect whole weight to be highly correlated with other weight variables, which would be redundant in predicting age.
#
Now we add new feature 'loss_weight' to further understand if that is how whole weight is calculated, and see if their ouliers in weight values, we expect weight diff to be vary between positive and zero as it was the mass that was lost during shucking process.
We know that, Whole weight = Shucked weight + Viscera weight + Shell weight
Therefore, loss_weight = Whole weight - (Shucked weight + Viscera weight + Shell weight), as this calculates the error in weight observations.

Identifying outliers using boxplot

1. In Length box plot, outliers exist between ranges of 20 to 40 which presumably is due to Infants presence or particular small species of abalone (as we previously found left tailing distribution which has lowest minimum(min = 15)).
2. We can see the max 226 which is an outlier, we would need to investigate the provenance of the data to understand where min and max anomaly come from.
3. All other variables have minimal noticeable outliers.

We can observe that there is a high correlation between most of the variable (correlation coeffcient > 0.7).
We already know that, weights are strongly correlated from univariate analysis.
we also expect know length, height and diameter to be linearly related.
Ignoring the known information, we can observe whole weight is strongly correlated with length, diameter, height and also there is weak correlation between rings and other variables.
Pair wise Analysis
From previously fetched results and conclusion, let us do pair wise analysis on highly correlated and least correlated variables.
Highly Correlated variables,
1. Whole weight and Height
2. Whole weight and Length
3. Whole weight and Diameter

From above plot, we can see an exponential increasing relationship,i.e., as height increases the whole weight increases for abalone, but it an error for 0 height to not increase when their weights are really high, or height is really large like 226 but weight is lower than we might expect. Another thing to notice is another outlier near 100 height, has also fairly large weight with it, unlike other outliers at height 0 and 226.

We can see the plot, is an exponential function with increasing relationship, and as expected, as the weight increase the length of shell increases as abalone adds new growth rings to it's shell.

Exploring relation between target variable(RING) and height, whole weight, length, diameter.
From previous conclusions, we can say that as length, whole weight, diameter, height increases, then rings should also increase to reflect the increase in age.

Clearly, there is a significant difference in dimensions depending on sex, we can see infants < female <= male, which tells us we can even classify between infants and adult abalones.
We can conclude that shell weight and height are good predictors of age, which we can model using linear regression.
We noticed that there were outliers in height and loss_weights variables, which we can remove as it would affect the model accuracy.
There is multi-collinearity in weights (whole weight, shell weight, viscera weight, shucked weight)

Attachment:- Problem Set.rar

Reference no: EM132618122

Questions Cloud

Find a newspaper article or web page report : Find a newspaper article or web page report of an item of accounting news, i.e. it refers to a current event, consideration, comment or decision
Explain when business process improvement more appropriate : Explain an example when business process improvement is more appropriate than business process engineering. Respond to at least two of your classmates' posts.
What are the potential consequences of the error : What common errors does a new systems analyst often make when analyzing a problem? What are the potential consequences of this error?
Describe the nature of the information system : Select an information system with which you are familiar, and that you think needs to be improved, based on your experiences as an employee, customer.
Estimate the age of abalone using the information : Estimate the age of abalone using the information in the data. We are going to understand the mathematical distributions of the variables in the data
Which stakeholders initiate most projects : Which stakeholders initiate most projects? What is the impetus for most projects? Give a specific business example. Respond to at least two of your classmates.
Identify the owner of the information system : Select an information system used by a medium to large organization. The organization can be in the public or private sector, and it can be one with which you.
Define business drivers for today information systems : What are some of the business drivers for today's information systems? What are the most important technology drivers for today's information systems?
How do their business and it strategies seem to match : Think about a company you know as a customer, employee or investor. How do their business and IT strategies seem to match? Does it look like their product.

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd