Estimate the age of abalone using the information

Assignment Help Other Subject
Reference no: EM132618122

Problem Set

Directions
Show all work/steps/calculations using a combination of code and Markdown.
You may also consult Stackoverflow, etc. This is not by any means "closed book" or anything like that.
Add whatever markdown or code cells you need in each part to explain and calculate your answers. Don't just provide answers but explain them as well. Explain and interpret your results.
Abalone
This is a problem about Abalone...but not really. You should be able to use common knowledge to work with the data in this problem.

Description of fields in abalone data.

Name Data Type Meas. Description
---- --------- ----- -----------
Sex nominal M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole weight continuous grams whole abalone
Shucked weight continuous grams weight of meat
Viscera weight continuous grams gut weight (after bleeding)
Shell weight continuous grams after being dried
Rings integer +1.5 gives the age in years
The target variable is Rings because it is a predictor of age. You can take this as your problem, "How do we estimate the age of an abalone from the available data?"

At a high level you have:

Question/Problem
ETL
EDA
Statistical and Mathematical Modeling.
Of necessity, there is very little ETL here except to verify that the data has loaded correctly and with the correct types. For the EDA and Modeling parts, follow the guidelines in Fundamentals. Do not use regression as it had not been covered yet; you may only use single value models (mean) or mathematical distributions.

Begin

Part- 1.Question/Problem
This question is estimate the age of abalone using the information in the data. We are going to understand the mathematical distributions of the variables in the data and come up with a model that can help in predicting the age of any new abalone.

Part- 2.Extract,Transform, Load (ETL)
Reading DataSet 'abalone_dataset.csv'

Part-3 .Exploratory Data Analysis

Part - 4. Single Variable Analysis

It is clear that, males population is more than other two categories. As there is less difference between female and infants, let us identify their difference by using relative density.

From the above result, we can clearly see that, ranking of categories as per 'sex' variable according to the denisty plot Male>Infant>Female, (even though female and infant have almost similar density).
2. Numerical Data
Quartile Based Cv
Quartile Based Change in Variance : To identify distribution and spread of curve in the distribution.
QCV = IQR/median
IQR is the Inter-Quartile Range, which is difference between 3rd Quartile (75 percentile) and 1st Quartile (25 percentile) of the data.

From above distributions, as we already expected,
1. The length has left skewed distribution.
2. The diameter has left skewed distribution.
3. The height have no skewed distribution.
4. All weights are right skewed distributions, and it is needed to notice that shuck, viscera, shell weights are similar to each other and similar to whole weight as well.
5. Rings are having a distribution that is not skewed, with mean around 10, so the age of abalone would be around 11.5 years.

Understanding similarities in weights distributions
Neccessity: We need to understand this because, we can expect whole weight to be highly correlated with other weight variables, which would be redundant in predicting age.
#
Now we add new feature 'loss_weight' to further understand if that is how whole weight is calculated, and see if their ouliers in weight values, we expect weight diff to be vary between positive and zero as it was the mass that was lost during shucking process.
We know that, Whole weight = Shucked weight + Viscera weight + Shell weight
Therefore, loss_weight = Whole weight - (Shucked weight + Viscera weight + Shell weight), as this calculates the error in weight observations.

Identifying outliers using boxplot

1. In Length box plot, outliers exist between ranges of 20 to 40 which presumably is due to Infants presence or particular small species of abalone (as we previously found left tailing distribution which has lowest minimum(min = 15)).
2. We can see the max 226 which is an outlier, we would need to investigate the provenance of the data to understand where min and max anomaly come from.
3. All other variables have minimal noticeable outliers.

We can observe that there is a high correlation between most of the variable (correlation coeffcient > 0.7).
We already know that, weights are strongly correlated from univariate analysis.
we also expect know length, height and diameter to be linearly related.
Ignoring the known information, we can observe whole weight is strongly correlated with length, diameter, height and also there is weak correlation between rings and other variables.
Pair wise Analysis
From previously fetched results and conclusion, let us do pair wise analysis on highly correlated and least correlated variables.
Highly Correlated variables,
1. Whole weight and Height
2. Whole weight and Length
3. Whole weight and Diameter

From above plot, we can see an exponential increasing relationship,i.e., as height increases the whole weight increases for abalone, but it an error for 0 height to not increase when their weights are really high, or height is really large like 226 but weight is lower than we might expect. Another thing to notice is another outlier near 100 height, has also fairly large weight with it, unlike other outliers at height 0 and 226.

We can see the plot, is an exponential function with increasing relationship, and as expected, as the weight increase the length of shell increases as abalone adds new growth rings to it's shell.

Exploring relation between target variable(RING) and height, whole weight, length, diameter.
From previous conclusions, we can say that as length, whole weight, diameter, height increases, then rings should also increase to reflect the increase in age.

Clearly, there is a significant difference in dimensions depending on sex, we can see infants < female <= male, which tells us we can even classify between infants and adult abalones.
We can conclude that shell weight and height are good predictors of age, which we can model using linear regression.
We noticed that there were outliers in height and loss_weights variables, which we can remove as it would affect the model accuracy.
There is multi-collinearity in weights (whole weight, shell weight, viscera weight, shucked weight)

Attachment:- Problem Set.rar

Reference no: EM132618122

Questions Cloud

Find a newspaper article or web page report : Find a newspaper article or web page report of an item of accounting news, i.e. it refers to a current event, consideration, comment or decision
Explain when business process improvement more appropriate : Explain an example when business process improvement is more appropriate than business process engineering. Respond to at least two of your classmates' posts.
What are the potential consequences of the error : What common errors does a new systems analyst often make when analyzing a problem? What are the potential consequences of this error?
Describe the nature of the information system : Select an information system with which you are familiar, and that you think needs to be improved, based on your experiences as an employee, customer.
Estimate the age of abalone using the information : Estimate the age of abalone using the information in the data. We are going to understand the mathematical distributions of the variables in the data
Which stakeholders initiate most projects : Which stakeholders initiate most projects? What is the impetus for most projects? Give a specific business example. Respond to at least two of your classmates.
Identify the owner of the information system : Select an information system used by a medium to large organization. The organization can be in the public or private sector, and it can be one with which you.
Define business drivers for today information systems : What are some of the business drivers for today's information systems? What are the most important technology drivers for today's information systems?
How do their business and it strategies seem to match : Think about a company you know as a customer, employee or investor. How do their business and IT strategies seem to match? Does it look like their product.

Reviews

Write a Review

Other Subject Questions & Answers

  Presentation-how to build a peanut butter and jelly sandwich

Please create a PowerPoint presentation on How to Build a Peanut Butter and Jelly Sandwich. Assume your audience has never built a peanut butter.

  Antisocial personality disorder

1. Which of the following is NOT one of the characteristics of antisocial personality disorder? Callous unconcern for the feelings of others A low tolerance to frustration The inability to establish relationships Prone to blame others for behavior th..

  Obverting propositions

Obvert the following propositions and state whether the converse is logically equivalent or not logically equivalent to the original.

  What is morriss tie ratio

If the company does not maintain a TIE ratio of at least 5 to 1, then its bank will refuse to renew the loan and bankruptcy will result. What is Morris's TIE ratio?

  What does stein mean by the real estate state

What are the principal causes of gentrification? According to Stein, what are the specific things that planners do to support gentrification?

  What is the role of an i/o psychologist in an organization

How are I/O psychologists trained? What is the role of an I/O psychologist in an organization? How will the role of I/O psychologists change/evolve in the next several years?

  What internal policies do you plan to implement

What internal policies do you plan to implement based on evidence-based practice approaches to ensure your organization meets these standards?

  Different disciplines of study-domains of thinking

"How can knowing about different disciplines of study (domains of thinking) help you to read and write critically as a student and a professional in the workplace?"

  Physical or other health disability

Please discuss an instance that you are familiar with in which someone with a physical or other health disability was able to benefit from a federal law designed to assist persons with disabilities.

  Why companies need to understand and create protocol

Assess of the potential of terrorism and the measures the U.S. can employ if such a threat is identified both internationally, and to protect the homeland.

  Compare and contrast an ideal husband and lady windermeres

How do the following character's demonstrate Oscar Wilde's observations about women: Lady Markby, Gwendolyn, Lady Chiltern, Mrs. Cheevley, Miss Mable, Lady Bracknell.

  Determine your strengths and weaknesses

Select and complete one of the leadership assessments in your resources to determine your strengths and weaknesses related to your future career in healthcare.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd