Reference no: EM132618122
Problem Set
Directions
Show all work/steps/calculations using a combination of code and Markdown.
You may also consult Stackoverflow, etc. This is not by any means "closed book" or anything like that.
Add whatever markdown or code cells you need in each part to explain and calculate your answers. Don't just provide answers but explain them as well. Explain and interpret your results.
Abalone
This is a problem about Abalone...but not really. You should be able to use common knowledge to work with the data in this problem.
Description of fields in abalone data.
Name Data Type Meas. Description
---- --------- ----- -----------
Sex nominal M, F, and I (infant)
Length continuous mm Longest shell measurement
Diameter continuous mm perpendicular to length
Height continuous mm with meat in shell
Whole weight continuous grams whole abalone
Shucked weight continuous grams weight of meat
Viscera weight continuous grams gut weight (after bleeding)
Shell weight continuous grams after being dried
Rings integer +1.5 gives the age in years
The target variable is Rings because it is a predictor of age. You can take this as your problem, "How do we estimate the age of an abalone from the available data?"
At a high level you have:
Question/Problem
ETL
EDA
Statistical and Mathematical Modeling.
Of necessity, there is very little ETL here except to verify that the data has loaded correctly and with the correct types. For the EDA and Modeling parts, follow the guidelines in Fundamentals. Do not use regression as it had not been covered yet; you may only use single value models (mean) or mathematical distributions.
Begin
Part- 1.Question/Problem
This question is estimate the age of abalone using the information in the data. We are going to understand the mathematical distributions of the variables in the data and come up with a model that can help in predicting the age of any new abalone.
Part- 2.Extract,Transform, Load (ETL)
Reading DataSet 'abalone_dataset.csv'
Part-3 .Exploratory Data Analysis
Part - 4. Single Variable Analysis
It is clear that, males population is more than other two categories. As there is less difference between female and infants, let us identify their difference by using relative density.
From the above result, we can clearly see that, ranking of categories as per 'sex' variable according to the denisty plot Male>Infant>Female, (even though female and infant have almost similar density).
2. Numerical Data
Quartile Based Cv
Quartile Based Change in Variance : To identify distribution and spread of curve in the distribution.
QCV = IQR/median
IQR is the Inter-Quartile Range, which is difference between 3rd Quartile (75 percentile) and 1st Quartile (25 percentile) of the data.
From above distributions, as we already expected,
1. The length has left skewed distribution.
2. The diameter has left skewed distribution.
3. The height have no skewed distribution.
4. All weights are right skewed distributions, and it is needed to notice that shuck, viscera, shell weights are similar to each other and similar to whole weight as well.
5. Rings are having a distribution that is not skewed, with mean around 10, so the age of abalone would be around 11.5 years.
Understanding similarities in weights distributions
Neccessity: We need to understand this because, we can expect whole weight to be highly correlated with other weight variables, which would be redundant in predicting age.
#
Now we add new feature 'loss_weight' to further understand if that is how whole weight is calculated, and see if their ouliers in weight values, we expect weight diff to be vary between positive and zero as it was the mass that was lost during shucking process.
We know that, Whole weight = Shucked weight + Viscera weight + Shell weight
Therefore, loss_weight = Whole weight - (Shucked weight + Viscera weight + Shell weight), as this calculates the error in weight observations.
Identifying outliers using boxplot
1. In Length box plot, outliers exist between ranges of 20 to 40 which presumably is due to Infants presence or particular small species of abalone (as we previously found left tailing distribution which has lowest minimum(min = 15)).
2. We can see the max 226 which is an outlier, we would need to investigate the provenance of the data to understand where min and max anomaly come from.
3. All other variables have minimal noticeable outliers.
We can observe that there is a high correlation between most of the variable (correlation coeffcient > 0.7).
We already know that, weights are strongly correlated from univariate analysis.
we also expect know length, height and diameter to be linearly related.
Ignoring the known information, we can observe whole weight is strongly correlated with length, diameter, height and also there is weak correlation between rings and other variables.
Pair wise Analysis
From previously fetched results and conclusion, let us do pair wise analysis on highly correlated and least correlated variables.
Highly Correlated variables,
1. Whole weight and Height
2. Whole weight and Length
3. Whole weight and Diameter
From above plot, we can see an exponential increasing relationship,i.e., as height increases the whole weight increases for abalone, but it an error for 0 height to not increase when their weights are really high, or height is really large like 226 but weight is lower than we might expect. Another thing to notice is another outlier near 100 height, has also fairly large weight with it, unlike other outliers at height 0 and 226.
We can see the plot, is an exponential function with increasing relationship, and as expected, as the weight increase the length of shell increases as abalone adds new growth rings to it's shell.
Exploring relation between target variable(RING) and height, whole weight, length, diameter.
From previous conclusions, we can say that as length, whole weight, diameter, height increases, then rings should also increase to reflect the increase in age.
Clearly, there is a significant difference in dimensions depending on sex, we can see infants < female <= male, which tells us we can even classify between infants and adult abalones.
We can conclude that shell weight and height are good predictors of age, which we can model using linear regression.
We noticed that there were outliers in height and loss_weights variables, which we can remove as it would affect the model accuracy.
There is multi-collinearity in weights (whole weight, shell weight, viscera weight, shucked weight)
Attachment:- Problem Set.rar