Conduct hierarchical clustering on the data

Assignment Help Computer Engineering

Reference no: EM132167047

Statistical Programming for Data Science Assignment -

Question 1 - It's All in the Taste - Experts vs Amateurs

Who is better at discerning the tastes of supermarket chocolate? Do you really need training to know if you like it? Or does it all just taste really good?

The Experts battle it out against a group of dedicated chocolate-eating Amateurs!

The data for this question are the responses to the sensometric qualities of chocolate that can be purchased in supermarkets. Two groups were asked to rate the qualities of the chocolates: the first group contained a panel of sensometric experts with responses recorded over 9 different tasting sessions. The accompanying data is in chocolate_experts.csv.

The second group contained a panel of volunteers chosen to represent 'regular shoppers' who underwent a three-hour sensometric training session before rating the qualities of the chocolate over 2 different tasting sessions. The accompanying data is in chocolate_amateurs.csv.

The responses were recorded over a continuous scale from 0 to 10 with 0 indicating the absence of the sensometric quality and 10 indicating fully present. It is of interest to determine if experts perceive supermarket chocolate differently to non-experts (the amateurs) using 14 sensometric variables (Chocolate Aroma through to Granular Texture in the data files).

For this question you need to randomly obtain two session ids for the expert responses only by making a call to sample as shown below. The two numbers that are returned are your session ids that you need to extract for your analysis.

For the expert data you will only need to analyse the responses corresponding to the two randomly selected session ids. Amateur data needs to be used in full.

You are asked to compare the responses between the two groups as requested in each part below. A partially written R script is available as part of the exam package. You must use this script for your analysis and follow the instructions therein. Any lines marked with requires you to change that line of code to suit your purposes. Further details are provided in the code comments around that line.

For the purposes of this exam a paragraph is 8-12 lines of text. Specifically, your analysis should include:

i) Initial Data Discussion: Write a short explanation (approximately 1 paragraph) of the analysis to be performed and an explanation of the data. Include your session IDs for the expert responses, and any data manipulation performed prior to analysis should you do so.

ii) Exploratory Factor Analysis: conduct two separate exploratory factor analyses: the first for your selected id sessions for the expert responses, the other for the full set of amateur responses. You may present the analyses side-by-side or in sequence; however you believe is best. For each Exploratory Factor Analysis you only need to include the following:

For each Exploratory Factor Analysis you need to include the following:

If appropriate, Cronbach Alpha output and a short discussion (2---3 lines) of whether the data is trustworthy and why.
Correlation output of your choosing (graphical and/or numerical) with an accompanying discussion (3---4 lines). If numerical, round the correlations to 2 digits;
A single paragraph explaining the outcome of the determinant test, Bartlett's test of sphericity and the KMO statistic for both data sets. Do not include R output.
Your decision regarding the number of factors to estimate (scree plot may be shown, do not show the R console output).
The FINAL factor solution. You do not need to discuss results of any of the other solutions, however you should justify your final factor solution, including loadings, and name the factors in each analysis. You should also include up to two sentences indicating whether the test of residuals was passed and whether the factors are correlated.
All factors should be named and an explanation as to how you come up with these names should be included.
Based on the factor analysis results and your chosen factor names, discuss the factors that have emerged from the study. What types of differences (if any) exist between the expert and amateur sensometric ratings?

iii) Conclusions: write 2 paragraphs of conclusions based on your analysis.

Question 2 - Are We There Yet? Clustering Cities Around the World

The data for this question are distances between cities in different regions of the world.

You will need to use the data set individually assigned to you. The file cities.xlsx on the Assignments page indicates the continent assigned to each student.

Each data set contains a distance matrix and can be found on the assignments page, in a file of the form RegionCitiesClustering.dat. For example, for the European data the file will be called EuropeanCitiesClustering.dat. For this question, you are asked to conduct clustering analysis using both hierarchical and partitional clustering techniques.

For the purposes of this exam a paragraph is 8-12 lines of text. Specifically, your analysis should include:

i) Initial Data Discussion: Write a short explanation (approximately 1 paragraph) of the analysis to be performed and an explanation of the data including any data manipulation performed prior to clustering.

ii) Hierarchical clustering: conduct hierarchical clustering on the data, choosing an appropriate AGNES- based method based on either single, complete, average-linkage or Ward's method. Ensure you justify your choice in your write-up and include the resulting dendrogram, as well as a discussion of the outcomes of hierarchical clustering on your data.

iii) Partitional clustering: conduct a partitional clustering of your data using K-means. Ensure you explain and include any relevant R output (including graphics) supporting your choice of k, the number of clusters.

iv) Discussion: (1-2 paragraphs) of your results.

v) Validation: as a form of cluster validation, consider the following:

If there are obvious outliers or distances that should be removed, identify these in your write-up and re-run your chosen Partitional Clustering algorithm, adjusting k if necessary. Include justification of your choice of the new value for k.

If there are no obvious outliers/distances that should be removed, then explain this conclusion with justification. In this case re-run your chosen Partitional Clustering algorithm for a different value of k to that used in Step 3 above. Include justification of your choice for the new value for k.

vi) Conclusions: write 2 paragraphs of conclusions based on your analysis including a statement regarding which clustering solution is the better one and why.

Attachment:- Assignment Files.rar

Reference no: EM132167047

Questions Cloud

What types of things would you need to be mindful of : What types of things would you need to be mindful of? What strategies could you use before, during, and after any kind of a dialogue/workshop?

Write about why are they important to gospel music : Write about why are they important to gospel music, their key achievements, their signature style and/or influences, etc.

Examine a specific relationship challenge : Challenges in Relationships - examine a specific relationship challenge you are now or have faced in a relationship with a friend/family member/romantic.

Define strategies to promote literacy development : Compose a 150-250 word executive summary introducing ELL families to the purpose of the resource list for families with ELL children.

Conduct hierarchical clustering on the data : COMP 5070 Statistical Programming for Data Science Assignment - Hierarchical clustering: conduct hierarchical clustering on the data

Estimate the standard deviation : Based on the data , is the number of hours students spend on their phone daily 10,8,5,12.. estimate the Standard deviation (SD) and (SE) standard error.

Confidence interval to estimate the true cost of a gallon : A survey of 35 grocery stores revealed that the average price of a gallon of milk was $2.98 with a standard deviation of $0.25.

How would you explain curriculum compacting to a friend : How would you explain Curriculum Compacting to a friend? If you had a gifted child, would you want them to be accelerated in school (ie. skip a grade)?

Estimate of true proportion : If I have 64 females in a population of 200, what would my estimate of true proportion be and can you share how you got the answer?

Reviews

len2167047

11/15/2018 3:08:44 AM

To obtain the maximum available marks you should aim to: Code all requested components. Use a clear style of code presentation. Code clarity is an important part of your submission. Thus you should choose meaningful variable names and adopt the use of comments --- you don't need to comment every single line, as this will affect readability --- however you should aim to comment at least each section of code. Have the code run successfully. Output the information in a presentable manner and present your written analysis of the output.

11/15/2018 3:08:37 AM

For the purpose of this exam, a "paragraph" is considered to consist of approximately 6---8 lines. You are welcome to exceed this amount - This exam appears longer than it actually is - explanations are given to help you understand the requested analyses and I have also provided hints. You do not need to write specialised code as you did for the assignments. You should be able to find nearly all the code you need from the R files provided throughout the course, via case studies and other examples. If you copy/paste code from the R code I have provided, this should give you nearly 100% of the code needed for this exam, with a few alterations on your behalf (e.g. filenames, variable names etc).

11/15/2018 3:08:30 AM

Hints: To make the correlation matrix more readable, use the round() command in R, e.g. round(cor(df, 2)) will compute the correlation matrix of the data in the matrix df, to two decimal places. You can use this tip for any other matrices too. The best solution may or may not be the rotated solution, based on your randomly selected sessions. Choose your solution based on the principles of a good Exploratory Factor Analysis (EFA). If items are not loading on to a factor, one reason could be that you have not extracted enough factors from the data. Reconsider your analysis if necessary however this may not solve the problem. Use the principles of EFA to make your final decision.

11/15/2018 3:08:24 AM

While no split loadings are desirable in EFA, a small number may be unavoidable. Again you should ultimately choose your final solution based on the principles of what constitutes a good Exploratory Factor Analysis. If the correlations between factors suggest an oblique rotation is required, simply note this in your discussion. Do not re-run the analysis. For hierarchical clustering, ensure you define the height of the dendrogram according to the size of the values in the output.

Write a Review

Required(*) Message

User Account

All Pages