Reference no: EM132167047
Statistical Programming for Data Science Assignment -
Question 1 - It's All in the Taste - Experts vs Amateurs
Who is better at discerning the tastes of supermarket chocolate? Do you really need training to know if you like it? Or does it all just taste really good?
The Experts battle it out against a group of dedicated chocolate-eating Amateurs!
The data for this question are the responses to the sensometric qualities of chocolate that can be purchased in supermarkets. Two groups were asked to rate the qualities of the chocolates: the first group contained a panel of sensometric experts with responses recorded over 9 different tasting sessions. The accompanying data is in chocolate_experts.csv.
The second group contained a panel of volunteers chosen to represent 'regular shoppers' who underwent a three-hour sensometric training session before rating the qualities of the chocolate over 2 different tasting sessions. The accompanying data is in chocolate_amateurs.csv.
The responses were recorded over a continuous scale from 0 to 10 with 0 indicating the absence of the sensometric quality and 10 indicating fully present. It is of interest to determine if experts perceive supermarket chocolate differently to non-experts (the amateurs) using 14 sensometric variables (Chocolate Aroma through to Granular Texture in the data files).
For this question you need to randomly obtain two session ids for the expert responses only by making a call to sample as shown below. The two numbers that are returned are your session ids that you need to extract for your analysis.
For the expert data you will only need to analyse the responses corresponding to the two randomly selected session ids. Amateur data needs to be used in full.
You are asked to compare the responses between the two groups as requested in each part below. A partially written R script is available as part of the exam package. You must use this script for your analysis and follow the instructions therein. Any lines marked with requires you to change that line of code to suit your purposes. Further details are provided in the code comments around that line.
For the purposes of this exam a paragraph is 8-12 lines of text. Specifically, your analysis should include:
i) Initial Data Discussion: Write a short explanation (approximately 1 paragraph) of the analysis to be performed and an explanation of the data. Include your session IDs for the expert responses, and any data manipulation performed prior to analysis should you do so.
ii) Exploratory Factor Analysis: conduct two separate exploratory factor analyses: the first for your selected id sessions for the expert responses, the other for the full set of amateur responses. You may present the analyses side-by-side or in sequence; however you believe is best. For each Exploratory Factor Analysis you only need to include the following:
For each Exploratory Factor Analysis you need to include the following:
- If appropriate, Cronbach Alpha output and a short discussion (2---3 lines) of whether the data is trustworthy and why.
- Correlation output of your choosing (graphical and/or numerical) with an accompanying discussion (3---4 lines). If numerical, round the correlations to 2 digits;
- A single paragraph explaining the outcome of the determinant test, Bartlett's test of sphericity and the KMO statistic for both data sets. Do not include R output.
- Your decision regarding the number of factors to estimate (scree plot may be shown, do not show the R console output).
- The FINAL factor solution. You do not need to discuss results of any of the other solutions, however you should justify your final factor solution, including loadings, and name the factors in each analysis. You should also include up to two sentences indicating whether the test of residuals was passed and whether the factors are correlated.
- All factors should be named and an explanation as to how you come up with these names should be included.
- Based on the factor analysis results and your chosen factor names, discuss the factors that have emerged from the study. What types of differences (if any) exist between the expert and amateur sensometric ratings?
iii) Conclusions: write 2 paragraphs of conclusions based on your analysis.
Question 2 - Are We There Yet? Clustering Cities Around the World
The data for this question are distances between cities in different regions of the world.
You will need to use the data set individually assigned to you. The file cities.xlsx on the Assignments page indicates the continent assigned to each student.
Each data set contains a distance matrix and can be found on the assignments page, in a file of the form RegionCitiesClustering.dat. For example, for the European data the file will be called EuropeanCitiesClustering.dat. For this question, you are asked to conduct clustering analysis using both hierarchical and partitional clustering techniques.
For the purposes of this exam a paragraph is 8-12 lines of text. Specifically, your analysis should include:
i) Initial Data Discussion: Write a short explanation (approximately 1 paragraph) of the analysis to be performed and an explanation of the data including any data manipulation performed prior to clustering.
ii) Hierarchical clustering: conduct hierarchical clustering on the data, choosing an appropriate AGNES- based method based on either single, complete, average-linkage or Ward's method. Ensure you justify your choice in your write-up and include the resulting dendrogram, as well as a discussion of the outcomes of hierarchical clustering on your data.
iii) Partitional clustering: conduct a partitional clustering of your data using K-means. Ensure you explain and include any relevant R output (including graphics) supporting your choice of k, the number of clusters.
iv) Discussion: (1-2 paragraphs) of your results.
v) Validation: as a form of cluster validation, consider the following:
If there are obvious outliers or distances that should be removed, identify these in your write-up and re-run your chosen Partitional Clustering algorithm, adjusting k if necessary. Include justification of your choice of the new value for k.
If there are no obvious outliers/distances that should be removed, then explain this conclusion with justification. In this case re-run your chosen Partitional Clustering algorithm for a different value of k to that used in Step 3 above. Include justification of your choice for the new value for k.
vi) Conclusions: write 2 paragraphs of conclusions based on your analysis including a statement regarding which clustering solution is the better one and why.
Attachment:- Assignment Files.rar