Reference no: EM133145042
CHM9008M Analytical Data Analysis - University of Lincoln
ASSIGNMENT BRIEF
The aim of the assignment is to answer two personalised problems in different aspects of data analysis, showing the work developed to solve them and to explain the basis of this work (see problems at the end of this cover sheet). The student will be able to retrieve from Blackboard the problems and any extra files they need to complete the task within two weeks.
ASSESSMENT CRITERIA
The student is requested to be critical at all times, to provide critical evaluations of the techniques and the validity of their use, and to show their work justifying their answers.
The learning outcomes students are expected to cover in this assignment (as taken from the module handbook) are:
1. Demonstrate knowledge and understanding on the statistical concepts beyond Bachelor's level, and that provides a basis or opportunity for originality in developing and/or applying ideas, often within a research context.
2. Apply knowledge and understanding, and problem solving abilities in new or unfamiliar environments within broader (or multidisciplinary) contexts related to statistical problems
3. Show ability to explore and organize data for analysis
4. Demonstrate proficiency in analyzing data from Forensic investigations
5. Demonstrate ability to select the appropriate methodologies for analysis based on properties of particular data sets.
6. Demonstrate ability to use multiple statistical software packages and use appropriate statistical software for data analysis.
7. Demonstrate ability to use formal mathematical arguments in the context of probability and statistics.
Problem 1. The aim of this exercise is to design an experiment of your choice in any area of experimental sciences, toxicology, analytical sciences or forensic science in which a response variable is used and two independent variables are involved. Using a factorial experimental design and a design of experiments using a surface response model (SRM),the student is requested to produce (i) an explanation of the rational for the problem with the research question to be addressed, (ii) a dataset containing the values for each variable, (iii) the plot for the response surface obtained, (iv) provide the equation of the surface,(v) comment on the correlation coefficient,(vi) explain potential interactions between variables(interaction plots) and (vii) discuss the significance of the different variables used in the experimental design using the right tools (i.e. Pareto Chart).
The theoretical dataset you have generated needs to be included in the response as an appendix. The student is also requested to give an explanation of each step followed to answer the previous questions and critically comment in the results.
Problem 2. You are a data scientist working for the NHS. A dataset containing a series of variables associated to liver health has been handed onto you. Your dataset has been anonymised, following ethical rules, and each patient is denoted by a number. An evaluation of the status of the liver health has been performed in the column ‘category' and this column indicates your dependent variable, which will establish whether a patient is prone to liver disease (1=hepatitis, 2=fibrosis, or 3=cirrhosis) and type of liver disease or they will not be suffering any liver condition (0=healthy). You may want to use these full names as your class column, if you are to perform a cross validation of your results. These results are given in the file: "liver health" which can be found in the assessment section together with this assessment brief. The internal medicine unit of the County Hospital in Lincoln has requested for you to perform an evaluation of the data and whether a selection of variables can predict or be associated to liver conditions. You will have to select a minimum number of 6 independent variables from the dataset to perform your analysis using the patient number and the status as fixed variables for your supervised analysis.Also, you have 180 patient records to analyse.
You are also specifically tasked to perform the following multivariate analyses:
(a) Using the data set, select 6(or more) variables and use these to produce a PCA analysis using all/or a selection of the patient records given. Comment on the results obtained.
(b) By selecting the different principal components, produce three graphs showing the clustering obtained for the different principal components and comment in the results (i.e. PC1 vs PC2, PC1 vs PC3, PC2 vs PC3, etc...)
(c) Produce a cross validation table (confusion matrix) for each of the graphs produced in (b) indicating the % of error obtained for each cluster and comment on the results
(d) Produce a correlation scatter plot for each one of the graphs produced in section (b) and comment on the significance of the effect the different compounds have on the clustering.
Also, using the same data file you are requested to:
(e) produce a Hierarchical Cluster Analysis (HCA) based on the patient records and independent variables and comment on the results
(f) using the characterisation of variables, explain the relevance and weight of each of the variables used in the clustering.
(g) explain the advantages and disadvantages of using each technique (PCA and HCA)for this dataset, based on the results obtained.
Annex to the question:
1) X (Patient ID/No.)
2) Category (diagnosis) (values: '0=healthy', '1=Hepatitis', '2=Fibrosis', '3=Cirrhosis')
3) Age (in years)
4) Sex (f,m) (you may want to transform this into numerical by assigning 1 or 0 to sex.
Attributes 5 to 14 refer to laboratory data:
5) ALB -Albumine
6) ALP- Alkaline phosphatase
7) ALT- alanine aminotransferase
8) AST- aspartate aminotransferase
9) BIL- bilirubin
10) CHE- acetylcholinesterase
11) CHOL- cholesterol
12) CREA- creatinine
13) GGT- gamma-glutamyl transferase
14) PROT- protein
Attachment:- Analytical Data Analysis.rar