What are some limitations of multivariate methods generally

Assignment Help Applied Statistics
Reference no: EM132320255 , Length: 15

Multivariate Analysis for High-Dimensional Data Assignment - Final Project

Instructions - There are two parts to this assessment. Part A is the project analysis of data as described in detail on pages 2-4 of this document. Your submission will be no more than 6-9 pages for Part A. Part B is the completion of a set of 4 questions on page 5-6 of this document. Additionally, you will submit one R script file for your work in Part A.

Part A -

Objective: The purpose of this project is to provide you with an opportunity to demonstrate an advanced level of synthesis, understanding and communication of the concepts, statistical methods and practical analyses within R that you have learnt throughout this course.

The Data: This data comes from a water quality study [1] where samples were taken from sites on different European rivers during a period of approximately one year. These samples were analysed for various chemical substances and, in parallel, algae samples were collected to determine the algae population distributions for seven algal species.

The impact on the environment of toxic waste, from a wide variety of manufacturing processes, is well known. It has also become clear that the subtler effects of nutrient level and chemical balance changes arising from farming land run-off and sewage water treatment also have a serious, but indirect, effect on the states of rivers, lakes and even the sea. Research has focused on the relationships among groups of chemical variables, among algae species and between both chemical and biological variables. The influence of season, river size and river flow rates are also considered important.

There are a total of 200 cases (rows/rivers) each containing 18 values (columns/variables). The first 3 variables in the data set are the season, river size, and fluid velocity of the river. The next eight variables are the chemical concentrations which should be relevant for the algae distribution: pH (measured on of scale of 1 to 14) and nitrogen, nitrates, nitrites, ammonia, phosphate, oxygen and chloride (all measured in mg/L). The last seven values of each row are the amount of different kinds of algae.

These 7 kinds are only a very small part of the whole algae community. The value 0.0 means that the frequency is very low. The data set also contains some missing data which are labelled with the string XXXXX.

A consultancy firm has asked you to explore this data and address three specific aspects of interest (Tasks 1, 2 and 3 below) for their client, and then report your process (what you have done and why) and findings in a written report. Before beginning the Tasks, you may need to do some data cleaning due to missing data or outliers. All analysis for the following tasks should be based on your cleaned data set and the structure/characteristics of your final data set should be well defined.

For this exercise assume that the data meets any required MVN assumptions (do not test for UVN or MVN).

The Tasks:

Task 1 - The client would like to know the number of rivers in the sample after cleaning. In addition, the number of rivers measured in each season, and each river size is required and some appropriate summary statistics/plots for each of the 8 chemical variables individually.

Action: Clean the data as you think necessary and then provide a frequency table of the number of rivers measured in each season, and then each river size. Determine appropriate summary statistics for each of the 8 chemical variables and the best way to present this information. Interpret interesting aspects of this data summary.

They would also like to know what the relationships are between the combination of river size & velocity based on the chemical variables. Which groups are most similar, and which are most different?

Action: First, create a new variable called 'river_size_vel' with categories that are combinations of the 3 River_Size and 3 River_Vel categories and provide a frequency table of the number of rivers in each new category. To show the multivariate relationships among the categories of the new 'river_size_vel' variable present the dendrogram and an MDS plot that best represent the relationships - as part of your interpretation explain what types of distances and clustering methods you have used and why.

Task 2 - The client would like to know if there are significant differences among the four seasons in terms of average river health as indicated by all of the chemical and algae variables?

Action: Select the best method (from those covered in this course only) to explore this question, perform the analysis and interpret. Include in your answer appropriate p-values for all significance tests performed.

Task 3 - The client would like to know if the season can be predicted based on all chemical and algae variables. The client is only interested in data related to Autumn, Spring and Summer?

Action: Select the best method (from those covered in this course only) to explore this question, perform the analysis, explain all relevant details of your process and interpret the results.

Instructions -

You will submit ONE pdf file for your project which will include Part A of your report addressing Task 1, 2 and 3 and Part B. All sections of Part A should be clearly labelled. You will also submit ONE R script file with all analysis code for Part A clearly labelled and commented.

You should not include any R code in your pdf report submission.

For each of Task 1, 2 and 3 your report should be no more than 2-3 pages (i.e. no more than 9 pages total). Any additional pages for Part A will not be marked. This means you will need to be very concise and clear in the information you provide in your final report.

Things to include for each of Task 1, 2 and 3:

  • Was any data cleaning necessary (removal of cases with missing data, outliers, etc.)?
  • Did you need to subset, rearrange or summarise the data in order to complete your analysis?
  • What analysis was performed and what specific choices were made in the analysis process that would be necessary to know for the analysis to be repeated?
  • What are the important aspects of the results to convey and did the analysis successfully address the aims of the company management?
  • Are there any caveats/limitations you would place on the results or suggestions for future analysis?
  • Do the results for each Task relate to previous Tasks?

Part B -

Include your responses to these Part B questions at the end of your Part A pdf submission (i.e. only one pdf file to be submitted for this whole project assessment item)

Question 1 - Recreate and complete the table below by indicating which features are relevant to each method.

Feature

MANOVA

PCA

FA

DFA

CCA

CA

MDS

Eigen analysis

 

 

 

 

 

 

 

Distance matrix

 

 

 

 

 

 

 

Data/Dimension reduction

 

 

 

 

 

 

 

Classification

 

 

 

 

 

 

 

Can be used to Identify group structure/clusters

 

 

 

 

 

 

 

Need independent a priori categorical variable(s)

 

 

 

 

 

 

 

Ordination method

 

 

 

 

 

 

 

Question 2 - Construct, by hand, a simple nearest-neighbour dendrogram from the distance matrix below. Do not produce the dendrogram in R. Use the distances to 'sketch' the relationships

 

1

2

3

4

2

1.912370

 

 

 

3

5.382450

7.120542

 

 

4

3.385996

5.059430

2.138709

 

5

1.512238

3.190303

4.575420

2.910661

Question 3 - Calculated by hand the Euclidian distance between individuals 1 and 2 for variables X1 and X2. Show all working.

 

X1

X2

1

-0.46

-0.46

2

-1.41

-1.79

3

1.78

1.48

4

0.60

0.55

5

0.13

0.31

Question 4 - What are some limitations or disadvantages of multivariate methods generally? (no more than 300 words).

Question 5 - Explain your understanding of eigen vectors and eigen values (your answer must be in your own words and will be checked using a plagiarism checker) (no more than 300 words).

Question 6 - Based on the Parallel Analysis table below, how many factors would you interpret? Explain you answer.

Factor Actual eigen value 95 percentile

Factor

Actual eigen value

95th percentile

1

2.45

1.99

2

1.98

1.89

3

1.13

1.14

4

1.02

1.08

5

0.89

1.03

Attachment:- Assignment File.rar

Reference no: EM132320255

Questions Cloud

Vertices proceduced by topological-sort : Show the ordering of vertices proceduced by Topological-Sort when it is run in the following dag, where it is assumed that the for-loop of the DFS procedure
Identify the connected componentof g : Show that a DFS of an undirected graph G can be used to identify the connected componentof G, and that the DFS contains as many trees as G
Discuss the advantages and disadvantages of vlan : Discuss the advantages and disadvantages of VLANs? How can a VLAN architecture improve LAN performances?
Enter a sequence of nonnegative numbers : Create program to allow the user to enter a sequence of nonnegative numbers. The user ends the list with a negative number. At the end the sum
What are some limitations of multivariate methods generally : STA8005 Multivariate Analysis for High-Dimensional Data Assignment - Final Project, University of Southern Queensland, Australia
Discuss the team dynamics for a highly effective : Discuss the team dynamics for a highly effective or ineffective team of which you were a member. Can you explain why the team performed so well or so poorly?
Custom-designed educational toys : You and your partner are starting a new B2C e-business that sells custom-designed educational toys. You want your Web site to have a light-colored
Comments including a brief description of the program : Comments including a brief description of the program, Input List and Output List, and full pseudocode. Place the pseudocode
Comment on the source benefits and currency : What do you think about the source above? Summarize the source. Comment on the source's benefits and currency.

Reviews

len2320255

6/11/2019 4:40:40 AM

Please read this document fully and carefully. You should complete lecture and tutorial examples before completing this project. There are two parts to this assessment. Part A is the project analysis of data as described in detail on pages 2-4 of this document. Your submission will be no more than 6-9 pages for Part A. Part B is the completion of a set of 4 questions on page 5-6 of this document. Part A and B should be submitted together in one pdf file to the link provided on the StudyDesk.

len2320255

6/11/2019 4:40:30 AM

Additionally, you will submit one R script file for your work in Part A. Your pdf file and your R file should be named “your name STA8005 project.pdf” and “your name STA8005 project.R” respectively. You will lose presentation marks if you do not follow this naming structure. Marks distribution: Part A: 65 marks total - Task 1: 30 marks, Task 2: 20 marks and Task 3: 15 marks. Part B: 30 marks total - 6 Questions: 5 marks each. Presentation: 5 marks.

len2320255

6/11/2019 4:40:20 AM

Instructions: You will submit ONE pdf file for your project which will include Part A of your report addressing Task 1, 2 and 3 and Part B. All sections of Part A should be clearly labelled. You will also submit ONE R script file with all analysis code for Part A clearly labelled and commented. You should not include any R code in your pdf report submission. For each of Task 1, 2 and 3 your report should be no more than 2-3 pages (i.e. no more than 9 pages total). Any additional pages for Part A will not be marked. This means you will need to be very concise and clear in the information you provide in your final report. Do not include an Appendix.

len2320255

6/11/2019 4:40:10 AM

To help succinctly convey results you can include tables or figures of results within the page limit, but you must also explain them in text. You can also use dot-points to itemise presentation of information if that helps. Do not perform any transformations on any of the variables to help normalise them for any analysis. For the purpose of this exercise assume that the data meets univariate and MVN requirements. Your R script file should reference only one input data file – river.csv, provided to you on the Study Desk.

len2320255

6/11/2019 4:40:00 AM

All sub-setting and data manipulation must be done in R – do not change the data file river.csv before importing into R. Do not be misled by the fact that you will submit no more than 6 - 9 pages for Part A – these analyses will require a time-consuming trial-and-error approach in order for you to ensure all data has been cleaned, subset and/or restructured correctly for the analysis, and then time for you to consider all options and choose the final analyses that you think will address each task best. To correctly address the tasks, you will need to spend time ensuring you have the correct R code. You have been given examples of all R code needed to correctly clean, subset and/or reorganise your data and perform the necessary analyses. Remember that there are many helpful websites via google to help you problem solve R coding issues.

Write a Review

Applied Statistics Questions & Answers

  How do you interpret the relationship between the data sets

What conclusions can you make about the issue of male and female pay equality - Is the average salary the same for each of the grade levels?

  A randomly chosen individual makes 3 claims

A randomly chosen individual makes 3 claims

  A variable for each measurement scale

1. Briefly describe your area of research interest (1-3 sentences is sufficient).2. List 4 variables that you might assess in a research project related to your research area. List one for each type of measurement scale: Nominal, ordinal, interval, a..

  What are the two variables that question is investigating

MM570 Applied Statistics for Psychology Assignment Project: Descriptive Statistics, Kaplan University, Australia. What two variables this question investigating

  The world''s smallest mammal is the bumblebee bat

The world's smallest mammal is the bumblebee bat. The mean weight of 15 randomly selected bumblee bats is 1.659 grams, with a standard deviation of 0.264 grams.Dr. Clifford Jones claims that the mean weight of bumblebee bats is 1.8 grams.

  Records indicated the restaurant gross

Records indicated the restaurant gross

  Describe which method you used to make your determination

Descriptive statistics (correlation coefficient) showing the relationship between median household income (in dollars), and each of the other three variables.

  Dependent independent

You are given the following information about y and x. y x Dependent Independent Variable Variable 5 15 7 12 9 10 11 7

  Fit a linear model predicting the number of views

Statistics for Data Science - Fit a linear model predicting the number of views (views), from the length of a video (length) and its average user rating (rate)

  Create an error bar graph

Calculate total (the sum of all five quizzes and the final) and percent (100 times the total divided by possible points, 125). Since total and percent are already preset, name the new variables total 1 and percent1. Print out id, total, total 1, p..

  Define opportunity loss

Define opportunity loss. What decision-making criteria are used with an opportunity loss table? Explain how a scatter diagram can be used to identify the type of regression to use.

  How can internal and external validity be increased

What statistical test should be used to analyze these data - How can internal and external validity be increased in an experiment and briefly describe the type of information that should be in an introduction.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd