Reference no: EM132375108
Homework -
Answer all questions specified on the problem and include a discussion on how your results answered/addressed the question.
Submit your .rmd file with the knitted PDF (or knitted Word Document saved as a PDF). If you are having trouble with .rmd, let us know and we will help you, but both the .rmd and the PDF are required.
This file can be used as a skeleton document for your code/write up. Please follow the instructions found under Content for Formatting and Guidelines. No code should be in your PDF write-up unless stated otherwise.
Please do the following problems from the text book R Handbook and stated.
1. The galaxies data from MASS contains the velocities of 82 galaxies from six well-separated conic sections of space (Postman et al., 1986, Roeder, 1990). The data are intended to shed light on whether or not the observable universe contains superclusters of galaxies surrounded by large voids. The evidence for the existence of superclusters would be the multimodality of the distribution of velocities.
a) Construct histograms using the following functions:
-hist() and ggplot()+geom_histogram()
-truehist() and ggplot+geom_histogram() (pay attention to the y-axis!)
-qplot()
Comment on the shape and distribution of the variable based on the three plots. (Hint: Also play around with binning)
b) Create a new variable loggalaxies = log(galaxies). Construct histograms using the functions in part a) and comment on the shape and differences.
c) Construct kernel density estimates using two different choices of kernel functions and three choices of bandwidth (one that is too large and "oversmooths," one that is too small and "undersmooths," and one that appears appropriate.) Therefore you should have six different kernel density estimates plots. Discuss your results. You can use the log scale or original scale for the variable.
d) What is your conclusion about the possible existence of superclusterd of galaxies? How many superclusters (1,2, 3, . . . )?
e) How many clusters did it find? Did it match with your answer from (d) above? Report parameter estimates and BIC of the best model.
2. The birthdeathrates data from HSAUR3 gives the birth and death rates for 69 countries (from Hartigan, 1975).
a) Produce a scatterplot of the data and overlay a contour plot of the estimated bivariate density.
b) Does the plot give you any interesting insights into the possible structure of the data?
c) Construct the perspective plot (persp() in R, GGplot is not required for this question).
d) Model-based clustering (Mclust). Provide plot of the summary of your fit (BIC, classification, uncertainty, and density).
e) Discuss the results (structure of data, outliers, etc.). Write a discussion in the context of the problem.
3. A sex difference in the age of onset of schizophrenia was noted by Kraepelin (1919). Subsequent epidemiological studies of the disorder have consistently shown an earlier onset in men than in women. One model that has been suggested to explain this observed difference is known as the subtype model which postulates two types of schizophrenia, one characterized by early onset, typical symptoms and poor premorbid competence; and the other by late onset, atypical symptoms and good premorbid competence. The early onset type is assumed to be largely a disorder of men and the late onset largely a disorder of women. Fit finite mixutres of normal densities separately to the onset data for men and women given in the schizophrenia data from HSAUR3. See if you can produce some evidence for or against the subtype model.
Attachment:- Assignment Files.rar