Perform an initial exploratory data anlysis using plots

Assignment Help Engineering Mathematics
Reference no: EM13837259

Portfolio task 1 - Regression

In the real world, you will be expected to communicate the results from a statistical analysis you perform to non-statisticians, so you should conclude each task with a brief explanation of your results, presented in terms a lay person would understand.

Task 1 The following dataset is taken from a paper by Stanley and Miller (1979) regarding the values of various variables for 22 US aircraft. The variables are as follows:

FFD first flight date, in months after January 1940. This variable is a proxy for how technologically advanced the aircraft is.
SPR specific power, proportional to power per unit weight

RGF flight range factor, categorised into three levels: low: flight range factor 4.2 med: 4.2 = flight range factor = 4.87 high: flight range factor 4.87

PLF payload as a fraction of gross weight of aircraft

SLF sustained load factor

CAR a binary variable which takes the value 1 if the aircraft can land on a carrier, 0 otherwise.

The data are available in the files jet2.csv, jet2.mtw and jet2.sav.

You are required to analyse these data using appropriate regression technniques.

a) Perform an initial exploratory data anlysis using plots and the usual descriptive statistical methods. This should include individual plots of FFD against the three continuous variables, with the data grouped by the two categorical variables.

As a consequence, discuss the suitablility of the data for regression analysis.

b) Fit a regression model using all the main effects (you will need to produce suitable dummy variables), and in addition, the following 2-way interactions:

RGF×SLF; RGF×SPR; RGF×PLF; RGF×CAR; CAR×SPR; CAR×PLF; CAR×SLF

N.B. Interactions involving RGF require two dummy variable levels.

Fully discuss your results.

Produce an overlay plot of the Student and deleted residuals against the predicted values. Discuss your results. scatterplot and correlation coefficient using Minitab, and use these to assess the suitablility for regression analysis.

c) Attempt to reduce this model by comparing R2 and adjusted R2 results for the best model of each size. (I recommend using a Best Subsets procedure and plotting your results in Excel.) Is this appropriate? What difficulties do you encounter?

d) Use your results above and your knowledge of other aspects of regression modelling techniques to explain why the approach of fitting all interactions is inappropriate for these data.

e) Find a suitable model for these data using only the main list of variables (SPR, RGF, PLF, SLF and CAR). Use appropriate methodology to identify this model.

This should include:

• Use of Standard Regression output and appropriate interpretation

• Residual diagnostics

• Influence diagnostics

• Other appropriate diagnostics

• Use of transformations

• Use of selection methods [R2, Cp, stepwise etc.]

• Use of suitable overall approach

f) Use you model to find the predicted FFD of a US aircraft, with SPR 4.0, RGF medium, PLF 0.16, SLF =3.0 and CAR = 0.

Calculate a suitable confidence interval for your prediction.

Portfolio tasks 2 and 3

Task 2 involves using Analysis of Variance to explore the factors which are likely to affect the value of customer transactions made to a bank in the Czech Republic.

You are required to investigate both main effects of the explanatory variables, and their interactions.

To complete it, you will need to refer to material from the third and fourth lectures and the relevant chapters of Field (2013).

Task 3 is a logistic regression exercise, based on lectures 7 and 8, as well as the appropriate chapter from Field (2013).

Task 2 The data credit contains data on transactions by account holders to a bank in the Czech Republic. The data is adapted from a dataset given for a data mining competition prior to the third international conference of Principles and Practices of knowledge discovery in data bases (PKDD). This conference was held in Prague in 1999 . The variables are:

Tcredit a transformation of the average value of credits made per day

Sex F/M

Second Y if there is a second account holder

N if there is not a second account holder

Loan yes if the account holder has a loan with the bank no if the account holder does not have a loan with the bank

Card yes if the account holder has a credit card with this bank no if the account holder does not have a credit card with this bank
Region one of:

Prague; south Bohemia; north Bohemia; west Bohemia; central Bohemia; east Bohemia; north Moravia and south Moravia.

The data are available in the files credit.csv, credit.mtw and credit.sav. There are 4,500 observations.

a) Produce suitable plots to investigate the relationship between each of the explanatory variables and Tcredit. Comment on your results.

b) Fit models in order to predict Tcredit using:

i. The model containing all main effects.

ii. The model containing all main effects and all two-way interactions.

iii. The model containing all main effects, all two-way interactions and all threeway interactions.

iv. The model containing all main effects, all two-way interactions, all threeway interactions and all four-way interactions.

c) Carry out suitable tests to compare the models fitted above, in particular you should compare:

i. model iii) with model iv)

ii. model ii) with model iii)

iii. model i) with model ii)

Hence, explain which of the four models fitted in b) above is the most appropriate.

d) Using the model you selected in c) above, reduce it by removing components one at a time (e.g. one interaction term) as appropriate.
Continue with this reduction until it is inappropriate to further reduce the model.

e) Validate the final model you found in d) above. Produce suitable residual plots/analysis (NOTE: No further analysis e.g. Influential analyses, transformations etc. are required for these data).

Comment on all your results.

f) Fully discuss all your findings and the appropriateness of your final model.

Task 3 The following data are taken from a survey to explore the factors influencing the pattern of consumption of psychotropic drugs. The following data are an extract taken from this survey and published in Murray et al (1981) .

The variables are as follows:

Sex 0 male

1 female

Agegr age group (one of 16-29; 30-44; 45-64; 65-74; 74)

GHQ score on the General Health Question 0 low

1 high

Drug number taking drugs

Tot total number in each variable combination

The data are available as psy.csv, psy.mtw and psy.sav.

N.B. For SPSS the data have been stacked in the required form with Drug being the number either taking or not taking the drug and Resp being the response (=1 if taken drugs, = 0 if not taken drugs)

Printouts of the data in the CSV/Minitab and SPSS formats is given in the appendices.

a) Explain why the assumption of a binomial distribution is suitable to use in modelling these data. Also, without any calculation, explain in what way you might expect each of the variables Sex, Agegr and GHQ to effect the dependent variable?

b) We wish to investigate how the variables (i.e. age group, sex and position on general health (GHQ)) may effect the chance of being a drug user. For this purpose you should fit various models to the Drug variable (Resp in SPSS) using appropriate software.

You should begin by fitting the following models:

i. A model with all main effects, two-way interactions and three-way interactions (i.e. the saturated model)

ii. A model with all main effects and all two-way interactions

Utilising the output from these models, explain why we can conclude that the three-way interaction is not needed.

c) Starting with model bii, remove each two-way interaction (not each individual term) one at a time.

Utilising the output from these models, explain which if any such terms may be removed.

Continue reducing the model until no more terms may be removed.

d) Using the best model that you selected in part c, produce suitable residual plots to investigate the adequacy of your model.
Clearly discuss your findings.

e) Using the best model that you selected in c), interpret clearly all the parameter estimates in the model.

State clearly whether they are as you would expect.

f) Use your model to predict the probability that an individual, who is female, scored high on the GHQ and is age 48 will be a psychotropic drug user.

Comment on the validity of your prediction in this case.

Portfolio task 4- Loglinear modelling Page 1

In the real world, you will be expected to communicate the results from a statistical analysis you perform to non-statisticians, so you should conclude each task with a brief explanation of your results, presented in terms a lay person would understand.

All data is SPSS format. N.B. Minitab doesn't perform loglinear analysis.

Note that in the lecture we followed Field's use of the Model Selection function within Analyze? Loglinear. This allowed us to evaluate the significance of the different levels of interaction available. However, it didn't allow us to refne the model further.

In this coursework you may begin with an overall assessment of the saturated model using this option, but you will need to use the General method to explore and refine the model to produce one that best fits the data.

Statistical Modelling Gary Hearne (adapted from Dr. Theresa Brunsdon)

2014 / 2015 School of Science and Technology

Portfolio task 4- Loglinear modelling Page 2

Task 4 A manufacturer of VLSI chips has been acting as exclusive supplier to three computer assembler firms (M1, M2 and M3). Over the last three years, faulty chips in operational machines have been returned via the assemblers to the manufacturers, together with information regarding the usage of the machine (low to average, above average).

Chips tend to fail because of one of two problems: insulation breakdown or circuit fracture. The numbers of returned faulty chips, categorised by type of fault, numerically coded variables.

a) Load the data into SPSS. Perform an exploratory analysis on the structure of the datra and the relationships within it using chi-squared tests. Carefully note your findings.

b) Use Analyze ? Loglinear ? Model selection to build a saturated model. Interpret the output, in particular the table of K-Way and Higher-Order Effects to determine whether the model is a good fit to the data. Explain your findings.

c) Repeat the procedure, but this time using Analyze ? Loglinear ? General. Ensure that in the Options menu you select Estimates.

Note that for this procedure, you will not have to declare the value ranges for the variables, but you will have to use the Weight Cases function.

Use the Parameter Estimates table to check that the output matches the result for the Model selection procedure.

d) Repeat the procedure, removing insignificant terms hierarchically until you have the model which best fits the data. Use this model to explain the relationships and interactions within the data.

Statistical Modelling Gary Hearne (adapted from Dr. Theresa Brunsdon)

2014 / 2015 School of Science and Technology

Reference no: EM13837259

Questions Cloud

Explain how each strategy could assist jacob : In 500-750-words, identify and explain how each strategy could assist Jacob in reaching his goal. In addition, discuss how you would involve Jacob's parents, and develop an activity from one of these three strategies that Jacob's parents can use ..
Describing developmental changes for both mother and baby : describing developmental changes for both the mother and baby
Develop a report based on the uts hospital data : The LHD governing council requires you to develop a report based on the ‘UTS Hospital' data to address issues related to Outlier Admissions patients with a length of stay over 30 days.
Discuss the social identity issues present in this case : Case Study- From Lippert- Johanson Incorporated to Fenway waste Management. Discuss the social identity issues present in this case
Perform an initial exploratory data anlysis using plots : Perform an initial exploratory data anlysis using plots and the usual descriptive statistical methods. This should include individual plots of FFD against the three continuous variables, with the data grouped by the two categorical variables.
What is treatment planning and when does it begin : What is treatment planning and when does it begin
Explain the principle of parens patriae : Explain the principle of parens patriae. What is a delinquent offense? A status offense? What is the typical juvenile delinquent's age? Gender? Race
What is your topic and what is its business significance : Define your 1st key word or phrase. Provide strengths and weaknesses from literature about the topic. Provide at least 4 references. Analyse your hypothesis about this topic and discuss your conclusion - What combinations of analytical and statisti..
Total interest would have been paid by end of month three : Suppose that the initial loan of $20,000 and interest rate 1.2% per month. Interest due is paid at end of each month. $10,000 of the original unpaid balance is to be repaid at the end of month two and three only. How much total interest would have be..

Reviews

Write a Review

Engineering Mathematics Questions & Answers

  Managing ashland multicomm services

This question is asking you to compare the likelihood of your getting 4 or more subscribers in a sample of 50 when the probability of a subscription has risen from 0.02 to 0.06.]  Talk about the comparison of probabilities in your explanation.

  Culminating quantitative research report

For this assignment you are to write a culminating quantitative research report on the concepts and topics that you learned in this course. For this paper, you need to critique two or more research papers/journals that use quantitative research me..

  Single-phase queuing system and multiphase queuing system

Compare and contrast the differences between a single-phase queuing system and multiphase queuing system. Describe your answer in full.

  Termination of the molding process

Consider a regression situation in which y = hardness of molded plastic and x = amount of time elapsed since termination of the molding process. Summary quantities included the following.

  Runge-kutta method

Use 4th order Runge-Kutta Method with step size h =0.2 and h =0.1 to find y(2) and sketch all the solutions on the interval [1, 2] with appropriate legend for comparison.

  Negotiated an agreement with lighting quick intermodal

Ms. Wilson has also negotiated an agreement with Lighting Quick Intermodal, Inc. (LQI), a third-party carrier that utilizes both motor and rail transportation.

  Question 1 the data in djiaxls represent the closing values

question 1 the data in djia.xls represent the closing values of the dow jones industrial average djia from 1979 through

  Plot the time series

The revenues (in $millions) of a chain of ice cream stores are listed for each quarter during the previous 5 years.

  Statistics for independent simple random samples

Provided below are summary statistics for independent simple random samples from two populations. x(bar 1)=22, s1=6, n1=21, x(bar2)=23, s2=7, n2=15.

  Determine a critical value

For problem below, state the null and the alternative hypothesis, determine a critical value, present a test statistic, and provide a p-value and your decision. Use (Type 1 error rate of  =.05)

  Difference in the mean amounts of student loan debt

Suppose you want to compare the amount of student loan debt for males and females at StatCrunch U. The null hypothesis for this problem would be that there is no difference in the mean amounts of student loan debt for males and females and the alt..

  Calculating the correlation between two variables

The steps in testing a research hypothesis Comparing the means of two or more groups Calculating the correlation between two variables

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd