Reference no: EM13969886
Question 1:
A large department store chain has 40 branches throughout the UK. A senior manager for the company is responsible for the sales of female cosmetics and perfumes. She was concerned that in the run up to Christmas, when many of the shoppers for these products could be men (purchasing presents for wives and girlfriends), the company may not be maximising its selling potential. She therefore proposed that the salespersons concerned should undergo a short training course specifically targeted at selling to men.
A pilot study was therefore initiated to investigate whether the proposed training course would be effective. Two cosmetics salespersons were randomly selected from each branch, and one of each such pair was randomly selected to undergo the training course. The total sales receipts or "take" were recorded for each of the salespersons in this study from the sales logs on the shop floor tills from a busy Saturday in early December. The relevant data can be found in the SAS data set Cosmetics.sas7bdat, which contains the following variables:
ID Store identification number (1,...,40)
Take1 Total value of sales (£) made by the "trained" salesperson in the store
Take2 Total value of sales (£) made by the "untrained" salesperson in the store
(i) Obtain the sample mean and standard deviation of the sales receipts for the trained and untrained salespersons, and briefly interpret your findings.
(ii) It is planned to conduct a statistical test to compare the mean sales receipts of the trained and untrained salespersons. What distributional assumption must be made in order for a parametric statistical test to be valid? Investigate the tenability of this assumption via an appropriate graphical technique (i.e. a data plot or plots).
(iii) Formulate and conduct a suitable parametric test. If appropriate, give an associated estimate and a corresponding 95% confidence interval. Carefully interpret all your findings.
(iv) Formulate and conduct the equivalent non-parametric test to that employed in (iii). Critically compare the outcomes of the two tests.
(v) Hence advise the manager regarding the effectiveness of the proposed training course. What else would influence the decision on whether to implement the course for all cosmetic salespersons in future years?
Question 2:
A bank recently offered to the market two highly competitive, fixed rate savings bonds, namely a taxable bond and a tax-free cash ISA. Both bonds required a minimum investment of £25,000 and were only available to investors who provided new funds to the bank. Following the launch of these bonds, the Customer Services department of the bank began to receive a significant number of complaints from potential investors in both types of bond regarding the incorrect processing of their applications.
As part of its investigation into this problem, the bank initially wished to investigate whether the error rate in processing applications differed between the two bond types. A random sample of 400 previously processed applications was taken for each bond type, and each was carefully checked for errors. The numbers of applications observed to be correctly and incorrectly processed for each bond type were as follows:
Bond Type
|
Correct
|
Incorrect
|
Totals
|
Taxable Bond
|
374
|
26
|
400
|
Cash ISA
|
356
|
44
|
400
|
Set up an appropriate data set in frequency distribution form that contains the variables:
BondType (1 = Taxable Bond, 2 = Cash ISA)
Error (0 if correctly processed, 1 if incorrectly processed)
Count (the relevant frequency)
Obtain the sample proportions of incorrectly processed applications for the two bond types. What do your findings suggest?
Set up, justify and conduct an appropriate statistical test regarding the "true" proportions of incorrectly processed applications for each bond type, carefully interpreting your findings.
If appropriate, estimate the difference between these two "true" proportions and give a corresponding 95% confidence interval, briefly justifying your decision regarding whether or not to present this additional information. Carefully interpret any stated results.
Question 3:
A large 'Pub and Restaurant' group owes its financial success to maximising the ground floor area of its licensed properties (by situating toilets and kitchens either in the basement or on the first floor) and by providing both food and drinks at competitive prices. Due mainly to the current economic climate, a large number of desirable suburban licensed properties have recently come to the market at competitive prices. The Chairman of the group has thus declared his ambition to expand the company's portfolio of such properties.
To fully investigate the potential of every one of these properties would be prohibitively expensive, as each such investigation would require a structural survey and detailed consultation with an architect concerning the structural changes required. It has thus been decided to model each property's potential value (as measured by average weekly income in £000's), using a variety of physical measurements coupled with accessible local socio-economic and geographical data as explanatory variables. To achieve this, a random sample of 40 suburban properties was drawn from the company's current portfolio. In addition to the (known) average weekly income of each property, the values of the various potential explanatory variables were also added to the data set. The variables of interest are, therefore:
Value Average weekly income (£000's)
Ratevalue Rateable value of property (£000's)
Footprint Total 'ground floor' area (m2)
Otherarea Total useable area off ground floor (does not include beer storage area or living accommodation)
House: Average cost of a house within ? mile radius of property (£000's)
Pubs Number of other public houses within ? mile radius
Food Number of other restaurants within ? mile radius
Social Percentage of domestic housing classified as 'social' within ? mile radius (%)
Employ Percentage of unemployed (Best local measure available) (%)
Car Maximum number of (hard) car parking spaces currently available
Garden Area of accessible outside space (m2) (including 'hard' space for cars plus green space, e.g. garden)
The data set also contains an identification number for each property in the sample:
ID Property identification number (1,...,40)
These data may be found in the permanent SAS data set Pubvalues.sas7bdat.
(i) Fully investigate and discuss any issues of multicollinearity between the ten potential explanatory variables listed above.
(ii) Fit appropriate multiple regression models to these data using both forward selection and backward elimination procedures, in each case using an associated significance level of 5%. For each procedure, state and briefly justify the decision taken at each step, and identify the explanatory variables present in the final reduced model.
Briefly discuss the composition of the final models obtained with these two different approaches in the light of the collinearity analysis conducted in part (i).
(iii) Further investigate model selection using the Cp method for comparing all possible models. Comment specifically on the two reduced models identified in (ii) above.
Hence, using the data alone, choose a final, reduced model, justifying your choice. Are any other models of the selected size also reasonable candidates for selection? Explain.
(iv) Consider again the reduced model obtained by backward elimination. State the fitted regression equation for this model. Interpret and briefly discuss each of the partial regression coefficients and the relevant related information output by SAS.
(v) Investigate the overall validity of the reduced model obtained by backward elimination by undertaking suitable diagnostic analyses involving the studentised and/or deleted residuals and/or the fitted values.
(vi) Considering again the reduced model obtained by backward elimination, identify any potential influential observations (use the value of the variable ID to specify each such observation).
Further investigate the two most extreme of these potential influential observations with respect to their effect on the model by considering the corresponding values of the leverage H, the deleted residual, the covariance ratio C and their DFBETAS values. Note that you are not required to construct a plot of observed versus fitted values for these observations.
(vii) Based on a comprehensive business analysis, the company will fully investigate a property as a potential purchase if it can reasonably be anticipated that the total income for one year (52 weeks) would be greater than or equal to the property's (freehold) price.
"The Good King James" has just come onto the market with an advertised freehold price of £515,000. Its rateable value is £13,000, local unemployment is running at 10% and the property has an accessible outside space of 100 m2.
Given that the company has satisfactorily investigated the various influential points identified in (vi) and found all the data to be completely valid, use the reduced model obtained by backward elimination to advise the company as to whether it should further investigate "The Good King James", commenting on the reliability of your analysis.
What is the maximum price that the company consider should consider offering (assuming the necessary further investigations are satisfactory) if it wishes to have 95% assurance that its potential investment in the property is well founded?
Question 4:
An insurance company wished to consider the possibility of introducing a flat rate travel insurance plan, for the over 50's, with an upper age limit of 85. As part of their investigation of the claim costs of such a plan, a random sample of thirty claims was selected from the company's recent records (on their current travel insurance scheme) for each claimant age from 50 to 85 inclusive and the mean claim amount (£) was calculated for each such age. The resulting data can be found in the SAS data set Travel.sas7bdat, which contains the following variables:
Age Claimant age in years (50,...,85)
Claim Mean claim amount (£)
LnClaim Natural logarithm of mean claim amount
(i) Fit the bivariate regression model of Claim on Age. Obtain:
- a scatterplot of Claim against Age with the fitted regression line superimposed
- a plot of studentised residuals against fitted values
Critically discuss these two plots and draw appropriate conclusions regarding the adequacy of the systematic component of the fitted model.
(ii) Now fit the bivariate regression model of LnClaim on Age. Obtain:
• A scatterplot of LnClaim against Age with the fitted regression line superimposed
• A plot of studentised residuals from this regression against the corresponding fitted values
• A histogram of the studentised residuals
• A normal probability plot of the studentised residuals
Carefully explaining your methodology, investigate the adequacy of the new fitted regression model. Which model would you recommend for these data and why?
(iii) Write down the fitted model obtained in (ii) above, commenting briefly on the value of the estimated slope. Explain how the model can be used to estimate the mean value of a travel insurance claim for an over-50's claimant from his/her age.
Obtain and print out the estimated mean values of LnClaim for each observation on the data set, together with corresponing 95% confidence limits.Use this model to estimate the mean value claimed for a 70-year old claimant. Give and interpret an associated 95% confidence interval.
What other information about travel insurance claims by over-50's should the company also investigate in order to be able to complete its claim costings?
Attachment:- Assignment.rar