Reference no: EM132769015
MIS770 Foundation Skills in Data Analysis - Deakin University
Assessment Task - Analysis of US Health Insurance data
Description
The purpose of this assignment is to investigate a dataset using the knowledge learned in Modules 1 and 2. This will enable conclusions to be drawn that ultimately assist in decision making.
The assignment requires you to analyse a given dataset, interpret the results, and then draw conclusions such that you are able to reply to specific questions being asked of you in the form of a business report. (These questions are asked in the following email).
The aims of the assignment are to:
• provide you with some examples of the application of data analysis
• test your understanding of the material presented in the relevant topics
• test your ability to analyse data and interpret your results
• test your ability to effectively communicate your results to others
Before attempting the assignment, make sure that you have prepared yourself well by reading the relevant sections of the prescribed textbook and reviewing the materials provided in Modules 1 and 2 (i.e. Topics 1 to 7).
Specific Requirements
The UnitedHealth Group is America's most prominent health insurance provider. They want to better understand certain population characteristics that might contribute to the high medical costs being billed to insurance providers. They have access to a random sample of US Health Insurance data containing 1338 insured personnel with their Age, Gender, Body Mass Index (BMI), Number of Children, Smoking status, Region and Charges.
You are a Data Analyst working for UnitedHealth Group. Your Manager, Daisy Pearce, has asked you to conduct a preliminary analysis. In particular, you are expected to apply a series of statistical techniques and produce a report based on your findings.
Daisy's email is reproduced on the next page.
Email from Daisy Pearce
Hi,
As per our conversation, I have spoken with our reporting team and we have THE following questions relating to the US health insurance data (contained in the file Insurance.xlsx). Please complete the required analysis and prepare a report for me containing answers to the following questions:
Q1. An Overall View of both "Charges" and "Smoking" Can you provide me with overall summaries of
a) Individual medical cost billed by health insurance
b) Smoking status
Q2. Relationships
a) Is there a relationship between the age of the primary beneficiary, their body mass index (BMI), number of children and medical cost?
b) We would also like to know is there a gender bias in the smoking behaviour of the beneficiary.
c) Can you further analyse to see whether the beneficiary's residential area/region in the US affect how health insurance provider bill their medical costs?
I realise that the US Health Insurance data contain a random sample of 1338 insured personnel, and that this information can be used to draw inferences about the specific attributes of the whole insured population and charges billed by health insurance providers. With that in mind, Please provide me with answers to the following questions:
Q3. The UnitedHealth Group would like estimates of the following.
a) Average medical cost for an older beneficiary (older adulthood: 56 years and older)
b) Proportion of smokers who are obese (BMI of at least 30)
Q4. The UnitedHealth Group would like a comparison between this year's medical cost and the industry average.
a) The industry average medical cost for a single adult (i.e. without children) is at least $10,000. Is there any evidence to support this assertion?
b) Based on the industry average, less than 50% of beneficiaries are female. Can this claim also be substantiated?
Q5. Appropriate Sample Size
One of the company's overall goals is to estimate the average medical cost for all insured personnel to within
$1000 (±1000) and the proportion of all insured smokers to within 3%, Will a sample size of 1338 be large enough? If not, what size sample should be taken? What other factors should be taken into account when sampling?
Business Report Requirements
• Your report should be no longer than 4 pages and should not include any charts, tables, or appendices in the report. Charts/graphics and tables are only to be placed in the Data Analysis file i.e. the Excel spreadsheet and not reproduced in the report.
• Suggested formatting for the report: single-line spacing; no smaller than 10- point font; page margins
approx. 25mm, and good use of white space.
• Your report must have a cover sheet containing your particulars and Unit details.
• The report is to be written as a stand-alone document (assume Daisy will only read your report). Thus, you should not have any references in the report to your data analysis output. Eg. "According to Table 1 in the analysis..."
• Your report must contain an executive summary that explains in plain language the purpose of the report and summarises the main findings. The executive summary should be no more than 300 words long.
• The body of your report must be set out in the same order as in the originating email from Daisy, with each section (question) clearly marked
• Use plain language and succinct explanations. Avoid the use of technical or statistical jargon as Daisy cannot be expected to understand statistical terminology. As a guide to the meaning of "Plain Language", imagine you are explaining your findings to a person without any statistical training (e.g. someone who has not studied this unit). What type of language would you use in this case?
• Marks will be lost if you use unexplained technical terms, irrelevant material, or have poor presentation/ organization
Data Analysis Instructions
In order to prepare a reply to Daisy's email, you will need to examine and analyse the dataset Insurance.xlsx thoroughly.
Daisy has asked a number of questions and your Data Analysis output (i.e. your charts/tables/graphs) should be structured such that you answer each question on the separate tab/worksheet provided in your Excel document. There are also three extra tabs in Insurance.xlsx called CI, HT and SampleSize and you should use the various templates contained in these tabs arriving at your "Confidence Interval", "Hypothesis" and "Sample Size" answers.
Q1. An overall summary of Charges (in dollars) and summary of Smoking status
You are required to comprehensively describe the variable ‘Charge' by itself and the variable ‘Smoking' by itself using the most appropriate techniques from Module 1.
Your analysis should include numerical summaries, graphs and tables. The importance of other variables is considered in other questions. You should thoroughly investigate relevant summary measures (and their reliability) for these two variables. Also, there may well be suitable tables and charts/graphs that will illustrate more clearly other important features of charges and smoking. (See Topics 1-3 learning materials)
Q2. Descriptive measures and insights
Your course notes (Module One) give methods (numerical summaries/tables/graphs/charts) for summarising a single variable and investigating the relationships (dependencies) between two variables for these situations. For example
• Pie/Bar charts
• Summary/Frequency Distribution tables
• Comparative summary measures including quartiles and percentiles
• Scatter diagrams
• Coefficient of correlation, r value
• Contingency tables/Cross tabs
• Stack bar charts, side-by-side bar charts
• Histograms/Frequency polygons/Ogives
• Single/Multiple box and whisker plots etc. (See Module One learning materials)
Use whatever techniques you have studied in Module 1 to investigate the associations/relationships. Generate suitable visualisations (Tables/Graphs/Charts) and numerical measure(s) demonstrating the existence or otherwise of a relationship. Remember to provide a brief overall summary when concluding these questions.
Q3-Q4 The analysis required involves inferential statistics, which are covered in Module 2. Use the relevant Excel templates (CI and HT) provided in the Data file.
These questions will require you to complete either a confidence interval or a hypothesis test. Go through each of the questions asked by Daisy and decide which technique is the most appropriate. Below are some hints regarding the most appropriate technique:
• Do we have to make an estimate, and therefore need a confidence interval?
• Are we testing a theory/claim/ or comparing values... and therefore need a hypothesis test?
So decide which you think is the most appropriate technique (tutorials for topics 6 and 7 help here).
• You can assume that a 95% confidence level is appropriate.
• Use 5% significance in any hypothesis tests you perform, and provide a summary of your conclusions.
Q5. Use the relevant Excel templates provided in the Data file.
Learning Outcome 1: Manipulate and summarise data that accurately represents real world problems
Learning Outcome 2: Interpret and appraise statistical output to assist in real-world decision making
Learning Outcome 3: Critical thinking: evaluating information using critical and analytical thinking and judgment.
Attachment:- Analysis of US Health Insurance data.rar