Conduct a sensitivity analysis to show that your conclusions

Assignment Help Advanced Statistics

Reference no: EM132743650

Assignment 1:

There are 3 parts to hand in.

1. A PDF report compiled by Rmarkdown of no more than 20 pages, including figures (which must be at least 1/3 of a page in size each) and key code blocks. Note the expectation is that there is not much more than 3-4 pages of actual mathematics and text, and that almost all space is taken up by figures, code and other R output.

2. An electronic R markdown file entitled with your anonymous student code and then ‘.Rmd' used to generate the report (a link and submission instructions will appear on ELE in due course). The R in this script may be run to help assess your mark, so please ensure that all code blocks run in a blank R session and the markdown knits in a folder containing only the data given to you and any Stan files you used.

3. Any Stan files you have used.

1. The data USelectionData.rds contains the results of the 2020 US general election by state as well as the number of Covid-19 cases and deaths on election day in each state, and a range of population demographics. State breakdowns of sex and race of the population are given as per- centages of the total population (TotalPop). So, for example, the first row states that Alabama is 48.46% male, 4.00% Hispanic, etc. The percentage of the state population that are US citi- zens is given by the variable Citizen and the percentage in poverty by Poverty. It is not clear how poverty is defined, nor whether there is a state-dependence for the definition (e.g. that is tied to minimum wage and living costs which are not given and vary by state). The number Employed and the income per capita per state are given. Finally, the variables Professional, Service, Office, Construction, Production give percentages of each state's employed popu- lation that work in each of these areas. Professional represents, e.g. Doctors, Lawyers, Teachers etc. Service represent service industries such as retail and hospitality. Office represent office workers, Construction, those in the building industry (including plumbers, electricians etc) and finally Production represents factory work.

As the purpose of this coursework is not to test your R coding ability, the file SomeDataWrangling.R attempts to shape the data in some of the different ways that you may wish to in order to answer the question. There is no "one correct way" to conduct your investigation, and each attempt will be marked on its own merits, hence any of the manipulations presented may be useful. The code may also be amended if you think of some other way you would like to shape the data for analysis. If you are having issues with data shaping, I am willing to help, as long as I don't cross the line into suggesting which models you should/should not fit.

Filtering the data to only consider the race between candidates Biden and Trump, conduct a Bayesian analysis to answer the question: How important might Covid-19 have been in influencing state results during the 2020 US presidential election?

Brief: Your analysis must
• Select and justify appropriate Bayesian models to fit. Your model should predict some variable representing the results using some of the other variables including those related to Covid-19. Note that the small size of the data means that you cannot use all of the variables at once, so the choices you make require careful justification and sensitivity analysis (see later).
• Fit any Bayesian models using Stan.
• Establish the convergence of any models you fit.
• Check the quality of your models using appropriate out of sample data.
• Conduct a sensitivity analysis to show that your conclusions are not sensitive to your model.
• Use Monte Carlo with accurate Monte Carlo error bounds to make probabilistic inferential statements. These must be used both to assess posterior probabilities for effects directly related to Covid-19, and to compare the size of these effects with others in your model.

Important: It is possible to spend far too much time completing this coursework by trying to perfect your model. It is possible to achieve top marks by choosing just one model with careful justification and, as long as it converges and looks reasonable (it won't look perfect), to explore just a very small number of alternatives when exploring sensitivity and to critically evaluate it well (why it might not look reasonable). There are no extra marks for finding a "best" model, or for exploring all possible alternatives to the choices you made in your sensitivity analysis, so please focus your efforts accordingly.

Assignment 2:

1. Suppose items can be classified into 1 of K distinct groups, where the probability of being placed in group j is 0 j and LK., Oi = I. Suppose there are N such items and that the K-vector y represents the number of each item in each group. So ye is the number of items in group i, Let, = N. Then, y 01 OR has a nmltinomial distribution, written
with pmf 191 OK •••••• Nfultin(N.03 Ox)
P(Y I = , ic N! er. Or • V!

2. Of course, the probabilities for each candidate (the Os) will not be the same for all states. Similar to logistic regression, we can model the probabilities in a multinomial regression as functions of oovariates (I would recommend reading about generalised linear models in Chapter 16 of Bayesian Data Analysis 3, and this model in 16.6). As with any generalised linear model, the idea is to have a linear predictor = Xft

where matrix 0 represents regression coefficients and matrix X contains the covariates. A link function then maps the ri to the parameters in the model, in our case to the multinomial probabilities 0. The mapping used is the 'Softmax` function, where for some vector v,
softmax(vo - - "' • So, if our X represents state dependent covariates, the probability that a random voter votes candidate k in state s is e„k _ en.,' Whilst this ensures that the probabilities sum to 1, this model is problematic (statisticians say unidentifiable) because we know that Ek Oek = 1. The remedy is to fix one vector of ri values, by default rra for all i, to be 0. Show that this remedy and the use of Softmax implies
ibk = log
for each state s.

3. Let X be the N„ x p matrix of covariates to include (one row per state), and Q the p x K matrix of unknown coefficients with Oil = 0 for all i. Assume all other 13 values have independent Normal priors with constant variance cr2, show that the log posterior density for # can be written as
P K , K - -LEE ff N ij EE Yak (E log (EeXP 1E Xejitil)) 202 j=2 s= k=1
where C is a constant. that we can ignore during sampling.

4. Write an MCMC sampler that will draw samples from this posterior distribution for any choices of X. There are 2 main options: Option 1 is to use Stan. Stan has the advantage that once the code is correct, the sampler will tune itself to be efficient in most cases, meaning that if your model is good, it will ultimately converge if you let it warm up for long enough and you will have an easier time of analysis. The disadvantage is that it can be difficult to debug Stan code and it is sometimes not intuitive. If you choose the Stan route, you will find the user manuals relevant to your particular version of Stan critical to success https://github.com/stam-dev/star/releases. Softmax and the multinomial mass functions are given in the Stan function reference. moll i-logit regression is discussed in 1.6 of the users guide (but note it is not, as described, exactly the same as our set up, because each of our observations is not a single y with k possible values, but a draw from a size N multinomial where N is the number of voters in a state). Option 2 is to code up the log posterior and either to write your own Niel ropolis-Ilastings or (better) to use the adaptMCHC package. Whilst option 2 will probably be easier to set up, it may be much harder to get a good MCMC.

5. Though there are millions of voters in our data set, we only really have 51 data points (the vote distributions amongst the candidates in the different states). With 5 candidates and 1 row fixed at 0, each covariate we introduce brings 4 parameters. so only a very small number should be selected and you will not need an intercept. With those ideas in mind, create a matrix of emaciates X designed to understand what types of people vote for Kanye West and how he might affect other candidates vote shares. Explain your choice of covariates and then run your MCMC until you feel it has converged. If you are having convergence issues, you may need to choose fewer X columns, and you may need to run any adaptation (e.g. in Stan) for longer. Demonstrate the convergence of your MCMC. Wink You might use an indicator variable to indicate whether Kanye ran (or didn't run). You might also have to consider how to make sure that the probability of voting for Kanye when he didn't run is 0.1

6. Use your Bayesian analysis to infer anything you can about Kanye's voters, show which candidates Kanye's voters might be coming from, and if Ktuiye had run in all 51 states, what might have happened? You may give your answer as if your Bayesian analysis had converged and fitted the data well (even if neither), the critical evaluation is in the next part.

7. Give a critical evaluation of your model and your conclusions. Say which features of the model you have chosen may be causing any issues you see. Mat further data could be used to strengthen your Bayesian analysis and conclusions, and how would that data feed into your analysis (there will always only be 51 states worth of data in the likelihood, so how can any problems with over fitting be overcome)?

Attachment:- part1 hw.rar

Reference no: EM132743650

Questions Cloud

Explain when the HPR will be equal to the yield to maturity : Assume that you bought an 8% coupon bearing bond with 4 years to maturity, Explain when the HPR will be equal to the yield to maturity

Prepare the journal entry to record income tax expense : Novak Corporation has one temporary difference at the end of 2020 that will reverse and cause taxable amounts of $53,100 in 2021, $57,800 in 2022, and $63,100.

Calculate quarterly loan repayment of nicole and keith : Nicole and Keith borrow $500 000 at 6.1% pa compounding quarterly and agree to make monthly repayments for 20 years. Calculate their quarterly loan repayment.

How should be treated in the financial report : How should be treated in the financial report. In your explanation, identify any further information you require to arrive at a conclusion as to treatment.

Conduct a sensitivity analysis to show that your conclusions : Conduct a sensitivity analysis to show that your conclusions are not sensitive to your model and Establish the convergence of any models you fit

How may the organizations be categorized : Why is it important to be able to identify influential organizations in the external environment? How may these organizations be categorized?

Explain current profit and moneyness : You have been asked to graphically illustrate a specific option to alter production by drawing its payoff diagram. It relates to an industrial beer fermenter

How do find what is the weighted average cost of capital : What is the weighted average cost of capital. XYZ company can sell bonds with a par value of $1,000 and a coupon rate of 8% per year

Would the managers of both affiliates be pleased : Advanz enterprises has its pharmaceutical product manufacturing affiliate in the USA and its sales affiliate in Brazil. The manufacturing cost of a best-selling

User Account

All Pages