Reference no: EM133154875
7089CEM Introduction to Statistical Methods for Data Science - Coventry University
Coursework - Modelling brain signals using nonlinear regression
Learning Outcome 1: Demonstrate knowledge of underlying concepts in probability and statistics used in Data Science.
Learning Outcome 2: Select and apply appropriate statistical methods or techniques to solve problems or analyse data sets.
Learning Outcome 3: Use modern software to solve real world problems and analyse large data sets.
Learning Outcome 4: Interpret the results of their analyses and communicate those results accurately.
Task and Mark distribution:
Coursework Description:
The aim of this assignment is to select the best regression model (from a candidate set of nonlinear regression models) that can well describe the brain response to a sound signal. The ‘simulated data' were assumed collected during a neuromarketing experiment, during which a participant listens to advertisement and their brain response is recorded using magnetoencephalography (MEG). MEG is a widely used non-invasive method to record the activity of the brain. Specifically, MEG is recorded from the amygdala, a brain region involved in emotion processing. For the first 10 seconds the participant listens to an advertisement narrated by a neutral voice, during the next 10 seconds another advertisement narrated by an emotional voice is played. The regression model you are asked to identify will measure the auditory-brain interaction and the effect of the emotional narration. The researchers hypothesise that the emotional narration will evoke an increased brain response.
Data:
The ‘simulated' MEG time-series data and the sound signal are provided in two separate excel files. The first X.csv file contains the input sound signal x1, and the categorical variable x2 denoting which audio category is being played (i.e. x2 = 0 when the neutral audio is played, x2 = 1 when the emotional audio is played); and the second y.csv file contains the output MEG signal y. The file time.csv contains the sampling time of both signals in seconds. The output MEG signal is subject to additive noise (assuming independent and identically distributed ("i.i.d") Gaussian with zero-mean) with unknown variance due to distortions during recording.
Task 1: Preliminary data analysis
You should first perform an initial exploratory data analysis, by investigating:
• Time series plots (of input audio and output MEG signal).
• Distribution for each (input & output) signal.
• Correlation and scatter plots (between the audio input and output brain signal) to examine their dependencies.
• boxplots of output brain signals to examine effect of sound categories.
• You can perform the above preliminary data analysis for each type of input sound signal separately (i.e. when x2 = 0, and x2 = 1).
Task 2: Regression - modelling the brain response (MEG) to a sound signal
We would like to determine a suitable mathematical model in explaining the relationship between the input audio signal and the output brain signal and how this relationship changes based on the content of the input audio signal (i.e. neutral versus emotional), assuming such a relationship can be described by a polynomial regression model. Below are 5 candidate nonlinear polynomial regression models, and only one of them can ‘truly' describe such a relationship. The objective is to identify this ‘true' model from those candidate models following Tasks 2.1 - 2.6.
Task 2.1:
Estimate model parameters θ = {θ1, θ2, ? , θbias}T for every candidate model using Least Squares (θ^ = (xTx)-1xTy), using the provided sound input and output MEG datasets (use all the data for training).
Task 2.2:
Based on the estimated model parameters, compute the model residual (error) sum of squared errors (RSS), for every candidate model.
RSS = ∑ni=1 (yi - xiθ^)2
Here xi denotes the ith row (ith data sample) in the input data matrix x, θ^ is a column vector.
Task 2.3:
Compute the log-likelihood function for every candidate model:
lnp(D|θ^) = -n/2ln(2Π) - n/2ln(σ2) - 1/2σ2.RSS
Here, σ2 is the variance of a model's residuals (prediction errors) distributions σ2 = RSS/(n - 1) , with n the number of data samples.
Task 2.4:
Compute the Akaike information criterion (AIC) and Bayesian information criterion (BIC) for every candidate model:
AIC = 2k - 2 ln p(D|θ^)
BIC = K.ln(n) - 2 ln p(D|θ^)
Here ln p(D|θ^) is the log-likelihood function obtained from Task 2.3 for each model, k is the number of estimated parameters in each candidate model.
Task 2.5:
Check the distribution of model prediction errors (residuals) for each candidate model. Plot the error distributions, and evaluate if those distributions are close to Normal/Gaussian (as the output MEG has additive Gaussian noise), e.g. by using Q-Q plot.
Task 2.6:
Select ‘best' regression model according to the AIC, BIC and distribution of model residuals from the 5 candidate models, and explain why you would like to choose this specific model.
Task 2.7:
Split the input (sound) and output (MEG) dataset (x and y) into two parts: one part used to train the model, the other used for testing (e.g. 70% for training, 30% for testing). For the selected ‘best' model, 1) estimate model parameters use the training dataset; 2) compute the model's output/prediction on the testing data; and 3) also compute the 95% (model prediction) confidence intervals and plot them (with error bars) together with the model prediction, as well as the testing data samples.
Task 3: Approximate Bayesian Computation (ABC)
Using ‘rejection ABC' method to compute the posterior distributions of the ‘selected' regression model parameters in Task 2.
1) You only need to compute 2 parameter posterior distributions -- the 2 parameters with largest absolute values in your least squares estimation (Task 2.1) of the selected model. Fix all the other parameters in your model as constant, by using the estimated values from Task 2.1.
2) Use a Uniform distribution as prior, around the estimated parameter values for those 2 parameters (from the Task 2.1). You will need to determine the range of the prior distribution.
3) Draw samples from the above Uniform prior, and perform rejection ABC for those 2 parameters.
4) Plot the joint and marginal posterior distribution for those 2 parameters.
5) Explain your results.
Attachment:- Statistical Methods for Data Science.rar