Reference no: EM133180774
Big Data Analytics
Repeat Assignment Worth 60% of Module Grade
1 Description and Submission Format
In this assignment you are tasked to perform a data analysis of a data set with the use of R language. You should submit a PDF document that should be generated from your RMarkdown.
2 Data set
The data set that should be used for the analysis is the Student Performance Data Set
Tasks
Task 1
Your first task is to perform exploratory analysis of the data set. That should give you some basic understanding of the data. For that you should load you data from a file, then clean the data as much as possible that the further analysis is easier. Finally, perform exploratory analysis by visualising and summarising the data. You should also look at the relationships between variables and you should check the "strength" of those relationships. Your report should include some of the plots and summaries with explanations.
Task 2
Second task is quite open. You have done preliminary exploration of the data set. At this point you should understand the domain of your data set, and you should have seen how di?erent attributes of the data look. Your final goal is to report some findings (or lack of them). You should have proofs that these are statistically correct. The following points are just hints of what might be interesting to do/take a look at:
• Take a look at plots you have created in the first part - what conclusions can be drawn based on them? These could be your hypotheses.
• Data contains categorical variables - is there a di?erence between instances belong- ing to one category and the other? Even if you do not see clear di?erences, you could perform a statistical test checking if some properties change over categories.
Task 3
• Perform linear regression with multiple variables to predict the student grade. Normalize the data and repeat the process of performing Linear Regression with Multiple Variables on normalized data to predict the student grade. Highlight the di?erence in prediction accuracy with both data sets.
• Perform classification to classify an appropriate categorical variable. Normalize the data and repeat the process of performing classification on normalized data. Highlight the di?erence in prediction accuracy with both data sets.
Submission
Write your code in an R Markdown document to present your preliminary data analysis in the form of report. Do not put all of the plots in the report, decide what might be useful, what might be interesting to explore. Use multidimensional plots to present multiple variables.
• You can also get up to 10 points for clarity and quality of the report and the source code.
• Acceptable file format: Knit your Markdown document in pdf output. Use the submission link on Moodle to upload your final pdf report.