Reference no: EM133784493
Predictive Analytics Project
Project Overview
This assessment involves writing a report that summarizes a statistical learning-related investigation that you have conducted on the data provided.
The data set is provided, and an objective is very broadly defined, but it is up to you to clarify in your report exactly what you will investigate your methods and results. It is quite possible that different students might come up with very different yet still valid analyses based on the same data set. You don't necessarily have to use all of the data, and you don't necessarily need to provide an exhaustive analysis that extracts every possible shred of information from this data, but you do need to clearly document your targeted investigation and how your results relate to the broad objective provided. You don't need to use every possible method, but there should be some justification for the methods that you do use.
We don't expect a Nobel prize-winning analysis, but your report should demonstrate that:
You have grasped important concepts associated with this course.
You can communicate your investigation in a formal written manner.
Your investigation should use R or Python to analyse data using methods from at least two of the following course content areas:
Classification (using LDA, QDA, KNN, or logistic regression)
Linear regression, possibly including regularisation
Decision trees and/or ensembles for classification or regression
Principal Component Analysis (PCA)
Cluster Analysis
Outlier Detection
Support Vector Machines
Formal Quantitative Assessment of Results (e.g., cross-validation for model assessment and selection in regression or classification)
Qualitative Assessment through Visualisation of Results (e.g. visualisation and interpretation of clustering hierarchies, visual interpretation of PCA, etc.)
If you need to clarify the project requirements or if you have technical issues you may post in the appropriate Discussion Forum but your post should not given any indication of your planned investigation, methods or results.
Your submission must consist of your own work in accordance with the Academic Integrity Policy.
For students who prepare their work using Python: You need to submit a source code file (.ipynb) and a PDF report. The markers may NOT necessarily run your code, but they may refer to this file while marking your report, if some clarification is needed.
The report should have the following sections marked clearly:
Title: In today's busy world, it is very important to make the most of your title. Make the title concise (< 20 words) and ‘eye-catching' yet an informative and accurate representation of the contents of the report. Author should be listed below the title.
Abstract: The abstract provides a brief overview of the report contents in around 200 - 300 words. An abstract typically consists of:
Introductory statement: background to the study, important issue(s) the report addresses. (approximately 1 to 2 sentences)
Purpose of the report: state the objectives (1-2 sentences)
Methodological approach: overview the data and methods (2-3 sentences)
Findings or Achievements: list one or two of the main findings or achievements from your investigation (1-2 sentences)
Conclusions and Implications: what conclusions can be drawn from your investigation? How can the findings/achievements in your report deliver a benefit to people, things, systems or processes? (1-2 sentences)
Introduction: The introduction sets the scene for the investigative efforts. It provides motivation for the work and relevant background information and references that will enable the reader to put in context the key objectives and achievements in your report. Address the important issues that have motivated your investigation. At the end of the introduction clearly state the objectives of the report. Do not put any results from your investigation in the introduction. Do not discuss details about the data and methods in this section. Do not discuss your conclusions or key findings in the introduction.
Data: This section should enable the reader to understand how the data was obtained, pre-processed (if applicable) and what the data represent.
Methods: This section should summarise the statistical learning methods that were used to process and to analyse the data. The methods should be appropriate to ensure that the objectives of the report are met. You are strongly encouraged to interleave your text with key calls to R functions that generate relevant results that you may want to highlight, just like the weekly course notes and labs. This can be achieved straightforwardly using R Markdown. You can use the R Markdown chunk option echo=FALSE to hide chunks of code that you judge less relevant, but these must still be present in the source code for verification, if necessary. Your code will be assessed for correctness, organisation and clarity as part of the Methods section. However, there should be enough detail in your textual description for your methodology to be repeated by an independent person, without having to refer to and decipher your code.
Results and Discussion: The results are explained correctly, clearly, and in sufficient detail. The results and discussion clearly follow from the data collection and the methods. (In fact, although it is more usual to have separate sections, you may find it more appropriate to merge the Results and Discussion with your explanation of the methodology.) The discussion centres on the outputs from the statistical learning procedures that you have performed. For example, what are the main outcomes? Why are they useful and what for? How are they interesting and why? What are the main achievements and their implications?
Conclusions: Final remarks about the key achievements of the investigations and what makes them "interesting" or "useful". Achievements or findings should be linked with the original objectives or hypotheses of the project. Are there any recommended actions from your analysis? Make sure that you mention any limitations of your work here. Limit the conclusion to no more than two or three paragraphs.
References: List any sources your investigation has drawn from. Note that all references should be referred to in the text.
Appendices: Appendices can be useful when the incorporation of material in the body of the work would make it poorly structured or too long and detailed. Appendices may be used for helpful, supporting or essential material that would otherwise clutter, break up or be distracting to the text.
In addition to the marks allocated to each section, there are an extra 10 marks associated with the general quality of the writing and presentation. For example, in a high-quality report:
The material is coherently organized and the logic is easy to follow. There are no spelling or grammatical errors and terminology is clearly defined. Writing is clear and concise and persuasive. Each Figure/Table is numbered, followed by a caption, and referred to in the body of the text, most noticeably in the results and/or discussion section. The Figures/Tables provided reinforce the most relevant achievements of the work. Any references are listed at the end of the report, with citations in the body of the report. Appendices are appropriately used.
The report proper (ie not including references or appendices) should be no more than about 10 single-column pages when printed in A4 format using R Markdown default settings for font, font size, line spacing, margins, etc. If you exceed this recommendation (including through inappropriate use of appendices to include extra material) then the marker may stop reading (and marking) after 10 pages.
World Bank Climate Change Data
Climate change is the critically important issue of our time and affects everyone on the planet. It is also a subject where there is a huge amount of data collected, analysed, written about and argued about. The wbcc bc.csv file is available on Canvas. The variables (columns) correspond to the country characteristics or indicators collected by the World Bank that it has categorised as relevant to climate change. Each record (row) corresponds to a particular country. Although much climate change analysis focusses on how things are changing over time, this data set is cross-sectional rather than longitudinal. It is a snapshot of a single recent value of these characteristics for each country. Table 3 lists the indicators and their descriptions.
Your task is to explore this data set and report on any interesting and relevant relationships you may discover between these indicators. It is also possible that the lack of an expected relationship might also be interesting. Do these climate change related indicators suggest particular groupings of countries? How does Australia compare to other countries in the world based on these indicators? Are there any findings that you think would be relevant to highlight to Australian leaders or to world leaders at the UN Climate Change Conference later this month in Glasgow?