Reference no: EM132931146
5011CEM Big Data Programming Project - Coventry University
Learning Outcome 1: COMPUTATION THINKING:develop and understand algorithms to solve problems; measure andoptimise algorithm complexity; appreciate the limits of what may bedone algorithmically in reasonable time or at all.
Learning Outcome 2: PROGRAMMING:create working solutions to a variety of computational and real world problems using multiple programming languages chosen asappropriate for the task.
Learning Outcome 3: DATA SCIENCE:work with (potentially large) datasets; using appropriate storagetechnology; applying statistical analysis to draw meaningfulconclusions; and using modern machine learning tools to discoverhidden patterns.
Learning Outcome 4: SOFTWARE DEVELOPMENT: develop a product from the initialstage of requirement / analysis all the way through development toits final stages of testing / evaluation.
Learning Outcome 5: PROFESSIONAL PRACTICE:understand professional practices of the modern IT industry whichinclude those technical (e.g. version control / automated testing) butalso social, ethical & legal responsibilities.
Learning Outcome 6: TRANSFERABLE SKILLS:apply a wide variety of degree level transferable skills including time management, team working, written and verbal presentation to bothexperts and non-experts, and critical reflection on own and otherswork.
Learning Outcome 7: ADVANCED WORK:apply the above to advanced topics selected according to theinterests of individual students.
Assessment Overview
Over the course of this module you have been introduced to a range of techniques that may be used for programming a big data project. This assessment allows you to pull together these techniques in a realistic scenario to complete a big data analysis project.Below is a realistic project scenario. By using the techniques presented during class you are to carry out the project and write a final project report for your client.
Project Scenario
You have been approached by a client who analysis atmospheric science and climate model data. They have developed a new analysis technique, but it takes too long to run for them to use it. They have asked you to investigate the use of big data techniques to reduce the processing time.
They have a large volume of data to process, and the analysis needs to be repeated frequently. They have the following basic requirements:
1. Current analysis time is approximately 2.5 hours to analyse the climate model output data for a 1-hour time period.
2. The data for a single day of model output is approximately 250MB. However, they have over 100 years' worth of data to analyse making a total of over 9TB.
3. Each day, they need to analyse the new data set for that day, so they wish to complete the analysis of the data for a 24-hour period (25 data sets) in under 2 hours.
4. It is not possible to hold on this in memory at one time, so the new process should load only 1 hour of data for processing at a time. If parallel processing is to occur, then 1 hour of data per worker can be loaded as needed.
You have been tasked with investigating the use of parallel processing to achieve the analysis speed required, with the following expectations:
1. Test and compare the processing speed of sequential and parallel processing
2. Extrapolate your findings to indicate the number of processors required to achieve the target processing time.
3. Test how your code responds to common errors, e.g. data that is text instead of numeric, use of NaN in the data as an error code.
4. Run automated tests that allow your client to set the tests running and return later to see the results, without user intervention.
Attachment:- project_report_brief.rar