Reference no: EM133114188
Cluster and Cloud Computing Assignment
Problem Description
Your task in this programming assignment is to implement a simple, parallelized application leveraging the University of Melbourne HPC facility SPARTAN. Your application will use a large Twitter dataset and a grid/mesh for Sydney to identify the languages used in making Tweets. Your objective is to count the number of different languages used for tweets in the given cells and the number of tweets in those languages and hence to calculate the multicultural nature of Sydney!
You should be able to log in to SPARTAN through running the attached command:
If you are a Windows user then you may need to install an application like Putty.exe to run ssh. (If you are coming from elsewhere with different firewall rules, then you may need to use a VPN).
The files to be used in this assignment are accessible at:
You should make a symbolic link to these files, i.e. you should run the following commands at the Unix prompt from your own user directory on SPARTAN:
The sydGrid.json file includes the latitudes and longitudes of a range of gridded boxes as illustrated in the figure below, i.e., the latitude and longitude of each of the corners of the boxes is given in the file.
Your assignment is to (eventually!) search the large Twitter data set (bigTwitter.json) and using the language used when tweeting, the number of tweets in those languages and the tweet location (lat/long) count the total number of tweets in a given cell that are made in different languages. The final result will be a score for each cell with the following format, where the numbers are obviously representative.
Here cell A1 has 11,111 tweets in total with 11 different languages used for tweets with the most popular being English (9,000 tweets), Chinese (555 tweets), French (444 tweets) with 10th most popular being Greek (66 tweets). Cell A2 has 22 languages used for tweeting with the most popular being English (21,000), Turkish (77 tweets), Swedish (66 tweets) and French being the 10th most popular language (2 tweets).
You may treat Simplified Chinese (zh-cn) and Traditional Chinese (zh-tw) as both being Chinese. Tweets with null or undefined (und) for the language attribute can be ignored. Further information on languages that might be used for tweeting is given in Tweets with no location information can be ignored. Tweets made outside of the Grid can also be ignored.
If a tweet occurs right on the border of two cells, e.g., exactly between the B1/B2 cell border then assume the tweet occurs in B1 (i.e., to the cell on the left). If a tweet occurs exactly on the border between B2/C2 then assume the tweet occurs in C2 (i.e., to the cell below). If a tweet occurs anywhere else on the boundary of a cell, e.g. the upper or leftmost border of A1 then it can be regarded as being in cell A1.
Your application should allow a given number of nodes and cores to be utilized. Specifically, your application should be run once to search the bigTwitter.json file on each of the following resources:
• 1 node and 1 core;
• 1 node and 8 cores;
• 2 nodes and 8 cores (with 4 cores per node).
The resources should be set when submitting the search application with the appropriate SLURM options. Note that you should run a single SLURM job three separate times on each of the resources given here, i.e. you should not need to run the same job 3 times on 1 node 1 core for example to benchmark the application. (This is a shared facility and this many COMP90024 students will consume a lot of resources!).
You can implement your solution using any routines that you wish from existing libraries however it is strongly recommended that you follow the guidelines provided on access and use of the SPARTAN cluster. Do not for example
think that the job scheduler/SPARTAN automatically parallelizes your code - it doesn't! You may wish to use the pre- existing MPI libraries that have been installed for C, C++ or Python. You should feel free to make use of the Internet to identify which JSON processing libraries you might use.
Your application should return the final results and the time to run the job itself, i.e. the time for the first job starting on a given SPARTAN node to the time the last job completes. You may ignore the queuing time. The focus of this assignment is not to optimize the application to run faster, but to learn about HPC and how basic benchmarking of applications on a HPC facility can be achieved and the lessons learned in doing this on a shared resource.
Attachment:- Cluster and Cloud Computing Assignment.rar