Reference no: EM132919660
MITS6005 Big Data
Answer the following questions about big data and tools and technologies to grow businesses and help to make appropriate decisions.
Question 1 Variety of Big data refers to the heterogeneous sources and nature of data. There are three types of data, namely structured, semi-structured and unstructured.
How these sources of data facilitate making decisions in various businesses? Illustrate the answer with an appropriate case explaining their roles that they may play in the analysis of Big Data sets for large companies.
Question 2 A MapReduce job usually splits the input dataset into independent chunks processed by the map tasks in a completely parallel manner. The MapReduce framework has many phases, amongst which the sort phase maps the input to the appropriate intermediate key-value pair. Discuss the different phases of the MapReduce framework and demonstrate the working with an appropriate example.
Question 3 An Australia based Higher Education Institute, "Victorian Institute of Technology", has international students across the ten countries of continent Asia. It has campuses across all ten countries and stores the details of the student in the Hive database. The structure of the data stored in Hive includes records such as student_id, student_name, student_school, home_state, home_country, and enrollment_year. Each table in Hive can have one or more partition columns to organize the Hive table and optimize it.
Analyze and propose column(s) that can be picked up as partition key to the given table in Hive? Give justification for your selection.
Question 4 Hadoop Distributed File System (HDFS) is designed to store and transfer massive data sets reliably and handles fault tolerance. Discuss how HDFS handles fault tolerance by performing replication of the big data.
Further, consider Steve has a Hadoop cluster, and there is a file of size 514 MB stored in HDFS (Hadoop 2.x) using default block size configuration and default replication factor. Calculate the number of blocks that needs to be generated for the given size and find each block's size to be stored in the Hadoop.
Question 5 NoSQL is an alternative to the traditional relational database system. There is a significant growth of using NoSQL databases, particularly in big companies.
Answer the following questions in relation to NoSQL.
a) Why NoSQL is better than relational database for big data?Compare and contrast the differences between relational databases and NoSQL databases. Your discussion should touch on performance, operational workloads and scale. Compare the circumstances under which you would use one over the other and provide contrasting examples.
b) Which guarantee (Consistency, Partition tolerance, availability)and can be relaxed for the following use case?
1. The data in banking applications should respond accurately to customer's query.
2. An online store wants to function 24/7, so that shoppers can make purchases exactly when they need.
3. A distributed system share data to different regions without failure.
Attachment:- Big Data.rar