Implement an efficient data layout and retrieval strategy

Assignment Help JAVA Programming

Reference no: EM133099787

SYSTEMS - ASSIGNMENT

Title: Implement an efficient data layout and retrieval strategy for a Hadoop Cluster

Overview & background:

A database may have some common data repeated across records. For example, in the attached CSV file (that is exported from a database) some column values are same among multiple rows. These common values are repetitively stored in the database records, which increases the storage cost but reduces retrieval time for analytical queries.

However, we need to create a layout of this kind of dataset on a Hadoop cluster at a reduced storage cost. So, we need to understand the commonality of values across records and create a data layout that avoids duplicate values. But at the same time, we need to allow retrieval of a complete data record from the storage, given a record identifier.

Input: CSV data with flat schema with multiple records and features. Description:

1. STORAGE:

Each Storage Node will store the data based on below condition.
a. Mutually Exclusive feature data (column value) which is not common across records (rows): private node
b. Feature data common in two records: 2-way shared node ec.atFure data common in four records: 4-way shared node.
d. Feature data common in eight records: 8-way shared node.
Note: Private node, 2,4,8- way shared nodes are storage nodes which stores feature values which are common in 2, 4, 8 records respectively.

2. METADATA
Maintain record ID wise metadata about above storage deployments, which will explain how the feature values are stored across the storage nodes. The meta-data can be stored on a specific node.

3. RETRIEVAL:

For provided record ID, retrieval of record will refer step 2 to fetch all the required features (column values) from respective storage nodes to form the original record.

NOTE: You can apply different techniques to understand the similarity of feature values like normalization, standardization, vectorization etc.

1. A Python / Java / Spark code which enables
a. the given CSV data to be written, using the distributed storage layout strategy described, to reduce duplicate data, and
b. retrieval of any record given the record ID from the distributed storage.

2. Report compression ratio achieved using above approach, i.e. how much storage reduction happens using the de-duplicated data layout on the cluster.

3. You can use a Hadoop cluster, a plain cluster of a set of nodes, or any BigData storage framework to demonstrate your data storage and retrieval code. Describe your setup in detail.

4. You should provide clear instructions to reproduce the submission on the Evaluator's setup.

5. Your code and results should be reproducible

6. The implementation should be general purpose for any other CSV input file.

Attachment:- Assignment-Problem-Statement.rar

Attachment:- Data.rar

Reference no: EM133099787

Questions Cloud

Describe the components of it infrastructure : List and describe the components of IT infrastructure firms need to manage.

How many units of Sancho products must be sold : Sancho Company sells a product for $50 per unit, with $37 per unit in variable costs. How many units of Sancho's products must be sold

Levels of government in canada : 1. Identify the THREE (3) levels of government in Canada that have passed laws pertaining to the operation of automobiles. Provide ONE (1) example of a law pass

Describe the components of it infrastructure : List and describe the components of IT infrastructure firms need to manage. Write in APA format

Implement an efficient data layout and retrieval strategy : Implement an efficient data layout and retrieval strategy for a Hadoop Cluster and plain cluster of a set of nodes, or any BigData storage framework

High failure rate for implementations : Explain why there is such a high failure rate for implementations involving enterprise applications, business process reengineering, and mergers and acquisition

Supply and demand relating to competition : In a Capitalistic society (such as the U. S.) is competition good or bad in your opinion and why? Please support your opinion with an example

What topic-issue does the writer address : At the rate our island paradise is able to convince our people to get properly vaccinated we should get about 70 per cent of our tough-headed people fully vacci

What is a budget and why it is important for a business : What is a budget and why it is important for a business to have a budget? What do you think might be some pitfalls of budgeting

User Account

All Pages