What do you propose to learn from the data

Assignment Help Computer Engineering
Reference no: EM133582736

Assignment: Data ETL

Goal: This project will be used to integrate concepts developed from the assignment in the second half of this class, specifically. You will identify a data driven business problem that requires preparation of the data. This preparation involves Extracting data (from 3 or more sources), Transforming (or cleaning) the data before Loading it into a database for analysis. In other words, you will experience, first-hand, the ETL process of Data management - preparing the data for further analyses.

Options: You can take this project in one of two directions: (i) Identify a large file, clean the data and normalize it into three or more tables OR (ii) Identify three or more large data sources, clean the data and merge them into a denormalized table for analysis. In either case, you will need to identify what you plan to learn from the cleaned and loaded data. BOTTOM LINE: Can you do the analyses WITHOUT going through this ETL process. If so, what's the point?!

In preparation for your project this term, I need you to do some digging to identify sources and ideas for a decent project.

There are a couple of decisions that have to be made. And so, I am making part of the project a "deliverable" so you can begin mulling over it. Most ETL tasks involve cleaning and integration. For integration, it is vital that you have an attribute that is common across all three data sets

Cleaning

Cleaning is one of the most important steps as it ensures the quality of the data in the data warehouse. Cleaning should perform basic data unification rules, such as:

I. Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available are translated to standard Male/Female/Unknown)

II. Convert null values into standardized Not Available/Not Provided value

III. Convert phone numbers, ZIP codes to a standardized form

IV. Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str

V. Validate address fields against each other (State/Country, City/State, City/ZIP code, City/Street).

Transform

The transform step applies a set of rules to transform the data from the source to the target. This includes

I. Converting any measured data to the same dimension (i.e. conformed dimension) using the same units so that they can later be joined.

II. generating surrogate keys or FKs so that you can join data from several sources,

III. generating aggregates

IV. deriving new calculated values,

V. Adding columns to create PKs and/or FKs

Data Integration

It is at this stage that you get the most value for the project. This typically means you are adding some attribute from a related set that adds 'Color' to the data. Perhaps Census data to labor data or other demographic data. The challenge is to locate data that are relatable.

Project direction: You will need to complete a datamart with significant pre-processing (ETL) activities.

Task

I. Problem being solved: What do you propose to learn from this data? List several of these business questions and show how your project solution (data set) could answer them.

II. Tools: You must complete the entire project using Visual Studio. OR you can do this with some other tool of your choice (ETL) like Power BI or tableau.

III. Volume: Total result data set must add up to at least 5k records, but not more than 100k.

IV. Destination: SQL server table(s). Depending on the direction you are taking, you can move all the data to a single CSV file and dump it into SQL server at the end or direct the final destination tables to SQL server.

V. Transformation: it must include TWO new columns (for each final destination) that is populated by (a) the current date and time so you know when that data was brought into the final dataset and (b) a second one to know where the data came from (source file name). This may be done through SSIS or in SQL server.

Note: Filename capturing works only when the source is a flat file. So, if your source is NOT a flat file, you may want to make a CSV file an intermediate destination and then use this file as the source (Hint: Use derived column transformation to add a column)

In addition it must include at least 3 of the following transformations: data conversion, derived column, data split, lookup, merge, merge join, multicast, union all, fuzzy lookup or any of the transforms not covered in class.

Data sources: You are welcome to use datasets from work that has been sufficiently "anonymized". In fact this itself is a valuable transformation task that you can then use to protect your data and make it available for additional analysis/exploration. There are many public data sets that can be used (see "data sources" tab)

Project ideas & Data sets [for ETL]

Goal: Explore various datasets to see what is missing in any of the data and how you can enhance it by combining info from other seemingly unconnected data (industry, education, poverty and liquor shops?). The links below serve as a starting point for your exploration. Get started!'

Expectation: You can take this project in one of two directions: (i) Identify three or more large data sources, clean the data and merge them into a denormalized table for analysis. OR (ii) Identify a large file, clean the data and normalize it into three or more tables so that when you rejoin them, you get more accurate answers to your questions. Sometimes this process may require you to get "reference sources" so your dimension tables (destinations in Model Y above) are more complete/accurate.

In either case, you will need to identify what you plan to learn from the cleaned and loaded data.

There are two main ideas to keep in mind: (i) Cleaning badly prepared data and (ii) integrating data from multiple sources. An ETL project usually involves BOTH of these.

When integrating data from more than one source, you need to make sure that they can be linked in the first place. In other words, is there something in common between the two data sets? Some kind of identifier like we use as PK and FK? If not, can you create it?

Reference no: EM133582736

Questions Cloud

How does vance openly acknowledge his lack : How does Vance openly acknowledge his lack of automatic ethos in this passage? What claim does Vance make that helps him establish credibility for writing
How completed swot analysis will used in selection process : How the completed SWOT analysis will be used in the selection process? Why you have selected this technology?
Identify three peer-reviewed articles that collected samples : Using these resources, create a report that can be used to carry out a needs assessment improving the services in your community.
What effect does a point-of-view shot have on the narrative : What effect does a Point-of-View shot have on the narrative and the audience? Use an example from a film. c). Why are Long-Takes used in some films
What do you propose to learn from the data : What do you propose to learn from this data? List several of these business questions and show how your project solution (data set) could answer them.
What is the most appropriate initial plan for mr. corcoran : What is the MOST appropriate initial plan for Mr. Corcoran? Mr. Corcoran is a 75-year-old gentleman with a medical history significant for HTN, OA
How will you use this information to meet the needs : Discuss the implications that these articles and their findings have on students with exceptional education needs and their parents, as well as reflect
What decade does the author identify : What decade does the author identify as showing "substantial improvement in residential integration? in segregation in america article by the data team
Explore the role rhetoric plays in james green : Explore the role rhetoric plays in James Green's "Equal Pay Bill Has a Serious Downside." First, define rhetoric and describe its components

Reviews

Write a Review

Computer Engineering Questions & Answers

  Define a constructor to accept and initialize instance data

Write a class called Dog that contains instance data that represents the dog's name and age. Define the Dog constructor to accept and initialize instance data.

  Holism versus reductionism based on information technology

Holism versus reductionism and Content of the paper should be based on Information technology

  Explain the application of sampling in mead procedures

Describe each incorrect assumption, statement, and inappropriate application of sampling in Mead's procedures in the following.

  Make an architectural diagram which shows how the virtual

global organizations have branches that are located in multiple countries. some of these organizations develop software

  Discuss the input-process-output model

Software engineering should always be preceded with some sort of analysis and design tactics. So far this session, you have learned about many considerations.

  The pros and cons of using sessions

What are the advantages and disadvantages of using cookies? What are the advantages and disadvantages of using sessions? Is one more secure than the other and why?

  Find the number of students having greater score

Find the average marks if the class and also find the number of students having greater score then average score of class.

  Describe business case recommendations to senior management

Describe your business case recommendations to senior management for developing a forensic lab. Include a description of the physical layout requirements.

  Define performance architecture

BSACB/531 IT Architecture in the Age of Cloud Assignment - Retail Case Study Summary, University of Phoenix - Define performance architecture

  What factors other than the variables associated

ADTA 5130 University of North Texas What factors other than the variables associated with house prices should you consider when buying rental property in a coll

  Write a program that accepts any number of homework scores

Write a program that accepts any number of homework scores ranging in value from 0 through 10. Prompt the user for a new score if they enter a value.

  What are the commands that you have to carry

What are the commands that you have to carry out in order to turn a Java source code file into a running program?

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd