How many rows would you be processing in all

Assignment Help Computer Engineering
Reference no: EM133565857

Homework: Project

Tasks to complete

Goal: This project will be used to integrate concepts developed from all the homework in the second half of this class, specifically. You will identify a data driven business problem that requires preparation of the data. This preparation involves Extracting data (from 3 or more sources), Transforming (or cleaning) the data before Loading it into a database for analysis. In other words, you will experience, first-hand, the ETL process of Data management.

Options: You can take this project in one of two directions: (I) Identify a large file, clean the data and normalize it into three or more tables OR (II) Identify three or more large data sources, clean the data and merge them into a denormalized table for analysis. In both cases, you will need to identify what you plan to learn from the cleaned and loaded data.

Resource: This article

In preparation for your project this term, I need you to do some digging to identify sources and ideas for a decent project.

There are a couple of decisions that have to be made. And so, I am making part of the project a "deliverable" so you can begin mulling over it. Most ETL tasks involve cleaning and integration. For integration, it is vital that you have an attribute that is common across all three data sets

Cleaning

Cleaning is one of the most important steps as it ensures the quality of the data in the data warehouse. Cleaning should perform basic data unification rules, such as:

• Making identifiers unique (sex categories Male/Female/Unknown, M/F/null, Man/Woman/Not Available are translated to standard Male/Female/Unknown)

• Convert null values into standardized Not Available/Not Provided value

• Convert phone numbers, ZIP codes to a standardized form

• Validate address fields, convert them into proper naming, e.g. Street/St/St./Str./Str

• Validate address fields against each other (State/Country, City/State, City/ZIP code, City/Street).

Transform

The transform step applies a set of rules to transform the data from the source to the target. This includes

• converting any measured data to the same dimension (i.e. conformed dimension) using the same units so that they can later be joined.
• generating surrogate keys or FKs so that you can join data from several sources,
• generating aggregates
• deriving new calculated values,
• Adding columns to create PKs and/or FKs

Data Integration

It is at this stage that you get the most value for the project. This typically means you are adding some attribute from a related set that adds 'Color' to the data. Perhaps Census data to labor data or other demographic data. The challenge is to locate data that are relatable.

Project direction: You will need to complete a datamart with significant pre-processing (ETL) activities.

Task

I. Problem being solved: What do you propose to learn from this data? List several of these business questions and show how your project solution (data set) could answer them.

II. Tools: You must complete the entire project using Visual Studio. OR you can do this with some other tool of your choice (ETL) like Power BI or tableau.

III. Volume: Total result data set must add up to at least 5k records, but not more than 100k.

IV. Destination: SQL server table(s). Depending on the direction you are taking, you can move all the data to a single CSV file and dump it into SQL server at the end or direct the final destination tables to SQL server.

V. Transformation: it must include TWO new columns (for each final destination) that is populated by (a) the current date and time so you know when that data was brought into the final dataset and (b) a second one to know where the data came from (source file name). This may be done through SSIS or in SQL server.

Note: Filename capturing works only when the source is a flat file. So, if your source is NOT a flat file, you may want to make a CSV file an intermediate destination and then use this file as the source (Hint: Use derived column transformation to add a column)

In addition it must include at least 3 of the following transformations: data conversion, derived column, data split, lookup, merge, merge join, multicast, union all, fuzzy lookup or any of the transforms not covered in class.

Data sources: You are welcome to use datasets from work that has been sufficiently "anonymized". In fact this itself is a valuable transformation task that you can then use to protect your data and make it available for additional analysis/exploration. There are many public data sets that can be used (see "data sources" tab)

Submit: Use the text area to submit a project "proposal" that addresses the following points; I am not looking for an elaborate write up, but use these 4 prompts to develop 4 well-written paragraphs free from language/grammar errors. Please do not write it in Q&A format!

A. Be sure to give a meaningful title.

B. Motivation for the project: what insights do you anticipate getting from this ETL project?

C. what problems do you anticipate during the ETL process? Cleaning? Transforming?

D. What data will you be using? where will you be getting these? How many rows would you be processing in all? What are the keys (Pk/fk)?

And finally, include

A. What type of decision support do you expect this project to provide? Would this have been possible with Excel? Why is this approach an improvement?

B. An ERD showing how the data sets are related to each other (either source or destination tables - see "options" above)

C. A subset of data from EACH file (5-10 rows and 5-10 columns) that shows the kind of data you are dealing with. For each file, be sure to identify what you would consider as a primary key. These can be included as a screenshot with column headers.

D. The datasets, properly named. Its best that you create a folder called {myGateID}_Project and save your datafiles there.

Reference no: EM133565857

Questions Cloud

Articulate the importance of closely monitoring side effects : Explain your findings and articulate the importance of closely monitoring side effects. When a client has discontinued psychotropics, clinical judgment is param
What a mental health provider can do to increase the success : What a mental health provider can do to increase the success of adherence to the therapy? patients drop out of psychotherapy or other mental health tretament
How balanced with other necessary leadership approaches : Overall, the impact of this management style would depend on how effectively it is implemented and balanced with other necessary leadership approaches.
The nurse is caring for client with COPD : The nurse is caring for a client with COPD. During the morning assessment the nurse notices the client oxygen saturation is 90%.
How many rows would you be processing in all : What data will you be using? where will you be getting these? How many rows would you be processing in all? What are the keys (Pk/fk)?
What he says in sessions might come back to haunt him : Would you be inclined to reassure Stan that you would not use what he tells you against him? Why or why not? what he says in sessions might come back to haunt
Why do problems stay undetected for so long : Why do problems stay undetected for so long? Does anyone like to be wrong? How do you react when someone informs you of a mistake?
Identify any underlying thythm : Analyze this VVI pacemaker strip. Identify any underlying thythm and how the pacemaker is functioning.
Provide an Entity Relationship Diagram : MIS605 Systems Analysis and Design, Torrens University - understanding of a system thus they are an integral part of system analysis. Process modelling models

Reviews

Write a Review

Computer Engineering Questions & Answers

  What is a cpu and what are its primary components

What shortcomings of electrical computation will optical computing devices address? What is a CPU? What are its primary components? What are registers?

  Create a dynamic nxn matrix for n passed as a parameter

Write a C++ program that uses class Matrix for dealing with the square matrices (n x n tables of integers)

  Discuss how these business rules could be set up

How can the above business rules be set up in a database and reinforced using DCL.

  What are pros and cons of each regarding data visualization

What are the qualities of each language regarding data visualization? What are the pros and cons of each regarding data visualization?

  Symptoms of group think would they most likely display

which of the following symptoms of Groupthink would they most likely display?

  What hardening means in cloud

What hardening means in Cloud and Security Cloud Based Solutions? a twelve minute presentation on a topic from the list I provided or an alternative approved.

  Why is it significant to define the primary keys of a file

your supervisor has approved your normalized versions of the database. your team members were given copies. one of the

  Explain ways to strongly authenticate a remote machine

In NOT less than 150 words, respond to the following discussion questions: List 5 ways to strongly authenticate a remote machine. Justify your answers.

  Create an additional array called less_array

Declare a single dimensional array of 65 characters. Convert each character into an integer and store it in a linked list.

  Write program that reads integerthat is your cwidand breaks

CSCI 428- Write a program that reads an integerthat is your CWIDand breaks it into a sequence of individualdigits.

  Explain the different types of computing careers

INDUSTRIAL TRAINING-Level 3-National Council for Higher Education- BACHELOR OF SCIENCE IN SOFTWARE ENGINEERING.Explain the different types of Computing careers

  Describe the type of cryptography or encryption

Select one type of cryptography or encryption and explain it in detail. Include the benefits as well as the limitations of this type of encryption.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd