Preparing data for further analyses

Assignment Help Software Engineering
Reference no: EM133022319

771768 Introduction to Programming for Artificial Intelligence and Data Science

Customer Data Pre-processing

Context
This assignment is designed to evaluate your ability to read and write file formats of common types used in Data Science, and to manipulate complex data into different representations. The tasks provided here are indicative of Data Pre-processing workloads which are common to all Data Science projects. The techniques learned and evaluated in this module will prepare you for further theoretical and applied topics later on in the programme, where you will further develop your skills.

This assignment makes use of an extensive collection of mocked data. These have been generated with some resemblance to real world values and distributions, including some relations between data elements.

Whilst teaching, both asynchronous and synchronous, stops for this module by Teaching Week 4, ad-hoc support will be available until submission of the assignment on the MS Teams site.

Background
You have been given a collection of data from a company wishing to process its customer records for business purposes (acw_user_data.csv). The existing systems in-place at the company only export to a CSV file, and this is not in an appropriate format for analysis. You have been given the task of preparing this data for further analyses by your colleagues within the company, including representation changes, filtering, and deriving some new attributes / metrics for them.

These data include attributes such as first name, second name, credit card number, marital status, and even contains data on the customer's car.

The number of records provided is significant, and therefore it is expected that solutions are robust to varying types of data, and varying values, offering a programmatic solution.

Tasks
Data Processing
Using standard python (No pandas / seaborn) with default libraries (os, sys, time, json, csv,
...) you have been given the following tasks:

1. Read in the provided ACW Data using the CSV library.

2. As a CSV file is an entirely flat file structure, we need to convert our data back into its rich structure. Convert all flat structures into nested structures. These are notably:
a. Vehicle - consists of make, model, year, and type
b. Credit Card - consists of start date, end date, number, security code, and IBAN.
c. Address - consists of the main address, city, and postcode.
For this task, it may be worthwhile inspecting the CSV headers to see which data columns may correspond to these above.
Note: Ensure that the values read in are appropriately cast to their respective types.

3. The client informs you that they have had difficulty with errors in the dependants column. Some entries are empty (i.e. " " or ""), which may hinder your conversion from Task 2. These should be changed into something meaningful when encountered.

Print a list where all such error corrections take place.

E.g. Problematic rows for dependants: [16, 58, 80, 98]

4. Write all records to a processed.json file in the JSON data format. This should be a list of dictionaries, where each index of the list is a dictionary representing a singular person.

5. You should create two additional file outputs, retired.json and employed.json, these should contain all retired customers (as indicated by the retired field in the CSV), and all employed customers respectively (as indicated by the employer field in the CSV) and be in the JSON data format.

6. The client states that there may be some issues with credit card entries. Any customers that have more than 10 years between their start and end date need writing to a separate file, called remove_ccard.json, in the JSON data format. The client will manually deal with these later based on your output. They request that you write a function to help perform this, which accepts a single row from the CSV data,

and outputs whether the row should be flagged. This can then be used when determining whether to write the current person to the remove_ccard file.

7. You have been tasked with calculating some additional metrics which will be used for ranking customers. You should create a new data attribute for our customers called "Salary-Commute". Reading in from processed.json:
a. Add, and calculate appropriately, this new attribute. It should represent the Salary that a customer earns, per mile of their commute.
i. Note: If a person travels 1 or fewer commute miles, then their salary- commute would be just their salary.
b. Sort these records by that new metric, in ascending order.
c. Store the output file out as a JSON format, for a commute.json file.

Data Visualisation
Using Pandas and Seaborn

Your client wishes to understand the data they have on their customers a bit more by use of visualisations. With use of Pandas and Seaborn read in the original CSV file provided with the assignment.

1. Obtain the Data Series for Salary, and Age, and calculate the following:
a. Mean Salary
b. Median Age
2. Perform univariate plots of the following data attributes:
a. Age, calculating how many bins would be required for a bin_width of 5.
b. Dependents, fixing data errors with seaborn itself.
c. Age (of default bins), conditioned on Marital Status
3. Perform multivariate plots with the following data attributes:
a. Commuted distance against salary.
b. Age against Salary
c. Age against Salary conditioned by Dependants
4. Your client would like the ability to save the plots which you have produced. Provide a Notebook cell which can do this.

Code Presentation
The code you produce to solve the above tasks should make good use of structure, logic, and commenting to be clear and robust. Variable names should be thoughtfully considered, and appropriate for purpose. Ensure that scope conflicts are avoided, and that variable names don't leak into other areas of code. You should consider how the structures covered throughout the module may be used. Similarly, you should be mindful of error handling where appropriate.

Use of Markdown Cells is advised to keep clear distinction between Tasks. When writing your algorithms for solving the above, it may be useful to keep in mind reusability of code, reducing the amount of boilerplate code, and duplicated code.

Attachment:- Data analysis.rar

Reference no: EM133022319

Questions Cloud

How to market goods and services in latin america : How to market goods and services in Latin America. We need to be aware of differences in consumer preferences, differences in marketing channels
What amount will be credited to ana interest : Payments of accounts payable in the amount of $7,200 was not posted to the payable account. What amount will be credited to Ana's interest
Compute the likely cash balance for the end of the year : Eli's net assets (assets - liabilities) will remain at 50 percent of sales. His firm will enjoy an 8 percent return on total sales. Compute likely cash balance
What the npv of the project : CLP/CAD =$ 0.0018 and is not expected to change. If the required rate of return is 18%, what the NPV of the project
Preparing data for further analyses : Preparing data for further analyses by your colleagues within the company, including representation changes, filtering, and deriving some new attributes
Compute the price per share for the ordinary share : You were assigned the team leader for the valuation of The Little Pic Co Ltd, a subsidiary for the Big Pic Ltd. Compute the price per share for ordinary share
Indicate the amount of underapplied or overapplied : Indicate the amount of underapplied or overapplied overhead if actual direct labour was $118,000 and actual manufacturing overhead was $497,400
How did the dining customs of greek and roman societies : How did the dining customs of Greek and Roman societies contemporary menus and food items
How many extra exempt months would there have been : How many extra exempt months would there have been if Jamil had moved back into the house between 1 February 2018 and the date of sale

Reviews

Write a Review

Software Engineering Questions & Answers

  Write a research report on software design and answer

write a research report on software design and answer diffrent type of questions related to design.1. describe three

  Identify how to address the most challenging aspects of

you are the lead software engineer at a large educational institution consisting of twenty three 23 campuses located

  Define a test plan or script

Define a test plan or script identifying major software functionality and hardware to be tested with required outcomes

  Create a use-case diagram with a minimum of three actors

CSIS 100- Create a Use-Case diagram with a minimum of 3 actors and 5 use cases. Include 1 "extends" relationship in your model.

  Florida condominiums are popular winter retreats for many

florida condominiums are popular winter retreats for many north americans. in recent years the prices have steadily

  Software architecture plays a different role

Software architecture plays a different role in different context. Select each one of these contexts, and describe software architecture's role in the context. Use an example to illustrate your point.Technical Project life cycle

  Draw an erd with cardinality notation

Review the Personal Trainer fact-finding summary and draw an ERD with cardinality notation. Assume that system entities include members, activities and services, and fitness instructors.

  Is there a place for ethics in iton march 15 2005 michael

is there a place for ethics in it?on march 15 2005 michael schrage published an article in cio magazine entitled ethics

  Mitigating of Delays in Construction Project in Oman

Glasgow Caledonian University - Master of Science In Construction Management - Examine causes and effects which leads to delays in projects in the Sultanate

  Why six-sigma efforts do not apply to software

Explain the Six-Sigma quality constraints and provide a justification as to why the software development company should consider it

  Develop a web-based system to administer functionalities

COIT20273 - Software Design and Development Project - Develop a web-based system to administer functionalities in a typical Australian medical clinic

  The president of a company that manufactures car seats has

the president of a company that manufactures car seats has been concerned about the number and cost of machine

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd