Preparing data for further analyses

Assignment Help Software Engineering

Reference no: EM133022319

771768 Introduction to Programming for Artificial Intelligence and Data Science

Customer Data Pre-processing

Context
This assignment is designed to evaluate your ability to read and write file formats of common types used in Data Science, and to manipulate complex data into different representations. The tasks provided here are indicative of Data Pre-processing workloads which are common to all Data Science projects. The techniques learned and evaluated in this module will prepare you for further theoretical and applied topics later on in the programme, where you will further develop your skills.

This assignment makes use of an extensive collection of mocked data. These have been generated with some resemblance to real world values and distributions, including some relations between data elements.

Whilst teaching, both asynchronous and synchronous, stops for this module by Teaching Week 4, ad-hoc support will be available until submission of the assignment on the MS Teams site.

Background
You have been given a collection of data from a company wishing to process its customer records for business purposes (acw_user_data.csv). The existing systems in-place at the company only export to a CSV file, and this is not in an appropriate format for analysis. You have been given the task of preparing this data for further analyses by your colleagues within the company, including representation changes, filtering, and deriving some new attributes / metrics for them.

These data include attributes such as first name, second name, credit card number, marital status, and even contains data on the customer's car.

The number of records provided is significant, and therefore it is expected that solutions are robust to varying types of data, and varying values, offering a programmatic solution.

Tasks
Data Processing
Using standard python (No pandas / seaborn) with default libraries (os, sys, time, json, csv,
...) you have been given the following tasks:

1. Read in the provided ACW Data using the CSV library.

2. As a CSV file is an entirely flat file structure, we need to convert our data back into its rich structure. Convert all flat structures into nested structures. These are notably:
a. Vehicle - consists of make, model, year, and type
b. Credit Card - consists of start date, end date, number, security code, and IBAN.
c. Address - consists of the main address, city, and postcode.
For this task, it may be worthwhile inspecting the CSV headers to see which data columns may correspond to these above.
Note: Ensure that the values read in are appropriately cast to their respective types.

3. The client informs you that they have had difficulty with errors in the dependants column. Some entries are empty (i.e. " " or ""), which may hinder your conversion from Task 2. These should be changed into something meaningful when encountered.

Print a list where all such error corrections take place.

E.g. Problematic rows for dependants: [16, 58, 80, 98]

4. Write all records to a processed.json file in the JSON data format. This should be a list of dictionaries, where each index of the list is a dictionary representing a singular person.

5. You should create two additional file outputs, retired.json and employed.json, these should contain all retired customers (as indicated by the retired field in the CSV), and all employed customers respectively (as indicated by the employer field in the CSV) and be in the JSON data format.

6. The client states that there may be some issues with credit card entries. Any customers that have more than 10 years between their start and end date need writing to a separate file, called remove_ccard.json, in the JSON data format. The client will manually deal with these later based on your output. They request that you write a function to help perform this, which accepts a single row from the CSV data,

and outputs whether the row should be flagged. This can then be used when determining whether to write the current person to the remove_ccard file.

7. You have been tasked with calculating some additional metrics which will be used for ranking customers. You should create a new data attribute for our customers called "Salary-Commute". Reading in from processed.json:
a. Add, and calculate appropriately, this new attribute. It should represent the Salary that a customer earns, per mile of their commute.
i. Note: If a person travels 1 or fewer commute miles, then their salary- commute would be just their salary.
b. Sort these records by that new metric, in ascending order.
c. Store the output file out as a JSON format, for a commute.json file.

Data Visualisation
Using Pandas and Seaborn

Your client wishes to understand the data they have on their customers a bit more by use of visualisations. With use of Pandas and Seaborn read in the original CSV file provided with the assignment.

1. Obtain the Data Series for Salary, and Age, and calculate the following:
a. Mean Salary
b. Median Age
2. Perform univariate plots of the following data attributes:
a. Age, calculating how many bins would be required for a bin_width of 5.
b. Dependents, fixing data errors with seaborn itself.
c. Age (of default bins), conditioned on Marital Status
3. Perform multivariate plots with the following data attributes:
a. Commuted distance against salary.
b. Age against Salary
c. Age against Salary conditioned by Dependants
4. Your client would like the ability to save the plots which you have produced. Provide a Notebook cell which can do this.

Code Presentation
The code you produce to solve the above tasks should make good use of structure, logic, and commenting to be clear and robust. Variable names should be thoughtfully considered, and appropriate for purpose. Ensure that scope conflicts are avoided, and that variable names don't leak into other areas of code. You should consider how the structures covered throughout the module may be used. Similarly, you should be mindful of error handling where appropriate.

Use of Markdown Cells is advised to keep clear distinction between Tasks. When writing your algorithms for solving the above, it may be useful to keep in mind reusability of code, reducing the amount of boilerplate code, and duplicated code.

Attachment:- Data analysis.rar

Reference no: EM133022319

Questions Cloud

How to market goods and services in latin america : How to market goods and services in Latin America. We need to be aware of differences in consumer preferences, differences in marketing channels

What amount will be credited to ana interest : Payments of accounts payable in the amount of $7,200 was not posted to the payable account. What amount will be credited to Ana's interest

Compute the likely cash balance for the end of the year : Eli's net assets (assets - liabilities) will remain at 50 percent of sales. His firm will enjoy an 8 percent return on total sales. Compute likely cash balance

What the npv of the project : CLP/CAD =$ 0.0018 and is not expected to change. If the required rate of return is 18%, what the NPV of the project

Preparing data for further analyses : Preparing data for further analyses by your colleagues within the company, including representation changes, filtering, and deriving some new attributes

Compute the price per share for the ordinary share : You were assigned the team leader for the valuation of The Little Pic Co Ltd, a subsidiary for the Big Pic Ltd. Compute the price per share for ordinary share

Indicate the amount of underapplied or overapplied : Indicate the amount of underapplied or overapplied overhead if actual direct labour was $118,000 and actual manufacturing overhead was $497,400

How did the dining customs of greek and roman societies : How did the dining customs of Greek and Roman societies contemporary menus and food items

How many extra exempt months would there have been : How many extra exempt months would there have been if Jamil had moved back into the house between 1 February 2018 and the date of sale

User Account

All Pages