Identify best practices in data collection

Assignment Help Python Programming

Reference no: EM133000454

BDA601 Big Data and Analytics - Laureate International Universities

Assessment - Design Data Pipeline

Learning Outcome 1: Explain and evaluate the V's of Big Data (volume, velocity, variety, veracity, valence, and value)
Learning Outcome 2: Identify best practices in data collection and storage, including data security and privacy principles; and
Learning Outcome 3: Effectively report and communicate findings to an appropriate audience.

Task Summary
Critically analyse the online retail business case (see below) and write a 1,500-word report that:
a) Identifies various sources of data to build an effective data pipeline;
b) Identifies challenges in integrating the data from the sources and formulates a strategy to address those challenges; and
c) Describes a design for a storage and retrieval system for the data lake that uses commercial and/or open-source big data tools.

Context
A modern data-driven organisation must be able to collect and process large volumes of data and perform analytics at scale on that data. Thus, the establishment of a data pipeline is an essential first step in building a data-driven organisation. A data pipeline ingests data from various sources, integrates that data and stores that data in a ‘data lake', making that data available to everyone in the organisation.

This Assessment prepares you to identify potential sources of data, address challenges in integrating data and design an efficient ‘data lake' using the big data principles, practices and technologies covered in the learning materials.

Case Study

Big Retail is an online retail shop in Adelaide, Australia. Its website, at which its users can explore different products and promotions and place orders, has more than 100,000 visitors per month. During checkout, each customer has three options: 1) to login to an existing account; 2) to create a new account if they have not already registered; or 3) to checkout as a guest. Customers' account information is maintained by both the sales and marketing departments in their separate databases. The sales department maintains records of the transactions in their database. The information technology (IT) department maintains the website.

Every month, the marketing team releases a catalogue and promotions, which are made available on the website and emailed to the registered customers. The website is static; that is, all the customers see the same content, irrespective of their location, login status or purchase history.

Recently, Big Retail has experienced a significant slump in sales, despite its having a cost advantage over its competitors. A significant reduction in the number of visitors to the website and the conversion rate (i.e., the percentage of visitors who ultimately buy something) has also been observed. To regain its market share and increase its sales, the management team at Big Retail has decided to adopt a data-driven strategy. Specifically, the management team wants to use big data analytics to enable a customised customer experience through targeted campaigns, a recommender system and product association.

The first step in moving towards the data-driven approach is to establish a data pipeline. The essential purpose of the data pipeline is to ingest data from various sources, integrate the data and store the data in a ‘data lake' that can be readily accessed by both the management team and the data scientists.

Task Instructions

Critically analyse the above case study and write a 1,500-word report. In your report, ensure that you:

• Identify the potential data sources that align with the objectives of the organisation's data- driven strategy. You should consider both the internal and external data sources. For each data source identified, describe its characteristics. Make reasonable assumptions about the fields and format of the data for each of the sources;

• Identify the challenges that will arise in integrating the data from different sources and that must be resolved before the data are stored in the ‘data lake.' Articulate the steps necessary to address these issues;

• Describe the ‘data lake' that you designed to store the integrated data and make the data available for efficient retrieval by both the management team and data scientists. The system should be designed using a commercial and/or an open-source database, tools and frameworks. Demonstrate how the ‘data lake' meets the big data storage and retrieval requirements; and

• Provide a schematic of the overall data pipeline. The schematic should clearly depict the data sources, data integration steps, the components of the ‘data lake' and the interactions among all the entities.

Assessment - Visualisation and Model Development

Learning Outcome 1: Apply data science principles to the cleaning, manipulation, and visualisation of data
Learning Outcome 2: Design analytical models based on a given problems; and
Learning Outcome 3: Effectively report and communicate findings to an appropriate audience.

Task Summary
Customer churn, also known as customer attrition, refers to the movement of customers from one service provider to another. It is well known that attracting new customers costs significantly more than retaining existing customers. Additionally, long-term customers are found to be less costly to serve and less sensitive to competitors' marketing activities. Thus, predicting customer churn is valuable to telecommunication industries, utility service providers, paid television channels, insurance companies and other business organisations providing subscription-based services. Customer-churn prediction allows for targeted retention planning.

In this Assessment, you will build a machine learning (ML) model to predict customer churn using the principles of ML and big data tools.
As part of this Assessment, you will write a 1,000-word report that will include the following:
a) A predictive model from a given dataset that follows data mining principles and techniques;
b) Explanations as to how to handle missing values in a dataset; and
c) An interpretation of the outcomes of the customer churn analysis.
Please refer to the Task Instructions (below) for details on how to complete this task.

Task Instructions
1. Dataset Construction
Kaggle telco churn dataset is a sample dataset from IBM, containing 21 attributes of approximately 7,043 telecommunication customers. In this Assessment, you are required to work with a modified version of this dataset (the dataset can be found at the URL provided below). Modify the dataset by removing the following attributes: MonthlyCharges, OnlineSecurity, StreamingTV, InternetService and Partner.
As the dataset is in .csv format, any spreadsheet application, such as Microsoft Excel or Open Office Calc, can be used to modify it. You will use your resulting dataset, which should comprise 7,043 observations and 16 attributes, to complete the subsequent tasks. The ‘Churn' attribute (i.e., the last attribute in the dataset) is the target of your churn analysis.
Kaggle.com. (2020). Telco customer churn-IBM sample data sets.

2. Model Development
From the dataset constructed in the previous step, present appropriate data visualisation and descriptive statistics, then develop a ‘decision-tree' model to predict customer churn. The model can be developed in Jupyter Notebook using Python and Spark's Machine Learning Library (Pyspark MLlib). You can use any other platform if you find it more efficient. The notebook should include the following sections:
a) Problem Statement
In this section, briefly state the context and the problem you will solve in the notebook.
b) Exploratory Data Analysis
In this section, perform both a visual and statistical exploratory analysis to gain insights about the dataset.
c) Data Cleaning and Feature Selection
In this section, perform data pre-processing and feature selection for the model, which you will build in the next section.
d) Model Building
In this section, use the pre-processed data and the selected features to build a ‘decision-tree' model to predict customer churn.
In the notebook, the code should be well documented, the graphs and charts should be neatly labelled, the narrative text should clearly state the objectives and a logical justification for each of the steps should be provided.
3. Handling Missing Values
The given dataset has very few missing values; however, in a real-world scenario, data- scientists often need to work with datasets with many missing values. If an attribute is important to build an effective model and have significant missing values, then the data- scientists need to come up with strategies to handle any missing values.
From the ‘decision-tree' model, built in the previous step, identify the most important attribute. If a significant number of values were missing in the most important attribute column, implement a method to replace the missing values and describe that method in your report.

4. Interpretation of Churn Analysis
Modelling churn is difficult because there is inherent uncertainty when measuring churn. Thus, it is important not only to understand any limitations associated with a churn analysis but also to be able to interpret the outcomes of a churn analysis.
In your report, interpret and describe the key findings that you were able to discover as part of your churn analysis. Describe the following facts with supporting details:
• The effectiveness of your churn analysis: What was the percentage of time at which your analysis was able to correctly identify the churn? Can this be considered a satisfactory outcome? Explain why or why not;
• Who is churning: Describe the attributes of the customers who are churning and explain what is driving the churn; and
• Improving the accuracy of your churn analysis: Describe the effects that your previous steps, model development and handling of missing values had on the outcome of your churn analysis and how the accuracy of your churn analysis could be improved.

Assessment - Model Evaluation

Learning Outcome 1: Apply data science principles to the cleaning, manipulation and visualisation of data;
Learning Outcome 2: Design analytical models based on a given problem; and
Learning Outcome 3: Effectively report and communicate findings to an appropriate audience.

Task Summary
Any enterprise-level, big-data, analytics project aimed at solving a real-world problem will generally comprise three phases:
1. Data preparation;
2. Data analysis and visualisation; and
3. Making decisions based on the analysis or insights.
In this Assessment, you will help the global community in its fight against COVID-19 by discovering meaningful insights in a dataset compiled by the Johns Hopkins University Center for Systems Science and Engineering.
Given the significance of the issue, you will slice and dice the data using different methods and drill down to gain insights that will help the individuals concerned make the right decisions.
Please refer to the Task Instructions (below) for details on how to complete this task.

Task Instructions

1. Dataset Preparation
The Johns Hopkins University COVID-19 dataset is a time-series dataset that officially began recording the global number of confirmed infections, deaths and recovered patients on 22 January 2020. The fields available in the dataset include the Province/State, Country/Region, the Latitude and Longitude of a country and the dates. The data period runs from 22 January 2020 to present.

In this Assessment, you are required to work with the latest version of this dataset (the version you use will depend on the day you download it). The dataset can be found at the URL provided below.
For this Assessment, you are only required to download the dataset related to confirmed infection numbers (i.e., only download the file named: time_series_covid19_confirmed_global.csv).
All of the analyses for this Assessment should be conducted on the confirmed infection numbers. You should use the dataset as it is without making any modifications to the downloaded file.
Humdata.org. (2020). Novel Coronavirus (Covid-19) cases data.

2. Data Analysis and Visualisation
Using the dataset downloaded in the previous step, undertake a data analysis and visualisation of the top three infected countries.
The top three infected countries should be selected based on the total count of infected people from 22 January 2020 to the latest date in your file.
The analysis and the visualisation can be completed using the Python libraries of your choice
i.e. Pyspark MLlib. You can use any other platform if you find it more efficient. The analysis and the visualisation should address the following sections collectively:

a) Predictive Modelling
In this section, fit a linear regression model to the time-series data for each of the three countries with an assumption that the infection rate has been increasing since the official record started. In this model, your dependent variable will be the count of infection for the independent variable (i.e., the week number).
Please note, you should convert the time-series data and represent the dates in the form of a week number. For example, 22 January 2020 to 28 January 2020 will be Week 1, 29 January 2020 to 4 February 2020 will be Week 2, etc.
Once all three linear regression models are ready, analyse the models thoroughly and identify the model with the highest variance. Select that country and its linear regression model and move to the next step.

b) Clustering
In this section, perform a K-Means clustering on the dataset used in the previous step for the country that had the highest amount of variance.
In the previous step, one of the assumptions was that the infection rate has been increasing since the official record started. Clustering should help you to validate that

assumption and most importantly, should help you discover a trend of infection count over a period.
Determine the best value of K for K-Means clustering through iteration. Once the clusters stabilise, analyse the clusters thoroughly and observe the trend over time.
For example, consider whether you had cluster/s at the top of the graph in the first weeks of January, whether the cluster/s came back down in the graphs in the following weeks and whether the cluster/s went up again. You will use these observations in the next step.

c) Graph Analytics
In this section, perform graph analytics and show the relationship between the country in question in the previous step and its neighbouring countries based on the weekly count of infection. Assume that the neighbouring countries do not share any borders with each other.
To determine the neighbouring countries, you can either use the latitude and longitude information from the dataset or your own knowledge of geography and present a graphical view.
As part of this analysis, assume that the neighbouring countries may also display similar cluster trends over a period (as seen in the previous step). In your video presentation, you will make recommendations to these neighbouring countries in relation to possible trends.

d) Visualisation
In this section, you are required to visualise your analytical findings (that you derived using the above steps).
In big data and analytics projects, visualisation is an integral part of any analysis and often brings the analysis to life. Thus, ensure that you produce a high-quality visualisation, which you can use to tell stories and drill down from the raw data to the decision-making process.

3. Video Presentation
After completing the whole data analysis and visualisation process, the outcomes need to be communicated to the neighbouring countries as identified in the previous step. Thus, you should prepare a video presentation summarising the insights discovered in the previous step. You should use 8-10 slides in your presentation and your presentation should be no longer than 10 minutes.

This video presentation is related to the big data and analytics project phase ‘making decisions based on the analysis and insights' (as described above). Thus, the contents of this video should be extremely helpful to the neighbouring countries as they make decisions about their COVID-19 policies.

Consequently, as you communicate about possible trends of infection, ensure that you support your findings with any insights that you discovered through predictive modelling, clustering, graph analytics and visualisation. Tell a story to your listeners by presenting drilled- down views of your discoveries and by relating all the outcomes from the analysis that you completed in the previous steps: predictive modelling, clustering, graph analytics and visualisation.

Attachment:- Big Data and Analytics.rar

Reference no: EM133000454

Questions Cloud

Model evaluation assessment : Model Evaluation Assessment - Effectively report and communicate findings to an appropriate audience.

Visualisation and model development assessment : Visualisation and Model Development Assessment - Apply data science principles to the cleaning, manipulation, and visualisation of data

What is the value of purchase discount : On 9/4/2019 the company paid the full amount in cash assuming that the sales term was (2/10, n/30). What is the value of purchase discount

Design data pipeline assessment : Identify best practices in data collection and storage, including data security and privacy principles; and Effectively report and communicate findings

Identify best practices in data collection : Identify best practices in data collection and storage, including data security and privacy principles; and Effectively report and communicate findings

What the price at which you willing to purchase these bonds : If the market interest rate is 8% per annum, compounded semi-annually, what will be the price at which you will be willing to purchase these bonds

What percent of their goal has been reached : They want to make $4,923 for a club trip. To the nearest tenth, what percent of their goal has been reached

What the total manufacturing cost of job : The corporation applies manufacturing overhead to each job order on the basis of direct labor cost, What the total manufacturing cost of Job

Write up explaining to the management of ABC Limited : Write up explaining to the management of ABC Limited, the impact of pricing strategies on business profitability and success

User Account

All Pages