Implement a classification system with logistic regression

Assignment Help Computer Engineering
Reference no: EM132646367

ICT707 Data Science Practice - University of the Sunshine Coast

Assignment Task

This assignment consists of two deliverables, being:
• One code implementation. This requires a zip file which should include:
o The code file in Jupyter Notebook format.
o Relevant data set files.
o A pdf or HTML file which is printed/converted from your Notebook after having all cells executed.
• A report. The report must be uploaded as a separate file.

Part I - PySpark source code

Important Note:
• For code reproduction, your code must be self-contained. That is, it should not require other libraries besides PySpark environment we have used in the semester. The data files are packaged properly with your code file.
• The data sets used in the lecture slides should not be used as the data set of the assignment. This will result in 0 mark for the coding component.

In this component, we need to utilise Python 3 and PySpark to complete the following data analysis tasks:
1. Exploratory data analysis
2. Recommendation engine
3. Classification

You need to choose a dataset from Kaggle (https://www.kaggle.com/datasets) to complete these tasks. Remember to include the data set file in you source code submission.
Note: In your notebook, please use Heading 1 Markdown cell to separate each sub task.

Task I.1: Exploratory data analysis

This subtask requires you to explore your dataset by
• telling its number of rows and columns,
• doing the data cleaning (missing values or duplicated records) if necessary
• selecting 3 columns, and drawing 1 plot (e.g. bar chart, histogram, boxplot, etc.) for each to summarise it

Task I.2: Recommendation engine

This subtask requires you to implement a recommender system on Collaborative filtering with Alternative Least Squares Algorithm. You need to include
• Model training and predictions
• Model evaluation using MSE

Task I.3: Classification

This subtask requires you to implement a classification system with Logistic regression. You need to include
• Logistic Regression model training
• Model evaluation

Part II -Report

You are required to write a report with the following content:
• Provide a high-level survey on the advances of data science in the past 2 years.
• Compare the features of Spark version 2.4 that we used this semester and the new version 3.0.
• Explain your design and implementation of the machine learning parts in your code, including the following topics:
o Background of your selected data set
o For each task, which learning algorithm is used and what are its key parameters and how you set them up
o For each task, provide comments/evaluation for the model learnt

Your report should use the following template:

Table of Contents

1.0 Advancement of Data Science (500 words)

2.0 Comparison of Spark 2.4 and 3.0 (250 words)

Machine Learning Implementation (250 words)
Data set
Collaborative filtering
Features of the model, key parameters and configuration Evaluation
Logistic regression
Features of the model, key parameters and configuration Evaluation

References

Assignment Advice

This assignment will take several weeks to complete and will require a good understanding of machine learning and PySpark for successful completion. It is imperative that students take heed of the following points in relation to doing this assignment:

1. Ensure that you clearly understand the requirements for the assignment - what must be done and what are the deliverables.
2. If you do not understand any of the assignment requirements - Please ASK your tutor.
3. Each time you work on any aspect of the assignment reread the assignment requirements to ensure that what is required is clearly understood.
4. We have practiced nearly all coding tasks in DataCamp before. If you have any difficulty, redoing the practices in DataCamp is recommended.

5. Prior to submitting your code, you should ensure not only that it executes as required, but also looks professional. It is expected that you adhere to python standards for naming and indenting. All methods should be adequately documented such that another programmer examining your code will readily know what the code is doing.

Attachment:- Data Science Practice.rar

Reference no: EM132646367

Questions Cloud

How did costco attempt to avoid ethnocentric : How did Costco attempt to avoid ethnocentric, polycentric, or geocentric attitudes?
Make a report to the operations managers : How can we create a report to these operations managers explaining the reasons for the depreciation policies adopted by Albatross Ltd.
What is a distributed database : 1- Use at least two different recent (2017 - 2020) research sources to support your answers and list your two sources.
Compute the basic earnings per share for Granite Ltd : Granite Lid commenced the year with 400 000 fully paid ordinary shares. Compute the basic earnings per share for Granite Ltd for 2020
Implement a classification system with logistic regression : Implement a classification system with Logistic regression - Provide a high-level survey on the advances of data science in the past 2 years.
What was the annual percentage increase in vehicle selling : In 2016, the automobile industry announced the average vehicle selling price was $44,416. Four years earlier, the average price was $26,680.
Find how the proposal should be accounted : Find how the proposal should be accounted for under the Financial Reporting Standards and how such a proposal would affect Dixon Ltd.'s financial statements.
Constant rise of crime and violence in urban communities : The problem is the constant rise of crime and violence in urban communities. How can black owned businesses help reduce the crime rate? My question is, is this
How does a company or organization reward team performance : How does a company or organization reward team performance? Are the teams rewarded as individual or as a team? Is the reward system equitable? Why or why not?

Reviews

Write a Review

Computer Engineering Questions & Answers

  Analyze the flow of patients through the emergency room

Write a program that helps a hospital analyze the flow of patients through the emergency room.

  Boolean expressions to work out your logic

Write a program that determines, for each of the five hands of the game, whether or not then Ace is played. Use the truth tables and Boolean expressions to work out your logic for this.

  Define risk and costs of compromised data integrity breaches

In 250 words or more, discuss the risk and costs of compromised data integrity breaches. Focus on integrity not confidentiality.

  Identify the various types of project environments

ENGINEERING PROJECT PLANNING AND MANAGEMENT-Identify and distinguish between the various types of project environments, planning tools, control mechanisms.

  Explain how quantum cryptography works

Explain how quantum cryptography works and what role you think it will play in the future of cryptography. Submit in a Word document.

  Write a brief description of the given task

Write A brief description of the task. The pseudocode associated with the task. Base the pseudocode on the examples provided in Ch. 7 of Prelude to Programming.

  Allow the user to choose which lines to display on screen

Write a program in C that will Open a text file, Read the text file, and allow the user to choose which lines to display on the screen.

  Add a member function to print out all the information

Add a member function to print out all the information about a town. The member functions should all be publicly accessible.

  Describe threats and vulnerabilities of social engineering

Describe the threats and vulnerabilities of social engineering and social media; include how they are similar and how they are different.

  Draw a dfa that recognizes language l

Consider language L equal to the set of strings on the alphabet {a, b} such that the final symbol in the string has not appeared earlier in the string.

  What are the different kinds of storage

What are the different sorts of storage available for a computer system? Regardless of the type of storage used by your computer systems, why would you want to back it up.

  Show searching is carried out in the sequence

display that if quadratic searching is carried out in the sequence (h(x) + q^2), (h(x) + (q-1)^2), ..., (h(x) + 1), h(x), (h(x) - 1), ..., (h(x) - q^2) with q = (b-1)/2, then the address difference % b between successive buckets being examined is ..

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd