Implement regression algorithms to predict the hourly value

Assignment Help Applied Statistics
Reference no: EM131687794

Machine Learning Assignment

1) Regression; Consider the Bike Share dataset from the UCI machine learning repository. The dataset contains three files, viz. day.csv, hour.csv, and Readme.txt. Both the datasets (data.csv and hour.csv) contain a combination of integer-valued (e.g., season, weekday or not) and real-valued features (e.g., temperature, windspeed). Details of the dataset are described in the README (Readme.txt) in the Data Characteristics section. Just like the previous assignment, spend some time understanding the structure of the dataset, how the instances are organized, how the features are organized, what the various features mean (info in README), what features are useful for the task at hand, and so on. Do not attempt to run any machine learning algorithm before understanding the structure of the dataset.

Note, in particular, the last three fields in the data, viz. casual (denotes casual riders), registered (registered riders), and cnt (total ridership count).

1. Using only hour.csv, implement regression algorithms (both linear and k-nearest neighbors) to predict the hourly values for:

a. the number of casual riders

b. the number of registered riders

c. total ridership count

2. Using only day.csv, implement regression algorithms (both linear and k-nearest neighbors) to predict the daily values for:

a. the number of casual riders

b. the number of registered riders

c. total ridership count

Therefore, for each dataset, you are reporting 3 models for linear regression and 3 models for KNN regression.

Note: Remember that using one of the target values (as a feature) in predicting the outcome of any of the counts - casual, registered, or total would defeat the purpose of the learning algorithm. It will make the problem too easy. That is, the number of casual riders cannot be used as a feature to predict the number of registered riders or total ridership. Similarly, the number of registered riders cannot be used to predict the other two values, and so on. Only features like season, temperature, real-feel, etc. have to be used by the learning algorithm.

For instance, suppose you are using hourly.csv dataset and you want to predict the number of casual riders. You have to remove the columns related to the number of registered riders and total ridership first and then start training/testing your model. Similarly, when you are making the prediction model for total ridership, you have to remove the columns related to casual rides and registered riders first and then start training/testing your model.

As before, you will need to separate the data into training set and test set (decide on the proportion of splits yourself). Evaluate the performance of your regression using suitable measures. Report on the performance results and which model(s) worked best (and why in your opinion).

2) Clustering; Consider the Seeds data set from the UCI machine learning repository. The dataset comprises of features from three different types of wheat kernels. There are seven features (area, perimeter, compactness, length, width, asymmetry coefficient, and length of kernel groove) that describe each data point. (Note that the dataset has an eighth column (class information with labels 1, 2, and 3), which we will use as ground truth to verify our clustering results.)

Using the k-means algorithm cluster this dataset into three clusters based on the seven features at your disposal. Demonstrate the effectiveness of your implementation by comparing the results against the ground truth. Follow the steps in the k-means demo video from the lectures.

Also, note that the default label values in scikit learn start from 0, whereas the dataset here starts labels with 1. While evaluating your implementation's effectiveness, ensure to account for this discrepancy.

As a performance measure, compare the clusters identified by k-means w.r.t. the ground truth data and make observations.

Attachment:- Assignment Files.rar

Reference no: EM131687794

Questions Cloud

Explain the pathophysiology of the system : An exemplar/case study and discussion of a relevant research study using the method.The cardiovascular and lymphatic system.
New aspect of big data and data mining : Explain the concept and how it might bring value to healthcare. Support your claims and provide evidence.
Premise provides conclusive proof for a conclusion : Which of the following terms refers to the process through which a premise provides conclusive proof for a conclusion
Discuss the key components of aggregate demand : You are asked to develop a model to predict the volume of new orders for steel. The ?rst step is to correlate constant-dollar new orders.
Implement regression algorithms to predict the hourly value : Using only hour.csv, implement regression algorithms (both linear and k-nearest neighbors) to predict the hourly values for: the number of casual riders
Learned about decision support : Reflect on what you have learned about decision support and the kind of work you do or want to do. What does decision support mean to you?
Independent business continuity : For this response, assume you are an independent business continuity consultant to Target and supporting the internal review of the business continuity plan.
Explain why a culture has specific values and beliefs : Epic explains the "extraordinary" journey of a king/god or noble person and a Myth is a story to justify/explain why a culture has specific values and beliefs
Give a specific example of a policy or regulation : Give a specific example of a policy or regulation that has helped economic activity. How do government policies and/or regulations factor into changes?

Reviews

len1687794

10/23/2017 7:17:30 AM

Using the k-means algorithm cluster this dataset into three clusters based on the seven features at your disposal. Demonstrate the effectiveness of your implementation by comparing the results against the ground truth. Follow the steps in the k-means demo video from the lectures. What to submit: A zipped file containing: a PDF document summarizing answers to questions 1 and 2. Instead distill your lessons and experiences succinctly. Either hyperlinks to or actual attachments of your data files and your iPython notebook(s).

Write a Review

Applied Statistics Questions & Answers

  Question 1you are a data analyst working for the australian

question 1you are a data analyst working for the australian petrol pricing commissioner and have been requested to

  1 ifnbspphi-z0 nbspalpha2 z0 assume alpha lt012 a company

1. ifnbspphi-z0 nbspalpha2 z0 ? assume alpha lt0.1.2. a company produces cooling tubes. the pressure of the tube is an

  Psychological explanations are unsatisfactory

What do the authors mean when they say that "psychological explanations are unsatisfactory" in this case? Relate this to the central arguments made by Durkheim in his study of Suicide.

  A study was conducted to determine the impact of social

a study was conducted to determine the impact of social media use on student learning. a total of 103 students taking a

  Examine the efficacy of studying in groups

A research study was conducted to examine the efficacy of studying in groups. Students were randomly assigned to one of three groups

  Annual salary plus bonus data for chief executive officers

Annual salary plus bonus data for chief executive officers are presented in the Business-Week Annual Pay Survey.

  Using the binomial probability distribution

When using the binomial probability distribution for analyzing guesses on a multiple-choice quiz, what is wrong with letting "p" denote the probability of getting a correct answer while "x" counts the number of wrong answers?

  Apply standard deviation on sampling

A sample of 40 children with chronic bronchitis is studied and their mean PEF is 279 with a standard deviation of 71.

  Promote anti-american propaganda

A Russian SVR agent has 45 million dollars which he can spend to bribe academics, politicians, and bloggers to spread disinformation and to promote anti-American propaganda

  Explain the reasoning for your choice

Minute Rice Use the null hypothesis given: Ho: p=1 minute And one of the following alternative hypotheses (only choose 1): Ha: p>1 minute or Ha: p

  Find a point estimate of percent confidence interval

Find a point estimate of and a 95 percent confidence interval for the total number of unexcused absences by hourly workers in the last year.

  Glendale westgate restaurant location

John and Mario, owners of Mi Casa Front Porch Restaurant are planning a big Super Bowl event for their Glendale Westgate restaurant location and have received a permit for additional outside seating.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd