Plot the data from each dataset using a scatter plot

Assignment Help Applied Statistics
Reference no: EM132230053 , Length: word count:1000

Assignment - Introduction to Machine Learning

Format is needed in a R Markdown report.

Data sets included: binary-classifier-data.csv and trinary-classifier-data.csv

Regression algorithms are used to predict numeric quantity while classification algorithms predict categorical outcomes. A spam filter is an example use case for a classification algorithm. The input dataset is emails labeled as either spam (i.e. junk emails) or ham (i.e. good emails). The classification algorithm uses features extracted from the emails to learn which emails fall into which category.

In this problem, you will use the nearest neighbors algorithm to fit a model on two simplified datasets. The first dataset (found in binary-classifier-data.csv) contains three variables; label, x, and y. The label variable is either 0 or 1 and is the output we want to predict using the x and y variables. The second dataset (found in trinary-classifier-data.csv) is similar to the first dataset except for the label variable can be 0, 1, or 2.

Note that in real-world datasets, your labels are usually not numbers, but text-based descriptions of the categories (e.g. spam or ham). In practice, you will encode categorical variables into numeric values.

a. Plot the data from each dataset using a scatter plot.

b. The k nearest neighbors algorithm categorizes an input value by looking at the labels for the k nearest points and assigning a category based on the most common label. In this problem, you will determine which points are nearest by calculating the Euclidean distance between two points. As a refresher, the Euclidean distance between two points: p1 = (x1, y1) and p2 = (x2, y2) is d = √((x1-x2)2+(y1-y2)2).

Fitting a model is when you use the input data to create a predictive model. There are various metrics you can use to determine how well your model fits the data. You will learn more about these metrics in later lessons. For this problem, you will focus on a single metric; accuracy. Accuracy is simply the percentage of how often the model predicts the correct result. If the model always predicts the correct result, it is 100% accurate. If the model always predicts the incorrect result, it is 0% accurate.

Fit a k nearest neighbors model for each dataset for k=3, k=5, k=10, k=15, k=20, and k=25. Compute the accuracy of the resulting models for each value of k. Plot the results in a graph where the x-axis is the different values of k and the y-axis is the accuracy of the model.

c. In later lessons, you will learn about linear classifiers. These algorithms work by defining a decision boundary that separates the different categories.

295_figure.png

Looking back at the plots of the data, do you think a linear classifier would work well on these datasets?

Attachment:- Assignment Files.rar

Reference no: EM132230053

Questions Cloud

Create a scatter plot of resultant clusters for each value : Clustering Assignment - Fit the dataset using the k-means algorithm from k=2 to k=12. Create a scatter plot of the resultant clusters for each value of k
Why is it important to make knowledge work visible : Why is it important to make knowledge work visible? In the technology value stream, which best describes lead time?
Expanding its product line to include three new products : Alan Industries is expanding its product line to include three new products. Calculate the objective value using Excel Solver.
How do you think that the problem can be at least reduced : How do you think that this problem can be at least reduced, if not solved? What do you think about the idea of "ZERO WASTE" as a goal for individuals.
Plot the data from each dataset using a scatter plot : Assignment - Introduction to Machine Learning - Assignment - Introduction to Machine Learning. Format is needed in a R Markdown report
Samsung releases new version of their galaxy phone : Discuss how Apple might respond if Samsung releases a new version of their Galaxy phone which has a brighter screen, faster processor,
Describe four different data sets from the movie : Describe four different data sets from the movie that led scientists to the discovery of global dimming, and explain how the Clean Air Act helped to eliminate.
Two examples of what partial productivity statistic : Give at least two examples of what a partial productivity statistic would be for some of the inputs that you identified.
What did you learn from the experience : In a well-crafted discussion post of at least 200 words, report on the results of all five weeks of the Ecological Footprint Reduction Project.

Reviews

len2230053

2/8/2019 12:58:34 AM

Need 1000+ words report. Regression algorithms are used to predict numeric quantity while classification algorithms predict categorical outcomes. A spam filter is an example use case for a classification algorithm. The input dataset is emails labeled as either spam (i.e. junk emails) or ham (i.e. good emails). The classification algorithm uses features extracted from the emails to learn which emails fall into which category. In this problem, you will use the nearest neighbors algorithm to fit a model on two simplified datasets.

len2230053

2/8/2019 12:58:28 AM

The first dataset (found in binary-classifier-data.csv) contains three variables; label, x, and y. The label variable is either 0 or 1 and is the output we want to predict using the x and y variables. The second dataset (found in trinary-classifier-data.csv) is similar to the first dataset except for the label variable can be 0, 1, or 2. This is needed in a R Markdown Report. Note that in real-world datasets, your labels are usually not numbers, but text-based descriptions of the categories (e.g. spam or ham). In practice, you will encode categorical variables into numeric values.

Write a Review

Applied Statistics Questions & Answers

  Whether new system has succeeded in lowering bacterial count

You are the statistician assigned to report to the hotel whether the strategy has worked. Base your analysis on a confidence interval.

  Find the value of the loads p that can be safely supported

Find the value of the loads P that can be safely supported by the 6-in.-by-12-in. (S4S) timber beam shown. The allowable bending stress is 1800 psi and the allowable shear stress is 140 psi.

  Determine an out of control process

Search the Internet, your textbook, or other materials to learn more about control charts. Explain how control charts can be used to detect processes that are possibly out of statistical control. Identify the warning signs that are used to determine ..

  Consider the following two functions

Consider the following two functions;  f(x) = 2x +1; valid for all values of x. h(x) = x -

  What is the probability that a mouse selected at random

1. The women and Children's hospital has a lab which is heavily secured as it contains many animals for scientific testing. In a particular lab, there are two sets of genetically identical mice (type A and type B). Of these mice, 35% are type B. Half..

  Determine the association between the predictive variables

Determine the association between the predictive variables described below and having a low birth baby using separately the dichotomous outcome

  In the first series of rolls of a die

In the first series of rolls of a die, the number of odd numbers exceeded the number of even numbers by 5. In the second series of rolls of the same die, the number of odd numbers exceeded the number of even numbers by 11. Determine which series ..

  Plot the receiver operating characteristic

STA303/1002 Fit a LASSO logistic regression (i.e., logistic regression with a LASSO penalty) model using glmnet. Use 10-fold cross-validation to choose the optimal value of the regularizer, show your R code and print the optimal λ obtained from t..

  Find the z scores for which the distributions area lies wit

Find the z-scores for which 88% of the distributions area lies within -z and z. The z-scores are ?

  Which of trendline fits the data better and how can you tell

Which of the trendlines fits the data better? How can you tell? Compare and contrast the shape of the linear regression trendline with the shape of the quadratic.

  If the coin lands either all heads or all tails

Toss the coin four times. If the coin lands either all heads or all tails, reject Ho: p=1/2 (the p denotes the chance for the coin to land on heads.) Complete parts a and b. a. what is the probability of a type 1 error for this procedure? b. if p=4/5..

  Construct a frequency table and display it as a histogram

Construct a frequency table and display it as a histogram using a class width of 10 starting at 0.  [Recall:  LCB is included in the class interval whereas UCB is not included.]  (Be sure to show all of your work, including your frequency chart.)

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd