Draw scatter plots for the given data with initial centroids

Assignment Help Python Programming

Reference no: EM132892094

Programming Assignment

1. Naïve Bayes

(1) Complete uploaded python code.
First, you have tobinarize training set (trainX) of MAGIC Gamma Telescope data set. Each column is converted to binary variable based on the average value. If a value is greater than average, set a value as 1. Otherwise, set a value as 0. Then, using new binarized dataset, calculate p_ij=P(x_j=1|y_j=i)(i=class,j=feature).

Class g Class h
P(x_1=1)
P(x_2=1)
P(x_3=1)
P(x_4=1)
P(x_5=1)
P(x_6=1)
P(x_7=1)
P(x_8=1)
P(x_9=1)
P(x_10=1)

(3) Based on the calculated p_ij, calculate probability of class g for each test sample (testX) and calculate accuracy for testX with varying cutoff (To binarize testX, use the mean of trainX).Prior probabilities of classes are proportional to ratios of classes in training set. cutoff ∈{0.1,0.15,0.2,0.25,...,0.95}. Draw a line plot (x=cutoff, y=accuracy).

(4) Explain why the shape of figure of Question 1-(3) looks like this.

2. Decision Tree
The aim of the given data set is to predict annual income of people based on the following factors.
age: the age of an individual
capital-gain:capital gains for an individual
capital-loss: capital loss for an individual
hours-per-week: the hours an individual has reported to work per week
sex: 1 if male, 0 if female
native-country: 1 if USA, 0 if others
workclass_[#]: 1 if an individual belongs to workclass # otherwise 0 (eg.Workclass_Private is 1 if an individual works for private companies)
education_[#]: 1 if an individual's education level is # otherwise 0(education level:Graduate>4-year university> "<4-year university" > High school > "<High school" > Preschool)
marital-status_[#] 1 if an individual's marital status is # otherwise 0 (Married-civ-spousecorresponds to a civilian spouse while Married-AF-spouse is a spouse in the Armed Forces)
occupation_[#]: 1 if an individual's occupation is # otherwise 0.
race_[#]: 1 if an individual's race is #, otherwise 0

Target is ‘income' (">50K" or "<=50K")
fnlwgt represents the number of people the census believes the entry represents, which is not used in training.
(1) Train a decision tree with the setting that max_depth=3, min_samples_split=100, min_samples_leaf=50 using entropy. Then, calculate overall accuracy, accuracy of class ">50K", and accuracy of class "<=50K".
overall accuracy accuracy of class ">50K" accuracy of class "<=50K"

(2) Based on the answer of Question 2-(1), describe the limitations of the trained decision tree model.

(3) Draw the trained tree.

(4) Explain the rule for class ">50K" that contains the most cases.

(5) Explain the rule for class "<=50K" that contains the most cases with an accuracy of 0.7 or higher.

(6) Train a new tree by changing a metric for finding split rules from entropy to gini impurity and compare two models in terms of the performance of the models and the generated rules.

3. k-means clustering
This problem uses the data generated from 4 normal distributions for applying k-means clustering.

k-means implemented in sci-kit learn can assign initial centroids through ‘init'. When init is set as cby parray (c= the number of clusters, p= the number of features), each row is used as a centroid.

(1) Select randomly 4 samples from the given data set and use them as initial centroids. This procedure is repeated for 100 times. Then, calculate the average values of the silhouette coefficient and adjusted rand index values for 100 iteration.

silhouette coefficient adjusted rand index

(2) Select randomly one sample from each normal distribution and use them as initial centroids. This procedure is repeated for 100 times. Then, calculate the average values of the silhouette coefficient and adjusted rand index values for 100 iteration. (5pts)
silhouette coefficient adjusted rand index

(3) Draw scatter plotsfor the given data with initial centroids and final centroids for the worst cases among 100 trials in Question 3-(1) in terms of silhouette coefficient and adjusted rand index, respectively. The initial centroids should be marked as red ‘X' and the final centroids should be marked as blue ‘X'.

(4) Draw scatter plots for the worst case of Question 3-(2)in the same way as in Question 3-(3).

(5) Based on the different results from 100 trials for each case, compare two different methods to determine initial centroids.

Attachment:- Programming Assignment.rar

Reference no: EM132892094

Questions Cloud

Prepare the investing activities section of Tifton statement : Proceeds from issuance of common stock $400,000. Prepare the investing activities section of Tifton's statement of cash flows

Essential ingredients of symmetric cipher : What are the essential ingredients of a symmetric cipher? What are the two basic functions used in encryption algorithms?

Name the law and section of the law : Name the law and section of the law that principally deals with real estate agents gaining beneficial interest in South Australia. What does that law require an

Distinguish between the mega- and task-environments : 1. How could you distinguish between the mega- and task-environments in terms of organizational control and workplace diversity?

Draw scatter plots for the given data with initial centroids : Draw scatter plots for the given data with initial centroids and final centroids for the worst cases among 100 trials in Question 3-(1) in terms of silhouette

Prepare the operating activities section : Harrisburg Corporation had net income of $35,000, a $9,000 decrease in accounts receivable, prepare the operating activities section

Describe which sector would you focus on : If you are opening a business, describe which sector would you focus on. Explain why

Principles and techniques of goal setting : You have been asked to give a brief presentation to your team about the principles and techniques of goal setting, measuring performance, time management and pe

What is the net increase in cash for the year ended December : What is the net increase in cash for the year ended December 31, 2019, as a result of the preceding information

User Account

All Pages