Reference no: EM132892094
Programming Assignment
1. Naïve Bayes
(1) Complete uploaded python code.
First, you have tobinarize training set (trainX) of MAGIC Gamma Telescope data set. Each column is converted to binary variable based on the average value. If a value is greater than average, set a value as 1. Otherwise, set a value as 0. Then, using new binarized dataset, calculate p_ij=P(x_j=1|y_j=i)(i=class,j=feature).
Class g Class h
P(x_1=1)
P(x_2=1)
P(x_3=1)
P(x_4=1)
P(x_5=1)
P(x_6=1)
P(x_7=1)
P(x_8=1)
P(x_9=1)
P(x_10=1)
(3) Based on the calculated p_ij, calculate probability of class g for each test sample (testX) and calculate accuracy for testX with varying cutoff (To binarize testX, use the mean of trainX).Prior probabilities of classes are proportional to ratios of classes in training set. cutoff ∈{0.1,0.15,0.2,0.25,...,0.95}. Draw a line plot (x=cutoff, y=accuracy).
(4) Explain why the shape of figure of Question 1-(3) looks like this.
2. Decision Tree
The aim of the given data set is to predict annual income of people based on the following factors.
age: the age of an individual
capital-gain:capital gains for an individual
capital-loss: capital loss for an individual
hours-per-week: the hours an individual has reported to work per week
sex: 1 if male, 0 if female
native-country: 1 if USA, 0 if others
workclass_[#]: 1 if an individual belongs to workclass # otherwise 0 (eg.Workclass_Private is 1 if an individual works for private companies)
education_[#]: 1 if an individual's education level is # otherwise 0(education level:Graduate>4-year university> "<4-year university" > High school > "<High school" > Preschool)
marital-status_[#] 1 if an individual's marital status is # otherwise 0 (Married-civ-spousecorresponds to a civilian spouse while Married-AF-spouse is a spouse in the Armed Forces)
occupation_[#]: 1 if an individual's occupation is # otherwise 0.
race_[#]: 1 if an individual's race is #, otherwise 0
Target is ‘income' (">50K" or "<=50K")
fnlwgt represents the number of people the census believes the entry represents, which is not used in training.
(1) Train a decision tree with the setting that max_depth=3, min_samples_split=100, min_samples_leaf=50 using entropy. Then, calculate overall accuracy, accuracy of class ">50K", and accuracy of class "<=50K".
overall accuracy accuracy of class ">50K" accuracy of class "<=50K"
(2) Based on the answer of Question 2-(1), describe the limitations of the trained decision tree model.
(3) Draw the trained tree.
(4) Explain the rule for class ">50K" that contains the most cases.
(5) Explain the rule for class "<=50K" that contains the most cases with an accuracy of 0.7 or higher.
(6) Train a new tree by changing a metric for finding split rules from entropy to gini impurity and compare two models in terms of the performance of the models and the generated rules.
3. k-means clustering
This problem uses the data generated from 4 normal distributions for applying k-means clustering.
k-means implemented in sci-kit learn can assign initial centroids through ‘init'. When init is set as cby parray (c= the number of clusters, p= the number of features), each row is used as a centroid.
(1) Select randomly 4 samples from the given data set and use them as initial centroids. This procedure is repeated for 100 times. Then, calculate the average values of the silhouette coefficient and adjusted rand index values for 100 iteration.
silhouette coefficient adjusted rand index
(2) Select randomly one sample from each normal distribution and use them as initial centroids. This procedure is repeated for 100 times. Then, calculate the average values of the silhouette coefficient and adjusted rand index values for 100 iteration. (5pts)
silhouette coefficient adjusted rand index
(3) Draw scatter plotsfor the given data with initial centroids and final centroids for the worst cases among 100 trials in Question 3-(1) in terms of silhouette coefficient and adjusted rand index, respectively. The initial centroids should be marked as red ‘X' and the final centroids should be marked as blue ‘X'.
(4) Draw scatter plots for the worst case of Question 3-(2)in the same way as in Question 3-(3).
(5) Based on the different results from 100 trials for each case, compare two different methods to determine initial centroids.
Attachment:- Programming Assignment.rar