Reference no: EM132462289
Unsupervised Learning - kMeans and Fuzzy Clustering
INSTRUCTIONS:
You will need to write several distinct programs beyond k-Means to accomplish the tasks set out below. Provide pseudocode for each in the appropriate sections of your report.
Part I. Visualizing (a slice of) a WCSSE 5-D Error-Surface
Use the 17 data-instances from the demo kmeans Excel file seen in class (HW3_S20_kmeans.xls) for this first exercise (i.e. cells B3:C19 in kmeans tab). The first datapoint should be: (0.62, 0.98). If you have overwritten the original data in the course of your explorations, re-download the file to use the same dataset (and get the same results) as everyone else!
Generate a 4D array of size 11 x 11 x 11 x 11. Note: indices represent locations of the two cluster-heads!
Each of the four integer indices should range in value from 0 to 10, mapping to the 0 - 1 range (i.e. in steps of size 0.10) of the k-means graph-axes in the Excel demo. The following generic code-example illustrates saving a WCSSE value of 9.83 for a 2-cluster solution consisting of cluster-heads (0.1, 0.2) and (0.8, 0.5) to such an array:
surfArray[1][2][8][5] = 9.83
Write a function (possibly named: getWCSE(data[], cheads[])) to generate every possible Within-Cluster Sum of Squared Distance/Errors (WCSSE) value for each of the 11^4 possible 2-cluster solutions afforded by this array.
Question 1. Display an Excel 3D Surface chart of the error-surface function w.r.t. indices (x1, y1), keeping the other centroid constant at (0,0)
• That is, for all values surfArray[x1][y1][0][0]
Question 2. Display an Excel 3D Surface chart of the error-surface function w.r.t. indices (x2, y2), keeping the other centroid constant at (10, 10)
• That is, for all values surfArray[10][10][x2][y2]
Explain the meaning of these two surface-graphs. How are they different?
Are they symmetric with respect to each other? Why or why not?
Do either of them hint at a possible solution near (0.2, 0.6) or (0.8, 0.56)?
Question 3. Display an Excel 3D Surface chart of the error-surface function w.r.t. indices (x1, y1) keeping the other centroid constant at (8, 6)
• That is, for all values surfArray[x1][y1][8][6]
Question 4. Display an Excel 3D Surface chart of the error-surface function w.r.t. indices (x2, y2) keeping the other centroid constant at (2, 6)
• That is, for all values surfArray[2][6][x2][y2]
Explain the meaning of these surface-graphs.
You know what, let me show some screenshots of what 2 of these WCSSE surfaces should look like -- if that might be helpful to you...
(0,0, x2, y2):
(10,10, x2, y2):
Part II. Clustering an Anonymous Dataset
Find the HW3_data.csv file in the class INFO folder. It contains 120 4-D instances to be clustered by your k-Means implementation.
Question 1. Implement the k-Means Algorithm as demonstrated by the in-class Excel demo.
Possible interface: kmeans(data[], cheads[])
Don't forget the heuristic for dealing with empty cluster sets. Your algorithm should output a clustering solution for the input data provided, as well as the final coordinates of the K centroids, and the WCSSE value of the solution.
-- provide a pseudocode for your k-means algorithm in your report
Question 2. Determine the most natural value of k (i.e. BestK) to be used for this dataset. Use the average WCSE of at least 10 trials for each value of K from 1 - 10.
-- show a graph of your BestK trial to support your choice of BestK
-- provide a solid argument for why your chosen BestK value seems appropriate for this dataset
Question 3. Generate clustering solutions for at least 10 randomly-initialized sets of centroids using your BestK value determined previously.
-- display the WCSE values in a table, highlighting the best one
-- display a 2-D scatterplot of your best k-means clustering solution (you might use attributes attC and attD for the axes, though you may choose whichever two look best to you), including final centroid positions. Color-code the datapoints according to their cluster.
Question 4. Find the HW3_train.csv file, which has the actual classifications of the datapoints (i.e. the Variable class column). If you did not find K=3 to be your BestK previously, generate your best K=3 clustering solution now. Using only your final set of three centroid coordinates, conduct a nearest-neighbor classification of the training set (ignoring the Variable column for this process).
-- assuming you cluster instances into clusters named Cluster_1, Cluster_2, etc. determine the value of my variables A, B etc. by what class the majority of your cluster-members fall into. For example, if Cluster_x has 50 members, 30 of which are actually class A, 18 are B and 2 is C -- then your Cluster_x has gravitated toward the concept of my class A. Show me the translations of your clusters to my classes (Note: this mapping will be different from one student's results to another's)
-- determine the accuracy-rate of your clustering solution to the actual classifications of the training set.
Find the HW3_test.csv file. Classify the 30 instances therein according to their nearest centroid in your K=3 solution.
-- determine the accuracy-rate of your clustering-solution to the actual classifications of the testing set
-- what do your accuracy results suggest about the effectiveness of the k-Means algorithm for this particular classification task? Do you think it is possible for k-Means to correctly classify these instances at the 100% level? 90% level? 80%? Explain.
Question 5. Implement the Fuzzy Clustering (FC) algorithm as described in class and as illustrated within the kMeans Excel file example. Repeat section D above, but now using the FC algorithm. Use the BestK determined from kMeans above as the number of clusters for this solution (Note: you should use the "more-complicated" of the two Expectation formulas from the class Powerpoint slide)
-- state the pseudocode of your implementation of this algorithm. What is your stopping condition? Why do you think this is reasonable?
-- when run multiple times, is there more, or less variety to the solutions you generate (Show me in a table)? Why is this so?
-- are your training/testing accuracy-rates better? Why do you suppose this is (or is not) so?
Graph and color-code your best FC solution for the training set.
Attachment:- Data Mining.rar