Reference no: EM132528571
Assignment 1 - Fundamentals of Data Mining
Q1. Describe how hierarchical clustering methods works.
Q2. Produce a hierarchical clustering (COBWEB) model for Iris data. How many clusters did it produce? Why? Did you expect that outcome? Describe your reasoning.
Change to use the classes to cluster evaluation. What can you conclude from it?
Use the acuity and cutoff parameters in order to produce a model that clusters major Iris types together. What values of parameters worked the best? Examine your findings/understanding of the produced results.
Q3. Describe how EM clustering methods works.
Q4. Use the EM clustering method on either the basketball or the cloud data set. How many clusters did the algorithm decide to make? Describe the model produced.
How does it compare to the COBWEB results?
If you change from "Use Training set" to "Percentage evaluation split - 66% train and 33% test" - how does the evaluation change? Discuss your findings/understanding of the produced results with respect to the specific dataset.
Q5. Describe the Association learning method. How are the frequent item sets created in an efficient way?
Q6. Use the Association rule learner APRIORI method to find the association rule in the Weather.nominal data set. How many rules did it produce? How large are the item sets? What was the largest one? What happened when you increased/decreased the confidence level? What about the number of rules? What happens when you increase the confidence parameter to 2? Why?
Q7. Use the Association rule learner APRIORI method to find the association rule on the supermarket data set. What is the size of the largest item set? What was the highest confidence level produced? How many rules with that confidence? Any interesting rules you found?
Assignment 2 - Fundamental of Data Mining
1. Use the Decision tree method (Classify Tab, "trees" folder, J48) to analyze the iris data (iris.arff can be found in Weka's Data folder or in Blackboard under Resources):
Give a brief description of the Decision Tree model
Discuss what you learned about the Iris dataset from the J48 classifier.
How did Decision tree method perform? (We will cover the evaluation techniques in more details later in the class. You can choose any of the available options for not. However, please specify what option you chose: training data set, cross-validation or % split was used).
How did Decision tree method provide you with the insight into your data set/rules/patterns and why?
2. Data preparation is an essential step in data mining. How the training data set is presented to a method can drastically affect the produced model's performance. Use the J48 Decision tree-learning scheme to analyze weather.numeric.arff and weather.nominal.arff (the data sets come with the Weka installation in Weka/data folder) data set. Make predictions for the 'temperature' attribute for both data sets.
Try to use J48 on weather.numeric.arff with no modifications to the dataset. Did you get an error? The method only performs on nominal class data - use the DiscretizeFilter (Unsupervised-Attribute- Discretize) filter, in the preprocess tab, before applying the learning method. Be sure to note how you discretized the dataset and take a moment to consider why you made the choice? Did you discretize all the attributes? How many bins did you discretize each attribute into?
Analyze the output of the model that learned the discretized attribute 'temperature'? What was the performance, can you improve it? What did the model tell you about the data? (Hint: you can modify the number of bins in the discretize filter in an attempt to improve the model performance or mimic the nominal dataset)
Analyze the output of the model that learned the nominal attribute 'temperature'? What was the performance, can you improve it? What did the model tell you about the data? How do the results differ from the model produced on the discretized version of the same attribute?
3. Use the J48 Decision tree learning scheme to analyze the bolts data (bolts.arff without the TIME attribute). The dataset describes the time needed by a machine to produce and count 20 bolts. (More details can be found in the file containing the dataset, you can open the file using a file editor to read the comments)
Why should you ignore the TIME attribute?
Analyze the model produced. What adjustments (if you were to make any) would have the greatest effect on the time to count 20 bolts (attribute: T20Bolt) (i.e. what is the most important/selective attribute/value pair in the tree)?
According to the classifier, how would you adjust the machine (the other attributes) to get the shortest time to count 20 bolts?
Need Assignment 1 and the ONLY the question number 2 in assignment 2.
Attachment:- Fundamentals of Data Mining Assignment Files.rar