Reference no: EM133122018
Data Mining for Business Analytics and Cyber Security
Assignment
Introduction
This assignment is intended to allow you to display your knowledge and understand capabilities of data mining. In this assignment, you will use the RapidMiner tool to display your technical competence gained from lab work. It is also an opportunity for you to display the knowledge that you have gained from lectures and your readings, and to show the relationship between the theory and the practice. The purpose of this assignment is to give you an appreciation of benefits that various data mining techniques bring to a data domain.
• Use the data files provided that include three datasets for prediction using decision tree, clustering using K-means and associate rule analysis.
• The assignment should be in report form answering each questions of case study.
• Apply at least three different models (for example: decision tree induction, clustering analysis, associate analysis.)
• Answer each question in the case study for each model appropriately and succinctly.
• While you may like to go into extreme detail about a step in generating associate rules, you will not have the space to do so. Rather, write down the important points and attach the important screen shots (export your processing model, save the result as image file) to show that you have thought the matter through.
Assignment Task
ASSIGNMENT 1 DESCRIPTION AND TASK LIST
Task overview:
TASK DESCRIPTION
Assignment requirements:
Association rule mining or shopping basket analysis
The dataset used for this task is Groceries.csv. (It has been attached in your email) The dataset Groceries.csv contains 10000 receipts. Each receipt represents a transaction with items that were purchased. In the dataset, each line is called a transaction and each column in a row represents an item.
Step 1 - Import the Groceries.csv data set into your RapidMiner data repository. Save it with the name Groceries.
Step 2 - Drag your Groceries data set into a new process window in RapidMiner, and run the model in order to inspect the data. When running the model, if prompted, save the process as MBA_Process.
Question 1: Conduct Data Understanding and Data Preparation activities on the data set. Do all of your variables have consistent data and are their data types are appropriate for the FP- Growth operator? If yes, please explain why you think your data types are appropriate. If no, what data preparation activities need to be done before move to next step?
Step 3 - Generate association rules for your data set. Use the default confidence and support first, then modify your confidence and support values in order to identify their most ideal levels.
Question 2: From step 3, what rules did you find? What attributes are most strongly associated with one another? Are there products that are frequently connected that surprise you? Why do you think this might be?
Step 4 - Look at the support and confidence and the other measures of rule strength such as LaPlace or Conviction.
Questions 3: How much did you have to test different support and confidence values before you found some association rules? Were any of your association rules good enough that you would base decisions on them? Why or why not?
Clustering analysis (use k-means model)
A student evaluation survey dataset (i.e., turkiye-student-evaluation.csv) will be used to clustering analysis. The goal is to group the students based on the similarity of their answers on the survey. This data set contains a total 5820 evaluation scores provided by students from Gazi University in Ankara (Turkey). There is a total of 28 course specific questions and additional 5 attributes.
Attribute Information:
instr: Instructor's identifier; values taken from {1,2,3}
class: Course code (descriptor); values taken from {1-13}
repeat: Number of times the student is taking this course; values taken from {0,1,2,3,...} attendance: Code of the level of attendance; values from {0, 1, 2, 3, 4}
difficulty: Level of difficulty of the course as perceived by the student; values taken from
{1,2,3,4,5}
Q1: The semester course content, teaching method and evaluation system were provided at the start.
Q2: The course aims and objectives were clearly stated at the beginning of the period.
Q3: The course was worth the amount of credit assigned to it.
Q4: The course was taught according to the syllabus announced on the first day of class.
Q5: The class discussions, homework assignments, applications and studies were satisfactory.
Q6: The textbook and other courses resources were sufficient and up to date.
Q7: The course allowed field work, applications, laboratory, discussion and other studies.
Q8: The quizzes, assignments, projects and exams contributed to helping the learning.
Q9: I greatly enjoyed the class and was eager to actively participate during the lectures.
Q10: My initial expectations about the course were met at the end of the period or year.
Q11: The course was relevant and beneficial to my professional development.
Q12: The course helped me look at life and the world with a new perspective.
Q13: The Instructor's knowledge was relevant and up to date.
Q14: The Instructor came prepared for classes.
Q15: The Instructor taught in accordance with the announced lesson plan.
Q16: The Instructor was committed to the course and was understandable.
Q17: The Instructor arrived on time for classes.
Q18: The Instructor has a smooth and easy to follow delivery/speech.
Q19: The Instructor made effective use of class hours.
Q20: The Instructor explained the course and was eager to be helpful to students.
Q21: The Instructor demonstrated a positive approach to students.
Q22: The Instructor was open and respectful of the views of students about the course.
Q23: The Instructor encouraged participation in the course.
Q24: The Instructor gave relevant homework assignments/projects, and helped/guided students.
Q25: The Instructor responded to questions about the course inside and outside of the course.
Q26: The Instructor's evaluation system (midterm and final questions, projects, assignments, etc.) effectively measured the course objectives.
Q27: The Instructor provided solutions to exams and discussed them with students.
Q28: The Instructor treated all students in a right and objective manner.
Q1-Q28 are all Likert-type, meaning that the values are taken from {1,2,3,4,5}
Step 1 - Import the turkiye-student-evaluation.csv data set into your RapidMiner data repository. Save it with the name StudentEvaluation.
Step 2 - Drag your StudentEvaluation data set into a new process window in RapidMiner, and run the model in order to inspect the data. When running the model, if prompted, save the process as Clustering_Process
Question 1: Conduct any data preparation that you need for the data set.
Is there any inconsistent data or missing values? Do you need to change data types for some attributes? Why or why not. Do you need to remove some attributes? Why or why not.
Step 3 - you may need other operators, e.g., Select Attributes. Connect a k-Means operator to your data set, configure your parameters. Since we don't know how many cluster (group) of students will be. In fact, you may need to use different k values (e.g., k = 2 or 3, or 4 or 5) to decide the best "natural" number of group of this dataset. Then run your model.
Questions 2: Investigate your Centroid Table, Folder View, and the other evaluation tools. Report your findings for your clusters. How many clusters (what is k value) is the best "natural" number of group.
Question 3: Experiment with the other k-Means operators in RapidMiner, such as Kernel or Fast. How are they different from your original model? Did the use of these operators change your clusters, and if so, how?
Predictive model using decision trees
The dataset that will be used for decision trees model is bill_authentication.csv. Data was extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.
The goal of this task is to predict whether a bank note is authentic or fake depending upon the four different attributes of the image of the note. The attributes are variance of wavelet transformed image, curtosis of the image, entropy, and skewness of the image.
Attribute Information:
1. variance of Wavelet Transformed image (continuous)
2. skewness of Wavelet Transformed image (continuous)
3. curtosis of Wavelet Transformed image (continuous)
4. entropy of image (continuous)
5. class (integer)
Step 1 - Import the billauthentication.csv data set into your RapidMiner data repository. Save it with the name BillAuthentication.
Step 2 - Drag your BillAuthentication data set into a new process window in RapidMiner, and run the model in order to inspect the data. When running the model, if prompted, save the process as DecisionTree_Process.
Step 3 - Split dataset into training and testing set using Split Data operator. Parameter partition specifies the ratio of the test set, which use to split up 20% of the data in to the test set and 80% for training. Sapling type: stratified sampling.
Question 1: Is there any inconsistent data or missing values? Do you need to change data types for some attributes? Why or why not. Do you need to remove some attributes? Why or why not.
Step 4 - Build a Decision Tree model for predicting whether a bank note is authentic or fake. Run your model with the default parameter values and gain_ratio for the decision tree.
Question 2: Examine the tree results. How many nodes and leaves are in the tree? View the tree. Which variable was used for the first split?
Step 5 - Re-run the model with different parameter values, e.g., using gini_index or information_gain and changing leaf and split sizes.
Questions 3: report differences in the tree's structure (e.g., nodes, leaves, first node, etc.). Analyse and report your results.
Attachment:- Data Mining for Business Analytics.rar