Conduct any data preparation that you need for the data set

Assignment Help Other Subject
Reference no: EM132623062

CIS5206 Data Mining for Business Analytics and Cyber Security - University of Southern Queensland

Case Study - assignment specification

Assignment requirements:

The dataset used for this task is Groceries.csv. The dataset Groceries.csv contains 10000 receipts. Each receipt represents a transaction with items that were purchased. In the dataset, each line is called a transaction and each column in a row represents an item.

Step 1 - Import the Groceries.csv data set into your RapidMiner data repository. Save it with the name Groceries.

Step 2 - Drag your Groceries data set into a new process window in RapidMiner, and run the model in order to inspect the data. When running the model, if prompted, save the process as MBA_Process.

Question 1: Conduct Data Understanding and Data Preparation activities on the data set. Do all of your variables have consistent data and are their data types are appropriate for the FP- Growth operator? If yes, please explain why you think your data types are appropriate. If no, what data preparation activities need to be done before move to next step?

Step 3 - Generate association rules for your data set. Use the default confidence and support first, then modify your confidence and support values in order to identify their most ideal levels.
Question 2: From step 3, what rules did you find? What attributes are most strongly associated with one another? Are there products that are frequently connected that surprise you? Why do you think this might be?

Step 4 - Look at the support and confidence and the other measures of rule strength such as LaPlace or Conviction.
Questions 3: How much did you have to test different support and confidence values before you found some association rules? Were any of your association rules good enough that you would base decisions on them? Why or why not?
Clustering analysis (use k-means model)

A student evaluation survey dataset (i.e., turkiye-student-evaluation.csv) will be used to clustering analysis. The goal is to group the students based on the similarity of their answers on the survey. This data set contains a total 5820 evaluation scores provided by students from Gazi University in Ankara (Turkey). There is a total of 28 course specific questions and additional 5 attributes.

Attribute Information:
instr: Instructor's identifier; values taken from {1,2,3} class: Course code (descriptor); values taken from {1-13}
repeat: Number of times the student is taking this course; values taken from {0,1,2,3,...} attendance: Code of the level of attendance; values from {0, 1, 2, 3, 4}
difficulty: Level of difficulty of the course as perceived by the student; values taken from
{1,2,3,4,5}

Q1: The semester course content, teaching method and evaluation system were provided at the start.

Q2: The course aims and objectives were clearly stated at the beginning of the period. Q3: The course was worth the amount of credit assigned to it.

Q4: The course was taught according to the syllabus announced on the first day of class.

Q5: The class discussions, homework assignments, applications and studies were satisfactory.

Q6: The textbook and other courses resources were sufficient and up to date.

Q7: The course allowed field work, applications, laboratory, discussion and other studies. Q8: The quizzes, assignments, projects and exams contributed to helping the learning.

Q9: I greatly enjoyed the class and was eager to actively participate during the lectures. Q10: My initial expectations about the course were met at the end of the period or year. Q11: The course was relevant and beneficial to my professional development.

Q12: The course helped me look at life and the world with a new perspective. Q13: The Instructor's knowledge was relevant and up to date.

Q14: The Instructor came prepared for classes.

Q15: The Instructor taught in accordance with the announced lesson plan. Q16: The Instructor was committed to the course and was understandable. Q17: The Instructor arrived on time for classes.

Q18: The Instructor has a smooth and easy to follow delivery/speech. Q19: The Instructor made effective use of class hours.

Q20: The Instructor explained the course and was eager to be helpful to students. Q21: The Instructor demonstrated a positive approach to students.

Q22: The Instructor was open and respectful of the views of students about the course. Q23: The Instructor encouraged participation in the course.

Q24: The Instructor gave relevant homework assignments/projects, and helped/guided students.

Q25: The Instructor responded to questions about the course inside and outside of the course.

Q26: The Instructor's evaluation system (midterm and final questions, projects, assignments, etc.) effectively measured the course objectives.

Q27: The Instructor provided solutions to exams and discussed them with students.

Q28: The Instructor treated all students in a right and objective manner.
Q1-Q28 are all Likert-type, meaning that the values are taken from {1,2,3,4,5}


Step 1 - Import the turkiye-student-evaluation.csv data set into your RapidMiner data repository. Save it with the name StudentEvaluation.
Step 2 - Drag your StudentEvaluation data set into a new process window in RapidMiner, and run the model in order to inspect the data. When running the model, if prompted, save the process as Clustering_Process

Question 1: Conduct any data preparation that you need for the data set.
Is there any inconsistent data or missing values? Do you need to change data types for some attributes? Why or why not. Do you need to remove some attributes? Why or why not.
Step 3 - you may need other operators, e.g., Select Attributes. Connect a k-Means operator to your data set, configure your parameters. Since we don't know how many cluster (group) of students will be. In fact, you may need to use different k values (e.g., k = 2 or 3, or 4 or 5) to decide the best "natural" number of group of this dataset. Then run your model.

Questions 2: Investigate your Centroid Table, Folder View, and the other evaluation tools. Report your findings for your clusters. How many clusters (what is k value) is the best "natural" number of group.

Question 3: Experiment with the other k-Means operators in RapidMiner, such as Kernel or Fast. How are they different from your original model? Did the use of these operators change your clusters, and if so, how?
Predictive model using decision trees

The dataset that will be used for decision trees model is bill_authentication.csv. Data was extracted from images that were taken from genuine and forged banknote-like specimens. For digitization, an industrial camera usually used for print inspection was used. The final images have 400x 400 pixels. Due to the object lens and distance to the investigated object gray-scale pictures with a resolution of about 660 dpi were gained. Wavelet Transform tool were used to extract features from images.

The goal of this task is to predict whether a bank note is authentic or fake depending upon the four different attributes of the image of the note. The attributes are variance of wavelet transformed image, curtosis of the image, entropy, and skewness of the image.

Attribute Information:

1. variance of Wavelet Transformed image (continuous)
2. skewness of Wavelet Transformed image (continuous)
3. curtosis of Wavelet Transformed image (continuous)
4. entropy of image (continuous)
5. class (integer)

Step 1 - Import the billauthentication.csv data set into your RapidMiner data repository. Save

it with the name BillAuthentication.

Step 2 - Drag your BillAuthentication data set into a new process window in RapidMiner, and run the model in order to inspect the data. When running the model, if prompted, save the process as DecisionTree_Process.

Step 3 - Split dataset into training and testing set using Split Data operator. Parameter partition specifies the ratio of the test set, which use to split up 20% of the data in to the test set and 80% for training. Sapling type: stratified sampling.

Question 1: Is there any inconsistent data or missing values? Do you need to change data types for some attributes? Why or why not. Do you need to remove some attributes? Why or why not.

Step 4 - Build a Decision Tree model for predicting whether a bank note is authentic or fake. Run your model with the default parameter values and gain_ratio for the decision tree.

Question 2: Examine the tree results. How many nodes and leaves are in the tree? View the tree. Which variable was used for the first split?

Step 5 - Re-run the model with different parameter values, e.g., using gini_index or information_gain and changing leaf and split sizes.

Questions 3: report differences in the tree's structure (e.g., nodes, leaves, first node, etc.). Analyse and report your results.

Attachment:- Data Mining for Business Analytics and Cyber Security.rar

Reference no: EM132623062

Questions Cloud

Prepare the appropriate adjusting entries for Waterway : Prepare the appropriate adjusting entries for Waterway as of December 31, 2017, to reflect the application of the "fair value" rule for the securities described
What do you suppose the broker will do : Assume you purchased 900 shares of XYZ common stock on margin at $90 per share from your broker. If the initial margin is 65%, what is the amount you borrowed f
How much will be in the account 10 years from today : Derek will deposit $4905 per year for 10 years into an account that earns 7%. The first deposit is made next year how much will be in the account 10 years from
How the teamsters union has supported its members : Discussing the origin of "Teamsters" union and the direction this union is going. Discuss how the Teamsters union has supported it's members and how the it has.
Conduct any data preparation that you need for the data set : Conduct any data preparation that you need for the data set - Investigate your Centroid Table, Folder View, and the other evaluation tools. Report your findings
Briefly describe these limitations in the Gray model : Briefly describe these limitations in the Gray model identified by Heidhues & Patel
Describe personal professional goals : Describe how your personal professional goals and objectives that you identified might be reflected in your agency learning agreement.
Juvenile justice system handles juvenile delinquency : Select two (2) U.S. Supreme Court decisions that impacted the manner in which the juvenile justice system handles juvenile delinquency.
How useful is the article to the society and to us : How useful is this article to the society and to us, as students in current issues in industry? Does the reviewer provide any new thoughts?

Reviews

Write a Review

Other Subject Questions & Answers

  Cross-cultural opportunities and conflicts in canada

Short Paper on Cross-cultural Opportunities and Conflicts in Canada.

  Sociology theory questions

Sociology are very fundamental in nature. Role strain and role constraint speak about the duties and responsibilities of the roles of people in society or in a group. A short theory about Darwin and Moths is also answered.

  A book review on unfaithful angels

This review will help the reader understand the social work profession through different concepts giving the glimpse of why the social work profession might have drifted away from its original purpose of serving the poor.

  Disorder paper: schizophrenia

Schizophrenia does not really have just one single cause. It is a possibility that this disorder could be inherited but not all doctors are sure.

  Individual assignment: two models handout and rubric

Individual Assignment : Two Models Handout and Rubric,    This paper will allow you to understand and evaluate two vastly different organizational models and to effectively communicate their differences.

  Developing strategic intent for toyota

The following report includes the description about the organization, its strategies, industry analysis in which it operates and its position in the industry.

  Gasoline powered passenger vehicles

In this study, we examine how gasoline price volatility and income of the consumers impacts consumer's demand for gasoline.

  An aspect of poverty in canada

Economics thesis undergrad 4th year paper to write. it should be about 22 pages in length, literature review, economic analysis and then data or cost benefit analysis.

  Ngn customer satisfaction qos indicator for 3g services

The paper aims to highlight the global trends in countries and regions where 3G has already been introduced and propose an implementation plan to the telecom operators of developing countries.

  Prepare a power point presentation

Prepare the power point presentation for the case: Santa Fe Independent School District

  Information literacy is important in this environment

Information literacy is critically important in this contemporary environment

  Associative property of multiplication

Write a definition for associative property of multiplication.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd