Data mining project

Assignment Help Other Subject
Reference no: EM13999715

The project must be carried out using any programming language or one of the suggested

platforms and libraries: references to them are listed here and are also available on Blackboard.

·         KNIME, open source Data Mining platform (https://www.knime.org).

·         Weka, open source ML library in Java (https://www.cs.waikato.ac.nz/ml/weka).

·         R, free programming language for statistical computing (https://www.r-project.org).

The following data files are required for this coursework and are provided in Blackboard:

·         wine.csv (data file for tasks 1 and 2)

·         training100Ku.csv (data file for tasks 3)

·         test1K.csv (data file for tasks 3)

·

Wine dataset for Task #1 and Task #2

The data set (wine.csv) is obtained from a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 chemical constituents found in each wine. Each data record contains the cultivar ID (1, 2 or 3) and 13 numerical attributes.

Task #1 – Data Exploration and Clustering

You are required to perform a clustering analysis for the multidimensional data set indicated above. This task has to be carried out two times: with and without normalisation.

Task1.1: Clustering without normalisation

Apply Principal Component Analysis (PCA) to generate two-dimensional coordinates and a 2D plot (plot1) of the records. The data points in plot1 should be represented with a colour associated to their class label. Apply a clustering algorithm to the data set to generate three partitions. Generate a 2D plot (plot2) based on the same PCA projection, similarly to the previous one, where the colour is associated to the cluster ID (use different colours w.r.t. plot1), and compare it with plot1. For the records associated to each cluster generate a 2D plot (plot3a, plot3b, plot3c) with colour associated to the class label (same colours of plot1): visually verify the distribution of class labels in each cluster.

Select, describe and apply at least one cluster validity measure: report the results in the report. Task1.2: Clustering with normalisation

Apply a normalisation pre-processing to the data set and repeat the steps of the part 1. Compare the new plots and the cluster validity measure with the previous ones.

The submission for Task #1 must contain two components:

·         a report section dedicated to your solution for Task #1,

·         any KNIME workflow(*) and source code used (a zip/jar archive).

 

Task #2 – Comparison of Classification Models

You are required to learn and test classification models for the wine data set. For this task you need to carry out a performance comparison of TWO different classification algorithms. You should use a 10-fold cross-validation method to estimate the generalisation error.

In the report you should briefly describe the two algorithms and the method used to compare the two algorithms.

The submission for Task #2 must contain two components:

·         a report section dedicated to your solution for Task #2,

·         any KNIME workflow(*) and source code used (a zip/jar archive).

 

Task #3 – The Search for God Particle: a Binary Classification Challenge

The CERN’s Large Hadron Collider (LHC) typically produces approximately 1011 collisions per hour and about 300 (0.0000003%) of these collisions result in a Higgs boson, the so called God particle. Detecting when interesting particles are produced is an important challenge, which is typically studied by the use of simulations. The data set for this task is related to simulations of collision events, which can be used to train a classification model to distinguish between collisions producing particles of interest (signal) and those producing other particles (background).

 Two data files are provided: the training set (training100Ku.csv) and the test set (test1K.csv). The training set file has 100,000 records, each containing, in this order, 21 numerical low-level attributes, 7 high-level attributes and the class label (signal/background). The low-level attributes are kinematic properties measured by the particle detectors in the accelerator during the experiment. The high-level attributes are computed after the experiment by means of some complex model as function of the low-level attributes (feature transformation).

The test set has 1,000 records, each containing a unique record identifier and 21 numerical low-level attributes (the same measurements in the same order as in the training set). The 7 high-level attributes and the class label are not present.

Your task is to predict the class label for the records of the test set. The resulting predictions must be submitted as a single file (CSV format) with only two columns: the record ID and the predicted class label (signal/background).

You must also include a section in the report to describe the method used to generate the submitted predictions and an estimation of these performance indices: accuracy, F-measure, precision and recall.

In summary, the submission for Task #3 must contain three components:

·         a report section dedicated to your solution for Task #3,

·         any KNIME workflow(*) and/or source code used (a zip/jar archive) and

·         the file “Task3-predictions.csv”.

 

 

 

(*) Important: do not include data when you export a KNIME workflow as a zip archive.

Reference no: EM13999715

Questions Cloud

What is the expected value of the potential offers : A simple model of search. Consider an agent who lives two periods. He is unemployed at the beginning of the first period and has a wage offer of w. If he accepts the wage offer w, he will work forever at that wage. What is the expected value of the p..
Type of sexual abuse from which jeremy may be suffering : Fully define, describe, and explain the type of sexual abuse from which Jeremy may be suffering. What type of treatment program would you suggest for Jeremy to participate in that would address the needs of a victim of child sexual abuse?
Derive an expression for the marginal cost of production : A manufacturer estimates that its variable cost for manufacturing a given product is given by the following expression: C(q) = 25q2 + 2000q [$] where C is the total cost and q is the quantity produced. Derive an expression for the marginal cost of pr..
Calculate the consumption-the consumers gross surplus : The inverse demand function of a group of consumers for a given type of widgets is given by the following expression: π = −10q + 2000[$] where q is the demand and π is the unit price for this product. Determine the maximum consumption of these consum..
Data mining project : The data set (wine.csv) is obtained from a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. The analysis determined the quantities of 13 chemical constituents found in each wine. Each data recor..
The supply function for the widget market : Economists estimate that the supply function for the widget market is given by the following expression: q = 0.2 · π − 40 a. Calculate the demand and price at the market equilibrium if the demand is as defined in Problem 2.2. b. For this equilibrium,..
Use only r programming language : If twenty-seven students are to be assigned to groups of three for each problem set, and no student can be assigned to the same group as a student whom he or she has previously worked with, how many problem sets can Dr Lee assign? Extend the function..
A partially completed bank reconciliation for dave company : A partially completed bank reconciliation for Dave Company at March 31, as well as additional data necessary to answer the questions, which follow.
How do the utilitarians use gossen 3rd law : How do the Utilitarians use Gossen's 3rd Law to resolve the 'Paradox of Value' aka 'the Water-Diamond paradox as posited by Adam Smith,no less than 400 words

Reviews

Write a Review

Other Subject Questions & Answers

  What methods of inquiry do philosophers use

What methods of inquiry do philosophers use? How might Descartes considered the meaning of life? Who invented the scientific method? Describe the primary elements of this method. How does it employ two forms of logic?

  Implications and limitations of research

To prepare for this third Final Project assignment, review the findings from your Part II Final Project assignment.

  Our existing system of medical education

The health care delivery system now places increased emphasis on maintaining wellness and on promoting disease avoidance through healthy behaviors and lifestyle. What does this new orientation pose for our existing system of medical education.

  Plan for indoctrination and inoculation

Make a plan for indoctrination and inoculation. Use logical inquiry and problem solving to arrive at a recommendation.

  Analyze the pre-implementation and design strategies

Analyze four policy choices of Mayor Schell that were made as part of the strategy for the homeless. Analyze the Pre-Implementation and Design Strategies of Mayor Schell and interpret four practical outcomes of his choices.

  Improve your relationships with others

Identify two or three things that you learned in class, which were evident in your conversation. Explain how they were evident in your interaction.

  What questions should you ask and how should you proceed

What questions should you ask and how should you proceed? What is chain of custody and why must it be followed in investigations?

  Pre-owned car dealerships

There are now pre-owned car dealerships that have taken the negotiation out of buying a car. The salespersons are not on commission and the dealer lists the automobiles at a reasonable price for sale.

  How would you design study using co relational design how

imagine you were conducting research on the relationship between academic performance e.g. better grades and different

  Making effective and ethical decisions

Complete the experiential exercise 5.1 Assessing Yourself at the end of the chapter, "Ethics and Corporate Responsibility." Based on your score, create a plan to increase your skills in making ethical decisions in the workplace.

  Case analysis involves gender discrimination

The Milwaukee County Juvenile Detention Center started a new policy that required each unit of the facility to be staffed at all times by at least one officer of the same gender as the detainees housed at a unit. The purpose of the policy, administra..

  Teaching technique- test scores compared to old method

Jill also wants to test this new teaching technique - she also expects the new method to increase test scores compared to the old method - so she uses the old method to teach one of her classes and the new method to teach a different section of the s..

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd