K-nearest neighbor for text classification, Computer Engineering

Assignment Help:

Assignment 2: K-nearest neighbor for text classification.

The goal of text classification is to identify the topic for a piece of text (news article, web-blog, etc.). Text classification has obvious utility in the age of information overload, and it has become a popular turf for applying machine learning algorithms. In this project, you will have the opportunity to implement k-nearest neighbor and apply it to text classification on the well known Reuter news collection.

1.       Download the dataset from my website, which is created from the original collection and contains a training file, a test file, the topics, and the format for train/test.

2.       Implement the k-nearest neighbor algorithm for text classification. Your goal is to predict the topic for each news article in the test set. Try the following distance or similarity measures with their corresponding representations.

a.        Hamming distance: each document is represented as a boolean vector, where each bit represents whether the corresponding word appears in the document.

b.       Euclidean distance: each document is represented as a numeric vector, where each number represents how many times the corresponding word appears in the document (it could be zero).

c.         Cosine similarity with TF-IDF weights (a popular metric in information retrieval): each document is represented by a numeric vector as in (b). However, now each number is the TF-IDF weight for the corresponding word (as defined below). The similarity between two documents is the dot product of their corresponding vectors, divided by the product of their norms.

3.        Let w be a word, d be a document, and N(d,w) be the number of occurrences of w in d (i.e., the number in the vector in (b)). TF stands for term frequency, and TF(d,w)=N(d,w)/W(d), where W(d) is the total number of words in d. IDF stands for inverted document frequency, and IDF(d,w)=log(D/C(w)), where D is the total number of documents, and C(w) is the total number of documents that contains the word w; the base for the logarithm is irrelevant, you can use e or 2. The TF-IDF weight for w in d is TF(d,w)*IDF(d,w); this is the number you should put in the vector in (c). TF-IDF is a clever heuristic to take into account of the "information content" that each word conveys, so that frequent words like "the" is discounted and document-specific ones are amplified. You can find more details about it online or in standard IR text.

4.       You should try k = 1, k = 3 and k = 5 with each of the representations above. Notice that with a distance measure, the k-nearest neighborhoods are the ones with the smallest distance from the test point, whereas with a similarity measure, they are the ones with the highest similarity scores.

 

 


Related Discussions:- K-nearest neighbor for text classification

Determine the computer arithmetic operations, Computer Arithmetic Data ...

Computer Arithmetic Data is manipulated with the help of arithmetic instructions in digital computers. Data is manipulated to  produce the results  necessary  to  provide solut

explain compiler, Compiler is used to change the high-level language progr...

Compiler is used to change the high-level language program into machine code at a time. It doesn't needs special instruction to store in a memory, it keeps automatically. The imple

Scsi bus - computer architecture, SCSI Bus:   Defined by ANSI - X3....

SCSI Bus:   Defined by ANSI - X3.131   50, 68 or 80 pins   Max. transfer rate - 160 MB/s, 320 MB/s. SCSI Bus Signals   Small Computer System Interface

Data base, why to learn data base?

why to learn data base?

Write C++ for following question., We are planning an orienteering game. Th...

We are planning an orienteering game. The aim of this game is to arrive at the goal (G) from the start (S) with the shortest distance. However, the players have to pass all the che

How a file can be shared among different users, Discuss the different techn...

Discuss the different techniques with which a file can be shared among different users. Several popular techniques with that a file can be shared among various users are: 1

Illustrate what is a ion pair energy, Q. Illustrate what is a ion pair ener...

Q. Illustrate what is a ion pair energy? Answer:- Ion energy contains electric charges called protons (+) and neutrons (0) and electrons (-) charges. It's present in an at

Explain about variable-length of instructions, Q. Explain about Variable-Le...

Q. Explain about Variable-Length of Instructions? With the better understanding of computer instruction sets designers developed the idea of having a range of instruction forma

E- commerce technology and web design, In this assignment, you are required...

In this assignment, you are required to develop part of a B2C e-Commerce site which is an online movie reservation system for a multiplex theatre. Through the developed site, the u

What is polling, What is polling? Polling is a scheme or an algorithm t...

What is polling? Polling is a scheme or an algorithm to recognize the devices interrupting the processor. Polling is employed when multiple devices interrupt the processor by o

Write Your Message!

Captcha
Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd