K-nearest neighbor for text classification, Computer Engineering

Assignment Help:

Assignment 2: K-nearest neighbor for text classification.

The goal of text classification is to identify the topic for a piece of text (news article, web-blog, etc.). Text classification has obvious utility in the age of information overload, and it has become a popular turf for applying machine learning algorithms. In this project, you will have the opportunity to implement k-nearest neighbor and apply it to text classification on the well known Reuter news collection.

1.       Download the dataset from my website, which is created from the original collection and contains a training file, a test file, the topics, and the format for train/test.

2.       Implement the k-nearest neighbor algorithm for text classification. Your goal is to predict the topic for each news article in the test set. Try the following distance or similarity measures with their corresponding representations.

a.        Hamming distance: each document is represented as a boolean vector, where each bit represents whether the corresponding word appears in the document.

b.       Euclidean distance: each document is represented as a numeric vector, where each number represents how many times the corresponding word appears in the document (it could be zero).

c.         Cosine similarity with TF-IDF weights (a popular metric in information retrieval): each document is represented by a numeric vector as in (b). However, now each number is the TF-IDF weight for the corresponding word (as defined below). The similarity between two documents is the dot product of their corresponding vectors, divided by the product of their norms.

3.        Let w be a word, d be a document, and N(d,w) be the number of occurrences of w in d (i.e., the number in the vector in (b)). TF stands for term frequency, and TF(d,w)=N(d,w)/W(d), where W(d) is the total number of words in d. IDF stands for inverted document frequency, and IDF(d,w)=log(D/C(w)), where D is the total number of documents, and C(w) is the total number of documents that contains the word w; the base for the logarithm is irrelevant, you can use e or 2. The TF-IDF weight for w in d is TF(d,w)*IDF(d,w); this is the number you should put in the vector in (c). TF-IDF is a clever heuristic to take into account of the "information content" that each word conveys, so that frequent words like "the" is discounted and document-specific ones are amplified. You can find more details about it online or in standard IR text.

4.       You should try k = 1, k = 3 and k = 5 with each of the representations above. Notice that with a distance measure, the k-nearest neighborhoods are the ones with the smallest distance from the test point, whereas with a similarity measure, they are the ones with the highest similarity scores.

 

 


Related Discussions:- K-nearest neighbor for text classification

What is computer, WHAT IS COMPUTER? Computer is termed in the Oxford di...

WHAT IS COMPUTER? Computer is termed in the Oxford dictionary as "An automatic electronic apparatus for making controlling operations or calculations    which are expressible i

Explain non-folded network, Explain Non-Folded network Non-Folded Netw...

Explain Non-Folded network Non-Folded Network: In a switching network, every inlet/outlet connection may be utilized for inter exchange transmission. In this case, the .excha

Compute the positive integer and square root, Question: Q1) Write a cod...

Question: Q1) Write a code that asks the user for a positive integer, computes the square root of that integer, and return the result to the user. The computational error needs

Explain dataflow computation model, Explain dataflow computation model ...

Explain dataflow computation model An option to the von Neumann model of computation is a dataflow computation model. In a dataflow model the control is tied to the flow of dat

How is secure sockets layer relied on the certificates, How is Secure Socke...

How is Secure Sockets Layer relied on the certificates? The Secure Sockets Layer standard is not a single protocol, but quite a set of accepted data transfer routines which a

Find fiber distributed data interconnect is an example of, FDDI (Fiber Dist...

FDDI (Fiber Distributed Data Interconnect) is an example of? Fiber Distributed Data Interconnect is an illustration of token ring.

What is exact and approximation algorithm, What is Exact and Approximation ...

What is Exact and Approximation algorithm? The principal decision to choose solving the problem exactly is called exact algorithm. The   principal decision to choose solving th

Explain the properties of hypercube, Q. Explain the properties of Hypercube...

Q. Explain the properties of Hypercube? Properties of Hypercube: Hypercube is both edge and node symmetric. The labels of any two neighbouring nodes vary in exactl

Transition table for sequential circuits, sovling questions on transition t...

sovling questions on transition table for sequential circuits

How many address bits are needed to show a 32 K memory, How many address bi...

How many address bits are required to represent a 32 K memory ? Ans. 32K = 25 x 210 = 215, Hence 15 address bits are needed; Only 16 bits can address this.

Write Your Message!

Captcha
Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd