K-nearest neighbor for text classification, Computer Engineering

Assignment Help:

Assignment 2: K-nearest neighbor for text classification.

The goal of text classification is to identify the topic for a piece of text (news article, web-blog, etc.). Text classification has obvious utility in the age of information overload, and it has become a popular turf for applying machine learning algorithms. In this project, you will have the opportunity to implement k-nearest neighbor and apply it to text classification on the well known Reuter news collection.

1.       Download the dataset from my website, which is created from the original collection and contains a training file, a test file, the topics, and the format for train/test.

2.       Implement the k-nearest neighbor algorithm for text classification. Your goal is to predict the topic for each news article in the test set. Try the following distance or similarity measures with their corresponding representations.

a.        Hamming distance: each document is represented as a boolean vector, where each bit represents whether the corresponding word appears in the document.

b.       Euclidean distance: each document is represented as a numeric vector, where each number represents how many times the corresponding word appears in the document (it could be zero).

c.         Cosine similarity with TF-IDF weights (a popular metric in information retrieval): each document is represented by a numeric vector as in (b). However, now each number is the TF-IDF weight for the corresponding word (as defined below). The similarity between two documents is the dot product of their corresponding vectors, divided by the product of their norms.

3.        Let w be a word, d be a document, and N(d,w) be the number of occurrences of w in d (i.e., the number in the vector in (b)). TF stands for term frequency, and TF(d,w)=N(d,w)/W(d), where W(d) is the total number of words in d. IDF stands for inverted document frequency, and IDF(d,w)=log(D/C(w)), where D is the total number of documents, and C(w) is the total number of documents that contains the word w; the base for the logarithm is irrelevant, you can use e or 2. The TF-IDF weight for w in d is TF(d,w)*IDF(d,w); this is the number you should put in the vector in (c). TF-IDF is a clever heuristic to take into account of the "information content" that each word conveys, so that frequent words like "the" is discounted and document-specific ones are amplified. You can find more details about it online or in standard IR text.

4.       You should try k = 1, k = 3 and k = 5 with each of the representations above. Notice that with a distance measure, the k-nearest neighborhoods are the ones with the smallest distance from the test point, whereas with a similarity measure, they are the ones with the highest similarity scores.

 

 


Related Discussions:- K-nearest neighbor for text classification

Register data type as sequential element, Reg data type as Sequential eleme...

Reg data type as Sequential element module reg_seq_example( clk, reset, d, q); input clk, reset, d; output q; reg q; wire clk, reset, d; always @ (posedge clk or

Single bus structures, Single BUS STRUCTURES : The Bus structure and ...

Single BUS STRUCTURES : The Bus structure and multiple bus structures are kinds of bus or computing. A bus is fundamentally a subsystem which transfers data amongst the compo

Explain arithmetic data processing instructions, Q. Explain Arithmetic Data...

Q. Explain Arithmetic Data Processing Instructions? These instructions carry outlogical and arithmetic operations on data. Arithmetic: The four fundamental operations are

Compare hypertext versus hypermedia, Compare hypertext versus hypermedia.  ...

Compare hypertext versus hypermedia.  Hypertext is basically similar as regular text - it can be stored, read, searched, or edited - with a significant except ion: hyper text h

Describe the working of CRT in detail, Describe the Working of CRT The ...

Describe the Working of CRT The electron beam produces a tiny, bright visible spot when it strikes the phosphor-coated screen. A colour CRT monitor has three different coloured

Describe the hardwired control method, Describe the Hardwired control metho...

Describe the Hardwired control method for generating the control signals Hard-wired control can be explained as sequential logic circuit that generates particular sequences of

A full binary tree with ''n'' non-leaf nodes, A full binary tree with 'n' n...

A full binary tree with 'n' non-leaf nodes have  2n+l nodes.

What is decision support system and describe its components, Q. What is a D...

Q. What is a DSS and Describe its components? A decision support system (DSS) is a highly flexible and interactive IT system that is designed to support decision making when t

Congestion in network layer, explain different types of congestion in netwo...

explain different types of congestion in network layer?

What is difference between cobol and vs cobol ii, In using COBOL on PC we h...

In using COBOL on PC we have only flat files and the programs can access only limited storage, while in VS COBOL II on M/F the programs can access up to 16MB or 2GB depending on th

Write Your Message!

Captcha
Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd