Reference no: EM132412761
Exercise
An n-gram is a sequence of elements or tokens that appear together in a document or a longer sequence of tokens. In this structure, n is the sequence size. For instance, in the sentence:
"a n-grama is a sequence of tokens"
"Un n-grama es una secuencia de tokens"
The 2-grams (bigrams) that would conform the sentence are (Un, n-grama), (n-grama, es), (es, una), (una, secuencia), (secuencia, de), (de tokens). The 3-gramas (trigrams) that make the sentence are: (Un, n-grama, es), (n-grama, es, una), (es, una, secuencia), (una, secuencia, de), (secuencia, de, tokens).
The n-grams are very popular structures mainly for their use in natural language processing, a branch of computer science that aims to achieve adequate processing of human natural language by machines. Its popularity resides in the ability to detect common patterns in related documents. For instance, in sports texts we can commonly find bigrams like (the, player) or (the, team), whereas in other kinds of texts like fantastic novels we will surely find bigrams like (the, damsel), (a, castle), etc. more frequently.
In this exercise we will design our own n-gram model which will help us score texts based on their similarity to different reference texts. To do so:
In the edu.uoc.mecm.eda.ngram.NgramFrequencyScorer class you will have to implement the train() method. This method takes an input path to a system folder and reads all files ending with .txt. These files will become the training set for our text scoring model. For each file, the method extracts all the tokens in the text. You will have to complete the method to calculate the relative frequency of all n-grams that appear in the training set (globally in all texts). When the class is initialized, the type of n-grams to use is specified (attribute numWords in the class). Your code must be generic and has to work with n-grams of any size.
Once we have calculated the relative frequency of the n-grams of the training set, we are ready to score other texts. All texts that are similar to the training set will have a better score than other less similar texts. N-gram relative frequency can be seen as the probability of appearance of an n-gram in the training texts (p(x) where x is an n-gram and p(x) is its relative frequency). Therefore, if we assume independence between n-grams of the same text, we can evaluate a text with our model using the expression:
getScore(X) =3∈5 log(p(x))
where X is the set of all n-grams that compose the text. Complete the getScore() method, which takes a text file as input parameter and returns the text's score based on our n- gram model.
If you take a look at the implemented tests in the edu.uoc.mecm.eda.tests.NgramFrequencyScorerTest class, you will see that the score of the first text is higher than the second text, whereas the score of this second text is higher than the third. Explain why this happens. Maybe you will have to analyse the training texts and the evaluation ones.