Solution-Design our own n-gram model

Design our own n-gram model

Assignment Help Data Structure & Algorithms

Reference no: EM132412761

Exercise

An n-gram is a sequence of elements or tokens that appear together in a document or a longer sequence of tokens. In this structure, n is the sequence size. For instance, in the sentence:

"a n-grama is a sequence of tokens"
"Un n-grama es una secuencia de tokens"

The 2-grams (bigrams) that would conform the sentence are (Un, n-grama), (n-grama, es), (es, una), (una, secuencia), (secuencia, de), (de tokens). The 3-gramas (trigrams) that make the sentence are: (Un, n-grama, es), (n-grama, es, una), (es, una, secuencia), (una, secuencia, de), (secuencia, de, tokens).

The n-grams are very popular structures mainly for their use in natural language processing, a branch of computer science that aims to achieve adequate processing of human natural language by machines. Its popularity resides in the ability to detect common patterns in related documents. For instance, in sports texts we can commonly find bigrams like (the, player) or (the, team), whereas in other kinds of texts like fantastic novels we will surely find bigrams like (the, damsel), (a, castle), etc. more frequently.

In this exercise we will design our own n-gram model which will help us score texts based on their similarity to different reference texts. To do so:

In the edu.uoc.mecm.eda.ngram.NgramFrequencyScorer class you will have to implement the train() method. This method takes an input path to a system folder and reads all files ending with .txt. These files will become the training set for our text scoring model. For each file, the method extracts all the tokens in the text. You will have to complete the method to calculate the relative frequency of all n-grams that appear in the training set (globally in all texts). When the class is initialized, the type of n-grams to use is specified (attribute numWords in the class). Your code must be generic and has to work with n-grams of any size.

Once we have calculated the relative frequency of the n-grams of the training set, we are ready to score other texts. All texts that are similar to the training set will have a better score than other less similar texts. N-gram relative frequency can be seen as the probability of appearance of an n-gram in the training texts (p(x) where x is an n-gram and p(x) is its relative frequency). Therefore, if we assume independence between n-grams of the same text, we can evaluate a text with our model using the expression:

getScore(X) =_3∈5 log(p(x))

where X is the set of all n-grams that compose the text. Complete the getScore() method, which takes a text file as input parameter and returns the text's score based on our n- gram model.

If you take a look at the implemented tests in the edu.uoc.mecm.eda.tests.NgramFrequencyScorerTest class, you will see that the score of the first text is higher than the second text, whereas the score of this second text is higher than the third. Explain why this happens. Maybe you will have to analyse the training texts and the evaluation ones.

Reference no: EM132412761

Questions Cloud

Explain how checkpoints serve to regulate the cell cycle : Explain how checkpoints serve to regulate the cell cycle and help a cell avoid mutations and cancer (when working properly, of course!).

Symbiotic relationships between biotic and abiotic parts : Have human activities affected these areas in any way? How? what is the symbiotic relationships between biotic and abiotic parts in these specific biomes

Define carrying capacity : Define carrying capacity and then apply it to the following two ecosystems: (1) Tropical Rainforest and (2) Desert. Choose one specific geographic location

Culture of modern life : What are least three of the discoveries that are to be most important and what describes their significance to society, health, and the culture of modern life?

Design our own n-gram model : Design our own n-gram model which will help us score texts based on their similarity to different reference texts - Explain why this happens

Explain one way that the meat packing industry in Chicago : Based on the excerpt of Upton Sinclair's The Jungle in the Virtual Reader, List and explain one way that the meat packing industry in Chicago defiled the meat

Process in the mitochondria of trypanosomes : RNA editing is a common process in the mitochondria of trypanosomes and plants as well as in chloroplasts, and in rare cases it occurs in higher eukaryotes

Discuss one clinical correlation for system : Find clinical correlations that relate to both the digestive and respiratory systems.

Describe leading strand and lagging strand dna replication : Describe leading strand and lagging strand DNA replication and use the following terms in your description: 3', 5', helicase, primase (RNA polymerase),

Reviews

len2412761

12/6/2019 11:56:38 PM

must use Java and be run in intellij Idea Exercises 3 and 4 are worth 30% each. In these exercises the correctness of the source code (passing all available unit tests – without changing them in any way), the most appropriate data structure choice, the justification for your choice and the code’s legibility will be evaluated. section 3 searching link to website book

Write a Review

Required(*) Message

User Account

All Pages

Design our own n-gram model

Reference no: EM132412761

Reference no: EM132412761

Questions Cloud

Reviews

len2412761

Write a Review

Data Structure & Algorithms Questions & Answers

Design a class template for the heap adt

Discuss trade-offs between dynamic and static data structure

Create a list and simply implement enqueue and dequeue

Structural and behavioral models

Display all columns and all rows from the employees table.

Design a 3-way merge sort algorithm

Data array a has data series from 1000000 to 1 with step

Write a pram algorithm for quicksort using n processors

What is the complexity of your algorithm in terms of big-o

Find the number of comparisons used by the bubble sort

How many times is sort and partition called

Find the coordinates of the optimal location of new machine

Assured A++ Grade

Academics

Major Subjects

Majors

Get In Touch

TERMS & POLICIES

HELP & SUPPORT