Design our own n-gram model

Assignment Help Data Structure & Algorithms
Reference no: EM132412761

Exercise

An n-gram is a sequence of elements or tokens that appear together in a document or a longer sequence of tokens. In this structure, n is the sequence size. For instance, in the sentence:

"a n-grama is a sequence of tokens"
"Un n-grama es una secuencia de tokens"

The 2-grams (bigrams) that would conform the sentence are (Un, n-grama), (n-grama, es), (es, una), (una, secuencia), (secuencia, de), (de tokens). The 3-gramas (trigrams) that make the sentence are: (Un, n-grama, es), (n-grama, es, una), (es, una, secuencia), (una, secuencia, de), (secuencia, de, tokens).

The n-grams are very popular structures mainly for their use in natural language processing, a branch of computer science that aims to achieve adequate processing of human natural language by machines. Its popularity resides in the ability to detect common patterns in related documents. For instance, in sports texts we can commonly find bigrams like (the, player) or (the, team), whereas in other kinds of texts like fantastic novels we will surely find bigrams like (the, damsel), (a, castle), etc. more frequently.

In this exercise we will design our own n-gram model which will help us score texts based on their similarity to different reference texts. To do so:

In the edu.uoc.mecm.eda.ngram.NgramFrequencyScorer class you will have to implement the train() method. This method takes an input path to a system folder and reads all files ending with .txt. These files will become the training set for our text scoring model. For each file, the method extracts all the tokens in the text. You will have to complete the method to calculate the relative frequency of all n-grams that appear in the training set (globally in all texts). When the class is initialized, the type of n-grams to use is specified (attribute numWords in the class). Your code must be generic and has to work with n-grams of any size.

Once we have calculated the relative frequency of the n-grams of the training set, we are ready to score other texts. All texts that are similar to the training set will have a better score than other less similar texts. N-gram relative frequency can be seen as the probability of appearance of an n-gram in the training texts (p(x) where x is an n-gram and p(x) is its relative frequency). Therefore, if we assume independence between n-grams of the same text, we can evaluate a text with our model using the expression:

getScore(X) =3∈5 log(p(x))

where X is the set of all n-grams that compose the text. Complete the getScore() method, which takes a text file as input parameter and returns the text's score based on our n- gram model.

If you take a look at the implemented tests in the edu.uoc.mecm.eda.tests.NgramFrequencyScorerTest class, you will see that the score of the first text is higher than the second text, whereas the score of this second text is higher than the third. Explain why this happens. Maybe you will have to analyse the training texts and the evaluation ones.

Reference no: EM132412761

Questions Cloud

Explain how checkpoints serve to regulate the cell cycle : Explain how checkpoints serve to regulate the cell cycle and help a cell avoid mutations and cancer (when working properly, of course!).
Symbiotic relationships between biotic and abiotic parts : Have human activities affected these areas in any way? How? what is the symbiotic relationships between biotic and abiotic parts in these specific biomes
Define carrying capacity : Define carrying capacity and then apply it to the following two ecosystems: (1) Tropical Rainforest and (2) Desert. Choose one specific geographic location
Culture of modern life : What are least three of the discoveries that are to be most important and what describes their significance to society, health, and the culture of modern life?
Design our own n-gram model : Design our own n-gram model which will help us score texts based on their similarity to different reference texts - Explain why this happens
Explain one way that the meat packing industry in Chicago : Based on the excerpt of Upton Sinclair's The Jungle in the Virtual Reader, List and explain one way that the meat packing industry in Chicago defiled the meat
Process in the mitochondria of trypanosomes : RNA editing is a common process in the mitochondria of trypanosomes and plants as well as in chloroplasts, and in rare cases it occurs in higher eukaryotes
Discuss one clinical correlation for system : Find clinical correlations that relate to both the digestive and respiratory systems.
Describe leading strand and lagging strand dna replication : Describe leading strand and lagging strand DNA replication and use the following terms in your description: 3', 5', helicase, primase (RNA polymerase),

Reviews

len2412761

12/6/2019 11:56:38 PM

must use Java and be run in intellij Idea Exercises 3 and 4 are worth 30% each. In these exercises the correctness of the source code (passing all available unit tests – without changing them in any way), the most appropriate data structure choice, the justification for your choice and the code’s legibility will be evaluated. section 3 searching link to website book

Write a Review

Data Structure & Algorithms Questions & Answers

  Design a class template for the heap adt

Design a class template for the Heap ADT, using the implementation described in this section.

  Discuss trade-offs between dynamic and static data structure

CSC310 assignment- Discuss the trade-offs between dynamic and static data structures. What are advantages of and differences between ArrayList, LinkedList, and Vector? Provide examples of how to best implement each of these data structures.

  Create a list and simply implement enqueue and dequeue

You can create a list and simply implement enqueue and dequeue functions in the List - that will technically make it a queue.

  Structural and behavioral models

Your analysis phase of the SRS project went well and your team feels good about their Functional, Structural, and Behavioral models. You also discussed the result of your analysis with the School of Prosperity (SoP) administration and they seem to be..

  Display all columns and all rows from the employees table.

Write SELECT statements for the following questions. Make sure to include the statement execution, including the resulting data.

  Design a 3-way merge sort algorithm

Design a 3-way merge sort algorithm, which divides the given array into three equal parts, recursively sorts each part, then merges the results.

  Data array a has data series from 1000000 to 1 with step

data array a has data series from 1000000 to 1 with step size 1 which is in perfect decreasing order.data array b has

  Write a pram algorithm for quicksort using n processors

Write a CREW PRAM algorithm for adding n numbers in a list in T(lg n) time. Write PRAM algorithm for Quicksort using n processors to sort a list of n elements.

  What is the complexity of your algorithm in terms of big-o

What is the complexity of your algorithm in terms of Big-O and what is the best possible complexity that you believe can be achieved when solving such problem? Explain w

  Find the number of comparisons used by the bubble sort

Using a generator of random orderings of the integers 1, 2,...,n, find the number of comparisons used by the bubble sort, insertion sort, binary insertion sort.

  How many times is sort and partition called

Trace the execution of quicksort on the following array, assuming that the first item in each subarray is the pivot value. Show the values of first and last.

  Find the coordinates of the optimal location of new machine

A machine shop has five machines, located at (3, 3), (3, 7), (8, 4), (12, 3), and (14, 6), respectively. A new machine is to be located in the shop.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd