Homophone disambiguation-unigram-bigram model , Basic Statistics

Assignment Help:

The BBC hosts a homophone quiz on its website; your task for this lab is to develop an automatic method for completing the quiz, with the aim being to get as high a score as possible. Since this is the fourth lab, and you have no doubt become language processing experts, we have chosen the advanced quiz!

The questions are as follows (the blank indicates the position of the disputed word, and the words appearing in brackets at the end are the possible options).

1. I don't know to go out or not (weather/whether)

2. Houses were being built on this (site/sight)

3. We went the door to get inside (through/threw)

4. I really want a car (new/knew)

5. They all had a of the cake (piece/peace)

6. She had to go to prove she was innocent (caught/court)

7. We were only to visit at certain times (allowed/aloud)

8. We had to a car while we were on holiday (hire/higher)

9. Tip the jug and lots of cream on the strawberries (poor/pour/paw)

10. She went back to she had locked the door (check/cheque)

For the purposes of this exercise, you will use a simple language model to estimate the probability that each of the candidate words is correct. To do this you will need to compute the frequency of each of the words in a large corpus. Additionally you will need a count of all bigrams (two word sequences) in the corpus. Using nltk this becomes trivial. For this lab we will be using the entire Brown corpus to get these counts.

1 Unigram Model

The first part of this lab is to use a very simple model to select the word which goes in the blank: simply pick the most frequent word (using the unigram frequencies above). You should write a Python program to read in the sample sentences available.

Your program should then output for every sentence the candidate word it thinks should go in the blank.

2 Bigram Model

The second method you should attempt is to make use of the bigram counts to determine which of the potential candidates makes the whole sentence more probable (i.e. you should develop a basic language model). If one is willing to make certain assumptions, the probability of a sequence of words w1,w2,w3,. . .,wn is given by:

782_biagram model.png

When using a bigram language model, we approximate the above probability with using only the previous word:

1830_biagram model1.png

You should think about the entire calculation you need to make, and which parts of it are common to all possible choices in the blank space for the homophone disambiguation task.

We estimate the bigram probabilities in the equation above using counts from a large corpus.

The standard way to estimate bigram probabilities is:

1934_Biagram 3.png

3 Smoothing

Results for the task can be improved using smoothing. Implement the "Plus One Bi-gram

Smoothing" that was described in lecture. The bigram probabilities are estimated as:

1081_biagram 4.png

where V is the number of distinct words in the training corpus (i.e. the number of word types).

4 Hand-in

Hand in four files:

1. A Python program called lab4a.py that reads on standard input a file of sentences in the format of the test file supplied and outputs on standard output one word per line, where the word on the k-th line is that homophone from the pair of homophones at the end of the k-th input sentence which the unigram model (section 1 above) predicts as the most probable to fill in the blank in the k-th input sentence.

2. A Python program called lab4b.py which is the same as lab4a.py, except that the words proposed should be the homophones deemed most probable by the bigram model (section 2 above).

3. A Python program called lab4c.py which is the same as lab4b.py, except that the words proposed should be the homophones deemed most probable by the bigram model with plus-one smoothing (section 3 above).

4. A brief report (maximum 1 side of A4 - half a side is fine) called lab4-report (.doc or.pdf) that:

_ Describes how your programs work and reports the result for each.

_ Discusses why you get the results you get.


Related Discussions:- Homophone disambiguation-unigram-bigram model

Test statistic and the p-value, An experimental surgical procedure is being...

An experimental surgical procedure is being studied as an alternative to the old method. Both methods are considered safe, but the new method has the potential to reduce operating

Study guide, Can you provide me a study guide for accounting please? Can yo...

Can you provide me a study guide for accounting please? Can you give me picture examples & instructions on how to do things that are really easy to understand? I need this for doin

Marginal and absorption Costing Problem solving, can you provide me the pro...

can you provide me the problem with the solution base above?

Comparing Two Samples, Ask qComparing Two Samples: 1. Apply the function "p...

Ask qComparing Two Samples: 1. Apply the function "plot" to the formula that relates the response "frequency" to the explanatory variable "march2007" in order to produce the two bo

Survey statistics, #qa national poll of 1836 respondents indicated that 36%...

#qa national poll of 1836 respondents indicated that 36% support the NDP. Test whether this is sufficient evidence to show that the NDP support has increased since the election. Us

Absorption distributions , Absorption distributions Probability distribu...

Absorption distributions Probability distributions which represent the number of 'individuals' (such as particles) which fail to cross the speci?ed area containing the hazards o

Prepares reports on gasoline prices , The average cost of a gallon of unlea...

The average cost of a gallon of unleaded gasoline in Greater Cincinnati was reported to be $2.41 (The Cincinnati Enquirer, February 3, 2006). During periods of rapidly changing pri

Sigma notation, Given the following pairs of random variables and comput...

Given the following pairs of random variables and compute the following sums. (NB: show your working table). 3 -7 2 10 6 11 10 12 15 14 (b) (c) (d) ?¦?2y

Derivations, example of derivations in daily life

example of derivations in daily life

Shamim, the monthly income( in tk''s)persons workingin afarmis as flows ,14...

the monthly income( in tk''s)persons workingin afarmis as flows ,14870,14930,15020,14460,14750,14920,15720,15160,14680,14890 find average monthly income?

Write Your Message!

Captcha
Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd