COMP3220 Document Processing and the Semantic Web Assignment

Assignment Help Python Programming
Reference no: EM132817932

COMP3220 Document Processing and the Semantic Web - Macquarie University

Assignment - Python for Text Processing

Objectives of this assignment

In this assignment you will practice with the use of Python for text processing. The assignment consists of 5 independent tasks, each of which is worth 5 marks.

The deadline of this assignment is before census date, so that it can serve as a diagnostic test and you can determine whether you want to remain in the unit or to withdraw without academic penalty.

You are provided with a template that contains the definitions of the functions that you need to implement in each of the tasks below. The template includes simple Python doctests that you can use to check the correctness of the code. These tests are there to help you, but note that we will use the unittest framework with a separate set of tests when we assess your submission. It is your responsibility to run your own tests, in addition to the doctests provided.

Each function will process a document from the NLTK Gutenberg corpus. This document is specified as an input argument to the function.

The Tasks

1. Find the top stems

Implement a function get_top_stems that returns a list with the n most frequent stems which is not in the list of NLTK stopwords. To determine whether a word is a stop word, remember to lowercase the word. The list must be sorted by frequency in descending order and the words must preserve the original casing. The input arguments of the function are:

document: The name of the Gutenberg document, e.g. "austen-emma.txt".
n: The number of stems to return.
To produce the correct results, the function must do this:

Use the NLTK libraries to find the tokens and the stems.
Use NLTK's sentence tokeniser before NLTK's word tokeniser.
Use NLTK's list of stop words, and compare your words with those of the list after lowercasing.

2. Find the top PoS bigrams
Implement a function get_top_pos_bigrams that returns a list with the n most frequent bigrams of parts of speech. Do not remove stop words. The list of bigrams must be sorted by frequency in descending order. The input arguments are:

document: The name of the Gutenberg document, e.g. "austen-emma.txt".
n: The number of bigrams to return.
To produce the correct results, the function must do this:

Use NLTK's pos_tag_sents instead of pos_tag.
Use NLTK's "universal" PoS tagset.
When computing bigrams, do not consider parts of speech of words that are in different sentences. For example, if we have this text: "Sentence 1. And sentence 2" the bigrams are: ('NOUN','NUM'), ('NUM','.'), ('CONJ','NOUN'), ('NOUN','NUM'). Note that this would not be a valid bigram, since the punctuation mark and the word "And" are in different sentences: ('.','CONJ').

3. Find the distribution of frequencies of parts of speech after a given word
Implement a function get_pos_after that returns the distribution of the parts of speech of the words that follow a word given as an input to the function. The result must be returned in descending order of frequency. The input arguments of the function are:

document: The name of the Gutenberg document, e.g. "austen-emma.txt".
word: The word.
To produce the correct results, the function must do this:

First do sentence tokenisation, then word tokenisation.
Do not consider words that occur in different sentences. Thus, if a word ends a sentence, there are no words following it.

4. Get the words with highest tf.idf
In this exercise you will implement a simple approach to find keywords in a document.

Implement a function get_top_word_tfidf that returns the list of n words with highest tf.idf. The result must be returned in descending order of tf.idf. The input arguments are:

document: The name of the Gutenberg document, e.g. "austen-emma.txt".
n: The number of words to return.
To produce the correct results, the function must do this:

Use Scikit-learn's TfidfVectorizer.
Fit the tf.idf vectorizer using the documents of the NLTK Gutenberg corpus.

5. Get the sentences with highest average of tf.idf
In this exercise you will implement a simple document summariser that returns the most important sentences based on the average tf.idf.

Implement a function get_top_sentence_tfidf that returns the positions of the sentences which have the largest average tf.idf. The list of sentence positions must be returned in the order of occurrence in the document. The input arguments are:

document: The name of the Gutenberg document, e.g. "austen-emma.txt".
n: The number of sentence positions to return.
The reason for returning the sentence positions in the order of occurrence, and not in order of average tf.idf, is that this is what document summarisers normally do.

To produce the correct results, the function must do this:

Use Scikit-learn's TfidfVectorizer.
Fit the tfidf vectorizer using the sentences of the documents of the NLTK Gutenberg corpus. This is different from task 4. Now you want to compute the tf.idf of sentences, not of documents.
Use NLTK's sentence tokeniser to find the sentences.

Submission
The submission must be a single Python file. Do not submit several files or a zip file since the automarker would not know what to do with your submission. Do not submit a Jupyter notebook.

Attachment:- Python for Text Processing.rar

Verified Expert

The assignment was an application design and development using python programming. The use of python is a modern application development platform that employs the dynamics of arithmetics and logic in developing versatile applications. Such applications are known to autonomously perform a range of both logical and arithmetic solutions.

Reference no: EM132817932

Questions Cloud

How often have you heard that the world is getting smaller : How often have you heard people say that the world is getting smaller? Your life today is affected by the decisions and actions of people in other parts.
Journalize the adjusting entries at december : Journalize the adjusting entries at December 31, 2020. Bank collected $12,400 note for Azurite Company in August, with an interest of $780.
HUM 112 World Cultures Assignment : HUM 112 World Cultures Assignment Help and Solution - Strayer University, USA - Homework Help - Cultural Activity Report
Discuss a management strategy used to retain : Share an example of a time when you've used a similar strategy in your personal finances. How are the applications of these strategies similar or different for
COMP3220 Document Processing and the Semantic Web Assignment : COMP3220 Document Processing and the Semantic Web Assignment Help and Solution, Macquarie University - Assessment Writing Service
What are some of the advantages of using a factoring company : What are some of the advantages of using a factoring company? What is the average fee or percentage range charged on the receivables?
What was the persons emotional state : Write a paragraph describing a time when you effectively empathized with another person. What was the person's emotional state? How did you recognize it?
Discuss a management strategy : Discuss a management strategy used to retain or increase cash. Share an example of a time when you've used a similar strategy in your personal finances.
Compute how much baltimore company should report : Baltimore Company purchased 14,000 shares. Compute how much Baltimore Company should report for its investment in Towson Company on December 31, 2018.

Reviews

Write a Review

Python Programming Questions & Answers

  Write a python program to implement the diff command

Without using the system() function to call any bash commands, write a python program that will implement a simple version of the diff command.

  Write a program for checking a circle

Write a program for checking a circle program must either print "is a circle: YES" or "is a circle: NO", appropriately.

  Prepare a python program

Prepare a Python program which evaluates how many stuck numbers there are in a range of integers. The range will be input as two command-line arguments.

  Python atm program to enter account number

Write a simple Python ATM program. Ask user to enter their account number, and print their initail balance. (Just make one up). Ask them if they wish to make deposit or withdrawal.

  Python function to calculate two roots

Write a Python function main() to calculate two roots. You must input a,b and c from keyboard, and then print two roots. Suppose the discriminant D= b2-4ac is positive.

  Design program that asks user to enter amount in python

IN Python Design a program that asks the user to enter the amount that he or she has budget in a month. A loop should then prompt the user to enter his or her expenses for the month.

  Write python program which imports three dictionaries

Write a Python program called hours.py which imports three dictionaries, and uses the data in them to calculate how many hours each person has spent in the lab.

  Write python program to create factors of numbers

Write down a python program which takes two numbers and creates the factors of both numbers and displays the greatest common factor.

  Email spam filter

Analyze the emails and predict whether the mail is a spam or not a spam - Create a training file and copy the text of several mails and spams in to it And create a test set identical to the training set but with different examples.

  Improve the readability and structural design of the code

Improve the readability and structural design of the code by improving the function names, variables, and loops, as well as whitespace. Move functions close to related functions or blocks of code related to your organised code.

  Create a simple and responsive gui

Please use primarily PHP or Python to solve the exercise and create a simple and responsive GUI, using HTML, CSS and JavaScript.Do not use a database.

  The program is to print the time

The program is to print the time in seconds that the iterative version takes, the time in seconds that the recursive version takes, and the difference between the times.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd