Reference no: EM132377830
CITS1401 Computational Thinking with Python Project 2 Semester 2 2019
Project: Using Stylometry to Verify Authorship
To be done individually.
You should construct a Python 3 program containing your solution to the following problem and submit your program electronically on LMS. No other method of submission is allowed.
You are expected to have read and understood the University's guidelines on academic conduct. In accordance with this policy, you may discuss with other students the general principles required to understand this project, but the work you submit must be the result of your own effort. Plagiarism detection, and other systems for detecting potential malpractice, will therefore be used. Besides, if what you submit is not your own work then you will have learnt little and will therefore, likely, fail the final exam.
You must submit your project before the submission deadline listed above. Following UWA policy, a late penalty of 10% will be deducted for each day (or part day), after the deadline, that the assignment is submitted. However, in order to facilitate marking of the assignments in a timely manner, no submissions will be allowed after 7 days following the deadline.
Overview
UWA, like every university around the country (probably around the galaxy) is very worried about ghost-written submissions for assignments. This is also known as contract cheating. Whatever you call it, ghost-writing is about getting someone else to do your work, but submitting it as if it was only your work. In this case we are concerned with essays. The incidence is believed to be low, but it's clearly not a good thing.
Coming from a different angle, debates have raged at various times about whether different authors' works were actually by those authors. For example, were all the works attributed to William Shakespeare actually by him? One approach to examining both of these issues is to use stylometry. That is, rather than looking directly at the content of texts, as one does when looking for suspected plagiarism, stylometic looks for stylistic similarities. In other words, similarities in the ways a particular author uses language, rather than similarities in the actual words on the page, on the assumption that an author will use a similar style for similar sorts of content, fiction, non-fiction, etc.
What you will do for this Project is write a program that reads in either one or two text files containing the works to be analysed and builds a profile for each. Then either the profile is listed, or if there are two text files, the two profiles are compared, returning a score which reflects the distance between the two works in terms of their style; low scores, down to 0, imply that the same author is likely responsible for both works, while large scores imply different authors.
Specification: What your program will need to do Input:
Your program must define the function main with the following signature:
def main(textfile1, textfile2, feature)
The first and second arguments are the names of the text files with a work to be analysed. The third argument is the type of feature that will be used to compare the document profiles. The allowed feature names are: "punctuation", "unigrams", "conjunctions" and "composite".
Output:
The function is required to return the following outputs in the order provided below:
• the score from a pairwise comparison rounded to four decimal places,
• the dictionary containing the profile of first file (textfile1), and
• the dictionary containing the profile of second file (textfile1)
A more detailed specification
• For the purposes of this project, a sentence is a sequence of words followed by either a full-stop, question mark or exclamation mark, which in turn must be followed either by a quotation mark (so the sentence is the end of a quote or spoken utterance), or white space (space, tab or new-line character). Thus:
This is some text. This is yet more text contains one sentence followed by the start of another sentence.
• You are required to create the profile of input files using dictionaries. The profile for each document will contain the number of occurrences of certain words (case insensitive) and pieces of punctuation.
• The counted words or punctuations are dependent on the input feature which can be: "punctuation", "unigrams", "conjunctions" and "composite".
• For conjunctions: your program is required to count the number of occurrences of the following words:
"also", "although", "and", "as", "because", "before", "but", "for", "if", "nor", "of",
"or", "since", "that", "though", "until", "when", "whenever", "whereas", "which", "while", "yet"
• For unigrams: your program is required to count the number of occurrences of each word in the files. Consider the following three lines of text contained in a file:
This is a Document. This is only a document
A test should not cause problem
The word count will be: "a":3, "document":2, "this":2, "is":2, "only":1,
"should":1, "not""1, "cause":1, "problem":1
• For punctuation: your program should count certain pieces of punctuation: comma and semicolon. In addition, your program should also count single- quote and hyphen, but only under certain circumstances. Specifically, your program should count single-quote marks, but only when they appear as apostrophes surrounded by letters, i.e. indicating a contraction such as "shouldn't" or "won't". (Apostrophe is being included as an indication of more informal writing, perhaps direct speech.). Your program should count dash (minus) signs, but only when they are surrounded by letters, indicating a compound-word, such as "compound-word". Any other punctuation or letters,
e.g '.' when not at the end of a sentence, should be regarded as white space, so serve to end words. For these purposes, strings of digits are also words as they convey information. Therefore, in the unlikely event that a floating point number, such as 3.142, appears, that is regarded as two words.
Note: Some of the texts we will use include double hyphen, i.e. "--". This is to be regarded as a space character.
• For composite: your program should contain number of occurrences of punctuations (as explained above) and conjunctions. In addition, your program should also add to the profile two further parameters relating to the text: the average number of words per sentence and the average number of sentences per paragraph, where a paragraph is any number of sentences followed by a blank line or by the end of the text.
• Each of the words and punctuation symbols should be placed, together with their respective counts, in a dictionary, which is called a profile.
• The first output by the main function is the distance between the corresponding profiles which should be computed using the standard distance formula:
• The second and third outputs returned by the main function are the profiles corresponding to the first and second text files respectively. The returned profiles as dictionaries in which each word is the key and value is the number of occurrences of the key, such as {"also":10, "got": 6} where "also" and "got" are the keys and have occurred 10 and 6 times respectively.
Attachment:- Computational Thinking with Python.rar