Reference no: EM132618110
COMP 5070 Statistical Programming for Data Science - University of South Australia
Assignment
The task is dedicated to the memory of the project We Feel Fine. This project aimed to provide a data collection engine that periodically scours the Internet to record human emotions or feelings from different blogs from sources such as LiveJournal, Blogger, Flickr, Technorati, Feedster, Ice Rocket and Google.
We Feel Fine scaned blog posts for occurrences of the phrases "I feel" and "I am feeling". Once a sentence was found to contain either of these key phrases, the full sentence was saved and scanned to see if it included one of about 5,000 pre-identified feelings. The full list of feelings (link given in (3) below) contains these valid feelings, as well as the total count of each feeling and the colour assigned to each feeling. For example, the first few lines of the file are:
The first line can be discarded. The second line contains the feeling (better), the count (872884) and the associated colour in hexadecimal (FFA401). I have provided code in the hints section that will show you how to use this colour.
Project website does not work now and you don't need it for the assignment. You are provided with a zip-file countries.zip containing an archived folder countries with feelings collected for different countries.
For this assignment you are asked to write a program that will analyse and visualise mined feelings from the We Feel Fine data sets based on a default search and then user-driven searches. Note: You do NOT have to search for the phrases "I feel" and "I am feeling" as We Feel Fine have already done this work for you. We are going to analyse what they found.
There are five components in this job: user prompt, data loading, data analysis, plotting, report.
Specifically you need to:
Prompt the user for the country which will be mined. If the user chooses to not provide this information, then assume a default search of the United States. Try to make your communication with a user as friendly as possible, that is, the least restrictive to how user should enter countries. E.g. no difference for small/large caps, accept some common abbriviations, like US or USA for United States, or UK for United Kingdom.
If an illegal value is entered (e.g. 'new transavia' for country), you can ask a gain or try to fix it - google for the Levenshtein distance. Then ask user to confirm your fix or change it to the right one. If your program fails to fix the illegal value for country name, then do not include it in the data loading routine.
You may wish to use a text list of all countries in the world to help define valid countries. Note that the We Feel Fine data set does not necessarily cover all of the countries in this list.
Please don't be overwhelmed with complexity of this part, start with basic prompt and then gradually increase functionality. Suggested features are desirable but not compulsory.
Allow the user a maximum of 5 countries to be successfully mined, although they are also allowed to enter less than 5 countries. Load corresponding data files from the folder countries. Successful mining occurs when the feelings for each country have been recorded and returned to your program.
For each feeling in the full list of over 5000 feelings and their frequencies determine the number of times each feeling appears in the mined text, for each country. For any counts that are larger than 0, you will need to retain the third column of information which is the hexadecimal equivalent of the colour of the prescribed feeling.
For each country, produce a plot of ellipses where each ellipse represents a feeling and have size proportional to the frequency of its occurrence and is coloured based on the full list of feelings referenced above. Ellipse position can be random. The code for this component is provided and explained below, however you will need to make a number of adjustments to it.
Run the base query of data file World.txt to determine the first 1500 feelings mined by We Feel Fine from anywhere in the world. We will compare these mined feelings with the chosen countries. There is a substantial hint below explaining how you need to do this.
When your code is ready you have to choose any five countries and provide the following output:
The constructed path to load a data file for the country selected by the user or yourself.
The most popular feeling across the 5 countries you have chosen to explore plus the base query. If there is no feelings mined, then report this fact.
A plot for each country of the ellipses generated by each country's feelings, as well as plot of the results of the base query from Step 5.
Assuming darker colours and blues correspond to negative feelings and lighter, happier colours correspond to positive feelings, write a short description summarising the nature of each country as being generally optimistic or pessimistic. This description is to be written by yourself (not your program, unless you want to be REALLY fancy!) and at most two paragraphs will be sufficient. For the purpose of this assignment, one paragraph is 6-8 lines long.
Analysis of supermarket transaction data
The data file supermarket.csv contains a supermarket transactionsdata. Most variables are self-explanatory:
Department - department in supermarket.
Product - product name (encoded).
Checkout - checkout number.
Date - date of the transaction.
Time - time of the transaction.
Basket - basket ID or in fact receipt number. Due to technical issues in the supermarket this ID is not unique. Numbers can repeat themselves on the other week. Hence, you can not use this variable as the only identification for the shopping trip but you can combine it with other variables, like Date or Time.
Total - price paid for the product.
You have to execute an analysis of the data and provide answers for the following research questions which are of great interest for many people in the retail industry:
Study the distribution of basket sizes measured by the number of items in a basket. What basket size is the most popular?
Study the relation between number of items in the basket and dollar value of the basket. Considering different "popularity" of different basket sizes from question 1, how much money does store get from each size of the basket? What kind of customers are more important - light (small baskets) of heavy (large baskets)?
What day of the week is the busiest for the supermarket in terms of a number of shopping trips (one basket = one shopping trip)? What day is the most profitable? For the last question please consider a total revenue or total sales as a proxy for the profit.
For each question you have to provide appropriate graph (or multiple graphs) and brief discussion to present your finding, answer the research question and explain your graph.
The output of your program should be, at the minimum, the following information:
Brief introduction with a presentation of your data.
Answers on the research questions with relevant numerical and graphical outputs.
Short description summarising your findings.
Attachment:- Assignment.rar