Provide a data collection engine

Assignment Help Programming Languages
Reference no: EM132618110

COMP 5070 Statistical Programming for Data Science - University of South Australia

Assignment

The task is dedicated to the memory of the project We Feel Fine. This project aimed to provide a data collection engine that periodically scours the Internet to record human emotions or feelings from different blogs from sources such as LiveJournal, Blogger, Flickr, Technorati, Feedster, Ice Rocket and Google.

We Feel Fine scaned blog posts for occurrences of the phrases "I feel" and "I am feeling". Once a sentence was found to contain either of these key phrases, the full sentence was saved and scanned to see if it included one of about 5,000 pre-identified feelings. The full list of feelings (link given in (3) below) contains these valid feelings, as well as the total count of each feeling and the colour assigned to each feeling. For example, the first few lines of the file are:

The first line can be discarded. The second line contains the feeling (better), the count (872884) and the associated colour in hexadecimal (FFA401). I have provided code in the hints section that will show you how to use this colour.

Project website does not work now and you don't need it for the assignment. You are provided with a zip-file countries.zip containing an archived folder countries with feelings collected for different countries.

For this assignment you are asked to write a program that will analyse and visualise mined feelings from the We Feel Fine data sets based on a default search and then user-driven searches. Note: You do NOT have to search for the phrases "I feel" and "I am feeling" as We Feel Fine have already done this work for you. We are going to analyse what they found.

There are five components in this job: user prompt, data loading, data analysis, plotting, report.

Specifically you need to:

Prompt the user for the country which will be mined. If the user chooses to not provide this information, then assume a default search of the United States. Try to make your communication with a user as friendly as possible, that is, the least restrictive to how user should enter countries. E.g. no difference for small/large caps, accept some common abbriviations, like US or USA for United States, or UK for United Kingdom.

If an illegal value is entered (e.g. 'new transavia' for country), you can ask a gain or try to fix it - google for the Levenshtein distance. Then ask user to confirm your fix or change it to the right one. If your program fails to fix the illegal value for country name, then do not include it in the data loading routine.

You may wish to use a text list of all countries in the world to help define valid countries. Note that the We Feel Fine data set does not necessarily cover all of the countries in this list.

Please don't be overwhelmed with complexity of this part, start with basic prompt and then gradually increase functionality. Suggested features are desirable but not compulsory.

Allow the user a maximum of 5 countries to be successfully mined, although they are also allowed to enter less than 5 countries. Load corresponding data files from the folder countries. Successful mining occurs when the feelings for each country have been recorded and returned to your program.

For each feeling in the full list of over 5000 feelings and their frequencies determine the number of times each feeling appears in the mined text, for each country. For any counts that are larger than 0, you will need to retain the third column of information which is the hexadecimal equivalent of the colour of the prescribed feeling.

For each country, produce a plot of ellipses where each ellipse represents a feeling and have size proportional to the frequency of its occurrence and is coloured based on the full list of feelings referenced above. Ellipse position can be random. The code for this component is provided and explained below, however you will need to make a number of adjustments to it.

Run the base query of data file World.txt to determine the first 1500 feelings mined by We Feel Fine from anywhere in the world. We will compare these mined feelings with the chosen countries. There is a substantial hint below explaining how you need to do this.

When your code is ready you have to choose any five countries and provide the following output:

The constructed path to load a data file for the country selected by the user or yourself.

The most popular feeling across the 5 countries you have chosen to explore plus the base query. If there is no feelings mined, then report this fact.

A plot for each country of the ellipses generated by each country's feelings, as well as plot of the results of the base query from Step 5.

Assuming darker colours and blues correspond to negative feelings and lighter, happier colours correspond to positive feelings, write a short description summarising the nature of each country as being generally optimistic or pessimistic. This description is to be written by yourself (not your program, unless you want to be REALLY fancy!) and at most two paragraphs will be sufficient. For the purpose of this assignment, one paragraph is 6-8 lines long.

Analysis of supermarket transaction data

The data file supermarket.csv contains a supermarket transactionsdata. Most variables are self-explanatory:

Department - department in supermarket.

Product - product name (encoded).

Checkout - checkout number.

Date - date of the transaction.

Time - time of the transaction.

Basket - basket ID or in fact receipt number. Due to technical issues in the supermarket this ID is not unique. Numbers can repeat themselves on the other week. Hence, you can not use this variable as the only identification for the shopping trip but you can combine it with other variables, like Date or Time.

Total - price paid for the product.

You have to execute an analysis of the data and provide answers for the following research questions which are of great interest for many people in the retail industry:

Study the distribution of basket sizes measured by the number of items in a basket. What basket size is the most popular?

Study the relation between number of items in the basket and dollar value of the basket. Considering different "popularity" of different basket sizes from question 1, how much money does store get from each size of the basket? What kind of customers are more important - light (small baskets) of heavy (large baskets)?

What day of the week is the busiest for the supermarket in terms of a number of shopping trips (one basket = one shopping trip)? What day is the most profitable? For the last question please consider a total revenue or total sales as a proxy for the profit.

For each question you have to provide appropriate graph (or multiple graphs) and brief discussion to present your finding, answer the research question and explain your graph.

The output of your program should be, at the minimum, the following information:

Brief introduction with a presentation of your data.

Answers on the research questions with relevant numerical and graphical outputs.

Short description summarising your findings.

Attachment:- Assignment.rar

Reference no: EM132618110

Questions Cloud

What is your favorite store to order from online : What are several e-business and e-commerce strategies and applications that should be developed and implemented by many companies today? Explain your reasoning.
How did the tarp program act : How did the TARP Program act as a measure for America to fix the Unemployment rate during the Global Financial Crisis?
How can company use change management to minimize resistance : How can a company use change management to minimize resistance and maximize the acceptance of change in business and technology? Give several examples.
Cause of unemployment in south africa : How is it that in the distortionist explanation it seem obvious that exogenous instituitions are the cause of unemployment in South Africa
Provide a data collection engine : Provide a data collection engine that periodically scours the Internet to record human emotions or feelings from different blogs from sources
Major sources of uncertainty in environmental risk : Name the major sources of uncertainty in an environmental risk assessment.
Describe your proposed computerization project : Describe your proposed computerization project. Create a schematic similar to Figure 12.3 that includes the specific information as it relates to your business.
What should the organization do to prepare for future flow : Consider a supply chain for an organization you are familiar with.
Describe strategic competitive benefits in use of extranets : What strategic competitive benefits do you see in a company's use of extranets? Explain your position and provide concrete examples. Provide an example from.

Reviews

Write a Review

Programming Languages Questions & Answers

  Write accessors for each of the declared class variables.

Read the file and display in the listbox each record splitting out the fields, eliminating the comma delimiters and placing spaces between the fields.

  Give value of mytop and contents of the array

Assume that stack is the class described in this section with stacktype set to int and stack_capacity or mycapacity set to 5. Give the value of mytop and the contents of the array.

  Write a program that lets the user play the game of rock

Write a program that lets the user play the game of Rock, Paper, Scissors against the computer. When the program begins, the user is asked to seed the random number generator, and a random number in the range of 1 through 3 is generated

  Write an hla assembly program

Write an HLA Assembly program that displays your favorite pet on screen in large letters. There should be no input, only output.

  Create procedure that returns most recent order information

Create a procedure that returns the most recent order information for aparticular basket. This procedure should determine the most recent stage entry

  Proc mean data=ex1height

Proc mean data=EX1height;throws what sort of error message

  Write a program to find the largest of five numbers

Write a program to find the largest of five numbers obtained from the user with an input dialog box

  Create a program to keep track of vehicle tire purchases

Create a program to keep track of vehicle tire purchases. The program will store data about the vehicle, the tire manufacturer and size, quantity purchased, the date of sale, and the mileage warranty, etc.

  A brief descriptionof the given project

A brief description, this is basically for my school final year project. So it need not to be very professional. I just need the outcome to be something like the ppt slides will do.

  Write a program to input a purchase amount

Write a program to input a purchase amount and calculate the sales tax and total do. The SalesTax depends on the county identifying code

  Assignment - file access and flowcharts

Assignment - File Access and Flowcharts, The Output symbol is used to output data to a text file. When an Output symbol is reached during Raptor program execution, the system determines whether or not output has been redirected

  What criterion is being used to determine the best estimates

What we would like to do is create similar tables in an .rtf file that produces the same statistics for any numerical value against each level of a know classification/factor from any data set we wish to inspect.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd