Reference no: EM132314338
Big Data Assignment -
This assignment includes two parts. For Part I, you are given two options. Option 1 involves writing an essay on big data techniques, and Option 2 involves writing a program for movie sentiment classification. For Part II, you will present your Part I work in the lab class with PowerPoint slides.
Part I - Essay/Program
You should pick one and only one option below. Moreover, your Part II presentation will be based on what you did in Part I. Hence you have to choose carefully and wisely based on your experience, expertise and preference.
Option I - Essay
Task Description - You are required to write an open-topic essay that discusses one of the important techniques used in big data, for example, Relational Database Management Systems (RDBMS), NoSQL databases, web APIs and data mining, cloud computing, MapReduce, Hadoop, predictive modelling, etc .
Note that your essay should focus on a single topic area in big data, and provide an in-depth and comprehensive discussion on important issues in the area of your focus. Depth is much more important than coverage for this assignment. Hence concentrating on a small topic with your own insight is much better than briefly touching everything superficially. For example, an essay talking about MongoDB or any other specific NoSQL database system and its usefulness in big data applications makes a good topic, whereas one that goes through all different types of general database systems does not, provided that both essays are about the same length.
Doing research is an important step for essay writing, as everything starts with proper reading and finding ideas from the reading materials. In the beginning, you can read the lecture notes to identify an area of your interest for the topic of your essay. However, to gain more understanding and insight, you should not confine your reading to the teaching materials alone. Instead, you need to reach out and look further into the issues of interest by checking out relevant literature, including but not limited to, websites and various online resources, reference books, published journal articles, etc.
Your essay should also be well organised and structured into sections with headings. Each section should focus on a single main point, for e.g. introduction, current techniques and issues, possible future development, summary, etc. A section can usually be further divided into paragraphs depending on the points being covered. You should also include a bibliography or reference section in the end, which contains references to all articles, books and online links that have been cited in the main article. Size 14 font and 1.5 line spacing will be required. Your essay should contain between 2,000 to 3,000 words excluding the reference section. Moreover and importantly, your essay should not just be about facts and findings from the literature, but should also include your own understandings and opinions on the topics discussed backed up by your readings.
Option 2 - Program
Task Description - For this task, you will create a complete program to perform sentiment classification for movie reviews. You will use a large movie review dataset containing a set of 25,000 movie reviews for training, and 25,000 for testing.
Download the dataset and unzip it to the local directory. Enter the aclImdb/ directory created by the zip file, you will find five items
train/ - feature files and raw text files for the training set
test/ - feature files and raw text files for the testing set
imdb.vocab - expected rating for each token
imdbEr.txt - text tokens for each feature index
README - the readme file for more information on the dataset
For the purpose of this task, you only need the following files under the train/ and test/ directories.
train/labeledBow.feat feature vector file for the training set
test/labeledBow.feat feature vector file for the testing set
Each feature vector file contains 25,000 lines, each line represents label value and feature vector of word occurrences for the corresponding movie review in the training/testing set. For e.g., the following is the first line of train/labeledBow.feat
9 0:9 1:1 2:4 ... 47304:1
This means that the first review gets a rating of 9, 0:9 for 9 occurrences of the word "the" (the first token in imdb.vocab), 1:1 for 1 occurrence of the word "and", 2:4 for 4 occurrences of the word "a", where "the", "and", "a" are the first three tokens in imdb.vocab file, and the last token 47304:1 for one occurrence of the word "pettiness", the 47305th token in imdb.vocab. The above input vector basically calculates the number of occurrences for each word/token appearing in the raw text. The features are highly sparse, i.e. the majority of the entries are zero with only a few non-zero values. Your program should be able to read data from the training/testing files and parse them into the label vector and feature vectors for all 25,000 input examples. The input files (*.feat) are in a format called libsvm / svmlight and can be read into a matrix using the sklearn.datasets.load_svmlight_files function. To then use the data for training classification model, you need to perform feature normalisation to the feature vectors. A recommended normalisation scheme for text data is the TF-IDF scheme. More information can be found in the lecture notes or online resources.
After reading the file, parsing the data and computing the TF-IDF metric, you need to train a classification model to differentiate between movies with positive and negative feedbacks depending on the reviews and ratings. This is a standard binary classification problem by treating all reviews with >5 rating scores as the positive class and those with <=5 rating scores as the negative class. You can use any classification model you prefer, including but not limited to the Support Vector Machine (SVM), Decision Trees, Random Forest, K-Nearest Neighbour (K-NN), and the Naïve Bayes classifier. Due to the size of the training and testing set, you are best advised to employ an efficient classification model (e.g. Linear SVM, Decision Tree, Random Forest, Naïve Bayes). All of these models have been implemented in the scikit-learn Python package. Check lecture notes and scikit-learn online document for further information. Note the performance of any classification model depends heavily on the choice of parameters to control the bias and variance trade-off, e.g. regularisation parameter C for linear SVM, tree depth for Decision Tree, depth and number of trees for Random Forest. Hence in addition to classifier training, you also need to implement proper function for parameter selection in your program that chooses the optimal parameters for the classification problem. This can be done by the cross validation procedure discussed in the lectures. You can also find resources online for discussions on parameter selection and cross validation.
To compare the performances of different classification models, you need to implement at least two different types of models and compare their predictive performances on the movie reviews dataset in your program in order to get full mark for this task.
Part II - Presentation
For Part II, you will create PowerPoint slides to present the work you did in Part I during week 14's lecture and practical classes. You will be allocated 5 minutes including question time to go through the content of your essay or explain the design and result of your program to the audience (tutor and rest of the class) followed by questions from them. It is important to note that your work for Part I and Part II should be consistent and cover the same content.
Attachment:- Assignment File.rar