Data clustering using k-means

Assignment Help Data Structure & Algorithms
Reference no: EM13889468

Project

Project Title: Data Clustering using K-means

In this project, students are required to cluster Amazon product reviews that belong to four product categories: books, electronic appliances, dvds, and kitchen appliances. Moreover, each category is further divided into positive-valued sentiment reviews and negative-valued sentiment reviews. In total, you will find reviews that belong to 4 × 2 = 8 categories in the data file attached "data.txt".

The format of the data file is as follows. Each line of the data file corresponds to one review. The first element in the line represents the label of the instance (e.g. kitchen-positive indicates that the review is a positive sentiment review about some kitchen appliance). The next elements (separated by spaces) in the line represent the unigram and bigram features extracted from the review. Note that the two words in a bigram feature are connected by two underscores. Reviews are represented using binary-valued features (i.e. each feature appears exactly once in a given line).

Questions

(1) Write a program to load the data instances to memory from the provided file data.txt.

(2) Implement the k-means clustering algorithm with Euclidean distance to cluster the instances into k clusters. Make sure that you normalize each feature vector to unit L2 length before computing Euclidean distances.

(3) Instead of selecting the mean in a cluster,

i. select the instance that is closest to the mean as the cluster center when performing k-means clustering and

ii. select k-medoid method to perform clustering

(4) Evaluate the clusters obtained in step 2, 3 and 4 using cross validation evaluation method.

(5) Briefly discuss which clustering method is best for this data and why?

Submission Instructions

• Submit

(a) the source code for all your programs,

(b) a README file (plain text) describing how to compile/run your code to produce the various results

(c) a PDF file providing the answers of all above questions

Compress all of the above files into a single zip/rar file and name it with your registration number.

Reference no: EM13889468

Questions Cloud

What will be the total expected foreign exchange gain : What will be the total expected foreign exchange gain or loss for both the interest payment and the value of the bond (in percentage) for Company A each year in the next eight years?
The standard deviation of a list of numbers is a measure : The standard deviation of a list of numbers is a measure of how much the numbers deviate from the averag
A global manufacturer of electrical switching equipment : 1.A global manufacturer of electrical switching equipment (ESE) is considering outsourcing the manufacturing of an electrical breaker used in the manufacturing of switch boards.
How does mild hypoxia affect airline crew : What is Mild Hypoxia? And how does mild hypoxia affect airline crew? Present a detailed and research based answer to these questions.
Data clustering using k-means : Write a program to load the data instances to memory from the provided file data.txt.
A firm in ohio is thinking of buying a plant : 1.A firm in Ohio is thinking of buying a plant from a regional business group located in a Southeast Asian country.
Who are the potential stakeholders involved in the situation : Who are the potential stakeholders involved in this situation? What alternatives does Tony have in this situation? What might the company do to prevent this situation from occurring?
Personal reflection essay on role of professional nurse : Write a 500 word, personal reflection on how your perspective on the role of the professional nurse has changed since the beginning of this course. Include details of how this course has influenced your understanding of role clarity.
Overlap between financial and management accounting : Are you surprised by the topics that management accountants are focusing on? Why or why not? What interests you more, financial accounting or management accounting?

Reviews

Write a Review

Data Structure & Algorithms Questions & Answers

  Illustrate insertion into the linear hash file

Illustrate insertion into the linear hash file. Suppose that bucket splitting occurs whenever file load factor exceeds (is greater than) 0.8.

  Initalize the element with appropriate integer values

delcare and array of integer of size 10 and initalize the element with appropriate integer values

  Creating database for charity event

Your Project is to organize a charity event. You must use at least two events, one of which must be a Windows program such as Word, WordPad, or Paint.

  Sketch dynamic programming tables for knapsack problem

Sketch Dynamic Programming Tables (one for calculating optimal value and one for keeping track of items used in getting optimal value) for 0/1 Knapsack Problem given below and illustrate your final result.

  Give time algorithm that outputs satisfying assignment

Find out  whether there is an assignment of true/false values to the literals such that at least a*m clauses will be true. Note that 3-SAT(1) is exactly the 3-SAT problem. Give an O(m*n)-time algorithm that outputs a satisfying assignment for 3-S..

  Determine mean process turnaround time

Their priorities are 2, 3, 1, 5 and 4, respectively, with 1 being the highest priority. Specify the order in which processes execute and determine the mean process turnaround time for each of the scheduling algorithms.

  Linked lists give a program to implement the insert

give a program to implement the insert operation and delete operations on a queue using linked

  What is the time complexity of your algorithm

You may describe your algorithm using pseudo-code, but you must describe youralgorithm in English. What is the time complexity of your algorithm?

  Question about multi dimensional arrays

Multi-dimensional arrays could cost a lot of memory. Determine how much memory does it take to create an integer array of 3 dimensions,

  Question related to ms excel

Discuss how do I insert a row in multiple tables on different sheets in the same workbook? I have twelve sheets, one for every month, and the sheets are exactly the same.

  Analyze the time-space complexity of algorithms

How a vEB tree can be used to support these three operations and analyze the time/space complexity of your algorithms.

  Creating an access database

PLUS is a corporation that makes all types of visual aids for judicial proceedings. Customers are usually private law firms, although the District Attorney's office has occasionally contracted for its services.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd