How to binarize categorical variables in scikit-learn

Assignment Help Other Subject
Reference no: EM131309135

Data Mining Assignment: Clustering and Basic Classification

OBJECTIVES

Learn some of the clustering features of Scikit-learn to do partitioning and hierarchical clustering (k-Means and hierarchical clustering algorithms);

Learn about document clustering, and document similarity scoring using TFIDF;

Using built-in k-nearest neighbor and interpret the output in Scikit-learn - this is an extension of the implementation you did last time;

Learn how to binarize categorical variables in Scikit-learn;

Learn how to use DecisionTreeClassifier to build a basic classifier.

DATA FOR PART 1 -

We will be using data from IMDB and working with movie data. IMDB is a movie database that is widely used to learn about (and rate) movies. Much of the work around movies focuses on predicting ratings - for example, the Net?ix Prize contest was designed to encourage developers to explore better algorithms for rating movies. Instead of predicting ratings, we will work in- stead with clustering the plots of movies. Data will come from the OMDB API which allows a developer to extract information from IMDB programmatically since there is no open public API directly published by IMDB.

You can view the notebook here to see how the data was extracted, but you can skip that step and look directly at the file which is the output of that data. Also, you can find the data for this assignment in data directory, and in it you will see a TSV file called

data/top1000_movie_summaries.tsv.

BACKGROUND FOR PART 1

Document clustering is a common task in text mining and has broad applications in a variety of contexts. In the unsupervised context, such clustering provides insights into a set of documents and the common features they share. In the supervised context such clustering allows one to train and subsequently classify documents. For example, if one were to determine of a document is of a certain kind (e.g. legal, academic) one can use labeled instances to learn the features that would allow the discrimination of unlabeled/unseen instances.

There are several good resources in information retrieval that you may want to bookmark for future reference in text mining and information retrieval generally:

Manning, C.D., Raghavan, P. and Schütze, H. (2008) Introduction to Information Retrieval. doi: https://dx.doi.org/10.1017/CBO9780511809071; Available at: Stanford NLP - Information Retrieval.

DOCUMENT ANALYSIS: TERM FREQUENCY (TF) AND INVERSE DOCUMENT FREQUENCY (TF) The intuition behind analyzing words in documents hinges on the following:

  • terms that are frequent in documents are given higher importance than those that are infrequent
  • terms that are frequent across documents are not considered as important

that is common words across an entire corpus are discounted while those that are common within documents are boosted.

Part 2 - Classification With k-Nearest Neighbors

In HW1 you learned about and used the k-NN algorithm. You computed k neighbors based on actual data. This algorithm can also be used to do what is called a lazy learner because it learns from the testing phase instead of the training phase. This has performance issues unto itself since all the data it has to be seen. It can, nonetheless, be used as a way to do supervised classification since it has learned all the class labels already.

You will first start with just a warm-up of the using the Nearest Neighbors algorithm already implented in Scikit-learn.

Part 3 - Classification With Decision Trees

As we learned, decision trees are a powerful way to build classifiers, especially since the output is interpretable. By using information gain such as entropy and gini coefficient, nodes can be chosen that split the data in meaningful ways allowing the leaf nodes to provide the labels of a set of decisions as one follows each attribute at a decision point.

Attachment:- Assignment.rar

Reference no: EM131309135

Questions Cloud

Pros and cons of four provider payment methods : Explain the pros and cons of four provider payment methods: (a) fee-for-service; (b) capitation; (c) global capitation; and (d) bundled payment.
Write a memo announcing employee layoffs : Write a memo announcing employee layoffs.- Write the memo to communicate the decision clearly and help employees understand and accept the message.
Explain shifts taking place currently in the health care : Explain shifts taking place currently in the health care system. For example, consider the shift from acute care to wellness and prevention and the shift in accountability.
Write a memo to employees letting them know the bad new news : Write a memo announcing no bonus.- Write a memo to your employees letting them know the bad news. Add details to make your message complete.
How to binarize categorical variables in scikit-learn : MCIS6273 Data Mining Assignment: Clustering and Basic Classification. Learn about document clustering, and document similarity scoring using TFIDF; Using built-in k-nearest neighbor and interpret the output in Scikit-learn - this is an extension of ..
Write a letter announcing a decision not to renew a lease : Write a letter announcing a decision not to renew a lease.- Convey this information to the store's manager, Henry D. Curtis.
Select case scenario that occurred in your nursing practice : Select a case scenario that occurred in your nursing practice. The scenario may involve a direct clinical or an indirect systems (administrative) situation. Explain the case, including what resolution did or did not occur.
What would be your levels of expectancy : Think about the ideal job that you would like to have. Describe this job, the kind of manager you would like to report to, and the kind of organization you would be working in. Answer the following questions: 1. What would be your levels of expect..
What is the size of the working-age population : Suppose the stock of capital and the workforce are both increasing at 3 percent annually in the country of Wholand. At the same time, real output is growing at 6 percent. How is that possible in the short run and in the long run?

Reviews

len1309135

12/12/2016 5:10:59 AM

In this homework you will apply some of what you have learned about clustering and classification through the extended use of Scikit-learn, Numpy and Pandas. Learn some of the clustering features of Scikit-learn to do partitioning and hierarchical clustering (k-Means and hierarchical clustering algorithms)

Write a Review

Other Subject Questions & Answers

  How human activities can have unintended harmful environment

Explain how each one of the results of the Experimental Forest experiments illustrates how human activities can have unintended harmful environmental consequences.

  Describe phenomenon of laws of supply and demand

Gas and so forth-appears to go up immediately. Describe why this phenomenon may be the good thing, using laws of supply and demand to describe your answer.

  Define the key points of a relevant economic article

key points of a relevant economic article.

  What is the difference between a common and novel problem

What is the difference between a common and novel problem? - Use the various techniques, found in the class text, to generate ideas and solutions to the novel problem you created.

  Explain skinners two mistakes according to staddon

1. explain how thorndike ruled out the idea that cats could learn to escape through reasoning and imitation.2. describe

  What were you wearing and how was your hair styled

What were you wearing and how was your hair styled? What kind of shoes, jewelry, or makeup were you wearing? How was your bedroom at home decorated? What were the colors? What was on the walls

  How they will be helped to fit back into the society

High reciprocating, is a judicial system in which inmates are left alone upon completion of their term to go back to the society without a plan on how they will be helped to fit back into the society

  Describe police culture in the united states

Describe police culture in the United States.Describe the internal and external mechanisms that control police discretion

  Central characteristic of the working stage of a group

Discuss how the development of cohesion is a central characteristic of the working stage of a group. What is your understanding of "group cohesion"? What factors lead to this unity in a group

  Material balance-mole fraction of water vapor in product gas

A gas containing equal parts (on a molar basis) of H2, N2, and H2O is passed through a column of calcium chloride pellets, which absorb 97% of the water but none of the gases. The column packing was initially dry and had a mass of 2.00 kg. Following ..

  Sociological environment influences physical-mental health

Generate a chart that demonstrates how the sociological environment influences the physical and mental health of individuals. Include both the physical and social environment and write a synopsis of the influence created by each item on the chart.

  What did you learn from the marshall interview

Mike Wallace's interview with Thurgood Marshall provides rich insights into the politics of rights and equality in the mid-Twentieth Century. What portion of the interview did you find most informative? Why? What did you learn from the Marshall i..

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd