Prepare heritage data for classification learning

Assignment Help Basic Statistics
Reference no: EM131231115

Assignment 1:

1. Using heritage data (release 1) in SQL

a. Find support for all single itemsets

b. List all itemsets with 2 elements and support of at least 0.2

c. List all itemsets with 3 elements and support at least 0.2

2. In Weka

a. Load heritage data (release 1)

b. Apply at least two association rule generation algorithms and compare results

c. Apply FPTree algorithm with at least two measures of rule metrics

Assignment 2:

1. In SQL/Weka:

a. Prepare heritage data for classification learning

b. Load heritage data release 3 (preprocessed to binary representation, including demographics and output attribute(s))

c. Perform exploratory analysis

d. Create at least three classification models for predicting hospitalization based on Year 1 data.

e. Which model performs the best on year 2 data?

f. Create regression model for predicting hospitalization days.

g. What is the difference between regression and classification models?

h. Present your results in a form of short report that includes screenshots, tables, an d needed description.

Assignment 3:

Classification Part 2

1. Using heritage release 3 data prepared last assignment

a. Include drug information into data

b. Include laboratory information into data

c. Import newly created data into Weka and run classification algorithms

d. Does inclusion of the information improve predictions?

There are many ways to complete question 4, so you need to make different decisions.

Try not to overcomplicate the problem.

2. In Weka using heritage 3 dataset

a. Apply kmeans algorithm for k=2, 3, 5, 10

b. Apply EM algorithm. What is the optimal number of clusters obtained by EM?

c. Compare the created clusters to classification based on hospitalization in year 2.

Assignment 4:

3.Using the data table shown below.

a.Calculate distance between all points in 1
-norm, 2
-norm and infinity
-norm. Show dissimilarity matrix.

b.Is there any need to preprocess the data to be more suitable for clustering? If so, describe the operations and show the resulting data table.

c.Apply k
-means clustering algorithm with k=2.

Using the data table shown below.

a. Calculate distance between all points in 1-norm, 2-norm and infinity-norm. Show dissimilarity matrix.

b. Is there any need to preprocess the data to be more suitable for clustering? If so, describe the operations and show the resulting data table.

c. Apply k-means clustering algorithm with k=2.

ID

Age

BMI

Gender

Total Cholesterol

1

30

24

M

180

2

70

19

M

190

3

65

26

M

220

4

40

32

F

260

Assignment 5:

-Text Mining

1.Write regular expression to:

a.detect zip codes in text

b.Find last names of all patients whose first name is John (note that regular expressions may have some false positives/false negatives).

2.List challenges in automatically retrieving ICD-9 codes from clinical notes. Search literature for to find relevant published work. Also, include own observations and comments.

3. Using the SMS data

a. Split data into training (80%) and testing (20%) sets

b. Build naïve Bayes classifier for detecting spam based on bag of words

i. List all words in the documents

ii. Count occurrences in spam and ham

iii. Assign likelihoods P(word|spam) and P(word|ham) for all words

iv. Convert test data into list of words. For each message you need, 2 columns: message id and word

v. Classify test data. This can be done by a series of joins with the data prepared in (iii).

vi. Calculate accuracy of your model (accuracy, precision, recall)

Reference no: EM131231115

Questions Cloud

How can u.s. companies protect their digital assets overseas : Prepare a 3 to 5 paragraph briefing statement that can be used to answer the above question. Your audience will be attendees at a conference for small business owners who are interested in expanding their footprint overseas (sales, offices, produc..
Calculate the total amount of co2 released to the atmosphere : Calculate the CO2 emissions in g CO2/MJ (LHV) with gasoline as fuel.
What is privacy in an information security context : What is another name for the Kennedy-Kassebaum Act (1996), and why is it impor- tant to organizations that are not in the health care industry? ?If you work for a financial service organization such as a bank or credit union, which 1999 law affect..
What is the amount of the companys total assets : The liabilities of the Smith Company are $120,000 and its owner's equity is $232,000. What is the amount of the company's total assets?
Prepare heritage data for classification learning : Perform exploratory analysis and create at least three classification models for predicting hospitalization based on Year 1 data.
Excellence in orthopedic care for large geriatric population : Dynamic Health System is a 3-hospital, 500-bed system in the Midwest United States. This system employs 100 physicians, both primary care and specialists, in 12 physician practices. Dynamic also runs a center of excellence in orthopedic care for the ..
Examine the five steps to the evidentiary process : Review the U.S. Department of Justice document explaining the Fourth Amendment protections in context of preparing electronic evidence. What are some noteworthy issues, recommendations, observations, or comments you have regarding these exceptions..
Estimate the maximum permissible cost of the condenser : If the sea power plant described in Problem 11.5 is to deliver power at $8/106 Btu, estimate the maximum permissible cost of the condenser and evaporator heat-exchanger surface in dollars per square foot, assuming a 20-year life, 10% discount rate..
What does the calculation of each ratio represent : What does the calculation of each ratio represent? How does year one compare with year two, and what trend can be seen when you compare the two years? Is the trend from year one to year two positive or negative?

Reviews

len1231115

10/5/2016 1:25:05 AM

I have the data for the first 3 assignment for now which i needed to be done by this coming Saturday and the rest I can wait for them till i got the data-set.Apply at least two association rule generation algorithms and compare results

Write a Review

Basic Statistics Questions & Answers

  Dan is working to get through his general education

1. dan is working to get through his general education requirements at a local community college. as a result he is

  Find lifetimes of batteries are relatively consistent

Camera batteries has a standard deviation of 2.1 months. Assume the variance is normally distributed. Do you fell that the lifetimes of the batteries are relatively consistent?

  P-value from tstat a one-sample t statistic for h0 micro 0

p-value from tstat. a one-sample t statistic for h0 micro 0 based on n 16 is tstat 2.44.a how many degrees of

  Manufacturer of guitar amplifiers markets one of its models

Manufacturer of guitar amplifiers markets one of its models,

  Find the posterior distribution

A week later the analysis had been completed. Of a total of 144 accounts (including the nine reported in part b), the average was x- = $11,254. Find the posterior distribution for μ for each of the two prior distributions. Calculate P(Μ > $11,000) f..

  Compute the following marginal and joint probability

Below is a cross-tabulation table of data from a 12-week study of diet and exercise. Compute the following probabilities (show the calculations): Marginal Probability and Joint Probability

  Choose between four possible projects

Your company has to choose between four possible projects. Project A (six months) will have an IRR of 6%. Project B (eight months) will have an IRR of 10%. Project C (six months) will have an IRR of 11%.

  Probability and customers demand a smoke free area

A new restaurant with 123 seats is being planned. Studies show that 57% of the customers demand a smoke free area. How many seats sould be in the non-smoking area in order to be very sure (mean+3StandardDeviation) of having enough seating there?

  Relationship between the amount of money people

There is a strong relationship between the amount of money people spend and the amount people save (in other words, people who spend more tend to save more). Does this mean that you can improve your life savings by spending more money? Explain how..

  Explain test of independence run on the claim

Using data from Boston, Massachusetts, a test of independence is run on the claim that ice cream sales per month and the number of car wrecks per month are independent. The claim is rejected.

  Based on information from the rocky mountain news a random

based on information from the rocky mountain news a random sample of 12 winter days in denver gave a mean pollution

  Process capability index using x bar chart

Following data from an x bar chart, is the process capable (capability index>1.33)?

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd