Find all customer names who have at least one account

Assignment Help Database Management System
Reference no: EM131507011

Assignment: Data Minig and Machine Learning

1. Assume you have 3 documents with the following terms:

• D1 = "computer", "web", "storage", "options"
• D2 = "computer", "game", "development"
• D3 = "web", "development", "frameworks"

If the query Q is composed of terms "computer" and "development", what is the relevance of each document to the query using the TF.IDF measure?

2. Explain and write the pseudocode for a Mapper/Reducer that takes as input a large file (possibly split into chucks) of integers and outputs:

a. The sum of the squares of each integer
b. The maximum integer

3. Assume you have a database with two relations (i.e. tables): customers and accounts.

The schema for customers is composed of the following attributes:

• customerID (integer)
• name (string)
• address (string)
• phone (string)

The schema for accounts is composed of the following attributes:

• customerID (integer)
• accountNumber (integer)
• balance (float)

Write the following queries using SQL:

A. Find the customer name with account number 12345.

B. Find all customer names who have at least one account with balance >$100,000

Write the following queries using Relational Algebra:

C. Find all account numbers with balance <$20,000

D. Calculate the sum of all accounts for customer with ID=432

4. Compute the Jaccard similarities of each pair of the following three sets:

{1, 2, 3, 4}, {2, 3, 5, 7}, and {2, 4, 6}.

5. List the first ten 3-shingles in the following sentence:

"In this section, we introduce the simplest and most common approach, shingling, as well as an interesting variation."

6. Fill in the signature matrix using the following input matrix of shingles of documents and the given permutations:

Input Matrix

Document 1

Document 2

Document 3

1

1

0

0

1

1

1

0

1

1

1

0

Permutations:

π1=(2,4,3,1)

π2=(1,2,4,3)

π3=(4,3,2,1)

Document 1

Document 2

Document 3










7. Using your answers from problem 3, compute the column/column and signature/signature similarities of the following document pairs:

a) 1-2

b) 1-3

c) 2-3

8. Perform a hierarchical clustering of the one-dimensional set of points 1, 4, 9, 16, 25, 36, 49, 64, 81, assuming clusters are represented by their centroid (average), and at each step the clusters with the closest centroids are merged. (show each step)

9. Use the k-means algorithm and Euclidean distance to cluster the following 5 examples into 2 clusters:

p1=(2,10), p2=(2,5), p3=(8,4), p4=(5,8), p5=(7,4).

Grocery Data

10. Given the following itemsets:

{A,B,C}

{B,C}

{A,B,C,D}

{A,C,D}

{C,D}

{B,C}

{B,D}

a) What is the support of {A}, {A,C}, {B}, {B,C}, {A,C}, {A,C,B}, {B,C,D}, {A,B,C,D} ?

b) What is the confidence of {A}->C, {B}->C, {A,C}->B, {B,C}->D ?

c) What is the interest of {A}->C, {B}->C, {A,C}->B, {B,C}->D ?

11. Apply the A-Priori Algorithm with support threshold 3 to the itemsets from problem 1.

12. If we use a triangular matrix to count pairs, and n, the number of items, is 20, what pair's count is in a[100]?

13. Assume you are given the following collection of twelve baskets, each of which containing three of the six items 1 through 6:

{1, 2, 3} {2, 3, 4} {3, 4, 5} {4, 5, 6}

{1, 3, 5} {2, 4, 6} {1, 3, 4} {2, 4, 5}

{3, 5, 6} {1, 2, 4} {2, 3, 5} {3, 4, 6}

Suppose the support threshold is 4. On the first pass of the PCY Algorithm we use a hash table with 11 buckets, and the set {i, j} is hashed to bucket i×j mod 11.

a) By any method, compute the support for each item and each pair of items.

b) Which pairs hash to which buckets?

c) Which buckets are frequent?

d) Which pairs are counted on the second pass of the PCY Algorithm?

14. For this question, you will need to use the Orange Canvas data mining software. You can find it at:

Be sure to read the documentation and especially the following:

Then download the file grocery-data.txt (from BlackBoard) and use it with the Orange Canvas software to answer the following questions:

a) Which rule has the most confidence?

b) If we consider rules with any confidence, for what support value do we arrive with only 10 rules?

c) Make a screenshot of the Orange schema (i.e. your workspace) you used to answer questions (a) and (b) and insert it.

15. Given the data set below, apply the k-Nearest Neighbor algorithm to classify the test data for k=1 and k=3. Use the Euclidean distance metric.

Training Set

#

x1

x2

true label

1

0.453705

-0.0106

1

2

3.258589

0.169734

1

3

3.184656

-0.83691

0

4

-0.42561

1.385033

0

5

0.658765

-1.87715

0

6

-0.40507

-1.9574

0

7

-4.52775

4.123102

1

8

2.538689

-1.5386

1

9

-1.04649

-3.59664

1

10

2.967113

0.505111

0

Testing Set

#

x1

x2

true label

predicted label

11

-4.69237

-4.77898

1

 

12

-2.1147

-1.81277

0

 

13

4.277164

-4.83136

1

 

14

-1.33862

-0.93995

0

 

15

-4.02728

-4.96129

1

 

16

4.968125

3.757161

1

 

17

-2.19987

-3.48712

0

 

18

2.849136

-3.33965

0

 

19

-4.30273

2.530094

1

 

20

4.690116

-0.36379

1

 

16. Compute the confusion matrix, accuracy, precision, recall, and F1 measures given your answers to problem 1

17. Assume you have the data set given below, which provides hypothetical examples of instances when people did or did not get hired for a job. It consists of three categorical attributes and a label that indicates "hired" or "not hired". Using this data, induce a decision tree using information gain for splitting the nodes, showing the calculations at each step.

Training Set

#

Experience (EXP)

Sufficient Qualifications? (QUAL)

Opinions of References (REFOP)

true label

1

good

Yes

favorable

1

2

excellent

Yes

favorable

1

3

none

No

favorable

0

4

good

No

not favorable

0

5

good

Yes

not favorable

0

6

excellent

Yes

not favorable

0

7

excellent

Yes

favorable

1

8

good

Yes

favorable

1

9

none

Yes

favorable

1

10

none

Yes

not favorable

0

18. Download and install the WEKA data mining toolkit. It is available through this link:

Then use the Explorer GUI interface and open the "credit-g.arff" data set that is included in WEKA (data directory). This is a data set from Germany describing credit-worthiness (good or bad) of customers based on 20 different attributes. Go to the classify tab, select "Percentage Split" and enter 50%.

Then run each of the following classification algorithms:

• trees.J48
• trees.SimpleCart
• trees.RandomForests
• meta.AdaBoostM1

a) For each algorithm, report the accuracy, precision, and recall values with default parameters

b) Adjust the parameters of each algorithm to maximize the accuracy. Report the algorithm and parameter settings that maximized the accuracy measure and provide the maximum accuracy value.

19 . Given the following utility matrix, representing the ratings, on a 1-5 star scale, of eight items, a through h, by three users A, B, and C:


a

b

c

d

e

f

g

h

A

4

5


5

1


3

2

B


3

4

3

1

2

1


C

2


1

3


4

5

3

• Compute the following from the data of this matrix.

(a) Normalize the matrix by subtracting from each nonblank entry the average value for its user.

(b) Using the normalized matrix from Part (a), compute the cosine distance between each pair of users.

20. 24. Three computers, A, B, and C, have the numerical features listed below:

Feature

A

B

C

Processor Speed

3.06

2.68

2.92

Disk Size

500

320

640

Main-Memory Size

6

4

6

We may imagine these values as defining a vector for each computer; for instance, A's vector is [3.06, 500, 6]. We can compute the cosine distance between any two of the vectors, but if we do not scale the components, then the disk size will dominate and make differences in the other components essentially invisible. Let us use 1 as the scale factor for processor speed, α for the disk size, and β for the main memory size.

(a) In terms of α and β, compute the cosines of the angles between the vectors for each pair of the three computers.

(b) What are the angles between the vectors if α = β = 1?

(c) What are the angles between the vectors if α = 0.01 and β = 0.5?

• 2. A certain user has rated the three computers of Problem 1 as follows: A: 4 stars, B: 2 stars, C: 5 stars.

(a) Normalize the ratings for this user.

(b) Compute a user profile for the user, with components forprocessor speed, disk size, and main memory size, based on the data of Problem 1.

21. Assume you have the adjacency matrix given in problem 2. Use the PageRank equation with to compute the rank vector using the power iteration method. Do 3 iterations and show your work.

22. Assume you have the following set of preferences of people over seats at a table:

{"John", {Seat1, Seat2, Seat 3},

"Mary",{Seat2, Seat4},

"Bob", {Seat1, Seat3, Seat4},

"Alice", {Seat3, Seat5},

"Cindy", {Seat1, Seat2, Seat3}}

What is the competitive ratio of the greedy algorithm vs. the optimal seating arrangement?

Explain in your own words, why is it hard to measure the ClickThrough Rate?

23. Consider the following scenario:

There are three advertisers A, B, and C

A bids on query x, B bids on x and y, and C bids on x, y, and z

All have budgets of $3.

Given the following query stream:

x x x y y y z z z

a) What are the sequences of choices for both the greedy algorithm assuming the worst case scenario and the BALANCE algorithm?

b) What are the competitive ratios for both algorithms in this scenario?

Attachment:- Attachments.rar

Verified Expert

The solution file is prepared in ms word which answered all questions related to data mining and machine learning. The topics covered in this are Mapper/Reducer,queries using SQL,Jaccard similarities,3-shingles,signature matrix,column/column and signature/signature similarities,hierarchical clustering,k--means algorithm and Euclidean distance,A-Priori Algorithm,triangular matrix to count pairs,Orange Canvas data mining software,k-Nearest Neighbor algorithm to classify the test data,Compute the confusion matrix, accuracy, precision, recall, and F1 measures and use WEKA data mining toolkit to analysis the data.

Reference no: EM131507011

Questions Cloud

Define five mutually exclusive alternatives for stf : Five mutually exclusive alternatives are being considered for providing a sewage-treatment facility (STF).
Develop companys branding pricing and distribution strategy : Develop the company's branding, pricing, and distribution strategy. Establish whether the company's intention is to be a leader or follower within the industry.
What about cloud backups : What about Cloud Backups? Do you think that the Cloud is secure enough to trust it with our sensitive data? Why or why not? How can we tell?
Step-by-step explanation of your algorithm : Write a description and step-by-step explanation of your algorithm. You may include a flowchart.
Find all customer names who have at least one account : Find all customer names who have at least one account with balance >$100,000. Calculate the sum of all accounts for customer with ID=432.
Pros and cons of the windows firewall : Do a bit of web research on the pros and cons of the Windows Firewall. Then in your discussion post, address the following.
Describe the rising needs for contingency planning : Describe the rising needs for contingency planning at the federal, state, and local level for projected crises scenarios.
Discussion question-working with linux : In this lesson, you were introduced to Linux and its installation process and file systems.
What percentage change in the b-c ratio of each plan : In the development of a publicly owned, commercial waterfront area, three possible independent plans are being considered.

Reviews

inf1507011

7/31/2017 5:20:12 AM

Gracious!!! It is awesome!!! Much obliged to you so much - in the event that I might I be able to would give you an embrace... a digital one will need to do. Much obliged to you over and over and once more... Have an awesome night!! I will rest great today evening time on account of you!!!

inf1507011

7/19/2017 5:50:48 AM

I need all the questions should be answered with their calculations. My professor will not accept any solution without calculation.

Write a Review

Database Management System Questions & Answers

  Database concurrency control and data security

Database concurrency control and data security and describe how the data is displayed and extracted.

  Describe a business scenario

Constraints Business Scenario: Describe a business scenario and specify the types of constraints that would be appropriate to ensure the integrity of the database. Be sure to include every constraint discussed .

  Dependencies can you infer does not hold over relation s

From the JD, the set of relation schemes SP, PJ, and JS is a lossless-join decomposition of SPJ. Construct an instance of SPJ to illustrate that no two of these schemes su?ce.

  By using the instructions given by your instructor install

using the instructions provided by your instructor install mysql and connect from netbeans. create a user account that

  Create a database

Create a database that implements the proposed data warehouse schema.

  Formulate the problem of determining the optimal locations

Consider the University of the Far West described in Example. Suppose that the university administration has decided that two audiovisual centers are needed.

  Relational algebra operations

List the sequence of relational algebra operations needed to implement the queries.

  Design a relational database so that it is at least in 3nf

Explain typical situations when denormalizing a table is acceptable. Provide one (1) example of denormalizing a database table to justify your response.

  Determine the functional dependencies

After determining the functional dependencies, convert this table to an equivalent collection of tables that are in third normal form.

  Goals and implement an effective deployment plan

To establish project goals and implement an effective deployment plan, To gain insight into your company's transactions with data mining

  Edinburgh airport case study

Discuss why a relational database would be suitable for the organisation in the case study and provide three reasons for this recommendation and discuss using three advantages that a database management system provides to an organisation.

  Define a structural model

Define a structural model. Why should a systems analyst create one? Give an example of class cohesion for a class named SUPPLIER for your example that supplies car parts. List some of its attributes and at least two OPERATIONS (methods) that would..

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd