Solution-Describe what happens to the training error

Describe what happens to the training error

Assignment Help Basic Statistics

Reference no: EM131096442

10-701 Machine Learning - Spring 2012 - Problem Set 1

Q1. Basic probability and statistics

1.1 Probability distribution-

Mickey is a novice in probability theory and is facing a seeming paradox. Consider a uniform probability distribution of variable X on the interval [0, ½], namely f(x) = 2, for x ∈ [0, ½]. Given that probabilities under a distribution should sum to 1, Mickey is puzzled that f(x) exceeds their sum. Explain the paradox. In other words, what is confusing Mickey?

1.2 Mean and variance-

Compute the mean and variance of X described in Section 1.1. from first principles (i.e. write down the definitions and compute their values).

1.3 Sampling-

Mickey knows how to sample from a canonical uniform distribution g(y) = 1, for y ∈ [0, 1]. Explain how Mickey can simulate the distribution described in Section 1.1 based on samples of Y, i.e. simulate a uniform distribution on [0, ½].

Show analytically that with your proposal, the new distribution would be identical in mean and variance as obtained in Section 1.2-in other words, prove that the simulated distribution converges to the desired distribution (hint: use transformation of variables).

1.4 More on expectations-

The odds that a particular face of a fair six-sided die is thrown can be approximated by a discrete uniform distribution. Equipped with his knowledge in probability, Mickey goes to gamble and hedges his bet on throwing a 6. What is the expected number of times he must throw the die to get a 6? Show your workings in clear steps (hint: use recursion).

1.5 Detective Bayes-

Mickey tosses a die multiple times, hoping for a 6. The sequence of his 10 tosses is 1, 3, 4, 2, 3, 3, 2, 5, 1, 6. Mickey is suspicious whether the die is biased towards 3 (or fair). Conduct a simple analysis based on the Bayes theorem to inform on Mickey-to what degree is the die biased? Explain your reasoning.

Assume in general every 100 dice contain 5 unfair dice that are biased towards 3 with the probability distribution of the six faces (1, 2, 3, 4, 5, 6) as P = [0.1, 0.1, 0.5, 0.1, 0.1, 0.1].

Q2. Linear regression

In regression, the input variable X is mapped to the response variable Y via a function Y = f(X) + E subject to some noise E. Linear regression assumes a linear form of the function and finds weight W that minimizes the sum of squared errors:

min_i=1∑ⁿ(y_i - wx_i)²(1)

2.1 Probabilistic interpretation-

Suppose there are n indenpendent pairs of input and response {(x₁, y₁), ...,(x_n, y_n)}. Assume these are linearly related subject to standard Gaussian noise y_i = wx_i + ∈_i, ∈_i ∼ N(0, 1). Using maximum likelihood, show that maximizing the log-likelihood function L(Y|X) = log(P(Y|X)) is equivalent to minimizing the sum of squared errors in Equation 1 (hint: first derive the likelihood expression).

2.2 Geometric interpretation-

Now think of the responses {y₁,..., y_n} as a vector y = [y₁, ..., y_n] in n-dimensional space, similarly for the estimates in linear regression yˆ = [wx₁, ..., wx_n]. What does Equation 1 mean geometrically to vectors y and yˆ? Provide a simple proof.

2.3 Regularization-

Real-world problems often involve input variables with multiple dimensions and limited data samples. In these scenarios, a common technique is often used to regularize the weights in linear regression and hence provides smoothing and improves generalization performance. Using matrix notations, the sum of squared errors in Equation 1 can be re-written as

arg min ||y - Xw|| + λ||w||²(2)

where X ∈ R^n×m, y ∈ R^m×1 and λ regularizes the norm of the weights. Derive an analytical expression that minimizes the term in Equation 2 with respect to w using vector derivatives. Explain briefly what happens to the weights w when λ is increased.

Q3. Density estimation

In K-nearest neighbors (KNN), the classification is achieved by majority vote in the vicinity of data. Suppose there are two classes of data each of n/2 points overlapped to some extent in a 2-dimensional space.

1. Describe what happens to the training error (using all available data) when the neighbor size k varies from n to 1.

2. Predict and explain with a sketch how the generalization error (e.g. holding out some data for testing) would change when k varies? Explain your reasoning.

3. Propose a method to determine an appropriate value for k.

4. In KNN, once k is determined, all neighbors within the perimeter of k are weighted equally in deciding the class label. Suggest a modification to the algorithm that improves this caveat.

5. Give two reasons why KNN may be undesirable when the input dimension is high.

Q4. Naive Bayes and classification

In this section, you will implement the full and naive Bayes classifiers and compare their performances on several data sets.

4.1 The rationale of "naive"-

Suppose you are modeling some N-dimensional data X with a full Gaussian distribution X ∼ N(µ, Σ), what is the number of parameters to be estimated? Now instead using the naive Bayes assumption, what is the number of parameters to be estimated? Show your calculations clearly.

4.2 Programming task-

First, simulate 10 data sets each containing two classes of data from the 2-dimensional Gaussian distributions specified in Table 1. Plot separately sets 5 and 10 on a 2-D plane with the two classes coded in different symbols and colors. Save the simulated 10 sets.

Parameter 1: µ₀ = [-2 0]; Σ₀ = [2 0; 0 3]; µ₁ = [2 0]; Σ₁ = [2 0; 0 3].

Parameter 2: µ₀ = [-2 0]; Σ₀ = [2 - 1.8; -1.8 3]; µ₁ = [2 0]; Σ₁ = [2 - 1.8; -1.8 3].

Set	# data per class	Gaussian parameters
1	20	Parameter 1
2	40	Parameter 1
3	100	Parameter 1
4	300	Parameter 1
5	1200	Parameter 1
6	20	Parameter 2
7	40	Parameter 2
8	100	Parameter 2
9	300	Parameter 2
10	1200	Parameter 2
Table 1 - Specification for the data sets

Now write two separate routines that implement the full Gaussian and naïve Bayes classifiers respectively. Apply each of these classifiers to the simulated 10 data sets using 10-fold cross validated classification-use equal number of data points from class 0 and 1 as training or testing (held-out) data in each fold by writing a generic routine that performs classification with k-fold cross validation. Save the learned parameters of each classifier in each fold (for full Gaussian Bayes, these are the means and covariances; for naive Bayes, these are the means and variances at each dimension).

1. Plot the cross-validated classification accuracies with error bars under Parameter 1 (sets 1 - 5) and Parameter 2 (sets 6 - 10) on separate figures. For each figure, overlay the accuracies of the two classifiers for comparison.

2. Carefully examine and report the parameters learned for the two classifiers under Parameter 1 and 2. How do these vary when the data size increases from 20 to 1200? For full Gaussian and naive Bayes, how close are the estimated means and covariances to the ground truth-you may use the parameters averaged across 10-fold validation as the estimated parameters.

3. Compare and comment on the performance of these classifiers under different parameterizations. What is your prediction about their performances? Do the results reflect your prediction? Explain your reasoning.

4. Submit the complete (MATLAB) code in the appendix of your write-up in compact format (e.g. double columns). These include 1) the routines for the two classifiers 2) the routine for k-fold cross validation. Do not attach or print out the simulated data sets.

Q5. Bonus question: a small real-world challenge

The US postal service uses machine learning methods to robustly identify zip codes. In this section, you will design and implement a classifier that does a simpler job-classifying hand-written digits of "0"s and "1"s. You can load these digits in MATLAB with the command "load -ascii digitX" where X= 0 or 1. Each of the ".mat" files contains 700 samples of digits in 8 × 8 pixels. To get a feel for the data, you can visualize these samples with the command "imagesc(reshape(digitX(i,:),8,8)); colorbar gray", where i is the row index.

1. Your task is to perform a leave-one-out cross validated classification on the digit data-we recommend the Naive Bayes classifier although you are free to develop your own algorithm (we do not recommend you spending too much time on developing an over-complex model). In any case, specify clearly the assumptions and formulations of your algorithm. For example, for the Naive Bayes classifier, you should specify the prior, the likelihood function and any manipulation or transformation on the data.

2. Implement and apply your classifier on the data. Report the cross-validated accuracy (this should be a single number). Did your classifier work well-explain why or why not?

3. Submit your MATLAB code in compact format and attach it in the write-up.

Attachment:- Data.rar

Reference no: EM131096442

Questions Cloud

Conducting a cost benefit analysis : 1. What are the elements of a decision? As an MBA, do you see similarities with conducting a cost benefit analysis? 2. What is opportunity loss or regret and how might it be used in everyday life?

Choose one publicly traded company : Identify two different topics covered in the course and examine two ways an organization can use financial statements to make business decisions related to your selected topics.

What are wmds and what are their main function : What are WMDs and what are their main function? What is the difference between fission and fusion? When there is discussion of nuclear weapons, which kind are they usually talking about? What does first-strike mean? Second-strike? MAD?

Develop tasks and responsibilities for their approved job : Second, create a company, or use an existing company in which there is a fit for your approved job selection choice. In your paper you will provide an overview of the company and discuss how the job fits in the organization.

Describe what happens to the training error : 10-701 Machine Learning - Spring 2012 - Problem Set 1. In K-nearest neighbors (KNN), the classification is achieved by majority vote in the vicinity of data. Describe what happens to the training error (using all available data) when the neighbor siz..

How does neoliberalism differ from liberalism in ir : How does neoliberalism differ from liberalism in IR? Define reciprocity and EXPLAIN how it promotes cooperation. How can reciprocity result in more conflict? Explain. Define international regimes. How do they facilitate cooperation?

What is the role of reason in relation to a good will : Why are talents, gifts of fortune and happiness not good in themselves? How do Kant's views compare to Aristotle's? What is the role of reason in relation to a good will?

What are the main assumptions of realism : What are the main assumptions of realism? Identify two of the thinkers that have contributed to realism and describe their contribution. [multiple answers]. What is the most common definition of power? What are its flaws?

Create the logic for the dice game pig : In this version of the game, when the computer does not roll a 1 and can choose whether to roll again, generate a new random value of 1 or 2. Use this value to decide whether the computer will continue to play or to pass the turn to the player.

User Account

All Pages

Describe what happens to the training error

Reference no: EM131096442

Reference no: EM131096442

Questions Cloud

Reviews

Write a Review

Basic Statistics Questions & Answers

Statistics-probability assignment

What is the least number

Determine the value of k

What is the probability

Binomial distributions

Caselet on mcdonald’s vs. burger king - waiting time

Generate descriptive statistics

Sampling variability and standard error

Estimate the population mean

Conduct a marketing experiment

Find out the probability

Linear programming models

Assured A++ Grade

Academics

Major Subjects

Majors

Get In Touch

TERMS & POLICIES

HELP & SUPPORT