Write down the squared error function

Assignment Help Basic Statistics
Reference no: EM131097062

10701 Machine Learning - Spring 2012 - Final Examinations

Q1. Decision trees and KNNs

In the following questions we use Euclidian distance for the KNN.

1.1 Assume we have a decision tree to classify binary vectors of length 100 (that is, each input is of size 100). Can we specify a 1-NN that would result in exactly the same classification as our decision tree? If so, explain why. If not, either explain why or provide a counter example.

1.2

124_Figure.png

Assume we have the decision tree in Figure 1 which classifies two dimensional vectors {X1, X2} ∈ R\{A, B}. In other words, the values A and B are never used in the inputs. Can this decision tree be implemented as a 1-NN? If so, explicitly say what are the values you use for the 1-NN (you should use the minimal number possible). If not, either explain why or provide a counter example.

Q2. Neural Networks

Consider neural networks where each neuron has a linear activation function, i.e., each neurons output is given by g(x) = c+b 1/n i=1nWixi, where b and c are two fixed real numbers and n is the number of incoming links to that neuron.

1. Suppose you have a single neuron with a linear activation function g() as above and input x = x0, . . . , xn and weights W = W0, . . . , Wn. Write down the squared error function for this input if the true output is a scalar y, then write down the weight update rule for the neuron based on gradient descent on this error function.

2. Now consider a network of linear neurons with one hidden layer of m units, n input units, and one output unit. For a given set of weights wk,j in the input-hidden layer and Wj in the hidden-output layer, write down the equation for the output unit as a function of wk,j , Wj, and input x. Show that there is a single-layer linear network with no hidden units that computes the same function.

3. Now assume that the true output is a vector y of length o. Consider a network of linear neurons with one hidden layer of m units, n input units, and o output units. If o > m, can a single-layer linear network with no hidden units be trained to compute the same function? Briefly explain why or why not.

4. The model in 3) combines dimensionality reduction with regression. One could also reduce the dimensionality of the inputs (e.g. with PCA) and then use a linear network to predict the outputs. Briefly explain why this might not be as effective as 3) on some data sets.

Q3. Gaussian mixtures models

3.1 The E-step in estimating a GMM infers the probabilistic membership of each data point in each component Z: P(Zj|Xi), i = 1, ..., n, j = 1, ..., k, where i indexes data and j indexes components. Suppose a GMM has two components with known variance and an equal prior distribution

N(µ1, 1) × 0.5 + N(µ2, 1) × 0.5                                                                      

The observed data are x1 = 2, and the current estimates of µ1 and µ2 are 2 and 1 respectively. Compute the component memberships of this observed data point for the next E-step (hint: normal densities for standardized variable y(µ=0,σ=1) at 0, 0.5, 1, 1.5, 2 are 0.4, 0.35, 0.24, 0.13, 0.05).

3.2 The Gaussian mixture model (GMM) and the k-means algorithm are closely related-the latter is a special case of GMM. The likelihood of a GMM with Z denoting the latent components can be expressed typically as

P(X) = ∑zP(X|Z)P(Z)

where P(X|Z) is the (multivariate) Gaussian likelihood conditioned on the mixture component and P(Z) is the prior on the components. Such a likelihood formulation can also be used to describe a k-means clustering model. Which of the following statements are true-choose all correct options if there are multiple ones?

a) P(Z) is uniform in k-means but this is not necessarily true in GMM.

b) The values in the covariance matrix in P(X|Z) tend towards zero in k-means but this is not so in GMM.

c) The values in the covariance matrix in P(X|Z) tend towards infinity in k-means but this is not so in GMM.

d) The covariance matrix in P(X|Z) in k-means is diagonal but this is not necessarily the case in GMM.

Q4. Semi-Supervised learning

4.1 Assume we are trying to classify stocks to predict whether the stock will increase (class 1) or decrease (class 0) based on the stock closing value in the last 5 days (so our input is a vector with 5 values). We would like to use logistic regression for this task. We have both labeled and unlabeled data for the learning task. For each of the 4 semi-supervised learning methods we discussed in class, say whether it can be used for this classification problem or not. If it can, briefly explain if and how you should change the logistic regression target function to perform the algorithm. If it cannot, explain why.

1. Re-weighting

2. Co-Training

3. EM based

4. Minimizing overfitting

4.2 Assume we are using co-training to classify the rectangles and circles in Figure 2 a). The ? represents unlabeled data. For our co-training we use linear classifiers where the first classifier uses only the x coordinate values and the second only the y axis values. Choose from the answers below the number of iterations that will be performed until our co-training converges for this problem and briefly explain:

1. 1

2. 2

3. more than 2

4. Impossible to tell.

493_Figure1.png

4.3 Now assume that we are using boosting (with linear classifiers, so each classifier is a line in the two dimensional plane) to classify the points in Figure 2 b). We terminate the boosting algorithm once we reach a t such that ∈t = 0 or after 100 iterations. How many iterations do we need to perform until the algorithm converges? Briefly explain.

1. 1

2. 2

3. more than 2

4. Impossible to tell.

Q5. Bayesian networks

5.1 1. Show that a 95_image.pngb, c | d (a is conditionally independent of {b, c} given d) implies a 95_image.png b | d (a is conditionally independent of b given d).

2. Define the skeleton of a BN G over variables X as the undirected graph over X that contains an edge {X, Y} for every edge (X, Y ) in G. Show that if two BNs G and G' over the same set of variables X having the same skeleton and the same set of v-structures, encode the same set of independence assumptions. V-structure is a structure of 3 nodes X, Y, Z such as X → Y ← Z.

Hints: It suffices to show that for any independence assumption A 95_image.pngB | C (A, B, C are mutually exclusive sets of variables), it is encoded by G if and only if it is also encoded G'. Show by d-separation.

5.2 For each of the following pairs of BNs in Figure 3, determine if the two BNs are equivalent, e.g they have the same set of independence assumptions. When they are equivalent, state one independence/conditional independence assumption shared by them. When they are not equivalent, state one independence/conditional independence assumption satisfied by one but not the other.

585_Figure2.png

5.3 We refer to the Figure 4 in this question.

2219_Figure3.png

1. What is Markov blanket of {A, S}?

2. (T/F) For each of the following independence assumptions, please state whether it is true or false:

(a) D 95_image.pngS.

(b) D 95_image.pngS|H.

(c) C 95_image.pngJ|H.

(d) C 95_image.pngJ|A.

Q6. Hidden Markov Models

6.1 Assume we have temporal data from two classes (for example, 10 days closing prices for stocks that increased/decreased on the following day). How can we use HMMs to classify this data?

6.2 Derive the probablity of:

P(o1, . . . , ot, qt-1 = sv, qt = sj)

You may use any of the model parameters (starting, emission and transition probabilities) and the following constructs (defined and derived in class) as part of your derivation:

pt(i) = p(qt = si)

 αt(i) = p(o1, . . . , ot, qt = si)

 δi(i) = maxq1,...,qt-1p(q1, . . . , qt-1, qt = si, O1, . . . , Ot)

 Note that you may not need to use all of these constructs to fully define the function listed above.

6.3 The following questions refer to figure 5. In that figure we present a HMM and specify both the transition and emission probabilities. Let ptA be the probability of being in state A at time t. Similarly define ptB and ptC.

1. What is p3C?

2. Define pt2 as the probability of observing 2 at time point t. Express pt2 as a function of ptB and ptC (you can also use any of the model parameters defined in the figure if you need).

2392_Figure4.png

Q7. Dimension Reduction

7.1 Which of the following unit vectors expressed in coordinates (X1, X2) correspond to theoretically correct directions of the 1st (p) and 2nd (q) principal components (via linear PCA) respectively for the data shown in Figure 6? Choose all correct options if there are multiple ones.

a) i) p(1, 0) q(0, 1) ii) p(1/√(2), 1/√(2)) q(1/√(2), -1/√(2))

b) i) p(1, 0) q(0, 1) ii) p(1/√(2), -1/√(2)) q(1/√(2), 1/√(2))

c) i) p(0, 1) q(1, 0) ii) p(1/√(2), 1/√(2)) q(-1/√(2), 1/√(2))

d) i) p(0, 1) q(1, 0) ii) p(1/√(2), 1/√(2)) q(-1/√(2), -1/√(2))

e) All of the above are correct

7.2 In linear PCA, the covariance matrix of the data C = XT X is decomposed into weighted sums of its eigenvalues (λ) and eigenvectors p:

C = ∑iλipipTi

87_Figure5.png

Prove mathematically that the first eigenvalue λ1 is identical to the variance obtained by projecting data into the first principal component p1 (hint: PCA maximizes variance by projecting data onto its principal components).

7.3 The key assumption of a naive Bayes (NB) classifier is that features are independent, which is not always desirable. Suppose that linear principal components analysis (PCA) is first used to transform the features, and NB is then used to classify data in this low-dimensional space. Is the following statement true? Justify your answers.

The independent assumption of NB would now be valid with PCA transformed features because all principal components are orthogonal and hence uncorrelated.

Q8. Markov Decision Processes

Consider the following Markov Decision Process (MDP), describing a simple robot grid world. The values of the immediate rewards R are written next to the transitions. Transitions with no value have an immediate reward of 0. Note that the result of the action "go south" from state S5 results in one of two outcomes. With probability p the robot succeeds in transitioning to state S6 and receives immediate reward 100. However, with probability (1-p) it gets stuck in sand, and remains in state S5 with zero immediate reward. Assume the discount factor γ = 0.8. Assume the probability p = 0.9.

2456_Figure6.png

1. Mark the state-action transition arrows that correspond to one optimal policy. If there is a tie, always choose the state with the smallest index.

2. Is it possible to change the value for γ so that the optimal policy is changed? If yes, give a new value for γ and describe the change in policy that it causes. Otherwise briefly explain why this is impossible.

3. Is it possible to change the immediate reward function so that Vchanges but the optimal policy π remains unchanged? If yes, give such a change and describe the resulting changes to V. Otherwise briefly explain why this is impossible.

4. How sticky does the sand have to get before the robot will prefer to completely avoid it? Answer this question by giving a probability for p below which the optimal policy chooses actions that completely avoid the sand, even choosing the action "go west" over "go south" when the robot is in state S5.

Q9. Boosting

9.1 AdaBoost can be understood as an optimizer that minimizes an exponential loss function E = i=1Nexp(-yif(xi)) where y = +1 or -1 is the class label, x is the data and f(x) is the weighted sum of weak learners. Show that the loss function E is strictly greater than and hence an upper bound on a 0 - 1 loss function E0-1 = i=1N 1 · (yif(xi) < 0) (hint: E0-1 is a step function that assigns value 1 if the classifier predicts correctly and 0 otherwise).

9.2 The AdaBoost algorithm has two caveats. Answer the following questions regarding these.

a) Show mathematically why a weak learner with < 50% predictive accuracy presents a problem to AdaBoost.

b) AdaBoost is susceptible to outliers. Suggest a simple heuristic that relieves this.

9.3 Figure 7 illustrates the decision boundary (the middle intersecting line) after the first iteration in an AdaBoost classifier with decision stumps as the weak learner. The square points are from class -1 and the circles are from class 1. Draw (roughly) in a solid line the decision boundary at the second iteration. Draw in dashed line the ensemble decision boundary based on decisions at iteration 1 and 2. State your reasoning.

540_Figure7.png

Reference no: EM131097062

Questions Cloud

How fiscal policy affects interest rate and aggregate demand : Explain how monetary policy affects interest rates and aggregate demand. Analyze how fiscal policy affects interest rates and aggregate demand. Evaluate why policymakers face a short-run trade-off between inflation and unemployment.
Previous court decisions to interpret statutes and rules : Question 1. Which of the following statements is not true?
What is wrong with childhood traumas inevitably assumptions : What is wrong with childhood traumas inevitably assumptions. Use your chapter to obtain information and support your statements.
Is the stock a good or bad buy : A share of stock with a beta of .83 now sells for $61. Investors expect the stock to pay a year-end dividend of $3. The T-bill rate is 6%, and the market risk premium is 9%. a. Suppose investors believe the stock will sell for $63 at year-end. Is t..
Write down the squared error function : 10701 Machine Learning - Spring 2012 - Final Examinations. Suppose you have a single neuron with a linear activation function g() as above and input x = x0, . . . , xn and weights W = W0, . . . , Wn. Write down the squared error function for this inp..
Expectation of the price of the stock : If the stock is perceived to be fairly priced today, what must be investors' expectation of the price of the stock at the end of the year?
Evaluate the influence of cultural and environmental factors : Evaluate the influence of cultural and environmental factors on the effectiveness of psychological testing and assessment. Assess the ethical issues involved in test administration and interpretation of testing and assessment results Prompt.
Expected rate of return on the market portfolio : Suppose you believe that the beta of the firm is .7. How much is the firm worth if the risk-free rate is 2% and the expected rate of return on the market portfolio is 13%? (Do not round intermediate calculations. Round your answer to 2 decimal pla..
Identify and explain the goals of psychology : Identify and explain the definition of psychology presented during class - identify and explain the goals of psychology.

Reviews

Write a Review

Basic Statistics Questions & Answers

  How should i compute probability in spss for variable

How should I compute probability in spss for variable "seizure duration" with unnormal distribution which constitutes numerical data of duration in seconds.

  If the forecast for period 5 is equal to 275 use

given the following data on the number of pints of ice cream sold at a local ice cream store for a 6-period time

  Term coeff secoeff t-ratio p-value intercept 23979345

examine the regression 90 confidence interval 90 prediction interval using 1990 and 2004 energy use kg oil equivalent

  Estimated total enrollment for an hmo

What would be the estimated total enrollment for an HMO firm that administered 20 plans? Source: The Wall Street Journal Almanac 1999, p. 565.

  Age differences in problem solving ability

Mean = 14.2, s² = 5.92 and sum of X squared EX² = 2941.60. What can be concluded about age differences in problem solving ability?

  How many mistakes can be made success rate

Some of these items are not up to quality specifications and have to be thrown out. At this rate, how many mistakes can be made while maintaining an acceptable success rate?

  What is your decision rule

Using a significance level of .01 to test this hypothesis, what is your decision rule?

  Confidence interval for the mean price

Using the sample data, what is the margin of error associated with a 95% confidence interval? Develop a 95% confidence interval for the mean price charged by discount brokers for a trade of 100 shares at $50 per share.

  Estimate the probability that it is private insurance

Estimate the probability that it is private insurance

  What percentage of american women are taller

Standard deviation of 2.5 inces approximately what percentage of american women are taller than 58.5 inches, but shorter than 68.5?

  Confidence interval of the same parameter

A 95% confidence interval is wider than a 98% confidence interval of the same parameter. In a two-tailed test, the value of the test statistic is 1.5. If we know the test statistic follows a Student's t-distribution with P(T

  Find the conditional distribution of x given n

Express Var(X) in terms of p, E(N), and Var(N), by using the law of total variance.

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd