Reference no: EM132995790
Question 1. Rewrite the objective in the form
f(x) = 1/2n ||y - Ax||2
where A is the data matrix and n is the number of data points.
Question 2. Implement the gradient descent (GD) method which consists of the iter- ations:
xk+1 = xk - αk∇f(xk)
where
∇f(x) = -1/n AT(y - Ax).
where we choose αk = a > 0 for some constant a. Tune your algorithm to different values of a. You can stop your algorithm when ||∇(x)|| ≤ ε with ε = 10-2. For which values of a, do you see divergence or convergence in your algorithm?
How close is the solution you compute with GD is to the actual solution
x∗ = [x∗1, x∗2, x3∗ ] where x∗ = (AT A)-1AT y?
Question 3. Plot the level sets of the objective f(x1, x2, x3) when x1 = x∗1. In other words, plot the level sets of f(x1∗ , x2, x3) versus x2 and x3. Mark the iterates of gradient descent on the level set plots and verify that gradient descent moves perpendicular to the level sets. Visualize the function f(x1∗ , x2, x3) in 3D as a function of x2 and x3. Does it look like a "cereal bowl"?
Question 4. Repeat the previous question with different choise of the stepsize αk in your implementation
(a) αk = a > 0 is a constant. How does the performance change when we vary a? For which values of a, do you see divergence or convergence? How large a should be for you to observe divergence of the GD?
(b) Repeat part (a) with αk = a/k varying the values of a.
(c) Repeat part (a) with αk = a/√k. Based on these experiments, what is the best stepsize rule in your opinion to get the best performance?
Question 5. Note that
f(x) = X f where f = 1 (y - aT x)2
is the loss of the data point i where aT is the i-th row of the data matrix
A. Note that each row comes from a single data point and the gradient of fi is
∇fi(x) = -ai (yi - ai x).
At step k, choose a data point index ik uniformly randomly from the set of all indices {1, 2, . . . , n} and consider the iterations
xk+1 = xk - αk∇fik (xk)
where the replace the gradient in the GD method with a random estimate of it; i.e. ∇fik . This is called the stochastic gradient method (SGD).
Repeat questions 3) and 4) for stochastic gradient descent. Can SGD be faster than GD for different values of target accuracy ε = 10-1, 10-2, 10-3, . . . etc? Please explain.
Attachment:- Question.rar