Reference no: EM133703594
Problem 1:
One question I've gotten here is - should the noise values for a given component be the same across all data points? No. Every time you generate a data point, you need to generate new noise terms. The distribution across data points is the same, but each data point gets different random noise values.
Problem 3/4:
One question I am getting a lot for these problems - when do you stop training?
Remember that our goal is to minimize the loss. The minimum of the loss may not be zero, so we can't rely on the loss hitting zero.
If the training loss /stops decreasing/ however, (or at least 'meaningfully' decreasing), there's probably no point in going further.
Note that we are trying to find a point where the gradient of the (training) loss is zero. If you evaluated the gradient and found out how big it is (norm of the vector), you could use how close it is to zero as a threshold for stopping as well.
The third thing that we've discussed is monitoring the testing error. Because if the testing error has plateaued or has started to increase, then there's no real point in going further either.
Note that if you are doing batch or stochastic gradient descent (which is fine), you may get more random fluctuations in your loss over time.
Additional point: many people are asking "can't I just run training a fixed number of times?" To which I would ask - are you sure it is converging? How do you know that the loss isn't still going down when you reach the end of these iterations? You can't really know unless you've done some experiments and watched the loss decrease.
Another question I've been getting - how do you know when an alpha is good? In general, if the loss is diverging to infinity - your alpha is definitely too large. Additionally, we can argue that if the alpha is small enough, we get a decrease in loss (even if it is slow). You could use this as a basis of a sort of binary search - if you have diverging loss, decrease alpha. If it is converging, increase alpha - bounce back and forth as needed.
But in general, if you're worried about how big alpha is - one thing to look at would be the size (norm) of the gradients, w.grad. If these values are all very large, then you need a smaller alpha. If these values are all very small, you can use a larger alpha. (We talked about something similar tonight.)
One thing you may see here - if your initial weights are large, your initial gradients may also be very large, which means that you may need a very small alpha at the start. One thing to test here is that if you initialize the weights of your model smaller, you get smaller gradients, and can get convergence with larger alpha.
How does the fit model compare to the 'actual' model? The 'actual model' here is the set of weights actually used to generate the y, as noted earlier in the file.
Last thing I'll say for this problem - you should expect the training and testing loss to be /different/. But you should expect them to be close. Because the data is linear in its features, and the model is linear in the data, this is about as simple as it gets and you shouldn't see dramatic divergence between training and testing loss.
Problem 6
Notice that when lambda = 0, the training and testing loss should match previous experiments.
Sanity check: notice that when I ask you to add the ridge penalty, I am not including w0.
When lambda > 0, minimizing the RidgeLoss on the training data isn't going to necessarily produce the smallest Loss (comparing y and ypred) on the training data. So you should see that as lambda increases, Loss(w) on the training set should increase. As lambda gets very big, this loss should plateau out though to a constant, as the w vector is essentially converging to zero.
However, the Loss on the /testing/ set may give more interesting results. You may actually see that TestingLoss(w) (with no ridge penalty) starts by /decreasing/ as lambda increases, hitting a minimum, then starting to increase. Note that as lambda increases, w is converging to zero, and thus we see that TestingLoss will also plateau out to a constant as the w converges.
This lambda where TestingLoss is minimized is optimal, giving a model that generalizes better to novel data than without the ridge penalty term.
Note - you may find (based on the randomness in the data) that there is no initial dip, and the best lambda is zero. This is possible. But one thing I'd encourage you to do is zoom in, check small lambda (and check that you are actually converging to minimum ridgeloss in training), and make sure you didn't miss it.
Problem 7
Main things I'll note here are twofold: I think you'll probably get the most interesting behavior for beta between zero and 5. Second, because of the complication introduced by the absolute value (the non-differentiable point at the origin), convergence from gradient descent is a lot harder. I got my best results (see my previous announcement on the subject) by decreasing alpha further and further over time.
Problem 8
I don't think there's much to say here. I'm asking you to transform your data set by multiplying every point by A, to make each point in the training and testing sets k-dimensional. A is randomly generated, and you should probably a lot of experiments to get a sense of the average for a given value of k. I think the trajectory of the test loss here is potentially surprising.
NOTE: In case it is an issue, we are treating A as /fixed/, not something that is trainable.
You could theoretically use the closed-form solution to the problem here if you wanted to. This avoids the problem of alpha and learning rates etc, but again the closed-form solution isn't going to generalize.
Problem 9
Not much to say here that wasn't said in the previous problem - I'll note that the closed-form solution isn't going to work here though since the matrix you need to invert won't be invertible (hence why regularization is needed).