How does the fit model compare to the actual model

Assignment Help Computer Engineering
Reference no: EM133703594

Problem 1:
One question I've gotten here is - should the noise values for a given component be the same across all data points? No. Every time you generate a data point, you need to generate new noise terms. The distribution across data points is the same, but each data point gets different random noise values.
Problem 3/4:
One question I am getting a lot for these problems - when do you stop training?
Remember that our goal is to minimize the loss. The minimum of the loss may not be zero, so we can't rely on the loss hitting zero.
If the training loss /stops decreasing/ however, (or at least 'meaningfully' decreasing), there's probably no point in going further.
Note that we are trying to find a point where the gradient of the (training) loss is zero. If you evaluated the gradient and found out how big it is (norm of the vector), you could use how close it is to zero as a threshold for stopping as well.
The third thing that we've discussed is monitoring the testing error. Because if the testing error has plateaued or has started to increase, then there's no real point in going further either.
Note that if you are doing batch or stochastic gradient descent (which is fine), you may get more random fluctuations in your loss over time.
Additional point: many people are asking "can't I just run training a fixed number of times?" To which I would ask - are you sure it is converging? How do you know that the loss isn't still going down when you reach the end of these iterations? You can't really know unless you've done some experiments and watched the loss decrease.
Another question I've been getting - how do you know when an alpha is good? In general, if the loss is diverging to infinity - your alpha is definitely too large. Additionally, we can argue that if the alpha is small enough, we get a decrease in loss (even if it is slow). You could use this as a basis of a sort of binary search - if you have diverging loss, decrease alpha. If it is converging, increase alpha - bounce back and forth as needed.
But in general, if you're worried about how big alpha is - one thing to look at would be the size (norm) of the gradients, w.grad. If these values are all very large, then you need a smaller alpha. If these values are all very small, you can use a larger alpha. (We talked about something similar tonight.)
One thing you may see here - if your initial weights are large, your initial gradients may also be very large, which means that you may need a very small alpha at the start. One thing to test here is that if you initialize the weights of your model smaller, you get smaller gradients, and can get convergence with larger alpha.
How does the fit model compare to the 'actual' model? The 'actual model' here is the set of weights actually used to generate the y, as noted earlier in the file.
Last thing I'll say for this problem - you should expect the training and testing loss to be /different/. But you should expect them to be close. Because the data is linear in its features, and the model is linear in the data, this is about as simple as it gets and you shouldn't see dramatic divergence between training and testing loss.
Problem 6
Notice that when lambda = 0, the training and testing loss should match previous experiments.
Sanity check: notice that when I ask you to add the ridge penalty, I am not including w0.
When lambda > 0, minimizing the RidgeLoss on the training data isn't going to necessarily produce the smallest Loss (comparing y and ypred) on the training data. So you should see that as lambda increases, Loss(w) on the training set should increase. As lambda gets very big, this loss should plateau out though to a constant, as the w vector is essentially converging to zero.
However, the Loss on the /testing/ set may give more interesting results. You may actually see that TestingLoss(w) (with no ridge penalty) starts by /decreasing/ as lambda increases, hitting a minimum, then starting to increase. Note that as lambda increases, w is converging to zero, and thus we see that TestingLoss will also plateau out to a constant as the w converges.
This lambda where TestingLoss is minimized is optimal, giving a model that generalizes better to novel data than without the ridge penalty term.
Note - you may find (based on the randomness in the data) that there is no initial dip, and the best lambda is zero. This is possible. But one thing I'd encourage you to do is zoom in, check small lambda (and check that you are actually converging to minimum ridgeloss in training), and make sure you didn't miss it.
Problem 7
Main things I'll note here are twofold: I think you'll probably get the most interesting behavior for beta between zero and 5. Second, because of the complication introduced by the absolute value (the non-differentiable point at the origin), convergence from gradient descent is a lot harder. I got my best results (see my previous announcement on the subject) by decreasing alpha further and further over time.
Problem 8
I don't think there's much to say here. I'm asking you to transform your data set by multiplying every point by A, to make each point in the training and testing sets k-dimensional. A is randomly generated, and you should probably a lot of experiments to get a sense of the average for a given value of k. I think the trajectory of the test loss here is potentially surprising.
NOTE: In case it is an issue, we are treating A as /fixed/, not something that is trainable.
You could theoretically use the closed-form solution to the problem here if you wanted to. This avoids the problem of alpha and learning rates etc, but again the closed-form solution isn't going to generalize.
Problem 9
Not much to say here that wasn't said in the previous problem - I'll note that the closed-form solution isn't going to work here though since the matrix you need to invert won't be invertible (hence why regularization is needed).

Reference no: EM133703594

Questions Cloud

Disease heart failure medication administration record : A nurse is assisting with care of client on medical-surgical unit. Medical History Day 1: Parkinson's disease Heart failure Medication Administration Record
What is the recommended response plan for mitigating : What is the recommended response plan for mitigating insider risks? What is the recommended response plan for mitigating insider risks?
Create a new column called saleamount that uses price : Price from the price list table multiplied by the Qty to calculate the revenue in dollars for each record. Set the data type and format of the new column
Identify abnormal lab values : Identify abnormal lab values. Give rationale as to why they may be out of range. Analyze ABG result Define CO, CVP, SVR, SVO2. Are values within normal range?
How does the fit model compare to the actual model : How does the fit model compare to the actual model? The 'actual model' here is the set of weights actually used to generate the y, as noted earlier in the
How many days did you eat breakfast : How many days did you eat breakfast? How many days did you eat at least 1 fruit? How many days did you eat at least 1 vegetable?
Product that would satisfy the customers request : There are roughly 10,000 reports. Which series of approaches is most likely to result in a product that would satisfy the customers request
Think of how the human body processes food to energy : Think of how the human body processes food to energy similar to how Petroleum fuels an automobile.
Draw a dfd diagram 0 that shows the revised design : Draw a context diagram for the new C3 system that shows the revised design. Draw a DFD diagram 0 that shows the revised design

Reviews

Write a Review

Computer Engineering Questions & Answers

  Mathematics in computing

Binary search tree, and postorder and preorder traversal Determine the shortest path in Graph

  Ict governance

ICT is defined as the term of Information and communication technologies, it is diverse set of technical tools and resources used by the government agencies to communicate and produce, circulate, store, and manage all information.

  Implementation of memory management

Assignment covers the following eight topics and explore the implementation of memory management, processes and threads.

  Realize business and organizational data storage

Realize business and organizational data storage and fast access times are much more important than they have ever been. Compare and contrast magnetic tapes, magnetic disks, optical discs

  What is the protocol overhead

What are the advantages of using a compiled language over an interpreted one? Under what circumstances would you select to use an interpreted language?

  Implementation of memory management

Paper describes about memory management. How memory is used in executing programs and its critical support for applications.

  Define open and closed loop control systems

Define open and closed loop cotrol systems.Explain difference between time varying and time invariant control system wth suitable example.

  Prepare a proposal to deploy windows server

Prepare a proposal to deploy Windows Server onto an existing network based on the provided scenario.

  Security policy document project

Analyze security requirements and develop a security policy

  Write a procedure that produces independent stack objects

Write a procedure (make-stack) that produces independent stack objects, using a message-passing style, e.g.

  Define a suitable functional unit

Define a suitable functional unit for a comparative study between two different types of paint.

  Calculate yield to maturity and bond prices

Calculate yield to maturity (YTM) and bond prices

Free Assignment Quote

Assured A++ Grade

Get guaranteed satisfaction & time on delivery in every assignment order you paid with us! We ensure premium quality solution document along with free turntin report!

All rights reserved! Copyrights ©2019-2020 ExpertsMind IT Educational Pvt Ltd