Reference no: EM132266369
1. MDL with multiplicative error bound
In deriving the error bound for MDL in class, we used Hoeffding's inequality and obtained the bound
LD(h) ≤ [Ls(h) + √(|h| + log(2/δ)/2m)]
Instead of using Hoeffding's inequality, we will use the following inequality due to Haussler. Consider m i.i.d. random variables Y1, . . . , Ym in range [0, M] with expected value µ. Let µˆ = 1/mΣmi=1 Yi. Assume ∈ > 0 and 0 < α < 1. Then
P (|µˆ - µ| > α(∈ + µˆ + µ)) ≤ 2 exp (-α2∈m/M)
(a) Rederive the MDL-type bound using Haussler's inequality instead of Hoeffding bound (i.e. obtain a bound for the expected loss in terms of the empirical loss and the hypoth- esis length). Use the MDL weight function w(h) = 1/2|h| for a hypothesis of length |h|.
(b) What is the regularized cost function suggested by the new bound?
(c) As m increases, compare the new bound with the bound obtained using Hoeffding's inequality for the case when LS(h) is small or zero, and for the case when LS(h) is large.
2. Margin Vs Network Size
In this question, we will look at the error bounds provided using VC-dimension of a single hidden layer neural network when the inputs are points in the plane.
(a) Argue that the VC dimension of convex polygons with k or fewer vertices is at least k (where the points on the inside and boundary of the polygon is labeled positive).
(b) A convex polygon with k vertices can be represented as the intersection of k halfspaces. Using this, argue that the VC-dimension of a single hidden layer neural networks with k linear threshold hidden units is at least k when the inputs are points in the plane.
(c) Bound the Rademacher complexity of single hidden layer neural networks with arbi- trary number of linear threshold hidden units when the Al norm of the output weights is 1 and the inputs are points in the plane.
(d) Assume that the margin is known to be at least γ when the single hidden layer neural networks has A1 norm equal to 1. Provide a bound on the error of an algorithm that maximizes the margin of the network. For simplicity, use the bounds without doing structural risk minimization in this question. (Hint: It may be useful to upper bound the 0 - 1 loss with the hinge loss for the analysis.)
(e) Compare the bound using the margin and using the VC-dimension. When is each bound better?
3. Estimation Error for l1 vs l2 regularization
In the example discussed in class, we want to find a sequence of characters in a file that indicate whether the file contains a virus - a "signature". Consider a variant, where string indicating the positive class has length exactly d. That is, the target function we want to learn is fv(x) = 1 iff v is a substring of x, and fv(x) = -1 otherwise, where v is a string of length d. Naturally, the string v is unknown, otherwise there is no learning problem. We will use linear classifiers to try to learn the target function. As features, we will use indicator functions ψu(x), where ψu(x) is an indicator function that takes the value 1 if u is a substring of x and 0 otherwise. Here u ranges over all possible strings of length d. Let the file length be F . For simplicity, use the bounds without doing structural risk minimization.
(a) Give the estimation error bound of using hard support vector machine. That is, describe how you can represent the target function as a linear function with magnitude at least 1 on all inputs and ||w||2 ≤ B for the weights. Let H be the class of linear function satisfying ||w||2 ≤ B. Then compute an upper bound for LD(h) assuming h belongs to H and has magnitude at least 1. For simplicity, consider the homogeneous case (bias implemented by having a feature that always has value 1). (Hint: It may be useful to upper bound the 0 - 1 loss with the hinge loss for the analysis.)
(b) Now describe how you can represent the target function as a linear function with mag- nitude at least 1 on all inputs and ||w||1 ≤ B for the weights. Let H be the class of linear function satisfying ||w||1 ≤ B. Then compute an upper bound for LD(h) as- suming h belongs to H and achieves magnitude at least 1. For simplicity, consider the homogeneous case (bias implemented by having a feature that always has value 1). Assume that the number of possible characters is C. Which bound is better, compared to the hard SVM case?
(c) Consider the case when many substrings may be correct, i.e. the function should output 1 if any of the substrings in a subset S of substrings appears, and -1 otherwise. In this case, which bound is better, assuming |S| is much larger than F?