Reference no: EM133240965
Variable Importance
Compared to many other machine learning models, a logistic model can be easily interpreted. The sign of a coefficient tells us whether the corresponding input variable has positive or negative effect on the prediction of the output class. The amount of a coefficient is related to the importance of a variable for the prediction. However, wi > wj (using the previous notation, wi is the i-th component of w) does not simply imply that the i-th input variable is more important than the j-th input variable. Obviously, this also depends on the scaling of the input variables. For example, the importance of an input variable measuring a distance in the physical world should be independent of whether the associated unit is meters or millimeters.
The notebook Variable importance using logistic regression.ipynb demonstrates how to analyze the importance of the variables in a logistic regression model. Please have a close look. (Note that for a non-linear model the importances and their ranking may be different.) As argued above, we do not consider the amount of a coefficient directly, but the corresponding z-statistic, which in our case is the coefficient over its standard error. The z-statistic of a coefficient is invariant under linearly rescaling of the corresponding input variable.
In the example in the notebook, we have to deal with a categorical variable. A categorical variable takes values that correspond to a particular category (class, concept, object), for example {Orange, Apple, Banana}, and these categories are not necessarily ordered in a meaningful way. Such a variable needs to be encoded before a (generic) machine learning system processes the data. Simply encoding {Orange,Apple,Banana} by {0,1,2} and treating the variable as measured on an interval scale (i.e., treating the categories as numbers), does not make sense - a banana is not two times an apple.
You already heard about the most popular encoding for output categorical vari- ables, the one-hot encoding. A one-hot encoding of {Orange, Apple, Banana} is {(1,0,0)T,(0,1,0)T,(0,0,1)T}, that is C = 3 classes are encode by C (output) variables. In the notebook, however, C - 1 ("dummy") variables are used for the categorical input variable.
In this questions of the assignment, you should concisely explain why C - 1 vari- ables are used instead of the one-hot encoding. Your submission should include answers to the following questions: How many solutions (i.e., optimal values for the coefficients) would the linear regression optimization problem (without reg- ularization) have if the one-hot encoding was used? Why? Why would it be difficult to interpret the variable importance if the one-hot encoding was used?Recall that when you add normally distributed random variables their variances add up, that is, if x1 ∼ N (0, σ12) and x2 ∼ N (0, σ2), then x1 +x2 ∼ N(0,σ12 +σ2). And remember if x ∼ N(0,1) then σx ∼ N(0,σ2).
You can compute the covariance in several ways. One basic way to derive it is the following. Consider the standard normally distributed random vector xˆ ∼ N(0,I) and its transformation x = Axˆ. We have x ∼ N(0,AAT). For the exercise, find A, and then the off-diagonal elements of AAT give you the covariance between x1 and x2.