Logistic Regression | Applied Machine Learning

Rationale

Logistic regression works similar to linear regression, except that the outcome is binary.
We need a different function than linear regression; a function that given any input, creates an output between 0 and 1.
The equation of binary decision used in logistic regression is: \(f(x) = g(XA)\), where \(g(z) = \frac{1}{1+e^{-z}}\), \(A\) is the vector of coefficients, and \(0 < f(x) \leq 1\).
The prediction function, after replacing \(z\) is \(\hat{y} = f(x) = \frac{1}{1+e^{-(XA)}}\).
\(g(z)\) is the logistic (sigmoid) function and looks like:

Shape of the logistic function.
Since the output of the logistic function is between 0 and 1, it gives us the estimated probability that an outcome is in one category or the other. Those outputs that are larger than 0.5 are assigned to one category and those smaller than 0.5 to the other category.
\(z=0\) is the decision boundary.
Using higher order polynomials leads to a non-linear decision boundary:
Linear vs nonlinear decision boundary.

Cost function

The cost function of a linear regression does not work for logistic regression, it becomes non-convex, meaning that we will have local optima in which the GD algorithm gets trapped.
We want a cost function to penalize extremely when we make a completely wrong prediction: when the outcome is 1 and we predict 0. Moreover, the cost function should be zero for when we predict the correct outcome with 100% confidence. The cost function should penalize slightly when we make the correct prediction, but with a confidence less than 100% (e.g., outcome is 1 and the result of logistic regression is 0.9), and should penalize more as our confidence in the right answer decreases.
The following function satisfies the conditions we set: \(C(A) = \begin{cases} -log(\hat{y}), & \text{if } y=1\\ -log(1-\hat{y}), & \text{if } y=0 \end{cases}\)
The cost function looks like this:
The cost function can be written in one line: \(C(A) = -y log(\hat{y}) - (1-y)log(1-\hat{y})\).
The function for all samples looks like: \(C(A) = -\frac{1}{n} [\sum_{i=1}^{n} y_i log(\hat{y_i}) + (1-y_i)log(1-\hat{y_i}]\).
The derivative of the cost function is \(\frac{\partial C}{\partial a_j} = \frac{1}{n}\sum_{i=1}^{n} -(y^i - \hat{y}^i)x_j^i\).
The derivative of the cost function for logistic regression looks identical to the cost function of multivariate linear regression. The only difference is that the way we compute \(\hat{y}_i\) or \(f(x)\) is different: for linear regression it was \(f(x) = XA\), and for logistic regression it is \(f(x) = \frac{1}{1+e^{-(XA)}}\).
Logistic regression, like linear regression, needs feature scaling.

Multiclass classification

To detect multiple classes, an option is using the one-vs-rest algorithm.
Use n different classifiers to detect n classes against all the other classes.
To make a prediction for single sample, we run all the classifiers, and accept the prediction of the classifier with the highest score.

Exercise

Detect hand-written numbers using logistic regression.

using ScikitLearn
@sk_import datasets: load_digits

digits = load_digits();
X = digits["data"];
y = digits["target"];

Tags: