Ridge and LASSO regressions | Applied Machine Learning

Ridge

A ridge regression is similar to a linear regression but there are more constraints for choosing the coefficients in addition to best fitting the data.
The new cost function is \(C = \frac{1}{2m} (\sum_{i=1}^{m} (y_i - a_1x_i - a_0)^2 + \lambda \sum_{j=1}^{n} a_j^2)\), where \(\lambda\) is the regularization parameter and \(\lambda \sum_{j=1}^{n} a_i^2\) is regularization term. Here, \(m\) is number of samples and \(n\) is number of features.
This kind of regularization is \(L_2\) regularization.
The derivative of the cost function, to be used in the GD is \(\frac{\partial C}{\partial a_j} = \frac{1}{m} \sum_{i=1}^{m} -(y_i - a_1x_i - a_0).x_i + \frac{\lambda}{m} a_j\).
Then, the update function for each parameter is: \(a_{j\_new} = a_j - \alpha [\frac{1}{m} \sum_{i=1}^{m} -(y_i - a_1x_i - a_0).x_i + \frac{\lambda}{m} a_j)]\).
We can write the update equation as: \(a_{j\_new} = a_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i=1}^{m}(y_i - a_1x_i - a_0).x_i\).
The term \((1 - \alpha \frac{\lambda}{m})\) is often slightly smaller than one, if m is large and α is small.
This term always shrinks \(a\). Therefore, it diminishes the effect of \(a\).
Having smaller values of coefficients results in a simpler model. This helps avoiding over-fitting.
The smaller λ, the closer the model to ordinary linear regression.
The larger λ, the more severe punishment of irrelevant features. If too large, all parameters will be nearly zero, which results in under-fitting.
Note that, conventionally, we do not regularize the bias coefficient (because it is always 1).
\(L_2\) regularization minimizes all coefficients as close to zero as possible. This means that each feature should have a small effect on outcome, while still predicting well.
Ridge regression is used when the number of features is more than the samples.
It helps avoiding “over-fitting” by giving different importances to different features.
Moreover, ridge regression can handle correlated features.
The more samples you have, the less important is regularization.

LASSO

The only difference in LASSO regression is in the regularization term using absolute values: \(C = \frac{1}{2n} \sum_{i=1}^{m} (y_i - a_1x_i - a_0)^2 + \lambda \sum_{j=1}^{n} \|a_j\|\).
The derivative of the loss function is \(\frac{\partial C}{\partial a_j} = \frac{1}{m} \sum_{i=1}^{m} -(y_i - a_1x_i - a_0).x_i + \frac{\lambda}{m \|a_j\|} a_j\). (The rule for derivation of absolute variables is \(\|x\|' = \frac{x}{\|x\|}\))
This small difference has a large impact. It allows irrelevant features to be set to zero, instead of just making them small.

Exercises

Use the California housing dataset. Predict a house’s price from the features that are explained here. (4 points)
```
using ScikitLearn
@sk_import datasets: fetch_california_housing

house = fetch_california_housing()
X = house["data"]
y = house["target"]
```
1. Modify the multivariate linear regression code that you had written before and add regularization to it to build a Ridge regression algorithm. You will only have to change the cost function and the gradient function.
2. Separate a random 20% of the data for testing. Train your model with the remaining 80% of the data and test its performance on the test data.
3. Test higher degree polynomials.
4. Compare your results with the results of ScikitLearn’s built-in functions.
Check the diabetes dataset from sklearn
```
using ScikitLearn
import ScikitLearn: fit!, predict
@sk_import datasets: load_diabetes
all_data = load_diabetes()

X = all_data["data"]
y = all_data["target"]
```
“Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.”

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499

The following column names match the description above, respectively: age sex bmi bp s1 s2 s3 s4 s5 s6.
1. Write code for a ridge/lasso regression model to predict disease progression.
2. Find the most important features?

Tags: