## Ridge

• A ridge regression is similar to a linear regression but there are more constraints for choosing the coefficients in addition to best fitting the data.
• The new cost function is $$C = \frac{1}{2m} (\sum_{i=1}^{m} (y_i - a_1x_i - a_0)^2 + \lambda \sum_{j=1}^{n} a_j^2)$$, where $$\lambda$$ is the regularization parameter and $$\lambda \sum_{j=1}^{n} a_i^2$$ is regularization term. Here, $$m$$ is number of samples and $$n$$ is number of features.
• This kind of regularization is $$L_2$$ regularization.
• The derivative of the cost function, to be used in the GD is $$\frac{\partial C}{\partial a_j} = \frac{1}{m} \sum_{i=1}^{m} -(y_i - a_1x_i - a_0).x_i + \frac{\lambda}{m} a_j$$.
• Then, the update function for each parameter is: $$a_{j\_new} = a_j - \alpha [\frac{1}{m} \sum_{i=1}^{m} -(y_i - a_1x_i - a_0).x_i + \frac{\lambda}{m} a_j)]$$.
• We can write the update equation as: $$a_{j\_new} = a_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i=1}^{m}(y_i - a_1x_i - a_0).x_i$$.
• The term $$(1 - \alpha \frac{\lambda}{m})$$ is often slightly smaller than one, if m is large and α is small.
• This term always shrinks $$a$$. Therefore, it diminishes the effect of $$a$$.
• Having smaller values of coefficients results in a simpler model. This helps avoiding over-fitting.
• The smaller λ, the closer the model to ordinary linear regression.
• The larger λ, the more severe punishment of irrelevant features. If too large, all parameters will be nearly zero, which results in under-fitting.
• Note that, conventionally, we do not regularize the bias coefficient (because it is always 1).
• $$L_2$$ regularization minimizes all coefficients as close to zero as possible. This means that each feature should have a small effect on outcome, while still predicting well.
• Ridge regression is used when the number of features is more than the samples.
• It helps avoiding “over-fitting” by giving different importances to different features.
• Moreover, ridge regression can handle correlated features.
• The more samples you have, the less important is regularization.

## LASSO

• The only difference in LASSO regression is in the regularization term using absolute values: $$C = \frac{1}{2n} \sum_{i=1}^{m} (y_i - a_1x_i - a_0)^2 + \lambda \sum_{j=1}^{n} \|a_j\|$$.
• The derivative of the loss function is $$\frac{\partial C}{\partial a_j} = \frac{1}{m} \sum_{i=1}^{m} -(y_i - a_1x_i - a_0).x_i + \frac{\lambda}{m \|a_j\|} a_j$$. (The rule for derivation of absolute variables is $$\|x\|' = \frac{x}{\|x\|}$$)
• This small difference has a large impact. It allows irrelevant features to be set to zero, instead of just making them small.

## Exercises

1. Use the California housing dataset. Predict a house’s price from the features that are explained here. (4 points)

using ScikitLearn
@sk_import datasets: fetch_california_housing

house = fetch_california_housing()
X = house["data"]
y = house["target"]

1. Modify the multivariate linear regression code that you had written before and add regularization to it to build a Ridge regression algorithm. You will only have to change the cost function and the gradient function.
2. Separate a random 20% of the data for testing. Train your model with the remaining 80% of the data and test its performance on the test data.
3. Test higher degree polynomials.
4. Compare your results with the results of ScikitLearn’s built-in functions.
2. Check the diabetes dataset from sklearn

using ScikitLearn
import ScikitLearn: fit!, predict

X = all_data["data"]
y = all_data["target"]


“Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.”

Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499

The following column names match the description above, respectively: age sex bmi bp s1 s2 s3 s4 s5 s6.

1. Write code for a ridge/lasso regression model to predict disease progression. (3 points)
2. Find the most important features? (1 point)
Tags: