Ridge
- A ridge regression is similar to a linear regression but there are more constraints for choosing the coefficients in addition to best fitting the data.
- The new cost function is \(C = \frac{1}{2m} (\sum_{i=1}^{m} (y_i - a_1x_i - a_0)^2 + \lambda \sum_{j=1}^{n} a_j^2)\), where \(\lambda\) is the regularization parameter and \(\lambda \sum_{j=1}^{n} a_i^2\) is regularization term. Here, \(m\) is number of samples and \(n\) is number of features.
- This kind of regularization is \(L_2\) regularization.
- The derivative of the cost function, to be used in the GD is \(\frac{\partial C}{\partial a_j} = \frac{1}{m} \sum_{i=1}^{m} -(y_i - a_1x_i - a_0).x_i + \frac{\lambda}{m} a_j\).
- Then, the update function for each parameter is: \(a_{j\_new} = a_j - \alpha [\frac{1}{m} \sum_{i=1}^{m} -(y_i - a_1x_i - a_0).x_i + \frac{\lambda}{m} a_j)]\).
- We can write the update equation as: \(a_{j\_new} = a_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i=1}^{m}(y_i - a_1x_i - a_0).x_i\).
- The term \((1 - \alpha \frac{\lambda}{m})\) is often slightly smaller than one, if m is large and α is small.
- This term always shrinks \(a\). Therefore, it diminishes the effect of \(a\).
- Having smaller values of coefficients results in a simpler model. This helps avoiding over-fitting.
- The smaller λ, the closer the model to ordinary linear regression.
- The larger λ, the more severe punishment of irrelevant features. If too large, all parameters will be nearly zero, which results in under-fitting.
- Note that, conventionally, we do not regularize the bias coefficient (because it is always 1).
- \(L_2\) regularization minimizes all coefficients as close to zero as possible. This means that each feature should have a small effect on outcome, while still predicting well.
- Ridge regression is used when the number of features is more than the samples.
- It helps avoiding “over-fitting” by giving different importances to different features.
- Moreover, ridge regression can handle correlated features.
- The more samples you have, the less important is regularization.
LASSO
- The only difference in LASSO regression is in the regularization term using absolute values: \(C = \frac{1}{2n} \sum_{i=1}^{m} (y_i - a_1x_i - a_0)^2 + \lambda \sum_{j=1}^{n} \|a_j\|\).
- The derivative of the loss function is \(\frac{\partial C}{\partial a_j} = \frac{1}{m} \sum_{i=1}^{m} -(y_i - a_1x_i - a_0).x_i + \frac{\lambda}{m \|a_j\|} a_j\). (The rule for derivation of absolute variables is \(\|x\|' = \frac{x}{\|x\|}\))
- This small difference has a large impact. It allows irrelevant features to be set to zero, instead of just making them small.
Exercises
-
Use the California housing dataset. Predict a house’s price from the features that are explained here. (4 points)
using ScikitLearn @sk_import datasets: fetch_california_housing house = fetch_california_housing() X = house["data"] y = house["target"]
- Modify the multivariate linear regression code that you had written before and add regularization to it to build a Ridge regression algorithm. You will only have to change the cost function and the gradient function.
- Separate a random 20% of the data for testing. Train your model with the remaining 80% of the data and test its performance on the test data.
- Test higher degree polynomials.
- Compare your results with the results of ScikitLearn’s built-in functions.
-
Check the diabetes dataset from sklearn
using ScikitLearn import ScikitLearn: fit!, predict @sk_import datasets: load_diabetes all_data = load_diabetes() X = all_data["data"] y = all_data["target"]
“Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.”
Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499
The following column names match the description above, respectively: age sex bmi bp s1 s2 s3 s4 s5 s6.
- Write code for a ridge/lasso regression model to predict disease progression.
- Find the most important features?