• A ridge regression is similar to a linear regression but there are more constraints for choosing the coefficients in addition to best fitting the data.
  • The new cost function is \(C = \frac{1}{2m} (\sum_{i=1}^{m} (y_i - a_1x_i - a_0)^2 + \lambda \sum_{j=1}^{n} a_j^2)\), where \(\lambda\) is the regularization parameter and \(\lambda \sum_{j=1}^{n} a_i^2\) is regularization term. Here, \(m\) is number of samples and \(n\) is number of features.
  • This kind of regularization is \(L_2\) regularization.
  • The derivative of the cost function, to be used in the GD is \(\frac{\partial C}{\partial a_j} = \frac{1}{m} \sum_{i=1}^{m} -(y_i - a_1x_i - a_0).x_i + \frac{\lambda}{m} a_j\).
  • Then, the update function for each parameter is: \(a_{j\_new} = a_j - \alpha [\frac{1}{m} \sum_{i=1}^{m} -(y_i - a_1x_i - a_0).x_i + \frac{\lambda}{m} a_j)]\).
  • We can write the update equation as: \(a_{j\_new} = a_j(1 - \alpha \frac{\lambda}{m}) - \alpha \frac{1}{m} \sum_{i=1}^{m}(y_i - a_1x_i - a_0).x_i\).
  • The term \((1 - \alpha \frac{\lambda}{m})\) is often slightly smaller than one, if m is large and α is small.
  • This term always shrinks \(a\). Therefore, it diminishes the effect of \(a\).
  • Having smaller values of coefficients results in a simpler model. This helps avoiding over-fitting.
  • The smaller λ, the closer the model to ordinary linear regression.
  • The larger λ, the more severe punishment of irrelevant features. If too large, all parameters will be nearly zero, which results in under-fitting.
  • Note that, conventionally, we do not regularize the bias coefficient (because it is always 1).
  • \(L_2\) regularization minimizes all coefficients as close to zero as possible. This means that each feature should have a small effect on outcome, while still predicting well.
  • Ridge regression is used when the number of features is more than the samples.
  • It helps avoiding “over-fitting” by giving different importances to different features.
  • Moreover, ridge regression can handle correlated features.
  • The more samples you have, the less important is regularization.


  • The only difference in LASSO regression is in the regularization term using absolute values: \(C = \frac{1}{2n} \sum_{i=1}^{m} (y_i - a_1x_i - a_0)^2 + \lambda \sum_{j=1}^{n} \|a_j\|\).
  • The derivative of the loss function is \(\frac{\partial C}{\partial a_j} = \frac{1}{m} \sum_{i=1}^{m} -(y_i - a_1x_i - a_0).x_i + \frac{\lambda}{m \|a_j\|} a_j\). (The rule for derivation of absolute variables is \(\|x\|' = \frac{x}{\|x\|}\))
  • This small difference has a large impact. It allows irrelevant features to be set to zero, instead of just making them small.


  1. Use the California housing dataset. Predict a house’s price from the features that are explained here. (4 points)

    using ScikitLearn
    @sk_import datasets: fetch_california_housing
    house = fetch_california_housing()
    X = house["data"]
    y = house["target"]
    1. Modify the multivariate linear regression code that you had written before and add regularization to it to build a Ridge regression algorithm. You will only have to change the cost function and the gradient function.
    2. Separate a random 20% of the data for testing. Train your model with the remaining 80% of the data and test its performance on the test data.
    3. Test higher degree polynomials.
    4. Compare your results with the results of ScikitLearn’s built-in functions.
  2. Check the diabetes dataset from sklearn

    using ScikitLearn
    import ScikitLearn: fit!, predict
    @sk_import datasets: load_diabetes
    all_data = load_diabetes()
    X = all_data["data"]
    y = all_data["target"]

    “Ten baseline variables, age, sex, body mass index, average blood pressure, and six blood serum measurements were obtained for each of n = 442 diabetes patients, as well as the response of interest, a quantitative measure of disease progression one year after baseline.”

    Bradley Efron, Trevor Hastie, Iain Johnstone and Robert Tibshirani (2004) “Least Angle Regression,” Annals of Statistics (with discussion), 407-499

    The following column names match the description above, respectively: age sex bmi bp s1 s2 s3 s4 s5 s6.

    1. Write code for a ridge/lasso regression model to predict disease progression. (3 points)
    2. Find the most important features? (1 point)
Tags: regression