Linear regression | Applied Machine Learning

Ordinary linear regression

Least squares is a widely used method for finding the “line of the best fit”.
Fitting a line through the cloud of points will result in the line not passing through many of the points.
What is the best way to find parameters of a linear model \(y = ax + b\).
A typical approach is choosing \(a\) and \(b\) such that the difference between the actual dependent variables \(y_i\) and predicted variables \(\hat{y}_i\) is minimum.
There are different ways to measure the difference between \(y_i\) and \(\hat{y}_i\). Two common methods is \(L_1 norm\) and \(L_2 norm\) (Euclidean): \(L_1 = \sum_{i=1}^{n} | \hat{y}_i - y_i |\) , \(L_2 = \sum_{i=1}^{n} (\hat{y}_i - y_i)^2\).
These methods are based on errors or residuals, where each error \(e_i = \hat{y}_i - y_i\).
To find the best \(a\) and \(b\), we want to minimize the loss measures \(L_l\).
\(l = 2\) is the most commonly used and simplest method. Because \(L_1\) has two main drawbacks. First, it does not have an analytical solution. Second, it sometimes provides multiple solutions, whereas \(L_2\) always returns a single optimal solution.
To find the best parameters of the line fitting a set of data, we can use a simple Monte Carlo algorithm. Although this approach is not optimal.
In the Monte Carlo algorithm, we choose two random values for \(a\) and \(b\). Then measure the loss. Next, we choose two other random values and measure the loss again. If the new loss is less than the previous one, we update \(a\) and \(b\). We repeat updating \(a\) and \(b\) until the loss stops changing much.

Linear regression best fit formula

There is a simpler and more common way for minimizing the cost function for a regression problem.
Deriving the formula to directly find the best fitting line is informative.
Let’s say \(x_i\) denotes each independent variable (input) and \(y_i\) denotes each dependent variable (output).
The line of best fit is \(\hat{y}_i = ax_i + b\).
The cost function is \(C = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2\)
We now replace \(\hat{y}_i\) with its definition: \(C = \sum_{i=1}^{n} (y_i - ax_i - b)^2\).
Minimizing \(C\) invovles finding its first derivative and finding at which values of \(a\) and \(b\) it is 0.
Partial derivative of \(C\) with respect to \(b\):
- \(\frac{\partial C}{\partial b}[\sum_{i=1}^{n} (y_i - ax_i - b)^2]\).
- Use the chain rule to take the derivative: \(0 = \sum_{i=1}^{n} -2(y_i - ax_i - b)\).
- Divide both sides by \(-2\): \(0 = \sum_{i=1}^{n} (y_i - ax_i - b)\)
- Break the summation into three parts and take \(a\) out: \(0 = \sum_{i=1}^{n} y_i - a \sum_{i=1}^{n} x_i - \sum_{i=1}^{n} b\).
- Sum \(b\) to \(n\) is \(nb\) : \(0 = \sum_{i=1}^{n} y_i - a \sum_{i=1}^{n} x_i - nb\).
- Add \(nb\) to both sides and divide by \(n\): \(b = \frac{\sum_{i=1}^{n} y_i - a \sum_{i=1}^{n} x_i}{n}\).
- Summations divided by \(n\) is mean. So it all simplifies to: \(b = \bar{y} - a\bar{x}\).
Partial derivative of \(C\) with respect to \(a\):
- \(\frac{\partial C}{\partial a}[\sum_{i=1}^{n} (y_i - ax_i - b)^2]\).
- Use the chain rule: \(0 = \sum_{i=1}^{n} -2x_i(y_i - ax_i - b)\).
- Divide both sides by \(-2\): \(0 = \sum_{i=1}^{n} x_i(y_i - ax_i - b)\).
- Distribute the \(x\): \(0 = \sum_{i=1}^{n} (x_iy_i - ax_i^2 - bx_i)\).
- Substitute \(b = \bar{y} - a\bar{x}\) from above: \(0 = \sum_{i=1}^{n} (x_iy_i - ax_i^2 - (\bar{y} - a\bar{x})x_i)\).
- Distribute the minus sign and \(x\): \(0 = \sum_{i=1}^{n} (x_iy_i - ax_i^2 - \bar{y}x_i + a\bar{x}x_i)\).
- Split up the sum into two sums: \(0 = \sum_{i=1}^{n} (x_iy_i - \bar{y}x_i) + \sum_{i=1}^{n}(a\bar{x}x_i - ax_i^2)\).
- Take \(a\) out of the summation: \(0 = \sum_{i=1}^{n} (x_iy_i - \bar{y}x_i) - a \sum_{i=1}^{n}(x_i^2 - \bar{x}x_i)\).
- Isolate \(a\): \(a = \frac{\sum_{i=1}^{n} (x_iy_i - \bar{y}x_i)}{\sum_{i=1}^{n}(x_i^2 - \bar{x}x_i)}\).
You can now calculate \(a\), then substitute it in the formula for \(b\).
How different are \(a\) and \(b\) values from your Monte Carlo algorithm from the analytical results?

Multivariate regression

When the number of independent variables increase, analytical finding of the line of best fit becomes increasingly difficult.
A common algorithm to solve such problems is Gradient Descent (GD).
GD is an algorithm for minimizing any function, here the cost function.
GD outline:
1. Start with some random (or all zero) parameter values.
2. Take the derivative of the cost function with respect to each parameter
3. Subtract the derivative from the original parameter values.
4. Before subtraction, you can multiply the gradient by a fixed factor, the learning rate α. This controls the “step size” of the algorithm.
5. New parameter values are the ones from which the gradient (times the learning rate) has been subtracted.
More formally, a GD algorithm is repeating the following until convergence:
- \(a_i = a_i - \alpha \frac{\partial C}{\partial a_i}\) for \(i =1...n\), where \(a_i\) is parameter \(i\), \(C\) is the cost function, and α is the learning rate.
Since the derivative gives the slope of a line, substracting it from the original parameter value moves us in the direction of lowering the cost function.
If α is too small, the algorithm will be slow.
If α is too big, the algorithm will fail to converge or even diverge.
Note that all parameters should be updated simultaneously.
The cost function can be normalized by dividing it by number of samples \(C = \frac{1}{2n}\sum_{i=1}^{n} (y_i - \hat{y}_i)^2\).
- The \(\frac{1}{2}\) helps cancel the \(2\) in the derivation of the function.

Derivative of the cost function

Derivative of the bias term is \(\frac{\partial C}{\partial b} = \frac{1}{n}\sum_{i=1}^{n} -(y_i - \hat{y}_i)\).
Derivative of the coefficients is \(\frac{\partial C}{\partial a_i} = \frac{1}{n}\sum_{i=1}^{n} -(y_i - \hat{y}_i)x_i\).

How to choose α

Choosing a good learning rate is important for both converging to a minimum and for having an optimal learning time.
To find which learning rate is best, you can use two kind of plots:
1. How cost changes per iteration.
2. How learning time changes per iteration.
Ideally, cost should be decreasing and eventually becoming flat.
You want a learning rate the minimizes the training time.

Matrix operations

To make the code easier to write and faster to run, we can calculate the derivatives of all variables with matrix multiplication.
Let \(A\) be a vector of all coefficients \(a_1 ... a_n\).
To make computations easier, we can add a new feautre \(x_0\) and equal it to 1.
Thus, vector \(A\) also includes intercept \(b\).
The equation for the linear line is then \(Y = XA\), where X is the vector of all variables.
The cost function becomes \(C = \frac{1}{2n} \sum (XA - Y)^2\).
Derivative of the cost function of each variable is then \(\frac{\partial C_A}{\partial a_i} = \frac{\partial}{\partial a_i} (\frac{1}{2n} (XA - Y)^2) = \frac{1}{n} (XA - Y) \times X\).

Batch vs stochastic GD

Batch GD is when we update model parameters using all training data.
In case the data is very large, using batch GD can be too slow.
In stochastic GD, we update the coefficients with a random fraction of the training set at each iteration.
stochastic GD, as the name implies, does not always move in the direction of minimizing the cost function, but when the cost function is convex, it can find the minimum¹.

Exercises

Monte Carlo regression. Write an algorithm to find the best fitting line to the following two lists. This will be a simple algorithm that checks the loss for a random combination of a and b variables. If a new combination is better than the old one, then they are selected. Run the algorithm for \(N = 10^6\) steps. Compare it with the built-in functions for finding the best fitting line (see below). Test your algorithm with \(L_1\) and \(L_2\) loss functions.

More specifically, you first write two functions called predict and cost. predict gives you your predicted \(y\) from x, a and b. cost gives you the cost the predicted y. After that you write a function that accepts parameters x and y. Inside the function you write a for loop that pick random values for a and b (from range -1:0.01:1, for example). At each iteration, it checks whether the cost of the new parameters are less than the cost of previous parameters. If so, update the parameter values.

 using Random
 using VegaLite

 x = collect(1:10) .+ randn(10)
 y = collect(1:10) .+ randn(10)

 @vlplot(
   mark = {:point, color=:black, filled=true},
   x = {x, axis={title="Independent variable"}},
   y = {y, axis={title="Dependent variable"}}
 )

Univariate linear regression

Ordinary least square from Scikit-learn:

 using ScikitLearn
 import ScikitLearn: fit!, predict
 @sk_import linear_model: LinearRegression
 model = LinearRegression()
 x = reshape(x, 10, 1)  # needed for the fit! function
 fit!(model, x, y)
 ypred = predict(model, x)
 intercept, coef = model.intercept_, model.coef_[1]

 # plotting
 using DataFrames
 d = DataFrame(:x => x, :y => y, :ypred => ypred);
 p2 = @vlplot(data=d)+
   @vlplot(
     mark = {:point, color=:black, filled=true},
     x = {:x, axis={title="Independent variable"}},
     y = {:y, axis={title="Dependent variable"}}
   ) + 
   @vlplot(
     mark = {:line, color=:red},
     x = :x,
     y = :ypred
 )

Univariate linear regression

Univariate linear regression. Write a function to calculate the line of best fit to the following data.
```
using RDatasets
trees = dataset("datasets", "trees");
x = trees[!, :Girth];
y = trees[!, :Height];
```
Do so in two different ways: a) using gradient descent, b) using the analytical solution.
To run a linear regression with gradient descent, create the following functions:
- a) a function called univariate_predict, that accepts x, a, and b (all of them numbers), and return a prediction (number). If X is an array, the same function that you wrote can be used with broadcasting (predict.(X, a, b)).
- b) a function called univariate_gradient that accepts X (array), Y (array), a (number), and b (number) and then returns new values for a and b. To that end, it should value of the derivative of the cost function at X.
- c) a function called univariate_cost that accepts X, Y, a, and b and calculates the cost.
- d) Finally build a function called linear_regression for gradient descent that accepts X, Y, learning rate α, and iterations for number of iterations, and returns the final a and b.\
- Now you can put all these functions in a file and create a module called ML. During the course, we create mode machine learning algorithms and you can add them to you ML package. By the end of the course, you will have your own machine learning package. You may download from here an empty ML file to fill out the function definitions.
- As a best practice for writing reproducible code, add a Project.toml file to your directory that manages the version of packages you are using.
- Try different learning rates. Plot the change of cost vs time for different learning rates and choose a learning rate accordingly.
Multivariate linear regression. Write a function to find the line of best fit to the following data using gradient descent. The logic of the algorithm is the same as univariate linear regression, except that you will work with matrices this time. (3 points)
```
X = Matrix(trees[!, [:Girth,:Height]]);
y = trees[!, :Volume];
```

Bottou, Léon; Online Learning and Stochastic Approximations. (PDF) ↩

Tags: