• It means reducing the number of input variables for a model.
• It involves projecting (remember this term from linear algebra?) one or more dimensions onto a new dimension.
• It is similar to estimating the position of students in a class by their shadows on a wall.

Benefits

• Increase the speed of ML algorithms.
• Reduce the required memory to keep and analyze the data.
• Visualization of high dimensional data.

Principal component analysis (PCA)

• The most common algorithm for dimensionality reduction.
• It tries to find a k-dimensional plane on to which project the n-dimensional data such that the sum of squares of projection error are minimized, where $$k< n$$.
• Specifically, we want to find $$k$$ vectors on to which project the data.
• The difference between linear regression and PCA is that in linear regression, the errors are the distance between the real values and predictions, whereas in PCA, the errors are the shortest distance between the samples and projection plane. The two errors have different values. Moreover, in PCA, there is not prediction.

Algorithm

• Data should be mean normalized and scaled before PCA. So the data should have a mean zero and the same scale. You may apply feature scaler after the normalization step.
• Compute covariance matrix $$\Sigma$$. In Julia, the Statistics.cov function calculates the covariance matrix.
• In a covariance matrix, element i,j shows the covariance of feature i and feature j.
• Compute the eigenvectors of $$\Sigma$$: the eigvecs command from the LinearAlgebra package in Julia.
• The first $$k$$ eigenvectors are the vectors on to which we want to project the data $$u_k$$.
• Then the lower dimensional representation z of our original dataset is $$z = x_{scaled} \times u_k$$.
• Eigenvalues of $$\Sigma$$ correspond to how much each of the new features explains the variance of the data.

Choosing the number of components

• Choose number of components $$k$$ such that the new features explain at least, say, 99% of the variation.

Calculating the explained variance by each feature

• Compute the variance per feature.
• Explained variance per feature is the features variance divided by total variance (sum of all feature variances).
• This can be done for both the original data and the PCA features.
• For the explained variance of the PCA features to work correctly, you should first create $$n$$ components, calculate their explained variance, then choose top $$k$$ components.
• There is an easier way too. Explained variance of the components is relative to the eigenvalues of $$\Sigma$$. The exact explained variances equals each eigenvalue divided by the sum of all eigenvalues.

Some points in using PCA

• PCA should be defined on the training set only. It should then only be used on the CV and test sets.
• PCA is not good to prevent over-fitting. It might work, but a better way to address over-fitting is using regularization. The reason PCA is not a suitable method for this task is because PCA does not know the output data, and thus might throw away useful data.
• Always test your ML algorithms without PCA. Only use PCA if there is a good reason for it.

Exercise

1. Write a PCA function that accepts two arguments: data X and k for the reduced number of features. The output is the data in the new dimensions and the total explained variance by the new dimensions. (3 points)
1. Reduce the dimensions of the dataset below to 2D for visualization.
2. Reduce the dimensions of the dataset below to k dimensions such that those k new features explain 99% of the variance in the data.
using ScikitLearn