• Anomaly detection is an unsupervised algorithm.
  • It detects unexpected samples in a data set.
  • It should be used instead of classification when:
    1. there are too few samples in the positive class. In such a case, we want to use all the positive samples for testing and cross validation.
    2. there are many different possible “types” of anomalies, and each new one looks nothing like the previous ones.
  • An example use case is fraud detection manufacturing. Take a company that produces a complicated product like a car. There are many things that can go wrong when producing such a complex object. With outlier detection, we can use some measures of car function, such as noise, consumption, heat, etc., and decide whether any single car is functioning too different from the rest. Other use cases include finding unusual network traffic, ecosystem disturbances, or simply identifying data points to be cleaned.


  • We build a model \(p(x)\) that assigns a probability to each data point \(x\), given all the data.
  • Calculating the probability of occurrence of all the data points, we can choose the outlier by selecting those who are less likely than a threshold \(\epsilon\).
  • We can use the Gaussian or normal distribution to assign a probability to each data point.
  • The Gaussian probability density function (PDF) \(= \frac{1}{\sigma \sqrt{2 \pi}} e ^ {- \frac{1}{2} (\frac{x - \mu}{\sigma ^ 2})^2 }\), where \(\mu\) is mean and \(\sigma\) is standard deviation.
  • To predict the probability of a sample across all the features, we multiply its probabilities at each feature: \(p(x) = \prod_{j=1}^{n}(p; \mu_j, \sigma_j^2) = \prod_{j=1}^{n} \frac{1}{\sigma_j \sqrt{2 \pi}} e ^ {- \frac{1}{2} (\frac{x - \mu_j}{\sigma_j})^2 }\), where \(n\) is the number of features.

Choosing a set of features

  • It is best that we choose numerical features that are noramlly distributed. If their distribution is different, though, the algorithm usually still works.
  • To transform the data to normal distribution, use different transformations, such as \(log\) transformation or different roots (\(x_i ^ {1/2}\), \(x_i ^ {1/3}\), etc.)


  1. Write a function that accepts two parameters, data X and a threshold σ, and returns the outliers in the data.

    using ScikitLearn
    using VegaLite
    @sk_import covariance: EllipticEnvelope
    @sk_import datasets: make_moons
    @sk_import datasets: make_blobs
    # Example settings
    n_samples = 300
    outliers_fraction = 0.05
    n_outliers = round(Int64, outliers_fraction * n_samples)
    n_inliers = n_samples - n_outliers
    X, y = make_blobs(centers=[[0, 0], [0, 0]], cluster_std=0.5, random_state=1, n_samples=n_inliers, n_features=2)
    @vlplot(mark=:point, x=X[:, 1], y=X[:, 2])
  2. Use this dataset (uncompressed here) of credit card usage to find fraudulent use cases. Note that only about 0.001 of uses are fraudulent, which makes this task unsuitable for supervised learning.