• The most popular clustering algorithm.
• Scales well with many samples.
• The algorithm initially defines $$k$$ points as cluster centroids. $$k$$ is defined by the user.
• It goes through every sample and assigns each sample to the closest cluster centroid.
• Then, it calculates the mean of each cluster and moves the centroid to the mean of each cluster. The mean of a multidimensional data is the mean of each column. So the mean has the same dimensions.
• At this point, it assigns samples to each centroid again based on the distance between the samples and centroids.
• Upon repeating the above steps, the algorithm converges to the point where the centroids stop moving.
• If one of the centroids does get any samples assigned to it, you can remove it, in which case you will have $$k-1$$ clusters. Another, less common, solution is to reinitialize the centroids.

## The optimization function

• The optimization function of the k-means algorithm is minimizing the average distance between the sample locations and the cluster centroid to which they are assigned.
• Formally, the optimization function which we want to minimize is $$C = \frac{1}{n}\sum_{i=1}^{n}( x_i - \mu_{c_i} )^2$$, where $$x_i$$ is the location of sample $$i$$, and $$\mu_{c_i}$$ is the location of cluster centroid to which sample $$i$$ is assigned.
• The cluster assignment step in the k-means algorithm minimizes the cost function with respect to assigned clusters to samples.
• The centroid movement step minimizes the cost function with respect to the location of the centroids.

## How to randomly initialize the cluster centroids

• There are different ways you can do this, but there is an optimal method that helps avoiding being trapped in local minima.
• Choose $$k < n$$.
• Randomly pick $$k$$ sample as the cluster centroids.
• k-means can anyway get stuck in local optima.
• A way to make sure that the solution is the best, we can try to run the algorithm many times (e.g. 50 to 1000) with different initial values, and then choose the solution with the lowest cost.
• Random initialization is important when you have a few number of clusters. The more clusters you have, the less likely it is to get stuck in local optima.

## How the choose $$k$$

• The most common way is to choose $$k$$ manually by looking at visualization.
• Another method to do it the the “elbow method”.
• The “elbow method” involves plotting the minimized cost of the algorithm with a range of $$k$$. The cost should decrease as we increase $$k$$. Initially, the cost decreases fast, and then at a specific $$k$$ it starts to decreases slower. That specific $$k$$ can be a good candidate for the number of clusters.
• The elbow method is not used commonly because many times you do not get and “elbow” in your costs curve.

## Exercise

1. Cluster the following fabricated data using the k-means algorithm. The data are in feature variable. Note that here we already have the labels in the labels variable, but we will not use them. (2 points)

using PyPlot
using ScikitLearn
@sk_import datasets: make_blobs
features, labels = make_blobs(n_samples=500, centers=3, cluster_std=0.55, random_state=0)

scatter(features[:, 1], features[:, 2], color="black")

2. Use the K-means algorithm to categorize human faces in this dataset. This kind of learning can be used later when we encounter a new face, we would be able to determine which group it belongs to. It may be a good idea to reduce the dimensions of the data and scale features to the same ranges. Exclude people who have less than 20 pictures, otherwise the predictions would be too random. Reserve a fraction of the images for testing. (Optional. 3 points)

You can read in the images with the Images.jl library:

   # Install required packages

The images are in color, but it is easier to work with grayscale. You can either manually change the images to grayscale, or do it with the Gray function from the ColorTypes package (Gray.(img)).