• KNN is a nonparametric learning algorithm, i.e., it does not make any assumptions about the structure of the data.
  • It is a distance based majority vote algorithm. It checks the category of k-nearest neighbors of a data sample, and assigns it to the category of the majority of neighbors.
  • Distance can be measured in different ways, such as Euclidean distance.
  • If it is a regression task, it can use the mean of the k-nearest neighbors as the prediction.

Pros

  • It assumes no specific structure of the data.
  • Easy to understand and interpret.
  • Can be used for both for classification or regression.
  • Easy tuning, with only one parameter k.
  • Fast for not so large number of samples or features.

Cons

  • Not powerful when there are many features (hundreds or more).
  • If most of the features are zero most of the time, it performs poorly.
  • Biased for extrapolation.
  • Computationally demanding when number of samples or features is large.
  • Not good for very large datasets, because uses a lot of memory as it stores all the data.
  • Also, with large data, predictions can be slower than other algorithms.
  • Needs irrelevant features to be removed.

Algorithm

  1. Choose a k value, which is the number of nearest neighbors that will be used for making a prediction.
  2. Calculate the distance between the data point you want to classify and every other sample in your training set.
  3. Sort the distances.
  4. Choose the first k samples with the least distance.
  5. For classifications, assign the most common label among the k neighbors for the query.
  6. For regression, assign the mean of the k neighbors for the query.

General advice

  • To choose the best k, test your model performance with a series of values.
  • If k is too small, your algorithm will be too sensitive to noise.
  • If k is too large, the algorithm will fail to see a pattern.
  • For classification, choose an odd k, so that one category is always in majority.
  • Like other distance based algorithms, feature scaling improves the performance.
  • KNN is sensitive to irrelevant features, so dimensionality reduction and feature selection are important.
  • KNN does not extrapolate. Your query should be within the same range as the training data.
  • If data are sparse, KNN will have problems making predictions.

Exercise

Here is a database of first names and the gender to which they belong. Create a KNN classifier that can tell whether a name is for males or for females.

(There is a commercial service that does just this task.)

For each name, you can create a list of features. Some suggestions are the following: number of letters, number of vowels (a, e, i, o, u), number of hard consonants (b, d, g, p, t, k), number of each letter in the name, last letter, and first letter.

To calculate the distance between two points, there are different metrics. A common one the Euclidean distance: \(d= \sqrt{\Sigma_{i=1}^{n} (q_i - p_i)^2}\), where q and p are two vectors of size n.

You can implement the distance function yourself easily. For example:

distance(p, q) = sqrt(sum((p .- q)^2))

Alternatively, there is a package names Distances.jl, which implements the Euclidean and many other distances.

using Distances
r = evaluate(Euclidean(), x, y)
r = Euclidean()(x, y)

Here is the usage of KNN in ScikitLearn (docs here and here). It provides some extra options too. For example, after finding the k-nearest neighbors, we can make predictions weighted by the distance of these neighbors. In our implementation above, all the k-nearest neighbors have the same weight in determining the value of the query. But if we adjust their weights by their distance to the query, we get different results.

ScikitLearn also provides some extra algorithms for finding the nearest neighbors. Our implementation above was a brute force algorithm. Here, we can also choose BallTree and KDTree algorithms for finding nearest neighbors, or let ScikitLearn choose the best algorithm for us based on the input.

using ScikitLearn
import ScikitLearn: fit!, predict

@sk_import neighbors: KNeighborsClassifier
# @sk_import neighbors: KNeighborsRegressor # for regression

model = KNeighborsClassifier(n_neighbors=3, weights="distance", algorithm="auto", metric="minkowski")

query = reshape([4.3, 2.0, 7.0, 1.0], 4, 1)'

fit!(model, X, y)
predict(model, query)