• Naive Bayes is a probabilistic algorithm that considers the features as independent, hence the term naive.
  • Generally, it is similar to linear models, but its faster to train, and is worse in generalization.
  • There are three kinds of NB: Gaussian, Bernoulli, and multinomial.
  • Gaussian NB accepts any continuous data, whereas Bernoulli and multinomial only accept count data.
  • Bernoulli and multinomial NB is mostly used in text data classification.

Conditional probability

  • What is the probability of a dice turning to 6 assuming that it is fair?
  • What is the probability of a King in a set of playing cards, if its hearts?
  • In the examples above, the event is conditioned to another event. A dice returning a 6 on the condition that it is fair, picking a King from a set of cards on the condition that what you pick is hearts.
  • That is conditional probability: probability of an event, given another event.
  • Mathematically, it is \(P(A \lvert B) = \frac{P(A \cap B)}{P(B)}\), where A can be seeing a King, B is seeing hearts, and \(P(A \cap B)\) is the joint probability of both A and B.
  • In the cards example, the probability that both having a King and a hearts is 1/52 ( \(P(A \cap B)\)). The probability of hearts is 13/52 (\(P(B)\)). So the probability of having king if it is hearts is 1/13.

Bayes’ rule

  • In machine learning we train models with some training data. We know the probability of the inputs given the output \(P(X\lvert Y)\). We aim to predict the probability of new output when encountered with new input \(P(Y\lvert X)\).
  • Bayes rule can be derived from conditional probability and is \(P(Y \lvert X) = \frac{P(X \lvert Y) \times P(Y)}{P(X)}\).
  • When there are multiple features, as is the usual case, we can extend the Bayes rule by assuming independence between the features (naivity).
  • The Naive Bayes rule is \(P(Y_k \lvert X_1...Xn) = \frac{P(X_1 \lvert Y_k) \times P(X_2 \lvert Y_k) ...P(X_n \lvert Y_k) \times P(Y_k)}{P(X_1) \times P(X_2) ... P(X_n)}\), where \(k\) is a class of \(Y\).

Gaussian Naive Bayes

  • This is used when X is not categorical.
  • If we assume X follows a certain distribution (here Gaussian), then we can compute the probability of likelihoods for each feature.
  • To compute the probability of likelihoods, all we need is the mean and variance of X for a given class of Y c: \(P(X\lvert Y_c) = \frac{1}{\sqrt{2\pi \sigma_c^2}} e^{\frac{-(x-u_c)^2}{2\sigma_c^2}}\)
  • We can transform the data to make them more Gaussian-like. Power transformation is one that helps.


Example 1

The nodal dataset has some binary predictors to determine whether lymph nodes of patients with prostate cancer are affected with cancer or not.

using RDatasets
df = dataset("boot","nodal")

As an example here, we take into account three predictors, “Aged” in which 1 indicates whether the person is over 60, “Stage” in which 1 indicates a more serious case of cancer, and “Acid” in which 1 indicates high levels of acid phosphatase. “R” is the output, indicating whether nodes are involved or not.

The counts of the above data is as below:

│ Row │ R     │ Acid_sum │ Stage_sum │ Aged_sum │Total|
│     │ Int64 │ Int64    │ Int64     │ Int64    │     |
│ 1   │ 1     │ 16       │ 15        │ 7        │ 20  |
│ 2   │ 0     │ 14       │ 12        │ 17       │ 33  |

If we know whether the cancer is in serious stage, whether the patient is aged, and whether the acid levels are high, can we predict whether nodes are involved in cancer?

We need to calculate the probability that the nodes are involved and the probability that the nodes are not involved. Whichever probability is higher, we choose that.

To that end, we will used the formula \(P(Y \lvert X) = \frac{P(X \lvert Y) \times P(Y)}{P(X)}\).

P(Y), the priors, are the fraction of each output category among all data:

  • \[P(Y_1) = 20/53\]
  • \[P(Y_0) = 33/53\]

P(X), the probability of evidence, is the fraction of each predictor among all data.

  • \[P(X_{acid}) = 30/53\]
  • \[P(X_{stage}) = 27/53\]
  • \[P(X_{aged}) = 24/53\]

\(P(X \lvert Y)\), the likelihood of evidence, is the fraction of each evidence within each category of the output.

  • \[P(X_{acid} \lvert Y_1) = 16/20\]
  • \[P(X_{stage} \lvert Y_1) = 15/20\]
  • \[P(X_{aged} \lvert Y_1) = 7/20\]

The probability that an aged person with high acid phosphatase and a serious stage cancer has involved nodes is:

\[P(Y_1 \lvert X) = \frac{P(X_{acid} \lvert Y_1) . P(X_{stage} \lvert Y_1) . P(X_{aged} \lvert Y_1) \times P(Y_1) }{P(X_{acid}) \times P(X_{aged}) \times P(X_{stage}))} = \frac{0.08}{0.13} = 0.62\]

Whic means that the nodes are more likely to be involved that not.

Examples 2

“This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.”

using RDatasets
df = dataset("datasets","iris")

To use these continuous variables in Naive Bayes algorithm, we assume that they are normally distributed. However, this has to be tested and data should be transformed to normal distribution if they are not normally distributed.

Probability that a flower with a certain Sepal and Petal length and width is setosa is:

\[P(Y_{setosa} \lvert X) = \frac{P(X \lvert Y_{setosa}) \times P(Y_{setosa})}{P(X \lvert Y_{setosa})P(Y_{setosa}) \times P(X \lvert Y_{not-setosa})P(Y_{not-setosa})}\]

The prior is calculated similar to the previous examples. \(P(Y_{setosa}) = 50/150\)

Likelihood of evidence \(P(X \lvert Y_{setosa})\) is calculated using \(P(X\lvert Y_c) = \frac{1}{\sqrt{2\pi \sigma_c^2}} e^{\frac{-(x-\mu_c)^2}{2\sigma_c^2}}\). In this case, \(Y_c = Y_{setosa}\).

For each input variable, we should calculate the mean and variance of that variable that correspond to \(Y_{setosa}\). So we should calculate the mean and variance of SepalLength, PetalLength, SepalWidth, and PetalWidth of each of the input variables of Setosa species. We then use the mean and variance in the equation above to calculate the likelihood of the evidence of our new sample.

Using Scikit-learn

Full documentation of Naive Bayes algorithm in Scikit-learn can be found here.

Here is an example of its implementation with ScikitLearn in Julia. You can give priors to the model (with priors argument in the GaussianNB function), if you have some expert knowledge about the data. If you do so, priors are not calculated from the data.

using ScikitLearn
import ScikitLearn: fit!, predict
@sk_import naive_bayes: GaussianNB  # for Gaussian NB
@sk_import naive_bayes: MultinomialNB  # for Multinomial NB

@sk_import datasets: load_iris

iris = load_iris()

X = iris["data"]
y = iris["target"]

model = GaussianNB()
fit!(model, X, y)
predict(model, X)

General advice

  • Transform the data to make them normal.
  • Identify and remove correlated features.
  • Apply Laplace correction.
    • When an output category has zero values in the inputs, the Bayes probability will be zero.
    • To make such cases non-zero, we usually add make the nominators at lease 1.