Revision

Back to Machine Learning

Introduction

Naive Bayes is a classifier algorithm based on the Bayes’ theorem with strong (or even naive) assumptions. The model takes data \(X\) as input (generally discrete) and outputs a discrete value \(Y\).

Bayes’ theorem

The Baye’s theorem is:

\[p(y \vert x) = \frac{p(x \vert y)p(y)}{p(x)}\]

It is just a derivation of this:

\[p(y \vert x) = p(x \cap y) = p(x \vert y)p(y)\]

Another formulation is:

\[posterior = \frac{prior \times likelihood}{evidence}\]

Assumptions

The Naive Bayes model makes the assumption that the features are independant given the output.

Let’s take the same example as Andrew Ng CS229 course to explain that. In a spam email classification, if one knows that an email is a spam (\(Y=1\)) then knowing if the words ‘buy’ appears in the email won’t impact its belief about the appearance of the word ‘price’ in the email.

The normal behaviour would be that if the word ‘buy’ appears in the email the probability that the word ‘price’ also appears would be higher. That’s why the assumption is said to be naive.

More formally the naive bayes assumption is for two features \(X_1\) and \(X_2\):

\[\begin{eqnarray} p(X_1, X_2 \vert Y) &&= p(X_1 \vert Y) p(X_2 \vert Y, X_1) \\ \text{Using Naive Bayes assumption }\Rightarrow &&= p(X_1 \vert Y) p(X_2 \vert Y) \end{eqnarray}\]

And for \(d\) features \(X_1, \ldots, X_d\) we get:

\[\begin{eqnarray} p(X_1, \ldots, X_d \vert Y) &&= p(X_1 \vert Y) p(X_2 \vert Y, X_1) \ldots p(X_d \vert Y, X_{d-1}, \ldots, X_1)\\ \text{Using Naive Bayes assumption }\Rightarrow &&= p(X_1 \vert Y) p(X_2 \vert Y) \ldots p(X_d \vert Y) \\ &&= \prod_{i=1}^d p(X_i \vert Y) \end{eqnarray}\]

Equivalently we have: \(\begin{eqnarray} p(X_1, \ldots, X_d, Y) &&= p(Y) p(X_1 \vert Y) p(X_2 \vert Y, X_1) \ldots p(X_d \vert Y, X_{d-1}, \ldots, X_1)\\ \text{Using Naive Bayes assumption }\Rightarrow &&= p(Y) p(X_1 \vert Y) p(X_2 \vert Y) \ldots p(X_d \vert Y) \\ &&= p(Y) \prod_{i=1}^d p(X_i \vert Y) \end{eqnarray}\)

Using the Bayes theorm we know that \(p(Y \vert X) = \frac{p(X \vert Y)p(Y)}{p(X)} = \frac{p(X,Y)}{p(X)}\). When the \(X\) are known, the denominator is just a constant and thus:

\[\begin{eqnarray} p(Y \vert X_1, \ldots, X_d) && \varpropto p(X_1, \ldots, X_d \vert Y)p(Y) \\ && \varpropto p(Y) \prod_{i=1}^d p(X_i \vert Y) \end{eqnarray}\]

Formula

We will express the formulas using the notation \(k\) for the class of \(Y\). In the binary case, \(k \in \{0, 1\}\) and in the multiclass case \(k \in \{1, \ldots, K\}\).

Bernouilli Naive Bayes

In the Bernouilli Naive Bayes, the input \(X\) of the model are Bernoulli variables that tells us if the variable \(X_i\) is present or not.

Hence each \(X_i \sim \mathcal{B}(\phi_{i, k})\) for \(Y\) being of class \(k\).

\[p(X \vert Y=k) = \prod_{i=1}^n \phi_{i, k}^{X_i} (1 - \phi_{i, k})^{(1-X_i)}\]

Multinomial Naive Bayes

In the Multinomial Naive Bayes, the input \(X_i\) of the model events have been generated by multinomial distributions with different parameters \(M\) being the number of possible outcomes and \(p_1, \ldots, p_M\) where \(p_m\) is the probability that event \(m\) occurs.

Each \(X_i\) can be view as one of the columns of an histogram of a multinomial distribution.

\[\begin{eqnarray} X &&= \left(X_{1, 1}, \ldots, X_{1, M_1}, X_{2, 1}, \ldots, X_{2, M_2}, \ldots, X_{d, 1}, \ldots, X_{d, M_d}\right) &&= \left(X_1, X_n\right) \end{eqnarray}\]

Where:

\(d\) is the number of multinomial distributions,
\(M_d\) is the number of possible outcomes for multinomial distribution \(d\),
\(n = \sum_{i=1}^d M_d\).

And \(Y \sim \mathcal{B}(\phi_Y)\).

\[p(X \vert Y=k) = \frac{(\sum_{i=1}^n X_{i, k})!} {\prod_{i=1}^n X_{i, k}!} \prod_{i=1}^n {p_{i, k}}^{X_{i, k}}\]

Gaussian Naive Bayes

In the Multinomial Naive Bayes, the input \(X\) of the model events have been generated by a gaussian distribution.

Hence the input data \(X \sim \mathcal{N}(\mu_k, \sigma_k^2)\) where \(\mu_k \in \mathbb{R}^{n_{pop}}\) is a vector and \(\sigma_k^2 \in \mathbb{R}^{n_{pop} \times n_{pop}}\) is a diagonal matrix and:

\[p(X=x \vert Y=k)=\frac{1}{\sqrt{2 \pi \sigma_k^2}} \exp \left(- \frac{(x - \mu_k)^2}{2 \sigma_k^2}\right)\]

Calibration

For both type of naive bayes the calibration is done using MLE:

\[\begin{eqnarray} L(\phi_y, \phi_{i,Y=0}, \phi_{i,Y=1}) &&= \prod_{j=1}^{n_{pop}} p(X^{(j)}, Y^{(j)}) \\ &&= \prod_{j=1}^{n_{pop}} p(X^{(j)} \vert Y^{(j)}) p(Y^{(j)}) \end{eqnarray}\]

Applying the \(\log\) (monotone strictly increasing function) we get:

\[\begin{eqnarray} l(\phi_y, \phi_{i,Y=0}, \phi_{i,Y=1}) &&= \log L(\phi_y, \phi_{i,y=0}, \phi_{i,y=1}) \\ &&= \log \prod_{j=1}^{n_{pop}} p(X^{(j)} \vert Y^{(j)}) p(Y^{(j)}) \\ &&= \sum_{j=1}^{n_{pop}} \log \left[p(X^{(j)} \vert Y^{(j)}) p(Y^{(j)}) \right] \end{eqnarray}\]

Bernouilli Naive Bayes

Now replacing \(p(X^{(j)} \vert Y^{(j)})\) by their distribution we get:

\[\begin{eqnarray} l(\phi_y, \phi_{i,Y=0}, \phi_{i,Y=1}) &&= \sum_{j=1}^{n_{pop}} \log \left(p(X^{(j)} \vert Y^{(j)}) p(Y^{(j)}) \right) \\ &&= \sum_{j=1}^{n_{pop}} \log \prod_{i=1}^n \phi_i^{X_i^{(j)}} (1 - \phi_i)^{1-X_i^{(j)}} (\phi_Y)^Y (1-\phi_Y)^{1-Y}\\ &&= \sum_{j=1}^{n_{pop}} \sum_{i=1}^n \left[X_i^{(j)} \log \phi_i + (1-X_i^{(j)}) \log (1 - \phi_i)\right] + Y \log \phi_Y + (1-Y) \log (1-\phi_Y) \end{eqnarray}\]

Setting the derivatives to 0 wrt to each parameters we obtain:

\[\phi_{i,Y=0} = \frac{\sum_{j=1}^{n_{pop}} \mathbf{1}_{X_i^{(j)}=0, Y^{(j)}=0}} {\sum_{j=1}^{n_{pop}} \mathbf{1}_{Y^{(j)}=0}}\] \[\phi_{i,Y=1} = \frac{\sum_{j=1}^{n_{pop}} \mathbf{1}_{X_i^{(j)}=1, Y^{(j)}=1}} {\sum_{j=1}^{n_{pop}} \mathbf{1}_{Y^{(j)}=1}}\] \[\phi_Y = \frac{\sum_{j=1}^{n_{pop}} \mathbf{1}_{Y^{(j)}=1}}{n_{pop}}\]

Multinomial Naive Bayes

Same steps can be applied to find parameters.

Gaussian Naive Bayes

Same steps can be applied to find parameters.

Resources

See:

CS229.