May 14, 2018

Machine Learning for the Masses: Naive Bayes (pt.1)


Machine learning doesn’t have to be or seem complicated. It can get as straight-forward as counting and multiplying, which is the case with Naive Bayes. This tutorial aims to offer a mostly non-academic (without sacrificing mathematical rigor) overview of Naive Bayes along with references on how to take it to the next level. It assumes basic knowledge of probability theory. Part 1 presents the basic ideas of the algorithm using a form of Naive Bayes called Multinomial Naive Bayes. In part 2 we are going to take it a bit further: briefly present other forms such as Multivariate Bernoulli Naive Bayes and some more exotic but still useful variants.


Naive Bayes is a generative classifier which is really good at classifying text. Its most popular application is the spam filter. It is based (in fact this is all it is along with some tricks) on Bayes' rule, so let’s start from there. Feel free to skip all the math and go directly to the practical part but they're not that hard and they will help you get a deeper understanding of what you are doing.

Conditional probability

A key concept to a bottom up explaination of Bayes' rule and NB is conditional probability. We usually write it as \(P(A|B)\), which can be read as: the probability of A given B has occurred. In general the formula is: $$P(A|B) = {P(A,B) \over P(B)} = {P(B,A) \over P(A)}$$ Note that the conditional probability can be easily found with the help of a contingency table and it’s the joint probability of the two events divided by the marginal probability. Exactly what the formula above states.

Now it's a good time to say that not all events are made dependent. For example, the outcome of tossing a dice couldn't care less about the outcome of any previous tosses. If the first toss is event A and the second toss is event B, then we say that A and B are independent. In this case: $$P(B|A)={P(B)}$$

Conditional Probability

figure 1 - Bayes' rule

Bayes' rule

Feel free to skip this and go directly to the actual rule, but it's super easy to derive it. Taking the formula for the conditional probability and solving for marginal probability, we are having the following: $${P(A|B)P(B)=P(A,B)}$$ If we swap A and B, we get: $${P(B|A)P(A)=P(B,A)}$$ And finally, since \(P(A,B)=P(B,A)\): $${P(A|B)P(B)=P(B|A)P(A)}$$ Solving for \(P(A|B)\), Bayes' rule: $$P(A|B)={P(B|A)P(A) \over P(B)}$$ In Bayesian terms, \(P(A|B)\) is called posterior probability, \(P(B|A)\) likelihood, \(P(A)\) prior probability and \(P(B)\) marginal probability.


NB is a supervised algorithm, therefore we need some kind of training examples to get started. Assume we were given some manually classified sentences (i.e. some people given a sentence marked it with the class he/she thinks is more relevant) as geography or economics. In reality these seven training examples are by no means sufficient to make NB work properly.

Text Class
Capital land and labor are the three factors of production Economics
Athens is the capital of Greece Geography
The wealth of nations Economics
The Amazon river is also home to the piranha Geography
Nile is the longest river on earth Geography
The Process of Production of Capital Economics
Australia is wider than the moon Geography

table 1 - training examples

Now suppose you were give the sentence: "Widest river in the world". Before classifying it as either geography or economics we need to perform the training process using the training examples listed above. We can describe training in two steps:

  1. For each word \(w\) in the dictionary (i.e. all words in the training examples) and for each class \(c\), calculate the following probability \(P(w|c)\). These are called likelihood in Bayesian terms.
  2. For each class \(c\), calculate the probability \(P(c)\). This is the prior in Bayesian terms.

We will see how to calculate these after presenting a quick but important step: preprocessing.


Preprocessing is a prerequisite to training (you could skip it but it's not recommended) and can be as simple as mapping the words to their lowercased or stemmed form e.g. Capital→capital (lowercase) or Capitals→Capit (according to Porter stemmer). We do that in order to make sure we group the words that should be grouped. For simplicity we are applying a lowercase filter only.

Feature selection

Furthermore, we need to create our dictionary which is going to consist of all the words which we are going to consider. We can keep every single word that appears in the documents but this can lead to various issues including an inflated number of features. Apart from the increased feature space, 'bad' features can also affect our results.

Intuitively, words such as 'the', 'is', 'of', etc. are very common and shouldn’t add much value, since they won’t distinguish some topic (in contrast with something more context specific such as “oceanic”). We call these “stopwords” and we completely remove them.

A couple of more formal ways to choose the most informative features would be ranking by their Information Gain or even perform PCA. These are slightly more advance techniques and are not in the scope of this tutorial.

Next step is to apply the lowercase filter to each word and also remove common words 'the', 'of', 'is', 'to', 'on', 'than', 'are' and 'also'. After doing this, we are simply listing each word in the dictionary and their occurrences.

Training example

figure 2 - training example


It's time to do the counting part - as mentioned in the beginning, it's going to be counting and multiplying only. We are working with a form of NB called Multinomial NB. All you need to know for now is that this version takes into account the frequencies of each word e.g. that a word appears \(n\) times in a class. It's trivial to create the word/frequency mapping depicted below with basic programming skills and we are going to use it to train our model:

Geography Economics
Word Frequency Word Frequency
river 2 production 2
wider 1 capital 2
piranha 1 wealth 1
1 three 1
moon 1 process 1
longest 1 nations 1
home 1 land 1
greece 1 labor 1
earth 1 factors 1
capital 1
australia 1
athens 1
amazon 1

table 2 - counting word frequencies

Training (cont.)

Now that we are having the frequencies for each word in each class we can resume the training process. We started doing it in two steps so let's take it from there:

  1. If you scroll up, this step requires doing a calculation for every word in the dictionary and for every class. It's exactly the same work for any word and any class so let's do it for a single word and class. I randomly choose the word river and class geography: $$P(river|geography)={count(river, geography) \over count(geography)}= {2 \over 13}$$ In words, that is the occurrences of word river in the geography class divided by the number of all words in geography class.
  2. The prior is even easier, it's just the number of words in geography class divided by the number of the words in the dictionary. That is: $$P(geography)={count(geography) \over count(vocabulary)}={13 \over 21}$$
Without doing much, we are half way there! How would these data classify the sentence "Widest river in the world" now?


You might be wondering why exactly we calculated these probabilities during the training process. Hopefully it will become much clearer now. As we said we need to classify the sentence "Widest river in the world". First we should apply the exact same filters and feature selection applied at the training data, i.e. not much: lowercase everything and remove stopwords. This preprocessing step would leave us with "widest river world". At last it's time to put Bayes' rule in action!

Classification example

image 3 - classification example

We have already said what needs to be done, which is calculating the probabilities for both categories to occur given this text. Let's get started: Using Bayes' rule, we need to calculate \(P(geography|'widest,river,world')\) and \(P(economics|'widest,river,world')\): $${P('widest,river,world'|geography)P(geography) \over P('widest,river,world')}$$ and $${P('widest,river,world'|economics)P(economics) \over P('widest,river,world')}$$ Why is that? Because the question is, what are the odds given the sentence "Widest river in the world" it talks about geography? The answer is simply Bayes' rule. You can go back and check, it should make sense.

Now, between these two, whatever probability is the highest, this is our winner. We can start working with geography. Note that since we are comparing these probabilities and the denominator is the same with both classes, we can omit it. So \(P(geography|'widest,river,world')\) becomes: $${P('widest,river,world'|geography)P(geography)}$$ or, in Bayesian terms: $$posterior=likelihood*prior$$

We already have the prior from the training! We can calculate the likelihood by applying the probability chain rule. This is inefficient as we need to keep track of the co-occurrences of all of these words in the class we are examining. Instead we are doing a simplification: assume that each word is independent from every other word, given the category. This is a key point in Naive Bayes and this is where the algorithm got its "naiveness" (I don't really like the term "naive", there are other terms such as Independent Bayes which make a bit more sense to me).

This makes our lives a lot easier as the likelihood becomes: $${P('widest'|geography) P('river'|geography) P('world'|geography)}$$

That is the product of the likelihood of each word independently from the other words. It is an assumption of conditional independence.

Note that some of these have already been calculated during training! What we don't have are words that we have never seen before, such as the word widest (bonus: if we had decided to use stemming, we would already have this word because we have seen wider ).

The problem that arises with previously unseen words is that \(P('widest'|geography) = 0\), which makes the whole product zero and therefore the final result, which is clearly wrong because it cancels other words which we might have seen and yield non-zero probability.

In order to work around this issue we just add a constant \(k\) to the nominator and \(|V|\) to the denominator where \(|V|\) is the size of the vocabulary (i.e. all the words for all classes). Usually \(k=1\). This is formally known as Laplace smoothing. Finally any unseen word such as the word widest becomes: $$P('widest'|geography) = {0 + 1 \over 13 + 21} = {1 \over 34}$$ Also training slightly changes to: $$P(river|geography)={count(river, geography) + 1 \over count(geography) + count(vocabulary)}= {2 + 1 \over 13 + 21} = {3 \over 34}$$

We are going to use Laplace smoothing everywhere, not only in unseen words.

Putting it all together: $${P(geography|'widest,river,world')}= {1 \over 34} * {3 \over 34} * {1 \over 34} * {13 \over 21} = {0.0000508}$$ $${P(economics|'widest,river,world')}={1 \over 29} * {1 \over 29} * {1 \over 29} * {8 \over 21} = {0.0000127}$$

We're done! The winner is the one with the highest probability, hence geography.


It might seem like a fairly long process but after doing it a couple of times it is essentially counting and multiplying. Let's summarize what we did:

  1. Decide which NB we are going to use. We chose Multinomial NB here.
  2. Apply preprocessing to the training examples.
  3. Training: create a word to frequency mapping. This can be implemented using a Dictionary in most programming languages. Use Laplace smoothing.
  4. Classification: having an incoming sentence, use the dictionary from step 2. Don't forget to apply Laplace smoothing for any previously unseen words. Use Bayes' rule to put it all together and calculate the posteriors. The class with the highest probability wins.