Machine Learning for the Masses: Naive Bayes (pt.2)
I hope you enjoyed part 1 and got a good intuition on the subject. In part 2, we are going to explore it a bit more so that if you understand both parts you can safely add it in your skillset and check the advanced box. The main subject of this post are the underlying distributions of the features we choose for our model. We are going to present and explain the differences between six main Naive Bayes variations: multivariate Bernoulli, multinomial, binarized multinomial, multivariate Gaussian, Flexible and Complement Naive Bayes.
The main purpose of this part is for anyone reading it to see that there is not a single way to create a model. You are free to experiment and use a healthy amount of imagination to create a new one. Or even combine the existing ones. Endless possibilities.
You might have noticed and pondered about these weird prefixes of Naive Bayes, like Multinomial or Bernoulli. Worry not. They are nothing more than the underlying distributions of the features we choose to construct our model with, i.e. what distributions our features are drawn from. In part 1, we chose as features the frequencies of each unigram (i.e. single word) which allowed us to say that we were using Multinomial NB. We could have chosen the distinct occurrences of trigrams in a class (e.g. whether trigram “I am going” occurs in a class or not). Whatever. The important part is to understand that you get to do the feature engineering and nothing is written in stone.
While we could jump right at the NB variations, I think it’s best to briefly present some well known distributions so that you get an idea. It’s mostly going to be plain English plus some formalities to make sure we are on the same page. Before doing that I am going to briefly explain what a random variable is - the basis to explain distributions.
It’s not the right post to extensively discuss about random variables but since they are essential to explain our four distributions we can give a basic definition. A random variable is any quantity that depends on the outcome of a random experiment.
Examples make everything clearer: let a random variable X which depends on the outcome of the dice roll and takes the values 1 to 6 (we normally represent these values with the lower case version of the random variable, i.e. x). The probability distribution of each random variable is the probability that each one of these values is having. In the dice example, each value has the same probability of ⅙. Similarly a random variable Y can depend on the outcome of the coin toss and takes the values y=0 (head) or y=1 (tail) with probability ½, in case of a fair coin.
In other words it’s just a practical way to organize our random events. If that sounds weird, I am preparing a post with the very basics of probability theory soon. Now we are going to use the building block we called a random variable and create the notion of a distribution.
The simplest random variable is the one that takes two values, 0 or 1 and we say that this random variable follows a Bernoulli distribution. In the general case, you can think of it as the success/fail or go/no go random variable.
This is a generalization of Bernoulli distribution. Suppose we are having multiple Bernoulli random variables and we are interested at the number of the successful ones. We can assign this random experiment to a binomial random variable. An example of this would be the probability that a coin would yield exactly 4 tails out of 5 tosses, regardless of the order. That’s it, just lots of Bernoulli trials. We can say that it generalizes the number of Bernoulli trials.
Multivariate Bernoulli distribution
Similar to the Binomial distribution, this one cares about the n independent Bernoulli trials. So, multiple Bernoulli random variables again.
This is also a generalization of Bernoulli distribution. The difference from the Binomial distribution is that it generalizes the number of outcomes (not the number of trials which was the Binomial distribution case). If we just replace the coin toss (0 and 1) example with a dice roll (1 to 6), we’ve got a Multivariate Bernoulli distribution. It is also called multinoulli or categorical distribution.
Yet another generalization of Bernoulli distribution in both trials and outcomes. It is more powerful, as we can find the probability that rolling a dice three times we are going to get one 5 and 6 twice. Practically this is the one we used in our previous part.
Multivariate Gaussian Distribution
It is the generalization of the univariate Gaussian distribution (duh). The key difference here is that all the previous distributions were discrete but this one is a continuous distribution - which is going to allow for real values (e.g. average length of hair in a dog breed)! Bonus: It is commonly used for Anomaly Detection.
The above distributions are going to help you understand subtleties in the different variants of Naive Bayes. For example a common misconception is that the Multivariate NB is a Multinomial NB with Boolean attributes (i.e. term frequency rounded to 1 if it’s larger than zero). In fact the latter is a whole new variant by itself and we are going to describe both shortly.
The different naive Bayes classifiers differ mainly by the assumptions they make regarding the distribution of \(P(Xi|Y)\). Let’s see what that means practically.
Naive Bayes Variants
Multinomial Naive Bayes
This is the variant we used in part 1. It is based on the multinomial distribution which is a generalization of Bernoulli distribution in two dimensions: trials and outcomes. Key point to remember: to get the likelihood of a term in a class, divide the number of occurrences of a word by the occurrences of all the words in the class.
- This variant is powerful as it entails a lot of information.
- In general performs better than the multivariate Bernoulli NB in text classification 
- Likelihood calculated as \(occurrences\ of\ term\ i\ in\ class\ j\ \over occurrences\ of\ all\ terms\ in\ class\ j\)
Binarized (or Boolean) Multinomial Naive Bayes
Word occurrence may matter more than word frequency in some cases (e.g. in sentiment analysis, it suffices for a word to occur once in order to register an emotion, arguably even if you want to track down the intensity: “I feel happy happy happy” and “I feel happy” would both be classified as joy). The only difference of this variant with the Multinomial NB is that first you remove all duplicates. Then you just apply the multinomial NB.
- Remove duplicates and it is exactly the same with multinomial NB.
- It might even outperform multinomial NB with term frequencies!
- Useful when multiple occurrences don’t contribute to meaning e.g. sentiment analysis
Multivariate Bernoulli Naive Bayes
The key difference with the binarized multinomial NB is the denominator. In this case we are interested at the number of documents instead of the number of tokens. Therefore the nominator is the same with the binarized multinomial NB’s but we divide with the number of documents of a class to get the likelihood. It explicitly penalizes the non-occurrence of a feature (multinomial NB ignores non-occurrences). It may assign an entire book to the class China because of a single occurrence of the term China, as mentioned in this remarkable text.
- Likelihood calculation as \(distinct\ occurrences\ of\ term\ i\ in\ class\ j \over documents\ in\ class\ j\)
Multivariate Gaussian Naive Bayes
As attributes here we apply the term frequency (as in multinomial NB) divided by the number of tokens in a document. Apart from the indepence assumption we also assume that the attributes follow normal distributions per class.
Flexible Naive Bayes
It is also common to use a Gaussian distribution with Naive Bayes. As mentioned earlier, this is the first continuous distribution we are seeing, all the previous ones being discrete. Instead of categorical data which was the case with multivariate Bernoulli distribution, here we are having real values such as 0.349 (continuous attributres). Flexible Naive Bayes simply takes this a bit further. The distribution of each attribute in each category can be taken to be the average of several normal distributions, one for every different value the attribute has in the training data of that category .
Complement Naive Bayes
Exactly what a regular Naive Bayes (any variant) would do but instead of calculating the likelihood of a word occuring in a class, we calculate the likelihood that it occurs in other classes. It is particularly helpful in unbalanced training sets (i.e. having significantly more training examples in certain classes).
The scaffold of this post was inspired by the excellent paper which can be found at the references below.