THE NAIVE BAYES GUIDE
What is Naive Bayes?
Section 1: Defining the Model

What is the Algorithm?
Naive Bayes (NB) is a supervised machine learning algorithm. NBs purpose is to predict the classification of a query sample by relying on labeled input data which are separated into classes. The name naive stems from the foundation that the algorithm is an independence assumption of the features and bayes stems from the foundation that the algorithm uses a statistical classification technique called Bayes Theorem.
How Does the Algorithm Work?
Step 1: Calculate the Prior Probability for given class labels in training data.
Step 2: Obtain Likelihood Probability with each feature attribute for each class.
Step 3: Calculate Posterior Probability using Bayes Theorem.

- P(A|B) — the probability of event A occurring, given event B has occurred [Posterior Probability]
- P(B|A) — the probability of event B occurring, given event A has occurred [Likelihood Probability]
- P(A) — the probability of event A [Prior Probaility of A]
- P(B) — the probability of event B [Prior Probaility of B]
Step 4: Return class label with a higher Posterior Probability → prediction of the query sample!
Example of the Algorithm
Let’s rewrite Bayes Theorem in terms of the Naive Bayes (Gaussian) equation…

- P(Class) represents the prior probability of the class (y output).
- P(Data) represents the prior probability of the predictor (X features).
- P(Data|Class) represents the likelihood probability of the predictor given the class.
- P(Class|Data) represents the posterior probability of the class given the predictor.
Let’s start with a mock dataframe with…
- 4 columns (3 features [X: x₁, x₂, x₃] and 1 output [y — Class A or Class B])
- 10 rows (observations) where 4 of them belong to Class A and 6 of them belong to Class B
The goal of this example is to predict the class of the query sample (A or B), where it has input values of 11, 7, and 22 for Feature 1, Feature 2, and Feature 3, respectively.
Note: Since we are examining the Gaussian Naive Bayes, plot the normal (or gaussian) distribution curve of each class per feature by using the mean (μ) and standard deviation (σ).

Let’s calculate the prior probability of P(Class) using the count of each class…
- P(Class=A) → [4 /(4+6)] = 0.40
- P(Class=B) → [6 /(6+4)] = 0.60
Let’s calculate the prior probability of P(Data)…
- P(Data) is not calculated in this example since the feature values are continuous instead of categorical
Let’s calculate the likelihood probability of P(Data|Class) using the Normal Distribution plot above for each feature…
- L(Feature 1=11|Class=A) → closer to 0
- L(Feature 2=7|Class=A) → 0.65
- L(Feature 3=22|Class=A) → 0.05
- L(Feature 1=11|Class=B) → 0.35
- L(Feature 2=7|Class=B) → 0.20
- L(Feature 3=22|Class=B) → closer to 0
Let’s calculate the posterior probability of P(Class|Data) using the Naive Bayes Equation…
- P(Class=A|Data)
= P(Class=A) x (L(Feature 1=11|Class=A) x L(Feature 2=7|Class=A) x L(Feature 3=22|Class=A))
= [0.60 x ((closer to 0) x 0.65 x 0.05)]
Note: If a probability is a small number (close to 0), take the logₑ() of the calculation to avoid underflow.
logₑ(P(Class=A|Data) = logₑ[0.60 x ((closer to 0) x 0.65 x 0.05)]
= logₑ(0.60) + logₑ(closer to 0) + logₑ(0.65) + logₑ(0.05)
= -0.51 + -101.71 + -0.43 + -3.00 = -105.65
- P(Class=B|Data)
= P(Class=B) x (L(Feature 1=11|Class=B) x L(Feature 2=7|Class=B) x L(Feature 3=22|Class=B))
= [0.40 x (0.35 x 0.20 x (closer to 0))]
Similarly, take the logₑ() of the calculation to avoid underflow.
logₑ(P(Class=B|Data)) = logₑ[0.40 x (0.35 x 0.20 x (closer to 0))]
= logₑ(0.40) + logₑ(0.35) + logₑ(0.20) + logₑ(closer to 0)
= -0.92 + -1.05 + -1.61 + -94.84 = -98.42
Since Class B has the higher log of posterior probability (-98.42) when compared to the log of posterior probability of Class A (-105.65) → the query sample will be predicted to be classified as Class B!