Naive Bayes

Cornell CS 3/5780 · Spring 2026

(down arrow for handout slides)

1. Estimating Distributions

Training data: $D = {(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)}$ drawn i.i.d. from $P(X,Y)$
MLE estimate of joint distribution: counting occurences $$ P(D) = \prod_{i=1}^n P(\mathbf{x}_i, y_i) \approx \hat{P}(\mathbf{x}, y) = \frac{\sum_{i=1}^n I(\mathbf{x}_i = \mathbf{x} \wedge y_i = y)}{n} $$
Conditional distribution: For classification, estimate $P(Y|X)$ directly
Estimating each distribution by counting:
- $\hat{P}(y) = \frac{1}{n}\sum_{i=1}^n I(y_i = y)$
- $\hat{P}(\mathbf{x}) = \frac{1}{n}\sum_{i=1}^n I(\mathbf{x}_i = \mathbf{x})$
- $\hat{P}(\mathbf{x},y) = \frac{1}{n}\sum_{i=1}^n I(\mathbf{x}_i = \mathbf{x})I( y_i = y)$
Conditional probability: $$ \hat{P}(y|\mathbf{x}) = \frac{\hat{P}(y, \mathbf{x})}{\hat{P}(\mathbf{x})} = \frac{\sum_{i=1}^n I(\mathbf{x}_i = \mathbf{x} )I( y_i = y)}{\sum_{i=1}^n I(\mathbf{x}_i = \mathbf{x})} $$

2. Curse of dimensionality

$\mathbf x = $[Yes, Game of Thrones, No, Souvlaki House, 9] ?

2. Curse of dimensionality

MLE estimate: $\hat{P}(y|\mathbf{x}) = \frac{|C|}{|B|}$
Problem: Requires many training data with identical features as $\mathbf{x}$, basically never happens in high dimensions or continuous space
- zero denominator $|B| \to 0$ (and $|C| \to 0$), so estimate becomes unreliable
Idea: Use Bayes Rule to flip the problem to "generative approach" $$ P(y|\mathbf{x}) \propto P(\mathbf{x}|y)P(y) $$
- $P(y)$ is easy to estimate by counting classes (like coin tossing)
- $P(\mathbf{x}|y)$ groups data by class, but is still high-dimensional

3. Naive Bayes

Question: Explain each step of derivation

$$ \begin{align} h(\mathbf{x}) &= \operatorname*{argmax}_y P(y|\mathbf{x}) \\ &= \operatorname*{argmax}_y \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})} \\ &= \operatorname*{argmax}_y P(\mathbf{x}|y)P(y) \\ &= \operatorname*{argmax}_y \textstyle \prod_{\alpha=1}^d P(x_\alpha|y) P(y) \\ &= \operatorname*{argmax}_y \textstyle \sum_{\alpha=1}^d \log(P(x_\alpha|y)) + \log(P(y)) \end{align} $$

Example: $\mathbf x=$ electives taken, $y=$ major

3. Naive Bayes

Naive Bayes Assumption: Feature values are independent given the label $$ P(\mathbf{x}|y) = \prod_{\alpha=1}^d P( x_\alpha|y) $$
Conditional independence assumption means we only need to estimate $P(x_\alpha|y)$ for each dimension $\alpha$ independently!
Naive Bayes Classifier: $$ \begin{align} h(\mathbf{x}) &= \operatorname*{argmax}_y P(y|\mathbf{x}) \\ &= \operatorname*{argmax}_y \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})} \\ &= \operatorname*{argmax}_y P(\mathbf{x}|y)P(y) \\ &= \operatorname*{argmax}_y \textstyle \prod_{\alpha=1}^d P(x_\alpha|y) P(y) \\ &= \operatorname*{argmax}_y \textstyle \sum_{\alpha=1}^d \log(P(x_\alpha|y)) + \log(P(y)) \end{align} $$
Question: Explain each step of above derivation

4. Categorical Features

P(ML=Y | Major=CS) $\approx [\hat \theta_{Y,CS}]_{ML} = \frac{9}{10}$

P(Major=CS) $\approx$

$ \hat \pi_{CS} = \frac{10}{15}$

4. Categorical Features

Features: $x_\alpha \in {f_1, f_2, \ldots, f_{K_\alpha}}$ (example: demographic data)
Model: Each feature follows a categorical distribution $$ P(x_\alpha = j | y = c) = [\theta_{jc}]_\alpha\quad\text{where}\quad \sum_{j=1}^{K_\alpha} [\theta_{jc}]_\alpha = 1 $$
Generative story: For each class, we roll $d$ dice (one per feature)
MLE estimate: $$ [\hat{\theta}_{jc}]_\alpha = \frac{\sum_{i=1}^n I(y_i = c) I(x_{i\alpha} = j) }{\sum_{i=1}^n I(y_i = c) } $$
MAP estimate: add smoothing $l_j $ to numerator and $\sum_{j=1}^{K_\alpha}l_j $ to denominator where e.g. $l_j=1$ is Laplace prior
Prediction: for $\hat{\pi}_c= {\sum_{i=1}^n I( y_i = c)}/{n}$, we predict $$\operatorname*{argmax}_c \hat{\pi}_c \prod_{\alpha=1}^d [\hat{\theta}_{jc}]_\alpha $$

5. Multinomial Features

Features: counts $x_\alpha \in {0, 1, \ldots, m}$ where $m = \sum_{\alpha=1}^d x_\alpha$ and higher count $\implies$ stronger signal

Example: Baby name classification, $x_\alpha$ is count of letter $\alpha$ in name, $m$ is total letter count, $d=26$ is vocabulary size, and $y$ is gender
Baby Name Gender Demo

5. Multinomial Features

Features: counts $x_\alpha \in {0, 1, \ldots, m}$ where $m = \sum_{\alpha=1}^d x_\alpha$
and higher count $\implies$ stronger signal
Example: Document classification, $x_\alpha$ is count of word $\alpha$ in document, $m$ is total word count and $d$ is vocabulary size
Model: multinomial distribution $$ P({x}_\alpha| y=c) \propto (\theta_{\alpha c})^{x_\alpha}, \quad \text{where}\quad \sum_{\alpha=1}^d \theta_{\alpha c} = 1$$
Parameter estimation: MLE/MAP on multinomial distribution $$ \hat{\theta}_{\alpha c} = \frac{\sum_{i=1}^n I(y_i = c) x_{i\alpha} + l_\alpha}{\sum_{i=1}^n I(y_i = c) m_i + \sum_{\alpha=1}^d l_\alpha} $$
Prediction: for $\hat{\pi}_c= {\sum_{i=1}^n I( y_i = c)}/{n}$, we predict $$ \operatorname*{argmax}_c \hat{\pi}_c \prod_{\alpha=1}^d \hat{\theta}_{\alpha c}^{x_\alpha} $$

6. Gaussian Naive Bayes

Features: $x_\alpha \in \mathbb{R}$ (continuous real values)
Model: Each feature follows a Gaussian distribution $$ P(x_\alpha|y=c) = \mathcal{N}(\mu_{\alpha c}, \sigma^2_{\alpha c}) = \frac{1}{\sqrt{2\pi}\sigma_{\alpha c}} \exp\left({-\frac{1}{2}\left(\frac{x_\alpha - \mu_{\alpha c}}{\sigma_{\alpha c}}\right)^2}\right) $$
Full distribution: $P(\mathbf{x}|y) \sim \mathcal{N}(\boldsymbol{\mu}_y, \Sigma_y)$, where $\Sigma_y$ is diagonal (independence assumption) with values $\sigma^2_{\alpha,y}$
Mean estimation: for $n_c = \sum_{i=1}^n I(y_i = c)$ $$ \hat\mu_{\alpha c} \leftarrow \frac{1}{n_c} \sum_{i=1}^n I(y_i = c) x_{i\alpha} $$
Variance estimation: $\displaystyle \hat\sigma^2_{\alpha c} \leftarrow \frac{1}{n_c} \sum_{i=1}^n I(y_i = c)(x_{i\alpha} - \mu_{\alpha c})^2 $

7. Naive Bayes is a Linear Classifier

$$f(u) = \frac{1}{1+\exp(-u)}$$

$$P(y|\mathbf x) = \frac{1}{1+\exp(-y( \mathbf w^\top \mathbf x+b) )}$$

7. Naive Bayes is a Linear Classifier

For many common cases, Naive Bayes produces a linear decision boundary!
Multinomial features with $y \in \{-1, +1\}$, we can derive $$P(y|\mathbf x) = \frac{1}{1+\exp(-y( \mathbf w^\top \mathbf x+b) )}$$ where weight $\mathbf w$ and bias $b$ are defined in terms of $\theta$ and $\pi$
- weights: $w_\alpha = \log[\theta_{\alpha+}]-\log[\theta_{\alpha-}]$
- bias: $b = \log[P(Y=+1)]- \log[P(Y=-1)]$
Gaussian with constant variance and $y \in \{-1, +1\}$
- Similar derivation and expression but with ${w}_\alpha = {\mu}_{\alpha,+} - {\mu}_{\alpha,-}$ (difference of means), as long as $\sigma_{\alpha,+1} = \sigma_{\alpha,-1}$ for all $\alpha$

8. Linear Classifier Proof (Multinomial)

Question: Explain the following steps, where we let ${w}_{\alpha}^+=\log[\theta_{\alpha+}]$

$$ \begin{align} \log\left[P(\mathbf{x}|Y=+1)\right] &=\log\left[\Pi_{\alpha=1}^d P(x_\alpha | Y=+1) \right]\\ &=\sum_{\alpha=1}^d x_\alpha\log[\theta_{\alpha+}] \\ &=\sum_{\alpha=1}^d x_\alpha w^+_\alpha=\mathbf{x}^\top \mathbf{w}_+. \end{align} $$

A similar argument shows $\log\left[P(\mathbf{x}|Y=-1)\right]=\mathbf{x}^\top \mathbf{w}_-$
Question: Explain the following steps

$$\begin{align} P(Y=+1\ |\ \mathbf{x}) &=\frac{P(\mathbf{x} \ |\ Y= +1)P(Y=+1)}{P(\mathbf{x})}\\&= \frac{P(\mathbf{x} \ |\ Y= +1)P(Y=+1)}{P(\mathbf{x} \ |\ Y= +1)P(Y=+1)+P(\mathbf{x} \ |\ Y= -1)P(Y=-1)}\end{align}$$

9. Linear Classifier Proof Cont. (Multinomial)

Question: Explain the following steps, recalling $b = \log[P(Y=+1)]- \log[P(Y=-1)]$

$$\begin{align} P(Y=+1\ |\ \mathbf{x})=&\frac{P(\mathbf{x} \ |\ Y= +1)P(Y=+1)}{P(\mathbf{x} \ |\ Y= +1)P(Y=+1)+P(\mathbf{x} \ |\ Y= -1)P(Y=-1)}\\ &=\frac{e^{\mathbf{x}^\top \mathbf{w}_+}P(Y=+1)} {e^{\mathbf{x}^\top \mathbf{w}_+}P(Y=+1)+e^{\mathbf{x}^\top \mathbf{w}_-}P(Y=-1)}&\\ &=\frac{1}{1+e^{-\mathbf{x}^\top \mathbf{w}}\frac{P(Y=-1)}{P(Y=+1)} }\\ & =\frac{1}{1+e^{-(\mathbf{x}^\top \mathbf{w}+b)}} \ \ \ \end{align}$$

A similar argument shows $P(Y=-1\ |\ \mathbf{x}) = \frac{1}{1+e^{(\mathbf{x}^\top \mathbf{w}+b)}} $, and combining the two expressions completes the proof.

Naive Bayes

By Sarah Dean

Naive Bayes

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

Naive Bayes

Cornell CS 3/5780 · Spring 2026

1. Estimating Distributions

1. Estimating Distributions

2. Curse of dimensionality

2. Curse of dimensionality

3. Naive Bayes

3. Naive Bayes

4. Categorical Features

4. Categorical Features

5. Multinomial Features

5. Multinomial Features

6. Gaussian Naive Bayes

6. Gaussian Naive Bayes

7. Naive Bayes is a Linear Classifier

7. Naive Bayes is a Linear Classifier

8. Linear Classifier Proof (Multinomial)

9. Linear Classifier Proof Cont. (Multinomial)

Naive Bayes

More from Sarah Dean