Naive Bayes

Cornell CS 3/5780 · Spring 2026

(down arrow for handout slides)

1. Estimating Distributions

1. Estimating Distributions

  • Training data: \(D = {(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)}\) drawn i.i.d. from \(P(X,Y)\)
  • MLE estimate of joint distribution: counting occurences $$ P(D) = \prod_{i=1}^n P(\mathbf{x}_i, y_i) \approx \hat{P}(\mathbf{x}, y) = \frac{\sum_{i=1}^n I(\mathbf{x}_i = \mathbf{x} \wedge y_i = y)}{n} $$
  • Conditional distribution: For classification, estimate \(P(Y|X)\) directly
  • Estimating each distribution by counting:
    • \(\hat{P}(y) = \frac{1}{n}\sum_{i=1}^n I(y_i = y)\)
    • \(\hat{P}(\mathbf{x}) = \frac{1}{n}\sum_{i=1}^n I(\mathbf{x}_i = \mathbf{x})\)
    • \(\hat{P}(\mathbf{x},y) = \frac{1}{n}\sum_{i=1}^n I(\mathbf{x}_i = \mathbf{x})I( y_i = y)\)
  • Conditional probability: $$ \hat{P}(y|\mathbf{x}) = \frac{\hat{P}(y, \mathbf{x})}{\hat{P}(\mathbf{x})} = \frac{\sum_{i=1}^n I(\mathbf{x}_i = \mathbf{x} )I( y_i = y)}{\sum_{i=1}^n I(\mathbf{x}_i = \mathbf{x})} $$

2. Curse of dimensionality

\(\mathbf x = \)[Yes, Game of Thrones, No, Souvlaki House, 9] ?

2. Curse of dimensionality

  • MLE estimate: \(\hat{P}(y|\mathbf{x}) = \frac{|C|}{|B|}\) 
  • Problem: Requires many training data with identical features as \(\mathbf{x}\), basically never happens in high dimensions or continuous space
    • zero denominator \(|B| \to 0\) (and \(|C| \to 0\)), so estimate becomes unreliable
  • Idea: Use Bayes Rule to flip the problem to "generative approach" $$ P(y|\mathbf{x}) \propto P(\mathbf{x}|y)P(y) $$
    • \(P(y)\) is easy to estimate by counting classes (like coin tossing)
    • \(P(\mathbf{x}|y)\) groups data by class, but is still high-dimensional

3. Naive Bayes

  • Question: Explain each step of derivation

$$ \begin{align} h(\mathbf{x}) &= \operatorname*{argmax}_y P(y|\mathbf{x}) \\ &= \operatorname*{argmax}_y \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})} \\ &= \operatorname*{argmax}_y P(\mathbf{x}|y)P(y) \\ &= \operatorname*{argmax}_y \textstyle \prod_{\alpha=1}^d P(x_\alpha|y) P(y) \\ &= \operatorname*{argmax}_y \textstyle \sum_{\alpha=1}^d \log(P(x_\alpha|y)) + \log(P(y)) \end{align} $$

Example: \(\mathbf x=\) electives taken, \(y=\) major

3. Naive Bayes

  • Naive Bayes Assumption: Feature values are independent given the label $$ P(\mathbf{x}|y) = \prod_{\alpha=1}^d P( x_\alpha|y) $$
  • Conditional independence assumption means we only need to estimate \(P(x_\alpha|y)\) for each dimension \(\alpha\) independently!
  • Naive Bayes Classifier: $$ \begin{align} h(\mathbf{x}) &= \operatorname*{argmax}_y P(y|\mathbf{x}) \\ &= \operatorname*{argmax}_y \frac{P(\mathbf{x}|y)P(y)}{P(\mathbf{x})} \\ &= \operatorname*{argmax}_y P(\mathbf{x}|y)P(y) \\ &= \operatorname*{argmax}_y \textstyle \prod_{\alpha=1}^d P(x_\alpha|y) P(y) \\ &= \operatorname*{argmax}_y \textstyle \sum_{\alpha=1}^d \log(P(x_\alpha|y)) + \log(P(y)) \end{align} $$
  • Question: Explain each step of above derivation

4. Categorical Features

P(ML=Y | Major=CS) \(\approx [\hat \theta_{Y,CS}]_{ML} = \frac{9}{10}\)

P(Major=CS) \(\approx\)

\( \hat \pi_{CS} = \frac{10}{15}\)

4. Categorical Features

  • Features: \(x_\alpha \in {f_1, f_2, \ldots, f_{K_\alpha}}\) (example: demographic data)
  • Model: Each feature follows a categorical distribution $$ P(x_\alpha = j | y = c) = [\theta_{jc}]_\alpha\quad\text{where}\quad \sum_{j=1}^{K_\alpha} [\theta_{jc}]_\alpha = 1 $$
  • Generative story: For each class, we roll \(d\) dice (one per feature)
  • MLE estimate: $$ [\hat{\theta}_{jc}]_\alpha = \frac{\sum_{i=1}^n I(y_i = c) I(x_{i\alpha} = j) }{\sum_{i=1}^n I(y_i = c) } $$
  • MAP estimate: add smoothing \(l_j \) to numerator and \(\sum_{j=1}^{K_\alpha}l_j \) to denominator where e.g. \(l_j=1\) is Laplace prior
  • Prediction: for \(\hat{\pi}_c= {\sum_{i=1}^n I( y_i = c)}/{n}\), we predict $$\operatorname*{argmax}_c \hat{\pi}_c \prod_{\alpha=1}^d [\hat{\theta}_{jc}]_\alpha $$

5. Multinomial Features

  • Features: counts \(x_\alpha \in {0, 1, \ldots, m}\) where \(m = \sum_{\alpha=1}^d x_\alpha\) and higher count \(\implies\) stronger signal
  • Example: Baby name classification, \(x_\alpha\) is count of letter \(\alpha\) in name, \(m\) is total letter count, \(d=26\) is vocabulary size, and \(y\) is gender
  • Baby Name Gender Demo

5. Multinomial Features

  • Features: counts \(x_\alpha \in {0, 1, \ldots, m}\) where \(m = \sum_{\alpha=1}^d x_\alpha\)
    and higher count \(\implies\) stronger signal
  • Example: Document classification, \(x_\alpha\) is count of word \(\alpha\) in document, \(m\) is total word count and \(d\) is vocabulary size
  • Model: multinomial distribution $$ P({x}_\alpha| y=c) \propto  (\theta_{\alpha c})^{x_\alpha}, \quad \text{where}\quad \sum_{\alpha=1}^d \theta_{\alpha c} = 1$$
  • Parameter estimation: MLE/MAP on multinomial distribution $$ \hat{\theta}_{\alpha c} = \frac{\sum_{i=1}^n I(y_i = c) x_{i\alpha} + l_\alpha}{\sum_{i=1}^n I(y_i = c) m_i + \sum_{\alpha=1}^d l_\alpha} $$
  • Prediction: for \(\hat{\pi}_c= {\sum_{i=1}^n I( y_i = c)}/{n}\), we predict $$ \operatorname*{argmax}_c \hat{\pi}_c \prod_{\alpha=1}^d \hat{\theta}_{\alpha c}^{x_\alpha} $$

6. Gaussian Naive Bayes

6. Gaussian Naive Bayes

  • Features: \(x_\alpha \in \mathbb{R}\) (continuous real values)
  • Model: Each feature follows a Gaussian distribution $$ P(x_\alpha|y=c) = \mathcal{N}(\mu_{\alpha c}, \sigma^2_{\alpha c}) = \frac{1}{\sqrt{2\pi}\sigma_{\alpha c}} \exp\left({-\frac{1}{2}\left(\frac{x_\alpha - \mu_{\alpha c}}{\sigma_{\alpha c}}\right)^2}\right) $$
  • Full distribution: \(P(\mathbf{x}|y) \sim \mathcal{N}(\boldsymbol{\mu}_y, \Sigma_y)\), where \(\Sigma_y\) is diagonal (independence assumption) with values \(\sigma^2_{\alpha,y}\)
  • Mean estimation: for \(n_c = \sum_{i=1}^n I(y_i = c)\) $$ \hat\mu_{\alpha c} \leftarrow \frac{1}{n_c} \sum_{i=1}^n I(y_i = c) x_{i\alpha} $$
  • Variance estimation: \(\displaystyle \hat\sigma^2_{\alpha c} \leftarrow \frac{1}{n_c} \sum_{i=1}^n I(y_i = c)(x_{i\alpha} - \mu_{\alpha c})^2 \)

7. Naive Bayes is a Linear Classifier

$$f(u) = \frac{1}{1+\exp(-u)}$$

$$P(y|\mathbf x) = \frac{1}{1+\exp(-y( \mathbf w^\top \mathbf x+b) )}$$

7. Naive Bayes is a Linear Classifier

  • For many common cases, Naive Bayes produces a linear decision boundary!

  • Multinomial features with \(y \in \{-1, +1\}\), we can derive $$P(y|\mathbf x) = \frac{1}{1+\exp(-y( \mathbf w^\top \mathbf x+b) )}$$ where weight \(\mathbf w\) and bias \(b\) are defined in terms of \(\theta\) and \(\pi\)
    • weights: \(w_\alpha = \log[\theta_{\alpha+}]-\log[\theta_{\alpha-}]\)
    • bias: \(b = \log[P(Y=+1)]- \log[P(Y=-1)]\)
  • Gaussian with constant variance and \(y \in \{-1, +1\}\) 
    • Similar derivation and expression but with \({w}_\alpha = {\mu}_{\alpha,+} - {\mu}_{\alpha,-}\) (difference of means), as long as \(\sigma_{\alpha,+1} = \sigma_{\alpha,-1}\) for all \(\alpha\)

8. Linear Classifier Proof (Multinomial)

  • Question: Explain the following steps, where we let \({w}_{\alpha}^+=\log[\theta_{\alpha+}]\)

$$ \begin{align} \log\left[P(\mathbf{x}|Y=+1)\right] &=\log\left[\Pi_{\alpha=1}^d P(x_\alpha | Y=+1) \right]\\ &=\sum_{\alpha=1}^d x_\alpha\log[\theta_{\alpha+}] \\ &=\sum_{\alpha=1}^d x_\alpha w^+_\alpha=\mathbf{x}^\top \mathbf{w}_+. \end{align} $$

  • A similar argument shows \(\log\left[P(\mathbf{x}|Y=-1)\right]=\mathbf{x}^\top \mathbf{w}_-\)
  • Question: Explain the following steps

$$\begin{align} P(Y=+1\ |\ \mathbf{x}) &=\frac{P(\mathbf{x} \ |\ Y= +1)P(Y=+1)}{P(\mathbf{x})}\\&= \frac{P(\mathbf{x} \ |\ Y= +1)P(Y=+1)}{P(\mathbf{x} \ |\ Y= +1)P(Y=+1)+P(\mathbf{x} \ |\ Y= -1)P(Y=-1)}\end{align}$$

9. Linear Classifier Proof Cont. (Multinomial)

  • Question: Explain the following steps, recalling \(b = \log[P(Y=+1)]- \log[P(Y=-1)]\)

$$\begin{align} P(Y=+1\ |\ \mathbf{x})=&\frac{P(\mathbf{x} \ |\ Y= +1)P(Y=+1)}{P(\mathbf{x} \ |\ Y= +1)P(Y=+1)+P(\mathbf{x} \ |\ Y= -1)P(Y=-1)}\\ &=\frac{e^{\mathbf{x}^\top \mathbf{w}_+}P(Y=+1)} {e^{\mathbf{x}^\top \mathbf{w}_+}P(Y=+1)+e^{\mathbf{x}^\top \mathbf{w}_-}P(Y=-1)}&\\ &=\frac{1}{1+e^{-\mathbf{x}^\top \mathbf{w}}\frac{P(Y=-1)}{P(Y=+1)} }\\ & =\frac{1}{1+e^{-(\mathbf{x}^\top \mathbf{w}+b)}} \ \ \ \end{align}$$

  • A similar argument shows \(P(Y=-1\ |\ \mathbf{x}) = \frac{1}{1+e^{(\mathbf{x}^\top \mathbf{w}+b)}} \), and combining the two expressions completes the proof.

Naive Bayes

By Sarah Dean

Private

Naive Bayes