Bias-Variance Tradeoff

Cornell CS 3/5780 · Spring 2026

(down arrow to see handout slides)

1. Setting

Training data $D = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}$ drawn i.i.d. from $P(X,Y)$
Regression: $y \in \mathbb{R}$ with squared loss
Today's question is about generalization: what is my expected test error? (after training on $D$)
Definition: The expected label given $\mathbf{x} \in \mathbb{R}^d$ (recall Bayes optimal prediction) $$ \bar{y}(\mathbf{x}) = E_{y|\mathbf{x}}[Y] = \int\limits_y y \, \Pr(y|\mathbf{x}) \partial y $$
Question: is $\bar{y}(\mathbf{x})$ a perfect prediction? When is it better/worse?

1. Setting

2/3. Expected Test Errors

4/5. Decomposition derivation

6. Bias-Variance Decomposition

7. Bias-Variance Tradeoffs

2. Expected Test Error (given $h$)

Apply machine learning algorithm $\mathcal{A}$ to learn a hypothesis $h$
Notation: $h_D = \mathcal{A}(D)$
For a specific hypothesis $h_D$ learned on dataset $D$: $$ E_{(\mathbf{x},y) \sim P} \left[ (h_D(\mathbf{x}) - y)^2 \right] = \int\limits_x \int\limits_y (h_D(\mathbf{x}) - y)^2 \Pr(\mathbf{x},y) \partial y \partial \mathbf{x} $$
This measures: How well does this particular hypothesis generalize?
Key observation: $h_D$ is a random variable!
Different training sets $D$ lead to different hypotheses
The hypothesis depends on which data points were sampled

3. Expected Test Error (Given $\mathcal{A}$)

Taking expectation over both test data and training data: $$ E_{\substack{(\mathbf{x},y) \sim P\\ D \sim P^n}} \left[(h_D(\mathbf{x}) - y)^2\right] = \int_D \int_{\mathbf{x}} \int_y (h_D(\mathbf{x}) - y)^2 P(\mathbf{x},y) P(D) \partial \mathbf{x} \partial y \partial D $$
Evaluates the quality of algorithm $\mathcal{A}$ given the distribution $P(X,Y)$
Note: $D$ = training points and $(\mathbf{x}, y)$ = test point
It is also useful to compute the average hypothesis over all possible training sets: $$ \bar{h}(\mathbf{x}) = E_{D \sim P^n}[h_D(\mathbf{x})] = \int\limits_D h_D(\mathbf{x}) \Pr(D) \partial D $$
"Average predictor" across all possible training datasets (weighted average with weight = probability)

1. Setting

2/3. Expected Test Errors

4/5. Decomposition derivation

6. Bias-Variance Decomposition

7. Bias-Variance Tradeoffs

4. Decomposition part 1

$$\textbf{Goal: }E_{\mathbf{x}, y, D} \left[ \left( h_{D}(\mathbf{x}) - y \right)^{2} \right] = \underbrace{E_{\mathbf{x}, D} \left[ \left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x}) \right)^{2} \right]}_{\text{Variance}} + E_{\mathbf{x}, y}\left[ \left( \bar{h}(\mathbf{x}) - y \right)^{2} \right]$$

Fill in below steps: Add and subtract, then expand, then simplify cross term

$$E_{\mathbf{x},y,D}\left[(h_D(\mathbf{x}) - y)^2\right] = E_{\mathbf{x},y,D}\left[\left[(h_D(\mathbf{x}) - \qquad) + ( \qquad - y)\right]^2\right] $$

$$ = E_{\mathbf{x},D}\left[(h_D(\mathbf{x}) - \qquad)^2\right] + 2 E_{\mathbf{x},y,D}\left[(h_D(\mathbf{x}) - \qquad)( \qquad - y)\right] + E_{\mathbf{x},y}\left[( \qquad - y)^2\right] $$

$$\begin{aligned} E_{\mathbf{x}, y, D} \left[\left(h_{D}(\mathbf{x}) - \qquad\right) \left(\qquad - y\right)\right] &= E_{\underline{\qquad}} \left[E_{\underline{\qquad}} \left[ h_{D}(\mathbf{x}) - \qquad\right] \left(\qquad - y\right) \right] \\ &= E_{\underline{\qquad}} \left[ \left( E_{\underline{\qquad}} \left[ h_{D}(\mathbf{x}) \right] - \qquad \right) \left(\qquad - y \right)\right] \\ &= E_{\underline{\qquad}} \left[ \left(\qquad - \qquad\right) \left(\qquad - y \right)\right]\ \\ &= 0 \end{aligned}$$

5. Decomposition part 2

$$\textbf{Goal: } E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) - y \right)^{2}\right] = \underbrace{E_{\mathbf{x}, y} \left[\left(\bar{y}(\mathbf{x}) - y\right)^{2}\right]}_{\text{Noise}} + \underbrace{E_{\mathbf{x}} \left[\left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})\right)^{2}\right]}_{\text{Bias}^2} $$

Fill in below steps: Add and subtract, then expand, then simplify cross term

$$\begin{aligned} &E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) - y \right)^{2}\right] = E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) - \qquad )+(\qquad - y \right)^{2}\right] \\ &=E_{\mathbf{x}, y} \left[\left(\qquad - y\right)^{2}\right] + E_{\mathbf{x}} \left[\left(\bar{h}(\mathbf{x}) - \qquad\right)^{2}\right] + 2 E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) - \qquad\right)\left(\qquad - y\right)\right] \end{aligned}$$

$$\begin{aligned}E_{\mathbf{x}, y} \left[ \left(\bar{h}(\mathbf{x}) - \qquad\right)\left(\qquad - y\right)\right] &= \\ &= \\ &= 0 \end{aligned}$$

6. Bias-Variance Decomposition

7. Bias-Variance Tradeoffs

6. Bias-Variance Decomposition

$$\underbrace{E_{\mathbf{x}, y, D} \left[\left(h_{D}(\mathbf{x}) - y\right)^{2}\right]}_{\text{Expected Test Error}} = \underbrace{E_{\mathbf{x}, D}\left[\left(h_{D}(\mathbf{x}) - \bar{h}(\mathbf{x})\right)^{2}\right]}_{\text{Variance}} + \underbrace{E_{\mathbf{x}, y}\left[\left(\bar{y}(\mathbf{x}) - y\right)^{2}\right]}_{\text{Noise}} + \underbrace{E_{\mathbf{x}}\left[\left(\bar{h}(\mathbf{x}) - \bar{y}(\mathbf{x})\right)^{2}\right]}_{\text{Bias}^2}$$

Variance: How much does $h_D$ vary across different training sets?
- Model "overfits" to particular training examples
- Cause: Model is too complex relative to amount of training data
Bias: How far is the average hypothesis from the true expected label?
- Model "underfits" the data, model class is not expressive enough
- Cause: Model is too simple to capture the true pattern
Noise: Inherent unpredictability in the labels
- Performance of the Bayes optimal hypothesis
- Cause: inherent uncertainty and/or uninformative features

7. Bias-Variance Tradeoffs

Insight: tune the model complexity to trade off Variance and Bias

Diagnosis: where does poor performance come from?

High Variance (Regime 1) $$\text{train err} < \epsilon < \text{test err}$$ indicates overfitting
High Bias (Regime 2) $$\epsilon < \text{train err} \approx \text{test err}$$ indicates underfitting

8. Summary

Bias-Variance Decomposition: $$ \text{Expected Test Error} = \text{Bias}^2 + \text{Variance} + \text{Noise} $$

Bias: Error from wrong model assumptions
Variance: Error from sensitivity to training data
Noise: Irreducible error from label randomness
Tradeoff: Complex models have low bias but high variance
Goal: Find the sweet spot that minimizes total error

Bias-Variance Tradeoff

By Sarah Dean

Bias-Variance Tradeoff

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

Bias-Variance Tradeoff

Cornell CS 3/5780 · Spring 2026

1. Setting

1. Setting

1. Setting

2/3. Expected Test Errors

4/5. Decomposition derivation

6. Bias-Variance Decomposition

7. Bias-Variance Tradeoffs

2. Expected Test Error (given \(h\))

3. Expected Test Error (Given \(\mathcal{A}\))

1. Setting

2/3. Expected Test Errors

4/5. Decomposition derivation

6. Bias-Variance Decomposition

7. Bias-Variance Tradeoffs

4. Decomposition part 1

5. Decomposition part 2

6. Bias-Variance Decomposition

7. Bias-Variance Tradeoffs

6. Bias-Variance Decomposition

7. Bias-Variance Tradeoffs

7. Bias-Variance Tradeoffs

8. Summary

Bias-Variance Tradeoff

Bias-Variance Tradeoff

Sarah Dean PRO

Bias-Variance Tradeoff

Cornell CS 3/5780 · Spring 2026

1. Setting

1. Setting

1. Setting

2/3. Expected Test Errors

4/5. Decomposition derivation

6. Bias-Variance Decomposition

7. Bias-Variance Tradeoffs

2. Expected Test Error (given \(h\))

3. Expected Test Error (Given \(\mathcal{A}\))

1. Setting

2/3. Expected Test Errors

4/5. Decomposition derivation

6. Bias-Variance Decomposition

7. Bias-Variance Tradeoffs

4. Decomposition part 1

5. Decomposition part 2

6. Bias-Variance Decomposition

7. Bias-Variance Tradeoffs

6. Bias-Variance Decomposition

7. Bias-Variance Tradeoffs

7. Bias-Variance Tradeoffs

8. Summary

Bias-Variance Tradeoff

More from Sarah Dean