Learning Theory

Cornell CS 3/5780 · Spring 2026

(down arrow to see handout slides)

0. Can You Convince Me of Your Psychic Abilities?

  • Game: I think of \(n\) bits. Can you guess it?

0. Can You Convince Me of Your Psychic Abilities?

  • I think of \(n\) bits. If somebody in the class guesses my bit sequence, that person clearly has telepathic abilities – right?
  • Students \(H = \{h_1, …, h_{|H|}\}\) and any non-psychic student has \(1-p\) probability guessing any single bit correctly. $$P(h_i \text{ correct} \mid h_i \text{ nonpsychic}) =\qquad\qquad \qquad\qquad \qquad\qquad \qquad\qquad $$
  • How likely is it that at least one student is correct? $$P(h_1 \text{ correct} \vee \ldots \vee h_{|H|} \text{ correct} \mid \text{all nonpsychic}) =\qquad\qquad \qquad\qquad \qquad\qquad \qquad\qquad $$


     
  • How large would \(n\) need to be? Given some small \(\delta\), find \(n\) such that the probability above is less than \(\delta\): $$n > \qquad\qquad \qquad\qquad \qquad\qquad \qquad\qquad$$

1. Setting

2. Generalization Error of fixed \(h\)

3. Generalization Error of ERM (finite \(H\))

4. Interpretation: Tradeoff

5. Infinite Hypothesis Spaces

6. Generalization Error Bound: Infinite \(H\)

1. Setting

  • Training data \(D = \{(\mathbf{x}_1, y_1), \ldots, (\mathbf{x}_n, y_n)\}\) drawn i.i.d. from \(P(X,Y)\), binary classification, the 0-1 loss function \(\ell(\hat y, y)=\mathbf 1\{\hat y=y\}\), and a hypothesis class \(H\)
  • Notation: expected test error of hypothesis \(h\) on distribution \(P\): $$\text{err}_{P}(h) = E_{(\mathbf{x},y) \sim P} \left[\ell(h(\mathbf{x}), y)\right]$$
  • Learning Goal: find \(h\) with small expected error \(\text{err}_{P}(h)\)
  • Define: sample error of hypothesis \(h\) on sample \(D\): $$\text{err}_{D}(h) = \frac{1}{n} \sum_{i=1}^{n} \ell(h(\mathbf{x}_i), y_i)$$
  • Learning Algorithm: empirical risk minimization $${h}_D = \arg\min_{h \in H} \text{err}_{D}(h)$$

1. Setting

2. Generalization Error of fixed \(h\)

3. Generalization Error of ERM (finite \(H\))

4. Interpretation: Tradeoff

5. Infinite Hypothesis Spaces

6. Generalization Error Bound: Infinite \(H\)

2. Generalization Error of fixed \(h\)

  • Define the generalization error as \( |\text{err}_{P}(h)  - \text{err}_{D}(h)|\)
  • Hoeffding/Chernoff Bound: For any distribution \(P(U)\) where \(U\in\{0,1\}\) and \(E[U]=p\), the average of i.i.d. samples deviates from the mean by more than \(\epsilon\) with bounded probability $$P\left(\left|\frac{1}{n}\sum_{i=1}^{n} u_i - p\right| > \epsilon\right) \leq 2e^{-2n\epsilon^2}$$ I.e., the average concentrates around the mean with high probability.
     
  • Apply Hoeffding with \(u_i=\)____________ to bound

    $$P\left(\left|\text{err}_{D}(h) - \text{err}_{P}(h)\right| > \epsilon\right) \leq \qquad\qquad\qquad\qquad\qquad\qquad$$

1. Setting

2. Generalization Error of fixed \(h\)

3. Generalization Error of ERM (finite \(H\))

4. Interpretation: Tradeoff

5. Infinite Hypothesis Spaces

6. Generalization Error Bound: Infinite \(H\)

3. Generalization Error of ERM (finite \(H\))

  • Explain or fill in each step (hint: use union bound) $$\begin{align*} P\left(\left|\text{err}_{D}(h_D) - \text{err}_{P}(h_D)\right| > \epsilon\right) &\leq P\left(\max_{h\in H}\left|\text{err}_{D}(h) - \text{err}_{P}(h)\right| > \epsilon\right) \\ \\ &=  \\ \\  &\leq \sum_{h\in H} P\left(\left|\text{err}_{D}(h) - \text{err}_{P}(h)\right| > \epsilon\right) \\ &\leq \end{align*} $$
  • Fix the probability to be less than \(\delta\), and derive expression for \(\epsilon\) $$ \epsilon= \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad \qquad $$

training error

generalization error

upper bound

4. Interpretation: Tradeoff

5. Infinite Hypothesis Spaces

6. Generalization Error Bound: Infinite \(H\)

4. Interpretation: Tradeoff

  • Derive the following bound: $$\begin{align*}\text{err}_{P}(h) &= \\ & \leq  \text{err}_{D}(h) + |\text{err}_{D}(h)-\text{err}_{P}(h)| \\  \end{align*}$$
  • Conclude (previous slide) that with probability at least \(1-\delta\):

    $$\begin{align*}\text{err}_{P}(h_D)& \leq  \underbrace{\text{err}_{D}(h_D)}_{(a)} + \underbrace{\qquad\qquad\qquad\qquad\qquad\qquad}_{(b)} \end{align*}$$
  • This PAC ("probably approximately correct") bound reflects the trade-off between
    • (a) Training error (smaller when \(H\) is larger)
    • (b) Complexity of \(H\)
  • Occam's Razor: Prefer the simplest hypothesis that fits the data.

4. Interpretation: Tradeoff

5. Infinite Hypothesis Spaces

6. Generalization Error Bound: Infinite \(H\)

5. Infinite Hypothesis Spaces

  • If \(H\) is the set of all linear classifiers, how big is \(H\)?
  • New idea: effective number of hypotheses measures all the ways \(H\) can label the training data \(D\)
  • \(H[D]\) = the set of all possible predictions on training data: $$H[D] = \{ (h(x_1), h(x_2), h(x_3), \ldots, h(x_n)) \mid h\in H\}$$
  • Question: what are the maximum and minimum possible sizes of \(H[D]\) for \(n\) training data points?

6. Generalization Error Bound: Infinite \(H\)

  • General upper bound in terms of Vapnik-Chervonenkis (VC) Dimension of a hypothesis class \(d_{VC}(H)\) $$ \max_{|D|=n} |H[D]|\leq (ne/d_{VC}(H))^{d_{VC}(H)} $$
  • VC Dimension is well known for many \(H\) 
    • Linear classifiers on \(d\) features: \(d_{VC} = d\)
    • Linear classifiers with bias: \(d_{VC} = d+1\)
    • Linear classifiers with margin \(\gamma\) on data with \(\|x_i\|\leq R\): \(d_{VC} = R^2/\gamma^2\)
  • Can derive a PAC ("probably approximately correct") bound
  • For \(H\) with VC dimension \(d_{VC}\), given \(n\) training data points \(D\),  with probability at least \(1-\delta\):

$$\begin{align*}\text{err}_{P}(h_D)& \leq  \underbrace{\text{err}_{D}(h_D)}_{(a)} + \underbrace{\sqrt{\frac{d_{VC}\log(2n/d_{VC}) + 1 +\log(1/\delta)}{4n}}}_{(b)} \end{align*}$$

Learning Theory

By Sarah Dean

Private

Learning Theory