Statistical learning theory and the VC dimension

 

Book club, 30.07.2021

 Claudia Merger, Alexandre René

The Nature of statistical learning theory

Statistical Learning theory

Ideas, examples,

context

Proofs, definitions,

"clean" textbook

The Nature of statistical learning theory

Statistical Learning theory

Chapters 1-3

todays topics

Main topics

  1. Empirical Risk Minimization Inductive Principle
  2. Consistency of Learning Process
  3. Falsifiability
  4. VC Entropy of a set of functions and the VC dimension

Empirical Risk Minimization Inductive Principle

 

Learning machine:

 

Given samples

 

 

fit some function

 

 

to minimize some risk

 

 

If the learning machine obeys ERM inductive principle for any given set of observations, call it a learning process.

 

f_{\alpha}(z),\quad \alpha \in \Lambda
R_{\mathrm{emp.}}(\alpha) = \frac{1}{l}\sum_i \underbrace{Q(\alpha, z_i)}_{\mathrm{Loss}}
z_1, \dots, z_l

Empirical Risk Minimization Inductive Principle

 

Learning machine:

 

Given samples

 

 

fit some function

 

 

to minimize some risk

 

 

Only minimize the empirical risk?

f_{\alpha}(z)
R_{\mathrm{emp.}}(\alpha) = \frac{1}{l}\sum_i \underbrace{Q(\alpha, z_i)}_{\mathrm{Loss}}
z_1, \dots, z_l

Empirical Risk Minimization Inductive Principle

 

Learning machine:

 

Given samples

 

 

fit some function

 

 

to minimize some risk

 

 

Only minimize the empirical risk?

f_{\alpha}(z)
R_{\mathrm{emp.}}(\alpha) = \frac{1}{l}\sum_i \underbrace{Q(\alpha, z_i)}_{\mathrm{Loss}}
z_1, \dots, z_l

NO

Example

Task: Learn Data p.d.f.

 

 

 

 

 

 

"Solution": Mixture with one component per data point:

 

R_{\mathrm{emp.}}(\alpha) = 0

Empirical Risk Minimization Inductive Principle

 

What we really want to minimize:

 

 

 

 

 

according to the true p.d.f.

 

 

R(\alpha) = \int Q(\alpha, z)dF(z)
F(z)

Empirical Risk Minimization Inductive Principle

 

What we really want to minimize:

 

 

 

 

 

according to the true p.d.f.

 

but we don't have

 

How do we know when we're close?

R(\alpha) = \int Q(\alpha, z)dF(z)
F(z)
F(z)

Main topics

  1. Empirical Risk Minimization Inductive Principle
  2. Consistency of Learning Process
  3. Falsifiability
  4. VC Entropy of a set of functions and the VC dimension

 

Consistency of the learning process

For each sample of size       pick

 

such that

 

 

 

  

l
\alpha_l
R_{\mathrm{emp.}}(\alpha_l) = \min_{\alpha}R_{\mathrm{emp.}}(\alpha)

Consistency of the learning process

For each sample of size       pick

 

such that

 

 

the learning process is consistent iff

 

 

 

and

 

  

l
\alpha_l
R_{\mathrm{emp.}}(\alpha_l) = \min_{\alpha}R_{\mathrm{emp.}}(\alpha)
\lim_{l \rightarrow \infty} R(\alpha_l) = \inf_{\alpha} R(\alpha)
R_{\mathrm{emp.}}(\alpha_l) \longrightarrow \lim_{l \rightarrow \infty} R(\alpha_l)
P
l \rightarrow \infty

Consistency of the learning process

For each sample of size       pick

 

such that

 

 

the learning process is consistent iff

 

 

 

and

 

  

l
\alpha_l
R_{\mathrm{emp.}}(\alpha_l) = \min_{\alpha}R_{\mathrm{emp.}}(\alpha)
\lim_{l \rightarrow \infty} R(\alpha_l) = \inf_{\alpha} R(\alpha)
R_{\mathrm{emp.}}(\alpha_l) \longrightarrow \lim_{l \rightarrow \infty} R(\alpha_l)
P
l \rightarrow \infty

choose the right

\alpha

get correct risk

Example

Task: Learn Data p.d.f.

 

 

 

 

 

 

"Solution": Mixture with one component per data point:

 

R_{\mathrm{emp.}}(\alpha) = 0

Example

Task: Learn Data p.d.f.

 

 

 

 

 

 

"Solution": Mixture with one component per data point:

 

R_{\mathrm{emp.}}(\alpha) = 0

inconsistent

Example

\alpha_0 \in \Lambda
Q(\alpha_0, z )
Q(\alpha, z ) \geq 0 \quad \forall z, \alpha \in \Lambda

for some

< Q(\alpha,z) \quad \forall z,\alpha

Example

\alpha_0 \in \Lambda
Q(\alpha_0, z )
Q(\alpha, z ) \geq 0 \quad \forall z, \alpha \in \Lambda

for some

< Q(\alpha,z) \quad \forall z,\alpha
\inf_{\alpha} R(\alpha) = R(\alpha_0)
\min_{\alpha} R_{\mathrm{emp.}}(\alpha) \\ = R_{\mathrm{emp.}}(\alpha_0)

Example

\alpha_0 \in \Lambda
Q(\alpha_0, z )
Q(\alpha, z ) \geq 0 \quad \forall z, \alpha \in \Lambda

for some

< Q(\alpha,z) \quad \forall z,\alpha

trivially

\inf_{\alpha} R(\alpha) = R(\alpha_0)
\min_{\alpha} R_{\mathrm{emp.}}(\alpha) \\ = R_{\mathrm{emp.}}(\alpha_0)

trivial consistency

Example

\alpha_0 \in \Lambda
Q(\alpha_0, z ) = 0 \quad \forall z
Q(\alpha, z ) \geq 0 \quad \forall z, \alpha \in \Lambda

for some

trivially

\inf_{\alpha} R(\alpha) = 0
\min_{\alpha} R_{\mathrm{emp.}}(\alpha) = 0 \quad \forall l

Example

\alpha_0 \in \Lambda
Q(\alpha_0, z ) = 0 \quad \forall z
Q(\alpha, z ) \geq 0 \quad \forall z, \alpha \in \Lambda

for some

trivially

\inf_{\alpha} R(\alpha) = 0
\min_{\alpha} R_{\mathrm{emp.}}(\alpha) = 0 \quad \forall l

trivial consistency

Exclude trivial consistency

 

Non-trivial consistency

 

consistency on any subset

 

 

 

 

 

 

\Lambda(c) := \left\{\alpha \in \lambda \bigg| R(\alpha) > c \right\}

Exclude trivial consistency

 

Non-trivial consistency

 

consistency on any subset

 

 

 

 

 

in the following consistent = nontrivially consistent

\Lambda(c) := \left\{\alpha \in \lambda \bigg| R(\alpha) > c \right\}

proof: Statistical Learning Theory, p. 89-92

What are the necessary and sufficient conditions?

\sup_{\alpha} R(\alpha) - R_{\mathrm{emp.}}(\alpha) \leq \, ?

For some

l
\sup_{\alpha} R(\alpha) - R_{\mathrm{emp.}}(\alpha) \leq \, \underbrace{ ? }_{\text{dist. independent}}

For some

l
\sup_{\alpha} R(\alpha) - R_{\mathrm{emp.}}(\alpha) \leq \, \underbrace{ ? }_{\text{dist. independent}}

For some

l

scaling with        ?

l
\sup_{\alpha} R(\alpha) - R_{\mathrm{emp.}}(\alpha) \leq \, \underbrace{ ? }_{\text{dist. independent}}

For some

l

scaling with        ?

l

Which properties must                                           have?

\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Main topics

  1. Empirical Risk Minimization Inductive Principle
  2. Consistency of Learning Process
  3. Falsifiability
  4. VC Entropy of a set of functions and the VC dimension

Falsifiability

K. Popper: A theory is scientific, if it is falsifiable.

 

 

 

 

 

Falsifiability

K. Popper: A theory is scientific, if it is falsifiable.

 

Example: " What goes up, must come down." - Falsifiable?

 

 

 

 

Falsifiability

K. Popper: A theory is scientific, if it is falsifiable.

 

Example: " What goes up, must come down." - Falsifiable?

 

 

 

Example: "Whatever will be, will be." - Falsifiable?

Falsifiability

K. Popper: A theory is scientific, if it is falsifiable.


Example: " What goes up, must come down." - Falsifiable?




Example: "Whatever will be, will be." - Falsifiable?

Example

Task: Learn Data p.d.f.

 

 

 

 

 

 

"Solution": Mixture with one component per data point:

 

R_{\mathrm{emp.}}(\alpha) = 0

Example

Task: Learn Data p.d.f.

 

 

 

 

 

 

"Solution": Mixture with one component per data point:

 

R_{\mathrm{emp.}}(\alpha) = 0

not falsifiable

For consistency, the set of functions

 

 

 

 

may not be too "flexible".

 

  

\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Main topics

  1. Empirical Risk Minimization Inductive Principle
  2. Consistency of Learning Process
  3. Falsifiability
  4. VC Entropy of a set of functions and the VC dimension

VC Entropy of a set of functions

\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Given a set of functions

VC Entropy of a set of functions

\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Given a set of functions

 

and samples

z_1, \dots , z_l

VC Entropy of a set of functions

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}
\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Given a set of functions

 

and samples

 

construct set of vectors

 

 

 

z_1, \dots , z_l

VC Entropy of a set of functions

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}
\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Given a set of functions

 

and samples

 

construct set of vectors

 

 

minimal

 

Minimal number

of vectors       such that all

have             at maximum distance    .

z_1, \dots , z_l
\epsilon \text{-net}
N^{\Lambda} (\epsilon, z_1, \dots , z_l)
q(\alpha) \in \mathcal{Q}
\mathcal{Q}^{\epsilon}
p \in \mathcal{Q}^{\epsilon}
\epsilon

VC Entropy of a set of functions

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}
\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Given a set of functions

 

and samples

 

construct set of vectors

 

 

minimal

 

Minimal number

of vectors       such that all

have             at maximum distance    .

z_1, \dots , z_l
\epsilon \text{-net}
N^{\Lambda} (\epsilon, z_1, \dots , z_l)
q(\alpha) \in \mathcal{Q}
\mathcal{Q}^{\epsilon}
p \in \mathcal{Q}^{\epsilon}
\epsilon
H^{\Lambda} (\epsilon, l ) = \bigg\langle \ln N^{\Lambda} (\epsilon, z_1, \dots, z_l) \bigg\rangle

VC Entropy of a set of functions

Expectation of the diversity of the set of functions on a sample of size   .

l

sufficient

Example:

 

indicator functions

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z
\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

input space

input space

z_1
z_2
z_1
z_2
z_3
z_3

Example:

 

 

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z
\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

Example:

 

 

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z
1 \leq N^{\Lambda}( z_1, \dots, z_l) \leq 2^l
N^{\Lambda} (z_1,z_2, z_3)=8
\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

Example:

 

 

 

 

worst case:  for some

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z
1 \leq N^{\Lambda}( z_1, \dots, z_l) \leq 2^l
\Lambda
\lim_{l \rightarrow \infty } \frac{H^{\Lambda} (l)}{l} = \lim_{l \rightarrow \infty } \frac{ \big \langle N^{\Lambda}(z_1, \dots, z_l) \big\rangle }{l} = 1
\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}
N^{\Lambda} (z_1,z_2, z_3)=8

Example:

 

 

 

 

worst case:  for some

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z
1 \leq N^{\Lambda}( z_1, \dots, z_l) \leq 2^l
\Lambda
\lim_{l \rightarrow \infty } \frac{H^{\Lambda} (l)}{l} = \lim_{l \rightarrow \infty } \frac{ \big \langle N^{\Lambda}(z_1, \dots, z_l) \big\rangle }{l} = 2

Not falsifiable + Inconsistent

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}
N^{\Lambda} (z_1,z_2, z_3)=8

Problems

  • VC entropy depends on distribution
  • How fast is the convergence of (2.6)? How many samples are needed?

VC dimension of a set of functions

VC dimension of a set of functions

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z
\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}
N^{\Lambda} (z_1,z_2, z_3)=8

VC dimension of a set of functions

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z
\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

input space

N^{\Lambda} (z_1,z_2, z_3)=8

VC dimension of a set of functions

VC dimension of a set of functions

Q
\beta
I

VC dimension of a set of functions

\sup_{\alpha} R(\alpha) - R_{\mathrm{emp.}}(\alpha) \leq \, \underbrace{ ? }_{\text{dist. independent}}

For some

l

scaling with        ?

l
\text{for a pair } (p,\tau) \text{ with } p>2
h = \text{VC entropy}
0 < \eta <1
\text{for a pair } (p,\tau) \text{ with } p>2
\text{define}
h = \text{VC entropy}
0 < \eta <1
\text{for a pair } (p,\tau) \text{ with } p>2
\text{define}
h = \text{VC entropy}
0 < \eta <1
\text{for a pair } (p,\tau) \text{ with } p>2
\text{define}
R(\alpha) \text{ and } R_{\mathrm{emp.}}(\alpha) \\ \text{ are close}
h = \text{VC entropy}
0 < \eta <1
\text{for a pair } (p,\tau) \text{ with } p>2
\text{define}
h = \text{VC entropy}
0 < \eta <1
\text{for a pair } (p,\tau) \text{ with } p>2
\text{define}
\text{ learning process picks }\text{the right }\alpha

 

 

 

 

 

 

the learning process is consistent iff

 

 

 

and

 

  

\lim_{l \rightarrow \infty} R(\alpha_l) = \inf_{\alpha} R(\alpha)
R_{\mathrm{emp.}}(\alpha_l) \longrightarrow \lim_{l \rightarrow \infty} R(\alpha_l)
P
l \rightarrow \infty

choose the right

\alpha

get correct risk

All we need to know about learning?

 

 

All we need to know about learning?

No

 

 

All we need to know about learning?

No

 

 

if small

 

tradeoff between empirical risk minimization and VC dimension

\frac{\text{sample size}}{\text{VC dimension}}

All we need to know about learning?

No

 

 

if small

 

tradeoff between empirical risk minimization and VC dimension

 

 

add regularization

\frac{\text{sample size}}{\text{VC dimension}}

Thank you!

 

deck

By merger

deck

  • 57