Statistical learning theory and the VC dimension

Book club, 30.07.2021

Claudia Merger, Alexandre René

The Nature of statistical learning theory

Statistical Learning theory

Ideas, examples,

context

Proofs, definitions,

"clean" textbook

The Nature of statistical learning theory

Statistical Learning theory

Chapters 1-3

todays topics

Main topics

Empirical Risk Minimization Inductive Principle
Consistency of Learning Process
Falsifiability
VC Entropy of a set of functions and the VC dimension

Empirical Risk Minimization Inductive Principle

Learning machine:

Given samples

fit some function

to minimize some risk

If the learning machine obeys ERM inductive principle for any given set of observations, call it a learning process.

f_{\alpha}(z),\quad \alpha \in \Lambda

R_{\mathrm{emp.}}(\alpha) = \frac{1}{l}\sum_i \underbrace{Q(\alpha, z_i)}_{\mathrm{Loss}}

z_1, \dots, z_l

Empirical Risk Minimization Inductive Principle

Learning machine:

Given samples

fit some function

to minimize some risk

Only minimize the empirical risk?

f_{\alpha}(z)

R_{\mathrm{emp.}}(\alpha) = \frac{1}{l}\sum_i \underbrace{Q(\alpha, z_i)}_{\mathrm{Loss}}

z_1, \dots, z_l

Empirical Risk Minimization Inductive Principle

Learning machine:

Given samples

fit some function

to minimize some risk

Only minimize the empirical risk?

f_{\alpha}(z)

R_{\mathrm{emp.}}(\alpha) = \frac{1}{l}\sum_i \underbrace{Q(\alpha, z_i)}_{\mathrm{Loss}}

z_1, \dots, z_l

NO

Example

Task: Learn Data p.d.f.

"Solution": Mixture with one component per data point:

R_{\mathrm{emp.}}(\alpha) = 0

Empirical Risk Minimization Inductive Principle

What we really want to minimize:

according to the true p.d.f.

R(\alpha) = \int Q(\alpha, z)dF(z)

F(z)

Empirical Risk Minimization Inductive Principle

What we really want to minimize:

according to the true p.d.f.

but we don't have

How do we know when we're close?

R(\alpha) = \int Q(\alpha, z)dF(z)

F(z)

Main topics

Empirical Risk Minimization Inductive Principle
Consistency of Learning Process
Falsifiability
VC Entropy of a set of functions and the VC dimension

Consistency of the learning process

For each sample of size pick

such that

\alpha_l

R_{\mathrm{emp.}}(\alpha_l) = \min_{\alpha}R_{\mathrm{emp.}}(\alpha)

Consistency of the learning process

For each sample of size pick

such that

the learning process is consistent iff

and

\alpha_l

R_{\mathrm{emp.}}(\alpha_l) = \min_{\alpha}R_{\mathrm{emp.}}(\alpha)

\lim_{l \rightarrow \infty} R(\alpha_l) = \inf_{\alpha} R(\alpha)

R_{\mathrm{emp.}}(\alpha_l) \longrightarrow \lim_{l \rightarrow \infty} R(\alpha_l)

l \rightarrow \infty

Consistency of the learning process

For each sample of size pick

such that

the learning process is consistent iff

and

\alpha_l

R_{\mathrm{emp.}}(\alpha_l) = \min_{\alpha}R_{\mathrm{emp.}}(\alpha)

\lim_{l \rightarrow \infty} R(\alpha_l) = \inf_{\alpha} R(\alpha)

R_{\mathrm{emp.}}(\alpha_l) \longrightarrow \lim_{l \rightarrow \infty} R(\alpha_l)

l \rightarrow \infty

choose the right

\alpha

get correct risk

Example

Task: Learn Data p.d.f.

"Solution": Mixture with one component per data point:

R_{\mathrm{emp.}}(\alpha) = 0

Example

Task: Learn Data p.d.f.

"Solution": Mixture with one component per data point:

R_{\mathrm{emp.}}(\alpha) = 0

inconsistent

Example

\alpha_0 \in \Lambda

Q(\alpha_0, z )

Q(\alpha, z ) \geq 0 \quad \forall z, \alpha \in \Lambda

for some

< Q(\alpha,z) \quad \forall z,\alpha

Example

\alpha_0 \in \Lambda

Q(\alpha_0, z )

Q(\alpha, z ) \geq 0 \quad \forall z, \alpha \in \Lambda

for some

< Q(\alpha,z) \quad \forall z,\alpha

\inf_{\alpha} R(\alpha) = R(\alpha_0)

\min_{\alpha} R_{\mathrm{emp.}}(\alpha) \\ = R_{\mathrm{emp.}}(\alpha_0)

Example

\alpha_0 \in \Lambda

Q(\alpha_0, z )

Q(\alpha, z ) \geq 0 \quad \forall z, \alpha \in \Lambda

for some

< Q(\alpha,z) \quad \forall z,\alpha

trivially

\inf_{\alpha} R(\alpha) = R(\alpha_0)

\min_{\alpha} R_{\mathrm{emp.}}(\alpha) \\ = R_{\mathrm{emp.}}(\alpha_0)

trivial consistency

Example

\alpha_0 \in \Lambda

Q(\alpha_0, z ) = 0 \quad \forall z

Q(\alpha, z ) \geq 0 \quad \forall z, \alpha \in \Lambda

for some

trivially

\inf_{\alpha} R(\alpha) = 0

\min_{\alpha} R_{\mathrm{emp.}}(\alpha) = 0 \quad \forall l

Example

\alpha_0 \in \Lambda

Q(\alpha_0, z ) = 0 \quad \forall z

Q(\alpha, z ) \geq 0 \quad \forall z, \alpha \in \Lambda

for some

trivially

\inf_{\alpha} R(\alpha) = 0

\min_{\alpha} R_{\mathrm{emp.}}(\alpha) = 0 \quad \forall l

trivial consistency

Exclude trivial consistency

Non-trivial consistency

consistency on any subset

\Lambda(c) := \left\{\alpha \in \lambda \bigg| R(\alpha) > c \right\}

Exclude trivial consistency

Non-trivial consistency

consistency on any subset

in the following consistent = nontrivially consistent

\Lambda(c) := \left\{\alpha \in \lambda \bigg| R(\alpha) > c \right\}

proof: Statistical Learning Theory, p. 89-92

What are the necessary and sufficient conditions?

\sup_{\alpha} R(\alpha) - R_{\mathrm{emp.}}(\alpha) \leq \, ?

For some

\sup_{\alpha} R(\alpha) - R_{\mathrm{emp.}}(\alpha) \leq \, \underbrace{ ? }_{\text{dist. independent}}

For some

\sup_{\alpha} R(\alpha) - R_{\mathrm{emp.}}(\alpha) \leq \, \underbrace{ ? }_{\text{dist. independent}}

For some

scaling with ?

\sup_{\alpha} R(\alpha) - R_{\mathrm{emp.}}(\alpha) \leq \, \underbrace{ ? }_{\text{dist. independent}}

For some

scaling with ?

Which properties must have?

\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Main topics

Empirical Risk Minimization Inductive Principle
Consistency of Learning Process
Falsifiability
VC Entropy of a set of functions and the VC dimension

Falsifiability

K. Popper: A theory is scientific, if it is falsifiable.

Falsifiability

K. Popper: A theory is scientific, if it is falsifiable.

Example: " What goes up, must come down." - Falsifiable?

Falsifiability

K. Popper: A theory is scientific, if it is falsifiable.

Example: " What goes up, must come down." - Falsifiable?

Example: "Whatever will be, will be." - Falsifiable?

Falsifiability

K. Popper: A theory is scientific, if it is falsifiable.

Example: " What goes up, must come down." - Falsifiable?

Example: "Whatever will be, will be." - Falsifiable?

Example

Task: Learn Data p.d.f.

"Solution": Mixture with one component per data point:

R_{\mathrm{emp.}}(\alpha) = 0

Example

Task: Learn Data p.d.f.

"Solution": Mixture with one component per data point:

R_{\mathrm{emp.}}(\alpha) = 0

not falsifiable

For consistency, the set of functions

may not be too "flexible".

\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Main topics

Empirical Risk Minimization Inductive Principle
Consistency of Learning Process
Falsifiability
VC Entropy of a set of functions and the VC dimension

VC Entropy of a set of functions

\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Given a set of functions

VC Entropy of a set of functions

\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Given a set of functions

and samples

z_1, \dots , z_l

VC Entropy of a set of functions

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Given a set of functions

and samples

construct set of vectors

z_1, \dots , z_l

VC Entropy of a set of functions

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Given a set of functions

and samples

construct set of vectors

minimal

Minimal number

of vectors such that all

have at maximum distance .

z_1, \dots , z_l

\epsilon \text{-net}

N^{\Lambda} (\epsilon, z_1, \dots , z_l)

q(\alpha) \in \mathcal{Q}

\mathcal{Q}^{\epsilon}

p \in \mathcal{Q}^{\epsilon}

\epsilon

VC Entropy of a set of functions

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

\left \{ Q(z,\alpha)\bigg| \alpha \in \Lambda \right\}

Given a set of functions

and samples

construct set of vectors

minimal

Minimal number

of vectors such that all

have at maximum distance .

z_1, \dots , z_l

\epsilon \text{-net}

N^{\Lambda} (\epsilon, z_1, \dots , z_l)

q(\alpha) \in \mathcal{Q}

\mathcal{Q}^{\epsilon}

p \in \mathcal{Q}^{\epsilon}

\epsilon

H^{\Lambda} (\epsilon, l ) = \bigg\langle \ln N^{\Lambda} (\epsilon, z_1, \dots, z_l) \bigg\rangle

VC Entropy of a set of functions

Expectation of the diversity of the set of functions on a sample of size .

sufficient

Example:

indicator functions

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

input space

z_1

z_2

z_1

z_2

z_3

Example:

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

Example:

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z

1 \leq N^{\Lambda}( z_1, \dots, z_l) \leq 2^l

N^{\Lambda} (z_1,z_2, z_3)=8

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

Example:

worst case: for some

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z

1 \leq N^{\Lambda}( z_1, \dots, z_l) \leq 2^l

\Lambda

\lim_{l \rightarrow \infty } \frac{H^{\Lambda} (l)}{l} = \lim_{l \rightarrow \infty } \frac{ \big \langle N^{\Lambda}(z_1, \dots, z_l) \big\rangle }{l} = 1

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

N^{\Lambda} (z_1,z_2, z_3)=8

Example:

worst case: for some

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z

1 \leq N^{\Lambda}( z_1, \dots, z_l) \leq 2^l

\Lambda

\lim_{l \rightarrow \infty } \frac{H^{\Lambda} (l)}{l} = \lim_{l \rightarrow \infty } \frac{ \big \langle N^{\Lambda}(z_1, \dots, z_l) \big\rangle }{l} = 2

Not falsifiable + Inconsistent

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

N^{\Lambda} (z_1,z_2, z_3)=8

Problems

VC entropy depends on distribution
How fast is the convergence of (2.6)? How many samples are needed?

VC dimension of a set of functions

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

N^{\Lambda} (z_1,z_2, z_3)=8

VC dimension of a set of functions

Q(\alpha,z) \in \{0,1\}\quad \forall \alpha, z

\mathcal{Q} = \left\{ q(\alpha) = \begin{pmatrix} Q(\alpha, z_1 ) \\ \dots \\ Q(\alpha, z_l )\end{pmatrix} \Bigg| \alpha \in \Lambda \right\}

input space

N^{\Lambda} (z_1,z_2, z_3)=8

VC dimension of a set of functions

\beta

VC dimension of a set of functions

\sup_{\alpha} R(\alpha) - R_{\mathrm{emp.}}(\alpha) \leq \, \underbrace{ ? }_{\text{dist. independent}}

For some

scaling with ?

\text{for a pair } (p,\tau) \text{ with } p>2

h = \text{VC entropy}

0 < \eta <1

\text{for a pair } (p,\tau) \text{ with } p>2

\text{define}

h = \text{VC entropy}

0 < \eta <1

\text{for a pair } (p,\tau) \text{ with } p>2

\text{define}

h = \text{VC entropy}

0 < \eta <1

\text{for a pair } (p,\tau) \text{ with } p>2

\text{define}

R(\alpha) \text{ and } R_{\mathrm{emp.}}(\alpha) \\ \text{ are close}

h = \text{VC entropy}

0 < \eta <1

\text{for a pair } (p,\tau) \text{ with } p>2

\text{define}

h = \text{VC entropy}

0 < \eta <1

\text{for a pair } (p,\tau) \text{ with } p>2

\text{define}

\text{ learning process picks }\text{the right }\alpha

the learning process is consistent iff

and

\lim_{l \rightarrow \infty} R(\alpha_l) = \inf_{\alpha} R(\alpha)

R_{\mathrm{emp.}}(\alpha_l) \longrightarrow \lim_{l \rightarrow \infty} R(\alpha_l)

l \rightarrow \infty

choose the right

\alpha

get correct risk

All we need to know about learning?

if small

tradeoff between empirical risk minimization and VC dimension

\frac{\text{sample size}}{\text{VC dimension}}

All we need to know about learning?

if small

tradeoff between empirical risk minimization and VC dimension

add regularization

\frac{\text{sample size}}{\text{VC dimension}}

Thank you!

deck

By merger

Statistical learning theory and the VC dimension

The Nature of statistical learning theory

Statistical Learning theory

The Nature of statistical learning theory

Statistical Learning theory

Main topics

Empirical Risk Minimization Inductive Principle

Empirical Risk Minimization Inductive Principle

Empirical Risk Minimization Inductive Principle

NO

Example

Empirical Risk Minimization Inductive Principle

Empirical Risk Minimization Inductive Principle

Main topics

Consistency of the learning process

Consistency of the learning process

Consistency of the learning process

Example

Example

inconsistent

Example

Example

Example

Example

Example

Exclude trivial consistency

Exclude trivial consistency

Main topics

Falsifiability

Falsifiability

Falsifiability

Falsifiability

Example

Example

not falsifiable

Main topics

VC Entropy of a set of functions

VC Entropy of a set of functions

VC Entropy of a set of functions

VC Entropy of a set of functions

VC Entropy of a set of functions

VC Entropy of a set of functions

VC dimension of a set of functions

VC dimension of a set of functions

VC dimension of a set of functions

VC dimension of a set of functions

VC dimension of a set of functions

VC dimension of a set of functions

Thank you!

deck

More from merger