Asymptotic Learning Curves of Kernel Methods

arXiv:1905.10843

Stefano Spigler, Mario Geiger, Matthieu Wyart

Why and how does deep supervised learning work?
Learn from examples: how many are needed?
Typical tasks:
- Regression (fitting functions)
- Classification

Supervised deep learning

Performance is evaluated through the generalization error \(\epsilon\)
Learning curves decay with number of examples \(n\), often as
\(\beta\) depends on the dataset and on the algorithm

Deep networks: \(\beta\sim 0.07\)-\(0.35\) [Hestness et al. 2017]

Learning curves

\(\epsilon\sim n^{-\beta}\)

We lack a theory for \(\beta\) for deep networks

Performance increases with overparametrization

\(\longrightarrow\) study the infinite-width limit!

[Jacot et al. 2018]

[Bruna and Mallat 2013, Arora et al. 2019]

What are the learning curves of kernels like?

Link with kernel learning

(next slide)

\(h\)

[Neyshabur et al. 2017, 2018, Advani and Saxe 2017]

[Belkin et al. 2018, Spigler et al. 2018, Geiger et al. 2019]

\(h\)

\(\epsilon\)

With a specific scaling, infinite-width limit \(\to\) kernel learning
Some kernels achieve almost the performance of deep networks

[Mei et al. 2017, Rotskoff and Vanden-Eijnden 2018, Jacot et al. 2018, Chizat and Bach 2018, ...]

Kernel methods learn non-linear functions or boundaries
Data are mapped to a feature space, where the problem is treated linearly

data \(\underline{x} \longrightarrow \underline{\phi}(\underline{x}) \longrightarrow \) use linear combination of features

only scalar products are needed:

\(\underline{\phi}(\underline{x})\)

Kernel methods

kernel \(K(\underline{x},\underline{x}^\prime)\)

\(\rightarrow\)

K(\underline{x},\underline{x}^\prime) = \exp\left(-\frac{|\!|\underline{x}-\underline{x}^\prime|\!|^2}{\sigma^2}\right)

K(\underline{x},\underline{x}^\prime) = \exp\left(-\frac{|\!|\underline{x}-\underline{x}^\prime|\!|}{\sigma}\right)

Gaussian:

Laplace:

\underline{\phi}(\underline{x})\cdot\underline{\phi}(\underline{x}^\prime)

E.g. kernel regression:

Target function \(\underline{x}_\mu \to Z(\underline{x}_\mu),\ \ \mu=1,\dots,n\)
Build an estimator \(\hat{Z}_K(\underline{x}) = \sum_{\mu=1}^n c_\mu K(\underline{x}_\mu,\underline{x})\)
Minimize training MSE \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_K(\underline{x}_\mu) - Z(\underline{x}_\mu) \right]^2\)
Estimate the generalization error \(\epsilon = \mathbb{E}_{\underline{x}} \left[ \hat{Z}_K(\underline{x}) - Z(\underline{x}) \right]^2\)

Kernel regression

A kernel \(K\) induces a corresponding Hilbert space \(\mathcal{H}_K\) with norm

\(\lvert\!\lvert Z \rvert\!\rvert_K = \int \mathrm{d}\underline{x} \mathrm{d} \underline{y}\, Z(\underline{x}) K^{-1}(\underline{x},\underline{y}) Z(\underline{y})\)

where \(K^{-1}(\underline{x},\underline{y})\) is such that

\(\int \mathrm{d} \underline{y}\, K^{-1}(\underline{x},\underline{y}) K(\underline{y},\underline{z}) = \delta(\underline{x},\underline{z})\)

\(\mathcal{H}_K\) is called the Reproducing Kernel Hilbert Space (RKHS)

Reproducing Kernel Hilbert Space

Regression: performance depends on the target function!

If only assumed to be Lipschitz, then \(\beta=\frac1d\)
If assumed to be in the RKHS, then \(\beta\) does not depend on \(d\)
Yet, RKHS is a very strong assumption on the smoothness of the target function (see later on)

curse of dimensionality!

[Luxburg and Bousquet 2004]

[Smola et al. 1998, Rudi and Rosasco 2017]

[Bach 2017]

Previous works

\(d\) = dimension of the input space

\(\longrightarrow\)

We apply kernel methods on

Datasets and algorithms

MNIST

CIFAR10

2 classes: even/odd

70000 28x28 b/w pictures

2 classes: first 5/last 5

60000 32x32 RGB pictures

We perform

regression \(\longrightarrow\)

classification \(\longrightarrow\)

kernel regression

margin SVM

\overbrace{\phantom{wwwww}}

dimension \(d = 784\)

dimension \(d = 3072\)

\rightarrow

Same exponent for regression and classification
Same exponent for Gaussian and Laplace kernel
MNIST and CIFAR10 display exponents \(\beta\) different from \(\frac12,\frac1d\)

Real exponents

We need a new framework!

\(\beta\approx0.4\)

\(\beta\approx0.1\)

Controlled setting: Teacher-Student regression
Training data are sampled from a Gaussian Process:

\(Z(\underline{x}_1),\dots,Z(\underline{x}_n)\ \sim\ \mathcal{N}(0, K_T)\)
\(\underline{x}_\mu\) are random on a \(d\)-dim hypersphere
Regression is done with another kernel \(K_S\)

Teacher-Student: simulation

\(\mathbb{E} Z(\underline{x}_\mu) = 0\)

\(\mathbb{E} Z(\underline{x}_\mu) Z(\underline{x}_\nu) = K_T(\underline{x}_\mu-\underline{x}_\nu)\)

\(\longrightarrow\)

Teacher-Student: simulation

Generalization error

Exponent \(-\beta\)

Can we understand these curves?

Teacher-Student: analytical

Regression: the solution can be written explicitly

\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \textcolor{darkred}{\mathbb{K}_S^{-1}} \textcolor{gray}{\underline{Z}}

(\underline{Z})_\mu = Z(\underline{x}_\mu)

(\underline{k}_S(\underline{x}))_\mu = K_S(\underline{x}_\mu, \underline{x})

(\mathbb{K}_S)_{\mu\nu} = K_S(\underline{x}_\mu, \underline{x}_\nu)

where

\underbrace{\phantom{wiiwiiiwww}}

Compute the generalization error \(\epsilon\) and how it scales with \(n\)

\epsilon = \textcolor{darkred}{\mathbb{E}_T} \int\mathrm{d}\underline{x}\, \left[ \hat{Z}_S(\underline{x}) - \textcolor{darkred}{Z(\underline{x})} \right]^2 \sim n^{-\beta}

\hat{Z}_S(\underline{x}) = \textcolor{gray}{\underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \underline{Z}}

\hat{Z}_S(\underline{x}) = \textcolor{darkred}{\underline{k}_S(\underline{x})} \textcolor{gray}{\cdot \mathbb{K}_S^{-1} \underline{Z}}

\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \textcolor{darkred}{\underline{Z}}

\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \underline{Z}

Teacher-Student: analytical

To compute the generalization error:

We look at the problem in the frequency domain
We assume that \(\tilde{K}_S(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_S}\) and \(\tilde{K}_T(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_T}\) as\(|\!|\underline{w}|\!|\to\infty\)
SIMPLIFYING ASSUMPTION: We take the \(n\) points \(\underline{x}_\mu\) on a regular \(d\)-dim lattice!

\epsilon \sim n^{-\beta}

\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

Then we can show that

with

E.g. Laplace has \(\alpha=d+1\) and Gaussian has \(\alpha=\infty\)

(details: arXiv:1905.10843)

for \(n\gg1\)

Teacher-Student

Large \(\alpha \rightarrow\) fast decay at high freq \(\rightarrow\) indifference to local details
\(\alpha_T\) is intrinsic to the data (T), \(\alpha_S\) depends on the algorithm (S)
If \(\alpha_S\) is large enough, \(\beta\) takes the largest possible value \(\frac{\alpha_T - d}{d}\)
As soon as \(\alpha_S\) is small enough, \(\beta=\frac{2\alpha_S}d\)

(optimal learning)

\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

Teacher-Student

If Teacher=Student=Laplace
If Teacher=Gaussian, Student=Laplace

\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

What is the prediction for our simulations?

(curse of dimensionality!)

\beta=\frac{\alpha_T-d}d = \frac1d

(\(\alpha_T=\alpha_S=d+1\))

(\(\alpha_T=\infty, \alpha_S=d+1\))

\beta=\frac{2\alpha_S}d = 2+\frac2d

Teacher-Student: comparison

Exponent \(-\beta\)

Our result matches the numerical simulations
There are finite size effects (small \(n\))

(on hypersphere)

Same result with points on regular lattice or random hypersphere?

What matters is how nearest-neighbor distance \(\delta\) scales with \(n\)

Nearest-neighbor distance

In both cases \(\delta\sim n^{\frac1d}\)

Finite size effects: asymptotic scaling only when \(n\) is large enough

(conjecture)

What about real data?

\(\longrightarrow\) assume they are instances of some Gaussian process \(K_T\)

Real data

Such instances are \(s\)-times (mean-square) differentiable with
\(s=\frac{\alpha_T-d}2\)
Fitted exponents are \(\beta\approx0.4\) (MNIST) and \(\beta\approx0.1\) (CIFAR10), regardless of the Student \(\longrightarrow \beta=\frac{\alpha_T-d}d\)

\(\longrightarrow\) \(s=\frac12 \beta d\), \(s\approx 0.2d\approx156\) (MNIST) and \(s\approx0.05d\approx153\) (CIFAR10)

This number is unreasonably large!

(since \(\beta=\frac1d\min(\alpha_T-d,2\alpha_S)\) indep. of \(\alpha_S \longrightarrow \beta=\frac{\alpha_T-d}d\))

Effective dimension

Measure NN-distance \(\delta\)
\(\delta\sim n^{-\mathrm{some\ exponent}} \)

Define effective dimension as \(\delta \sim n^{-\frac1{d_\mathrm{eff}}}\)

\(\longrightarrow\)

MNIST

0.4

CIFAR10

0.1

\(\phantom{x}\)

\(\beta\)

\(d_\mathrm{eff}\)

\(s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor\)

\(d_\mathrm{eff}\) is much smaller

\(s\) is more reasonable

\(\longrightarrow\)

784

3072

\(d\)

RKHS and smoothness

Indeed, what happens if we consider a field \(Z_T(\underline{x})\) that
- is an instance of a Teacher \(K_T\)
- lies in the RKHS of a Student \(K_S\)

\(\Longrightarrow\)

\(\alpha_T > \alpha_S + d\)

(\(\alpha_T\))

(\(\alpha_S\))

\(\alpha_S > d\)

\(\mathbb{E}_T \lvert\!\lvert Z_T \rvert\!\rvert_{K_S} = \mathbb{E}_T \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, Z_T(\underline{x}) K_S^{-1}(\underline{x},\underline{y}) Z_T(\underline{y}) = \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, K_T(\underline{x},\underline{y}) K_S^{-1}(\underline{x},\underline{y}) \textcolor{red}{< \infty}\)

\(K_S(\underline{0}) \propto \int \mathrm{d}\underline{w}\, \tilde{K}_S(\underline{w}) \textcolor{red}{< \infty}\)

\(\Longrightarrow\)

Therefore the smoothness must be \(s = \frac{\alpha_T-d}2 > \frac{d}2\)

(it scales with \(d\)!)

\(\longrightarrow \beta > \frac12\)

Conclusion

MNIST and CIFAR10 display power laws in the learning curves, with exponents \(\beta\approx 0.4,0.1\) (resp.) \(\gg\frac1d\)
\(\beta\) is the same for regression and classification tasks with Gaussian and Laplace kernels
We introduced a new framework that allows for different degrees of smoothness in the data, where we can compute \(\beta\)
We defined an effective dimension for real data (\(\ll d\)), that is linked to an effective smoothness \(s\)

MNIST

0.4

CIFAR10

0.1

\(\phantom{x}\)

\(\beta\)

\(d_\mathrm{eff}\)

\(s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor\)

784

3072

\(d\)

arXiv:1905.10843

Asymptotic learning curves of kernel methods

By Stefano Spigler

Asymptotic learning curves of kernel methods

Stefano Spigler

spigler.net/stefano

Asymptotic Learning Curves of Kernel Methods

Asymptotic learning curves of kernel methods

More from Stefano Spigler