# Asymptotic Learning Curves of Kernel Methods

Stefano Spigler, Mario Geiger, Matthieu Wyart

• Why and how does deep supervised learning work?

• Learn from examples: how many are needed?

• Regression (fitting functions)

• Classification

Supervised deep learning

• Performance is evaluated through the generalization error $$\epsilon$$

• Learning curves decay with number of examples $$n$$, often as

• $$\beta$$ depends on the dataset and on the algorithm

Deep networks: $$\beta\sim 0.07$$-$$0.35$$ [Hestness et al. 2017]

Learning curves

$$\epsilon\sim n^{-\beta}$$

We lack a theory for $$\beta$$ for deep networks

• Performance increases with overparametrization

$$\longrightarrow$$ study the infinite-width limit!

[Jacot et al. 2018]

[Bruna and Mallat 2013, Arora et al. 2019]

What are the learning curves of kernels like?

(next slide)

$$h$$

[Neyshabur et al. 2017, 2018, Advani and Saxe 2017]

[Belkin et al. 2018, Spigler et al. 2018, Geiger et al. 2019]

$$h$$

$$\epsilon$$

• With a specific scaling, infinite-width limit $$\to$$ kernel learning

• Some kernels achieve almost the performance of deep networks

[Mei et al. 2017, Rotskoff and Vanden-Eijnden 2018, Jacot et al. 2018, Chizat and Bach 2018, ...]

• Kernel methods learn non-linear functions or boundaries

• Data are mapped to a feature space, where the problem is treated linearly

data $$\underline{x} \longrightarrow \underline{\phi}(\underline{x}) \longrightarrow$$ use linear combination of features

only scalar products are needed:

$$\underline{\phi}(\underline{x})$$

Kernel methods

kernel $$K(\underline{x},\underline{x}^\prime)$$

$$\rightarrow$$

K(\underline{x},\underline{x}^\prime) = \exp\left(-\frac{|\!|\underline{x}-\underline{x}^\prime|\!|^2}{\sigma^2}\right)
K(\underline{x},\underline{x}^\prime) = \exp\left(-\frac{|\!|\underline{x}-\underline{x}^\prime|\!|}{\sigma}\right)

Gaussian:

Laplace:

\underline{\phi}(\underline{x})\cdot\underline{\phi}(\underline{x}^\prime)

E.g. kernel regression:

• Target function  $$\underline{x}_\mu \to Z(\underline{x}_\mu),\ \ \mu=1,\dots,n$$

• Build an estimator  $$\hat{Z}_K(\underline{x}) = \sum_{\mu=1}^n c_\mu K(\underline{x}_\mu,\underline{x})$$

• Minimize training MSE $$= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_K(\underline{x}_\mu) - Z(\underline{x}_\mu) \right]^2$$

• Estimate the generalization error $$\epsilon = \mathbb{E}_{\underline{x}} \left[ \hat{Z}_K(\underline{x}) - Z(\underline{x}) \right]^2$$

Kernel regression

A kernel $$K$$ induces a corresponding Hilbert space $$\mathcal{H}_K$$ with norm

$$\lvert\!\lvert Z \rvert\!\rvert_K = \int \mathrm{d}\underline{x} \mathrm{d} \underline{y}\, Z(\underline{x}) K^{-1}(\underline{x},\underline{y}) Z(\underline{y})$$

where $$K^{-1}(\underline{x},\underline{y})$$ is such that

$$\int \mathrm{d} \underline{y}\, K^{-1}(\underline{x},\underline{y}) K(\underline{y},\underline{z}) = \delta(\underline{x},\underline{z})$$

$$\mathcal{H}_K$$ is called the Reproducing Kernel Hilbert Space (RKHS)

Reproducing Kernel Hilbert Space

Regression: performance depends on the target function!

• If only assumed to be Lipschitz, then $$\beta=\frac1d$$

• If assumed to be in the RKHS, then $$\beta$$ does not depend on $$d$$

• Yet, RKHS is a very strong assumption on the smoothness of the target function (see later on)

curse of dimensionality!

[Luxburg and Bousquet 2004]

[Smola et al. 1998, Rudi and Rosasco 2017]

[Bach 2017]

Previous works

$$d$$ = dimension of the input space

$$\longrightarrow$$

We apply kernel methods on

Datasets and algorithms

MNIST

CIFAR10

2 classes: even/odd

70000 28x28 b/w pictures

2 classes: first 5/last 5

60000 32x32 RGB pictures

We perform

regression        $$\longrightarrow$$

classification   $$\longrightarrow$$

kernel regression

margin SVM

\overbrace{\phantom{wwwww}}

dimension $$d = 784$$

dimension $$d = 3072$$

\rightarrow
\rightarrow
• Same exponent for regression and classification

• Same exponent for Gaussian and Laplace kernel

• MNIST and CIFAR10 display exponents $$\beta$$ different from $$\frac12,\frac1d$$

Real exponents

We need a new framework!

$$\beta\approx0.4$$

$$\beta\approx0.1$$

• Controlled setting: Teacher-Student regression

• Training data are sampled from a Gaussian Process:

$$Z(\underline{x}_1),\dots,Z(\underline{x}_n)\ \sim\ \mathcal{N}(0, K_T)$$
$$\underline{x}_\mu$$ are random on a $$d$$-dim hypersphere

• Regression is done with another kernel $$K_S$$

Teacher-Student: simulation

$$\mathbb{E} Z(\underline{x}_\mu) = 0$$

$$\mathbb{E} Z(\underline{x}_\mu) Z(\underline{x}_\nu) = K_T(\underline{x}_\mu-\underline{x}_\nu)$$

$$\longrightarrow$$

Teacher-Student: simulation

Generalization error

Exponent $$-\beta$$

Can we understand these curves?

Teacher-Student: analytical

Regression: the solution can be written explicitly

\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \textcolor{darkred}{\mathbb{K}_S^{-1}} \textcolor{gray}{\underline{Z}}
(\underline{Z})_\mu = Z(\underline{x}_\mu)
(\underline{k}_S(\underline{x}))_\mu = K_S(\underline{x}_\mu, \underline{x})
(\mathbb{K}_S)_{\mu\nu} = K_S(\underline{x}_\mu, \underline{x}_\nu)

where

\underbrace{\phantom{wiiwiiiwww}}

Compute the generalization error $$\epsilon$$ and how it scales with $$n$$

\epsilon = \textcolor{darkred}{\mathbb{E}_T} \int\mathrm{d}\underline{x}\, \left[ \hat{Z}_S(\underline{x}) - \textcolor{darkred}{Z(\underline{x})} \right]^2 \sim n^{-\beta}
\hat{Z}_S(\underline{x}) = \textcolor{gray}{\underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \underline{Z}}
\hat{Z}_S(\underline{x}) = \textcolor{darkred}{\underline{k}_S(\underline{x})} \textcolor{gray}{\cdot \mathbb{K}_S^{-1} \underline{Z}}
\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \textcolor{darkred}{\underline{Z}}
\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \underline{Z}

Teacher-Student: analytical

To compute the generalization error:

• We look at the problem in the frequency domain

• We assume that $$\tilde{K}_S(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_S}$$ and $$\tilde{K}_T(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_T}$$ as$$|\!|\underline{w}|\!|\to\infty$$

• SIMPLIFYING ASSUMPTION: We take the $$n$$ points $$\underline{x}_\mu$$ on a regular $$d$$-dim lattice!
\epsilon \sim n^{-\beta}
\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

Then we can show that

with

E.g. Laplace has $$\alpha=d+1$$ and Gaussian has $$\alpha=\infty$$

(details: arXiv:1905.10843)

for $$n\gg1$$

Teacher-Student

• Large $$\alpha \rightarrow$$ fast decay at high freq $$\rightarrow$$ indifference to local details

• $$\alpha_T$$ is intrinsic to the data (T), $$\alpha_S$$ depends on the algorithm (S)

• If $$\alpha_S$$ is large enough, $$\beta$$  takes the largest possible value $$\frac{\alpha_T - d}{d}$$

• As soon as $$\alpha_S$$ is small enough, $$\beta=\frac{2\alpha_S}d$$

(optimal learning)

\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

Teacher-Student

• If Teacher=Student=Laplace

• If Teacher=Gaussian, Student=Laplace
\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

What is the prediction for our simulations?

(curse of dimensionality!)

\beta=\frac{\alpha_T-d}d = \frac1d

($$\alpha_T=\alpha_S=d+1$$)

($$\alpha_T=\infty, \alpha_S=d+1$$)

\beta=\frac{2\alpha_S}d = 2+\frac2d

Teacher-Student: comparison

Exponent $$-\beta$$

• Our result matches the numerical simulations

• There are finite size effects (small $$n$$)

(on hypersphere)

Same result with points on regular lattice or random hypersphere?

What matters is how nearest-neighbor distance $$\delta$$ scales with $$n$$

Nearest-neighbor distance

In both cases  $$\delta\sim n^{\frac1d}$$

Finite size effects: asymptotic scaling only when $$n$$ is large enough

(conjecture)

$$\longrightarrow$$ assume they are instances of some Gaussian process $$K_T$$

Real data

• Such instances are $$s$$-times (mean-square) differentiable with
$$s=\frac{\alpha_T-d}2$$

• Fitted exponents are $$\beta\approx0.4$$ (MNIST) and $$\beta\approx0.1$$ (CIFAR10), regardless of the Student $$\longrightarrow \beta=\frac{\alpha_T-d}d$$

$$\longrightarrow$$ $$s=\frac12 \beta d$$, $$s\approx 0.2d\approx156$$ (MNIST) and $$s\approx0.05d\approx153$$ (CIFAR10)

This number is unreasonably large!

(since $$\beta=\frac1d\min(\alpha_T-d,2\alpha_S)$$ indep. of $$\alpha_S \longrightarrow \beta=\frac{\alpha_T-d}d$$)

Effective dimension

• Measure NN-distance $$\delta$$

• $$\delta\sim n^{-\mathrm{some\ exponent}}$$

Define effective dimension as $$\delta \sim n^{-\frac1{d_\mathrm{eff}}}$$

$$\longrightarrow$$

MNIST

0.4

15

CIFAR10

0.1

35

$$\phantom{x}$$

$$\beta$$

$$d_\mathrm{eff}$$

3

1

$$s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor$$

$$d_\mathrm{eff}$$ is much smaller

$$s$$ is more reasonable

$$\longrightarrow$$

$$\longrightarrow$$

784

3072

$$d$$

RKHS and smoothness

• Indeed, what happens if we consider a field $$Z_T(\underline{x})$$ that

• is an instance of a Teacher $$K_T$$
• lies in the RKHS of a Student $$K_S$$

$$\Longrightarrow$$

$$\alpha_T > \alpha_S + d$$

($$\alpha_T$$)

($$\alpha_S$$)

$$\alpha_S > d$$

$$\mathbb{E}_T \lvert\!\lvert Z_T \rvert\!\rvert_{K_S} = \mathbb{E}_T \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, Z_T(\underline{x}) K_S^{-1}(\underline{x},\underline{y}) Z_T(\underline{y}) = \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, K_T(\underline{x},\underline{y}) K_S^{-1}(\underline{x},\underline{y}) \textcolor{red}{< \infty}$$

$$K_S(\underline{0}) \propto \int \mathrm{d}\underline{w}\, \tilde{K}_S(\underline{w}) \textcolor{red}{< \infty}$$

$$\Longrightarrow$$

Therefore the smoothness must be $$s = \frac{\alpha_T-d}2 > \frac{d}2$$

(it scales with $$d$$!)

$$\longrightarrow \beta > \frac12$$

Conclusion

• MNIST and CIFAR10 display power laws in the learning curves, with exponents $$\beta\approx 0.4,0.1$$ (resp.) $$\gg\frac1d$$

• $$\beta$$ is the same for regression and classification tasks with Gaussian and Laplace kernels

• We introduced a new framework that allows for different degrees of smoothness in the data, where we can compute $$\beta$$

• We defined an effective dimension for real data ($$\ll d$$), that is linked to an effective smoothness $$s$$

MNIST

0.4

15

CIFAR10

0.1

35

$$\phantom{x}$$

$$\beta$$

$$d_\mathrm{eff}$$

3

1

$$s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor$$

784

3072

$$d$$

#### Asymptotic learning curves of kernel methods

By Stefano Spigler

• 681