Asymptotic Learning Curves of Kernel Methods

Stefano Spigler, Mario Geiger, Matthieu Wyart

  • Why and how does deep supervised learning work?


     
  • Learn from examples: how many are needed?


     
  • Typical tasks:
     
    • Regression (fitting functions)
       
    • Classification

Supervised deep learning

  • Performance is evaluated through the generalization error \(\epsilon\)


     
  • Learning curves decay with number of examples \(n\), often as


     
  • \(\beta\) depends on the dataset and on the algorithm
     

Deep networks: \(\beta\sim 0.07\)-\(0.35\) [Hestness et al. 2017]

Learning curves

\(\epsilon\sim n^{-\beta}\)

We lack a theory for \(\beta\) for deep networks

  • Performance increases with overparametrization

      \(\longrightarrow\) study the infinite-width limit!





     

[Jacot et al. 2018]

[Bruna and Mallat 2013, Arora et al. 2019]

What are the learning curves of kernels like?

Link with kernel learning

(next slide)

\(h\)

[Neyshabur et al. 2017, 2018, Advani and Saxe 2017]

[Belkin et al. 2018, Spigler et al. 2018, Geiger et al. 2019]

\(h\)

\(\epsilon\)

  • With a specific scaling, infinite-width limit \(\to\) kernel learning
     

  • Some kernels achieve almost the performance of deep networks

[Mei et al. 2017, Rotskoff and Vanden-Eijnden 2018, Jacot et al. 2018, Chizat and Bach 2018, ...] 

  • Kernel methods learn non-linear functions or boundaries

     
  • Data are mapped to a feature space, where the problem is treated linearly

data \(\underline{x} \longrightarrow \underline{\phi}(\underline{x}) \longrightarrow \) use linear combination of features

only scalar products are needed:                     

\(\underline{\phi}(\underline{x})\)

Kernel methods

kernel \(K(\underline{x},\underline{x}^\prime)\)

\(\rightarrow\)

K(\underline{x},\underline{x}^\prime) = \exp\left(-\frac{|\!|\underline{x}-\underline{x}^\prime|\!|^2}{\sigma^2}\right)
K(\underline{x},\underline{x}^\prime) = \exp\left(-\frac{|\!|\underline{x}-\underline{x}^\prime|\!|}{\sigma}\right)

Gaussian:

Laplace:

\underline{\phi}(\underline{x})\cdot\underline{\phi}(\underline{x}^\prime)

E.g. kernel regression:

  • Target function  \(\underline{x}_\mu \to Z(\underline{x}_\mu),\ \ \mu=1,\dots,n\)

     
  • Build an estimator  \(\hat{Z}_K(\underline{x}) = \sum_{\mu=1}^n c_\mu K(\underline{x}_\mu,\underline{x})\)

     
  • Minimize training MSE \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_K(\underline{x}_\mu) - Z(\underline{x}_\mu) \right]^2\)

     
  • Estimate the generalization error \(\epsilon = \mathbb{E}_{\underline{x}} \left[ \hat{Z}_K(\underline{x}) - Z(\underline{x}) \right]^2\)

Kernel regression

A kernel \(K\) induces a corresponding Hilbert space \(\mathcal{H}_K\) with norm

 

\(\lvert\!\lvert Z \rvert\!\rvert_K = \int \mathrm{d}\underline{x} \mathrm{d} \underline{y}\, Z(\underline{x}) K^{-1}(\underline{x},\underline{y}) Z(\underline{y})\)

 

where \(K^{-1}(\underline{x},\underline{y})\) is such that

 

\(\int \mathrm{d} \underline{y}\, K^{-1}(\underline{x},\underline{y}) K(\underline{y},\underline{z}) = \delta(\underline{x},\underline{z})\)

 

 

\(\mathcal{H}_K\) is called the Reproducing Kernel Hilbert Space (RKHS)

Reproducing Kernel Hilbert Space

Regression: performance depends on the target function!

 

  • If only assumed to be Lipschitz, then \(\beta=\frac1d\)

     

  • If assumed to be in the RKHS, then \(\beta\) does not depend on \(d\)

     

  • Yet, RKHS is a very strong assumption on the smoothness of the target function (see later on)

curse of dimensionality!

[Luxburg and Bousquet 2004]

[Smola et al. 1998, Rudi and Rosasco 2017]

[Bach 2017]

Previous works

\(d\) = dimension of the input space

\(\longrightarrow\)

We apply kernel methods on

Datasets and algorithms

MNIST

CIFAR10

2 classes: even/odd

70000 28x28 b/w pictures

2 classes: first 5/last 5

60000 32x32 RGB pictures

We perform

regression        \(\longrightarrow\)

classification   \(\longrightarrow\)

kernel regression

margin SVM

\overbrace{\phantom{wwwww}}

dimension \(d = 784\)

dimension \(d = 3072\)

\rightarrow
\rightarrow
  • Same exponent for regression and classification
     
  • Same exponent for Gaussian and Laplace kernel
     
  • MNIST and CIFAR10 display exponents \(\beta\) different from \(\frac12,\frac1d\)

Real exponents

We need a new framework!

\(\beta\approx0.4\)

\(\beta\approx0.1\)

  • Controlled setting: Teacher-Student regression


     
  • Training data are sampled from a Gaussian Process:

          \(Z(\underline{x}_1),\dots,Z(\underline{x}_n)\ \sim\ \mathcal{N}(0, K_T)\)
          \(\underline{x}_\mu\) are random on a \(d\)-dim hypersphere


     
  • Regression is done with another kernel \(K_S\)

Teacher-Student: simulation

\(\mathbb{E} Z(\underline{x}_\mu) = 0\)

\(\mathbb{E} Z(\underline{x}_\mu) Z(\underline{x}_\nu) = K_T(\underline{x}_\mu-\underline{x}_\nu)\)

\(\longrightarrow\)

Teacher-Student: simulation

Generalization error

Exponent \(-\beta\)

Can we understand these curves?

Teacher-Student: analytical

Regression: the solution can be written explicitly

\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \textcolor{darkred}{\mathbb{K}_S^{-1}} \textcolor{gray}{\underline{Z}}
(\underline{Z})_\mu = Z(\underline{x}_\mu)
(\underline{k}_S(\underline{x}))_\mu = K_S(\underline{x}_\mu, \underline{x})
(\mathbb{K}_S)_{\mu\nu} = K_S(\underline{x}_\mu, \underline{x}_\nu)

where

\underbrace{\phantom{wiiwiiiwww}}

Compute the generalization error \(\epsilon\) and how it scales with \(n\)

\epsilon = \textcolor{darkred}{\mathbb{E}_T} \int\mathrm{d}\underline{x}\, \left[ \hat{Z}_S(\underline{x}) - \textcolor{darkred}{Z(\underline{x})} \right]^2 \sim n^{-\beta}
\hat{Z}_S(\underline{x}) = \textcolor{gray}{\underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \underline{Z}}
\hat{Z}_S(\underline{x}) = \textcolor{darkred}{\underline{k}_S(\underline{x})} \textcolor{gray}{\cdot \mathbb{K}_S^{-1} \underline{Z}}
\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \textcolor{darkred}{\underline{Z}}
\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \underline{Z}

Teacher-Student: analytical

To compute the generalization error:
 

  • We look at the problem in the frequency domain
     
  • We assume that \(\tilde{K}_S(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_S}\) and \(\tilde{K}_T(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_T}\) as\(|\!|\underline{w}|\!|\to\infty\)



     
  • SIMPLIFYING ASSUMPTION: We take the \(n\) points \(\underline{x}_\mu\) on a regular \(d\)-dim lattice!
\epsilon \sim n^{-\beta}
\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

Then we can show that

with

E.g. Laplace has \(\alpha=d+1\) and Gaussian has \(\alpha=\infty\)

(details: arXiv:1905.10843) 

for \(n\gg1\)

Teacher-Student

  • Large \(\alpha \rightarrow\) fast decay at high freq \(\rightarrow\) indifference to local details

     
  • \(\alpha_T\) is intrinsic to the data (T), \(\alpha_S\) depends on the algorithm (S)

     
  • If \(\alpha_S\) is large enough, \(\beta\)  takes the largest possible value \(\frac{\alpha_T - d}{d}\)

     
  • As soon as \(\alpha_S\) is small enough, \(\beta=\frac{2\alpha_S}d\)

(optimal learning)

\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

Teacher-Student

  • If Teacher=Student=Laplace




     
  • If Teacher=Gaussian, Student=Laplace
\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

What is the prediction for our simulations?

(curse of dimensionality!)

\beta=\frac{\alpha_T-d}d = \frac1d

(\(\alpha_T=\alpha_S=d+1\))

(\(\alpha_T=\infty, \alpha_S=d+1\))

\beta=\frac{2\alpha_S}d = 2+\frac2d

Teacher-Student: comparison

Exponent \(-\beta\)

  • Our result matches the numerical simulations
     
  • There are finite size effects (small \(n\))

(on hypersphere)

Same result with points on regular lattice or random hypersphere?

 

What matters is how nearest-neighbor distance \(\delta\) scales with \(n\)

Nearest-neighbor distance

In both cases  \(\delta\sim n^{\frac1d}\)

Finite size effects: asymptotic scaling only when \(n\) is large enough

(conjecture)

What about real data?

\(\longrightarrow\) assume they are instances of some Gaussian process \(K_T\)

Real data

  • Such instances are \(s\)-times (mean-square) differentiable with
                                                     \(s=\frac{\alpha_T-d}2\)

     
  • Fitted exponents are \(\beta\approx0.4\) (MNIST) and \(\beta\approx0.1\) (CIFAR10), regardless of the Student \(\longrightarrow \beta=\frac{\alpha_T-d}d\)

\(\longrightarrow\) \(s=\frac12 \beta d\), \(s\approx 0.2d\approx156\) (MNIST) and \(s\approx0.05d\approx153\) (CIFAR10)
 

This number is unreasonably large!

(since \(\beta=\frac1d\min(\alpha_T-d,2\alpha_S)\) indep. of \(\alpha_S \longrightarrow \beta=\frac{\alpha_T-d}d\))

Effective dimension

  • Measure NN-distance \(\delta\)


     
  • \(\delta\sim n^{-\mathrm{some\ exponent}} \)

Define effective dimension as \(\delta \sim n^{-\frac1{d_\mathrm{eff}}}\)

\(\longrightarrow\)

MNIST

0.4

15

CIFAR10

0.1

35

\(\phantom{x}\)

\(\beta\)

\(d_\mathrm{eff}\)

3

1

\(s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor\)

\(d_\mathrm{eff}\) is much smaller

\(s\) is more reasonable

\(\longrightarrow\)

\(\longrightarrow\)

784

3072

\(d\)

RKHS and smoothness

  • Indeed, what happens if we consider a field \(Z_T(\underline{x})\) that
     
    • is an instance of a Teacher \(K_T\)
    • lies in the RKHS of a Student \(K_S\)

\(\Longrightarrow\)

\(\alpha_T > \alpha_S + d\)

(\(\alpha_T\))

(\(\alpha_S\))

\(\alpha_S > d\)

\(\mathbb{E}_T \lvert\!\lvert Z_T \rvert\!\rvert_{K_S} = \mathbb{E}_T \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, Z_T(\underline{x}) K_S^{-1}(\underline{x},\underline{y}) Z_T(\underline{y}) = \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, K_T(\underline{x},\underline{y}) K_S^{-1}(\underline{x},\underline{y}) \textcolor{red}{< \infty}\)

\(K_S(\underline{0}) \propto \int \mathrm{d}\underline{w}\, \tilde{K}_S(\underline{w}) \textcolor{red}{< \infty}\)

\(\Longrightarrow\)

Therefore the smoothness must be \(s = \frac{\alpha_T-d}2 > \frac{d}2\)

(it scales with \(d\)!)

\(\longrightarrow \beta > \frac12\)

Conclusion

  • MNIST and CIFAR10 display power laws in the learning curves, with exponents \(\beta\approx 0.4,0.1\) (resp.) \(\gg\frac1d\)

     
  • \(\beta\) is the same for regression and classification tasks with Gaussian and Laplace kernels

     
  • We introduced a new framework that allows for different degrees of smoothness in the data, where we can compute \(\beta\)

     
  • We defined an effective dimension for real data (\(\ll d\)), that is linked to an effective smoothness \(s\)

MNIST

0.4

15

CIFAR10

0.1

35

\(\phantom{x}\)

\(\beta\)

\(d_\mathrm{eff}\)

3

1

\(s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor\)

784

3072

\(d\)

Asymptotic learning curves of kernel methods

By Stefano Spigler

Asymptotic learning curves of kernel methods

  • 782