kernel methods

and the

curse of dimensionality

Stefano Spigler

Jonas Paccolat, Mario Geiger, Matthieu Wyart

  • Why and how does deep supervised learning work?


     
  • Learn from examples: how many are needed?


     
  • Typical tasks:
     
    • Regression (fitting functions)
       
    • Classification

  supervised deep learning

  • Performance is evaluated through the generalization error \(\epsilon\)


     
  • Learning curves decay with number of examples \(n\), often as


     
  • \(\beta\) depends on the dataset and on the algorithm
     

Deep networks: \(\beta\sim 0.07\)-\(0.35\) [Hestness et al. 2017]

  learning curves

\(\epsilon\sim n^{-\beta}\)

We lack a theory for \(\beta\) for deep networks!

  • Performance increases with overparametrization


      \(\longrightarrow\) study the infinite-width limit!





     

[Jacot et al. 2018]

[Bruna and Mallat 2013, Arora et al. 2019]

What are the learning curves of kernels like?

  link with kernel learning

(next slides)

\(h\)

[Neyshabur et al. 2017, 2018, Advani and Saxe 2017]

[Spigler et al. 2018, Geiger et al. 2019, Belkin et al. 2019]

\(h\)

\(\epsilon\)

  • With a specific scaling, infinite-width limit \(\to\) kernel learning

[Rotskoff and Vanden-Eijnden 2018, Mei et al. 2017, Jacot et al. 2018, Chizat and Bach 2018, ...] 

Neural Tangent Kernel

  • Very brief introduction to kernel methods and real data


     

  • Gaussian data: Teacher-Student regression


     
  • Gaussian approximation: smoothness and effective dimension


     
  • Role of invariance in the task?

  outline

  • Kernel methods learn non-linear functions or boundaries
     
  • Map data to a feature space, where the problem is linear

data \(\underline{x} \longrightarrow \underline{\phi}(\underline{x}) \longrightarrow \) use linear combination of features

only scalar products are needed:                     

\(\underline{\phi}(\underline{x})\)

  kernel methods

kernel \(K(\underline{x},\underline{x}^\prime)\)

\(\rightarrow\)

K(\underline{x},\underline{x}^\prime) = \exp\left(-\frac{|\!|\underline{x}-\underline{x}^\prime|\!|^2}{\sigma^2}\right)
K(\underline{x},\underline{x}^\prime) = \exp\left(-\frac{|\!|\underline{x}-\underline{x}^\prime|\!|}{\sigma}\right)

Gaussian:

Laplace:

\underline{\phi}(\underline{x})\cdot\underline{\phi}(\underline{x}^\prime)

E.g. kernel regression:

  • Target function  \(\underline{x}_\mu \to Z(\underline{x}_\mu),\ \ \mu=1,\dots,n\)

     
  • Build an estimator  \(\hat{Z}_K(\underline{x}) = \sum_{\mu=1}^n c_\mu K(\underline{x}_\mu,\underline{x})\)

     
  • Minimize training MSE \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_K(\underline{x}_\mu) - Z(\underline{x}_\mu) \right]^2\)

     
  • Estimate the generalization error \(\epsilon = \mathbb{E}_{\underline{x}} \left[ \hat{Z}_K(\underline{x}) - Z(\underline{x}) \right]^2\)

  kernel regression

A kernel \(K\) induces a corresponding Hilbert space \(\mathcal{H}_K\) with norm

 

\(\lvert\!\lvert Z \rvert\!\rvert_K = \int \mathrm{d}^d\underline{x} \mathrm{d}^d\underline{y}\, Z(\underline{x}) K^{-1}(\underline{x},\underline{y}) Z(\underline{y})\)

 

where \(K^{-1}(\underline{x},\underline{y})\) is such that

 

\(\int \mathrm{d}^d\underline{y}\, K^{-1}(\underline{x},\underline{y}) K(\underline{y},\underline{z}) = \delta(\underline{x},\underline{z})\)

 

 

\(\mathcal{H}_K\) is called the Reproducing Kernel Hilbert Space (RKHS)

  reproducing kernel hilbert space (rkhs)

Regression: performance depends on the target function!

 

  • If only assumed to be Lipschitz, then \(\beta=\frac1d\)

     

  • If assumed to be in the RKHS, then \(\beta\geq\frac12\) does not depend on \(d\)

     

  • Yet, RKHS is a very strong assumption on the smoothness of the target function

Curse of dimensionality!

[Luxburg and Bousquet 2004]

[Smola et al. 1998, Rudi and Rosasco 2017]

[Bach 2017]

  previous works

\(d\) = dimension of the input space

\(\longrightarrow\)

We apply kernel methods on

  real data and algorithms

MNIST

CIFAR10

2 classes: even/odd

70000 28x28 b/w pictures

2 classes: first 5/last 5

60000 32x32 RGB pictures

We perform

regression        \(\longrightarrow\)

classification   \(\longrightarrow\)

kernel regression

margin SVM

\overbrace{\phantom{wwwww}}

dimension \(d = 784\)

dimension \(d = 3072\)

\rightarrow
\rightarrow
  • Same exponent for regression and classification
     
  • Same exponent for Gaussian and Laplace kernel
     
  • MNIST and CIFAR10 display exponents \(\beta\gg\frac1d\) but \(<\frac12\)

  real data:

  exponents

We need a new framework!

\(\beta\approx0.4\)

\(\beta\approx0.1\)

  • Controlled setting: Teacher-Student regression


     
  • Training data are sampled from a Gaussian Process:

          \(Z_T(\underline{x}_1),\dots,Z_T(\underline{x}_n)\ \sim\ \mathcal{N}(0, K_T)\)
          \(\underline{x}_\mu\) are random on a \(d\)-dim hypersphere


     
  • Regression is done with another kernel \(K_S\)

  kernel teacher-student framework

\(\mathbb{E} Z_T(\underline{x}_\mu) = 0\)

\(\mathbb{E} Z_T(\underline{x}_\mu) Z_T(\underline{x}_\nu) = K_T(|\!|\underline{x}_\mu-\underline{x}_\nu|\!|)\)

  teacher-student: simulations

Generalization error

Exponent \(-\beta\)

Can we understand these curves?

  teacher-student: regression

\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \textcolor{darkred}{\mathbb{K}_S^{-1}} \textcolor{gray}{\underline{Z}_T}
(\underline{Z}_T)_\mu = Z_T(\underline{x}_\mu)
(\underline{k}_S(\underline{x}))_\mu = K_S(\underline{x}_\mu, \underline{x})
(\mathbb{K}_S)_{\mu\nu} = K_S(\underline{x}_\mu, \underline{x}_\nu)

where

\underbrace{\phantom{wiiwiiiwwwwww}}

Compute the generalization error \(\epsilon\) and how it scales with \(n\)

\epsilon = \textcolor{darkred}{\mathbb{E}_T} \int\mathrm{d}^d\underline{x}\, \left[ \hat{Z}_S(\underline{x}) - \textcolor{darkred}{Z_T(\underline{x})} \right]^2 \sim n^{-\beta}
\hat{Z}_S(\underline{x}) = \textcolor{gray}{\underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \underline{Z}}
\hat{Z}_S(\underline{x}) = \textcolor{darkred}{\underline{k}_S(\underline{x})} \textcolor{gray}{\cdot \mathbb{K}_S^{-1} \underline{Z}_T}
\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \textcolor{darkred}{\underline{Z}_T}
\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \underline{Z}_T

kernel overlap

Gram matrix

training data

Explicit solution:

Regression:

\(\hat{Z}_S(\underline{x}) = \sum_{\mu=1}^n c_\mu K_S(\underline{x}_\mu,\underline{x})\)

Minimize \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_S(\underline{x}_\mu) - Z_T(\underline{x}_\mu) \right]^2\)

  teacher-student: theorem (1/2)

To compute the generalization error:
 

  • We look at the problem in the frequency domain
     
  • We assume that \(\tilde{K}_S(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_S}\) and \(\tilde{K}_T(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_T}\) as\(|\!|\underline{w}|\!|\to\infty\)



     
  • SIMPLIFYING ASSUMPTION: We take the \(n\) points \(\underline{x}_\mu\) on a regular \(d\)-dim lattice!
\epsilon \sim n^{-\beta}
\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

Then we can show that

with

E.g. Laplace has \(\alpha=d+1\) and Gaussian has \(\alpha=\infty\)

(details: arXiv:1905.10843) 

for \(n\gg1\)

  teacher-student: theorem (2/2)

  • Large \(\alpha \rightarrow\) fast decay at high freq \(\rightarrow\) indifference to local details

     
  • \(\alpha_T\) is intrinsic to the data (T), \(\alpha_S\) depends on the algorithm (S)

     
  • If \(\alpha_S\) is large enough, \(\beta\)  takes the largest possible value \(\frac{\alpha_T - d}{d}\)

     
  • As soon as \(\alpha_S\) is small enough, \(\beta=\frac{2\alpha_S}d\)

(optimal learning)

\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)
  • If Teacher=Student=Laplace




     
  • If Teacher=Gaussian, Student=Laplace
\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

What is the prediction for our simulations?

(curse of dimensionality!)

\beta=\frac{\alpha_T-d}d = \frac1d

(\(\alpha_T=\alpha_S=d+1\))

(\(\alpha_T=\infty, \alpha_S=d+1\))

\beta=\frac{2\alpha_S}d = 2+\frac2d

  teacher-student: comparison (1/2)

Exponent \(-\beta\)

  • Our result matches the numerical simulations
     
  • There are finite size effects (small \(n\))

(on hypersphere)

  TEACHER-STUDENT: COMPARISON (2/2)

  teacher-student: Matérn TEACHER

K_T(\underline x) = \frac{2^{1-\nu}}{\Gamma(\nu)} z^\nu \mathcal K_\nu(z), \quad z = \sqrt{2\nu} \frac{|\!|\underline x|\!|}\sigma, \quad \alpha = d+2\nu

Matérn kernels: 

\(n\)

\beta=\min(2\nu,4)
d=1
K_S(\underline x) = \exp\left(-\frac{|\!|\underline x|\!|}\sigma\right)

Laplace student, 

Same result with points on regular lattice or random hypersphere?

 

What matters is how nearest-neighbor distance \(\delta\) scales with \(n\)

  nearest-neighbor distance

In both cases  \(\delta\sim n^{\frac1d}\)

Finite size effects: asymptotic scaling only when \(n\) is large enough

(conjecture)

What about real data?

\(\longrightarrow\) second order approximation with a Gaussian process \(K_T\):

does it capture some aspects?

  back toreal data

  • Gaussian processes are \(s\)-times (mean-square) differentiable,
                                                     \(s=\frac{\alpha_T-d}2\)
     
  • Fitted exponents are \(\beta\approx0.4\) (MNIST) and \(\beta\approx0.1\) (CIFAR10), regardless of the Student \(\longrightarrow \beta=\frac{\alpha_T-d}d\)

\(\longrightarrow\) \(s=\frac12 \beta d\), \(s\approx 0.2d\approx156\) (MNIST) and \(s\approx0.05d\approx153\) (CIFAR10)
 

This number is unreasonably large!

(since \(\beta=\frac1d\min(\alpha_T-d,2\alpha_S)\) indep. of \(\alpha_S \longrightarrow \beta=\frac{\alpha_T-d}d\))

  effective dimension

  • Measure NN-distance \(\delta\)


     
  • \(\delta\sim n^{-\mathrm{some\ exponent}} \)

Define effective dimension as \(\delta \sim n^{-\frac1{d_\mathrm{eff}}}\)

\(\longrightarrow\)

MNIST

0.4

15

CIFAR10

0.1

35

\(\phantom{x}\)

\(\beta\)

\(d_\mathrm{eff}\)

3

1

\(s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor\)

\(d_\mathrm{eff}\) is much smaller

\(s\) is more reasonable!

\(\longrightarrow\)

\(\longrightarrow\)

784

3072

\(d\)

  curse of dimensionality (1/2)

  • Loosely speaking, the (optimal) exponent is




     
  • To avoid the curse of dimensionality (\(\beta\sim\frac1d\)):
     
    • either the dimension of the manifold is small
       
    • or the data are extremely smooth
\beta \approx \frac{\text{smoothness}\ \ \textcolor{darkred}{\alpha_T-d = 2s}}{\text{manifold dimension}\ \ \textcolor{darkred}{d}}

  curse of dimensionality (2/2)

  • Assume that the data are not smooth enough and live in \(d\) large

     
  • Dimensionality reduction in the task rather than in the data?

     
  • E.g. the \(n\) points \(\underline x_\mu\) live in \(\mathbb R^d\), but the target function is such that




     
  • Can kernels understand the lower dimensional structure?
Z_T(\underline x) = Z_T(\underline x_\parallel) \equiv Z_T(x_1,\dots,x_{d_\parallel}), \quad d_\parallel < d

Similar setting studied in Bach 2017

  task invariance: kernel regression (1/2)

\epsilon \sim n^{-\beta}
\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

Theorem (informal formulation):
 

in the described setting with \(d_\parallel \leq d\),

with

for \(n\gg1\)

Regardless of \(d_\parallel\)!

Two reasons contribute to this result:

 

  • the nearest-neighbor distance always scales as \(\delta \sim n^{-\frac1d}\)
     
  • \(\alpha_T(d) - d\) only depends on the function \(K_T(z)\) and not on \(d\)

Similar result in Bach 2017

  task invariance: kernel regression (2/2)

Teacher = Matérn (with parameter \(\nu\)),    Student = Laplace,    \(d\)=4

\(n\)

  task invariance: classification (1/2)

Classification with the margin SVM algorithm:

\hat y(\underline x) = \mathrm{sign}\left[ \sum_{\mu=1}^n c_\mu K\left(\frac{|\!|\underline x - \underline x^\mu|\!|}{\sigma}\right) + b \right]

find \(\{c_\mu\},b\) by minimizing some function

We consider a very simple setting:

  • the label is \(y(\underline x) = y(x_1) \ \longrightarrow \ d_\parallel=1\)

+

-

+

y(x_1):

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

-

-

-

+

+

+

+

+

-

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

+

+

+

+

+

+

+

+

+

+

+

+

+

-

+

+

+

-

-

-

-

-

+

+

+

+

+

+

x_1
x_1

hyperplane

band

Non-Gaussian data!

x_1
x_1
x_1
x_1

  task invariance: classification (2/2)

  • \(\sigma\ll\delta\): then the estimator is tantamount to a nearest-neighbor algorithm \(\longrightarrow\) curse of dimensionality \(\beta=\frac1d\)


     
  • \(\sigma\gg\delta\): important correlations in \(c_\mu\) due to the long-range kernel. For the hyperplane with \(d_\parallel=1\) we find \(\beta = \mathcal O(d^0)\)!

Vary kernel scale \(\sigma\) \(\longrightarrow\)  two regimes! 

No curse of dimensionality!

  kernel correlations (1/2)

K\left(\frac{|\!|\underline x - \underline x^\mu|\!|}{\sigma}\right) \approx K(0) - \mathrm{const} \times \left(\frac{|\!|\underline x - \underline x^\mu|\!|}\sigma\right)^\xi

When \(\sigma\gg\delta\) we can expand the kernel overlaps:

(the exponent \(\xi\) is linked to the smoothness of the kernel)

We can derive some scaling arguments that lead to an exponent

\beta = \frac{ d + \xi - 1 }{ 3d + \xi - 3 }

Idea:

  • support vectors (\(c_\mu\neq0\)) are close to the interface
  • we impose that the decision boundary has \(\mathcal{O}(1)\) spatial fluctuations on a scale proportional to \(\delta\)

  kernel correlations (2/2)

\beta = \frac{ d + \xi - 1 }{ 3d + \xi - 3 }

\(n\)

d=1

Laplace kernel \(\xi=1\)

Matérn kernels \(\xi = \min(2\nu,2)\)

hyperplane

\(n\)

band

\(n\)

\(n\)

in all these cases!

  conclusion

  • Learning curves of real data decay as power laws with exponents

     
  • We introduce a new framework that links the exponent \(\beta\) to the degree of smoothness of Gaussian random data

     
  • We justify how different kernels can lead to the same exponent \(\beta\)

     
  • We show that the effective dimension of real data is \(\ll d\). It can be linked to a (small) effective smoothness \(s\)

     
  • We show that kernel regression is not able to capture invariants in the task, while kernel classification can

arXiv:1905.10843 + paper to be released soon! 

\frac1d \ll \beta < \frac12

(in some regime and for smooth interfaces)

  • Indeed, what happens if we consider a field \(Z_T(\underline{x})\) that
     
    • is an instance of a Teacher \(K_T\)
    • lies in the RKHS of a Student \(K_S\)

\(\Longrightarrow\)

\(\alpha_T > \alpha_S + d\)

(\(\alpha_T\))

(\(\alpha_S\))

\(\alpha_S > d\)

\(\mathbb{E}_T \lvert\!\lvert Z_T \rvert\!\rvert_{K_S} = \mathbb{E}_T \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, Z_T(\underline{x}) K_S^{-1}(\underline{x},\underline{y}) Z_T(\underline{y}) = \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, K_T(\underline{x},\underline{y}) K_S^{-1}(\underline{x},\underline{y}) \textcolor{red}{< \infty}\)

\(K_S(\underline{0}) \propto \int \mathrm{d}\underline{w}\, \tilde{K}_S(\underline{w}) \textcolor{red}{< \infty}\)

\(\Longrightarrow\)

Therefore the smoothness must be \(s = \frac{\alpha_T-d}2 > \frac{d}2\)

(it scales with \(d\)!)

\(\longrightarrow \beta > \frac12\)

  rkhs & smoothness

  the nearest-neighbor limit

using a Laplace kernel

and

varying the dimension \(d\):

 

 

\(\beta=\frac1d\)

\(n\)

hyperplane interface

  kernel correlations: hypersphere

\beta = \frac{ d + \xi - 1 }{ 3d + \xi - 3 }

\(n\)

boundary = hypersphere:

Laplace kernels (\(\xi=1\))

What about other interfaces?

\(y(\underline x) = \mathrm{sign}(|\!|\underline x|\!|-R)\)

(same exponent!)

(similar scaling arguments apply, provided \(R\gg\delta\))

(\(d_\parallel=1\))

+

-

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

-

-

-

+

+

+

+

-

+

+

+

-

-

|\!|\underline x|\!|

+

+

x_1
\underline{x}_\perp

Kernel methods and the curse of dimensionality

By Stefano Spigler

Kernel methods and the curse of dimensionality

Talk given in Courant Institute, NY, March 2020

  • 926