kernel methods

and the

curse of dimensionality

arXiv:1905.10843

Stefano Spigler

Jonas Paccolat, Mario Geiger, Matthieu Wyart

Why and how does deep supervised learning work?
Learn from examples: how many are needed?
Typical tasks:
- Regression (fitting functions)
- Classification

supervised deep learning

Performance is evaluated through the generalization error \(\epsilon\)
Learning curves decay with number of examples \(n\), often as
\(\beta\) depends on the dataset and on the algorithm

Deep networks: \(\beta\sim 0.07\)-\(0.35\) [Hestness et al. 2017]

learning curves

\(\epsilon\sim n^{-\beta}\)

We lack a theory for \(\beta\) for deep networks!

Performance increases with overparametrization

\(\longrightarrow\) study the infinite-width limit!

[Jacot et al. 2018]

[Bruna and Mallat 2013, Arora et al. 2019]

What are the learning curves of kernels like?

link with kernel learning

(next slides)

\(h\)

[Neyshabur et al. 2017, 2018, Advani and Saxe 2017]

[Spigler et al. 2018, Geiger et al. 2019, Belkin et al. 2019]

\(h\)

\(\epsilon\)

With a specific scaling, infinite-width limit \(\to\) kernel learning

[Rotskoff and Vanden-Eijnden 2018, Mei et al. 2017, Jacot et al. 2018, Chizat and Bach 2018, ...]

Neural Tangent Kernel

Very brief introduction to kernel methods and real data
Gaussian data: Teacher-Student regression
Gaussian approximation: smoothness and effective dimension
Role of invariance in the task?

outline

Kernel methods learn non-linear functions or boundaries
Map data to a feature space, where the problem is linear

data \(\underline{x} \longrightarrow \underline{\phi}(\underline{x}) \longrightarrow \) use linear combination of features

only scalar products are needed:

\(\underline{\phi}(\underline{x})\)

kernel methods

kernel \(K(\underline{x},\underline{x}^\prime)\)

\(\rightarrow\)

K(\underline{x},\underline{x}^\prime) = \exp\left(-\frac{|\!|\underline{x}-\underline{x}^\prime|\!|^2}{\sigma^2}\right)

K(\underline{x},\underline{x}^\prime) = \exp\left(-\frac{|\!|\underline{x}-\underline{x}^\prime|\!|}{\sigma}\right)

Gaussian:

Laplace:

\underline{\phi}(\underline{x})\cdot\underline{\phi}(\underline{x}^\prime)

E.g. kernel regression:

Target function \(\underline{x}_\mu \to Z(\underline{x}_\mu),\ \ \mu=1,\dots,n\)
Build an estimator \(\hat{Z}_K(\underline{x}) = \sum_{\mu=1}^n c_\mu K(\underline{x}_\mu,\underline{x})\)
Minimize training MSE \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_K(\underline{x}_\mu) - Z(\underline{x}_\mu) \right]^2\)
Estimate the generalization error \(\epsilon = \mathbb{E}_{\underline{x}} \left[ \hat{Z}_K(\underline{x}) - Z(\underline{x}) \right]^2\)

kernel regression

A kernel \(K\) induces a corresponding Hilbert space \(\mathcal{H}_K\) with norm

\(\lvert\!\lvert Z \rvert\!\rvert_K = \int \mathrm{d}^d\underline{x} \mathrm{d}^d\underline{y}\, Z(\underline{x}) K^{-1}(\underline{x},\underline{y}) Z(\underline{y})\)

where \(K^{-1}(\underline{x},\underline{y})\) is such that

\(\int \mathrm{d}^d\underline{y}\, K^{-1}(\underline{x},\underline{y}) K(\underline{y},\underline{z}) = \delta(\underline{x},\underline{z})\)

\(\mathcal{H}_K\) is called the Reproducing Kernel Hilbert Space (RKHS)

reproducing kernel hilbert space (rkhs)

Regression: performance depends on the target function!

If only assumed to be Lipschitz, then \(\beta=\frac1d\)
If assumed to be in the RKHS, then \(\beta\geq\frac12\) does not depend on \(d\)
Yet, RKHS is a very strong assumption on the smoothness of the target function

Curse of dimensionality!

[Luxburg and Bousquet 2004]

[Smola et al. 1998, Rudi and Rosasco 2017]

[Bach 2017]

previous works

\(d\) = dimension of the input space

\(\longrightarrow\)

We apply kernel methods on

real data and algorithms

MNIST

CIFAR10

2 classes: even/odd

70000 28x28 b/w pictures

2 classes: first 5/last 5

60000 32x32 RGB pictures

We perform

regression \(\longrightarrow\)

classification \(\longrightarrow\)

kernel regression

margin SVM

\overbrace{\phantom{wwwww}}

dimension \(d = 784\)

dimension \(d = 3072\)

\rightarrow

Same exponent for regression and classification
Same exponent for Gaussian and Laplace kernel
MNIST and CIFAR10 display exponents \(\beta\gg\frac1d\) but \(<\frac12\)

real data:

exponents

We need a new framework!

\(\beta\approx0.4\)

\(\beta\approx0.1\)

Controlled setting: Teacher-Student regression
Training data are sampled from a Gaussian Process:

\(Z_T(\underline{x}_1),\dots,Z_T(\underline{x}_n)\ \sim\ \mathcal{N}(0, K_T)\)
\(\underline{x}_\mu\) are random on a \(d\)-dim hypersphere
Regression is done with another kernel \(K_S\)

kernel teacher-student framework

\(\mathbb{E} Z_T(\underline{x}_\mu) = 0\)

\(\mathbb{E} Z_T(\underline{x}_\mu) Z_T(\underline{x}_\nu) = K_T(|\!|\underline{x}_\mu-\underline{x}_\nu|\!|)\)

teacher-student: simulations

Generalization error

Exponent \(-\beta\)

Can we understand these curves?

teacher-student: regression

\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \textcolor{darkred}{\mathbb{K}_S^{-1}} \textcolor{gray}{\underline{Z}_T}

(\underline{Z}_T)_\mu = Z_T(\underline{x}_\mu)

(\underline{k}_S(\underline{x}))_\mu = K_S(\underline{x}_\mu, \underline{x})

(\mathbb{K}_S)_{\mu\nu} = K_S(\underline{x}_\mu, \underline{x}_\nu)

where

\underbrace{\phantom{wiiwiiiwwwwww}}

Compute the generalization error \(\epsilon\) and how it scales with \(n\)

\epsilon = \textcolor{darkred}{\mathbb{E}_T} \int\mathrm{d}^d\underline{x}\, \left[ \hat{Z}_S(\underline{x}) - \textcolor{darkred}{Z_T(\underline{x})} \right]^2 \sim n^{-\beta}

\hat{Z}_S(\underline{x}) = \textcolor{gray}{\underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \underline{Z}}

\hat{Z}_S(\underline{x}) = \textcolor{darkred}{\underline{k}_S(\underline{x})} \textcolor{gray}{\cdot \mathbb{K}_S^{-1} \underline{Z}_T}

\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \textcolor{darkred}{\underline{Z}_T}

\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \underline{Z}_T

kernel overlap

Gram matrix

training data

Explicit solution:

Regression:

\(\hat{Z}_S(\underline{x}) = \sum_{\mu=1}^n c_\mu K_S(\underline{x}_\mu,\underline{x})\)

Minimize \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_S(\underline{x}_\mu) - Z_T(\underline{x}_\mu) \right]^2\)

teacher-student: theorem (1/2)

To compute the generalization error:

We look at the problem in the frequency domain
We assume that \(\tilde{K}_S(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_S}\) and \(\tilde{K}_T(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_T}\) as\(|\!|\underline{w}|\!|\to\infty\)
SIMPLIFYING ASSUMPTION: We take the \(n\) points \(\underline{x}_\mu\) on a regular \(d\)-dim lattice!

\epsilon \sim n^{-\beta}

\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

Then we can show that

with

E.g. Laplace has \(\alpha=d+1\) and Gaussian has \(\alpha=\infty\)

(details: arXiv:1905.10843)

for \(n\gg1\)

teacher-student: theorem (2/2)

Large \(\alpha \rightarrow\) fast decay at high freq \(\rightarrow\) indifference to local details
\(\alpha_T\) is intrinsic to the data (T), \(\alpha_S\) depends on the algorithm (S)
If \(\alpha_S\) is large enough, \(\beta\) takes the largest possible value \(\frac{\alpha_T - d}{d}\)
As soon as \(\alpha_S\) is small enough, \(\beta=\frac{2\alpha_S}d\)

(optimal learning)

\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

If Teacher=Student=Laplace
If Teacher=Gaussian, Student=Laplace

\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

What is the prediction for our simulations?

(curse of dimensionality!)

\beta=\frac{\alpha_T-d}d = \frac1d

(\(\alpha_T=\alpha_S=d+1\))

(\(\alpha_T=\infty, \alpha_S=d+1\))

\beta=\frac{2\alpha_S}d = 2+\frac2d

teacher-student: comparison (1/2)

Exponent \(-\beta\)

Our result matches the numerical simulations
There are finite size effects (small \(n\))

(on hypersphere)

TEACHER-STUDENT: COMPARISON (2/2)

teacher-student: Matérn TEACHER

K_T(\underline x) = \frac{2^{1-\nu}}{\Gamma(\nu)} z^\nu \mathcal K_\nu(z), \quad z = \sqrt{2\nu} \frac{|\!|\underline x|\!|}\sigma, \quad \alpha = d+2\nu

Matérn kernels:

\(n\)

\beta=\min(2\nu,4)

d=1

K_S(\underline x) = \exp\left(-\frac{|\!|\underline x|\!|}\sigma\right)

Laplace student,

Same result with points on regular lattice or random hypersphere?

What matters is how nearest-neighbor distance \(\delta\) scales with \(n\)

nearest-neighbor distance

In both cases \(\delta\sim n^{\frac1d}\)

Finite size effects: asymptotic scaling only when \(n\) is large enough

(conjecture)

What about real data?

\(\longrightarrow\) second order approximation with a Gaussian process \(K_T\):

does it capture some aspects?

back toreal data

Gaussian processes are \(s\)-times (mean-square) differentiable,
\(s=\frac{\alpha_T-d}2\)
Fitted exponents are \(\beta\approx0.4\) (MNIST) and \(\beta\approx0.1\) (CIFAR10), regardless of the Student \(\longrightarrow \beta=\frac{\alpha_T-d}d\)

\(\longrightarrow\) \(s=\frac12 \beta d\), \(s\approx 0.2d\approx156\) (MNIST) and \(s\approx0.05d\approx153\) (CIFAR10)

This number is unreasonably large!

(since \(\beta=\frac1d\min(\alpha_T-d,2\alpha_S)\) indep. of \(\alpha_S \longrightarrow \beta=\frac{\alpha_T-d}d\))

effective dimension

Measure NN-distance \(\delta\)
\(\delta\sim n^{-\mathrm{some\ exponent}} \)

Define effective dimension as \(\delta \sim n^{-\frac1{d_\mathrm{eff}}}\)

\(\longrightarrow\)

MNIST

0.4

CIFAR10

0.1

\(\phantom{x}\)

\(\beta\)

\(d_\mathrm{eff}\)

\(s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor\)

\(d_\mathrm{eff}\) is much smaller

\(s\) is more reasonable!

\(\longrightarrow\)

784

3072

\(d\)

curse of dimensionality (1/2)

Loosely speaking, the (optimal) exponent is
To avoid the curse of dimensionality (\(\beta\sim\frac1d\)):
- either the dimension of the manifold is small
- or the data are extremely smooth

\beta \approx \frac{\text{smoothness}\ \ \textcolor{darkred}{\alpha_T-d = 2s}}{\text{manifold dimension}\ \ \textcolor{darkred}{d}}

curse of dimensionality (2/2)

Assume that the data are not smooth enough and live in \(d\) large
Dimensionality reduction in the task rather than in the data?
E.g. the \(n\) points \(\underline x_\mu\) live in \(\mathbb R^d\), but the target function is such that
Can kernels understand the lower dimensional structure?

Z_T(\underline x) = Z_T(\underline x_\parallel) \equiv Z_T(x_1,\dots,x_{d_\parallel}), \quad d_\parallel < d

Similar setting studied in Bach 2017

task invariance: kernel regression (1/2)

\epsilon \sim n^{-\beta}

\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

Theorem (informal formulation):

in the described setting with \(d_\parallel \leq d\),

with

for \(n\gg1\)

Regardless of \(d_\parallel\)!

Two reasons contribute to this result:

the nearest-neighbor distance always scales as \(\delta \sim n^{-\frac1d}\)
\(\alpha_T(d) - d\) only depends on the function \(K_T(z)\) and not on \(d\)

task invariance: kernel regression (2/2)

Teacher = Matérn (with parameter \(\nu\)), Student = Laplace, \(d\)=4

\(n\)

task invariance: classification (1/2)

Classification with the margin SVM algorithm:

\hat y(\underline x) = \mathrm{sign}\left[ \sum_{\mu=1}^n c_\mu K\left(\frac{|\!|\underline x - \underline x^\mu|\!|}{\sigma}\right) + b \right]

find \(\{c_\mu\},b\) by minimizing some function

We consider a very simple setting:

the label is \(y(\underline x) = y(x_1) \ \longrightarrow \ d_\parallel=1\)

y(x_1):

x_1

hyperplane

band

Non-Gaussian data!

x_1

task invariance: classification (2/2)

\(\sigma\ll\delta\): then the estimator is tantamount to a nearest-neighbor algorithm \(\longrightarrow\) curse of dimensionality \(\beta=\frac1d\)
\(\sigma\gg\delta\): important correlations in \(c_\mu\) due to the long-range kernel. For the hyperplane with \(d_\parallel=1\) we find \(\beta = \mathcal O(d^0)\)!

Vary kernel scale \(\sigma\) \(\longrightarrow\) two regimes!

No curse of dimensionality!

kernel correlations (1/2)

K\left(\frac{|\!|\underline x - \underline x^\mu|\!|}{\sigma}\right) \approx K(0) - \mathrm{const} \times \left(\frac{|\!|\underline x - \underline x^\mu|\!|}\sigma\right)^\xi

When \(\sigma\gg\delta\) we can expand the kernel overlaps:

(the exponent \(\xi\) is linked to the smoothness of the kernel)

We can derive some scaling arguments that lead to an exponent

\beta = \frac{ d + \xi - 1 }{ 3d + \xi - 3 }

Idea:

support vectors (\(c_\mu\neq0\)) are close to the interface
we impose that the decision boundary has \(\mathcal{O}(1)\) spatial fluctuations on a scale proportional to \(\delta\)

kernel correlations (2/2)

\beta = \frac{ d + \xi - 1 }{ 3d + \xi - 3 }

\(n\)

d=1

Laplace kernel \(\xi=1\)

Matérn kernels \(\xi = \min(2\nu,2)\)

hyperplane

\(n\)

band

\(n\)

in all these cases!

conclusion

Learning curves of real data decay as power laws with exponents
We introduce a new framework that links the exponent \(\beta\) to the degree of smoothness of Gaussian random data
We justify how different kernels can lead to the same exponent \(\beta\)
We show that the effective dimension of real data is \(\ll d\). It can be linked to a (small) effective smoothness \(s\)
We show that kernel regression is not able to capture invariants in the task, while kernel classification can

arXiv:1905.10843 + paper to be released soon!

\frac1d \ll \beta < \frac12

(in some regime and for smooth interfaces)

Indeed, what happens if we consider a field \(Z_T(\underline{x})\) that
- is an instance of a Teacher \(K_T\)
- lies in the RKHS of a Student \(K_S\)

\(\Longrightarrow\)

\(\alpha_T > \alpha_S + d\)

(\(\alpha_T\))

(\(\alpha_S\))

\(\alpha_S > d\)

\(\mathbb{E}_T \lvert\!\lvert Z_T \rvert\!\rvert_{K_S} = \mathbb{E}_T \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, Z_T(\underline{x}) K_S^{-1}(\underline{x},\underline{y}) Z_T(\underline{y}) = \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, K_T(\underline{x},\underline{y}) K_S^{-1}(\underline{x},\underline{y}) \textcolor{red}{< \infty}\)

\(K_S(\underline{0}) \propto \int \mathrm{d}\underline{w}\, \tilde{K}_S(\underline{w}) \textcolor{red}{< \infty}\)

\(\Longrightarrow\)

Therefore the smoothness must be \(s = \frac{\alpha_T-d}2 > \frac{d}2\)

(it scales with \(d\)!)

\(\longrightarrow \beta > \frac12\)

rkhs & smoothness

the nearest-neighbor limit

using a Laplace kernel

and

varying the dimension \(d\):

\(\beta=\frac1d\)

\(n\)

hyperplane interface

kernel correlations: hypersphere

\beta = \frac{ d + \xi - 1 }{ 3d + \xi - 3 }

\(n\)

boundary = hypersphere:

Laplace kernels (\(\xi=1\))

What about other interfaces?

\(y(\underline x) = \mathrm{sign}(|\!|\underline x|\!|-R)\)

(same exponent!)

(similar scaling arguments apply, provided \(R\gg\delta\))

(\(d_\parallel=1\))

|\!|\underline x|\!|

x_1

\underline{x}_\perp

Kernel methods and the curse of dimensionality

By Stefano Spigler

Kernel methods and the curse of dimensionality

Talk given in Courant Institute, NY, March 2020

Stefano Spigler

spigler.net/stefano

kernel methods

and the

curse of dimensionality

supervised deep learning

learning curves

link with kernel learning

outline

kernel methods

kernel regression

reproducing kernel hilbert space (rkhs)

previous works

real data and algorithms

real data:

exponents

kernel teacher-student framework

teacher-student: simulations

teacher-student: regression

teacher-student: theorem (1/2)

teacher-student: theorem (2/2)

teacher-student: comparison (1/2)

TEACHER-STUDENT: COMPARISON (2/2)

teacher-student: Matérn TEACHER

nearest-neighbor distance

back toreal data

effective dimension

curse of dimensionality (1/2)

curse of dimensionality (2/2)

task invariance: kernel regression (1/2)

task invariance: kernel regression (2/2)

task invariance: classification (1/2)

task invariance: classification (2/2)

kernel correlations (1/2)

kernel correlations (2/2)

conclusion

rkhs & smoothness

the nearest-neighbor limit

kernel correlations: hypersphere

Kernel methods and the curse of dimensionality

More from Stefano Spigler