curse of dimensionality

Stefano Spigler

Jonas Paccolat, Mario Geiger, Matthieu Wyart

• Why and how does deep supervised learning work?

• Learn from examples: how many are needed?

• Regression (fitting functions)

• Classification

supervised deep learning

• Performance is evaluated through the generalization error $$\epsilon$$

• Learning curves decay with number of examples $$n$$, often as

• $$\beta$$ depends on the dataset and on the algorithm

Deep networks: $$\beta\sim 0.07$$-$$0.35$$ [Hestness et al. 2017]

learning curves

$$\epsilon\sim n^{-\beta}$$

We lack a theory for $$\beta$$ for deep networks!

• Performance increases with overparametrization

$$\longrightarrow$$ study the infinite-width limit!

[Jacot et al. 2018]

[Bruna and Mallat 2013, Arora et al. 2019]

What are the learning curves of kernels like?

(next slides)

$$h$$

[Neyshabur et al. 2017, 2018, Advani and Saxe 2017]

[Spigler et al. 2018, Geiger et al. 2019, Belkin et al. 2019]

$$h$$

$$\epsilon$$

• With a specific scaling, infinite-width limit $$\to$$ kernel learning

[Rotskoff and Vanden-Eijnden 2018, Mei et al. 2017, Jacot et al. 2018, Chizat and Bach 2018, ...]

Neural Tangent Kernel

• Very brief introduction to kernel methods and real data

• Gaussian data: Teacher-Student regression

• Gaussian approximation: smoothness and effective dimension

• Role of invariance in the task?

outline

• Kernel methods learn non-linear functions or boundaries

• Map data to a feature space, where the problem is linear

data $$\underline{x} \longrightarrow \underline{\phi}(\underline{x}) \longrightarrow$$ use linear combination of features

only scalar products are needed:

$$\underline{\phi}(\underline{x})$$

kernel methods

kernel $$K(\underline{x},\underline{x}^\prime)$$

$$\rightarrow$$

K(\underline{x},\underline{x}^\prime) = \exp\left(-\frac{|\!|\underline{x}-\underline{x}^\prime|\!|^2}{\sigma^2}\right)
K(\underline{x},\underline{x}^\prime) = \exp\left(-\frac{|\!|\underline{x}-\underline{x}^\prime|\!|}{\sigma}\right)

Gaussian:

Laplace:

\underline{\phi}(\underline{x})\cdot\underline{\phi}(\underline{x}^\prime)

E.g. kernel regression:

• Target function  $$\underline{x}_\mu \to Z(\underline{x}_\mu),\ \ \mu=1,\dots,n$$

• Build an estimator  $$\hat{Z}_K(\underline{x}) = \sum_{\mu=1}^n c_\mu K(\underline{x}_\mu,\underline{x})$$

• Minimize training MSE $$= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_K(\underline{x}_\mu) - Z(\underline{x}_\mu) \right]^2$$

• Estimate the generalization error $$\epsilon = \mathbb{E}_{\underline{x}} \left[ \hat{Z}_K(\underline{x}) - Z(\underline{x}) \right]^2$$

kernel regression

A kernel $$K$$ induces a corresponding Hilbert space $$\mathcal{H}_K$$ with norm

$$\lvert\!\lvert Z \rvert\!\rvert_K = \int \mathrm{d}^d\underline{x} \mathrm{d}^d\underline{y}\, Z(\underline{x}) K^{-1}(\underline{x},\underline{y}) Z(\underline{y})$$

where $$K^{-1}(\underline{x},\underline{y})$$ is such that

$$\int \mathrm{d}^d\underline{y}\, K^{-1}(\underline{x},\underline{y}) K(\underline{y},\underline{z}) = \delta(\underline{x},\underline{z})$$

$$\mathcal{H}_K$$ is called the Reproducing Kernel Hilbert Space (RKHS)

reproducing kernel hilbert space (rkhs)

Regression: performance depends on the target function!

• If only assumed to be Lipschitz, then $$\beta=\frac1d$$

• If assumed to be in the RKHS, then $$\beta\geq\frac12$$ does not depend on $$d$$

• Yet, RKHS is a very strong assumption on the smoothness of the target function

Curse of dimensionality!

[Luxburg and Bousquet 2004]

[Smola et al. 1998, Rudi and Rosasco 2017]

[Bach 2017]

previous works

$$d$$ = dimension of the input space

$$\longrightarrow$$

We apply kernel methods on

real data and algorithms

MNIST

CIFAR10

2 classes: even/odd

70000 28x28 b/w pictures

2 classes: first 5/last 5

60000 32x32 RGB pictures

We perform

regression        $$\longrightarrow$$

classification   $$\longrightarrow$$

kernel regression

margin SVM

\overbrace{\phantom{wwwww}}

dimension $$d = 784$$

dimension $$d = 3072$$

\rightarrow
\rightarrow
• Same exponent for regression and classification

• Same exponent for Gaussian and Laplace kernel

• MNIST and CIFAR10 display exponents $$\beta\gg\frac1d$$ but $$<\frac12$$

exponents

We need a new framework!

$$\beta\approx0.4$$

$$\beta\approx0.1$$

• Controlled setting: Teacher-Student regression

• Training data are sampled from a Gaussian Process:

$$Z_T(\underline{x}_1),\dots,Z_T(\underline{x}_n)\ \sim\ \mathcal{N}(0, K_T)$$
$$\underline{x}_\mu$$ are random on a $$d$$-dim hypersphere

• Regression is done with another kernel $$K_S$$

kernel teacher-student framework

$$\mathbb{E} Z_T(\underline{x}_\mu) = 0$$

$$\mathbb{E} Z_T(\underline{x}_\mu) Z_T(\underline{x}_\nu) = K_T(|\!|\underline{x}_\mu-\underline{x}_\nu|\!|)$$

teacher-student: simulations

Generalization error

Exponent $$-\beta$$

Can we understand these curves?

teacher-student: regression

\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \textcolor{darkred}{\mathbb{K}_S^{-1}} \textcolor{gray}{\underline{Z}_T}
(\underline{Z}_T)_\mu = Z_T(\underline{x}_\mu)
(\underline{k}_S(\underline{x}))_\mu = K_S(\underline{x}_\mu, \underline{x})
(\mathbb{K}_S)_{\mu\nu} = K_S(\underline{x}_\mu, \underline{x}_\nu)

where

\underbrace{\phantom{wiiwiiiwwwwww}}

Compute the generalization error $$\epsilon$$ and how it scales with $$n$$

\epsilon = \textcolor{darkred}{\mathbb{E}_T} \int\mathrm{d}^d\underline{x}\, \left[ \hat{Z}_S(\underline{x}) - \textcolor{darkred}{Z_T(\underline{x})} \right]^2 \sim n^{-\beta}
\hat{Z}_S(\underline{x}) = \textcolor{gray}{\underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \underline{Z}}
\hat{Z}_S(\underline{x}) = \textcolor{darkred}{\underline{k}_S(\underline{x})} \textcolor{gray}{\cdot \mathbb{K}_S^{-1} \underline{Z}_T}
\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \textcolor{darkred}{\underline{Z}_T}
\hat{Z}_S(\underline{x}) = \underline{k}_S(\underline{x}) \cdot \mathbb{K}_S^{-1} \underline{Z}_T

kernel overlap

Gram matrix

training data

Explicit solution:

Regression:

$$\hat{Z}_S(\underline{x}) = \sum_{\mu=1}^n c_\mu K_S(\underline{x}_\mu,\underline{x})$$

Minimize $$= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_S(\underline{x}_\mu) - Z_T(\underline{x}_\mu) \right]^2$$

teacher-student: theorem (1/2)

To compute the generalization error:

• We look at the problem in the frequency domain

• We assume that $$\tilde{K}_S(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_S}$$ and $$\tilde{K}_T(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_T}$$ as$$|\!|\underline{w}|\!|\to\infty$$

• SIMPLIFYING ASSUMPTION: We take the $$n$$ points $$\underline{x}_\mu$$ on a regular $$d$$-dim lattice!
\epsilon \sim n^{-\beta}
\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

Then we can show that

with

E.g. Laplace has $$\alpha=d+1$$ and Gaussian has $$\alpha=\infty$$

(details: arXiv:1905.10843)

for $$n\gg1$$

teacher-student: theorem (2/2)

• Large $$\alpha \rightarrow$$ fast decay at high freq $$\rightarrow$$ indifference to local details

• $$\alpha_T$$ is intrinsic to the data (T), $$\alpha_S$$ depends on the algorithm (S)

• If $$\alpha_S$$ is large enough, $$\beta$$  takes the largest possible value $$\frac{\alpha_T - d}{d}$$

• As soon as $$\alpha_S$$ is small enough, $$\beta=\frac{2\alpha_S}d$$

(optimal learning)

\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)
• If Teacher=Student=Laplace

• If Teacher=Gaussian, Student=Laplace
\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

What is the prediction for our simulations?

(curse of dimensionality!)

\beta=\frac{\alpha_T-d}d = \frac1d

($$\alpha_T=\alpha_S=d+1$$)

($$\alpha_T=\infty, \alpha_S=d+1$$)

\beta=\frac{2\alpha_S}d = 2+\frac2d

teacher-student: comparison (1/2)

Exponent $$-\beta$$

• Our result matches the numerical simulations

• There are finite size effects (small $$n$$)

(on hypersphere)

teacher-student: Matérn TEACHER

K_T(\underline x) = \frac{2^{1-\nu}}{\Gamma(\nu)} z^\nu \mathcal K_\nu(z), \quad z = \sqrt{2\nu} \frac{|\!|\underline x|\!|}\sigma, \quad \alpha = d+2\nu

Matérn kernels:

$$n$$

\beta=\min(2\nu,4)
d=1
K_S(\underline x) = \exp\left(-\frac{|\!|\underline x|\!|}\sigma\right)

Laplace student,

Same result with points on regular lattice or random hypersphere?

What matters is how nearest-neighbor distance $$\delta$$ scales with $$n$$

nearest-neighbor distance

In both cases  $$\delta\sim n^{\frac1d}$$

Finite size effects: asymptotic scaling only when $$n$$ is large enough

(conjecture)

$$\longrightarrow$$ second order approximation with a Gaussian process $$K_T$$:

does it capture some aspects?

back toreal data

• Gaussian processes are $$s$$-times (mean-square) differentiable,
$$s=\frac{\alpha_T-d}2$$

• Fitted exponents are $$\beta\approx0.4$$ (MNIST) and $$\beta\approx0.1$$ (CIFAR10), regardless of the Student $$\longrightarrow \beta=\frac{\alpha_T-d}d$$

$$\longrightarrow$$ $$s=\frac12 \beta d$$, $$s\approx 0.2d\approx156$$ (MNIST) and $$s\approx0.05d\approx153$$ (CIFAR10)

This number is unreasonably large!

(since $$\beta=\frac1d\min(\alpha_T-d,2\alpha_S)$$ indep. of $$\alpha_S \longrightarrow \beta=\frac{\alpha_T-d}d$$)

effective dimension

• Measure NN-distance $$\delta$$

• $$\delta\sim n^{-\mathrm{some\ exponent}}$$

Define effective dimension as $$\delta \sim n^{-\frac1{d_\mathrm{eff}}}$$

$$\longrightarrow$$

MNIST

0.4

15

CIFAR10

0.1

35

$$\phantom{x}$$

$$\beta$$

$$d_\mathrm{eff}$$

3

1

$$s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor$$

$$d_\mathrm{eff}$$ is much smaller

$$s$$ is more reasonable!

$$\longrightarrow$$

$$\longrightarrow$$

784

3072

$$d$$

curse of dimensionality (1/2)

• Loosely speaking, the (optimal) exponent is

• To avoid the curse of dimensionality ($$\beta\sim\frac1d$$):

• either the dimension of the manifold is small

• or the data are extremely smooth
\beta \approx \frac{\text{smoothness}\ \ \textcolor{darkred}{\alpha_T-d = 2s}}{\text{manifold dimension}\ \ \textcolor{darkred}{d}}

curse of dimensionality (2/2)

• Assume that the data are not smooth enough and live in $$d$$ large

• Dimensionality reduction in the task rather than in the data?

• E.g. the $$n$$ points $$\underline x_\mu$$ live in $$\mathbb R^d$$, but the target function is such that

• Can kernels understand the lower dimensional structure?
Z_T(\underline x) = Z_T(\underline x_\parallel) \equiv Z_T(x_1,\dots,x_{d_\parallel}), \quad d_\parallel < d

Similar setting studied in Bach 2017

\epsilon \sim n^{-\beta}
\beta=\frac1d \min(\alpha_T - d, 2\alpha_S)

Theorem (informal formulation):

in the described setting with $$d_\parallel \leq d$$,

with

for $$n\gg1$$

Regardless of $$d_\parallel$$!

Two reasons contribute to this result:

• the nearest-neighbor distance always scales as $$\delta \sim n^{-\frac1d}$$

• $$\alpha_T(d) - d$$ only depends on the function $$K_T(z)$$ and not on $$d$$

Similar result in Bach 2017

Teacher = Matérn (with parameter $$\nu$$),    Student = Laplace,    $$d$$=4

$$n$$

Classification with the margin SVM algorithm:

\hat y(\underline x) = \mathrm{sign}\left[ \sum_{\mu=1}^n c_\mu K\left(\frac{|\!|\underline x - \underline x^\mu|\!|}{\sigma}\right) + b \right]

find $$\{c_\mu\},b$$ by minimizing some function

We consider a very simple setting:

• the label is $$y(\underline x) = y(x_1) \ \longrightarrow \ d_\parallel=1$$

+

-

+

y(x_1):

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

-

-

-

+

+

+

+

+

-

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

+

+

+

+

+

+

+

+

+

+

+

+

+

-

+

+

+

-

-

-

-

-

+

+

+

+

+

+

x_1
x_1

hyperplane

band

Non-Gaussian data!

x_1
x_1
x_1
x_1

• $$\sigma\ll\delta$$: then the estimator is tantamount to a nearest-neighbor algorithm $$\longrightarrow$$ curse of dimensionality $$\beta=\frac1d$$

• $$\sigma\gg\delta$$: important correlations in $$c_\mu$$ due to the long-range kernel. For the hyperplane with $$d_\parallel=1$$ we find $$\beta = \mathcal O(d^0)$$!

Vary kernel scale $$\sigma$$ $$\longrightarrow$$  two regimes!

No curse of dimensionality!

kernel correlations (1/2)

K\left(\frac{|\!|\underline x - \underline x^\mu|\!|}{\sigma}\right) \approx K(0) - \mathrm{const} \times \left(\frac{|\!|\underline x - \underline x^\mu|\!|}\sigma\right)^\xi

When $$\sigma\gg\delta$$ we can expand the kernel overlaps:

(the exponent $$\xi$$ is linked to the smoothness of the kernel)

We can derive some scaling arguments that lead to an exponent

\beta = \frac{ d + \xi - 1 }{ 3d + \xi - 3 }

Idea:

• support vectors ($$c_\mu\neq0$$) are close to the interface
• we impose that the decision boundary has $$\mathcal{O}(1)$$ spatial fluctuations on a scale proportional to $$\delta$$

kernel correlations (2/2)

\beta = \frac{ d + \xi - 1 }{ 3d + \xi - 3 }

$$n$$

d=1

Laplace kernel $$\xi=1$$

Matérn kernels $$\xi = \min(2\nu,2)$$

hyperplane

$$n$$

band

$$n$$

$$n$$

in all these cases!

conclusion

• Learning curves of real data decay as power laws with exponents

• We introduce a new framework that links the exponent $$\beta$$ to the degree of smoothness of Gaussian random data

• We justify how different kernels can lead to the same exponent $$\beta$$

• We show that the effective dimension of real data is $$\ll d$$. It can be linked to a (small) effective smoothness $$s$$

• We show that kernel regression is not able to capture invariants in the task, while kernel classification can

arXiv:1905.10843 + paper to be released soon!

\frac1d \ll \beta < \frac12

(in some regime and for smooth interfaces)

• Indeed, what happens if we consider a field $$Z_T(\underline{x})$$ that

• is an instance of a Teacher $$K_T$$
• lies in the RKHS of a Student $$K_S$$

$$\Longrightarrow$$

$$\alpha_T > \alpha_S + d$$

($$\alpha_T$$)

($$\alpha_S$$)

$$\alpha_S > d$$

$$\mathbb{E}_T \lvert\!\lvert Z_T \rvert\!\rvert_{K_S} = \mathbb{E}_T \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, Z_T(\underline{x}) K_S^{-1}(\underline{x},\underline{y}) Z_T(\underline{y}) = \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, K_T(\underline{x},\underline{y}) K_S^{-1}(\underline{x},\underline{y}) \textcolor{red}{< \infty}$$

$$K_S(\underline{0}) \propto \int \mathrm{d}\underline{w}\, \tilde{K}_S(\underline{w}) \textcolor{red}{< \infty}$$

$$\Longrightarrow$$

Therefore the smoothness must be $$s = \frac{\alpha_T-d}2 > \frac{d}2$$

(it scales with $$d$$!)

$$\longrightarrow \beta > \frac12$$

the nearest-neighbor limit

using a Laplace kernel

and

varying the dimension $$d$$:

$$\beta=\frac1d$$

$$n$$

hyperplane interface

kernel correlations: hypersphere

\beta = \frac{ d + \xi - 1 }{ 3d + \xi - 3 }

$$n$$

boundary = hypersphere:

Laplace kernels ($$\xi=1$$)

$$y(\underline x) = \mathrm{sign}(|\!|\underline x|\!|-R)$$

(same exponent!)

(similar scaling arguments apply, provided $$R\gg\delta$$)

($$d_\parallel=1$$)

+

-

+

+

+

+

+

+

+

-

-

-

-

-

-

-

-

-

-

-

-

+

+

+

+

-

+

+

+

-

-

|\!|\underline x|\!|

+

+

x_1
\underline{x}_\perp

Kernel methods and the curse of dimensionality

By Stefano Spigler

Kernel methods and the curse of dimensionality

Talk given in Courant Institute, NY, March 2020

• 283