kernel methods
and the
curse of dimensionality
Stefano Spigler
Jonas Paccolat, Mario Geiger, Matthieu Wyart

- Why and how does deep supervised learning work?
- Learn from examples: how many are needed?
- Typical tasks:
- Regression (fitting functions)
- Classification
- Regression (fitting functions)
supervised deep learning
- Performance is evaluated through the generalization error \(\epsilon\)
- Learning curves decay with number of examples \(n\), often as
- \(\beta\) depends on the dataset and on the algorithm
Deep networks: \(\beta\sim 0.07\)-\(0.35\) [Hestness et al. 2017]
learning curves
\(\epsilon\sim n^{-\beta}\)
We lack a theory for \(\beta\) for deep networks!
Performance increases with overparametrization
\(\longrightarrow\) study the infinite-width limit!
[Jacot et al. 2018]
[Bruna and Mallat 2013, Arora et al. 2019]
What are the learning curves of kernels like?
link with kernel learning
(next slides)
[Neyshabur et al. 2017, 2018, Advani and Saxe 2017]
[Spigler et al. 2018, Geiger et al. 2019, Belkin et al. 2019]
With a specific scaling, infinite-width limit \(\to\) kernel learning
[Rotskoff and Vanden-Eijnden 2018, Mei et al. 2017, Jacot et al. 2018, Chizat and Bach 2018, ...]
Neural Tangent Kernel
Very brief introduction to kernel methods and real data
- Gaussian data: Teacher-Student regression
- Gaussian approximation: smoothness and effective dimension
- Role of invariance in the task?
- Kernel methods learn non-linear functions or boundaries
- Map data to a feature space, where the problem is linear
data \(\underline{x} \longrightarrow \underline{\phi}(\underline{x}) \longrightarrow \) use linear combination of features
only scalar products are needed:
kernel methods
kernel \(K(\underline{x},\underline{x}^\prime)\)
E.g. kernel regression:
Target function \(\underline{x}_\mu \to Z(\underline{x}_\mu),\ \ \mu=1,\dots,n\)
- Build an estimator \(\hat{Z}_K(\underline{x}) = \sum_{\mu=1}^n c_\mu K(\underline{x}_\mu,\underline{x})\)
- Minimize training MSE \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_K(\underline{x}_\mu) - Z(\underline{x}_\mu) \right]^2\)
- Estimate the generalization error \(\epsilon = \mathbb{E}_{\underline{x}} \left[ \hat{Z}_K(\underline{x}) - Z(\underline{x}) \right]^2\)
kernel regression
A kernel \(K\) induces a corresponding Hilbert space \(\mathcal{H}_K\) with norm
\(\lvert\!\lvert Z \rvert\!\rvert_K = \int \mathrm{d}^d\underline{x} \mathrm{d}^d\underline{y}\, Z(\underline{x}) K^{-1}(\underline{x},\underline{y}) Z(\underline{y})\)
where \(K^{-1}(\underline{x},\underline{y})\) is such that
\(\int \mathrm{d}^d\underline{y}\, K^{-1}(\underline{x},\underline{y}) K(\underline{y},\underline{z}) = \delta(\underline{x},\underline{z})\)
\(\mathcal{H}_K\) is called the Reproducing Kernel Hilbert Space (RKHS)
reproducing kernel hilbert space (rkhs)
Regression: performance depends on the target function!
If only assumed to be Lipschitz, then \(\beta=\frac1d\)
If assumed to be in the RKHS, then \(\beta\geq\frac12\) does not depend on \(d\)
Yet, RKHS is a very strong assumption on the smoothness of the target function
Curse of dimensionality!
[Luxburg and Bousquet 2004]
[Smola et al. 1998, Rudi and Rosasco 2017]
[Bach 2017]
previous works
\(d\) = dimension of the input space
We apply kernel methods on
real data and algorithms

2 classes: even/odd
70000 28x28 b/w pictures
2 classes: first 5/last 5
60000 32x32 RGB pictures
We perform
regression \(\longrightarrow\)
classification \(\longrightarrow\)
kernel regression
margin SVM
dimension \(d = 784\)
dimension \(d = 3072\)
- Same exponent for regression and classification
- Same exponent for Gaussian and Laplace kernel
- MNIST and CIFAR10 display exponents \(\beta\gg\frac1d\) but \(<\frac12\)
real data:
We need a new framework!
- Controlled setting: Teacher-Student regression
- Training data are sampled from a Gaussian Process:
\(Z_T(\underline{x}_1),\dots,Z_T(\underline{x}_n)\ \sim\ \mathcal{N}(0, K_T)\)
\(\underline{x}_\mu\) are random on a \(d\)-dim hypersphere
- Regression is done with another kernel \(K_S\)
kernel teacher-student framework
\(\mathbb{E} Z_T(\underline{x}_\mu) = 0\)
\(\mathbb{E} Z_T(\underline{x}_\mu) Z_T(\underline{x}_\nu) = K_T(|\!|\underline{x}_\mu-\underline{x}_\nu|\!|)\)
teacher-student: simulations
Generalization error
Exponent \(-\beta\)
Can we understand these curves?
teacher-student: regression
Compute the generalization error \(\epsilon\) and how it scales with \(n\)
kernel overlap
Gram matrix
training data
Explicit solution:
\(\hat{Z}_S(\underline{x}) = \sum_{\mu=1}^n c_\mu K_S(\underline{x}_\mu,\underline{x})\)
Minimize \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_S(\underline{x}_\mu) - Z_T(\underline{x}_\mu) \right]^2\)
teacher-student: theorem (1/2)
To compute the generalization error:
- We look at the problem in the frequency domain
- We assume that \(\tilde{K}_S(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_S}\) and \(\tilde{K}_T(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_T}\) as\(|\!|\underline{w}|\!|\to\infty\)
- SIMPLIFYING ASSUMPTION: We take the \(n\) points \(\underline{x}_\mu\) on a regular \(d\)-dim lattice!
Then we can show that
E.g. Laplace has \(\alpha=d+1\) and Gaussian has \(\alpha=\infty\)
(details: arXiv:1905.10843)
for \(n\gg1\)
teacher-student: theorem (2/2)
- Large \(\alpha \rightarrow\) fast decay at high freq \(\rightarrow\) indifference to local details
- \(\alpha_T\) is intrinsic to the data (T), \(\alpha_S\) depends on the algorithm (S)
- If \(\alpha_S\) is large enough, \(\beta\) takes the largest possible value \(\frac{\alpha_T - d}{d}\)
- As soon as \(\alpha_S\) is small enough, \(\beta=\frac{2\alpha_S}d\)
(optimal learning)
- If Teacher=Student=Laplace
- If Teacher=Gaussian, Student=Laplace
What is the prediction for our simulations?
(curse of dimensionality!)
(\(\alpha_T=\infty, \alpha_S=d+1\))
teacher-student: comparison (1/2)
Exponent \(-\beta\)
- Our result matches the numerical simulations
- There are finite size effects (small \(n\))
(on hypersphere)
teacher-student: Matérn TEACHER
Matérn kernels:
Laplace student,
Same result with points on regular lattice or random hypersphere?
What matters is how nearest-neighbor distance \(\delta\) scales with \(n\)
nearest-neighbor distance
In both cases \(\delta\sim n^{\frac1d}\)
Finite size effects: asymptotic scaling only when \(n\) is large enough
What about real data?
\(\longrightarrow\) second order approximation with a Gaussian process \(K_T\):
does it capture some aspects?
back toreal data
- Gaussian processes are \(s\)-times (mean-square) differentiable,
- Fitted exponents are \(\beta\approx0.4\) (MNIST) and \(\beta\approx0.1\) (CIFAR10), regardless of the Student \(\longrightarrow \beta=\frac{\alpha_T-d}d\)
\(\longrightarrow\) \(s=\frac12 \beta d\), \(s\approx 0.2d\approx156\) (MNIST) and \(s\approx0.05d\approx153\) (CIFAR10)
This number is unreasonably large!
(since \(\beta=\frac1d\min(\alpha_T-d,2\alpha_S)\) indep. of \(\alpha_S \longrightarrow \beta=\frac{\alpha_T-d}d\))
effective dimension
Measure NN-distance \(\delta\)
- \(\delta\sim n^{-\mathrm{some\ exponent}} \)
Define effective dimension as \(\delta \sim n^{-\frac1{d_\mathrm{eff}}}\)
\(s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor\)
\(d_\mathrm{eff}\) is much smaller
\(s\) is more reasonable!
curse of dimensionality (1/2)
- Loosely speaking, the (optimal) exponent is
- To avoid the curse of dimensionality (\(\beta\sim\frac1d\)):
- either the dimension of the manifold is small
- or the data are extremely smooth
- either the dimension of the manifold is small
curse of dimensionality (2/2)
- Assume that the data are not smooth enough and live in \(d\) large
Dimensionality reduction in the task rather than in the data?
- E.g. the \(n\) points \(\underline x_\mu\) live in \(\mathbb R^d\), but the target function is such that
- Can kernels understand the lower dimensional structure?
Similar setting studied in Bach 2017
task invariance: kernel regression (1/2)
Theorem (informal formulation):
in the described setting with \(d_\parallel \leq d\),
for \(n\gg1\)
Regardless of \(d_\parallel\)!
Two reasons contribute to this result:
- the nearest-neighbor distance always scales as \(\delta \sim n^{-\frac1d}\)
- \(\alpha_T(d) - d\) only depends on the function \(K_T(z)\) and not on \(d\)
Similar result in Bach 2017
task invariance: kernel regression (2/2)
Teacher = Matérn (with parameter \(\nu\)), Student = Laplace, \(d\)=4
task invariance: classification (1/2)
Classification with the margin SVM algorithm:
find \(\{c_\mu\},b\) by minimizing some function
We consider a very simple setting:
- the label is \(y(\underline x) = y(x_1) \ \longrightarrow \ d_\parallel=1\)
Non-Gaussian data!
task invariance: classification (2/2)
- \(\sigma\ll\delta\): then the estimator is tantamount to a nearest-neighbor algorithm \(\longrightarrow\) curse of dimensionality \(\beta=\frac1d\)
- \(\sigma\gg\delta\): important correlations in \(c_\mu\) due to the long-range kernel. For the hyperplane with \(d_\parallel=1\) we find \(\beta = \mathcal O(d^0)\)!
Vary kernel scale \(\sigma\) \(\longrightarrow\) two regimes!
No curse of dimensionality!
kernel correlations (1/2)
When \(\sigma\gg\delta\) we can expand the kernel overlaps:
(the exponent \(\xi\) is linked to the smoothness of the kernel)
We can derive some scaling arguments that lead to an exponent
- support vectors (\(c_\mu\neq0\)) are close to the interface
- we impose that the decision boundary has \(\mathcal{O}(1)\) spatial fluctuations on a scale proportional to \(\delta\)
kernel correlations (2/2)
Laplace kernel \(\xi=1\)
Matérn kernels \(\xi = \min(2\nu,2)\)
in all these cases!
- Learning curves of real data decay as power laws with exponents
- We introduce a new framework that links the exponent \(\beta\) to the degree of smoothness of Gaussian random data
We justify how different kernels can lead to the same exponent \(\beta\)
- We show that the effective dimension of real data is \(\ll d\). It can be linked to a (small) effective smoothness \(s\)
- We show that kernel regression is not able to capture invariants in the task, while kernel classification can
arXiv:1905.10843 + paper to be released soon!
(in some regime and for smooth interfaces)
- Indeed, what happens if we consider a field \(Z_T(\underline{x})\) that
- is an instance of a Teacher \(K_T\)
- lies in the RKHS of a Student \(K_S\)
\(\alpha_T > \alpha_S + d\)
\(\alpha_S > d\)
\(\mathbb{E}_T \lvert\!\lvert Z_T \rvert\!\rvert_{K_S} = \mathbb{E}_T \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, Z_T(\underline{x}) K_S^{-1}(\underline{x},\underline{y}) Z_T(\underline{y}) = \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, K_T(\underline{x},\underline{y}) K_S^{-1}(\underline{x},\underline{y}) \textcolor{red}{< \infty}\)
\(K_S(\underline{0}) \propto \int \mathrm{d}\underline{w}\, \tilde{K}_S(\underline{w}) \textcolor{red}{< \infty}\)
Therefore the smoothness must be \(s = \frac{\alpha_T-d}2 > \frac{d}2\)
(it scales with \(d\)!)
\(\longrightarrow \beta > \frac12\)
rkhs & smoothness
the nearest-neighbor limit
using a Laplace kernel
varying the dimension \(d\):
hyperplane interface
kernel correlations: hypersphere
boundary = hypersphere:
Laplace kernels (\(\xi=1\))
What about other interfaces?
\(y(\underline x) = \mathrm{sign}(|\!|\underline x|\!|-R)\)
(same exponent!)
(similar scaling arguments apply, provided \(R\gg\delta\))
Kernel methods and the curse of dimensionality
By Stefano Spigler
Kernel methods and the curse of dimensionality
Talk given in Courant Institute, NY, March 2020
- 1,020