kernel methods
and the
curse of dimensionality
Stefano Spigler
Jonas Paccolat, Mario Geiger, Matthieu Wyart
- Why and how does deep supervised learning work?
- Learn from examples: how many are needed?
- Typical tasks:
- Regression (fitting functions)
- Classification
- Regression (fitting functions)
supervised deep learning
- Performance is evaluated through the generalization error \(\epsilon\)
- Learning curves decay with number of examples \(n\), often as
- \(\beta\) depends on the dataset and on the algorithm
Deep networks: \(\beta\sim 0.07\)-\(0.35\) [Hestness et al. 2017]
learning curves
\(\epsilon\sim n^{-\beta}\)
We lack a theory for \(\beta\) for deep networks!
-
Performance increases with overparametrization
\(\longrightarrow\) study the infinite-width limit!
[Jacot et al. 2018]
[Bruna and Mallat 2013, Arora et al. 2019]
What are the learning curves of kernels like?
link with kernel learning
(next slides)
\(h\)
[Neyshabur et al. 2017, 2018, Advani and Saxe 2017]
[Spigler et al. 2018, Geiger et al. 2019, Belkin et al. 2019]
\(h\)
\(\epsilon\)
-
With a specific scaling, infinite-width limit \(\to\) kernel learning
[Rotskoff and Vanden-Eijnden 2018, Mei et al. 2017, Jacot et al. 2018, Chizat and Bach 2018, ...]
Neural Tangent Kernel
-
Very brief introduction to kernel methods and real data
- Gaussian data: Teacher-Student regression
- Gaussian approximation: smoothness and effective dimension
- Role of invariance in the task?
outline
- Kernel methods learn non-linear functions or boundaries
- Map data to a feature space, where the problem is linear
data \(\underline{x} \longrightarrow \underline{\phi}(\underline{x}) \longrightarrow \) use linear combination of features
only scalar products are needed:
\(\underline{\phi}(\underline{x})\)
kernel methods
kernel \(K(\underline{x},\underline{x}^\prime)\)
\(\rightarrow\)
Gaussian:
Laplace:
E.g. kernel regression:
-
Target function \(\underline{x}_\mu \to Z(\underline{x}_\mu),\ \ \mu=1,\dots,n\)
- Build an estimator \(\hat{Z}_K(\underline{x}) = \sum_{\mu=1}^n c_\mu K(\underline{x}_\mu,\underline{x})\)
- Minimize training MSE \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_K(\underline{x}_\mu) - Z(\underline{x}_\mu) \right]^2\)
- Estimate the generalization error \(\epsilon = \mathbb{E}_{\underline{x}} \left[ \hat{Z}_K(\underline{x}) - Z(\underline{x}) \right]^2\)
kernel regression
A kernel \(K\) induces a corresponding Hilbert space \(\mathcal{H}_K\) with norm
\(\lvert\!\lvert Z \rvert\!\rvert_K = \int \mathrm{d}^d\underline{x} \mathrm{d}^d\underline{y}\, Z(\underline{x}) K^{-1}(\underline{x},\underline{y}) Z(\underline{y})\)
where \(K^{-1}(\underline{x},\underline{y})\) is such that
\(\int \mathrm{d}^d\underline{y}\, K^{-1}(\underline{x},\underline{y}) K(\underline{y},\underline{z}) = \delta(\underline{x},\underline{z})\)
\(\mathcal{H}_K\) is called the Reproducing Kernel Hilbert Space (RKHS)
reproducing kernel hilbert space (rkhs)
Regression: performance depends on the target function!
-
If only assumed to be Lipschitz, then \(\beta=\frac1d\)
-
If assumed to be in the RKHS, then \(\beta\geq\frac12\) does not depend on \(d\)
-
Yet, RKHS is a very strong assumption on the smoothness of the target function
Curse of dimensionality!
[Luxburg and Bousquet 2004]
[Smola et al. 1998, Rudi and Rosasco 2017]
[Bach 2017]
previous works
\(d\) = dimension of the input space
\(\longrightarrow\)
We apply kernel methods on
real data and algorithms
MNIST
CIFAR10
2 classes: even/odd
70000 28x28 b/w pictures
2 classes: first 5/last 5
60000 32x32 RGB pictures
We perform
regression \(\longrightarrow\)
classification \(\longrightarrow\)
kernel regression
margin SVM
dimension \(d = 784\)
dimension \(d = 3072\)
- Same exponent for regression and classification
- Same exponent for Gaussian and Laplace kernel
- MNIST and CIFAR10 display exponents \(\beta\gg\frac1d\) but \(<\frac12\)
real data:
exponents
We need a new framework!
\(\beta\approx0.4\)
\(\beta\approx0.1\)
- Controlled setting: Teacher-Student regression
- Training data are sampled from a Gaussian Process:
\(Z_T(\underline{x}_1),\dots,Z_T(\underline{x}_n)\ \sim\ \mathcal{N}(0, K_T)\)
\(\underline{x}_\mu\) are random on a \(d\)-dim hypersphere
- Regression is done with another kernel \(K_S\)
kernel teacher-student framework
\(\mathbb{E} Z_T(\underline{x}_\mu) = 0\)
\(\mathbb{E} Z_T(\underline{x}_\mu) Z_T(\underline{x}_\nu) = K_T(|\!|\underline{x}_\mu-\underline{x}_\nu|\!|)\)
teacher-student: simulations
Generalization error
Exponent \(-\beta\)
Can we understand these curves?
teacher-student: regression
where
Compute the generalization error \(\epsilon\) and how it scales with \(n\)
kernel overlap
Gram matrix
training data
Explicit solution:
Regression:
\(\hat{Z}_S(\underline{x}) = \sum_{\mu=1}^n c_\mu K_S(\underline{x}_\mu,\underline{x})\)
Minimize \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_S(\underline{x}_\mu) - Z_T(\underline{x}_\mu) \right]^2\)
teacher-student: theorem (1/2)
To compute the generalization error:
- We look at the problem in the frequency domain
- We assume that \(\tilde{K}_S(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_S}\) and \(\tilde{K}_T(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_T}\) as\(|\!|\underline{w}|\!|\to\infty\)
- SIMPLIFYING ASSUMPTION: We take the \(n\) points \(\underline{x}_\mu\) on a regular \(d\)-dim lattice!
Then we can show that
with
E.g. Laplace has \(\alpha=d+1\) and Gaussian has \(\alpha=\infty\)
(details: arXiv:1905.10843)
for \(n\gg1\)
teacher-student: theorem (2/2)
- Large \(\alpha \rightarrow\) fast decay at high freq \(\rightarrow\) indifference to local details
- \(\alpha_T\) is intrinsic to the data (T), \(\alpha_S\) depends on the algorithm (S)
- If \(\alpha_S\) is large enough, \(\beta\) takes the largest possible value \(\frac{\alpha_T - d}{d}\)
- As soon as \(\alpha_S\) is small enough, \(\beta=\frac{2\alpha_S}d\)
(optimal learning)
- If Teacher=Student=Laplace
- If Teacher=Gaussian, Student=Laplace
What is the prediction for our simulations?
(curse of dimensionality!)
(\(\alpha_T=\alpha_S=d+1\))
(\(\alpha_T=\infty, \alpha_S=d+1\))
teacher-student: comparison (1/2)
Exponent \(-\beta\)
- Our result matches the numerical simulations
- There are finite size effects (small \(n\))
(on hypersphere)
TEACHER-STUDENT: COMPARISON (2/2)
teacher-student: Matérn TEACHER
Matérn kernels:
\(n\)
Laplace student,
Same result with points on regular lattice or random hypersphere?
What matters is how nearest-neighbor distance \(\delta\) scales with \(n\)
nearest-neighbor distance
In both cases \(\delta\sim n^{\frac1d}\)
Finite size effects: asymptotic scaling only when \(n\) is large enough
(conjecture)
What about real data?
\(\longrightarrow\) second order approximation with a Gaussian process \(K_T\):
does it capture some aspects?
back toreal data
- Gaussian processes are \(s\)-times (mean-square) differentiable,
\(s=\frac{\alpha_T-d}2\)
- Fitted exponents are \(\beta\approx0.4\) (MNIST) and \(\beta\approx0.1\) (CIFAR10), regardless of the Student \(\longrightarrow \beta=\frac{\alpha_T-d}d\)
\(\longrightarrow\) \(s=\frac12 \beta d\), \(s\approx 0.2d\approx156\) (MNIST) and \(s\approx0.05d\approx153\) (CIFAR10)
This number is unreasonably large!
(since \(\beta=\frac1d\min(\alpha_T-d,2\alpha_S)\) indep. of \(\alpha_S \longrightarrow \beta=\frac{\alpha_T-d}d\))
effective dimension
-
Measure NN-distance \(\delta\)
- \(\delta\sim n^{-\mathrm{some\ exponent}} \)
Define effective dimension as \(\delta \sim n^{-\frac1{d_\mathrm{eff}}}\)
\(\longrightarrow\)
MNIST
0.4
15
CIFAR10
0.1
35
\(\phantom{x}\)
\(\beta\)
\(d_\mathrm{eff}\)
3
1
\(s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor\)
\(d_\mathrm{eff}\) is much smaller
\(s\) is more reasonable!
\(\longrightarrow\)
\(\longrightarrow\)
784
3072
\(d\)
curse of dimensionality (1/2)
- Loosely speaking, the (optimal) exponent is
- To avoid the curse of dimensionality (\(\beta\sim\frac1d\)):
- either the dimension of the manifold is small
- or the data are extremely smooth
- either the dimension of the manifold is small
curse of dimensionality (2/2)
- Assume that the data are not smooth enough and live in \(d\) large
-
Dimensionality reduction in the task rather than in the data?
- E.g. the \(n\) points \(\underline x_\mu\) live in \(\mathbb R^d\), but the target function is such that
- Can kernels understand the lower dimensional structure?
Similar setting studied in Bach 2017
task invariance: kernel regression (1/2)
Theorem (informal formulation):
in the described setting with \(d_\parallel \leq d\),
with
for \(n\gg1\)
Regardless of \(d_\parallel\)!
Two reasons contribute to this result:
- the nearest-neighbor distance always scales as \(\delta \sim n^{-\frac1d}\)
- \(\alpha_T(d) - d\) only depends on the function \(K_T(z)\) and not on \(d\)
Similar result in Bach 2017
task invariance: kernel regression (2/2)
Teacher = Matérn (with parameter \(\nu\)), Student = Laplace, \(d\)=4
\(n\)
task invariance: classification (1/2)
Classification with the margin SVM algorithm:
find \(\{c_\mu\},b\) by minimizing some function
We consider a very simple setting:
- the label is \(y(\underline x) = y(x_1) \ \longrightarrow \ d_\parallel=1\)
+
-
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
-
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
-
+
+
+
-
-
-
-
-
+
+
+
+
+
+
hyperplane
band
Non-Gaussian data!
task invariance: classification (2/2)
- \(\sigma\ll\delta\): then the estimator is tantamount to a nearest-neighbor algorithm \(\longrightarrow\) curse of dimensionality \(\beta=\frac1d\)
- \(\sigma\gg\delta\): important correlations in \(c_\mu\) due to the long-range kernel. For the hyperplane with \(d_\parallel=1\) we find \(\beta = \mathcal O(d^0)\)!
Vary kernel scale \(\sigma\) \(\longrightarrow\) two regimes!
No curse of dimensionality!
kernel correlations (1/2)
When \(\sigma\gg\delta\) we can expand the kernel overlaps:
(the exponent \(\xi\) is linked to the smoothness of the kernel)
We can derive some scaling arguments that lead to an exponent
Idea:
- support vectors (\(c_\mu\neq0\)) are close to the interface
- we impose that the decision boundary has \(\mathcal{O}(1)\) spatial fluctuations on a scale proportional to \(\delta\)
kernel correlations (2/2)
\(n\)
Laplace kernel \(\xi=1\)
Matérn kernels \(\xi = \min(2\nu,2)\)
hyperplane
\(n\)
band
\(n\)
\(n\)
in all these cases!
conclusion
- Learning curves of real data decay as power laws with exponents
- We introduce a new framework that links the exponent \(\beta\) to the degree of smoothness of Gaussian random data
-
We justify how different kernels can lead to the same exponent \(\beta\)
- We show that the effective dimension of real data is \(\ll d\). It can be linked to a (small) effective smoothness \(s\)
- We show that kernel regression is not able to capture invariants in the task, while kernel classification can
arXiv:1905.10843 + paper to be released soon!
(in some regime and for smooth interfaces)
- Indeed, what happens if we consider a field \(Z_T(\underline{x})\) that
- is an instance of a Teacher \(K_T\)
- lies in the RKHS of a Student \(K_S\)
\(\Longrightarrow\)
\(\alpha_T > \alpha_S + d\)
(\(\alpha_T\))
(\(\alpha_S\))
\(\alpha_S > d\)
\(\mathbb{E}_T \lvert\!\lvert Z_T \rvert\!\rvert_{K_S} = \mathbb{E}_T \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, Z_T(\underline{x}) K_S^{-1}(\underline{x},\underline{y}) Z_T(\underline{y}) = \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, K_T(\underline{x},\underline{y}) K_S^{-1}(\underline{x},\underline{y}) \textcolor{red}{< \infty}\)
\(K_S(\underline{0}) \propto \int \mathrm{d}\underline{w}\, \tilde{K}_S(\underline{w}) \textcolor{red}{< \infty}\)
\(\Longrightarrow\)
Therefore the smoothness must be \(s = \frac{\alpha_T-d}2 > \frac{d}2\)
(it scales with \(d\)!)
\(\longrightarrow \beta > \frac12\)
rkhs & smoothness
the nearest-neighbor limit
using a Laplace kernel
and
varying the dimension \(d\):
\(\beta=\frac1d\)
\(n\)
hyperplane interface
kernel correlations: hypersphere
\(n\)
boundary = hypersphere:
Laplace kernels (\(\xi=1\))
What about other interfaces?
\(y(\underline x) = \mathrm{sign}(|\!|\underline x|\!|-R)\)
(same exponent!)
(similar scaling arguments apply, provided \(R\gg\delta\))
(\(d_\parallel=1\))
+
-
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
-
+
+
+
-
-
+
+
Kernel methods and the curse of dimensionality
By Stefano Spigler
Kernel methods and the curse of dimensionality
Talk given in Courant Institute, NY, March 2020
- 926