Asymptotic Learning Curves of Kernel Methods
Stefano Spigler, Mario Geiger, Matthieu Wyart
- Why and how does deep supervised learning work?
- Learn from examples: how many are needed?
- Typical tasks:
- Regression (fitting functions)
- Classification
- Regression (fitting functions)
Supervised deep learning
- Performance is evaluated through the generalization error \(\epsilon\)
- Learning curves decay with number of examples \(n\), often as
- \(\beta\) depends on the dataset and on the algorithm
Deep networks: \(\beta\sim 0.07\)-\(0.35\) [Hestness et al. 2017]
Learning curves
\(\epsilon\sim n^{-\beta}\)
We lack a theory for \(\beta\) for deep networks
-
Performance increases with overparametrization
\(\longrightarrow\) study the infinite-width limit!
[Jacot et al. 2018]
[Bruna and Mallat 2013, Arora et al. 2019]
What are the learning curves of kernels like?
Link with kernel learning
(next slide)
\(h\)
[Neyshabur et al. 2017, 2018, Advani and Saxe 2017]
[Belkin et al. 2018, Spigler et al. 2018, Geiger et al. 2019]
\(h\)
\(\epsilon\)
-
With a specific scaling, infinite-width limit \(\to\) kernel learning
-
Some kernels achieve almost the performance of deep networks
[Mei et al. 2017, Rotskoff and Vanden-Eijnden 2018, Jacot et al. 2018, Chizat and Bach 2018, ...]
- Kernel methods learn non-linear functions or boundaries
- Data are mapped to a feature space, where the problem is treated linearly
data \(\underline{x} \longrightarrow \underline{\phi}(\underline{x}) \longrightarrow \) use linear combination of features
only scalar products are needed:
\(\underline{\phi}(\underline{x})\)
Kernel methods
kernel \(K(\underline{x},\underline{x}^\prime)\)
\(\rightarrow\)
Gaussian:
Laplace:
E.g. kernel regression:
-
Target function \(\underline{x}_\mu \to Z(\underline{x}_\mu),\ \ \mu=1,\dots,n\)
- Build an estimator \(\hat{Z}_K(\underline{x}) = \sum_{\mu=1}^n c_\mu K(\underline{x}_\mu,\underline{x})\)
- Minimize training MSE \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_K(\underline{x}_\mu) - Z(\underline{x}_\mu) \right]^2\)
- Estimate the generalization error \(\epsilon = \mathbb{E}_{\underline{x}} \left[ \hat{Z}_K(\underline{x}) - Z(\underline{x}) \right]^2\)
Kernel regression
A kernel \(K\) induces a corresponding Hilbert space \(\mathcal{H}_K\) with norm
\(\lvert\!\lvert Z \rvert\!\rvert_K = \int \mathrm{d}\underline{x} \mathrm{d} \underline{y}\, Z(\underline{x}) K^{-1}(\underline{x},\underline{y}) Z(\underline{y})\)
where \(K^{-1}(\underline{x},\underline{y})\) is such that
\(\int \mathrm{d} \underline{y}\, K^{-1}(\underline{x},\underline{y}) K(\underline{y},\underline{z}) = \delta(\underline{x},\underline{z})\)
\(\mathcal{H}_K\) is called the Reproducing Kernel Hilbert Space (RKHS)
Reproducing Kernel Hilbert Space
Regression: performance depends on the target function!
-
If only assumed to be Lipschitz, then \(\beta=\frac1d\)
-
If assumed to be in the RKHS, then \(\beta\) does not depend on \(d\)
-
Yet, RKHS is a very strong assumption on the smoothness of the target function (see later on)
curse of dimensionality!
[Luxburg and Bousquet 2004]
[Smola et al. 1998, Rudi and Rosasco 2017]
[Bach 2017]
Previous works
\(d\) = dimension of the input space
\(\longrightarrow\)
We apply kernel methods on
Datasets and algorithms
MNIST
CIFAR10
2 classes: even/odd
70000 28x28 b/w pictures
2 classes: first 5/last 5
60000 32x32 RGB pictures
We perform
regression \(\longrightarrow\)
classification \(\longrightarrow\)
kernel regression
margin SVM
dimension \(d = 784\)
dimension \(d = 3072\)
- Same exponent for regression and classification
- Same exponent for Gaussian and Laplace kernel
- MNIST and CIFAR10 display exponents \(\beta\) different from \(\frac12,\frac1d\)
Real exponents
We need a new framework!
\(\beta\approx0.4\)
\(\beta\approx0.1\)
- Controlled setting: Teacher-Student regression
- Training data are sampled from a Gaussian Process:
\(Z(\underline{x}_1),\dots,Z(\underline{x}_n)\ \sim\ \mathcal{N}(0, K_T)\)
\(\underline{x}_\mu\) are random on a \(d\)-dim hypersphere
- Regression is done with another kernel \(K_S\)
Teacher-Student: simulation
\(\mathbb{E} Z(\underline{x}_\mu) = 0\)
\(\mathbb{E} Z(\underline{x}_\mu) Z(\underline{x}_\nu) = K_T(\underline{x}_\mu-\underline{x}_\nu)\)
\(\longrightarrow\)
Teacher-Student: simulation
Generalization error
Exponent \(-\beta\)
Can we understand these curves?
Teacher-Student: analytical
Regression: the solution can be written explicitly
where
Compute the generalization error \(\epsilon\) and how it scales with \(n\)
Teacher-Student: analytical
To compute the generalization error:
- We look at the problem in the frequency domain
- We assume that \(\tilde{K}_S(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_S}\) and \(\tilde{K}_T(\underline{w}) \sim |\!|\underline{w}|\!|^{-\alpha_T}\) as\(|\!|\underline{w}|\!|\to\infty\)
- SIMPLIFYING ASSUMPTION: We take the \(n\) points \(\underline{x}_\mu\) on a regular \(d\)-dim lattice!
Then we can show that
with
E.g. Laplace has \(\alpha=d+1\) and Gaussian has \(\alpha=\infty\)
(details: arXiv:1905.10843)
for \(n\gg1\)
Teacher-Student
- Large \(\alpha \rightarrow\) fast decay at high freq \(\rightarrow\) indifference to local details
- \(\alpha_T\) is intrinsic to the data (T), \(\alpha_S\) depends on the algorithm (S)
- If \(\alpha_S\) is large enough, \(\beta\) takes the largest possible value \(\frac{\alpha_T - d}{d}\)
- As soon as \(\alpha_S\) is small enough, \(\beta=\frac{2\alpha_S}d\)
(optimal learning)
Teacher-Student
- If Teacher=Student=Laplace
- If Teacher=Gaussian, Student=Laplace
What is the prediction for our simulations?
(curse of dimensionality!)
(\(\alpha_T=\alpha_S=d+1\))
(\(\alpha_T=\infty, \alpha_S=d+1\))
Teacher-Student: comparison
Exponent \(-\beta\)
- Our result matches the numerical simulations
- There are finite size effects (small \(n\))
(on hypersphere)
Same result with points on regular lattice or random hypersphere?
What matters is how nearest-neighbor distance \(\delta\) scales with \(n\)
Nearest-neighbor distance
In both cases \(\delta\sim n^{\frac1d}\)
Finite size effects: asymptotic scaling only when \(n\) is large enough
(conjecture)
What about real data?
\(\longrightarrow\) assume they are instances of some Gaussian process \(K_T\)
Real data
- Such instances are \(s\)-times (mean-square) differentiable with
\(s=\frac{\alpha_T-d}2\)
- Fitted exponents are \(\beta\approx0.4\) (MNIST) and \(\beta\approx0.1\) (CIFAR10), regardless of the Student \(\longrightarrow \beta=\frac{\alpha_T-d}d\)
\(\longrightarrow\) \(s=\frac12 \beta d\), \(s\approx 0.2d\approx156\) (MNIST) and \(s\approx0.05d\approx153\) (CIFAR10)
This number is unreasonably large!
(since \(\beta=\frac1d\min(\alpha_T-d,2\alpha_S)\) indep. of \(\alpha_S \longrightarrow \beta=\frac{\alpha_T-d}d\))
Effective dimension
-
Measure NN-distance \(\delta\)
- \(\delta\sim n^{-\mathrm{some\ exponent}} \)
Define effective dimension as \(\delta \sim n^{-\frac1{d_\mathrm{eff}}}\)
\(\longrightarrow\)
MNIST
0.4
15
CIFAR10
0.1
35
\(\phantom{x}\)
\(\beta\)
\(d_\mathrm{eff}\)
3
1
\(s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor\)
\(d_\mathrm{eff}\) is much smaller
\(s\) is more reasonable
\(\longrightarrow\)
\(\longrightarrow\)
784
3072
\(d\)
RKHS and smoothness
- Indeed, what happens if we consider a field \(Z_T(\underline{x})\) that
- is an instance of a Teacher \(K_T\)
- lies in the RKHS of a Student \(K_S\)
\(\Longrightarrow\)
\(\alpha_T > \alpha_S + d\)
(\(\alpha_T\))
(\(\alpha_S\))
\(\alpha_S > d\)
\(\mathbb{E}_T \lvert\!\lvert Z_T \rvert\!\rvert_{K_S} = \mathbb{E}_T \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, Z_T(\underline{x}) K_S^{-1}(\underline{x},\underline{y}) Z_T(\underline{y}) = \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, K_T(\underline{x},\underline{y}) K_S^{-1}(\underline{x},\underline{y}) \textcolor{red}{< \infty}\)
\(K_S(\underline{0}) \propto \int \mathrm{d}\underline{w}\, \tilde{K}_S(\underline{w}) \textcolor{red}{< \infty}\)
\(\Longrightarrow\)
Therefore the smoothness must be \(s = \frac{\alpha_T-d}2 > \frac{d}2\)
(it scales with \(d\)!)
\(\longrightarrow \beta > \frac12\)
Conclusion
- MNIST and CIFAR10 display power laws in the learning curves, with exponents \(\beta\approx 0.4,0.1\) (resp.) \(\gg\frac1d\)
- \(\beta\) is the same for regression and classification tasks with Gaussian and Laplace kernels
- We introduced a new framework that allows for different degrees of smoothness in the data, where we can compute \(\beta\)
- We defined an effective dimension for real data (\(\ll d\)), that is linked to an effective smoothness \(s\)
MNIST
0.4
15
CIFAR10
0.1
35
\(\phantom{x}\)
\(\beta\)
\(d_\mathrm{eff}\)
3
1
\(s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor\)
784
3072
\(d\)
Asymptotic learning curves of kernel methods
By Stefano Spigler
Asymptotic learning curves of kernel methods
- 756