Stefano Spigler
Jonas Paccolat, Mario Geiger, Matthieu Wyart
Deep networks: \(\beta\sim 0.07\)-\(0.35\) [Hestness et al. 2017]
\(\epsilon\sim n^{-\beta}\)
We lack a theory for \(\beta\) for deep networks!
Performance increases with overparametrization
\(\longrightarrow\) study the infinite-width limit!
[Jacot et al. 2018]
[Bruna and Mallat 2013, Arora et al. 2019]
What are the learning curves of kernels like?
(next slides)
\(h\)
[Neyshabur et al. 2017, 2018, Advani and Saxe 2017]
[Spigler et al. 2018, Geiger et al. 2019, Belkin et al. 2019]
\(h\)
\(\epsilon\)
With a specific scaling, infinite-width limit \(\to\) kernel learning
[Rotskoff and Vanden-Eijnden 2018, Mei et al. 2017, Jacot et al. 2018, Chizat and Bach 2018, ...]
Neural Tangent Kernel
Very brief introduction to kernel methods and real data
data \(\underline{x} \longrightarrow \underline{\phi}(\underline{x}) \longrightarrow \) use linear combination of features
only scalar products are needed:
\(\underline{\phi}(\underline{x})\)
kernel \(K(\underline{x},\underline{x}^\prime)\)
\(\rightarrow\)
Gaussian:
Laplace:
E.g. kernel regression:
A kernel \(K\) induces a corresponding Hilbert space \(\mathcal{H}_K\) with norm
\(\lvert\!\lvert Z \rvert\!\rvert_K = \int \mathrm{d}^d\underline{x} \mathrm{d}^d\underline{y}\, Z(\underline{x}) K^{-1}(\underline{x},\underline{y}) Z(\underline{y})\)
where \(K^{-1}(\underline{x},\underline{y})\) is such that
\(\int \mathrm{d}^d\underline{y}\, K^{-1}(\underline{x},\underline{y}) K(\underline{y},\underline{z}) = \delta(\underline{x},\underline{z})\)
\(\mathcal{H}_K\) is called the Reproducing Kernel Hilbert Space (RKHS)
Regression: performance depends on the target function!
If only assumed to be Lipschitz, then \(\beta=\frac1d\)
If assumed to be in the RKHS, then \(\beta\geq\frac12\) does not depend on \(d\)
Yet, RKHS is a very strong assumption on the smoothness of the target function
Curse of dimensionality!
[Luxburg and Bousquet 2004]
[Smola et al. 1998, Rudi and Rosasco 2017]
[Bach 2017]
\(d\) = dimension of the input space
\(\longrightarrow\)
We apply kernel methods on
MNIST
CIFAR10
2 classes: even/odd
70000 28x28 b/w pictures
2 classes: first 5/last 5
60000 32x32 RGB pictures
We perform
regression \(\longrightarrow\)
classification \(\longrightarrow\)
kernel regression
margin SVM
dimension \(d = 784\)
dimension \(d = 3072\)
We need a new framework!
\(\beta\approx0.4\)
\(\beta\approx0.1\)
\(\mathbb{E} Z_T(\underline{x}_\mu) = 0\)
\(\mathbb{E} Z_T(\underline{x}_\mu) Z_T(\underline{x}_\nu) = K_T(|\!|\underline{x}_\mu-\underline{x}_\nu|\!|)\)
Generalization error
Exponent \(-\beta\)
Can we understand these curves?
where
Compute the generalization error \(\epsilon\) and how it scales with \(n\)
kernel overlap
Gram matrix
training data
Explicit solution:
Regression:
\(\hat{Z}_S(\underline{x}) = \sum_{\mu=1}^n c_\mu K_S(\underline{x}_\mu,\underline{x})\)
Minimize \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_S(\underline{x}_\mu) - Z_T(\underline{x}_\mu) \right]^2\)
To compute the generalization error:
Then we can show that
with
E.g. Laplace has \(\alpha=d+1\) and Gaussian has \(\alpha=\infty\)
(details: arXiv:1905.10843)
for \(n\gg1\)
(optimal learning)
What is the prediction for our simulations?
(curse of dimensionality!)
(\(\alpha_T=\alpha_S=d+1\))
(\(\alpha_T=\infty, \alpha_S=d+1\))
Exponent \(-\beta\)
(on hypersphere)
Matérn kernels:
\(n\)
Laplace student,
Same result with points on regular lattice or random hypersphere?
What matters is how nearest-neighbor distance \(\delta\) scales with \(n\)
In both cases \(\delta\sim n^{\frac1d}\)
Finite size effects: asymptotic scaling only when \(n\) is large enough
(conjecture)
What about real data?
\(\longrightarrow\) second order approximation with a Gaussian process \(K_T\):
does it capture some aspects?
\(\longrightarrow\) \(s=\frac12 \beta d\), \(s\approx 0.2d\approx156\) (MNIST) and \(s\approx0.05d\approx153\) (CIFAR10)
This number is unreasonably large!
(since \(\beta=\frac1d\min(\alpha_T-d,2\alpha_S)\) indep. of \(\alpha_S \longrightarrow \beta=\frac{\alpha_T-d}d\))
Define effective dimension as \(\delta \sim n^{-\frac1{d_\mathrm{eff}}}\)
\(\longrightarrow\)
MNIST
0.4
15
CIFAR10
0.1
35
\(\phantom{x}\)
\(\beta\)
\(d_\mathrm{eff}\)
3
1
\(s=\left\lfloor\frac12 \beta d_\mathrm{eff}\right\rfloor\)
\(d_\mathrm{eff}\) is much smaller
\(s\) is more reasonable!
\(\longrightarrow\)
\(\longrightarrow\)
784
3072
\(d\)
Similar setting studied in Bach 2017
Theorem (informal formulation):
in the described setting with \(d_\parallel \leq d\),
with
for \(n\gg1\)
Regardless of \(d_\parallel\)!
Two reasons contribute to this result:
Similar result in Bach 2017
Teacher = Matérn (with parameter \(\nu\)), Student = Laplace, \(d\)=4
\(n\)
Classification with the margin SVM algorithm:
find \(\{c_\mu\},b\) by minimizing some function
We consider a very simple setting:
+
-
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
+
-
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
+
+
+
+
+
+
+
+
+
+
+
+
+
-
+
+
+
-
-
-
-
-
+
+
+
+
+
+
hyperplane
band
Non-Gaussian data!
Vary kernel scale \(\sigma\) \(\longrightarrow\) two regimes!
No curse of dimensionality!
When \(\sigma\gg\delta\) we can expand the kernel overlaps:
(the exponent \(\xi\) is linked to the smoothness of the kernel)
We can derive some scaling arguments that lead to an exponent
Idea:
\(n\)
Laplace kernel \(\xi=1\)
Matérn kernels \(\xi = \min(2\nu,2)\)
hyperplane
\(n\)
band
\(n\)
\(n\)
in all these cases!
arXiv:1905.10843 + paper to be released soon!
(in some regime and for smooth interfaces)
\(\Longrightarrow\)
\(\alpha_T > \alpha_S + d\)
(\(\alpha_T\))
(\(\alpha_S\))
\(\alpha_S > d\)
\(\mathbb{E}_T \lvert\!\lvert Z_T \rvert\!\rvert_{K_S} = \mathbb{E}_T \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, Z_T(\underline{x}) K_S^{-1}(\underline{x},\underline{y}) Z_T(\underline{y}) = \int \mathrm{d}^d\underline{x} \mathrm{d}^d \underline{y}\, K_T(\underline{x},\underline{y}) K_S^{-1}(\underline{x},\underline{y}) \textcolor{red}{< \infty}\)
\(K_S(\underline{0}) \propto \int \mathrm{d}\underline{w}\, \tilde{K}_S(\underline{w}) \textcolor{red}{< \infty}\)
\(\Longrightarrow\)
Therefore the smoothness must be \(s = \frac{\alpha_T-d}2 > \frac{d}2\)
(it scales with \(d\)!)
\(\longrightarrow \beta > \frac12\)
using a Laplace kernel
and
varying the dimension \(d\):
\(\beta=\frac1d\)
\(n\)
hyperplane interface
\(n\)
boundary = hypersphere:
Laplace kernels (\(\xi=1\))
What about other interfaces?
\(y(\underline x) = \mathrm{sign}(|\!|\underline x|\!|-R)\)
(same exponent!)
(similar scaling arguments apply, provided \(R\gg\delta\))
(\(d_\parallel=1\))
+
-
+
+
+
+
+
+
+
-
-
-
-
-
-
-
-
-
-
-
-
+
+
+
+
-
+
+
+
-
-
+
+