Deep networks: \(\beta\sim 0.07\)-\(0.35\) [Hestness et al. 2017]
\(\epsilon\sim n^{-\beta}\)
We lack a theory for \(\beta\) for deep networks!
Performance increases with overparametrization
\(\longrightarrow\) study the infinite-width limit!
[Jacot et al. 2018]
What are the learning curves of kernels like?
(next slides)
\(h\)
[Neyshabur et al. 2017, 2018, Advani and Saxe 2017]
[Spigler et al. 2018, Geiger et al. 2019, Belkin et al. 2019]
\(h\)
\(\epsilon\)
With a specific scaling, infinite-width limit \(\to\) kernel learning
[Rotskoff and Vanden-Eijnden 2018, Mei et al. 2017, Jacot et al. 2018, Chizat and Bach 2018, ...]
Neural Tangent Kernel
Very brief introduction to kernel methods and real data
data \(\underline{x} \longrightarrow \underline{\phi}(\underline{x}) \longrightarrow \) use linear combination of features
only scalar products are needed:
\(\underline{\phi}(\underline{x})\)
kernel \(K(\underline{x},\underline{x}^\prime)\)
\(\rightarrow\)
Gaussian:
Laplace:
Regression: performance depends on the target function!
With the weakest hypotheses, \(\beta=\frac1d\)
With strong smoothness assumptions, \(\beta\geq\frac12\) is independent of \(d\)
Curse of dimensionality!
[Luxburg and Bousquet 2004]
[Smola et al. 1998, Rudi and Rosasco 2017, Bach 2017]
\(d\) = dimension of the input space
\(\longrightarrow\)
MNIST
CIFAR10
2 classes: even/odd
70000 28x28 b/w pictures
2 classes: first 5/last 5
60000 32x32 RGB pictures
Kernel regression on:
dimension \(d = 784\)
dimension \(d = 3072\)
\(\beta\approx0.37\)
\(\beta\approx0.08\)
\(\mathbb{E} Z_T(\underline{x}_\mu) = 0\)
\(\mathbb{E} Z_T(\underline{x}_\mu) Z_T(\underline{x}_\nu) = K_T(|\!|\underline{x}_\mu-\underline{x}_\nu|\!|)\)
(artificial, synthetic data)
Generalization error
Exponent \(-\beta\)
Can we understand these curves?
where
Compute the generalization error \(\epsilon\) and how it scales with \(n\)
kernel overlap
Gram matrix
training data
Explicit solution:
Regression:
\(\hat{Z}_S(\underline{x}) = \sum_{\mu=1}^n c_\mu K_S(\underline{x}_\mu,\underline{x})\)
Minimize \(= \frac1n \sum_{\mu=1}^n \left[ \hat{Z}_S(\underline{x}_\mu) - Z_T(\underline{x}_\mu) \right]^2\)
To compute the generalization error:
Then we can show that
with
E.g. Laplace has \(\alpha=d+1\) and Gaussian has \(\alpha=\infty\)
(details: arXiv:1905.10843)
for \(n\gg1\)
(optimal learning)
What is the prediction for our simulations?
(curse of dimensionality!)
(\(\alpha_T=\alpha_S=d+1\))
(\(\alpha_T=\infty, \alpha_S=d+1\))
Exponent \(-\beta\)
(on hypersphere)
Same result with points on regular lattice or random hypersphere?
What matters is how nearest-neighbor distance \(\delta\) scales with \(n\)
In both cases \(\delta\sim n^{\frac1d}\)
Finite size effects: asymptotic scaling only when \(n\) is large enough
(conjecture)
1. Effective dimension is much smaller:
\(\delta\sim n^{\frac1{d_\mathrm{eff}}}\)
2. We find the same exponent regardless of the student:
Assuming this formula holds
Guess: measure \(\alpha_T\) in real data from this projetion!
\(\frac{\alpha_T}d = 1 + \frac{2s}d\)
\(= 1+\frac1d\,\) for Laplace
\(=1\) for Gaussian
Measure effective smoothness in real data
Fit \(c=\frac{\alpha_T}d\) from the projection
\(q_\rho^2\)
\(\beta\approx0.36 \ \ \ \ 2s\approx5.4\)
\(\beta\approx0.07 \ \ \ \ 2s\approx2.45\)
eigenmodes are extracted from a Gram matrix with a larger training set of size \(\tilde{n}\)