M. Geiger, A. Jacot, S. d’Ascoli, M. Baity-Jesi,
L. Sagun, G. Biroli, C. Hongler, M. Wyart
Stefano Spigler
arXivs: 1901.01608; 1810.09665; 1809.09349
\(<-1\)
\(>+1\)
\(f(\mathbf{x})\)
depth \(L\)
width \(\color{red}h\)
\(f(\mathbf{x};\mathbf{W})\)
\(W_\mu\)
70k pictures, digits \(0,\dots,9\);
use parity as label
\(\pm1=\) cats/dogs, yes/no, even/odd...
Vary network size \(\color{red}N\) (\(\sim\color{red}h^2\)):
\(h\)
Find parameters \(\mathbf{W}\) such that \(\mathrm{sign} f(\mathbf{x}_i; \mathbf{W}) = y_i\) for \(i\in\) train set
Minimize some loss!
\(\mathcal{L}(\mathbf{W}) = 0\) if and only if \(y_i f(\mathbf{x}_i;\mathbf{W}) > 1\) for all patterns
(classified correctly with some margin)
Binary classification: \(y_i = \pm1\)
Hinge loss:
Minimize loss \(\longleftrightarrow\) gradient descent
Start with random initial conditions!
Random, high dimensional, not convex landscape!
bad local minimum?
Soudry, Hoffer '17; Sagun et al. '17; Cooper '18; Baity-Jesy et al. '18 - arXiv:1803.06969
in practical settings:
Upon increasing density \(\to\) transition
sharp transition with finite-range interactions
Random packing:
this is why we use the hinge loss!
Shallow networks \(\longleftrightarrow\) packings of spheres: Franz and Parisi, '16
Deep nets \(\longleftrightarrow\) packings of ellipsoids!
(if signals propagate through the net)
\(\color{red}N^\star < c_0 P\)
typically \(c_0=\mathcal{O}(1)\)
\(\color{red}N^\star < c_0 P\)
\(\color{red}N^\star\)
network size
dataset size
Geiger et al. '18 - arXiv:1809.09349;
Spigler et al. '18 - arXiv:1810.09665
No local minima are found when overparametrized!
\(P\)
\(N\)
dataset size
network size
\(\color{red}N^\star < c_0 P\)
Spectrum of the Hessian (eigenvalues)
We don't find local minima when overparametrized... ...shape of the landscape?
Geiger et al. '18 - arXiv:1809.09349
Local curvature:
second order approximation
Information captured by Hessian matrix: \(\mathcal{H}_{\mu\nu} = \frac{\partial^2}{\partial_{\mathbf{W}_\mu}\partial_{\mathbf{W}_\nu}} \mathcal{L}(\mathbf{W})\)
w.r.t parameters \(W\)
Spectrum
Over-parametrized
Jamming
Under-parametrized
\(\sim\sqrt{\mathcal{L}}\)
\(\leftrightarrow\)
eigenvalues
eigenvalues
eigenvalues
\(\mathcal{L} = 0\)
Flat
Geiger et al. '18 - arXiv:1809.09349
Almost flat
\(N>N^\star\)
\(N\approx N^\star\)
\(N<N^\star\)
Spectrum
Spectrum
From numerical simulations:
(at the transition)
Dirac \(\delta\)'s
depth
Vary network size \(\color{red}N\) (\(\sim\color{red}h^2\)):
\(h\)
Yes, deep networks fit all data if \(N>N^*\ \longrightarrow\) jamming transition
Spigler et al. '18 - arXiv:1810.09665
Ok, so just crank up \(N\) and fit everything?
Generalization? \(\to\) Compute test error \(\epsilon\)
But wait... what about overfitting?
overfitting
\(N\)
\(N^*\)
Test error \(\epsilon\)
Train error
example: polynomial fitting
\(N \sim \mathrm{polynomial\ degree}\)
Spigler et al. '18 - arXiv:1810.09665
Advani and Saxe '17;
Spigler et al. '18 - arXiv:1810.09665;
Geiger et al. '19 - arXiv:1901.01608
"Double descent"
test error
\(N\)
\(N/N^*\)
(after the peak)
\(P\)
\(N\)
dataset size
network size
We know why: Fluctuations!
ensemble average over \(n\) instances:
\(\phantom{x}\)
\(f_N(\mathbf{W}_1)\)
\(f_N(\mathbf{W}_2)\)
\(f_N(\mathbf{W}_3)\)
\(-1\)
\(-1\)
\(+1\)
\(\bar f_N\)
\(-1!\)
\(\frac{{\color{red}-1-1}{\color{blue}+1}}{3}\cdots\)
Explained in a few slides
Define some norm over the output functions:
ensemble variance (fixed \(n\)):
\(\phantom{x}\)
\( \{f(\mathbf{x};\mathbf{W}_\alpha)\} \to \left\langle\epsilon_N\right\rangle\)
Remark:
Geiger et al. '19 - arXiv:1901.01608
\(\bar f^n_N(\mathbf{x}) \to \bar\epsilon_N\)
test error of ensemble average
average test error
\(\neq\)
normal average
ensemble average
test error \(\epsilon\)
test error \(\epsilon\)
(CIFAR-10 \(\to\) regrouped in 2 classes)
(MNIST parity)
Geiger et al. '19 - arXiv:1901.01608
decision boundaries:
Smoothness of test error as function of decision boundary + symmetry:
normal average
ensemble average
Neal '96; Williams '98; Lee et al '18; Schoenholz et al. '16
width \(\color{red}h\)
\(f(\mathbf{x};\mathbf{W})\)
input dim \(\color{red}d\)
\(W^{(1)}\sim d^{-\frac12}\mathcal{N}(0,1)\)
as \(h\to\infty\)
\(W_\mu\)
Jacot et al. '18
For an input x the function f(x;W) lives on a curved manifold
The manifold becomes linear!
Lazy learning:
The formula for the kernel \(\Theta^t\) is useless, unless...
Theorem. (informal)
Deep learning \(=\) learning with a kernel as \(h\to\infty\)
Jacot et al. '18
\(\phantom{x}\)
convolution with a kernel
\(\phantom{wwwwwwww}\)
Geiger et al. '19 - arXiv:1901.01608;
Hanin and Nica '19;
Dyer and Gur-Ari '19
Then:
The output function fluctuates similarly to the kernel
\(\Delta\Theta^{t=0} \sim 1/\sqrt{h} \sim N^{-\frac14}\)
at \(t=0\)
\(|\!|\Theta^t - \Theta^{t=0}|\!|_F \sim 1/h \sim N^{-\frac12}\)
1. Can networks fit all the \(\color{red}P\) training data?
check Geiger et al. '19 - arXiv:1906.08034 for more!
check Spigler et al. '19 - arXiv:1905.10843 !
2. Can networks overfit? Can \(\color{red}N\) be too large?
3. How does the test error scale with \(\color{red}P\)?
\(\to\) Long term goal: how to choose \(\color{red}N\)?
(tentative) Right after jamming, and do ensemble averaging!