M. Geiger, A. Jacot, S. d’Ascoli, M. Baity-Jesi,
L. Sagun, G. Biroli, C. Hongler, M. Wyart
Stefano Spigler
arXivs: 1901.01608; 1810.09665; 1809.09349
Les Houches 2020 - Recent progress in glassy systems
<−1
>+1
f(x)
depth L
width h
f(x;W)
Wμ
70k pictures, digits 0,…,9;
use parity as label
±1= cats/dogs, yes/no, even/odd...
Vary network size N (∼h2):
h
Find parameters W such that signf(xi;W)=yi for i∈ train set
Minimize some loss!
L(W)=0 if and only if yif(xi;W)>1 for all patterns
(classified correctly with some margin)
Binary classification: yi=±1
Hinge loss:
Minimize loss ⟷ gradient descent
Start with random initial conditions!
bad local minimum?
Soudry, Hoffer '17; Sagun et al. '17; Cooper '18; Baity-Jesy et al. '18 - arXiv:1803.06969
in practical settings:
Upon increasing density → transition
sharp transition with finite-range interactions
Random packing:
this is why we use the hinge loss!
Shallow networks (Franz and Parisi, '16)
2 layers committee machines (Franz et al. '18)
⟷ packings of spheres
Deep nets ⟷ packings of ellipsoids!
Geiger et al. '18 - arXiv:1809.09349;
Spigler et al. '18 - arXiv:1810.09665
No local minima are found when overparametrized!
P
N
dataset size
network size
N⋆<c0P
Vary network size N (∼h2):
h
Yes, deep networks fit all data if N>N∗ ⟶ jamming transition
Spigler et al. '18 - arXiv:1810.09665
Ok, so just crank up N and fit everything?
Generalization? → Compute test error ϵ
But wait... what about overfitting?
overfitting
N
N∗
Test error ϵ
Train error
example: polynomial fitting
N∼polynomial degree
Spigler et al. '18 - arXiv:1810.09665
Advani and Saxe '17;
Spigler et al. '18 - arXiv:1810.09665;
Geiger et al. '19 - arXiv:1901.01608
"Double descent"
test error
N
N/N∗
(after the peak)
P
N
dataset size
network size
We know why: Fluctuations!
ensemble average over n instances:
x
fN(W1)
fN(W2)
fN(W3)
−1
−1
+1
fˉN
−1!
3−1−1+1⋯
Explained in a few slides
Define some norm over the output functions:
ensemble variance (fixed n):
x
{f(x;Wα)}→⟨ϵN⟩
Remark:
Geiger et al. '19 - arXiv:1901.01608
fˉNn(x)→ϵˉN
test error of ensemble average
average test error
=
normal average
ensemble average
test error ϵ
test error ϵ
(CIFAR-10 → regrouped in 2 classes)
(MNIST parity)
Geiger et al. '19 - arXiv:1901.01608
decision boundaries:
Smoothness of test error as function of decision boundary + symmetry:
normal average
ensemble average
Jacot et al. '18
For an input x the function f(x;W) lives on a curved manifold
The manifold becomes linear!
Lazy learning:
The formula for the kernel Θt is useless, unless...
Theorem. (informal)
Deep learning = learning with a kernel as h→∞
Jacot et al. '18
x
convolution with a kernel
wwwwwwww
Geiger et al. '19 - arXiv:1901.01608;
Hanin and Nica '19;
Dyer and Gur-Ari '19
Then:
The output function fluctuates similarly to the kernel
ΔΘt=0∼1/h∼N−41
at t=0
∣∣Θt−Θt=0∣∣F∼1/h∼N−21
1. Can networks fit all the P training data?
check Geiger et al. '19 - arXiv:1906.08034 for more!
check Spigler et al. '19 - arXiv:1905.10843 !
2. Can networks overfit? Can N be too large?
3. How does the test error scale with P?
→ Long term goal: how to choose N?
(tentative) Right after jamming, and do ensemble averaging!