Loss Landscape and
Performance in Deep Learning
M. Geiger, A. Jacot, S. d’Ascoli, M. Baity-Jesi,
L. Sagun, G. Biroli, C. Hongler, M. Wyart
Stefano Spigler
arXivs: 1901.01608; 1810.09665; 1809.09349
Les Houches 2020 - Recent progress in glassy systems
(Supervised) Deep Learning - Classification
<−1
>+1
- Learning from examples: training set
- Is able to predict: test set
- Not understood why it works so well!
f(x)
- What network size?
Set-up: Simple Fully-Connected architecture
- Deep net f(x;W) with N∼h2L parameters
depth L
width h
f(x;W)
- Alternating linear and nonlinear operations!
Wμ
Set-up: Dataset
- P training data:
x1,…,xP
- Binary classification:
xi→label yi=±1
- Independent test set to evaluate performance
Example - MNIST (parity):
70k pictures, digits 0,…,9;
use parity as label
±1= cats/dogs, yes/no, even/odd...
Outline
Vary network size N (∼h2):
- Can networks fit all the P training data?
- Can networks overfit? Can N be too large?
→ Long term goal: how to choose N?
h
Learning
-
Find parameters W such that signf(xi;W)=yi for i∈ train set
-
Minimize some loss!
-
L(W)=0 if and only if yif(xi;W)>1 for all patterns
(classified correctly with some margin)
Binary classification: yi=±1
Hinge loss:
Learning dynamics = descent in loss landscape
-
Minimize loss ⟷ gradient descent
-
Start with random initial conditions!
- Random, high dimensional, not convex landscape!
- Why not stuck in bad local minima?
- What is the landscape geometry?
- Many flat directions are found!
bad local minimum?
Soudry, Hoffer '17; Sagun et al. '17; Cooper '18; Baity-Jesy et al. '18 - arXiv:1803.06969
in practical settings:
Analogy with granular matter: Jamming
Upon increasing density → transition
sharp transition with finite-range interactions
- random initial conditions
- minimize energy L
- either find L=0 or L>0
Random packing:
this is why we use the hinge loss!
Shallow networks (Franz and Parisi, '16)
2 layers committee machines (Franz et al. '18)
⟷ packings of spheres
Deep nets ⟷ packings of ellipsoids!
Empirical tests: MNIST (parity)
Geiger et al. '18 - arXiv:1809.09349;
Spigler et al. '18 - arXiv:1810.09665
- Above N∗ we have L=0
- Solid line is the bound N∗<c0P
Also...
- Hypostatic at jamming
- Critical spectrum of the Hessian (many flat directions)
- (Non universal) critical exponents
No local minima are found when overparametrized!
P
N
dataset size
network size
N⋆<c0P
Outline
Vary network size N (∼h2):
- Can networks fit all the P training data?
- Can networks overfit? Can N be too large?
→ Long term goal: how to choose N?
h
Yes, deep networks fit all data if N>N∗ ⟶ jamming transition
Generalization
Spigler et al. '18 - arXiv:1810.09665
Ok, so just crank up N and fit everything?
Generalization? → Compute test error ϵ
But wait... what about overfitting?
overfitting
N
N∗
Test error ϵ
Train error
example: polynomial fitting
N∼polynomial degree
Overfitting?
Spigler et al. '18 - arXiv:1810.09665
-
Test error decreases monotonically with N!
- Cusp at the jamming transition
Advani and Saxe '17;
Spigler et al. '18 - arXiv:1810.09665;
Geiger et al. '19 - arXiv:1901.01608
"Double descent"
test error
N
N/N∗
(after the peak)
P
N
dataset size
network size
We know why: Fluctuations!
Ensemble average
- Random initialization → output function fN is stochastic
- Fluctuations: quantified by average and variance
ensemble average over n instances:
x
fN(W1)
fN(W2)
fN(W3)
−1
−1
+1
fˉN
−1!
3−1−1+1⋯
Explained in a few slides
Define some norm over the output functions:
ensemble variance (fixed n):
x
Fluctuations increase error
{f(x;Wα)}→⟨ϵN⟩
Remark:
Geiger et al. '19 - arXiv:1901.01608
-
Test error increases with fluctuations
- Ensemble test error is nearly flat after N∗!
fˉNn(x)→ϵˉN
test error of ensemble average
average test error
=
normal average
ensemble average
test error ϵ
test error ϵ
(CIFAR-10 → regrouped in 2 classes)
(MNIST parity)
Scaling argument!
Geiger et al. '19 - arXiv:1901.01608
decision boundaries:
Smoothness of test error as function of decision boundary + symmetry:
normal average
ensemble average
Infinitely-wide networks
Jacot et al. '18
- For small width h: ∇Wf evolves during training
- For large width hhh: ∇Wf is constant during training
For an input x the function f(x;W) lives on a curved manifold
The manifold becomes linear!
Lazy learning:
- weights don't change much:
- enough to change the output f by ∼O(1)!\partial_{\mathbf{W}}
Neural Tangent Kernel
- Gradient descent implies:
The formula for the kernel Θt is useless, unless...
Theorem. (informal)
Deep learning = learning with a kernel as h→∞
Jacot et al. '18
x
convolution with a kernel
wwwwwwww
Finite N asymptotics?
Geiger et al. '19 - arXiv:1901.01608;
Hanin and Nica '19;
Dyer and Gur-Ari '19
-
Evolution in time is small:
- Fluctuations are much larger:
Then:
The output function fluctuates similarly to the kernel
ΔΘt=0∼1/h∼N−41
at t=0
∣∣Θt−Θt=0∣∣F∼1/h∼N−21
Conclusion
1. Can networks fit all the P training data?
-
Yes, deep networks fit all data if N>N∗ ⟶ jamming transition
-
Initialization induces fluctuations in output that increase test error
- No overfitting: error keeps decreasing past N∗ because fluctuations diminish
check Geiger et al. '19 - arXiv:1906.08034 for more!
check Spigler et al. '19 - arXiv:1905.10843 !
2. Can networks overfit? Can N be too large?
3. How does the test error scale with P?
→ Long term goal: how to choose N?
(tentative) Right after jamming, and do ensemble averaging!
Loss Landscape and Generalization in Deep Learning
By Stefano Spigler
Loss Landscape and Generalization in Deep Learning
Talk given in Les Houches (https://sites.google.com/view/leshouches2020)
- 822