Loss landscape and symmetries in Neural Networks
Mario Geiger
Professor: Matthieu Wyart
Physics of Complex Systems Laboratory
...now postdoc in MIT
with
Prof Tess Smidt
Paradoxes of deep learning
Landscape vs Parameterization
Data Symmetries
Table of contents
Introduction to deep learning
Color code
Questions
Open problems
model
\(\displaystyle w\in \mathbb{R}^N\)
Classification with deep learning
Classification with deep learning
dataset
\(\displaystyle w\in \mathbb{R}^N\)
\(\displaystyle \mathcal{L} = \frac1P \sum_{i=1}^{P} \ell(f(w, x_i),y_i)\)
loss function
\(\displaystyle w\in \mathbb{R}^N\)
Classification with deep learning
\(\displaystyle \mathcal{L} = \frac1P \sum_i \ell(f(w, x_i),y_i)\)
gradient descent
\(\displaystyle w\in \mathbb{R}^N\)
Classification with deep learning
\( w^{t+1} = w^t - \nabla_w \mathcal{L}(w^t) \)
\(\displaystyle P\)
\(\displaystyle N\)
Classification with deep learning
number of parameters
number of training samples
In this talk I will use the following notation
Paradoxes
Landscape vs Parameterization
Data Symmetries
Non-convex \(\mathcal{L}\)
Overparameterization
Curse of dimensionality
- No guarantees to be convex
- High dimensional space (\(N\))
Why not stuck in a local minima?
- Over parameterized: \(\mathcal{L} \to 0\)
- \(\mathcal{L}\) connected at the bottom
Freeman and Bruna (2017)
Soudry and Carmon (2016)
Cooper (2018)
- \(\mathcal{L}\) connected at the bottom
- Under Parametrized: \(\mathcal{L}\) glassy
Baity-Jesi et al. (2018)
How is the transition between under and over parameterized regimes?
Paradoxes
Landscape vs Parameterization
Data Symmetries
Non-convex \(\mathcal{L}\)
Overparameterization
Curse of dimensionality
- State of the art neural networks \( P \ll N\)
: test/generalization error
= number of parameters
Neyshabur et al. (2017, 2018); Bansal et al. (2018); Advani et al. (2020)
Why does the test error decreases with \(N\)?
Paradoxes
Landscape vs Parameterization
Data Symmetries
Non-convex \(\mathcal{L}\)
Overparameterization
Curse of dimensionality
\(P =\) size of trainset
\(\delta =\) distance to closest neighbor
Bach (2017)
Paradoxes
Landscape vs Parameterization
Data Symmetries
Non-convex \(\mathcal{L}\)
Overparameterization
Curse of dimensionality
Hestness et al. (2017)
regression + Lipschitz continuous
Luxburg and Bousquet (2004)
Paradoxes
Landscape vs Parameterization
Data Symmetries
Non-convex \(\mathcal{L}\)
Overparameterization
Curse of dimensionality
- Kernel Method:
sample complexity /= #{ invariant transformations }
Bietti et al. (2021)
- Convolutional Neural Network (CNN) are invariant under translations
sample complexity = \(P\) required to get \(\epsilon\)
Paradoxes
Landscape vs Parameterization
Data Symmetries
Non-convex \(\mathcal{L}\)
Overparameterization
Curse of dimensionality
Figure from (Geometric Deep Learning by Bronstein, Bruna, Cohen and Veličković)
- Translations are a small set in the space of image transformations
- Labels are invariant to smooth deformations or diffeomorphisms of small magnitude
Bruna and Mallat (2013); Mallat (2016)
Paradoxes
Landscape vs Parameterization
Data Symmetries
Non-convex \(\mathcal{L}\)
Overparameterization
Curse of dimensionality
Is stability to diffeomorphisms responsible for beating the curse of dimensionality?
- Recent empirical results revealed that small shifts of images can change the output a lot
Azulay and Weiss (2018); Dieleman et al. (2016); Zhang (2019)
Paradoxes
Landscape vs Parameterization
Data Symmetries
Non-convex \(\mathcal{L}\)
Overparameterization
Curse of dimensionality
Empirical evidence?
Can we generalize CNN invariance to other symmetries?
- CNNs perform much better than Fully-Connected networks
- Many tasks involve symmetries like rotation or mirror
Paradoxes
Landscape vs Parameterization
Data Symmetries
Non-convex \(\mathcal{L}\)
Overparameterization
Curse of dimensionality
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
How is the transition between under and over parameterized regimes?
The same question was asked in the physics of glasses
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
\(\displaystyle U = \sum_{ij} U_{ij}\)
Van der Waals
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
\(\displaystyle U = \sum_{ij} U_{ij}\)
Van der Waals
Finite range interaction
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
\(\displaystyle U = \sum_{ij} U_{ij}\)
Van der Waals
Finite range interaction
SAT/UNSAT problem
num. of constraints \(N_\Delta\)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
\(\displaystyle U = \sum_{ij} U_{ij}\)
Van der Waals
Finite range interaction
SAT/UNSAT problem
num. of constraints \(N_\Delta\)
Sharp Transition:
Jamming
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Low density: flat directions
\(N_\Delta = 0\)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Low density: flat directions
High density: glassy
\(N_\Delta = 0\)
\(N_\Delta \geq N\)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Low density: flat directions
High density: glassy
\(N_\Delta=N\)
Sharp jamming transition
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Low density: flat directions
High density: glassy
\(N_\Delta < N\) \(N_\Delta=N\)
Sharp jamming transition
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Low density: flat directions
High density: glassy
\(N_\Delta < N\) \(N_\Delta=N\)
Sharp jamming transition
Prediction for Neural Networks
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
cross entropy loss (long range)
Paradoxes
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
cross entropy loss
(long range)
Paradoxes
quadratic hinge loss
(finite range)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Particles | Neural Networks |
---|---|
energy | loss |
overlapping particles | point below margin |
density | trainset size / parameters = P/N |
Paradoxes
margin
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Fully Connected 5 hidden layers relu, ADAM MNIST (pca d=10)
Geiger, M., Spigler, S., d’Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G., and Wyart, M. (2019)
Paradoxes
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Fully Connected 5 hidden layers relu, ADAM MNIST (pca d=10)
Geiger, M., Spigler, S., d’Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G., and Wyart, M. (2019)
Paradoxes
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Fully Connected 5 hidden layers relu, ADAM MNIST (pca d=10)
Geiger, M., Spigler, S., d’Ascoli, S., Sagun, L., Baity-Jesi, M., Biroli, G., and Wyart, M. (2019)
Paradoxes
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
There is sharp jamming transition, in the same universality class as ellipses
Paradoxes
How is the transition between under and over parameterized?
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
There is sharp jamming transition, in the same universality class as ellipses
Paradoxes
How is the transition between under and over parameterized?
Why not stuck in a local minima?
Crank-up the number of parameters and you are not stuck
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
There is sharp jamming transition, in the same universality class as ellipses
Paradoxes
How is the transition between under and over parameterized?
Why not stuck in a local minima?
Crank-up the number of parameters and you are not stuck
What about overfitting?
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Double Descent
Why does the test error decreases with \(N\)?
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
P=5000 Fully Connected 2 hidden layers swish
softhinge MNIST (pca d=10)
Paradoxes
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
P=5000 Fully Connected 2 hidden layers swish
softhinge MNIST (pca d=10)
Paradoxes
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
P=5000 Fully Connected 2 hidden layers swish
softhinge MNIST (pca d=10)
Paradoxes
Jamming
Cusp at jamming?
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
P=5000 Fully Connected 2 hidden layers swish
softhinge MNIST (pca d=10)
Paradoxes
Jamming
Cusp at jamming?
Slow decrease?
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Cusp at jamming?
Slow decrease?
We show that the predictor explode at jamming, we predict the exponent
Mario Geiger et al J. Stat. Mech. (2020) 023401
Need to understand what happens at \(N\to \infty\)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Neural Tangent Kernel
Space of functions \(X\to Y\)
Paradoxes
Arthur Jacot et al. Neurips (2018)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Neural Tangent Kernel
Manifold \(f:W\to (X\to Y)\)
Paradoxes
Space of functions \(X\to Y\)
\(f(w_0)\)
Arthur Jacot et al. Neurips (2018)
(curry notation)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Neural Tangent Kernel
Paradoxes
Space of functions \(X\to Y\)
Manifold \(f:W\to (X\to Y)\)
tangent space
In the limit \(N\to\infty\), parameters change only a little.
Arthur Jacot et al. Neurips (2018)
\(f(w_0)\)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Neural Tangent Kernel
Paradoxes
Manifold \(f:W\to (X\to Y)\)
\(\displaystyle \Theta(w, x_1, x_2) = \nabla_w f(w,x_1) \cdot \nabla_w f(w, x_2)\)
tangent space
Arthur Jacot et al. Neurips (2018)
\(f(w_0)\)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Neural Tangent Kernel
Paradoxes
Manifold \(f:W\to (X\to Y)\)
\(\displaystyle \frac{d}{dt}f(w, \cdot) \propto \ell_i' \Theta(w, x_i, \cdot)\)
Arthur Jacot et al. Neurips (2018)
\(\displaystyle \Theta(w, x_1, x_2) = \nabla_w f(w,x_1) \cdot \nabla_w f(w, x_2)\)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Neural Tangent Kernel
Paradoxes
Manifold \(f:W\to (X\to Y)\)
\(\displaystyle \frac{d}{dt}f(w, \cdot) \propto \ell_i' \Theta(w, x_i, \cdot)\)
Arthur Jacot et al. Neurips (2018)
Well defined behavior at \(N=\infty\)
\(\displaystyle \Theta(w, x_1, x_2) = \nabla_w f(w,x_1) \cdot \nabla_w f(w, x_2)\)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
- \(\| \Theta - \langle \Theta \rangle \|^2 \sim N^{-1/2}\)
- \(\Rightarrow \mathrm{Var}(f) \sim N^{-1/2}\)
- \(\Rightarrow \epsilon(N) - \epsilon_\infty \sim N^{-1/2}\)
Mario Geiger et al J. Stat. Mech. (2020) 023401
Paradoxes
Noisy convergence toward a well defined dynamics
Why does the test error decreases with \(N\)?
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
- \(\| \Theta - \langle \Theta \rangle \|^2 \sim N^{-1/2}\)
- \(\Rightarrow \mathrm{Var}(f) \sim N^{-1/2}\)
- \(\Rightarrow \epsilon(N) - \epsilon_\infty \sim N^{-1/2}\)
Mario Geiger et al J. Stat. Mech. (2020) 023401
Paradoxes
Noisy convergence toward a well defined dynamics
Why does the test error decreases with \(N\)?
Can we remove these fluctuations without sending \(N\) to \(\infty\)?
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Ensemble average
Generalization error of the average of 20 network's outputs
\(\sim N^{-1/2}\)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Noisy convergence toward a well defined dynamics...
...described by the NTK in which the weights do not move!
no feature learning?
Neural network do learn features
Le Quoc V. IEEE (2013)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Mean-Field limit
Mei et al. (2018); Rotskoff and Vanden-Eijnden (2018); Chizat and Bach (2018); Sirignano and Spiliopoulos (2020b); Mei et al. (2019); Nguyen (2019); Sirignano and Spiliopoulos (2020a), with recent development for deeper nets, see e.g. Nguyen and Pham (2020)
Another limit, different than NTK in which the neural network can learn features
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
NTK vs. Mean-Field
Q1 How to quantify if the network is rather Mean-Field or NTK?
Q2 Which of the two limits perform better?
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
\(f(w, x) = \frac1{\sqrt{h}} w^{L+1} \phi(\frac1{\sqrt{h}} \dots)\)
Paradoxes
NTK Mean-Field
\(f(w, x) = \frac1h w^{L+1} \phi(\frac1{\sqrt{h}} \dots)\)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
\(f(w, x) = \frac1{\sqrt{h}} w^{L+1} \phi(\frac1{\sqrt{h}} \dots)\)
Paradoxes
\(\tilde f(w, x) = \frac1h w^{L+1} \phi(\frac1{\sqrt{h}} \dots)\)
\(\tilde\alpha (\tilde f(w, x) - \tilde f(w_0, x))\)
(linearize to allow small weights change to have an impact)
Chizat et al. Neurips (2019)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Q1 How to quantify if the network is rather Mean-Field or NTK?
\(\Delta\Theta \equiv \Theta(w) - \Theta(w_0)\)
level sets delimits a smooth transition
NTK
Mean-Field
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
- Double descent is also present in feature learning
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Fully Connected 2 hidden layers
Paradoxes
= number of neurons per layer
Q2 Which of the two limits perform better?
gradient flow
FC: lazy > feature
CNN: feature > lazy
Mario Geiger et al J. Stat. Mech. (2020) 113301
Mario Geiger et al Physics Reports V924 (2021)
Open problem: Not well understood which regime performs better
Lee Jaehoon et al. (2020)
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Bonus: Phase Diagram of SGD
Fully connected
MNIST
P=1024
Landscape vs Parameterization
Data Symmetries
Jamming
Double descent
Phase Diagram
Paradoxes
Bonus: Phase Diagram of SGD
Landscape vs Parameterization
Data Symmetries
Paradoxes
- Why not stuck in a local minima?
- How is the transition between under/over parameterized?
- Why does the test error decreases with \(N\)?
- Q1 How to quantify if the network is rather Mean-Field or NTK?
- Q2 Which of the two limits perform better?
- Crank-up \(N\)
-
There is sharp jamming transition, in the same universality class as ellipses
- Noisy convergence toward a well defined dynamics
-
Change in NTK delimits a smooth transition controlled by \(\tilde\alpha\)
-
FC: lazy > feature
CNN: feature > lazy
still an open problem
Landscape vs Parameterization
Data Symmetries
Data Symmetries
Curse of dimensionality not answered so far
One needs to take into account the structure of the data.
Data contains symmetries
Paradoxes
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
Paradoxes
Is stability to diffeomorphisms responsible for beating the curse of dimensionality?
Aim: measure the stability toward diffeomorphisms
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Maximum entropy distribution
Euclidean Neural Network
Paradoxes
diffeomorphisms are controlled by a temperature parameter
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Maximum entropy distribution
Euclidean Neural Network
Paradoxes
diffeomorphisms are controlled by a temperature parameter
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Maximum entropy distribution
Euclidean Neural Network
Paradoxes
diffeomorphisms are controlled by a temperature parameter
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Maximum entropy distribution
Euclidean Neural Network
Paradoxes
diffeomorphisms are controlled by a temperature parameter
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
Paradoxes
\(D_f \propto |f(\tau x) - f(x)|\)
- average over inputs
- average over diffeomorphisms of the same temperature
- normalized by the amplitude of \(f\)
Sensitivity to diffeomorphisms
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
Paradoxes
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
\(\displaystyle R_f = \frac{D_f}{G_f}\)
Euclidean Neural Network
Paradoxes
Leonardo Petrini et al arxiv 2105.02468
Relative Sensitivity
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
Paradoxes
(Relative) Sensitivity to diffeomorphisms is key to beat the curse of dimensionality
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
Euclidean Neural Network
Paradoxes
Can we generalize CNN invariance to other symmetries?
So far we looked at 2D images
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
Paradoxes
Equivariance
\(f(w, {\color{blue}D(g)} x) = {\color{blue}D'(g)} f(w, x)\)
Classification with CNN translation identity
Segmentation with CNN translation translation
Aim rotation rotation
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
Paradoxes
Daniel Worrall et al CVPR (2017)
CNN are not equivariant to rotations
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
slide taken from Tess Smidt
Paradoxes
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
Paradoxes
CNN are easy to implement, usual machine learning libraries provide functions to do so
Equivariant Neural Network for 3D rotations are less easy to implement and requires specialized functions
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
Paradoxes
slide taken from Tess Smidt
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
e3nn
open source library
pytorch and jax (e3nn-jax)
Efficient code for:
- spherical harmonics
- tensor product
Paradoxes
between any irreps of O(3)
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
e3nn
is modular and flexible
The same building blocks allow to implement
- message passing on graph
- voxel convolution
- SE(3)-Transformers
- SphericalCNN
Paradoxes
Cohen, Geiger, et al (1801.10130)
Fabian Fuchs et al (2006.10503)
Weiler, Geiger et al (1807.02547)
Tess Smidt et al (1802.08219)
Landscape vs Parameterization
Data Symmetries
Diffeomorphisms
Euclidean Neural Network
One example of results obtained using e3nn
Ab initio Molecular dynamics
Paradoxes
Simon Batzner et al (2021) arxiv 2101.03164
Landscape vs Parameterization
Data Symmetries
Paradoxes
Is stability to diffeomorphisms responsible for beating the curse of dimensionality?
- We see a strong correlation between the relative stability and the test error
- It is acquired during training
- How is it learned?
- Is there a bound between \(\epsilon\) and \(R_f\)?
Can we generalize CNN invariance to other symmetries?
- e3nn is equivariant to translations, rotations and mirror
- Why equivariant neural network perform better than invariant ones?
Thank you for your time
Slides for Ian's lab meeting
By Mario Geiger
Slides for Ian's lab meeting
- 558