stability

arxiv.org/abs/2103.00065

Figure 1

Problem	5K subset of CIFAR10
Loss	Quadratic (0-1)
Model	2 hidden layer MLP, tanh
Optimizer	Full batch GD

Figure 1

\lambda_{\max}(\nabla^2 \mathcal{L}(w))

Sharpness

Figure 1: Hessian grows until it reaches 2/𝜂

(this is weird)

Architectures; VGG-11, Resnet-32
Run: Full batch with accumulation
Batch-Norm*: Ghosting
Eigenvalue approx. on 10% of data

For quadratics: 𝜆ₘₐₓ ≤ L ⇒ 𝜂 < 2/L works

Intuition from quadratics

Most of optimization assumes 𝜆ₘₐₓ ≤ L

https://www.cs.ubc.ca/~schmidtm/Courses/5XX-S20/S1.pdf?page=26

Intuition from quadratics

For quadratics: 𝜆ₘₐₓ ≤ L ⇒ 𝜂 < 2/L works

For deep learning: 𝜂 = 2/L ?⇒? 𝜆ₘₐₓ ≈ L

Intuition from quadratics

"sharpness vs generalization"*

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. Keskar, Mudigere, Nocedal, Smelyanskiy. Tang. https://arxiv.org/abs/1609.04836

"sharpness vs generalization"*

Three Factors Influencing Minima in SGD. Jastrzębski, Kenton, Arpit, Ballas, Fischer, Bengio, Storkey. arxiv.org/abs/1711.04623

It's not only on a toy dataset

What's happening?

All trajectories start the same

(for similar step-sizes)

Logistic loss

All trajectories start the same

Checked with Runge-Kutta

Caveats

Tracks less well for ReLU/Max-Pooling but sometime works

What happens when 𝜆ₘₐₓ = 2/𝜂

(Oscillations are not due to stochasticity)

The largest eigenvalue of the Hessian increases during training
("progressive sharpening")
The maximum "stable" value depends on the step-size
(Larger step-size lead to smaller 𝜆ₘₐₓ)
GD oscillates when it hits that threshold and sharpening stop
(but still makes progress)

Key empirical observation

"What if", validation, NTK regime

What if: Momentum

"𝜂 < 2/L" equivalent for momentum on quadratics

What if: Momentum

What if: Drop the learning rate

What if: Use 𝜂 = 1/𝜆ₜ

Validation

Simplest architecture that does this?
What changes don't affect behavior?
What changes do?

Simplest architecture with this behavior

Deep (20) Linear Network

Changes that don't affect behavior

Loss function: Squared loss, Logistic loss
Activation function: Tanh, Softplus, ELU, ReLU
Block: MLP or Convolution

Starting from toy model on CIFAR10 (5K)

MLP, Tanh, Squared loss

MLP, Tanh, Logistic

MLP, ELU, Squared Loss

MLP, ELU, Logistic Loss

CNN, ReLU, Squared Loss

CNN, ReLU, Logistic Loss

Changes that matter

Last week: NTK and Infinite width

gradient flow → init kernel solution

How does this match with "Edge of stability"?

Progressive Sharpening

Initialization	Less sharpening with NTK init
Width	Wider => Less sharpening
Depth	Deeper => More sharpening
Data	More data => More sharpening

Changes that matter

Start: 2-hidden layer MLP, 200 neurons, tanh
Optimizer: Gradient flow (RK4)
5 restarts

Changes that matter: Width

(standard initialization)

Changes that matter: Width

(NTK initialization)

Changes that matter: Depth

(probably width/depth or width/data ratios)

Changes that matter: Dataset Size

Glossed over: BatchNorm, Transformers, SGD, Generalization

Great* empirical paper
NN are not quadratics
GD is not monotonic
𝜆ₘₐₓ increases
In normal training, we're closer to this regime than Gradient Flow/NTK