Copy of brain_march

Gradient Descent and the Edge of Stability

Amin

April 2023

Gradient Descent and the Edge of Stability

Amin

April 2023

Stability?

Let \(\ell: \mathbb{R}^P \to \mathbb{R}\) be continuously differentiable and \(L\)-smooth.
Consider gradient descent on \(\ell\) with a step size of \(\eta\).
Descent Lemma says:
Proof: When the function is L-smooth, max eigenvalue of hessian is always smaller (or equal) than L.

\ell(\theta_{t+1}) \le \ell(\theta_t) - \frac{\eta (2 - \eta L)}{2} || \nabla \ell(\theta_t)||^2

Self-Stabilization:

Gradient Descent Implicitly Solves:
By evolving around the constrained trajectory
where \(\mathcal{M}\) is:

\mathcal{M} = \{ \theta: S(\theta) \le 2 / \eta \wedge \nabla L(\theta) \cdot u(\theta) = 0 \}

Self-Stabilization

Self-Stabilization: Sketch

Assumption 1: Existence of progressive sharpening \(\alpha = \nabla L \cdot \nabla S > 0 \).
Assumption 2: Eigengap in hessian!

Stage 1

We can Taylor expand the sharpness around the reference point:

Stage 2

Note: \(u \cdot \nabla L(\theta_t) \approx S(\theta_t) x_t\) because:

u \cdot \nabla L(\theta^\star) \approx u \cdot \frac{\eta}{2} S(\theta_t) \nabla L(\theta^\star) \approx \frac{1}{2} S(\theta_t) u \cdot \eta \nabla L(\theta^\star)

(I don't understand how they ignore the
factor of 2)

Stage 3

Hence, adding the cubic term back in the Taylor expansion for \(y_t\), we have:
Therefore, when \(x_t > \sqrt{2\alpha/\beta}\), sharpness
begins to decrease.

Stage 4

Hence the full dynamics can be described as:

Self-Stabilization

Main Result

Takeaway 1: GD's behaviour is not monotonic over short timescales, but is over long timescales.
Takeaway 2: GD is guaranteed to not increase training loss over long timescales, no matter how non-convex the loss is.

Experiments

Limitations

Assumption 5 in the paper is certainly not true for almost all the practical models. But it can be broken down to multiple more-manageable pieces.

Assumption 4 is extremely strong, and it directly hints at self-stabilization.

Proof Idea

The proof idea is to analyze the the constrained update's trajectory along the directions of velocity of sharpness and top eigenvector:
And bounding the displacement using a cubic Taylor-expansion (and strong assumptions to make it work :p)

Gradient Descent and the Edge of Stability

Gradient Descent and the Edge of Stability

Stability?

Self-Stabilization:

Self-Stabilization

Self-Stabilization: Sketch

Stage 1

Stage 2

Stage 3

Stage 4

Self-Stabilization

Main Result

Experiments

Experiments

Limitations

Proof Idea

Extension: Multiple Unsabilities

Thank You!