Gradient Descent and the Edge of Stability
Amin
April 2023
Gradient Descent and the Edge of Stability
Amin
April 2023



Stability?
- Let \(\ell: \mathbb{R}^P \to \mathbb{R}\) be continuously differentiable and \(L\)-smooth.
- Consider gradient descent on \(\ell\) with a step size of \(\eta\).
- Descent Lemma says:
- Proof: When the function is L-smooth, max eigenvalue of hessian is always smaller (or equal) than L.
\ell(\theta_{t+1}) \le \ell(\theta_t) - \frac{\eta (2 - \eta L)}{2} || \nabla \ell(\theta_t)||^2

Self-Stabilization:
- Gradient Descent Implicitly Solves:
- By evolving around the constrained trajectory
- where \(\mathcal{M}\) is:


\mathcal{M} = \{ \theta: S(\theta) \le 2 / \eta \wedge \nabla L(\theta) \cdot u(\theta) = 0 \}
Self-Stabilization

Self-Stabilization: Sketch

- Assumption 1: Existence of progressive sharpening \(\alpha = \nabla L \cdot \nabla S > 0 \).
- Assumption 2: Eigengap in hessian!

Stage 1

- We can Taylor expand the sharpness around the reference point:


Stage 2

- Note: \(u \cdot \nabla L(\theta_t) \approx S(\theta_t) x_t\) because:
u \cdot \nabla L(\theta^\star) \approx u \cdot \frac{\eta}{2} S(\theta_t) \nabla L(\theta^\star) \approx \frac{1}{2} S(\theta_t) u \cdot \eta \nabla L(\theta^\star)
(I don't understand how they ignore the
factor of 2)

Stage 3

- Hence, adding the cubic term back in the Taylor expansion for \(y_t\), we have:
- Therefore, when \(x_t > \sqrt{2\alpha/\beta}\), sharpness
begins to decrease.


Stage 4



- Hence the full dynamics can be described as:
Self-Stabilization


Main Result

- Takeaway 1: GD's behaviour is not monotonic over short timescales, but is over long timescales.
- Takeaway 2: GD is guaranteed to not increase training loss over long timescales, no matter how non-convex the loss is.
Experiments

Experiments

Limitations

- Assumption 5 in the paper is certainly not true for almost all the practical models. But it can be broken down to multiple more-manageable pieces.
- Assumption 4 is extremely strong, and it directly hints at self-stabilization.

Proof Idea
-
The proof idea is to analyze the the constrained update's trajectory along the directions of velocity of sharpness and top eigenvector:
- And bounding the displacement using a cubic Taylor-expansion (and strong assumptions to make it work :p)


Extension: Multiple Unsabilities

Thank You!
Copy of brain_march_23
By Amin Mohamadi
Copy of brain_march_23
- 335