Gradient Descent and the Edge of Stability

Amin

April 2023

Gradient Descent and the Edge of Stability

Amin

April 2023

Stability?

  • Let \(\ell: \mathbb{R}^P \to \mathbb{R}\) be continuously differentiable and \(L\)-smooth.
  • Consider gradient descent on \(\ell\) with a step size of \(\eta\).
  • Descent Lemma says:

     
  • Proof: When the function is L-smooth, max eigenvalue of hessian is always smaller (or equal) than L.
\ell(\theta_{t+1}) \le \ell(\theta_t) - \frac{\eta (2 - \eta L)}{2} || \nabla \ell(\theta_t)||^2

Self-Stabilization:

  • Gradient Descent Implicitly Solves:


     
  • By evolving around the constrained trajectory


     
  • where \(\mathcal{M}\) is:
\mathcal{M} = \{ \theta: S(\theta) \le 2 / \eta \wedge \nabla L(\theta) \cdot u(\theta) = 0 \}

Self-Stabilization

Self-Stabilization: Sketch

  • Assumption 1: Existence of progressive sharpening \(\alpha = \nabla L \cdot \nabla S > 0 \).
     
  • Assumption 2: Eigengap in hessian!

Stage 1

  • We can Taylor expand the sharpness around the reference point:

     

Stage 2

  • Note: \(u \cdot \nabla L(\theta_t) \approx S(\theta_t) x_t\) because:
u \cdot \nabla L(\theta^\star) \approx u \cdot \frac{\eta}{2} S(\theta_t) \nabla L(\theta^\star) \approx \frac{1}{2} S(\theta_t) u \cdot \eta \nabla L(\theta^\star)

(I don't understand how they ignore the
factor of 2)

Stage 3

  • Hence, adding the cubic term back in the Taylor expansion for \(y_t\), we have:

     
  •  
  • Therefore, when \(x_t > \sqrt{2\alpha/\beta}\), sharpness
    begins to decrease.

Stage 4

  • Hence the full dynamics can be described as:

Self-Stabilization

Main Result

  • Takeaway 1: GD's behaviour is not monotonic over short timescales, but is over long timescales.
  • Takeaway 2: GD is guaranteed to not increase training loss over long timescales, no matter how non-convex the loss is.

Experiments

Experiments

Limitations

  • Assumption 5 in the paper is certainly not true for almost all the practical models. But it can be broken down to multiple more-manageable pieces. 
  • Assumption 4 is extremely strong, and it directly hints at self-stabilization.

Proof Idea

  • The proof idea is to analyze the the constrained update's trajectory along the directions of velocity of sharpness and top eigenvector:


     
  • And  bounding the displacement using a cubic Taylor-expansion (and strong assumptions to make it work :p)

Extension: Multiple Unsabilities

Thank You!

Copy of brain_march_23

By Amin Mohamadi

Copy of brain_march_23

  • 57