**Victor Sanches Portella**

September 2023

cs.ubc.ca/~victorsp

joint with **Frederik Kunstner**, **Nick Harvey**, and **Mark Schmidt**

**2023 SIAM PNW Conference**

**Gradient descent** is usually good enough

\(O(d)\) **time and space per iteration **is preferable

Finding good **step-sizes/hyperparameters** is** costly**

We want methods that are "**parameter-free/adaptive**"

What does **adaptive** mean?

AdaGrad

Adam

RMSProp

BFGS

Barzilai-Borwein

Line-search

Hypergradient Descent

Newton's Method

Training an ML model is usually cast as

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

\displaystyle \implies

f(x_t) - f(x_*) \displaystyle \lesssim \left( 1 - \frac{1}{\kappa} \right)^t

\displaystyle \alpha = \tfrac{1}{L}

\displaystyle \kappa = L/\mu

**What if we don't know \(L\)?**

**Line-search!**

"Halve your step-size if too big"

f(x_t) - f(x_*) \displaystyle \lesssim \left( 1 - \phantom{\frac{1}{2}} \cdot \frac{1}{\kappa} \right)^t

\displaystyle \frac{1}{2}

**\(\mu\)-strongly convex**

\(f\)** is**

**\(L\)-smooth**

and

\preceq \nabla^2 f(x) \preceq

L \cdot I

\mu \cdot I

"Easy to optimize"

x_{t+1} = x_t - \alpha \nabla f(x_t)

**Gradient Descent**

We often can do better if we use a (diagonal) **matrix** **preconditioner**

x_{t+1} = x_t - \phantom{P} \nabla f(x_t)

P

x_{t+1} = x_t - \phantom{P} \nabla f(x_t)

We often can do better if we use a (diagonal)

**(Quasi-)Newton Methods**

**Hypergradient Methods**

Hyperparameter tuning as an opt problem

Unstable and no theory/guarantees

**Online Learning **

Formally adapts to adversarial and changing inputs

\displaystyle \Bigg \{

\displaystyle \Bigg \{

\displaystyle \Bigg \{

Super-linear convergence close to opt

What is a **good** \(P\)?

May need 2nd-order information.

Too conservative in this case (e.g., AdaGrad)

"Fixes" (e.g., Adam) have few guarantees

*"AdaGrad inspired an incredible number of clones, most of them with similar, worse, or no regret guarantees.(...) Nowadays, [adaptive methods] seems to denote any kind of coordinate-wise learning rates that does not guarantee anything in particular."*

**Orabona, F.** (2019). A modern introduction to online learning.

adaptive methods

f(x_t) - f(x_*) \displaystyle \lesssim \left( 1 - O\Big(\frac{1}{\kappa}\Big) \right)^t

only guarantee (globally)

In **Smooth**

and **Strongly Convex** optimization,

**Should be better** if there is a good Preconditioner \(P\)

Can we get a** line-search analog **for **diagonal preconditioners?**

*"AdaGrad inspired an incredible number of clones, most of them with similar, worse, or no regret guarantees.(...) Nowadays, [adaptive methods] seems to denote any kind of coordinate-wise learning rates that does not guarantee anything in particular."*

**Orabona, F.** (2019). A modern introduction to online learning.

**Online Learning**

**Smooth Optimization**

**1 step-size**

**\(d\) step-sizes**

(diagonal preconditioner )

Backtracking Line-search

Diagonal AdaGrad

**Multidimensional Backtracking**

Scalar AdaGrad

(and others)

(and others)

\displaystyle \frac{1}{\kappa} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

**Optimal step-size**: biggest that guarantees progress

**Optimal preconditioner**: **biggest (??)** that guarantees progress

\displaystyle P = \tfrac{1}{L} I

\displaystyle \kappa = \tfrac{L}{\mu}

**\(L\)-smooth**

**\(\mu\)-strongly convex**

\(f\) is

and

\preceq \nabla^2 f(x) \preceq

L \cdot I

\mu \cdot I

**Optimal Diagonal Preconditioner**

\(\kappa_* \leq \kappa\), hopefully \(\kappa_* \ll \kappa\)

**Over diagonal matrices**

\displaystyle P_*

minimizes \(\kappa_*\) such that

\displaystyle \frac{1}{\kappa_*} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

**Line-search**

\displaystyle \Bigg\{

Worth it if \(\sqrt{2d} \kappa_* \ll 2 \kappa\)

\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \phantom{\frac{1}{2}} \cdot \frac{1}{\kappa}\Big)^t (f(x_0) - f(x_*))

\displaystyle
\frac{1}{2}

**Multidimensional Backtracking**

\displaystyle \Bigg\{

(our algorithm)

# backtracks \(\lesssim\)

\displaystyle d \cdot \log\Big( \alpha_0 \cdot L\Big)

\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \phantom{\frac{1}{\sqrt{2d}}} \cdot \frac{1}{\phantom{\kappa_*}}\Big)^t (f(x_0) - f(x_*))

\displaystyle\frac{1}{\sqrt{2d}}

\displaystyle \kappa_*

# backtracks \(\leq\)

\displaystyle \log\Big(\alpha_{0} \cdot L \Big)

**Line-search**: test if step-size \(\alpha_{max}/2\) makes enough progress:

\displaystyle f(x_{t+1}) \leq f(x_t) - \tfrac{\alpha_{\max}}{2} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

**Armijo condition**

If this fails, **cut out** everything bigger than \(\alpha_{\max}/2\)

**Preconditioner search:**

0

\alpha_{0}

\tfrac{\alpha_{0}}{2}

\tfrac{1}{L}

\tfrac{\alpha_{0}}{4}

Test if preconditioner \(P\) makes enough progress:

Candidate preconditioners: diagonals in a box/ellipsoid

If this fails, **cut out** everything bigger than \(P\)

\displaystyle f(x_{t+1}) \leq f(x_t)

- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

**Line-search**:

\displaystyle f(x_{t+1}) \leq f(x_t) - \tfrac{\alpha_{\max}}{2} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

"Progress if \(f\) were

\(\frac{2}{\alpha_{\max}}\)-smooth"

If this fails, **cut out** everything bigger than \(\alpha_{\max}/2\)

0

\alpha_{\max}

\tfrac{\alpha_{\max}}{2}

\tfrac{1}{L}

\tfrac{\alpha_{\max}}{4}

Test if step-size \(\alpha_{\max}/2\) makes enough progress:

Candidate step-sizes: interval \([0, \alpha_{\max}]\)

**Preconditioner search:**

Test if preconditioner \(P\) makes enough progress:

Candidate preconditioners: diagonals in a box/ellipsoid

If this fails, **cut out** everything bigger than \(P\)

"Progress if \(f\) were

\(P\)-smooth"

\displaystyle f(x_{t+1}) \leq f(x_t)

- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

\(P\) does not yield **sufficient progress**

Which preconditioners can be thrown out?

\(P \) yields sufficient progress \(\iff\) \(h(P) \leq 0\)

\displaystyle \nabla h(P)

Convexity \(\implies\)

induces a **separating hyperplane!**

**"Hypergradient"**

**Main technical Idea**

\displaystyle
h(P) = \phantom{f(x - P \nabla f(x)) - (f(x) - \tfrac{1}{2} \lVert \nabla f(x) \rVert_P^2)}

\displaystyle
\phantom{h(P) = f(x - P \nabla f(x)) }- (f(x) - \tfrac{1}{2} \lVert \nabla f(x) \rVert_P^2)

\displaystyle
\phantom{h(P) = }f(x - P \nabla f(x))\phantom{ - (f(x) - \tfrac{1}{2} \lVert \nabla f(x) \rVert_P^2)}

**Box case**: query point needs to be **too close to the origin**

\displaystyle f(x_t) - f(x_*) \lesssim\Big(1 - \phantom{\frac{1}{d}} \cdot \frac{1}{\kappa_*}\Big)^t

\displaystyle
\frac{1}{d}

Volume decrease \(\implies\) query points close to the origin

Good convergence rate \(\implies\) query point close to the boundary

**Ellipsoid method **might be better.

\(\Omega(d^3)\)** time per iteration**

We want to use the **Ellipsoid method** as our cutting plane method

\(\Omega(d^3)\) time per iteration

We can exploit symmetry!

\(O(d)\) time per iteration

**Constant volume decrease** on each CUT

\displaystyle f(x_t) - f(x_*) \lesssim\Big(1 - \phantom{\frac{1}{\sqrt{2d}}} \cdot \frac{1}{\kappa_*}\Big)^t

\displaystyle
\frac{1}{\sqrt{2d}}

**Query point** \(1/\sqrt{2d}\) away from boundary

\kappa \approx 10^{13}

\kappa_* \approx 10^{2}

**Theoretically principled** adaptive optimization method for smooth strongly convex optimization

A **theoretically-informed **use of "hypergradients"

ML Optimization meets **Cutting Plane methods**

arxiv.org/abs/2306.02527

Training/Fitting a ML model is often cast a **(uncontrained) optimization problem**

Usually in ML, models tend to be BIG

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

**\(d\) is BIG**

Running time and space **\(O(d)\) **is usually **the most we can afford**

First-order (i.e., gradient based) methods fit the bill

(stochastic even more so)

Usually \(O(d)\) time and space per iteration

\displaystyle f(y) \geq f(x) + \langle \nabla f(y), x - y\rangle + \frac{L}{2}\lVert x - y \rVert_2^2

**\(f\) is convex**

Not the case with Neural Networks

Still quite useful in theory and practice

\displaystyle \Bigg\{

More conditions on \(f\) for rates of convergence

**\(L\)-smooth**

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

**\(\mu\)-strongly convex**

\displaystyle f(y) \leq f(x) + \langle \nabla f(y), x - y\rangle + \frac{\mu}{2}\lVert x - y \rVert_2^2

\displaystyle x_{t+1} = x_t - \alpha \nabla f(x_t)

Which step-size \(\alpha\) should we pick?

\displaystyle \implies

\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{\mu}{L} \right)^t (f(x_0) - f(x_*))

Condition number

\displaystyle \alpha = \frac{1}{L}

\displaystyle \kappa = \frac{L}{\mu}

\(\kappa\) Big \(\implies\) hard function

If we know \(L\), picking \(1/L\) always works

**and is worst-case optimal**

What if we do not know \(L\)?

Locally flat \(\implies\) we can pick bigger step-sizes

\displaystyle x_{t+1} = x_t - \tfrac{1}{L} \nabla f(x_t)

If \(f\) is \(L\) smooth, we have

\displaystyle f(x_{t+1}) \leq f(x_t) - \tfrac{1}{L} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

**"Descent Lemma"**

**Idea:** Pick \(\eta\) big and see if the "descent condition" holds

(Locally \(1/\eta\)-smooth)

\displaystyle f(x) = x^T A x

\displaystyle A =
\begin{pmatrix}
1000 & 0 \\
0 & 0.001
\end{pmatrix}

\displaystyle \kappa = 10^{-6}

\displaystyle x_{t+1} = x_t -
\begin{pmatrix}
0.001 & 0 \\
0 & 1000
\end{pmatrix}
\nabla f(x_t)

**Converges in 1 step**

\(P\)

\(O(d)\) space and time \(\implies\) \(P\) diagonal (or sparse)

Can we find a good \(P\) automatically?

**"Adapt to \(f\)"**

**Preconditioer \(P\)**

\displaystyle x_{t+1} = x_t - P_t \cdot \nabla f(x_t)

Preconditioner at round \(t\)

**AdaGrad from Online Learning**

\displaystyle P_t = \Big( \sum_{i \leq t} \nabla f(x_i) \nabla f(x_i)^T \Big)^{1/2}

\displaystyle \mathrm{Diag}\Big( \sum_{i \leq t} \nabla f(x_i) \nabla f(x_i)^T \Big)^{1/2}

or

Better guarantees if **functions are easy**

while preserving optimal worst-case guarantees in Online Learning

Attains** linear rate in classical convex opt** (proved later)

But... Online Learning is** too adversarial**, AdaGrad is **"conservative"**

In OL, functions change every iteration **adversarially**

But... Online Learning is** too adversarial**, AdaGrad is **"conservative"**

"**Fixes**": Adam, RMSProp, and other workarounds

"AdaGrad inspired anincredible number of clones, most of them withsimilar, worse, or no regret guarantees.(...) Nowadays, [adaptive] seems to denoteany kind of coordinate-wise learning rates that does not guarantee anything in particular."

**Francesco Orabona** in "A Modern Introduction to Online Learning", Sec. 4.3

**Idea: **look at step-size/preconditioner choice as an optimization problem

Gradient descent on the hyperparameters

How to pick the step-size of this? Well...

Little/ No theory

Unpredictable

... and popular?!

P_t = \nabla^2 f(x_t)

Newton's method

is usually a great preconditioner

**Superlinear** convergence

...when \(\lVert x_t - x_*\rVert\) small

**Newton **may diverge otherwise

Using step-size with Newton and QN method ensures convergence away from \(x_*\)

**Worse than GD**

\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{1}{\kappa^2} \right)^t (f(x_0) - f(x_*))

\displaystyle \phantom{\kappa}^2

\(\nabla^2 f(x)\) is usually expensive to compute

P_t \approx \nabla^2 f(x_t)

should also help

Quasi-Newton Methods, e.g. BFGS

**(Quasi-)Newton**: needs Hessian, can be slower than GD

**Hypergradient methods**: purely heuristic, unstable

**Online Learning Algorithms**: Good but pessimistic theory

at least for smooth optimization it seems pessimistic...

**Online Learning**

**Smooth Optimization**

**1 step-size**

**\(d\) step-sizes**

(diagonal preconditioner )

Backtracking Line-search

Diagonal AdaGrad

Coordinate-wise

Coin Betting

(non-smooth opt?)

**Multidimensional Backtracking**

Scalar AdaGrad

Coin-Betting

**What does it mean for a method to be adaptive?**