### Optimal Per-Coordinate Step-sizes with

Victor Sanches Portella

September 2023

cs.ubc.ca/~victorsp

joint with Frederik Kunstner, Nick Harvey, and Mark Schmidt

## Multidimensional Backtracking

Theory Student Seminar @ University of Toronto

## Gradient Descent and Line Search

### Why first-order optimization?

Training/Fitting a ML model is often cast a (uncontrained) optimization problem

Usually in ML, models tend to be BIG

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

$$d$$ is BIG

Running time and space $$O(d)$$ is usually the most we can afford

First-order (i.e., gradient based) methods fit the bill

(stochastic even more so)

Usually $$O(d)$$ time and space per iteration

### Convex Optimization Setting

\displaystyle f(y) \geq f(x) + \langle \nabla f(y), x - y\rangle + \frac{L}{2}\lVert x - y \rVert_2^2

$$f$$ is convex

Not the case with Neural Networks

Still quite useful in theory and practice

\displaystyle \Bigg\{

More conditions on $$f$$ for rates of convergence

$$L$$-smooth

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

$$\mu$$-strongly convex

\displaystyle f(y) \leq f(x) + \langle \nabla f(y), x - y\rangle + \frac{\mu}{2}\lVert x - y \rVert_2^2

\displaystyle x_{t+1} = x_t - \alpha \nabla f(x_t)

Which step-size $$\alpha$$ should we pick?

\displaystyle \implies
\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{\mu}{L} \right)^t (f(x_0) - f(x_*))

Condition number

\displaystyle \alpha = \frac{1}{L}
\displaystyle \kappa = \frac{L}{\mu}

$$\kappa$$ Big $$\implies$$ hard function

### What Step-Size to Pick?

If we know $$L$$, picking $$1/L$$ always works

and is worst-case optimal

What if we do not know $$L$$?

Locally flat $$\implies$$ we can pick bigger step-sizes

\displaystyle x_{t+1} = x_t - \tfrac{1}{L} \nabla f(x_t)

If $$f$$ is $$L$$ smooth, we have

\displaystyle f(x_{t+1}) \leq f(x_t) - \tfrac{1}{L} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

"Descent Lemma"

Idea: Pick $$\eta$$ big and see if the "descent condition" holds

(Locally $$1/\eta$$-smooth)

### Backtracking Line-Search

Backtracking Line-Search

Start with $$\alpha_{\max} > 2 L$$

$$\alpha \gets \alpha_{\max}/2$$

If

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

$$t \gets t+1$$

Else

While $$t \leq T$$

\displaystyle \alpha_{\max} \gets \alpha_{\max}/2

Halve candidate space

Guarantee: step-size will be at least $$\tfrac{1}{2} \cdot \tfrac{1}{L}$$

Armijo Condition

### Beyond Line-Search?

\displaystyle f(x) = x^T A x
\displaystyle A = \begin{pmatrix} 1000 & 0 \\ 0 & 0.001 \end{pmatrix}
\displaystyle \kappa = 10^{-6}
\displaystyle x_{t+1} = x_t - \begin{pmatrix} 0.001 & 0 \\ 0 & 1000 \end{pmatrix} \nabla f(x_t)

Converges in 1 step

$$P$$

$$O(d)$$ space and time $$\implies$$ $$P$$ diagonal (or sparse)

Can we find a good $$P$$ automatically?

"Adapt to $$f$$"

Preconditioer $$P$$

\displaystyle x_{t+1} = x_t - P_t \cdot \nabla f(x_t)

Preconditioner at round $$t$$

\displaystyle P_t = \Big( \sum_{i \leq t} \nabla f(x_i) \nabla f(x_i)^T \Big)^{1/2}
\displaystyle \mathrm{Diag}\Big( \sum_{i \leq t} \nabla f(x_i) \nabla f(x_i)^T \Big)^{1/2}

or

Better guarantees if functions are easy

while preserving optimal worst-case guarantees in Online Learning

Attains linear rate in classical convex opt (proved later)

In OL, functions change every iteration adversarially

"Fixes": Adam, RMSProp, and other workarounds

"AdaGrad inspired an incredible number of clones, most of them with similar, worse, or no regret guarantees.(...) Nowadays, [adaptive] seems to denote any kind of coordinate-wise learning rates that does not guarantee anything in particular."

Francesco Orabona in "A Modern Introduction to Online Learning", Sec. 4.3

Idea: look at step-size/preconditioner choice as an optimization problem

How to pick the step-size of this? Well...

Little/ No theory

Unpredictable

... and popular?!

### Second-order Methods

P_t = \nabla^2 f(x_t)

Newton's method

is usually a great preconditioner

Superlinear convergence

...when $$\lVert x_t - x_*\rVert$$ small

Newton may diverge otherwise

Using step-size with Newton and QN method ensures convergence away from $$x_*$$

Worse than GD

\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{1}{\kappa^2} \right)^t (f(x_0) - f(x_*))
\displaystyle \phantom{\kappa}^2

$$\nabla^2 f(x)$$ is usually expensive to compute

P_t \approx \nabla^2 f(x_t)

should also help

Quasi-Newton Methods, e.g. BFGS

### State of Affairs

(Quasi-)Newton: needs Hessian, can be slower than GD

Online Learning Algorithms: Good but pessimistic theory

at least for smooth optimization it seems pessimistic...

Online Learning

Smooth Optimization

1 step-size

$$d$$ step-sizes

(diagonal preconditioner )

Backtracking Line-search

Coordinate-wise

Coin Betting

(non-smooth opt?)

Multidimensional Backtracking

Coin-Betting

What does it mean for a method to be adaptive?

## Preconditioner Search

### Optimal (Diagonal) Preconditioner

\displaystyle \mu I \preceq \nabla^2 f(x) \preceq L I
\displaystyle \frac{1}{\kappa} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

Optimal step-size: biggest that guarantees progress

Optimal preconditioner: biggest (??) that guarantees progress

\displaystyle P = \tfrac{1}{L} I
\displaystyle \kappa = \tfrac{L}{\mu}
\displaystyle f(y) \geq f(x) + \langle \nabla f(y), x - y\rangle + \frac{L}{2}\lVert x - y \rVert_2^2

$$L$$-smooth

$$\mu$$-strongly convex

\displaystyle f(y) \leq f(x) + \langle \nabla f(y), x - y\rangle + \frac{\mu}{2}\lVert x - y \rVert_2^2
\displaystyle P_*

minimizes $$\kappa_*$$ such that

\displaystyle \frac{1}{\kappa_*} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

Over diagonal matrices

### From Line-search to Preconditioner Search

Line-search

step-size is at least $$1/2$$ the optimum $$1/L$$

# backtracks $$\leq$$

\displaystyle \log\Big(\alpha_{0} L \Big)
\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \frac{1}{2 \cdot \kappa}\Big)^t (f(x_0) - f(x_*))

Multidimensional Backtracking

Condition number is at least $$1/\sqrt{2d}$$ the optimum

# backtracks $$\lesssim$$

\displaystyle d \cdot \log\Big( \alpha_0 \cdot L\Big)
\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \frac{1}{\sqrt{2d} \cdot \kappa_*}\Big)^t (f(x_0) - f(x_*))
\displaystyle \Bigg\{
\displaystyle \Bigg\{

Worth it if $$\sqrt{2d} \kappa_* \ll 2 \kappa$$

## Multidimensional Backtracking

### Why Naive Search does not Work

Line-search: test if step-size $$\alpha_{\max}/2$$ makes enough progress:

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha_{\max} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

Armijo condition

If this fails, cut out everything bigger than $$\alpha_{\max}/2$$

Preconditioner search:

0
\alpha_{0}
\tfrac{\alpha_{0}}{2}
\tfrac{1}{L}
\tfrac{\alpha_{0}}{4}

Test if preconditioner $$P$$ makes enough progress:

Candidate preconditioners $$\mathcal{S}$$: diagonals in a box

\displaystyle f(x_{t+1}) \leq f(x_t)
- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

If this fails, cut out everything bigger than $$P$$

\langle \nabla f(x_t), P \nabla f(x_t) \rangle

### Why Naive Search does not Work

Preconditioner search:

Test if preconditioner $$P$$ makes enough progress:

Candidate preconditioners $$\mathcal{S}$$: diagonals in a box

\displaystyle f(x_{t+1}) \leq f(x_t)
\langle \nabla f(x_t), P \nabla f(x_t) \rangle
- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

If this fails, cut out everything bigger than $$P$$

### Convexity to the Rescue

$$P$$ does not yield sufficient progress

Which preconditioners can be thrown out?

All $$Q$$ such that $$P \preceq Q$$ works, but it is too weak

\displaystyle h(P) \coloneqq f(x - P^{-1} \nabla f(x)) - f(x) + \tfrac{1}{2} \lVert \nabla f(x) \rVert_P^2

$$P$$ does not yield sufficient progress $$\iff$$ $$h(P) > 0$$

\displaystyle h(Q) \geq h(P) + \langle \nabla h(P), Q - P \rangle

Convexity $$\implies$$

\displaystyle h(P) + \langle \nabla h(P), Q - P \rangle > 0

$$\implies$$ $$Q$$ is invalid

A separating hyperplane!

$$P$$ in this half-space

\displaystyle \Bigg\{

### Smallest Axis-Aligned Ellipsoid

Contraction of $$1/\sqrt{2d}$$ from boundary

\displaystyle \implies

Constant volume contraction

### Conclusions

Theoretically principled adaptive optimization method for strongly convex smooth optimization

ML Optimization meets Cutting Plane methods

Stochastic case?

Heuristics for non-convex case?

Other cutting-plane methods?

# ?

## Backup Slides

#### Preconditioner Search

By Victor Sanches Portella

• 102