### Tamanhos de Passo por Coordenada Ótimos com

Victor Sanches Portella

Abril 2024

cs.ubc.ca/~victorsp

junto de Frederik Kunstner, Nick Harvey, e Mark Schmidt

## Multidimensional Backtracking

, IME - USP

SNAIL 🐌

## Gradient Descent and Line Search

### Why first-order optimization?

Training/Fitting a ML model is often cast a (uncontrained) optimization problem

Usually in ML, models tend to be BIG

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

$$d$$ is BIG

First-order (i.e., gradient based) methods fit the bill

(stochastic even more so)

$$O(d)$$ time and space per iteration is preferable

### Convex Optimization Setting

$$f$$ is convex

Not the case with Neural Networks

Still quite useful in theory and practice

\displaystyle \Bigg\{

More conditions on $$f$$ for rates of convergence

$$L$$-smooth

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

$$\mu$$-strongly convex

\preceq \nabla^2 f(x) \preceq
L \cdot I
\mu \cdot I

"Easy to optimize"

\displaystyle x_{t+1} = x_t - \alpha \nabla f(x_t)

Which step-size $$\alpha$$ should we pick?

\displaystyle \implies
\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{\mu}{L} \right)^t (f(x_0) - f(x_*))

Condition number

\displaystyle \alpha = \frac{1}{L}
\displaystyle \kappa = \frac{L}{\mu}

$$\kappa$$ Big $$\implies$$ hard function

### What Step-Size to Pick?

If we know $$L$$, picking $$1/L$$ always works

and is                     optimal

What if we do not know $$L$$?

\displaystyle x_{t+1} = x_t - \tfrac{1}{L} \nabla f(x_t)
\displaystyle f(x_{t+1}) \leq f(x_t) - \tfrac{1}{L} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

"Descent Lemma"

Idea: Pick $$\eta$$ big and see if the "descent condition" holds

(Locally $$1/\eta$$-smooth)

worst-case

### Backtracking Line Search

Line-search: test if step-size $$\alpha_{\max}/2$$ makes enough progress:

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha_{\max} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

Armijo condition

If this fails, cut out everything bigger than $$\alpha_{\max}/2$$

0
\alpha_{0}
\tfrac{\alpha_{0}}{2}
\tfrac{1}{L}
\tfrac{\alpha_{0}}{4}

### Backtracking Line-Search

Backtracking Line-Search

Start with $$\alpha_{\max} > 2 L$$

$$\alpha \gets \alpha_{\max}/2$$

If

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

$$t \gets t+1$$

Else

While $$t \leq T$$

\displaystyle \alpha_{\max} \gets \alpha_{\max}/2

Halve candidate space

Guarantee: step-size will be at least $$\tfrac{1}{2} \cdot \tfrac{1}{L}$$

Makes enough progress?

### Beyond Line-Search?

\displaystyle f(x) = x^T A x
\displaystyle A = \begin{pmatrix} 1000 & 0 \\ 0 & 0.001 \end{pmatrix}
\displaystyle \kappa = 10^{-6}
\displaystyle x_{t+1} = x_t - \begin{pmatrix} 0.001 & 0 \\ 0 & 1000 \end{pmatrix} \nabla f(x_t)

Converges in 1 step

$$P$$

Can we find a good $$P$$ automatically?

"Adapt to $$f$$"

Preconditioer $$P$$

### Second-order Methods

P_t = \nabla^2 f(x_t)

Newton's method

is usually a great preconditioner

Superlinear convergence

...when $$\lVert x_t - x_*\rVert$$ small

Newton may diverge otherwise

Using step-size with Newton and QN method ensures convergence away from $$x_*$$

Worse than GD

\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{1}{\kappa^2} \right)^t (f(x_0) - f(x_*))
\displaystyle \phantom{\kappa}^2

$$\nabla^2 f(x)$$ is usually expensive to compute

P_t \approx \nabla^2 f(x_t)

should also help

Quasi-Newton Methods, e.g. BFGS

### Non-Smooth (why?) Optimization

Toy Problem: Minimizing the absolute value function

\displaystyle x_{t+1} = x_t - P_t \cdot \nabla f(x_t)

Preconditioner at round $$t$$

\displaystyle P_t = \mathrm{Diag}\Big( \sum_{i \leq t} \nabla f(x_i) \nabla f(x_i)^T \Big)^{-1/2}

Convergence guarantees

Attains linear rate in classical convex opt (proved later)

Also... "approximates the Hessian" is not quite true

"Fixes": Adam, RMSProp, and other workarounds

\displaystyle G_t = (1 - \rho) \cdot \nabla f(x_i) \nabla f(x_i)^T + \rho \cdot G_{t-1}
\displaystyle P_t = G_t^{-1/2}

RMSProp

Uses "momentum" (weighted sum of gradients)

Similar preconditioner to the above

"AdaGrad inspired an incredible number of clones, most of them with similar, worse, or no regret guarantees.(...) Nowadays, [adaptive] seems to denote any kind of coordinate-wise learning rates that does not guarantee anything in particular."

Francesco Orabona in "A Modern Introduction to Online Learning", Sec. 4.3

Idea: look at step-size/preconditioner choice as an optimization problem

How to pick the step-size of this? Well...

Little/ No theory

Unpredictable

... and popular?!

### State of Affairs

What does it mean for a method to be adaptive?

(Quasi-)Newton Methods

\displaystyle \Bigg \{

Super-linear convergence close to opt

May need 2nd-order information.

Hyperparameter tuning as an opt problem

Unstable and no theory/guarantees

\displaystyle \Bigg \{

Online Learning

\displaystyle \Bigg \{

"Fixes" (e.g., Adam) have few guarantees

### State of Affairs

f(x_t) - f(x_*) \displaystyle \lesssim \left( 1 - O\Big(\frac{1}{\kappa}\Big) \right)^t

only guarantee (globally)

In Smooth

and Strongly Convex optimization,

Should be better if there is a good Preconditioner $$P$$

Online Learning

Smooth Optimization

1 step-size

$$d$$ step-sizes

(diagonal preconditioner )

Backtracking Line-search

Multidimensional Backtracking

(and others)

(and others)

(non-smooth optmization)

## Preconditioner Search

### Optimal (Diagonal) Preconditioner

\displaystyle \frac{1}{\kappa} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

Optimal step-size: biggest that guarantees progress

Optimal preconditioner: biggest (??) that guarantees progress

\displaystyle P = \tfrac{1}{L} I
\displaystyle \kappa = \tfrac{L}{\mu}

$$L$$-smooth

$$\mu$$-strongly convex

$$f$$ is

and

\preceq \nabla^2 f(x) \preceq
L \cdot I
\mu \cdot I

Optimal Diagonal Preconditioner

$$\kappa_* \leq \kappa$$, hopefully $$\kappa_* \ll \kappa$$

Over diagonal matrices

\displaystyle P_*

minimizes $$\kappa_*$$ such that

\displaystyle \frac{1}{\kappa_*} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

### Optimal (Diagonal) Preconditioner

\displaystyle P_*, \kappa_*

attain

### From Line-search to Preconditioner Search

Line-search

\displaystyle \Bigg\{

Worth it if $$\sqrt{2d} \kappa_* \ll 2 \kappa$$

\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \phantom{\frac{1}{2}} \cdot \frac{1}{\kappa}\Big)^t (f(x_0) - f(x_*))
\displaystyle \frac{1}{2}

Multidimensional Backtracking

\displaystyle \Bigg\{

(our algorithm)

# backtracks $$\lesssim$$

\displaystyle d \cdot \log\Big( \alpha_0 \cdot L\Big)
\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \phantom{\frac{1}{\sqrt{2d}}} \cdot \frac{1}{\phantom{\kappa_*}}\Big)^t (f(x_0) - f(x_*))
\displaystyle\frac{1}{\sqrt{2d}}
\displaystyle \kappa_*

# backtracks $$\leq$$

\displaystyle \log\Big(\alpha_{0} \cdot L \Big)

## Multidimensional Backtracking

### Why Naive Search does not Work

Line-search: test if step-size $$\alpha_{\max}/2$$ makes enough progress:

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha_{\max} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

Armijo condition

If this fails, cut out everything bigger than $$\alpha_{\max}/2$$

Preconditioner search:

0
\alpha_{0}
\tfrac{\alpha_{0}}{2}
\tfrac{1}{L}
\tfrac{\alpha_{0}}{4}

Test if preconditioner $$P$$ makes enough progress:

Candidate preconditioners $$\mathcal{S}$$: diagonals in a box

\displaystyle f(x_{t+1}) \leq f(x_t)
- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

If this fails, cut out everything bigger than $$P$$

\langle \nabla f(x_t), P \nabla f(x_t) \rangle

### Why Naive Search does not Work

Preconditioner search:

Test if preconditioner $$P$$ makes enough progress:

Candidate preconditioners $$\mathcal{S}$$: diagonals in a box

\displaystyle f(x_{t+1}) \leq f(x_t)
\langle \nabla f(x_t), P \nabla f(x_t) \rangle
- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

If this fails, cut out everything bigger than $$P$$

### Convexity to the Rescue

$$P$$ does not yield sufficient progress

Which preconditioners can be thrown out?

All $$Q$$ such that $$P \preceq Q$$ works, but it is too weak

\displaystyle h(P) \coloneqq f(x - P^{-1} \nabla f(x)) - f(x) + \tfrac{1}{2} \lVert \nabla f(x) \rVert_P^2

$$P$$ does not yield sufficient progress $$\iff$$ $$h(P) > 0$$

\displaystyle h(Q) \geq h(P) + \langle \nabla h(P), Q - P \rangle

Convexity $$\implies$$

\displaystyle h(P) + \langle \nabla h(P), Q - P \rangle > 0

$$\implies$$ $$Q$$ is invalid

A separating hyperplane!

$$P$$ in this half-space

\displaystyle \Bigg\{

that maximizes

that maximizes

### Ellipsoid Method to the Rescue

We want to use the Ellipsoid method as our cutting plane method

$$\Omega(d^3)$$ time per iteration

We can exploit symmetry!

$$O(d)$$ time per iteration

Constant volume decrease on each CUT

\displaystyle f(x_t) - f(x_*) \lesssim\Big(1 - \phantom{\frac{1}{\sqrt{2d}}} \cdot \frac{1}{\kappa_*}\Big)^t
\displaystyle \frac{1}{\sqrt{2d}}

Query point $$1/\sqrt{2d}$$ away from boundary

### Experiments

\kappa \approx 10^{13}
\kappa_* \approx 10^{2}

### Conclusions

Theoretically principled adaptive optimization method for strongly convex smooth optimization

ML Optimization meets Cutting Plane methods

Stochastic case?

Heuristics for non-convex case?

Other cutting-plane methods?