Preconditioner Search

Procurando por

Tamanhos de Passo por Coordenada Ótimos com

Victor Sanches Portella

Abril 2024

cs.ubc.ca/~victorsp

junto de Frederik Kunstner, Nick Harvey, e Mark Schmidt

Multidimensional Backtracking

, IME - USP

SNAIL 🐌

Gradient Descent and Line Search

Why first-order optimization?

Training/Fitting a ML model is often cast a (uncontrained) optimization problem

Usually in ML, models tend to be BIG

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

$d$ is BIG

First-order (i.e., gradient based) methods fit the bill

(stochastic even more so)

$O(d)$ time and space per iteration is preferable

Convex Optimization Setting

$f$ is convex

Not the case with Neural Networks

Still quite useful in theory and practice

\displaystyle \Bigg\{

\displaystyle \Bigg\{

More conditions on $f$ for rates of convergence

$L$ -smooth

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

$\mu$ -strongly convex

\preceq \nabla^2 f(x) \preceq

\preceq \nabla^2 f(x) \preceq

L \cdot I

L \cdot I

\mu \cdot I

\mu \cdot I

"Easy to optimize"

Gradient Descent

\displaystyle x_{t+1} = x_t - \alpha \nabla f(x_t)

\displaystyle x_{t+1} = x_t - \alpha \nabla f(x_t)

Which step-size $\alpha$ should we pick?

\displaystyle \implies

\displaystyle \implies

\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{\mu}{L} \right)^t (f(x_0) - f(x_*))

\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{\mu}{L} \right)^t (f(x_0) - f(x_*))

Condition number

\displaystyle \alpha = \frac{1}{L}

\displaystyle \alpha = \frac{1}{L}

\displaystyle \kappa = \frac{L}{\mu}

\displaystyle \kappa = \frac{L}{\mu}

$\kappa$ Big $\implies$ hard function

What Step-Size to Pick?

If we know $L$ , picking $1/L$ always works

and is optimal

What if we do not know $L$ ?

\displaystyle x_{t+1} = x_t - \tfrac{1}{L} \nabla f(x_t)

\displaystyle x_{t+1} = x_t - \tfrac{1}{L} \nabla f(x_t)

\displaystyle f(x_{t+1}) \leq f(x_t) - \tfrac{1}{L} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

\displaystyle f(x_{t+1}) \leq f(x_t) - \tfrac{1}{L} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

"Descent Lemma"

Idea: Pick $\eta$ big and see if the "descent condition" holds

(Locally $1/\eta$ -smooth)

worst-case

Backtracking Line Search

Line-search: test if step-size $\alpha_{\max}/2$ makes enough progress:

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha_{\max} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha_{\max} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

Armijo condition

If this fails, cut out everything bigger than $\alpha_{\max}/2$

0

\alpha_{0}

\alpha_{0}

\tfrac{\alpha_{0}}{2}

\tfrac{\alpha_{0}}{2}

\tfrac{1}{L}

\tfrac{1}{L}

\tfrac{\alpha_{0}}{4}

\tfrac{\alpha_{0}}{4}

Backtracking Line-Search

Backtracking Line-Search

Start with $\alpha_{\max} > 2 L$

$\alpha \gets \alpha_{\max}/2$

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

$t \gets t+1$

Else

While $t \leq T$

\displaystyle \alpha_{\max} \gets \alpha_{\max}/2

\displaystyle \alpha_{\max} \gets \alpha_{\max}/2

Halve candidate space

Guarantee: step-size will be at least $\tfrac{1}{2} \cdot \tfrac{1}{L}$

Makes enough progress?

Beyond Line-Search?

\displaystyle f(x) = x^T A x

\displaystyle f(x) = x^T A x

\displaystyle A = \begin{pmatrix} 1000 & 0 \\ 0 & 0.001 \end{pmatrix}

\displaystyle A = \begin{pmatrix} 1000 & 0 \\ 0 & 0.001 \end{pmatrix}

\displaystyle \kappa = 10^{-6}

\displaystyle \kappa = 10^{-6}

\displaystyle x_{t+1} = x_t - \begin{pmatrix} 0.001 & 0 \\ 0 & 1000 \end{pmatrix} \nabla f(x_t)

\displaystyle x_{t+1} = x_t - \begin{pmatrix} 0.001 & 0 \\ 0 & 1000 \end{pmatrix} \nabla f(x_t)

Converges in 1 step

$P$

Can we find a good $P$ automatically?

"Adapt to $f$ "

Preconditioer $P$

"Adaptive" Optimization Methods

Second-order Methods

P_t = \nabla^2 f(x_t)

P_t = \nabla^2 f(x_t)

Newton's method

is usually a great preconditioner

Superlinear convergence

...when $\lVert x_t - x_*\rVert$ small

Newton may diverge otherwise

Using step-size with Newton and QN method ensures convergence away from $x_*$

Worse than GD

\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{1}{\kappa^2} \right)^t (f(x_0) - f(x_*))

\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{1}{\kappa^2} \right)^t (f(x_0) - f(x_*))

\displaystyle \phantom{\kappa}^2

\displaystyle \phantom{\kappa}^2

$\nabla^2 f(x)$ is usually expensive to compute

P_t \approx \nabla^2 f(x_t)

P_t \approx \nabla^2 f(x_t)

should also help

Quasi-Newton Methods, e.g. BFGS

Adaptivity in Online Learning

\displaystyle x_{t+1} = x_t - P_t \cdot \nabla f(x_t)

\displaystyle x_{t+1} = x_t - P_t \cdot \nabla f(x_t)

Preconditioner at round $t$

AdaGrad from Online Learning

\displaystyle P_t = \mathrm{Diag}\Big( \sum_{i \leq t} \nabla f(x_i) \nabla f(x_i)^T \Big)^{-1/2}

\displaystyle P_t = \mathrm{Diag}\Big( \sum_{i \leq t} \nabla f(x_i) \nabla f(x_i)^T \Big)^{-1/2}

Convergence guarantees

"adapt" to the function itself

Attains linear rate in classical convex opt (proved later)

But... Online Learning is too adversarial, AdaGrad is "conservative"

Also... "approximates the Hessian" is not quite true

"Fixing" AdaGrad

But... Online Learning is too adversarial, AdaGrad is "conservative"

"Fixes": Adam, RMSProp, and other workarounds

\displaystyle G_t = (1 - \rho) \cdot \nabla f(x_i) \nabla f(x_i)^T + \rho \cdot G_{t-1}

\displaystyle G_t = (1 - \rho) \cdot \nabla f(x_i) \nabla f(x_i)^T + \rho \cdot G_{t-1}

\displaystyle P_t = G_t^{-1/2}

\displaystyle P_t = G_t^{-1/2}

RMSProp

Adam

Uses "momentum" (weighted sum of gradients)

Similar preconditioner to the above

State of Affairs

What does it mean for a method to be adaptive?

(Quasi-)Newton Methods

\displaystyle \Bigg \{

\displaystyle \Bigg \{

Super-linear convergence close to opt

May need 2nd-order information.

Hypergradient Methods

Hyperparameter tuning as an opt problem

Unstable and no theory/guarantees

\displaystyle \Bigg \{

\displaystyle \Bigg \{

Online Learning

Formally adapts to adversarial and changing inputs

\displaystyle \Bigg \{

\displaystyle \Bigg \{

Too conservative in this case (e.g., AdaGrad)

"Fixes" (e.g., Adam) have few guarantees

State of Affairs

adaptive methods

f(x_t) - f(x_*) \displaystyle \lesssim \left( 1 - O\Big(\frac{1}{\kappa}\Big) \right)^t

f(x_t) - f(x_*) \displaystyle \lesssim \left( 1 - O\Big(\frac{1}{\kappa}\Big) \right)^t

only guarantee (globally)

In Smooth

and Strongly Convex optimization,

Should be better if there is a good Preconditioner $P$

Online Learning

Smooth Optimization

1 step-size

$d$ step-sizes

(diagonal preconditioner )

Backtracking Line-search

Diagonal AdaGrad

Multidimensional Backtracking

Scalar AdaGrad

(and others)

(non-smooth optmization)

Optimal (Diagonal) Preconditioner

\displaystyle \frac{1}{\kappa} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

\displaystyle \frac{1}{\kappa} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

Optimal step-size: biggest that guarantees progress

Optimal preconditioner: biggest (??) that guarantees progress

\displaystyle P = \tfrac{1}{L} I

\displaystyle P = \tfrac{1}{L} I

\displaystyle \kappa = \tfrac{L}{\mu}

\displaystyle \kappa = \tfrac{L}{\mu}

$L$ -smooth

$\mu$ -strongly convex

$f$ is

and

\preceq \nabla^2 f(x) \preceq

\preceq \nabla^2 f(x) \preceq

L \cdot I

L \cdot I

\mu \cdot I

\mu \cdot I

Optimal Diagonal Preconditioner

$\kappa_* \leq \kappa$ , hopefully $\kappa_* \ll \kappa$

Over diagonal matrices

\displaystyle P_*

\displaystyle P_*

minimizes $\kappa_*$ such that

\displaystyle \frac{1}{\kappa_*} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

\displaystyle \frac{1}{\kappa_*} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

From Line-search to Preconditioner Search

Line-search

\displaystyle \Bigg\{

\displaystyle \Bigg\{

Worth it if $\sqrt{2d} \kappa_* \ll 2 \kappa$

\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \phantom{\frac{1}{2}} \cdot \frac{1}{\kappa}\Big)^t (f(x_0) - f(x_*))

\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \phantom{\frac{1}{2}} \cdot \frac{1}{\kappa}\Big)^t (f(x_0) - f(x_*))

\displaystyle \frac{1}{2}

\displaystyle \frac{1}{2}

Multidimensional Backtracking

\displaystyle \Bigg\{

\displaystyle \Bigg\{

(our algorithm)

# backtracks $\lesssim$

\displaystyle d \cdot \log\Big( \alpha_0 \cdot L\Big)

\displaystyle d \cdot \log\Big( \alpha_0 \cdot L\Big)

\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \phantom{\frac{1}{\sqrt{2d}}} \cdot \frac{1}{\phantom{\kappa_*}}\Big)^t (f(x_0) - f(x_*))

\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \phantom{\frac{1}{\sqrt{2d}}} \cdot \frac{1}{\phantom{\kappa_*}}\Big)^t (f(x_0) - f(x_*))

\displaystyle\frac{1}{\sqrt{2d}}

\displaystyle\frac{1}{\sqrt{2d}}

\displaystyle \kappa_*

\displaystyle \kappa_*

# backtracks $\leq$

\displaystyle \log\Big(\alpha_{0} \cdot L \Big)

\displaystyle \log\Big(\alpha_{0} \cdot L \Big)

Why Naive Search does not Work

Line-search: test if step-size $\alpha_{\max}/2$ makes enough progress:

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha_{\max} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha_{\max} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

Armijo condition

If this fails, cut out everything bigger than $\alpha_{\max}/2$

Preconditioner search:

0

\alpha_{0}

\alpha_{0}

\tfrac{\alpha_{0}}{2}

\tfrac{\alpha_{0}}{2}

\tfrac{1}{L}

\tfrac{1}{L}

\tfrac{\alpha_{0}}{4}

\tfrac{\alpha_{0}}{4}

Test if preconditioner $P$ makes enough progress:

Candidate preconditioners $\mathcal{S}$ : diagonals in a box

\displaystyle f(x_{t+1}) \leq f(x_t)

\displaystyle f(x_{t+1}) \leq f(x_t)

- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

If this fails, cut out everything bigger than $P$

\langle \nabla f(x_t), P \nabla f(x_t) \rangle

\langle \nabla f(x_t), P \nabla f(x_t) \rangle

Why Naive Search does not Work

Preconditioner search:

Test if preconditioner $P$ makes enough progress:

Candidate preconditioners $\mathcal{S}$ : diagonals in a box

\displaystyle f(x_{t+1}) \leq f(x_t)

\displaystyle f(x_{t+1}) \leq f(x_t)

\langle \nabla f(x_t), P \nabla f(x_t) \rangle

\langle \nabla f(x_t), P \nabla f(x_t) \rangle

- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

If this fails, cut out everything bigger than $P$

Convexity to the Rescue

$P$ does not yield sufficient progress

Which preconditioners can be thrown out?

All $Q$ such that $P \preceq Q$ works, but it is too weak

\displaystyle h(P) \coloneqq f(x - P^{-1} \nabla f(x)) - f(x) + \tfrac{1}{2} \lVert \nabla f(x) \rVert_P^2

\displaystyle h(P) \coloneqq f(x - P^{-1} \nabla f(x)) - f(x) + \tfrac{1}{2} \lVert \nabla f(x) \rVert_P^2

$P$ does not yield sufficient progress $\iff$ $h(P) > 0$

\displaystyle h(Q) \geq h(P) + \langle \nabla h(P), Q - P \rangle

\displaystyle h(Q) \geq h(P) + \langle \nabla h(P), Q - P \rangle

Convexity $\implies$

\displaystyle h(P) + \langle \nabla h(P), Q - P \rangle > 0

\displaystyle h(P) + \langle \nabla h(P), Q - P \rangle > 0

$\implies$ $Q$ is invalid

A separating hyperplane!

$P$ in this half-space

\displaystyle \Bigg\{

\displaystyle \Bigg\{

Ellipsoid Method to the Rescue

We want to use the Ellipsoid method as our cutting plane method

$\Omega(d^3)$ time per iteration

We can exploit symmetry!

$O(d)$ time per iteration

Constant volume decrease on each CUT

\displaystyle f(x_t) - f(x_*) \lesssim\Big(1 - \phantom{\frac{1}{\sqrt{2d}}} \cdot \frac{1}{\kappa_*}\Big)^t

\displaystyle f(x_t) - f(x_*) \lesssim\Big(1 - \phantom{\frac{1}{\sqrt{2d}}} \cdot \frac{1}{\kappa_*}\Big)^t

\displaystyle \frac{1}{\sqrt{2d}}

\displaystyle \frac{1}{\sqrt{2d}}

Query point $1/\sqrt{2d}$ away from boundary

Procurando por Tamanhos de Passo por Coordenada Ótimos com Victor Sanches Portella Abril 2024 cs.ubc.ca/~victorsp junto de Frederik Kunstner , Nick Harvey , e Mark Schmidt Multidimensional Backtracking , IME - USP SNAIL 🐌