Searching for

Optimal Per-Coordinate Step-sizes with

Victor Sanches Portella

September 2023

cs.ubc.ca/~victorsp

joint with Frederik Kunstner, Nick Harvey, and Mark Schmidt

Multidimensional Backtracking

Theory Student Seminar @ University of Toronto

Gradient Descent and Line Search

Why first-order optimization?

Training/Fitting a ML model is often cast a (uncontrained) optimization problem

Usually in ML, models tend to be BIG

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

$d$ is BIG

Running time and space $O(d)$ is usually the most we can afford

First-order (i.e., gradient based) methods fit the bill

(stochastic even more so)

Usually $O(d)$ time and space per iteration

Convex Optimization Setting

\displaystyle f(y) \geq f(x) + \langle \nabla f(y), x - y\rangle + \frac{L}{2}\lVert x - y \rVert_2^2

\displaystyle f(y) \geq f(x) + \langle \nabla f(y), x - y\rangle + \frac{L}{2}\lVert x - y \rVert_2^2

$f$ is convex

Not the case with Neural Networks

Still quite useful in theory and practice

\displaystyle \Bigg\{

\displaystyle \Bigg\{

More conditions on $f$ for rates of convergence

$L$ -smooth

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

\displaystyle \min_{x \in \mathbb{R}^{d}}~f(x)

$\mu$ -strongly convex

\displaystyle f(y) \leq f(x) + \langle \nabla f(y), x - y\rangle + \frac{\mu}{2}\lVert x - y \rVert_2^2

\displaystyle f(y) \leq f(x) + \langle \nabla f(y), x - y\rangle + \frac{\mu}{2}\lVert x - y \rVert_2^2

Gradient Descent

\displaystyle x_{t+1} = x_t - \alpha \nabla f(x_t)

\displaystyle x_{t+1} = x_t - \alpha \nabla f(x_t)

Which step-size $\alpha$ should we pick?

\displaystyle \implies

\displaystyle \implies

\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{\mu}{L} \right)^t (f(x_0) - f(x_*))

\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{\mu}{L} \right)^t (f(x_0) - f(x_*))

Condition number

\displaystyle \alpha = \frac{1}{L}

\displaystyle \alpha = \frac{1}{L}

\displaystyle \kappa = \frac{L}{\mu}

\displaystyle \kappa = \frac{L}{\mu}

$\kappa$ Big $\implies$ hard function

What Step-Size to Pick?

If we know $L$ , picking $1/L$ always works

and is worst-case optimal

What if we do not know $L$ ?

Locally flat $\implies$ we can pick bigger step-sizes

\displaystyle x_{t+1} = x_t - \tfrac{1}{L} \nabla f(x_t)

\displaystyle x_{t+1} = x_t - \tfrac{1}{L} \nabla f(x_t)

If $f$ is $L$ smooth, we have

\displaystyle f(x_{t+1}) \leq f(x_t) - \tfrac{1}{L} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

\displaystyle f(x_{t+1}) \leq f(x_t) - \tfrac{1}{L} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

"Descent Lemma"

Idea: Pick $\eta$ big and see if the "descent condition" holds

(Locally $1/\eta$ -smooth)

Backtracking Line-Search

Backtracking Line-Search

Start with $\alpha_{\max} > 2 L$

$\alpha \gets \alpha_{\max}/2$

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

$t \gets t+1$

Else

While $t \leq T$

\displaystyle \alpha_{\max} \gets \alpha_{\max}/2

\displaystyle \alpha_{\max} \gets \alpha_{\max}/2

Halve candidate space

Guarantee: step-size will be at least $\tfrac{1}{2} \cdot \tfrac{1}{L}$

Armijo Condition

Beyond Line-Search?

\displaystyle f(x) = x^T A x

\displaystyle f(x) = x^T A x

\displaystyle A = \begin{pmatrix} 1000 & 0 \\ 0 & 0.001 \end{pmatrix}

\displaystyle A = \begin{pmatrix} 1000 & 0 \\ 0 & 0.001 \end{pmatrix}

\displaystyle \kappa = 10^{-6}

\displaystyle \kappa = 10^{-6}

\displaystyle x_{t+1} = x_t - \begin{pmatrix} 0.001 & 0 \\ 0 & 1000 \end{pmatrix} \nabla f(x_t)

\displaystyle x_{t+1} = x_t - \begin{pmatrix} 0.001 & 0 \\ 0 & 1000 \end{pmatrix} \nabla f(x_t)

Converges in 1 step

$P$

$O(d)$ space and time $\implies$ $P$ diagonal (or sparse)

Can we find a good $P$ automatically?

"Adapt to $f$ "

Preconditioer $P$

"Adaptive" Optimization Methods

Adaptive and Parameter-Free Methods

\displaystyle x_{t+1} = x_t - P_t \cdot \nabla f(x_t)

\displaystyle x_{t+1} = x_t - P_t \cdot \nabla f(x_t)

Preconditioner at round $t$

AdaGrad from Online Learning

\displaystyle P_t = \Big( \sum_{i \leq t} \nabla f(x_i) \nabla f(x_i)^T \Big)^{1/2}

\displaystyle P_t = \Big( \sum_{i \leq t} \nabla f(x_i) \nabla f(x_i)^T \Big)^{1/2}

\displaystyle \mathrm{Diag}\Big( \sum_{i \leq t} \nabla f(x_i) \nabla f(x_i)^T \Big)^{1/2}

\displaystyle \mathrm{Diag}\Big( \sum_{i \leq t} \nabla f(x_i) \nabla f(x_i)^T \Big)^{1/2}

Better guarantees if functions are easy

while preserving optimal worst-case guarantees in Online Learning

Attains linear rate in classical convex opt (proved later)

But... Online Learning is too adversarial, AdaGrad is "conservative"

In OL, functions change every iteration adversarially

Second-order Methods

P_t = \nabla^2 f(x_t)

P_t = \nabla^2 f(x_t)

Newton's method

is usually a great preconditioner

Superlinear convergence

...when $\lVert x_t - x_*\rVert$ small

Newton may diverge otherwise

Using step-size with Newton and QN method ensures convergence away from $x_*$

Worse than GD

\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{1}{\kappa^2} \right)^t (f(x_0) - f(x_*))

\displaystyle f(x_t) - f(x_*) \leq \left( 1 - \frac{1}{\kappa^2} \right)^t (f(x_0) - f(x_*))

\displaystyle \phantom{\kappa}^2

\displaystyle \phantom{\kappa}^2

$\nabla^2 f(x)$ is usually expensive to compute

P_t \approx \nabla^2 f(x_t)

P_t \approx \nabla^2 f(x_t)

should also help

Quasi-Newton Methods, e.g. BFGS

State of Affairs

(Quasi-)Newton: needs Hessian, can be slower than GD

Hypergradient methods: purely heuristic, unstable

Online Learning Algorithms: Good but pessimistic theory

at least for smooth optimization it seems pessimistic...

Online Learning

Smooth Optimization

1 step-size

$d$ step-sizes

(diagonal preconditioner )

Backtracking Line-search

Diagonal AdaGrad

Coordinate-wise

Coin Betting

(non-smooth opt?)

Multidimensional Backtracking

Scalar AdaGrad

Coin-Betting

What does it mean for a method to be adaptive?

Optimal (Diagonal) Preconditioner

\displaystyle \mu I \preceq \nabla^2 f(x) \preceq L I

\displaystyle \mu I \preceq \nabla^2 f(x) \preceq L I

\displaystyle \frac{1}{\kappa} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

\displaystyle \frac{1}{\kappa} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

Optimal step-size: biggest that guarantees progress

Optimal preconditioner: biggest (??) that guarantees progress

\displaystyle P = \tfrac{1}{L} I

\displaystyle P = \tfrac{1}{L} I

\displaystyle \kappa = \tfrac{L}{\mu}

\displaystyle \kappa = \tfrac{L}{\mu}

\displaystyle f(y) \geq f(x) + \langle \nabla f(y), x - y\rangle + \frac{L}{2}\lVert x - y \rVert_2^2

\displaystyle f(y) \geq f(x) + \langle \nabla f(y), x - y\rangle + \frac{L}{2}\lVert x - y \rVert_2^2

$L$ -smooth

$\mu$ -strongly convex

\displaystyle f(y) \leq f(x) + \langle \nabla f(y), x - y\rangle + \frac{\mu}{2}\lVert x - y \rVert_2^2

\displaystyle f(y) \leq f(x) + \langle \nabla f(y), x - y\rangle + \frac{\mu}{2}\lVert x - y \rVert_2^2

\displaystyle P_*

\displaystyle P_*

minimizes $\kappa_*$ such that

\displaystyle \frac{1}{\kappa_*} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

\displaystyle \frac{1}{\kappa_*} P^{-1} \preceq \nabla^2 f(x) \preceq P^{-1}

Over diagonal matrices

From Line-search to Preconditioner Search

Line-search

step-size is at least $1/2$ the optimum $1/L$

# backtracks $\leq$

\displaystyle \log\Big(\alpha_{0} L \Big)

\displaystyle \log\Big(\alpha_{0} L \Big)

\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \frac{1}{2 \cdot \kappa}\Big)^t (f(x_0) - f(x_*))

\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \frac{1}{2 \cdot \kappa}\Big)^t (f(x_0) - f(x_*))

Multidimensional Backtracking

Condition number is at least $1/\sqrt{2d}$ the optimum

# backtracks $\lesssim$

\displaystyle d \cdot \log\Big( \alpha_0 \cdot L\Big)

\displaystyle d \cdot \log\Big( \alpha_0 \cdot L\Big)

\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \frac{1}{\sqrt{2d} \cdot \kappa_*}\Big)^t (f(x_0) - f(x_*))

\displaystyle f(x_t) - f(x_*) \leq\Big(1 - \frac{1}{\sqrt{2d} \cdot \kappa_*}\Big)^t (f(x_0) - f(x_*))

\displaystyle \Bigg\{

\displaystyle \Bigg\{

\displaystyle \Bigg\{

\displaystyle \Bigg\{

Worth it if $\sqrt{2d} \kappa_* \ll 2 \kappa$

Why Naive Search does not Work

Line-search: test if step-size $\alpha_{\max}/2$ makes enough progress:

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha_{\max} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

\displaystyle f(x_{t+1}) \leq f(x_t) - \alpha_{\max} \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_2^2

Armijo condition

If this fails, cut out everything bigger than $\alpha_{\max}/2$

Preconditioner search:

0

\alpha_{0}

\alpha_{0}

\tfrac{\alpha_{0}}{2}

\tfrac{\alpha_{0}}{2}

\tfrac{1}{L}

\tfrac{1}{L}

\tfrac{\alpha_{0}}{4}

\tfrac{\alpha_{0}}{4}

Test if preconditioner $P$ makes enough progress:

Candidate preconditioners $\mathcal{S}$ : diagonals in a box

\displaystyle f(x_{t+1}) \leq f(x_t)

\displaystyle f(x_{t+1}) \leq f(x_t)

- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

If this fails, cut out everything bigger than $P$

\langle \nabla f(x_t), P \nabla f(x_t) \rangle

\langle \nabla f(x_t), P \nabla f(x_t) \rangle

Why Naive Search does not Work

Preconditioner search:

Test if preconditioner $P$ makes enough progress:

Candidate preconditioners $\mathcal{S}$ : diagonals in a box

\displaystyle f(x_{t+1}) \leq f(x_t)

\displaystyle f(x_{t+1}) \leq f(x_t)

\langle \nabla f(x_t), P \nabla f(x_t) \rangle

\langle \nabla f(x_t), P \nabla f(x_t) \rangle

- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

- \tfrac{1}{2} \lVert \nabla f(x_t) \rVert_P^2

If this fails, cut out everything bigger than $P$

Convexity to the Rescue

$P$ does not yield sufficient progress

Which preconditioners can be thrown out?

All $Q$ such that $P \preceq Q$ works, but it is too weak

\displaystyle h(P) \coloneqq f(x - P^{-1} \nabla f(x)) - f(x) + \tfrac{1}{2} \lVert \nabla f(x) \rVert_P^2

\displaystyle h(P) \coloneqq f(x - P^{-1} \nabla f(x)) - f(x) + \tfrac{1}{2} \lVert \nabla f(x) \rVert_P^2

$P$ does not yield sufficient progress $\iff$ $h(P) > 0$

\displaystyle h(Q) \geq h(P) + \langle \nabla h(P), Q - P \rangle

\displaystyle h(Q) \geq h(P) + \langle \nabla h(P), Q - P \rangle

Convexity $\implies$

\displaystyle h(P) + \langle \nabla h(P), Q - P \rangle > 0

\displaystyle h(P) + \langle \nabla h(P), Q - P \rangle > 0

$\implies$ $Q$ is invalid

A separating hyperplane!

$P$ in this half-space

\displaystyle \Bigg\{

\displaystyle \Bigg\{

Hypergradient

Searching for Optimal Per-Coordinate Step-sizes with Victor Sanches Portella September 2023 cs.ubc.ca/~victorsp joint with Frederik Kunstner , Nick Harvey , and Mark Schmidt Multidimensional Backtracking Theory Student Seminar @ University of Toronto

Searching for

Optimal Per-Coordinate Step-sizes with

Multidimensional Backtracking

Gradient Descent and Line Search

Why first-order optimization?

Convex Optimization Setting

Gradient Descent

What Step-Size to Pick?

Backtracking Line-Search

Beyond Line-Search?

"Adaptive" Optimization Methods

Adaptive and Parameter-Free Methods

"Fixing" AdaGrad

Hypergradient Methods

Second-order Methods

State of Affairs

Preconditioner Search

Optimal (Diagonal) Preconditioner

From Line-search to Preconditioner Search

Multidimensional Backtracking

Why Naive Search does not Work

Why Naive Search does not Work

Convexity to the Rescue

Convexity to the Rescue

Box as Feasible Sets

How Deep to Query?

Ellipsoid Method to the Rescue

Smallest Axis-Aligned Ellipsoid

Conclusions

?

?

?

Thanks!

Backup Slides

Preconditioner Search

Preconditioner Search

Victor Sanches Portella

Searching for

Optimal Per-Coordinate Step-sizes with

Multidimensional Backtracking

Preconditioner Search

More from Victor Sanches Portella