Follow The Regularized Leader

The Algorithm with a Thousand Faces

Victor Sanches Portella

PhD Student in Computer Science @ UBC

October, 2019

 Online Convex Optimization

Online Convex Optimization (OCO)

At each round

Player chooses a point

Enemy chooses a function

Player suffers a loss

SIMULTANEOUSLY

Player

Enemy

!

!

x \in X
f
f(x)

CONVEX

Player and Enemy see

f~\text{and}~x

Enemy may be 

Adversarial

Formalizing Online Convex Optimization

An Online Convex Optimization Problem

\mathcal{C} = (X, \mathcal{F})
X

convex set

\mathcal{F}

set of convex functions

Player

Enemy

Rounds

t = 1, \dotsc, T
x_t \in X
f_t \in \mathcal{F}

Expert's Problem

Player

Enemy

Experts

0.5

0.1

0.3

0.1

1

0

-1

1

f(p) = y^{T}p = \mathbb{E}_{e \sim p}[y_e]

Probabilities

Costs

p \in \Delta_E
y \in [-1,1]^E

Online Regression

Online Linear Regression

Player

Enemy

r_t
(x_t, y_t)
|r_t(x_t) - y_t|
r_t(x) = \langle w_t,x \rangle

Regression Function

Query & Answer

Loss

w_t
f_t(w) = |\langle w, x_t \rangle - y_t |

We want to predict the answer based on the query

Regret

\mathrm{Regret}_T( u) = \displaystyle \sum_{t = 1}^T f_t(x_t) - \sum_{t = 1}^T f_t(u)
\mathrm{Regret}_T( U) = \displaystyle \sup_{u \in U} \mathrm{Regret}_T( u)

Cost of always choosing  

u

Goal: sublinear Regret

\displaystyle \lim_{T \to \infty} \frac{1}{T}\mathrm{Regret}_T( U) = 0

Player's Loss

Player Strategies

Sublinear regret under mild conditions

Focus of this talk: algorithms for the Player

Hupefully efficiently implementable

Unified view of the algorithms from FTRL

Motivation

"Why should I care?"

OCO in Practice

Optimization for Big Data

Stochastic Gradient Descent

Adaptive Gradient Descent (AdaGrad)

Web Ad Placement

(Bandit - Limited Feedback)

Deep Nets Training

[Large Scale Distributed Deep Networks, Dean et. al. 12']

Applications of OCO in Other Areas

Computational Complexity

Approximately Maximum Flow

Robust Optimization

Competitive Analysis

Linear Spectral Sparsification

SDP Solver

QIP = PSPACE

k-server problem

~\Omega(n^4) \to O(n^{2 + \varepsilon})~
\}

"Boosting"

[QIP = PSPACE, Jain et. al. '09]
[k-server via multiscale entropic regularization, Bubeck et. al. '17]
[Spectral Sparsification and Regret Minimization Beyond Matrix Multiplicative Updates, Allen-Zhu, Liao, and Orecchia '16]
[A Combinatorial, Primal-Dual approach to Semidefinite Programs, Arora, Kale, Street '07]
[Electrical Flows, Laplacian Systems, and Faster Approximation of Maximum Flow in Undirected Graphs, Christiano et. al. '11]
[A Combinatorial, Primal-Dual approach to Semidefinite Programs, Arora, Kale, Street '07]

Adaptive

FTRL

Cummulative Loss

Experts

0
1
0.5
1
t = 1
1
1.5
0.5
1
t = 2
1.5
2
1
1.5
t = 3
2.5
3
2
1.5
t = 4

Follow the Leader

Enemy

Player

f_1
f_2
f_3
f_4
x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x)

UNSTABLE!

x_1
x_2
x_3
x_4
{}_{x \in X}

Adding Regularization

Enemy

Player

f_1
x_1
f_2
x_2
f_3
x_3
f_4
x_4
x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R(x)
R

FTRL

Fixed Regularizer

{}_{x \in \mathbb{E}}

Regret Upper-Bounds for Experts

d

Experts

R(x) = \frac{1}{2} \lVert x \rVert_2^2
\displaystyle \mathrm{Regret}_T \leq \sqrt{2 d T}
R(x) = \sum x_i \ln x_i
\displaystyle \mathrm{Regret}_T \leq 2 \sqrt{(\ln d) T}
\Rightarrow
\Rightarrow
X = \Delta_d
f_t(p) = y^T p,~\text{where}~y \in [-1,1]^d

Can we do better?

Adding Adaptive Regularization

At round     use regularizer

R_t
t
x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R_{t+1}(x)
R_t
R_{t+1}

?

r_{t+1}
R_{t+1} = R_t + r_{t+1}
R_{t+1} = r_1 + r_2 + \dotsc + r_{t+1}

Regularizer Increment

Convex Function

{}_{x \in \mathbb{E}}

AdaFTRL

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R_{t+1}(x)
\displaystyle \sum_{i = 1}^{t+1} r_{i}(x)

Efficiently computable?

Not clear in general

{}_{x \in \mathbb{E}}

Online Mirror Descent

Online (sub)Gradient Descent

- \nabla f_t(x_t)
X
x_t
x_{t+1}

Round

t + 1
x_{t+1} = \mathrm{Proj}_X(x_t - \nabla f_t(x_t))

projection

Another Perspective

\nabla f_t(x_t)

Representation of  derivative

[Df_t(x_t)](~\;) = \langle \nabla f_t(x_t), ~~\; \rangle

What is

?

direction

u
u
u

Online Gradient Descent Update

x_t - \nabla f_t(x_t)
x_t - Df_t(x_t)

point

\langle x_t, \cdot \rangle - Df_t(x_t)

functional

(Riesz Repr. Theorem)

functional

functional

Directional derivative of      at

f_t
x_t

dual

primal

dual

dual

Avoiding Inner-Product

\langle x_t, \cdot \rangle - Df_t(x_t) = D R (x_t) - Df_t(x_t)
R(x) = \frac{1}{2} \lang x, x\rang
\implies
\nabla R(x) = x
x_t - \nabla f_t(x_t) = \nabla R (x_t) - \nabla f_t(x_t)

What if we make other choices for         ?

R(x)
\frac{1}{2}\lVert x\rVert_2^2

How to make projections w.r.t.         ?

R(x)

Avoiding Inner-Product

= D R (x_t) - Df_t(x_t)
\nabla R (x_t) - \nabla f_t(x_t)
x_t
\nabla R(x_t)
- \nabla f_t(x_t)
\nabla R
\nabla R^*

Dual

Primal

?

\nabla R^{-1}
\in X?

Bregman Divergence

B_{R}(x,y) = R(x) -(R(y) + \langle \nabla R (y), x - y\rangle)

Bregman Divergence

Bregman Projection

\Pi_X^R(y) = \arg\,\min B_{R}(x,y)

1st-order Taylor

x \in X

Online Mirror Descent

x_t
\nabla R(x_t)
- \nabla f_t(x_t)
y_{t+1}
x_{t+1}
\nabla R
\nabla R^*
\Pi_X^R

Bregman

Projection

Dual

Primal

X
\mathrm{int}(\mathrm{dom} R)

Adaptive?

{}_{t+1}
{}_{t+1}
{}_{t+1}
\nabla R_{~~~~~~}(x_t)
{}_{t+1}

Adaptive!

Lazy Online Mirror Descent

x_t
\nabla R(x_t)
- \nabla f_t(x_t)
y_{t+1}
x_{t+1}
\nabla R
\nabla R^*
\Pi_X^R

Bregman

Projection

X
\mathrm{int}(\mathrm{dom} R)
y_{t}
y_{t-1}

Classic Online Mirror Descent

First round

x_1 \in \mathrm{arg}\,\mathrm{min}~R(x)
x \in X
x_{t+1} = \Pi_X^R (\nabla R^*(y_{t+1}))

First round

x_1 \in \mathrm{arg}\,\mathrm{min}~R(x)
x \in X

For

t = 1, \dotsc, T
y_{t+1} = ~~~~~~~ - \nabla f_t(x_t)
x_{t+1} = \Pi_X^R (\nabla R^*(y_{t+1}))

For

t = 1, \dotsc, T
y_t
y_{t+1} = ~~~~~~~~~~~~~ - \nabla f_t(x_t)
\nabla R(x_t)

Eager

Lazy

Let us use FTRL to unify them

Only proof sketch of the talk

LOMD as FTRL

y_{t+1} = ~~~~~- \nabla f_t(x_t)
= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - \nabla f_t(x_t)
\displaystyle -\sum_{i = 1}^{t} \nabla f_i(x_i)

...

y_t
y_{t-1} -\nabla f_{t-1}(x_{t-1})
=
\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i), x \rangle + R_X(x)
R_X =
\{
R

inside

X
+ \infty

outside

FTRL

\nabla R_X^*(y) = \Pi_X^R(\nabla R^*(y))
{}_{x \in \mathbb{E}}
\nabla R_X^*(y) = \argmin \langle - y, x\rangle + R_X(x)
\Pi_X^R(\nabla R^*(y_{t+1})) = ?
\Pi_X^R(\nabla R^*(y_{t+1})) = ?

EOMD as FTRL

y_{t+1} = ~~~~~~~~~~~~~~~- \nabla f_t(x_t)
= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - \nabla f_t(x_t)
\nabla R(x_t)
y_{t-1} -\nabla f_{t-1}(x_{t-1})
R_X =
\{
R

inside

X
+ \infty

outside

\partial R_X(x_t) = \nabla R(x_t) + N_X(x_t)
\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))
N_X(x_t)
X
x_t
\nabla R_X (x_t) = ?

Normal Cone

Subgradients

EOMD as FTRL

X
\nabla R(x_t)
R
\nabla R(x_t) + N_X(x_t)
x_t
N_X(x_t) = [0, +\infty)

EOMD as FTRL

y_{t+1} = ~~~~~~~~~~~~~~~- \nabla f_t(x_t)
\nabla R(x_t)
\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i) + p_i, x \rangle + R_X(x)
R_X =
\{
R

inside

X
+ \infty

outside

FTRL

\partial R_X(x_t) = \nabla R(x_t) + N_X(x_t)
\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))
p_1 \in N_X(x_1), p_2 \in N_X(x_2), \dotsc, p_t \in N_X(x_t)
{}_{x \in \mathbb{E}}

EOMD vs LOMD

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i), x \rangle + R_X(x)

Eager = Lazy

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i) + ~~~~ , x \rangle + R_X(x)
p_i
N_X(z_1)
X
z_1
z_2
N_X(z_2) = \{0\}
x_i \in \mathrm{int}(\mathrm{dom~R})
\mathrm{int}(\mathrm{dom R}) \subseteq \mathrm{ri}~X
\implies
p_i
{}_{x \in \mathbb{E}}
{}_{x \in \mathbb{E}}

A Genealogy of

Algorithms

A Bird's-eye View

Connection Among the Main Algorithms

Connection Among the Main Algorithms

x_{t+1} = \mathrm{Proj}_X(x_t - \nabla f_t(x_t))
\displaystyle (x_{t+1})(i) = \frac{x_t(i) e^{-\nabla f_t(x_t)(i)}}{K}
R(x) = \sum x_i \ln x_i
R(x) = \frac{1}{2} \lVert x \rVert_2^2
\mathrm{dom}~R = \Delta_d

AdaReg

y_{t+1} = x_t - ~~~~~~~ \nabla f_t(x_t)
x_{t+1} = \mathrm{Proj}_{~~~~~~~~}(y_{t+1}^{})
H_{t+1}
{}_{H_{t+1}^{-1}}
\displaystyle H_{t+1} \approx G_t^{-1}
\displaystyle H_{t+1} \approx G_t^{-\frac{1}{2}}
\displaystyle G_{t} = \sum_{i = 1}^t \nabla f_i(x_i) \nabla f_i(x_i)^{\intercal}
\displaystyle H_{t+1} \in \argmin \langle H, G_t\rangle + \Phi(H)
\displaystyle H \succ 0
\displaystyle - \ln \det H
\mathrm{Tr}(H^{-1})

AdaReg

\displaystyle H_{t+1} \in \argmin \langle H, G_t\rangle + \Phi(H)
\displaystyle H \succ 0
\displaystyle \langle H, G_t\rangle = \sum_{i = 1}^t \mathrm{Tr}(H g_i ^{}g_i^T) = \sum_{i = 1}^t \lVert g_i \rVert_H^2

Minimize size of (sub)gradients

Minimize "complexity" of     

H

VS

\displaystyle H_{t+1} \in \argmin \sum_{i = 1}^t \lVert g_i \rVert_H^2 + \Phi(H)
\displaystyle H \succ 0

FTRL

Follow The Regularized Leader

The Algorithm with a Thousand Faces

Victor Sanches Portella

PhD Student in Computer Science @ UBC

October, 2019

Algorithms We Shall See

\text{AdaFTRL}
\text{AdaOMD}
\text{AdaDA}
\text{AdaReg}
\text{FTRL}
\text{EOMD}
\text{LOMD}

Generalizations and Special Cases

Limited Feedback: Bandit, two-point Bandit feedback

Special Cases: Combinatorial, other specific settings

Player

Drop or Add Hypotheses: Convexity, adversarial enemies,

Hypercube

L2-Ball

Change Metric: Policy Regret, Raw Loss

 side information

Mirror Maps

What if we make other choices for         ?

R(x)
R(x)
(i)

strictly convex and differentiable on

\mathrm{int}(\mathrm{dom} R)
(ii)
y = \nabla R(~~)
(iii)

For every

y

there is

~~~ \in \mathrm{int}(\mathrm{dom} R)

such that

\Pi_X^R(y) \in \mathrm{int}(\mathrm{dom} R)

Bregman Projections onto       attained by

\mathrm{int}(\mathrm{dom} R)
\bar{y}
\bar{y}
\nabla R^{-1}(y) =
\bar{y}
\implies
X
\forall y \in \mathrm{int}(\mathrm{dom} R)
{}^*

Bregman Projector

Adaptive Online Mirror Descent

First round

x_1 \in \mathrm{arg}\,\mathrm{min}~R_1(x)
x \in X

Round

t+1

for

t = 1, \dotsc, T
y_{t+1} = \nabla R_{t+1}(x_t) - \nabla f_t(x_t)
x_{t+1} = \Pi_X^{R_{t+1}} (\nabla R^*_{t+1}(y_{t+1}))
R_{t+1} = r_1 + \dotsc + r_t + r_{t+1}
R_{t+1} = R_t + r_{t+1}

Mirror Map Increments

FTRL - The Algorithm with a Thousand Faces

By Victor Sanches Portella

FTRL - The Algorithm with a Thousand Faces

  • 312