Online Convex Optimization

Online Convex Optimization (OCO)

At each round

Player chooses a point

Enemy chooses a function

Player suffers a loss

SIMULTANEOUSLY

Player

Enemy

!

x \in X

f

f(x)

CONVEX

Player and Enemy see

f~\text{and}~x

Enemy may be

Adversarial

Formalizing Online Convex Optimization

An Online Convex Optimization Problem

\mathcal{C} = (X, \mathcal{F})

X

convex set

\mathcal{F}

set of convex functions

Player

Enemy

Rounds

t = 1, \dotsc, T

x_t \in X

f_t \in \mathcal{F}

Expert's Problem

Player

Enemy

Experts

0.5

0.1

0.3

0.1

1

0

-1

1

f(p) = y^{T}p = \mathbb{E}_{e \sim p}[y_e]

Probabilities

Costs

p \in \Delta_E

y \in [-1,1]^E

Online Regression

Online Linear Regression

Player

Enemy

r_t

(x_t, y_t)

|r_t(x_t) - y_t|

r_t(x) = \langle w_t,x \rangle

Regression Function

Query & Answer

Loss

w_t

f_t(w) = |\langle w, x_t \rangle - y_t |

We want to predict the answer based on the query

Regret

\mathrm{Regret}_T( u) = \displaystyle \sum_{t = 1}^T f_t(x_t) - \sum_{t = 1}^T f_t(u)

\mathrm{Regret}_T( U) = \displaystyle \sup_{u \in U} \mathrm{Regret}_T( u)

Cost of always choosing

u

Goal: sublinear Regret

\displaystyle \lim_{T \to \infty} \frac{1}{T}\mathrm{Regret}_T( U) = 0

Player's Loss

Player Strategies

Sublinear regret under mild conditions

Focus of this talk: algorithms for the Player

Hupefully efficiently implementable

Unified view of the algorithms from FTRL

Motivation

"Why should I care?"

OCO in Practice

Optimization for Big Data

Stochastic Gradient Descent

Adaptive Gradient Descent (AdaGrad)

Web Ad Placement

(Bandit - Limited Feedback)

Deep Nets Training

[Large Scale Distributed Deep Networks, Dean et. al. 12']

Applications of OCO in Other Areas

Computational Complexity

Approximately Maximum Flow

Robust Optimization

Competitive Analysis

Linear Spectral Sparsification

SDP Solver

QIP = PSPACE

k-server problem

~\Omega(n^4) \to O(n^{2 + \varepsilon})~

\}

"Boosting"

[QIP = PSPACE, Jain et. al. '09]

[k-server via multiscale entropic regularization, Bubeck et. al. '17]

[Spectral Sparsification and Regret Minimization Beyond Matrix Multiplicative Updates, Allen-Zhu, Liao, and Orecchia '16]

[A Combinatorial, Primal-Dual approach to Semideﬁnite Programs, Arora, Kale, Street '07]

[Electrical Flows, Laplacian Systems, and Faster Approximation of Maximum Flow in Undirected Graphs, Christiano et. al. '11]

[A Combinatorial, Primal-Dual approach to Semideﬁnite Programs, Arora, Kale, Street '07]

Adaptive

FTRL

Cummulative Loss

Experts

0

1

0.5

1

t = 1

1

1.5

0.5

1

t = 2

1.5

2

1

1.5

t = 3

2.5

3

2

1.5

t = 4

Follow the Leader

Enemy

Player

f_1

f_2

f_3

f_4

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x)

UNSTABLE!

x_1

x_2

x_3

x_4

{}_{x \in X}

Adding Regularization

Enemy

Player

f_1

x_1

f_2

x_2

f_3

x_3

f_4

x_4

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R(x)

R

FTRL

Fixed Regularizer

{}_{x \in \mathbb{E}}

Regret Upper-Bounds for Experts

d

Experts

R(x) = \frac{1}{2} \lVert x \rVert_2^2

\displaystyle \mathrm{Regret}_T \leq \sqrt{2 d T}

R(x) = \sum x_i \ln x_i

\displaystyle \mathrm{Regret}_T \leq 2 \sqrt{(\ln d) T}

\Rightarrow

X = \Delta_d

f_t(p) = y^T p,~\text{where}~y \in [-1,1]^d

Can we do better?

Adding Adaptive Regularization

At round use regularizer

R_t

t

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R_{t+1}(x)

R_t

R_{t+1}

?

r_{t+1}

R_{t+1} = R_t + r_{t+1}

R_{t+1} = r_1 + r_2 + \dotsc + r_{t+1}

Regularizer Increment

Convex Function

{}_{x \in \mathbb{E}}

AdaFTRL

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R_{t+1}(x)

\displaystyle \sum_{i = 1}^{t+1} r_{i}(x)

Efficiently computable?

Not clear in general

{}_{x \in \mathbb{E}}

Online Mirror Descent

Online (sub)Gradient Descent

- \nabla f_t(x_t)

X

x_t

x_{t+1}

Round

t + 1

x_{t+1} = \mathrm{Proj}_X(x_t - \nabla f_t(x_t))

projection

Another Perspective

\nabla f_t(x_t)

Representation of derivative

[Df_t(x_t)](~\;) = \langle \nabla f_t(x_t), ~~\; \rangle

What is

?

direction

u

Online Gradient Descent Update

x_t - \nabla f_t(x_t)

x_t - Df_t(x_t)

point

\langle x_t, \cdot \rangle - Df_t(x_t)

functional

(Riesz Repr. Theorem)

functional

Directional derivative of at

f_t

x_t

dual

primal

dual

Avoiding Inner-Product

\langle x_t, \cdot \rangle - Df_t(x_t) = D R (x_t) - Df_t(x_t)

R(x) = \frac{1}{2} \lang x, x\rang

\implies

\nabla R(x) = x

x_t - \nabla f_t(x_t) = \nabla R (x_t) - \nabla f_t(x_t)

What if we make other choices for ?

R(x)

\frac{1}{2}\lVert x\rVert_2^2

How to make projections w.r.t. ?

R(x)

Avoiding Inner-Product

= D R (x_t) - Df_t(x_t)

\nabla R (x_t) - \nabla f_t(x_t)

x_t

\nabla R(x_t)

- \nabla f_t(x_t)

\nabla R

\nabla R^*

Dual

Primal

?

\nabla R^{-1}

\in X?

Bregman Divergence

B_{R}(x,y) = R(x) -(R(y) + \langle \nabla R (y), x - y\rangle)

Bregman Divergence

Bregman Projection

\Pi_X^R(y) = \arg\,\min B_{R}(x,y)

1st-order Taylor

x \in X

Online Mirror Descent

x_t

\nabla R(x_t)

- \nabla f_t(x_t)

y_{t+1}

x_{t+1}

\nabla R

\nabla R^*

\Pi_X^R

Bregman

Projection

Dual

Primal

X

\mathrm{int}(\mathrm{dom} R)

Adaptive?

{}_{t+1}

\nabla R_{~~~~~~}(x_t)

{}_{t+1}

Adaptive!

Lazy Online Mirror Descent

x_t

\nabla R(x_t)

- \nabla f_t(x_t)

y_{t+1}

x_{t+1}

\nabla R

\nabla R^*

\Pi_X^R

Bregman

Projection

X

\mathrm{int}(\mathrm{dom} R)

y_{t}

y_{t-1}

Classic Online Mirror Descent

First round

x_1 \in \mathrm{arg}\,\mathrm{min}~R(x)

x \in X

x_{t+1} = \Pi_X^R (\nabla R^*(y_{t+1}))

First round

x_1 \in \mathrm{arg}\,\mathrm{min}~R(x)

x \in X

For

t = 1, \dotsc, T

y_{t+1} = ~~~~~~~ - \nabla f_t(x_t)

x_{t+1} = \Pi_X^R (\nabla R^*(y_{t+1}))

For

t = 1, \dotsc, T

y_t

y_{t+1} = ~~~~~~~~~~~~~ - \nabla f_t(x_t)

\nabla R(x_t)

Eager

Lazy

Let us use FTRL to unify them

Only proof sketch of the talk

LOMD as FTRL

y_{t+1} = ~~~~~- \nabla f_t(x_t)

= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - \nabla f_t(x_t)

\displaystyle -\sum_{i = 1}^{t} \nabla f_i(x_i)

...

y_t

y_{t-1} -\nabla f_{t-1}(x_{t-1})

=

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i), x \rangle + R_X(x)

R_X =

\{

R

inside

X

+ \infty

outside

FTRL

\nabla R_X^*(y) = \Pi_X^R(\nabla R^*(y))

{}_{x \in \mathbb{E}}

\nabla R_X^*(y) = \argmin \langle - y, x\rangle + R_X(x)

\Pi_X^R(\nabla R^*(y_{t+1})) = ?

EOMD as FTRL

y_{t+1} = ~~~~~~~~~~~~~~~- \nabla f_t(x_t)

= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - \nabla f_t(x_t)

\nabla R(x_t)

y_{t-1} -\nabla f_{t-1}(x_{t-1})

R_X =

\{

R

inside

X

+ \infty

outside

\partial R_X(x_t) = \nabla R(x_t) + N_X(x_t)

\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))

N_X(x_t)

X

x_t

\nabla R_X (x_t) = ?

Normal Cone

Subgradients

EOMD as FTRL

X

\nabla R(x_t)

R

\nabla R(x_t) + N_X(x_t)

x_t

N_X(x_t) = [0, +\infty)

EOMD as FTRL

y_{t+1} = ~~~~~~~~~~~~~~~- \nabla f_t(x_t)

\nabla R(x_t)

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i) + p_i, x \rangle + R_X(x)

R_X =

\{

R

inside

X

+ \infty

outside

FTRL

\partial R_X(x_t) = \nabla R(x_t) + N_X(x_t)

\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))

p_1 \in N_X(x_1), p_2 \in N_X(x_2), \dotsc, p_t \in N_X(x_t)

{}_{x \in \mathbb{E}}

EOMD vs LOMD

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i), x \rangle + R_X(x)

Eager = Lazy

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i) + ~~~~ , x \rangle + R_X(x)

p_i

N_X(z_1)

X

z_1

z_2

N_X(z_2) = \{0\}

x_i \in \mathrm{int}(\mathrm{dom~R})

\mathrm{int}(\mathrm{dom R}) \subseteq \mathrm{ri}~X

\implies

p_i

{}_{x \in \mathbb{E}}

A Genealogy of

Algorithms

A Bird's-eye View

Connection Among the Main Algorithms

x_{t+1} = \mathrm{Proj}_X(x_t - \nabla f_t(x_t))

\displaystyle (x_{t+1})(i) = \frac{x_t(i) e^{-\nabla f_t(x_t)(i)}}{K}

R(x) = \sum x_i \ln x_i

R(x) = \frac{1}{2} \lVert x \rVert_2^2

\mathrm{dom}~R = \Delta_d

AdaReg

y_{t+1} = x_t - ~~~~~~~ \nabla f_t(x_t)

x_{t+1} = \mathrm{Proj}_{~~~~~~~~}(y_{t+1}^{})

H_{t+1}

{}_{H_{t+1}^{-1}}

\displaystyle H_{t+1} \approx G_t^{-1}

\displaystyle H_{t+1} \approx G_t^{-\frac{1}{2}}

\displaystyle G_{t} = \sum_{i = 1}^t \nabla f_i(x_i) \nabla f_i(x_i)^{\intercal}

\displaystyle H_{t+1} \in \argmin \langle H, G_t\rangle + \Phi(H)

\displaystyle H \succ 0

\displaystyle - \ln \det H

\mathrm{Tr}(H^{-1})

AdaReg

\displaystyle H_{t+1} \in \argmin \langle H, G_t\rangle + \Phi(H)

\displaystyle H \succ 0

\displaystyle \langle H, G_t\rangle = \sum_{i = 1}^t \mathrm{Tr}(H g_i ^{}g_i^T) = \sum_{i = 1}^t \lVert g_i \rVert_H^2

Minimize size of (sub)gradients

Minimize "complexity" of

H

VS

\displaystyle H_{t+1} \in \argmin \sum_{i = 1}^t \lVert g_i \rVert_H^2 + \Phi(H)

\displaystyle H \succ 0

FTRL

Algorithms We Shall See

\text{AdaFTRL}

\text{AdaOMD}

\text{AdaDA}

\text{AdaReg}

\text{FTRL}

\text{EOMD}

\text{LOMD}

Generalizations and Special Cases

Limited Feedback: Bandit, two-point Bandit feedback

Special Cases: Combinatorial, other specific settings

Player

Drop or Add Hypotheses: Convexity, adversarial enemies,

Hypercube

L2-Ball

Change Metric: Policy Regret, Raw Loss

side information

Mirror Maps

What if we make other choices for ?

R(x)

(i)

strictly convex and differentiable on

\mathrm{int}(\mathrm{dom} R)

(ii)

y = \nabla R(~~)

(iii)

For every

y

there is

~~~ \in \mathrm{int}(\mathrm{dom} R)

such that

\Pi_X^R(y) \in \mathrm{int}(\mathrm{dom} R)

Bregman Projections onto attained by

\mathrm{int}(\mathrm{dom} R)

\bar{y}

\nabla R^{-1}(y) =

\bar{y}

\implies

X

\forall y \in \mathrm{int}(\mathrm{dom} R)

{}^*

Bregman Projector

Adaptive Online Mirror Descent

First round

x_1 \in \mathrm{arg}\,\mathrm{min}~R_1(x)

x \in X

Round

t+1

for

t = 1, \dotsc, T

y_{t+1} = \nabla R_{t+1}(x_t) - \nabla f_t(x_t)

x_{t+1} = \Pi_X^{R_{t+1}} (\nabla R^*_{t+1}(y_{t+1}))

R_{t+1} = r_1 + \dotsc + r_t + r_{t+1}

R_{t+1} = R_t + r_{t+1}

Mirror Map Increments

Follow The Regularized Leader

Online Convex Optimization