Online Convex Optimization

Learning, Duality, and Algorithms

Victor Sanches Portella

Advisor: Marcel K. de Carli Silva

IME - USP

May, 2019

Online Convex Optimization

Formalizing Online Convex Optimization

An Online Convex Optimization Problem

\mathcal{C} = (X, \mathcal{F})

\mathcal{C} = (X, \mathcal{F})

X

convex set

\mathcal{F}

\mathcal{F}

set of convex functions

Player

Enemy

Rounds

t = 1, \dotsc, T

t = 1, \dotsc, T

x_t \in X

x_t \in X

f_t \in \mathcal{F}

f_t \in \mathcal{F}

Online Regression

Online Linear Regression

Player

Enemy

r_t

r_t

(x_t, y_t)

(x_t, y_t)

|r_t(x_t) - y_t|

|r_t(x_t) - y_t|

r_t(x) = \langle w_t,x \rangle

r_t(x) = \langle w_t,x \rangle

Regression Function

Query & Answer

Loss

w_t

w_t

f_t(w) = |\langle w, x_t \rangle - y_t |

f_t(w) = |\langle w, x_t \rangle - y_t |

Regret

\mathrm{Regret}_T( u) = \displaystyle \sum_{t = 1}^T f_t(x_t) - \sum_{t = 1}^T f_t(u)

\mathrm{Regret}_T( u) = \displaystyle \sum_{t = 1}^T f_t(x_t) - \sum_{t = 1}^T f_t(u)

\mathrm{Regret}_T( U) = \displaystyle \sup_{u \in U} \mathrm{Regret}_T( u)

\mathrm{Regret}_T( U) = \displaystyle \sup_{u \in U} \mathrm{Regret}_T( u)

Cost of always choosing

u

Goal: sublinear Regret

\displaystyle \lim_{T \to \infty} \frac{1}{T}\mathrm{Regret}_T( U) = 0

\displaystyle \lim_{T \to \infty} \frac{1}{T}\mathrm{Regret}_T( U) = 0

Player's Loss

Adaptive

FTRL

Follow the Leader

Enemy

Player

f_1

f_1

f_2

f_2

f_3

f_3

f_4

f_4

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x)

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x)

UNSTABLE!

x_1

x_1

x_2

x_2

x_3

x_3

x_4

x_4

{}_{x \in X}

{}_{x \in X}

Adding Regularization

Enemy

Player

f_1

f_1

x_1

x_1

f_2

f_2

x_2

x_2

f_3

f_3

x_3

x_3

f_4

f_4

x_4

x_4

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R(x)

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R(x)

R

FTRL

Fixed Regularizer

{}_{x \in \mathbb{E}}

{}_{x \in \mathbb{E}}

Adding Adaptive Regularization

At round use regularizer

R_t

R_t

t

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R_{t+1}(x)

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R_{t+1}(x)

R_t

R_t

R_{t+1}

R_{t+1}

?

r_{t+1}

r_{t+1}

R_{t+1} = R_t + r_{t+1}

R_{t+1} = R_t + r_{t+1}

R_{t+1} = r_1 + r_2 + \dotsc + r_{t+1}

R_{t+1} = r_1 + r_2 + \dotsc + r_{t+1}

Regularizer Increment

Convex Function

{}_{x \in \mathbb{E}}

{}_{x \in \mathbb{E}}

AdaFTRL

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R_{t+1}(x)

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R_{t+1}(x)

\displaystyle \sum_{i = 1}^{t+1} r_{i}(x)

\displaystyle \sum_{i = 1}^{t+1} r_{i}(x)

Efficiently computable?

Not clear in general

{}_{x \in \mathbb{E}}

{}_{x \in \mathbb{E}}

Another Perspective

\nabla f_t(x_t)

\nabla f_t(x_t)

Representation of derivative

[Df_t(x_t)](~\;) = \langle \nabla f_t(x_t), ~~\; \rangle

[Df_t(x_t)](~\;) = \langle \nabla f_t(x_t), ~~\; \rangle

What is

direction

u

u

u

Online Gradient Descent Update

x_t - \nabla f_t(x_t)

x_t - \nabla f_t(x_t)

x_t - Df_t(x_t)

x_t - Df_t(x_t)

point

\langle x_t, \cdot \rangle - Df_t(x_t)

\langle x_t, \cdot \rangle - Df_t(x_t)

functional

(Riesz Repr. Theorem)

functional

Directional derivative of at

f_t

f_t

x_t

x_t

Avoiding Inner-Product

\langle x_t, \cdot \rangle - Df_t(x_t) = D R (x_t) - Df_t(x_t)

\langle x_t, \cdot \rangle - Df_t(x_t) = D R (x_t) - Df_t(x_t)

R(x) = \frac{1}{2} \lang x, x\rang

R(x) = \frac{1}{2} \lang x, x\rang

\implies

\implies

\nabla R(x) = x

\nabla R(x) = x

x_t - \eta \nabla f_t(x_t) = \nabla R (x_t) - \nabla f_t(x_t)

x_t - \eta \nabla f_t(x_t) = \nabla R (x_t) - \nabla f_t(x_t)

What if we make other choices for ?

R(x)

R(x)

\frac{1}{2}\lVert x\rVert_2^2

\frac{1}{2}\lVert x\rVert_2^2

Mirror Maps

What if we make other choices for ?

R(x)

R(x)

R(x)

R(x)

(i)

(i)

strictly convex and differentiable on

\mathrm{int}(\mathrm{dom} R)

\mathrm{int}(\mathrm{dom} R)

(ii)

(ii)

y = \nabla R(~~)

y = \nabla R(~~)

(iii)

(iii)

For every

y

there is

~~~ \in \mathrm{int}(\mathrm{dom} R)

~~~ \in \mathrm{int}(\mathrm{dom} R)

such that

\Pi_X^R(y) \in \mathrm{int}(\mathrm{dom} R)

\Pi_X^R(y) \in \mathrm{int}(\mathrm{dom} R)

Bregman Projections onto attained by

\mathrm{int}(\mathrm{dom} R)

\mathrm{int}(\mathrm{dom} R)

\bar{y}

\bar{y}

\bar{y}

\bar{y}

\nabla R^{-1}(y) =

\nabla R^{-1}(y) =

\bar{y}

\bar{y}

\implies

\implies

X

\forall y \in \mathrm{int}(\mathrm{dom} R)

\forall y \in \mathrm{int}(\mathrm{dom} R)

{}^*

{}^*

Bregman Projector

Online Mirror Descent

x_t

x_t

\nabla R(x_t)

\nabla R(x_t)

- \nabla f_t(x_t)

- \nabla f_t(x_t)

y_{t+1}

y_{t+1}

x_{t+1}

x_{t+1}

\nabla R

\nabla R

\nabla R^*

\nabla R^*

\Pi_X^R

\Pi_X^R

Bregman

Projection

Dual

Primal

X

\mathrm{int}(\mathrm{dom} R)

\mathrm{int}(\mathrm{dom} R)

Adaptive?

{}_{t+1}

{}_{t+1}

{}_{t+1}

{}_{t+1}

{}_{t+1}

{}_{t+1}

\nabla R_{~~~~~~}(x_t)

\nabla R_{~~~~~~}(x_t)

{}_{t+1}

{}_{t+1}

Adaptive!

Adaptive Online Mirror Descent

First round

x_1 \in \mathrm{arg}\,\mathrm{min}~R_1(x)

x_1 \in \mathrm{arg}\,\mathrm{min}~R_1(x)

x \in X

x \in X

Round

t+1

t+1

for

t = 1, \dotsc, T

t = 1, \dotsc, T

y_{t+1} = \nabla R_{t+1}(x_t) - \nabla f_t(x_t)

y_{t+1} = \nabla R_{t+1}(x_t) - \nabla f_t(x_t)

x_{t+1} = \Pi_X^{R_{t+1}} (\nabla R^*_{t+1}(y_{t+1}))

x_{t+1} = \Pi_X^{R_{t+1}} (\nabla R^*_{t+1}(y_{t+1}))

R_{t+1} = r_1 + \dotsc + r_t + r_{t+1}

R_{t+1} = r_1 + \dotsc + r_t + r_{t+1}

R_{t+1} = R_t + r_{t+1}

R_{t+1} = R_t + r_{t+1}

Mirror Map Increments

Lazy Online Mirror Descent

x_t

x_t

\nabla R(x_t)

\nabla R(x_t)

- \nabla f_t(x_t)

- \nabla f_t(x_t)

y_{t+1}

y_{t+1}

x_{t+1}

x_{t+1}

\nabla R

\nabla R

\nabla R^*

\nabla R^*

\Pi_X^R

\Pi_X^R

Bregman

Projection

X

\mathrm{int}(\mathrm{dom} R)

\mathrm{int}(\mathrm{dom} R)

y_{t}

y_{t}

Classic Online Mirror Descent

First round

x_1 \in \mathrm{arg}\,\mathrm{min}~R(x)

x_1 \in \mathrm{arg}\,\mathrm{min}~R(x)

x \in X

x \in X

x_{t+1} = \Pi_X^R (\nabla R^*(y_{t+1}))

x_{t+1} = \Pi_X^R (\nabla R^*(y_{t+1}))

First round

x_1 \in \mathrm{arg}\,\mathrm{min}~R(x)

x_1 \in \mathrm{arg}\,\mathrm{min}~R(x)

x \in X

x \in X

For

t = 1, \dotsc, T

t = 1, \dotsc, T

y_{t+1} = ~~~~~~~ - \nabla f_t(x_t)

y_{t+1} = ~~~~~~~ - \nabla f_t(x_t)

x_{t+1} = \Pi_X^R (\nabla R^*(y_{t+1}))

x_{t+1} = \Pi_X^R (\nabla R^*(y_{t+1}))

For

t = 1, \dotsc, T

t = 1, \dotsc, T

y_t

y_t

y_{t+1} = ~~~~~~~~~~~~~ - \nabla f_t(x_t)

y_{t+1} = ~~~~~~~~~~~~~ - \nabla f_t(x_t)

\nabla R(x_t)

\nabla R(x_t)

Eager

Lazy

LOMD as FTRL

y_{t+1} = ~~~~~- \nabla f_t(x_t)

y_{t+1} = ~~~~~- \nabla f_t(x_t)

= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - \nabla f_t(x_t)

= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - \nabla f_t(x_t)

\displaystyle -\sum_{i = 1}^{t} \nabla f_i(x_i)

\displaystyle -\sum_{i = 1}^{t} \nabla f_i(x_i)

...

y_t

y_t

y_{t-1} -\nabla f_{t-1}(x_{t-1})

y_{t-1} -\nabla f_{t-1}(x_{t-1})

=

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i), x \rangle + R_X(x)

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i), x \rangle + R_X(x)

R_X =

R_X =

\{

R

inside

X

+ \infty

+ \infty

outside

FTRL

\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))

\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))

{}_{x \in \mathbb{E}}

{}_{x \in \mathbb{E}}

EOMD as FTRL

y_{t+1} = ~~~~~~~~~~~~~~~- \nabla f_t(x_t)

y_{t+1} = ~~~~~~~~~~~~~~~- \nabla f_t(x_t)

= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - \nabla f_t(x_t)

= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - \nabla f_t(x_t)

\nabla R(x_t)

\nabla R(x_t)

y_{t-1} -\nabla f_{t-1}(x_{t-1})

y_{t-1} -\nabla f_{t-1}(x_{t-1})

R_X =

R_X =

\{

R

inside

X

+ \infty

+ \infty

outside

\partial R_X(x_t) = \nabla R(x_t) + N_X(x_t)

\partial R_X(x_t) = \nabla R(x_t) + N_X(x_t)

\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))

\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))

N_X(x_t)

N_X(x_t)

X

x_t

x_t

EOMD as FTRL

y_{t+1} = ~~~~~~~~~~~~~~~- \nabla f_t(x_t)

y_{t+1} = ~~~~~~~~~~~~~~~- \nabla f_t(x_t)

\nabla R(x_t)

\nabla R(x_t)

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i) + p_i, x \rangle + R_X(x)

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i) + p_i, x \rangle + R_X(x)

R_X =

R_X =

\{

R

inside

X

+ \infty

+ \infty

outside

FTRL

\partial R_X(x_t) = \nabla R(x_t) + N_X(x_t)

\partial R_X(x_t) = \nabla R(x_t) + N_X(x_t)

\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))

\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))

p_1 \in N_X(x_1), p_2 \in N_X(x_2), \dotsc, p_t \in N_X(x_t)

p_1 \in N_X(x_1), p_2 \in N_X(x_2), \dotsc, p_t \in N_X(x_t)

{}_{x \in \mathbb{E}}

{}_{x \in \mathbb{E}}

EOMD vs LOMD

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i), x \rangle + R_X(x)

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i), x \rangle + R_X(x)

Eager = Lazy

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i) + ~~~~ , x \rangle + R_X(x)

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i) + ~~~~ , x \rangle + R_X(x)

p_i

p_i

N_X(z_1)

N_X(z_1)

X

z_1

z_1

z_2

z_2

N_X(z_2) = \{0\}

N_X(z_2) = \{0\}

x_i \in \mathrm{int}(\mathrm{dom~R})

x_i \in \mathrm{int}(\mathrm{dom~R})

\mathrm{int}(\mathrm{dom R}) \subseteq \mathrm{ri}~X

\mathrm{int}(\mathrm{dom R}) \subseteq \mathrm{ri}~X

\implies

\implies

p_i

p_i

{}_{x \in \mathbb{E}}

{}_{x \in \mathbb{E}}

{}_{x \in \mathbb{E}}

{}_{x \in \mathbb{E}}

Online Convex Optimization Learning, Duality, and Algorithms Victor Sanches Portella Advisor: Marcel K. de Carli Silva IME - USP May, 2019