Online Convex Optimization

Learning, Duality, and Algorithms

Victor Sanches Portella

Advisor: Marcel K. de Carli Silva

IME - USP

May, 2019

 Online Convex Optimization

Online Convex Optimization (OCO)

At each round

Player chooses a point

Enemy chooses a function

Player suffers a loss

SIMULTANEOUSLY

Player

Enemy

!

!

x \in X
f
f(x)

CONVEX

Player and Enemy see

f~\text{and}~x

Formalizing Online Convex Optimization

An Online Convex Optimization Problem

\mathcal{C} = (X, \mathcal{F})
X

convex set

\mathcal{F}

set of convex functions

Player

Enemy

Rounds

t = 1, \dotsc, T
x_t \in X
f_t \in \mathcal{F}

Expert's Problem

Player

Enemy

Experts

0.5

0.1

0.3

0.1

1

0

-1

1

f(p) = y^{T}p = \mathbb{E}_{e \sim p}[y_e]

Probabilities

Costs

p \in \Delta_E
y \in [-1,1]^E

Online Regression

Online Linear Regression

Player

Enemy

r_t
(x_t, y_t)
|r_t(x_t) - y_t|
r_t(x) = \langle w_t,x \rangle

Regression Function

Query & Answer

Loss

w_t
f_t(w) = |\langle w, x_t \rangle - y_t |

Regret

\mathrm{Regret}_T( u) = \displaystyle \sum_{t = 1}^T f_t(x_t) - \sum_{t = 1}^T f_t(u)
\mathrm{Regret}_T( U) = \displaystyle \sup_{u \in U} \mathrm{Regret}_T( u)

Cost of always choosing  

u

Goal: sublinear Regret

\displaystyle \lim_{T \to \infty} \frac{1}{T}\mathrm{Regret}_T( U) = 0

Player's Loss

Player Strategies

Sublinear regret under mild conditions

Focus of this talk: algorithms for the Player

Hupefully efficiently implementable

Algorithms We Shall See

\text{AdaFTRL}
\text{AdaOMD}
\text{AdaDA}
\text{AdaReg}
\text{FTRL}
\text{EOMD}
\text{LOMD}

Adaptive

FTRL

Cummulative Loss

Experts

0
1
0.5
1
t = 1
1
1.5
0.5
1
t = 2
1.5
2
1
1.5
t = 3
2.5
3
2
1.5
t = 4

Follow the Leader

Enemy

Player

f_1
f_2
f_3
f_4
x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x)

UNSTABLE!

x_1
x_2
x_3
x_4
{}_{x \in X}

Adding Regularization

Enemy

Player

f_1
x_1
f_2
x_2
f_3
x_3
f_4
x_4
x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R(x)
R

FTRL

Fixed Regularizer

{}_{x \in \mathbb{E}}

Adding Adaptive Regularization

At round     use regularizer

R_t
t
x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R_{t+1}(x)
R_t
R_{t+1}

?

r_{t+1}
R_{t+1} = R_t + r_{t+1}
R_{t+1} = r_1 + r_2 + \dotsc + r_{t+1}

Regularizer Increment

Convex Function

{}_{x \in \mathbb{E}}

AdaFTRL

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R_{t+1}(x)
\displaystyle \sum_{i = 1}^{t+1} r_{i}(x)

Efficiently computable?

Not clear in general

{}_{x \in \mathbb{E}}

Adaptive

Online Mirror Descent

Online (sub)Gradient Descent

- \nabla f_t(x_t)
X
x_t
x_{t+1}

Round

t
x_{t+1} = \mathrm{Proj}_X(x_t - \nabla f_t(x_t))

projection

Another Perspective

\nabla f_t(x_t)

Representation of  derivative

[Df_t(x_t)](~\;) = \langle \nabla f_t(x_t), ~~\; \rangle

What is

?

direction

u
u
u

Online Gradient Descent Update

x_t - \nabla f_t(x_t)
x_t - Df_t(x_t)

point

\langle x_t, \cdot \rangle - Df_t(x_t)

functional

(Riesz Repr. Theorem)

functional

functional

Directional derivative of      at

f_t
x_t

Avoiding Inner-Product

\langle x_t, \cdot \rangle - Df_t(x_t) = D R (x_t) - Df_t(x_t)
R(x) = \frac{1}{2} \lang x, x\rang
\implies
\nabla R(x) = x
x_t - \eta \nabla f_t(x_t) = \nabla R (x_t) - \nabla f_t(x_t)

What if we make other choices for         ?

R(x)
\frac{1}{2}\lVert x\rVert_2^2

Mirror Maps

What if we make other choices for         ?

R(x)
R(x)
(i)

strictly convex and differentiable on

\mathrm{int}(\mathrm{dom} R)
(ii)
y = \nabla R(~~)
(iii)

For every

y

there is

~~~ \in \mathrm{int}(\mathrm{dom} R)

such that

\Pi_X^R(y) \in \mathrm{int}(\mathrm{dom} R)

Bregman Projections onto       attained by

\mathrm{int}(\mathrm{dom} R)
\bar{y}
\bar{y}
\nabla R^{-1}(y) =
\bar{y}
\implies
X
\forall y \in \mathrm{int}(\mathrm{dom} R)
{}^*

Bregman Projector

Online Mirror Descent

x_t
\nabla R(x_t)
- \nabla f_t(x_t)
y_{t+1}
x_{t+1}
\nabla R
\nabla R^*
\Pi_X^R

Bregman

Projection

Dual

Primal

X
\mathrm{int}(\mathrm{dom} R)

Adaptive?

{}_{t+1}
{}_{t+1}
{}_{t+1}
\nabla R_{~~~~~~}(x_t)
{}_{t+1}

Adaptive!

Adaptive Online Mirror Descent

First round

x_1 \in \mathrm{arg}\,\mathrm{min}~R_1(x)
x \in X

Round

t+1

for

t = 1, \dotsc, T
y_{t+1} = \nabla R_{t+1}(x_t) - \nabla f_t(x_t)
x_{t+1} = \Pi_X^{R_{t+1}} (\nabla R^*_{t+1}(y_{t+1}))
R_{t+1} = r_1 + \dotsc + r_t + r_{t+1}
R_{t+1} = R_t + r_{t+1}

Mirror Map Increments

Lazy Online Mirror Descent

x_t
\nabla R(x_t)
- \nabla f_t(x_t)
y_{t+1}
x_{t+1}
\nabla R
\nabla R^*
\Pi_X^R

Bregman

Projection

X
\mathrm{int}(\mathrm{dom} R)
y_{t}

Classic Online Mirror Descent

First round

x_1 \in \mathrm{arg}\,\mathrm{min}~R(x)
x \in X
x_{t+1} = \Pi_X^R (\nabla R^*(y_{t+1}))

First round

x_1 \in \mathrm{arg}\,\mathrm{min}~R(x)
x \in X

For

t = 1, \dotsc, T
y_{t+1} = ~~~~~~~ - \nabla f_t(x_t)
x_{t+1} = \Pi_X^R (\nabla R^*(y_{t+1}))

For

t = 1, \dotsc, T
y_t
y_{t+1} = ~~~~~~~~~~~~~ - \nabla f_t(x_t)
\nabla R(x_t)

Eager

Lazy

LOMD as FTRL

y_{t+1} = ~~~~~- \nabla f_t(x_t)
= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - \nabla f_t(x_t)
\displaystyle -\sum_{i = 1}^{t} \nabla f_i(x_i)

...

y_t
y_{t-1} -\nabla f_{t-1}(x_{t-1})
=
\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i), x \rangle + R_X(x)
R_X =
\{
R

inside

X
+ \infty

outside

FTRL

\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))
{}_{x \in \mathbb{E}}

EOMD as FTRL

y_{t+1} = ~~~~~~~~~~~~~~~- \nabla f_t(x_t)
= ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - \nabla f_t(x_t)
\nabla R(x_t)
y_{t-1} -\nabla f_{t-1}(x_{t-1})
R_X =
\{
R

inside

X
+ \infty

outside

\partial R_X(x_t) = \nabla R(x_t) + N_X(x_t)
\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))
N_X(x_t)
X
x_t

EOMD as FTRL

X
\nabla R(x_t)
R
\nabla R(x_t) + N_X(x_t)
x_t
N_X(x_t) = [0, +\infty)

EOMD as FTRL

y_{t+1} = ~~~~~~~~~~~~~~~- \nabla f_t(x_t)
\nabla R(x_t)
\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i) + p_i, x \rangle + R_X(x)
R_X =
\{
R

inside

X
+ \infty

outside

FTRL

\partial R_X(x_t) = \nabla R(x_t) + N_X(x_t)
\nabla R_X^*(y_{t+1}) = \Pi_X^R(\nabla R^*(y_{t+1}))
p_1 \in N_X(x_1), p_2 \in N_X(x_2), \dotsc, p_t \in N_X(x_t)
{}_{x \in \mathbb{E}}

EOMD vs LOMD

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i), x \rangle + R_X(x)

Eager = Lazy

\displaystyle x_{t+1} = \mathrm{arg}\,\mathrm{min}~ \sum_{i=1}^t \langle \nabla f_i(x_i) + ~~~~ , x \rangle + R_X(x)
p_i
N_X(z_1)
X
z_1
z_2
N_X(z_2) = \{0\}
x_i \in \mathrm{int}(\mathrm{dom~R})
\mathrm{int}(\mathrm{dom R}) \subseteq \mathrm{ri}~X
\implies
p_i
{}_{x \in \mathbb{E}}
{}_{x \in \mathbb{E}}

A Genealogy of

Algorithms

Connection Among the Main Algorithms

AdaReg

y_{t+1} = x_t - ~~~~~~~ \nabla f_t(x_t)
x_{t+1} = \mathrm{Proj}_{~~~~~~~~}(y_{t+1}^{})
H_{t+1}
{}_{H_{t+1}^{-1}}
\displaystyle H_{t+1} \approx G_t^{-1}
\displaystyle H_{t+1} \approx G_t^{-\frac{1}{2}}

Second Order Algorithms?

\displaystyle G_{t} = \sum_{i = 1}^t \nabla f_i(x_i) \nabla f_i(x_i)^{\intercal}

A Bird's-eye View

Future Directions

Generalizations and Special Cases

Limited Feedback: Bandit, two-point Bandit feedback

Special Cases: Combinatorial, other specific settings

Player

Drop or Add Hypotheses: Convexity, adversarial enemies,

Hypercube

L2-Ball

Change Metric: Policy Regret, Raw Loss

 side information

OCO in Other Areas

Quantum Computing

Approximately Maximum Flow

Robust Optimization

Competitive Analysis

Spectral Sparsification

SDP Solver

Oracle Boosting

Ideas

New Setting

Variational Perspective

Online Convex Optimization

Learning, Duality, and Algorithms

Victor Sanches Portella

Advisor: Marcel K. de Carli Silva

IME - USP

May, 2019

OCO - Defense

By Victor Sanches Portella

OCO - Defense

  • 755