## Online Convex Optimization

Learning, Duality, and Algorithms

Victor Sanches Portella

Supervisor: Marcel K. de Carli Silva

IME - USP

August, 2018

# Learning

### Online Learning (OL)

At each round

Nature reveals a query

Player makes a prediction

Player suffers a loss

SIMULTANEOUSLY

Player

Enemy

?

Nature

!

!

Player and Enemy see each other's choices

Player

### Examples of OL

Spam Filtering

Nature reveals emails

Player is the spam filter

Enemy is the user

Player

Enemy

?

Nature

!

!

Loss of 1 if the filter is wrong

### Examples of OL

Player chooses an expert

Enemy chooses the costs of the experts

Player

Enemy

?

Nature

!

!

Experts

Loss is the cost of the chosen expert

### Statistical Learning and OL Comparison

Probability distribution over enemy and nature

Training set

### Online Learning

Online

Expected accuracy

Cumulative loss

### Formalizing Online Learning

An Online Learning Problem

\mathcal{P} = (X, Y, D, L)
X
Y
D
L

query set

label set

decision set

loss function

Spam

Emails

Yes/No

Yes/No

Binary

Experts

[-1,1]^E
E

Expert's cost

### Minimizing the Loss

Minimizing the cost is impossible

X

Player

Enemy

Not X

Idea: minimize the

regret

\mathrm{Regret}_T( h) = \displaystyle \sum_{t = 1}^T L(d_t, y_t) - \sum_{t = 1}^T L(h(x_t), y_t)
\mathrm{Regret}_T(\mathcal{H}) = \displaystyle \sup_{h \in \mathcal{H}} \mathrm{Regret}_T( h)

hypothesis

\text{LOSS} = T
\text{cost of the}\\\text{strategy}~h
h \colon \{\text{Questions}\} \to \{\text{Predictions}\}

# of Rounds

### Minimizing the Loss

Attaining sublinear regret is impossible in general  (Cover '67)

Idea: allow the player to randomize his choices

Enemy does not know the outcomes of the "coin flips"

Bounds on regret with high probability or on the expectation

Player

Enemy

Simulated Player

d \sim \mathcal{D}
\mathcal{D}
d~\text{?}

Probability distribution

### Online Convex Optimization (OCO)

At each round

Player chooses a point

Enemy chooses a function

Player suffers a loss

SIMULTANEOUSLY

Player

Enemy

!

!

x \in X
f
f(x)

CONVEX

Player and Enemy see

f~\text{and}~x

### Formalizing Online Convex Optimization

An Online Convex Optimization Problem

\mathcal{C} = (X, \mathcal{F})
X

convex set

\mathcal{F}

set of convex functions

\mathrm{Regret}_T( u) = \displaystyle \sum_{t = 1}^T f_t(x_t) - \sum_{t = 1}^T f_t(u)
\mathrm{Regret}_T( U) = \displaystyle \sup_{u \in U} \mathrm{Regret}_T( u)

Cost of always choosing

u

### Online Learning

Special Case of OL

Low-regret

Algorithms

### From OL to OCO

An OL problem: Experts

Player

Enemy

Nature

x:
e \in E
y \in [-1,1]^E

## OCO

### From OL to OCO

An OL problem:                           Experts

Player

Enemy

Nature

e \in E
y \in [-1,1]^E

## OL

Randomized

e \sim p \in \Delta_E
\Longrightarrow
\mathbb{E}[L(e,y)] = p^Ty

Player

p \in \Delta_E

Enemy

f_y(p) = p^T y

Simplex

x:

# Duality

### Function and Epigraph

f \colon \mathbb{E} \to \mathbb{R} \cup \{\pm \infty\}
\text{Convex if}~\mathrm{epi} (f)~\text{is convex}
\text{Epigraph}~\mathrm{epi} (f)

### Conjugate function

f^*(x^*) = \displaystyle \sup_{x \in \mathbb{E}}( \langle x^*, x \rangle - f(x))
f^{**} = f

f(z) \geq f(\bar{x}) + \langle g, z - \bar{x} \rangle
g \in \partial f(\bar{x}) \Leftrightarrow
\nabla f(\bar{y}) \oplus -1
g \oplus -1

y \in \partial f(x) \Leftrightarrow x \in \partial f^*(y)
\Leftrightarrow x~\text{attains}\sup_{z \in \mathbb{E}} \langle z, y \rangle - f(z) = f^*(y)

### Strongly convex and Strongly smooth Functions

Strongly convex

Strongly smooth

f(z) \geq f(x) + \langle u, z - x \rangle + \frac{\sigma}{2} \lVert x - z \rVert^2
f(z) \leq f(x) + \langle \nabla f(x), z - x \rangle + \frac{\beta}{2} \lVert x - z \rVert^2

ARBITRARY

(u \in \partial f (x))
\forall z,
\forall z,

### Strongly convex/Strongly smooth Duality

Theorem

f~\sigma\text{-strongly convex with respect to}~\lVert\cdot \rVert
\Leftrightarrow f^*~\frac{1}{\sigma}\text{-strongly smooth with respect to}~\lVert \cdot \rVert_*

Dual norm

\lVert x\rVert_* = \displaystyle \sup_{u \in \mathbb{E} \colon \lVert u \rVert \leq 1} \langle x, u\rangle

### Bregman Divergence and Projection

B_{\psi}(x,y) = \psi(x) -(\psi(y) + \langle \nabla \psi (y), x - y\rangle)

Bregman Divergence

Bregman Projection

\Pi_X^\psi(y) = \arg\,\min B_{\psi}(x,y)

1st-order Taylor

x \in X

# Algorithms

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R(x)
f_1, \ldots, f_t
R

Strongly convex regularizer

\mathrm{FTRL}_R(f_1, \ldots, f_t)

### Bounding the Regret of FTRL

\displaystyle \mathrm{Regret}_T( u)\leq \sum_{t = 1}^T (f_t(x_t) - f_t(x_{t+1})) + R(u) - R(x_1)

Lemma (Kalai e Vempala, '05):

\leq \langle g_t, x_t - x_{t+1} \rangle \leq \lVert g_t \rVert_*\lVert x_t - x_{t+1} \rVert
g_t \in \partial f_t(x_t)
\displaystyle \mathrm{Regret}_T(u)\leq \sum_{t = 1}^T\lVert g_t \rVert_*\lVert x_t - x_{t+1} \rVert + R(u) - R(x_1)

Corollary:

### Bounding the Regret - Lipschitz Continuity

\lVert g_t \rVert_*\lVert x_t - x_{t+1} \rVert
g_t \in \partial f_t(x_t)

Stability between rounds

?

Lemma:

f~\text{is}~\rho\text{-Lipschitz continuous if}~\lvert f(x) - f(y)\rvert \leq \rho \lVert x - y\rVert
f~\text{is}~\rho\text{-Lipschitz continuous}\\ \Leftrightarrow \text{there is}~g \in \partial f(x)~\text{s.t.}~\lVert g\rVert_* \leq \rho

### Bounding the Regret

\lVert g_t \rVert_*\lVert x_t - x_{t+1} \rVert
g_t \in \partial f_t(x_t)

Stability between rounds

Lipschitz

?

### Bounding the Regret - Using Duality

\psi~\sigma\text{-strongly-convex} \Rightarrow \psi^*~\frac{1}{\sigma}\text{-strongly-smooth}
x_{t} = \nabla \psi^* (-g_t)

(McMahan, '17)

\displaystyle\psi(x) \coloneqq \sum_{i =1}^t f_i(x) + R(x) - \langle g_t, x\rangle
\lVert \nabla \psi^*(0) - \nabla\psi^*(-g_t)\rVert \leq \frac{1}{\sigma}\lVert g_t\rVert_*
\mathrm{FTRL}_R(f_1, \ldots, f_t)
x_{t+1} = \nabla \psi^*(0)

### Bounding the Regret

\displaystyle \mathrm{Regret}_T( X) \leq \sigma \theta + \frac{1}{\sigma} \sum_{t = 1}^T \lVert g_t\rVert_*^2
\theta = \displaystyle \sup_{u,v \in X} \frac{R(u) - R(v)}{\sigma}
\displaystyle \mathrm{Regret}_T( X) \leq 2 \rho \sqrt{\theta T}
f_1, \ldots, f_T~\rho\text{-Lipschitz}
\}
\displaystyle \mathrm{Regret}_T( X) \ge\Omega(\sqrt{T})

Theorem (Abernethy, B, R, T, '08):

\text{Careful choice of}~\sigma

Diameter

### Bounding the Regret -  Experts Example

\theta = \displaystyle \sup_{u,v \in X} \frac{R(u) - R(v)}{\sigma}
\mathcal{C} = (\Delta_E, \mathcal{F})
\Delta_E = \{p \in [0,1]^E \colon \sum p_e = 1 \}
f_t(p) = y^T p~\text{with}~y \in [-1,1]^E
U = \{e_i \colon i \in E\}
\displaystyle \mathrm{Regret}_T(U) \leq 2 \rho \sqrt{\theta T}

Randomized Experts

Best expert in hindsight

### Bounding the Regret -  Experts Example

\displaystyle \mathrm{Regret}_T(U) \leq 2 \rho \sqrt{\theta T}
R(x) = \frac{1}{2} \lVert x \rVert_2^2
\Rightarrow
\rho = \sqrt{\lvert E \rvert }
\lVert \cdot \rVert = \lVert \cdot \rVert_2 = \lVert \cdot \rVert_*
\displaystyle \mathrm{Regret}_T \leq \sqrt{2 \lvert E\rvert T}
R(x) = \sum x_i \ln x_i
\rho = 1
\displaystyle \mathrm{Regret}_T \leq 2 \sqrt{\ln \lvert E \rvert T}
\displaystyle\theta \leq \frac{1}{2}
\lVert \cdot \rVert = \lVert \cdot \rVert_1
\lVert \cdot \rVert_* = \lVert \cdot \rVert_{\infty}
\theta \leq \ln \lvert E \rvert
\Rightarrow
\Rightarrow
\Rightarrow
X = \Delta_E
f_t(p) = y^T p,~\text{where}~y \in [-1,1]^E

### FTRL is not Perfect

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R(x)

Each FTRL step needs to solve a optimization problem

\mathrm{FTRL}_R(f_1, \ldots, f_t)

It would be interesting to have an algorithm which is clearly efficiently computable

### Online Mirror Descent - Intuition

x_{t+1} = \Pi_X^{\psi}(\nabla \psi^*(y_{t+1}))
\text{Regularizer:}~\psi
y_{t+1} = \nabla \psi(x_t) - g_t, \text{where}~g_t \in \partial f_t(x_t)
\mathbb{E}
\mathbb{E}^*
x_{t}
\nabla \psi
\nabla \psi (x_t)
-g_t
y_{t+1}
\nabla \psi^* = (\nabla \psi)^{-1}
x_{t+1}
\Pi_X^\psi
X
\text{Linear forms in}~\mathbb{E}

### Online Mirror Descent

x_{t+1} = \Pi_X^{\psi}(\nabla \psi^*(y_{t+1}))
f_1, \ldots, f_t
\psi

Strongly Convex Regularizer

\mathrm{EOMD}_R(f_1, \ldots, f_T)
y_{t+1} = \nabla \psi(x_t) - g_t, \text{where}~g_t \in \partial f_t(x_t)
y_1 = 0
\text{for}~t = 1, \ldots, T-1

### OMD Examples

\displaystyle \psi(x) = \frac{1}{\eta 2} \lVert x \rVert_2^2
\lVert \cdot \rVert = \lVert \cdot \rVert_2 = \lVert \cdot \rVert_*
x_{t+1} = x_t - \eta g_t
\psi(x) = \displaystyle \frac{1}{\eta}\sum x_i \ln x_i
\lVert \cdot \rVert = \lVert \cdot \rVert_1 \\\lVert \cdot \rVert_* = \lVert \cdot \rVert_{\infty}
x_{t+1}(i) = x_t(i) e^{-\eta g_t(i)}

Online

Hedge

FTRL

LOMD

EOMD

Online Newton

Hedge

Newton

Online GD

Mirror

FTPL

Linear Coupling

Proximal

Quasi Newton

?

Accelerated GD

# Future Directions

Player

Enemy

!

!

### Sugestion - Bandit Convex Optimization (BCO)

At each round

Player chooses a point

Enemy chooses a function

Player suffers a loss

SIMULTANEOUSLY

x \in X
f
f(x)

CONVEX

Player and Enemy see

f~\text{e}~x

### Multi-armed Bandit Problem

"Slot machine" icons by Freepik at www.flaticon.com

At each round

Player chooses a machine

Enemy chooses the costs

Player suffers the loss of the machine he has chosen



EXPERTS

EXPLORATION VS EXPLOITATION

d
T

### rounds

\tilde{O}(d\sqrt{T})
\tilde{\Omega}(d\sqrt{T})
\{

(Dani, H, K, '12 )

(Bubeck, C-B, K, '12 )

### General case

\tilde{O}(d^{9.5}\sqrt{T})
\tilde{\Omega}(d\sqrt{T})
\{

(Dani, H, K, '12 )

(Bubeck, E, L, '17 )

\tilde{\Theta}(d^{1.5}\sqrt{T})

(Bubeck, E, L, '17 )

### Sugestion - Boosting

Set of low accuracy learners - Weak Learners

Combination generates a good model - Strong Learner

Usually performed in an incremental fashion

Idea: Use boosting outside of learning

### Example - Approximately maximum flows

Electrical flows can be computed quickly

Nearly-linear time Laplacian solver (Spielman and Teng, '04)

Electrical flows may not respect capacities

Idea: Compute many electrical flows, penalizing violated edges

L x = b

Multiplicative Weights Update Method

## Otimização Convexa Online

Victor Sanches Portella

Orientador: Marcel K. de Carli Silva

IME - USP

Agosto, 2018

### Oráculos

\mathrm{PLAYER} \colon \mathrm{Seq}(X) \times \mathrm{Seq}(Y) \to D
\mathrm{NATURE} \colon \mathbb{N} \to X
\mathrm{ENEMY} \colon \mathrm{Seq}(X) \times \mathrm{Seq}(D) \to Y

### Algoritmo

\text{Para}~t = 1, \ldots, T,~\text{faça}
x_t \coloneqq \text{NATURE}(t)
d_t \coloneqq \text{PLAYER}((x_1,\ldots, x_{t}), (y_1,\ldots, y_{t-1}) )
y_t \coloneqq \text{ENEMY}((x_1,\ldots, x_{t}), (d_1,\ldots, d_{t-1}) )
l_t \coloneqq L(d_t, y_t)
\text{Retorne}~(x,d,y)
\mathcal{P} = (X, Y, D, L)

### Indo de OL para OCO

\mathcal{P} = (\mathbb{R}^n, \mathbb{R}, \mathbb{R}, L)
\mathcal{H}

Um problema de OL: Regressão linear

conjunto das funções lineares

h(x) = \langle w, x\rangle

Inimigo

Natureza

x \in \mathbb{R}^n
d \in \mathbb{R}
y \in \mathbb{R}

## OCO

Inimigo

w \in \mathbb{R}
L(\langle\cdot, x \rangle, y)
\langle w,x \rangle =
L(d, y)

### Teorema da separação

Teorema: existe

a \in \mathbb{E}\setminus\{0\}~\text{tal que}
\displaystyle \sup_{s \in S} \langle a, s\rangle \leq \sup_{t \in T} \langle a, t\rangle
S
T
\langle a, x \rangle = \beta

### Limitando o regret - Usando dualidade

\lVert g_t \rVert_*\lVert x_t - x_{t+1} \rVert
g_t \in \partial f_t(x_t)

Lipschitz

?

Lema:

F, f~\text{convexas tais que}~F + f~\text{é}~\sigma\text{-strongly-convex}.
\text{Se}~\bar{x} \in \mathrm{arg}\,\mathrm{min}~F(x)~\text{e}~\bar{y} \in \mathrm{arg}\,\mathrm{min}~F(x) + f(x),~\text{então}
\displaystyle \lVert \bar{x} - \bar{y} \rVert \leq \frac{1}{\sigma} \lVert g \rVert_*
\forall g \in \partial f(\bar{x})
\psi~\sigma\text{-strongly-convex} \Rightarrow \psi^*~\frac{1}{\sigma}\text{-strongly-smooth},
\bar{x} = \nabla \psi^*(0), \bar{y} = \nabla \psi^* (b)

### Exemplos de LOMD

y_{t+1} = y_t - g_t, \text{onde}~g_t \in \partial f_t(x_t)
\displaystyle R(x) = \begin{cases}\psi(x), & \text{se}~x \in X\\ +\infty, & \text{se}~x \not\in X \end{cases}
\mathcal{C} = (X, \mathcal{F})
x_{t+1} = \nabla R^*(y_{t+1})
\nabla R^*(x^*)~\text{atinge}\\ \displaystyle \sup_{x \in \mathbb{E}} \langle x^*, x\rangle - R(x) = \sup_{x \in X} \langle x^*, x\rangle - \psi(x)
\neq \psi^*(x)
\nabla R^*(x) \in X

### Exemplos de LOMD

y_{t+1} = y_t - g_t, \text{onde}~g_t \in \partial f_t(g_t)
\displaystyle R(x) = \begin{cases}\psi(x), & \text{se}~x \in X\\ +\infty, & \text{se}~x \not\in X \end{cases}
\mathcal{C} = (X, \mathcal{F})
x_{t+1} = \nabla R^*(y_{t+1})

Lema

\Rightarrow
\nabla R^*(x) = \Pi_X^\psi(\nabla \psi^*(x))
\psi(x) = \frac{1}{\eta 2} \lVert x\rVert_2^2
\Pi_X^\psi(x) = \arg\,\min \lVert x - z\rVert_2
\nabla \psi^*(x) = \eta x
z \in X
y_{t+1} = \sum_{i = 1}^t - g_t, \text{onde}~g_t \in \partial f_t(g_t)\\ x_{t+1} = \arg\,\min \lVert x - y_{t+1}\rVert_2

### Minimizing the Loss

Attaining sublinear regret is impossible in general  (Cover '67)

X

Player

Enemy

Not X

Idea: allow the player to randomize her choices

h_0
h_1

No

Yes

\mathrm{Regret}_T(\mathcal{H}) \geq \displaystyle \frac{T}{2}

Enemy does not know the outcomes of the "coin flips"

We want bounds with high probability or on the expectation

### Algorithms for OCO

There are algorithms OCO with guaranteed sublinear regret

Take inspiration from classic optimization

Use concepts of  conjugate functions and subgradients

Intuition depends on concepts from convex analysis

Theorem. The following are equivalent:

y \in \partial f(x)
\sup_{z \in \mathbb{E}} \langle z, y \rangle - f(z) = f^*(y)
x

atinge

x \in \partial f^*(y)
\sup_{z \in \mathbb{E}} \langle z, x \rangle - f^*(z) = f^{**}(x)
y

atinge

(a)

(b)

(c)

(d)

### Formalizing Online Convex Optimization

\mathrm{Regret}_T( u) = \displaystyle \sum_{t = 1}^T f_t(x_t) - \sum_{t = 1}^T f_t(u)
\mathrm{Regret}_T( U) = \displaystyle \sup_{u \in U} \mathrm{Regret}_T( u)

### Algorithm

\text{For}~t = 1, \ldots, T,~\text{do}
x_t \coloneqq \text{PLAYER}((f_1,\ldots, f_{t-1}) )
f_t \coloneqq \text{ENEMY}( (x_1,\ldots, x_{t-1}))
l_t \coloneqq f_t(x_t)

Cost of always choosing

u

### Formalizing Online Convex Optimization

Oracles of Online Convex Optimization

\mathrm{PLAYER} \colon \mathrm{Seq}(\mathcal{F}) \to X
\mathrm{ENEMY} \colon \mathrm{Seq}(X) \to \mathcal{F}

An Online Convex Optimization Problem

\mathcal{C} = (X, \mathcal{F})
X

convex set

\mathcal{F}

set of convex functions

### Online Mirror Descent - Intuition

x_{t+1} = \Pi_X^\psi(\nabla\psi^*(y_{t+1}))
y_{t+1} = y_t - g_t, \text{where}~g_t \in \partial f_t(x_t)
\mathbb{E}
\mathbb{E}^*
x_{t}
y_t
-g_t
y_{t+1}
\nabla \psi^* = (\nabla \psi)^{-1}
x_{t+1}
\Pi_X^\psi
X
= \nabla R^*(y_{t+1})
\nabla R^*
\text{Regularizer:}~\psi
R(x) = \begin{cases} \psi(x) &x \in X\\ +\infty &\text{o.w.} \end{cases}
\nabla \psi (x_t)
-g_t
y'_{t+1}

### Online Mirror Descent

x_{t+1} = \nabla R^*(y_{t+1})
f_1, \ldots, f_t
R

Regularizer

\mathrm{LOMD}_R(f_1, \ldots, f_t)
y_{t+1} = y_t - g_t, \text{where}~g_t \in \partial f_t(x_t)
y_1 = 0
\text{for}~t = 1, \ldots, T-1

### Lazy

#### OCO - quali

By Victor Sanches Portella

• 403