Online Convex Optimization

Learning, Duality, and Algorithms

Victor Sanches Portella

Supervisor: Marcel K. de Carli Silva

IME - USP

August, 2018

Learning

Motivation

Yes

No

Spam?

Adaptive

Online Learning (OL)

At each round

Nature reveals a query

Player makes a prediction

Enemy picks the "true answer"

Player suffers a loss

SIMULTANEOUSLY

Player

Enemy

?

Nature

!

!

Player and Enemy see each other's choices

Player

Examples of OL

Spam Filtering

Nature reveals emails

Player is the spam filter

Enemy is the user

Player

Enemy

?

Nature

!

!

Loss of 1 if the filter is wrong

Examples of OL

Prediction with Expert Advice

Nature reveals the experts' advice

Player chooses an expert

Enemy chooses the costs of the experts

Player

Enemy

?

Nature

!

!

Experts

Loss is the cost of the chosen expert

Statistical Learning and OL Comparison

Probability distribution over enemy and nature

Training set

Statistical Learning

Online Learning

Adversarial enemy and nature

Online

Expected accuracy

Cumulative loss

Formalizing Online Learning

An Online Learning Problem

\mathcal{P} = (X, Y, D, L)
X
Y
D
L

query set

label set

decision set

loss function

Spam

Emails

Yes/No

Yes/No

Binary

Experts

\{\text{Advices}\}^E
[-1,1]^E
E

Expert's cost

Minimizing the Loss

Minimizing the cost is impossible

X

Player

Enemy

Not X

Idea: minimize the

regret

\mathrm{Regret}_T( h) = \displaystyle \sum_{t = 1}^T L(d_t, y_t) - \sum_{t = 1}^T L(h(x_t), y_t)
\mathrm{Regret}_T(\mathcal{H}) = \displaystyle \sup_{h \in \mathcal{H}} \mathrm{Regret}_T( h)

hypothesis

\text{LOSS} = T
\text{cost of the}\\\text{strategy}~h
h \colon \{\text{Questions}\} \to \{\text{Predictions}\}

# of Rounds

Minimizing the Loss

Attaining sublinear regret is impossible in general  (Cover '67)

Idea: allow the player to randomize his choices

Enemy does not know the outcomes of the "coin flips"

Bounds on regret with high probability or on the expectation

Player

Enemy

Simulated Player

d \sim \mathcal{D}
\mathcal{D}
d~\text{?}

Probability distribution

Online Convex Optimization (OCO)

At each round

Player chooses a point

Enemy chooses a function

Player suffers a loss

SIMULTANEOUSLY

Player

Enemy

!

!

x \in X
f
f(x)

CONVEX

Player and Enemy see

f~\text{and}~x

Formalizing Online Convex Optimization

An Online Convex Optimization Problem

\mathcal{C} = (X, \mathcal{F})
X

convex set

\mathcal{F}

set of convex functions

\mathrm{Regret}_T( u) = \displaystyle \sum_{t = 1}^T f_t(x_t) - \sum_{t = 1}^T f_t(u)
\mathrm{Regret}_T( U) = \displaystyle \sup_{u \in U} \mathrm{Regret}_T( u)

Cost of always choosing  

u

OCO and OL Relations

Online Convex Optimization

Online Learning

Special Case of OL

Low-regret

Algorithms

From OL to OCO

An OL problem: Experts

Player

Enemy

Nature

x:
e \in E
y \in [-1,1]^E

OL

OCO

Experts' advice

From OL to OCO

An OL problem:                           Experts

Player

Enemy

Nature

e \in E
y \in [-1,1]^E

OL

Randomized

e \sim p \in \Delta_E
\Longrightarrow
\mathbb{E}[L(e,y)] = p^Ty

OCO

Player

p \in \Delta_E

Enemy

f_y(p) = p^T y

Simplex

x:

Experts' advice

Duality

Function and Epigraph

f \colon \mathbb{E} \to \mathbb{R} \cup \{\pm \infty\}
\text{Convex if}~\mathrm{epi} (f)~\text{is convex}
\text{Epigraph}~\mathrm{epi} (f)

Conjugate function

f^*(x^*) = \displaystyle \sup_{x \in \mathbb{E}}( \langle x^*, x \rangle - f(x))
f^{**} = f

Subgradients

f(z) \geq f(\bar{x}) + \langle g, z - \bar{x} \rangle
g \in \partial f(\bar{x}) \Leftrightarrow
\nabla f(\bar{y}) \oplus -1
g \oplus -1

Conjugate Functions and Subgradients

y \in \partial f(x) \Leftrightarrow x \in \partial f^*(y)
\Leftrightarrow x~\text{attains}\sup_{z \in \mathbb{E}} \langle z, y \rangle - f(z) = f^*(y)

Strongly convex and Strongly smooth Functions

Strongly convex

Strongly smooth

f(z) \geq f(x) + \langle u, z - x \rangle + \frac{\sigma}{2} \lVert x - z \rVert^2
f(z) \leq f(x) + \langle \nabla f(x), z - x \rangle + \frac{\beta}{2} \lVert x - z \rVert^2

ARBITRARY

(u \in \partial f (x))
\forall z,
\forall z,

Strongly convex/Strongly smooth Duality

Theorem

f~\sigma\text{-strongly convex with respect to}~\lVert\cdot \rVert
\Leftrightarrow f^*~\frac{1}{\sigma}\text{-strongly smooth with respect to}~\lVert \cdot \rVert_*

Dual norm

\lVert x\rVert_* = \displaystyle \sup_{u \in \mathbb{E} \colon \lVert u \rVert \leq 1} \langle x, u\rangle

Bregman Divergence and Projection

B_{\psi}(x,y) = \psi(x) -(\psi(y) + \langle \nabla \psi (y), x - y\rangle)

Bregman Divergence

Bregman Projection

\Pi_X^\psi(y) = \arg\,\min B_{\psi}(x,y)

1st-order Taylor

x \in X

Algorithms

Follow the Regularized Leader

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R(x)
f_1, \ldots, f_t
R

Already-seen functions

Strongly convex regularizer

\mathrm{FTRL}_R(f_1, \ldots, f_t)

Bounding the Regret of FTRL

\displaystyle \mathrm{Regret}_T( u)\leq \sum_{t = 1}^T (f_t(x_t) - f_t(x_{t+1})) + R(u) - R(x_1)

 Lemma (Kalai e Vempala, '05):

\leq \langle g_t, x_t - x_{t+1} \rangle \leq \lVert g_t \rVert_*\lVert x_t - x_{t+1} \rVert
g_t \in \partial f_t(x_t)
\displaystyle \mathrm{Regret}_T(u)\leq \sum_{t = 1}^T\lVert g_t \rVert_*\lVert x_t - x_{t+1} \rVert + R(u) - R(x_1)

 Corollary:

Bounding the Regret - Lipschitz Continuity

\lVert g_t \rVert_*\lVert x_t - x_{t+1} \rVert
g_t \in \partial f_t(x_t)

Stability between rounds

?

Lemma:

f~\text{is}~\rho\text{-Lipschitz continuous if}~\lvert f(x) - f(y)\rvert \leq \rho \lVert x - y\rVert
f~\text{is}~\rho\text{-Lipschitz continuous}\\ \Leftrightarrow \text{there is}~g \in \partial f(x)~\text{s.t.}~\lVert g\rVert_* \leq \rho

Bounding the Regret

\lVert g_t \rVert_*\lVert x_t - x_{t+1} \rVert
g_t \in \partial f_t(x_t)

Stability between rounds

Lipschitz

?

Bounding the Regret - Using Duality

\psi~\sigma\text{-strongly-convex} \Rightarrow \psi^*~\frac{1}{\sigma}\text{-strongly-smooth}
x_{t} = \nabla \psi^* (-g_t)

(McMahan, '17)

\displaystyle\psi(x) \coloneqq \sum_{i =1}^t f_i(x) + R(x) - \langle g_t, x\rangle
\lVert \nabla \psi^*(0) - \nabla\psi^*(-g_t)\rVert \leq \frac{1}{\sigma}\lVert g_t\rVert_*
\mathrm{FTRL}_R(f_1, \ldots, f_t)
x_{t+1} = \nabla \psi^*(0)

Bounding the Regret

\displaystyle \mathrm{Regret}_T( X) \leq \sigma \theta + \frac{1}{\sigma} \sum_{t = 1}^T \lVert g_t\rVert_*^2
\theta = \displaystyle \sup_{u,v \in X} \frac{R(u) - R(v)}{\sigma}
\displaystyle \mathrm{Regret}_T( X) \leq 2 \rho \sqrt{\theta T}
f_1, \ldots, f_T~\rho\text{-Lipschitz}
\}
\displaystyle \mathrm{Regret}_T( X) \ge\Omega(\sqrt{T})

Theorem (Abernethy, B, R, T, '08):

\text{Careful choice of}~\sigma

Diameter

Bounding the Regret -  Experts Example

\theta = \displaystyle \sup_{u,v \in X} \frac{R(u) - R(v)}{\sigma}
\mathcal{C} = (\Delta_E, \mathcal{F})
\Delta_E = \{p \in [0,1]^E \colon \sum p_e = 1 \}
f_t(p) = y^T p~\text{with}~y \in [-1,1]^E
U = \{e_i \colon i \in E\}
\displaystyle \mathrm{Regret}_T(U) \leq 2 \rho \sqrt{\theta T}

Randomized Experts

Best expert in hindsight

Bounding the Regret -  Experts Example

\displaystyle \mathrm{Regret}_T(U) \leq 2 \rho \sqrt{\theta T}
R(x) = \frac{1}{2} \lVert x \rVert_2^2
\Rightarrow
\rho = \sqrt{\lvert E \rvert }
\lVert \cdot \rVert = \lVert \cdot \rVert_2 = \lVert \cdot \rVert_*
\displaystyle \mathrm{Regret}_T \leq \sqrt{2 \lvert E\rvert T}
R(x) = \sum x_i \ln x_i
\rho = 1
\displaystyle \mathrm{Regret}_T \leq 2 \sqrt{\ln \lvert E \rvert T}
\displaystyle\theta \leq \frac{1}{2}
\lVert \cdot \rVert = \lVert \cdot \rVert_1
\lVert \cdot \rVert_* = \lVert \cdot \rVert_{\infty}
\theta \leq \ln \lvert E \rvert
\Rightarrow
\Rightarrow
\Rightarrow
X = \Delta_E
f_t(p) = y^T p,~\text{where}~y \in [-1,1]^E

FTRL is not Perfect

x_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t f_{i}(x) + R(x)

Each FTRL step needs to solve a optimization problem

\mathrm{FTRL}_R(f_1, \ldots, f_t)

It would be interesting to have an algorithm which is clearly efficiently computable

Online Mirror Descent - Intuition

x_{t+1} = \Pi_X^{\psi}(\nabla \psi^*(y_{t+1}))
\text{Regularizer:}~\psi
y_{t+1} = \nabla \psi(x_t) - g_t, \text{where}~g_t \in \partial f_t(x_t)
\mathbb{E}
\mathbb{E}^*
x_{t}
\nabla \psi
\nabla \psi (x_t)
-g_t
y_{t+1}
\nabla \psi^* = (\nabla \psi)^{-1}
x_{t+1}
\Pi_X^\psi
X
\text{Linear forms in}~\mathbb{E}

Online Mirror Descent

x_{t+1} = \Pi_X^{\psi}(\nabla \psi^*(y_{t+1}))
f_1, \ldots, f_t
\psi

Already-seen functions

Strongly Convex Regularizer

\mathrm{EOMD}_R(f_1, \ldots, f_T)
y_{t+1} = \nabla \psi(x_t) - g_t, \text{where}~g_t \in \partial f_t(x_t)
y_1 = 0
\text{for}~t = 1, \ldots, T-1

Eager

OMD Examples

\displaystyle \psi(x) = \frac{1}{\eta 2} \lVert x \rVert_2^2
\lVert \cdot \rVert = \lVert \cdot \rVert_2 = \lVert \cdot \rVert_*
x_{t+1} = x_t - \eta g_t
\psi(x) = \displaystyle \frac{1}{\eta}\sum x_i \ln x_i
\lVert \cdot \rVert = \lVert \cdot \rVert_1 \\\lVert \cdot \rVert_* = \lVert \cdot \rVert_{\infty}
x_{t+1}(i) = x_t(i) e^{-\eta g_t(i)}

Online

(Sub)Gradient Descent

Hedge

Connections Among Algorithms

FTRL

LOMD

EOMD

Online Newton

Hedge

Newton

Online GD

Gradient

Mirror

AdaGrad

FTPL

Linear Coupling

Adaptive FTRL

Proximal

Adaptive Prox-FTRL

Quasi Newton

?

Adaptive OMD

Accelerated GD

Future Directions

Player

Enemy

!

!

Sugestion - Bandit Convex Optimization (BCO)

At each round

Player chooses a point

Enemy chooses a function

Player suffers a loss

SIMULTANEOUSLY

x \in X
f
f(x)

CONVEX

Player and Enemy see

f~\text{e}~x

Multi-armed Bandit Problem

"Slot machine" icons by Freepik at www.flaticon.com

At each round

Player chooses a machine

Enemy chooses the costs

Player suffers the loss of the machine he has chosen

$$

$

$$$

$

EXPERTS

EXPLORATION VS EXPLOITATION

History of BCO - Regret Bounds

Linear functions

d
T

Dimension

rounds

\tilde{O}(d\sqrt{T})
\tilde{\Omega}(d\sqrt{T})
\{

(Dani, H, K, '12 )

(Bubeck, C-B, K, '12 )

General case

\tilde{O}(d^{9.5}\sqrt{T})
\tilde{\Omega}(d\sqrt{T})
\{

(Dani, H, K, '12 )

(Bubeck, E, L, '17 )

\tilde{\Theta}(d^{1.5}\sqrt{T})

(Bubeck, E, L, '17 )

Sugestion - Boosting

Set of low accuracy learners - Weak Learners

Combination generates a good model - Strong Learner

Usually performed in an incremental fashion

Idea: Use boosting outside of learning

Example - Approximately maximum flows

Electrical flows can be computed quickly

Nearly-linear time Laplacian solver (Spielman and Teng, '04)

Electrical flows may not respect capacities

Idea: Compute many electrical flows, penalizing violated edges

L x = b

Multiplicative Weights Update Method

Otimização Convexa Online

Algoritmos, Aprendizado e Dualidade

Victor Sanches Portella

Orientador: Marcel K. de Carli Silva

IME - USP

Agosto, 2018

Formalizando Online Learning

Oráculos

\mathrm{PLAYER} \colon \mathrm{Seq}(X) \times \mathrm{Seq}(Y) \to D
\mathrm{NATURE} \colon \mathbb{N} \to X
\mathrm{ENEMY} \colon \mathrm{Seq}(X) \times \mathrm{Seq}(D) \to Y

Formalizando Online Learning

Algoritmo

\text{Para}~t = 1, \ldots, T,~\text{faça}
x_t \coloneqq \text{NATURE}(t)
d_t \coloneqq \text{PLAYER}((x_1,\ldots, x_{t}), (y_1,\ldots, y_{t-1}) )
y_t \coloneqq \text{ENEMY}((x_1,\ldots, x_{t}), (d_1,\ldots, d_{t-1}) )
l_t \coloneqq L(d_t, y_t)
\text{Retorne}~(x,d,y)
\mathcal{P} = (X, Y, D, L)

Indo de OL para OCO

\mathcal{P} = (\mathbb{R}^n, \mathbb{R}, \mathbb{R}, L)
\mathcal{H}

Um problema de OL: Regressão linear

conjunto das funções lineares

h(x) = \langle w, x\rangle

Jogador

Inimigo

Natureza

x \in \mathbb{R}^n
d \in \mathbb{R}
y \in \mathbb{R}

OL

OCO

Jogador

Inimigo

w \in \mathbb{R}
L(\langle\cdot, x \rangle, y)
\langle w,x \rangle =
L(d, y)

Teorema da separação

Teorema: existe

a \in \mathbb{E}\setminus\{0\}~\text{tal que}
\displaystyle \sup_{s \in S} \langle a, s\rangle \leq \sup_{t \in T} \langle a, t\rangle
S
T
\langle a, x \rangle = \beta

Limitando o regret - Usando dualidade

\lVert g_t \rVert_*\lVert x_t - x_{t+1} \rVert
g_t \in \partial f_t(x_t)

Estabilidade entre rodadas

Lipschitz

?

Lema:

F, f~\text{convexas tais que}~F + f~\text{é}~\sigma\text{-strongly-convex}.
\text{Se}~\bar{x} \in \mathrm{arg}\,\mathrm{min}~F(x)~\text{e}~\bar{y} \in \mathrm{arg}\,\mathrm{min}~F(x) + f(x),~\text{então}
\displaystyle \lVert \bar{x} - \bar{y} \rVert \leq \frac{1}{\sigma} \lVert g \rVert_*
\forall g \in \partial f(\bar{x})
\psi~\sigma\text{-strongly-convex} \Rightarrow \psi^*~\frac{1}{\sigma}\text{-strongly-smooth},
\bar{x} = \nabla \psi^*(0), \bar{y} = \nabla \psi^* (b)

Exemplos de LOMD

y_{t+1} = y_t - g_t, \text{onde}~g_t \in \partial f_t(x_t)
\displaystyle R(x) = \begin{cases}\psi(x), & \text{se}~x \in X\\ +\infty, & \text{se}~x \not\in X \end{cases}
\mathcal{C} = (X, \mathcal{F})
x_{t+1} = \nabla R^*(y_{t+1})
\nabla R^*(x^*)~\text{atinge}\\ \displaystyle \sup_{x \in \mathbb{E}} \langle x^*, x\rangle - R(x) = \sup_{x \in X} \langle x^*, x\rangle - \psi(x)
\neq \psi^*(x)
\nabla R^*(x) \in X

Exemplos de LOMD

y_{t+1} = y_t - g_t, \text{onde}~g_t \in \partial f_t(g_t)
\displaystyle R(x) = \begin{cases}\psi(x), & \text{se}~x \in X\\ +\infty, & \text{se}~x \not\in X \end{cases}
\mathcal{C} = (X, \mathcal{F})
x_{t+1} = \nabla R^*(y_{t+1})

Lema

\Rightarrow
\nabla R^*(x) = \Pi_X^\psi(\nabla \psi^*(x))
\psi(x) = \frac{1}{\eta 2} \lVert x\rVert_2^2
\Pi_X^\psi(x) = \arg\,\min \lVert x - z\rVert_2
\nabla \psi^*(x) = \eta x
z \in X
y_{t+1} = \sum_{i = 1}^t - g_t, \text{onde}~g_t \in \partial f_t(g_t)\\ x_{t+1} = \arg\,\min \lVert x - y_{t+1}\rVert_2

Minimizing the Loss

Attaining sublinear regret is impossible in general  (Cover '67)

X

Player

Enemy

Not X

Idea: allow the player to randomize her choices

h_0
h_1

No

Yes

\mathrm{Regret}_T(\mathcal{H}) \geq \displaystyle \frac{T}{2}

Enemy does not know the outcomes of the "coin flips"

We want bounds with high probability or on the expectation

Algorithms for OCO

There are algorithms OCO with guaranteed sublinear regret

Take inspiration from classic optimization

Use concepts of  conjugate functions and subgradients

Intuition depends on concepts from convex analysis

Conjugate Functions and Subgradients

Theorem. The following are equivalent:

y \in \partial f(x)
\sup_{z \in \mathbb{E}} \langle z, y \rangle - f(z) = f^*(y)
x

atinge

x \in \partial f^*(y)
\sup_{z \in \mathbb{E}} \langle z, x \rangle - f^*(z) = f^{**}(x)
y

atinge

(a)

(b)

(c)

(d)

Formalizing Online Convex Optimization

\mathrm{Regret}_T( u) = \displaystyle \sum_{t = 1}^T f_t(x_t) - \sum_{t = 1}^T f_t(u)
\mathrm{Regret}_T( U) = \displaystyle \sup_{u \in U} \mathrm{Regret}_T( u)

Algorithm

\text{For}~t = 1, \ldots, T,~\text{do}
x_t \coloneqq \text{PLAYER}((f_1,\ldots, f_{t-1}) )
f_t \coloneqq \text{ENEMY}( (x_1,\ldots, x_{t-1}))
l_t \coloneqq f_t(x_t)

Cost of always choosing  

u

Formalizing Online Convex Optimization

Oracles of Online Convex Optimization

\mathrm{PLAYER} \colon \mathrm{Seq}(\mathcal{F}) \to X
\mathrm{ENEMY} \colon \mathrm{Seq}(X) \to \mathcal{F}

An Online Convex Optimization Problem

\mathcal{C} = (X, \mathcal{F})
X

convex set

\mathcal{F}

set of convex functions

Online Mirror Descent - Intuition

x_{t+1} = \Pi_X^\psi(\nabla\psi^*(y_{t+1}))
y_{t+1} = y_t - g_t, \text{where}~g_t \in \partial f_t(x_t)
\mathbb{E}
\mathbb{E}^*
x_{t}
y_t
-g_t
y_{t+1}
\nabla \psi^* = (\nabla \psi)^{-1}
x_{t+1}
\Pi_X^\psi
X
= \nabla R^*(y_{t+1})
\nabla R^*
\text{Regularizer:}~\psi
R(x) = \begin{cases} \psi(x) &x \in X\\ +\infty &\text{o.w.} \end{cases}
\nabla \psi (x_t)
-g_t
y'_{t+1}

Online Mirror Descent

x_{t+1} = \nabla R^*(y_{t+1})
f_1, \ldots, f_t
R

Already-seen functions

Regularizer

\mathrm{LOMD}_R(f_1, \ldots, f_t)
y_{t+1} = y_t - g_t, \text{where}~g_t \in \partial f_t(x_t)
y_1 = 0
\text{for}~t = 1, \ldots, T-1

Lazy

OCO - quali

By Victor Sanches Portella

OCO - quali

  • 403