When Online Learning

joint work with Nick Harvey (UBC) and Christopher Liaw (Google)

meets Stochastic Calculus

Victor Sanches Portella

ime.usp.br/~victorsp

Prediction With Expert's Advice

Prediction with Expert's Advice

Player

Adversary

\(n\) Experts

0.5

0.1

0.3

0.1

Probabilities

p_t

-1

0.5

-0.3

Costs

\ell_t \in [-1,1]^n

Player's loss:

\langle \ell_t, p_t \rangle

Adversary knows the strategy of the player

\mathbb{E}[\ell_t(i)]

Measuring Player's Perfomance

\displaystyle \mathrm{Regret}(T) = \sum_{t = 1}^T \langle \ell_t, x_t \rangle - \min_{i = 1, \dotsc, n} \sum_{t = 1}^T \ell_t(i)

\displaystyle \sum_{t = 1}^T \langle \ell_t, x_t \rangle

Total player's loss

Can be = \(T\) always

\displaystyle \sum_{t = 1}^T \langle \ell_t, x_t \rangle - \sum_{t = 1}^T \min_{i = 1, \dotsc, n} \ell_t(i)

Compare with offline optimum

Almost the same as Attempt #1

Restrict the offline optimum

\displaystyle \frac{\mathrm{Regret}(T) }{T} \to 0

Attempt #1

Attempt #2

Attempt #3

Loss of Best Expert

Player's Loss

Goal:

Follow the Leader

Idea: Pick the best expert at each round

x_t = e_i = \begin{pmatrix}0\\ 0 \\ 1 \\ 0 \end{pmatrix}

where \(i\) minimizes

\displaystyle L_t(i) = \sum_{s = 1}^t \ell_t(i)

Can fail badly

Player loses \(T -1\)

Best expert loses \(T/2\)

\ell_1

\ell_2

\ell_3

\cdots

\ell_4

Gradient Descent

\displaystyle x_{t+1} = \mathrm{Proj}_{\Delta_n} (x_t - \eta_t \ell_t)

\(\eta_t\): step-size at round \(t\)

\(\ell_t\): loss vector at round \(t\)

= \nabla f_t(x)

\displaystyle \mathrm{Regret}(T) \leq \sqrt{2 T n}

Sublinear Regret!

Optimal dependency on \(T\)

Can we improve the dependency on \(n\)?

Yes, and by a lot

Multiplicative Weights Update Method

\displaystyle p_{t+1}(i) \propto p_t(i) \cdot e^{- \eta \ell_t(i)}

Normalization

\displaystyle \mathrm{Regret}(T) \leq \sqrt{2 T \ln n}

Optimal!

\displaystyle \mathbb{E}[\mathrm{Regret}(T)] \geq \sqrt{2 T \ln n}(1 - o(1))

For random \(\pm 1\) costs

Multiplicative Weights Update:

(Hedge)

MWU is also Mirror Descent

Potential based players

\displaystyle p_t \propto \nabla_x \Phi(t, L_t)

\displaystyle \Phi(t, x) = \frac{1}{\eta}\ln \left( \sum_{i} e^{-\eta x(i)} \right)

LogSumExp

Why Learning with Experts?

Boosting in ML

Understanding sequential prediction & online learning

Universal Optimization

TCS, Learning theory, SDPs...

Quantile Regret

Best Expert

Best Experts

\(\varepsilon\)-fraction

\displaystyle \Bigg \}

MWU:

\displaystyle \lesssim \sqrt{T \ln (1/\varepsilon)}

Needs knowledge of \(\varepsilon\)

We design an algorith with \(\sqrt{T \ln(1/\varepsilon)}\) quantile regret

for all \(\varepsilon\) and best known leading constant

\displaystyle \sum_{t = 1}^T \langle \ell_t, p_t \rangle - \sum_{t = 1}^T \ell_t(i_{\varepsilon})

Loss of

top \(\varepsilon n \) expert

\(\varepsilon\)-Quantile Regret

Continuous OL via Stochastic Calculus

Algorithms design guided by

PDEs and (Stochastic) Calculus tools

Main Goal of this Talk: describe the main ideas of the

continuous time model and tools

Continuous Experts' Problem

Modeling Online Learning in Continuous Time

Analysis often becomes clean

Sandbox for design of optimization algorithms

Gradient flow is useful for smooth optimization

\displaystyle \partial_t x_t = - \nabla f(x_t)

Key Question: How to model non-smooth (online) optimization in continuous time?

Why go to ?

continuous time

Modeling Adversarial Costs in Continuous Time

Total loss of expert \(i\):

\displaystyle L_t(i) = \sum_{s = 1}^t \ell_s(i)

Useful perspective: \(L(i)\) is a realization of a random walk

realization of a Brownian Motion

Probability 1 = Worst-case

Discrete Time

Continuous Time

\displaystyle L_t(i) = B_t(i)

The Continuous Time Model

Discrete time

Continuous time

\displaystyle \mathrm{Regret}(t) = \max_{i} R_t(i)

\displaystyle R_t(i) = A_t - L_t(i)

\displaystyle L_t(i) =B_t(i)

\displaystyle L_t(i) = \mathrm{Rand. Walk}

Cummulative loss

Player's cummulative loss

\displaystyle \sum_{t = 1}^T\langle p_t, \Delta L_t\rangle

\displaystyle \int_0^T\langle p_t, \mathrm{d} L_t \rangle

\displaystyle A_t

Player's loss per round

\displaystyle \langle p_t, \ell_t\rangle

\displaystyle \langle p_t, \mathrm{d} L_t \rangle

[Freund '09]

\displaystyle \Delta L_t

Regret Vector

Regret

Goal: Prob. 1 bounds on Regret

MWU in Continuous Time

Potential based players

\displaystyle p_t \propto \nabla_x \Phi(t, R_t)

\displaystyle \Phi(t, x) = \frac{1}{\eta}\ln \left( \sum_{i} e^{-\eta x(i)} \right)

\displaystyle p_t(i) = \mathrm{softmax}(-\eta R_t) \propto e^{-\eta L_t(i)}

Multiplicative Weights Update

LogSumExp

\displaystyle \Phi(t, x) = \exp\Big(\frac{x^2}{2t}\Big)

NormalHedge

First algorithm for quantile regret

Very clean Continuous time analysis

[Freund '09]

A Peek Into the Analysis

Ito's Lemma

(Fundamental Theorem of Stochastic Calculus)

\displaystyle f(B_t) - f(B_0) = \int_{0}^T f'(B_t) \mathrm{d} B_t

\displaystyle + ~\frac{1}{2} \int_{0}^T f'' (B_t) \mathrm{d} t

\(B(t)\) is very non-smooth \(\implies\) second-order terms matter

Ito's Lemma

Idea: Pick \(\Phi\) as to make Ito's Lemma simpler for

Idea: Use stochastic calculus to guide the algorithm design

\displaystyle p_t \propto \nabla_x \Phi(t, R_t)

Potential based players

Smooth

Non-smooth

Using Ito's Lemma for Potential Based Players

Using Ito's Lemma on potential \(\Phi(t, R_t)\) for 1 dimension*

\displaystyle \Phi(T, R_T) - \Phi(0, R_0) =

\displaystyle + \int_{0}^T (\partial_t \Phi(t, R_t) + \tfrac{1}{2}\partial_{xx}\Phi(t,R_t)) \mathrm{d} t

\displaystyle \int_{0}^T \partial_x \Phi(t, R_t) \mathrm{d} R_t

\(=0 \) if \(p_t \propto \partial_x \Phi(t, R_t)\)

Potential does not change if this \(= 0\)

Ito's Lemma suggests \(\Phi\) that satisfy the Backwards Heat Equation

\;\;\;\;\Phi + \;\;\;\;\;\;\; \Phi = 0

\displaystyle \partial_t

\displaystyle \tfrac{1}{2}\partial_{xx}

* Simplified, not quite correct

Going to Higher Dimensions

Using Ito's Lemma on potential \(\Phi(t, R_t)\) for \(d\) dimensions

\displaystyle \Phi(T, R_T) - \Phi(0, R_0) =

\displaystyle + \int_{0}^T \partial_t \Phi(t, R_t)\mathrm{d}t + \frac{1}{2} \sum_{i,j = 1}^d \int_{0}^T \partial_{x_i x_j}\Phi(t,R_t)) \mathrm{d}[R_i, R_j]_t

\displaystyle \int_{0}^T \langle \nabla_x \Phi(t, R_t), \mathrm{d} R_t \rangle

\(=0 \) if \(p_t \propto \nabla_x \Phi(t, R_t)\)

"Covariance" of \(R_i\) and \(R_j\)

Do dependencies between \(L_i\) and \(L_j\) matter?

YES, and cannot (or hard?) to discretize otherwise

Different intuition from the discrete case (?)

Beyond i.i.d. Experts

\displaystyle \mathrm{d} L_1(t) = \mathrm{d} B_1(t)

\displaystyle \mathrm{d} L_2(t) =

\displaystyle \mathrm{d} L_n(t) =

\displaystyle \vdots

\displaystyle \mathrm{d}B_2(t)

\displaystyle \mathrm{d} B_n(t)

\displaystyle \vdots

\mathrm{d}L_1(t) = \phantom{w_{1,1}(t)} \mathrm{d} B_1(t) + \phantom{w_{1,2}(t)} \mathrm{d} B_2(t)\; +

\displaystyle \mathrm{d}L_2(t) = \phantom{w_{2,1}(t)} \mathrm{d} B_1(t) + \phantom{w_{2,2}(t)} \mathrm{d} B_2(t) \; +

\displaystyle \mathrm{d}L_n(t) = \phantom{w_{n,1}(t)} \mathrm{d} B_1(t) + \phantom{w_{n,2}(t)} \mathrm{d} B_2(t) \; +

\displaystyle \vdots

\displaystyle + \; \phantom{w_{2,n}(t)} \mathrm{d} B_n(t)

\displaystyle + \; \phantom{w_{n,n}(t)} \mathrm{d} B_n(t)

\displaystyle \vdots

\displaystyle + \; \phantom{w_{1,n}(t)}\mathrm{d} B_n(t)

\displaystyle \vdots

w_{1,1}(t)

w_{1,2}(t)

w_{1,n}(t)

w_{2,1}(t)

w_{2,2}(t)

w_{2,n}(t)

w_{n,1}(t)

w_{n,2}(t)

w_{n,n}(t)

A Peek Into the Analysis

Potential based players

\displaystyle p(t) \propto \nabla \Phi(t, R(t))

\displaystyle \mathrm{QuantRegret}(T, \varepsilon) \leq 2\sqrt{2 T \ln(1/\varepsilon)} + 6 \sqrt{T}

For all \(\varepsilon\)

Ito's Lemma suggests \(\Phi\) that satisfy the Backwards Heat Equation

\;\;\;\;\Phi + \;\;\;\;\;\;\; \Phi = 0

\displaystyle \partial_t

\displaystyle \tfrac{1}{2}\partial_{xx}

Using this potential*, we get

Best leading constant

Discrete time analysis is IDENTICAL to continuous time analysis

Discrete Ito's Lemma

*(with a slightly bigger cnst. in the BHE)

Other Results Using

Stochastic Calculus

Fixed Time vs Anytime Regret

Question:

Are the minimax regret with and without knowledge of \(T\) different?

fixed-time

anytime

[Harvey, Liaw, Perkins, Randhawa '23]

n = 2

anytime

fixed-time

[Cover '67]

\displaystyle \sqrt{\frac{2}{\pi}T}

\displaystyle 1.30693 \sqrt{T}

Back. Heat Eq.

Efficient version via SC

[Greenstreet, VSP, Harvey '20]

Heat Eq.

\displaystyle \sqrt{2 T \ln n}

In Continuous Time, both are equal if Loss Processes are independent.

[VSP, Liaw, Harvey '22]

Large n

\displaystyle 2\sqrt{T \ln n}

\displaystyle \leq

What about expected regret?

Question:

What is the expected regret in the anytime setting

even without idependent experts?

[VSP, Liaw, Harvey '25]:

High expected regret \(\implies\) lower bound

In the language of martingales:

\displaystyle \sim \sqrt{2 T \ln n}

Nearly tight bounds.

asymptotically!

For a martingale \(X_t\), find upper and lower bounds to

\displaystyle \frac{\mathbb{E}[\lVert X_{T}\rVert_{\infty}]}{\mathbb{E}[\sqrt{T}]}

sup

\displaystyle \Bigg\{

\displaystyle :

\displaystyle T

is a stopping time

\displaystyle \Bigg\}

Evidence that

anytime = fixed-time

Online Linear Optimization

Player

Adversary

x_t \in \mathbb{R}^n

Unconstrained

f_t(x) = \langle g_t, x \rangle

Linear functions

\displaystyle \lVert g_t\rVert \leq 1

Player's loss:

f_t(x_t)

\displaystyle \mathrm{Regret}(T,\phantom{u}) = \sum_{t = 1}^T \langle g_t, x_t \rangle - \sum_{t = 1}^T \langle g_t, \phantom{u} \rangle

\displaystyle u

Loss of Fixed \(u\)

Player's Loss

Parameter-Free Online Linear Optimization

Goal:

\displaystyle \mathrm{Regret}(T,u) = O(\lVert u\rVert \sqrt{V_T \log(\lVert u \rVert )})

No knowledge of \(\lVert u \rVert\)

\displaystyle V_T = \sum_{t = 1}^T \lVert g_t \rVert^2

Small regret if \(\lVert g_t\rVert\) small

[Zhang, Yang, Cutkosky, Paschalidis '24]:

Parameter-free and Adaptive algorithm

Backwards Heat Equation

Parameter free and adaptive algorithms matching lower bounds

(even up to leading constant)

Pontential based player satisfying

+ refined discretization

Conclusion and Open Questions

Continuous Time Model for Experts and OLO

Thanks!

[VSP, Liaw, Harvey '22] Continuous prediction with experts' advice.

[Zhang, Yang, Cutkosky, Paschalidis '24] Improving adaptive online learning using refined discretization.

[Freund '09] A method for hedging in continuous time.

[Harvey, Liaw, Perkins, Randhawa '23] Optimal anytime regret with two experts.

[Greenstreet, VSP, Harvey '22] Efficient and Optimal Fixed-Time Regret with Two Experts

[Harvey, Liaw, VSP '22] On the Expected infinity-norm of High-dimensional Martingales

Improve LB for anytime experts? Or better upper-bounds?

High-dim continuous time OLO?

Hopefully this model can be helpful in more developments in OL and optimization!

Application to offline non-smooth optimization?

When Online Learning

joint work with Nick Harvey (UBC) and Christopher Liaw (Google)

meets Stochastic Calculus

Victor Sanches Portella

ime.usp.br/~victorsp

Performance Measure - Regret

\displaystyle \mathrm{Regret}(T) = \sum_{t = 1}^T \langle \ell_t, p_t \rangle - \min_{i = 1, \dotsc, n} \sum_{t = 1}^T \ell_t(i)

Loss of Best Expert

Player's Loss

\displaystyle \mathrm{Regret}(T) \leq \sqrt{2 T \ln n}

Optimal!

\displaystyle \mathbb{E}[\mathrm{Regret}(T)] \geq \sqrt{2 T \ln n}(1 - o(1))

For random \(\pm 1\) costs

Multiplicative Weights Update:

\displaystyle p_{t+1}(i) \propto p_t(i) \cdot \exp(-\eta \ell_t(i) )

(Hedge)

Motivating Problem - Fixed Time vs Anytime

MWU regret

\displaystyle \mathrm{Regret}(T) \leq \sqrt{2 T \ln n}

when \(T\) is known

\displaystyle \mathrm{Regret}(T) \leq 2\sqrt{T \ln n}

when \(T\) is not known

\displaystyle \Bigg\{

anytime

fixed-time

Does knowing \(T\) gives the player an advantage?

[Harvey, Liaw, Perkins, Randhawa '23]

Continuous anytime algorithms for independent experts

[VSP, Liaw, Harvey '22]

Optimal lower bound 2 experts + optimal algorithm

+ improved algorithms for quantile regret!

With stochastic calculus:

MWU in Continuous Time

Potential based players

\displaystyle p_t \propto \nabla_x \Phi(t, R_t)

\displaystyle \Phi(t, R(t)) = \ln \left( \sum_{i} e^{-\eta_t R_t(i)} \right)

\displaystyle p_t(i) = \mathrm{softmax}(-\eta_t R_t) \propto e^{-\eta_t L_t(i)}

MWU!

Same regret bound as discrete time!

Idea: Use stochastic calculus to guide the algorithm design

LogSumExp

Regret bounds

\displaystyle \mathrm{Regret}(T) \leq \sqrt{2 T \ln n}

when \(T\) is known

\displaystyle \mathrm{Regret}(T) \leq 2\sqrt{T \ln n}

when \(T\) is not known

anytime

fixed-time

\displaystyle \Bigg\{

with prob. 1

The Joys of Stochastic Calculus

+ better anytime algorithms in continuous time

[Zhang, Yang, Cutkosky, Paschalidis '24]

Optimal anytime lower bound 2 experts + optimal algorithm

Best known algorithms for quantile regret

[Harvey, Liaw, Perkins, Randhawa '23]

Efficient optimal algorithms for fixed time 2 experts

[Greenstreet, VSP, Harvey '20]

Optimal parameter-free algorithms for online linear optimization

[VSP, Liaw, Harvey '22]

Simple continuous time analysis of NormalHedge

[Freund '09]

MWU in Continuous Time

Potential based players

\displaystyle p_t \propto \nabla_x \Phi(t, R_t)

\displaystyle \Phi(t, R(t)) = \ln \left( \sum_{i} e^{-\eta_t R_t(i)} \right)

\displaystyle p_t(i) = \mathrm{softmax}(-\eta_t R_t) \propto e^{-\eta_t L_t(i)}

MWU!

Same regret bound as discrete time!

Idea: Use stochastic calculus to guide the algorithm design

LogSumExp

Regret bounds

\displaystyle \mathrm{Regret}(T) \leq \sqrt{2 T \ln n}

when \(T\) is known

\displaystyle \mathrm{Regret}(T) \leq 2\sqrt{T \ln n}

when \(T\) is not known

anytime

fixed-time

\displaystyle \Bigg\{

with prob. 1

A Peek Into the Analysis

Potential based players

\displaystyle p(t) \propto \nabla \Phi(t, R(t))

\displaystyle \mathrm{Regret}(T) \leq \sqrt{2 T \ln n}(1 + o(1))

Matches fixed-time!

Ito's Lemma suggests \(\Phi\) that satisfy the Backwards Heat Equation

\;\;\;\;\Phi + \;\;\;\;\;\;\; \Phi = 0

\displaystyle \partial_t

\displaystyle \tfrac{1}{2}\partial_{xx}

This new anytime algorithm has good regret!

Does not translate easily to discrete time

need correlation between experts

Take away: Anytime lower bounds for (continuous) experts

need dependent experts

A One Dimensional Continuous Time Model

\displaystyle \sum_{t = 1}^T x_t \cdot g_t - \sum_{t = 1}^T u \cdot g_t

\displaystyle \int_{t = 0}^T x(t, G_t )\mathrm{d} G_t - u \cdot G_t

Discrete Regret

Continuos Regret

Theorem:

\displaystyle x(t, G_t) = \partial_x \Phi(t, - G_t)

If \(\Phi\) satisfies the BHE and

\displaystyle \mathrm{ContRegret}(T, u) \leq \Phi(0,0) + \Phi^*([G]_t, u)

Going to higher dim:

Continuous time analogue

V_T = \sum_{t = 1}^T g_t^2

Learn direction and scale separately

Use refined discretization

Discretizing:

Why Continuous Time?

INRIA

By Victor Sanches Portella

When Online Learning

meets Stochastic Calculus

Prediction With Expert's Advice

Prediction with Expert's Advice

Measuring Player's Perfomance

Follow the Leader

Gradient Descent

Multiplicative Weights Update Method

Why Learning with Experts?

Quantile Regret

Continuous OL via Stochastic Calculus

Continuous Experts' Problem

Modeling Online Learning in Continuous Time

Modeling Adversarial Costs in Continuous Time

The Continuous Time Model

MWU in Continuous Time

A Peek Into the Analysis

Using Ito's Lemma for Potential Based Players

Going to Higher Dimensions

Beyond i.i.d. Experts

A Peek Into the Analysis

Other Results Using

Stochastic Calculus

Fixed Time vs Anytime Regret

What about expected regret?

Online Linear Optimization

Parameter-Free Online Linear Optimization

Conclusion and Open Questions

When Online Learning

meets Stochastic Calculus

Performance Measure - Regret

Motivating Problem - Fixed Time vs Anytime

MWU in Continuous Time

The Joys of Stochastic Calculus

MWU in Continuous Time

A Peek Into the Analysis

A One Dimensional Continuous Time Model

Why Continuous Time?

INRIA

More from Victor Sanches Portella