When Online Learning
joint work with Nick Harvey (UBC) and Christopher Liaw (Google)
meets Stochastic Calculus


Victor Sanches Portella
ime.usp.br/~victorsp


Prediction With Expert's Advice
Prediction with Expert's Advice
Player
Adversary
\(n\) Experts
0.5
0.1
0.3
0.1
Probabilities
1
-1
0.5
-0.3
Costs
Player's loss:
Adversary knows the strategy of the player
Measuring Player's Perfomance
Total player's loss
Can be = \(T\) always
Compare with offline optimum
Almost the same as Attempt #1
Restrict the offline optimum
Attempt #1
Attempt #2
Attempt #3
Loss of Best Expert
Player's Loss
Goal:
Follow the Leader
Idea: Pick the best expert at each round
where \(i\) minimizes
Can fail badly
Player loses \(T -1\)
Best expert loses \(T/2\)
Gradient Descent
\(\eta_t\): step-size at round \(t\)
\(\ell_t\): loss vector at round \(t\)
Sublinear Regret!
Optimal dependency on \(T\)
Can we improve the dependency on \(n\)?
Yes, and by a lot
Multiplicative Weights Update Method
Normalization
Optimal!
For random \(\pm 1\) costs
Multiplicative Weights Update:
(Hedge)
MWU is also Mirror Descent
Potential based players
LogSumExp

Why Learning with Experts?
Boosting in ML
Understanding sequential prediction & online learning
Universal Optimization

TCS, Learning theory, SDPs...



Quantile Regret
Best Expert
Best Experts
\(\varepsilon\)-fraction
MWU:
Needs knowledge of \(\varepsilon\)
We design an algorith with \(\sqrt{T \ln(1/\varepsilon)}\) quantile regret
for all \(\varepsilon\) and best known leading constant
Loss of
top \(\varepsilon n \) expert
\(\varepsilon\)-Quantile Regret
Continuous OL via Stochastic Calculus

Algorithms design guided by
PDEs and (Stochastic) Calculus tools
Main Goal of this Talk: describe the main ideas of the
continuous time model and tools
Continuous Experts' Problem
Modeling Online Learning in Continuous Time

Analysis often becomes clean
Sandbox for design of optimization algorithms
Gradient flow is useful for smooth optimization
Key Question: How to model non-smooth (online) optimization in continuous time?
Why go to ?
continuous time
Modeling Adversarial Costs in Continuous Time
Total loss of expert \(i\):

Useful perspective: \(L(i)\) is a realization of a random walk
realization of a Brownian Motion
Probability 1 = Worst-case
Discrete Time
Continuous Time
The Continuous Time Model
Discrete time
Continuous time
Cummulative loss
Player's cummulative loss
Player's loss per round
[Freund '09]
Regret Vector
Regret
Goal: Prob. 1 bounds on Regret
MWU in Continuous Time
Potential based players
Multiplicative Weights Update
LogSumExp
NormalHedge
First algorithm for quantile regret

Very clean Continuous time analysis
[Freund '09]
A Peek Into the Analysis
Ito's Lemma
(Fundamental Theorem of Stochastic Calculus)
\(B(t)\) is very non-smooth \(\implies\) second-order terms matter
Ito's Lemma
Idea: Pick \(\Phi\) as to make Ito's Lemma simpler for
Idea: Use stochastic calculus to guide the algorithm design
Potential based players
Smooth
Non-smooth
Using Ito's Lemma for Potential Based Players
Using Ito's Lemma on potential \(\Phi(t, R_t)\) for 1 dimension*
\(=0 \) if \(p_t \propto \partial_x \Phi(t, R_t)\)
Potential does not change if this \(= 0\)
Ito's Lemma suggests \(\Phi\) that satisfy the Backwards Heat Equation
* Simplified, not quite correct
Going to Higher Dimensions
Using Ito's Lemma on potential \(\Phi(t, R_t)\) for \(d\) dimensions
\(=0 \) if \(p_t \propto \nabla_x \Phi(t, R_t)\)
"Covariance" of \(R_i\) and \(R_j\)
Do dependencies between \(L_i\) and \(L_j\) matter?
YES, and cannot (or hard?) to discretize otherwise
Different intuition from the discrete case (?)
Beyond i.i.d. Experts
A Peek Into the Analysis
Potential based players
For all \(\varepsilon\)
Ito's Lemma suggests \(\Phi\) that satisfy the Backwards Heat Equation
Using this potential*, we get
Best leading constant
Discrete time analysis is IDENTICAL to continuous time analysis
Discrete Ito's Lemma
*(with a slightly bigger cnst. in the BHE)
Other Results Using
Stochastic Calculus
Fixed Time vs Anytime Regret
Question:
Are the minimax regret with and without knowledge of \(T\) different?
fixed-time
anytime
[Harvey, Liaw, Perkins, Randhawa '23]
n = 2
anytime
fixed-time
[Cover '67]
Back. Heat Eq.
Efficient version via SC
<
[Greenstreet, VSP, Harvey '20]
Heat Eq.
?
In Continuous Time, both are equal if Loss Processes are independent.
[VSP, Liaw, Harvey '22]
Large n
What about expected regret?
Question:
What is the expected regret in the anytime setting
even without idependent experts?
[VSP, Liaw, Harvey '25]:
High expected regret \(\implies\) lower bound
In the language of martingales:
Nearly tight bounds.
asymptotically!
For a martingale \(X_t\), find upper and lower bounds to
sup
is a stopping time
Evidence that
anytime = fixed-time
Online Linear Optimization
Player
Adversary
Unconstrained
Linear functions
Player's loss:
Loss of Fixed \(u\)
Player's Loss
Parameter-Free Online Linear Optimization
Goal:
No knowledge of \(\lVert u \rVert\)
Small regret if \(\lVert g_t\rVert\) small
[Zhang, Yang, Cutkosky, Paschalidis '24]:
Parameter-free and Adaptive algorithm
Backwards Heat Equation
Parameter free and adaptive algorithms matching lower bounds
(even up to leading constant)
Pontential based player satisfying
+ refined discretization
Conclusion and Open Questions
Continuous Time Model for Experts and OLO
Thanks!
[VSP, Liaw, Harvey '22] Continuous prediction with experts' advice.
[Zhang, Yang, Cutkosky, Paschalidis '24] Improving adaptive online learning using refined discretization.
[Freund '09] A method for hedging in continuous time.
[Harvey, Liaw, Perkins, Randhawa '23] Optimal anytime regret with two experts.
[Greenstreet, VSP, Harvey '22] Efficient and Optimal Fixed-Time Regret with Two Experts
[Harvey, Liaw, VSP '22] On the Expected infinity-norm of High-dimensional Martingales
Improve LB for anytime experts? Or better upper-bounds?
?
High-dim continuous time OLO?
?
Hopefully this model can be helpful in more developments in OL and optimization!
Application to offline non-smooth optimization?
?
When Online Learning
joint work with Nick Harvey (UBC) and Christopher Liaw (Google)
meets Stochastic Calculus


Victor Sanches Portella
ime.usp.br/~victorsp


Performance Measure - Regret
Loss of Best Expert
Player's Loss
Optimal!
For random \(\pm 1\) costs
Multiplicative Weights Update:
(Hedge)
Motivating Problem - Fixed Time vs Anytime
MWU regret
when \(T\) is known
when \(T\) is not known
anytime
fixed-time
Does knowing \(T\) gives the player an advantage?
[Harvey, Liaw, Perkins, Randhawa '23]
Continuous anytime algorithms for independent experts
[VSP, Liaw, Harvey '22]
Optimal lower bound 2 experts + optimal algorithm
+ improved algorithms for quantile regret!
With stochastic calculus:
MWU in Continuous Time
Potential based players
MWU!
Same regret bound as discrete time!
Idea: Use stochastic calculus to guide the algorithm design
LogSumExp
Regret bounds
when \(T\) is known
when \(T\) is not known
anytime
fixed-time
with prob. 1
The Joys of Stochastic Calculus
+ better anytime algorithms in continuous time
[Zhang, Yang, Cutkosky, Paschalidis '24]
Optimal anytime lower bound 2 experts + optimal algorithm
Best known algorithms for quantile regret
[Harvey, Liaw, Perkins, Randhawa '23]
Efficient optimal algorithms for fixed time 2 experts
[Greenstreet, VSP, Harvey '20]
Optimal parameter-free algorithms for online linear optimization
[VSP, Liaw, Harvey '22]
Simple continuous time analysis of NormalHedge
[Freund '09]
MWU in Continuous Time
Potential based players
MWU!
Same regret bound as discrete time!
Idea: Use stochastic calculus to guide the algorithm design
LogSumExp
Regret bounds
when \(T\) is known
when \(T\) is not known
anytime
fixed-time
with prob. 1
A Peek Into the Analysis
Potential based players
Matches fixed-time!
Ito's Lemma suggests \(\Phi\) that satisfy the Backwards Heat Equation
This new anytime algorithm has good regret!
Does not translate easily to discrete time
need correlation between experts
Take away: Anytime lower bounds for (continuous) experts
need dependent experts
A One Dimensional Continuous Time Model
Discrete Regret
Continuos Regret
Theorem:
If \(\Phi\) satisfies the BHE and
Going to higher dim:
Continuous time analogue
of
Learn direction and scale separately
Use refined discretization
Discretizing:
Why Continuous Time?
INRIA
By Victor Sanches Portella
INRIA
- 55