Tour of Online Learning via Prediction with Experts' Advice
Victor Sanches Portella
November 2023
Experts' Problem and Online Learning
Prediction with Experts' Advice
Player
Adversary
\(n\) Experts
0.5
0.1
0.3
0.1
Probabilities
1
0
0.5
0.3
Costs
Player's loss:
Adversary knows the strategy of the player
Picking a random expert
vs
Picking a probability vector
Measuring Player's Perfomance
Total player's loss
Can be = \(T\) always
Compare with offline optimum
Almost the same as Attempt #1
Restrict the offline optimum
Attempt #1
Attempt #2
Attempt #3
Loss of Best Expert
Player's Loss
Goal:
Example
Cummulative Loss
Experts
Example
Cummulative Loss
Experts
General Online Learning
Player
Adversary
Player's loss:
Convex
The player sees \(f_t\)
Simplex
Linear functions
Some usual settings:
Experts' problem
Why Online Learning?
Traditional ML optimization makes stochastic assumptions on the data
OL strips away the stochastic layer
Traditional ML optimization makes stochastic assumptions on the data
Less assumptions \(\implies\) Weaker guarantees
Less assumptions \(\implies\) More robust
Adaptive algorithms
AdaGrad
Adam...?
Parameter Free algorithms
Coin Betting
TCS application, solving SDPs, Learning theory, etc
Algorithms
Follow the Leader
Idea: Pick the best expert at each round
where \(i\) minimizes
Player loses \(T -1\)
Best expert loses \(T/2\)
Works very well for quadratic losses
* picking distributions instead of best expert
Gradient Descent
\(\eta_t\): step-size at round \(t\)
\(\ell_t\): loss vector at round \(t\)
Sublinear Regret!
Optimal dependency on \(T\)
Can we improve the dependency on \(n\)?
Yes, and by a lot
Multiplicative Weights Update Method
Normalization
Exponential improvement on \(n\)
Optimal
Other methods had clearer "optimization views"
Rediscovered many times in different fields
This one has an optimization view as well!
Follow the Regularized Leader
Regularizer
"Stabilizes the algorithm"
"Lazy" Gradient Descent
Multiplicative Weights Update
Good choice of \(R\) depends of the functions and the feasible set
Online Mirror Descent
projection
Online Mirror Descent
Dual
Primal
Bregman
Projection
Regularizer
GD:
MWU:
OMD and FTRL are quite general
Applications
Approximately Solving Zero-Sum Games
Payoff matrix \(A\) of row player
Row player
Column player
Strategy \(p = (0.1~~0.9)\)
Strategy \(q = (0.3~~0.7)\)
Von Neumman min-max Theorem:
Row player
picks row \(i\) with probability \(p_i\)
Column player
picks column \(j\) with probability \(q_j\)
Row player
gets \(A_{ij}\)
Column player
gets \(-A_{ij}\)
Approximately Solving Zero-Sum Games
Main idea: make each row of \(A\) be an expert
For \(t = 1, \dotsc, T\)
\(p_1 =\) uniform distribution
Loss vector \(\ell_t\) is the \(j\)-th col. of \(-A\)
Get \(p_{t+1}\) via Multiplicative Weights
Thm:
Cor:
where \(j\) maximizes \(p_t^T A e_j\)
\(q_t = e_j\)
\(\bar{p} = \tfrac{1}{T} \sum_{t} p_t\)
\(\bar{q} = \tfrac{1}{T} \sum_{t} q_t\)
and
Boosting
Training set
Hypothesis class
of functions
Weak learner:
such that
Question: Can we get with high probability a hypothesis* \(h^*\) such that
only on a \(\varepsilon\)-fraction of \(S\)?
Generalization follows if \(\mathcal{H}\) is simple (and other conditions)
Boosting
For \(t = 1, \dotsc, T\)
\(p_1 =\) uniform distribution
\(\ell_t(i) = 1 - 2|h_t(x_i) - y_i|\)
Get \(p_{t+1}\) via Multiplicative Weights (with right step-size)
\(h_t = \mathrm{WL}(p_t, S)\)
\(\bar{h} = \mathrm{Majority}(h_1, \dotsc, h_T)\)
Theorem
If \(T \geq (2/\gamma^2) \ln(1/\varepsilon)\), then \(\bar{h}\) makes at most \(\varepsilon\) mistakes in \(S\)
Main ideas:
Regret only against distrb. on examples that \(\bar{h}\) errs
Due to WL property, loss of the player is \(\geq 2 T \gamma\)
\(\ln n\) becomes \(\ln (n/\mathrm{\# mistakes}) \)
Cost of any distribution of this type is \(\leq 0\)
From Electrical Flows to Maximum Flow
Goal:
Route as much flow as possible from \(s\) to \(t\) while respecting the edges' capacities
We can compute in time \(O(|V| \cdot |E|)\)
This year there was a paper with a \(O(|E|^{1 + o(1)})\) alg...
What if we want something faster even if approx.?
From Electrical Flows to Maximum Flow
Fast Laplacian system solvers (Spielman & Teng' 13)
We can compute electrical flows by solving this system
Electrical flows may not respect edge capacities!
Solves
in \(\tilde{O}(|E|)\) time
Laplacian matrix of \(G\)
Main idea: Use electrical flows as a "weak learner", and boost it using MWU!
Edges = Experts
Cost = flow/capacity
Other applications (beyond experts) in TCS
Solving Packing linear systems with oracle access
Approximating multicommodity flow
Approximately solve some semidefinite programs
Spectral graph sparsification
Approximating multicommodity flow
Computational complexity (QIP = PSPACE)
Research Topics
Solving SDPs
Idea: Update \(\mathbf{X}_t\) using \(y\) via Online Learning
Led to many improved approximation algorithms
Similar ideas are still used in many problems in TCS
Goal
Find \(X\) with \(C \bullet X > \alpha \)
or dual solution certifying OPT \(\leq (1 + \delta) \alpha\)
Dual Oracle: Given a candidate primal solution \(\mathbf{X}_t\), find a vector \(y\) that certifies
primal infeasability
or
objective value \(\leq \alpha\)
Parameter-free and Adaptive Algorithms
Online Learning algorithms usually need knowledge of two parameters to get optimal regret bounds
Lipschitz Continuity constant
Distance to comparator
Can we design algorithms that do not need to know these parameter and still achieve similar performance?
AdaGrad
CoinBetting
Impossible to adapt to both at the same time (in general)
Differential Privacy meets Online Learning
PAC DP-Learning
is equivalent to
Online Learnability
Finite sample complexity for PAC Learning with a differentially private algorithm
Finite Littlestone Dimension
An algorithm achieves low regret if it is robust to adversarial data
Online Learning
A algorithm is robust to small changes to its input
Differentially Private
Other Topics
Bandit Feedback
Mirror Descent and Applications
Saddle-Point Optimization and Games
Online Boosting
Non-stochastic Control
Resources
Surveys:
A modern Introduction do Online Learning - Francesco Orabona
Amazing references and historical discussion
Introduction to Online Convex Optimization - Elad Hazan
Online Learning and Online Convex Optimization
- Shai Shalev-Schwartz
Introduction to Online Optimization - Sébastien Bubeck
A bit lighter to read IMO (less details and feels shorter)
Great sections on parameter-free OL
Covers other topics not covered by Orabona
A bit old but covers Online Learnability
Some of the nasty details of convex analysis are covered
Tour of Online Learning via Prediction with Experts' Advice
Victor Sanches Portella
November 2023
Waterloo - OL via the Expert's problem
By Victor Sanches Portella
Waterloo - OL via the Expert's problem
- 136