Victor Sanches Portella
November 2023
Player
Adversary
\(n\) Experts
0.5
0.1
0.3
0.1
Probabilities
1
0
0.5
0.3
Costs
Player's loss:
Adversary knows the strategy of the player
Picking a random expert
vs
Picking a probability vector
Total player's loss
Can be = \(T\) always
Compare with offline optimum
Almost the same as Attempt #1
Restrict the offline optimum
Attempt #1
Attempt #2
Attempt #3
Loss of Best Expert
Player's Loss
Goal:
Player
Adversary
Player's loss:
Convex
The player sees \(f_t\)
Simplex
Linear functions
Some usual settings:
Experts' problem
Traditional ML optimization makes stochastic assumptions on the data
OL strips away the stochastic layer
Traditional ML optimization makes stochastic assumptions on the data
Less assumptions \(\implies\) Weaker guarantees
Less assumptions \(\implies\) More robust
Adaptive algorithms
AdaGrad
Adam...?
Parameter Free algorithms
Coin Betting
TCS application, solving SDPs, Learning theory, etc
Idea: Pick the best expert at each round
where \(i\) minimizes
Player loses \(T -1\)
Best expert loses \(T/2\)
Works very well for quadratic losses
* picking distributions instead of best expert
\(\eta_t\): step-size at round \(t\)
\(\ell_t\): loss vector at round \(t\)
Sublinear Regret!
Optimal dependency on \(T\)
Can we improve the dependency on \(n\)?
Yes, and by a lot
Normalization
Exponential improvement on \(n\)
Optimal
Other methods had clearer "optimization views"
Rediscovered many times in different fields
This one has an optimization view as well!
Regularizer
"Stabilizes the algorithm"
"Lazy" Gradient Descent
Multiplicative Weights Update
Good choice of \(R\) depends of the functions and the feasible set
projection
Bregman
Projection
Regularizer
GD:
MWU:
Payoff matrix \(A\) of row player
Row player
Column player
Strategy \(p = (0.1~~0.9)\)
Strategy \(q = (0.3~~0.7)\)
Von Neumman min-max Theorem:
Row player
picks row \(i\) with probability \(p_i\)
Column player
picks column \(j\) with probability \(q_j\)
Row player
gets \(A_{ij}\)
Column player
gets \(-A_{ij}\)
Main idea: make each row of \(A\) be an expert
For \(t = 1, \dotsc, T\)
\(p_1 =\) uniform distribution
Loss vector \(\ell_t\) is the \(j\)-th col. of \(-A\)
Get \(p_{t+1}\) via Multiplicative Weights
Thm:
Cor:
where \(j\) maximizes \(p_t^T A e_j\)
\(q_t = e_j\)
\(\bar{p} = \tfrac{1}{T} \sum_{t} p_t\)
\(\bar{q} = \tfrac{1}{T} \sum_{t} q_t\)
and
Training set
Hypothesis class
of functions
Weak learner:
such that
Question: Can we get with high probability a hypothesis* \(h^*\) such that
only on a \(\varepsilon\)-fraction of \(S\)?
Generalization follows if \(\mathcal{H}\) is simple (and other conditions)
For \(t = 1, \dotsc, T\)
\(p_1 =\) uniform distribution
\(\ell_t(i) = 1 - 2|h_t(x_i) - y_i|\)
Get \(p_{t+1}\) via Multiplicative Weights (with right step-size)
\(h_t = \mathrm{WL}(p_t, S)\)
\(\bar{h} = \mathrm{Majority}(h_1, \dotsc, h_T)\)
Theorem
If \(T \geq (2/\gamma^2) \ln(1/\varepsilon)\), then \(\bar{h}\) makes at most \(\varepsilon\) mistakes in \(S\)
Main ideas:
Regret only against distrb. on examples that \(\bar{h}\) errs
Due to WL property, loss of the player is \(\geq 2 T \gamma\)
\(\ln n\) becomes \(\ln (n/\mathrm{\# mistakes}) \)
Cost of any distribution of this type is \(\leq 0\)
Goal:
Route as much flow as possible from \(s\) to \(t\) while respecting the edges' capacities
We can compute in time \(O(|V| \cdot |E|)\)
This year there was a paper with a \(O(|E|^{1 + o(1)})\) alg...
What if we want something faster even if approx.?
Fast Laplacian system solvers (Spielman & Teng' 13)
We can compute electrical flows by solving this system
Electrical flows may not respect edge capacities!
Solves
in \(\tilde{O}(|E|)\) time
Laplacian matrix of \(G\)
Main idea: Use electrical flows as a "weak learner", and boost it using MWU!
Edges = Experts
Cost = flow/capacity
Solving Packing linear systems with oracle access
Approximating multicommodity flow
Approximately solve some semidefinite programs
Spectral graph sparsification
Approximating multicommodity flow
Computational complexity (QIP = PSPACE)
Idea: Update \(\mathbf{X}_t\) using \(y\) via Online Learning
Led to many improved approximation algorithms
Similar ideas are still used in many problems in TCS
Goal
Find \(X\) with \(C \bullet X > \alpha \)
or dual solution certifying OPT \(\leq (1 + \delta) \alpha\)
Dual Oracle: Given a candidate primal solution \(\mathbf{X}_t\), find a vector \(y\) that certifies
primal infeasability
or
objective value \(\leq \alpha\)
Online Learning algorithms usually need knowledge of two parameters to get optimal regret bounds
Lipschitz Continuity constant
Distance to comparator
Can we design algorithms that do not need to know these parameter and still achieve similar performance?
AdaGrad
CoinBetting
Impossible to adapt to both at the same time (in general)
PAC DP-Learning
is equivalent to
Online Learnability
Finite sample complexity for PAC Learning with a differentially private algorithm
Finite Littlestone Dimension
An algorithm achieves low regret if it is robust to adversarial data
Online Learning
A algorithm is robust to small changes to its input
Differentially Private
Bandit Feedback
Mirror Descent and Applications
Saddle-Point Optimization and Games
Online Boosting
Non-stochastic Control
Surveys:
A modern Introduction do Online Learning - Francesco Orabona
Amazing references and historical discussion
Introduction to Online Convex Optimization - Elad Hazan
Online Learning and Online Convex Optimization
- Shai Shalev-Schwartz
Introduction to Online Optimization - Sébastien Bubeck
A bit lighter to read IMO (less details and feels shorter)
Great sections on parameter-free OL
Covers other topics not covered by Orabona
A bit old but covers Online Learnability
Some of the nasty details of convex analysis are covered
Victor Sanches Portella
November 2023