Tour of Online Learning via Prediction with Experts' Advice

Victor Sanches Portella

November 2023

Experts' Problem and Online Learning

Prediction with Experts' Advice

Player

Adversary

\(n\) Experts

0.5

0.1

0.3

0.1

Probabilities

x_t

1

0

0.5

0.3

Costs

\ell_t

Player's loss:

\langle \ell_t, x_t \rangle

Adversary knows the strategy of the player

Picking a random expert

vs

Picking a probability vector

Measuring Player's Perfomance

\displaystyle \mathrm{Regret}(T) = \sum_{t = 1}^T \langle \ell_t, x_t \rangle - \min_{i = 1, \dotsc, n} \sum_{t = 1}^T \ell_t(i)
\displaystyle \sum_{t = 1}^T \langle \ell_t, x_t \rangle

Total player's loss

Can be = \(T\) always

\displaystyle \sum_{t = 1}^T \langle \ell_t, x_t \rangle - \sum_{t = 1}^T \min_{i = 1, \dotsc, n} \ell_t(i)

Compare with offline optimum

Almost the same as Attempt #1

Restrict the offline optimum

\displaystyle \frac{\mathrm{Regret}(T) }{T} \to 0

Attempt #1

Attempt #2

Attempt #3

Loss of Best Expert

Player's Loss

Goal:

Example

Cummulative Loss

Experts

0
1
0.5
1
t = 1
1
1.5
0.5
1
t = 2
1.5
2
1
1.5
t = 3
2.5
3
2
1.5
t = 4

Example

Cummulative Loss

Experts

0
1
0.5
1
t = 1
1
1.5
0.5
1
t = 2
1.5
2
1
1.5
t = 3
2.5
3
2
1.5
t = 4

General Online Learning

Player

Adversary

Player's loss:

x_t \in \mathcal{X}
f_t(\cdot)

Convex

f_t(x_t)

The player sees \(f_t\)

\mathcal{X} = \Delta_n

Simplex

f_t(x) = \langle \ell_t, x \rangle

Linear functions

Some usual settings:

\mathcal{X} = \mathbb{R}^n
\mathcal{X} = \mathrm{Ball}
f_t(x) = \lVert Ax - b \rVert^2
f_t(x) = - \log \langle a, x \rangle

Experts' problem

Why Online Learning?

Traditional ML optimization makes stochastic assumptions on the data

OL  strips away  the stochastic layer

Traditional ML optimization makes stochastic assumptions on the data

Less assumptions \(\implies\) Weaker guarantees

Less assumptions \(\implies\) More robust

Adaptive algorithms

AdaGrad

Adam...?

Parameter Free algorithms

Coin Betting

TCS application, solving SDPs, Learning theory, etc

Algorithms

Follow the Leader

Idea: Pick the best expert at each round

x_t = e_i = \begin{pmatrix}0\\ 0 \\ 1 \\ 0 \end{pmatrix}

where \(i\) minimizes

\displaystyle \sum_{s = 1}^t \ell_t(i)

Player loses \(T -1\)

Best expert loses \(T/2\)

Works very well for quadratic losses

0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0

* picking distributions instead of best expert

Gradient Descent

\displaystyle x_{t+1} = \mathrm{Proj}_{\Delta_n} (x_t - \eta_t \ell_t)

\(\eta_t\): step-size at round \(t\)

\(\ell_t\): loss vector at round \(t\)

= \nabla f_t(x)
\displaystyle \mathrm{Regret}(T) \leq \sqrt{2 T n}

Sublinear Regret!

Optimal dependency on \(T\)

Can we improve the dependency on \(n\)?

Yes, and by a lot

Multiplicative Weights Update Method

\displaystyle x_{t+1}(i) \propto x_t(i) \cdot e^{- \eta \ell_t(i)}
\displaystyle x_{t+1}(i) \propto x_t(i)\cdot (1 - \eta \ell_t(i))

Normalization

\displaystyle \mathrm{Regret}(T) \leq \sqrt{2 T \ln n}

Exponential improvement on \(n\)

Optimal

Other methods had clearer "optimization views"

Rediscovered many times in different fields

This one has an optimization view as well!

Follow the Regularized Leader

p_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t \langle\ell_t, p \rangle + R(x)
\displaystyle p \in \Delta_n

Regularizer

"Stabilizes the algorithm"

R(x) = \frac{1}{\eta 2} \lVert x \rVert_2^2
\implies

"Lazy" Gradient Descent

R(x) = \sum_i x_i \ln x_i
\implies

Multiplicative Weights Update

Good choice of \(R\) depends of the functions and the feasible set

Online Mirror Descent

- \nabla f_t(x_t)
X
x_t
x_{t+1}

projection

Online Mirror Descent

x_t
\nabla R(x_t)

Dual

Primal

X
\nabla R
- \nabla f_t(x_t)
\nabla R^{-1}
\Pi_X^R

Bregman

Projection

x_{t+1}

Regularizer

R(x) = \frac{1}{2} \lVert x \rVert_2^2

GD:

R(x) = \sum_i x_i \ln x_i

MWU:

OMD and FTRL are quite general

Applications

Approximately Solving Zero-Sum Games

\begin{pmatrix} -1 & 0.7 \\ 1 & -0.5 \end{pmatrix}

Payoff matrix \(A\) of row player

Row player

Column player

Strategy \(p  = (0.1~~0.9)\)

Strategy \(q  = (0.3~~0.7)\)

Von Neumman min-max Theorem:

\mathbb{E}[A_{ij}] = p^{T} A q
\displaystyle \max_p \min_q p^T A q = \min_q \max_p p^T A q = \mathrm{OPT}

Row player

picks row \(i\) with probability \(p_i\)

Column player

picks column \(j\) with probability \(q_j\)

Row player

gets \(A_{ij}\)

Column player

gets \(-A_{ij}\)

Approximately Solving Zero-Sum Games

Main idea: make each row of \(A\) be an expert

For \(t = 1, \dotsc, T\)

\(p_1 =\) uniform distribution

Loss vector \(\ell_t\) is the \(j\)-th col. of \(-A\)

Get \(p_{t+1}\) via Multiplicative Weights

Thm:

\displaystyle \mathrm{OPT} - \tfrac{1}{T}\mathrm{Regret}(T) \leq \bar{p}^T A \bar{q} \leq \mathrm{OPT} + \tfrac{1}{T}\mathrm{Regret}(T)

Cor:

\displaystyle T \geq \frac{2 \ln n}{\varepsilon^2} \implies
\displaystyle \mathrm{OPT} - \varepsilon \leq \bar{p}^T A \bar{q} \leq \mathrm{OPT} + \epsilon

where \(j\) maximizes \(p_t^T A e_j\)

\(q_t = e_j\)

\(\bar{p} = \tfrac{1}{T} \sum_{t} p_t\)

\(\bar{q} = \tfrac{1}{T} \sum_{t} q_t\)

and

Boosting

Training set

\displaystyle S = \{(x_1, y_1), (x_2, y_2), \dotsc, (x_n, y_n)\}

Hypothesis class

\displaystyle \mathcal{H}

of functions

\displaystyle \mathcal{X} \to \{0,1\}
\displaystyle x_i \in \mathcal{X}
\displaystyle y_i \in \{0,1\}

Weak learner:

\displaystyle \mathrm{WL}(p, S) = h \in \mathcal{H}

such that

\displaystyle \mathbb{P}_{i \sim p}[h(x_i) = y_i] \geq \frac{1}{2} + \gamma

Question: Can we get with high probability a hypothesis* \(h^*\) such that

\displaystyle h(x_i) \neq y_i

only on a \(\varepsilon\)-fraction of \(S\)?

Generalization follows if \(\mathcal{H}\) is simple (and other conditions)

Boosting

For \(t = 1, \dotsc, T\)

\(p_1 =\) uniform distribution

 \(\ell_t(i) = 1 - 2|h_t(x_i) - y_i|\)

Get \(p_{t+1}\) via Multiplicative Weights (with right step-size)

\(h_t = \mathrm{WL}(p_t, S)\)

\(\bar{h} = \mathrm{Majority}(h_1, \dotsc, h_T)\)

Theorem

If \(T \geq (2/\gamma^2) \ln(1/\varepsilon)\), then \(\bar{h}\) makes at most \(\varepsilon\) mistakes in \(S\)

Main ideas:

Regret only against distrb. on examples that \(\bar{h}\) errs

Due to WL property, loss of the player is \(\geq 2 T \gamma\)

\(\ln n\) becomes \(\ln (n/\mathrm{\# mistakes}) \)

Cost of any distribution of this type is \(\leq 0\)

From Electrical Flows to Maximum Flow

\displaystyle s
\displaystyle t

Goal:

Route as much flow as possible from \(s\) to \(t\) while respecting the edges' capacities

We can compute in time \(O(|V| \cdot |E|)\)

This year there was a paper with a \(O(|E|^{1 + o(1)})\) alg...

What if we want something faster even if approx.?

From Electrical Flows to Maximum Flow

Fast Laplacian system solvers (Spielman & Teng' 13)

We can compute electrical flows by solving this system

Electrical flows may not respect edge capacities!

Solves

\displaystyle L x = b

in \(\tilde{O}(|E|)\) time

Laplacian matrix of \(G\)

Main idea: Use electrical flows as a "weak learner", and boost it using MWU!

Edges = Experts

Cost = flow/capacity

Other applications (beyond experts) in TCS

Solving Packing linear systems with oracle access

Approximating multicommodity flow

Approximately solve some semidefinite programs

Spectral graph sparsification

Approximating multicommodity flow

Computational complexity (QIP = PSPACE)

Research Topics

Solving SDPs

Idea: Update \(\mathbf{X}_t\) using \(y\) via Online Learning

Led to many improved approximation algorithms

Similar ideas are still used in many problems in TCS

Goal

Find \(X\) with \(C \bullet X > \alpha \)

or dual solution certifying OPT \(\leq (1 + \delta) \alpha\)

Dual Oracle: Given a candidate primal solution \(\mathbf{X}_t\), find a vector \(y\) that certifies

primal infeasability

or

objective value \(\leq \alpha\)

Parameter-free and Adaptive Algorithms

Online Learning algorithms usually need knowledge of two parameters to get optimal regret bounds

Lipschitz Continuity constant

Distance to comparator

L
D

Can we design algorithms that do not need to know these parameter and still achieve similar performance?

AdaGrad

CoinBetting

Impossible to adapt to both at the same time (in general)

Differential Privacy meets Online Learning

PAC DP-Learning

is equivalent to

Online Learnability

Finite sample complexity for PAC Learning with a differentially private algorithm

Finite Littlestone Dimension

An                                algorithm achieves low regret if it is robust to adversarial data

Online Learning

A                                        algorithm is robust to small changes to its input

Differentially Private

Other Topics

Bandit Feedback

Mirror Descent and Applications

Saddle-Point Optimization and Games

Online Boosting

Non-stochastic Control

Resources

Surveys:

A modern Introduction do Online Learning - Francesco Orabona

Amazing references and historical discussion

Introduction to Online Convex Optimization - Elad Hazan

Online Learning and Online Convex Optimization
                                   - Shai Shalev-Schwartz

Introduction to Online Optimization - Sébastien Bubeck

A bit lighter to read IMO (less details and feels shorter)

Great sections on parameter-free OL

Covers other topics not covered by Orabona

A bit old but covers Online Learnability

Some of the nasty details of convex analysis are covered

Tour of Online Learning via Prediction with Experts' Advice

Victor Sanches Portella

November 2023

Made with Slides.com