## Tour of Online Learning via Prediction with Experts' Advice

Victor Sanches Portella

November 2023

## Experts' Problem and Online Learning

Player

$$n$$ Experts

0.5

0.1

0.3

0.1

Probabilities

x_t

1

0

0.5

0.3

Costs

\ell_t

Player's loss:

\langle \ell_t, x_t \rangle

Adversary knows the strategy of the player

Picking a random expert

vs

Picking a probability vector

### Measuring Player's Perfomance

\displaystyle \mathrm{Regret}(T) = \sum_{t = 1}^T \langle \ell_t, x_t \rangle - \min_{i = 1, \dotsc, n} \sum_{t = 1}^T \ell_t(i)
\displaystyle \sum_{t = 1}^T \langle \ell_t, x_t \rangle

Total player's loss

Can be = $$T$$ always

\displaystyle \sum_{t = 1}^T \langle \ell_t, x_t \rangle - \sum_{t = 1}^T \min_{i = 1, \dotsc, n} \ell_t(i)

Compare with offline optimum

Almost the same as Attempt #1

Restrict the offline optimum

\displaystyle \frac{\mathrm{Regret}(T) }{T} \to 0

Attempt #1

Attempt #2

Attempt #3

Loss of Best Expert

Player's Loss

Goal:

0
1
0.5
1
t = 1
1
1.5
0.5
1
t = 2
1.5
2
1
1.5
t = 3
2.5
3
2
1.5
t = 4

0
1
0.5
1
t = 1
1
1.5
0.5
1
t = 2
1.5
2
1
1.5
t = 3
2.5
3
2
1.5
t = 4

### General Online Learning

Player

Player's loss:

x_t \in \mathcal{X}
f_t(\cdot)

Convex

f_t(x_t)

The player sees $$f_t$$

\mathcal{X} = \Delta_n

Simplex

f_t(x) = \langle \ell_t, x \rangle

Linear functions

Some usual settings:

\mathcal{X} = \mathbb{R}^n
\mathcal{X} = \mathrm{Ball}
f_t(x) = \lVert Ax - b \rVert^2
f_t(x) = - \log \langle a, x \rangle

Experts' problem

### Why Online Learning?

Traditional ML optimization makes stochastic assumptions on the data

OL  strips away  the stochastic layer

Traditional ML optimization makes stochastic assumptions on the data

Less assumptions $$\implies$$ Weaker guarantees

Less assumptions $$\implies$$ More robust

Parameter Free algorithms

Coin Betting

TCS application, solving SDPs, Learning theory, etc

## Algorithms

Idea: Pick the best expert at each round

x_t = e_i = \begin{pmatrix}0\\ 0 \\ 1 \\ 0 \end{pmatrix}

where $$i$$ minimizes

\displaystyle \sum_{s = 1}^t \ell_t(i)

Player loses $$T -1$$

Best expert loses $$T/2$$

Works very well for quadratic losses

0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
0

* picking distributions instead of best expert

\displaystyle x_{t+1} = \mathrm{Proj}_{\Delta_n} (x_t - \eta_t \ell_t)

$$\eta_t$$: step-size at round $$t$$

$$\ell_t$$: loss vector at round $$t$$

= \nabla f_t(x)
\displaystyle \mathrm{Regret}(T) \leq \sqrt{2 T n}

Sublinear Regret!

Optimal dependency on $$T$$

Can we improve the dependency on $$n$$?

Yes, and by a lot

### Multiplicative Weights Update Method

\displaystyle x_{t+1}(i) \propto x_t(i) \cdot e^{- \eta \ell_t(i)}
\displaystyle x_{t+1}(i) \propto x_t(i)\cdot (1 - \eta \ell_t(i))

Normalization

\displaystyle \mathrm{Regret}(T) \leq \sqrt{2 T \ln n}

Exponential improvement on $$n$$

Optimal

Other methods had clearer "optimization views"

Rediscovered many times in different fields

This one has an optimization view as well!

p_{t+1} = \displaystyle \mathrm{arg}\,\mathrm{min} \sum_{i = 1}^t \langle\ell_t, p \rangle + R(x)
\displaystyle p \in \Delta_n

Regularizer

"Stabilizes the algorithm"

R(x) = \frac{1}{\eta 2} \lVert x \rVert_2^2
\implies

R(x) = \sum_i x_i \ln x_i
\implies

Multiplicative Weights Update

Good choice of $$R$$ depends of the functions and the feasible set

### Online Mirror Descent

- \nabla f_t(x_t)
X
x_t
x_{t+1}

projection

x_t
\nabla R(x_t)

### Primal

X
\nabla R
- \nabla f_t(x_t)
\nabla R^{-1}
\Pi_X^R

Bregman

Projection

x_{t+1}

Regularizer

R(x) = \frac{1}{2} \lVert x \rVert_2^2

GD:

R(x) = \sum_i x_i \ln x_i

MWU:

## Applications

### Approximately Solving Zero-Sum Games

\begin{pmatrix} -1 & 0.7 \\ 1 & -0.5 \end{pmatrix}

Payoff matrix $$A$$ of row player

Row player

Column player

Strategy $$p = (0.1~~0.9)$$

Strategy $$q = (0.3~~0.7)$$

Von Neumman min-max Theorem:

\mathbb{E}[A_{ij}] = p^{T} A q
\displaystyle \max_p \min_q p^T A q = \min_q \max_p p^T A q = \mathrm{OPT}

Row player

picks row $$i$$ with probability $$p_i$$

Column player

picks column $$j$$ with probability $$q_j$$

Row player

gets $$A_{ij}$$

Column player

gets $$-A_{ij}$$

### Approximately Solving Zero-Sum Games

Main idea: make each row of $$A$$ be an expert

For $$t = 1, \dotsc, T$$

$$p_1 =$$ uniform distribution

Loss vector $$\ell_t$$ is the $$j$$-th col. of $$-A$$

Get $$p_{t+1}$$ via Multiplicative Weights

Thm:

\displaystyle \mathrm{OPT} - \tfrac{1}{T}\mathrm{Regret}(T) \leq \bar{p}^T A \bar{q} \leq \mathrm{OPT} + \tfrac{1}{T}\mathrm{Regret}(T)

Cor:

\displaystyle T \geq \frac{2 \ln n}{\varepsilon^2} \implies
\displaystyle \mathrm{OPT} - \varepsilon \leq \bar{p}^T A \bar{q} \leq \mathrm{OPT} + \epsilon

where $$j$$ maximizes $$p_t^T A e_j$$

$$q_t = e_j$$

$$\bar{p} = \tfrac{1}{T} \sum_{t} p_t$$

$$\bar{q} = \tfrac{1}{T} \sum_{t} q_t$$

and

### Boosting

Training set

\displaystyle S = \{(x_1, y_1), (x_2, y_2), \dotsc, (x_n, y_n)\}

Hypothesis class

\displaystyle \mathcal{H}

of functions

\displaystyle \mathcal{X} \to \{0,1\}
\displaystyle x_i \in \mathcal{X}
\displaystyle y_i \in \{0,1\}

Weak learner:

\displaystyle \mathrm{WL}(p, S) = h \in \mathcal{H}

such that

\displaystyle \mathbb{P}_{i \sim p}[h(x_i) = y_i] \geq \frac{1}{2} + \gamma

Question: Can we get with high probability a hypothesis* $$h^*$$ such that

\displaystyle h(x_i) \neq y_i

only on a $$\varepsilon$$-fraction of $$S$$?

Generalization follows if $$\mathcal{H}$$ is simple (and other conditions)

### Boosting

For $$t = 1, \dotsc, T$$

$$p_1 =$$ uniform distribution

$$\ell_t(i) = 1 - 2|h_t(x_i) - y_i|$$

Get $$p_{t+1}$$ via Multiplicative Weights (with right step-size)

$$h_t = \mathrm{WL}(p_t, S)$$

$$\bar{h} = \mathrm{Majority}(h_1, \dotsc, h_T)$$

Theorem

If $$T \geq (2/\gamma^2) \ln(1/\varepsilon)$$, then $$\bar{h}$$ makes at most $$\varepsilon$$ mistakes in $$S$$

Main ideas:

Regret only against distrb. on examples that $$\bar{h}$$ errs

Due to WL property, loss of the player is $$\geq 2 T \gamma$$

$$\ln n$$ becomes $$\ln (n/\mathrm{\# mistakes})$$

Cost of any distribution of this type is $$\leq 0$$

### From Electrical Flows to Maximum Flow

\displaystyle s
\displaystyle t

Goal:

Route as much flow as possible from $$s$$ to $$t$$ while respecting the edges' capacities

We can compute in time $$O(|V| \cdot |E|)$$

This year there was a paper with a $$O(|E|^{1 + o(1)})$$ alg...

What if we want something faster even if approx.?

### From Electrical Flows to Maximum Flow

Fast Laplacian system solvers (Spielman & Teng' 13)

We can compute electrical flows by solving this system

Electrical flows may not respect edge capacities!

Solves

\displaystyle L x = b

in $$\tilde{O}(|E|)$$ time

Laplacian matrix of $$G$$

Main idea: Use electrical flows as a "weak learner", and boost it using MWU!

Edges = Experts

Cost = flow/capacity

### Other applications (beyond experts) in TCS

Solving Packing linear systems with oracle access

Approximating multicommodity flow

Approximately solve some semidefinite programs

Spectral graph sparsification

Approximating multicommodity flow

Computational complexity (QIP = PSPACE)

## Research Topics

### Solving SDPs

Idea: Update $$\mathbf{X}_t$$ using $$y$$ via Online Learning

Led to many improved approximation algorithms

Similar ideas are still used in many problems in TCS

Goal

Find $$X$$ with $$C \bullet X > \alpha$$

or dual solution certifying OPT $$\leq (1 + \delta) \alpha$$

Dual Oracle: Given a candidate primal solution $$\mathbf{X}_t$$, find a vector $$y$$ that certifies

primal infeasability

or

objective value $$\leq \alpha$$

Online Learning algorithms usually need knowledge of two parameters to get optimal regret bounds

Lipschitz Continuity constant

Distance to comparator

L
D

Can we design algorithms that do not need to know these parameter and still achieve similar performance?

CoinBetting

Impossible to adapt to both at the same time (in general)

### Differential Privacy meets Online Learning

PAC DP-Learning

is equivalent to

Online Learnability

Finite sample complexity for PAC Learning with a differentially private algorithm

Finite Littlestone Dimension

An                                algorithm achieves low regret if it is robust to adversarial data

Online Learning

A                                        algorithm is robust to small changes to its input

Differentially Private

### Other Topics

Bandit Feedback

Mirror Descent and Applications

Online Boosting

Non-stochastic Control

### Resources

Surveys:

A modern Introduction do Online Learning - Francesco Orabona

Amazing references and historical discussion

Introduction to Online Convex Optimization - Elad Hazan

Online Learning and Online Convex Optimization
- Shai Shalev-Schwartz

Introduction to Online Optimization - Sébastien Bubeck

A bit lighter to read IMO (less details and feels shorter)

Great sections on parameter-free OL

Covers other topics not covered by Orabona

A bit old but covers Online Learnability

Some of the nasty details of convex analysis are covered

## Tour of Online Learning via Prediction with Experts' Advice

Victor Sanches Portella

November 2023

#### Waterloo - OL via the Expert's problem

By Victor Sanches Portella

• 84