## Tour of** Online Learning** via Prediction with **Experts' Advice**

**Victor Sanches Portella**

November 2023

##
**Experts**' Problem and **Online Learning**

### Prediction with Experts' Advice

Player

Adversary

**\(n\)** Experts

0.5

0.1

0.3

0.1

Probabilities

1

0

0.5

0.3

Costs

**Player's loss:**

Adversary **knows** the strategy of the player

Picking a **random expert**

vs

Picking a **probability vector**

### Measuring Player's Perfomance

Total player's loss

Can be = \(T\) always

Compare with offline optimum

Almost the same as Attempt #1

Restrict the offline optimum

**Attempt #1**

**Attempt #2**

**Attempt #3**

Loss of Best Expert

Player's Loss

**Goal:**

### Example

### Cummulative Loss

### Experts

### Example

### Cummulative Loss

### Experts

### General Online Learning

Player

Adversary

**Player's loss:**

**Convex**

**The player sees \(f_t\)**

Simplex

Linear functions

Some usual settings:

Experts' problem

### Why Online Learning?

Traditional ML optimization makes **stochastic assumptions on the data**

OL **strips away** the stochastic layer

Traditional ML optimization makes **stochastic assumptions on the data**

Less assumptions \(\implies\) **Weaker guarantees**

Less assumptions \(\implies\) **More robust**

Adaptive algorithms

**AdaGrad**

**Adam...?**

Parameter Free algorithms

**Coin Betting**

TCS application, solving SDPs, Learning theory, etc

## Algorithms

### Follow the Leader

**Idea**: Pick the best expert at each round

**where \(i\) minimizes**

Player loses \(T -1\)

Best expert loses \(T/2\)

Works **very well** for quadratic losses

* picking distributions instead of best expert

### Gradient Descent

\(\eta_t\): step-size at round \(t\)

\(\ell_t\): loss vector at round \(t\)

**Sublinear** Regret!

**Optimal** dependency on \(T\)

Can we improve the dependency on \(n\)?

**Yes, and by a lot**

### Multiplicative Weights Update Method

Normalization

**Exponential** improvement on \(n\)

**Optimal**

Other methods had clearer "optimization views"

Rediscovered many times in different fields

This one has an optimization view as well!

### Follow the Regularized Leader

Regularizer

"Stabilizes the algorithm"

"Lazy" Gradient Descent

Multiplicative Weights Update

Good choice of \(R\) depends of the functions and the feasible set

### Online Mirror Descent

projection

### Online Mirror Descent

### Dual

### Primal

**Bregman**

**Projection**

Regularizer

**GD:**

**MWU:**

### OMD and FTRL are quite general

## Applications

### Approximately Solving Zero-Sum Games

Payoff matrix \(A\) of row player

**Row player**

**Column player**

Strategy \(p = (0.1~~0.9)\)

Strategy \(q = (0.3~~0.7)\)

**Von Neumman min-max Theorem:**

Row player

picks row \(i\) with probability \(p_i\)

Column player

picks column \(j\) with probability \(q_j\)

Row player

gets \(A_{ij}\)

Column player

gets \(-A_{ij}\)

### Approximately Solving Zero-Sum Games

**Main idea: **make each row of \(A\) be an expert

For \(t = 1, \dotsc, T\)

\(p_1 =\) uniform distribution

Loss vector \(\ell_t\) is the \(j\)-th col. of \(-A\)

Get \(p_{t+1}\) via Multiplicative Weights

**Thm:**

**Cor:**

where \(j\) maximizes \(p_t^T A e_j\)

\(q_t = e_j\)

\(\bar{p} = \tfrac{1}{T} \sum_{t} p_t\)

\(\bar{q} = \tfrac{1}{T} \sum_{t} q_t\)

and

### Boosting

**Training set**

**Hypothesis class**

of functions

**Weak learner:**

such that

**Question:** Can we get with high probability a hypothesis* \(h^*\) such that

**only on a \(\varepsilon\)-fraction of \(S\)?**

**Generalization** follows if \(\mathcal{H}\) is simple (and other conditions)

### Boosting

For \(t = 1, \dotsc, T\)

\(p_1 =\) uniform distribution

\(\ell_t(i) = 1 - 2|h_t(x_i) - y_i|\)

Get \(p_{t+1}\) via Multiplicative Weights (with right step-size)

\(h_t = \mathrm{WL}(p_t, S)\)

\(\bar{h} = \mathrm{Majority}(h_1, \dotsc, h_T)\)

**Theorem**

If \(T \geq (2/\gamma^2) \ln(1/\varepsilon)\), then \(\bar{h}\) makes at most \(\varepsilon\) mistakes in \(S\)

Main ideas:

Regret only against distrb. on examples that \(\bar{h}\) errs

Due to **WL** property, loss of the player is \(\geq 2 T \gamma\)

\(\ln n\) becomes \(\ln (n/\mathrm{\# mistakes}) \)

Cost of any distribution of this type is \(\leq 0\)

### From Electrical Flows to Maximum Flow

**Goal:**

Route **as much flow as possible** from \(s\) to \(t\) while respecting the **edges' capacities**

We can compute in time \(O(|V| \cdot |E|)\)

This year there was a paper with a \(O(|E|^{1 + o(1)})\) alg...

What if we want something faster even if approx.?

### From Electrical Flows to Maximum Flow

**Fast Laplacian system solvers **(Spielman & Teng' 13)

We can compute **electrical flows** by solving this system

Electrical flows may not respect edge capacities!

Solves

in \(\tilde{O}(|E|)\) time

Laplacian matrix of \(G\)

**Main idea:** Use electrical flows as a "weak learner", and boost it using MWU!

**Edges = Experts**

**Cost = flow/capacity**

### Other applications (beyond experts) in TCS

Solving Packing linear systems with oracle access

Approximating multicommodity flow

Approximately solve some semidefinite programs

Spectral graph sparsification

Approximating multicommodity flow

Computational complexity (QIP = PSPACE)

## Research Topics

### Solving SDPs

**Idea: **Update \(\mathbf{X}_t\) using \(y\) via Online Learning

Led to many improved approximation algorithms

Similar ideas are still used in many problems in TCS

**Goal**

Find \(X\) with \(C \bullet X > \alpha \)

or dual solution certifying OPT \(\leq (1 + \delta) \alpha\)

**Dual Oracle**: Given a candidate primal solution \(\mathbf{X}_t\), find a vector \(y\) that certifies

**primal infeasability **

or

**objective value **\(\leq \alpha\)

### Parameter-free and Adaptive Algorithms

Online Learning algorithms usually need knowledge of **two parameters** to get** optimal regret bounds**

**Lipschitz Continuity constant**

**Distance to comparator**

Can we design algorithms that do not need to know these parameter and still achieve similar performance?

**AdaGrad**

**CoinBetting**

**Impossible to adapt to both at the same time (in general)**

### Differential Privacy meets Online Learning

**PAC DP-Learning **

is equivalent to

**Online Learnability**

Finite sample complexity for PAC Learning with a differentially private algorithm

Finite Littlestone Dimension

An** **algorithm achieves low regret if it is robust to adversarial data

**Online Learning**

A ** **algorithm** **is robust to small changes to its input

**Differentially Private**

### Other Topics

**Bandit Feedback**

**Mirror Descent and Applications**

**Saddle-Point Optimization and Games**

**Online Boosting**

**Non-stochastic Control**

### Resources

**Surveys:**

A modern Introduction do Online Learning - Francesco Orabona

Amazing references and historical discussion

Introduction to Online Convex Optimization - Elad Hazan

Online Learning and Online Convex Optimization

- Shai Shalev-Schwartz

Introduction to Online Optimization - Sébastien Bubeck

A bit lighter to read IMO (less details and feels shorter)

Great sections on parameter-free OL

Covers other topics not covered by Orabona

A bit old but covers Online Learnability

Some of the nasty details of convex analysis are covered

## Tour of** Online Learning** via Prediction with **Experts' Advice**

**Victor Sanches Portella**

November 2023

#### Waterloo - OL via the Expert's problem

By Victor Sanches Portella

# Waterloo - OL via the Expert's problem

- 84