Online Learning

ML in Feedback Sys #3

Prof Sarah Dean

Announcements

Sign up to scribe and rank preferences for paper presentations by September 7th
Email me about first paper presentations 9/12 ~~and 9/14~~
- HSNL18 Fairness Without Demographics in Repeated Loss Minimization
- ~~PZMH20 Performative Prediction~~
Required: meet with Atul at least 2 days before you are scheduled to present
Working in pairs/groups, self-assessment

training data

$\{(x_i, y_i)\}$

model

$f:\mathcal X\to\mathcal Y$

policy

observation

action

ML in Feedback Systems

training data

$\{(x_i, y_i)\}$

model

$f:\mathcal X\to\mathcal Y$

observation

prediction

Supervised learning

$\mathcal D$

sampled i.i.d. from $\mathcal D$

$x\sim\mathcal D_{x}$

Goal: for new sample $x,y\sim \mathcal D$, prediction $\hat y = f(x)$ is close to true $y$

Sample vs. population

Fundamental Theorem of Supervised Learning:

The risk is bounded by the empirical risk plus the generalization error. $$ \mathcal R(f) \leq \mathcal R_N(f) + |\mathcal R(f) - \mathcal R_N(f)|$$

Empirical risk minimization

$$\hat f = \min_{f\in\mathcal F} \frac{1}{n} \sum_{i=1}^n \ell(y_i, f(x_i))$$

$\{$

$\mathcal R_N(f)$

1. Representation

2. Optimization

3. Generalization

Case study: linear regression

Generalization: under the fixed design generative model, $\{x_i\}_{i=1}^n$ are fixed and

$y_i = \theta_\star^\top x_i + v_i$ with $v_i$ i.i.d. with mean $0$ and variance $\sigma^2$

$\mathcal R(\theta) = \frac{1}{n}\sum_{i=1}^n \mathbb E_{y_i}\left[(x_i^\top \theta - y_i)^2\right]$

Claim: when features span $\mathbb R^d$, the excess risk $\mathcal R(\hat\theta) -\mathcal R(\theta_\star) =\frac{\sigma^2 d}{n}$

First, $\mathcal R(\theta) = \frac{1}{n}\|X(\theta-\theta_\star)\|_2^2 + \sigma^2$
Then, $\|X(\hat\theta-\theta_\star)\|_2^2 = v^\top X(X^\top X)^{-1}X^\top v$
Then take expectation.

Case study: linear regression

Exercises:

For the same generative model and a new fixed $x_{n+1}$, what is the expected loss $\mathbb E_y[(\hat \theta^\top x_{n+1} - y_{n+1})^2]$? Can you interpret the quantities?
In the random design setting, we take each $x_i$ to be drawn i.i.d. from $\mathcal N(0,\Sigma)$ and assume that $v_i$ is also Gaussian. The risk is then $$\mathcal R(\theta) = \mathbb E_{x, y}\left[(x^\top \theta - y)^2\right].$$ What is the excess risk of $\hat\theta$ in terms of $X^\top X$ and $\Sigma$? What is the excess risk in terms of $\sigma^2, n, d$ (ref 1, 2)?

training data

$\{(x_i, y_i)\}$

model

$f:\mathcal X\to\mathcal Y$

observation

prediction

Supervised learning

$\mathcal D$

sampled i.i.d. from $\mathcal D$

$x\sim\mathcal D_{x}$

Goal: for new sample $x,y\sim \mathcal D$, prediction $\hat y = f(x)$ is close to true $y$

model

$f_t:\mathcal X\to\mathcal Y$

observation

prediction

Online learning

$x_t$

Goal: cumulatively over time, predictions $\hat y_t = f_t(x_t)$ are close to true $y_t$

accumulate

$\{(x_t, y_t)\}$

Online Learning

for $t=1,2,...$
- receive $x_t$
- predict $\hat y_t$
- receive true $y_t$
- suffer loss $\ell(y_t,\hat y_t)$

ex - rainfall prediction, online advertising, election forecasting, ...

Online learning

Reference: Ch 1&2 of Shalev-Shwartz "Online Learning and Online Convex Optimization"

The regret of an algorithm which plays $(\hat y_t)_{t=1}^T$ in an environment playing $(x_t, y_t)_{t=1}^T$ is $$ R(T) = \sum_{t=1}^T \ell(y_t, \hat y_t) - \min_{f\in\mathcal F} \sum_{t=1}^T \ell(y_t, f(x_t)) $$

Cumulative loss and regret

We want to design prediction rules so that $\displaystyle \sum_{t=1}^T \ell(y_t, \hat y_t)$ is small

"best in hindsight"

often consider worst-case regret, i.e. $\sup_{(x_t, y_t)_{t=1}^T} R(T)$

"adversarial"

Case study: online linear regression

Consider the squared loss, $\mathcal F$ to be linear functions, and $\hat y_t = \theta_t^\top x_t$

How to choose $\theta_t$ based on $(x_k, y_k)_{k=1}^{t-1}$?

Follow the (Regularized) Leader

$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 + \lambda\|\theta\|_2^2$$

Online Gradient Descent

$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$

"best in hindsight"

Online convex optimization

ex - in online learning, we have $g_t(\theta) =\ell(y_t, f_\theta(x_t))$
ex - allocating $d$ resources $\Theta=\Delta^d$ among entities with varying returns $r_t\in\mathbb R^d$ so that $g_t (\theta)= r_t^\top \theta$

Online Optimization

for $t=1,2,...$
- select $\theta_t\in\Theta$
- receive function $g_t:\Theta\to\mathbb R$
- suffer loss $g_t(\theta_t)$

A more general framework captures problems with convex losses

The regret of an algorithm playing $(\theta_t)_{t=1}^T$ in an environment playing $(g_t)_{t=1}^T$ $$ R(T) = \sum_{t=1}^T g_t(\theta_t) - \min_{\theta\in\Theta} \sum_{t=1}^T g_t(\theta) $$

Convexity & continuity

For differentiable and convex $g:\mathcal \Theta\to\mathbb R$, $$g(\theta') \geq g(\theta) + \nabla g(\theta)^\top (\theta'-\theta)$$

Definition: $g$ is Lipschitz continuous if for all $\theta\in\Theta$ there exists $L$ so that $$ \|g(\theta)-g(\theta')\|_2 \leq L\|\theta-\theta'\|_2 $$

A differentiable function $g$ is $L$ Lipschitz if $\|\nabla g(\theta)\|_2 \leq L$ for all $\theta\in\Theta$.

Convexity & continuity

For differentiable and $\gamma$-strongly convex $g:\mathcal \Theta\to\mathbb R$, $$g(\theta') \geq g(\theta) + \nabla g(\theta)^\top (\theta'-\theta) + \frac{\gamma}{2}\|\theta'-\theta\|_2^2$$

Definition: $g$ is Lipschitz continuous if for all $\theta\in\Theta$ there exists $L$ so that $$ \|g(\theta)-g(\theta')\|_2 \leq L\|\theta-\theta'\|_2 $$

A differentiable function $g$ is $L$ Lipschitz if $\|\nabla g(\theta)\|_2 \leq L$ for all $\theta\in\Theta$.

Follow the (Regularized) Leader

$$\theta_t = \arg\min \sum_{k=1}^{t-1} g_k(\theta) + r(\theta)$$

Follow the Regularized Leader

Theorem (2.12): Suppose each $g_t$ is convex and $L_t$ Lipschitz and let $\frac{1}{T}\sum_{t=1}^T L_t^2\leq L^2$. Let $\|\theta_\star\|_2\leq B$ and set $r(\theta) = \frac{L\sqrt{T}}{\sqrt{2}B}\|\theta\|_2^2$. Then $$ R(T) \leq BL\sqrt{2T}.$$

Why regularize? Consider $\Theta=[-1,1]$, $g_t(\theta) = z_t\theta$ with $z_t=\begin{cases}-0.5& t=1 \\ 1 & t~\text{even}\\-1 & t>1~\text{and odd}\end{cases}$

GD locally minimizes first order approximation $g(\theta) \approx g(\theta_0) + \nabla g(\theta_0)^\top (\theta-\theta_0)$

Online Gradient Descent

$$\theta_t = \theta_{t-1} - \alpha \nabla g_{t-1}(\theta_{t-1})$$

Online Gradient Descent

Theorem (2.7): Suppose each $g_t$ is convex and $L_t$ Lipschitz and let $\frac{1}{T}\sum_{t=1}^T L_t^2\leq L^2$. Let $\|\theta_\star\|_2\leq B$ and set $\alpha = \frac{B}{L\sqrt{2T}}$. Then $$ R(T) \leq BL\sqrt{2T}$$

$= \arg\min \nabla g(\theta_{t-1})^\top (\theta-\theta_{t-1}) +\frac{1}{2\alpha}\|\theta-\theta_{t-1}\|_2^2$

because $\displaystyle \theta_{t+1} = \arg\min\sum_{k=1}^{t} \nabla g_k(\theta_k)^\top\theta + \frac{1}{2\alpha}\|\theta\|_2^2$

because $g_t(\theta) \geq g_t(\theta_t) + \nabla g_t(\theta_t)^\top (\theta-\theta_t)$

For any fixed $\theta\in\Theta$,

$$ \sum_{t=1}^T g_t(\theta_t)-g_t(\theta) \leq \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta)$$

Can show by induction that for $\theta_1=0$,

$$ \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta) \leq \frac{1}{2\alpha}\|\theta\|_2^2+ \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta_{t+1})$$

$\alpha\|\nabla g_t(\theta_t)\|_2^2$

Putting it all together, $$ R(T) = \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_\star) \leq \frac{1}{2\alpha}\|\theta_\star\|_2^2+ \alpha\sum_{t=1}^T L_t^2$$

Online Gradient Descent: Proof Sketch

exercise: fill in the details

Lemma (2.1, 2.3): For any fixed $\theta\in\Theta$, under FTRL

$$ \sum_{t=1}^T g_t(\theta_t)-g_t(\theta) \leq r(\theta)-r(\theta_1)+ \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_{t+1})$$

Lemma (2.10): For $r$ $\gamma$-strongly convex and $g_t$ convex and $L_t$ Lipschitz,

$$ g_t(\theta_t)-g_t(\theta_{t+1}) \leq \frac{L_t^2}{\gamma}$$

$$ R(T) = \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_\star) \leq r(\theta_\star) +TL^2/\gamma$$

Follow the Regularized Leader

For linear functions, FTRL is the same as OGD! In general, similar arguments.

exercise: fill in the details

$$\theta_t = \underbrace{\Big(\sum_{k=1}^{t-1}x_k x_k^\top + \lambda I\Big)^{-1}}_{A_{t-1}^{-1}}\underbrace{\sum_{k=1}^{t-1}x_ky_k }_{b_{t-1}}$$

Follow the (Regularized) Leader

$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 + \lambda\|\theta\|_2^2$$

Online Gradient Descent

$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$

Recursive least-squares

Sherman-Morrison formula: $\displaystyle (A+uv^\top)^{-1} = A^{-1} - \frac{A^{-1}uv^\top A^{-1}}{1+v^\top A^{-1}u} $

Follow the (Regularized) Leader

$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 + \lambda\|\theta\|_2^2$$

Online Gradient Descent

$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$

Recursive least-squares

Recursive FTRL

set $M_0=\frac{1}{\lambda}I$ and $b_0 = 0$
for $t=1,2,...$
- $\theta_t = M_{t-1}b_{t-1}$
- $M_t = M_{t-1} - \frac{M_{t-1}x_t x_t^\top M_{t-1}}{1-x_t^\top M_{t-1}x_t}$
- $b_t = b_{t-1}+x_ty_t$

Excess risk of fixed design least squares
From online learning to online optimization
Two online optimization algorithms:
- Follow the Regularized Leader
- Online Gradient Descent
Recursive least squares

Reference: Ch 1&2 of Shalev-Shwartz "Online Learning and Online Convex Optimization"

Next time: how to model an environment that changes over time?

Online Learning

ML in Feedback Sys #3

Announcements

ML in Feedback Systems

Supervised learning

\(\mathcal D\)

Sample vs. population

\(\{\)

Case study: linear regression

Case study: linear regression

Supervised learning

\(\mathcal D\)

Online learning

Online learning

Cumulative loss and regret

Case study: online linear regression

Online convex optimization

Convexity & continuity

Convexity & continuity

Follow the Regularized Leader

Online Gradient Descent

Online Gradient Descent: Proof Sketch

Follow the Regularized Leader

Recursive least-squares

Recursive least-squares

Recap