Online Learning

ML in Feedback Sys #3

Prof Sarah Dean

Announcements

  • Sign up to scribe and rank preferences for paper presentations by September 7th
  • Email me about first paper presentations 9/12 and 9/14
    • HSNL18 Fairness Without Demographics in Repeated Loss Minimization
    • PZMH20 Performative Prediction
  • Required: meet with Atul at least 2 days before you are scheduled to present
  • Working in pairs/groups, self-assessment

training data

\(\{(x_i, y_i)\}\)

model

\(f:\mathcal X\to\mathcal Y\)

policy

 

 

observation

action

ML in Feedback Systems

training data

\(\{(x_i, y_i)\}\)

model

\(f:\mathcal X\to\mathcal Y\)

observation

prediction

Supervised learning

\(\mathcal D\)

sampled i.i.d. from \(\mathcal D\)

\(x\sim\mathcal D_{x}\)

Goal: for new sample \(x,y\sim \mathcal D\), prediction \(\hat y = f(x)\) is close to true \(y\)

Sample vs. population

Fundamental Theorem of Supervised Learning:

  • The risk is bounded by the empirical risk plus the generalization error. $$ \mathcal R(f) \leq \mathcal R_N(f) + |\mathcal R(f) - \mathcal R_N(f)|$$

Empirical risk minimization

$$\hat f = \min_{f\in\mathcal F} \frac{1}{n} \sum_{i=1}^n \ell(y_i, f(x_i))$$

\(\{\)

\(\mathcal R_N(f)\)

1. Representation

2. Optimization

3. Generalization

Case study: linear regression

Generalization: under the fixed design generative model, \(\{x_i\}_{i=1}^n\) are fixed and

\(y_i = \theta_\star^\top x_i + v_i\) with \(v_i\) i.i.d. with mean \(0\) and variance \(\sigma^2\)

\(\mathcal R(\theta) = \frac{1}{n}\sum_{i=1}^n \mathbb E_{y_i}\left[(x_i^\top \theta - y_i)^2\right]\)

Claim: when features span \(\mathbb R^d\), the excess risk \(\mathcal R(\hat\theta) -\mathcal R(\theta_\star) =\frac{\sigma^2 d}{n}\)

  • First, \(\mathcal R(\theta) = \frac{1}{n}\|X(\theta-\theta_\star)\|_2^2 + \sigma^2\)
  • Then, \(\|X(\hat\theta-\theta_\star)\|_2^2 = v^\top X(X^\top X)^{-1}X^\top v\)
  • Then take expectation.

Case study: linear regression

Exercises:

  • For the same generative model and a new fixed \(x_{n+1}\), what is the expected loss \(\mathbb E_y[(\hat \theta^\top x_{n+1} - y_{n+1})^2]\)? Can you interpret the quantities?
  • In the random design setting, we take each \(x_i\) to be drawn i.i.d. from \(\mathcal N(0,\Sigma)\) and assume that \(v_i\) is also Gaussian. The risk is then $$\mathcal R(\theta) = \mathbb E_{x, y}\left[(x^\top \theta - y)^2\right].$$ What is the excess risk of \(\hat\theta\) in terms of \(X^\top X\) and \(\Sigma\)? What is the excess risk in terms of \(\sigma^2, n, d\) (ref 1, 2)?

training data

\(\{(x_i, y_i)\}\)

model

\(f:\mathcal X\to\mathcal Y\)

observation

prediction

Supervised learning

\(\mathcal D\)

sampled i.i.d. from \(\mathcal D\)

\(x\sim\mathcal D_{x}\)

Goal: for new sample \(x,y\sim \mathcal D\), prediction \(\hat y = f(x)\) is close to true \(y\)

model

\(f_t:\mathcal X\to\mathcal Y\)

observation

prediction

Online learning

\(x_t\)

Goal: cumulatively over time, predictions \(\hat y_t = f_t(x_t)\) are close to true \(y_t\)

accumulate

\(\{(x_t, y_t)\}\)

Online Learning

  • for \(t=1,2,...\)
    • receive \(x_t\)
    • predict \(\hat y_t\)
    • receive true \(y_t\)
    • suffer loss \(\ell(y_t,\hat y_t)\)

ex - rainfall prediction, online advertising, election forecasting, ...

Online learning

Reference: Ch 1&2 of Shalev-Shwartz "Online Learning and Online Convex Optimization"

The regret of an algorithm which plays \((\hat y_t)_{t=1}^T\) in an environment playing \((x_t, y_t)_{t=1}^T\) is $$ R(T) = \sum_{t=1}^T \ell(y_t, \hat y_t) - \min_{f\in\mathcal F} \sum_{t=1}^T \ell(y_t, f(x_t)) $$

Cumulative loss and regret

We want to design prediction rules so that \(\displaystyle \sum_{t=1}^T \ell(y_t, \hat y_t)\) is small

"best in hindsight"

often consider worst-case regret, i.e. \(\sup_{(x_t, y_t)_{t=1}^T} R(T)\)

"adversarial"

Case study: online linear regression

Consider the squared loss, \(\mathcal F\) to be linear functions, and \(\hat y_t = \theta_t^\top x_t\)

How to choose \(\theta_t\) based on \((x_k, y_k)_{k=1}^{t-1}\)?

Follow the (Regularized) Leader

$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 +  \lambda\|\theta\|_2^2$$

Online Gradient Descent

$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$

"best in hindsight"

Online convex optimization

  • ex - in online learning, we have \(g_t(\theta) =\ell(y_t, f_\theta(x_t))\)
  • ex - allocating \(d\) resources \(\Theta=\Delta^d\) among entities with varying returns \(r_t\in\mathbb R^d\) so that \(g_t (\theta)= r_t^\top \theta\)

Online Optimization

  • for \(t=1,2,...\)
    • select \(\theta_t\in\Theta\)
    • receive function \(g_t:\Theta\to\mathbb R\)
    • suffer loss \(g_t(\theta_t)\)

A more general framework captures problems with convex losses

The regret of an algorithm playing \((\theta_t)_{t=1}^T\) in an environment playing \((g_t)_{t=1}^T\) $$ R(T) = \sum_{t=1}^T g_t(\theta_t) - \min_{\theta\in\Theta} \sum_{t=1}^T g_t(\theta) $$

Convexity & continuity

Parabola

For differentiable and convex \(g:\mathcal \Theta\to\mathbb R\), $$g(\theta') \geq g(\theta) + \nabla g(\theta)^\top (\theta'-\theta)$$

Definition: \(g\) is Lipschitz continuous if for all \(\theta\in\Theta\) there exists \(L\) so that $$ \|g(\theta)-g(\theta')\|_2 \leq L\|\theta-\theta'\|_2 $$

A differentiable function \(g\) is \(L\) Lipschitz if \(\|\nabla g(\theta)\|_2 \leq L\) for all \(\theta\in\Theta\).

Convexity & continuity

Parabola

For differentiable and \(\gamma\)-strongly convex \(g:\mathcal \Theta\to\mathbb R\), $$g(\theta') \geq g(\theta) + \nabla g(\theta)^\top (\theta'-\theta) + \frac{\gamma}{2}\|\theta'-\theta\|_2^2$$

Definition: \(g\) is Lipschitz continuous if for all \(\theta\in\Theta\) there exists \(L\) so that $$ \|g(\theta)-g(\theta')\|_2 \leq L\|\theta-\theta'\|_2 $$

A differentiable function \(g\) is \(L\) Lipschitz if \(\|\nabla g(\theta)\|_2 \leq L\) for all \(\theta\in\Theta\).

Parabola

Follow the (Regularized) Leader

$$\theta_t = \arg\min \sum_{k=1}^{t-1} g_k(\theta) +  r(\theta)$$

Follow the Regularized Leader

Theorem (2.12): Suppose each \(g_t\) is convex and \(L_t\) Lipschitz and let \(\frac{1}{T}\sum_{t=1}^T L_t^2\leq L^2\). Let \(\|\theta_\star\|_2\leq B\) and set \(r(\theta) = \frac{L\sqrt{T}}{\sqrt{2}B}\|\theta\|_2^2\). Then $$  R(T) \leq BL\sqrt{2T}.$$

Why regularize? Consider \(\Theta=[-1,1]\),  \(g_t(\theta) = z_t\theta\) with \(z_t=\begin{cases}-0.5& t=1 \\ 1 & t~\text{even}\\-1 & t>1~\text{and odd}\end{cases}\)

GD locally minimizes first order approximation \(g(\theta) \approx g(\theta_0) + \nabla g(\theta_0)^\top (\theta-\theta_0)\)

Online Gradient Descent

$$\theta_t = \theta_{t-1} - \alpha \nabla g_{t-1}(\theta_{t-1})$$

Online Gradient Descent

Parabola

Theorem (2.7): Suppose each \(g_t\) is convex and \(L_t\) Lipschitz and let \(\frac{1}{T}\sum_{t=1}^T L_t^2\leq L^2\). Let \(\|\theta_\star\|_2\leq B\) and set \(\alpha = \frac{B}{L\sqrt{2T}}\). Then $$  R(T) \leq BL\sqrt{2T}$$

\(= \arg\min \nabla g(\theta_{t-1})^\top (\theta-\theta_{t-1}) +\frac{1}{2\alpha}\|\theta-\theta_{t-1}\|_2^2\)

because \(\displaystyle \theta_{t+1} = \arg\min\sum_{k=1}^{t} \nabla g_k(\theta_k)^\top\theta + \frac{1}{2\alpha}\|\theta\|_2^2\)

because \(g_t(\theta) \geq g_t(\theta_t) + \nabla g_t(\theta_t)^\top (\theta-\theta_t)\)

For any fixed \(\theta\in\Theta\),

$$ \sum_{t=1}^T g_t(\theta_t)-g_t(\theta) \leq \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta)$$

Can show by induction that for \(\theta_1=0\),

$$ \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta) \leq \frac{1}{2\alpha}\|\theta\|_2^2+ \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta_{t+1})$$

\(\alpha\|\nabla g_t(\theta_t)\|_2^2\)

Putting it all together, $$ R(T) =  \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_\star) \leq \frac{1}{2\alpha}\|\theta_\star\|_2^2+ \alpha\sum_{t=1}^T L_t^2$$

Online Gradient Descent: Proof Sketch

exercise: fill in the details

Lemma (2.1, 2.3): For any fixed \(\theta\in\Theta\), under FTRL

$$ \sum_{t=1}^T g_t(\theta_t)-g_t(\theta) \leq r(\theta)-r(\theta_1)+  \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_{t+1})$$

Lemma (2.10): For \(r\) \(\gamma\)-strongly convex and \(g_t\) convex and \(L_t\) Lipschitz,

$$  g_t(\theta_t)-g_t(\theta_{t+1}) \leq \frac{L_t^2}{\gamma}$$

$$ R(T) =  \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_\star) \leq r(\theta_\star)  +TL^2/\gamma$$

Follow the Regularized Leader

For linear functions, FTRL is the same as OGD! In general, similar arguments.

exercise: fill in the details

$$\theta_t = \underbrace{\Big(\sum_{k=1}^{t-1}x_k x_k^\top  + \lambda I\Big)^{-1}}_{A_{t-1}^{-1}}\underbrace{\sum_{k=1}^{t-1}x_ky_k }_{b_{t-1}}$$

Follow the (Regularized) Leader

$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 +  \lambda\|\theta\|_2^2$$

Online Gradient Descent

$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$

Recursive least-squares

Sherman-Morrison formula: \(\displaystyle (A+uv^\top)^{-1} = A^{-1} - \frac{A^{-1}uv^\top A^{-1}}{1+v^\top A^{-1}u} \)

Follow the (Regularized) Leader

$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 +  \lambda\|\theta\|_2^2$$

Online Gradient Descent

$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$

Recursive least-squares

Recursive FTRL

  • set \(M_0=\frac{1}{\lambda}I\) and \(b_0 = 0\)
  • for \(t=1,2,...\)
    • \(\theta_t = M_{t-1}b_{t-1}\)
    • \(M_t = M_{t-1} - \frac{M_{t-1}x_t x_t^\top M_{t-1}}{1-x_t^\top M_{t-1}x_t}\)
    • \(b_t = b_{t-1}+x_ty_t\)
  • Excess risk of fixed design least squares
  • From online learning to online optimization
  • Two online optimization algorithms:
    • Follow the Regularized Leader
    • Online Gradient Descent
  • Recursive least squares

Reference: Ch 1&2 of Shalev-Shwartz "Online Learning and Online Convex Optimization"

Next time: how to model an environment that changes over time?

Recap