Online Learning
ML in Feedback Sys #3
Prof Sarah Dean
Announcements
- Sign up to scribe and rank preferences for paper presentations by September 7th
- Email me about first paper presentations 9/12
and 9/14- HSNL18 Fairness Without Demographics in Repeated Loss Minimization
PZMH20 Performative Prediction
- Required: meet with Atul at least 2 days before you are scheduled to present
- Working in pairs/groups, self-assessment
training data
\(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
policy
observation
action
ML in Feedback Systems
training data
\(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
observation
prediction
Supervised learning
\(\mathcal D\)
sampled i.i.d. from \(\mathcal D\)
\(x\sim\mathcal D_{x}\)
Goal: for new sample \(x,y\sim \mathcal D\), prediction \(\hat y = f(x)\) is close to true \(y\)
Sample vs. population
Fundamental Theorem of Supervised Learning:
- The risk is bounded by the empirical risk plus the generalization error. $$ \mathcal R(f) \leq \mathcal R_N(f) + |\mathcal R(f) - \mathcal R_N(f)|$$
Empirical risk minimization
$$\hat f = \min_{f\in\mathcal F} \frac{1}{n} \sum_{i=1}^n \ell(y_i, f(x_i))$$
\(\{\)
\(\mathcal R_N(f)\)
1. Representation
2. Optimization
3. Generalization
Case study: linear regression
Generalization: under the fixed design generative model, \(\{x_i\}_{i=1}^n\) are fixed and
\(y_i = \theta_\star^\top x_i + v_i\) with \(v_i\) i.i.d. with mean \(0\) and variance \(\sigma^2\)
\(\mathcal R(\theta) = \frac{1}{n}\sum_{i=1}^n \mathbb E_{y_i}\left[(x_i^\top \theta - y_i)^2\right]\)
Claim: when features span \(\mathbb R^d\), the excess risk \(\mathcal R(\hat\theta) -\mathcal R(\theta_\star) =\frac{\sigma^2 d}{n}\)
- First, \(\mathcal R(\theta) = \frac{1}{n}\|X(\theta-\theta_\star)\|_2^2 + \sigma^2\)
- Then, \(\|X(\hat\theta-\theta_\star)\|_2^2 = v^\top X(X^\top X)^{-1}X^\top v\)
- Then take expectation.
Case study: linear regression
Exercises:
- For the same generative model and a new fixed \(x_{n+1}\), what is the expected loss \(\mathbb E_y[(\hat \theta^\top x_{n+1} - y_{n+1})^2]\)? Can you interpret the quantities?
- In the random design setting, we take each \(x_i\) to be drawn i.i.d. from \(\mathcal N(0,\Sigma)\) and assume that \(v_i\) is also Gaussian. The risk is then $$\mathcal R(\theta) = \mathbb E_{x, y}\left[(x^\top \theta - y)^2\right].$$ What is the excess risk of \(\hat\theta\) in terms of \(X^\top X\) and \(\Sigma\)? What is the excess risk in terms of \(\sigma^2, n, d\) (ref 1, 2)?
training data
\(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
observation
prediction
Supervised learning
\(\mathcal D\)
sampled i.i.d. from \(\mathcal D\)
\(x\sim\mathcal D_{x}\)
Goal: for new sample \(x,y\sim \mathcal D\), prediction \(\hat y = f(x)\) is close to true \(y\)
model
\(f_t:\mathcal X\to\mathcal Y\)
observation
prediction
Online learning
\(x_t\)
Goal: cumulatively over time, predictions \(\hat y_t = f_t(x_t)\) are close to true \(y_t\)
accumulate
\(\{(x_t, y_t)\}\)
Online Learning
- for \(t=1,2,...\)
- receive \(x_t\)
- predict \(\hat y_t\)
- receive true \(y_t\)
- suffer loss \(\ell(y_t,\hat y_t)\)
ex - rainfall prediction, online advertising, election forecasting, ...
Online learning
Reference: Ch 1&2 of Shalev-Shwartz "Online Learning and Online Convex Optimization"
The regret of an algorithm which plays \((\hat y_t)_{t=1}^T\) in an environment playing \((x_t, y_t)_{t=1}^T\) is $$ R(T) = \sum_{t=1}^T \ell(y_t, \hat y_t) - \min_{f\in\mathcal F} \sum_{t=1}^T \ell(y_t, f(x_t)) $$
Cumulative loss and regret
We want to design prediction rules so that \(\displaystyle \sum_{t=1}^T \ell(y_t, \hat y_t)\) is small
"best in hindsight"
often consider worst-case regret, i.e. \(\sup_{(x_t, y_t)_{t=1}^T} R(T)\)
"adversarial"
Case study: online linear regression
Consider the squared loss, \(\mathcal F\) to be linear functions, and \(\hat y_t = \theta_t^\top x_t\)
How to choose \(\theta_t\) based on \((x_k, y_k)_{k=1}^{t-1}\)?
Follow the (Regularized) Leader
$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 + \lambda\|\theta\|_2^2$$
Online Gradient Descent
$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$
"best in hindsight"
Online convex optimization
- ex - in online learning, we have \(g_t(\theta) =\ell(y_t, f_\theta(x_t))\)
- ex - allocating \(d\) resources \(\Theta=\Delta^d\) among entities with varying returns \(r_t\in\mathbb R^d\) so that \(g_t (\theta)= r_t^\top \theta\)
Online Optimization
- for \(t=1,2,...\)
- select \(\theta_t\in\Theta\)
- receive function \(g_t:\Theta\to\mathbb R\)
- suffer loss \(g_t(\theta_t)\)
A more general framework captures problems with convex losses
The regret of an algorithm playing \((\theta_t)_{t=1}^T\) in an environment playing \((g_t)_{t=1}^T\) $$ R(T) = \sum_{t=1}^T g_t(\theta_t) - \min_{\theta\in\Theta} \sum_{t=1}^T g_t(\theta) $$
Convexity & continuity
For differentiable and convex \(g:\mathcal \Theta\to\mathbb R\), $$g(\theta') \geq g(\theta) + \nabla g(\theta)^\top (\theta'-\theta)$$
Definition: \(g\) is Lipschitz continuous if for all \(\theta\in\Theta\) there exists \(L\) so that $$ \|g(\theta)-g(\theta')\|_2 \leq L\|\theta-\theta'\|_2 $$
A differentiable function \(g\) is \(L\) Lipschitz if \(\|\nabla g(\theta)\|_2 \leq L\) for all \(\theta\in\Theta\).
Convexity & continuity
For differentiable and \(\gamma\)-strongly convex \(g:\mathcal \Theta\to\mathbb R\), $$g(\theta') \geq g(\theta) + \nabla g(\theta)^\top (\theta'-\theta) + \frac{\gamma}{2}\|\theta'-\theta\|_2^2$$
Definition: \(g\) is Lipschitz continuous if for all \(\theta\in\Theta\) there exists \(L\) so that $$ \|g(\theta)-g(\theta')\|_2 \leq L\|\theta-\theta'\|_2 $$
A differentiable function \(g\) is \(L\) Lipschitz if \(\|\nabla g(\theta)\|_2 \leq L\) for all \(\theta\in\Theta\).
Follow the (Regularized) Leader
$$\theta_t = \arg\min \sum_{k=1}^{t-1} g_k(\theta) + r(\theta)$$
Follow the Regularized Leader
Theorem (2.12): Suppose each \(g_t\) is convex and \(L_t\) Lipschitz and let \(\frac{1}{T}\sum_{t=1}^T L_t^2\leq L^2\). Let \(\|\theta_\star\|_2\leq B\) and set \(r(\theta) = \frac{L\sqrt{T}}{\sqrt{2}B}\|\theta\|_2^2\). Then $$ R(T) \leq BL\sqrt{2T}.$$
Why regularize? Consider \(\Theta=[-1,1]\), \(g_t(\theta) = z_t\theta\) with \(z_t=\begin{cases}-0.5& t=1 \\ 1 & t~\text{even}\\-1 & t>1~\text{and odd}\end{cases}\)
GD locally minimizes first order approximation \(g(\theta) \approx g(\theta_0) + \nabla g(\theta_0)^\top (\theta-\theta_0)\)
Online Gradient Descent
$$\theta_t = \theta_{t-1} - \alpha \nabla g_{t-1}(\theta_{t-1})$$
Online Gradient Descent
Theorem (2.7): Suppose each \(g_t\) is convex and \(L_t\) Lipschitz and let \(\frac{1}{T}\sum_{t=1}^T L_t^2\leq L^2\). Let \(\|\theta_\star\|_2\leq B\) and set \(\alpha = \frac{B}{L\sqrt{2T}}\). Then $$ R(T) \leq BL\sqrt{2T}$$
\(= \arg\min \nabla g(\theta_{t-1})^\top (\theta-\theta_{t-1}) +\frac{1}{2\alpha}\|\theta-\theta_{t-1}\|_2^2\)
because \(\displaystyle \theta_{t+1} = \arg\min\sum_{k=1}^{t} \nabla g_k(\theta_k)^\top\theta + \frac{1}{2\alpha}\|\theta\|_2^2\)
because \(g_t(\theta) \geq g_t(\theta_t) + \nabla g_t(\theta_t)^\top (\theta-\theta_t)\)
For any fixed \(\theta\in\Theta\),
$$ \sum_{t=1}^T g_t(\theta_t)-g_t(\theta) \leq \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta)$$
Can show by induction that for \(\theta_1=0\),
$$ \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta) \leq \frac{1}{2\alpha}\|\theta\|_2^2+ \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta_{t+1})$$
\(\alpha\|\nabla g_t(\theta_t)\|_2^2\)
Putting it all together, $$ R(T) = \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_\star) \leq \frac{1}{2\alpha}\|\theta_\star\|_2^2+ \alpha\sum_{t=1}^T L_t^2$$
Online Gradient Descent: Proof Sketch
exercise: fill in the details
Lemma (2.1, 2.3): For any fixed \(\theta\in\Theta\), under FTRL
$$ \sum_{t=1}^T g_t(\theta_t)-g_t(\theta) \leq r(\theta)-r(\theta_1)+ \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_{t+1})$$
Lemma (2.10): For \(r\) \(\gamma\)-strongly convex and \(g_t\) convex and \(L_t\) Lipschitz,
$$ g_t(\theta_t)-g_t(\theta_{t+1}) \leq \frac{L_t^2}{\gamma}$$
$$ R(T) = \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_\star) \leq r(\theta_\star) +TL^2/\gamma$$
Follow the Regularized Leader
For linear functions, FTRL is the same as OGD! In general, similar arguments.
exercise: fill in the details
$$\theta_t = \underbrace{\Big(\sum_{k=1}^{t-1}x_k x_k^\top + \lambda I\Big)^{-1}}_{A_{t-1}^{-1}}\underbrace{\sum_{k=1}^{t-1}x_ky_k }_{b_{t-1}}$$
Follow the (Regularized) Leader
$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 + \lambda\|\theta\|_2^2$$
Online Gradient Descent
$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$
Recursive least-squares
Sherman-Morrison formula: \(\displaystyle (A+uv^\top)^{-1} = A^{-1} - \frac{A^{-1}uv^\top A^{-1}}{1+v^\top A^{-1}u} \)
Follow the (Regularized) Leader
$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 + \lambda\|\theta\|_2^2$$
Online Gradient Descent
$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$
Recursive least-squares
Recursive FTRL
- set \(M_0=\frac{1}{\lambda}I\) and \(b_0 = 0\)
- for \(t=1,2,...\)
- \(\theta_t = M_{t-1}b_{t-1}\)
- \(M_t = M_{t-1} - \frac{M_{t-1}x_t x_t^\top M_{t-1}}{1-x_t^\top M_{t-1}x_t}\)
- \(b_t = b_{t-1}+x_ty_t\)
- Excess risk of fixed design least squares
- From online learning to online optimization
- Two online optimization algorithms:
- Follow the Regularized Leader
- Online Gradient Descent
- Recursive least squares
Reference: Ch 1&2 of Shalev-Shwartz "Online Learning and Online Convex Optimization"
Next time: how to model an environment that changes over time?
Recap
03 - Online Learning - ML in Feedback Sys
By Sarah Dean