Prof Sarah Dean
training data
\(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
policy
observation
action
training data
\(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
observation
prediction
sampled i.i.d. from \(\mathcal D\)
\(x\sim\mathcal D_{x}\)
Goal: for new sample \(x,y\sim \mathcal D\), prediction \(\hat y = f(x)\) is close to true \(y\)
Fundamental Theorem of Supervised Learning:
Empirical risk minimization
$$\hat f = \min_{f\in\mathcal F} \frac{1}{n} \sum_{i=1}^n \ell(y_i, f(x_i))$$
\(\mathcal R_N(f)\)
1. Representation
2. Optimization
3. Generalization
Generalization: under the fixed design generative model, \(\{x_i\}_{i=1}^n\) are fixed and
\(y_i = \theta_\star^\top x_i + v_i\) with \(v_i\) i.i.d. with mean \(0\) and variance \(\sigma^2\)
\(\mathcal R(\theta) = \frac{1}{n}\sum_{i=1}^n \mathbb E_{y_i}\left[(x_i^\top \theta - y_i)^2\right]\)
Claim: when features span \(\mathbb R^d\), the excess risk \(\mathcal R(\hat\theta) -\mathcal R(\theta_\star) =\frac{\sigma^2 d}{n}\)
Exercises:
training data
\(\{(x_i, y_i)\}\)
model
\(f:\mathcal X\to\mathcal Y\)
observation
prediction
sampled i.i.d. from \(\mathcal D\)
\(x\sim\mathcal D_{x}\)
Goal: for new sample \(x,y\sim \mathcal D\), prediction \(\hat y = f(x)\) is close to true \(y\)
model
\(f_t:\mathcal X\to\mathcal Y\)
observation
prediction
\(x_t\)
Goal: cumulatively over time, predictions \(\hat y_t = f_t(x_t)\) are close to true \(y_t\)
accumulate
\(\{(x_t, y_t)\}\)
Online Learning
ex - rainfall prediction, online advertising, election forecasting, ...
Reference: Ch 1&2 of Shalev-Shwartz "Online Learning and Online Convex Optimization"
The regret of an algorithm which plays \((\hat y_t)_{t=1}^T\) in an environment playing \((x_t, y_t)_{t=1}^T\) is $$ R(T) = \sum_{t=1}^T \ell(y_t, \hat y_t) - \min_{f\in\mathcal F} \sum_{t=1}^T \ell(y_t, f(x_t)) $$
We want to design prediction rules so that \(\displaystyle \sum_{t=1}^T \ell(y_t, \hat y_t)\) is small
"best in hindsight"
often consider worst-case regret, i.e. \(\sup_{(x_t, y_t)_{t=1}^T} R(T)\)
"adversarial"
Consider the squared loss, \(\mathcal F\) to be linear functions, and \(\hat y_t = \theta_t^\top x_t\)
How to choose \(\theta_t\) based on \((x_k, y_k)_{k=1}^{t-1}\)?
Follow the (Regularized) Leader
$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 + \lambda\|\theta\|_2^2$$
Online Gradient Descent
$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$
"best in hindsight"
Online Optimization
A more general framework captures problems with convex losses
The regret of an algorithm playing \((\theta_t)_{t=1}^T\) in an environment playing \((g_t)_{t=1}^T\) $$ R(T) = \sum_{t=1}^T g_t(\theta_t) - \min_{\theta\in\Theta} \sum_{t=1}^T g_t(\theta) $$
For differentiable and convex \(g:\mathcal \Theta\to\mathbb R\), $$g(\theta') \geq g(\theta) + \nabla g(\theta)^\top (\theta'-\theta)$$
Definition: \(g\) is Lipschitz continuous if for all \(\theta\in\Theta\) there exists \(L\) so that $$ \|g(\theta)-g(\theta')\|_2 \leq L\|\theta-\theta'\|_2 $$
A differentiable function \(g\) is \(L\) Lipschitz if \(\|\nabla g(\theta)\|_2 \leq L\) for all \(\theta\in\Theta\).
For differentiable and \(\gamma\)-strongly convex \(g:\mathcal \Theta\to\mathbb R\), $$g(\theta') \geq g(\theta) + \nabla g(\theta)^\top (\theta'-\theta) + \frac{\gamma}{2}\|\theta'-\theta\|_2^2$$
Definition: \(g\) is Lipschitz continuous if for all \(\theta\in\Theta\) there exists \(L\) so that $$ \|g(\theta)-g(\theta')\|_2 \leq L\|\theta-\theta'\|_2 $$
A differentiable function \(g\) is \(L\) Lipschitz if \(\|\nabla g(\theta)\|_2 \leq L\) for all \(\theta\in\Theta\).
Follow the (Regularized) Leader
$$\theta_t = \arg\min \sum_{k=1}^{t-1} g_k(\theta) + r(\theta)$$
Theorem (2.12): Suppose each \(g_t\) is convex and \(L_t\) Lipschitz and let \(\frac{1}{T}\sum_{t=1}^T L_t^2\leq L^2\). Let \(\|\theta_\star\|_2\leq B\) and set \(r(\theta) = \frac{L\sqrt{T}}{\sqrt{2}B}\|\theta\|_2^2\). Then $$ R(T) \leq BL\sqrt{2T}.$$
Why regularize? Consider \(\Theta=[-1,1]\), \(g_t(\theta) = z_t\theta\) with \(z_t=\begin{cases}-0.5& t=1 \\ 1 & t~\text{even}\\-1 & t>1~\text{and odd}\end{cases}\)
GD locally minimizes first order approximation \(g(\theta) \approx g(\theta_0) + \nabla g(\theta_0)^\top (\theta-\theta_0)\)
Online Gradient Descent
$$\theta_t = \theta_{t-1} - \alpha \nabla g_{t-1}(\theta_{t-1})$$
Theorem (2.7): Suppose each \(g_t\) is convex and \(L_t\) Lipschitz and let \(\frac{1}{T}\sum_{t=1}^T L_t^2\leq L^2\). Let \(\|\theta_\star\|_2\leq B\) and set \(\alpha = \frac{B}{L\sqrt{2T}}\). Then $$ R(T) \leq BL\sqrt{2T}$$
\(= \arg\min \nabla g(\theta_{t-1})^\top (\theta-\theta_{t-1}) +\frac{1}{2\alpha}\|\theta-\theta_{t-1}\|_2^2\)
because \(\displaystyle \theta_{t+1} = \arg\min\sum_{k=1}^{t} \nabla g_k(\theta_k)^\top\theta + \frac{1}{2\alpha}\|\theta\|_2^2\)
because \(g_t(\theta) \geq g_t(\theta_t) + \nabla g_t(\theta_t)^\top (\theta-\theta_t)\)
For any fixed \(\theta\in\Theta\),
$$ \sum_{t=1}^T g_t(\theta_t)-g_t(\theta) \leq \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta)$$
Can show by induction that for \(\theta_1=0\),
$$ \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta) \leq \frac{1}{2\alpha}\|\theta\|_2^2+ \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta_{t+1})$$
\(\alpha\|\nabla g_t(\theta_t)\|_2^2\)
Putting it all together, $$ R(T) = \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_\star) \leq \frac{1}{2\alpha}\|\theta_\star\|_2^2+ \alpha\sum_{t=1}^T L_t^2$$
exercise: fill in the details
Lemma (2.1, 2.3): For any fixed \(\theta\in\Theta\), under FTRL
$$ \sum_{t=1}^T g_t(\theta_t)-g_t(\theta) \leq r(\theta)-r(\theta_1)+ \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_{t+1})$$
Lemma (2.10): For \(r\) \(\gamma\)-strongly convex and \(g_t\) convex and \(L_t\) Lipschitz,
$$ g_t(\theta_t)-g_t(\theta_{t+1}) \leq \frac{L_t^2}{\gamma}$$
$$ R(T) = \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_\star) \leq r(\theta_\star) +TL^2/\gamma$$
For linear functions, FTRL is the same as OGD! In general, similar arguments.
exercise: fill in the details
$$\theta_t = \underbrace{\Big(\sum_{k=1}^{t-1}x_k x_k^\top + \lambda I\Big)^{-1}}_{A_{t-1}^{-1}}\underbrace{\sum_{k=1}^{t-1}x_ky_k }_{b_{t-1}}$$
Follow the (Regularized) Leader
$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 + \lambda\|\theta\|_2^2$$
Online Gradient Descent
$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$
Sherman-Morrison formula: \(\displaystyle (A+uv^\top)^{-1} = A^{-1} - \frac{A^{-1}uv^\top A^{-1}}{1+v^\top A^{-1}u} \)
Follow the (Regularized) Leader
$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 + \lambda\|\theta\|_2^2$$
Online Gradient Descent
$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$
Recursive FTRL
Reference: Ch 1&2 of Shalev-Shwartz "Online Learning and Online Convex Optimization"
Next time: how to model an environment that changes over time?