Prof Sarah Dean

## Announcements

• Sign up to scribe and rank preferences for paper presentations by September 7th
• Email me about first paper presentations 9/12 and 9/14
• HSNL18 Fairness Without Demographics in Repeated Loss Minimization
• PZMH20 Performative Prediction
• Required: meet with Atul at least 2 days before you are scheduled to present
• Working in pairs/groups, self-assessment

training data

$$\{(x_i, y_i)\}$$

model

$$f:\mathcal X\to\mathcal Y$$

policy

observation

action

## ML in Feedback Systems

training data

$$\{(x_i, y_i)\}$$

model

$$f:\mathcal X\to\mathcal Y$$

observation

prediction

## $$\mathcal D$$

sampled i.i.d. from $$\mathcal D$$

$$x\sim\mathcal D_{x}$$

Goal: for new sample $$x,y\sim \mathcal D$$, prediction $$\hat y = f(x)$$ is close to true $$y$$

## Sample vs. population

Fundamental Theorem of Supervised Learning:

• The risk is bounded by the empirical risk plus the generalization error. $$\mathcal R(f) \leq \mathcal R_N(f) + |\mathcal R(f) - \mathcal R_N(f)|$$

Empirical risk minimization

$$\hat f = \min_{f\in\mathcal F} \frac{1}{n} \sum_{i=1}^n \ell(y_i, f(x_i))$$

# $$\{$$

$$\mathcal R_N(f)$$

1. Representation

2. Optimization

3. Generalization

## Case study: linear regression

Generalization: under the fixed design generative model, $$\{x_i\}_{i=1}^n$$ are fixed and

$$y_i = \theta_\star^\top x_i + v_i$$ with $$v_i$$ i.i.d. with mean $$0$$ and variance $$\sigma^2$$

$$\mathcal R(\theta) = \frac{1}{n}\sum_{i=1}^n \mathbb E_{y_i}\left[(x_i^\top \theta - y_i)^2\right]$$

Claim: when features span $$\mathbb R^d$$, the excess risk $$\mathcal R(\hat\theta) -\mathcal R(\theta_\star) =\frac{\sigma^2 d}{n}$$

• First, $$\mathcal R(\theta) = \frac{1}{n}\|X(\theta-\theta_\star)\|_2^2 + \sigma^2$$
• Then, $$\|X(\hat\theta-\theta_\star)\|_2^2 = v^\top X(X^\top X)^{-1}X^\top v$$
• Then take expectation.

## Case study: linear regression

Exercises:

• For the same generative model and a new fixed $$x_{n+1}$$, what is the expected loss $$\mathbb E_y[(\hat \theta^\top x_{n+1} - y_{n+1})^2]$$? Can you interpret the quantities?
• In the random design setting, we take each $$x_i$$ to be drawn i.i.d. from $$\mathcal N(0,\Sigma)$$ and assume that $$v_i$$ is also Gaussian. The risk is then $$\mathcal R(\theta) = \mathbb E_{x, y}\left[(x^\top \theta - y)^2\right].$$ What is the excess risk of $$\hat\theta$$ in terms of $$X^\top X$$ and $$\Sigma$$? What is the excess risk in terms of $$\sigma^2, n, d$$ (ref 1, 2)?

training data

$$\{(x_i, y_i)\}$$

model

$$f:\mathcal X\to\mathcal Y$$

observation

prediction

## $$\mathcal D$$

sampled i.i.d. from $$\mathcal D$$

$$x\sim\mathcal D_{x}$$

Goal: for new sample $$x,y\sim \mathcal D$$, prediction $$\hat y = f(x)$$ is close to true $$y$$

model

$$f_t:\mathcal X\to\mathcal Y$$

observation

prediction

## Online learning

$$x_t$$

Goal: cumulatively over time, predictions $$\hat y_t = f_t(x_t)$$ are close to true $$y_t$$

accumulate

$$\{(x_t, y_t)\}$$

Online Learning

• for $$t=1,2,...$$
• receive $$x_t$$
• predict $$\hat y_t$$
• receive true $$y_t$$
• suffer loss $$\ell(y_t,\hat y_t)$$

ex - rainfall prediction, online advertising, election forecasting, ...

## Online learning

Reference: Ch 1&2 of Shalev-Shwartz "Online Learning and Online Convex Optimization"

The regret of an algorithm which plays $$(\hat y_t)_{t=1}^T$$ in an environment playing $$(x_t, y_t)_{t=1}^T$$ is $$R(T) = \sum_{t=1}^T \ell(y_t, \hat y_t) - \min_{f\in\mathcal F} \sum_{t=1}^T \ell(y_t, f(x_t))$$

## Cumulative loss and regret

We want to design prediction rules so that $$\displaystyle \sum_{t=1}^T \ell(y_t, \hat y_t)$$ is small

"best in hindsight"

often consider worst-case regret, i.e. $$\sup_{(x_t, y_t)_{t=1}^T} R(T)$$

## Case study: online linear regression

Consider the squared loss, $$\mathcal F$$ to be linear functions, and $$\hat y_t = \theta_t^\top x_t$$

How to choose $$\theta_t$$ based on $$(x_k, y_k)_{k=1}^{t-1}$$?

$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 + \lambda\|\theta\|_2^2$$

$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$

"best in hindsight"

## Online convex optimization

• ex - in online learning, we have $$g_t(\theta) =\ell(y_t, f_\theta(x_t))$$
• ex - allocating $$d$$ resources $$\Theta=\Delta^d$$ among entities with varying returns $$r_t\in\mathbb R^d$$ so that $$g_t (\theta)= r_t^\top \theta$$

Online Optimization

• for $$t=1,2,...$$
• select $$\theta_t\in\Theta$$
• receive function $$g_t:\Theta\to\mathbb R$$
• suffer loss $$g_t(\theta_t)$$

A more general framework captures problems with convex losses

The regret of an algorithm playing $$(\theta_t)_{t=1}^T$$ in an environment playing $$(g_t)_{t=1}^T$$ $$R(T) = \sum_{t=1}^T g_t(\theta_t) - \min_{\theta\in\Theta} \sum_{t=1}^T g_t(\theta)$$

## Convexity & continuity

For differentiable and convex $$g:\mathcal \Theta\to\mathbb R$$, $$g(\theta') \geq g(\theta) + \nabla g(\theta)^\top (\theta'-\theta)$$

Definition: $$g$$ is Lipschitz continuous if for all $$\theta\in\Theta$$ there exists $$L$$ so that $$\|g(\theta)-g(\theta')\|_2 \leq L\|\theta-\theta'\|_2$$

A differentiable function $$g$$ is $$L$$ Lipschitz if $$\|\nabla g(\theta)\|_2 \leq L$$ for all $$\theta\in\Theta$$.

## Convexity & continuity

For differentiable and $$\gamma$$-strongly convex $$g:\mathcal \Theta\to\mathbb R$$, $$g(\theta') \geq g(\theta) + \nabla g(\theta)^\top (\theta'-\theta) + \frac{\gamma}{2}\|\theta'-\theta\|_2^2$$

Definition: $$g$$ is Lipschitz continuous if for all $$\theta\in\Theta$$ there exists $$L$$ so that $$\|g(\theta)-g(\theta')\|_2 \leq L\|\theta-\theta'\|_2$$

A differentiable function $$g$$ is $$L$$ Lipschitz if $$\|\nabla g(\theta)\|_2 \leq L$$ for all $$\theta\in\Theta$$.

$$\theta_t = \arg\min \sum_{k=1}^{t-1} g_k(\theta) + r(\theta)$$

Theorem (2.12): Suppose each $$g_t$$ is convex and $$L_t$$ Lipschitz and let $$\frac{1}{T}\sum_{t=1}^T L_t^2\leq L^2$$. Let $$\|\theta_\star\|_2\leq B$$ and set $$r(\theta) = \frac{L\sqrt{T}}{\sqrt{2}B}\|\theta\|_2^2$$. Then $$R(T) \leq BL\sqrt{2T}.$$

Why regularize? Consider $$\Theta=[-1,1]$$,  $$g_t(\theta) = z_t\theta$$ with $$z_t=\begin{cases}-0.5& t=1 \\ 1 & t~\text{even}\\-1 & t>1~\text{and odd}\end{cases}$$

GD locally minimizes first order approximation $$g(\theta) \approx g(\theta_0) + \nabla g(\theta_0)^\top (\theta-\theta_0)$$

$$\theta_t = \theta_{t-1} - \alpha \nabla g_{t-1}(\theta_{t-1})$$

Theorem (2.7): Suppose each $$g_t$$ is convex and $$L_t$$ Lipschitz and let $$\frac{1}{T}\sum_{t=1}^T L_t^2\leq L^2$$. Let $$\|\theta_\star\|_2\leq B$$ and set $$\alpha = \frac{B}{L\sqrt{2T}}$$. Then $$R(T) \leq BL\sqrt{2T}$$

$$= \arg\min \nabla g(\theta_{t-1})^\top (\theta-\theta_{t-1}) +\frac{1}{2\alpha}\|\theta-\theta_{t-1}\|_2^2$$

because $$\displaystyle \theta_{t+1} = \arg\min\sum_{k=1}^{t} \nabla g_k(\theta_k)^\top\theta + \frac{1}{2\alpha}\|\theta\|_2^2$$

because $$g_t(\theta) \geq g_t(\theta_t) + \nabla g_t(\theta_t)^\top (\theta-\theta_t)$$

For any fixed $$\theta\in\Theta$$,

$$\sum_{t=1}^T g_t(\theta_t)-g_t(\theta) \leq \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta)$$

Can show by induction that for $$\theta_1=0$$,

$$\sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta) \leq \frac{1}{2\alpha}\|\theta\|_2^2+ \sum_{t=1}^T \nabla g_t(\theta_t)^\top (\theta_t-\theta_{t+1})$$

$$\alpha\|\nabla g_t(\theta_t)\|_2^2$$

Putting it all together, $$R(T) = \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_\star) \leq \frac{1}{2\alpha}\|\theta_\star\|_2^2+ \alpha\sum_{t=1}^T L_t^2$$

## Online Gradient Descent: Proof Sketch

exercise: fill in the details

Lemma (2.1, 2.3): For any fixed $$\theta\in\Theta$$, under FTRL

$$\sum_{t=1}^T g_t(\theta_t)-g_t(\theta) \leq r(\theta)-r(\theta_1)+ \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_{t+1})$$

Lemma (2.10): For $$r$$ $$\gamma$$-strongly convex and $$g_t$$ convex and $$L_t$$ Lipschitz,

$$g_t(\theta_t)-g_t(\theta_{t+1}) \leq \frac{L_t^2}{\gamma}$$

$$R(T) = \sum_{t=1}^T g_t(\theta_t)-g_t(\theta_\star) \leq r(\theta_\star) +TL^2/\gamma$$

For linear functions, FTRL is the same as OGD! In general, similar arguments.

exercise: fill in the details

$$\theta_t = \underbrace{\Big(\sum_{k=1}^{t-1}x_k x_k^\top + \lambda I\Big)^{-1}}_{A_{t-1}^{-1}}\underbrace{\sum_{k=1}^{t-1}x_ky_k }_{b_{t-1}}$$

$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 + \lambda\|\theta\|_2^2$$

$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$

## Recursive least-squares

Sherman-Morrison formula: $$\displaystyle (A+uv^\top)^{-1} = A^{-1} - \frac{A^{-1}uv^\top A^{-1}}{1+v^\top A^{-1}u}$$

$$\theta_t = \arg\min \sum_{k=1}^{t-1} (\theta^\top x_k-y_k)^2 + \lambda\|\theta\|_2^2$$

$$\theta_t = \theta_{t-1} - \alpha (\theta_{t-1}^\top x_{t-1}-y_{t-1})x_{t-1}$$

## Recursive least-squares

Recursive FTRL

• set $$M_0=\frac{1}{\lambda}I$$ and $$b_0 = 0$$
• for $$t=1,2,...$$
• $$\theta_t = M_{t-1}b_{t-1}$$
• $$M_t = M_{t-1} - \frac{M_{t-1}x_t x_t^\top M_{t-1}}{1-x_t^\top M_{t-1}x_t}$$
• $$b_t = b_{t-1}+x_ty_t$$
• Excess risk of fixed design least squares
• From online learning to online optimization
• Two online optimization algorithms:
• Recursive least squares

Reference: Ch 1&2 of Shalev-Shwartz "Online Learning and Online Convex Optimization"

Next time: how to model an environment that changes over time?

By Sarah Dean

Private