Prof Sarah Dean

## Reminders

• Final project proposal due October 7
• Upcoming paper presentations starting 10/24
• Meet with Atul in advance

model

$$f_t:\mathcal X\to\mathcal Y$$

observation

prediction

## Prediction in a streaming world

$$x_t$$

accumulate

$$\{(x_t, y_t)\}$$

$$\hat y_{t}$$

policy

$$\pi_t:\mathcal X\to\mathcal A$$

observation

action

$$x_t$$

accumulate

$$\{(x_t, a_t, r_t)\}$$

$$a_{t}$$

## Action in a streaming world

Linear Contextual Bandits

• for $$t=1,2,...$$
• receive context $$x_t$$
• take action $$a_t\in\mathcal A$$
• receive reward $$\mathbb E[r_t] = \theta_\star^\top \varphi(x_t, a_t)$$

## Linear Contextual Bandits

Linear Contextual Bandits

• for $$t=1,2,...$$
• receive action set $$\mathcal A_t$$
• take action $$\varphi_t\in\mathcal A_t$$
• receive reward $$\mathbb E[r_t] = \theta_\star^\top \varphi_t$$

Related Goals:

• choose best action: $$\qquad\max_{\varphi\in\mathcal A_t} \theta_\star^\top \varphi$$
• predict reward of action: $$\quad\hat r_t \approx \theta_\star^\top \varphi_t$$
• estimate reward function: $$\quad\hat \theta \approx \theta_\star$$

The prediction error $$(\hat\theta_t-\theta_\star)^\top \varphi =\sum_{k=1}^t \varepsilon_k (V_t^{-1}\varphi_k)^\top \varphi \sim \mathcal N(0,\sigma^2\underbrace{\varphi^\top V_t^{-1}\varphi}_{ \|\varphi\|^2_{V_t^{-1}} } )$$

## Prediction errors

How much should I trust my predicted reward $$\hat\theta^\top \varphi$$ if observed rewards are corrupted by Gaussian noise?

$$\hat\theta_t =V_t^{-1}\sum_{k=1}^t \varphi_k r_k,\quad V_t=\sum_{k=1}^t\varphi_k\varphi_k^\top$$

With probability $$1-\delta$$, we have $$|(\hat\theta_t-\theta_\star)^\top \varphi| \leq \sigma\sqrt{2\log(2/\delta) }\|\varphi\|_{V_t^{-1}}$$

Correction: last lecture, expressions like $$\|Mx\|_2$$ should have instead been $$\sqrt{x^\top Mx} = \|M^{1/2}x\|_2 =: \|M\|_V$$

For symmetric matrices with non-negative eigenvalues (i.e. "positive semi-definite matrices"), we know that $$M=V\Lambda V^\top$$

So the "square root" of a PSD matrix is $$M^{1/2} = \Lambda^{1/2}V^\top$$

For a diagonal matrix $$D^{1/2}=\mathrm{diag}(\sqrt{\lambda_1},\dots,\sqrt{\lambda_n})$$

where notation $$M^{1/2}$$ means a matrix such that $$(M^{1/2})^\top M^{1/2} = M$$

## Confidence Ellipsoids

How much should I trust the estimate $$\hat \theta$$?

Define the confidence ellipsoid $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\}$$

For the right choice of $$\beta_t$$, it is possible to guarantee $$\theta_\star\in\mathcal C_t$$ with high probability

Exercise: For a fixed action $$\varphi$$, show that $$\max_{\theta\in\mathcal C_t} \theta^\top \varphi \leq \hat\theta^\top \varphi +\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$

$$\min_{\theta\in\mathcal C_t}\theta^\top \varphi \geq \hat\theta^\top\varphi -\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$

example: $$K=2$$ and we've pulled the arms 2 and 1 times respectively

$$V_t = \begin{bmatrix} 2& \\& 1\end{bmatrix}$$

$$(\hat \mu_1,\hat\mu_2)$$

Confidence set

$$2(\mu_1-\hat\mu_1)^2 + (\mu_2-\hat\mu_2)^2 \leq \beta_t$$

Pulling arm 1: $$\hat \mu_1 \pm \sqrt{\beta_t/2}$$

Pulling arm 2: $$\hat \mu_2 \pm \sqrt{\beta_t}$$

$$\left\{\begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 0\\ 1\end{bmatrix}\right \}$$

example: $$d=2$$ linear bandits

$$V_t = \begin{bmatrix} 3&1 \\1& 3\end{bmatrix}$$

$$\hat \theta$$

Trying action $$a=[0,1]$$:

• $$\hat \theta^\top a\pm \sqrt{\beta_t a^\top V_t^{-1}a}$$
• = $$\hat \theta_2 \pm \sqrt{3\beta_t/8}$$

$$\left\{\begin{bmatrix} 1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ -1\end{bmatrix}\right \}$$

$$V_t^{-1} = \frac{1}{8} \begin{bmatrix} 3&-1 \\-1& 3\end{bmatrix}$$

Exercise: For fixed $$\varphi$$, show that best/worst case elements of $$\mathcal C_t$$ are given  by $$\theta = \pm\frac{\sqrt{\beta_t} V_t^{-1}\varphi}{\|a\|_{V_t^{-1}}}$$

## Regularization

Now we have

Confidence ellipsoid takes the same form $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\}$$

$$\hat\theta_t =V_t^{-1}\sum_{k=1}^t \varphi_k r_k,\quad V_t=\lambda I+\sum_{k=1}^t\varphi_k\varphi_k^\top$$

$$\hat\theta_t = \arg\min_\theta \sum_{k=1}^t (\theta^\top \varphi_k - r_k)^2 + \lambda\|\theta\|_2^2$$

To handle cases where $$\{\varphi_k\}$$ are not full rank, we consider regularized LS

## From Estimation to Action

How does estimation error affect suboptimality?

• Define the optimal action $$\varphi_t^\star = \arg\max_{\varphi\in\mathcal A_t} \theta_\star^\top \varphi$$
• Suppose we choose $$\hat\varphi_t = \arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top \varphi$$

The suboptimality is $$\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t$$

• $$\leq \max_{\theta\in\mathcal C}\theta^\top \varphi_t^\star - \min_{\theta\in\mathcal C}\theta^\top \hat\varphi_t\qquad$$ since we don't know $$\theta_\star$$
• $$\leq \hat\theta^\top\varphi_t^\star +\sqrt{\beta}\|\varphi_t^\star \|_{V^{-1}}-(\hat\theta^\top\hat\varphi_t -\sqrt{\beta}\|\hat\varphi_t\|_{V^{-1}})\quad$$ from previous slide
• $$\leq \underbrace{\hat\theta^\top\varphi_t^\star - \hat\theta^\top\hat\varphi_t }_{\leq 0} +\sqrt{\beta}(\|\varphi_t^\star \|_{V^{-1}}+\|\hat\varphi_t\|_{V^{-1}})\quad$$ by choice of $$\hat\theta$$

## From Estimation to Action

The suboptimality is $$\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t$$

• $$\leq \sqrt{\beta}(\|\varphi_t^\star \|_{V^{-1}}+\|\hat\varphi_t\|_{V^{-1}})$$
• $$\leq 2\sqrt{\beta \max_{\varphi\in\mathcal A_t} \varphi^\top V^{-1}\varphi }$$

This perspective motivates techniques in experiment design

• e.g. select $$\{\varphi_k\}$$ to ensure that $$V$$ has a large minimum eigenvalue (so $$V^{-1}$$ will have small eigenvalues)

When $$\{\varphi_k\}_{k=1}^N$$ are chosen at random from "nice" distribution

$$\|V^{-1}\| = \frac{1}{\lambda_{\min}(V)} \lesssim \sqrt{\frac{d}{N}}$$ with high probability.

## From Estimation to Action

Then with high probability, the suboptimality

• $$\theta_\star^\top \varphi^\star - \theta_\star^\top \hat\varphi \lesssim 2B\sqrt{\beta_t \frac{d}{N} }$$

Informal Theorem: Let the norm of all $$\varphi\in\mathcal A_t$$ be bounded by $$B$$ for all $$t$$. Suppose that $$\{\varphi_k\}_{k=1}^N$$ are chosen at random from "nice" distribution.

## Cumulative sub-optimality: regret

For a fixed interaction horizon $$T$$, how to trade-off exploration and exploitation?

Design algorithms with low regret $$R(T) = \sum_{t=1}^T\max_{a\in\mathcal A} r(x_t, a) - r(x_t, a_t)$$

$$R(T) = \sum_{t=1}^T \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t$$

## Explore-then-commit

ETC

• For $$t=1,\dots,N$$
• play $$\varphi_t$$ at random
• Estimate $$\hat\theta$$ with least squares
• For $$t=N+1,\dots,T$$
• play $$\hat\varphi_t=\arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top\varphi$$

The regret has two components

$$R(T) = \underbrace{\sum_{t=1}^{N} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$

## Explore-then-commit

The regret has two components

$$R(T) = \underbrace{\sum_{t=1}^{N} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$

Suppose that $$B$$ bounds the norm of $$\varphi$$ and $$\|\theta_\star\|\leq 1$$.

• $$\theta_\star^\top \varphi_\star - \theta_\star^\top \varphi \leq 2B$$

Then we have $$R_1\leq 2BN$$

## Explore-then-commit

The regret has two components

$$R(T) = \underbrace{\sum_{t=1}^{N} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$

Using sub-optimality result, with high probability $$R_2 \lesssim (T-N)2B\sqrt{\beta \frac{d}{N} }$$

Suppose that $$\max_t\beta_t\leq \beta$$.

## Explore-then-commit

The regret is bounded with high probability by

$$R(T) \lesssim 2BN + 2BT\sqrt{\beta \frac{d}{N} }$$

Choosing $$N=T^{2/3}$$ leads to sublinear regret

• $$R(T) \lesssim T^{2/3}$$

## Upper Confidence Bound

Adaptive perspective: optimism in the face of uncertainty

Instead of exploring randomly, focus exploration on promising actions

UCB

• Initialize $$V_0=\lambda I$$, $$b_0=0$$
• For $$t=1,\dots,T$$
• play $$\displaystyle \varphi_t = \arg\max_{\varphi\in\mathcal A_t} \max_{\theta\in\mathcal C_{t-1}}\theta^\top \varphi$$
• update $$V_t = V_{t-1}+\varphi_t\varphi_t^\top$$
and $$b_t = b_{t-1}+r_t\varphi_t$$
• $$\hat\theta_t = V_t^{-1}b_t$$
• $$\mathcal C_t = \{\|\theta-\hat\theta \|_{V_t}\leq \beta_t\}$$

## Sub-optimality of optimism

• Define the optimal action $$\varphi_t^\star = \arg\max_{\varphi\in\mathcal A_t} \theta_\star^\top \varphi$$
• Suppose we choose optimistic $$\hat\varphi_t = \arg\max_{\varphi\in\mathcal A_t} \max_{\theta\in\mathcal C_t} \theta^\top \varphi$$
• let $$\tilde\theta_t = \arg\max_{\theta\in\mathcal C_t} \theta^\top \hat\varphi_t$$

The suboptimality is $$\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t$$

• $$\leq \tilde\theta_t^\top \hat\varphi_t - \theta_\star^\top \hat\varphi_t\quad$$ by choice of $$\hat\varphi_t$$
• $$= (\tilde \theta_t - \theta_\star)^\top \hat\varphi_t$$
• $$= (V_{t-1}^{1/2}(\tilde \theta_t - \theta_\star))^\top(V_{t-1}^{-1/2}\hat\varphi_t )$$
• $$\leq \|\tilde \theta_t - \theta_\star\|_{V_{t-1}} \|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad$$ by Cauchy-Schwarz
• $$\leq 2\sqrt{\beta_{t-1}}\|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad$$ using definition of confidence interval

## Upper Confidence Bound

Proof Sketch: $$R(T) = \sum_{t=1}^T \theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t$$

• $$\leq 2\sqrt{\beta}\sum_{t=1}^T\|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad$$ from previous slide
• $$\leq 2\sqrt{T\beta\sum_{t=1}^T\|\hat\varphi_t\|_{V_{t-1}^{-1}}^2}\quad$$ by Cauchy-Schwarz
• $$\lesssim \sqrt{T}$$ following Lemma 19.4 in Bandit Algorithms

The regret is bounded with high probability by

$$R(T) \lesssim \sqrt{T}$$

After fall break: action in a dynamical world (optimal control)

## Recap

• Linear contextual bandits
• Confidence bounds
• Sub-optimality & regret
• ETC ($$T^{2/3}$$) and UCB ($$\sqrt{T}$$)

Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari

By Sarah Dean

Private