13 - Bandits - ML in Feedback Sys

Contextual Bandits

ML in Feedback Sys #13

Prof Sarah Dean

Reminders

Final project proposal due October 7
Upcoming paper presentations starting 10/24
- Meet with Atul in advance

model

$f_t:\mathcal X\to\mathcal Y$

observation

prediction

Prediction in a streaming world

$x_t$

accumulate

$\{(x_t, y_t)\}$

$\hat y_{t}$

policy

$\pi_t:\mathcal X\to\mathcal A$

observation

action

$x_t$

accumulate

$\{(x_t, a_t, r_t)\}$

$a_{t}$

Action in a streaming world

Linear Contextual Bandits

for $t=1,2,...$
- receive context $x_t$
- take action $a_t\in\mathcal A$
- receive reward $$\mathbb E[r_t] = \theta_\star^\top \varphi(x_t, a_t)$$

Linear Contextual Bandits

Linear Contextual Bandits

for $t=1,2,...$
- receive action set $\mathcal A_t$
- take action $\varphi_t\in\mathcal A_t$
- receive reward $$\mathbb E[r_t] = \theta_\star^\top \varphi_t$$

Related Goals:

choose best action: $\qquad\max_{\varphi\in\mathcal A_t} \theta_\star^\top \varphi$
predict reward of action: $\quad\hat r_t \approx \theta_\star^\top \varphi_t$
estimate reward function: $\quad\hat \theta \approx \theta_\star$

The prediction error $(\hat\theta_t-\theta_\star)^\top \varphi =\sum_{k=1}^t \varepsilon_k (V_t^{-1}\varphi_k)^\top \varphi \sim \mathcal N(0,\sigma^2\underbrace{\varphi^\top V_t^{-1}\varphi}_{ \|\varphi\|^2_{V_t^{-1}} } )$

Prediction errors

How much should I trust my predicted reward $\hat\theta^\top \varphi$ if observed rewards are corrupted by Gaussian noise?

$$\hat\theta_t =V_t^{-1}\sum_{k=1}^t \varphi_k r_k,\quad V_t=\sum_{k=1}^t\varphi_k\varphi_k^\top $$

With probability $1-\delta$, we have $|(\hat\theta_t-\theta_\star)^\top \varphi| \leq \sigma\sqrt{2\log(2/\delta) }\|\varphi\|_{V_t^{-1}}$

Correction: last lecture, expressions like $\|Mx\|_2$ should have instead been $$\sqrt{x^\top Mx} = \|M^{1/2}x\|_2 =: \|M\|_V$$

For symmetric matrices with non-negative eigenvalues (i.e. "positive semi-definite matrices"), we know that $M=V\Lambda V^\top$

So the "square root" of a PSD matrix is $$ M^{1/2} = \Lambda^{1/2}V^\top $$

For a diagonal matrix $D^{1/2}=\mathrm{diag}(\sqrt{\lambda_1},\dots,\sqrt{\lambda_n})$

where notation $M^{1/2}$ means a matrix such that $$(M^{1/2})^\top M^{1/2} = M$$

Confidence Ellipsoids

How much should I trust the estimate $\hat \theta$?

Define the confidence ellipsoid $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$

For the right choice of $\beta_t$, it is possible to guarantee $\theta_\star\in\mathcal C_t$ with high probability

Exercise: For a fixed action $\varphi$, show that $$\max_{\theta\in\mathcal C_t} \theta^\top \varphi \leq \hat\theta^\top \varphi +\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$

$$\min_{\theta\in\mathcal C_t}\theta^\top \varphi \geq \hat\theta^\top\varphi -\sqrt{\beta_t}\|\varphi\|_{V_t^{-1}}$$

example: $K=2$ and we've pulled the arms 2 and 1 times respectively

$V_t = \begin{bmatrix} 2& \\& 1\end{bmatrix}$

$(\hat \mu_1,\hat\mu_2)$

Confidence set

$2(\mu_1-\hat\mu_1)^2 + (\mu_2-\hat\mu_2)^2 \leq \beta_t$

Pulling arm 1: $\hat \mu_1 \pm \sqrt{\beta_t/2}$

Pulling arm 2: $\hat \mu_2 \pm \sqrt{\beta_t}$

$$ \left\{\begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 1\\ 0\end{bmatrix}, \begin{bmatrix} 0\\ 1\end{bmatrix}\right \} $$

example: $d=2$ linear bandits

$V_t = \begin{bmatrix} 3&1 \\1& 3\end{bmatrix}$

$\hat \theta$

Trying action $a=[0,1]$:

$\hat \theta^\top a\pm \sqrt{\beta_t a^\top V_t^{-1}a}$
- = $\hat \theta_2 \pm \sqrt{3\beta_t/8}$

$$ \left\{\begin{bmatrix} 1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ 1\end{bmatrix}, \begin{bmatrix} -1\\ -1\end{bmatrix}\right \} $$

$V_t^{-1} = \frac{1}{8} \begin{bmatrix} 3&-1 \\-1& 3\end{bmatrix}$

Exercise: For fixed $\varphi$, show that best/worst case elements of $\mathcal C_t$ are given by $$\theta = \pm\frac{\sqrt{\beta_t} V_t^{-1}\varphi}{\|a\|_{V_t^{-1}}}$$

Regularization

Now we have

Confidence ellipsoid takes the same form $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$

$$\hat\theta_t =V_t^{-1}\sum_{k=1}^t \varphi_k r_k,\quad V_t=\lambda I+\sum_{k=1}^t\varphi_k\varphi_k^\top $$

$$\hat\theta_t = \arg\min_\theta \sum_{k=1}^t (\theta^\top \varphi_k - r_k)^2 + \lambda\|\theta\|_2^2$$

To handle cases where $\{\varphi_k\}$ are not full rank, we consider regularized LS

From Estimation to Action

How does estimation error affect suboptimality?

Define the optimal action $\varphi_t^\star = \arg\max_{\varphi\in\mathcal A_t} \theta_\star^\top \varphi$
Suppose we choose $\hat\varphi_t = \arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top \varphi$

The suboptimality is $\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t$

$\leq \max_{\theta\in\mathcal C}\theta^\top \varphi_t^\star - \min_{\theta\in\mathcal C}\theta^\top \hat\varphi_t\qquad$ since we don't know $\theta_\star$
$\leq \hat\theta^\top\varphi_t^\star +\sqrt{\beta}\|\varphi_t^\star \|_{V^{-1}}-(\hat\theta^\top\hat\varphi_t -\sqrt{\beta}\|\hat\varphi_t\|_{V^{-1}})\quad$ from previous slide
$\leq \underbrace{\hat\theta^\top\varphi_t^\star - \hat\theta^\top\hat\varphi_t }_{\leq 0} +\sqrt{\beta}(\|\varphi_t^\star \|_{V^{-1}}+\|\hat\varphi_t\|_{V^{-1}})\quad$ by choice of $\hat\theta$

From Estimation to Action

The suboptimality is $\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t$

$\leq \sqrt{\beta}(\|\varphi_t^\star \|_{V^{-1}}+\|\hat\varphi_t\|_{V^{-1}})$
$\leq 2\sqrt{\beta \max_{\varphi\in\mathcal A_t} \varphi^\top V^{-1}\varphi } $

This perspective motivates techniques in experiment design

e.g. select $\{\varphi_k\}$ to ensure that $V$ has a large minimum eigenvalue (so $V^{-1}$ will have small eigenvalues)

When $\{\varphi_k\}_{k=1}^N$ are chosen at random from "nice" distribution

$\|V^{-1}\| = \frac{1}{\lambda_{\min}(V)} \lesssim \sqrt{\frac{d}{N}}$ with high probability.

From Estimation to Action

Then with high probability, the suboptimality

$\theta_\star^\top \varphi^\star - \theta_\star^\top \hat\varphi \lesssim 2B\sqrt{\beta_t \frac{d}{N} }$

Informal Theorem: Let the norm of all $\varphi\in\mathcal A_t$ be bounded by $B$ for all $t$. Suppose that $\{\varphi_k\}_{k=1}^N$ are chosen at random from "nice" distribution.

Cumulative sub-optimality: regret

For a fixed interaction horizon $T$, how to trade-off exploration and exploitation?

Design algorithms with low regret $$R(T) = \sum_{t=1}^T\max_{a\in\mathcal A} r(x_t, a) - r(x_t, a_t) $$

$$R(T) = \sum_{t=1}^T \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t $$

Explore-then-commit

ETC

For $t=1,\dots,N$
- play $\varphi_t$ at random
Estimate $\hat\theta$ with least squares
For $t=N+1,\dots,T$
- play $\hat\varphi_t=\arg\max_{\varphi\in\mathcal A_t} \hat\theta^\top\varphi$

The regret has two components

$$R(T) = \underbrace{\sum_{t=1}^{N} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_1} + \underbrace{\sum_{t=N+1}^{T} \max_{\varphi\in\mathcal A_t}\theta_\star^\top \varphi - \theta_\star^\top \varphi_t }_{R_2}$$

Explore-then-commit

The regret has two components

Suppose that $B$ bounds the norm of $\varphi$ and $\|\theta_\star\|\leq 1$.

$\theta_\star^\top \varphi_\star - \theta_\star^\top \varphi \leq 2B$

Then we have $R_1\leq 2BN$

Explore-then-commit

The regret has two components

Using sub-optimality result, with high probability $R_2 \lesssim (T-N)2B\sqrt{\beta \frac{d}{N} }$

Suppose that $\max_t\beta_t\leq \beta$.

Explore-then-commit

The regret is bounded with high probability by

$$R(T) \lesssim 2BN + 2BT\sqrt{\beta \frac{d}{N} }$$

Choosing $N=T^{2/3}$ leads to sublinear regret

$R(T) \lesssim T^{2/3}$

Upper Confidence Bound

Adaptive perspective: optimism in the face of uncertainty

Instead of exploring randomly, focus exploration on promising actions

UCB

Initialize $V_0=\lambda I$, $b_0=0$
For $t=1,\dots,T$
- play $\displaystyle \varphi_t = \arg\max_{\varphi\in\mathcal A_t} \max_{\theta\in\mathcal C_{t-1}}\theta^\top \varphi$
- update $V_t = V_{t-1}+\varphi_t\varphi_t^\top$
  and $b_t = b_{t-1}+r_t\varphi_t$

$\hat\theta_t = V_t^{-1}b_t$
$\mathcal C_t = \{\|\theta-\hat\theta \|_{V_t}\leq \beta_t\}$

Sub-optimality of optimism

Define the optimal action $\varphi_t^\star = \arg\max_{\varphi\in\mathcal A_t} \theta_\star^\top \varphi$
Suppose we choose optimistic $\hat\varphi_t = \arg\max_{\varphi\in\mathcal A_t} \max_{\theta\in\mathcal C_t} \theta^\top \varphi$
- let $\tilde\theta_t = \arg\max_{\theta\in\mathcal C_t} \theta^\top \hat\varphi_t$

The suboptimality is $\theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t$

$\leq \tilde\theta_t^\top \hat\varphi_t - \theta_\star^\top \hat\varphi_t\quad $ by choice of $\hat\varphi_t$
$= (\tilde \theta_t - \theta_\star)^\top \hat\varphi_t$
$ = (V_{t-1}^{1/2}(\tilde \theta_t - \theta_\star))^\top(V_{t-1}^{-1/2}\hat\varphi_t )$
$\leq \|\tilde \theta_t - \theta_\star\|_{V_{t-1}} \|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad$ by Cauchy-Schwarz
$\leq 2\sqrt{\beta_{t-1}}\|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad $ using definition of confidence interval

Upper Confidence Bound

Proof Sketch: $R(T) = \sum_{t=1}^T \theta_\star^\top \varphi_t^\star - \theta_\star^\top \hat\varphi_t$

$\leq 2\sqrt{\beta}\sum_{t=1}^T\|\hat\varphi_t\|_{V_{t-1}^{-1}}\quad $ from previous slide
$\leq 2\sqrt{T\beta\sum_{t=1}^T\|\hat\varphi_t\|_{V_{t-1}^{-1}}^2}\quad$ by Cauchy-Schwarz
$\lesssim \sqrt{T}$ following Lemma 19.4 in Bandit Algorithms

The regret is bounded with high probability by

$$R(T) \lesssim \sqrt{T}$$

After fall break: action in a dynamical world (optimal control)

Recap

Linear contextual bandits
Confidence bounds
Sub-optimality & regret
ETC ($T^{2/3}$) and UCB ($\sqrt{T}$)

Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari