Prof Sarah Dean
Learning to Forecast Dynamical Systems from Streaming Data, Giannakis, Henriksen, Tropp, Ward
Prediction in
Unit 2: Action
filter
\(\mathcal Y^t \to \mathcal Y\)
or \(\to \mathcal S\)
observation
prediction
\(y_t\)
accumulate
\(\{y_t\}\)
\(s\)
\(\hat y_{t+1}\) or \(\hat s_t\)
identify
\(\hat F\)
model
\(f_t:\mathcal X\to\mathcal Y\)
observation
prediction
\(x_t\)
accumulate
\(\{(x_t, y_t)\}\)
\(\hat y_{t}\)
policy
\(\pi_t:\mathcal X\to\mathcal A\)
observation
action
\(x_t\)
accumulate
\(\{(x_t, a_t, r_t)\}\)
\(a_{t}\)
Contextual Bandits
ex - traffic routing, online advertising, campaign planning, experiment design, ...
Whittle, Peter (1979), "Discussion of Dr Gittins' paper", Journal of the Royal Statistical Society, Series B, 41 (2): 148–177
A Stochastic Model with Applications to Learning, 1953.
cites the problem as arising due to "Professor Merrill Flood, then at the RAND Corporation" encountered in experimental work on learning models
An Elementary Survey of Statistical Decision Theory, 1954
Contextual Bandits
ex - traffic routing, online advertising, campaign planning, experiment design, ...
Related Goals:
Taking action \(a_t\in\mathcal A\) in context \(x_t\) yields reward $$r_t = r(x_t, a_t) + \varepsilon_t,\qquad \mathbb E[\varepsilon_t]=0,~~\mathbb E[\varepsilon_t^2] = \sigma^2$$
e.g. machine model affects rewards, so context
\(x=(\)•\(, \)•\(, \)•\(, \)•\(, \)•\(, \)•\(, \)•\(, \)•\()\)
e.g. betting on horseracing
Related Goals:
Taking action \(a_t\in\mathcal A\) in context \(x_t\) yields reward $$r_t = r(x_t, a_t) + \varepsilon_t,\qquad \mathbb E[\varepsilon_t]=0,~~\mathbb E[\varepsilon_t^2] = \sigma^2$$
Taking action \(a_t\in\mathcal A\) in context \(x_t\) yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t$$
Taking action \(a_t\in\mathcal A\) in context \(x_t\) yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t$$
Taking action \(a_t\in\mathcal A\) in context \(x_t\) yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t$$
Taking action \(a_t\in\mathcal A\) in context \(x_t\) yields reward $$r_t = \langle\theta_\star, \varphi(x_t, a_t)\rangle + \varepsilon_t$$
\(\iff\)
Equivalent perspective:
Define \(\mathcal A_t = \{\varphi(x_t, a) \mid a\in\mathcal A\}\)
\(K\) armed contextual bandits where \(\mathbb E[r(x, a)] = \tilde \theta_a^\top x\)
observe rewards determined by \(\theta_* = \begin{bmatrix} \tilde \theta_1\\ \vdots \\ \tilde\theta_K\end{bmatrix}\)
choose \(\varphi_t\in \mathcal A_t = \{e_1\otimes x_t,\dots, e_K \otimes x_t\}\)
\(\iff\)
Linear contextual bandits where \(\mathbb E[r(x, a)] = ( \Theta_\star x)^\top a\)
Linear contextual bandits where \(\mathbb E[r(x, a)] = ( \Theta_\star x)^\top a\)
observe rewards determined by \(\theta_* = \mathrm{vec}(\Theta_\star)\)
choose \(\varphi_t\in \mathcal A_t = \{\mathrm{vec}(ax^\top)\mid \|a\|\leq 1\}\)
\(\iff\)
Exercise: Formulate the following in the contextual bandit framework.
Taking action \(\varphi_t\in\mathcal A_t\) yields reward $$\mathbb E[r_t] = \langle\theta_\star, \varphi_t\rangle $$
Strategy:
Estimate \(\hat\theta\) from data \(\{(\varphi_k, r_k)\}_{k=1}^t\)
$$\hat\theta_t = \arg\min_\theta \sum_{k=1}^t (\theta^\top \varphi_k - r_k)^2$$
$$\hat\theta_t ={\underbrace{ \left(\sum_{k=1}^t \varphi_k\varphi_k^\top\right)}_{V_t}}^{-1}\sum_{k=1}^t \varphi_k r_k $$
Recall that \(\hat\theta_t-\theta_\star\)
Therefore, \((\hat\theta_t-\theta_\star)^\top \varphi =\sum_{k=1}^t \varepsilon_k (V_t^{-1}\varphi_k)^\top \varphi\)
How much should I trust my predicted reward \(\hat\theta^\top \varphi\)?
\((\hat\theta_t-\theta_\star)^\top \varphi =\sum_{k=1}^t \varepsilon_k (V_t^{-1}\varphi_k)^\top \varphi\)
With probability \(1-\delta\), we have \(|(\hat\theta_t-\theta_\star)^\top \varphi| \leq \sigma\sqrt{2\log(2/\delta) }\underbrace{\sqrt{\varphi^\top V_t^{-1}\varphi}}_{\|\varphi\|_{V_t^{-1}}}\)
If \((\varepsilon_k)_{k\leq t}\) are Gaussian and independent from \((\varphi_k)_{k\leq t}\), then this error is distributed as \(\mathcal N(0, \sigma^2\sum_{k=1}^t ((V_t^{-1}\varphi_k)^\top \varphi)^2)\)
How much should I trust my predicted reward \(\hat\theta^\top \varphi\)?
How much should I trust the estimate \(\hat \theta\)?
Define the confidence ellipsoid $$\mathcal C_t = \{\theta\in\mathbb R^d \mid \|\theta-\hat\theta_t\|_{V_t}^2\leq \beta_t\} $$
For the right choice of \(\beta_t\), we can show that with high probability, \(\theta_\star\in\mathcal C_t\)
Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari
Next time: low regret algorithms
Reference: Ch 19-20 in Bandit Algorithms by Lattimore & Szepesvari