CS 4/5789: Introduction to Reinforcement Learning
Lecture 21
Prof. Sarah Dean
MW 2:45-4pm
110 Hollister Hall
Agenda
0. Announcements & Recap
1. Model-Based RL with Exploration
2. UCB Value Iteration Algorithm
3. UCB-VI Analysis
Announcements
5789 Paper Review Assignment (weekly pace suggested)
HW 3 released Monday, due 4/25
Final exam Monday 5/16 at 7pm


Recap: Contextual Bandit
A (less) simplified setting for studying exploration
ex - machine make and model affect rewards, so context \(x=(\)•\(, \)•\(, \)•\(, \)•\(, \)•\(, \)•\(, \)•\(, \)•\()\)
- See context \(x\sim \mathcal D\), pull "arm" \(a\) and get reward \(r_t\sim r(x_t, a_t)\) with \(\mathbb E[r(x, a)] = \mu_a(x)\), in the linear case \(=\theta_a^\top x\)
- Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t) \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$
LinUCB Algorithm
For \(t=1,...,T\):
- Observe context \(x_t\)
- Pull \(a_t = \arg\max_a (\widehat \theta^a_t)^\top x_t + \alpha \sqrt{x_t^\top (A_t^a)^{-1} x_t}\)
- Update \(\widehat \theta^{a_t}\) and \(A^{a_t}\)
Contextual Bandit
- Claim: with high probability, \(\theta_a^\top x \leq (\widehat \theta^a_t)^\top x_t + \alpha \sqrt{x_t^\top (A_t^a)^{-1} x_t}\)
- Chebychev's: If \(\mathbb E[u]=0\) then with probability \(1-1/\beta^2\), $$|u| \leq \beta \mathbb E[u^2]$$
Contextual Bandit
- Claim: with high probability, \(\theta_a^\top x \leq (\widehat \theta^a_t)^\top x + \alpha \sqrt{x^\top (A_t^a)^{-1} x}\)
- Chebychev's: If \(\mathbb E[u]=0\) then with probability \(1-1/\beta^2\), $$|u| \leq \beta \mathbb E[u^2]$$
-
Proof of Claim:
- Strategy: apply Chebychev's with \(u=\theta_a^\top x - (\widehat \theta^a_t)^\top x\)
- Last lecture: showed \(\mathbb E[(\theta_a - \widehat \theta^a_t)^\top x]=0\)
Agenda
0. Announcements & Recap
1. Model-Based RL with Exploration
2. UCB Value Iteration Algorithm
3. UCB-VI Analysis
Recap: Tabular MBRL
Algorithm:
- Query each \((s,a)\) pair \(\frac{N}{SA}\) times, record sample \(s'\sim P(s,a)\)
- Fit transition mode by counting: $$\widehat P(s'\mid s,a) = \frac{\sum_{i=1}^N \mathbb 1\{(s_i, a_i, s_i') = (s, a, s')\}}{\sum_{i=1}^N \mathbb 1\{(s_i, a_i) = (s, a)\}}$$
- Design \(\widehat \pi\) as if \(\widehat P\) is true
Analysis: \(\widehat \pi\) vs. \(\pi^*\)
- Compare \(\widehat P\) and \(P\) (Hoeffding's)
- Compare \(\widehat V^\pi\) and \(V^\pi\) (Simulation Lemma)
- Compare \(\widehat V^{\widehat \pi}\) and \(V^{\pi^*}\)
Agenda
0. Announcements & Recap
1. Model-Based RL with Exploration
2. UCB Value Iteration Algorithm
3. UCB-VI Analysis
CS 4/5789: Lecture 21
By Sarah Dean