CS 4/5789: Introduction to Reinforcement Learning
Lecture 22: Exploration in MDPs
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework
- 5789 Paper Reviews due weekly on Mondays
- PSet 7 due Monday
- PA 4 due May 3
- Final exam is Saturday 5/13 at 2pm
- WICCxURMC Survey
Agenda
1. Recap: Bandits & MBRL
2. MBRL with Exploration
3. UCB Value Iteration
4. UCB-VI Analysis

Recap: Bandits
A simplified setting for studying exploration
ex - machine make and model affect rewards, so context \(x=(\)•\(, \)•\(, \)•\(, \)•\(, \)•\(, \)•\(, \)•\(, \)•\()\)
- See context \(x_t\sim \mathcal D\), pull "arm" \(a_t\) and get reward \(r_t\sim r(x_t, a_t)\) with \(\mathbb E[r(x_, a)] = \mu_a(x)\), in the linear case \(=\theta_a^\top x\)
- Regret: $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi_\star(x_t))-r(x_t,a_t) \right] = \sum_{t=1}^T\mathbb E[\mu_\star(x_t) - \mu_{a_t}(x_t)] $$
Recap: UCB
UCB-type Algorithms
- for \(t=1,2,...,T\)
- Observe context \(x_t\)
- Pull arm \(a_t\) optimistically
- MAB: \(\arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}\)
- Linear: \(\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \sqrt{x_t A_{a,t}^{-1} x_t}\)
- update estimates and confidence intervals
Recap: Tabular MBRL
Algorithm:
- Query each \((s,a)\) pair \(\frac{N}{SA}\) times, record sample \(s'\sim P(s,a)\)
- Fit transition mode by counting: $$\widehat P(s'\mid s,a) = \frac{\sum_{i=1}^N \mathbb 1\{(s_i, a_i, s_i') = (s, a, s')\}}{\sum_{i=1}^N \mathbb 1\{(s_i, a_i) = (s, a)\}}$$
- Design \(\widehat \pi\) as if \(\widehat P\) is true
Analysis: \(\widehat \pi\) vs. \(\pi^*\)
- Compare \(\widehat P\) and \(P\) (Hoeffding's)
- Compare \(\widehat V^\pi\) and \(V^\pi\) (Simulation Lemma)
- Compare \(\widehat V^{\widehat \pi}\) and \(V^{\pi^*}\)
Agenda
1. Recap: Bandits & MBRL
2. MBRL with Exploration
3. UCB Value Iteration
4. UCB-VI Analysis
MBRL with Exploration
- Finite horizon tabular MDP with given initial state $$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, s_0\}$$
- Finite states \(|\mathcal S| = S\) and actions \(|\mathcal A|=A\)
- Transition probabilities \(P\) unknown
- for simplicity assume reward function is noiseless, thus perfectly known once \(s,a\) is visited
- Unlike in previous Unit, initial state is fixed!
- No longer possible to query \(s'\sim P(s,a)\) for any \(s,a\)
Example
- Consider the deterministic chain MDP
- \(\mathcal S = \{0,\dots,H-1\}\) and \(A = \{1, \dots, A\}\)
- \(s_{t+1} = s_t + \mathbf 1\{a_t=1\} - \mathbf 1\{a_t\neq 1\}\) except at endpoints
- Suppose \(s_0=0\), finite horizon \(H\), and reward function $$r(s,a) = \mathbf 1\{s=H-1\}$$
- Uniformly random policy for exploration, observe \((r_t)_{0\leq t\leq H-1}\)
- \(\mathbb P\{r_t= 0\forall t\} = 1-1/A^H\)
\(\neq 1\)
\(1\)
\(0\)
\(1\)






\(2\)
\(H-1\)
...
\(\neq 1\)
\(\neq 1\)
\(\neq 1\)
\(\neq 1\)
\(1\)
\(1\)
\(1\)
\(1\)
Agenda
1. Recap: Bandits & MBRL
2. MBRL with Exploration
3. UCB Value Iteration
4. UCB-VI Analysis
UCB-VI
- Initialize \(\hat P_0\), \(\hat r_0=0\), and \(b_0\)
- for \(i=1,2,...,T\)
- Design optimistic policy \(\hat \pi^i = VI(\hat P_i, \hat r_i+b_i)\)
- Rollout \(\hat\pi^i\) and observe trajectory
- Update \(\hat P_{i+1}\), \(\hat r_i\), and \(b_{i+1}\)
Optimistic MBRL
- Design a reward bonus to incentivize exploration of unseen states and actions
Model Estimation
- Using the dataset at iteration \(i\): $$\{\{s_t^k, a_t^k, r_t^k\}_{t=0}^{H-1}\}_{k=0}^i $$
- Number of times we took action \(a\) in state \(s\) $$N_i(s,a) = \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a\} $$
- Number of times we transitioned to \(s'\) after \(s,a\) $$N_i(s,a, s') = \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a, s^k_{t+1}=s'\} $$
- Estimate transition probability \(\hat P_i(s'|s,a) = \frac{N_i(s,a,s')}{N_i(s,a)}\)
- Reward of \(s,a\): \(\hat r_i(s,a) = \frac{1}{N_i(s,a) } \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a\} r_t^k\)
\(\neq 1\)
\(1\)
\(0\)
\(1\)






\(2\)
\(H-1\)
...
\(\neq 1\)
\(\neq 1\)
\(\neq 1\)
\(\neq 1\)
\(1\)
\(1\)
\(1\)
\(1\)
- Uniformly random policy:
- e.g. \(a_{0:8} = (1,3,2,2,3,1,1,3,2)\)
- then \(s_{0:9} = (0,1,0,0,0,0,1,2,1,0)\)
- \(N(s,a) = ?\)
- \(\hat P(s'|s,a)=\) PollEV
- \(\hat r(s,a)=0\)
Example
Reward Bonus
- Using the dataset at iteration \(i\): \(\{\{s_t^k, a_t^k, r_t^k\}_{t=0}^{H-1}\}_{k=0}^i \)
- Number of times \(s,a\): \(N_i(s,a)\)
- Number of times \(s,a\to s'\): \(N_i(s,a, s')\)
- Reward of \(s,a\): \(\hat r_i(s,a)\)
- Reward bonus $$ b_i(s,a) = H\sqrt{\frac{\alpha}{N_i(s,a)}} $$
DP
- Initialize \(\hat V^i_H(s)=0\)
- for \(t=H-1, ..., 0\)
- \(\hat Q^i_t(s,a)=\hat r_i(s,a)+b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]\)
- \(\hat\pi^i(s) = \arg\max_a \hat Q^i_t(s,a)\)
- \(\hat V^i_{t}(s)=\hat Q^i_t(s,\hat\pi^i(s) )\)
Optimistic DP
- In finite horizon setting, VI is just DP
\(\neq 1\)
\(1\)
\(0\)
\(1\)






\(2\)
\(H-1\)
...
\(\neq 1\)
\(\neq 1\)
\(\neq 1\)
\(\neq 1\)
\(1\)
\(1\)
\(1\)
\(1\)
- Iteration 1: uniformly random policy for exploration
- trajectory contains \(s,a\) with probability \(\propto 1/A^s\)
- Iteration 2: reward bonus incentivizes upward exploration
- Eventually reach \(s=H-1\) and converge to \(\pi^\star_t(s) = 1\)
Example
Agenda
1. Recap: Bandits & MBRL
2. MBRL with Exploration
3. UCB Value Iteration
4. UCB-VI Analysis
- The exploration bonus bounds, with high probability, the difference $$ |\mathbb E_{s'\sim\hat P(s,a)}[V(s')]-\mathbb E_{s'\sim P(s,a)}[V(s')]| \quad\forall ~~s,a $$
- similar to confidence intervals bounding \(|\mu_a-\hat\mu_a|\)
- The exploration bonus leads to optimism $$ \hat V_t^i(s) \geq V_t^\star(s) $$
- similar to showing that \(\hat\mu_{\hat a_\star} \geq \mu_\star\)
Analysis: Two Key Facts
- These two properties are key to proving a regret bound $$R(T) = \mathbb E\left[\sum_{i=1}^T V_0^\star (s_0)-V^{\pi^i}(s_0) \right] $$
- Above, define regret as cumulative sub-optimality (over episodes)
- now sub-optimality itself is cumulative reward over time (i.e. value)
- Argument follows UCB proof structure:
- By optimism, \(V^\star_0(s_0) - V_0^{\pi^i}(s_0) \leq \hat V_0^i(s_0) - V^{\pi^i}_0(s_0)\)
- Simulation Lemma to compare \(\hat V_0^i(s_0)\) & \(V^{\pi^i}_0(s_0)\)
- Regret proof is out of scope for this class (see 6789)
Regret in RL Setting
\(\neq 1\)
\(1\)






...
\(\neq 1\)
\(\neq 1\)
\(\neq 1\)
\(1\)
\(1\)
\(1\)
\(1\)
- Regret bonus enables the "right amount" of exploration
Example

Analysis: Exploration Bonus
- Lemma: For any fixed function \(V:\mathcal S\to [0,H]\), whp, $$|\mathbb E_{s'\sim\hat P_i(s,a)}[V(s')]-\mathbb E_{s'\sim P_i(s,a)}[V(s')]| \leq H\sqrt{\frac{\alpha}{N_i(s,a)}}=b_i(s,a)$$ where \(\alpha\) depends on \(S,A,H\) and failure probability
-
Proof: \(|\mathbb E_{s'\sim\hat P_i(s,a)}[V(s')]-\mathbb E_{s'\sim P_i(s,a)}[V(s')]| =\)
- \(=\left|\sum_{s'\in\mathcal S} (\hat P_i(s'|s,a) - P_i(s'|s,a) )V(s')\right|\)
- \(\leq \sum_{s'\in\mathcal S} |\hat P_i(s'|s,a) - P_i(s'|s,a) | |V(s')|\)
- \(\leq \left(\sum_{s'\in\mathcal S} |\hat P_i(s'|s,a) - P_i(s'|s,a) |\right)\max_{s'} |V(s')| \)
- \(\leq \sqrt{\frac{\alpha}{N_i(s,a)}} \underbrace{\max_{s'} |V(s')|}_{\leq H}\) using result from Lecture 11
Analysis: Optimism
- Lemma: as long as \(r(s,a)\in[0,1]\), \( \hat V_t^i(s) \geq V_t^\star(s)\) for all \(t,i,s\)
-
Proof: By induction.
- Base case \(\hat V_H^i(s)=0=V_H^\star(s)\)
- Assume that \( \hat V_{t+1}^i(s) \geq V_{t+1}^\star(s)\)
- Then \(\hat Q_t^i(s,a) - Q_t^\star (s,a)\)
- \( = \hat r_i(s,a) + b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]- r(s,a) - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] \)
- \( \geq b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')] - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] \) (noiseless \(r_t\))
- \( \geq b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[V^\star_{t+1}(s')] - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')] \) (assumption)
- \( \geq b_i(s,a)-b_i(s,a) \) (previous Lemma)
- Thus \(\hat Q_t^i(s,a) \geq Q_t^\star (s,a)\) which implies \(\hat V_t^i(s) \geq V_t^\star (s)\) (exercise)
Recap
- PSet due Monday
- Exploration problem
- UCB-VI Algorithm
- Next lecture: Revisiting imitation learning
Sp23 CS 4/5789: Lecture 22
By Sarah Dean