Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Reminders

• Homework
• 5789 Paper Reviews due weekly on Mondays
• PSet 7 due Monday
• PA 4 due May 3
• Final exam is Saturday 5/13 at 2pm
• WICCxURMC Survey

## Agenda

1. Recap: Bandits & MBRL

2. MBRL with Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

## Recap: Bandits

A simplified setting for studying exploration

ex - machine make and model affect rewards, so context $$x=($$$$,$$$$,$$$$,$$$$,$$$$,$$$$,$$$$,$$$$)$$

• See context $$x_t\sim \mathcal D$$, pull "arm" $$a_t$$ and get reward $$r_t\sim r(x_t, a_t)$$ with $$\mathbb E[r(x_, a)] = \mu_a(x)$$, in the linear case $$=\theta_a^\top x$$
• Regret: $$R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi_\star(x_t))-r(x_t,a_t) \right] = \sum_{t=1}^T\mathbb E[\mu_\star(x_t) - \mu_{a_t}(x_t)]$$

## Recap: UCB

UCB-type Algorithms

• for $$t=1,2,...,T$$
• Observe context $$x_t$$
• Pull arm $$a_t$$ optimistically
• MAB: $$\arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$$
• Linear: $$\arg\max_{a\in[K]} \hat \theta_{a,t}^\top x_t + \sqrt{x_t A_{a,t}^{-1} x_t}$$
• update estimates and confidence intervals

## Recap: Tabular MBRL

Algorithm:

1. Query each $$(s,a)$$ pair $$\frac{N}{SA}$$ times, record sample $$s'\sim P(s,a)$$
2. Fit transition mode by counting: $$\widehat P(s'\mid s,a) = \frac{\sum_{i=1}^N \mathbb 1\{(s_i, a_i, s_i') = (s, a, s')\}}{\sum_{i=1}^N \mathbb 1\{(s_i, a_i) = (s, a)\}}$$
3. Design $$\widehat \pi$$ as if $$\widehat P$$ is true

Analysis: $$\widehat \pi$$ vs. $$\pi^*$$

• Compare $$\widehat P$$ and $$P$$ (Hoeffding's)
• Compare $$\widehat V^\pi$$ and $$V^\pi$$ (Simulation Lemma)
• Compare $$\widehat V^{\widehat \pi}$$ and $$V^{\pi^*}$$

## Agenda

1. Recap: Bandits & MBRL

2. MBRL with Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

## MBRL with Exploration

• Finite horizon tabular MDP with given initial state $$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, s_0\}$$
• Finite states $$|\mathcal S| = S$$ and actions $$|\mathcal A|=A$$
• Transition probabilities $$P$$ unknown
• for simplicity assume reward function is noiseless, thus perfectly known once $$s,a$$ is visited
• Unlike in previous Unit, initial state is fixed!
• No longer possible to query $$s'\sim P(s,a)$$ for any $$s,a$$

## Example

• Consider the deterministic chain MDP
• $$\mathcal S = \{0,\dots,H-1\}$$ and $$A = \{1, \dots, A\}$$
• $$s_{t+1} = s_t + \mathbf 1\{a_t=1\} - \mathbf 1\{a_t\neq 1\}$$ except at endpoints
• Suppose $$s_0=0$$, finite horizon $$H$$, and reward function $$r(s,a) = \mathbf 1\{s=H-1\}$$
• Uniformly random policy for exploration, observe $$(r_t)_{0\leq t\leq H-1}$$
• $$\mathbb P\{r_t= 0\forall t\} = 1-1/A^H$$

$$\neq 1$$

$$1$$

$$0$$

$$1$$

$$2$$

$$H-1$$

...

$$\neq 1$$

$$\neq 1$$

$$\neq 1$$

$$\neq 1$$

$$1$$

$$1$$

$$1$$

$$1$$

## Agenda

1. Recap: Bandits & MBRL

2. MBRL with Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

UCB-VI

• Initialize $$\hat P_0$$, $$\hat r_0=0$$, and $$b_0$$
• for $$i=1,2,...,T$$
• Design optimistic policy $$\hat \pi^i = VI(\hat P_i, \hat r_i+b_i)$$
• Rollout $$\hat\pi^i$$ and observe trajectory
• Update $$\hat P_{i+1}$$, $$\hat r_i$$, and $$b_{i+1}$$

## Optimistic MBRL

• Design a reward bonus to incentivize exploration of unseen states and actions

## Model Estimation

• Using the dataset at iteration $$i$$: $$\{\{s_t^k, a_t^k, r_t^k\}_{t=0}^{H-1}\}_{k=0}^i$$
• Number of times we took action $$a$$ in state $$s$$ $$N_i(s,a) = \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a\}$$
• Number of times we transitioned to $$s'$$ after $$s,a$$ $$N_i(s,a, s') = \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a, s^k_{t+1}=s'\}$$
• Estimate transition probability $$\hat P_i(s'|s,a) = \frac{N_i(s,a,s')}{N_i(s,a)}$$
• Reward of $$s,a$$: $$\hat r_i(s,a) = \frac{1}{N_i(s,a) } \sum_{k=1}^{i-1} \sum_{t=0}^{H-1} \mathbf 1\{s_t^k=s, a_t^k=a\} r_t^k$$

$$\neq 1$$

$$1$$

$$0$$

$$1$$

$$2$$

$$H-1$$

...

$$\neq 1$$

$$\neq 1$$

$$\neq 1$$

$$\neq 1$$

$$1$$

$$1$$

$$1$$

$$1$$

• Uniformly random policy:
• e.g. $$a_{0:8} = (1,3,2,2,3,1,1,3,2)$$
• then $$s_{0:9} = (0,1,0,0,0,0,1,2,1,0)$$
• $$N(s,a) = ?$$
• $$\hat P(s'|s,a)=$$ PollEV
• $$\hat r(s,a)=0$$

## Reward Bonus

• Using the dataset at iteration $$i$$: $$\{\{s_t^k, a_t^k, r_t^k\}_{t=0}^{H-1}\}_{k=0}^i$$
• Number of times $$s,a$$: $$N_i(s,a)$$
• Number of times $$s,a\to s'$$: $$N_i(s,a, s')$$
• Reward of $$s,a$$: $$\hat r_i(s,a)$$
• Reward bonus $$b_i(s,a) = H\sqrt{\frac{\alpha}{N_i(s,a)}}$$

DP

• Initialize $$\hat V^i_H(s)=0$$
• for $$t=H-1, ..., 0$$
• $$\hat Q^i_t(s,a)=\hat r_i(s,a)+b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]$$
• $$\hat\pi^i(s) = \arg\max_a \hat Q^i_t(s,a)$$
• $$\hat V^i_{t}(s)=\hat Q^i_t(s,\hat\pi^i(s) )$$

## Optimistic DP

• In finite horizon setting, VI is just DP

$$\neq 1$$

$$1$$

$$0$$

$$1$$

$$2$$

$$H-1$$

...

$$\neq 1$$

$$\neq 1$$

$$\neq 1$$

$$\neq 1$$

$$1$$

$$1$$

$$1$$

$$1$$

• Iteration 1: uniformly random policy for exploration
• trajectory contains $$s,a$$ with probability $$\propto 1/A^s$$
• Iteration 2: reward bonus incentivizes upward exploration
• Eventually reach $$s=H-1$$ and converge to $$\pi^\star_t(s) = 1$$

## Agenda

1. Recap: Bandits & MBRL

2. MBRL with Exploration

3. UCB Value Iteration

4. UCB-VI Analysis

1. The exploration bonus bounds, with high probability, the difference $$|\mathbb E_{s'\sim\hat P(s,a)}[V(s')]-\mathbb E_{s'\sim P(s,a)}[V(s')]| \quad\forall ~~s,a$$
• similar to confidence intervals bounding $$|\mu_a-\hat\mu_a|$$
2. The exploration bonus leads to optimism $$\hat V_t^i(s) \geq V_t^\star(s)$$
• similar to showing that $$\hat\mu_{\hat a_\star} \geq \mu_\star$$

## Analysis: Two Key Facts

• These two properties are key to proving a regret bound $$R(T) = \mathbb E\left[\sum_{i=1}^T V_0^\star (s_0)-V^{\pi^i}(s_0) \right]$$
• Above, define regret as cumulative sub-optimality (over episodes)
• now sub-optimality itself is cumulative reward over time (i.e. value)
• Argument follows UCB proof structure:
1. By optimism, $$V^\star_0(s_0) - V_0^{\pi^i}(s_0) \leq \hat V_0^i(s_0) - V^{\pi^i}_0(s_0)$$
2. Simulation Lemma to compare $$\hat V_0^i(s_0)$$ & $$V^{\pi^i}_0(s_0)$$
• Regret proof is out of scope for this class (see 6789)

## Regret in RL Setting

$$\neq 1$$

$$1$$

...

$$\neq 1$$

$$\neq 1$$

$$\neq 1$$

$$1$$

$$1$$

$$1$$

$$1$$

• Regret bonus enables the "right amount" of exploration

## Analysis: Exploration Bonus

• Lemma: For any fixed function $$V:\mathcal S\to [0,H]$$, whp, $$|\mathbb E_{s'\sim\hat P_i(s,a)}[V(s')]-\mathbb E_{s'\sim P_i(s,a)}[V(s')]| \leq H\sqrt{\frac{\alpha}{N_i(s,a)}}=b_i(s,a)$$ where $$\alpha$$ depends on $$S,A,H$$ and failure probability
• Proof: $$|\mathbb E_{s'\sim\hat P_i(s,a)}[V(s')]-\mathbb E_{s'\sim P_i(s,a)}[V(s')]| =$$
• $$=\left|\sum_{s'\in\mathcal S} (\hat P_i(s'|s,a) - P_i(s'|s,a) )V(s')\right|$$
• $$\leq \sum_{s'\in\mathcal S} |\hat P_i(s'|s,a) - P_i(s'|s,a) | |V(s')|$$
• $$\leq \left(\sum_{s'\in\mathcal S} |\hat P_i(s'|s,a) - P_i(s'|s,a) |\right)\max_{s'} |V(s')|$$
• $$\leq \sqrt{\frac{\alpha}{N_i(s,a)}} \underbrace{\max_{s'} |V(s')|}_{\leq H}$$ using result from Lecture 11

## Analysis: Optimism

• Lemma: as long as $$r(s,a)\in[0,1]$$, $$\hat V_t^i(s) \geq V_t^\star(s)$$ for all $$t,i,s$$
• Proof: By induction.
1. Base case $$\hat V_H^i(s)=0=V_H^\star(s)$$
2. Assume that $$\hat V_{t+1}^i(s) \geq V_{t+1}^\star(s)$$
• Then $$\hat Q_t^i(s,a) - Q_t^\star (s,a)$$
• $$= \hat r_i(s,a) + b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]- r(s,a) - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')]$$
• $$\geq b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')] - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')]$$ (noiseless $$r_t$$)
• $$\geq b_i(s,a)+\mathbb E_{s'\sim \hat P_i(s,a)}[V^\star_{t+1}(s')] - \mathbb E_{s'\sim P_i(s,a)}[V^\star_{t+1}(s')]$$ (assumption)
• $$\geq b_i(s,a)-b_i(s,a)$$ (previous Lemma)
• Thus $$\hat Q_t^i(s,a) \geq Q_t^\star (s,a)$$ which implies $$\hat V_t^i(s) \geq V_t^\star (s)$$ (exercise)

## Recap

• PSet due Monday

• Exploration problem
• UCB-VI Algorithm

• Next lecture: Revisiting imitation learning

By Sarah Dean

Private