Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Reminders

• Homework
• PSet 4 released, due Monday
• 5789 Paper Reviews due weekly starting Monday
• Midterm 3/15 during lecture
• Let us know conflicts/accomodations ASAP! (EdStem)
• Review Lecture on Monday 3/13 (last year's slides/recording)
• Materials: slides (Lectures 1-10, some of 11-13), PSets 1-4 (solutions on Canvas)
• also: equation sheet (on Canvas), 2023 notes, PAs
• I will monitor Exams/Prelim tag on EdStem for questions

## Agenda

1. Recap

2. Conservative Policy Iteration

3. Guarantees on Improvement

4. Summary

## Feedback in RL

action $$a_t$$

state $$s_t$$

reward $$r_t$$

1. Control feedback: between states and actions
2. Data feedback: between data and poicy

policy

data $$(s_t,a_t,r_t)$$

policy $$\pi$$

transitions $$P,f$$

experience

unknown in Unit 2

## Sampling procedure for $$Q^\pi$$

Algorithm: Data collection

• For $$i=1,\dots,N$$:
• Sample $$s_0\sim\mu_0$$ and $$h_1$$ and $$h_2$$ from discount distribution
• Roll in $$h_1$$ steps: set $$(s_i,a_i)=(s_{h_1},a_{h_1})$$
• Roll out $$h_2$$ steps: set $$y_i=\sum_{t=h_1}^{h_1+h_2} r_t$$

Proposition: The resulting dataset $$\{(s_i,a_i), y_i\}_{i=1}^N$$

1. Drawn from discounted state distribution  $$s_i \sim d_{\mu_0}^\pi$$
2. Unbiased labels $$\mathbb E[y_i\mid s_i,a_i] = Q^\pi(s_i,a_i)$$

Proof in future PSet

Approximate Policy Iteration

• Initialize $$\pi_0:\mathcal S\to\mathcal A$$
• For $$i=0,\dots,T-1$$:
• Roll in/out $$\pi^i$$ to collect dataset $$\{s_j,a_j,y_j\}_{j=1}^N$$ then compute $$\hat Q^{\pi_i}$$ with supervised learning $$\hat Q^{\pi_i}(s,a) = \arg\min_{Q\in\mathcal Q} \sum_{j=1}^N (Q(s_j,a_j)-y_j)^2$$
• Policy Improvement: $$\forall s$$, $$\pi^{i+1}(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$

## Approximate Policy Iteration

• Even if we assume that supervised learning succeeds $$\mathbb E_{s,a\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
• or the stronger assumption: $$\max_a \mathbb E_{s\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
• Approx PI does not necessarily improve because $$d_{\mu_0}^{\pi_{i+1}}$$ might be different from $$d_{\mu_0}^{\pi_i}$$

## Approximate Policy Iteration

• Even if we assume that supervised learning succeeds $$\max_a \mathbb E_{s\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
• Approx PI does not necessarily improve because $$d_{\mu_0}^{\pi_{i+1}}$$ might be different from $$d_{\mu_0}^{\pi_i}$$

## Approximate Policy Iteration

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

## Agenda

1. Recap

2. Conservative Policy Iteration

3. Guarantees on Improvement

4. Summary

Performance Difference Lemma: For two policies, $$V^{\pi_{i+1}}(s_0) - V^{\pi_{i}}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{s_0}}\left[ \mathbb E_{a\sim \pi_{i+1}(s)}\left[A^{\pi_i}(s,a) \right] \right]$$

where we define the advantage function $$A^{\pi}(s,a) =Q^{\pi}(s,a) - V^{\pi}(s)$$

## Improvement in PI

• Advantage of $$\pi_{i+1}$$ over $$\pi_i$$ over distribution of $$\pi_{i+1}$$
• Notice that Policy Iteration's greedy updates $$\pi_{i+1}(s) = \arg\max_a Q^{\pi_i}(s,a)=\arg\max_a A^{\pi_i}(s,a)$$
• Idea: keep $$d_{\mu_0}^{\pi_{i+1}}$$ close to $$d_{\mu_0}^{\pi_{i}}$$

Conservative Policy Iteration

• Initialize $$\pi_0:\mathcal S\to\Delta(\mathcal A)$$
• equivalently, $$\pi(a|s) : \mathcal A\times \mathcal S \to [0,1]$$
• For $$i=0,\dots,T-1$$:
• Roll in/out $$\pi^i$$ to collect dataset $$\{s_j,a_j,y_j\}_{j=1}^N$$ then compute $$\hat Q^{\pi_i}$$ with supervised learning
• Policy Improvement: $$\forall s$$, $$\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$
• Incremental update: a policy which follows $$\bar \pi$$ with probability $$\alpha$$ (and otherwise $$\pi_{i}$$)$$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)$$

## Conservative Policy Iteration

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

## Example

$$i=0$$

$$i=1$$

$$i=2$$

$$\bar \pi$$

$$i=1$$

$$i=2$$

$$i=0$$

Case: $$p_1$$ and $$p_2$$ both close to $$1$$

• $$\pi^\star(1)=$$switch

reward: $$+1$$ if $$s=0$$ and $$-\frac{1}{2}$$ if $$a=$$ switch

$$\pi^i(\cdot |1)$$

$$1$$

$$\hat Q^{\pi^i}(1,$$switch$$)$$

$$\hat Q^{\pi^i}(1,$$stay$$)$$

• Policy follows $$\bar \pi$$ with probability $$\alpha$$ (and otherwise $$\pi_{i}$$)$$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)$$
• For any $$s$$, $$\|\pi^{i+1}(\cdot |s)- \pi^{i}(\cdot |s)\|_1$$
• $$= \| \pi^{i}(\cdot |s) -(1-\alpha) \pi^{i}(\cdot|s) - \alpha \bar \pi^{}(\cdot |s)\|_1$$
• $$= \| \alpha \pi^{i}(\cdot |s) - \alpha \bar \pi^{}(\cdot |s)\|_1$$
• $$\leq \alpha \| \pi^{i}(\cdot |s) \|_1 + \alpha \| \bar \pi^{}(\cdot |s)\|_1 = 2\alpha$$
• Lemma: For any $$\pi$$ and $$\pi'$$, $$\|d^\pi_{\mu_0} - d^{\pi'}_{\mu_0}\|_1 \leq \frac{\gamma \max_s \|\pi(\cdot|s)-\pi'(\cdot|s)\|_1}{1-\gamma}$$ (proof in future PSet)
• Thus in Conservative PI, $$d_{\mu_0}^{\pi_{i+1}}$$ is $$\frac{2\gamma\alpha}{1-\gamma}$$ close to $$d_{\mu_0}^{\pi_{i}}$$

$$\pi^{i}$$

$$\bar\pi$$

$$\pi^{i+1}$$

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

## Example

$$\pi^i(\cdot |1)$$

$$1$$

$$i=0$$

$$i=1$$

$$i=2$$

$$i=1$$

$$i=2$$

$$i=0$$

$$d^{\pi^i}_1(0)$$

$$1$$

$$d^{\pi^i}_1(1)$$

$$i=0$$

$$i=1$$

$$i=2$$

$$i=1$$

$$i=2$$

$$i=0$$

## Agenda

1. Recap

2. Conservative Policy Iteration

3. Guarantees on Improvement

4. Summary

## CPI Improvement

• Let $$\mathbb A_i = \mathbb E_{s\sim d_{\mu_0}^{\pi_i}}[ \max_a A^{\pi_i}(s,a)]$$ be the PI improvement.
• Conservative policy iteration improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}\geq 0$$
• when step size set to $$\alpha_i = \frac{(1-\gamma)^2 \mathbb A_i}{4\gamma }$$.

Assumptions:

1. Supervised learning works perfectly $$\max_a \mathbb E_{s\sim d_{\mu_0}^{\pi^i}}[(\hat Q^{\pi^i}(s,a)-Q^{\pi^i}(s,a))^2] = 0$$
2. Reward is bounded between $$0$$ and $$1$$

Both assumptions can be relaxed; improvement would then depend on $$\epsilon$$ and reward bounds

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

## Example

$$\pi^i(\cdot |1)$$

$$Q^{\pi^i}(1,$$switch$$)$$

$$Q^{\pi^i}(1,$$stay$$)$$

$$1$$

$$\bar \pi$$

$$V^{\pi^i}(1)$$

$$\mathbb A_i\begin{cases}\end{cases}$$

$$\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]$$

• $$=\frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_{i+1}(s)}\left[A^{\pi_i}(s,a) \right]\right]$$ (PDL)
• $$=\frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[ \alpha A^{\pi_i}(s,\bar \pi(s)) + (1-\alpha)\mathbb E_{a\sim \pi_{i}(s)}[A^{\pi_i}(s,a)] \right]$$ (CPI step 2)
• PollEv: $$\mathbb E_{a\sim \pi_{i}(s)}[A^{\pi_i}(s,a)] = \mathbb E_{a\sim \pi_{i}(s)}[Q^{\pi_i}(s,a)] - V^\pi(s) =0$$
• $$=\frac{\alpha}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right]$$

## CPI Improvement Proof

CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$

1. $$\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$
2. $$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)$$

$$\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]$$

• $$=\frac{\alpha}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right]$$
• $$=\frac{\alpha}{1-\gamma} \Big(\mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right] - \mathbb E_{s\sim d^{\pi_{i}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right]$$
$$\qquad\qquad\qquad\qquad\qquad\qquad+ \mathbb E_{s\sim d^{\pi_{i}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right]\Big)$$
• $$=\frac{\alpha}{1-\gamma} \left( \sum_{s} (d^{\pi_{i+1}}_{\mu_0} (s) - d^{\pi_{i}}_{\mu_0} (s)) A^{\pi_i}(s,\bar \pi(s)) + \mathbb E_{s\sim d^{\pi_{i}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right]\right)$$

## CPI Improvement Proof

1. $$\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$
2. $$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)$$

CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$

$$\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]$$

• $$=\frac{\alpha}{1-\gamma} \left( \sum_{s} (d^{\pi_{i+1}}_{\mu_0} (s) - d^{\pi_{i}}_{\mu_0} (s)) A^{\pi_i}(s,\bar \pi(s)) +\mathbb A_i\right)$$
• $$\geq \frac{\alpha}{1-\gamma} \left( -\left|\sum_{s} (d^{\pi_{i+1}}_{\mu_0} (s) - d^{\pi_{i}}_{\mu_0} (s)) A^{\pi_i}(s,\bar \pi(s))| \right| + \mathbb A_i\right)$$
• $$\geq \frac{\alpha}{1-\gamma} \left( -\sum_{s} |d^{\pi_{i+1}}_{\mu_0}(s) - d^{\pi_{i}}_{\mu_0}(s)||A^{\pi_i}(s,\bar \pi(s))| + \mathbb A_i\right)$$
• $$\geq \frac{\alpha}{1-\gamma} \left( -\sum_{s} |d^{\pi_{i+1}}_{\mu_0}(s) - d^{\pi_{i}}_{\mu_0}(s)|\frac{1}{1-\gamma} + \mathbb A_i\right)$$
• $$\geq \frac{1}{1-\gamma} \left( -\frac{2\gamma\alpha^2}{1-\gamma} \frac{1}{1-\gamma} +\mathbb A_i \alpha\right) =\frac{\mathbb A_i^2(1-\gamma)}{8 \gamma}$$

## CPI Improvement Proof

CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$

1. $$\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$
2. $$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)$$

## Agenda

1. Recap

2. Conservative Policy Iteration

3. Guarantees on Improvement

4. Summary

1. Constructing dataset for supervised learning
• features $$(s,a)\sim d^\pi_{\mu_0}$$ ("roll in")
• labels $$y$$ with $$\mathbb E[y|s,a]= Q^\pi(s,a)$$ ("roll out")
2. Incremental updates to control distribution shift
• mixture of current and greedy policy
• parameter $$\alpha$$ controls the distribution shift

## Summary: Key Ideas

1. Constructing dataset for supervised learning
2. Incremental updates to control distribution shift

## Summary: Key Ideas

action $$a_t$$

state $$s_t$$

reward $$r_t$$

policy

data $$(s_t,a_t,r_t)$$

policy $$\pi$$

transitions $$P,f$$

experience

unknown in Unit 2

## Summary: Constructing Labels

Labels via rollouts of $$\pi$$:

• Method: $$y = \sum_{t=h_1}^{h_1+h_2} r_t$$ for $$h_2\sim$$Geometric$$(1-\gamma)$$
• Motivation: definition of
$$Q^\pi(s,a) = \mathbb E_{P,\pi}\left[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \right]$$
• On future PSet, you will show
• $$\mathbb E[y|s_{h_1},a_{h_1}]= Q^\pi(s_{h_1},a_{h_1})$$
• i.e., this label is unbiased
• How much variance will labels have?
• Many sources of randomness: all $$h_2$$ transitions

...

...

...

## Preview: Constructing Labels

Other key equations can inspire labels:

• So far: $$Q^\pi(s,a) = \mathbb E_{P,\pi}\left[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \right]$$
• We also know:
• Bellman Expectation Equation $$Q^\pi(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[\mathbb E_{a'\sim \pi(s')}\left[Q^\pi(s',a')\right]\right]$$
• Bellman Optimality Equation $$Q^\star(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[ \max_{a}Q^\star(s',a')\right]$$

...

...

...

## Recap

• PSet 4
• Prelim in class 3/15