CS 4/5789: Introduction to Reinforcement Learning

Lecture 13: Conservative Policy Iteration

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- PSet 4 released, due Monday
- 5789 Paper Reviews due weekly starting Monday
Midterm 3/15 during lecture
- Let us know conflicts/accomodations ASAP! (EdStem)
- Review Lecture on Monday 3/13 (last year's slides/recording)
- Materials: slides (Lectures 1-10, some of 11-13), PSets 1-4 (solutions on Canvas)
  - also: equation sheet (on Canvas), 2023 notes, PAs
- I will monitor Exams/Prelim tag on EdStem for questions

Agenda

1. Recap

2. Conservative Policy Iteration

3. Guarantees on Improvement

4. Summary

Feedback in RL

action $a_t$

state $s_t$

reward $r_t$

Control feedback: between states and actions
Data feedback: between data and poicy

policy

data $(s_t,a_t,r_t)$

policy $\pi$

transitions $P,f$

experience

unknown in Unit 2

Sampling procedure for $Q^\pi$

Algorithm: Data collection

For $i=1,\dots,N$:
- Sample $s_0\sim\mu_0$ and $h_1$ and $h_2$ from discount distribution
- Roll in $h_1$ steps: set $(s_i,a_i)=(s_{h_1},a_{h_1})$
- Roll out $h_2$ steps: set $y_i=\sum_{t=h_1}^{h_1+h_2} r_t$

Proposition: The resulting dataset $\{(s_i,a_i), y_i\}_{i=1}^N$

Drawn from discounted state distribution $s_i \sim d_{\mu_0}^\pi$
Unbiased labels $\mathbb E[y_i\mid s_i,a_i] = Q^\pi(s_i,a_i)$

Proof in future PSet

Approximate Policy Iteration

Initialize $\pi_0:\mathcal S\to\mathcal A$
For $i=0,\dots,T-1$:
- Roll in/out $\pi^i$ to collect dataset $\{s_j,a_j,y_j\}_{j=1}^N$ then compute $\hat Q^{\pi_i}$ with supervised learning $$\hat Q^{\pi_i}(s,a) = \arg\min_{Q\in\mathcal Q} \sum_{j=1}^N (Q(s_j,a_j)-y_j)^2 $$
- Policy Improvement: $\forall s$, $$\pi^{i+1}(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$

Approximate Policy Iteration

Even if we assume that supervised learning succeeds $$\mathbb E_{s,a\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
- or the stronger assumption: $$\max_a \mathbb E_{s\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
Approx PI does not necessarily improve because $d_{\mu_0}^{\pi_{i+1}}$ might be different from $d_{\mu_0}^{\pi_i}$

Approximate Policy Iteration

Even if we assume that supervised learning succeeds $$\max_a \mathbb E_{s\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
Approx PI does not necessarily improve because $d_{\mu_0}^{\pi_{i+1}}$ might be different from $d_{\mu_0}^{\pi_i}$

Approximate Policy Iteration

0	1	2	3
4	5	6	7
8	9	10	11
12	13	14	15

0	1	2	3
4	5	6	7
8	9	10	11
12	13	14	15

Agenda

1. Recap

2. Conservative Policy Iteration

3. Guarantees on Improvement

4. Summary

Performance Difference Lemma: For two policies, $$V^{\pi_{i+1}}(s_0) - V^{\pi_{i}}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{s_0}}\left[ \mathbb E_{a\sim \pi_{i+1}(s)}\left[A^{\pi_i}(s,a) \right] \right] $$

where we define the advantage function $A^{\pi}(s,a) =Q^{\pi}(s,a) - V^{\pi}(s)$

Improvement in PI

Advantage of $\pi_{i+1}$ over $\pi_i$ over distribution of $\pi_{i+1}$
Notice that Policy Iteration's greedy updates $$ \pi_{i+1}(s) = \arg\max_a Q^{\pi_i}(s,a)=\arg\max_a A^{\pi_i}(s,a)$$
Idea: keep $d_{\mu_0}^{\pi_{i+1}}$ close to $d_{\mu_0}^{\pi_{i}}$

Conservative Policy Iteration

Initialize $\pi_0:\mathcal S\to\Delta(\mathcal A)$
- equivalently, $\pi(a|s) : \mathcal A\times \mathcal S \to [0,1]$
For $i=0,\dots,T-1$:
- Roll in/out $\pi^i$ to collect dataset $\{s_j,a_j,y_j\}_{j=1}^N$ then compute $\hat Q^{\pi_i}$ with supervised learning
- Policy Improvement: $\forall s$, $$\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$
- Incremental update: a policy which follows $\bar \pi$ with probability $\alpha$ (and otherwise $\pi_{i}$)$$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s) $$

Conservative Policy Iteration

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

$i=0$

$i=1$

$i=2$

$\bar \pi$

$i=1$

$i=2$

$i=0$

Case: $p_1$ and $p_2$ both close to $1$

$\pi^\star(1)=$switch

reward: $+1$ if $s=0$ and $-\frac{1}{2}$ if $a=$ switch

$\pi^i(\cdot |1)$

$1$

$\hat Q^{\pi^i}(1,$switch$)$

$\hat Q^{\pi^i}(1,$stay$)$

Incremental Updates

Policy follows $\bar \pi$ with probability $\alpha$ (and otherwise $\pi_{i}$)$$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s) $$
For any $s$, $\|\pi^{i+1}(\cdot |s)- \pi^{i}(\cdot |s)\|_1 $
- $= \| \pi^{i}(\cdot |s) -(1-\alpha) \pi^{i}(\cdot|s) - \alpha \bar \pi^{}(\cdot |s)\|_1$
- $= \| \alpha \pi^{i}(\cdot |s) - \alpha \bar \pi^{}(\cdot |s)\|_1$
- $\leq \alpha \| \pi^{i}(\cdot |s) \|_1 + \alpha \| \bar \pi^{}(\cdot |s)\|_1 = 2\alpha $
Lemma: For any $\pi$ and $\pi'$, $\|d^\pi_{\mu_0} - d^{\pi'}_{\mu_0}\|_1 \leq \frac{\gamma \max_s \|\pi(\cdot|s)-\pi'(\cdot|s)\|_1}{1-\gamma}$ (proof in future PSet)
Thus in Conservative PI, $d_{\mu_0}^{\pi_{i+1}}$ is $\frac{2\gamma\alpha}{1-\gamma}$ close to $d_{\mu_0}^{\pi_{i}}$

$\pi^{i}$

$\bar\pi$

$\pi^{i+1}$

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

$\pi^i(\cdot |1)$

$1$

$i=0$

$i=1$

$i=2$

$i=1$

$i=2$

$i=0$

$d^{\pi^i}_1(0)$

$1$

$d^{\pi^i}_1(1)$

$i=0$

$i=1$

$i=2$

$i=1$

$i=2$

$i=0$

Agenda

1. Recap

2. Conservative Policy Iteration

3. Guarantees on Improvement

4. Summary

CPI Improvement

Let $\mathbb A_i = \mathbb E_{s\sim d_{\mu_0}^{\pi_i}}[ \max_a A^{\pi_i}(s,a)]$ be the PI improvement.
Conservative policy iteration improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}\geq 0 $$
when step size set to $\alpha_i = \frac{(1-\gamma)^2 \mathbb A_i}{4\gamma }$.

Assumptions:

Supervised learning works perfectly $$\max_a \mathbb E_{s\sim d_{\mu_0}^{\pi^i}}[(\hat Q^{\pi^i}(s,a)-Q^{\pi^i}(s,a))^2] = 0$$
Reward is bounded between $0$ and $1$

Both assumptions can be relaxed; improvement would then depend on $\epsilon$ and reward bounds

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Example

$\pi^i(\cdot |1)$

$Q^{\pi^i}(1,$switch$)$

$ Q^{\pi^i}(1,$stay$)$

$1$

$\bar \pi$

$V^{\pi^i}(1)$

$\mathbb A_i\begin{cases}\end{cases}$

$\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]$

$=\frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_{i+1}(s)}\left[A^{\pi_i}(s,a) \right]\right] $ (PDL)
$=\frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[ \alpha A^{\pi_i}(s,\bar \pi(s)) + (1-\alpha)\mathbb E_{a\sim \pi_{i}(s)}[A^{\pi_i}(s,a)] \right] $ (CPI step 2)
- PollEv: $\mathbb E_{a\sim \pi_{i}(s)}[A^{\pi_i}(s,a)] = \mathbb E_{a\sim \pi_{i}(s)}[Q^{\pi_i}(s,a)] - V^\pi(s) =0$
$=\frac{\alpha}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right] $

CPI Improvement Proof

CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$

$\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$
$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)$

$\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]$

$=\frac{\alpha}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right] $
$=\frac{\alpha}{1-\gamma} \Big(\mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right] - \mathbb E_{s\sim d^{\pi_{i}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right] $
$\qquad\qquad\qquad\qquad\qquad\qquad+ \mathbb E_{s\sim d^{\pi_{i}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right]\Big) $
$=\frac{\alpha}{1-\gamma} \left( \sum_{s} (d^{\pi_{i+1}}_{\mu_0} (s) - d^{\pi_{i}}_{\mu_0} (s)) A^{\pi_i}(s,\bar \pi(s)) + \mathbb E_{s\sim d^{\pi_{i}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right]\right) $

CPI Improvement Proof

$\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$
$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)$

CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$

$\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]$

$=\frac{\alpha}{1-\gamma} \left( \sum_{s} (d^{\pi_{i+1}}_{\mu_0} (s) - d^{\pi_{i}}_{\mu_0} (s)) A^{\pi_i}(s,\bar \pi(s)) +\mathbb A_i\right) $
$\geq \frac{\alpha}{1-\gamma} \left( -\left|\sum_{s} (d^{\pi_{i+1}}_{\mu_0} (s) - d^{\pi_{i}}_{\mu_0} (s)) A^{\pi_i}(s,\bar \pi(s))| \right| + \mathbb A_i\right) $
$\geq \frac{\alpha}{1-\gamma} \left( -\sum_{s} |d^{\pi_{i+1}}_{\mu_0}(s) - d^{\pi_{i}}_{\mu_0}(s)||A^{\pi_i}(s,\bar \pi(s))| + \mathbb A_i\right) $
$\geq \frac{\alpha}{1-\gamma} \left( -\sum_{s} |d^{\pi_{i+1}}_{\mu_0}(s) - d^{\pi_{i}}_{\mu_0}(s)|\frac{1}{1-\gamma} + \mathbb A_i\right) $
$\geq \frac{1}{1-\gamma} \left( -\frac{2\gamma\alpha^2}{1-\gamma} \frac{1}{1-\gamma} +\mathbb A_i \alpha\right) =\frac{\mathbb A_i^2(1-\gamma)}{8 \gamma}$

CPI Improvement Proof

CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$

$\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$
$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)$

(pick minimizing $\alpha=\frac{\mathbb A_i(1-\gamma)^2}{4\gamma}$)

Agenda

1. Recap

2. Conservative Policy Iteration

3. Guarantees on Improvement

4. Summary

Constructing dataset for supervised learning
- features $(s,a)\sim d^\pi_{\mu_0}$ ("roll in")
- labels $y$ with $\mathbb E[y|s,a]= Q^\pi(s,a)$ ("roll out")
Incremental updates to control distribution shift
- mixture of current and greedy policy
- parameter $\alpha$ controls the distribution shift

Summary: Key Ideas

Constructing dataset for supervised learning
Incremental updates to control distribution shift

Summary: Key Ideas

action $a_t$

state $s_t$

reward $r_t$

policy

data $(s_t,a_t,r_t)$

policy $\pi$

transitions $P,f$

experience

unknown in Unit 2

Summary: Constructing Labels

Labels via rollouts of $\pi$:

Method: $y = \sum_{t=h_1}^{h_1+h_2} r_t $ for $h_2\sim$Geometric$(1-\gamma)$
Motivation: definition of
$ Q^\pi(s,a) = \mathbb E_{P,\pi}\left[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \right] $
On future PSet, you will show
- $\mathbb E[y|s_{h_1},a_{h_1}]= Q^\pi(s_{h_1},a_{h_1})$
- i.e., this label is unbiased
How much variance will labels have?
- Many sources of randomness: all $h_2$ transitions

...

Preview: Constructing Labels

Other key equations can inspire labels:

So far: $Q^\pi(s,a) = \mathbb E_{P,\pi}\left[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \right] $
We also know:
- Bellman Expectation Equation $$Q^\pi(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[\mathbb E_{a'\sim \pi(s')}\left[Q^\pi(s',a')\right]\right]$$
- Bellman Optimality Equation $$Q^\star(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[ \max_{a}Q^\star(s',a')\right]$$

...

Recap

PSet 4
Prelim in class 3/15

Incremental Updates
Guaranteed Improvement

Next lecture: Prelim Review

Sp23 CS 4/5789: Lecture 13

By Sarah Dean

Sp23 CS 4/5789: Lecture 13

2 years ago

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 13: Conservative Policy Iteration

Reminders

Agenda

Feedback in RL

Sampling procedure for \(Q^\pi\)

Approximate Policy Iteration

Approximate Policy Iteration

Approximate Policy Iteration

Agenda

Improvement in PI

Conservative Policy Iteration

Example

Incremental Updates

Example

Agenda

CPI Improvement

Example

CPI Improvement Proof

CPI Improvement Proof

CPI Improvement Proof

Agenda

Summary: Key Ideas

Summary: Key Ideas

Summary: Constructing Labels

Preview: Constructing Labels

Recap

Sp23 CS 4/5789: Lecture 13

More from Sarah Dean