CS 4/5789: Introduction to Reinforcement Learning
Lecture 13: Conservative Policy Iteration
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework
- PSet 4 released, due Monday
- 5789 Paper Reviews due weekly starting Monday
- Midterm 3/15 during lecture
- Let us know conflicts/accomodations ASAP! (EdStem)
- Review Lecture on Monday 3/13 (last year's slides/recording)
- Materials: slides (Lectures 1-10, some of 11-13), PSets 1-4 (solutions on Canvas)
- also: equation sheet (on Canvas), 2023 notes, PAs
- I will monitor Exams/Prelim tag on EdStem for questions
Agenda
1. Recap
2. Conservative Policy Iteration
3. Guarantees on Improvement
4. Summary
Feedback in RL


action \(a_t\)
state \(s_t\)
reward \(r_t\)

- Control feedback: between states and actions
- Data feedback: between data and poicy
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown in Unit 2
Sampling procedure for \(Q^\pi\)
Algorithm: Data collection
- For \(i=1,\dots,N\):
- Sample \(s_0\sim\mu_0\) and \(h_1\) and \(h_2\) from discount distribution
- Roll in \(h_1\) steps: set \((s_i,a_i)=(s_{h_1},a_{h_1})\)
- Roll out \(h_2\) steps: set \(y_i=\sum_{t=h_1}^{h_1+h_2} r_t\)
Proposition: The resulting dataset \(\{(s_i,a_i), y_i\}_{i=1}^N\)
- Drawn from discounted state distribution \(s_i \sim d_{\mu_0}^\pi\)
- Unbiased labels \(\mathbb E[y_i\mid s_i,a_i] = Q^\pi(s_i,a_i)\)
Proof in future PSet
Approximate Policy Iteration
- Initialize \(\pi_0:\mathcal S\to\mathcal A\)
- For \(i=0,\dots,T-1\):
- Roll in/out \(\pi^i\) to collect dataset \(\{s_j,a_j,y_j\}_{j=1}^N\) then compute \(\hat Q^{\pi_i}\) with supervised learning $$\hat Q^{\pi_i}(s,a) = \arg\min_{Q\in\mathcal Q} \sum_{j=1}^N (Q(s_j,a_j)-y_j)^2 $$
- Policy Improvement: \(\forall s\), $$\pi^{i+1}(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$
Approximate Policy Iteration
- Even if we assume that supervised learning succeeds $$\mathbb E_{s,a\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
- or the stronger assumption: $$\max_a \mathbb E_{s\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
- Approx PI does not necessarily improve because \(d_{\mu_0}^{\pi_{i+1}}\) might be different from \(d_{\mu_0}^{\pi_i}\)
Approximate Policy Iteration
- Even if we assume that supervised learning succeeds $$\max_a \mathbb E_{s\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
- Approx PI does not necessarily improve because \(d_{\mu_0}^{\pi_{i+1}}\) might be different from \(d_{\mu_0}^{\pi_i}\)
Approximate Policy Iteration
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
Agenda
1. Recap
2. Conservative Policy Iteration
3. Guarantees on Improvement
4. Summary
Performance Difference Lemma: For two policies, $$V^{\pi_{i+1}}(s_0) - V^{\pi_{i}}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{s_0}}\left[ \mathbb E_{a\sim \pi_{i+1}(s)}\left[A^{\pi_i}(s,a) \right] \right] $$
where we define the advantage function \(A^{\pi}(s,a) =Q^{\pi}(s,a) - V^{\pi}(s)\)
Improvement in PI
- Advantage of \(\pi_{i+1}\) over \(\pi_i\) over distribution of \(\pi_{i+1}\)
- Notice that Policy Iteration's greedy updates $$ \pi_{i+1}(s) = \arg\max_a Q^{\pi_i}(s,a)=\arg\max_a A^{\pi_i}(s,a)$$
- Idea: keep \(d_{\mu_0}^{\pi_{i+1}}\) close to \(d_{\mu_0}^{\pi_{i}}\)
Conservative Policy Iteration
- Initialize \(\pi_0:\mathcal S\to\Delta(\mathcal A)\)
- equivalently, \(\pi(a|s) : \mathcal A\times \mathcal S \to [0,1]\)
- For \(i=0,\dots,T-1\):
- Roll in/out \(\pi^i\) to collect dataset \(\{s_j,a_j,y_j\}_{j=1}^N\) then compute \(\hat Q^{\pi_i}\) with supervised learning
- Policy Improvement: \(\forall s\), $$\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$
- Incremental update: a policy which follows \(\bar \pi\) with probability \(\alpha\) (and otherwise \(\pi_{i}\))$$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s) $$
Conservative Policy Iteration

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Example
\(i=0\)
\(i=1\)
\(i=2\)
\(\bar \pi\)
\(i=1\)
\(i=2\)
\(i=0\)
Case: \(p_1\) and \(p_2\) both close to \(1\)
- \(\pi^\star(1)=\)switch
reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch
\(\pi^i(\cdot |1)\)
\(1\)
\(\hat Q^{\pi^i}(1,\)switch\()\)
\(\hat Q^{\pi^i}(1,\)stay\()\)
Incremental Updates
- Policy follows \(\bar \pi\) with probability \(\alpha\) (and otherwise \(\pi_{i}\))$$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s) $$
-
For any \(s\), \(\|\pi^{i+1}(\cdot |s)- \pi^{i}(\cdot |s)\|_1 \)
- \(= \| \pi^{i}(\cdot |s) -(1-\alpha) \pi^{i}(\cdot|s) - \alpha \bar \pi^{}(\cdot |s)\|_1\)
- \(= \| \alpha \pi^{i}(\cdot |s) - \alpha \bar \pi^{}(\cdot |s)\|_1\)
- \(\leq \alpha \| \pi^{i}(\cdot |s) \|_1 + \alpha \| \bar \pi^{}(\cdot |s)\|_1 = 2\alpha \)
- Lemma: For any \(\pi\) and \(\pi'\), \(\|d^\pi_{\mu_0} - d^{\pi'}_{\mu_0}\|_1 \leq \frac{\gamma \max_s \|\pi(\cdot|s)-\pi'(\cdot|s)\|_1}{1-\gamma}\) (proof in future PSet)
- Thus in Conservative PI, \(d_{\mu_0}^{\pi_{i+1}}\) is \(\frac{2\gamma\alpha}{1-\gamma}\) close to \(d_{\mu_0}^{\pi_{i}}\)
\(\pi^{i}\)
\(\bar\pi\)
\(\pi^{i+1}\)

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Example
\(\pi^i(\cdot |1)\)
\(1\)
\(i=0\)
\(i=1\)
\(i=2\)
\(i=1\)
\(i=2\)
\(i=0\)
\(d^{\pi^i}_1(0)\)
\(1\)
\(d^{\pi^i}_1(1)\)
\(i=0\)
\(i=1\)
\(i=2\)
\(i=1\)
\(i=2\)
\(i=0\)
Agenda
1. Recap
2. Conservative Policy Iteration
3. Guarantees on Improvement
4. Summary
CPI Improvement
- Let \(\mathbb A_i = \mathbb E_{s\sim d_{\mu_0}^{\pi_i}}[ \max_a A^{\pi_i}(s,a)]\) be the PI improvement.
- Conservative policy iteration improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}\geq 0 $$
- when step size set to \(\alpha_i = \frac{(1-\gamma)^2 \mathbb A_i}{4\gamma }\).
Assumptions:
- Supervised learning works perfectly $$\max_a \mathbb E_{s\sim d_{\mu_0}^{\pi^i}}[(\hat Q^{\pi^i}(s,a)-Q^{\pi^i}(s,a))^2] = 0$$
- Reward is bounded between \(0\) and \(1\)
Both assumptions can be relaxed; improvement would then depend on \(\epsilon\) and reward bounds

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Example
\(\pi^i(\cdot |1)\)
\(Q^{\pi^i}(1,\)switch\()\)
\( Q^{\pi^i}(1,\)stay\()\)
\(1\)
\(\bar \pi\)
\(V^{\pi^i}(1)\)
\(\mathbb A_i\begin{cases}\end{cases}\)
\(\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]\)
- \(=\frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_{i+1}(s)}\left[A^{\pi_i}(s,a) \right]\right] \) (PDL)
- \(=\frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[ \alpha A^{\pi_i}(s,\bar \pi(s)) + (1-\alpha)\mathbb E_{a\sim \pi_{i}(s)}[A^{\pi_i}(s,a)] \right] \) (CPI step 2)
- PollEv: \(\mathbb E_{a\sim \pi_{i}(s)}[A^{\pi_i}(s,a)] = \mathbb E_{a\sim \pi_{i}(s)}[Q^{\pi_i}(s,a)] - V^\pi(s) =0\)
- \(=\frac{\alpha}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right] \)
CPI Improvement Proof
CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$
- \(\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)\)
- \(\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)\)
\(\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]\)
- \(=\frac{\alpha}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right] \)
- \(=\frac{\alpha}{1-\gamma} \Big(\mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right] - \mathbb E_{s\sim d^{\pi_{i}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right] \)
\(\qquad\qquad\qquad\qquad\qquad\qquad+ \mathbb E_{s\sim d^{\pi_{i}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right]\Big) \) - \(=\frac{\alpha}{1-\gamma} \left( \sum_{s} (d^{\pi_{i+1}}_{\mu_0} (s) - d^{\pi_{i}}_{\mu_0} (s)) A^{\pi_i}(s,\bar \pi(s)) + \mathbb E_{s\sim d^{\pi_{i}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right]\right) \)
CPI Improvement Proof
- \(\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)\)
- \(\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)\)
CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$
\(\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]\)
- \(=\frac{\alpha}{1-\gamma} \left( \sum_{s} (d^{\pi_{i+1}}_{\mu_0} (s) - d^{\pi_{i}}_{\mu_0} (s)) A^{\pi_i}(s,\bar \pi(s)) +\mathbb A_i\right) \)
- \(\geq \frac{\alpha}{1-\gamma} \left( -\left|\sum_{s} (d^{\pi_{i+1}}_{\mu_0} (s) - d^{\pi_{i}}_{\mu_0} (s)) A^{\pi_i}(s,\bar \pi(s))| \right| + \mathbb A_i\right) \)
- \(\geq \frac{\alpha}{1-\gamma} \left( -\sum_{s} |d^{\pi_{i+1}}_{\mu_0}(s) - d^{\pi_{i}}_{\mu_0}(s)||A^{\pi_i}(s,\bar \pi(s))| + \mathbb A_i\right) \)
- \(\geq \frac{\alpha}{1-\gamma} \left( -\sum_{s} |d^{\pi_{i+1}}_{\mu_0}(s) - d^{\pi_{i}}_{\mu_0}(s)|\frac{1}{1-\gamma} + \mathbb A_i\right) \)
- \(\geq \frac{1}{1-\gamma} \left( -\frac{2\gamma\alpha^2}{1-\gamma} \frac{1}{1-\gamma} +\mathbb A_i \alpha\right) =\frac{\mathbb A_i^2(1-\gamma)}{8 \gamma}\)
CPI Improvement Proof
CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$
- \(\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)\)
- \(\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)\)
Agenda
1. Recap
2. Conservative Policy Iteration
3. Guarantees on Improvement
4. Summary
- Constructing dataset for supervised learning
- features \((s,a)\sim d^\pi_{\mu_0}\) ("roll in")
- labels \(y\) with \(\mathbb E[y|s,a]= Q^\pi(s,a)\) ("roll out")
- Incremental updates to control distribution shift
- mixture of current and greedy policy
- parameter \(\alpha\) controls the distribution shift
Summary: Key Ideas
- Constructing dataset for supervised learning
- Incremental updates to control distribution shift
Summary: Key Ideas


action \(a_t\)
state \(s_t\)
reward \(r_t\)

policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown in Unit 2
Summary: Constructing Labels
Labels via rollouts of \(\pi\):
- Method: \(y = \sum_{t=h_1}^{h_1+h_2} r_t \) for \(h_2\sim\)Geometric\((1-\gamma)\)
- Motivation: definition of
\( Q^\pi(s,a) = \mathbb E_{P,\pi}\left[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \right] \) - On future PSet, you will show
- \(\mathbb E[y|s_{h_1},a_{h_1}]= Q^\pi(s_{h_1},a_{h_1})\)
- i.e., this label is unbiased
- How much variance will labels have?
- Many sources of randomness: all \(h_2\) transitions
...
...
...
Preview: Constructing Labels
Other key equations can inspire labels:
- So far: \(Q^\pi(s,a) = \mathbb E_{P,\pi}\left[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \right] \)
- We also know:
- Bellman Expectation Equation $$Q^\pi(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[\mathbb E_{a'\sim \pi(s')}\left[Q^\pi(s',a')\right]\right]$$
- Bellman Optimality Equation $$Q^\star(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[ \max_{a}Q^\star(s',a')\right]$$
...
...
...
Recap
- PSet 4
- Prelim in class 3/15
- Incremental Updates
- Guaranteed Improvement
- Next lecture: Prelim Review
Sp23 CS 4/5789: Lecture 13
By Sarah Dean