CS 4/5789: Introduction to Reinforcement Learning
Lecture 12: Approximate Policy Iteration
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework
- PSet 4 released tonight, due next Monday
- 5789 Paper Reviews - on Canvas - due weekly starting Monday
- Midterm 3/15 during lecture
- Let us know conflicts/accomodations ASAP! (EdStem)
- Review Lecture on Monday 3/13 (last year's slides/recording)
- Materials: slides (Lectures 1-10, some of 11-13), PSets 1-4 (solutions on Canvas)
- also: equation sheet (on Canvas), 2023 notes, PAs
- Mid-semester feedback survey: email “Associate Dean Alan Zehnder <invitation@surveys.mail.cornell.edu>”
Agenda
1. Recap: MBRL
2. Feedback & Supervision
3. Supervision via Rollouts
4. Approximate Policy Iteration
Recap: Model-based RL
Model-based RL with Queries
- Sample and record \(s_i'\sim P(s_i, a_i)\)
- Estimate \(\widehat P\) from \(\{(s_i',s_i, a_i)\}_{i=1}^N\)
- Design \(\widehat \pi\) from \(\widehat P\)
Tabular MBRL
- Sample: evenly each state and actions \(\frac{N}{SA}\) times
- Estimate: by counting $$\hat P(s'|s,a) = \frac{\text{\#~times~}s'_i=s'\text{~when~}s_i=s,a_i=a}{\text{\#~times~}s_i=s,a_i=a} $$
- Design: policy iteration
Recap: Sample Complexity
Theorem: Tabular MBRL with \(N \gtrsim \frac{S^2 A}{\epsilon^2}\) is \(\epsilon\) sub-optimal with high probability
- Simulation Lemma: $$|\hat V^\pi(s_0) - V^\pi(s_0)| \lesssim \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$$
- Estimation Lemma: With high probability, $$\max_{s,a }\|P(\cdot |s,a)-\hat P(\cdot |s,a)\|_1 \lesssim \sqrt{\frac{S^2A}{N}} $$
Recap: Policy Iteration
Policy Iteration
- Initialize \(\pi_0:\mathcal S\to\mathcal A\)
- For \(t=0,\dots,T-1\):
- Compute \(Q^{\pi_t}\) with Policy Evaluation
- Policy Improvement: \(\forall s\), $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} Q^{\pi_t}(s,a)$$
Agenda
1. Recap: MBRL
2. Feedback & Supervision
3. Supervision via Rollouts
4. Approximate Policy Iteration
Feedback in RL


action \(a_t\)
state \(s_t\)
reward \(r_t\)
Control feedback
- between states and actions
- "reaction"
- studied in control theory "automatic feedback control"
- our focus for Unit 1
policy \(\pi\)
transitions \(P,f\)
Feedback in RL


action \(a_t\)
state \(s_t\)
reward \(r_t\)

- Control feedback: between states and actions
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
Data feedback
- between data and policy
- "adaptation"
- connection to machine learning
- our new focus in Unit 2
Feedback in RL


action \(a_t\)
state \(s_t\)
reward \(r_t\)

- Control feedback: between states and actions
- Data feedback: between data and poicy
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown in Unit 2
- Supervised learning: features \(x\) and labels \(y\)
- Goal: predict labels with \(\hat f(x)\approx \mathbb E[y|x]\)
- Requirements: dataset \(\{x_i,y_i\}_{i=1}^N\)
- Method: \(\hat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N (f(x_i)-y_i)^2\)
- Important functions in MDPs
- Transitions \(P(s'|s,a)\)
- Value/Q of a policy \(V^\pi(s)\) and \(Q^\pi(s,a)\)
- Optimal Value/Q \(V^\star(s)\) and \(Q^\star(s,a)\)
- Optimal policy \(\pi^\star(s)\)
Supervised Learning for MDPs
- Supervised learning: features \(x\) and labels \(y\)
- Important functions in MDPs
- Transitions \(P(s'|s,a)\)
- features \(s,a,s'\), sampled outcomes observed
- Value/Q of a policy \(V^\pi(s)\) and \(Q^\pi(s,a)\)
- features \(s\) or \(s,a\), labels ?
- Optimal Value/Q \(V^\star(s)\) and \(Q^\star(s,a)\)
- features \(s\) or \(s,a\), labels ?
- Optimal policy \(\pi^\star(s)\)
- features \(s\), labels from expert, otherwise ?
- Transitions \(P(s'|s,a)\)
Supervised Learning for MDPs
MBRL (Lec 11)
this week
after prelim
after prelim
Agenda
1. Recap: MBRL
2. Feedback & Supervision
3. Supervision via Rollouts
4. Approximate Policy Iteration
Data for Learning \(Q^\pi(s,a)\)
- How to construct labels for $$ Q^\pi(s,a) = \mathbb E_{P,\pi}\Big[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \Big] $$
- Label via rollout:
- start at \((s,a)\) and then observe rewards \(r_0,r_1,\dots\)
- Let label \(y=\sum_{t=0}^\infty \gamma^t r_t\)
- Unbiased because \(\mathbb E[y|s,a] = Q^\pi(s,a)\)
...
...
...
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
Example

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
- Consider \(\pi(s)=\)stay
-
Rollouts with
- \(s=0\) and \(a=\)stay
- \(s=0\) and \(a=\)switch
- \(s=1\) and \(a=\)stay
- \(s=1\) and \(a=\)switch
Sampling procedure for \(Q^\pi\)
Algorithm: Data collection
- For \(i=1,\dots,N\):
- Sample \(s_0\sim\mu_0\) and \(h_1\) and \(h_2\) from \(1-\gamma\) geometric distribution
- Roll in \(h_1\) steps: set \((s_i,a_i)=(s_{h_1},a_{h_1})\)
- Roll out \(h_2\) steps: set \(y_i=\sum_{t=h_1}^{h_1+h_2} r_t\)
Rollout:
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
Timestep sampling with discount/geometric distribution: set \(h_i=h\geq 0\) with probability \((1-\gamma)\gamma^h\)
Sampling procedure for \(Q^\pi\)
Algorithm: Data collection
- For \(i=1,\dots,N\):
- Sample \(s_0\sim\mu_0\) and \(h_1\) and \(h_2\) from discount distribution
- Roll in \(h_1\) steps: set \((s_i,a_i)=(s_{h_1},a_{h_1})\)
- Roll out \(h_2\) steps: set \(y_i=\sum_{t=h_1}^{h_1+h_2} r_t\)
Proposition: The resulting dataset \(\{(s_i,a_i), y_i\}_{i=1}^N\)
- Drawn from discounted state distribution \(s_i \sim d_{\mu_0}^\pi\)
- Unbiased labels \(\mathbb E[y_i\mid s_i,a_i] = Q^\pi(s_i,a_i)\)
- Using sampled data, $$\hat Q^\pi(s,a) = \arg\min_{Q\in\mathcal Q} \sum_{i=1}^N (Q(s_i,a_i)-y_i)^2 $$
- Assumption: Supervised learning works, i.e. $$\mathbb E_{s,a\sim d_{\mu_0}^\pi }[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon $$
- In practice, often sample data from a single long rollout (rather than resetting to \(\mu_0\) every time)
- This is called "Monte Carlo" supervision
Supervision via Rollouts
Notation note: \(s,a\sim d_{\mu_0}^\pi\) is a compact way of writing \(s\sim d_{\mu_0}^\pi\) and \(a\sim\pi\).
Agenda
1. Recap: MBRL
2. Feedback & Supervision
3. Supervision via Rollouts
4. Approximate Policy Iteration
Approximate Policy Iteration
- Initialize \(\pi_0:\mathcal S\to\mathcal A\)
- For \(i=0,\dots,T-1\):
- Rollout \(\pi^i\), collect dataset \(\{s_j,a_j,y_j\}_{j=1}^N\) then compute \(\hat Q^{\pi_i}\) with supervised learning
- Policy Improvement: \(\forall s\), $$\pi^{i+1}(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$
Approximate Policy Iteration
- This is an "on policy" method because it uses data collected with the policy \(\pi\)
- Recall that PI guarantees monotonic improvement $$Q^{\pi_{i+1}}(s,a)\geq Q^{\pi_i}(s,a)$$
- We assume that supervised learning succeeds $$\mathbb E_{s,a\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
- Does Approx PI monotonically improve (approximately)?
- Not necessarily, because \(d_{\mu_0}^{\pi_{i+1}}\) might be different from \(d_{\mu_0}^{\pi_i}\)
- Greedy improvement when Q function is approximate could lead to oscillation!
Approximate Policy Iteration
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
Example
- Suppose \(\pi_1\) is a policy which goes down first and then sometimes right
- \(\hat Q^{\pi_1}\) is only reliable on the left and bottom parts of the grid
- \(\pi_2\) ends up going right first and then down
- \(\pi_3\) will oscillate back to \(\pi_1\)!
Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$
where we define the advantage function \(A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)\)
Performance Difference
The advantage function:
- "Advantage" of taking action \(a\) in state \(s\) rather than \(\pi(s)\)
- What can we say about \(A^{\pi^\star}\)? PollEV
- Recall the state distribution for a policy \(\pi\) (Lecture 2)
- \( d^{\pi}_{\mu_0,t}(s) = \mathbb{P}\{s_t=s\mid s_0\sim \mu_0,s_{k+1}\sim P(s_k, \pi(s_k))\} \)
- We showed that it can be written as $$d^{\pi}_{\mu_0,t} = P_\pi^\top d^{\pi}_{\mu_0,t-1} = (P_\pi^t)^\top \mu_0$$
- The discounted distribution (PSet) $$d_{\mu_0}^{\pi} = (1-\gamma) \sum_{t=0}^{\infty} \gamma^t \underbrace{(P_\pi ^t)^\top \mu_0}_{d_{\mu_0,t}^{\pi}} $$
- When the initial state is fixed to a known \(s_0\), i.e. \(\mu_0=e_{s_0}\) we write \(d_{s_0,t}^{\pi}\)
Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$
Proof of PDL
- \(V^\pi(s_0) - V^{\pi'}(s_0) =\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1) ] - V^{\pi'}(s_0) \right]\)
- \(=\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1)- V^{\pi'}(s_1) + V^{\pi'}(s_1) ] - V^{\pi'}(s_0) \right]\)
- \(= \gamma \mathbb E_{\substack{a_0\sim \pi(s_0) \\ s_1\sim P(s_0, a)} }[V^\pi(s_1)- V^{\pi'}(s_1) ] + \mathbb E_{a\sim \pi(s_0)}\left[Q^{\pi'}(a, s_0) -V^{\pi'}(s_0) \right]\)
- Iterate \(k\) times: \(V^\pi(s_0) - V^{\pi'}(s_0) =\) $$\gamma^k \mathbb E_{s_k\sim d_{s_0,k}^\pi}[V^\pi(s_k)- V^{\pi'}(s_k) ] + \sum_{\ell=0}^{k-1}\gamma^\ell \mathbb E_{\substack{s_\ell \sim d_{s_0,\ell }^\pi \\ a\sim \pi(s_\ell )}}\left[Q^{\pi'}(a, s_\ell) -V^{\pi'}(s_\ell) \right]$$
- Statement follows by letting \(k\to\infty\).
Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$
where we define the advantage function \(A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)\)
Performance Difference
- Notice that \(\arg\max_a A^\pi(s,a) = \arg\max_a Q^\pi(s,a)\)
- Can use PDL to show that PI has monotonic improvement
- Next time, we use insights from PDL to develop a better algorithm
Recap
- PSet 4
- Prelim in class 3/15
- Supervision via Rollouts
- Performance Difference Lemma
- Next lecture: Conservative Policy Iteration
Sp23 CS 4/5789: Lecture 12
By Sarah Dean