CS 4/5789: Introduction to Reinforcement Learning
Lecture 12: Approximate Policy Iteration
Prof. Sarah Dean
MW 2:454pm
255 Olin Hall
Reminders
 Homework
 PSet 4 released tonight, due next Monday
 5789 Paper Reviews  on Canvas  due weekly starting Monday
 Midterm 3/15 during lecture
 Let us know conflicts/accomodations ASAP! (EdStem)
 Review Lecture on Monday 3/13 (last year's slides/recording)
 Materials: slides (Lectures 110, some of 1113), PSets 14 (solutions on Canvas)
 also: equation sheet (on Canvas), 2023 notes, PAs
 Midsemester feedback survey: email “Associate Dean Alan Zehnder <invitation@surveys.mail.cornell.edu>”
Agenda
1. Recap: MBRL
2. Feedback & Supervision
3. Supervision via Rollouts
4. Approximate Policy Iteration
Recap: Modelbased RL
Modelbased RL with Queries
 Sample and record \(s_i'\sim P(s_i, a_i)\)
 Estimate \(\widehat P\) from \(\{(s_i',s_i, a_i)\}_{i=1}^N\)
 Design \(\widehat \pi\) from \(\widehat P\)
Tabular MBRL
 Sample: evenly each state and actions \(\frac{N}{SA}\) times
 Estimate: by counting $$\hat P(s's,a) = \frac{\text{\#~times~}s'_i=s'\text{~when~}s_i=s,a_i=a}{\text{\#~times~}s_i=s,a_i=a} $$
 Design: policy iteration
Recap: Sample Complexity
Theorem: Tabular MBRL with \(N \gtrsim \frac{S^2 A}{\epsilon^2}\) is \(\epsilon\) suboptimal with high probability
 Simulation Lemma: $$\hat V^\pi(s_0)  V^\pi(s_0) \lesssim \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \\hat P(\cdot s,\pi(s))  P(\cdots,\pi(s))\_1\right]$$
 Estimation Lemma: With high probability, $$\max_{s,a }\P(\cdot s,a)\hat P(\cdot s,a)\_1 \lesssim \sqrt{\frac{S^2A}{N}} $$
Recap: Policy Iteration
Policy Iteration
 Initialize \(\pi_0:\mathcal S\to\mathcal A\)
 For \(t=0,\dots,T1\):
 Compute \(Q^{\pi_t}\) with Policy Evaluation
 Policy Improvement: \(\forall s\), $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} Q^{\pi_t}(s,a)$$
Agenda
1. Recap: MBRL
2. Feedback & Supervision
3. Supervision via Rollouts
4. Approximate Policy Iteration
Feedback in RL
action \(a_t\)
state \(s_t\)
reward \(r_t\)
Control feedback
 between states and actions
 "reaction"
 studied in control theory "automatic feedback control"
 our focus for Unit 1
policy \(\pi\)
transitions \(P,f\)
Feedback in RL
action \(a_t\)
state \(s_t\)
reward \(r_t\)
 Control feedback: between states and actions
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
Data feedback
 between data and policy
 "adaptation"
 connection to machine learning
 our new focus in Unit 2
Feedback in RL
action \(a_t\)
state \(s_t\)
reward \(r_t\)
 Control feedback: between states and actions
 Data feedback: between data and poicy
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown in Unit 2
 Supervised learning: features \(x\) and labels \(y\)
 Goal: predict labels with \(\hat f(x)\approx \mathbb E[yx]\)
 Requirements: dataset \(\{x_i,y_i\}_{i=1}^N\)
 Method: \(\hat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N (f(x_i)y_i)^2\)
 Important functions in MDPs
 Transitions \(P(s's,a)\)
 Value/Q of a policy \(V^\pi(s)\) and \(Q^\pi(s,a)\)
 Optimal Value/Q \(V^\star(s)\) and \(Q^\star(s,a)\)
 Optimal policy \(\pi^\star(s)\)
Supervised Learning for MDPs
 Supervised learning: features \(x\) and labels \(y\)
 Important functions in MDPs
 Transitions \(P(s's,a)\)
 features \(s,a,s'\), sampled outcomes observed
 Value/Q of a policy \(V^\pi(s)\) and \(Q^\pi(s,a)\)
 features \(s\) or \(s,a\), labels ?
 Optimal Value/Q \(V^\star(s)\) and \(Q^\star(s,a)\)
 features \(s\) or \(s,a\), labels ?
 Optimal policy \(\pi^\star(s)\)
 features \(s\), labels from expert, otherwise ?
 Transitions \(P(s's,a)\)
Supervised Learning for MDPs
MBRL (Lec 11)
this week
after prelim
after prelim
Agenda
1. Recap: MBRL
2. Feedback & Supervision
3. Supervision via Rollouts
4. Approximate Policy Iteration
Data for Learning \(Q^\pi(s,a)\)
 How to construct labels for $$ Q^\pi(s,a) = \mathbb E_{P,\pi}\Big[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \Big] $$
 Label via rollout:
 start at \((s,a)\) and then observe rewards \(r_0,r_1,\dots\)
 Let label \(y=\sum_{t=0}^\infty \gamma^t r_t\)
 Unbiased because \(\mathbb E[ys,a] = Q^\pi(s,a)\)
...
...
...
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
Example
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1p_2\)
stay: \(1p_1\)
switch: \(p_2\)
 Consider \(\pi(s)=\)stay

Rollouts with
 \(s=0\) and \(a=\)stay
 \(s=0\) and \(a=\)switch
 \(s=1\) and \(a=\)stay
 \(s=1\) and \(a=\)switch
Sampling procedure for \(Q^\pi\)
Algorithm: Data collection
 For \(i=1,\dots,N\):
 Sample \(s_0\sim\mu_0\) and \(h_1\) and \(h_2\) from \(1\gamma\) geometric distribution
 Roll in \(h_1\) steps: set \((s_i,a_i)=(s_{h_1},a_{h_1})\)
 Roll out \(h_2\) steps: set \(y_i=\sum_{t=h_1}^{h_1+h_2} r_t\)
Rollout:
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
Timestep sampling with discount/geometric distribution: set \(h_i=h\geq 0\) with probability \((1\gamma)\gamma^h\)
Sampling procedure for \(Q^\pi\)
Algorithm: Data collection
 For \(i=1,\dots,N\):
 Sample \(s_0\sim\mu_0\) and \(h_1\) and \(h_2\) from discount distribution
 Roll in \(h_1\) steps: set \((s_i,a_i)=(s_{h_1},a_{h_1})\)
 Roll out \(h_2\) steps: set \(y_i=\sum_{t=h_1}^{h_1+h_2} r_t\)
Proposition: The resulting dataset \(\{(s_i,a_i), y_i\}_{i=1}^N\)
 Drawn from discounted state distribution \(s_i \sim d_{\mu_0}^\pi\)
 Unbiased labels \(\mathbb E[y_i\mid s_i,a_i] = Q^\pi(s_i,a_i)\)
 Using sampled data, $$\hat Q^\pi(s,a) = \arg\min_{Q\in\mathcal Q} \sum_{i=1}^N (Q(s_i,a_i)y_i)^2 $$
 Assumption: Supervised learning works, i.e. $$\mathbb E_{s,a\sim d_{\mu_0}^\pi }[(\hat Q^\pi(s,a)Q^\pi(s,a))^2]\leq \epsilon $$
 In practice, often sample data from a single long rollout (rather than resetting to \(\mu_0\) every time)
 This is called "Monte Carlo" supervision
Supervision via Rollouts
Notation note: \(s,a\sim d_{\mu_0}^\pi\) is a compact way of writing \(s\sim d_{\mu_0}^\pi\) and \(a\sim\pi\).
Agenda
1. Recap: MBRL
2. Feedback & Supervision
3. Supervision via Rollouts
4. Approximate Policy Iteration
Approximate Policy Iteration
 Initialize \(\pi_0:\mathcal S\to\mathcal A\)
 For \(i=0,\dots,T1\):
 Rollout \(\pi^i\), collect dataset \(\{s_j,a_j,y_j\}_{j=1}^N\) then compute \(\hat Q^{\pi_i}\) with supervised learning
 Policy Improvement: \(\forall s\), $$\pi^{i+1}(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$
Approximate Policy Iteration
 This is an "on policy" method because it uses data collected with the policy \(\pi\)
 Recall that PI guarantees monotonic improvement $$Q^{\pi_{i+1}}(s,a)\geq Q^{\pi_i}(s,a)$$
 We assume that supervised learning succeeds $$\mathbb E_{s,a\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)Q^\pi(s,a))^2]\leq \epsilon$$
 Does Approx PI monotonically improve (approximately)?
 Not necessarily, because \(d_{\mu_0}^{\pi_{i+1}}\) might be different from \(d_{\mu_0}^{\pi_i}\)
 Greedy improvement when Q function is approximate could lead to oscillation!
Approximate Policy Iteration
0  1  2  3 
4  5  6  7 
8  9  10  11 
12  13  14  15 
Example
 Suppose \(\pi_1\) is a policy which goes down first and then sometimes right
 \(\hat Q^{\pi_1}\) is only reliable on the left and bottom parts of the grid
 \(\pi_2\) ends up going right first and then down
 \(\pi_3\) will oscillate back to \(\pi_1\)!
Performance Difference Lemma: For two policies, $$V^\pi(s_0)  V^{\pi'}(s_0) = \frac{1}{1\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$
where we define the advantage function \(A^{\pi'}(s,a) =Q^{\pi'}(s,a)  V^{\pi'}(s)\)
Performance Difference
The advantage function:
 "Advantage" of taking action \(a\) in state \(s\) rather than \(\pi(s)\)
 What can we say about \(A^{\pi^\star}\)? PollEV
 Recall the state distribution for a policy \(\pi\) (Lecture 2)
 \( d^{\pi}_{\mu_0,t}(s) = \mathbb{P}\{s_t=s\mid s_0\sim \mu_0,s_{k+1}\sim P(s_k, \pi(s_k))\} \)
 We showed that it can be written as $$d^{\pi}_{\mu_0,t} = P_\pi^\top d^{\pi}_{\mu_0,t1} = (P_\pi^t)^\top \mu_0$$
 The discounted distribution (PSet) $$d_{\mu_0}^{\pi} = (1\gamma) \sum_{t=0}^{\infty} \gamma^t \underbrace{(P_\pi ^t)^\top \mu_0}_{d_{\mu_0,t}^{\pi}} $$
 When the initial state is fixed to a known \(s_0\), i.e. \(\mu_0=e_{s_0}\) we write \(d_{s_0,t}^{\pi}\)
Performance Difference Lemma: For two policies, $$V^\pi(s_0)  V^{\pi'}(s_0) = \frac{1}{1\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$
Proof of PDL
 \(V^\pi(s_0)  V^{\pi'}(s_0) =\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1) ]  V^{\pi'}(s_0) \right]\)
 \(=\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1) V^{\pi'}(s_1) + V^{\pi'}(s_1) ]  V^{\pi'}(s_0) \right]\)
 \(= \gamma \mathbb E_{\substack{a_0\sim \pi(s_0) \\ s_1\sim P(s_0, a)} }[V^\pi(s_1) V^{\pi'}(s_1) ] + \mathbb E_{a\sim \pi(s_0)}\left[Q^{\pi'}(a, s_0) V^{\pi'}(s_0) \right]\)
 Iterate \(k\) times: \(V^\pi(s_0)  V^{\pi'}(s_0) =\) $$\gamma^k \mathbb E_{s_k\sim d_{s_0,k}^\pi}[V^\pi(s_k) V^{\pi'}(s_k) ] + \sum_{\ell=0}^{k1}\gamma^\ell \mathbb E_{\substack{s_\ell \sim d_{s_0,\ell }^\pi \\ a\sim \pi(s_\ell )}}\left[Q^{\pi'}(a, s_\ell) V^{\pi'}(s_\ell) \right]$$
 Statement follows by letting \(k\to\infty\).
Performance Difference Lemma: For two policies, $$V^\pi(s_0)  V^{\pi'}(s_0) = \frac{1}{1\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$
where we define the advantage function \(A^{\pi'}(s,a) =Q^{\pi'}(s,a)  V^{\pi'}(s)\)
Performance Difference
 Notice that \(\arg\max_a A^\pi(s,a) = \arg\max_a Q^\pi(s,a)\)
 Can use PDL to show that PI has monotonic improvement
 Next time, we use insights from PDL to develop a better algorithm
Recap
 PSet 4
 Prelim in class 3/15
 Supervision via Rollouts
 Performance Difference Lemma
 Next lecture: Conservative Policy Iteration
Sp23 CS 4/5789: Lecture 12
By Sarah Dean