CS 4/5789: Introduction to Reinforcement Learning

Lecture 12: Approximate Policy Iteration

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- PSet 4 released tonight, due next Monday
- 5789 Paper Reviews - on Canvas - due weekly starting Monday
Midterm 3/15 during lecture
- Let us know conflicts/accomodations ASAP! (EdStem)
- Review Lecture on Monday 3/13 (last year's slides/recording)
- Materials: slides (Lectures 1-10, some of 11-13), PSets 1-4 (solutions on Canvas)
  - also: equation sheet (on Canvas), 2023 notes, PAs
Mid-semester feedback survey: email “Associate Dean Alan Zehnder <invitation@surveys.mail.cornell.edu>”

Agenda

1. Recap: MBRL

2. Feedback & Supervision

3. Supervision via Rollouts

4. Approximate Policy Iteration

Recap: Model-based RL

Model-based RL with Queries

Sample and record $s_i'\sim P(s_i, a_i)$
Estimate $\widehat P$ from $\{(s_i',s_i, a_i)\}_{i=1}^N$
Design $\widehat \pi$ from $\widehat P$

Tabular MBRL

Sample: evenly each state and actions $\frac{N}{SA}$ times
Estimate: by counting $$\hat P(s'|s,a) = \frac{\text{\#~times~}s'_i=s'\text{~when~}s_i=s,a_i=a}{\text{\#~times~}s_i=s,a_i=a} $$
Design: policy iteration

Recap: Sample Complexity

Theorem: Tabular MBRL with $N \gtrsim \frac{S^2 A}{\epsilon^2}$ is $\epsilon$ sub-optimal with high probability

Simulation Lemma: $$|\hat V^\pi(s_0) - V^\pi(s_0)| \lesssim \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$$
Estimation Lemma: With high probability, $$\max_{s,a }\|P(\cdot |s,a)-\hat P(\cdot |s,a)\|_1 \lesssim \sqrt{\frac{S^2A}{N}} $$

Recap: Policy Iteration

Policy Iteration

Initialize $\pi_0:\mathcal S\to\mathcal A$
For $t=0,\dots,T-1$:
- Compute $Q^{\pi_t}$ with Policy Evaluation
- Policy Improvement: $\forall s$, $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} Q^{\pi_t}(s,a)$$

Agenda

1. Recap: MBRL

2. Feedback & Supervision

3. Supervision via Rollouts

4. Approximate Policy Iteration

Feedback in RL

action $a_t$

state $s_t$

reward $r_t$

Control feedback

between states and actions
- "reaction"
studied in control theory "automatic feedback control"
our focus for Unit 1

policy $\pi$

transitions $P,f$

Feedback in RL

action $a_t$

state $s_t$

reward $r_t$

Control feedback: between states and actions

policy

data $(s_t,a_t,r_t)$

policy $\pi$

transitions $P,f$

experience

Data feedback

between data and policy
- "adaptation"
connection to machine learning
our new focus in Unit 2

Feedback in RL

action $a_t$

state $s_t$

reward $r_t$

Control feedback: between states and actions
Data feedback: between data and poicy

policy

data $(s_t,a_t,r_t)$

policy $\pi$

transitions $P,f$

experience

unknown in Unit 2

Supervised learning: features $x$ and labels $y$
- Goal: predict labels with $\hat f(x)\approx \mathbb E[y|x]$
- Requirements: dataset $\{x_i,y_i\}_{i=1}^N$
- Method: $\hat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N (f(x_i)-y_i)^2$
Important functions in MDPs
- Transitions $P(s'|s,a)$
- Value/Q of a policy $V^\pi(s)$ and $Q^\pi(s,a)$
- Optimal Value/Q $V^\star(s)$ and $Q^\star(s,a)$
- Optimal policy $\pi^\star(s)$

Supervised Learning for MDPs

Supervised learning: features $x$ and labels $y$
Important functions in MDPs
- Transitions $P(s'|s,a)$
  - features $s,a,s'$, sampled outcomes observed
- Value/Q of a policy $V^\pi(s)$ and $Q^\pi(s,a)$
  - features $s$ or $s,a$, labels ?
- Optimal Value/Q $V^\star(s)$ and $Q^\star(s,a)$
  - features $s$ or $s,a$, labels ?
- Optimal policy $\pi^\star(s)$
  - features $s$, labels from expert, otherwise ?

Supervised Learning for MDPs

MBRL (Lec 11)

this week

after prelim

Agenda

1. Recap: MBRL

2. Feedback & Supervision

3. Supervision via Rollouts

4. Approximate Policy Iteration

Data for Learning $Q^\pi(s,a)$

How to construct labels for $$ Q^\pi(s,a) = \mathbb E_{P,\pi}\Big[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \Big] $$
Label via rollout:
- start at $(s,a)$ and then observe rewards $r_0,r_1,\dots$
- Let label $y=\sum_{t=0}^\infty \gamma^t r_t$
- Unbiased because $\mathbb E[y|s,a] = Q^\pi(s,a)$

...

$s_t$

$a_t\sim \pi(s_t)$

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

$a_{t+1}\sim \pi(s_{t+1})$

...

Example

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Consider $\pi(s)=$stay
Rollouts with
- $s=0$ and $a=$stay
- $s=0$ and $a=$switch
- $s=1$ and $a=$stay
- $s=1$ and $a=$switch

Sampling procedure for $Q^\pi$

Algorithm: Data collection

For $i=1,\dots,N$:
- Sample $s_0\sim\mu_0$ and $h_1$ and $h_2$ from $1-\gamma$ geometric distribution
- Roll in $h_1$ steps: set $(s_i,a_i)=(s_{h_1},a_{h_1})$
- Roll out $h_2$ steps: set $y_i=\sum_{t=h_1}^{h_1+h_2} r_t$

Rollout:

$s_t$

$a_t\sim \pi(s_t)$

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

$a_{t+1}\sim \pi(s_{t+1})$

...

Timestep sampling with discount/geometric distribution: set $h_i=h\geq 0$ with probability $(1-\gamma)\gamma^h$

Sampling procedure for $Q^\pi$

Algorithm: Data collection

For $i=1,\dots,N$:
- Sample $s_0\sim\mu_0$ and $h_1$ and $h_2$ from discount distribution
- Roll in $h_1$ steps: set $(s_i,a_i)=(s_{h_1},a_{h_1})$
- Roll out $h_2$ steps: set $y_i=\sum_{t=h_1}^{h_1+h_2} r_t$

Proposition: The resulting dataset $\{(s_i,a_i), y_i\}_{i=1}^N$

Drawn from discounted state distribution $s_i \sim d_{\mu_0}^\pi$
Unbiased labels $\mathbb E[y_i\mid s_i,a_i] = Q^\pi(s_i,a_i)$

Using sampled data, $$\hat Q^\pi(s,a) = \arg\min_{Q\in\mathcal Q} \sum_{i=1}^N (Q(s_i,a_i)-y_i)^2 $$
Assumption: Supervised learning works, i.e. $$\mathbb E_{s,a\sim d_{\mu_0}^\pi }[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon $$
In practice, often sample data from a single long rollout (rather than resetting to $\mu_0$ every time)
- This is called "Monte Carlo" supervision

Supervision via Rollouts

Notation note: $s,a\sim d_{\mu_0}^\pi$ is a compact way of writing $s\sim d_{\mu_0}^\pi$ and $a\sim\pi$.

Agenda

1. Recap: MBRL

2. Feedback & Supervision

3. Supervision via Rollouts

4. Approximate Policy Iteration

Approximate Policy Iteration

Initialize $\pi_0:\mathcal S\to\mathcal A$
For $i=0,\dots,T-1$:
- Rollout $\pi^i$, collect dataset $\{s_j,a_j,y_j\}_{j=1}^N$ then compute $\hat Q^{\pi_i}$ with supervised learning
- Policy Improvement: $\forall s$, $$\pi^{i+1}(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$

Approximate Policy Iteration

This is an "on policy" method because it uses data collected with the policy $\pi$

Recall that PI guarantees monotonic improvement $$Q^{\pi_{i+1}}(s,a)\geq Q^{\pi_i}(s,a)$$
We assume that supervised learning succeeds $$\mathbb E_{s,a\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
Does Approx PI monotonically improve (approximately)?
- Not necessarily, because $d_{\mu_0}^{\pi_{i+1}}$ might be different from $d_{\mu_0}^{\pi_i}$
Greedy improvement when Q function is approximate could lead to oscillation!

Approximate Policy Iteration

0	1	2	3
4	5	6	7
8	9	10	11
12	13	14	15

Example

Suppose $\pi_1$ is a policy which goes down first and then sometimes right
$\hat Q^{\pi_1}$ is only reliable on the left and bottom parts of the grid
$\pi_2$ ends up going right first and then down
$\pi_3$ will oscillate back to $\pi_1$!

Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$

where we define the advantage function $A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)$

Performance Difference

The advantage function:

"Advantage" of taking action $a$ in state $s$ rather than $\pi(s)$
What can we say about $A^{\pi^\star}$? PollEV

Recall the state distribution for a policy $\pi$ (Lecture 2)
- $ d^{\pi}_{\mu_0,t}(s) = \mathbb{P}\{s_t=s\mid s_0\sim \mu_0,s_{k+1}\sim P(s_k, \pi(s_k))\} $
We showed that it can be written as $$d^{\pi}_{\mu_0,t} = P_\pi^\top d^{\pi}_{\mu_0,t-1} = (P_\pi^t)^\top \mu_0$$
The discounted distribution (PSet) $$d_{\mu_0}^{\pi} = (1-\gamma) \sum_{t=0}^{\infty} \gamma^t \underbrace{(P_\pi ^t)^\top \mu_0}_{d_{\mu_0,t}^{\pi}} $$
When the initial state is fixed to a known $s_0$, i.e. $\mu_0=e_{s_0}$ we write $d_{s_0,t}^{\pi}$

Proof of PDL

$V^\pi(s_0) - V^{\pi'}(s_0) =\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1) ] - V^{\pi'}(s_0) \right]$
$=\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1)- V^{\pi'}(s_1) + V^{\pi'}(s_1) ] - V^{\pi'}(s_0) \right]$
$= \gamma \mathbb E_{\substack{a_0\sim \pi(s_0) \\ s_1\sim P(s_0, a)} }[V^\pi(s_1)- V^{\pi'}(s_1) ] + \mathbb E_{a\sim \pi(s_0)}\left[Q^{\pi'}(a, s_0) -V^{\pi'}(s_0) \right]$
Iterate $k$ times: $V^\pi(s_0) - V^{\pi'}(s_0) =$ $$\gamma^k \mathbb E_{s_k\sim d_{s_0,k}^\pi}[V^\pi(s_k)- V^{\pi'}(s_k) ] + \sum_{\ell=0}^{k-1}\gamma^\ell \mathbb E_{\substack{s_\ell \sim d_{s_0,\ell }^\pi \\ a\sim \pi(s_\ell )}}\left[Q^{\pi'}(a, s_\ell) -V^{\pi'}(s_\ell) \right]$$
Statement follows by letting $k\to\infty$.

where we define the advantage function $A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)$

Performance Difference

Notice that $\arg\max_a A^\pi(s,a) = \arg\max_a Q^\pi(s,a)$
Can use PDL to show that PI has monotonic improvement
Next time, we use insights from PDL to develop a better algorithm

Recap

PSet 4
Prelim in class 3/15

Supervision via Rollouts
Performance Difference Lemma

Next lecture: Conservative Policy Iteration

Sp23 CS 4/5789: Lecture 12

By Sarah Dean

Sp23 CS 4/5789: Lecture 12

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 12: Approximate Policy Iteration

Reminders

Agenda

Recap: Model-based RL

Recap: Sample Complexity

Recap: Policy Iteration

Agenda

Feedback in RL

Feedback in RL

Feedback in RL

Supervised Learning for MDPs

Supervised Learning for MDPs

Agenda

Data for Learning \(Q^\pi(s,a)\)

Example

Sampling procedure for \(Q^\pi\)

Sampling procedure for \(Q^\pi\)

Supervision via Rollouts

Agenda

Approximate Policy Iteration

Approximate Policy Iteration

Example

Performance Difference

Proof of PDL

Performance Difference

Recap

Sp23 CS 4/5789: Lecture 12

Sp23 CS 4/5789: Lecture 12

Sarah Dean PRO

CS 4/5789: Introduction to Reinforcement Learning

Lecture 12: Approximate Policy Iteration

Reminders

Agenda

Recap: Model-based RL

Recap: Sample Complexity

Recap: Policy Iteration

Agenda

Feedback in RL

Feedback in RL

Feedback in RL

Supervised Learning for MDPs

Supervised Learning for MDPs

Agenda

Data for Learning \(Q^\pi(s,a)\)

Example

Sampling procedure for \(Q^\pi\)

Sampling procedure for \(Q^\pi\)

Supervision via Rollouts

Agenda

Approximate Policy Iteration

Approximate Policy Iteration

Example

Performance Difference

Proof of PDL

Performance Difference

Recap

Sp23 CS 4/5789: Lecture 12

More from Sarah Dean