Sp24 CS 4/5789: Lecture 14

CS 4/5789: Introduction to Reinforcement Learning

Lecture 14: Fitted Policy Iteration

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

Homework
- Friday: PSet 4 due, PSet 5 released
- PA 3 released this week, due in two weeks
- 5789: Paper assignments posted on Canvas
Prelims
- Regrade requests open Thurs-Mon
- Corrections: assigned in late April, graded like a PSet
  - for each problem, you final score will be calculated as
    initial score $+~ \alpha \times ($corrected score $ - $ initial score$)_+$

Prelim Grades

Raw percentages (out of 76 points) $\neq$ letter grades!
- A range: 65+ points
- B range: 50+ points
- C range: 40+ points
Will take into consideration "outlier" exams when determining final overall letter grade

Agenda

1. Recap: Fitted VI

2. Fitted Policy Iteration

3. Fitted Approx. Policy Evaluation

4. Fitted Direct Policy Evaluation

5. Performance Difference Lemma

Fitted Q-Value Iteration

Fitted Q-Value Iteration

Input: dataset $\tau \sim \rho_{\pi_{\text {data }}}$
Initialize function $Q^0 \in\mathscr F$ (mapping $\mathcal S\times \mathcal A\to \mathbb R$)
For $k=0,\dots,K-1$: $$Q^{k+1}= \arg \min _{f \in \mathscr{F}} \sum_{i=0}^{N-1} \left(f(s_i,a_i)-(r(s_i,a_i)+\gamma \max _{a} Q^k (s_{i+1}, a))\right)^{2}$$
Return $\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}Q^K(s,a)$ for all $s$

Fixed point iteration using a fixed dataset and supervised learning

Supervised learning: features $x$ and labels $y$
- Goal: predict labels with $\hat f(x)\approx \mathbb E[y|x]$
- Requirements: dataset $\{x_i,y_i\}_{i=1}^N$
- Method: $\hat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N (f(x_i)-y_i)^2$
For VI, we construct from $\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots, s_N,a_N \right\}$

Supervised Learning for VI

features x	labels y

$s_0,a_0$

$r(s_0,a_0)+\gamma \max _{a} Q^k (s_{1}, a))$

$s_1,a_1$

$s_2,a_2$

$r(s_1,a_1)+\gamma \max _{a} Q^k (s_{2}, a))$

$r(s_2,a_2)+\gamma \max _{a} Q^k (s_{3}, a))$

Agenda

1. Recap: Fitted VI

2. Fitted Policy Iteration

3. Fitted Approx. Policy Evaluation

4. Fitted Direct Policy Evaluation

5. Performance Difference Lemma

Recall: Policy Iteration

Policy Iteration

Initialize $\pi^0:\mathcal S\to\mathcal A$
For $k=0,\dots,K-1$:
- Compute $V^{\pi^k}$ with Policy Evaluation
- Policy Improvement: $\forall s$, $$\pi^{k+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi^k}(s')]$$

Approximate Policy Evaluation:

Initialize $V_0$. For $j=0,1,\dots, M$:
- $V_{j+1} = R^{\pi} + \gamma P^{\pi} V_j$

Exact Policy Evaluation:

$V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}$

Q-Policy Iteration

Q-Policy Iteration

Initialize $\pi^0:\mathcal S\to\mathcal A$. For $k=0,\dots,K-1$:
- Compute $Q^{\pi^k}$ with Policy Evaluation
- Policy Improvement: $\forall s$, $$\pi^{k+1}(s)=\arg\max_{a\in\mathcal A} Q^k(s,a)$$

Policy Evaluation:

Recall definition $Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s, a_0=a,P,\pi]$
Bellman Consistency Equation $Q^{\pi}(s,a) = r(s,a) + \gamma \underset{{s'\sim P(s,a)}\atop{ a'\sim \pi(s')}}{\mathbb E}[Q^{\pi}(s',a')]$

Q-Policy Iteration

Q-Policy Iteration

Initialize $\pi^0:\mathcal S\to\mathcal A$. For $k=0,\dots,K-1$:
- Compute $Q^{\pi^k}$ with Policy Evaluation
- Policy Improvement: $\forall s$, $$\pi^{k+1}(s)=\arg\max_{a\in\mathcal A} Q^k(s,a)$$

Policy Evaluation:

Bellman Consistency Equation $Q^{\pi}(s,a) = r(s,a) + \gamma \underset{{s'\sim P(s,a)}\atop{ a'\sim \pi(s')}}{\mathbb E}[Q^{\pi}(s',a')]$
Exact: Solve BCE $\forall s,a$ (set of $SA$ linear equations)

Q-Policy Iteration

Q-Policy Iteration

Initialize $\pi^0:\mathcal S\to\mathcal A$. For $k=0,\dots,K-1$:
- Compute $Q^{\pi^k}$ with Policy Evaluation
- Policy Improvement: $\forall s$, $$\pi^{k+1}(s)=\arg\max_{a\in\mathcal A} Q^k(s,a)$$

Policy Evaluation:

Exact: Solve BCE $\forall s,a$ (set of $SA$ linear equations)
Approximate: Initialize $Q_0$. For $j=0,1,\dots, M$:
- $Q^{j+1}(s,a) = r(s,a) + \gamma \underset{{s'\sim P(s,a)}, { a'\sim \pi(s')}}{\mathbb E}[Q^{j}(s',a')]$

Brainstorm!

Fitted Policy Iteration

Initialize $\pi_0:\mathcal S\to\mathcal A$
For $i=0,\dots,T-1$:
- Fitted Policy Evaluation: find $\hat Q^{\pi_i}$ with supervised learning
- Incremental Policy Improvement: $\forall s$, greedy policy $\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$: $$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s) $$

Fitted Policy Iteration

Agenda

1. Recap: Fitted VI

2. Fitted Policy Iteration

3. Fitted Approx. Policy Evaluation

4. Fitted Direct Policy Evaluation

5. Performance Difference Lemma

We can't compute the expectation if the model is unknown or intractable
Instead, we can "roll out" $\pi$ to collect data $$\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots, s_N,a_N \right\}\sim \rho_{\pi}$$

Approximate Policy Evaluation:

Input: policy $\pi$. Initialize $Q_0$.
For $j=0,1,\dots, M$:
- $Q^{j+1}(s,a) = r(s,a) + \gamma \underset{{s'\sim P(s,a)}, { a'\sim \pi(s')}}{\mathbb E}[Q^{j}(s',a')]$
Return $Q^{M}(s,a) $

Fitted Approx. Policy Eval.

We can't compute the expectation if the model is unknown or intractable
Instead, we can "roll out" $\pi$ to collect data $$\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots, s_N,a_N \right\}\sim \rho_{\pi}$$
Ideally, we want to have $$Q^{j+1}(s, a) \approx r(s,a) + \gamma \underset{{s'\sim P(s,a)}, { a'\sim \pi(s')}}{\mathbb E}[Q^{j}(s',a')]$$
Note that the RHS can also be written as $$\mathbb{E}\left[r\left(s, a\right)+\gamma Q^j\left(s', a^{\prime}\right) \mid s, a\right]$$
How to choose $x$ and $y$ for supervised learning?
- $x=\left(s_i, a_i\right)$ and $y=r\left(s, a\right)+\gamma Q^k\left(s_{i+1}, a_{i+1}\right)$
Then we have $Q^{k+1}(s, a)=f(x)=\mathbb{E}[y \mid x]$

Fitted Approx. Policy Eval.

$s_t$

$a_t\sim \pi(s_t)$

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

$a_{t+1}\sim \pi(s_{t+1})$

...

This is an "on policy" method because it uses data collected with the policy $\pi$
Also called "temporal difference" learning

Fitted Approximate Policy Evaluation (TD):

Initialize $Q_0$.
Roll out $\pi$ to get data $\tau\sim \rho_{\pi}$
For $j=0,1,\dots, M$:
- $Q^{j+1}= \arg \min _{f \in \mathscr{F}} \sum_{i=0}^{N-1} \left(f(s_i,a_i)-(r(s_i,a_i)+\gamma Q^j (s_{i+1}, a_{i+1}))\right)^{2}$

Fitted Approx. Policy Eval.

Agenda

1. Recap: Fitted VI

2. Fitted Policy Iteration

3. Fitted Approx. Policy Evaluation

4. Fitted Direct Policy Evaluation

5. Performance Difference Lemma

Direct Supervision of $Q^\pi$

In policy evaluation, what we really want to compute is $$ Q^\pi(s,a) = \mathbb E\Big[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \Big] $$
We roll out $\pi$ to collect data $$\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots \right\}\sim \rho_{\pi}$$
Why not just set
- $x=\left(s_i, a_i\right)$ and $y=\sum_{\ell=i}^\infty \gamma^\ell r\left(s_\ell, a_\ell\right)$
Would require infinitely long data collection!

...

Direct Supervision of $Q^\pi$

In policy evaluation, what we really want to compute is $$ Q^\pi(s,a) = \mathbb E\Big[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \Big] $$
We roll out $\pi$ to collect data $$\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots \right\}\sim \rho_{\pi}$$
New idea: $x_i=\left(s_i, a_i\right)$
- Sample timestep $h$ w.p. $(1-\gamma)\gamma^h$
  - i.e. $1-\gamma$ geometric distribution
- $y_i=\sum_{\ell=i}^{i+h} r\left(s_\ell, a_\ell\right)$
Claim: This is a good label, i.e. $\mathbb E[y_i\mid s_i,a_i] = Q^\pi(s_i,a_i)$

"On policy" method because it uses data collected with the policy $\pi$
Also called "Montecarlo" sampling

Fitted Direct Policy Evaluation (MC):

Roll out $\pi^j$ to get data $\tau\sim \rho_{\pi^j}$
For $i=0,\dots,N-1$, sample $h_i\sim \mathrm{Geom}(1-\gamma)$
Return $\hat Q^{\pi}= \arg \min _{f \in \mathscr{F}} \sum_{i=0}^{N-1} \left(f(s_i,a_i)-\sum_{\ell=i}^{i+h_i} r\left(s_\ell, a_\ell\right) )\right)^{2}$

Fitted Direct Policy Eval.

Fitted Policy Iteration

Initialize $\pi_0:\mathcal S\to\mathcal A$
For $i=0,\dots,T-1$:
- Fitted Policy Evaluation: find $\hat Q^{\pi_i}$ with supervised learning (TD or MC)
  - Note: this requires rolling out $\pi$!
- Incremental Policy Improvement: $\forall s$, greedy policy $\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$: $$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s) $$

Fitted Policy Iteration

Agenda

1. Recap: Fitted VI

2. Fitted Policy Iteration

3. Fitted Approx. Policy Evaluation

4. Fitted Direct Policy Evaluation

5. Performance Difference Lemma

0	1	2	3
4	5	6	7
8	9	10	11
12	13	14	15

Example: Fitted PI Failure

Suppose $\pi_1$ is a policy which goes down first and then sometimes right
$\hat Q^{\pi_1}$ is only reliable on the left and bottom parts of the grid
Greedy $\pi_2$ ends up going right first and then down
$\pi_3$ will oscillate back to $\pi_1$!

Why incremental updates in Fitted PI?

Advantage Function

Recall that PI guarantees monotonic improvement
Let's understand more about why
Define: the advantage function $A^{\pi}(s,a) =Q^{\pi}(s,a) - V^{\pi}(s)$
- "Advantage" of taking action $a$ in state $s$ rather than $\pi(s)$
- What can we say about $A^{\pi^\star}$? PollEV
Notice that Policy Improvement step can be written in terms of the advantage function: $$\arg\max_a A^\pi(s,a) = \arg\max_a Q^\pi(s,a)$$

Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right] $$

where we define the advantage function $A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)$

Performance Difference

Advantage of $\pi$ over $\pi'$ on distribution of $\pi$
Can use PDL to show that PI has monotonic improvement
- Set $\pi'=\pi^k$ and $\pi=\pi^{k+1}$
Fitted PI does not have the same guarantee
Incremental policy updates keep $d_{\mu_0}^{\pi_{k+1}}$ close to $d_{\mu_0}^{\pi_{k}}$

Advantage function helps us understand value difference

Proof of PDL

$V^\pi(s_0) - V^{\pi'}(s_0) =\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1) ] - V^{\pi'}(s_0) \right]$
$=\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1)- V^{\pi'}(s_1) + V^{\pi'}(s_1) ] - V^{\pi'}(s_0) \right]$
$= \gamma \mathbb E_{\substack{a_0\sim \pi(s_0) \\ s_1\sim P(s_0, a)} }[V^\pi(s_1)- V^{\pi'}(s_1) ] + \mathbb E_{a\sim \pi(s_0)}\left[Q^{\pi'}(a, s_0) -V^{\pi'}(s_0) \right]$
Iterate $k$ times: $V^\pi(s_0) - V^{\pi'}(s_0) =$ $$\gamma^k \mathbb E_{s_k\sim d_{s_0,k}^\pi}[V^\pi(s_k)- V^{\pi'}(s_k) ] + \sum_{\ell=0}^{k-1}\gamma^\ell \mathbb E_{\substack{s_\ell \sim d_{s_0,\ell }^\pi \\ a\sim \pi(s_\ell )}}\left[Q^{\pi'}(a, s_\ell) -V^{\pi'}(s_\ell) \right]$$
Statement follows by letting $k\to\infty$.

Recap

PSet 4 due Friday
Prelim grades released

Fitted PI via Fitted PE
Performance Difference Lemma

Next lecture: from Learning to Optimization

CS 4/5789: Introduction to Reinforcement Learning

Lecture 14: Fitted Policy Iteration

Reminders

Prelim Grades

Agenda

Fitted Q-Value Iteration

Supervised Learning for VI

Agenda

Recall: Policy Iteration

Q-Policy Iteration

Q-Policy Iteration

Q-Policy Iteration

Fitted Policy Iteration

Agenda

Fitted Approx. Policy Eval.

Fitted Approx. Policy Eval.

Fitted Approx. Policy Eval.

Agenda

Direct Supervision of \(Q^\pi\)

Direct Supervision of \(Q^\pi\)

Fitted Direct Policy Eval.

Fitted Policy Iteration

Agenda

Example: Fitted PI Failure

Advantage Function

Performance Difference

Proof of PDL

Recap