CS 4/5789: Introduction to Reinforcement Learning

Lecture 14: Fitted Policy Iteration

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

  • Homework
    • Friday: PSet 4 due, PSet 5 released
    • PA 3 released this week, due in two weeks
    • 5789: Paper assignments posted on Canvas
  • Prelims
    • Regrade requests open Thurs-Mon
    • Corrections: assigned in late April, graded like a PSet
      • for each problem, you final score will be calculated as
            initial score \(+~ \alpha \times (\)corrected score \( - \) initial score\()_+\)

Prelim Grades

  • Raw percentages (out of 76 points) \(\neq\) letter grades!
    • A range: 65+ points
    • B range: 50+ points
    • C range: 40+ points
  • Will take into consideration "outlier" exams when determining final overall letter grade

Agenda

1. Recap: Fitted VI

2. Fitted Policy Iteration

3. Fitted Approx. Policy Evaluation

4. Fitted Direct Policy Evaluation

5. Performance Difference Lemma

Fitted Q-Value Iteration

Fitted Q-Value Iteration

  • Input: dataset \(\tau \sim \rho_{\pi_{\text {data }}}\)
  • Initialize function \(Q^0 \in\mathscr F\) (mapping \(\mathcal S\times \mathcal A\to \mathbb R\))
  • For \(k=0,\dots,K-1\): $$Q^{k+1}=  \arg \min _{f \in \mathscr{F}} \sum_{i=0}^{N-1} \left(f(s_i,a_i)-(r(s_i,a_i)+\gamma \max _{a} Q^k (s_{i+1}, a))\right)^{2}$$
  • Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}Q^K(s,a)\) for all \(s\)

Fixed point iteration using a fixed dataset and supervised learning

  • Supervised learning: features \(x\) and labels \(y\)
    • Goal: predict labels with \(\hat f(x)\approx \mathbb E[y|x]\)
    • Requirements: dataset \(\{x_i,y_i\}_{i=1}^N\)
    • Method: \(\hat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N (f(x_i)-y_i)^2\)
  • For VI, we construct from \(\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots, s_N,a_N \right\}\)

Supervised Learning for VI

features x labels y

\(s_0,a_0\)

\(r(s_0,a_0)+\gamma \max _{a} Q^k (s_{1}, a))\)

\(s_1,a_1\)

\(s_2,a_2\)

\(r(s_1,a_1)+\gamma \max _{a} Q^k (s_{2}, a))\)

\(r(s_2,a_2)+\gamma \max _{a} Q^k (s_{3}, a))\)

Agenda

1. Recap: Fitted VI

2. Fitted Policy Iteration

3. Fitted Approx. Policy Evaluation

4. Fitted Direct Policy Evaluation

5. Performance Difference Lemma

Recall: Policy Iteration

Policy Iteration

  • Initialize \(\pi^0:\mathcal S\to\mathcal A\)
  • For \(k=0,\dots,K-1\):
    • Compute \(V^{\pi^k}\) with Policy Evaluation
    • Policy Improvement: \(\forall s\), $$\pi^{k+1}(s)=\arg\max_{a\in\mathcal A} r(s,a)+\gamma \mathbb E_{s'\sim P(s,a)}[V^{\pi^k}(s')]$$

Approximate Policy Evaluation:

  • Initialize \(V_0\). For \(j=0,1,\dots, M\):
    • \(V_{j+1} = R^{\pi} + \gamma P^{\pi} V_j\)

Exact Policy Evaluation:

  • \(V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}\)

Q-Policy Iteration

Q-Policy Iteration

  • Initialize \(\pi^0:\mathcal S\to\mathcal A\). For \(k=0,\dots,K-1\):
    • Compute \(Q^{\pi^k}\) with Policy Evaluation
    • Policy Improvement: \(\forall s\), $$\pi^{k+1}(s)=\arg\max_{a\in\mathcal A} Q^k(s,a)$$

Policy Evaluation:

  • Recall definition \(Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s, a_0=a,P,\pi]\)
  • Bellman Consistency Equation \(Q^{\pi}(s,a) = r(s,a) + \gamma \underset{{s'\sim P(s,a)}\atop{ a'\sim \pi(s')}}{\mathbb E}[Q^{\pi}(s',a')]\)

Q-Policy Iteration

Q-Policy Iteration

  • Initialize \(\pi^0:\mathcal S\to\mathcal A\). For \(k=0,\dots,K-1\):
    • Compute \(Q^{\pi^k}\) with Policy Evaluation
    • Policy Improvement: \(\forall s\), $$\pi^{k+1}(s)=\arg\max_{a\in\mathcal A} Q^k(s,a)$$

Policy Evaluation:

  • Bellman Consistency Equation \(Q^{\pi}(s,a) = r(s,a) + \gamma \underset{{s'\sim P(s,a)}\atop{ a'\sim \pi(s')}}{\mathbb E}[Q^{\pi}(s',a')]\)
  • Exact: Solve BCE \(\forall s,a\) (set of \(SA\) linear equations)

Q-Policy Iteration

Q-Policy Iteration

  • Initialize \(\pi^0:\mathcal S\to\mathcal A\). For \(k=0,\dots,K-1\):
    • Compute \(Q^{\pi^k}\) with Policy Evaluation
    • Policy Improvement: \(\forall s\), $$\pi^{k+1}(s)=\arg\max_{a\in\mathcal A} Q^k(s,a)$$

Policy Evaluation:

  • Exact: Solve BCE \(\forall s,a\) (set of \(SA\) linear equations)
  • Approximate: Initialize \(Q_0\). For \(j=0,1,\dots, M\):
    • \(Q^{j+1}(s,a) = r(s,a) + \gamma \underset{{s'\sim P(s,a)}, { a'\sim \pi(s')}}{\mathbb E}[Q^{j}(s',a')]\)

Brainstorm!

Fitted Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\mathcal A\)
  • For \(i=0,\dots,T-1\):
    • Fitted Policy Evaluation: find \(\hat Q^{\pi_i}\) with supervised learning
    • Incremental Policy Improvement: \(\forall s\), greedy policy \(\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)\):  $$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)  $$

Fitted Policy Iteration

Agenda

1. Recap: Fitted VI

2. Fitted Policy Iteration

 3. Fitted Approx. Policy Evaluation 

4. Fitted Direct Policy Evaluation

5. Performance Difference Lemma

  • We can't compute the expectation if the model is unknown or intractable
  • Instead, we can "roll out" \(\pi\) to collect data $$\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots, s_N,a_N \right\}\sim \rho_{\pi}$$

Approximate Policy Evaluation:

  • Input: policy \(\pi\). Initialize \(Q_0\).
  • For \(j=0,1,\dots, M\):
    • \(Q^{j+1}(s,a) = r(s,a) + \gamma \underset{{s'\sim P(s,a)}, { a'\sim \pi(s')}}{\mathbb E}[Q^{j}(s',a')]\)
  • Return \(Q^{M}(s,a) \)

Fitted Approx. Policy Eval.

  • We can't compute the expectation if the model is unknown or intractable
  • Instead, we can "roll out" \(\pi\) to collect data $$\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots, s_N,a_N \right\}\sim \rho_{\pi}$$
  • Ideally, we want to have $$Q^{j+1}(s, a) \approx  r(s,a) + \gamma \underset{{s'\sim P(s,a)}, { a'\sim \pi(s')}}{\mathbb E}[Q^{j}(s',a')]$$
  • Note that the RHS can also be written as $$\mathbb{E}\left[r\left(s, a\right)+\gamma Q^j\left(s', a^{\prime}\right) \mid s, a\right]$$
  • How to choose \(x\) and \(y\) for supervised learning?
    • \(x=\left(s_i, a_i\right)\) and \(y=r\left(s, a\right)+\gamma Q^k\left(s_{i+1}, a_{i+1}\right)\)
  • Then we have \(Q^{k+1}(s, a)=f(x)=\mathbb{E}[y \mid x]\)

Fitted Approx. Policy Eval.

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

  • This is an "on policy" method because it uses data collected with the policy \(\pi\)
  • Also called "temporal difference" learning

Fitted Approximate Policy Evaluation (TD):

  • Initialize \(Q_0\).
  • Roll out \(\pi\) to get data \(\tau\sim \rho_{\pi}\)
  • For \(j=0,1,\dots, M\):
    • \(Q^{j+1}=  \arg \min _{f \in \mathscr{F}} \sum_{i=0}^{N-1} \left(f(s_i,a_i)-(r(s_i,a_i)+\gamma Q^j (s_{i+1}, a_{i+1}))\right)^{2}\)

Fitted Approx. Policy Eval.

Agenda

1. Recap: Fitted VI

2. Fitted Policy Iteration

3. Fitted Approx. Policy Evaluation

4. Fitted Direct Policy Evaluation

5. Performance Difference Lemma

Direct Supervision of \(Q^\pi\)

  • In policy evaluation, what we really want to compute is $$ Q^\pi(s,a) = \mathbb E\Big[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \Big] $$
  • We roll out \(\pi\) to collect data $$\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots \right\}\sim \rho_{\pi}$$
  • Why not just set
    • \(x=\left(s_i, a_i\right)\) and \(y=\sum_{\ell=i}^\infty \gamma^\ell r\left(s_\ell, a_\ell\right)\)
  • Would require infinitely long data collection!

...

...

...

Direct Supervision of \(Q^\pi\)

  • In policy evaluation, what we really want to compute is $$ Q^\pi(s,a) = \mathbb E\Big[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \Big] $$
  • We roll out \(\pi\) to collect data $$\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots \right\}\sim \rho_{\pi}$$
  • New idea: \(x_i=\left(s_i, a_i\right)\)
    • Sample timestep \(h\) w.p. \((1-\gamma)\gamma^h\)
    • \(y_i=\sum_{\ell=i}^{i+h} r\left(s_\ell, a_\ell\right)\)
  • Claim: This is a good label, i.e. \(\mathbb E[y_i\mid s_i,a_i] = Q^\pi(s_i,a_i)\)
  • "On policy" method because it uses data collected with the policy \(\pi\)
  • Also called "Montecarlo" sampling

Fitted Direct Policy Evaluation (MC):

  • Roll out \(\pi^j\) to get data \(\tau\sim \rho_{\pi^j}\)
  • For \(i=0,\dots,N-1\), sample \(h_i\sim \mathrm{Geom}(1-\gamma)\)
  • Return \(\hat Q^{\pi}=  \arg \min _{f \in \mathscr{F}} \sum_{i=0}^{N-1} \left(f(s_i,a_i)-\sum_{\ell=i}^{i+h_i} r\left(s_\ell, a_\ell\right) )\right)^{2}\)

Fitted Direct Policy Eval.

Fitted Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\mathcal A\)
  • For \(i=0,\dots,T-1\):
    • Fitted Policy Evaluation: find \(\hat Q^{\pi_i}\) with supervised learning (TD or MC)
      • Note: this requires rolling out \(\pi\)!
    • Incremental Policy Improvement: \(\forall s\), greedy policy \(\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)\):  $$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)  $$

Fitted Policy Iteration

Agenda

1. Recap: Fitted VI

2. Fitted Policy Iteration

3. Fitted Approx. Policy Evaluation

4. Fitted Direct Policy Evaluation

5. Performance Difference Lemma

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

Example: Fitted PI Failure

  • Suppose \(\pi_1\) is a policy which goes down first and then sometimes right
  • \(\hat Q^{\pi_1}\) is only reliable on the left and bottom parts of the grid
  • Greedy \(\pi_2\) ends up going right first and then down
  • \(\pi_3\) will oscillate back to \(\pi_1\)!

Why incremental updates in Fitted PI?

Advantage Function

  • Recall that PI guarantees monotonic improvement
  • Let's understand more about why
  • Define: the advantage function \(A^{\pi}(s,a) =Q^{\pi}(s,a) - V^{\pi}(s)\)

    • "Advantage" of taking action \(a\) in state \(s\) rather than \(\pi(s)\)
    • What can we say about \(A^{\pi^\star}\)? PollEV
  • Notice that Policy Improvement step can be written in terms of the advantage function: $$\arg\max_a A^\pi(s,a) = \arg\max_a Q^\pi(s,a)$$

Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right]  \right] $$

where we define the advantage function \(A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)\)

Performance Difference

  • Advantage of \(\pi\) over \(\pi'\) on distribution of \(\pi\)
  • Can use PDL to show that PI has monotonic improvement
    • Set \(\pi'=\pi^k\) and \(\pi=\pi^{k+1}\)
  • Fitted PI does not have the same guarantee
  • Incremental policy updates keep \(d_{\mu_0}^{\pi_{k+1}}\) close to \(d_{\mu_0}^{\pi_{k}}\)
  • Advantage function helps us understand value difference

Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right]  \right] $$

Proof of PDL

  • \(V^\pi(s_0) - V^{\pi'}(s_0) =\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1) ] - V^{\pi'}(s_0) \right]\)
  • \(=\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1)- V^{\pi'}(s_1) + V^{\pi'}(s_1) ] - V^{\pi'}(s_0) \right]\)
  • \(= \gamma \mathbb E_{\substack{a_0\sim \pi(s_0) \\ s_1\sim P(s_0, a)} }[V^\pi(s_1)- V^{\pi'}(s_1) ] + \mathbb E_{a\sim \pi(s_0)}\left[Q^{\pi'}(a, s_0) -V^{\pi'}(s_0) \right]\)
  • Iterate \(k\) times: \(V^\pi(s_0) - V^{\pi'}(s_0) =\) $$\gamma^k \mathbb E_{s_k\sim d_{s_0,k}^\pi}[V^\pi(s_k)- V^{\pi'}(s_k) ] + \sum_{\ell=0}^{k-1}\gamma^\ell \mathbb E_{\substack{s_\ell \sim d_{s_0,\ell }^\pi \\ a\sim \pi(s_\ell )}}\left[Q^{\pi'}(a, s_\ell) -V^{\pi'}(s_\ell) \right]$$
  • Statement follows by letting \(k\to\infty\).

Recap

  • PSet 4 due Friday
  • Prelim grades released

 

  • Fitted PI via Fitted PE
  • Performance Difference Lemma

 

  • Next lecture: from Learning to Optimization