CS 4/5789: Introduction to Reinforcement Learning

Lecture 13: Conservative Policy Iteration

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework
    • PSet 4 released, due Monday
    • 5789 Paper Reviews due weekly starting Monday
  • Midterm 3/15 during lecture
    • Let us know conflicts/accomodations ASAP! (EdStem)
    • Review Lecture on Monday 3/13 (last year's slides/recording)
    • Materials: slides (Lectures 1-10, some of 11-13), PSets 1-4 (solutions on Canvas)
      • also: equation sheet (on Canvas), 2023 notes, PAs
    • I will monitor Exams/Prelim tag on EdStem for questions

Agenda

1. Recap

2. Conservative Policy Iteration

3. Guarantees on Improvement

4. Summary

Feedback in RL

action \(a_t\)

state \(s_t\)

reward \(r_t\)

  1. Control feedback: between states and actions
  2. Data feedback: between data and poicy

policy

data \((s_t,a_t,r_t)\)

policy \(\pi\)

transitions \(P,f\)

experience

unknown in Unit 2

Sampling procedure for \(Q^\pi\)

Algorithm: Data collection

  • For \(i=1,\dots,N\):
    • Sample \(s_0\sim\mu_0\) and \(h_1\) and \(h_2\) from discount distribution
    • Roll in \(h_1\) steps: set \((s_i,a_i)=(s_{h_1},a_{h_1})\)
    • Roll out \(h_2\) steps: set \(y_i=\sum_{t=h_1}^{h_1+h_2} r_t\)

Proposition: The resulting dataset \(\{(s_i,a_i), y_i\}_{i=1}^N\)

  1. Drawn from discounted state distribution  \(s_i \sim d_{\mu_0}^\pi\)
  2. Unbiased labels \(\mathbb E[y_i\mid s_i,a_i] = Q^\pi(s_i,a_i)\)

Proof in future PSet

Approximate Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\mathcal A\)
  • For \(i=0,\dots,T-1\):
    • Roll in/out \(\pi^i\) to collect dataset \(\{s_j,a_j,y_j\}_{j=1}^N\) then compute \(\hat Q^{\pi_i}\) with supervised learning $$\hat Q^{\pi_i}(s,a) = \arg\min_{Q\in\mathcal Q} \sum_{j=1}^N (Q(s_j,a_j)-y_j)^2 $$
    • Policy Improvement: \(\forall s\), $$\pi^{i+1}(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$

Approximate Policy Iteration

  • Even if we assume that supervised learning succeeds $$\mathbb E_{s,a\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
    • or the stronger assumption: $$\max_a \mathbb E_{s\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
  • Approx PI does not necessarily improve because \(d_{\mu_0}^{\pi_{i+1}}\) might be different from \(d_{\mu_0}^{\pi_i}\)

Approximate Policy Iteration

  • Even if we assume that supervised learning succeeds $$\max_a \mathbb E_{s\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
  • Approx PI does not necessarily improve because \(d_{\mu_0}^{\pi_{i+1}}\) might be different from \(d_{\mu_0}^{\pi_i}\)

Approximate Policy Iteration

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

Agenda

1. Recap

2. Conservative Policy Iteration

3. Guarantees on Improvement

4. Summary

Performance Difference Lemma: For two policies, $$V^{\pi_{i+1}}(s_0) - V^{\pi_{i}}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{s_0}}\left[ \mathbb E_{a\sim \pi_{i+1}(s)}\left[A^{\pi_i}(s,a) \right]  \right] $$

where we define the advantage function \(A^{\pi}(s,a) =Q^{\pi}(s,a) - V^{\pi}(s)\)

Improvement in PI

  • Advantage of \(\pi_{i+1}\) over \(\pi_i\) over distribution of \(\pi_{i+1}\)
  • Notice that Policy Iteration's greedy updates $$ \pi_{i+1}(s) = \arg\max_a Q^{\pi_i}(s,a)=\arg\max_a A^{\pi_i}(s,a)$$
  • Idea: keep \(d_{\mu_0}^{\pi_{i+1}}\) close to \(d_{\mu_0}^{\pi_{i}}\)

Conservative Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\Delta(\mathcal A)\)
    • equivalently, \(\pi(a|s) : \mathcal A\times \mathcal S \to [0,1]\)
  • For \(i=0,\dots,T-1\):
    • Roll in/out \(\pi^i\) to collect dataset \(\{s_j,a_j,y_j\}_{j=1}^N\) then compute \(\hat Q^{\pi_i}\) with supervised learning
    • Policy Improvement: \(\forall s\), $$\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$
    • Incremental update: a policy which follows \(\bar \pi\) with probability \(\alpha\) (and otherwise \(\pi_{i}\))$$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)  $$

Conservative Policy Iteration

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

\(i=0\)

\(i=1\)

\(i=2\)

\(\bar \pi\)

\(i=1\)

\(i=2\)

\(i=0\)

Case: \(p_1\) and \(p_2\) both close to \(1\)

  • \(\pi^\star(1)=\)switch

reward: \(+1\) if \(s=0\) and \(-\frac{1}{2}\) if \(a=\) switch

\(\pi^i(\cdot |1)\)

\(1\)

\(\hat Q^{\pi^i}(1,\)switch\()\)

\(\hat Q^{\pi^i}(1,\)stay\()\)

Incremental Updates

  • Policy follows \(\bar \pi\) with probability \(\alpha\) (and otherwise \(\pi_{i}\))$$\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)  $$
  • For any \(s\), \(\|\pi^{i+1}(\cdot |s)- \pi^{i}(\cdot |s)\|_1  \)
    • \(= \| \pi^{i}(\cdot |s) -(1-\alpha) \pi^{i}(\cdot|s) - \alpha \bar \pi^{}(\cdot |s)\|_1\)
    • \(= \| \alpha \pi^{i}(\cdot |s) - \alpha \bar \pi^{}(\cdot |s)\|_1\)
    • \(\leq \alpha \| \pi^{i}(\cdot |s) \|_1 + \alpha \| \bar \pi^{}(\cdot |s)\|_1 = 2\alpha \)
  • Lemma: For any \(\pi\) and \(\pi'\), \(\|d^\pi_{\mu_0} - d^{\pi'}_{\mu_0}\|_1 \leq \frac{\gamma \max_s \|\pi(\cdot|s)-\pi'(\cdot|s)\|_1}{1-\gamma}\) (proof in future PSet)
  • Thus in Conservative PI, \(d_{\mu_0}^{\pi_{i+1}}\) is \(\frac{2\gamma\alpha}{1-\gamma}\) close to \(d_{\mu_0}^{\pi_{i}}\)

\(\pi^{i}\)

\(\bar\pi\)

\(\pi^{i+1}\)

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

\(\pi^i(\cdot |1)\)

\(1\)

\(i=0\)

\(i=1\)

\(i=2\)

\(i=1\)

\(i=2\)

\(i=0\)

\(d^{\pi^i}_1(0)\)

\(1\)

\(d^{\pi^i}_1(1)\)

\(i=0\)

\(i=1\)

\(i=2\)

\(i=1\)

\(i=2\)

\(i=0\)

Agenda

1. Recap

2. Conservative Policy Iteration

3. Guarantees on Improvement

4. Summary

CPI Improvement

  • Let \(\mathbb A_i = \mathbb E_{s\sim d_{\mu_0}^{\pi_i}}[ \max_a A^{\pi_i}(s,a)]\) be the PI improvement.
  • Conservative policy iteration improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}\geq 0 $$
  • when step size set to \(\alpha_i = \frac{(1-\gamma)^2 \mathbb A_i}{4\gamma }\).

Assumptions:

  1. Supervised learning works perfectly $$\max_a \mathbb E_{s\sim d_{\mu_0}^{\pi^i}}[(\hat Q^{\pi^i}(s,a)-Q^{\pi^i}(s,a))^2] = 0$$
  2. Reward is bounded between \(0\) and \(1\)

Both assumptions can be relaxed; improvement would then depend on \(\epsilon\) and reward bounds

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

Example

\(\pi^i(\cdot |1)\)

\(Q^{\pi^i}(1,\)switch\()\)

\( Q^{\pi^i}(1,\)stay\()\)

\(1\)

\(\bar \pi\)

\(V^{\pi^i}(1)\)

\(\mathbb A_i\begin{cases}\end{cases}\)

\(\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]\)

  • \(=\frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[ \mathbb E_{a\sim \pi_{i+1}(s)}\left[A^{\pi_i}(s,a) \right]\right] \) (PDL)
  • \(=\frac{1}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[ \alpha A^{\pi_i}(s,\bar \pi(s)) + (1-\alpha)\mathbb E_{a\sim \pi_{i}(s)}[A^{\pi_i}(s,a)] \right]  \) (CPI step 2)
    • PollEv: \(\mathbb E_{a\sim \pi_{i}(s)}[A^{\pi_i}(s,a)] = \mathbb E_{a\sim \pi_{i}(s)}[Q^{\pi_i}(s,a)] - V^\pi(s) =0\)
  • \(=\frac{\alpha}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s))  \right]  \)

CPI Improvement Proof

CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$

  1. \(\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)\)
  2. \(\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)\)

\(\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]\)

  • \(=\frac{\alpha}{1-\gamma} \mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s))  \right]  \)
  • \(=\frac{\alpha}{1-\gamma} \Big(\mathbb E_{s\sim d^{\pi_{i+1}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right] - \mathbb E_{s\sim d^{\pi_{i}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right] \)
    \(\qquad\qquad\qquad\qquad\qquad\qquad+ \mathbb E_{s\sim d^{\pi_{i}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right]\Big) \)
  • \(=\frac{\alpha}{1-\gamma} \left( \sum_{s} (d^{\pi_{i+1}}_{\mu_0} (s) - d^{\pi_{i}}_{\mu_0} (s)) A^{\pi_i}(s,\bar \pi(s))  + \mathbb E_{s\sim d^{\pi_{i}}_{\mu_0}}\left[A^{\pi_i}(s,\bar \pi(s)) \right]\right) \)

CPI Improvement Proof

  1. \(\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)\)
  2. \(\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)\)

CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$

\(\mathbb E_{s\sim\mu_0}[V^{\pi^{i+1}}(s) - V^{\pi^i}(s)]\)

  • \(=\frac{\alpha}{1-\gamma} \left( \sum_{s} (d^{\pi_{i+1}}_{\mu_0} (s) - d^{\pi_{i}}_{\mu_0} (s)) A^{\pi_i}(s,\bar \pi(s))  +\mathbb A_i\right) \)
  • \(\geq \frac{\alpha}{1-\gamma} \left( -\left|\sum_{s} (d^{\pi_{i+1}}_{\mu_0} (s) - d^{\pi_{i}}_{\mu_0} (s)) A^{\pi_i}(s,\bar \pi(s))| \right| + \mathbb A_i\right) \)
  • \(\geq \frac{\alpha}{1-\gamma} \left( -\sum_{s} |d^{\pi_{i+1}}_{\mu_0}(s)  - d^{\pi_{i}}_{\mu_0}(s)||A^{\pi_i}(s,\bar \pi(s))|  + \mathbb A_i\right) \)
  • \(\geq \frac{\alpha}{1-\gamma} \left( -\sum_{s} |d^{\pi_{i+1}}_{\mu_0}(s)  - d^{\pi_{i}}_{\mu_0}(s)|\frac{1}{1-\gamma}  + \mathbb A_i\right) \)
  • \(\geq \frac{1}{1-\gamma} \left( -\frac{2\gamma\alpha^2}{1-\gamma} \frac{1}{1-\gamma}  +\mathbb A_i \alpha\right) =\frac{\mathbb A_i^2(1-\gamma)}{8 \gamma}\)

CPI Improvement Proof

CPI improves in expectation $$\mathbb E_{s\sim \mu_0}[V^{\pi_{i+1}}(s) - V^{\pi_{i}}(s)]\geq \frac{1-\gamma}{8\gamma}{\mathbb A_i^2}$$

  1. \(\bar \pi(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)\)
  2. \(\pi^{i+1}(a|s) = (1-\alpha) \pi^{i}(a|s) + \alpha \bar \pi^{}(a|s)\)

Agenda

1. Recap

2. Conservative Policy Iteration

3. Guarantees on Improvement

4. Summary

  1. Constructing dataset for supervised learning
    • features \((s,a)\sim d^\pi_{\mu_0}\) ("roll in")
    • labels \(y\) with \(\mathbb E[y|s,a]= Q^\pi(s,a)\) ("roll out")
  2. Incremental updates to control distribution shift
    • mixture of current and greedy policy
    • parameter \(\alpha\) controls the distribution shift

Summary: Key Ideas

  1. Constructing dataset for supervised learning
  2. Incremental updates to control distribution shift

Summary: Key Ideas

action \(a_t\)

state \(s_t\)

reward \(r_t\)

policy

data \((s_t,a_t,r_t)\)

policy \(\pi\)

transitions \(P,f\)

experience

unknown in Unit 2

Summary: Constructing Labels

Labels via rollouts of \(\pi\):

  • Method: \(y = \sum_{t=h_1}^{h_1+h_2} r_t \) for \(h_2\sim\)Geometric\((1-\gamma)\)
  • Motivation: definition of
    \( Q^\pi(s,a) = \mathbb E_{P,\pi}\left[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \right] \)
  • On future PSet, you will show
    • \(\mathbb E[y|s_{h_1},a_{h_1}]= Q^\pi(s_{h_1},a_{h_1})\)
    • i.e., this label is unbiased
  • How much variance will labels have?
    • Many sources of randomness: all \(h_2\) transitions

...

...

...

Preview: Constructing Labels

Other key equations can inspire labels:

  • So far: \(Q^\pi(s,a) = \mathbb E_{P,\pi}\left[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \right] \)
  • We also know:
    • Bellman Expectation Equation $$Q^\pi(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[\mathbb E_{a'\sim \pi(s')}\left[Q^\pi(s',a')\right]\right]$$
    • Bellman Optimality Equation $$Q^\star(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[ \max_{a}Q^\star(s',a')\right]$$

...

...

...

Recap

  • PSet 4
  • Prelim in class 3/15

 

  • Incremental Updates
  • Guaranteed Improvement

 

  • Next lecture: Prelim Review