CS 4/5789: Introduction to Reinforcement Learning

Lecture 12: Approximate Policy Iteration

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework
    • PSet 4 released tonight, due next Monday
    • 5789 Paper Reviews - on Canvas - due weekly starting Monday
  • Midterm 3/15 during lecture
    • Let us know conflicts/accomodations ASAP! (EdStem)
    • Review Lecture on Monday 3/13 (last year's slides/recording)
    • Materials: slides (Lectures 1-10, some of 11-13), PSets 1-4 (solutions on Canvas)
      • also: equation sheet (on Canvas), 2023 notes, PAs
  • Mid-semester feedback survey: email “Associate Dean Alan Zehnder <invitation@surveys.mail.cornell.edu>”

Agenda

1. Recap: MBRL

2. Feedback & Supervision

3. Supervision via Rollouts

4. Approximate Policy Iteration

Recap: Model-based RL

Model-based RL with Queries

  1. Sample and record \(s_i'\sim P(s_i, a_i)\)
  2. Estimate \(\widehat P\) from \(\{(s_i',s_i, a_i)\}_{i=1}^N\)
  3. Design \(\widehat \pi\) from \(\widehat P\)

Tabular MBRL

  1. Sample: evenly each state and actions \(\frac{N}{SA}\) times
  2. Estimate: by counting $$\hat P(s'|s,a) = \frac{\text{\#~times~}s'_i=s'\text{~when~}s_i=s,a_i=a}{\text{\#~times~}s_i=s,a_i=a} $$
  3. Design: policy iteration

Recap: Sample Complexity

Theorem: Tabular MBRL with \(N \gtrsim \frac{S^2 A}{\epsilon^2}\) is \(\epsilon\) sub-optimal with high probability

  • Simulation Lemma: $$|\hat V^\pi(s_0) - V^\pi(s_0)| \lesssim \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$$
  • Estimation Lemma: With high probability, $$\max_{s,a }\|P(\cdot |s,a)-\hat P(\cdot |s,a)\|_1 \lesssim \sqrt{\frac{S^2A}{N}} $$

Recap: Policy Iteration

Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\mathcal A\)
  • For \(t=0,\dots,T-1\):
    • Compute \(Q^{\pi_t}\) with Policy Evaluation
    • Policy Improvement: \(\forall s\), $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} Q^{\pi_t}(s,a)$$

Agenda

1. Recap: MBRL

2. Feedback & Supervision

3. Supervision via Rollouts

4. Approximate Policy Iteration

Feedback in RL

action \(a_t\)

state \(s_t\)

reward \(r_t\)

Control feedback

  • between states and actions
    • "reaction"
  • studied in control theory "automatic feedback control"
  • our focus for Unit 1

policy \(\pi\)

transitions \(P,f\)

Feedback in RL

action \(a_t\)

state \(s_t\)

reward \(r_t\)

  1. Control feedback: between states and actions

policy

data \((s_t,a_t,r_t)\)

policy \(\pi\)

transitions \(P,f\)

experience

Data feedback

  • between data and policy
    • "adaptation"
  • connection to machine learning
  • our new focus in Unit 2

 

Feedback in RL

action \(a_t\)

state \(s_t\)

reward \(r_t\)

  1. Control feedback: between states and actions
  2. Data feedback: between data and poicy

policy

data \((s_t,a_t,r_t)\)

policy \(\pi\)

transitions \(P,f\)

experience

unknown in Unit 2

  • Supervised learning: features \(x\) and labels \(y\)
    • Goal: predict labels with \(\hat f(x)\approx \mathbb E[y|x]\)
    • Requirements: dataset \(\{x_i,y_i\}_{i=1}^N\)
    • Method: \(\hat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N (f(x_i)-y_i)^2\)
  • Important functions in MDPs
    • Transitions \(P(s'|s,a)\)
    • Value/Q of a policy \(V^\pi(s)\) and \(Q^\pi(s,a)\)
    • Optimal Value/Q \(V^\star(s)\) and \(Q^\star(s,a)\)
    • Optimal policy \(\pi^\star(s)\)

Supervised Learning for MDPs

  • Supervised learning: features \(x\) and labels \(y\)
  • Important functions in MDPs
    • Transitions \(P(s'|s,a)\)
      • features \(s,a,s'\), sampled outcomes observed
    • Value/Q of a policy \(V^\pi(s)\) and \(Q^\pi(s,a)\)
      • features \(s\) or \(s,a\), labels ?
    • Optimal Value/Q \(V^\star(s)\) and \(Q^\star(s,a)\)
      • features \(s\) or \(s,a\), labels ?
    • Optimal policy \(\pi^\star(s)\)
      • features \(s\), labels from expert, otherwise ?

Supervised Learning for MDPs

MBRL (Lec 11)

this week

after prelim

after prelim

Agenda

1. Recap: MBRL

2. Feedback & Supervision

3. Supervision via Rollouts

4. Approximate Policy Iteration

Data for Learning \(Q^\pi(s,a)\)

  • How to construct labels for $$ Q^\pi(s,a) = \mathbb E_{P,\pi}\Big[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \Big] $$
  • Label via rollout:
    • start at \((s,a)\) and then observe rewards \(r_0,r_1,\dots\)
    • Let label \(y=\sum_{t=0}^\infty \gamma^t r_t\)
    • Unbiased because \(\mathbb E[y|s,a] = Q^\pi(s,a)\)

...

...

...

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

Example

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

  • Consider \(\pi(s)=\)stay
  • Rollouts with
    • \(s=0\) and \(a=\)stay
    • \(s=0\) and \(a=\)switch
    • \(s=1\) and \(a=\)stay
    • \(s=1\) and \(a=\)switch

Sampling procedure for \(Q^\pi\)

Algorithm: Data collection

  • For \(i=1,\dots,N\):
    • Sample \(s_0\sim\mu_0\) and \(h_1\) and \(h_2\) from \(1-\gamma\) geometric distribution
    • Roll in \(h_1\) steps: set \((s_i,a_i)=(s_{h_1},a_{h_1})\)
    • Roll out \(h_2\) steps: set \(y_i=\sum_{t=h_1}^{h_1+h_2} r_t\)

Rollout:

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

Timestep sampling with discount/geometric distribution: set \(h_i=h\geq 0\) with probability \((1-\gamma)\gamma^h\)

Sampling procedure for \(Q^\pi\)

Algorithm: Data collection

  • For \(i=1,\dots,N\):
    • Sample \(s_0\sim\mu_0\) and \(h_1\) and \(h_2\) from discount distribution
    • Roll in \(h_1\) steps: set \((s_i,a_i)=(s_{h_1},a_{h_1})\)
    • Roll out \(h_2\) steps: set \(y_i=\sum_{t=h_1}^{h_1+h_2} r_t\)

Proposition: The resulting dataset \(\{(s_i,a_i), y_i\}_{i=1}^N\)

  1. Drawn from discounted state distribution  \(s_i \sim d_{\mu_0}^\pi\)
  2. Unbiased labels \(\mathbb E[y_i\mid s_i,a_i] = Q^\pi(s_i,a_i)\)
  • Using sampled data, $$\hat Q^\pi(s,a) = \arg\min_{Q\in\mathcal Q} \sum_{i=1}^N (Q(s_i,a_i)-y_i)^2 $$
  • Assumption: Supervised learning works, i.e. $$\mathbb E_{s,a\sim d_{\mu_0}^\pi }[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon $$
  • In practice, often sample data from a single long rollout (rather than resetting to \(\mu_0\) every time)
    • This is called "Monte Carlo" supervision

Supervision via Rollouts

Notation note: \(s,a\sim d_{\mu_0}^\pi\) is a compact way of writing \(s\sim d_{\mu_0}^\pi\) and \(a\sim\pi\).

Agenda

1. Recap: MBRL

2. Feedback & Supervision

3. Supervision via Rollouts

4. Approximate Policy Iteration

Approximate Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\mathcal A\)
  • For \(i=0,\dots,T-1\):
    • Rollout \(\pi^i\), collect dataset \(\{s_j,a_j,y_j\}_{j=1}^N\) then compute \(\hat Q^{\pi_i}\) with supervised learning
    • Policy Improvement: \(\forall s\), $$\pi^{i+1}(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$

Approximate Policy Iteration

  • This is an "on policy" method because it uses data collected with the policy \(\pi\)
  • Recall that PI guarantees monotonic improvement $$Q^{\pi_{i+1}}(s,a)\geq Q^{\pi_i}(s,a)$$
  • We assume that supervised learning succeeds $$\mathbb E_{s,a\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
  • Does Approx PI monotonically improve (approximately)?
    • Not necessarily, because \(d_{\mu_0}^{\pi_{i+1}}\) might be different from \(d_{\mu_0}^{\pi_i}\)
  • Greedy improvement when Q function is approximate could lead to oscillation!

Approximate Policy Iteration

0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15

Example

  • Suppose \(\pi_1\) is a policy which goes down first and then sometimes right
  • \(\hat Q^{\pi_1}\) is only reliable on the left and bottom parts of the grid
  • \(\pi_2\) ends up going right first and then down
  • \(\pi_3\) will oscillate back to \(\pi_1\)!

Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right]  \right] $$

where we define the advantage function \(A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)\)

Performance Difference

The advantage function:

  • "Advantage" of taking action \(a\) in state \(s\) rather than \(\pi(s)\)
  • What can we say about \(A^{\pi^\star}\)? PollEV
  • Recall the state distribution for a policy \(\pi\) (Lecture 2)
    • \( d^{\pi}_{\mu_0,t}(s) = \mathbb{P}\{s_t=s\mid s_0\sim \mu_0,s_{k+1}\sim P(s_k, \pi(s_k))\} \)
  • We showed that it can be written as $$d^{\pi}_{\mu_0,t} = P_\pi^\top  d^{\pi}_{\mu_0,t-1} = (P_\pi^t)^\top \mu_0$$
  • The discounted distribution (PSet) $$d_{\mu_0}^{\pi} = (1-\gamma)  \sum_{t=0}^{\infty} \gamma^t \underbrace{(P_\pi ^t)^\top \mu_0}_{d_{\mu_0,t}^{\pi}} $$
  • When the initial state is fixed to a known \(s_0\), i.e. \(\mu_0=e_{s_0}\) we write \(d_{s_0,t}^{\pi}\)

Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right]  \right] $$

Proof of PDL

  • \(V^\pi(s_0) - V^{\pi'}(s_0) =\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1) ] - V^{\pi'}(s_0) \right]\)
  • \(=\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1)- V^{\pi'}(s_1) + V^{\pi'}(s_1) ] - V^{\pi'}(s_0) \right]\)
  • \(= \gamma \mathbb E_{\substack{a_0\sim \pi(s_0) \\ s_1\sim P(s_0, a)} }[V^\pi(s_1)- V^{\pi'}(s_1) ] + \mathbb E_{a\sim \pi(s_0)}\left[Q^{\pi'}(a, s_0) -V^{\pi'}(s_0) \right]\)
  • Iterate \(k\) times: \(V^\pi(s_0) - V^{\pi'}(s_0) =\) $$\gamma^k \mathbb E_{s_k\sim d_{s_0,k}^\pi}[V^\pi(s_k)- V^{\pi'}(s_k) ] + \sum_{\ell=0}^{k-1}\gamma^\ell \mathbb E_{\substack{s_\ell \sim d_{s_0,\ell }^\pi \\ a\sim \pi(s_\ell )}}\left[Q^{\pi'}(a, s_\ell) -V^{\pi'}(s_\ell) \right]$$
  • Statement follows by letting \(k\to\infty\).

Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right]  \right] $$

where we define the advantage function \(A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)\)

Performance Difference

  • Notice that \(\arg\max_a A^\pi(s,a) = \arg\max_a Q^\pi(s,a)\)
  • Can use PDL to show that PI has monotonic improvement
  • Next time, we use insights from PDL to develop a better algorithm

Recap

  • PSet 4
  • Prelim in class 3/15

 

  • Supervision via Rollouts
  • Performance Difference Lemma

 

  • Next lecture: Conservative Policy Iteration

Sp23 CS 4/5789: Lecture 12

By Sarah Dean

Private

Sp23 CS 4/5789: Lecture 12