Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Reminders

• Homework
• PSet 4 released tonight, due next Monday
• 5789 Paper Reviews - on Canvas - due weekly starting Monday
• Midterm 3/15 during lecture
• Let us know conflicts/accomodations ASAP! (EdStem)
• Review Lecture on Monday 3/13 (last year's slides/recording)
• Materials: slides (Lectures 1-10, some of 11-13), PSets 1-4 (solutions on Canvas)
• also: equation sheet (on Canvas), 2023 notes, PAs
• Mid-semester feedback survey: email “Associate Dean Alan Zehnder <invitation@surveys.mail.cornell.edu>”

## Agenda

1. Recap: MBRL

2. Feedback & Supervision

3. Supervision via Rollouts

4. Approximate Policy Iteration

## Recap: Model-based RL

Model-based RL with Queries

1. Sample and record $$s_i'\sim P(s_i, a_i)$$
2. Estimate $$\widehat P$$ from $$\{(s_i',s_i, a_i)\}_{i=1}^N$$
3. Design $$\widehat \pi$$ from $$\widehat P$$

Tabular MBRL

1. Sample: evenly each state and actions $$\frac{N}{SA}$$ times
2. Estimate: by counting $$\hat P(s'|s,a) = \frac{\text{\#~times~}s'_i=s'\text{~when~}s_i=s,a_i=a}{\text{\#~times~}s_i=s,a_i=a}$$
3. Design: policy iteration

## Recap: Sample Complexity

Theorem: Tabular MBRL with $$N \gtrsim \frac{S^2 A}{\epsilon^2}$$ is $$\epsilon$$ sub-optimal with high probability

• Simulation Lemma: $$|\hat V^\pi(s_0) - V^\pi(s_0)| \lesssim \mathbb E_{s\sim d^{\pi}_{s_0}}\left[ \|\hat P(\cdot |s,\pi(s)) - P(\cdot|s,\pi(s))\|_1\right]$$
• Estimation Lemma: With high probability, $$\max_{s,a }\|P(\cdot |s,a)-\hat P(\cdot |s,a)\|_1 \lesssim \sqrt{\frac{S^2A}{N}}$$

## Recap: Policy Iteration

Policy Iteration

• Initialize $$\pi_0:\mathcal S\to\mathcal A$$
• For $$t=0,\dots,T-1$$:
• Compute $$Q^{\pi_t}$$ with Policy Evaluation
• Policy Improvement: $$\forall s$$, $$\pi^{t+1}(s)=\arg\max_{a\in\mathcal A} Q^{\pi_t}(s,a)$$

## Agenda

1. Recap: MBRL

2. Feedback & Supervision

3. Supervision via Rollouts

4. Approximate Policy Iteration

## Feedback in RL

action $$a_t$$

state $$s_t$$

reward $$r_t$$

Control feedback

• between states and actions
• "reaction"
• studied in control theory "automatic feedback control"
• our focus for Unit 1

policy $$\pi$$

transitions $$P,f$$

## Feedback in RL

action $$a_t$$

state $$s_t$$

reward $$r_t$$

1. Control feedback: between states and actions

policy

data $$(s_t,a_t,r_t)$$

policy $$\pi$$

transitions $$P,f$$

experience

Data feedback

• between data and policy
• connection to machine learning
• our new focus in Unit 2

## Feedback in RL

action $$a_t$$

state $$s_t$$

reward $$r_t$$

1. Control feedback: between states and actions
2. Data feedback: between data and poicy

policy

data $$(s_t,a_t,r_t)$$

policy $$\pi$$

transitions $$P,f$$

experience

unknown in Unit 2

• Supervised learning: features $$x$$ and labels $$y$$
• Goal: predict labels with $$\hat f(x)\approx \mathbb E[y|x]$$
• Requirements: dataset $$\{x_i,y_i\}_{i=1}^N$$
• Method: $$\hat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N (f(x_i)-y_i)^2$$
• Important functions in MDPs
• Transitions $$P(s'|s,a)$$
• Value/Q of a policy $$V^\pi(s)$$ and $$Q^\pi(s,a)$$
• Optimal Value/Q $$V^\star(s)$$ and $$Q^\star(s,a)$$
• Optimal policy $$\pi^\star(s)$$

## Supervised Learning for MDPs

• Supervised learning: features $$x$$ and labels $$y$$
• Important functions in MDPs
• Transitions $$P(s'|s,a)$$
• features $$s,a,s'$$, sampled outcomes observed
• Value/Q of a policy $$V^\pi(s)$$ and $$Q^\pi(s,a)$$
• features $$s$$ or $$s,a$$, labels ?
• Optimal Value/Q $$V^\star(s)$$ and $$Q^\star(s,a)$$
• features $$s$$ or $$s,a$$, labels ?
• Optimal policy $$\pi^\star(s)$$
• features $$s$$, labels from expert, otherwise ?

MBRL (Lec 11)

this week

after prelim

after prelim

## Agenda

1. Recap: MBRL

2. Feedback & Supervision

3. Supervision via Rollouts

4. Approximate Policy Iteration

## Data for Learning $$Q^\pi(s,a)$$

• How to construct labels for $$Q^\pi(s,a) = \mathbb E_{P,\pi}\Big[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \Big]$$
• Label via rollout:
• start at $$(s,a)$$ and then observe rewards $$r_0,r_1,\dots$$
• Let label $$y=\sum_{t=0}^\infty \gamma^t r_t$$
• Unbiased because $$\mathbb E[y|s,a] = Q^\pi(s,a)$$

...

...

...

$$s_t$$

$$a_t\sim \pi(s_t)$$

$$r_t\sim r(s_t, a_t)$$

$$s_{t+1}\sim P(s_t, a_t)$$

$$a_{t+1}\sim \pi(s_{t+1})$$

...

## Example

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

• Consider $$\pi(s)=$$stay
• Rollouts with
• $$s=0$$ and $$a=$$stay
• $$s=0$$ and $$a=$$switch
• $$s=1$$ and $$a=$$stay
• $$s=1$$ and $$a=$$switch

## Sampling procedure for $$Q^\pi$$

Algorithm: Data collection

• For $$i=1,\dots,N$$:
• Sample $$s_0\sim\mu_0$$ and $$h_1$$ and $$h_2$$ from $$1-\gamma$$ geometric distribution
• Roll in $$h_1$$ steps: set $$(s_i,a_i)=(s_{h_1},a_{h_1})$$
• Roll out $$h_2$$ steps: set $$y_i=\sum_{t=h_1}^{h_1+h_2} r_t$$

Rollout:

$$s_t$$

$$a_t\sim \pi(s_t)$$

$$r_t\sim r(s_t, a_t)$$

$$s_{t+1}\sim P(s_t, a_t)$$

$$a_{t+1}\sim \pi(s_{t+1})$$

...

Timestep sampling with discount/geometric distribution: set $$h_i=h\geq 0$$ with probability $$(1-\gamma)\gamma^h$$

## Sampling procedure for $$Q^\pi$$

Algorithm: Data collection

• For $$i=1,\dots,N$$:
• Sample $$s_0\sim\mu_0$$ and $$h_1$$ and $$h_2$$ from discount distribution
• Roll in $$h_1$$ steps: set $$(s_i,a_i)=(s_{h_1},a_{h_1})$$
• Roll out $$h_2$$ steps: set $$y_i=\sum_{t=h_1}^{h_1+h_2} r_t$$

Proposition: The resulting dataset $$\{(s_i,a_i), y_i\}_{i=1}^N$$

1. Drawn from discounted state distribution  $$s_i \sim d_{\mu_0}^\pi$$
2. Unbiased labels $$\mathbb E[y_i\mid s_i,a_i] = Q^\pi(s_i,a_i)$$
• Using sampled data, $$\hat Q^\pi(s,a) = \arg\min_{Q\in\mathcal Q} \sum_{i=1}^N (Q(s_i,a_i)-y_i)^2$$
• Assumption: Supervised learning works, i.e. $$\mathbb E_{s,a\sim d_{\mu_0}^\pi }[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
• In practice, often sample data from a single long rollout (rather than resetting to $$\mu_0$$ every time)
• This is called "Monte Carlo" supervision

## Supervision via Rollouts

Notation note: $$s,a\sim d_{\mu_0}^\pi$$ is a compact way of writing $$s\sim d_{\mu_0}^\pi$$ and $$a\sim\pi$$.

## Agenda

1. Recap: MBRL

2. Feedback & Supervision

3. Supervision via Rollouts

4. Approximate Policy Iteration

Approximate Policy Iteration

• Initialize $$\pi_0:\mathcal S\to\mathcal A$$
• For $$i=0,\dots,T-1$$:
• Rollout $$\pi^i$$, collect dataset $$\{s_j,a_j,y_j\}_{j=1}^N$$ then compute $$\hat Q^{\pi_i}$$ with supervised learning
• Policy Improvement: $$\forall s$$, $$\pi^{i+1}(s)=\arg\max_{a\in\mathcal A} \hat Q^{\pi_i}(s,a)$$

## Approximate Policy Iteration

• This is an "on policy" method because it uses data collected with the policy $$\pi$$
• Recall that PI guarantees monotonic improvement $$Q^{\pi_{i+1}}(s,a)\geq Q^{\pi_i}(s,a)$$
• We assume that supervised learning succeeds $$\mathbb E_{s,a\sim d_{\mu_0}^\pi}[(\hat Q^\pi(s,a)-Q^\pi(s,a))^2]\leq \epsilon$$
• Does Approx PI monotonically improve (approximately)?
• Not necessarily, because $$d_{\mu_0}^{\pi_{i+1}}$$ might be different from $$d_{\mu_0}^{\pi_i}$$
• Greedy improvement when Q function is approximate could lead to oscillation!

## Approximate Policy Iteration

 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

## Example

• Suppose $$\pi_1$$ is a policy which goes down first and then sometimes right
• $$\hat Q^{\pi_1}$$ is only reliable on the left and bottom parts of the grid
• $$\pi_2$$ ends up going right first and then down
• $$\pi_3$$ will oscillate back to $$\pi_1$$!

Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right]$$

where we define the advantage function $$A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)$$

## Performance Difference

• "Advantage" of taking action $$a$$ in state $$s$$ rather than $$\pi(s)$$
• What can we say about $$A^{\pi^\star}$$? PollEV
• Recall the state distribution for a policy $$\pi$$ (Lecture 2)
• $$d^{\pi}_{\mu_0,t}(s) = \mathbb{P}\{s_t=s\mid s_0\sim \mu_0,s_{k+1}\sim P(s_k, \pi(s_k))\}$$
• We showed that it can be written as $$d^{\pi}_{\mu_0,t} = P_\pi^\top d^{\pi}_{\mu_0,t-1} = (P_\pi^t)^\top \mu_0$$
• The discounted distribution (PSet) $$d_{\mu_0}^{\pi} = (1-\gamma) \sum_{t=0}^{\infty} \gamma^t \underbrace{(P_\pi ^t)^\top \mu_0}_{d_{\mu_0,t}^{\pi}}$$
• When the initial state is fixed to a known $$s_0$$, i.e. $$\mu_0=e_{s_0}$$ we write $$d_{s_0,t}^{\pi}$$

Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right]$$

## Proof of PDL

• $$V^\pi(s_0) - V^{\pi'}(s_0) =\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1) ] - V^{\pi'}(s_0) \right]$$
• $$=\mathbb E_{a\sim \pi(s_0)}\left[ r(s_0,a) + \gamma \mathbb E_{s_1\sim P(s_0, a) }[V^\pi(s_1)- V^{\pi'}(s_1) + V^{\pi'}(s_1) ] - V^{\pi'}(s_0) \right]$$
• $$= \gamma \mathbb E_{\substack{a_0\sim \pi(s_0) \\ s_1\sim P(s_0, a)} }[V^\pi(s_1)- V^{\pi'}(s_1) ] + \mathbb E_{a\sim \pi(s_0)}\left[Q^{\pi'}(a, s_0) -V^{\pi'}(s_0) \right]$$
• Iterate $$k$$ times: $$V^\pi(s_0) - V^{\pi'}(s_0) =$$ $$\gamma^k \mathbb E_{s_k\sim d_{s_0,k}^\pi}[V^\pi(s_k)- V^{\pi'}(s_k) ] + \sum_{\ell=0}^{k-1}\gamma^\ell \mathbb E_{\substack{s_\ell \sim d_{s_0,\ell }^\pi \\ a\sim \pi(s_\ell )}}\left[Q^{\pi'}(a, s_\ell) -V^{\pi'}(s_\ell) \right]$$
• Statement follows by letting $$k\to\infty$$.

Performance Difference Lemma: For two policies, $$V^\pi(s_0) - V^{\pi'}(s_0) = \frac{1}{1-\gamma} \mathbb E_{s\sim d^\pi_{s_0}}\left[ \mathbb E_{a\sim \pi(s)}\left[A^{\pi'}(s,a) \right] \right]$$

where we define the advantage function $$A^{\pi'}(s,a) =Q^{\pi'}(s,a) - V^{\pi'}(s)$$

## Performance Difference

• Notice that $$\arg\max_a A^\pi(s,a) = \arg\max_a Q^\pi(s,a)$$
• Can use PDL to show that PI has monotonic improvement
• Next time, we use insights from PDL to develop a better algorithm

## Recap

• PSet 4
• Prelim in class 3/15

• Supervision via Rollouts
• Performance Difference Lemma

• Next lecture: Conservative Policy Iteration

By Sarah Dean

Private