## CS 4/5789: Introduction to Reinforcement Learning

### Lecture 12

Prof. Sarah Dean

MW 2:45-4pm

110 Hollister Hall

## Agenda

0. Announcements & Recap

1. Performance Difference Lemma

2. Supervision via Bellman Eq

3. Supervision via Bellman Opt

4. Function Approximation

## Announcements

HW1 due tonight, HW2 released next Monday

5789 Paper Review Assignment (weekly pace *suggested*)

Prelim Tuesday 3/22 at 7:30pm in Phillips 101

OH cancelled Wednesday, instead Thursday 10:30-11:30am

**Learning Theory Mentorship Workshop**

with the Conference on Algorithmic Learning Theory (ALT)

Virtual, March 14-15, 2022

Application due March 10: https://let-all.com/alt22.html

## Recap

Meta-Algorithm for Policy Iteration in Unknown MDP

Approximate Policy Iteration

Greedy Improvement:

\(\pi^{t+1}(s) = \arg\max_a \widehat Q^{t}(s, a)\)

Could oscillate!

Conservative Policy Iteration

Incremental Improvement:

\(\pi'(s) = \arg\max_a \widehat Q^{t}(s, a)\)

\(\pi^{t+1}(a\mid s) = (1-\alpha)\pi^{t}(a\mid s) + \alpha \pi'(s\mid a)\)

## Recap

Meta-Algorithm for Policy Iteration in Unknown MDP

- Sample \(h_1=h\) w.p. \(\propto \gamma^h\): \((s_{h_1}, a_{h_1}) = (s_i,a_i) \sim d^\pi_{\mu_0}\)
- Sample \(h_2=h\) w.p. \(\propto \gamma^h\): \(y_i = \sum_{t=h_1}^{h_1+h_2} r_t\)

**Supervision with Rollout (MC):**

\(\mathbb{E}[y_i] = Q^\pi(s_i, a_i)\)

\(\widehat Q\) via ERM on \(\{(s_i, a_i, y_i)\}_{1}^N\)

**Rollout:**

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

#### CS 4/5789: Lecture 12

By Sarah Dean