### Sarah Dean PRO

asst prof in CS at Cornell

Prof. Sarah Dean

MW 2:45-4pm

110 Hollister Hall

0. Announcements & Recap

1. Performance Difference Lemma

2. Supervision via Bellman Eq

3. Supervision via Bellman Opt

4. Function Approximation

HW1 due tonight, HW2 released next Monday

5789 Paper Review Assignment (weekly pace *suggested*)

Prelim Tuesday 3/22 at 7:30pm in Phillips 101

OH cancelled Wednesday, instead Thursday 10:30-11:30am

**Learning Theory Mentorship Workshop**

with the Conference on Algorithmic Learning Theory (ALT)

Virtual, March 14-15, 2022

Application due March 10: https://let-all.com/alt22.html

Meta-Algorithm for Policy Iteration in Unknown MDP

Approximate Policy Iteration

Greedy Improvement:

\(\pi^{t+1}(s) = \arg\max_a \widehat Q^{t}(s, a)\)

Could oscillate!

Conservative Policy Iteration

Incremental Improvement:

\(\pi'(s) = \arg\max_a \widehat Q^{t}(s, a)\)

\(\pi^{t+1}(a\mid s) = (1-\alpha)\pi^{t}(a\mid s) + \alpha \pi'(s\mid a)\)

Meta-Algorithm for Policy Iteration in Unknown MDP

- Sample \(h_1=h\) w.p. \(\propto \gamma^h\): \((s_{h_1}, a_{h_1}) = (s_i,a_i) \sim d^\pi_{\mu_0}\)
- Sample \(h_2=h\) w.p. \(\propto \gamma^h\): \(y_i = \sum_{t=h_1}^{h_1+h_2} r_t\)

**Supervision with Rollout (MC):**

\(\mathbb{E}[y_i] = Q^\pi(s_i, a_i)\)

\(\widehat Q\) via ERM on \(\{(s_i, a_i, y_i)\}_{1}^N\)

**Rollout:**

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

By Sarah Dean