CS 4/5789: Introduction to Reinforcement Learning

Lecture 12

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

0. Announcements & Recap

1. Performance Difference Lemma

2. Supervision via Bellman Eq

3. Supervision via Bellman Opt

4. Function Approximation

Announcements

HW1 due tonight, HW2 released next Monday

5789 Paper Review Assignment (weekly pace suggested)

Prelim Tuesday 3/22 at 7:30pm in Phillips 101

OH cancelled Wednesday, instead Thursday 10:30-11:30am

Learning Theory Mentorship Workshop

with the Conference on Algorithmic Learning Theory (ALT)

Virtual, March 14-15, 2022

Application due March 10: https://let-all.com/alt22.html

Recap

Meta-Algorithm for Policy Iteration in Unknown MDP

Approximate Policy Iteration

Greedy Improvement:

\(\pi^{t+1}(s) = \arg\max_a \widehat Q^{t}(s, a)\)

Could oscillate!

Conservative Policy Iteration

Incremental Improvement:

\(\pi'(s) = \arg\max_a \widehat Q^{t}(s, a)\)

\(\pi^{t+1}(a\mid s) = (1-\alpha)\pi^{t}(a\mid s) + \alpha \pi'(s\mid a)\)

Recap

Meta-Algorithm for Policy Iteration in Unknown MDP

Sample \(h_1=h\) w.p. \(\propto \gamma^h\): \((s_{h_1}, a_{h_1}) = (s_i,a_i) \sim d^\pi_{\mu_0}\)
Sample \(h_2=h\) w.p. \(\propto \gamma^h\): \(y_i = \sum_{t=h_1}^{h_1+h_2} r_t\)

Supervision with Rollout (MC):

\(\mathbb{E}[y_i] = Q^\pi(s_i, a_i)\)

\(\widehat Q\) via ERM on \(\{(s_i, a_i, y_i)\}_{1}^N\)

Rollout:

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

CS 4/5789: Lecture 12

By Sarah Dean

CS 4/5789: Lecture 12

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 12

Agenda

Announcements

Recap

Recap

CS 4/5789: Lecture 12

More from Sarah Dean