CS 4/5789: Introduction to Reinforcement Learning
Lecture 12
Prof. Sarah Dean
MW 2:45-4pm
110 Hollister Hall
Agenda
0. Announcements & Recap
1. Performance Difference Lemma
2. Supervision via Bellman Eq
3. Supervision via Bellman Opt
4. Function Approximation
Announcements
HW1 due tonight, HW2 released next Monday
5789 Paper Review Assignment (weekly pace suggested)
Prelim Tuesday 3/22 at 7:30pm in Phillips 101
OH cancelled Wednesday, instead Thursday 10:30-11:30am

Learning Theory Mentorship Workshop
with the Conference on Algorithmic Learning Theory (ALT)
Virtual, March 14-15, 2022
Application due March 10: https://let-all.com/alt22.html
Recap
Meta-Algorithm for Policy Iteration in Unknown MDP

Approximate Policy Iteration
Greedy Improvement:
\(\pi^{t+1}(s) = \arg\max_a \widehat Q^{t}(s, a)\)
Could oscillate!
Conservative Policy Iteration
Incremental Improvement:
\(\pi'(s) = \arg\max_a \widehat Q^{t}(s, a)\)
\(\pi^{t+1}(a\mid s) = (1-\alpha)\pi^{t}(a\mid s) + \alpha \pi'(s\mid a)\)
Recap
Meta-Algorithm for Policy Iteration in Unknown MDP

- Sample \(h_1=h\) w.p. \(\propto \gamma^h\): \((s_{h_1}, a_{h_1}) = (s_i,a_i) \sim d^\pi_{\mu_0}\)
- Sample \(h_2=h\) w.p. \(\propto \gamma^h\): \(y_i = \sum_{t=h_1}^{h_1+h_2} r_t\)
Supervision with Rollout (MC):
\(\mathbb{E}[y_i] = Q^\pi(s_i, a_i)\)
\(\widehat Q\) via ERM on \(\{(s_i, a_i, y_i)\}_{1}^N\)
Rollout:
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
CS 4/5789: Lecture 12
By Sarah Dean