CS 4/5789: Introduction to Reinforcement Learning

Lecture 13

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

 

0. Announcements & Recap

1. Q Function Approximation

2. Optimization & Gradient Descent

3. Stochastic Gradient Descent

4. Derivative-Free Optimization

Announcements

 

HW2 released next Monday

 

5789 Paper Review Assignment (weekly pace suggested)

 

OH cancelled today, instead Thursday 10:30-11:30am

Learning Theory Mentorship Workshop

Application due March 10: https://let-all.com/alt22.html

Prelim Tuesday 3/22 at 7:30-9pm in Phillips 101

 

Closed-book, definition/equation sheet for reference will be provided

 

Focus: mainly Unit 1 (known models) but many lectures in Unit 2 revisit important key concepts

Study Materials: Lecture Notes 1-15, HW0&1

 

Lecture on 3/21 will be a review

Prelim Exam

Recap

Meta-Algorithm for Policy Iteration in Unknown MDP

  • Sample \(h_1=h\) w.p. \(\propto \gamma^h\): \((s_{h_1}, a_{h_1}) = (s_i,a_i) \sim d^\pi_{\mu_0}\)
  • Sample \(h_2=h\) w.p. \(\propto \gamma^h\): \(y_i = \sum_{t=h_1}^{h_1+h_2} r_t\)

Supervision with Rollout (MC):

\(\mathbb{E}[y_i] = Q^\pi(s_i, a_i)\)

\(\widehat Q\) via ERM on \(\{(s_i, a_i, y_i)\}_{1}^N\)

Rollout:

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

Recap

  • \(y_t =r_t + \gamma \max_a \widehat Q(s_{t+1},a) \)

Supervision with Bellman Exp (TD):

If \(\widehat Q = Q^\pi\) then \(\mathbb{E}[y_t] = Q^\pi(s_t, a_t)\)

One step:

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

Supervision with Bellman Opt (TD):

  • \(y_t =r_t + \gamma \widehat Q(s_{t+1},a_{t+1}) \)

If \(\widehat Q = Q^*\) then \(\mathbb{E}[y_t] = Q^*(s_t, a_t)\)

SARSA and Q-learning are simple tabular algorithms

Agenda

 

0. Announcements & Recap

1. Q Function Approximation

2. Optimization & Gradient Descent

3. Stochastic Gradient Descent

4. Derivative-Free Optimization

CS 4/5789: Lecture 13

By Sarah Dean

Private

CS 4/5789: Lecture 13