CS 4/5789: Introduction to Reinforcement Learning

Lecture 11

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

 

0. Announcements & Recap

1. Supervision via Rollouts

2. Approximate Policy Iteration

3. Performance Different Lemma

4. Conservative Policy Iteration

Announcements


HW0 grades and solutions released

HW1 due Monday 3/7


5789 Paper Review Assignment (weekly pace suggested)


Prelim Tuesday 3/22 at 7:30pm in Phillips 101


Office hours after lecture M (110 Hollister) and W (416A Gates)

Wednesday OH prioritize 1-1 questions/concerns over HW

Recap

Model-based RL (Meta-Algorithm)

  1. Sample and record siP(si,ai)s_i'\sim P(s_i, a_i)
  2. Estimate P^\widehat P from {(si,si,ai)}i=1N\{(s_i',s_i, a_i)\}_{i=1}^N
  3. Design π^\widehat \pi from P^\widehat P

Tabular MBRL

  1. Sample: evenly each s,as,a NSA\frac{N}{SA} times
  2. Estimate: by averaging
  3. Design: policy iteration

Using Simulation Lemma we saw that ϵ\epsilon optimal policy requires NS2Aϵ2N\gtrsim \frac{S^2 A}{\epsilon^2}

MBRL for LQR

  1. Sample: siN(0,σ2I)s_i\sim \mathcal N(0,\sigma^2 I) and aiN(0,σ2I)a_i\sim \mathcal N(0,\sigma^2I)
  2. Estimate: by least-squares linear regression
    A^,B^=argminA,Bi=1NsiAsiBai22\displaystyle \widehat A,\widehat B = \arg\min_{A,B} \sum_{i=1}^N\|s_i' - As_i-Ba_i\|_2^2
  3. Design: LQR dynamic programming

Recap

We will not cover the argument in detail, but ϵ\epsilon optimal policy requires Nns+naϵ2N\gtrsim \frac{n_s+n_a}{\epsilon^2}

Recap: Policy Iteration

Policy Evaluation uses knowledge of transition function PP and reward function rr

initialize pi[0]
for t=1,2,...
    Q[t] = PolicyEvaluation(pi[t]) # Evaluation
    pi[t+1] = argmax_a(Q[t](:,a)) # Improvement

Can we estimate the Q function from sampled data?

CS 4/5789: Lecture 11

By Sarah Dean

Private

CS 4/5789: Lecture 11