### Sarah Dean PRO

asst prof in CS at Cornell

Prof. Sarah Dean

MW 2:45-4pm

110 Hollister Hall

0. Announcements & Recap

1. Supervision via Rollouts

2. Approximate Policy Iteration

3. Performance Different Lemma

4. Conservative Policy Iteration

HW0 grades and solutions released

HW1 due Monday 3/7

5789 Paper Review Assignment (weekly pace *suggested*)

Prelim Tuesday 3/22 at 7:30pm in Phillips 101

Office hours after lecture M (110 Hollister) and W (416A Gates)

Wednesday OH prioritize 1-1 questions/concerns over HW

**Model-based RL (Meta-Algorithm)**

- Sample and record \(s_i'\sim P(s_i, a_i)\)
- Estimate \(\widehat P\) from \(\{(s_i',s_i, a_i)\}_{i=1}^N\)
- Design \(\widehat \pi\) from \(\widehat P\)

**Tabular MBRL**

- Sample: evenly each \(s,a\) \(\frac{N}{SA}\) times
- Estimate: by averaging
- Design: policy iteration

Using **Simulation Lemma** we saw that \(\epsilon\) optimal policy requires \(N\gtrsim \frac{S^2 A}{\epsilon^2}\)

**MBRL for LQR**

- Sample: \(s_i\sim \mathcal N(0,\sigma^2 I)\) and \(a_i\sim \mathcal N(0,\sigma^2I)\)
- Estimate: by least-squares linear regression

\(\displaystyle \widehat A,\widehat B = \arg\min_{A,B} \sum_{i=1}^N\|s_i' - As_i-Ba_i\|_2^2\) - Design: LQR dynamic programming

We will not cover the argument in detail, but \(\epsilon\) optimal policy requires \(N\gtrsim \frac{n_s+n_a}{\epsilon^2}\)

Policy Evaluation uses knowledge of transition function \(P\) and reward function \(r\)

```
initialize pi[0]
for t=1,2,...
Q[t] = PolicyEvaluation(pi[t]) # Evaluation
pi[t+1] = argmax_a(Q[t](:,a)) # Improvement
```

Can we estimate the Q function from sampled data?

By Sarah Dean