Prof. Sarah Dean
MW 2:45-4pm
110 Hollister Hall
0. Announcements & Recap
1. Supervision via Rollouts
2. Approximate Policy Iteration
3. Performance Different Lemma
4. Conservative Policy Iteration
HW0 grades and solutions released
HW1 due Monday 3/7
5789 Paper Review Assignment (weekly pace suggested)
Prelim Tuesday 3/22 at 7:30pm in Phillips 101
Office hours after lecture M (110 Hollister) and W (416A Gates)
Wednesday OH prioritize 1-1 questions/concerns over HW
Model-based RL (Meta-Algorithm)
Tabular MBRL
Using Simulation Lemma we saw that \(\epsilon\) optimal policy requires \(N\gtrsim \frac{S^2 A}{\epsilon^2}\)
MBRL for LQR
We will not cover the argument in detail, but \(\epsilon\) optimal policy requires \(N\gtrsim \frac{n_s+n_a}{\epsilon^2}\)
Policy Evaluation uses knowledge of transition function \(P\) and reward function \(r\)
initialize pi[0]
for t=1,2,...
Q[t] = PolicyEvaluation(pi[t]) # Evaluation
pi[t+1] = argmax_a(Q[t](:,a)) # Improvement
Can we estimate the Q function from sampled data?