Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:55-4:10pm
255 Olin Hall
1. Recap
2. Value Iteration
3. Fitted VI
4. Online Fitted VI
5. Finite Horizon
action \(a_t\)
state \(s_t\)
reward \(r_t\)
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown in Unit 2
1. Recap
2. Value Iteration
3. Fitted VI
4. Online Fitted VI
5. Finite Horizon
Value Iteration
For infinite horizon discounted MDP, we repeatedly apply the Bellman Optimality Equation, which is a fixed-point algorithm
Q-Value Iteration
For infinite horizon discounted MDP, we repeatedly apply the Bellman Optimality Equation, which is a fixed-point algorithm
Which step requires a model?
Q-Value Iteration: for all \(s,a\)
\(Q^{k+1}(s,a) = r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^{k}(s',a') \right]\)
We can't evaluate the expectations without the model \(P\) or if \(S\) and/or \(A\) are too large
Instead we can collect data:
"Roll out" data collection policy \(\pi_{\text {data }}\)
Collect states and actions \(\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots \right\}\)
How can we use this?
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
1. Recap
2. Value Iteration
3. Fitted VI
4. Online Fitted VI
5. Finite Horizon
Fitted Q-Value Iteration
Fixed point iteration using a fixed dataset and supervised learning
1. Recap
2. Value Iteration
3. Fitted VI
4. Online Fitted VI
5. Finite Horizon
action
state
policy
data
\(\bar\pi(s)\)
\(\mathcal A\)
1. Tabular
2. Parametric, e.g. deep (PA 3)
Q(s,a) | |||||
\(\mathcal S\)
\(\mathcal A\)
\(\mathcal S\)
\(\mathcal A\)
PollEV
Online Fitted Q-Value Iteration (Q-learning)
1. Recap
2. Value Iteration
3. Fitted VI
4. Online Fitted VI
5. Finite Horizon
Fitted Q-Value Iteration (finite horizon)
Similar extension for online data collection
By Sarah Dean