Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap
2. Labels via Bellman
3. (Q) Value-based RL
4. Preview: Optimization
Two concerns of Data Feedback
action at
state st
reward rt
policy
data (st,at,rt)
policy π
transitions P,f
experience
unknown in Unit 2
...
...
...
st
at∼π(st)
rt∼r(st,at)
st+1∼P(st,at)
at+1∼π(st+1)
...
1. Recap
2. Labels via Bellman
3. (Q) Value-based RL
4. Preview: Optimization
...
...
...
0
1
stay: 1
switch: 1
stay: 1−p
switch: 1−2p
stay: p
switch: 2p
st
at∼π(st)
rt∼r(st,at)
st+1∼P(st,at)
at+1∼π(st+1)
...
0
1
stay: 1
switch: 1
stay: 1−p
switch: 1−2p
stay: p
switch: 2p
The label is biased
E[yi∣si,ai]−Q⋆(si,ai)=
γEs′∼P(si,ai)[maxaQ^(s′,a)−maxa′Qπ(s′,a′)]
Sources of variance: one step of P and π
Off policy: rollout with π and estimate Q⋆
st
at∼π(st)
rt∼r(st,at)
st+1∼P(st,at)
at+1∼π(st+1)
...
1. Recap
2. Labels via Bellman
3. (Q) Value-based RL
4. Preview: Optimization
action
state, reward
policy
data
experience
Key components of a value-based RL algorithm:
Key components of a value-based RL algorithm:
Different choices for these components lead to different algorithms
1. Tabular
2. Parametric, e.g. deep (PA 3)
Q(s,a) | |||||
S
A
S
A
Initialize arbitrary π0, then for iterations i:
("Montecarlo")
πˉ(s)
A
Initialize arbitrary π0, Q0, then for iterations i:
Initialize arbitrary π0, Q0, then for iterations i:
1. PI with MC
2. PI with TD
3. Q-learning
1. Recap
2. Labels via Bellman
3. (Q) Value-based RL
4. Preview: Optimization
J(θ)
θ
np.amin(J, axis=1)
θ⋆
By Sarah Dean