Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap
2. Labels via Bellman
3. (Q) Value-based RL
4. Preview: Optimization
Two concerns of Data Feedback
action \(a_t\)
state \(s_t\)
reward \(r_t\)
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown in Unit 2
...
...
...
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
1. Recap
2. Labels via Bellman
3. (Q) Value-based RL
4. Preview: Optimization
...
...
...
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(1-p\)
switch: \(1-2p\)
stay: \(p\)
switch: \(2p\)
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(1-p\)
switch: \(1-2p\)
stay: \(p\)
switch: \(2p\)
The label is biased
\(\mathbb E[y_i|s_i, a_i]-Q^\star(s_i,a_i) =\)
\(\gamma \mathbb E_{s'\sim P(s_i,a_i)}\big[\max_a \hat Q(s',a)-\max_{a'}Q^\pi(s',a')\big]\)
Sources of variance: one step of \(P\) and \(\pi\)
Off policy: rollout with \(\pi\) and estimate \(Q^\star\)
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
1. Recap
2. Labels via Bellman
3. (Q) Value-based RL
4. Preview: Optimization
action
state, reward
policy
data
experience
Key components of a value-based RL algorithm:
Key components of a value-based RL algorithm:
Different choices for these components lead to different algorithms
1. Tabular
2. Parametric, e.g. deep (PA 3)
Q(s,a) | |||||
\(\mathcal S\)
\(\mathcal A\)
\(\mathcal S\)
\(\mathcal A\)
Initialize arbitrary \(\pi_0\), then for iterations \(i\):
("Montecarlo")
\(\bar\pi(s)\)
\(\mathcal A\)
Initialize arbitrary \(\pi_0\), \(Q_0\), then for iterations \(i\):
Initialize arbitrary \(\pi_0\), \(Q_0\), then for iterations \(i\):
1. PI with MC
2. PI with TD
3. Q-learning
1. Recap
2. Labels via Bellman
3. (Q) Value-based RL
4. Preview: Optimization
\(J(\theta)\)
\(\theta\)
np.amin(J, axis=1)
\(\theta^\star\)
By Sarah Dean