Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap: Units 1-3
2. Game Setting
3. Policy Learning Component
4. Value Learning Component
5. Online Planning Component
action \(a_t\)
state \(s_t\)
reward \(r_t\)
policy \(\pi\)
transitions \(P,f\)
action \(a_t\)
state \(s_t\)
reward \(r_t\)
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown
Supervised Learning
Policy
Dataset of expert trajectory
...
\(\pi\)( ) =
\((x=s, y=a^*)\)
imitation
inverse RL
Goal: understand/predict behaviors
1. Recap: Units 1-3
2. Game Setting
3. Policy Learning Component
4. Value Learning Component
5. Online Planning Component
49:30-56:30
...
...
\(V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}\) PollEv
\(Q^*(s,a) = V^*(f(s,a))\)
\(V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}\)
\(V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}\)
\(Q^*(s,a) = V^*(f(s,a))\)
\(V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}\)
Strategy:
\(V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}\)
\(Q^*(s,a) = V^*(f(s,a))\)
\(V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}\)
1. Recap: Units 1-3
2. Game Setting
3. Policy Learning Component
4. Value Learning Component
5. Online Planning Component
Deep network with convolutional layers
Warm-start policy network with expert data
How well does \(\pi_{\theta_{BC}}\) perform?
...
...
How well does \(\widehat \pi = \pi_{\theta_{PG}}\) perform?
1. Recap: Units 1-3
2. Game Setting
3. Policy Learning Component
4. Value Learning Component
5. Online Planning Component
Deep network with convolutional layers
IID sampling \(s\sim d^{\widehat \pi}\)
Optimize with SGD $$\beta_{t+1} = \beta_t - \eta \sum_{s,z\in\mathcal B} (V_{\beta}(s) - y) \nabla_\beta V_\beta(s)$$
1. Recap: Units 1-3
2. Game Setting
3. Policy Learning Component
4. Value Learning Component
5. Online Planning Component
\(a_t = \arg\max \widehat V(f(s_t,a))\)
\(a_t = \widehat \pi(s_t)\)
Both are only approximations!
\(\widehat V(f(s,a))\)
1. Low depth search: use knowledge of dynamics
\(a_t = \arg\max \widehat V(f(s_t,a))\)
\(=\widehat V(s')\)
\(s'=f(s,a)\)
1. Low depth search: use knowledge of dynamics
\(s''=f(s',a')\)
\(s'''=f(s'',a'')\)
\(\widehat V(s''')\)
\(a_t = \)\(\arg\max_a\)\( \min_{a'} \)\(\max_{a''}\)\( \widehat V(f(f(f(s_t,a),a'),a''))\)
\(s'=f(s,a)\)
1. Low depth search: use knowledge of dynamics
\(s''=f(s',a')\)
\(s'''=f(s'',a'')\)
\(\widehat V(s''')\)
\(s'=f(s,a)\)
2. Improve value estimate with rollout
\(s''=f(s',a')\)
\(s'''=f(s'',a'')\)
\(\lambda \widehat V(s''') + (1-\lambda) r\)
...
\(s'''\)
\(r\)
3. Adaptive depth tree search
Monte-Carlo Tree Search (Classic AI)
expand promising or under-explored nodes
backprop node values from expansion