Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap: Units 1-3
2. Game Setting
3. Policy Learning Component
4. Value Learning Component
5. Online Planning Component
action at
state st
reward rt
policy π
transitions P,f
action at
state st
reward rt
policy
data (st,at,rt)
policy π
transitions P,f
experience
unknown
Supervised Learning
Policy
Dataset of expert trajectory
...
π( ) =
(x=s,y=a∗)
imitation
inverse RL
Goal: understand/predict behaviors
1. Recap: Units 1-3
2. Game Setting
3. Policy Learning Component
4. Value Learning Component
5. Online Planning Component
49:30-56:30
...
...
V∗(s)=max{Q∗(s,a),Q∗(s,a′)} PollEv
Q∗(s,a)=V∗(f(s,a))
V∗(s′)=min{Q∗(s′,a),Q∗(s′,a′)}
V∗(s)=max{Q∗(s,a),Q∗(s,a′)}
Q∗(s,a)=V∗(f(s,a))
V∗(s′)=min{Q∗(s′,a),Q∗(s′,a′)}
Strategy:
V∗(s)=max{Q∗(s,a),Q∗(s,a′)}
Q∗(s,a)=V∗(f(s,a))
V∗(s′)=min{Q∗(s′,a),Q∗(s′,a′)}
1. Recap: Units 1-3
2. Game Setting
3. Policy Learning Component
4. Value Learning Component
5. Online Planning Component
Deep network with convolutional layers
Warm-start policy network with expert data
How well does πθBC perform?
...
...
How well does π=πθPG perform?
1. Recap: Units 1-3
2. Game Setting
3. Policy Learning Component
4. Value Learning Component
5. Online Planning Component
Deep network with convolutional layers
IID sampling s∼dπ
Optimize with SGD βt+1=βt−ηs,z∈B∑(Vβ(s)−y)∇βVβ(s)
1. Recap: Units 1-3
2. Game Setting
3. Policy Learning Component
4. Value Learning Component
5. Online Planning Component
at=argmaxV(f(st,a))
at=π(st)
Both are only approximations!
V(f(s,a))
1. Low depth search: use knowledge of dynamics
at=argmaxV(f(st,a))
=V(s′)
s′=f(s,a)
1. Low depth search: use knowledge of dynamics
s′′=f(s′,a′)
s′′′=f(s′′,a′′)
V(s′′′)
at=argmaxamina′maxa′′V(f(f(f(st,a),a′),a′′))
s′=f(s,a)
1. Low depth search: use knowledge of dynamics
s′′=f(s′,a′)
s′′′=f(s′′,a′′)
V(s′′′)
s′=f(s,a)
2. Improve value estimate with rollout
s′′=f(s′,a′)
s′′′=f(s′′,a′′)
λV(s′′′)+(1−λ)r
...
s′′′
r
3. Adaptive depth tree search
Monte-Carlo Tree Search (Classic AI)
expand promising or under-explored nodes
backprop node values from expansion
By Sarah Dean