Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
110 Hollister Hall
0. Announcements & Recap
1. Game Setting
2. Policy Learning Component
3. Value Learning Component
4. Online Planning Component
5789 Paper Review Assignment (weekly pace suggested)
HW 4 due 5/9 -- don't plan on extentions
Final exam Monday 5/16 at 7pm in Statler Hall 196
Review session in lecture 5/9
Course evaluations open until next week
Markov decision process \(\mathcal M = \{\mathcal S, ~\mathcal A, ~P, ~r, ~\gamma\}\)
\(s_t\)
\(r_t\)
\(a_t\)
\(\pi\)
\(\gamma\)
\(P\)
actions & states determine environment
discount & reward determine objective
\(\{x_i, y_i\}\)
\(x\)
\(\widehat y\)
\((x, y)\)
Technologies are developed and used within a particular social, economic, and political context. They arise out of a social structure, they are grafted on to it, and they may reinforce it or destroy it, often in ways that are neither foreseen nor foreseeable.”
Ursula Franklin, 1989
control feedback
data feedback
external feedback
"...social, economic, and political context..."
"...neither foreseen nor forseeable..."
control feedback
data feedback
external feedback
"...social, economic, and political context..."
"...neither foreseen nor forseeable..."
...
...
\(V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}\)
\(Q^*(s,a) = V^*(f(s,a))\)
\(V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}\)
\(V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}\)
\(Q^*(s,a) = V^*(f(s,a))\)
\(V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}\)
Strategy:
\(V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}\)
\(Q^*(s,a) = V^*(f(s,a))\)
\(V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}\)
0. Announcements & Recap
1. Game Setting
2. Policy Learning Component
3. Value Learning Component
4. Online Planning Component
Deep network with convolutional layers
Warm-start policy network with expert data
How well does \(\pi_{\theta_{BC}}\) perform?
...
...
How well does \(\widehat \pi = \pi_{\theta_{PG}}\) perform?
Deep network with convolutional layers
IID sampling \(s\sim d^{\widehat \pi}\)
Optimize with SGD $$\beta_{t+1} = \beta_t - \eta \sum_{s,z\in\mathcal B} (V_{\beta}(s) - y) \nabla_\beta V_\beta(s)$$
\(a_t = \arg\max \widehat V(f(s_t,a))\)
\(a_t = \widehat \pi(s_t)\)
Both are only approximations!
\(\widehat V(f(s,a))\)
1. Low depth search: use knowledge of dynamics
\(a_t = \arg\max \widehat V(f(s_t,a))\)
\(a_t = \)\(\arg\max_a\)\( \min_{a'} \)\(\max_{a''}\)\( \widehat V(f(f(f(s_t,a),a'),a''))\)
\(s'=f(s,a)\)
1. Low depth search: use knowledge of dynamics
\(s''=f(s',a')\)
\(s'''=f(s'',a'')\)
\(\widehat V(s''')\)
\(s'=f(s,a)\)
2. Improve value estimate with rollout
\(s''=f(s',a')\)
\(s'''=f(s'',a'')\)
\(\lambda \widehat V(s''') + (1-\lambda) r\)
...
\(s'''\)
\(r\)
3. Adaptive depth tree search
Monte-Carlo Tree Search (Classic AI)
expand promising or under-explored nodes
backprop node values from expansion
By Sarah Dean