CS 4/5789: Introduction to Reinforcement Learning
Lecture 27
Prof. Sarah Dean
MW 2:454pm
110 Hollister Hall
Agenda
0. Announcements & Recap
1. Game Setting
2. Policy Learning Component
3. Value Learning Component
4. Online Planning Component
Announcements
5789 Paper Review Assignment (weekly pace suggested)
HW 4 due 5/9  don't plan on extentions
Final exam Monday 5/16 at 7pm in Statler Hall 196
Review session in lecture 5/9
Course evaluations open until next week
Recap: RL Specification
Markov decision process \(\mathcal M = \{\mathcal S, ~\mathcal A, ~P, ~r, ~\gamma\}\)
\(s_t\)
\(r_t\)
\(a_t\)
\(\pi\)
\(\gamma\)
\(P\)
 action space and discount known
 states and reward signals observed
 transition probabilities unknown
actions & states determine environment
discount & reward determine objective
All ML is RL once deployed
\(\{x_i, y_i\}\)
\(x\)
\(\widehat y\)
\((x, y)\)
Technologies are developed and used within a particular social, economic, and political context. They arise out of a social structure, they are grafted on to it, and they may reinforce it or destroy it, often in ways that are neither foreseen nor foreseeable.”
Ursula Franklin, 1989
RL helps us reason about feedback
control feedback
data feedback
external feedback
"...social, economic, and political context..."
"...neither foreseen nor forseeable..."
RL helps us reason about feedback
control feedback
data feedback
external feedback
"...social, economic, and political context..."
"...neither foreseen nor forseeable..."
Deliberation, governance, oversight are necessary
AlphaGo vs. Lee Sedol
...
Setting: Markov Game
 Two Player Markov Game: \(\{\mathcal S,\mathcal A, f, r, H, s_0\}\)
 Deterministic transitions: \(s' = f(s,a)\)
 Players alternate taking actions:
 Player 0 in even steps, player 1 in odd steps
 Sparse reward: \(r(s_H)=1\) when player 0 wins (else \(1\))
...
Setting: Markov Game
 Minmax formulation $$ V^*(s) = \textcolor{red}{\max_{\pi_1} } \textcolor{yellow}{\min_{\pi_2} }\mathbb E[r(s_H)s_0=s, \pi_1, \pi_2]$$
 Zero sum game
Setting: Markov Game
 Minmax formulation $$ V^*(s) = \textcolor{red}{\max_{\pi_1} } \textcolor{yellow}{\min_{\pi_2} }\mathbb E[r(s_H)s_0=s, \pi_1, \pi_2]$$
 Zero sum game => solvable with DP!
\(V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}\)
\(Q^*(s,a) = V^*(f(s,a))\)
\(V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}\)
Setting: Markov Game
 But \(H\approx 150\), \(A\approx 250\), so this tree will have \(\approx A^H\) nodes
 1 TB harddrive can store \(\approx 250^6\) 8bit numbers
 Impossible to enumerate!
\(V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}\)
\(Q^*(s,a) = V^*(f(s,a))\)
\(V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}\)
Setting: Markov Game
Strategy:
 Approximate \(\pi^*\), use \(\widehat \pi\) to approximate \(V^*\)
 Low depth tree search combines \(\widehat V\) with simulated play \(\widehat \pi\)
\(V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}\)
\(Q^*(s,a) = V^*(f(s,a))\)
\(V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}\)
Agenda
0. Announcements & Recap
1. Game Setting
2. Policy Learning Component
3. Value Learning Component
4. Online Planning Component
Policy Learning
Deep network with convolutional layers
 input: 19x19 3bit grid
 output: distribution over grid
Imitation Learning
Warmstart policy network with expert data
 Sample data \((s,a)\) from human games, \(N=30\) million
 Loglikelihood loss function $$\min_\pi \sum_{i=1}^N \log(\pi(a_is_i))$$
 Optimize with Stochastic Gradient Descent $$ \theta_{t+1} = \theta_t  \eta \frac{1}{\mathcal B} \sum_{(s,a)\in \mathcal B}\nabla_\theta\log(\pi_\theta(as))$$
Imitation Learning
How well does \(\pi_{\theta_{BC}}\) perform?
 57% accuracy on held out test
 random policy: 1/200
 Pachi: open source Go program
 11% win rate
Policy Gradient
 Warmstart \(\theta_0 = \theta_{BC}\)
 Iterate for \(t=0,...,T1\)
 Randomly select previous \(\tau \in \{0,1..., t\}\)
 Play \(\pi_{\theta_t}\) against \(\pi_{\theta_\tau}\) and observe \((s_0,\)\(a_0\)\(,s_1,\)\(a_1\)\(,...,s_H)\)
 Gradient update: $$\theta_{t+1} = \theta_t + \eta \sum_{h=0 }^{H/2}\nabla_\theta \log \pi_{\theta_t}(\textcolor{red}{a_{2h}}s_{2h}) r(s_H)$$
...
Policy Gradient
...
How well does \(\widehat \pi = \pi_{\theta_{PG}}\) perform?
 Pachi: open source Go program
 85% win rate
Value Learning
Deep network with convolutional layers
 input: 19x19 3bit grid
 output: scalar value
Value Learning
 Ideally, approximate \(\widehat V \approx V^*\)
 easier to supervise \(\widehat V \approx V^{\widehat \pi}\) $$V^{\widehat \pi}(s) = \mathbb E[s(r_H)s_0=s, \widehat \pi, \widehat \pi]$$
 Supervision via rollouts
 In each game \(i\), sample \(h\) and set \(s_i=s_h\) and \(y_i\) as the game's outcome (\(\pm 1\))
 Simulate \(N=30\) million games

IID sampling \(s\sim d^{\widehat \pi}\)
Value Learning
 Leastsquares regression $$\min_\beta \sum_{i=1}^N (V_\beta(s_i)  y_i)^2$$

Optimize with SGD $$\beta_{t+1} = \beta_t  \eta \sum_{s,z\in\mathcal B} (V_{\beta}(s)  y) \nabla_\beta V_\beta(s)$$
Combination with Search
\(a_t = \arg\max \widehat V(f(s_t,a))\)
\(a_t = \widehat \pi(s_t)\)
Both are only approximations!
Combination with Search
\(\widehat V(f(s,a))\)
1. Low depth search: use knowledge of dynamics
\(a_t = \arg\max \widehat V(f(s_t,a))\)
Combination with Search
\(a_t = \)\(\arg\max_a\)\( \min_{a'} \)\(\max_{a''}\)\( \widehat V(f(f(f(s_t,a),a'),a''))\)
\(s'=f(s,a)\)
1. Low depth search: use knowledge of dynamics
\(s''=f(s',a')\)
\(s'''=f(s'',a'')\)
\(\widehat V(s''')\)
Combination with Search
\(s'=f(s,a)\)
2. Improve value estimate with rollout
\(s''=f(s',a')\)
\(s'''=f(s'',a'')\)
\(\lambda \widehat V(s''') + (1\lambda) r\)
...
\(s'''\)
\(r\)
Combination with Search
3. Adaptive depth tree search
MonteCarlo Tree Search (Classic AI)
expand promising or underexplored nodes
backprop node values from expansion
Combination with Search

Lowadaptive depth tree search with \(\widehat V\)  Improve value estimate with rollout of \(\widehat \pi\)
Summary
 Learning:
 Warm start policy with imitation learning
 Improve policy with policy gradient
 Approximate value of policy
 Planning:
 Adaptive tree search with \(\widehat V\) and \(\widehat \pi\)
Exofeedback
CS 4/5789: Lecture 27
By Sarah Dean