CS 4/5789: Introduction to Reinforcement Learning

Lecture 27

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

 

0. Announcements & Recap

 1. Game Setting

2. Policy Learning Component

3. Value Learning Component

4. Online Planning Component

Announcements

 

5789 Paper Review Assignment (weekly pace suggested)

HW 4 due 5/9 -- don't plan on extentions

 

Final exam Monday 5/16 at 7pm in Statler Hall 196
Review session in lecture 5/9

 

Course evaluations open until next week

Recap: RL Specification

Markov decision process \(\mathcal M = \{\mathcal S, ~\mathcal A, ~P, ~r, ~\gamma\}\)

\(s_t\)

\(r_t\)

\(a_t\)

\(\pi\)

\(\gamma\)

\(P\)

  • action space and discount known
  • states and reward signals observed
  • transition probabilities unknown

actions & states determine environment

discount & reward determine objective

All ML is RL once deployed

\(\{x_i, y_i\}\)

\(x\)

\(\widehat y\)

\((x, y)\)

 

Technologies are developed  and used within a particular social, economic, and political context. They arise out of a  social structure, they are grafted on to it, and they may reinforce it or destroy it, often in ways that are neither foreseen  nor foreseeable.”

Ursula Franklin, 1989

RL helps us reason about feedback

control feedback

data feedback

external feedback

"...social, economic, and political context..."

"...neither foreseen nor forseeable..."

RL helps us reason about feedback

control feedback

data feedback

external feedback

"...social, economic, and political context..."

"...neither foreseen nor forseeable..."

Deliberation, governance, oversight are necessary

AlphaGo vs. Lee Sedol

...

Setting: Markov Game

  • Two Player Markov Game: \(\{\mathcal S,\mathcal A, f, r, H, s_0\}\)
  • Deterministic transitions: \(s' = f(s,a)\)
  • Players alternate taking actions:
    • Player 0 in even steps, player 1 in odd steps
  • Sparse reward: \(r(s_H)=1\) when player 0 wins (else \(-1\))

...

Setting: Markov Game

  • Min-max formulation $$ V^*(s) =  \textcolor{red}{\max_{\pi_1} } \textcolor{yellow}{\min_{\pi_2} }\mathbb E[r(s_H)|s_0=s, \pi_1, \pi_2]$$
  • Zero sum game

Setting: Markov Game

  • Min-max formulation $$ V^*(s) =  \textcolor{red}{\max_{\pi_1} } \textcolor{yellow}{\min_{\pi_2} }\mathbb E[r(s_H)|s_0=s, \pi_1, \pi_2]$$
  • Zero sum game => solvable with DP!

\(V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}\)

\(Q^*(s,a) = V^*(f(s,a))\)

\(V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}\)

Setting: Markov Game

  • But \(H\approx 150\), \(A\approx 250\), so this tree will have \(\approx A^H\) nodes
  • 1 TB hard-drive can store \(\approx 250^6\) 8-bit numbers
  • Impossible to enumerate!

\(V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}\)

\(Q^*(s,a) = V^*(f(s,a))\)

\(V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}\)

Setting: Markov Game

Strategy:

  • Approximate \(\pi^*\), use \(\widehat \pi\) to approximate \(V^*\)
  • Low depth tree search combines \(\widehat V\) with simulated play \(\widehat \pi\)

\(V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}\)

\(Q^*(s,a) = V^*(f(s,a))\)

\(V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}\)

Agenda

 

0. Announcements & Recap

 1. Game Setting

2. Policy Learning Component

3. Value Learning Component

4. Online Planning Component

Policy Learning

Deep network with convolutional layers

  • input: 19x19 3-bit grid
  • output: distribution over grid

Imitation Learning

Warm-start policy network with expert data

  1. Sample data \((s,a)\) from human games, \(N=30\) million
  2. Log-likelihood loss function $$\min_\pi \sum_{i=1}^N -\log(\pi(a_i|s_i))$$
  3. Optimize with Stochastic Gradient Descent $$ \theta_{t+1} = \theta_t - \eta \frac{1}{|\mathcal B|} \sum_{(s,a)\in \mathcal B}-\nabla_\theta\log(\pi_\theta(a|s))$$

Imitation Learning

How well does \(\pi_{\theta_{BC}}\) perform?

  • 57% accuracy on held out test
    • random policy: 1/200
  • Pachi: open source Go program
    • 11% win rate

Policy Gradient

  1. Warm-start \(\theta_0 = \theta_{BC}\)
  2. Iterate for \(t=0,...,T-1\)
    1. Randomly select previous \(\tau \in \{0,1..., t\}\)
    2. Play \(\pi_{\theta_t}\) against \(\pi_{\theta_\tau}\) and observe \((s_0,\)\(a_0\)\(,s_1,\)\(a_1\)\(,...,s_H)\)
    3. Gradient update: $$\theta_{t+1} = \theta_t + \eta \sum_{h=0 }^{H/2}\nabla_\theta \log \pi_{\theta_t}(\textcolor{red}{a_{2h}}|s_{2h}) r(s_H)$$

...

Policy Gradient

...

How well does \(\widehat \pi = \pi_{\theta_{PG}}\) perform?

  • Pachi: open source Go program
    • 85% win rate

Value Learning

Deep network with convolutional layers

  • input: 19x19 3-bit grid
  • output: scalar value

Value Learning

  • Ideally, approximate \(\widehat V \approx V^*\)
    • easier to supervise \(\widehat V \approx V^{\widehat \pi}\) $$V^{\widehat \pi}(s) = \mathbb E[s(r_H)|s_0=s, \widehat \pi, \widehat \pi]$$
  • Supervision via rollouts
    • In each game \(i\), sample \(h\) and set \(s_i=s_h\) and \(y_i\) as the game's outcome (\(\pm 1\))
    • Simulate \(N=30\) million games
    • IID sampling \(s\sim d^{\widehat \pi}\)

 

Value Learning

  • Least-squares regression $$\min_\beta \sum_{i=1}^N (V_\beta(s_i) - y_i)^2$$
  • Optimize with SGD $$\beta_{t+1} = \beta_t - \eta \sum_{s,z\in\mathcal B} (V_{\beta}(s) - y) \nabla_\beta V_\beta(s)$$

Combination with Search

\(a_t = \arg\max \widehat V(f(s_t,a))\)

\(a_t = \widehat \pi(s_t)\)

Both are only approximations!

Combination with Search

\(\widehat V(f(s,a))\)

1. Low depth search: use knowledge of dynamics

\(a_t = \arg\max \widehat V(f(s_t,a))\)

Combination with Search

\(a_t = \)\(\arg\max_a\)\( \min_{a'} \)\(\max_{a''}\)\( \widehat V(f(f(f(s_t,a),a'),a''))\)

\(s'=f(s,a)\)

1. Low depth search: use knowledge of dynamics

\(s''=f(s',a')\)

\(s'''=f(s'',a'')\)

\(\widehat V(s''')\)

Combination with Search

\(s'=f(s,a)\)

2. Improve value estimate with rollout

\(s''=f(s',a')\)

\(s'''=f(s'',a'')\)

\(\lambda \widehat V(s''') + (1-\lambda) r\)

...

\(s'''\)

\(r\)

Combination with Search

3. Adaptive depth tree search

Monte-Carlo Tree Search (Classic AI)

expand promising or under-explored nodes

backprop node values from expansion

Combination with Search

  1. Low adaptive depth tree search with \(\widehat V\)
  2. Improve value estimate with rollout of \(\widehat \pi\)

Summary

  1. Learning:
    1. Warm start policy with imitation learning
    2. Improve policy with policy gradient
    3. Approximate value of policy
  2. Planning:
    1. Adaptive tree search with \(\widehat V\) and \(\widehat \pi\)

Exo-feedback

CS 4/5789: Lecture 27

By Sarah Dean

Private

CS 4/5789: Lecture 27