## CS 4/5789: Introduction to Reinforcement Learning

### Lecture 27

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

## Agenda

0. Announcements & Recap

1. Game Setting

2. Policy Learning Component

3. Value Learning Component

4. Online Planning Component

## Announcements

5789 Paper Review Assignment (weekly pace suggested)

HW 4 due 5/9 -- don't plan on extentions

Final exam Monday 5/16 at 7pm in Statler Hall 196
Review session in lecture 5/9

Course evaluations open until next week

## Recap: RL Specification

Markov decision process $$\mathcal M = \{\mathcal S, ~\mathcal A, ~P, ~r, ~\gamma\}$$

$$s_t$$

$$r_t$$

$$a_t$$

$$\pi$$

$$\gamma$$

$$P$$

• action space and discount known
• states and reward signals observed
• transition probabilities unknown

actions & states determine environment

discount & reward determine objective

## All ML is RL once deployed

$$\{x_i, y_i\}$$

$$x$$

$$\widehat y$$

$$(x, y)$$

Technologies are developed  and used within a particular social, economic, and political context. They arise out of a  social structure, they are grafted on to it, and they may reinforce it or destroy it, often in ways that are neither foreseen  nor foreseeable.”

Ursula Franklin, 1989

## RL helps us reason about feedback

control feedback

data feedback

external feedback

"...social, economic, and political context..."

"...neither foreseen nor forseeable..."

## RL helps us reason about feedback

control feedback

data feedback

external feedback

"...social, economic, and political context..."

"...neither foreseen nor forseeable..."

...

## Setting: Markov Game

• Two Player Markov Game: $$\{\mathcal S,\mathcal A, f, r, H, s_0\}$$
• Deterministic transitions: $$s' = f(s,a)$$
• Players alternate taking actions:
• Player 0 in even steps, player 1 in odd steps
• Sparse reward: $$r(s_H)=1$$ when player 0 wins (else $$-1$$)

...

## Setting: Markov Game

• Min-max formulation $$V^*(s) = \textcolor{red}{\max_{\pi_1} } \textcolor{yellow}{\min_{\pi_2} }\mathbb E[r(s_H)|s_0=s, \pi_1, \pi_2]$$
• Zero sum game

## Setting: Markov Game

• Min-max formulation $$V^*(s) = \textcolor{red}{\max_{\pi_1} } \textcolor{yellow}{\min_{\pi_2} }\mathbb E[r(s_H)|s_0=s, \pi_1, \pi_2]$$
• Zero sum game => solvable with DP!

$$V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}$$

$$Q^*(s,a) = V^*(f(s,a))$$

$$V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}$$

## Setting: Markov Game

• But $$H\approx 150$$, $$A\approx 250$$, so this tree will have $$\approx A^H$$ nodes
• 1 TB hard-drive can store $$\approx 250^6$$ 8-bit numbers
• Impossible to enumerate!

$$V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}$$

$$Q^*(s,a) = V^*(f(s,a))$$

$$V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}$$

## Setting: Markov Game

Strategy:

• Approximate $$\pi^*$$, use $$\widehat \pi$$ to approximate $$V^*$$
• Low depth tree search combines $$\widehat V$$ with simulated play $$\widehat \pi$$

$$V^*(s) = \max\{Q^*(s,a), Q^*(s,a')\}$$

$$Q^*(s,a) = V^*(f(s,a))$$

$$V^*(s') = \min\{Q^*(s',a), Q^*(s',a')\}$$

## Agenda

0. Announcements & Recap

1. Game Setting

2. Policy Learning Component

3. Value Learning Component

4. Online Planning Component

## Policy Learning

Deep network with convolutional layers

• input: 19x19 3-bit grid
• output: distribution over grid

## Imitation Learning

Warm-start policy network with expert data

1. Sample data $$(s,a)$$ from human games, $$N=30$$ million
2. Log-likelihood loss function $$\min_\pi \sum_{i=1}^N -\log(\pi(a_i|s_i))$$
3. Optimize with Stochastic Gradient Descent $$\theta_{t+1} = \theta_t - \eta \frac{1}{|\mathcal B|} \sum_{(s,a)\in \mathcal B}-\nabla_\theta\log(\pi_\theta(a|s))$$

## Imitation Learning

How well does $$\pi_{\theta_{BC}}$$ perform?

• 57% accuracy on held out test
• random policy: 1/200
• Pachi: open source Go program
• 11% win rate

1. Warm-start $$\theta_0 = \theta_{BC}$$
2. Iterate for $$t=0,...,T-1$$
1. Randomly select previous $$\tau \in \{0,1..., t\}$$
2. Play $$\pi_{\theta_t}$$ against $$\pi_{\theta_\tau}$$ and observe $$(s_0,$$$$a_0$$$$,s_1,$$$$a_1$$$$,...,s_H)$$
3. Gradient update: $$\theta_{t+1} = \theta_t + \eta \sum_{h=0 }^{H/2}\nabla_\theta \log \pi_{\theta_t}(\textcolor{red}{a_{2h}}|s_{2h}) r(s_H)$$

...

...

How well does $$\widehat \pi = \pi_{\theta_{PG}}$$ perform?

• Pachi: open source Go program
• 85% win rate

## Value Learning

Deep network with convolutional layers

• input: 19x19 3-bit grid
• output: scalar value

## Value Learning

• Ideally, approximate $$\widehat V \approx V^*$$
• easier to supervise $$\widehat V \approx V^{\widehat \pi}$$ $$V^{\widehat \pi}(s) = \mathbb E[s(r_H)|s_0=s, \widehat \pi, \widehat \pi]$$
• Supervision via rollouts
• In each game $$i$$, sample $$h$$ and set $$s_i=s_h$$ and $$y_i$$ as the game's outcome ($$\pm 1$$)
• Simulate $$N=30$$ million games
• IID sampling $$s\sim d^{\widehat \pi}$$

## Value Learning

• Least-squares regression $$\min_\beta \sum_{i=1}^N (V_\beta(s_i) - y_i)^2$$
• Optimize with SGD $$\beta_{t+1} = \beta_t - \eta \sum_{s,z\in\mathcal B} (V_{\beta}(s) - y) \nabla_\beta V_\beta(s)$$

## Combination with Search

$$a_t = \arg\max \widehat V(f(s_t,a))$$

$$a_t = \widehat \pi(s_t)$$

Both are only approximations!

## Combination with Search

$$\widehat V(f(s,a))$$

1. Low depth search: use knowledge of dynamics

$$a_t = \arg\max \widehat V(f(s_t,a))$$

## Combination with Search

$$a_t =$$$$\arg\max_a$$$$\min_{a'}$$$$\max_{a''}$$$$\widehat V(f(f(f(s_t,a),a'),a''))$$

$$s'=f(s,a)$$

1. Low depth search: use knowledge of dynamics

$$s''=f(s',a')$$

$$s'''=f(s'',a'')$$

$$\widehat V(s''')$$

## Combination with Search

$$s'=f(s,a)$$

2. Improve value estimate with rollout

$$s''=f(s',a')$$

$$s'''=f(s'',a'')$$

$$\lambda \widehat V(s''') + (1-\lambda) r$$

...

$$s'''$$

$$r$$

## Combination with Search

Monte-Carlo Tree Search (Classic AI)

expand promising or under-explored nodes

backprop node values from expansion

## Combination with Search

1. Low adaptive depth tree search with $$\widehat V$$
2. Improve value estimate with rollout of $$\widehat \pi$$

## Summary

1. Learning:
1. Warm start policy with imitation learning
2. Improve policy with policy gradient
3. Approximate value of policy
2. Planning:
1. Adaptive tree search with $$\widehat V$$ and $$\widehat \pi$$

By Sarah Dean

Private