Sp24 CS 4/5789: Lecture 23

CS 4/5789: Introduction to Reinforcement Learning

Lecture 23: Imitation Learning

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

My OH rescheduled to today after lecture
Homework
- 5789 Paper Assignments
- PSet 8 due tonight
- Final PA due Friday
- Prelim corrections - next slide
Final exam is Tuesday 5/14 at 2pm in Ives 305

Prelim Corrections

You may correct any part of a question that you didn't receive full credit on
- Bonus on your exam grade proportional to the difference: $$\text{initial score} + \alpha\times(\text{corrected score} - \text{initial score})_+$$
Treat corrections like a written homework
- neat and explanations of steps should be clear
- MC must include a few sentences of justificatoin
- scored more strictly than the exam was
- You can visit OH and discuss with others, but solutions must be written by yourself

Agenda

1. Recap: Exploration

2. Imitation Learning

3. Behavioral Cloning

4. DAgger

Unit 2: Constructing labels for supervised learning, updating policy using learned quantities
Unit 3: Design policy to both explore (collect useful data) and exploit (high reward)

Recap: Exploration

action $a_t$

state $s_t$

reward $r_t$

policy

data $(s_t,a_t,r_t)$

policy $\pi$

transitions $P,f$

experience

unknown

Recap: UCB

UCB-type Algorithms

Multi-armed bandits $$\arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$$
Linear contextual bandits $$\arg\max_{a} \hat \theta_{a,t}^\top x_t + \sqrt{x_t A_{a,t}^{-1} x_t}$$
Markov decision process (tabular) $$\arg\max_a \hat r_i(s,a)+H\sqrt{\frac{\alpha}{N_i(s,a)}}+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]$$

Exploration becomes more difficult outside of tabular setting where a comprehensive search is possible
Using expert data can sidestep exploration problem

Motivation for Imitation

Agenda

1. Recap: Exploration

2. Imitation Learning

3. Behavioral Cloning

4. DAgger

Helicopter Acrobatics (Stanford)

LittleDog Robot (LAIRLab at CMU)

An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]

Imitation Learning

Expert Demonstrations

Supervised ML Algorithm

Policy $\pi$

ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks

maps states to actions

Agenda

1. Recap: Exploration

2. Imitation Learning

3. Behavioral Cloning

4. DAgger

Behavioral Cloning

Dataset from expert policy $\pi_\star$: $$ \{(s_i, a_i)\}_{i=1}^N \sim \mathcal D_\star $$

maximize $\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]$

s.t. $s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$

$\pi$

rather than optimize,

imitate!

minimize $\sum_{i=1}^N \ell(\pi(s_i), a_i)$

$\pi$

Behavioral Cloning

Policy class $\Pi$: usually parametrized by some $w\in\mathbb R^d$, e.g. weights of deep network, SVM, etc
Loss function $\ell(\cdot,\cdot)$: quantify accuracy
Optimization Algorithm: gradient descent, interior point methods, sklearn, torch

minimize $\sum_{i=1}^N \ell(\pi(s_i), a_i)$

$\pi\in\Pi$

Supervised learning with empirical risk minimization (ERM)

Behavioral Cloning

In this class, we assume that supervised learning works!

minimize $\sum_{i=1}^N \ell(\pi(s_i), a_i)$

$\pi\in\Pi$

Supervised learning with empirical risk minimization (ERM)

i.e. we successfully optimize and generalize, so that the population loss is small: $\displaystyle \mathbb E_{s,a\sim\mathcal D_\star}[\ell(\pi(s), a)]\leq \epsilon$

For many loss functions, this means that
$\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon$

Ex: Learning to Drive

Policy $\pi$

Input: Camera Image

Output: Steering Angle

Supervised Learning

Policy

Dataset of expert trajectory

$(x, y)$

...

$\pi$( ) =

Ex: Learning to Drive

Ex: Learning? to Drive

expert trajectory

learned policy

No training data of "recovery" behavior!

Ex: Learning? to Drive

What about assumption $\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon$?

PollEv

An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]

“If the network is not presented with sufficient variability in its training exemplars to cover the conditions it is likely to encounter...[it] will perform poorly”

Extra Example

Initial state $s_0=0$ and reward $r(s,a) = \mathbf 1\{s=1\}$
- Optimal policy $\pi_\star(s) = U$
Discounted state distribution $d_{0}^{\pi_\star} = \begin{bmatrix} 1-\gamma & \gamma & 0 \end{bmatrix}^\top $
Consider $\hat\pi(1)=U$, $\hat\pi(2)=D$, and $$\hat\pi(0) = \begin{cases}U & w.p. ~~1-\frac{\epsilon}{1-\gamma}\\ D & w.p. ~~\frac{\epsilon}{1-\gamma} \end{cases}$$
PollEv What is the supervised learning error?
- $ \mathbb E_{s\sim d_0^{\pi_\star}}\left[\mathbb E_{a\sim \hat \pi(s)}[\mathbf 1\{a\neq \pi_\star(s)\}]\right]=\epsilon$
Error in performance:
- $V^{\pi_\star}(0) = \frac{\gamma}{1-\gamma}$ vs. $V^{\hat\pi}(0) =\frac{\gamma}{1-\gamma} - \frac{\epsilon\gamma}{(1-\gamma)^2}$

$U$

$D$

$U$

$D$

$1$

$0$

$2$

Assuming that SL works, how sub-optimal is $\hat\pi$?
- Also assume $r(s,a)\in[0,1]$
Recall Performance Difference Lemma on $\pi^\star$ and $\hat\pi$ $$ \mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\hat\pi}(s) \right] = \frac{1}{1-\gamma} \mathbb E_{s\sim d_{\mu_0}^{\pi\star}}\left[A^{\hat \pi}(s,\pi^\star(s))\right]$$
The advantage of $\pi^\star$ over $\hat\pi$ depends on SL error
- $A^{\hat \pi}(s,\pi^\star(s)) \leq \frac{2}{1-\gamma}\mathbf 1 \{\pi^\star(s) \neq \hat \pi(s)\}$
Then the performance of BC is upper bounded: $$\mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\hat\pi}(s) \right]\leq \frac{2}{(1-\gamma)^2}\epsilon $$

Extra: BC Analysis

Agenda

1. Recap: Exploration

2. Imitation Learning

3. Behavioral Cloning

4. DAgger

expert trajectory

learned policy

No training data of "recovery" behavior

query expert

learned policy

and append trajectory

retrain

Idea: interact with expert to ask what they would do

DAgger: Dataset Aggregation

Supervised Learning

Policy

Dataset

$\mathcal D = (x_i, y_i)_{i=1}^M$

...

$\pi$( ) =

Execute

Query Expert

$\pi^*(s_0), \pi^*(s_1),...$

$s_0, s_1, s_2...$

Aggregate

$(x_i = s_i, y_i = \pi^*(s_i))$

Ex: Off-road driving

[Pan et al, RSS 18]

Goal: map image to command

Approach: Use Model Predictive Controller as the expert!

$\pi($ $)=$ steering, throttle

DAgger Setting

Discounted Infinite Horizon MDP $$\mathcal M = \{\mathcal S, \mathcal A, P, r, \gamma\} $$
$P$ unknown and $r$ unknown and maybe unobserved
Access to an expert who knows $\pi_\star$ and who we can query at any state $s$ during training

DAgger Algorithm

DAgger

Initialize $\pi^0$ and dataset $\mathcal D = \empty$
for $i=0,1,...,T-1$
- Generate dataset with $\pi_i$ and query the expert $$\mathcal D_i = \{s_j, a_j^\star\}_{j=1}^N \quad s_j\sim d_{\mu_0}^{\pi_i},\quad \mathbb E[a^\star_j] = \pi_\star(s_j) $$
- Dataset Aggregation: $\mathcal D = \mathcal D \cup \mathcal D_i $
- Update policy with supervised learning $$ \pi_{i+1} = \arg\min_{\pi\in\mathcal \Pi} \sum_{s,a\in\mathcal D} \ell(\pi(s), a) $$

DAgger Performance

Due to active data collection, we can understand DAgger as "online learning" rather than supervised learning
Details are out of scope, but online learning guarantee is of the form:
$$\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon$$
Contrast this with the supervised learning guarantee:

$$\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon$$

Due to the active data collection, need a new framework beyond "supervised learning"
Online learning is a general setting which captures the idea of learning from data over time

Extra: Online Learning

Online learning

for $i=1,2,...,T$
1. Learner choses $f_i$
2. Suffer the risk (i.e. expected loss) $$ \mathcal R_i(f_i) = \mathbb E_{x,y\sim \mathcal D_i}[\ell(f_i(x), y)]$$

Measure performance of online learning with average regret $$\frac{1}{T} R(T) = \frac{1}{T} \left(\sum_{i=1}^T \mathcal R_i(f_i) - \min_f \sum_{i=1}^T \mathcal R_i(f)\right) $$
Define regret as the incurred risk compared with the best function in hindsight
This baseline represents the usual offline supervised learning approach $$\min_f \frac{1}{T}\sum_{i=1}^T \mathcal R_i(f) = \min_f \mathbb E_{x,y\sim \bar{\mathcal D}} [\ell(f_i(x), y)]$$ where we define $\bar{\mathcal D} = \frac{1}{T}\sum_{i=1}^T \mathcal D_i$

Regret

How should learner choose $\theta_i$?
A good option is to solves a sequence of (regularized) supervised learning problems
- $d(f)$ regularizes the predictions, e.g. $\|\theta\|_2^2$

Dataset aggregation: $ \sum_{k=1}^i \mathcal R_k(f) = \displaystyle \mathbb E_{x, y \sim \bar {\mathcal D}_i}[ \ell(f(x, y))]$

Follow the Regularized Leader

Alg: FTRL

for $i=1,2,...,T$ $$f_i = \arg\min_f \sum_{k=1}^i \mathcal R_k(f) + \lambda d(f) $$

Theorem: For convex loss and strongly convex regularizer, $$ \max_{\mathcal R_1,...,\mathcal R_T} \frac{1}{T} \left( \sum_{i=1}^T \mathcal R_i(f_i) - \min_f \sum_{i=1}^T \mathcal R_i(f) \right) \leq O\left(1/\sqrt{T}\right)$$ i.e. regret is bounded for any sequence of $\mathcal D_i$.
Corollary: For DAgger with a noiseless expert, $$ \min_{1\leq i\leq T} \mathbb E_{s\sim d^{\pi^i}_{\mu_0} } [\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\}] \leq O(1/\sqrt{T}) =: \epsilon $$
Proof Sketch
- DAgger as FTRL: $f_i=\pi^i$, $(x,y)=(s, \pi^\star(s))$, and $\mathcal D_i = d_{\mu_0}^{\pi^i}$
- Minimum policy error is upper bounded by average regret (using that loss upper bounds indicator)

DAgger Analysis

Corollary: For DAgger with a noiseless expert, $$ \min_{1\leq i\leq T} \mathbb E_{s\sim d^{\pi^i}_{\mu_0} } [\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\}] \leq O(1/\sqrt{T}) =: \epsilon $$
We can show an upper bound, starting with PDL $$ \mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\pi^i}(s) \right] = -\frac{1}{1-\gamma} \mathbb E_{s\sim d_{\mu_0}^{\pi^i}}\left[A^{ \pi^\star}(s,\pi^i(s))\right]$$
The advantage of $\hat\pi$ over $\pi^\star$ (no $\gamma$ dependence)
- $A^{ \pi^\star}(s,\pi^i(s)) \geq -\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\} \cdot \max_{s,a} |A^{ \pi^\star}(s,\pi^i(s))| $
Then the performance of DAgger is upper bounded: $$\mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\pi^i}(s) \right]\leq \frac{\epsilon}{1-\gamma}\max_{s,a} |A^{ \pi^\star}(s,\pi^i(s))| $$

DAgger Analysis

Summary: BC vs. DAgger

Supervised learning guarantee

$\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon$

Online learning guarantee

$\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon$

Performance Guarantee

$V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}$

Performance Guarantee

$V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon$

Recap

PSet due tonight

Pitfalls of BC
DAgger Algorithm
Online learning

Next lecture: Inverse RL