CS 4/5789: Introduction to Reinforcement Learning

Lecture 23: Interactive Imitation Learning

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- 5789 Paper Reviews due weekly on Mondays
- PSet 7 due tonight, PSet 8 (final one!) released tonight
  - Next week: midterm corrections
- PA 4 due next Wednesday (May 3)
Final exam is Saturday 5/13 at 2pm

Agenda

1. Recap: Exploration

2. Imitation Learning

3. DAgger

4. Online Learning

Unit 2: Constructing labels for supervised learning, updating policy using learned quantities
Unit 3: Design policy to both explore (collect useful data) and exploit (high reward)

Recap: Exploration

action $a_t$

state $s_t$

reward $r_t$

policy

data $(s_t,a_t,r_t)$

policy $\pi$

transitions $P,f$

experience

unknown

Recap: UCB

UCB-type Algorithms

Multi-armed bandits $$\arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$$
Linear contextual bandits $$\arg\max_{a} \hat \theta_{a,t}^\top x_t + \sqrt{x_t A_{a,t}^{-1} x_t}$$
Markov decision process (tabular) $$\arg\max_a \hat r_i(s,a)+H\sqrt{\frac{\alpha}{N_i(s,a)}}+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]$$

Exploration becomes more difficult outside of tabular setting where a comprehensive search is possible
Using expert data can sidestep exploration problem

Motivation for Imitation

Agenda

1. Recap: Exploration

2. Imitation Learning

3. DAgger

4. Online Learning

Imitation Learning

Expert Demonstrations

Supervised ML Algorithm

Policy $\pi$

ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks

maps states to actions

Helicopter Acrobatics (Stanford)

Behavioral Cloning

Supervised Learning

Policy

Dataset of expert trajectory

$(x, y)$

...

$\pi$( ) =

Behavioral Cloning

Dataset from expert policy $\pi_\star$: $\{(s_i, a_i)\}_{i=1}^N \sim d_{\mu_0}^{\pi_\star} $
- Recall definition of discounted state distribution
Supervised learning with empirical risk minimization (ERM) $$\min_{\pi\in\Pi} \sum_{i=1}^N \ell(\pi(s_i), a_i) $$
In this class, we assume that supervised learning works!
- i.e. we successfully optimize and generalize, so that the population loss is small: $\displaystyle \mathbb E_{s,a\sim d_{\mu_0}^{\pi_\star}}[\ell(\pi(s), a)]\leq \epsilon$
We further assume that $\ell(\pi(s), a) \geq \mathbb 1\{\pi(s)\neq a\}$
$$\displaystyle \mathbb E_{s\sim d_{\mu_0}^{\pi_\star}}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon$$

Example

Initial state $s_0=0$ and reward $r(s,a) = \mathbf 1\{s=1\}$
- Optimal policy $\pi_\star(s) = U$
Discounted state distribution $d_{0}^{\pi_\star} = \begin{bmatrix} 1-\gamma & \gamma & 0 \end{bmatrix}^\top $
Consider $\hat\pi(1)=U$, $\hat\pi(2)=D$, and $$\hat\pi(0) = \begin{cases}U & w.p. ~~1-\frac{\epsilon}{1-\gamma}\\ D & w.p. ~~\frac{\epsilon}{1-\gamma} \end{cases}$$
PollEv What is the supervised learning error?
- $ \mathbb E_{s\sim d_0^{\pi_\star}}\left[\mathbb E_{a\sim \hat \pi(s)}[\mathbf 1\{a\neq \pi_\star(s)\}]\right]=\epsilon$
Error in performance:
- $V^{\pi_\star}(0) = \frac{\gamma}{1-\gamma}$ vs. $V^{\hat\pi}(0) =\frac{\gamma}{1-\gamma} - \frac{\epsilon\gamma}{(1-\gamma)^2}$

$U$

$D$

$U$

$D$

$1$

$0$

$2$

Assuming that SL works, how sub-optimal is $\hat\pi$?
- Also assume $r(s,a)\in[0,1]$
Recall Performance Difference Lemma on $\pi^\star$ and $\hat\pi$ $$ \mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\hat\pi}(s) \right] = \frac{1}{1-\gamma} \mathbb E_{s\sim d_{\mu_0}^{\pi\star}}\left[A^{\hat \pi}(s,\pi^\star(s))\right]$$
The advantage of $\pi^\star$ over $\hat\pi$ depends on SL error
- $A^{\hat \pi}(s,\pi^\star(s)) \leq \frac{2}{1-\gamma}\mathbf 1 \{\pi^\star(s) \neq \hat \pi(s)\}$
Then the performance of BC is upper bounded: $$\mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\hat\pi}(s) \right]\leq \frac{2}{(1-\gamma)^2}\epsilon $$

BC Analysis

Agenda

1. Recap: Exploration

2. Imitation Learning

3. DAgger

4. Online Learning

expert trajectory

learned policy

No training data of "recovery" behavior

query expert

learned policy

and append trajectory

retrain

Idea: interact with expert to ask what they would do

DAgger: Dataset Aggregation

Supervised Learning

Policy

Dataset

$\mathcal D = (x_i, y_i)_{i=1}^M$

...

$\pi$( ) =

Execute

Query Expert

$\pi^*(s_0), \pi^*(s_1),...$

$s_0, s_1, s_2...$

Aggregate

$(x_i = s_i, y_i = \pi^*(s_i))$

Ex: Off-road driving

[Pan et al, RSS 18]

Goal: map image to command

Approach: Use Model Predictive Controller as the expert!

$\pi($ $)=$ steering, throttle

DAgger Setting

Discounted Infinite Horizon MDP $$\mathcal M = \{\mathcal S, \mathcal A, P, r, \gamma\} $$
$P$ unknown and $r$ unknown and maybe unobserved
Access to an expert who knows $\pi_\star$ and who we can query at any state $s$ during training

DAgger Algorithm

DAgger

Initialize $\pi^0$ and dataset $\mathcal D = \empty$
for $i=0,1,...,T-1$
- Generate dataset with $\pi_i$ and query the expert $$\mathcal D_i = \{s_j, a_j^\star\}_{j=1}^N \quad s_j\sim d_{\mu_0}^{\pi_i},\quad \mathbb E[a^\star_j] = \pi_\star(s_j) $$
- Dataset Aggregation: $\mathcal D = \mathcal D \cup \mathcal D_i $
- Update policy with supervised learning $$ \pi_{i+1} = \arg\min_{\pi\in\mathcal \Pi} \sum_{s,a\in\mathcal D} \ell(\pi(s), a) $$

Agenda

1. Recap: Exploration

2. Imitation Learning

3. DAgger

4. Online Learning

Due to the active data collection, need a new framework beyond "supervised learning"
Online learning is a general setting which captures the idea of learning from data over time

Online Learning

Online learning

for $i=1,2,...,T$
1. Learner choses $f_i$
2. Suffer the risk (i.e. expected loss) $$ \mathcal R_i(f_i) = \mathbb E_{x,y\sim \mathcal D_i}[\ell(f_i(x), y)]$$

Measure performance of online learning with average regret $$\frac{1}{T} R(T) = \frac{1}{T} \left(\sum_{i=1}^T \mathcal R_i(f_i) - \min_f \sum_{i=1}^T \mathcal R_i(f)\right) $$
Define regret as the incurred risk compared with the best function in hindsight
This baseline represents the usual offline supervised learning approach $$\min_f \frac{1}{T}\sum_{i=1}^T \mathcal R_i(f) = \min_f \mathbb E_{x,y\sim \bar{\mathcal D}} [\ell(f_i(x), y)]$$ where we define $\bar{\mathcal D} = \frac{1}{T}\sum_{i=1}^T \mathcal D_i$

Regret

How should learner choose $\theta_i$?
A good option is to solves a sequence of (regularized) supervised learning problems
- $d(f)$ regularizes the predictions, e.g. $\|\theta\|_2^2$

Dataset aggregation: $ \sum_{k=1}^i \mathcal R_k(f) = \displaystyle \mathbb E_{x, y \sim \bar \mathcal D_i}[ \ell(f(x, y))]$

Follow the Regularized Leader

Alg: FTRL

for $i=1,2,...,T$ $$f_i = \arg\min_f \sum_{k=1}^i \mathcal R_k(f) + \lambda d(f) $$

Theorem: For convex loss and strongly convex regularizer, $$ \max_{\mathcal R_1,...,\mathcal R_T} \frac{1}{T} \left( \sum_{i=1}^T \mathcal R_i(f_i) - \min_f \sum_{i=1}^T \mathcal R_i(f) \right) \leq O\left(1/\sqrt{T}\right)$$ i.e. regret is bounded for any sequence of $\mathcal D_i$.
Corollary: For DAgger with a noiseless expert, $$ \min_{1\leq i\leq T} \mathbb E_{s\sim d^{\pi^i}_{\mu_0} } [\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\}] \leq O(1/\sqrt{T}) =: \epsilon $$
Proof Sketch
- DAgger as FTRL: $f_i=\pi^i$, $(x,y)=(s, \pi^\star(s))$, and $\mathcal D_i = d_{\mu_0}^{\pi^i}$
- Minimum policy error is upper bounded by average regret (using that loss upper bounds indicator)

DAgger Analysis

Corollary: For DAgger with a noiseless expert, $$ \min_{1\leq i\leq T} \mathbb E_{s\sim d^{\pi^i}_{\mu_0} } [\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\}] \leq O(1/\sqrt{T}) =: \epsilon $$
We can show an upper bound, starting with PDL $$ \mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\pi^i}(s) \right] = -\frac{1}{1-\gamma} \mathbb E_{s\sim d_{\mu_0}^{\pi^i}}\left[A^{ \pi^\star}(s,\pi^i(s))\right]$$
The advantage of $\hat\pi$ over $\pi^\star$ (no $\gamma$ dependence)
- $A^{ \pi^\star}(s,\pi^i(s)) \geq -\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\} \cdot \max_{s,a} |A^{ \pi^\star}(s,\pi^i(s))| $
Then the performance of DAgger is upper bounded: $$\mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\pi^i}(s) \right]\leq \frac{\epsilon}{1-\gamma}\max_{s,a} |A^{ \pi^\star}(s,\pi^i(s))| $$

DAgger Analysis

Summary: BC vs. DAgger

Supervised learning guarantee

$\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon$

Online learning guarantee

$\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon$

Performance Guarantee

$V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}$

Performance Guarantee

$V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon$

Recap

PSet due tonight

Pitfalls of BC
DAgger Algorithm
Online learning

Next lecture: Inverse RL

CS 4/5789: Lecture 23

By Sarah Dean

CS 4/5789: Lecture 23

2 years ago

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 23: Interactive Imitation Learning

Reminders

Agenda

Recap: Exploration

Recap: UCB

Motivation for Imitation

Agenda

Imitation Learning

Behavioral Cloning

Behavioral Cloning

Example

BC Analysis

Agenda

DAgger: Dataset Aggregation

Ex: Off-road driving

DAgger Setting

DAgger Algorithm

Agenda

Online Learning

Regret

Follow the Regularized Leader

DAgger Analysis

DAgger Analysis

Summary: BC vs. DAgger

Recap

CS 4/5789: Lecture 23

More from Sarah Dean