CS 4/5789: Introduction to Reinforcement Learning

Lecture 23: Imitation Learning

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

  • My OH rescheduled to today after lecture
  • Homework
    • 5789 Paper Assignments
    • PSet 8 due tonight
    • Final PA due Friday
    • Prelim corrections - next slide
  • Final exam is Tuesday 5/14 at 2pm in Ives 305

Prelim Corrections

  • You may correct any part of a question that you didn't receive full credit on

    • Bonus on your exam grade proportional to the difference: $$\text{initial score} + \alpha\times(\text{corrected score} - \text{initial score})_+$$

  • Treat corrections like a written homework

    • neat and explanations of steps should be clear

    • MC must include a few sentences of justificatoin

    • scored more strictly than the exam was

    • You can visit OH and discuss with others, but solutions must be written by yourself

Agenda

1. Recap: Exploration

2. Imitation Learning

3. Behavioral Cloning

4. DAgger

  • Unit 2: Constructing labels for supervised learning, updating policy using learned quantities
  • Unit 3: Design policy to both explore (collect useful data) and exploit (high reward)

Recap: Exploration

action \(a_t\)

state \(s_t\)

reward \(r_t\)

policy

data \((s_t,a_t,r_t)\)

policy \(\pi\)

transitions \(P,f\)

experience

unknown

Recap: UCB

UCB-type Algorithms

  • Multi-armed bandits $$\arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$$
  • Linear contextual bandits $$\arg\max_{a} \hat \theta_{a,t}^\top x_t + \sqrt{x_t A_{a,t}^{-1} x_t}$$
  • Markov decision process (tabular) $$\arg\max_a \hat r_i(s,a)+H\sqrt{\frac{\alpha}{N_i(s,a)}}+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]$$
  • Exploration becomes more difficult outside of tabular setting where a comprehensive search is possible
  • Using expert data can sidestep exploration problem

Motivation for Imitation

Agenda

1. Recap: Exploration

2. Imitation Learning

3. Behavioral Cloning

4. DAgger

Helicopter Acrobatics (Stanford)

LittleDog Robot (LAIRLab at CMU)

An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]

Imitation Learning

Expert Demonstrations

Supervised ML Algorithm

Policy \(\pi\)

ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks

maps states to actions

Agenda

1. Recap: Exploration

2. Imitation Learning

3. Behavioral Cloning

4. DAgger

Behavioral Cloning

Dataset from expert policy \(\pi_\star\): $$ \{(s_i, a_i)\}_{i=1}^N \sim \mathcal D_\star $$

maximize   \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)

s.t.   \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)

\(\pi\)

rather than optimize,

imitate!

minimize   \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)

\(\pi\)

Behavioral Cloning

  1. Policy class \(\Pi\): usually parametrized by some \(w\in\mathbb R^d\), e.g. weights of deep network, SVM, etc
  2. Loss function \(\ell(\cdot,\cdot)\): quantify accuracy
  3. Optimization Algorithm: gradient descent, interior point methods, sklearn, torch

minimize   \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)

\(\pi\in\Pi\)

Supervised learning with empirical risk minimization (ERM)

Behavioral Cloning

In this class, we assume that supervised learning works!

minimize   \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)

\(\pi\in\Pi\)

Supervised learning with empirical risk minimization (ERM)

i.e. we successfully optimize and generalize, so that the population loss is small: \(\displaystyle \mathbb E_{s,a\sim\mathcal D_\star}[\ell(\pi(s), a)]\leq \epsilon\)

For many loss functions, this means that
\(\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon\)

Ex: Learning to Drive

Policy \(\pi\)

Input: Camera Image

Output: Steering Angle

Supervised Learning

Policy

Dataset of expert trajectory

\((x, y)\)

...

\(\pi\)(       ) =

Ex: Learning to Drive

Ex: Learning? to Drive

expert trajectory

learned policy

No training data of "recovery" behavior!

Ex: Learning? to Drive

What about assumption  \(\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon\)?

PollEv

An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]

“If the network is not presented with sufficient variability in its training exemplars to cover the conditions it is likely to encounter...[it] will perform poorly”

Extra Example

  • Initial state \(s_0=0\) and reward \(r(s,a) = \mathbf 1\{s=1\}\)
    • Optimal policy \(\pi_\star(s) = U\)
  • Discounted state distribution \(d_{0}^{\pi_\star} = \begin{bmatrix}  1-\gamma & \gamma & 0 \end{bmatrix}^\top \)
  • Consider \(\hat\pi(1)=U\), \(\hat\pi(2)=D\), and $$\hat\pi(0) = \begin{cases}U & w.p. ~~1-\frac{\epsilon}{1-\gamma}\\ D & w.p. ~~\frac{\epsilon}{1-\gamma} \end{cases}$$
  • PollEv What is the supervised learning error?
    • \( \mathbb E_{s\sim d_0^{\pi_\star}}\left[\mathbb E_{a\sim \hat \pi(s)}[\mathbf 1\{a\neq \pi_\star(s)\}]\right]=\epsilon\)
  • Error in performance:
    • \(V^{\pi_\star}(0) = \frac{\gamma}{1-\gamma}\) vs. \(V^{\hat\pi}(0) =\frac{\gamma}{1-\gamma} - \frac{\epsilon\gamma}{(1-\gamma)^2}\)

\(U\)

\(D\)

\(U\)

\(U\)

\(D\)

\(D\)

\(1\)

\(0\)

\(2\)

  • Assuming that SL works, how sub-optimal is \(\hat\pi\)?
    • Also assume \(r(s,a)\in[0,1]\)
  • Recall Performance Difference Lemma on \(\pi^\star\) and \(\hat\pi\) $$ \mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\hat\pi}(s) \right] = \frac{1}{1-\gamma} \mathbb E_{s\sim d_{\mu_0}^{\pi\star}}\left[A^{\hat \pi}(s,\pi^\star(s))\right]$$
  • The advantage of \(\pi^\star\) over \(\hat\pi\) depends on SL error
    • \(A^{\hat \pi}(s,\pi^\star(s)) \leq \frac{2}{1-\gamma}\mathbf 1 \{\pi^\star(s) \neq \hat \pi(s)\}\)
  • Then the performance of BC is upper bounded: $$\mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\hat\pi}(s) \right]\leq \frac{2}{(1-\gamma)^2}\epsilon $$

Extra: BC Analysis

Agenda

1. Recap: Exploration

2. Imitation Learning

3. Behavioral Cloning

4. DAgger

expert trajectory

learned policy

No training data of "recovery" behavior

query expert

learned policy

and append trajectory

retrain

Idea: interact with expert to ask what they would do

DAgger: Dataset Aggregation

Supervised Learning

Policy

Dataset

\(\mathcal D = (x_i, y_i)_{i=1}^M\)

...

\(\pi\)(       ) =

Execute

Query Expert

\(\pi^*(s_0), \pi^*(s_1),...\)

\(s_0, s_1, s_2...\)

Aggregate

\((x_i = s_i, y_i = \pi^*(s_i))\)

Ex: Off-road driving

[Pan et al, RSS 18]

Goal: map image to command

Approach: Use Model Predictive Controller as the expert!

\(\pi(\)                 \()=\) steering, throttle

DAgger Setting

  • Discounted Infinite Horizon MDP $$\mathcal M = \{\mathcal S, \mathcal A, P, r, \gamma\} $$
  • \(P\) unknown and \(r\) unknown and maybe unobserved
  • Access to an expert who knows \(\pi_\star\) and who we can query at any state \(s\) during training

DAgger Algorithm

DAgger

  • Initialize \(\pi^0\) and dataset \(\mathcal D = \empty\)
  • for \(i=0,1,...,T-1\)
    • Generate dataset with \(\pi_i\) and query the expert $$\mathcal D_i = \{s_j, a_j^\star\}_{j=1}^N \quad s_j\sim d_{\mu_0}^{\pi_i},\quad \mathbb E[a^\star_j] = \pi_\star(s_j) $$
    • Dataset Aggregation: \(\mathcal D = \mathcal D \cup \mathcal D_i \)
    • Update policy with supervised learning $$ \pi_{i+1} = \arg\min_{\pi\in\mathcal \Pi} \sum_{s,a\in\mathcal D} \ell(\pi(s), a)  $$

DAgger Performance

  • Due to active data collection, we can understand DAgger as "online learning" rather than supervised learning
  • Details are out of scope, but online learning guarantee is of the form:

    $$\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon$$

  • Contrast this with the supervised learning guarantee:

    $$\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon$$

  • Due to the active data collection, need a new framework beyond "supervised learning"
  • Online learning is a general setting which captures the idea of learning from data over time

Extra: Online Learning

Online learning

  • for \(i=1,2,...,T\)
    1. Learner choses \(f_i\)
    2. Suffer the risk (i.e. expected loss) $$ \mathcal R_i(f_i) = \mathbb E_{x,y\sim \mathcal D_i}[\ell(f_i(x), y)]$$
  • Measure performance of online learning with average regret $$\frac{1}{T} R(T) = \frac{1}{T} \left(\sum_{i=1}^T \mathcal R_i(f_i) - \min_f \sum_{i=1}^T \mathcal R_i(f)\right) $$
  • Define regret as the incurred risk compared with the best function in hindsight
  • This baseline represents the usual offline supervised learning approach $$\min_f \frac{1}{T}\sum_{i=1}^T \mathcal R_i(f) = \min_f \mathbb E_{x,y\sim \bar{\mathcal D}} [\ell(f_i(x), y)]$$ where we define \(\bar{\mathcal D} = \frac{1}{T}\sum_{i=1}^T \mathcal D_i\)

Regret

  • How should learner choose \(\theta_i\)?
  • A good option is to solves a sequence of (regularized) supervised learning problems
    • \(d(f)\) regularizes the predictions, e.g. \(\|\theta\|_2^2\)

 

 

 

 

 

  • Dataset aggregation: \( \sum_{k=1}^i \mathcal R_k(f) = \displaystyle \mathbb E_{x, y \sim \bar {\mathcal D}_i}[ \ell(f(x, y))]\)

Follow the Regularized Leader

Alg: FTRL

  • for \(i=1,2,...,T\) $$f_i = \arg\min_f \sum_{k=1}^i \mathcal R_k(f) + \lambda d(f) $$
  • Theorem: For convex loss and strongly convex regularizer, $$ \max_{\mathcal R_1,...,\mathcal R_T} \frac{1}{T} \left( \sum_{i=1}^T \mathcal R_i(f_i) - \min_f \sum_{i=1}^T \mathcal R_i(f) \right) \leq O\left(1/\sqrt{T}\right)$$ i.e. regret is bounded for any sequence of \(\mathcal D_i\).
  • Corollary: For DAgger with a noiseless expert, $$ \min_{1\leq i\leq T} \mathbb E_{s\sim d^{\pi^i}_{\mu_0} } [\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\}] \leq O(1/\sqrt{T}) =: \epsilon $$
  • Proof Sketch
    • DAgger as FTRL: \(f_i=\pi^i\), \((x,y)=(s, \pi^\star(s))\), and \(\mathcal D_i = d_{\mu_0}^{\pi^i}\)
    • Minimum policy error is upper bounded by average regret (using that loss upper bounds indicator)

DAgger Analysis

  • Corollary: For DAgger with a noiseless expert, $$ \min_{1\leq i\leq T} \mathbb E_{s\sim d^{\pi^i}_{\mu_0} } [\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\}] \leq O(1/\sqrt{T}) =: \epsilon $$
  • We can show an upper bound, starting with PDL $$ \mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\pi^i}(s) \right] = -\frac{1}{1-\gamma} \mathbb E_{s\sim d_{\mu_0}^{\pi^i}}\left[A^{ \pi^\star}(s,\pi^i(s))\right]$$
  • The advantage of \(\hat\pi\) over \(\pi^\star\) (no \(\gamma\) dependence)
    • \(A^{ \pi^\star}(s,\pi^i(s)) \geq -\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\} \cdot \max_{s,a} |A^{ \pi^\star}(s,\pi^i(s))| \)
  • Then the performance of DAgger is upper bounded: $$\mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\pi^i}(s) \right]\leq \frac{\epsilon}{1-\gamma}\max_{s,a} |A^{ \pi^\star}(s,\pi^i(s))|  $$

DAgger Analysis

Summary: BC vs. DAgger

Supervised learning guarantee

\(\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon\)

Online learning guarantee

\(\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon\)

Performance Guarantee

\(V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}\)

Performance Guarantee

\(V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon\)

Recap

  • PSet due tonight

 

  • Pitfalls of BC
  • DAgger Algorithm
  • Online learning

 

  • Next lecture: Inverse RL