CS 4/5789: Introduction to Reinforcement Learning

Lecture 23: Interactive Imitation Learning

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework
    • 5789 Paper Reviews due weekly on Mondays
    • PSet 7 due tonight, PSet 8 (final one!) released tonight
      • Next week: midterm corrections
    • PA 4 due next Wednesday (May 3)
  • Final exam is Saturday 5/13 at 2pm

Agenda

1. Recap: Exploration

2. Imitation Learning

3. DAgger

4. Online Learning

  • Unit 2: Constructing labels for supervised learning, updating policy using learned quantities
  • Unit 3: Design policy to both explore (collect useful data) and exploit (high reward)

Recap: Exploration

action \(a_t\)

state \(s_t\)

reward \(r_t\)

policy

data \((s_t,a_t,r_t)\)

policy \(\pi\)

transitions \(P,f\)

experience

unknown

Recap: UCB

UCB-type Algorithms

  • Multi-armed bandits $$\arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$$
  • Linear contextual bandits $$\arg\max_{a} \hat \theta_{a,t}^\top x_t + \sqrt{x_t A_{a,t}^{-1} x_t}$$
  • Markov decision process (tabular) $$\arg\max_a \hat r_i(s,a)+H\sqrt{\frac{\alpha}{N_i(s,a)}}+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]$$
  • Exploration becomes more difficult outside of tabular setting where a comprehensive search is possible
  • Using expert data can sidestep exploration problem

Motivation for Imitation

Agenda

1. Recap: Exploration

2. Imitation Learning

3. DAgger

4. Online Learning

Imitation Learning

Expert Demonstrations

Supervised ML Algorithm

Policy \(\pi\)

ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks

maps states to actions

Helicopter Acrobatics (Stanford)

Behavioral Cloning

Supervised Learning

Policy

Dataset of expert trajectory

\((x, y)\)

...

\(\pi\)(       ) =

Behavioral Cloning

  • Dataset from expert policy \(\pi_\star\): \(\{(s_i, a_i)\}_{i=1}^N \sim d_{\mu_0}^{\pi_\star} \)
    • Recall definition of discounted state distribution
  • Supervised learning with empirical risk minimization (ERM) $$\min_{\pi\in\Pi} \sum_{i=1}^N \ell(\pi(s_i), a_i) $$
  • In this class, we assume that supervised learning works!
    • i.e. we successfully optimize and generalize, so that the population loss is small: \(\displaystyle \mathbb E_{s,a\sim d_{\mu_0}^{\pi_\star}}[\ell(\pi(s), a)]\leq \epsilon\)
  • We further assume that \(\ell(\pi(s), a) \geq \mathbb 1\{\pi(s)\neq a\}\)
    $$\displaystyle \mathbb E_{s\sim d_{\mu_0}^{\pi_\star}}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon$$

Example

  • Initial state \(s_0=0\) and reward \(r(s,a) = \mathbf 1\{s=1\}\)
    • Optimal policy \(\pi_\star(s) = U\)
  • Discounted state distribution \(d_{0}^{\pi_\star} = \begin{bmatrix}  1-\gamma & \gamma & 0 \end{bmatrix}^\top \)
  • Consider \(\hat\pi(1)=U\), \(\hat\pi(2)=D\), and $$\hat\pi(0) = \begin{cases}U & w.p. ~~1-\frac{\epsilon}{1-\gamma}\\ D & w.p. ~~\frac{\epsilon}{1-\gamma} \end{cases}$$
  • PollEv What is the supervised learning error?
    • \( \mathbb E_{s\sim d_0^{\pi_\star}}\left[\mathbb E_{a\sim \hat \pi(s)}[\mathbf 1\{a\neq \pi_\star(s)\}]\right]=\epsilon\)
  • Error in performance:
    • \(V^{\pi_\star}(0) = \frac{\gamma}{1-\gamma}\) vs. \(V^{\hat\pi}(0) =\frac{\gamma}{1-\gamma} - \frac{\epsilon\gamma}{(1-\gamma)^2}\)

\(U\)

\(D\)

\(U\)

\(U\)

\(D\)

\(D\)

\(1\)

\(0\)

\(2\)

  • Assuming that SL works, how sub-optimal is \(\hat\pi\)?
    • Also assume \(r(s,a)\in[0,1]\)
  • Recall Performance Difference Lemma on \(\pi^\star\) and \(\hat\pi\) $$ \mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\hat\pi}(s) \right] = \frac{1}{1-\gamma} \mathbb E_{s\sim d_{\mu_0}^{\pi\star}}\left[A^{\hat \pi}(s,\pi^\star(s))\right]$$
  • The advantage of \(\pi^\star\) over \(\hat\pi\) depends on SL error
    • \(A^{\hat \pi}(s,\pi^\star(s)) \leq \frac{2}{1-\gamma}\mathbf 1 \{\pi^\star(s) \neq \hat \pi(s)\}\)
  • Then the performance of BC is upper bounded: $$\mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\hat\pi}(s) \right]\leq \frac{2}{(1-\gamma)^2}\epsilon $$

BC Analysis

Agenda

1. Recap: Exploration

2. Imitation Learning

3. DAgger

4. Online Learning

expert trajectory

learned policy

No training data of "recovery" behavior

query expert

learned policy

and append trajectory

retrain

Idea: interact with expert to ask what they would do

DAgger: Dataset Aggregation

Supervised Learning

Policy

Dataset

\(\mathcal D = (x_i, y_i)_{i=1}^M\)

...

\(\pi\)(       ) =

Execute

Query Expert

\(\pi^*(s_0), \pi^*(s_1),...\)

\(s_0, s_1, s_2...\)

Aggregate

\((x_i = s_i, y_i = \pi^*(s_i))\)

Ex: Off-road driving

[Pan et al, RSS 18]

Goal: map image to command

Approach: Use Model Predictive Controller as the expert!

\(\pi(\)                 \()=\) steering, throttle

DAgger Setting

  • Discounted Infinite Horizon MDP $$\mathcal M = \{\mathcal S, \mathcal A, P, r, \gamma\} $$
  • \(P\) unknown and \(r\) unknown and maybe unobserved
  • Access to an expert who knows \(\pi_\star\) and who we can query at any state \(s\) during training

DAgger Algorithm

DAgger

  • Initialize \(\pi^0\) and dataset \(\mathcal D = \empty\)
  • for \(i=0,1,...,T-1\)
    • Generate dataset with \(\pi_i\) and query the expert $$\mathcal D_i = \{s_j, a_j^\star\}_{j=1}^N \quad s_j\sim d_{\mu_0}^{\pi_i},\quad \mathbb E[a^\star_j] = \pi_\star(s_j) $$
    • Dataset Aggregation: \(\mathcal D = \mathcal D \cup \mathcal D_i \)
    • Update policy with supervised learning $$ \pi_{i+1} = \arg\min_{\pi\in\mathcal \Pi} \sum_{s,a\in\mathcal D} \ell(\pi(s), a)  $$

Agenda

1. Recap: Exploration

2. Imitation Learning

3. DAgger

4. Online Learning

  • Due to the active data collection, need a new framework beyond "supervised learning"
  • Online learning is a general setting which captures the idea of learning from data over time

Online Learning

Online learning

  • for \(i=1,2,...,T\)
    1. Learner choses \(f_i\)
    2. Suffer the risk (i.e. expected loss) $$ \mathcal R_i(f_i) = \mathbb E_{x,y\sim \mathcal D_i}[\ell(f_i(x), y)]$$
  • Measure performance of online learning with average regret $$\frac{1}{T} R(T) = \frac{1}{T} \left(\sum_{i=1}^T \mathcal R_i(f_i) - \min_f \sum_{i=1}^T \mathcal R_i(f)\right) $$
  • Define regret as the incurred risk compared with the best function in hindsight
  • This baseline represents the usual offline supervised learning approach $$\min_f \frac{1}{T}\sum_{i=1}^T \mathcal R_i(f) = \min_f \mathbb E_{x,y\sim \bar{\mathcal D}} [\ell(f_i(x), y)]$$ where we define \(\bar{\mathcal D} = \frac{1}{T}\sum_{i=1}^T \mathcal D_i\)

Regret

  • How should learner choose \(\theta_i\)?
  • A good option is to solves a sequence of (regularized) supervised learning problems
    • \(d(f)\) regularizes the predictions, e.g. \(\|\theta\|_2^2\)

 

 

 

 

 

  • Dataset aggregation: \( \sum_{k=1}^i \mathcal R_k(f) = \displaystyle \mathbb E_{x, y \sim \bar \mathcal D_i}[ \ell(f(x, y))]\)

Follow the Regularized Leader

Alg: FTRL

  • for \(i=1,2,...,T\) $$f_i = \arg\min_f \sum_{k=1}^i \mathcal R_k(f) + \lambda d(f) $$
  • Theorem: For convex loss and strongly convex regularizer, $$ \max_{\mathcal R_1,...,\mathcal R_T} \frac{1}{T} \left( \sum_{i=1}^T \mathcal R_i(f_i) - \min_f \sum_{i=1}^T \mathcal R_i(f) \right) \leq O\left(1/\sqrt{T}\right)$$ i.e. regret is bounded for any sequence of \(\mathcal D_i\).
  • Corollary: For DAgger with a noiseless expert, $$ \min_{1\leq i\leq T} \mathbb E_{s\sim d^{\pi^i}_{\mu_0} } [\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\}] \leq O(1/\sqrt{T}) =: \epsilon $$
  • Proof Sketch
    • DAgger as FTRL: \(f_i=\pi^i\), \((x,y)=(s, \pi^\star(s))\), and \(\mathcal D_i = d_{\mu_0}^{\pi^i}\)
    • Minimum policy error is upper bounded by average regret (using that loss upper bounds indicator)

DAgger Analysis

  • Corollary: For DAgger with a noiseless expert, $$ \min_{1\leq i\leq T} \mathbb E_{s\sim d^{\pi^i}_{\mu_0} } [\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\}] \leq O(1/\sqrt{T}) =: \epsilon $$
  • We can show an upper bound, starting with PDL $$ \mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\pi^i}(s) \right] = -\frac{1}{1-\gamma} \mathbb E_{s\sim d_{\mu_0}^{\pi^i}}\left[A^{ \pi^\star}(s,\pi^i(s))\right]$$
  • The advantage of \(\hat\pi\) over \(\pi^\star\) (no \(\gamma\) dependence)
    • \(A^{ \pi^\star}(s,\pi^i(s)) \geq -\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\} \cdot \max_{s,a} |A^{ \pi^\star}(s,\pi^i(s))| \)
  • Then the performance of DAgger is upper bounded: $$\mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\pi^i}(s) \right]\leq \frac{\epsilon}{1-\gamma}\max_{s,a} |A^{ \pi^\star}(s,\pi^i(s))|  $$

DAgger Analysis

Summary: BC vs. DAgger

Supervised learning guarantee

\(\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon\)

Online learning guarantee

\(\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon\)

Performance Guarantee

\(V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}\)

Performance Guarantee

\(V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon\)

Recap

  • PSet due tonight

 

  • Pitfalls of BC
  • DAgger Algorithm
  • Online learning

 

  • Next lecture: Inverse RL