CS 4/5789: Introduction to Reinforcement Learning
Lecture 23: Interactive Imitation Learning
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework
- 5789 Paper Reviews due weekly on Mondays
- PSet 7 due tonight, PSet 8 (final one!) released tonight
- Next week: midterm corrections
- PA 4 due next Wednesday (May 3)
- Final exam is Saturday 5/13 at 2pm
Agenda
1. Recap: Exploration
2. Imitation Learning
3. DAgger
4. Online Learning
- Unit 2: Constructing labels for supervised learning, updating policy using learned quantities
- Unit 3: Design policy to both explore (collect useful data) and exploit (high reward)
Recap: Exploration


action \(a_t\)
state \(s_t\)
reward \(r_t\)

policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown
Recap: UCB
UCB-type Algorithms
- Multi-armed bandits $$\arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$$
- Linear contextual bandits $$\arg\max_{a} \hat \theta_{a,t}^\top x_t + \sqrt{x_t A_{a,t}^{-1} x_t}$$
- Markov decision process (tabular) $$\arg\max_a \hat r_i(s,a)+H\sqrt{\frac{\alpha}{N_i(s,a)}}+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]$$
- Exploration becomes more difficult outside of tabular setting where a comprehensive search is possible
- Using expert data can sidestep exploration problem
Motivation for Imitation

Agenda
1. Recap: Exploration
2. Imitation Learning
3. DAgger
4. Online Learning
Imitation Learning
Expert Demonstrations
Supervised ML Algorithm
Policy \(\pi\)
ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks
maps states to actions

Helicopter Acrobatics (Stanford)
Behavioral Cloning
Supervised Learning
Policy
Dataset of expert trajectory
\((x, y)\)




...
\(\pi\)( ) =


Behavioral Cloning
- Dataset from expert policy \(\pi_\star\): \(\{(s_i, a_i)\}_{i=1}^N \sim d_{\mu_0}^{\pi_\star} \)
- Recall definition of discounted state distribution
- Supervised learning with empirical risk minimization (ERM) $$\min_{\pi\in\Pi} \sum_{i=1}^N \ell(\pi(s_i), a_i) $$
- In this class, we assume that supervised learning works!
- i.e. we successfully optimize and generalize, so that the population loss is small: \(\displaystyle \mathbb E_{s,a\sim d_{\mu_0}^{\pi_\star}}[\ell(\pi(s), a)]\leq \epsilon\)
-
We further assume that \(\ell(\pi(s), a) \geq \mathbb 1\{\pi(s)\neq a\}\)
$$\displaystyle \mathbb E_{s\sim d_{\mu_0}^{\pi_\star}}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon$$
Example
- Initial state \(s_0=0\) and reward \(r(s,a) = \mathbf 1\{s=1\}\)
- Optimal policy \(\pi_\star(s) = U\)
- Discounted state distribution \(d_{0}^{\pi_\star} = \begin{bmatrix} 1-\gamma & \gamma & 0 \end{bmatrix}^\top \)
- Consider \(\hat\pi(1)=U\), \(\hat\pi(2)=D\), and $$\hat\pi(0) = \begin{cases}U & w.p. ~~1-\frac{\epsilon}{1-\gamma}\\ D & w.p. ~~\frac{\epsilon}{1-\gamma} \end{cases}$$
-
PollEv What is the supervised learning error?
- \( \mathbb E_{s\sim d_0^{\pi_\star}}\left[\mathbb E_{a\sim \hat \pi(s)}[\mathbf 1\{a\neq \pi_\star(s)\}]\right]=\epsilon\)
- Error in performance:
- \(V^{\pi_\star}(0) = \frac{\gamma}{1-\gamma}\) vs. \(V^{\hat\pi}(0) =\frac{\gamma}{1-\gamma} - \frac{\epsilon\gamma}{(1-\gamma)^2}\)


\(U\)
\(D\)
\(U\)
\(U\)
\(D\)
\(D\)
\(1\)
\(0\)
\(2\)
- Assuming that SL works, how sub-optimal is \(\hat\pi\)?
- Also assume \(r(s,a)\in[0,1]\)
- Recall Performance Difference Lemma on \(\pi^\star\) and \(\hat\pi\) $$ \mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\hat\pi}(s) \right] = \frac{1}{1-\gamma} \mathbb E_{s\sim d_{\mu_0}^{\pi\star}}\left[A^{\hat \pi}(s,\pi^\star(s))\right]$$
- The advantage of \(\pi^\star\) over \(\hat\pi\) depends on SL error
- \(A^{\hat \pi}(s,\pi^\star(s)) \leq \frac{2}{1-\gamma}\mathbf 1 \{\pi^\star(s) \neq \hat \pi(s)\}\)
- Then the performance of BC is upper bounded: $$\mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\hat\pi}(s) \right]\leq \frac{2}{(1-\gamma)^2}\epsilon $$
BC Analysis
Agenda
1. Recap: Exploration
2. Imitation Learning
3. DAgger
4. Online Learning

expert trajectory
learned policy
No training data of "recovery" behavior


query expert
learned policy

and append trajectory
retrain
Idea: interact with expert to ask what they would do
DAgger: Dataset Aggregation
Supervised Learning
Policy
Dataset
\(\mathcal D = (x_i, y_i)_{i=1}^M\)




...
\(\pi\)( ) =


Execute

Query Expert
\(\pi^*(s_0), \pi^*(s_1),...\)
\(s_0, s_1, s_2...\)
Aggregate
\((x_i = s_i, y_i = \pi^*(s_i))\)
Ex: Off-road driving



[Pan et al, RSS 18]
Goal: map image to command
Approach: Use Model Predictive Controller as the expert!
\(\pi(\) \()=\) steering, throttle
DAgger Setting
- Discounted Infinite Horizon MDP $$\mathcal M = \{\mathcal S, \mathcal A, P, r, \gamma\} $$
- \(P\) unknown and \(r\) unknown and maybe unobserved
- Access to an expert who knows \(\pi_\star\) and who we can query at any state \(s\) during training
DAgger Algorithm
DAgger
- Initialize \(\pi^0\) and dataset \(\mathcal D = \empty\)
- for \(i=0,1,...,T-1\)
- Generate dataset with \(\pi_i\) and query the expert $$\mathcal D_i = \{s_j, a_j^\star\}_{j=1}^N \quad s_j\sim d_{\mu_0}^{\pi_i},\quad \mathbb E[a^\star_j] = \pi_\star(s_j) $$
- Dataset Aggregation: \(\mathcal D = \mathcal D \cup \mathcal D_i \)
- Update policy with supervised learning $$ \pi_{i+1} = \arg\min_{\pi\in\mathcal \Pi} \sum_{s,a\in\mathcal D} \ell(\pi(s), a) $$
Agenda
1. Recap: Exploration
2. Imitation Learning
3. DAgger
4. Online Learning
- Due to the active data collection, need a new framework beyond "supervised learning"
- Online learning is a general setting which captures the idea of learning from data over time
Online Learning
Online learning
- for \(i=1,2,...,T\)
- Learner choses \(f_i\)
- Suffer the risk (i.e. expected loss) $$ \mathcal R_i(f_i) = \mathbb E_{x,y\sim \mathcal D_i}[\ell(f_i(x), y)]$$
- Measure performance of online learning with average regret $$\frac{1}{T} R(T) = \frac{1}{T} \left(\sum_{i=1}^T \mathcal R_i(f_i) - \min_f \sum_{i=1}^T \mathcal R_i(f)\right) $$
- Define regret as the incurred risk compared with the best function in hindsight
- This baseline represents the usual offline supervised learning approach $$\min_f \frac{1}{T}\sum_{i=1}^T \mathcal R_i(f) = \min_f \mathbb E_{x,y\sim \bar{\mathcal D}} [\ell(f_i(x), y)]$$ where we define \(\bar{\mathcal D} = \frac{1}{T}\sum_{i=1}^T \mathcal D_i\)
Regret
- How should learner choose \(\theta_i\)?
- A good option is to solves a sequence of (regularized) supervised learning problems
- \(d(f)\) regularizes the predictions, e.g. \(\|\theta\|_2^2\)
- Dataset aggregation: \( \sum_{k=1}^i \mathcal R_k(f) = \displaystyle \mathbb E_{x, y \sim \bar \mathcal D_i}[ \ell(f(x, y))]\)
Follow the Regularized Leader
Alg: FTRL
- for \(i=1,2,...,T\) $$f_i = \arg\min_f \sum_{k=1}^i \mathcal R_k(f) + \lambda d(f) $$
- Theorem: For convex loss and strongly convex regularizer, $$ \max_{\mathcal R_1,...,\mathcal R_T} \frac{1}{T} \left( \sum_{i=1}^T \mathcal R_i(f_i) - \min_f \sum_{i=1}^T \mathcal R_i(f) \right) \leq O\left(1/\sqrt{T}\right)$$ i.e. regret is bounded for any sequence of \(\mathcal D_i\).
- Corollary: For DAgger with a noiseless expert, $$ \min_{1\leq i\leq T} \mathbb E_{s\sim d^{\pi^i}_{\mu_0} } [\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\}] \leq O(1/\sqrt{T}) =: \epsilon $$
- Proof Sketch
- DAgger as FTRL: \(f_i=\pi^i\), \((x,y)=(s, \pi^\star(s))\), and \(\mathcal D_i = d_{\mu_0}^{\pi^i}\)
- Minimum policy error is upper bounded by average regret (using that loss upper bounds indicator)
DAgger Analysis
- Corollary: For DAgger with a noiseless expert, $$ \min_{1\leq i\leq T} \mathbb E_{s\sim d^{\pi^i}_{\mu_0} } [\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\}] \leq O(1/\sqrt{T}) =: \epsilon $$
- We can show an upper bound, starting with PDL $$ \mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\pi^i}(s) \right] = -\frac{1}{1-\gamma} \mathbb E_{s\sim d_{\mu_0}^{\pi^i}}\left[A^{ \pi^\star}(s,\pi^i(s))\right]$$
- The advantage of \(\hat\pi\) over \(\pi^\star\) (no \(\gamma\) dependence)
- \(A^{ \pi^\star}(s,\pi^i(s)) \geq -\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\} \cdot \max_{s,a} |A^{ \pi^\star}(s,\pi^i(s))| \)
- Then the performance of DAgger is upper bounded: $$\mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s) - V^{\pi^i}(s) \right]\leq \frac{\epsilon}{1-\gamma}\max_{s,a} |A^{ \pi^\star}(s,\pi^i(s))| $$
DAgger Analysis
Summary: BC vs. DAgger
Supervised learning guarantee
\(\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon\)
Online learning guarantee
\(\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon\)
Performance Guarantee
\(V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}\)
Performance Guarantee
\(V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon\)
Recap
- PSet due tonight
- Pitfalls of BC
- DAgger Algorithm
- Online learning
- Next lecture: Inverse RL
CS 4/5789: Lecture 23
By Sarah Dean