CS 4/5789: Introduction to Reinforcement Learning
Lecture 23: Interactive Imitation Learning
Prof. Sarah Dean
MW 2:454pm
255 Olin Hall
Reminders
 Homework
 5789 Paper Reviews due weekly on Mondays
 PSet 7 due tonight, PSet 8 (final one!) released tonight
 Next week: midterm corrections
 PA 4 due next Wednesday (May 3)
 Final exam is Saturday 5/13 at 2pm
Agenda
1. Recap: Exploration
2. Imitation Learning
3. DAgger
4. Online Learning
 Unit 2: Constructing labels for supervised learning, updating policy using learned quantities
 Unit 3: Design policy to both explore (collect useful data) and exploit (high reward)
Recap: Exploration
action \(a_t\)
state \(s_t\)
reward \(r_t\)
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown
Recap: UCB
UCBtype Algorithms
 Multiarmed bandits $$\arg\max_a \widehat \mu_{a,t} + \sqrt{C/N_{a,t}}$$
 Linear contextual bandits $$\arg\max_{a} \hat \theta_{a,t}^\top x_t + \sqrt{x_t A_{a,t}^{1} x_t}$$
 Markov decision process (tabular) $$\arg\max_a \hat r_i(s,a)+H\sqrt{\frac{\alpha}{N_i(s,a)}}+\mathbb E_{s'\sim \hat P_i(s,a)}[\hat V^i_{t+1}(s')]$$
 Exploration becomes more difficult outside of tabular setting where a comprehensive search is possible
 Using expert data can sidestep exploration problem
Motivation for Imitation
Agenda
1. Recap: Exploration
2. Imitation Learning
3. DAgger
4. Online Learning
Imitation Learning
Expert Demonstrations
Supervised ML Algorithm
Policy \(\pi\)
ex  SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks
maps states to actions
Helicopter Acrobatics (Stanford)
Behavioral Cloning
Supervised Learning
Policy
Dataset of expert trajectory
\((x, y)\)
...
\(\pi\)( ) =
Behavioral Cloning
 Dataset from expert policy \(\pi_\star\): \(\{(s_i, a_i)\}_{i=1}^N \sim d_{\mu_0}^{\pi_\star} \)
 Recall definition of discounted state distribution
 Supervised learning with empirical risk minimization (ERM) $$\min_{\pi\in\Pi} \sum_{i=1}^N \ell(\pi(s_i), a_i) $$
 In this class, we assume that supervised learning works!
 i.e. we successfully optimize and generalize, so that the population loss is small: \(\displaystyle \mathbb E_{s,a\sim d_{\mu_0}^{\pi_\star}}[\ell(\pi(s), a)]\leq \epsilon\)

We further assume that \(\ell(\pi(s), a) \geq \mathbb 1\{\pi(s)\neq a\}\)
$$\displaystyle \mathbb E_{s\sim d_{\mu_0}^{\pi_\star}}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon$$
Example
 Initial state \(s_0=0\) and reward \(r(s,a) = \mathbf 1\{s=1\}\)
 Optimal policy \(\pi_\star(s) = U\)
 Discounted state distribution \(d_{0}^{\pi_\star} = \begin{bmatrix} 1\gamma & \gamma & 0 \end{bmatrix}^\top \)
 Consider \(\hat\pi(1)=U\), \(\hat\pi(2)=D\), and $$\hat\pi(0) = \begin{cases}U & w.p. ~~1\frac{\epsilon}{1\gamma}\\ D & w.p. ~~\frac{\epsilon}{1\gamma} \end{cases}$$

PollEv What is the supervised learning error?
 \( \mathbb E_{s\sim d_0^{\pi_\star}}\left[\mathbb E_{a\sim \hat \pi(s)}[\mathbf 1\{a\neq \pi_\star(s)\}]\right]=\epsilon\)
 Error in performance:
 \(V^{\pi_\star}(0) = \frac{\gamma}{1\gamma}\) vs. \(V^{\hat\pi}(0) =\frac{\gamma}{1\gamma}  \frac{\epsilon\gamma}{(1\gamma)^2}\)
\(U\)
\(D\)
\(U\)
\(U\)
\(D\)
\(D\)
\(1\)
\(0\)
\(2\)
 Assuming that SL works, how suboptimal is \(\hat\pi\)?
 Also assume \(r(s,a)\in[0,1]\)
 Recall Performance Difference Lemma on \(\pi^\star\) and \(\hat\pi\) $$ \mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s)  V^{\hat\pi}(s) \right] = \frac{1}{1\gamma} \mathbb E_{s\sim d_{\mu_0}^{\pi\star}}\left[A^{\hat \pi}(s,\pi^\star(s))\right]$$
 The advantage of \(\pi^\star\) over \(\hat\pi\) depends on SL error
 \(A^{\hat \pi}(s,\pi^\star(s)) \leq \frac{2}{1\gamma}\mathbf 1 \{\pi^\star(s) \neq \hat \pi(s)\}\)
 Then the performance of BC is upper bounded: $$\mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s)  V^{\hat\pi}(s) \right]\leq \frac{2}{(1\gamma)^2}\epsilon $$
BC Analysis
Agenda
1. Recap: Exploration
2. Imitation Learning
3. DAgger
4. Online Learning
expert trajectory
learned policy
No training data of "recovery" behavior
query expert
learned policy
and append trajectory
retrain
Idea: interact with expert to ask what they would do
DAgger: Dataset Aggregation
Supervised Learning
Policy
Dataset
\(\mathcal D = (x_i, y_i)_{i=1}^M\)
...
\(\pi\)( ) =
Execute
Query Expert
\(\pi^*(s_0), \pi^*(s_1),...\)
\(s_0, s_1, s_2...\)
Aggregate
\((x_i = s_i, y_i = \pi^*(s_i))\)
Ex: Offroad driving
[Pan et al, RSS 18]
Goal: map image to command
Approach: Use Model Predictive Controller as the expert!
\(\pi(\) \()=\) steering, throttle
DAgger Setting
 Discounted Infinite Horizon MDP $$\mathcal M = \{\mathcal S, \mathcal A, P, r, \gamma\} $$
 \(P\) unknown and \(r\) unknown and maybe unobserved
 Access to an expert who knows \(\pi_\star\) and who we can query at any state \(s\) during training
DAgger Algorithm
DAgger
 Initialize \(\pi^0\) and dataset \(\mathcal D = \empty\)
 for \(i=0,1,...,T1\)
 Generate dataset with \(\pi_i\) and query the expert $$\mathcal D_i = \{s_j, a_j^\star\}_{j=1}^N \quad s_j\sim d_{\mu_0}^{\pi_i},\quad \mathbb E[a^\star_j] = \pi_\star(s_j) $$
 Dataset Aggregation: \(\mathcal D = \mathcal D \cup \mathcal D_i \)
 Update policy with supervised learning $$ \pi_{i+1} = \arg\min_{\pi\in\mathcal \Pi} \sum_{s,a\in\mathcal D} \ell(\pi(s), a) $$
Agenda
1. Recap: Exploration
2. Imitation Learning
3. DAgger
4. Online Learning
 Due to the active data collection, need a new framework beyond "supervised learning"
 Online learning is a general setting which captures the idea of learning from data over time
Online Learning
Online learning
 for \(i=1,2,...,T\)
 Learner choses \(f_i\)
 Suffer the risk (i.e. expected loss) $$ \mathcal R_i(f_i) = \mathbb E_{x,y\sim \mathcal D_i}[\ell(f_i(x), y)]$$
 Measure performance of online learning with average regret $$\frac{1}{T} R(T) = \frac{1}{T} \left(\sum_{i=1}^T \mathcal R_i(f_i)  \min_f \sum_{i=1}^T \mathcal R_i(f)\right) $$
 Define regret as the incurred risk compared with the best function in hindsight
 This baseline represents the usual offline supervised learning approach $$\min_f \frac{1}{T}\sum_{i=1}^T \mathcal R_i(f) = \min_f \mathbb E_{x,y\sim \bar{\mathcal D}} [\ell(f_i(x), y)]$$ where we define \(\bar{\mathcal D} = \frac{1}{T}\sum_{i=1}^T \mathcal D_i\)
Regret
 How should learner choose \(\theta_i\)?
 A good option is to solves a sequence of (regularized) supervised learning problems
 \(d(f)\) regularizes the predictions, e.g. \(\\theta\_2^2\)
 Dataset aggregation: \( \sum_{k=1}^i \mathcal R_k(f) = \displaystyle \mathbb E_{x, y \sim \bar \mathcal D_i}[ \ell(f(x, y))]\)
Follow the Regularized Leader
Alg: FTRL
 for \(i=1,2,...,T\) $$f_i = \arg\min_f \sum_{k=1}^i \mathcal R_k(f) + \lambda d(f) $$
 Theorem: For convex loss and strongly convex regularizer, $$ \max_{\mathcal R_1,...,\mathcal R_T} \frac{1}{T} \left( \sum_{i=1}^T \mathcal R_i(f_i)  \min_f \sum_{i=1}^T \mathcal R_i(f) \right) \leq O\left(1/\sqrt{T}\right)$$ i.e. regret is bounded for any sequence of \(\mathcal D_i\).
 Corollary: For DAgger with a noiseless expert, $$ \min_{1\leq i\leq T} \mathbb E_{s\sim d^{\pi^i}_{\mu_0} } [\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\}] \leq O(1/\sqrt{T}) =: \epsilon $$
 Proof Sketch
 DAgger as FTRL: \(f_i=\pi^i\), \((x,y)=(s, \pi^\star(s))\), and \(\mathcal D_i = d_{\mu_0}^{\pi^i}\)
 Minimum policy error is upper bounded by average regret (using that loss upper bounds indicator)
DAgger Analysis
 Corollary: For DAgger with a noiseless expert, $$ \min_{1\leq i\leq T} \mathbb E_{s\sim d^{\pi^i}_{\mu_0} } [\mathbf 1\{\pi^i(s)\neq \pi^\star(s)\}] \leq O(1/\sqrt{T}) =: \epsilon $$
 We can show an upper bound, starting with PDL $$ \mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s)  V^{\pi^i}(s) \right] = \frac{1}{1\gamma} \mathbb E_{s\sim d_{\mu_0}^{\pi^i}}\left[A^{ \pi^\star}(s,\pi^i(s))\right]$$
 The advantage of \(\hat\pi\) over \(\pi^\star\) (no \(\gamma\) dependence)
 \(A^{ \pi^\star}(s,\pi^i(s)) \geq \mathbf 1\{\pi^i(s)\neq \pi^\star(s)\} \cdot \max_{s,a} A^{ \pi^\star}(s,\pi^i(s)) \)
 Then the performance of DAgger is upper bounded: $$\mathbb E_{s\sim\mu_0} \left[ V^{\pi_\star}(s)  V^{\pi^i}(s) \right]\leq \frac{\epsilon}{1\gamma}\max_{s,a} A^{ \pi^\star}(s,\pi^i(s)) $$
DAgger Analysis
Summary: BC vs. DAgger
Supervised learning guarantee
\(\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s)  \pi^*(s)\}]\leq \epsilon\)
Online learning guarantee
\(\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s)  \pi^*(s)\}]\leq \epsilon\)
Performance Guarantee
\(V_\mu^{\pi^*}  V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1\gamma)^2}\)
Performance Guarantee
\(V_\mu^{\pi^*}  V_\mu^{\pi^t} \leq \frac{\max_{s,a}A^{\pi^*}(s,a)}{1\gamma}\epsilon\)
Recap
 PSet due tonight
 Pitfalls of BC
 DAgger Algorithm
 Online learning
 Next lecture: Inverse RL
CS 4/5789: Lecture 23
By Sarah Dean