Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:55-4:10pm
255 Olin Hall
You may correct any part of a question that you didn't receive full credit on
Bonus on your exam grade proportional to the difference: $$\text{initial score} + \alpha\times(\text{corrected score} - \text{initial score})_+$$
Treat corrections like a written homework
neat and explanations of steps should be clear
MC must include a few sentences of justificatoin
scored more strictly than the exam was
You can visit OH and discuss with others, but solutions must be written by yourself
1. Recap: Exploration
2. Imitation Learning
3. Behavioral Cloning
4. DAgger
action \(a_t\)
state \(s_t\)
reward \(r_t\)
policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown
UCB-type Algorithms
1. Recap: Exploration
2. Imitation Learning
3. Behavioral Cloning
4. DAgger
Helicopter Acrobatics (Stanford)
LittleDog Robot (LAIRLab at CMU)
An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]
Expert Demonstrations
Supervised ML Algorithm
Policy \(\pi\)
ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks
maps states to actions
1. Recap: Exploration
2. Imitation Learning
3. Behavioral Cloning
4. DAgger
Dataset from expert policy \(\pi_\star\): $$ \{(s_i, a_i)\}_{i=1}^N \sim \mathcal D_\star $$
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
rather than optimize,
imitate!
minimize \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)
\(\pi\)
sklearn
, torch
minimize \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)
\(\pi\in\Pi\)
Supervised learning with empirical risk minimization (ERM)
In this class, we assume that supervised learning works!
minimize \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)
\(\pi\in\Pi\)
Supervised learning with empirical risk minimization (ERM)
i.e. we successfully optimize and generalize, so that the population loss is small: \(\displaystyle \mathbb E_{s,a\sim\mathcal D_\star}[\ell(\pi(s), a)]\leq \epsilon\)
For many loss functions, this means that
\(\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon\)
Policy \(\pi\)
Input: Camera Image
Output: Steering Angle
Supervised Learning
Policy
Dataset of expert trajectory
\((x, y)\)
...
\(\pi\)( ) =
expert trajectory
learned policy
No training data of "recovery" behavior!
What about assumption \(\displaystyle \mathbb E_{s\sim\mathcal D_\star}[\mathbb 1\{\pi(s)\neq \pi_\star(s)\}]\leq \epsilon\)?
PollEv
An Autonomous Land Vehicle In A Neural Network [Pomerleau, NIPS ‘88]
“If the network is not presented with sufficient variability in its training exemplars to cover the conditions it is likely to encounter...[it] will perform poorly”
\(U\)
\(D\)
\(U\)
\(U\)
\(D\)
\(D\)
\(1\)
\(0\)
\(2\)
1. Recap: Exploration
2. Imitation Learning
3. Behavioral Cloning
4. DAgger
expert trajectory
learned policy
No training data of "recovery" behavior
query expert
learned policy
and append trajectory
retrain
Idea: interact with expert to ask what they would do
Supervised Learning
Policy
Dataset
\(\mathcal D = (x_i, y_i)_{i=1}^M\)
...
\(\pi\)( ) =
Execute
Query Expert
\(\pi^*(s_0), \pi^*(s_1),...\)
\(s_0, s_1, s_2...\)
Aggregate
\((x_i = s_i, y_i = \pi^*(s_i))\)
[Pan et al, RSS 18]
Goal: map image to command
Approach: Use Model Predictive Controller as the expert!
\(\pi(\) \()=\) steering, throttle
DAgger
$$\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon$$
Contrast this with the supervised learning guarantee:
$$\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon$$
Online learning
Alg: FTRL
Supervised learning guarantee
\(\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon\)
Online learning guarantee
\(\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon\)
Performance Guarantee
\(V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}\)
Performance Guarantee
\(V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon\)
By Sarah Dean