Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap: Imitation Learning
2. Inverse RL Setting
3. Entropy & Constrained Optimization
4. MaxEnt IRL & Soft VI
Expert Demonstrations
Supervised ML Algorithm
Policy \(\pi\)
ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks
maps states to actions
Supervised Learning
Policy
Dataset
\(\mathcal D = (x_i, y_i)_{i=1}^M\)
...
\(\pi\)( ) =
Execute
Query Expert
\(\pi^*(s_0), \pi^*(s_1),...\)
\(s_0, s_1, s_2...\)
Aggregate
\((x_i = s_i, y_i = \pi^*(s_i))\)
Supervised learning guarantee
\(\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon\)
Online learning guarantee
\(\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon\)
Performance Guarantee
\(V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}\)
Performance Guarantee
\(V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon\)
1. Recap: Imitation Learning
2. Inverse RL Setting
3. Entropy & Constrained Optimization
4. MaxEnt IRL & Soft VI
\(s=\)
\(a\in \{\) north, south, east, west \(\}\)
\(\varphi(s,a) = \begin{bmatrix} \mathbb P \{\text{hit building}\mid \text{move in direction }a\} \\ \mathbb P \{\text{hit car}\mid \text{move in direction }a\}\\ \vdots \end{bmatrix} \)
Detect location of important objects
Reason about current position
(Kitani et al., 2012)
\(\theta_\star\) encodes cost/benefit of outcomes
Consistent policies: find \(\pi\) such that $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$
Consistent policies imply consistent rewards
Estimate the feature distribution from expert data: $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] \approx \frac{1}{N} \sum_{i=1}^N \varphi(s_i^*, a_i^*) $$
\(\theta_*^\top \mathbb E_{d^{\pi^*}_\mu}[\varphi(s,a)] \geq \theta_*^\top \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)] \)
\(\theta_*\)
Depending on expert policy and the definition of features, more than one policy may satisfy this
How should we choose between policies?
Occam's razor: choose the policy which encodes the least information
1. Recap: Imitation Learning
2. Inverse RL Setting
3. Entropy & Constrained Optimization
4. MaxEnt IRL & Soft VI
The mathematical concept entropy allows us to quantify "amount of information"
Definition: The entropy of a distribution \(P\in\Delta(\mathcal X) \) is $$\mathsf{Ent}(P) = \mathbb E_P[-\log P(x)] = -\sum_{x\in\mathcal X} P(x) \log P(x)$$
Fact: since \(P(x)\in[0,1]\), entropy is non-negative
Entropy is a measure of uncertainty
Distributions with higher entropy are more uncertain, while lower entropy means less uncertainty
Consider distributions over \(A\) actions \(\mathcal A = \{1,...,A\}\)
Uniform distribution \(U(a) = \frac{1}{A}\) PollEV
\(\mathsf{Ent}(P) = -\sum_{a\in\mathcal A} \frac{1}{A} \log \frac{1}{A} = \log A\)
Deterministic action with \(a=1\) with probability \(1\)
\(\mathsf{Ent}(P) = -\sum_{a=1} 1\cdot \log 1 =0\)
Uniform distribution over \(\{1,2\}\), \(U_1(a) = \frac{1}{2}\mathbf 1\{a\in\{1,2\}\}\)
\(\mathsf{Ent}(P) = -\sum_{a=1}^2 \frac{1}{2} \log\frac{1}{2}=\log 2\)
Principle: among distributions consistent with constraints (i.e. observed data, mean, variance) choose the one with the most uncertainty, i.e. the highest entropy
Specifically, we define
\(\mathsf{Ent}(\pi) = \mathbb E_{s\sim d^{\pi}_\mu}[\mathsf{Ent}(\pi(s)) ] \)
\(= \mathbb E_{s\sim d^{\pi}_\mu}[\mathbb E_{a\sim\pi(s)}[-\log \pi(a|s)]] \)
\(= \mathbb E_{s,a\sim d^{\pi}_\mu}[\mathbb -\log \pi(a|s)] \)
maximize \(\mathsf{Ent}(\pi)\)
s.t. \(\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]\)
Principle: among distributions consistent with constraints (i.e. observed data, mean, variance) choose the one with the most uncertainty, i.e. the highest entropy
maximize \(\mathbb E_{ d^{\pi}_\mu}[-\log \pi(a|s)]\)
s.t. \(\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] - \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)] = 0\)
1. Recap: Imitation Learning
2. Inverse RL Setting
3. Entropy & Constrained Optimization
4. MaxEnt IRL & Soft VI
Algorithm: Max-Ent IRL
Soft-VI
Reward function: \(\widehat r(s,a) = \bar w^\top \varphi(s,a)\)
Log-Likelihood of trajectory:
\(s\): position, \(a\): direction
most likely path
Supervised Learning
Policy
Dataset of expert trajectory
...
\(\pi\)( ) =
\((x=s, y=a^*)\)
imitation
inverse RL
Goal: understand/predict behaviors