Prof. Sarah Dean
MW 2:45-4pm
110 Hollister Hall
0. Announcements & Recap
1. Maximum Entropy IRL
2. Iterative Algorithm
3. Soft Value Iteration
5789 Paper Review Assignment (weekly pace suggested)
HW 4 due 5/9 -- don't plan on extentions
My office hours are now Wednesdays 4:15-6pm in Gates 416A
Final exam Monday 5/16 at 7pm
Review session in lecture 5/9
Supervised Learning
Policy
Dataset of expert trajectory
...
\(\pi\)( ) =
\((x=s, y=a^*)\)
imitation
inverse RL
Goal: back out reward function that generates behaviors
Consistent policy via distribution matching
Consistent rewards: find \(r\) such that for all \(\pi\) $$ \mathbb E_{d^{\pi^*}_\mu}[r(s,a)] \geq \mathbb E_{d^{\pi}_\mu}[r(s,a)]$$
Assumption: \(r\) linear in known features, \(r(s,a) = \theta_*^\top \varphi(s,a)\).
\(\theta_*^\top \mathbb E_{d^{\pi^*}_\mu}[\varphi(s,a)] \geq \theta_*^\top \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)] \)
\(\theta_*\)
Consistent policies: find \(\pi\) such that $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$
same expected features!
Estimate from expert data: \( E_{d^{\pi^*}_\mu}[\varphi (s,a)] \approx \frac{1}{N} \sum_{i=1}^N \varphi(s_i^*, a_i^*) \)
\(s=\)
\(a\in \{\) north, south, east, west \(\}\)
\(\varphi(s,a) = \begin{bmatrix} \mathbb P \{\text{hit building}\mid \text{move in direction }a\} \\ \mathbb P \{\text{hit car}\mid \text{move in direction }a\}\\ \vdots \end{bmatrix} \)
Detect location of important objects
Reason about current position
Entropy measures uncertainty
\(\mathsf{Ent}(P) = \mathbb E_P[-\log P(x)] = -\sum_{x\in\mathcal X} P(x) \log P(x)\)
Principle of maximum entropy:
Among all options, choose the one with the most uncertainty
maximize \(\mathsf{Ent}(\pi)\)
s.t. \(\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]\)
\(x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0\)
Equivalent Lagrange Formulation
\(\displaystyle x^* =\arg \min_x \max_w ~~f(x)+w\cdot g(x)\)
Iterative Procedure
\(x^* =\arg \min~~f(x)~~\text{s.t.}~~g_1(x)=g_2(x)=...=g_d(x)=0\)
Equivalent Lagrange Formulation
\(\displaystyle x^* =\arg \min_x \max_{w_1,\dots,w_d} ~~f(x)+\Sigma_{i=1}^dw_i g_i(x)\)
\(x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0\)
Equivalent Lagrange Formulation
\(\displaystyle x^* =\arg \min_x \max_{w_1,\dots,w_d} ~~f(x)+w^\top g(x)\)
Iterative Procedure
0. Announcements & Recap
1. Maximum Entropy IRL
2. Iterative Algorithm
3. Soft Value Iteration
Soft-VI
Reward function: \(\widehat r(s,a) = w_K^\top \varphi(s,a)\)
Log-Likelihood of trajectory:
\(s\): position, \(a\): direction
most likely path