CS 4/5789: Introduction to Reinforcement Learning
Lecture 25
Prof. Sarah Dean
MW 2:45-4pm
110 Hollister Hall
Agenda
0. Announcements & Recap
1. Maximum Entropy IRL
2. Iterative Algorithm
3. Soft Value Iteration
Announcements
5789 Paper Review Assignment (weekly pace suggested)
HW 4 due 5/9 -- don't plan on extentions
My office hours are now Wednesdays 4:15-6pm in Gates 416A
Final exam Monday 5/16 at 7pm
Review session in lecture 5/9
Recap: IL vs. IRL
Supervised Learning
Policy
Dataset of expert trajectory




...
\(\pi\)( ) =


\((x=s, y=a^*)\)
imitation
inverse RL
Goal: back out reward function that generates behaviors
Consistent policy via distribution matching
Recap: Consistent Policies
Consistent rewards: find \(r\) such that for all \(\pi\) $$ \mathbb E_{d^{\pi^*}_\mu}[r(s,a)] \geq \mathbb E_{d^{\pi}_\mu}[r(s,a)]$$
Assumption: \(r\) linear in known features, \(r(s,a) = \theta_*^\top \varphi(s,a)\).
\(\theta_*^\top \mathbb E_{d^{\pi^*}_\mu}[\varphi(s,a)] \geq \theta_*^\top \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)] \)
\(\theta_*\)
Consistent policies: find \(\pi\) such that $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$
same expected features!
Estimate from expert data: \( E_{d^{\pi^*}_\mu}[\varphi (s,a)] \approx \frac{1}{N} \sum_{i=1}^N \varphi(s_i^*, a_i^*) \)

Example: Driving
\(s=\)

\(a\in \{\) north, south, east, west \(\}\)
\(\varphi(s,a) = \begin{bmatrix} \mathbb P \{\text{hit building}\mid \text{move in direction }a\} \\ \mathbb P \{\text{hit car}\mid \text{move in direction }a\}\\ \vdots \end{bmatrix} \)
Detect location of important objects
Reason about current position
Recap: Entropy
Entropy measures uncertainty
\(\mathsf{Ent}(P) = \mathbb E_P[-\log P(x)] = -\sum_{x\in\mathcal X} P(x) \log P(x)\)
Principle of maximum entropy:
Among all options, choose the one with the most uncertainty
maximize \(\mathsf{Ent}(\pi)\)
s.t. \(\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]\)
Recap: Constrained Optimization
\(x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0\)
Equivalent Lagrange Formulation
\(\displaystyle x^* =\arg \min_x \max_w ~~f(x)+w\cdot g(x)\)
Iterative Procedure
- For \(t=0,\dots,T-1\):
- Best repsonse: \(x_t = \arg\min f(x) + w_t g(x)\)
- Gradient ascent: \(w_{t+1} = w_t + \eta g(x)\)
- Return \(\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t\)
Recap: Constrained Optimization
\(x^* =\arg \min~~f(x)~~\text{s.t.}~~g_1(x)=g_2(x)=...=g_d(x)=0\)
Equivalent Lagrange Formulation
\(\displaystyle x^* =\arg \min_x \max_{w_1,\dots,w_d} ~~f(x)+\Sigma_{i=1}^dw_i g_i(x)\)
Recap: Constrained Optimization
\(x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0\)
Equivalent Lagrange Formulation
\(\displaystyle x^* =\arg \min_x \max_{w_1,\dots,w_d} ~~f(x)+w^\top g(x)\)
Iterative Procedure
- For \(t=0,\dots,T-1\):
- Best repsonse: \(x_t = \arg\min f(x) + w_t^\top g(x)\)
- Gradient ascent: \(w_{t+1} = w_t + \eta g(x)\)
- Return \(\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t\)
Agenda
0. Announcements & Recap
1. Maximum Entropy IRL
2. Iterative Algorithm
3. Soft Value Iteration
Max-Ent IRL
- For \(k=0,\dots,K-1\):
- \(\pi^k = \mathsf{SoftVI}(w_k^\top \varphi)\)
- \(w_{k+1} = w_k + \eta (\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] - \mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)])\)
- Return \(\bar \pi = \mathsf{Unif}(\pi^0,\dots \pi^{K-1})\)
- Input: reward function \(r\). Initialize \(V_H^*(s) = 0\)
- For \(h=H-1,\dots 0\):
- \(Q_h^*(s,a) = r(s,a) + \mathbb E_{s'\sim P}[V_{h+1}(s')]\)
- \(\pi_h^*(a|s) \propto \exp(Q^*_h(s,a))\)
- \(V_h^*(s) = \log\left(\sum_{a\in\mathcal A} \exp(Q^*_h(s,a) \right)\)
Soft-VI
Inverse RL
Reward function: \(\widehat r(s,a) = w_K^\top \varphi(s,a)\)
Log-Likelihood of trajectory:
- Given \(\tau = (s_0,a_0,\dots s_{H-1}, a_{H-1})\)
- How likely is the expert to take this trajectory?
- \(\log(\rho^{\bar \pi}(\tau)) = \sum_{h=0}^{H-1} \log P(s_{h+1}|s_h, a_h) + \log \bar\pi(a_h|s_h)\)


\(s\): position, \(a\): direction
most likely path
CS 4/5789: Lecture 25
By Sarah Dean