CS 4/5789: Introduction to Reinforcement Learning
Lecture 24: Inverse Reinforcement Learning
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework
- 5789 Paper Reviews due weekly on Mondays
- PSet 8 (final one!) due Monday
- Next week: midterm corrections
- PA 4 due next Wednesday (May 3)
- Final exam is Saturday 5/13 at 2pm
- Length: 2 hours
Agenda
1. Recap: Imitation Learning
2. Inverse RL Setting
3. Entropy & Constrained Optimization
4. MaxEnt IRL & Soft VI
Recap: Imitation Learning
Expert Demonstrations
Supervised ML Algorithm
Policy \(\pi\)
ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks
maps states to actions

Recap: BC/DAgger
Supervised Learning
Policy
Dataset
\(\mathcal D = (x_i, y_i)_{i=1}^M\)




...
\(\pi\)( ) =


Execute

Query Expert
\(\pi^*(s_0), \pi^*(s_1),...\)
\(s_0, s_1, s_2...\)
Aggregate
\((x_i = s_i, y_i = \pi^*(s_i))\)
Recap: BC vs. DAgger
Supervised learning guarantee
\(\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon\)
Online learning guarantee
\(\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon\)
Performance Guarantee
\(V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}\)
Performance Guarantee
\(V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon\)
Agenda
1. Recap: Imitation Learning
2. Inverse RL Setting
3. Entropy & Constrained Optimization
4. MaxEnt IRL & Soft VI
Inverse RL: Motivation
- Like previous lecture, learning from expert demonstrations
- Unlike previous lectures, focus on learning reward rather than learning policy
- Motivations:
- Scientific inquiry - modelling human or animal behavior
- Imitation via reward function
- reward may be more succinct & transferrable
- Modelling other agents in multi-agent setting (adversarial or cooperative)
Inverse RL: Setting
- Finite horizon MDP $$\mathcal M = \{\mathcal S,\mathcal A, P, r, H, \mu\} $$
- Transition \(P\) known but reward \(r\) unknown and signal \(r_t\) unobserved
- State/state-action distributions:
- \(d_{\mu}^{\pi}(s) = \frac{1}{H}\sum_{t=0}^{H-1} d_{\mu,t}^\pi(s)\) average prob. of state
- \(d_{\mu}^{\pi}(s,a) = \frac{1}{H}\sum_{t=0}^{H-1} d_{\mu,t}^\pi(s,a)\) average prob. of state, action
- Observe traces of expert policy $$\mathcal D=\{s_i^\star, a_i^\star\} \sim d_{\mu}^{\pi_\star}$$
- Assumption: reward function is linear in known features: $$r(s,a) = \theta_\star^\top \varphi(s,a)$$

Example: Driving
\(s=\)

\(a\in \{\) north, south, east, west \(\}\)
\(\varphi(s,a) = \begin{bmatrix} \mathbb P \{\text{hit building}\mid \text{move in direction }a\} \\ \mathbb P \{\text{hit car}\mid \text{move in direction }a\}\\ \vdots \end{bmatrix} \)
Detect location of important objects
Reason about current position
(Kitani et al., 2012)
\(\theta_\star\) encodes cost/benefit of outcomes
Consistency principle
- Consistent rewards: find \(r\) such that for all \(\pi\) $$ \mathbb E_{d^{\pi^*}_\mu}[r(s,a)] \geq \mathbb E_{d^{\pi}_\mu}[r(s,a)]$$
-
Consistent policies: find \(\pi\) such that $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$
-
Consistent policies imply consistent rewards
-
Estimate the feature distribution from expert data: $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] \approx \frac{1}{N} \sum_{i=1}^N \varphi(s_i^*, a_i^*) $$
\(\theta_*^\top \mathbb E_{d^{\pi^*}_\mu}[\varphi(s,a)] \geq \theta_*^\top \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)] \)
\(\theta_*\)
Ambiguity problem
- Our goal is to find a policy \(\pi\) that matches the feature distribution $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$
-
Depending on expert policy and the definition of features, more than one policy may satisfy this
-
How should we choose between policies?
-
Occam's razor: choose the policy which encodes the least information
Agenda
1. Recap: Imitation Learning
2. Inverse RL Setting
3. Entropy & Constrained Optimization
4. MaxEnt IRL & Soft VI
Entropy
-
The mathematical concept entropy allows us to quantify "amount of information"
-
Definition: The entropy of a distribution \(P\in\Delta(\mathcal X) \) is $$\mathsf{Ent}(P) = \mathbb E_P[-\log P(x)] = -\sum_{x\in\mathcal X} P(x) \log P(x)$$
-
Fact: since \(P(x)\in[0,1]\), entropy is non-negative
-
Entropy is a measure of uncertainty
-
Distributions with higher entropy are more uncertain, while lower entropy means less uncertainty
-
Entropy Examples
-
Consider distributions over \(A\) actions \(\mathcal A = \{1,...,A\}\)
-
Uniform distribution \(U(a) = \frac{1}{A}\) PollEV
-
\(\mathsf{Ent}(P) = -\sum_{a\in\mathcal A} \frac{1}{A} \log \frac{1}{A} = \log A\)
-
-
Deterministic action with \(a=1\) with probability \(1\)
-
\(\mathsf{Ent}(P) = -\sum_{a=1} 1\cdot \log 1 =0\)
-
-
Uniform distribution over \(\{1,2\}\), \(U_1(a) = \frac{1}{2}\mathbf 1\{a\in\{1,2\}\}\)
-
\(\mathsf{Ent}(P) = -\sum_{a=1}^2 \frac{1}{2} \log\frac{1}{2}=\log 2\)
-
Max Entropy Principle
-
Principle: among distributions consistent with constraints (i.e. observed data, mean, variance) choose the one with the most uncertainty, i.e. the highest entropy
-
Specifically, we define
-
\(\mathsf{Ent}(\pi) = \mathbb E_{s\sim d^{\pi}_\mu}[\mathsf{Ent}(\pi(s)) ] \)
-
\(= \mathbb E_{s\sim d^{\pi}_\mu}[\mathbb E_{a\sim\pi(s)}[-\log \pi(a|s)]] \)
-
\(= \mathbb E_{s,a\sim d^{\pi}_\mu}[\mathbb -\log \pi(a|s)] \)
-
-
maximize \(\mathsf{Ent}(\pi)\)
s.t. \(\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]\)
Max Entropy Principle
-
Principle: among distributions consistent with constraints (i.e. observed data, mean, variance) choose the one with the most uncertainty, i.e. the highest entropy
maximize \(\mathbb E_{ d^{\pi}_\mu}[-\log \pi(a|s)]\)
s.t. \(\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] - \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)] = 0\)
Constrained Optimization
- General form for constrained optimization $$x^* =\arg \max~~f(x)~~\text{s.t.}~~g(x)=0$$
- An equivalent formulation (Lagrange) $$ x^* =\arg \max_x \min_w ~~f(x)+w\cdot g(x)$$
- Optimality condition: $$\nabla_x[f(x)+w\cdot g(x) ] = \nabla_w[f(x)+w\cdot g(x) ] = 0 $$
- Exercise: \(f(x) = x_1+x_2\) and \(g(x) = x_1^2+x_2^2\)
- Weighted objective of reward and entropy: $$\max_{P\in\Delta(\mathcal A)} \mathbb E_{a\sim P}[\mu_a] + \tau \mathsf{Ent}(P)$$
- The condition that \(P\in\Delta(\mathcal A)\) is actually a constraint $$\max_{P(1),\dots, P(A)} \sum_{a\in\mathcal A} P(a)\mu_a + \mathsf{Ent}(P)\quad\text{s.t.}\quad \sum_a P(a) -1=0$$
- Lagrange formulation and optimality condition:
- \(\nabla_{P}[\sum_{a=1}^K P(a)\widehat\mu_a - \tau P(a)\log P(a) + w\left(\sum_{a=1}^K P(a)-1\right)]=0\)
- \(P_\star(a) = \exp(\mu_a/\tau)\cdot \exp(w/\tau-1) \)
- \(\nabla_w[w\left(\sum_{a=1}^K P(a)-1\right)]=0\)
- \( \exp(w/\tau-1) = \frac{1}{\sum_{a=1}^A \exp(\mu_a /\tau)}\)
- \(\nabla_{P}[\sum_{a=1}^K P(a)\widehat\mu_a - \tau P(a)\log P(a) + w\left(\sum_{a=1}^K P(a)-1\right)]=0\)
Example: High Entropy Action
Constrained Optimization
- General form for constrained optimization $$x^* =\arg \max~~f(x)~~\text{s.t.}~~g(x)=0$$
- An equivalent formulation (Lagrange) $$ x^* =\arg \max_x \min_w ~~f(x)+w\cdot g(x)$$
- Iterative algorithm for constrained optimization
- For \(t=0,\dots,T-1\):
- Best response: \(x_t = \arg\max f(x) + w_t g(x)\)
- Gradient descent: \(w_{t+1} = w_t - \eta g(x)\)
- Return \(\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t\)
- For \(t=0,\dots,T-1\):
Constrained Optimization
- General form for constrained optimization $$x^* =\arg \max~~f(x)~~\text{s.t.}~~g_1(x)=g_2(x)=...=g_d(x)=0$$
- An equivalent formulation (Lagrange) $$ x^* =\arg \max_x \min_{w_1,\dots,w_d} ~~f(x)+\Sigma_{i=1}^dw_i g_i(x)$$
Constrained Optimization
- General form for constrained optimization $$x^* =\arg \max~~f(x)~~\text{s.t.}~~g(x)=0$$
- An equivalent formulation (Lagrange) $$ x^* =\arg \max_x \min_{w} ~~f(x)+w^\top g(x)$$
- Iterative algorithm for constrained optimization
- For \(t=0,\dots,T-1\):
- Best response: \(x_t = \arg\max f(x) + w_t^\top g(x)\)
- Gradient descent: \(w_{t+1} = w_t - \eta g(x)\)
- Return \(\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t\)
- For \(t=0,\dots,T-1\):
Agenda
1. Recap: Imitation Learning
2. Inverse RL Setting
3. Entropy & Constrained Optimization
4. MaxEnt IRL & Soft VI
Max-Ent IRL
Algorithm: Max-Ent IRL
- For \(k=0,\dots,K-1\):
- \(\pi^k =\arg\max_\pi \mathbb E_{ d^{\pi}_\mu}[w_k^\top \varphi (s,a) -\log \pi(a|s)] \)
- \(w_{k+1} = w_k - \eta (\mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)]-\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] )\)
- Return \(\bar \pi = \mathsf{Unif}(\pi^0,\dots \pi^{K-1})\)
- Use iterative constrained optimization algorithm on our maximum entropy IRL formulation: $$\max \mathbb E_{ d^{\pi}_\mu}[-\log \pi(a|s)]\quad\text{s.t.}\quad \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]- \mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = 0$$
Soft-VI
- Best response step resembles MDP with reward \(r(s,a) = w_k^\top \varphi (s,a)\) as well as a policy-dependent term $$\arg\max_\pi \mathbb E_{ d^{\pi}_\mu}[w_k^\top \varphi (s,a) -\log \pi(a|s)] $$
- We can solve this with DP!
- Initialize \(V_H^*(s) = 0\), then for \(h=H-1,\dots 0\):
- \(Q_h^*(s,a) = r(s,a) + \mathbb E_{s'\sim P}[V_{h+1}(s')]\)
- \(\pi_h^*(a|s) = \arg\max_{\pi(s)\in\Delta(\mathcal A)} \mathbb E_{a\sim \pi(s)}[Q_h^*(s,a)] + \mathsf{Ent}(\pi(s))\)
- \(\pi_h^*(a|s) \propto \exp(Q^*_h(s,a))\)
- \(V_h^*(s) = \log\left(\sum_{a\in\mathcal A} \exp(Q^*_h(s,a) \right)\)
- Initialize \(V_H^*(s) = 0\), then for \(h=H-1,\dots 0\):
Max-Ent IRL
- For \(k=0,\dots,K-1\):
- \(\pi^k = \mathsf{SoftVI}(w_k^\top \varphi)\)
- \(w_{k+1} = w_k - \eta (\mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)]-\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] )\)
- Return \(\bar \pi = \mathsf{Unif}(\pi^0,\dots \pi^{K-1})\)
- Input: reward function \(r\). Initialize \(V_H^*(s) = 0\)
- For \(h=H-1,\dots 0\):
- \(Q_h^*(s,a) = r(s,a) + \mathbb E_{s'\sim P}[V_{h+1}(s')]\)
- \(\pi_h^*(a|s) \propto \exp(Q^*_h(s,a))\)
- \(V_h^*(s) = \log\left(\sum_{a\in\mathcal A} \exp(Q^*_h(s,a) \right)\)
Soft-VI
Modelling Behavior with IRL
Reward function: \(\widehat r(s,a) = \bar w^\top \varphi(s,a)\)
Log-Likelihood of trajectory:
- Given \(\tau = (s_0,a_0,\dots s_{H-1}, a_{H-1})\)
- How likely is the expert to take this trajectory?
- \(\log(\rho^{\bar \pi}(\tau)) = \sum_{h=0}^{H-1} \log P(s_{h+1}|s_h, a_h) + \log \bar\pi(a_h|s_h)\)


\(s\): position, \(a\): direction
most likely path
Recap: IL vs. IRL
Supervised Learning
Policy
Dataset of expert trajectory




...
\(\pi\)( ) =


\((x=s, y=a^*)\)
imitation
inverse RL
Goal: understand/predict behaviors

Recap
- PSet due Monday
- Inverse RL
- Maximum Entropy Principle
- Constrained Optimization
- Soft VI
- Next week: case study and societal implications
4/5789: Lecture 24
By Sarah Dean