CS 4/5789: Introduction to Reinforcement Learning

Lecture 24: Inverse Reinforcement Learning

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- 5789 Paper Reviews due weekly on Mondays
- PSet 8 (final one!) due Monday
  - Next week: midterm corrections
- PA 4 due next Wednesday (May 3)
Final exam is Saturday 5/13 at 2pm
- Length: 2 hours

Agenda

1. Recap: Imitation Learning

2. Inverse RL Setting

3. Entropy & Constrained Optimization

4. MaxEnt IRL & Soft VI

Recap: Imitation Learning

Expert Demonstrations

Supervised ML Algorithm

Policy $\pi$

ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks

maps states to actions

Recap: BC/DAgger

Supervised Learning

Policy

Dataset

$\mathcal D = (x_i, y_i)_{i=1}^M$

...

$\pi$( ) =

Execute

Query Expert

$\pi^*(s_0), \pi^*(s_1),...$

$s_0, s_1, s_2...$

Aggregate

$(x_i = s_i, y_i = \pi^*(s_i))$

Recap: BC vs. DAgger

Supervised learning guarantee

$\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon$

Online learning guarantee

$\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon$

Performance Guarantee

$V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}$

Performance Guarantee

$V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon$

Agenda

1. Recap: Imitation Learning

2. Inverse RL Setting

3. Entropy & Constrained Optimization

4. MaxEnt IRL & Soft VI

Inverse RL: Motivation

Like previous lecture, learning from expert demonstrations
Unlike previous lectures, focus on learning reward rather than learning policy
Motivations:
- Scientific inquiry - modelling human or animal behavior
- Imitation via reward function
  - reward may be more succinct & transferrable
- Modelling other agents in multi-agent setting (adversarial or cooperative)

Inverse RL: Setting

Finite horizon MDP $$\mathcal M = \{\mathcal S,\mathcal A, P, r, H, \mu\} $$
Transition $P$ known but reward $r$ unknown and signal $r_t$ unobserved
State/state-action distributions:
- $d_{\mu}^{\pi}(s) = \frac{1}{H}\sum_{t=0}^{H-1} d_{\mu,t}^\pi(s)$ average prob. of state
- $d_{\mu}^{\pi}(s,a) = \frac{1}{H}\sum_{t=0}^{H-1} d_{\mu,t}^\pi(s,a)$ average prob. of state, action
Observe traces of expert policy $$\mathcal D=\{s_i^\star, a_i^\star\} \sim d_{\mu}^{\pi_\star}$$
Assumption: reward function is linear in known features: $$r(s,a) = \theta_\star^\top \varphi(s,a)$$

Example: Driving

$s=$

$a\in \{$ north, south, east, west $\}$

$\varphi(s,a) = \begin{bmatrix} \mathbb P \{\text{hit building}\mid \text{move in direction }a\} \\ \mathbb P \{\text{hit car}\mid \text{move in direction }a\}\\ \vdots \end{bmatrix} $

Detect location of important objects

Reason about current position

(Kitani et al., 2012)

$\theta_\star$ encodes cost/benefit of outcomes

Consistency principle

Consistent rewards: find $r$ such that for all $\pi$ $$ \mathbb E_{d^{\pi^*}_\mu}[r(s,a)] \geq \mathbb E_{d^{\pi}_\mu}[r(s,a)]$$
Consistent policies: find $\pi$ such that $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$
Consistent policies imply consistent rewards
Estimate the feature distribution from expert data: $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] \approx \frac{1}{N} \sum_{i=1}^N \varphi(s_i^*, a_i^*) $$

$\theta_*^\top \mathbb E_{d^{\pi^*}_\mu}[\varphi(s,a)] \geq \theta_*^\top \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)] $

$\theta_*$

Ambiguity problem

Our goal is to find a policy $\pi$ that matches the feature distribution $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$
Depending on expert policy and the definition of features, more than one policy may satisfy this
How should we choose between policies?
Occam's razor: choose the policy which encodes the least information

Agenda

1. Recap: Imitation Learning

2. Inverse RL Setting

3. Entropy & Constrained Optimization

4. MaxEnt IRL & Soft VI

Entropy

The mathematical concept entropy allows us to quantify "amount of information"
Definition: The entropy of a distribution $P\in\Delta(\mathcal X) $ is $$\mathsf{Ent}(P) = \mathbb E_P[-\log P(x)] = -\sum_{x\in\mathcal X} P(x) \log P(x)$$
Fact: since $P(x)\in[0,1]$, entropy is non-negative
Entropy is a measure of uncertainty
- Distributions with higher entropy are more uncertain, while lower entropy means less uncertainty

Entropy Examples

Consider distributions over $A$ actions $\mathcal A = \{1,...,A\}$
Uniform distribution $U(a) = \frac{1}{A}$ PollEV
- $\mathsf{Ent}(P) = -\sum_{a\in\mathcal A} \frac{1}{A} \log \frac{1}{A} = \log A$
Deterministic action with $a=1$ with probability $1$
- $\mathsf{Ent}(P) = -\sum_{a=1} 1\cdot \log 1 =0$
Uniform distribution over $\{1,2\}$, $U_1(a) = \frac{1}{2}\mathbf 1\{a\in\{1,2\}\}$
- $\mathsf{Ent}(P) = -\sum_{a=1}^2 \frac{1}{2} \log\frac{1}{2}=\log 2$

Max Entropy Principle

Principle: among distributions consistent with constraints (i.e. observed data, mean, variance) choose the one with the most uncertainty, i.e. the highest entropy
Specifically, we define
- $\mathsf{Ent}(\pi) = \mathbb E_{s\sim d^{\pi}_\mu}[\mathsf{Ent}(\pi(s)) ] $
  - $= \mathbb E_{s\sim d^{\pi}_\mu}[\mathbb E_{a\sim\pi(s)}[-\log \pi(a|s)]] $
  - $= \mathbb E_{s,a\sim d^{\pi}_\mu}[\mathbb -\log \pi(a|s)] $

maximize $\mathsf{Ent}(\pi)$

s.t. $\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$

Max Entropy Principle

Principle: among distributions consistent with constraints (i.e. observed data, mean, variance) choose the one with the most uncertainty, i.e. the highest entropy

maximize $\mathbb E_{ d^{\pi}_\mu}[-\log \pi(a|s)]$

s.t. $\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] - \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)] = 0$

Constrained Optimization

General form for constrained optimization $$x^* =\arg \max~~f(x)~~\text{s.t.}~~g(x)=0$$
An equivalent formulation (Lagrange) $$ x^* =\arg \max_x \min_w ~~f(x)+w\cdot g(x)$$
Optimality condition: $$\nabla_x[f(x)+w\cdot g(x) ] = \nabla_w[f(x)+w\cdot g(x) ] = 0 $$
Exercise: $f(x) = x_1+x_2$ and $g(x) = x_1^2+x_2^2$

Weighted objective of reward and entropy: $$\max_{P\in\Delta(\mathcal A)} \mathbb E_{a\sim P}[\mu_a] + \tau \mathsf{Ent}(P)$$
The condition that $P\in\Delta(\mathcal A)$ is actually a constraint $$\max_{P(1),\dots, P(A)} \sum_{a\in\mathcal A} P(a)\mu_a + \mathsf{Ent}(P)\quad\text{s.t.}\quad \sum_a P(a) -1=0$$
Lagrange formulation and optimality condition:
- $\nabla_{P}[\sum_{a=1}^K P(a)\widehat\mu_a - \tau P(a)\log P(a) + w\left(\sum_{a=1}^K P(a)-1\right)]=0$
  - $P_\star(a) = \exp(\mu_a/\tau)\cdot \exp(w/\tau-1) $
- $\nabla_w[w\left(\sum_{a=1}^K P(a)-1\right)]=0$
  - $ \exp(w/\tau-1) = \frac{1}{\sum_{a=1}^A \exp(\mu_a /\tau)}$

Example: High Entropy Action

Constrained Optimization

General form for constrained optimization $$x^* =\arg \max~~f(x)~~\text{s.t.}~~g(x)=0$$
An equivalent formulation (Lagrange) $$ x^* =\arg \max_x \min_w ~~f(x)+w\cdot g(x)$$
Iterative algorithm for constrained optimization
- For $t=0,\dots,T-1$:
  1. Best response: $x_t = \arg\max f(x) + w_t g(x)$
  2. Gradient descent: $w_{t+1} = w_t - \eta g(x)$
- Return $\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t$

Constrained Optimization

General form for constrained optimization $$x^* =\arg \max~~f(x)~~\text{s.t.}~~g_1(x)=g_2(x)=...=g_d(x)=0$$
An equivalent formulation (Lagrange) $$ x^* =\arg \max_x \min_{w_1,\dots,w_d} ~~f(x)+\Sigma_{i=1}^dw_i g_i(x)$$

Constrained Optimization

General form for constrained optimization $$x^* =\arg \max~~f(x)~~\text{s.t.}~~g(x)=0$$
An equivalent formulation (Lagrange) $$ x^* =\arg \max_x \min_{w} ~~f(x)+w^\top g(x)$$
Iterative algorithm for constrained optimization
- For $t=0,\dots,T-1$:
  1. Best response: $x_t = \arg\max f(x) + w_t^\top g(x)$
  2. Gradient descent: $w_{t+1} = w_t - \eta g(x)$
- Return $\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t$

Agenda

1. Recap: Imitation Learning

2. Inverse RL Setting

3. Entropy & Constrained Optimization

4. MaxEnt IRL & Soft VI

Max-Ent IRL

Algorithm: Max-Ent IRL

For $k=0,\dots,K-1$:
1. $\pi^k =\arg\max_\pi \mathbb E_{ d^{\pi}_\mu}[w_k^\top \varphi (s,a) -\log \pi(a|s)] $
2. $w_{k+1} = w_k - \eta (\mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)]-\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] )$
Return $\bar \pi = \mathsf{Unif}(\pi^0,\dots \pi^{K-1})$

Use iterative constrained optimization algorithm on our maximum entropy IRL formulation: $$\max \mathbb E_{ d^{\pi}_\mu}[-\log \pi(a|s)]\quad\text{s.t.}\quad \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]- \mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = 0$$

Soft-VI

Best response step resembles MDP with reward $r(s,a) = w_k^\top \varphi (s,a)$ as well as a policy-dependent term $$\arg\max_\pi \mathbb E_{ d^{\pi}_\mu}[w_k^\top \varphi (s,a) -\log \pi(a|s)] $$
We can solve this with DP!
- Initialize $V_H^*(s) = 0$, then for $h=H-1,\dots 0$:
  - $Q_h^*(s,a) = r(s,a) + \mathbb E_{s'\sim P}[V_{h+1}(s')]$
  - $\pi_h^*(a|s) = \arg\max_{\pi(s)\in\Delta(\mathcal A)} \mathbb E_{a\sim \pi(s)}[Q_h^*(s,a)] + \mathsf{Ent}(\pi(s))$
    - $\pi_h^*(a|s) \propto \exp(Q^*_h(s,a))$
  - $V_h^*(s) = \log\left(\sum_{a\in\mathcal A} \exp(Q^*_h(s,a) \right)$

Max-Ent IRL

For $k=0,\dots,K-1$:
1. $\pi^k = \mathsf{SoftVI}(w_k^\top \varphi)$
2. $w_{k+1} = w_k - \eta (\mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)]-\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] )$
Return $\bar \pi = \mathsf{Unif}(\pi^0,\dots \pi^{K-1})$

Input: reward function $r$. Initialize $V_H^*(s) = 0$
For $h=H-1,\dots 0$:
1. $Q_h^*(s,a) = r(s,a) + \mathbb E_{s'\sim P}[V_{h+1}(s')]$
2. $\pi_h^*(a|s) \propto \exp(Q^*_h(s,a))$
3. $V_h^*(s) = \log\left(\sum_{a\in\mathcal A} \exp(Q^*_h(s,a) \right)$

Soft-VI

Modelling Behavior with IRL

Reward function: $\widehat r(s,a) = \bar w^\top \varphi(s,a)$

Log-Likelihood of trajectory:

Given $\tau = (s_0,a_0,\dots s_{H-1}, a_{H-1})$
How likely is the expert to take this trajectory?
$\log(\rho^{\bar \pi}(\tau)) = \sum_{h=0}^{H-1} \log P(s_{h+1}|s_h, a_h) + \log \bar\pi(a_h|s_h)$

$s$: position, $a$: direction

most likely path

Recap: IL vs. IRL

Supervised Learning

Policy

Dataset of expert trajectory

...

$\pi$( ) =

$(x=s, y=a^*)$

imitation

inverse RL

Goal: understand/predict behaviors

Recap

PSet due Monday

Inverse RL
Maximum Entropy Principle
Constrained Optimization
Soft VI

Next week: case study and societal implications

4/5789: Lecture 24

By Sarah Dean

4/5789: Lecture 24

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 24: Inverse Reinforcement Learning

Reminders

Agenda

Recap: Imitation Learning

Recap: BC/DAgger

Recap: BC vs. DAgger

Agenda

Inverse RL: Motivation

Inverse RL: Setting

Example: Driving

Consistency principle

Ambiguity problem

Agenda

Entropy

Entropy Examples

Max Entropy Principle

Max Entropy Principle

Constrained Optimization

Example: High Entropy Action

Constrained Optimization

Constrained Optimization

Constrained Optimization

Agenda

Max-Ent IRL

Soft-VI

Max-Ent IRL

Modelling Behavior with IRL

Recap: IL vs. IRL

Recap

4/5789: Lecture 24

More from Sarah Dean