CS 4/5789: Introduction to Reinforcement Learning

Lecture 24: Inverse Reinforcement Learning

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework
    • 5789 Paper Reviews due weekly on Mondays
    • PSet 8 (final one!) due Monday
      • Next week: midterm corrections
    • PA 4 due next Wednesday (May 3)
  • Final exam is Saturday 5/13 at 2pm
    • Length: 2 hours

Agenda

1. Recap: Imitation Learning

2. Inverse RL Setting

3. Entropy & Constrained Optimization

4. MaxEnt IRL & Soft VI

Recap: Imitation Learning

Expert Demonstrations

Supervised ML Algorithm

Policy \(\pi\)

ex - SVM, Gaussian Process, Kernel Ridge Regression, Deep Networks

maps states to actions

Recap: BC/DAgger

Supervised Learning

Policy

Dataset

\(\mathcal D = (x_i, y_i)_{i=1}^M\)

...

\(\pi\)(       ) =

Execute

Query Expert

\(\pi^*(s_0), \pi^*(s_1),...\)

\(s_0, s_1, s_2...\)

Aggregate

\((x_i = s_i, y_i = \pi^*(s_i))\)

Recap: BC vs. DAgger

Supervised learning guarantee

\(\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon\)

Online learning guarantee

\(\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon\)

Performance Guarantee

\(V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}\)

Performance Guarantee

\(V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon\)

Agenda

1. Recap: Imitation Learning

2. Inverse RL Setting

3. Entropy & Constrained Optimization

4. MaxEnt IRL & Soft VI

Inverse RL: Motivation

  • Like previous lecture, learning from expert demonstrations
  • Unlike previous lectures, focus on learning reward rather than learning policy
  • Motivations:
    • Scientific inquiry - modelling human or animal behavior
    • Imitation via reward function
      • reward may be more succinct & transferrable
    • Modelling other agents in multi-agent setting (adversarial or cooperative)

Inverse RL: Setting

  • Finite horizon MDP $$\mathcal M =  \{\mathcal S,\mathcal A, P, r, H, \mu\} $$
  • Transition \(P\) known but reward \(r\) unknown and signal \(r_t\) unobserved
  • State/state-action distributions:
    • \(d_{\mu}^{\pi}(s) = \frac{1}{H}\sum_{t=0}^{H-1} d_{\mu,t}^\pi(s)\) average prob. of state
    • \(d_{\mu}^{\pi}(s,a) = \frac{1}{H}\sum_{t=0}^{H-1} d_{\mu,t}^\pi(s,a)\) average prob. of state, action
  • Observe traces of expert policy $$\mathcal D=\{s_i^\star, a_i^\star\} \sim d_{\mu}^{\pi_\star}$$
  • Assumption: reward function is linear in known features: $$r(s,a) = \theta_\star^\top  \varphi(s,a)$$

Example: Driving

\(s=\)

\(a\in \{\) north, south, east, west \(\}\)

\(\varphi(s,a) = \begin{bmatrix} \mathbb P \{\text{hit building}\mid \text{move in direction }a\} \\ \mathbb P \{\text{hit car}\mid \text{move in direction }a\}\\ \vdots \end{bmatrix}  \)

Detect location of important objects

Reason about current position

(Kitani et al., 2012)

\(\theta_\star\) encodes cost/benefit of outcomes

Consistency principle

  • Consistent rewards: find  \(r\)  such that for all \(\pi\) $$ \mathbb E_{d^{\pi^*}_\mu}[r(s,a)] \geq \mathbb E_{d^{\pi}_\mu}[r(s,a)]$$
  • Consistent policies: find \(\pi\) such that $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$

  • Consistent policies imply consistent rewards

  • Estimate the feature distribution from expert data: $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] \approx \frac{1}{N} \sum_{i=1}^N \varphi(s_i^*, a_i^*) $$

 \(\theta_*^\top \mathbb E_{d^{\pi^*}_\mu}[\varphi(s,a)] \geq \theta_*^\top \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)] \)

\(\theta_*\)

Ambiguity problem

  • Our goal is to find a policy \(\pi\) that matches the feature distribution $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$
  • Depending on expert policy and the definition of features, more than one policy may satisfy this

  • How should we choose between policies?

  • Occam's razor: choose the policy which encodes the least information

Agenda

1. Recap: Imitation Learning

2. Inverse RL Setting

3. Entropy & Constrained Optimization

4. MaxEnt IRL & Soft VI

Entropy

  • The mathematical concept entropy allows us to quantify "amount of information"

  • Definition: The entropy of a distribution \(P\in\Delta(\mathcal X) \) is $$\mathsf{Ent}(P) = \mathbb E_P[-\log P(x)] = -\sum_{x\in\mathcal X} P(x) \log P(x)$$

  • Fact: since \(P(x)\in[0,1]\), entropy is non-negative

  • Entropy is a measure of uncertainty

    • Distributions with higher entropy are more uncertain, while lower entropy means less uncertainty

Entropy Examples

  • Consider distributions over \(A\) actions \(\mathcal A = \{1,...,A\}\)

  • Uniform distribution \(U(a) = \frac{1}{A}\) PollEV

    • \(\mathsf{Ent}(P) =  -\sum_{a\in\mathcal A} \frac{1}{A} \log \frac{1}{A} = \log A\)

  • Deterministic action with \(a=1\) with probability \(1\)

    • \(\mathsf{Ent}(P) = -\sum_{a=1} 1\cdot \log 1 =0\)

  • Uniform distribution over \(\{1,2\}\), \(U_1(a) = \frac{1}{2}\mathbf 1\{a\in\{1,2\}\}\)

    • \(\mathsf{Ent}(P) =  -\sum_{a=1}^2 \frac{1}{2} \log\frac{1}{2}=\log 2\)

Max Entropy Principle

  • Principle: among distributions consistent with constraints (i.e. observed data, mean, variance) choose the one with the most uncertainty, i.e. the highest entropy


     

  • Specifically, we define

    • \(\mathsf{Ent}(\pi) = \mathbb E_{s\sim d^{\pi}_\mu}[\mathsf{Ent}(\pi(s)) ] \)

      • \(= \mathbb E_{s\sim d^{\pi}_\mu}[\mathbb E_{a\sim\pi(s)}[-\log \pi(a|s)]] \)

      • \(= \mathbb E_{s,a\sim d^{\pi}_\mu}[\mathbb -\log \pi(a|s)] \)

maximize    \(\mathsf{Ent}(\pi)\)

s.t.    \(\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]\)

Max Entropy Principle

  • Principle: among distributions consistent with constraints (i.e. observed data, mean, variance) choose the one with the most uncertainty, i.e. the highest entropy

     

maximize    \(\mathbb E_{ d^{\pi}_\mu}[-\log \pi(a|s)]\)

s.t.    \(\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] - \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)] = 0\)

Constrained Optimization

  • General form for constrained optimization $$x^* =\arg \max~~f(x)~~\text{s.t.}~~g(x)=0$$
  • An equivalent formulation (Lagrange) $$ x^* =\arg \max_x \min_w ~~f(x)+w\cdot g(x)$$
  • Optimality condition: $$\nabla_x[f(x)+w\cdot g(x) ] = \nabla_w[f(x)+w\cdot g(x) ] = 0 $$
  • Exercise: \(f(x) = x_1+x_2\) and \(g(x) = x_1^2+x_2^2\)
  • Weighted objective of reward and entropy: $$\max_{P\in\Delta(\mathcal A)} \mathbb E_{a\sim P}[\mu_a] + \tau \mathsf{Ent}(P)$$
  • The condition that \(P\in\Delta(\mathcal A)\) is actually a constraint $$\max_{P(1),\dots, P(A)} \sum_{a\in\mathcal A} P(a)\mu_a + \mathsf{Ent}(P)\quad\text{s.t.}\quad \sum_a P(a) -1=0$$
  • Lagrange formulation and optimality condition:
    • \(\nabla_{P}[\sum_{a=1}^K P(a)\widehat\mu_a - \tau P(a)\log P(a) + w\left(\sum_{a=1}^K P(a)-1\right)]=0\)
      • \(P_\star(a) = \exp(\mu_a/\tau)\cdot \exp(w/\tau-1) \)
    • \(\nabla_w[w\left(\sum_{a=1}^K P(a)-1\right)]=0\)
      • \( \exp(w/\tau-1) = \frac{1}{\sum_{a=1}^A \exp(\mu_a /\tau)}\)

Example: High Entropy Action

Constrained Optimization

  • General form for constrained optimization $$x^* =\arg \max~~f(x)~~\text{s.t.}~~g(x)=0$$
  • An equivalent formulation (Lagrange) $$ x^* =\arg \max_x \min_w ~~f(x)+w\cdot g(x)$$
  • Iterative algorithm for constrained optimization
    • For \(t=0,\dots,T-1\):
      1. Best response: \(x_t = \arg\max f(x) + w_t g(x)\)
      2. Gradient descent: \(w_{t+1} = w_t - \eta g(x)\)
    • Return \(\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t\)

Constrained Optimization

  • General form for constrained optimization $$x^* =\arg \max~~f(x)~~\text{s.t.}~~g_1(x)=g_2(x)=...=g_d(x)=0$$
  • An equivalent formulation (Lagrange) $$ x^* =\arg \max_x \min_{w_1,\dots,w_d} ~~f(x)+\Sigma_{i=1}^dw_i g_i(x)$$

Constrained Optimization

  • General form for constrained optimization $$x^* =\arg \max~~f(x)~~\text{s.t.}~~g(x)=0$$
  • An equivalent formulation (Lagrange) $$ x^* =\arg \max_x \min_{w} ~~f(x)+w^\top g(x)$$
  • Iterative algorithm for constrained optimization
    • For \(t=0,\dots,T-1\):
      1. Best response: \(x_t = \arg\max f(x) + w_t^\top g(x)\)
      2. Gradient descent: \(w_{t+1} = w_t - \eta g(x)\)
    • Return \(\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t\)

Agenda

1. Recap: Imitation Learning

2. Inverse RL Setting

3. Entropy & Constrained Optimization

4. MaxEnt IRL & Soft VI

Max-Ent IRL

Algorithm: Max-Ent IRL

  • For \(k=0,\dots,K-1\):
    1. \(\pi^k =\arg\max_\pi \mathbb E_{ d^{\pi}_\mu}[w_k^\top \varphi (s,a) -\log \pi(a|s)] \)
    2. \(w_{k+1} = w_k - \eta  (\mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)]-\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] )\)
  • Return \(\bar \pi = \mathsf{Unif}(\pi^0,\dots \pi^{K-1})\)
  • Use iterative constrained optimization algorithm on our maximum entropy IRL formulation: $$\max \mathbb E_{ d^{\pi}_\mu}[-\log \pi(a|s)]\quad\text{s.t.}\quad \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]- \mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = 0$$

Soft-VI

  • Best response step resembles MDP with reward \(r(s,a) = w_k^\top \varphi (s,a)\) as well as a policy-dependent  term  $$\arg\max_\pi \mathbb E_{ d^{\pi}_\mu}[w_k^\top \varphi (s,a) -\log \pi(a|s)] $$
  • We can solve this with DP!
    • Initialize \(V_H^*(s) = 0\), then for \(h=H-1,\dots 0\):
      • \(Q_h^*(s,a) = r(s,a) + \mathbb E_{s'\sim P}[V_{h+1}(s')]\)
      • \(\pi_h^*(a|s)  = \arg\max_{\pi(s)\in\Delta(\mathcal A)} \mathbb E_{a\sim \pi(s)}[Q_h^*(s,a)] + \mathsf{Ent}(\pi(s))\)
        • \(\pi_h^*(a|s) \propto \exp(Q^*_h(s,a))\)
      • \(V_h^*(s) = \log\left(\sum_{a\in\mathcal A} \exp(Q^*_h(s,a) \right)\)

Max-Ent IRL

  • For \(k=0,\dots,K-1\):
    1. \(\pi^k = \mathsf{SoftVI}(w_k^\top \varphi)\)
    2. \(w_{k+1} = w_k - \eta  (\mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)]-\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] )\)
  • Return \(\bar \pi = \mathsf{Unif}(\pi^0,\dots \pi^{K-1})\)
  • Input: reward function \(r\). Initialize \(V_H^*(s) = 0\)
  • For \(h=H-1,\dots 0\):
    1. \(Q_h^*(s,a) = r(s,a) + \mathbb E_{s'\sim P}[V_{h+1}(s')]\)
    2. \(\pi_h^*(a|s) \propto \exp(Q^*_h(s,a))\)
    3. \(V_h^*(s) = \log\left(\sum_{a\in\mathcal A} \exp(Q^*_h(s,a) \right)\)

Soft-VI

Modelling Behavior with IRL

Reward function: \(\widehat r(s,a) = \bar w^\top \varphi(s,a)\)

Log-Likelihood of trajectory:

  • Given \(\tau = (s_0,a_0,\dots s_{H-1}, a_{H-1})\)
  • How likely is the expert to take this trajectory?
  • \(\log(\rho^{\bar \pi}(\tau)) = \sum_{h=0}^{H-1} \log P(s_{h+1}|s_h, a_h) + \log \bar\pi(a_h|s_h)\)

\(s\): position, \(a\): direction

most likely path

Recap: IL vs. IRL

Supervised Learning

Policy

Dataset of expert trajectory

...

\(\pi\)(       ) =

\((x=s, y=a^*)\)

imitation

inverse RL

Goal: understand/predict behaviors

Recap

  • PSet due Monday

 

  • Inverse RL
  • Maximum Entropy Principle
  • Constrained Optimization
  • Soft VI

 

  • Next week: case study and societal implications