CS 4/5789: Lecture 25

CS 4/5789: Introduction to Reinforcement Learning

Lecture 25

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

0. Announcements & Recap

1. Maximum Entropy IRL

2. Iterative Algorithm

3. Soft Value Iteration

Announcements

5789 Paper Review Assignment (weekly pace suggested)

HW 4 due 5/9 -- don't plan on extentions

My office hours are now Wednesdays 4:15-6pm in Gates 416A

Final exam Monday 5/16 at 7pm
Review session in lecture 5/9

Recap: IL vs. IRL

Supervised Learning

Policy

Dataset of expert trajectory

...

$\pi$( ) =

$(x=s, y=a^*)$

imitation

inverse RL

Goal: back out reward function that generates behaviors

Consistent policy via distribution matching

Recap: Consistent Policies

Consistent rewards: find $r$ such that for all $\pi$ $$ \mathbb E_{d^{\pi^*}_\mu}[r(s,a)] \geq \mathbb E_{d^{\pi}_\mu}[r(s,a)]$$

Assumption: $r$ linear in known features, $r(s,a) = \theta_*^\top \varphi(s,a)$.

$\theta_*^\top \mathbb E_{d^{\pi^*}_\mu}[\varphi(s,a)] \geq \theta_*^\top \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)] $

$\theta_*$

Consistent policies: find $\pi$ such that $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$

same expected features!

Estimate from expert data: $ E_{d^{\pi^*}_\mu}[\varphi (s,a)] \approx \frac{1}{N} \sum_{i=1}^N \varphi(s_i^*, a_i^*) $

Example: Driving

$s=$

$a\in \{$ north, south, east, west $\}$

$\varphi(s,a) = \begin{bmatrix} \mathbb P \{\text{hit building}\mid \text{move in direction }a\} \\ \mathbb P \{\text{hit car}\mid \text{move in direction }a\}\\ \vdots \end{bmatrix} $

Detect location of important objects

Reason about current position

Recap: Entropy

Entropy measures uncertainty

$\mathsf{Ent}(P) = \mathbb E_P[-\log P(x)] = -\sum_{x\in\mathcal X} P(x) \log P(x)$

Principle of maximum entropy:

Among all options, choose the one with the most uncertainty

maximize $\mathsf{Ent}(\pi)$

s.t. $\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$

Recap: Constrained Optimization

$x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0$

Equivalent Lagrange Formulation

$\displaystyle x^* =\arg \min_x \max_w ~~f(x)+w\cdot g(x)$

Iterative Procedure

For $t=0,\dots,T-1$:
1. Best repsonse: $x_t = \arg\min f(x) + w_t g(x)$
2. Gradient ascent: $w_{t+1} = w_t + \eta g(x)$
Return $\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t$

Recap: Constrained Optimization

$x^* =\arg \min~~f(x)~~\text{s.t.}~~g_1(x)=g_2(x)=...=g_d(x)=0$

Equivalent Lagrange Formulation

$\displaystyle x^* =\arg \min_x \max_{w_1,\dots,w_d} ~~f(x)+\Sigma_{i=1}^dw_i g_i(x)$

Recap: Constrained Optimization

$x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0$

Equivalent Lagrange Formulation

$\displaystyle x^* =\arg \min_x \max_{w_1,\dots,w_d} ~~f(x)+w^\top g(x)$

Iterative Procedure

For $t=0,\dots,T-1$:
1. Best repsonse: $x_t = \arg\min f(x) + w_t^\top g(x)$
2. Gradient ascent: $w_{t+1} = w_t + \eta g(x)$
Return $\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t$

Agenda

0. Announcements & Recap

1. Maximum Entropy IRL

2. Iterative Algorithm

3. Soft Value Iteration

Max-Ent IRL

For $k=0,\dots,K-1$:
1. $\pi^k = \mathsf{SoftVI}(w_k^\top \varphi)$
2. $w_{k+1} = w_k + \eta (\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] - \mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)])$
Return $\bar \pi = \mathsf{Unif}(\pi^0,\dots \pi^{K-1})$

Input: reward function $r$. Initialize $V_H^*(s) = 0$
For $h=H-1,\dots 0$:
1. $Q_h^*(s,a) = r(s,a) + \mathbb E_{s'\sim P}[V_{h+1}(s')]$
2. $\pi_h^*(a|s) \propto \exp(Q^*_h(s,a))$
3. $V_h^*(s) = \log\left(\sum_{a\in\mathcal A} \exp(Q^*_h(s,a) \right)$

Soft-VI

Inverse RL

Reward function: $\widehat r(s,a) = w_K^\top \varphi(s,a)$

Log-Likelihood of trajectory:

Given $\tau = (s_0,a_0,\dots s_{H-1}, a_{H-1})$
How likely is the expert to take this trajectory?
$\log(\rho^{\bar \pi}(\tau)) = \sum_{h=0}^{H-1} \log P(s_{h+1}|s_h, a_h) + \log \bar\pi(a_h|s_h)$

$s$: position, $a$: direction

most likely path