## CS 4/5789: Introduction to Reinforcement Learning

### Lecture 25

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

## Agenda

0. Announcements & Recap

1. Maximum Entropy IRL

2. Iterative Algorithm

3. Soft Value Iteration

## Announcements

5789 Paper Review Assignment (weekly pace suggested)

HW 4 due 5/9 -- don't plan on extentions

My office hours are now Wednesdays 4:15-6pm in Gates 416A

Final exam Monday 5/16 at 7pm
Review session in lecture 5/9

## Recap: IL vs. IRL

Supervised Learning

Policy

Dataset of expert trajectory

...

$$\pi$$(       ) =

$$(x=s, y=a^*)$$

imitation

inverse RL

Goal: back out reward function that generates behaviors

Consistent policy via distribution matching

## Recap: Consistent Policies

Consistent rewards: find  $$r$$  such that for all $$\pi$$ $$\mathbb E_{d^{\pi^*}_\mu}[r(s,a)] \geq \mathbb E_{d^{\pi}_\mu}[r(s,a)]$$

Assumption: $$r$$ linear in known features, $$r(s,a) = \theta_*^\top \varphi(s,a)$$.

$$\theta_*^\top \mathbb E_{d^{\pi^*}_\mu}[\varphi(s,a)] \geq \theta_*^\top \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$

$$\theta_*$$

Consistent policies: find $$\pi$$ such that $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$

same expected features!

Estimate from expert data: $$E_{d^{\pi^*}_\mu}[\varphi (s,a)] \approx \frac{1}{N} \sum_{i=1}^N \varphi(s_i^*, a_i^*)$$

## Example: Driving

$$s=$$

$$a\in \{$$ north, south, east, west $$\}$$

$$\varphi(s,a) = \begin{bmatrix} \mathbb P \{\text{hit building}\mid \text{move in direction }a\} \\ \mathbb P \{\text{hit car}\mid \text{move in direction }a\}\\ \vdots \end{bmatrix}$$

Detect location of important objects

## Recap: Entropy

Entropy measures uncertainty

$$\mathsf{Ent}(P) = \mathbb E_P[-\log P(x)] = -\sum_{x\in\mathcal X} P(x) \log P(x)$$

Principle of maximum entropy:

Among all options, choose the one with the most uncertainty

maximize    $$\mathsf{Ent}(\pi)$$

s.t.    $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$

## Recap: Constrained Optimization

$$x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0$$

Equivalent Lagrange Formulation

$$\displaystyle x^* =\arg \min_x \max_w ~~f(x)+w\cdot g(x)$$

Iterative Procedure

• For $$t=0,\dots,T-1$$:
1. Best repsonse: $$x_t = \arg\min f(x) + w_t g(x)$$
2. Gradient ascent: $$w_{t+1} = w_t + \eta g(x)$$
• Return $$\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t$$

## Recap: Constrained Optimization

$$x^* =\arg \min~~f(x)~~\text{s.t.}~~g_1(x)=g_2(x)=...=g_d(x)=0$$

Equivalent Lagrange Formulation

$$\displaystyle x^* =\arg \min_x \max_{w_1,\dots,w_d} ~~f(x)+\Sigma_{i=1}^dw_i g_i(x)$$

## Recap: Constrained Optimization

$$x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0$$

Equivalent Lagrange Formulation

$$\displaystyle x^* =\arg \min_x \max_{w_1,\dots,w_d} ~~f(x)+w^\top g(x)$$

Iterative Procedure

• For $$t=0,\dots,T-1$$:
1. Best repsonse: $$x_t = \arg\min f(x) + w_t^\top g(x)$$
2. Gradient ascent: $$w_{t+1} = w_t + \eta g(x)$$
• Return $$\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t$$

## Agenda

0. Announcements & Recap

1. Maximum Entropy IRL

2. Iterative Algorithm

3. Soft Value Iteration

## Max-Ent IRL

• For $$k=0,\dots,K-1$$:
1. $$\pi^k = \mathsf{SoftVI}(w_k^\top \varphi)$$
2. $$w_{k+1} = w_k + \eta (\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] - \mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)])$$
• Return $$\bar \pi = \mathsf{Unif}(\pi^0,\dots \pi^{K-1})$$
• Input: reward function $$r$$. Initialize $$V_H^*(s) = 0$$
• For $$h=H-1,\dots 0$$:
1. $$Q_h^*(s,a) = r(s,a) + \mathbb E_{s'\sim P}[V_{h+1}(s')]$$
2. $$\pi_h^*(a|s) \propto \exp(Q^*_h(s,a))$$
3. $$V_h^*(s) = \log\left(\sum_{a\in\mathcal A} \exp(Q^*_h(s,a) \right)$$

Soft-VI

## Inverse RL

Reward function: $$\widehat r(s,a) = w_K^\top \varphi(s,a)$$

Log-Likelihood of trajectory:

• Given $$\tau = (s_0,a_0,\dots s_{H-1}, a_{H-1})$$
• How likely is the expert to take this trajectory?
• $$\log(\rho^{\bar \pi}(\tau)) = \sum_{h=0}^{H-1} \log P(s_{h+1}|s_h, a_h) + \log \bar\pi(a_h|s_h)$$

$$s$$: position, $$a$$: direction

most likely path

By Sarah Dean

Private