CS 4/5789: Introduction to Reinforcement Learning

Lecture 25

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

 

0. Announcements & Recap

 1. Maximum Entropy IRL

2. Iterative Algorithm

3. Soft Value Iteration

Announcements

 

5789 Paper Review Assignment (weekly pace suggested)

HW 4 due 5/9 -- don't plan on extentions

 

My office hours are now Wednesdays 4:15-6pm in Gates 416A

 

Final exam Monday 5/16 at 7pm
Review session in lecture 5/9

Recap: IL vs. IRL

Supervised Learning

Policy

Dataset of expert trajectory

...

\(\pi\)(       ) =

\((x=s, y=a^*)\)

imitation

inverse RL

Goal: back out reward function that generates behaviors

Consistent policy via distribution matching

Recap: Consistent Policies

Consistent rewards: find  \(r\)  such that for all \(\pi\) $$ \mathbb E_{d^{\pi^*}_\mu}[r(s,a)] \geq \mathbb E_{d^{\pi}_\mu}[r(s,a)]$$

Assumption: \(r\) linear in known features, \(r(s,a) = \theta_*^\top \varphi(s,a)\).

 \(\theta_*^\top \mathbb E_{d^{\pi^*}_\mu}[\varphi(s,a)] \geq \theta_*^\top \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)] \)

\(\theta_*\)

Consistent policies: find \(\pi\) such that $$\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]$$

same expected features!

Estimate from expert data: \( E_{d^{\pi^*}_\mu}[\varphi (s,a)] \approx \frac{1}{N} \sum_{i=1}^N \varphi(s_i^*, a_i^*) \)

Example: Driving

\(s=\)

\(a\in \{\) north, south, east, west \(\}\)

\(\varphi(s,a) = \begin{bmatrix} \mathbb P \{\text{hit building}\mid \text{move in direction }a\} \\ \mathbb P \{\text{hit car}\mid \text{move in direction }a\}\\ \vdots \end{bmatrix}  \)

Detect location of important objects

Reason about current position

Recap: Entropy

Entropy measures uncertainty

\(\mathsf{Ent}(P) = \mathbb E_P[-\log P(x)] = -\sum_{x\in\mathcal X} P(x) \log P(x)\)

Principle of maximum entropy:

Among all options, choose the one with the most uncertainty

maximize    \(\mathsf{Ent}(\pi)\)

s.t.    \(\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] = \mathbb E_{d^{\pi}_\mu}[\varphi (s,a)]\)

Recap: Constrained Optimization

\(x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0\)

Equivalent Lagrange Formulation

\(\displaystyle x^* =\arg \min_x \max_w ~~f(x)+w\cdot g(x)\)

Iterative Procedure

  • For \(t=0,\dots,T-1\):
    1. Best repsonse: \(x_t = \arg\min f(x) + w_t g(x)\)
    2. Gradient ascent: \(w_{t+1} = w_t + \eta g(x)\)
  • Return \(\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t\)

Recap: Constrained Optimization

\(x^* =\arg \min~~f(x)~~\text{s.t.}~~g_1(x)=g_2(x)=...=g_d(x)=0\)

Equivalent Lagrange Formulation

\(\displaystyle x^* =\arg \min_x \max_{w_1,\dots,w_d} ~~f(x)+\Sigma_{i=1}^dw_i g_i(x)\)

Recap: Constrained Optimization

\(x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0\)

Equivalent Lagrange Formulation

\(\displaystyle x^* =\arg \min_x \max_{w_1,\dots,w_d} ~~f(x)+w^\top g(x)\)

Iterative Procedure

  • For \(t=0,\dots,T-1\):
    1. Best repsonse: \(x_t = \arg\min f(x) + w_t^\top g(x)\)
    2. Gradient ascent: \(w_{t+1} = w_t + \eta g(x)\)
  • Return \(\bar x = \frac{1}{T} \sum_{t=0}^{T-1} x_t\)

Agenda

 

0. Announcements & Recap

 1. Maximum Entropy IRL

2. Iterative Algorithm

3. Soft Value Iteration

Max-Ent IRL

  • For \(k=0,\dots,K-1\):
    1. \(\pi^k = \mathsf{SoftVI}(w_k^\top \varphi)\)
    2. \(w_{k+1} = w_k + \eta (\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] - \mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)])\)
  • Return \(\bar \pi = \mathsf{Unif}(\pi^0,\dots \pi^{K-1})\)
  • Input: reward function \(r\). Initialize \(V_H^*(s) = 0\)
  • For \(h=H-1,\dots 0\):
    1. \(Q_h^*(s,a) = r(s,a) + \mathbb E_{s'\sim P}[V_{h+1}(s')]\)
    2. \(\pi_h^*(a|s) \propto \exp(Q^*_h(s,a))\)
    3. \(V_h^*(s) = \log\left(\sum_{a\in\mathcal A} \exp(Q^*_h(s,a) \right)\)

Soft-VI

Inverse RL

Reward function: \(\widehat r(s,a) = w_K^\top \varphi(s,a)\)

Log-Likelihood of trajectory:

  • Given \(\tau = (s_0,a_0,\dots s_{H-1}, a_{H-1})\)
  • How likely is the expert to take this trajectory?
  • \(\log(\rho^{\bar \pi}(\tau)) = \sum_{h=0}^{H-1} \log P(s_{h+1}|s_h, a_h) + \log \bar\pi(a_h|s_h)\)

\(s\): position, \(a\): direction

most likely path

CS 4/5789: Lecture 25

By Sarah Dean

Private

CS 4/5789: Lecture 25