## CS 4/5789: Introduction to Reinforcement Learning

### Lecture 24

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

## Agenda

0. Announcements & Recap

1. Inverse Reinforcement Learning

2. Maximum Entropy Principle

3. Constrained Optimization

## Announcements

5789 Paper Review Assignment (weekly pace suggested)

HW 3 due, HW 4 released

PollEV Participation Count Updated on Canvas

My office hours are now Wednesdays 4:15-6pm in Gates 416A

Final exam Monday 5/16 at 7pm

## Recap: Behavioral Cloning

Supervised Learning

Policy

Dataset of expert trajectory

$$(x, y)$$

...

$$\pi$$(       ) =

## DAgger: Dataset Aggregation

Supervised Learning

Policy

Dataset

$$\mathcal D = (x_i, y_i)_{i=1}^M$$

...

$$\pi$$(       ) =

Execute

Query Expert

$$\pi^*(s_0), \pi^*(s_1),...$$

$$s_0, s_1, s_2...$$

Aggregate

$$(x_i = s_i, y_i = \pi^*(s_i))$$

## BC vs. DAgger

Supervised learning guarantee

$$\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon$$

Online learning guarantee

$$\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon$$

Performance Guarantee

$$V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}$$

Performance Guarantee

$$V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon$$

## Agenda

0. Announcements & Recap

1. Inverse Reinforcement Learning

2. Maximum Entropy Principle

3. Constrained Optimization

(Kitani et al., 2012)

By Sarah Dean

Private