## CS 4/5789: Introduction to Reinforcement Learning

### Lecture 15

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

## Agenda

0. Announcements & Recap

1. PG with Q & A functions

2. Trust Regions & KL-Divergence

## Announcements

HW2 due Monday 3/28

5789 Paper Review Assignment (weekly pace suggested)

Monday 3/21 is the last day to drop

Prelim Tuesday 3/22 at 7:30-9pm in Phillips 101

Closed-book, definition/equation sheet provided

Focus: mainly Unit 1 (known models) but many lectures in Unit 2 revisit important key concepts

Study Materials: Lecture Notes 1-15, HW0&1

Lecture on Monday 3/21 will be a review

## Recap

Derivative Free Optimization: Random Search

$$\nabla J(\theta)$$$$\approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}$$

$$J(\theta) = -\theta^2 - 1$$

$$\theta$$

## Recap

Derivative Free Optimization: Sampling

$$\nabla J(\theta)$$$$\approx \nabla_\theta \log(P_\theta(x)) h(x)$$

$$J(\theta) = \mathbb E_{x\sim P_\theta}[h(x)]$$

$$x$$

$$= 2(\theta-x)\theta h(x)$$

$$h(x) = -x^2$$

$$=\mathbb E_{x\sim\mathcal N(\theta, 1)}[-x^2]$$

$$P_\theta = \mathcal N(\theta, 1)$$

### RL Setting

• MDP $$\mathcal M = \{\mathcal S, \mathcal A, P, r, \gamma\}$$ with $$P, r$$ unknown
• policy $$\pi_\theta$$ with parameter $$\theta\in\mathbb R^d$$
• observe rollout of $$\pi_\theta$$: $$\tau = (s_0,a_0,s_1,...)$$ and $$(r_0, r_1,...)$$
• objective function
$$J(\theta) = \mathbb E_{s_0\sim\mu_0}[\sum_{t=0}^\infty\gamma^t r_t \mid P,r,\pi_\theta] = \mathbb E_{\tau\sim\rho_\theta}[R(\tau)]$$

Simple Random Search

1. with $$\theta_t \pm \delta v$$ observe $$\tau_+$$ and $$\tau_-$$
2. finite difference approx
$$g=\frac{1}{2\delta}(R(\tau_+) - R(\tau_-))v$$

REINFORCE

1. with $$\theta_t$$ observe $$\tau$$
2. trajectory-based approx
$$g=\sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)$$

Meta-Algorithm: DF-SGA

initialize $$\theta_0$$

for $$t=0,1,...$$

1. collect rollouts using $$\theta_t$$
2. estimate gradient with $$g_t$$
3. $$\theta_{t+1} = \theta_t + \alpha g_t$$

## Agenda

0. Announcements & Recap

1. PG with Q & A functions

2. Trust Regions & KL-Divergence