Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Reminders

• Homework
• PA 3 due Friday 3/31
• PSet 4 due Wednesday Friday
• 5789 Paper Reviews due weekly on Mondays
• Hard deadline for 3 reviews by Friday
• My Tuesday office hours moved this week to 3-4pm

## Agenda

1. Recap

2. Policy Optimization

3. with Trajectories

4. with Value

## Recap: SGA

• Rather than exact gradients, SGA uses unbiased estimates of the gradient $$g_i$$, i.e. $$\mathbb E[g_i|\theta_i] = \nabla J(\theta_i)$$

Algorithm: SGA

• Initialize $$\theta_0$$; For $$i=0,1,...$$:
• $$\theta_{i+1} = \theta_i + \alpha g_i$$

$$\theta_1$$

$$\theta_2$$

$$\theta_1$$

$$\theta_2$$

## Recap: DFO

• Random Search
• Based on finite difference approximation
• $$g_i=\frac{1}{2\delta} (J(\theta_i+\delta v) - J(\theta_i - \delta v))v$$ for $$v\sim \mathcal N(0,I)$$
• Montecarlo / Importance Weighting
• Suppose that $$J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$$
• $$g_i=\left[\nabla_{\theta}\log P_\theta(z)\right]_{\theta=\theta_i} h(z)$$ for $$z\sim P_{\theta_i}$$

$$\underbrace{\qquad\qquad}_{\text{score}}$$

$$\theta$$

$$z$$

$$\nabla J(\theta)$$$$\approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}$$

$$J(\theta) = -\theta^2 - 1$$

$$\theta$$

## Recap: Random Search Example

• start with $$\theta$$ positive
• suppose $$v$$ is positive
• then $$J(\theta+\delta v)<J(\theta-\delta v)$$
• therefore $$g$$ is negative
• indeed, $$\nabla J(\theta) = -2\theta<0$$ when $$\theta>0$$

## Recap: Montecarlo Example

$$\nabla J(\theta)$$$$\approx \nabla_\theta \log(P_\theta(z)) h(z)$$

$$J(\theta) = \mathbb E_{z\sim P_\theta}[h(z)]$$

$$z$$

$$\nabla_\theta \log P_\theta(z)= (z-\theta)$$

$$h(z) = -z^2$$

$$=\mathbb E_{z\sim\mathcal N(\theta, 1)}[-z^2]$$

($$=-\theta^2 - 1$$)

$$P_\theta = \mathcal N(\theta, 1)$$

• start with $$\theta$$ positive
• suppose $$z>\theta$$
• then score is positive
• therefore $$g$$ is negative (since $$h(z)<0$$)

$$\log P_\theta(z) \propto -\frac{1}{2}(\theta-z)^2$$

## Agenda

1. Recap

2. Policy Optimization

3. with Trajectories

4. with Value

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$$

• Goal: achieve high expected cumulative reward:

$$\max_\pi ~~\mathbb E \left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\mid s_0\sim \mu_0, s_{t+1}\sim P(s_t, a_t), a_t\sim \pi(s_t)\right ]$$

• Recall notation for a trajectory $$\tau = (s_0, a_0, s_1, a_1, \dots)$$ and probability of a trajectory $$\mathbb P^{\pi}_{\mu_0}$$
• Define cumulative reward $$R(\tau) = \sum_{t=0}^\infty \gamma^t r(s_t, a_t)$$
• For parametric (e.g. deep) policy $$\pi_\theta$$, the objective is: $$J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$

## Policy Optimization Setting

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma, \mu_0\}$$

• Goal: achieve high expected cumulative reward:

$$\max_\theta ~~J(\theta)= \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$

• Assume that we can "rollout" policy $$\pi_\theta$$ to observe:

• a sample $$\tau$$ from $$\mathbb P^{\pi_\theta}_{\mu_0}$$

• the resulting cumulative reward $$R(\tau)$$

• Note: we do not need to know $$P$$ or $$r$$!

## Policy Optimization Setting

Meta-Algorithm: Policy Optimization

• Initialize $$\theta_0$$
• For $$i=0,1,...$$:
• Rollout policy
• Estimate $$\nabla J(\theta_i)$$ as $$g_i$$ using rollouts
• Update $$\theta_{i+1} = \theta_i + \alpha g_i$$

## Policy Optimization

In today's lecture, we review four ways to construct the estimates $$g_i$$ such that $$\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$$

• In today's lecture, we review four ways to construct the estimates $$g_i$$ such that $$\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$$
• We will have
• two estimates that use trajectories $$\tau$$
• two estimates that also use Q/Value functions
• We consider infinite length trajectories $$\tau$$ without worrying about computational feasibility
• note that we could use a similar trick as in Lecture 12-13 to ensure finite sampling size/time

## Agenda

1. Recap

2. Policy Optimization

3. with Trajectories

4. with Value

• Successful in continuous action/control settings
• Can improve accuracy of gradient estimate by sampling more $$v$$, which is easily parallelizable in simulation

## Random Policy Search

Algorithm: Random Policy Search

• Given $$\alpha, \delta$$. Initialize $$\theta_0$$
• For $$i=0,1,...$$:
• Sample $$v\sim \mathcal N(0, I)$$
• Rollout policies $$\pi_{\theta_i\pm\delta v}$$ and observe trajectories $$\tau_+$$ and $$\tau_-$$
• Estimate $$g_i = \frac{1}{2\delta}\left(R(\tau_+)-R(\tau_-)\right) v$$
• Update $$\theta_{i+1} = \theta_i + \alpha g_i$$

We have that $$\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$$ up to accuracy of finite difference approximation

• $$\mathbb E[g_i| \theta_i] = \mathbb E[\mathbb E[g_i|v, \theta_i]| \theta_i]$$ by tower property
• $$\mathbb E[g_i|v, \theta_i]$$
• $$= \mathbb E_{\tau_+, \tau_-}[\frac{1}{2\delta}\left(R(\tau_+)-R(\tau_-)\right) v]$$
• $$=\frac{1}{2\delta}\left( \mathbb E_{\tau_+\sim \mathbb P^{\pi_{\theta_i+\delta v}}_{\mu_0}}[R(\tau_+)]-\mathbb E_{\tau_+\sim \mathbb P^{\pi_{\theta_i+\delta v}}_{\mu_0}}[R(\tau_-)]\right) v$$
• $$=\frac{1}{2\delta}\left( J(\theta_i+\delta v) - J(\theta_i+\delta v)\right) v$$
• $$=\nabla J(\theta_i)^\top v v$$ if finite different approx is perfect
• $$\mathbb E[g_i| \theta_i] = \mathbb E_v[\nabla J(\theta_i)^\top v v] = \nabla J(\theta_i)$$

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

## Example

$$i=0$$

$$i=1$$

$$i=1$$

$$i=0$$

• Parametrized policy: $$\pi_\theta(0)=$$stay, $$\pi_\theta(\mathsf{stay}|1) = \theta^{(1)}$$ and $$\pi_\theta(\mathsf{switch}|1) = \theta^{(2)}$$.
• Initialize $$\theta^{(1)}_0=\theta^{(2)}_0=1/2$$

• try perturbation in favor of "switch", then in favor of "stay"

• update in direction of policy which receives more cumulative reward

reward: $$+1$$ if $$s=0$$ and $$-\frac{1}{2}$$ if $$a=$$ switch

$$\theta^{(1)}$$

$$\theta^{(2)}$$

Claim: The gradient estimate is unbiased $$\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$$

• Recall Montecarlo gradient and that $$J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$

Algorithm: REINFORCE

• Given $$\alpha$$. Initialize $$\theta_0$$
• For $$i=0,1,...$$:
• Rollout policy $$\pi_{\theta_i}$$ and observe trajectory $$\tau$$
• Estimate $$g_i = \sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau)$$
• Update $$\theta_{i+1} = \theta_i + \alpha g_i$$

$$0$$

$$1$$

stay: $$1$$

switch: $$1$$

stay: $$p_1$$

switch: $$1-p_2$$

stay: $$1-p_1$$

switch: $$p_2$$

## Example

• Parametrized policy: $$\pi_\theta(0)=$$stay, $$\pi_\theta(\mathsf{stay}|1) = \theta^{(1)}$$ and $$\pi_\theta(\mathsf{switch}|1) = \theta^{(2)}$$.
• Compute the score PollEV
• $$\nabla_\theta \log \pi_\theta(a|s)=\begin{bmatrix} 1/\theta^{(1)} \cdot \mathbb 1\{a=\mathsf{stay}\} \\ 1/\theta^{(2)} \cdot 1\{a=\mathsf{switch}\}\end{bmatrix}$$
• Initialize $$\theta^{(1)}_0=\theta^{(2)}_0=1/2$$

• rollout, then sum score over trajectory $$g_0 \propto \begin{bmatrix} \text{\# times } s=1,a=\mathsf{stay} \\ \text{\# times } s=1,a=\mathsf{switch} \end{bmatrix}$$

• Direction of update depends on empirical action frequency, size depends on $$R(\tau)$$

reward: $$+1$$ if $$s=0$$ and $$-\frac{1}{2}$$ if $$a=$$ switch

Claim: The gradient estimate $$g_i=\sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau)$$  is unbiased

• Recall that $$J(\theta) = \mathop{\mathbb E}_{\tau\sim \mathbb P^{\pi_\theta}_{\mu_0}}\left[R(\tau)\right]$$
• by Montecarlo, $$\nabla J(\theta_i) = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i} R(\tau)]$$
• Since $$\tau \sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}$$ suffices to show that $$\textstyle \log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau) = \sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]$$
• Key ideas:
• $$\mathbb P^{\pi_{\theta}}_{\mu_0}$$ factors into terms depending on $$P$$ and $$\pi_\theta$$
• the logarithm of a product is the sum of the logarithm
• only terms depending on $$\theta$$ affect the gradient

We have that $$\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$$

• Using the Montecarlo derivation from last lecture $$\nabla J(\theta_i) = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i} R(\tau)]$$
• $$\log \mathbb P^{\pi_{\theta_i}}_{\mu_0}(\tau)$$
• $$=\log \left(\mu_0(s_0) \pi_{\theta_i} (a_0|s_0) P(s_1|a_0,s_0) \pi_{\theta_i} (a_1|s_1) P(s_2|a_1,s_1)...\right)$$
• $$=\log \mu_0(s_0) + \sum_{t=0}^\infty \left(\log \pi_{\theta_i} (a_t|s_t))+ \log P(s_{t+1}|a_t,s_t)\right)$$
• $$\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i}$$
• $$=\cancel{\nabla_\theta \log \mu_0(s_0)} + \sum_{t=0}^\infty \left(\nabla_\theta \log \pi_{\theta} (a_t|s_t))_{\theta=\theta_i}+ \cancel{\nabla_\theta \log P(s_{t+1}|a_t,s_t)}\right)$$
• Thus $$\nabla_\theta \log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)$$ ends up having no dependence on unknown $$P$$!
• $$\mathbb E[g_i| \theta_i] = \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\sum_{t=0}^\infty \nabla_\theta[\log \pi_\theta(a_t|s_t)]_{\theta=\theta_i}R(\tau) ]$$
• $$= \mathbb E_{\tau\sim \mathbb P^{\pi_{\theta_i}}_{\mu_0}}[\nabla_\theta[\log \mathbb P^{\pi_{\theta}}_{\mu_0}(\tau)]_{\theta=\theta_i}R(\tau) ]$$ by above
• $$= \nabla J(\theta_i)$$

## Agenda

1. Recap

2. Policy Optimization

3. with Trajectories

4. with Value

• So far, methods depend on entire trajectory rollout
• This leads to high variance estimates
• Incorporating (Q) Value function can reduce variance
• In practice, can only use estimates of Q/Value
• results in bias (Lecture 15)
• we ignore this issue today

## Motivation: PG with Value

...

...

...

Algorithm: Idealized Actor Critic

• Given $$\alpha$$. Initialize $$\theta_0$$
• For $$i=0,1,...$$:
• Roll "in" policy $$\pi_{\theta_i}$$ to sample $$s,a\sim d_{\mu_0}^{\pi_{\theta_i}}$$
• Estimate $$g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}Q^{\pi_{\theta_i}}(s,a)$$
• Update $$\theta_{i+1} = \theta_i + \alpha g_i$$

## Policy Gradient with (Q) Value

Claim: The gradient estimate is unbiased $$\mathbb E[g_i| \theta_i] = \nabla J(\theta_i)$$

• Product rule on $$J(\theta) =\mathbb E_{\substack{s_0\sim \mu_0 \\ a_0\sim\pi_\theta(s_0)}}[ Q^{\pi_\theta}(s_0, a_0)]$$ to derive recursion $$\mathbb E_{s_0}[\nabla_{\theta} V^{\pi_\theta}(s_0)] = \mathbb E_{s_0,a_0}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] + \gamma \mathbb E_{s_0,a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]$$
• Starting with a different decomposition of cumulative reward: $$\nabla J(\theta) = \nabla_{\theta} \mathbb E_{s_0\sim\mu_0}[V^{\pi_\theta}(s_0)] =\mathbb E_{s_0\sim\mu_0}[ \nabla_{\theta} V^{\pi_\theta}(s_0)]$$
• $$\nabla_{\theta} V^{\pi_\theta}(s_0) = \nabla_{\theta} \mathbb E_{a_0\sim\pi_\theta(s_0)}[ Q^{\pi_\theta}(s_0, a_0)]$$
• $$= \nabla_{\theta} \sum_{a_0\in\mathcal A} \pi_\theta(a_0|s_0) Q^{\pi_\theta}(s_0, a_0)$$
• $$=\sum_{a_0\in\mathcal A} \left( \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) + \pi_\theta(a_0|s_0) \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)]\right)$$
• Considering each term:
• $$\sum_{a_0\in\mathcal A} \nabla_{\theta} [\pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0) = \sum_{a_0\in\mathcal A} \pi_\theta(a_0|s_0) \frac{\nabla_{\theta} [\pi_\theta(a_0|s_0) ]}{\pi_\theta(a_0|s_0) } Q^{\pi_\theta}(s_0, a_0)$$
• $$= \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)]$$
• $$\sum_{a_0\in\mathcal A}\pi_\theta(a_0|s_0) \nabla_{\theta} [Q^{\pi_\theta}(s_0, a_0)] = \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} Q^{\pi_\theta}(s_0, a_0)]$$
• $$= \mathbb E_{a_0\sim\pi_\theta(s_0)}[ \nabla_{\theta} [r(s,a) + \gamma \mathbb E_{s_1\sim P(s_0, a_0)}V^{\pi_\theta}(s_1)]]$$
• $$=\gamma \mathbb E_{a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]$$
• Recursion $$\mathbb E_{s_0}[\nabla_{\theta} V^{\pi_\theta}(s_0)] = \mathbb E_{s_0,a_0}[ \nabla_{\theta} [\log \pi_\theta(a_0|s_0) ] Q^{\pi_\theta}(s_0, a_0)] + \gamma \mathbb E_{s_0,a_0,s_1}[\nabla_\theta V^{\pi_\theta}(s_1)]$$
• Iterating this recursion leads to $$\nabla J(\theta) = \sum_{t=0}^\infty \gamma^t \mathbb E_{s_t, a_t}[\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]]$$ $$= \sum_{t=0}^\infty \gamma^t \sum_{s_t, a_t} d_{\mu_0, t}^{\pi_\theta}(s_t, a_t) [\nabla_{\theta} [\log \pi_\theta(a_t|s_t) ] Q^{\pi_\theta}(s_t, a_t)]] =\frac{1}{1-\gamma}\mathbb E_{s,a\sim d_{\mu_0}^{\pi_\theta}}[\nabla_{\theta} [\log \pi_\theta(a|s) ] Q^{\pi_\theta}(s, a)]$$

The Advantage function is $$A^{\pi_{\theta_i}}(s,a) = Q^{\pi_{\theta_i}}(s,a) - V^{\pi_{\theta_i}}(s)$$

• Claim: Same as previous slide in expectation over actions: $$\mathbb E_{a\sim \pi_{\theta}(s)}[\nabla_\theta[\log \pi_\theta(a|s)]A^{\pi_{\theta}}(s,a)] = \mathbb E_{a\sim \pi_{\theta}(s)}[\nabla_\theta[\log \pi_\theta(a|s)]Q^{\pi_{\theta}}(s,a)]$$
• Suffices to show that $$\mathbb E_{a\sim \pi_{\theta}(s)}[\nabla_\theta[\log \pi_\theta(a|s)]V^{\pi_{\theta}}(s)]=0$$

Algorithm: Idealized Actor Critic with Advantage

• Same as previous slide, except estimation step
• Estimate $$g_i = \frac{1}{1-\gamma} \nabla_\theta[\log \pi_\theta(a|s)]_{\theta=\theta_i}A^{\pi_{\theta_i}}(s,a)$$
• Claim: $$\mathbb E_{a\sim \pi_\theta(s)}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right] = 0$$
• General principle: subtracting any action-independent "baseline" does not affect expected value
• Proof of claim:
• $$\mathbb E_{a\sim \pi_\theta(s)}\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right]$$
• $$=\sum_{a\in\mathcal A} \pi_\theta(a|s)\left[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)\right]$$
• $$=\sum_{a\in\mathcal A} \pi_\theta(a|s) \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} \cdot b(s)$$
• $$=\nabla_\theta \sum_{a\in\mathcal A}\pi_\theta(a|s) \cdot b(s)$$
• $$=\nabla_\theta b(s) = 0$$

## Recap

• PSet due Wed Fri
• PA due Fri

• PG with rollouts: random search and REINFORCE
• PG with value: Actor-Critic and baselines

• Next lecture: Trust Regions

By Sarah Dean

Private