Policy Optimization

ML in Feedback Sys #16

Fall 2025, Prof Sarah Dean

Policy Optimization

"What we do"

Given a parametrized policy class $\Pi_\theta$ and initial policy
For $i=0,1,2,...$
- "Rollout" the policy, i.e. collect trajectory $\tau_i=\{x_t,a_t,c_t\}$
- Use trajectory to estimate the total cost $\hat J_i = \sum_t c_t$
- Determine descent direction $g_i=\hat J_i d_i$
- Update the policy according to $\theta_{i+1} = \theta_i - \eta g_i$

policy

$\pi_\theta$

observation

action

$x_t$

$a_t$

Policy Optimization

"What we do"

Given a parametrized policy class $\Pi_\theta$ and initial policy
- either deterministic $\pi_{\theta_0}:\mathcal X\to\mathcal A$
- or stochastic $\pi_{\theta_0}:\mathcal X\to\Delta(\mathcal A)$, notation $\pi_{\theta}(a|x)$
- input $x$ could be states $s$ or AR history of $y$
For $i=0,1,2,...$
- "Rollout" the policy, i.e. collect trajectory $\tau_i=\{x_t,a_t,c_t\}$
  - i.e. select actions $a_t\sim \pi_i(x_t)$ if stochastic
  - if deterministic, sample $v_i\sim\mathcal N(0,I)$ and $a_t=\pi_{\theta_i+\delta v_i}(x_t)$
- Use trajectory to estimate the total cost $\hat J_i = \sum_t c_t$
- Determine descent direction $g_i=\hat J_i d_i$ where
  - $d_i = \frac{1}{\delta} v_i$ or $d_i = \sum_{t} \nabla_\theta \log\pi_{\theta_i}(a_t|x_t)$
- Update the policy according to $\theta_{i+1} = \theta_i - \eta g_i$

Policy Optimization

"Why we do it"

Fact 1: This approach is flexible and general purpose.
Fact 2: In both cases, the descent direction $g_i$ approximates the gradient $\nabla_\theta J(\theta_i)$.
Fact 3: For linear quadratic control, stationary points of $J(\theta)=J(K)$ correspond to the optimal policy, even though the total cost is non-convex.

Directly tackles the optimal control objective
Does not require knowledge of dynamics or cost functions
Does not require state estimation or separation principle
- Sometimes combined with state estimator, so $x_t=\hat s_t$ or even $x_t=(\hat s_t, \Sigma_t)$
Easy to leverage expert demonstrations $\{x_i,a_i\}$ using supervised learning $$\theta_0 = \arg\min \sum_{i=1}^N \ell(a_i,\pi_\theta(x_i)) $$

Pros of Policy Optimization

$$ \min_{\theta}~~ \mathbb E_{w,v}\Big[\sum_{k=0}^{T} c(s_k, a_k) \Big ]\quad \text{s.t}\quad s_{k+1} = F(s_k, a_k,w_k),~~y_k=H(s_k,v_k),~~a_k=\pi^\theta_k(a_{0:k-1}, y_{0:k}) $$

$=J(\theta)$

Widely applicable, with examples spanning
- Board Games: Chess, Go, Shogi
- Video Games: Atari, Dota, Starcraft
- Robotics: manipulation, locomotion, drones
- LLMs: problem solving, "alignment"
- Recommendation: targeted ads, social media, search
- etc....

Pros of Policy Optimization

Fact 2: In both cases, the descent direction $g_i$ approximates the gradient of the total cost $\nabla_\theta J(\theta_i)$

Policy Gradient

Sampling parameters

policies $\mathcal X\to\mathcal A$
$v_i\sim\mathcal N(0,I)$ and $a_t=\pi_{\theta_i+\delta v_i}(x_t)$
$\hat J_i = \sum_t c_t$
$g_i = \frac{1}{\delta}\hat J_i v_i$

Sampling actions

policies $\mathcal X\to\Delta(\mathcal A)$
$a_t\sim \pi_{\theta_i}(x_t)$
$\hat J_i = \sum_t c_t$
$g_i = \hat J_i \sum_{t} \nabla_\theta \log\pi_{\theta_i}(a_t|x_t)$

Fact 2a: In expectation, the descent direction is approximately equal to the gradient of the total cost $$\|\mathbb E[\hat J_i v_i/\delta] -\nabla_\theta J(\theta)\| = O(\delta) $$
Proof:
- By definition and rollout, $\mathbb E [\hat J_i|v_i] =\mathbb E[\sum_t c_t|v_i]= J(\theta_i+\delta v_i)$
- Taylor expansion: $J(\theta_i+\delta v_i)=J(\theta_i) + \nabla J(\theta_i)^\top \delta v_i + O(\delta^2)$
- $ \mathbb E[g_i]=\mathbb E [\mathbb E [\hat J_i|v_i] \frac{1}{\delta} v_i ]$ by independence
  - $ =\mathbb E [ \frac{1}{\delta} J(\theta_i+\delta v_i) v_i ] $
  - $ = \frac{1}{\delta} \mathbb E [ J(\theta_i)v_i + \delta v_i v_i^\top \nabla J(\theta_i) + O(\delta^2) v_i ] $
  - $ = \nabla J(\theta_i) + O(\delta^2) $

Sampling in Parameter Space

Fact 2a: In expectation, the descent direction is equal to the gradient of the smoothed total cost $J_\delta(\theta) = \mathbb E_{v\sim\mathcal N(0,I)}[J(\theta+\delta v)]$ $$\mathbb E[g_i] = \nabla_\theta J_\delta(\theta)$$ and $\|\nabla J_\delta(\theta) -\nabla_\theta J(\theta)\| = O(\delta) $
Proof:
- By definition and rollout, $\mathbb E [\hat J_i|v_i] =\mathbb E[\sum_t c_t|v_i]= J(\theta_i+\delta v_i)$
- Stein's Lemma $\nabla J_\delta(\theta) = \mathbb E [\nabla J(\theta+\delta v)] = \frac{1}{\delta^2} \mathbb E [J(\theta+\delta v)\cdot \delta v] $
- $ \mathbb E[g_i]=\mathbb E [\mathbb E [\hat J_i|v_i] \frac{1}{\delta} v_i ]$ by independence
  - $ =\mathbb E [ \frac{1}{\delta} J(\theta_i+\delta v_i) v_i ] $
  - $ = \nabla J_\delta(\theta_i)$

Sampling in Parameter Space

Fact 2b: In expectation, the descent direction is equal to the gradient of the total cost $$\mathbb E[\hat J_i \textstyle{\sum_{t} \nabla_\theta } \log\pi_{\theta_i}(a_t|x_t) ] =\nabla_\theta J(\theta) $$
Proof:
- Recall $\tau = \{x_t,a_t,c_t\}$ and define $\hat J(\tau) = \sum_t c_t$
- Let $P_\theta$ denote joint distribution over $x_t,a_t,c_t$ under policy $\pi_\theta$, so that the trajectory $\tau\sim P_\theta$
- Then by definition $J(\theta) = \mathbb E_{\tau\sim P_\theta} [\hat J(\tau)]$
- Then $\nabla_\theta J(\theta) = \nabla_\theta \int \hat J(\tau) P_\theta(\tau) d\tau$
  - $= \int \hat J(\tau) \nabla_\theta P_\theta(\tau) d\tau$
  - $= \int \hat J(\tau) P_\theta(\tau) \nabla_\theta \log P_\theta(\tau) d\tau$
  - $=\mathbb E_{\tau\sim P_\theta}[\hat J(\tau) \nabla_\theta \log P_\theta(\tau)]$
- $ P_\theta(\tau) = P(x_0)\pi_\theta(a_0|x_0) P(x_1|a_0,x_0) \pi_\theta(a_1|x_1) P(x_2|x_{0:1},a_{0:1}) \pi_\theta(a_2|x_2) ...$

Sampling in Action Space

Fact 2b: In expectation, the descent direction is equal to the gradient of the total cost $$\mathbb E[\hat J_i \textstyle{\sum_{t} \nabla_\theta } \log\pi_{\theta_i}(a_t|x_t) ] =\nabla_\theta J(\theta) $$
Proof:
- Recall $\tau = \{x_t,a_t,c_t\}$ and define $\hat J(\tau) = \sum_t c_t$
- Let $P_\theta$ denote joint distribution over $x_t,a_t,c_t$ under policy $\pi_\theta$, so that the trajectory $\tau\sim P_\theta$
- Then by definition $J(\theta) = \mathbb E_{\tau\sim P_\theta} [\hat J(\tau)]$
- Then $\nabla_\theta J(\theta) = \nabla_\theta \int \hat J(\tau) P_\theta(\tau) d\tau$
  - $= \int \hat J(\tau) \nabla_\theta P_\theta(\tau) d\tau$
  - $= \int \hat J(\tau) P_\theta(\tau) \nabla_\theta \log P_\theta(\tau) d\tau$
  - $=\mathbb E_{\tau\sim P_\theta}[\hat J(\tau) \nabla_\theta \log P_\theta(\tau)]$
- $ \log P_\theta(\tau) = \sum_t \log P(x_{t+1}|x_{0:t},a_{0:t} ) + \sum_t \log \pi_\theta(a_t|x_t) $
- $ \nabla_\theta \log P_\theta(\tau) = 0 +\sum_t \nabla_\theta \log \pi_\theta(a_t|x_t) $

Sampling in Action Space

Fact 2b: In expectation, the descent direction is equal to the gradient of the total cost $$\mathbb E[\hat J_i \textstyle{\sum_{t} \nabla_\theta } \log\pi_{\theta_i}(a_t|x_t) ] =\nabla_\theta J(\theta) $$
Proof:
- Recall $\tau = \{x_t,a_t,c_t\}$ and define $\hat J(\tau) = \sum_t c_t$
- Let $P_\theta$ denote joint distribution over $x_t,a_t,c_t$ under policy $\pi_\theta$, so that the trajectory $\tau\sim P_\theta$
- Then by definition $J(\theta) = \mathbb E_{\tau\sim P_\theta} [\hat J(\tau)]$
- Then $\nabla_\theta J(\theta) = \nabla_\theta \int \hat J(\tau) P_\theta(\tau) d\tau$
  - $= \int \hat J(\tau) \nabla_\theta P_\theta(\tau) d\tau$
  - $= \int \hat J(\tau) P_\theta(\tau) \nabla_\theta \log P_\theta(\tau) d\tau$
  - $=\mathbb E_{\tau\sim P_\theta}[\hat J(\tau) \nabla_\theta \log P_\theta(\tau)]$
  - $=\mathbb E_{\tau\sim P_\theta}[\hat J(\tau) \sum_t \nabla_\theta \log \pi_\theta(a_t|x_t)] $

Sampling in Action Space

Gradient Approximation

Sampling parameters $g_i = \frac{1}{\delta}\hat J_i v_i$

Deterministic policy $\mathcal X\to\mathcal A$
Approximately equal to gradient $\nabla J(\theta)$ in expectation
High variance
Variance reduction tricks
- sample $\theta \pm \delta v_i$
- subtract baseline $\approx J(\theta)$
- sample multiple trajectories
Requires rollouts ("online")
Parameter space is small for carefully designed control policies, but large for deep policies

Sampling actions $g_i = \hat J_i \sum_{t} \nabla_\theta \log\pi_{\theta_i}(a_t|x_t)$

Stochastic policy $\mathcal X\to\Delta(\mathcal A)$
Exactly equal to gradient $\nabla J(\theta)$ in expectation
High variance
Variance reduction tricks
- truncate $\sum_t \nabla_\theta \log \pi_\theta(a_t|x_t)\sum_{k\leq t} c_t$
- subtract baseline $\approx J(\theta)$
- sample multiple trajectories
Extensions beyond "online" formulation
Action space is small in discrete domains (games), but large for continuous domains (feedback control)

Both methods approximate the gradient (first order information) with function evaluations (zero-th order information)
- Necessary when only black-box access is available, i.e. real world
This process is noisy and requires many samples/iterations
- Quite difficult to actually deploy in the real world...
- Instead, using simulation is ubiquitous
Autodifferentiation is a key insight of the deep learning revolution: computing derivatives is not fundamentally much more difficult than computing function evaluations
If you can compute gradients, you should* use them!
If you can write a differentiable simulator, you should!

Gradient Approximation

Do Differentiable Simulators Give Better Policy Gradients? Suh, et al., 2022.

Fact 3: For linear quadratic control, stationary points of $J(\theta)=J(K)$ correspond to the optimal policy, even though the total cost is non-convex.
For linear policies, optimization is non-convex in $K$ $$ \min_{K_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} s_k^\top (Q + K_k^\top R K_k)s_k \Big]\quad \text{s.t}\quad s_{k+1} = (F +GK_k)s_k+w_k $$
- Example: For a 1D system with $F=G=1$,
  $\mathbb E[s_2] = (1+K_1)(1+K_0)s_0$
However, reparametrizations recovery convexity
- e.g. new variables $P_t$ (cost) and $Y_t = P_t K_t$

LQ Optimization Landscape

plot $z=(1+x)(1+y)$

Fact 3: For linear quadratic control, stationary points of $J(\theta)=J(K)$ correspond to the optimal policy, even though the total cost is non-convex.
For linear policies, optimization is non-convex in $K$
Sun & Fazel (2021) show that bijective convex re-parametrization guarantees gradient dominance property: $$\|\nabla J(\theta)\|^2 \geq2 \mu (J(\theta)-J(\theta^\star))$$
Gradient dominance ensures that all critical points are global minima and that gradient descent (with appropriate step size) will converge.
Gradient dominance doesn't hold for LQG where policy is in terms of feedback $K$ and filter $L$, but it does hold for linear auto-regressive policy (Fallah et al., 2025)

LQ Optimization Landscape

Learning Optimal Controllers by Policy Gradient. Sun & Fazel, 2021.

On the Gradient Domination of the LQG Problem. Fallah, Toso, Anderson, 2025.

Recap

Policy Optimization
Gradient Approximation
Optimization Landscape

Next time: Guest Lecture on Off-Policy Learning

Announcements

Paper presentations start next week!

16 - Policy Optimization - ML in Feedback Sys F25

By Sarah Dean

16 - Policy Optimization - ML in Feedback Sys F25

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

Policy Optimization

ML in Feedback Sys #16

Policy Optimization

Policy Optimization

Policy Optimization

Pros of Policy Optimization

Pros of Policy Optimization

Policy Gradient

Sampling in Parameter Space

Sampling in Parameter Space

Sampling in Action Space

Sampling in Action Space

Sampling in Action Space

Gradient Approximation

Gradient Approximation

LQ Optimization Landscape

LQ Optimization Landscape

Recap

Announcements

16 - Policy Optimization - ML in Feedback Sys F25

More from Sarah Dean