Policy Optimization

ML in Feedback Sys #16

Fall 2025, Prof Sarah Dean

Policy Optimization

"What we do"

  • Given a parametrized policy class \(\Pi_\theta\) and initial policy
  • For \(i=0,1,2,...\)
    • "Rollout" the policy, i.e. collect trajectory \(\tau_i=\{x_t,a_t,c_t\}\)
    • Use trajectory to estimate the total cost \(\hat J_i = \sum_t c_t\)
    • Determine descent direction \(g_i=\hat J_i d_i\)
    • Update the policy according to \(\theta_{i+1} = \theta_i - \eta g_i\)

policy

\(\pi_\theta\)

observation

action

\(x_t\)

\(a_t\)

Policy Optimization

"What we do"

  • Given a parametrized policy class \(\Pi_\theta\) and initial policy
    • either deterministic \(\pi_{\theta_0}:\mathcal X\to\mathcal A\)
    • or stochastic \(\pi_{\theta_0}:\mathcal X\to\Delta(\mathcal A)\), notation \(\pi_{\theta}(a|x)\)
    • input \(x\) could be states \(s\) or AR history of \(y\)
  • For \(i=0,1,2,...\)
    • "Rollout" the policy, i.e. collect trajectory \(\tau_i=\{x_t,a_t,c_t\}\)
      • i.e. select actions \(a_t\sim \pi_i(x_t)\) if stochastic
      • if deterministic, sample \(v_i\sim\mathcal N(0,I)\) and \(a_t=\pi_{\theta_i+\delta v_i}(x_t)\)
    • Use trajectory to estimate the total cost \(\hat J_i = \sum_t c_t\)
    • Determine descent direction \(g_i=\hat J_i d_i\) where
      • \(d_i = \frac{1}{\delta} v_i\) or \(d_i = \sum_{t} \nabla_\theta \log\pi_{\theta_i}(a_t|x_t)\)
    • Update the policy according to \(\theta_{i+1} = \theta_i - \eta g_i\)

Policy Optimization

"Why we do it"

  • Fact 1: This approach is flexible and general purpose.
  • Fact 2: In both cases, the descent direction \(g_i\) approximates the gradient \(\nabla_\theta J(\theta_i)\).
  • Fact 3: For linear quadratic control, stationary points of \(J(\theta)=J(K)\) correspond to the optimal policy, even though the total cost  is non-convex.
  • Directly tackles the optimal control objective


     
  • Does not require knowledge of dynamics or cost functions
  • Does not require state estimation or separation principle
    • Sometimes combined with state estimator, so \(x_t=\hat s_t\) or even \(x_t=(\hat s_t, \Sigma_t)\)
  • Easy to leverage expert demonstrations \(\{x_i,a_i\}\) using supervised learning $$\theta_0 = \arg\min \sum_{i=1}^N \ell(a_i,\pi_\theta(x_i)) $$

Pros of Policy Optimization

$$ \min_{\theta}~~ \mathbb E_{w,v}\Big[\sum_{k=0}^{T} c(s_k, a_k) \Big ]\quad \text{s.t}\quad  s_{k+1} = F(s_k, a_k,w_k),~~y_k=H(s_k,v_k),~~a_k=\pi^\theta_k(a_{0:k-1}, y_{0:k}) $$

\(=J(\theta)\)

  • Widely applicable, with examples spanning
    • Board Games: Chess, Go, Shogi
    • Video Games: Atari, Dota, Starcraft
    • Robotics: manipulation, locomotion, drones
    • LLMs: problem solving, "alignment"
    • Recommendation: targeted ads, social media, search
    • etc....

Pros of Policy Optimization

Fact 2: In both cases, the descent direction \(g_i\) approximates the gradient of the total cost \(\nabla_\theta J(\theta_i)\)

Policy Gradient

Sampling parameters

  • policies \(\mathcal X\to\mathcal A\)
  • \(v_i\sim\mathcal N(0,I)\) and \(a_t=\pi_{\theta_i+\delta v_i}(x_t)\)
  • \(\hat J_i = \sum_t c_t\)
  • \(g_i = \frac{1}{\delta}\hat J_i  v_i\) 

Sampling actions

  • policies \(\mathcal X\to\Delta(\mathcal A)\)
  • \(a_t\sim \pi_{\theta_i}(x_t)\)
  • \(\hat J_i = \sum_t c_t\)
  • \(g_i = \hat J_i \sum_{t} \nabla_\theta \log\pi_{\theta_i}(a_t|x_t)\)
  • Fact 2a: In expectation, the descent direction is approximately equal to the gradient of the total cost $$\|\mathbb E[\hat J_i  v_i/\delta] -\nabla_\theta J(\theta)\| = O(\delta) $$
  • Proof:
    • By definition and rollout, \(\mathbb E [\hat J_i|v_i]  =\mathbb E[\sum_t c_t|v_i]= J(\theta_i+\delta v_i)\)
    • Taylor expansion: \(J(\theta_i+\delta v_i)=J(\theta_i) + \nabla J(\theta_i)^\top \delta v_i + O(\delta^2)\)
    • \( \mathbb E[g_i]=\mathbb E [\mathbb E [\hat J_i|v_i] \frac{1}{\delta}  v_i ]\) by independence
      • \( =\mathbb E [ \frac{1}{\delta} J(\theta_i+\delta v_i)  v_i ] \)
      • \( = \frac{1}{\delta}  \mathbb E [ J(\theta_i)v_i + \delta v_i v_i^\top \nabla J(\theta_i)  + O(\delta^2) v_i ] \)
      • \( = \nabla J(\theta_i)  + O(\delta^2) \)

Sampling in Parameter Space

  • Fact 2a: In expectation, the descent direction is equal to the gradient of the smoothed total cost \(J_\delta(\theta) = \mathbb E_{v\sim\mathcal N(0,I)}[J(\theta+\delta v)]\) $$\mathbb E[g_i] =  \nabla_\theta J_\delta(\theta)$$ and \(\|\nabla J_\delta(\theta)  -\nabla_\theta J(\theta)\| = O(\delta) \)
  • Proof:
    • By definition and rollout, \(\mathbb E [\hat J_i|v_i]  =\mathbb E[\sum_t c_t|v_i]= J(\theta_i+\delta v_i)\)
    • Stein's Lemma \(\nabla J_\delta(\theta) = \mathbb E [\nabla J(\theta+\delta v)] = \frac{1}{\delta^2}  \mathbb E [J(\theta+\delta v)\cdot \delta v]  \)
    • \( \mathbb E[g_i]=\mathbb E [\mathbb E [\hat J_i|v_i] \frac{1}{\delta}  v_i ]\) by independence
      • \( =\mathbb E [ \frac{1}{\delta} J(\theta_i+\delta v_i)  v_i ] \)
      • \( = \nabla J_\delta(\theta_i)\)

Sampling in Parameter Space

  • Fact 2b: In expectation, the descent direction is equal to the gradient of the total cost $$\mathbb E[\hat J_i   \textstyle{\sum_{t} \nabla_\theta } \log\pi_{\theta_i}(a_t|x_t) ] =\nabla_\theta J(\theta) $$
  • Proof:
    • Recall \(\tau = \{x_t,a_t,c_t\}\) and define \(\hat J(\tau) = \sum_t c_t\)
    • Let \(P_\theta\) denote joint distribution over \(x_t,a_t,c_t\) under policy \(\pi_\theta\), so that the trajectory \(\tau\sim P_\theta\)
    • Then by definition \(J(\theta) = \mathbb E_{\tau\sim P_\theta} [\hat J(\tau)]\)
    • Then \(\nabla_\theta J(\theta) = \nabla_\theta \int \hat J(\tau) P_\theta(\tau) d\tau\)
      • \(= \int \hat J(\tau) \nabla_\theta P_\theta(\tau) d\tau\)
      • \(= \int \hat J(\tau) P_\theta(\tau) \nabla_\theta \log  P_\theta(\tau) d\tau\)
      • \(=\mathbb E_{\tau\sim P_\theta}[\hat J(\tau) \nabla_\theta  \log  P_\theta(\tau)]\)
    • \( P_\theta(\tau) = P(x_0)\pi_\theta(a_0|x_0) P(x_1|a_0,x_0) \pi_\theta(a_1|x_1) P(x_2|x_{0:1},a_{0:1}) \pi_\theta(a_2|x_2) ...\)

Sampling in Action Space

  • Fact 2b: In expectation, the descent direction is equal to the gradient of the total cost $$\mathbb E[\hat J_i   \textstyle{\sum_{t} \nabla_\theta } \log\pi_{\theta_i}(a_t|x_t) ] =\nabla_\theta J(\theta) $$
  • Proof:
    • Recall \(\tau = \{x_t,a_t,c_t\}\) and define \(\hat J(\tau) = \sum_t c_t\)
    • Let \(P_\theta\) denote joint distribution over \(x_t,a_t,c_t\) under policy \(\pi_\theta\), so that the trajectory \(\tau\sim P_\theta\)
    • Then by definition \(J(\theta) = \mathbb E_{\tau\sim P_\theta} [\hat J(\tau)]\)
    • Then \(\nabla_\theta J(\theta) = \nabla_\theta \int \hat J(\tau) P_\theta(\tau) d\tau\)
      • \(= \int \hat J(\tau) \nabla_\theta P_\theta(\tau) d\tau\)
      • \(= \int \hat J(\tau) P_\theta(\tau) \nabla_\theta \log  P_\theta(\tau) d\tau\)
      • \(=\mathbb E_{\tau\sim P_\theta}[\hat J(\tau) \nabla_\theta  \log  P_\theta(\tau)]\)
    • \( \log P_\theta(\tau) = \sum_t \log P(x_{t+1}|x_{0:t},a_{0:t} ) + \sum_t \log \pi_\theta(a_t|x_t) \)
    • \( \nabla_\theta \log P_\theta(\tau) = 0 +\sum_t \nabla_\theta  \log \pi_\theta(a_t|x_t) \)

Sampling in Action Space

  • Fact 2b: In expectation, the descent direction is equal to the gradient of the total cost $$\mathbb E[\hat J_i   \textstyle{\sum_{t} \nabla_\theta } \log\pi_{\theta_i}(a_t|x_t) ] =\nabla_\theta J(\theta) $$
  • Proof:
    • Recall \(\tau = \{x_t,a_t,c_t\}\) and define \(\hat J(\tau) = \sum_t c_t\)
    • Let \(P_\theta\) denote joint distribution over \(x_t,a_t,c_t\) under policy \(\pi_\theta\), so that the trajectory \(\tau\sim P_\theta\)
    • Then by definition \(J(\theta) = \mathbb E_{\tau\sim P_\theta} [\hat J(\tau)]\)
    • Then \(\nabla_\theta J(\theta) = \nabla_\theta \int \hat J(\tau) P_\theta(\tau) d\tau\) 
      • \(= \int \hat J(\tau) \nabla_\theta P_\theta(\tau) d\tau\)
      • \(= \int \hat J(\tau) P_\theta(\tau) \nabla_\theta \log P_\theta(\tau) d\tau\)
      • \(=\mathbb E_{\tau\sim P_\theta}[\hat J(\tau) \nabla_\theta  \log  P_\theta(\tau)]\)
      • \(=\mathbb E_{\tau\sim P_\theta}[\hat J(\tau)  \sum_t \nabla_\theta  \log \pi_\theta(a_t|x_t)] \)

Sampling in Action Space

Gradient Approximation

Sampling parameters \(g_i = \frac{1}{\delta}\hat J_i  v_i\)

  • Deterministic policy \(\mathcal X\to\mathcal A\)
  • Approximately equal to gradient \(\nabla J(\theta)\) in expectation
  • High variance
  • Variance reduction tricks
    • sample \(\theta \pm \delta v_i\)
    • subtract baseline \(\approx J(\theta)\)
    • sample multiple trajectories
  • Requires rollouts ("online")
  • Parameter space is small for carefully designed control policies, but large for deep policies 

Sampling actions \(g_i = \hat J_i \sum_{t} \nabla_\theta \log\pi_{\theta_i}(a_t|x_t)\)

  • Stochastic policy \(\mathcal X\to\Delta(\mathcal A)\)
  • Exactly equal to gradient \(\nabla J(\theta)\) in expectation
  • High variance
  • Variance reduction tricks
    • truncate \(\sum_t \nabla_\theta  \log \pi_\theta(a_t|x_t)\sum_{k\leq t} c_t\)
    • subtract baseline \(\approx J(\theta)\)
    • sample multiple trajectories
  • Extensions beyond "online" formulation
  • Action space is small in discrete domains (games), but large for continuous domains (feedback control)
  • Both methods approximate the gradient (first order information) with function evaluations (zero-th order information)
    • Necessary when only black-box access is available, i.e. real world
  • This process is noisy and requires many samples/iterations
    • Quite difficult to actually deploy in the real world...
    • Instead, using simulation is ubiquitous
  • Autodifferentiation is a key insight of the deep learning revolution: computing derivatives is not fundamentally much more difficult than computing function evaluations
  • If you can compute gradients, you should* use them!
  • If you can write a differentiable simulator, you should!

Gradient Approximation

  • Fact 3: For linear quadratic control, stationary points of \(J(\theta)=J(K)\) correspond to the optimal policy, even though the total cost  is non-convex.
  • For linear policies, optimization is non-convex in \(K\) $$ \min_{K_{0:T}} ~~\mathbb E_w\Big[\sum_{k=0}^{T} s_k^\top (Q + K_k^\top R K_k)s_k \Big]\quad \text{s.t}\quad s_{k+1} = (F +GK_k)s_k+w_k $$
    • Example: For a 1D system with \(F=G=1\),
      \(\mathbb E[s_2] = (1+K_1)(1+K_0)s_0\)
  • However, reparametrizations recovery convexity
    • e.g. new variables \(P_t\) (cost) and \(Y_t = P_t K_t\)

LQ Optimization Landscape

plot \(z=(1+x)(1+y)\)

  • Fact 3: For linear quadratic control, stationary points of \(J(\theta)=J(K)\) correspond to the optimal policy, even though the total cost  is non-convex.
  • For linear policies, optimization is non-convex in \(K\) 
  • Sun & Fazel (2021) show that bijective convex re-parametrization guarantees gradient dominance property: $$\|\nabla J(\theta)\|^2 \geq2 \mu (J(\theta)-J(\theta^\star))$$
  • Gradient dominance ensures that all critical points are global minima and that gradient descent (with appropriate step size) will converge.
  • Gradient dominance doesn't hold for LQG where policy is in terms of feedback \(K\) and filter \(L\), but it does hold for linear auto-regressive policy (Fallah et al., 2025)

LQ Optimization Landscape

On the Gradient Domination of the LQG Problem. Fallah, Toso, Anderson, 2025.

Recap

  • Policy Optimization
  • Gradient Approximation
  • Optimization Landscape

Next time: Guest Lecture on Off-Policy Learning

Announcements

  • Paper presentations start next week!

16 - Policy Optimization - ML in Feedback Sys F25

By Sarah Dean

16 - Policy Optimization - ML in Feedback Sys F25

  • 43