Sarah Dean PRO
asst prof in CS at Cornell
Fall 2025, Prof Sarah Dean
"What we do"
policy
\(\pi_\theta\)
observation
action
\(x_t\)
\(a_t\)
"What we do"
"Why we do it"
$$ \min_{\theta}~~ \mathbb E_{w,v}\Big[\sum_{k=0}^{T} c(s_k, a_k) \Big ]\quad \text{s.t}\quad s_{k+1} = F(s_k, a_k,w_k),~~y_k=H(s_k,v_k),~~a_k=\pi^\theta_k(a_{0:k-1}, y_{0:k}) $$
\(=J(\theta)\)
Fact 2: In both cases, the descent direction \(g_i\) approximates the gradient of the total cost \(\nabla_\theta J(\theta_i)\)
Sampling parameters
Sampling actions
Sampling parameters \(g_i = \frac{1}{\delta}\hat J_i v_i\)
Sampling actions \(g_i = \hat J_i \sum_{t} \nabla_\theta \log\pi_{\theta_i}(a_t|x_t)\)
Do Differentiable Simulators Give Better Policy Gradients? Suh, et al., 2022.
plot \(z=(1+x)(1+y)\)
Learning Optimal Controllers by Policy Gradient. Sun & Fazel, 2021.
On the Gradient Domination of the LQG Problem. Fallah, Toso, Anderson, 2025.
Next time: Guest Lecture on Off-Policy Learning
By Sarah Dean