Lecture 7: Reinforcement Learning (Actor-critic; variance reduction)
Shen Shen
April 23, 2025
2:30pm, Room 32-144
Modeling with Machine Learning for Computer Science

Outline
- Recap: Policy gradient
- RL challenges
- high variance
- sample complexity
- gradient update step-size issue
Policy Gradient Derivation
- We overload notation:
- Let \(\tau\) denote a state-action sequence: \(\tau=s_0, a_0, s_1, a_1, \ldots\)
- Let \(R(\tau)\) denote the sum of discounted rewards on \(\tau: R(\tau)=\sum_t \gamma^t R\left(s_t, a_t\right)\)
- W.l.o.g. assume \(R(\tau)\) is deterministic in \(\tau\)
- Let \(P(\tau ; \theta)\) denote the probability of trajectory \(\tau\) induced by \(\pi_\theta\)
- Let \(U(\theta)\) denote the objective: \(U(\theta)=\mathbb{E}\left[\sum_t \gamma^t R\left(s_t, a_t\right) \mid \pi_\theta\right]\)
- Our goal is to find \[\theta: \max _\theta U(\theta)=\max _\theta \sum_\tau P(\tau ; \theta) R(\tau)\]
Recap
Policy Gradient Derivation
Identity (quite useful in ML)
\(\begin{aligned} \nabla_\theta p_\theta(\tau) & =p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} \\ & =p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)\end{aligned}\)
\(=\nabla_\theta \sum_\tau P(\tau ; \theta) R(\tau)\)
\(=\sum_\tau \nabla_\theta P(\tau ; \theta) R(\tau)\)
\(=\sum_\tau \frac{P(\tau ; \theta)}{P(\tau ; \theta)} \nabla_\theta P(\tau ; \theta) R(\tau)\)
\(=\sum_\tau P(\tau ; \theta) \frac{\nabla_\theta P(\tau ; \theta)}{P(\tau ; \theta)} R(\tau)\)
\(=\sum_\tau P(\tau ; \theta) \nabla_\theta \log P(\tau ; \theta) R(\tau)\)
\(\nabla_\theta U(\theta)\)
Recap
Policy Gradient Derivation
where \(P(\tau ; \theta)=\prod_{t=0} \underbrace{P\left(s_{t+1} \mid s_t, a_t\right)}_{\text {transition }} \cdot \underbrace{\left.\pi_\theta\left(a_t \mid s_t\right)\right]}_{\text {policy }}\)
\(=\sum_\tau P(\tau ; \theta) \nabla_\theta \log P(\tau ; \theta) R(\tau)\)
\(\nabla_\theta U(\theta)\)
Transition is unknown....
Stuck?
Recap
Policy Gradient Derivation
\(=\sum_\tau P(\tau ; \theta) \nabla_\theta \log P(\tau ; \theta) R(\tau)\)
Approximate with the empirical (Monte-Carlo) estimate for \(m\) sample traj. under policy \(\pi_\theta\)
\(\nabla_\theta U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right) R\left(\tau^{(i)}\right)\)
Valid even when:
- Reward function discontinuous and/or unknown
- Discrete state and/or action spaces
\(\nabla_\theta U(\theta)\)
Recap
Policy Gradient Derivation
where \(P(\tau ; \theta)=\prod_{t=0} \underbrace{P\left(s_{t+1} \mid s_t, a_t\right)}_{\text {transition }} \cdot \underbrace{\left.\pi_\theta\left(a_t \mid s_t\right)\right]}_{\text {policy }}\)
\(=\nabla_\theta \log [\prod_{t=0} \underbrace{P\left(s_{t+1} \mid s_t, a_t\right)}_{\text {transition }} \cdot \underbrace{\left.\pi_\theta\left(a_t \mid s_t\right)\right]}_{\text {policy }}\)
\(=\nabla_\theta\left[\sum_{t=0} \log P\left(s_{t+1} \mid s_t, a_t\right)+\sum_{t=0} \log \pi_\theta\left(a_t \mid s_t\right)\right]\)
\(=\nabla_\theta \sum_{t=0} \log \pi_\theta\left(a_t \mid s_t\right)\)
\(=\sum_{t=0} \underbrace{\nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right)}_{\text {no transition model required, }}\)
\(\nabla_\theta \log P(\tau ; \theta)\)
\(\nabla_\theta U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right) R\left(\tau^{(i)}\right)\)
Recap
Policy Gradient Derivation
\(=\sum_{t=0} \underbrace{\nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right)}_{\text {no transition model required}}\)
\(\nabla_\theta \log P(\tau ; \theta)\)
\(\nabla_\theta U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right) R\left(\tau^{(i)}\right)\)
The following expression provides us with an unbiased estimate of the gradient, and we can compute it without access to the transition model:

Unbiased estimator \(\mathrm{E}[\hat{g}]=\nabla_\theta U(\theta)\), but very noisy.
where
Recap
\(=\sum_\tau P(\tau ; \theta) R(\tau)\)
\(U(\theta)\)
\(\approx \hat{g} = \frac{1}{m}\sum_{i=1}^{m}\left(\sum_{t=0}\nabla_\theta \log \pi_\theta(a_t^{(i)}\mid s_t^{(i)})\right)R(\tau^{(i)})\)
This policy gradient estimator typically has high variance, due to:
- Trajectory-Level Monte Carlo Sampling
- Single-trajectory randomness: The gradient estimator \(\hat{g}\) depends on averaging the returns from a finite set of \(m\) trajectories.
- Each trajectory \(\tau^{(i)}\) is sampled from the policy \(\pi_\theta\), making it inherently stochastic and subject to large fluctuations.
\(=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right) R\left(\tau^{(i)}\right)\)
\(\nabla_\theta U(\theta)\)
\(=\sum_\tau P(\tau ; \theta) R(\tau)\)
\(U(\theta)\)
\(\nabla_\theta U(\theta) \approx \hat{g} = \frac{1}{m}\sum_{i=1}^{m}\left(\sum_{t=0}\nabla_\theta \log \pi_\theta(a_t^{(i)}\mid s_t^{(i)})\right)R(\tau^{(i)})\)
This policy gradient estimator typically has high variance, due to:
- High Variance of Return \(R(\tau^{(i)})\)
- Cumulative rewards: The return \(R(\tau^{(i)})\) is the sum of all rewards along the trajectory. A slight variation in early steps can produce large differences in total returns.
- Sparse or delayed rewards: When rewards are sparse or rare, returns can fluctuate drastically—some trajectories yield high rewards, others yield zero.
\(=\sum_\tau P(\tau ; \theta) R(\tau)\)
\(U(\theta)\)
\(\nabla_\theta U(\theta) \approx \hat{g} = \frac{1}{m}\sum_{i=1}^{m}\left(\sum_{t=0}\nabla_\theta \log \pi_\theta(a_t^{(i)}\mid s_t^{(i)})\right)R(\tau^{(i)})\)
This policy gradient estimator typically has high variance, due to:
- Noisy Gradient via \(\nabla_\theta \log \pi_\theta(a_t \mid s_t)\)
- Log probability gradients: The term \(\nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)})\) can fluctuate significantly, as small parameter changes affect action probabilities greatly, especially with neural-network policies.
- Multiplicative variance amplification: Since the return \(R(\tau^{(i)})\) multiplies this gradient, any small variance in the log-policy gradients is amplified by large or variable returns.
\(=\sum_\tau P(\tau ; \theta) R(\tau)\)
\(U(\theta)\)
\(\nabla_\theta U(\theta) \approx \hat{g} = \frac{1}{m}\sum_{i=1}^{m}\left(\sum_{t=0}\nabla_\theta \log \pi_\theta(a_t^{(i)}\mid s_t^{(i)})\right)R(\tau^{(i)})\)
This policy gradient estimator typically has high variance, due to:
- Credit Assignment Problem
- Long-horizon dependencies: Rewards received late in a trajectory affect all preceding time-step gradient updates equally, regardless of which actions truly contributed. This leads to noisy and uncertain gradient signals for early actions.
- Poor causality: The gradient estimator does not explicitly distinguish between actions genuinely contributing to a high return and those irrelevant to it.
\(=\sum_\tau P(\tau ; \theta) R(\tau)\)
\(U(\theta)\)
\(\nabla_\theta U(\theta) \approx \hat{g} = \frac{1}{m}\sum_{i=1}^{m}\left(\sum_{t=0}\nabla_\theta \log \pi_\theta(a_t^{(i)}\mid s_t^{(i)})\right)R(\tau^{(i)})\)
This policy gradient estimator typically has high variance, due to:
- Limited Sample Size (\(m\) small)
- Finite-sample estimation: Typically, only a limited number \(m\) of trajectories are sampled. Small sample sizes mean high sampling variability, thus high estimator variance.
- Expensive sampling: Generating trajectories is computationally expensive, limiting practical trajectory counts and exacerbating variance.
Variance reduction:
- Getting more sample trajectories (sometimes infeasible)
- Temporal structure
- Subtracting a constant baseline \(b\)
\(\hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)\left(R\left(\tau^{(i)}\right)\right)\)
\(=\frac{1}{m} \sum_{i=1}^m\left(\sum_{t=0}^{h-1} \nabla_\theta \log \pi_\theta\left(a_t^{(i)} \mid s_t^{(i)}\right)\right)\left(\sum_{t=0}^{h-1} R\left(s_t^{(i)}, a_t^{(i)}\right)\right)\)
\(=\frac{1}{m} \sum_{i=1}^m\left(\sum_{t=0}^{I I-1} \nabla_\theta \log \pi_\theta\left(a_t^{(i)} \mid s_t^{(i)}\right)\left[\left(\sum_{k=0}^{t-1} R\left(s_k^{(i)}, a_k^{(i)}\right)\right)+\left(\sum_{k=t}^{h-1} R\left(s_k^{(i)}, a_k^{(i)}\right)\right)\right]\right)\)
[Policy Gradient Theorem: Sutton et al 1999; GPOMDP: Bartlett & Baxter, 2001; Survey: Peters & Schaal, 2006]
Removing terms that don't depend on current action can lower variance:
\(\frac{1}{m} \sum_{i=1}^m \sum_{t=0}^{h-1} \nabla_\theta \log \pi_\theta\left(a_t^{(i)} \mid s_t^{(i)}\right)\left(\sum_{k=t}^{h-1} R\left(s_k^{(i)}, a_k^{(i)}\right)\right)\)
Temporal structure
Variance reduction:
- Subtracting a constant baseline \(b:\)
\(\nabla U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)\left(R\left(\tau^{(i)}\right)-b\right)\)
[Williams, REINFORCE paper, 1992]
- Still unbiased, despite the additional term
\[\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)(-b)\]
- with good choice of \(b,\) can reduce variance of \(\nabla U(\theta)\)
\(\nabla U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)\left(R\left(\tau^{(i)}\right)-b\right)\)
\(\begin{aligned} & \mathbb{E}\left[\nabla_\theta \log P(\tau ; \theta) b\right] \\ & =\sum_\tau P(\tau ; \theta) \nabla_\theta \log P(\tau ; \theta) b \\ & =\sum_\tau P(\tau ; \theta) \frac{\nabla_\theta P(\tau ; \theta)}{P(\tau ; \theta)} b \\ & =\sum_\tau \nabla_\theta P(\tau ; \theta) b\end{aligned}\)
\(=\nabla_\theta\left(\sum_\tau P(\tau) b\right)=b \nabla_\theta\left(\sum_\tau P(\tau)\right)=b \times 0\)
[Williams, REINFORCE paper, 1992]
❓ Why still unbiased despite the additional term
\[\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)(-b)\]
Variance reduction:
- Getting more sample trajectories (sometimes infeasible)
- Subtracting a constant baseline \(b:\)
\(\nabla U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)\left(R\left(\tau^{(i)}\right)-b\right)\)
[Williams, REINFORCE paper, 1992]
✅ Still unbiased, despite the additional term
\[\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)(-b)\]
❓with good choice of \(b,\) can reduce variance of \(\nabla U(\theta)\)
Control Variates
- The main idea is to reduce variance in the estimate of an expectation.
- Suppose we want to estimate: \[\mu = \mathbb{E}[X]\] where \(X\) is some random variable.
- We introduce another random variable \(Y\) (the control variate), with a known expectation \(\mathbb{E}[Y] = \nu\), and importantly, \(Y\) is correlated to \(X\).
- Estimator \(X-Y+\nu\) has variance \[\operatorname{Var}(X-Y)=\operatorname{Var}(X)+\operatorname{Var}(Y)-2 \operatorname{Cov}(X, Y)\]
We can also define a new estimator: \(X’ = X - \alpha(Y - \nu)\) where \(\alpha\) can be chosen optimally to minimize variance.
Control variates in RL
\(\nabla U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)\left(R\left(\tau^{(i)}\right)-b\right)\)
- What's a good baseline \(b\) candidate?
- state-value function \( V^\pi(s) \)) from rewards
- \(b=\frac{1}{m} \sum_{i=1}^m R\left(\tau^{(i)}\right)\)
- Estimated state-dependent value functions: \(b\left(s_t\right)=\hat{V}^\pi\left(s_t\right)\)
\(\nabla U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)\left(R\left(\tau^{(i)}\right)-\hat{V}^\pi(s)\right)\)
[Greensmith, Bartlett, Baxter, JMLR 2004 for variance reduction techniques.]
advantage function



\(-b\)

instead of increase the likelihood of all "winning" games, increase the likelihood of "better than average score" games
\(-b\)
How to estimate \(V^\pi\)?
- Again, Monte Carlo estimate \(b\left(s_t\right)=\mathbb{E}\left[r_t+r_{t+1}+r_{t+2}+\ldots+r_{H-1}\right]=V^\pi\left(s_t\right)\)
-
Or, collect \(\tau_1, \ldots, \tau_m\), and regress against empirical return: \(\phi_{i+1} \leftarrow \underset{\phi}{\arg \min } \frac{1}{m} \sum_{i=1}^m \sum_{t=0}^{H-1}\left(V_\phi^\pi\left(s_t^{(i)}\right)-\left(\sum_{k=t}^{H-1} R\left(s_k^{(i)}, u_k^{(i)}\right)\right)\right)^2\)
-
Or, similar to fitted Q-learning, do fitted V-learning: \(\phi_{i+1} \leftarrow \min _\phi \sum_{\left(s, u, s^{\prime}, r\right)}\left\|r+V_{\phi_i}^\pi\left(s^{\prime}\right)-V_\phi(s)\right\|_2^2\)
[Greensmith, Bartlett, Baxter, JMLR 2004 for variance reduction techniques.]
How to estimate advatange?
(GAE) [Schulman et al, ICLR 2016]
TD(lambda) / eligibility traces [Sutton and Barto, 1990]
- \(\hat{Q}:\) lambda exponentially weighted average of all the above
- Advantage estimate then \(\hat{A}=\hat{Q}-\hat{V}\)

How to estimate advatange?
[Async Advantage Actor Critic (A3C) [Mnih et al, 2016]

- \(\hat{Q}\) : one of the above choices (e.g. k=5 step lookahead)
- Advantage estimate then \(\hat{A}=\hat{Q}-\hat{V}\)

Actor-critic method
[Williams, REINFORCE paper, 1992]
actor
critic
Vanilla policy gradient/REINFORCE: step-sizing issue
- Step-sizes is worth tuning even in pure SGD under supervised learning.
- In policy gradient, bad step-sizes can have worse "chain effect":
- Step too far, may lead to terrible policy
- Next mini-batch: terrible data collected (with e.g. no reward signal; so no "correction")
- Not clear how to recover
- Intuitively, too big of a distributional shift
- Coupled with unknown dynamics/transition, good step-sizing is hard a-priori
Trust region policy optimization (TRPO)
- Interpretation of the objective via importance sampling (next slide)
- Interpretation of the constraint via distributional shift
- Can solve approximately via conjugate gradient method (using linear approx. of the objective and quadratic approx. of the KL)
\(\max L(\pi)=\mathbb{E}_{\pi \text { old }}\left[\frac{\pi(a \mid s)}{\pi_{\mathrm{old}}(a \mid s)} A^\pi_{old}\right.\)\(\left.(s, a)\right]\)
Constraint: \(\quad \mathbb{E}_{\pi_{\text {old }}}\left[K L\left(\pi \mid \pi_{\text {old }}\right)\right] \leq \epsilon\)
importance sampling
\(\mathbb{E}_{x \sim q}\left[\frac{p(x)}{q(x)} f(x)\right]=\mathbb{E}_{x \sim p}[f(x)]\)
\(U(\theta)=\mathbb{E}_{\tau \sim \theta} \mathrm{old}\left[\frac{P(\tau \mid \theta)}{P\left(\tau \mid \theta_{\mathrm{old}}\right)} R(\tau)\right]\)
\(=\mathbb{E}_{\tau \sim \theta_{\text {old }}}\left[\frac{\pi(\tau \mid \theta)}{\pi\left(\tau \mid \theta_{\text {old }}\right)} R(\tau)\right]\)
\(\nabla_\theta U(\theta)=\mathbb{E}_{\tau \sim \theta} \text { old }\left[\frac{\nabla_\theta P(\tau \mid \theta)}{P\left(\tau \mid \theta_{\text {old }}\right)} R(\tau)\right]\)
[Tang and Abbeel, On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient, 2011]
- Estimate both the utilities (objective) and policy gradient under new policy
- by using trajectories under old policy
- helps sample complexity, but increases variance actually (another tradeoff)
[Tang and Abbeel, On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient, 2011]
TRPO
\(\max L(\pi)=\mathbb{E}_{\pi \text { old }}\left[\frac{\pi(a \mid s)}{\pi_{\mathrm{old}}(a \mid s)} A^\pi_{old}\right.\)\(\left.(s, a)\right]\)
Constraint: \(\quad \mathbb{E}_{\pi_{\text {old }}}\left[K L\left(\pi \mid \pi_{\text {old }}\right)\right] \leq \epsilon\)
\(\max \mathbb{E}_{\pi \text { old }}\left[\frac{\pi(a \mid s)}{\pi_{\mathrm{old}}(a \mid s)} A^\pi_{old}\right.\)\(\left.(s, a)\right]\)
\(-\beta\left(\mathbb{E}_t\left[\operatorname{KL}\left(\pi_{o l d} \mid \pi\right)\right]-\delta\right)\)
Proximal Policy Optimization (PPO, v1)

Proximal Policy Optimization (PPO, v2)
Recall the objective:
\(\hat{\mathbb{E}}_t\left[\frac{\pi_\theta\left(a_t \mid s_t\right)}{\pi_\theta \text { old }}{ }^{\left(a_t \mid s_t\right)} \hat{A}_t\right]\)
\(=\hat{\mathbb{E}}_t\left[\rho_t(\theta) \hat{A}_t\right]\)
A pwer bound of the above:
\(L^{C L I P}(\theta)=\hat{\mathbb{E}}_t\left[\min \left(\rho_t(\theta) \hat{A}_{t<0}, \operatorname{clip}\left(\rho_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t\right)\right]\)



[table credit: Daniel Bick]

- The main loss \(L_t(\theta)=\min \left(r_t(\theta) \hat{A}_t, \operatorname{clip}\left(r_t(\theta)\right), 1-\epsilon, 1+\epsilon\right) \hat{A}_t\)
- Advantage estimate based on truncated GAE
- Adding an entropy term

Beyond what we covered
- POMDP
- Inverse RL
- Apprenticeship learning
- Behavioral cloning (Data Augmentation, a.k.a. DAgger)
- Transfer learning (sim to real)
- Domain randomization
- Multi-task learning
- Curriculum learning
- Hierarchical RL
- Safe/verifiable RL
- Multi-agent RL
- Offline RL
- Rewards shaping
- Fairness, ethical, explainable AI (value alignment)
Thanks!
We'd love to hear your thoughts.
Variance Reduction Achieved
- With the optimal control variate estimator \(X'\): \[\mathrm{Var}(X') = \mathrm{Var}(X - 0.8(Y - 1)).\]
- Expanding explicitly: \[\mathrm{Var}(X') = \mathrm{Var}(X - 0.8(X+\varepsilon - 1)) = \mathrm{Var}(0.2X - 0.8\varepsilon) = (0.2^2)\mathrm{Var}(X) + (-0.8^2)\mathrm{Var}(\varepsilon) = (0.04)(4) + (0.64)(1)= 0.16 + 0.64 = 0.8.\]
- Thus, we reduced the variance significantly: \( 4 \rightarrow 0.8 \) (5× improvement).
6.C011/C511 - ML for CS (Spring25) - Lecture 7 - Reinforcement Learning III (Actor-critic; variance reduction)
By Shen Shen
6.C011/C511 - ML for CS (Spring25) - Lecture 7 - Reinforcement Learning III (Actor-critic; variance reduction)
- 151