Lecture 7: Reinforcement Learning (Actor-critic; variance reduction)

 

Shen Shen

April 23, 2025

2:30pm, Room 32-144

 

Modeling with Machine Learning for Computer Science

Outline

  • Recap: Policy gradient 
  • RL challenges
    • high variance
    • sample complexity
    • gradient update step-size issue

Policy Gradient Derivation

  • We overload notation:
  • Let \(\tau\) denote a state-action sequence: \(\tau=s_0, a_0, s_1, a_1, \ldots\)
  • Let \(R(\tau)\) denote the sum of discounted rewards on \(\tau: R(\tau)=\sum_t \gamma^t R\left(s_t, a_t\right)\)
  • W.l.o.g. assume \(R(\tau)\) is deterministic in \(\tau\)
  • Let \(P(\tau ; \theta)\) denote the probability of trajectory \(\tau\) induced by \(\pi_\theta\)
  • Let \(U(\theta)\) denote the objective: \(U(\theta)=\mathbb{E}\left[\sum_t \gamma^t R\left(s_t, a_t\right) \mid \pi_\theta\right]\)
  • Our goal is to find \[\theta: \max _\theta U(\theta)=\max _\theta \sum_\tau P(\tau ; \theta) R(\tau)\]
Recap

Policy Gradient Derivation

Identity (quite useful in ML)

\(\begin{aligned} \nabla_\theta p_\theta(\tau) & =p_\theta(\tau) \frac{\nabla_\theta p_\theta(\tau)}{p_\theta(\tau)} \\ & =p_\theta(\tau) \nabla_\theta \log p_\theta(\tau)\end{aligned}\)

\(=\nabla_\theta \sum_\tau P(\tau ; \theta) R(\tau)\)

\(=\sum_\tau \nabla_\theta P(\tau ; \theta) R(\tau)\)

\(=\sum_\tau \frac{P(\tau ; \theta)}{P(\tau ; \theta)} \nabla_\theta P(\tau ; \theta) R(\tau)\)

\(=\sum_\tau P(\tau ; \theta) \frac{\nabla_\theta P(\tau ; \theta)}{P(\tau ; \theta)} R(\tau)\)

\(=\sum_\tau P(\tau ; \theta) \nabla_\theta \log P(\tau ; \theta) R(\tau)\)

\(\nabla_\theta U(\theta)\)

Recap

Policy Gradient Derivation

where \(P(\tau ; \theta)=\prod_{t=0} \underbrace{P\left(s_{t+1} \mid s_t, a_t\right)}_{\text {transition }} \cdot \underbrace{\left.\pi_\theta\left(a_t \mid s_t\right)\right]}_{\text {policy }}\)

\(=\sum_\tau P(\tau ; \theta) \nabla_\theta \log P(\tau ; \theta) R(\tau)\)

\(\nabla_\theta U(\theta)\)

Transition is unknown....

Stuck?

Recap

Policy Gradient Derivation

\(=\sum_\tau P(\tau ; \theta) \nabla_\theta \log P(\tau ; \theta) R(\tau)\)

Approximate with the empirical (Monte-Carlo) estimate for \(m\) sample traj. under policy \(\pi_\theta\)

 

\(\nabla_\theta U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right) R\left(\tau^{(i)}\right)\)

Valid even when:

  • Reward function discontinuous and/or unknown
  • Discrete state and/or action spaces

\(\nabla_\theta U(\theta)\)

Recap

Policy Gradient Derivation

where \(P(\tau ; \theta)=\prod_{t=0} \underbrace{P\left(s_{t+1} \mid s_t, a_t\right)}_{\text {transition }} \cdot \underbrace{\left.\pi_\theta\left(a_t \mid s_t\right)\right]}_{\text {policy }}\)

\(=\nabla_\theta \log [\prod_{t=0} \underbrace{P\left(s_{t+1} \mid s_t, a_t\right)}_{\text {transition }} \cdot \underbrace{\left.\pi_\theta\left(a_t \mid s_t\right)\right]}_{\text {policy }}\)

\(=\nabla_\theta\left[\sum_{t=0} \log P\left(s_{t+1} \mid s_t, a_t\right)+\sum_{t=0} \log \pi_\theta\left(a_t \mid s_t\right)\right]\)

\(=\nabla_\theta \sum_{t=0} \log \pi_\theta\left(a_t \mid s_t\right)\)

\(=\sum_{t=0} \underbrace{\nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right)}_{\text {no transition model required, }}\)

\(\nabla_\theta \log P(\tau ; \theta)\)

\(\nabla_\theta U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right) R\left(\tau^{(i)}\right)\)

Recap

Policy Gradient Derivation

\(=\sum_{t=0} \underbrace{\nabla_\theta \log \pi_\theta\left(a_t \mid s_t\right)}_{\text {no transition model required}}\)

\(\nabla_\theta \log P(\tau ; \theta)\)

\(\nabla_\theta U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right) R\left(\tau^{(i)}\right)\)

The following expression provides us with an unbiased estimate of the gradient, and we can compute it without access to the transition model:

 

Unbiased estimator \(\mathrm{E}[\hat{g}]=\nabla_\theta U(\theta)\), but very noisy.

where 

Recap

\(=\sum_\tau P(\tau ; \theta) R(\tau)\)

\(U(\theta)\)

\(\approx \hat{g} = \frac{1}{m}\sum_{i=1}^{m}\left(\sum_{t=0}\nabla_\theta \log \pi_\theta(a_t^{(i)}\mid s_t^{(i)})\right)R(\tau^{(i)})\)

This policy gradient estimator typically has high variance, due to:

  • Trajectory-Level Monte Carlo Sampling
    • Single-trajectory randomness:  The gradient estimator \(\hat{g}\) depends on averaging the returns from a finite set of \(m\) trajectories.
    • Each trajectory \(\tau^{(i)}\) is sampled from the policy \(\pi_\theta\), making it inherently stochastic and subject to large fluctuations.

\(=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right) R\left(\tau^{(i)}\right)\)

\(\nabla_\theta U(\theta)\) 

\(=\sum_\tau P(\tau ; \theta) R(\tau)\)

\(U(\theta)\)

\(\nabla_\theta U(\theta) \approx \hat{g} = \frac{1}{m}\sum_{i=1}^{m}\left(\sum_{t=0}\nabla_\theta \log \pi_\theta(a_t^{(i)}\mid s_t^{(i)})\right)R(\tau^{(i)})\)

This policy gradient estimator typically has high variance, due to:

  • High Variance of Return \(R(\tau^{(i)})\)
    • Cumulative rewards:   The return \(R(\tau^{(i)})\) is the sum of all rewards along the trajectory. A slight variation in early steps can produce large differences in total returns.
    • Sparse or delayed rewards:  When rewards are sparse or rare, returns can fluctuate drastically—some trajectories yield high rewards, others yield zero.

\(=\sum_\tau P(\tau ; \theta) R(\tau)\)

\(U(\theta)\)

\(\nabla_\theta U(\theta) \approx \hat{g} = \frac{1}{m}\sum_{i=1}^{m}\left(\sum_{t=0}\nabla_\theta \log \pi_\theta(a_t^{(i)}\mid s_t^{(i)})\right)R(\tau^{(i)})\)

This policy gradient estimator typically has high variance, due to:

  • Noisy Gradient via \(\nabla_\theta \log \pi_\theta(a_t \mid s_t)\)
    • Log probability gradients:  The term \(\nabla_\theta \log \pi_\theta(a_t^{(i)} \mid s_t^{(i)})\) can fluctuate significantly, as small parameter changes affect action probabilities greatly, especially with neural-network policies.
    • Multiplicative variance amplification:  Since the return \(R(\tau^{(i)})\) multiplies this gradient, any small variance in the log-policy gradients is amplified by large or variable returns.

\(=\sum_\tau P(\tau ; \theta) R(\tau)\)

\(U(\theta)\)

\(\nabla_\theta U(\theta) \approx \hat{g} = \frac{1}{m}\sum_{i=1}^{m}\left(\sum_{t=0}\nabla_\theta \log \pi_\theta(a_t^{(i)}\mid s_t^{(i)})\right)R(\tau^{(i)})\)

This policy gradient estimator typically has high variance, due to:

  •  Credit Assignment Problem
    • Long-horizon dependencies:  Rewards received late in a trajectory affect all preceding time-step gradient updates equally, regardless of which actions truly contributed. This leads to noisy and uncertain gradient signals for early actions.
    • Poor causality:  The gradient estimator does not explicitly distinguish between actions genuinely contributing to a high return and those irrelevant to it.

\(=\sum_\tau P(\tau ; \theta) R(\tau)\)

\(U(\theta)\)

\(\nabla_\theta U(\theta) \approx \hat{g} = \frac{1}{m}\sum_{i=1}^{m}\left(\sum_{t=0}\nabla_\theta \log \pi_\theta(a_t^{(i)}\mid s_t^{(i)})\right)R(\tau^{(i)})\)

This policy gradient estimator typically has high variance, due to:

  • Limited Sample Size (\(m\) small)
    • Finite-sample estimation: Typically, only a limited number \(m\) of trajectories are sampled. Small sample sizes mean high sampling variability, thus high estimator variance.
    • Expensive sampling: Generating trajectories is computationally expensive, limiting practical trajectory counts and exacerbating variance.

Variance reduction:

  • Getting more sample trajectories (sometimes infeasible)
  • Temporal structure
  • Subtracting a constant baseline \(b\)

\(\hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)\left(R\left(\tau^{(i)}\right)\right)\)

\(=\frac{1}{m} \sum_{i=1}^m\left(\sum_{t=0}^{h-1} \nabla_\theta \log \pi_\theta\left(a_t^{(i)} \mid s_t^{(i)}\right)\right)\left(\sum_{t=0}^{h-1} R\left(s_t^{(i)}, a_t^{(i)}\right)\right)\)

\(=\frac{1}{m} \sum_{i=1}^m\left(\sum_{t=0}^{I I-1} \nabla_\theta \log \pi_\theta\left(a_t^{(i)} \mid s_t^{(i)}\right)\left[\left(\sum_{k=0}^{t-1} R\left(s_k^{(i)}, a_k^{(i)}\right)\right)+\left(\sum_{k=t}^{h-1} R\left(s_k^{(i)}, a_k^{(i)}\right)\right)\right]\right)\)

[Policy Gradient Theorem: Sutton et al 1999; GPOMDP: Bartlett & Baxter, 2001; Survey: Peters & Schaal, 2006]

Removing terms that don't depend on current action can lower variance:

\(\frac{1}{m} \sum_{i=1}^m \sum_{t=0}^{h-1} \nabla_\theta \log \pi_\theta\left(a_t^{(i)} \mid s_t^{(i)}\right)\left(\sum_{k=t}^{h-1} R\left(s_k^{(i)}, a_k^{(i)}\right)\right)\)

Temporal structure

Variance reduction:

  • Subtracting a constant baseline \(b:\)

 

\(\nabla U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)\left(R\left(\tau^{(i)}\right)-b\right)\)

[Williams, REINFORCE paper, 1992]

  • Still unbiased, despite the additional term

\[\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)(-b)\]

  • with good choice of \(b,\) can reduce variance of \(\nabla U(\theta)\)

\(\nabla U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)\left(R\left(\tau^{(i)}\right)-b\right)\)

\(\begin{aligned} & \mathbb{E}\left[\nabla_\theta \log P(\tau ; \theta) b\right] \\ & =\sum_\tau P(\tau ; \theta) \nabla_\theta \log P(\tau ; \theta) b \\ & =\sum_\tau P(\tau ; \theta) \frac{\nabla_\theta P(\tau ; \theta)}{P(\tau ; \theta)} b \\ & =\sum_\tau \nabla_\theta P(\tau ; \theta) b\end{aligned}\)

\(=\nabla_\theta\left(\sum_\tau P(\tau) b\right)=b \nabla_\theta\left(\sum_\tau P(\tau)\right)=b \times 0\)

[Williams, REINFORCE paper, 1992]

❓ Why still unbiased despite the additional term                                                                                                             

\[\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)(-b)\]

Variance reduction:

  • Getting more sample trajectories (sometimes infeasible)
  • Subtracting a constant baseline \(b:\)

 

\(\nabla U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)\left(R\left(\tau^{(i)}\right)-b\right)\)

[Williams, REINFORCE paper, 1992]

✅ Still unbiased, despite the additional term

\[\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)(-b)\]

❓with good choice of \(b,\) can reduce variance of \(\nabla U(\theta)\)

Control Variates

  • The main idea is to reduce variance in the estimate of an expectation.
  • Suppose we want to estimate: \[\mu = \mathbb{E}[X]\] where \(X\) is some random variable.
  • We introduce another random variable \(Y\) (the control variate), with a known expectation  \(\mathbb{E}[Y] = \nu\), and importantly, \(Y\) is correlated to \(X\).
  • Estimator \(X-Y+\nu\) has variance \[\operatorname{Var}(X-Y)=\operatorname{Var}(X)+\operatorname{Var}(Y)-2 \operatorname{Cov}(X, Y)\]

We can also define a new estimator: \(X’ = X - \alpha(Y - \nu)\) where \(\alpha\) can be chosen optimally to minimize variance.

Control variates in RL

  

\(\nabla U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)\left(R\left(\tau^{(i)}\right)-b\right)\)

  • What's a good baseline \(b\) candidate?
  • state-value function \( V^\pi(s) \)) from rewards
  • \(b=\frac{1}{m} \sum_{i=1}^m R\left(\tau^{(i)}\right)\)
  • Estimated state-dependent value functions: \(b\left(s_t\right)=\hat{V}^\pi\left(s_t\right)\)

\(\nabla U(\theta) \approx \hat{g}=\frac{1}{m} \sum_{i=1}^m \nabla_\theta \log P\left(\tau^{(i)} ; \theta\right)\left(R\left(\tau^{(i)}\right)-\hat{V}^\pi(s)\right)\)

[Greensmith, Bartlett, Baxter, JMLR 2004 for variance reduction techniques.]

advantage function

\(-b\)

instead of increase the likelihood of all "winning" games, increase the likelihood of "better than average score" games

\(-b\)

How to estimate \(V^\pi\)?

  • Again, Monte Carlo estimate \(b\left(s_t\right)=\mathbb{E}\left[r_t+r_{t+1}+r_{t+2}+\ldots+r_{H-1}\right]=V^\pi\left(s_t\right)\)
  • Or, collect \(\tau_1, \ldots, \tau_m\), and regress against empirical return:  \(\phi_{i+1} \leftarrow \underset{\phi}{\arg \min } \frac{1}{m} \sum_{i=1}^m \sum_{t=0}^{H-1}\left(V_\phi^\pi\left(s_t^{(i)}\right)-\left(\sum_{k=t}^{H-1} R\left(s_k^{(i)}, u_k^{(i)}\right)\right)\right)^2\)

  • Or, similar to fitted Q-learning, do fitted V-learning:                                                  \(\phi_{i+1} \leftarrow \min _\phi \sum_{\left(s, u, s^{\prime}, r\right)}\left\|r+V_{\phi_i}^\pi\left(s^{\prime}\right)-V_\phi(s)\right\|_2^2\)

[Greensmith, Bartlett, Baxter, JMLR 2004 for variance reduction techniques.]

How to estimate advatange?

(GAE) [Schulman et al, ICLR 2016]
TD(lambda) / eligibility traces [Sutton and Barto, 1990]

  • \(\hat{Q}:\) lambda exponentially weighted average of all the above
  • Advantage estimate then \(\hat{A}=\hat{Q}-\hat{V}\)

How to estimate advatange?

[Async Advantage Actor Critic (A3C) [Mnih et al, 2016]

  • \(\hat{Q}\) : one of the above choices (e.g. k=5 step lookahead)
  • Advantage estimate then \(\hat{A}=\hat{Q}-\hat{V}\)

Actor-critic method

[Williams, REINFORCE paper, 1992]

actor

critic

Vanilla policy gradient/REINFORCE: step-sizing issue

  • Step-sizes is worth tuning even in pure SGD under supervised learning.
  • In policy gradient, bad step-sizes can have worse "chain effect":
    • Step too far, may lead to terrible policy
    • Next mini-batch: terrible data collected (with e.g. no reward signal; so no "correction")
    • Not clear how to recover
    • Intuitively, too big of a distributional shift
  • Coupled with unknown dynamics/transition, good step-sizing is hard a-priori

Trust region policy optimization (TRPO)

  • Interpretation of the objective via importance sampling (next slide)
  • Interpretation of the constraint via distributional shift
  • Can solve approximately via conjugate gradient method (using linear approx. of the objective and quadratic approx. of the KL)

\(\max L(\pi)=\mathbb{E}_{\pi \text { old }}\left[\frac{\pi(a \mid s)}{\pi_{\mathrm{old}}(a \mid s)} A^\pi_{old}\right.\)\(\left.(s, a)\right]\)

Constraint: \(\quad \mathbb{E}_{\pi_{\text {old }}}\left[K L\left(\pi \mid \pi_{\text {old }}\right)\right] \leq \epsilon\)

importance sampling

\(\mathbb{E}_{x \sim q}\left[\frac{p(x)}{q(x)} f(x)\right]=\mathbb{E}_{x \sim p}[f(x)]\)

\(U(\theta)=\mathbb{E}_{\tau \sim \theta} \mathrm{old}\left[\frac{P(\tau \mid \theta)}{P\left(\tau \mid \theta_{\mathrm{old}}\right)} R(\tau)\right]\)

\(=\mathbb{E}_{\tau \sim \theta_{\text {old }}}\left[\frac{\pi(\tau \mid \theta)}{\pi\left(\tau \mid \theta_{\text {old }}\right)} R(\tau)\right]\)

\(\nabla_\theta U(\theta)=\mathbb{E}_{\tau \sim \theta} \text { old }\left[\frac{\nabla_\theta P(\tau \mid \theta)}{P\left(\tau \mid \theta_{\text {old }}\right)} R(\tau)\right]\)

[Tang and Abbeel, On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient, 2011]

  • Estimate both the utilities (objective) and policy gradient under new policy
  • by using trajectories under old policy
  • helps sample complexity, but increases variance actually (another tradeoff)

[Tang and Abbeel, On a Connection between Importance Sampling and the Likelihood Ratio Policy Gradient, 2011]

TRPO

\(\max L(\pi)=\mathbb{E}_{\pi \text { old }}\left[\frac{\pi(a \mid s)}{\pi_{\mathrm{old}}(a \mid s)} A^\pi_{old}\right.\)\(\left.(s, a)\right]\)

Constraint: \(\quad \mathbb{E}_{\pi_{\text {old }}}\left[K L\left(\pi \mid \pi_{\text {old }}\right)\right] \leq \epsilon\)

\(\max \mathbb{E}_{\pi \text { old }}\left[\frac{\pi(a \mid s)}{\pi_{\mathrm{old}}(a \mid s)} A^\pi_{old}\right.\)\(\left.(s, a)\right]\)

\(-\beta\left(\mathbb{E}_t\left[\operatorname{KL}\left(\pi_{o l d} \mid \pi\right)\right]-\delta\right)\)

Proximal Policy Optimization (PPO, v1)

Proximal Policy Optimization (PPO, v2)

Recall the objective:

\(\hat{\mathbb{E}}_t\left[\frac{\pi_\theta\left(a_t \mid s_t\right)}{\pi_\theta \text { old }}{ }^{\left(a_t \mid s_t\right)} \hat{A}_t\right]\)

\(=\hat{\mathbb{E}}_t\left[\rho_t(\theta) \hat{A}_t\right]\)

A pwer bound of the above:

\(L^{C L I P}(\theta)=\hat{\mathbb{E}}_t\left[\min \left(\rho_t(\theta) \hat{A}_{t<0}, \operatorname{clip}\left(\rho_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t\right)\right]\)

[table credit: Daniel Bick]

- The main loss \(L_t(\theta)=\min \left(r_t(\theta) \hat{A}_t, \operatorname{clip}\left(r_t(\theta)\right), 1-\epsilon, 1+\epsilon\right) \hat{A}_t\)
- Advantage estimate based on truncated GAE
- Adding an entropy term

Beyond what we covered

- POMDP
- Inverse RL
- Apprenticeship learning
- Behavioral cloning (Data Augmentation, a.k.a. DAgger)
- Transfer learning (sim to real)
- Domain randomization
- Multi-task learning
- Curriculum learning
- Hierarchical RL
- Safe/verifiable RL
- Multi-agent RL
- Offline RL
- Rewards shaping
- Fairness, ethical, explainable AI (value alignment)

Thanks!

We'd love to hear your thoughts.

 

Variance Reduction Achieved

  • With the optimal control variate estimator \(X'\): \[\mathrm{Var}(X') = \mathrm{Var}(X - 0.8(Y - 1)).\]
  • Expanding explicitly: \[\mathrm{Var}(X') = \mathrm{Var}(X - 0.8(X+\varepsilon - 1)) = \mathrm{Var}(0.2X - 0.8\varepsilon) = (0.2^2)\mathrm{Var}(X) + (-0.8^2)\mathrm{Var}(\varepsilon) = (0.04)(4) + (0.64)(1)= 0.16 + 0.64 = 0.8.\]
  • Thus, we reduced the variance significantly: \( 4 \rightarrow 0.8 \) (5× improvement).

6.C011/C511 - ML for CS (Spring25) - Lecture 7 - Reinforcement Learning III (Actor-critic; variance reduction)

By Shen Shen

6.C011/C511 - ML for CS (Spring25) - Lecture 7 - Reinforcement Learning III (Actor-critic; variance reduction)

  • 151