CS 4/5789: Introduction to Reinforcement Learning
Lecture 15: Value-Based RL
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders/Announcements
- Homework
- PA 3 released Friday, due 3/31
- PSet 4 released Wednesday
- 5789 Paper Reviews due weekly on Mondays
- Prelim
- Median: 59/75, Mean: 58/75, Standard Deviation: 12/75
- Raw percentages \(\neq\) letter grades!
- Regrade requests open until Wednesday 11:59pm
- Corrections: assigned in late April, graded like a PSet
- for each problem, you final score will be calculated as
initial score \(+~ \alpha \times (\)corrected score \( - \) initial score\()_+\)
- for each problem, you final score will be calculated as
- Median: 59/75, Mean: 58/75, Standard Deviation: 12/75
Agenda
1. Recap
2. Labels via Bellman
3. (Q) Value-based RL
4. Preview: Optimization
Two concerns of Data Feedback
- Constructing dataset for supervised learning
- Updating the policy based on learned quantities
Recap: Control/Data Feedback


action \(a_t\)
state \(s_t\)
reward \(r_t\)

policy
data \((s_t,a_t,r_t)\)
policy \(\pi\)
transitions \(P,f\)
experience
unknown in Unit 2
- Constructing dataset for supervised learning
- features \((s,a)\sim d^\pi_{\mu_0}\) ("roll in")
- labels \(y\) with \(\mathbb E[y|s,a]= Q^\pi(s,a)\) ("roll out")
- Incremental updates to control distribution shift
- mixture of current and greedy policy
- parameter \(\alpha\) controls the distribution shift
Recap: Conservative PI
Recall: Labels via Rollouts
- Method: \(y = \sum_{t=h_1}^{h_1+h_2} r_t \) for \(h_2\sim\)Geometric\((1-\gamma)\)
- Motivation: definition of
\( Q^\pi(s,a) = \mathbb E_{P,\pi}\left[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \right] \) - On future PSet, show this label is unbiased
- i.e. \(\mathbb E[y|s_{h_1},a_{h_1}]= Q^\pi(s_{h_1},a_{h_1})\)
- Sources of label variance:
- choice of \(h_2\)
- \(h_2\) steps of \(P\) and \(\pi\)
- On policy: rollout with \(\pi\) and
estimate \(Q^\pi\)
...
...
...
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
Agenda
1. Recap
2. Labels via Bellman
3. (Q) Value-based RL
4. Preview: Optimization
Bellman Expectation Equation
- Bellman Expectation Equation $$Q^\pi(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[\mathbb E_{a'\sim \pi(s')}\left[Q^\pi(s',a')\right]\right]$$
- Idea: rollout policy \(\pi\) and use current Q estimate!
- set features \((s_i, a_i) = (s_t, a_t)\)
- set label \(y_i = r_t + \gamma \hat Q(s_{t+1}, a_{t+1})\)
- Example:
- \(r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}\)
\(\qquad\qquad+\mathbf 1\{s=0\}\), \(\gamma=0.5\) - \(\pi(a|1)=\frac{1}{2}\), \(\pi(0)=\)stay
- \(r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}\)
...
...
...

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(1-p\)
switch: \(1-2p\)
stay: \(p\)
switch: \(2p\)
Bellman Expectation Equation
- Bellman Expectation Equation $$Q^\pi(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[\mathbb E_{a'\sim \pi(s')}\left[Q^\pi(s',a')\right]\right]$$
- Idea: rollout policy \(\pi\) and use current Q estimate!
- set features \((s_i, a_i) = (s_t, a_t)\)
- set label \(y_i = r_t + \gamma \hat Q(s_{t+1}, a_{t+1})\)
- Is the label unbiased? PollEv
- \(\mathbb E[y_i|s_i, a_i]-Q^\pi(s_i,a_i) =\)
\(\gamma \mathbb E_{s'\sim P(s_i,a_i)}\big[\mathbb E_{a'\sim \pi(s')}[\hat Q(s',a')-Q^\pi(s',a')]\big]\)
- \(\mathbb E[y_i|s_i, a_i]-Q^\pi(s_i,a_i) =\)
- Sources of variance: one step of \(P\) and \(\pi\)
- On policy: rollout with \(\pi\) and estimate \(Q^\pi\)
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
Bellman Optimality Equation
- Bellman Optimality Equation $$Q^\star(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[ \max_{a}Q^\star(s',a')\right]$$
- Idea: rollout policy \(\pi\) and use current Q\(\star\) estimate!
- set features \((s_i, a_i) = (s_t, a_t)\)
- set label \(y_i = r_t + \gamma \max_a \hat Q(s_{t+1}, a)\)
- Example:
- \(r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}\)
\(\qquad\qquad+\mathbf 1\{s=0\}\), \(\gamma=0.5\) - \(\pi(a|1)=\frac{1}{2}\), \(\pi(0)=\)stay
- \(r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}\)

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(1-p\)
switch: \(1-2p\)
stay: \(p\)
switch: \(2p\)
Bellman Optimality Equation
- Bellman Optimality Equation $$Q^\star(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[ \max_{a}Q^\star(s',a')\right]$$
- Idea: rollout policy \(\pi\) and use current Q\(\star\) estimate!
- set features \((s_i, a_i) = (s_t, a_t)\)
- set label \(y_i = r_t + \gamma \max_a \hat Q(s_{t+1}, a)\)
-
The label is biased
-
\(\mathbb E[y_i|s_i, a_i]-Q^\star(s_i,a_i) =\)
\(\gamma \mathbb E_{s'\sim P(s_i,a_i)}\big[\max_a \hat Q(s',a)-\max_{a'}Q^\pi(s',a')\big]\)
-
-
Sources of variance: one step of \(P\) and \(\pi\)
-
Off policy: rollout with \(\pi\) and estimate \(Q^\star\)
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
...
Agenda
1. Recap
2. Labels via Bellman
3. (Q) Value-based RL
4. Preview: Optimization
Value-based RL


action
state, reward

policy
data
experience
Key components of a value-based RL algorithm:
- Rollout policy
- Construct/update dataset
- Learn/update Q function
Value-based RL
Key components of a value-based RL algorithm:
- Rollout policy
- Construct/update dataset
- Learn/update Q function
Different choices for these components lead to different algorithms
- We first discuss learning Q
- Then we will cover 3 main algorithms
Q function Class
- Supervised learning subroutine sets \(\hat Q\) to solve: $$\min_{Q\in\mathcal Q} \textstyle\sum_{i=1}^N (Q(s_i,a_i)-y_i)^2 $$
- In lecture, we often treat this as a black box. Examples:
- Sometimes, the minimization is only incrementally solved before returning \(\hat Q\), e.g. \(\hat Q(s,a) \leftarrow (1-\alpha) \hat Q(s,a) + \alpha y\) or gradient steps
1. Tabular
2. Parametric, e.g. deep (PA 3)
Q(s,a) | |||||
\(\mathcal S\)
\(\mathcal A\)

\(\mathcal S\)
\(\mathcal A\)
Q function Class
- Supervised learning subroutine sets \(\hat Q\) to solve: $$\min_{Q\in\mathcal Q} \textstyle\sum_{i=1}^N (Q(s_i,a_i)-y_i)^2 $$
- For a parametric class \(\mathcal Q = \{Q_\theta\mid\theta\in\mathcal R^d\}\)
- gradient descent is $$\theta\leftarrow \theta - \alpha \nabla_\theta \left[\textstyle \sum_{i=1}^N (Q_\theta(s_i,a_i)-y_i)^2\right]$$
- The gradient:
- \(\nabla_\theta(Q_\theta(s_i,a_i)-y_i)^2 = 2(Q_\theta(s_i,a_i)-y_i)\nabla_\theta Q_\theta(s_i,a_i)\)

Q function Class
- Supervised learning subroutine sets \(\hat Q\) to solve: $$\min_{Q\in\mathcal Q} \textstyle\sum_{i=1}^N (Q(s_i,a_i)-y_i)^2 $$
- For a parametric class \(\mathcal Q = \{Q_\theta\mid\theta\in\mathcal R^d\}\)
- gradient descent is $$\theta\leftarrow \theta -2 \alpha \textstyle \sum_{i=1}^N (Q_\theta(s_i,a_i)-y_i)\nabla_\theta Q_\theta(s_i,a_i) $$
- The gradient:
- \(\nabla_\theta(Q_\theta(s_i,a_i)-y_i)^2 = 2(Q_\theta(s_i,a_i)-y_i)\nabla_\theta Q_\theta(s_i,a_i)\)

1. Approx/Conservative PI
Initialize arbitrary \(\pi_0\), then for iterations \(i\):
- Rollout \(\pi_i\)
- Construct dataset with features \((s_t,a_t)\) and labels $$y_t = \sum_{k=t}^{h} r_k,\quad h\sim\mathrm{Geometric}(1-\gamma)$$
- \(\hat Q_i\) with supervised learning on \(\{(s_t,a_t),y_t\}\)
- \(\pi_{i+1}\) (incrementally) greedy w.r.t. \(\hat Q_i\)
("Montecarlo")
Epsilon-greedy policy
- Alternative to incremental policy updates for encouraging exploration
- Let \(\bar \pi\) be a greedy policy (e.g. \(\bar\pi(s) = \arg\max_a \hat Q(s,a)\))
- Then the \(\epsilon\) greedy policy follows \(\bar\pi\) with probability \(1-\epsilon\)
- otherwise selects an action from \(\mathcal A\) uniformly at random
- Mathematically, this is $$\pi(a|s) = \begin{cases}1-\epsilon+\frac{\epsilon}{A} & a=\bar\pi(s)\\ \frac{\epsilon}{A} & \text{otherwise} \end{cases}$$
\(\bar\pi(s)\)
\(\mathcal A\)
2. "Temporal difference" PI
Initialize arbitrary \(\pi_0\), \(Q_0\), then for iterations \(i\):
- Rollout \(\pi_i\)
- Construct dataset with features \((s_t,a_t)\) and labels $$y_t = r_t+\gamma \hat Q_{i}(s_{t+1}, a_{t+1})$$
- \(\hat Q_{i+1}\) with supervised learning on \(\{(s_t,a_t),y_t\}\)
- \(\pi_{i+1}\) (incrementally or \(\epsilon\)) greedy w.r.t. \(\hat Q_{i+1}\)
3. "Q-learning"
Initialize arbitrary \(\pi_0\), \(Q_0\), then for iterations \(i\):
- Rollout \(\pi_i\)
- Append to dataset: features \((s_t,a_t)\) and labels $$y_t = r_t+\gamma \max_a \hat Q_{i}(s_{t+1}, a)$$
- \(\hat Q_{i+1}\) with supervised learning on \(\{(s_t,a_t),y_t\}\)
- \(\pi_{i+1}\) (\(\epsilon\)) greedy w.r.t. \(\hat Q_{i+1}\)
1. PI with MC
- Data-driven PI
- \(\sum_{k=t}^{h} r_k\)
- Multi-timestep
- Unbiased
- High Variance
- On policy
Comparison
2. PI with TD
- Data-driven PI
- \(r_t+\gamma \hat Q_{i}(s_{t+1}, a_{t+1})\)
- Single timestep
- Biased depending on \(Q^\pi-\hat Q_{i}\)
- Low variance
- On policy
3. Q-learning
- Data-driven VI
- \(r_t+\gamma \hat Q_{i}(s_{t+1}, a_\star)\)
- Single timestep
- Biased depending on \(Q^\star-\hat Q_{i}\)
- Low variance
- Off policy
Agenda
1. Recap
2. Labels via Bellman
3. (Q) Value-based RL
4. Preview: Optimization
- Ultimate Goal: find (near) optimal policy
- Value-based RL estimates intermediate quantities
- \(Q^{\pi}\) or \(Q^{\star}\) are indirectly useful for finding optimal policy
- Imitation learning had no intermediaries, but required data from an expert policy
- Idea: optimize policy without relying on intermediaries:
- objective as a function of policy: \(J(\pi) = \mathbb E_{s\sim \mu_0}[V^\pi(s)]\)
- For parametric (e.g. deep) policy \(\pi_\theta\): $$J(\theta) = \mathbb E_{s\sim \mu_0}[V^{\pi_\theta}(s)]$$

Preview: Policy Optimization
\(J(\theta)\)
\(\theta\)
Preview: Optimization
- So far, we have discussed tabular and quadratic optimization
-
np.amin(J, axis=1)
- for \(J(\theta) = a\theta^2 + b\theta +c\), maximum \(\theta^\star = -\frac{b}{2a}\)
-
- Next lecture, we discuss strategies for arbitrary differentiable functions (even ones that are unknown!)
\(\theta^\star\)
Recap
- PSet will be released Wed
- PA was released Fri
- Rollout and Bellman Labels
- Value-based RL
- Next lecture: Optimization Overview
Sp23 CS 4/5789: Lecture 15
By Sarah Dean