CS 4/5789: Introduction to Reinforcement Learning

Lecture 15: Value-Based RL

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders/Announcements

  • Homework
    • PA 3 released Friday, due 3/31
    • PSet 4 released Wednesday
    • 5789 Paper Reviews due weekly on Mondays
  • Prelim
    • Median: 59/75, Mean: 58/75, Standard Deviation: 12/75
      • Raw percentages \(\neq\) letter grades!
    • Regrade requests open until Wednesday 11:59pm
    • Corrections: assigned in late April, graded like a PSet
      • for each problem, you final score will be calculated as
            initial score \(+~ \alpha \times (\)corrected score \( - \) initial score\()_+\)

Agenda

1. Recap

2. Labels via Bellman

3. (Q) Value-based RL

4. Preview: Optimization

Two concerns of Data Feedback

  1. Constructing dataset for supervised learning
  2. Updating the policy based on learned quantities

Recap: Control/Data Feedback

action \(a_t\)

state \(s_t\)

reward \(r_t\)

policy

data \((s_t,a_t,r_t)\)

policy \(\pi\)

transitions \(P,f\)

experience

unknown in Unit 2

  1. Constructing dataset for supervised learning
    • features \((s,a)\sim d^\pi_{\mu_0}\) ("roll in")
    • labels \(y\) with \(\mathbb E[y|s,a]= Q^\pi(s,a)\) ("roll out")
  2. Incremental updates to control distribution shift
    • mixture of current and greedy policy
    • parameter \(\alpha\) controls the distribution shift

Recap: Conservative PI

Recall: Labels via Rollouts

  • Method: \(y = \sum_{t=h_1}^{h_1+h_2} r_t \) for \(h_2\sim\)Geometric\((1-\gamma)\)
  • Motivation: definition of
    \( Q^\pi(s,a) = \mathbb E_{P,\pi}\left[\sum_{t=0}^\infty \gamma^t r_t\mid s_0=s, a_0=a \right] \)
  • On future PSet, show this label is unbiased
    • i.e. \(\mathbb E[y|s_{h_1},a_{h_1}]= Q^\pi(s_{h_1},a_{h_1})\)
    • Sources of label variance:
      • choice of \(h_2\)
      • \(h_2\) steps of \(P\) and \(\pi\)
    • On policy: rollout with \(\pi\) and
      estimate \(Q^\pi\)

...

...

...

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

Agenda

1. Recap

2. Labels via Bellman

3. (Q) Value-based RL

4. Preview: Optimization

Bellman Expectation Equation

  • Bellman Expectation Equation $$Q^\pi(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[\mathbb E_{a'\sim \pi(s')}\left[Q^\pi(s',a')\right]\right]$$
  • Idea: rollout policy \(\pi\) and use current Q estimate!
    • set features \((s_i, a_i) = (s_t, a_t)\)
    • set label \(y_i = r_t + \gamma \hat Q(s_{t+1}, a_{t+1})\)
  • Example:
    • \(r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}\)
      \(\qquad\qquad+\mathbf 1\{s=0\}\), \(\gamma=0.5\)
    • \(\pi(a|1)=\frac{1}{2}\), \(\pi(0)=\)stay

...

...

...

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(1-p\)

switch: \(1-2p\)

stay: \(p\)

switch: \(2p\)

Bellman Expectation Equation

  • Bellman Expectation Equation $$Q^\pi(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[\mathbb E_{a'\sim \pi(s')}\left[Q^\pi(s',a')\right]\right]$$
  • Idea: rollout policy \(\pi\) and use current Q estimate!
    • set features \((s_i, a_i) = (s_t, a_t)\)
    • set label \(y_i = r_t + \gamma \hat Q(s_{t+1}, a_{t+1})\)
  • Is the label unbiased? PollEv
    • \(\mathbb E[y_i|s_i, a_i]-Q^\pi(s_i,a_i) =\)
      \(\gamma \mathbb E_{s'\sim P(s_i,a_i)}\big[\mathbb E_{a'\sim \pi(s')}[\hat Q(s',a')-Q^\pi(s',a')]\big]\)
  • Sources of variance: one step of \(P\) and \(\pi\)
  • On policy: rollout with \(\pi\) and estimate \(Q^\pi\)

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

Bellman Optimality Equation

  • Bellman Optimality Equation $$Q^\star(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[ \max_{a}Q^\star(s',a')\right]$$
  • Idea: rollout policy \(\pi\) and use current Q\(\star\) estimate!
    • set features \((s_i, a_i) = (s_t, a_t)\)
    • set label \(y_i = r_t + \gamma \max_a  \hat Q(s_{t+1}, a)\)
  • Example:
    • \(r(s,a) = -0.5\cdot \mathbf 1\{a=\mathsf{switch}\}\)
      \(\qquad\qquad+\mathbf 1\{s=0\}\), \(\gamma=0.5\)
    • \(\pi(a|1)=\frac{1}{2}\), \(\pi(0)=\)stay

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(1-p\)

switch: \(1-2p\)

stay: \(p\)

switch: \(2p\)

Bellman Optimality Equation

  • Bellman Optimality Equation $$Q^\star(s,a) = r(s,a) + \gamma \mathbb E_{s'\sim P(s,a)}\left[ \max_{a}Q^\star(s',a')\right]$$
  • Idea: rollout policy \(\pi\) and use current Q\(\star\) estimate!
    • set features \((s_i, a_i) = (s_t, a_t)\)
    • set label \(y_i = r_t + \gamma \max_a  \hat Q(s_{t+1}, a)\)
  • The label is biased

    • \(\mathbb E[y_i|s_i, a_i]-Q^\star(s_i,a_i) =\)
      \(\gamma \mathbb E_{s'\sim P(s_i,a_i)}\big[\max_a \hat Q(s',a)-\max_{a'}Q^\pi(s',a')\big]\)

  • Sources of variance: one step of \(P\) and \(\pi\)

  • Off policy: rollout with \(\pi\) and estimate \(Q^\star\)

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

Agenda

1. Recap

2. Labels via Bellman

3. (Q) Value-based RL

4. Preview: Optimization

Value-based RL

action

state,  reward

policy

data

experience

Key components of a value-based RL algorithm:

  1. Rollout policy
  2. Construct/update dataset
  3. Learn/update Q function

Value-based RL

Key components of a value-based RL algorithm:

  1. Rollout policy
  2. Construct/update dataset
  3. Learn/update Q function

Different choices for these components lead to different algorithms

  • We first discuss learning Q
  • Then we will cover 3 main algorithms

Q function Class

  • Supervised learning subroutine sets \(\hat Q\) to solve: $$\min_{Q\in\mathcal Q} \textstyle\sum_{i=1}^N (Q(s_i,a_i)-y_i)^2 $$
  • In lecture, we often treat this as  a black box. Examples:




     
  • Sometimes, the minimization is only incrementally solved before returning \(\hat Q\), e.g. \(\hat Q(s,a) \leftarrow (1-\alpha) \hat Q(s,a) + \alpha y\) or gradient steps

1. Tabular

2. Parametric, e.g. deep (PA 3)

Q(s,a)

\(\mathcal S\)

\(\mathcal A\)

\(\mathcal S\)

\(\mathcal A\)

Q function Class

  • Supervised learning subroutine sets \(\hat Q\) to solve: $$\min_{Q\in\mathcal Q} \textstyle\sum_{i=1}^N (Q(s_i,a_i)-y_i)^2 $$
  • For a parametric class \(\mathcal Q = \{Q_\theta\mid\theta\in\mathcal R^d\}\)
    • gradient descent is $$\theta\leftarrow \theta - \alpha \nabla_\theta \left[\textstyle \sum_{i=1}^N (Q_\theta(s_i,a_i)-y_i)^2\right]$$
    • The gradient:
      • \(\nabla_\theta(Q_\theta(s_i,a_i)-y_i)^2 = 2(Q_\theta(s_i,a_i)-y_i)\nabla_\theta Q_\theta(s_i,a_i)\)

Q function Class

  • Supervised learning subroutine sets \(\hat Q\) to solve: $$\min_{Q\in\mathcal Q} \textstyle\sum_{i=1}^N (Q(s_i,a_i)-y_i)^2 $$
  • For a parametric class \(\mathcal Q = \{Q_\theta\mid\theta\in\mathcal R^d\}\)
    • gradient descent is $$\theta\leftarrow \theta -2 \alpha \textstyle \sum_{i=1}^N (Q_\theta(s_i,a_i)-y_i)\nabla_\theta Q_\theta(s_i,a_i) $$
    • The gradient:
      • \(\nabla_\theta(Q_\theta(s_i,a_i)-y_i)^2 = 2(Q_\theta(s_i,a_i)-y_i)\nabla_\theta Q_\theta(s_i,a_i)\)

1. Approx/Conservative PI

Initialize arbitrary \(\pi_0\), then for iterations \(i\):

  1. Rollout \(\pi_i\)
  2. Construct dataset with features \((s_t,a_t)\) and labels $$y_t = \sum_{k=t}^{h} r_k,\quad h\sim\mathrm{Geometric}(1-\gamma)$$
  3. \(\hat Q_i\) with supervised learning on \(\{(s_t,a_t),y_t\}\)
    • \(\pi_{i+1}\) (incrementally) greedy w.r.t. \(\hat Q_i\)

("Montecarlo")

Epsilon-greedy policy

  • Alternative to incremental policy updates for encouraging exploration
  • Let \(\bar \pi\) be a greedy policy (e.g. \(\bar\pi(s) = \arg\max_a \hat Q(s,a)\))
  • Then the \(\epsilon\) greedy policy follows \(\bar\pi\) with probability \(1-\epsilon\)
    • otherwise selects an action from \(\mathcal A\) uniformly at random
  • Mathematically, this is $$\pi(a|s) = \begin{cases}1-\epsilon+\frac{\epsilon}{A} & a=\bar\pi(s)\\ \frac{\epsilon}{A} & \text{otherwise} \end{cases}$$

\(\bar\pi(s)\)

\(\mathcal A\)

2. "Temporal difference" PI

Initialize arbitrary \(\pi_0\), \(Q_0\), then for iterations \(i\):

  1. Rollout \(\pi_i\)
  2. Construct dataset with features \((s_t,a_t)\) and labels $$y_t = r_t+\gamma \hat Q_{i}(s_{t+1}, a_{t+1})$$
  3. \(\hat Q_{i+1}\) with supervised learning on \(\{(s_t,a_t),y_t\}\)
    • \(\pi_{i+1}\) (incrementally or \(\epsilon\)) greedy w.r.t. \(\hat Q_{i+1}\)

3. "Q-learning"

Initialize arbitrary \(\pi_0\), \(Q_0\), then for iterations \(i\):

  1. Rollout \(\pi_i\)
  2. Append to dataset: features \((s_t,a_t)\) and labels $$y_t = r_t+\gamma \max_a \hat Q_{i}(s_{t+1}, a)$$
  3. \(\hat Q_{i+1}\) with supervised learning on \(\{(s_t,a_t),y_t\}\)
    • \(\pi_{i+1}\) (\(\epsilon\)) greedy w.r.t. \(\hat Q_{i+1}\)

1. PI with MC

  • Data-driven PI
  • \(\sum_{k=t}^{h} r_k\)
  • Multi-timestep
  • Unbiased
     
  • High Variance



     
  • On policy

Comparison

2. PI with TD

  • Data-driven PI
  • \(r_t+\gamma \hat Q_{i}(s_{t+1}, a_{t+1})\)
  • Single timestep
  • Biased depending on \(Q^\pi-\hat Q_{i}\)
  • Low variance



     
  • On policy

3. Q-learning

  • Data-driven VI
  • \(r_t+\gamma \hat Q_{i}(s_{t+1}, a_\star)\)
  • Single timestep
  • Biased depending on \(Q^\star-\hat Q_{i}\)
  • Low variance



     
  • Off policy

Agenda

1. Recap

2. Labels via Bellman

3. (Q) Value-based RL

4. Preview: Optimization

  • Ultimate Goal: find (near) optimal policy
  • Value-based RL estimates intermediate quantities
    • \(Q^{\pi}\) or \(Q^{\star}\) are indirectly useful for finding optimal policy
  • Imitation learning had no intermediaries, but required data from an expert policy
  • Idea: optimize policy without relying on intermediaries:
    • objective as a function of policy: \(J(\pi) = \mathbb E_{s\sim \mu_0}[V^\pi(s)]\)
    • For parametric (e.g. deep) policy \(\pi_\theta\): $$J(\theta) = \mathbb E_{s\sim \mu_0}[V^{\pi_\theta}(s)]$$

Preview: Policy Optimization

Parabola

\(J(\theta)\)

\(\theta\)

Preview: Optimization

  • So far, we have discussed tabular and quadratic optimization
    • np.amin(J, axis=1)
    • for \(J(\theta) = a\theta^2 + b\theta +c\), maximum \(\theta^\star = -\frac{b}{2a}\)
  • Next lecture, we discuss strategies for arbitrary differentiable functions (even ones that are unknown!)

\(\theta^\star\)

Recap

  • PSet will be released Wed
  • PA was released Fri

 

  • Rollout and Bellman Labels
  • Value-based RL

 

  • Next lecture: Optimization Overview