CS 4/5789: Introduction to Reinforcement Learning

Lecture 13: Fitted Value Iteration

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

  • Homework
    • PA 2 due tonight
    • PSet 4 due Friday
    • PA 3 released later this week
  • Prelim grade released Wednesday

Agenda

1. Recap

2. Value Iteration

3. Fitted VI

4. Online Fitted VI

5. Finite Horizon

Feedback in RL

action \(a_t\)

state \(s_t\)

reward \(r_t\)

  1. Control feedback: between states and actions
  2. Data feedback: between data and poicy

policy

data \((s_t,a_t,r_t)\)

policy \(\pi\)

transitions \(P,f\)

experience

unknown in Unit 2

  • Supervised learning: features \(x\) and labels \(y\)
    • Goal: predict labels with \(\hat f(x)\approx \mathbb E[y|x]\)
    • Requirements: dataset \(\{x_i,y_i\}_{i=1}^N\)
    • Method: \(\hat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N (f(x_i)-y_i)^2\)
  • Important functions in MDPs
    • Transitions \(P(s'|s,a)\)
    • Value/Q of a policy \(V^\pi(s)\) and \(Q^\pi(s,a)\)
    • Optimal Value/Q \(V^\star(s)\) and \(Q^\star(s,a)\)
    • Optimal policy \(\pi^\star(s)\)

Supervised Learning for MDPs

Agenda

1. Recap

2. Value Iteration

3. Fitted VI

4. Online Fitted VI

5. Finite Horizon

Recall: Value Iteration

Value Iteration

  • Initialize \(V^0\)
  • For \(k=0,\dots,K-1\):
    • \(V^{k+1}(s) = \max_{a\in\mathcal A}  r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^{k}(s') \right]\) for all \(s\)
  • Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}{r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^K(s')]}\)

For infinite horizon discounted MDP, we repeatedly apply the Bellman Optimality Equation, which is a fixed-point algorithm

Q-Value Iteration

Q-Value Iteration

  • Initialize \(Q^0\) (\(S\times A\) array)
  • For \(k=0,\dots,K-1\):
    • \(Q^{k+1}(s,a) =  r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^{k}(s',a') \right]\)
      for all \(s,a\)
  • Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}Q^{K}(s,a)\)

For infinite horizon discounted MDP, we repeatedly apply the Bellman Optimality Equation, which is a fixed-point algorithm

Which step requires a model?

Data from trajectories

Q-Value Iteration: for all \(s,a\)

\(Q^{k+1}(s,a) =  r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^{k}(s',a') \right]\)

  • We can't evaluate the expectations without the model \(P\) or if \(S\) and/or \(A\) are too large

  • Instead we can collect data:

    • "Roll out" data collection policy \(\pi_{\text {data }}\)

    • Collect states and actions \(\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots \right\}\)

  • How can we use this?

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

...

Agenda

1. Recap

2. Value Iteration

3. Fitted VI

4. Online Fitted VI

5. Finite Horizon

  • Ideally, we want to have $$Q^{k+1}(s, a) \approx r(s, a)+\gamma\mathbb{E}_{s^{\prime} \sim P(s, a)}\left[\max _{a^{\prime} \in A} Q^k\left(s^{\prime}, a^{\prime}\right)\right] \quad \forall s, a$$
  • Note that the RHS can also be written as $$\mathbb{E}\left[r\left(s, a\right)+\gamma\max _{a^{\prime}} Q^k\left(s', a^{\prime}\right) \mid s, a\right]$$
  • How to choose \(x\) and \(y\) for supervised learning?
    • \(x=\left(s, a\right)\)
    • \(y=r\left(s, a\right)+\gamma\max_{a'} Q^k\left(s', a'\right)\)
  • Then we have \(Q^{k+1}(s, a)=f(x)=\mathbb{E}[y \mid x]\)

To Supervised Learning

  • Convert \(\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots ,s_N,a_N\right\}\) into \(x,y\) pairs:
    • \(x_i=(s_i,a_i)\) and \(y=r(s_i,a_i)+\gamma \max_a Q^{k}(s_{i+1}, a)\)
  • Minimize least-squares objective function:$$\hat{f}(x)=\arg \min _{f \in \mathscr{F}} \sum_{i=0}^{N -1}\left(y_{i}-f\left(x_{i}\right)\right)^{2}$$
  • Then if we do supervised learning right
    • i.e. have enough data, choose a good \(\mathscr{F}\), and optimize well,$$Q^{k+1}\left(s, a\right):=\hat{f}(x) \approx \mathbb{E}[y \mid x]=\mathbb{E}\left[r\left(s, a\right)+\gamma\max _{a^{\prime}} Q^k\left(s', a^{\prime}\right) \mid s, a\right]$$

To ERM

Fitted Q-Value Iteration

Fitted Q-Value Iteration

  • Input: dataset \(\tau \sim \rho_{\pi_{\text {data }}}\)
  • Initialize function \(Q^0 \in\mathscr F\) (mapping \(\mathcal S\times \mathcal A\to \mathbb R\))
  • For \(k=0,\dots,K-1\): $$Q^{k+1}=  \arg \min _{f \in \mathscr{F}} \sum_{i=0}^{N-1} \left(f(s_i,a_i)-(r(s_i,a_i)+\gamma \max _{a} Q^k (s_{i+1}, a))\right)^{2}$$
  • Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}Q^K(s,a)\) for all \(s\)

Fixed point iteration using a fixed dataset and supervised learning

Agenda

1. Recap

2. Value Iteration

3. Fitted VI

4. Online Fitted VI

5. Finite Horizon

  • Instead of using fixed dataset, we can sample new trajectories at each iteration using a policy \(\pi^k\) defined by \(Q^k\)
  • Two key questions:
    1. How to define \(\pi^k\)?
      • incremental or encouraging exploration
    2. How to update \(Q^{k+1}\)?
      • full ERM or incremental with GD

Online Fitted VI (Q-learning)

action

state

policy

data

Data-collecting policy

  • The greedy policy is \(\bar \pi(s)=\arg\max_{a\in\mathcal A}  Q^k(s,a)\)
  • Option 1: Incremental update: a policy which follows \(\bar \pi\) with probability \(\alpha\) (and otherwise \(\pi^{k-1}\))$$\pi^{k}(a|s) = (1-\alpha) \pi^{k-1}(a|s) + \alpha \bar \pi^{}(a|s)  $$
  • Option 2: The \(\epsilon\) greedy policy follows \(\bar\pi\) with probability \(1-\epsilon\) and otherwise selects an action from \(\mathcal A\) uniformly at random
    • Mathematically, this is $$\pi^{k}(a|s) = \begin{cases}1-\epsilon+\frac{\epsilon}{A} & a=\bar\pi(s)\\ \frac{\epsilon}{A} & \text{otherwise} \end{cases}$$

\(\bar\pi(s)\)

\(\mathcal A\)

Updating Q function

  • For a parametric class \(\mathscr F = \{f_\theta\mid\theta\in\mathbb R^d\}\), e.g.




     
  • Option 1: Full ERM $$\theta_{k+1}=\arg\min_{\theta} \textstyle\sum_{i=0}^{N-1} (f_\theta(s_i,a_i)-y_i)^2 $$
  • Option 2: Incremental with gradient steps
    • gradient descent is $$\theta_{k+1}=\theta_k -\alpha \nabla \textstyle\sum_{i=0}^{N-1} (f_\theta(s_i,a_i)-y_i)^2 $$

1. Tabular

2. Parametric, e.g. deep (PA 3)

Q(s,a)

\(\mathcal S\)

\(\mathcal A\)

\(\mathcal S\)

\(\mathcal A\)

PollEV

Online Fitted VI

Online Fitted Q-Value Iteration (Q-learning)

  • Initialize function \(Q^0 \in\mathscr F\)
  • For \(k=0,\dots,K-1\):
    • define \(\pi^k\) from \(Q^k\) (e.g. epsilon greedy)
    • sample dataset \(\tau \sim \rho_{\pi^{k}}\)
    • update \(Q^{k+1}\) (e.g. incrementally with GD)
  • Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}Q^{K}(s,a)\)

Agenda

1. Recap

2. Value Iteration

3. Fitted VI

4. Online Fitted VI

5. Finite Horizon

Finite Horizon: DP

  • For finite horizon MDPs, DP gives exact solution, so we didn't need a fixed point algorithm
    • (when we have tractable models)
  • The dynamic programming step: $$ Q_{t}(s,a) =r(s_t,a_t)+ \max _{a} Q_{t+1} (s_{t+1}, a)$$
  • When we don't have tractable models, we can used fixed point iteration and supervised learning!
  • We now need multiple trajectories: $$\tau_1,\dots,\tau_N\sim\pi_{\text {data}}$$

Finite Horizon: DP

Fitted Q-Value Iteration (finite horizon)

  • Input: offline dataset \(\tau_1,\dots,\tau_N \sim \rho_{\pi_{\text {data }}}\)
  • Initialize function \(Q^0 \in\mathscr F\) (maps \(\mathcal S\times\mathcal A\times \{0,\dots,H-1\} \to \mathbb R \))
  • For \(k=0,\dots,K-1\): $$\displaystyle Q^{k+1} =  \arg \min _{f \in \mathscr{F}} \sum_{i=1}^{N-1} \sum_{t=0}^{H-1}\left(f_t(s_t^i,a_t^i)-(r(s_t^i,a_t^i)+ \max _{a} Q^k_{t+1} (s_{t+1}^i, a))\right)^{2}$$
  • Return \(\displaystyle \hat\pi_t(s) = \arg\max_{a\in\mathcal A}Q_t^{K}(s,a)\) for all \(s,t\)

Similar extension for online data collection

Recap

  • PA 2 due tonight
  • PSet 4 due Friday

 

  • Fitted VI
  • Online Fitted VI
  • Finite Horizon Fitted VI

 

  • Next lecture: Fitted Policy Iteration