Sp24 CS 4/5789: Lecture 13

CS 4/5789: Introduction to Reinforcement Learning

Lecture 13: Fitted Value Iteration

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

Homework
- PA 2 due tonight
- PSet 4 due Friday
- PA 3 released later this week
Prelim grade released Wednesday

Agenda

1. Recap

2. Value Iteration

3. Fitted VI

4. Online Fitted VI

5. Finite Horizon

Feedback in RL

action $a_t$

state $s_t$

reward $r_t$

Control feedback: between states and actions
Data feedback: between data and poicy

policy

data $(s_t,a_t,r_t)$

policy $\pi$

transitions $P,f$

experience

unknown in Unit 2

Supervised learning: features $x$ and labels $y$
- Goal: predict labels with $\hat f(x)\approx \mathbb E[y|x]$
- Requirements: dataset $\{x_i,y_i\}_{i=1}^N$
- Method: $\hat f = \arg\min_{f\in\mathcal F} \sum_{i=1}^N (f(x_i)-y_i)^2$
Important functions in MDPs
- Transitions $P(s'|s,a)$
- Value/Q of a policy $V^\pi(s)$ and $Q^\pi(s,a)$
- Optimal Value/Q $V^\star(s)$ and $Q^\star(s,a)$
- Optimal policy $\pi^\star(s)$

Supervised Learning for MDPs

Agenda

1. Recap

2. Value Iteration

3. Fitted VI

4. Online Fitted VI

5. Finite Horizon

Recall: Value Iteration

Value Iteration

Initialize $V^0$
For $k=0,\dots,K-1$:
- $V^{k+1}(s) = \max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^{k}(s') \right]$ for all $s$
Return $\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}{r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^K(s')]}$

For infinite horizon discounted MDP, we repeatedly apply the Bellman Optimality Equation, which is a fixed-point algorithm

Q-Value Iteration

Q-Value Iteration

Initialize $Q^0$ ($S\times A$ array)
For $k=0,\dots,K-1$:
- $Q^{k+1}(s,a) = r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^{k}(s',a') \right]$
  for all $s,a$
Return $\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}Q^{K}(s,a)$

For infinite horizon discounted MDP, we repeatedly apply the Bellman Optimality Equation, which is a fixed-point algorithm

Which step requires a model?

Data from trajectories

Q-Value Iteration: for all $s,a$

$Q^{k+1}(s,a) = r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^{k}(s',a') \right]$

We can't evaluate the expectations without the model $P$ or if $S$ and/or $A$ are too large
Instead we can collect data:
- "Roll out" data collection policy $\pi_{\text {data }}$
- Collect states and actions $\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots \right\}$
How can we use this?

$s_t$

$a_t\sim \pi(s_t)$

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

$a_{t+1}\sim \pi(s_{t+1})$

...

Agenda

1. Recap

2. Value Iteration

3. Fitted VI

4. Online Fitted VI

5. Finite Horizon

Ideally, we want to have $$Q^{k+1}(s, a) \approx r(s, a)+\gamma\mathbb{E}_{s^{\prime} \sim P(s, a)}\left[\max _{a^{\prime} \in A} Q^k\left(s^{\prime}, a^{\prime}\right)\right] \quad \forall s, a$$
Note that the RHS can also be written as $$\mathbb{E}\left[r\left(s, a\right)+\gamma\max _{a^{\prime}} Q^k\left(s', a^{\prime}\right) \mid s, a\right]$$
How to choose $x$ and $y$ for supervised learning?
- $x=\left(s, a\right)$
- $y=r\left(s, a\right)+\gamma\max_{a'} Q^k\left(s', a'\right)$
Then we have $Q^{k+1}(s, a)=f(x)=\mathbb{E}[y \mid x]$

To Supervised Learning

Convert $\tau=\left\{s_{0}, a_{0}, s_{1}, a_{1}, \ldots ,s_N,a_N\right\}$ into $x,y$ pairs:
- $x_i=(s_i,a_i)$ and $y=r(s_i,a_i)+\gamma \max_a Q^{k}(s_{i+1}, a)$
Minimize least-squares objective function:$$\hat{f}(x)=\arg \min _{f \in \mathscr{F}} \sum_{i=0}^{N -1}\left(y_{i}-f\left(x_{i}\right)\right)^{2}$$
Then if we do supervised learning right
- i.e. have enough data, choose a good $\mathscr{F}$, and optimize well,$$Q^{k+1}\left(s, a\right):=\hat{f}(x) \approx \mathbb{E}[y \mid x]=\mathbb{E}\left[r\left(s, a\right)+\gamma\max _{a^{\prime}} Q^k\left(s', a^{\prime}\right) \mid s, a\right]$$

To ERM

Fitted Q-Value Iteration

Fitted Q-Value Iteration

Input: dataset $\tau \sim \rho_{\pi_{\text {data }}}$
Initialize function $Q^0 \in\mathscr F$ (mapping $\mathcal S\times \mathcal A\to \mathbb R$)
For $k=0,\dots,K-1$: $$Q^{k+1}= \arg \min _{f \in \mathscr{F}} \sum_{i=0}^{N-1} \left(f(s_i,a_i)-(r(s_i,a_i)+\gamma \max _{a} Q^k (s_{i+1}, a))\right)^{2}$$
Return $\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}Q^K(s,a)$ for all $s$

Fixed point iteration using a fixed dataset and supervised learning

Agenda

1. Recap

2. Value Iteration

3. Fitted VI

4. Online Fitted VI

5. Finite Horizon

Instead of using fixed dataset, we can sample new trajectories at each iteration using a policy $\pi^k$ defined by $Q^k$
Two key questions:
1. How to define $\pi^k$?
  - incremental or encouraging exploration
2. How to update $Q^{k+1}$?
  - full ERM or incremental with GD

Online Fitted VI (Q-learning)

action

state

policy

data

Data-collecting policy

The greedy policy is $\bar \pi(s)=\arg\max_{a\in\mathcal A} Q^k(s,a)$
Option 1: Incremental update: a policy which follows $\bar \pi$ with probability $\alpha$ (and otherwise $\pi^{k-1}$)$$\pi^{k}(a|s) = (1-\alpha) \pi^{k-1}(a|s) + \alpha \bar \pi^{}(a|s) $$
Option 2: The $\epsilon$ greedy policy follows $\bar\pi$ with probability $1-\epsilon$ and otherwise selects an action from $\mathcal A$ uniformly at random
- Mathematically, this is $$\pi^{k}(a|s) = \begin{cases}1-\epsilon+\frac{\epsilon}{A} & a=\bar\pi(s)\\ \frac{\epsilon}{A} & \text{otherwise} \end{cases}$$

$\bar\pi(s)$

$\mathcal A$

Updating Q function

For a parametric class $\mathscr F = \{f_\theta\mid\theta\in\mathbb R^d\}$, e.g.
Option 1: Full ERM $$\theta_{k+1}=\arg\min_{\theta} \textstyle\sum_{i=0}^{N-1} (f_\theta(s_i,a_i)-y_i)^2 $$
Option 2: Incremental with gradient steps
- gradient descent is $$\theta_{k+1}=\theta_k -\alpha \nabla \textstyle\sum_{i=0}^{N-1} (f_\theta(s_i,a_i)-y_i)^2 $$

1. Tabular

2. Parametric, e.g. deep (PA 3)


	Q(s,a)

$\mathcal S$

$\mathcal A$

$\mathcal S$

$\mathcal A$

PollEV

Online Fitted VI

Online Fitted Q-Value Iteration (Q-learning)

Initialize function $Q^0 \in\mathscr F$
For $k=0,\dots,K-1$:
- define $\pi^k$ from $Q^k$ (e.g. epsilon greedy)
- sample dataset $\tau \sim \rho_{\pi^{k}}$
- update $Q^{k+1}$ (e.g. incrementally with GD)
Return $\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A}Q^{K}(s,a)$

Agenda

1. Recap

2. Value Iteration

3. Fitted VI

4. Online Fitted VI

5. Finite Horizon

Finite Horizon: DP

For finite horizon MDPs, DP gives exact solution, so we didn't need a fixed point algorithm
- (when we have tractable models)
The dynamic programming step: $$ Q_{t}(s,a) =r(s_t,a_t)+ \max _{a} Q_{t+1} (s_{t+1}, a)$$
When we don't have tractable models, we can used fixed point iteration and supervised learning!
We now need multiple trajectories: $$\tau_1,\dots,\tau_N\sim\pi_{\text {data}}$$

Finite Horizon: DP

Fitted Q-Value Iteration (finite horizon)

Input: offline dataset $\tau_1,\dots,\tau_N \sim \rho_{\pi_{\text {data }}}$
Initialize function $Q^0 \in\mathscr F$ (maps $\mathcal S\times\mathcal A\times \{0,\dots,H-1\} \to \mathbb R $)
For $k=0,\dots,K-1$: $$\displaystyle Q^{k+1} = \arg \min _{f \in \mathscr{F}} \sum_{i=1}^{N-1} \sum_{t=0}^{H-1}\left(f_t(s_t^i,a_t^i)-(r(s_t^i,a_t^i)+ \max _{a} Q^k_{t+1} (s_{t+1}^i, a))\right)^{2}$$
Return $\displaystyle \hat\pi_t(s) = \arg\max_{a\in\mathcal A}Q_t^{K}(s,a)$ for all $s,t$

Similar extension for online data collection

Recap

PA 2 due tonight
PSet 4 due Friday

Fitted VI
Online Fitted VI
Finite Horizon Fitted VI

Next lecture: Fitted Policy Iteration