CS 4/5789: Introduction to Reinforcement Learning

Lecture 3: Bellman Equations

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Announcements

Agenda

 

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Recap: Infinite Horizon MDP

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\) defined by states, actions, reward, transition, discount factor

action \(a_t\in\mathcal A\)

state \(s_t\in\mathcal S\)

reward

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

Recap: Infinite Horizon MDP

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\) defined by states, actions, reward, transition, discount factor

maximize   \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)

s.t.   \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)

\(\pi\)

Goal

Recap: Behavioral Cloning

Dataset from expert policy \(\pi_\star\): $$ \{(s_i, a_i)\}_{i=1}^N \sim \mathcal D_\star $$

\(\pi\)

minimize   \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)

expert trajectory

learned policy

No training data of "recovery" behavior!

Recap: Trajectory and State Distributions

  • Probability of trajectory \(\tau =(s_0, a_0, s_1, ... s_t, a_t)\) under policy \(\pi\) starting from initial distribution \(\mu_0\): $$\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\pi(a_0 \mid s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)$$
  • State distribution \(d_t\) as \({|\mathcal S|}\) dimensional vector with $$ d_t[s] = \mathbb{P}\{s_t=s\mid \mu_0,\pi\} $$

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

Recap: Trajectory and State Distributions

  • Probability of trajectory \(\tau =(s_0, a_0, s_1, ... s_t, a_t)\): $$\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\pi(a_0 \mid s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)$$
  • State distribution \(d_t\) as \({|\mathcal S|}\) dimensional vector with $$ d_t[s] = \mathbb{P}\{s_t=s\mid \mu_0,\pi\} $$

Proposition: The state distribution evolves according to \( d_t = (P_\pi^t)^\top d_0\)

\(s\)

\(s'\)

\(P(s'\mid s,\pi(s))\)

\(P_\pi^\top=\)

Agenda

 

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

The value of a state \(s\) under a policy \(\pi\) is the expected cumulative discounted reward starting from that state

Value Function

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t), r_t\sim r(s_t, a_t)\right]$$

simplification for the rest of lecture: \(r(s,a)\) is deterministic

Example

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

  • Recall simple MDP example
  • Suppose the reward is:
    • \(r(0,a)=1\) and \(r(1,a)=0\) for all \(a\)
  • Consider the policy
    • \(\pi(s)=\)stay for all \(s\)
  • Simulate reward sequences

Rollout vs. Expectation

...

...

...

The cumulative reward of a given trajectory $$\sum_{t=0}^\infty \gamma^t \mathbb r(s_t, a_t)$$

The expected cumulative reward averages over all possible trajectories

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) \mid s_0=s,P,\pi\right]$$

  • If \(s_0=0\) then \(s_t=0\) for all \(t\)
  • PollEV: \(V^\pi(0) = \sum_{t=0}^\infty \gamma^t r(0,\pi(0))\)
    • \(=\sum_{t=0}^\infty \gamma^t = \frac{1}{1-\gamma}\)
  • If \(s_0=1\) then at some time \(T\) state transitions and then \(s_t=0\) for all \(t\geq T\)
  • \(V^\pi(1) = \mathbb E_{T}[ \sum_{t=0}^{T-1} \gamma^t r(1,\pi(1)) + \sum_{t=T}^\infty \gamma^t r(0,\pi(0))]\)
    • \(\displaystyle =\sum_{T=0}^\infty \mathbb P\{T=t\} \sum_{t=T}^\infty \gamma^t \)
    • \(\displaystyle=\sum_{T=0}^\infty p_1^T(1-p_1) \frac{\gamma^T}{1-\gamma}=\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\)

Example

\(0\)

\(1\)

\(p_1\)

\(1-p_1\)

Example

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

  • Recall the reward is:
    • \(r(0,a)=1\) and \(r(1,a)=0\)
      for all \(a\)
  • Consider the policy
    • \(\pi(1)=\)stay
    • \(\pi(0)=\begin{cases}\textsf{stay} & \text{w.p. } p_1\\ \textsf{switch}& \text{w.p. } 1-p_1\end{cases}\)

Example

\(0\)

\(1\)

\(p_1\)

\(1-p_1\)

\(p_1\)

\(1-p_1\)

  • Recall the reward is:
    • \(r(0,a)=1\) and \(r(1,a)=0\)
      for all \(a\)
  • What happens?
  • Is \(V^\pi(0)\) or \(V^\pi(1)\) larger?

Food for thought: what distribution determines the value function?

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, \pi(a_t)) \mid s_0=s, P, \pi\right]$$

  • Discounted "steady-state" distribution $$ d_\gamma = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_t $$

Steady-state?

Agenda

 

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Exercise: review proof (below)

Bellman Expectation Equation:

 \(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)

...

...

...

The cumulative reward expression is almost recursive:

$$\sum_{t=0}^\infty \gamma^t r(s_t, a_t) = r(s_0,a_0) + \gamma \sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) $$

Bellman Expectation Equation

Proof

  • \(V^\pi(s) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s, P, \pi ]\)
  • \(= \mathbb{E}[r(s_0,a_0)\mid s_0=s, P, \pi ] + \mathbb{E}[\sum_{t=1}^\infty \gamma^{t} r(s_{t},a_{t}) \mid s_0=s, P, \pi ]\)
    (linearity of expectation)
  • \(= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_0=s, P, \pi ]\)
    (simplifying conditional expectation, re-indexing sum)
  • \(= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_1=s', P, \pi ]\mid s'\sim P(s, a), a\sim \pi(s)]\) (tower property of conditional expectation)
  • \(= \mathbb{E}[r(s,a)+ \gamma\mathbb{E}[V^\pi(s')\mid s'\sim P(s, a)] \mid a\sim \pi(s)]\)
    (definition of value function and linearity of expectation)

Example

\(0\)

\(1\)

\(p_1\)

\(1-p_1\)

\(p_1\)

\(1-p_1\)

Recall \(r(0,a)=1\) and \(r(1,a)=0\)

  • \(V^\pi(s)=\mathbb E_a[r(s,a)+\gamma \mathbb E_{s'}[V^\pi(s')]]\)
  • \(V^\pi(0)=p_1(1+\gamma V^\pi(0)) \)
    \(\qquad \qquad + (1-p_1)(1+\gamma V^\pi(1))\)
    • \(V^\pi(0)=\frac{1-\gamma p_1}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2}\)
  • \(V^\pi(1)=0+\gamma(p_1V^\pi(1) + (1-p_1)V^\pi(0))\)
    • \(V^\pi(1)=\frac{\gamma( 1-p_1)}{1-\gamma p_1}V^\pi(0)\)
    • \(V^\pi(1)=\frac{\gamma( 1-p_1)}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2}\)

Agenda

 

1. Recap

2. Value Function

3. Bellman Equation

4. Policy Evaluation

Policy Evaluation

  • Consider deterministic policy. The Bellman equation:
    •  \(V^{\pi}(s) = r(s, \pi(s)) + \gamma \mathbb{E}_{s' \sim P( s, \pi(s))} [V^\pi(s')] \)
    •  \(V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S}  P(s'\mid s, \pi(s)) V^\pi(s') \)
  • As with the state distribution, consider \(V^\pi\) as an \(S=|\mathcal S|\) dimensional vector, also \(R^\pi\)
    •  \(V^{\pi}(s) = R^\pi(s) + \gamma \langle P_\pi(s) , V^\pi \rangle \)

\(P_\pi=\)

\(s\)

\(P(\cdot\mid s,\pi(s))\)

Policy Evaluation

  • \(V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S}  P(s'\mid s, \pi(s)) V^\pi(s') \)
  • The matrix vector form of the Bellman Equation is

\(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)

\(s\)

\(s'\)

\(P(s'\mid s,\pi(s))\)

\(=\)

\(+\gamma\)

\(V^\pi(s)\)

\(r(s,\pi(s))\)

Matrix inversion is slow! \(\mathcal O(S^3)\)

To exactly compute the value function, we just need to solve the \(S\times S\) system of linear equations:

  • \(V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}\)
  • (PSet 1)

Exact Policy Evaluation

\(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)

Approximate Policy Evaluation:

  • Initialize \(V_0\)
  • For \(t=0,1,\dots, T\):
    • \(V_{t+1} = R^{\pi} + \gamma P^{\pi} V_t\)

Complexity of each iteration?

  • \(\mathcal O(S^2)\)

Approximate Policy Evaluation

To trade off computation time for complexity, we can use a fixed point iteration algorithm

Example

\(0\)

\(1\)

\(p_1\)

\(1-p_1\)

\(p_1\)

\(1-p_1\)

Recall \(r(0,a)=1\) and \(r(1,a)=0\)

  • \(R^\pi = \begin{bmatrix} 1\\ 0\end{bmatrix} \)
  • \(P_\pi = \begin{bmatrix} p_1 & 1-p_1 \\ 1-p_1 &p_1 \end{bmatrix}\)
  • Exercises:
    • Approx PE with \(V_0 = [1,1]\)
    • Confirm \(V^\pi=\begin{bmatrix}\frac{1-\gamma p_1}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2} \\ \frac{\gamma( 1-p_1)}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2} \end{bmatrix}\) is a fixed point

To show the Approx PE works, we first prove a contraction lemma

Convergence of Approx PE

Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$

Proof

  • \(\|V_{t+1} - V^\pi\|_\infty = \|R^\pi + \gamma P_\pi V_t-V^\pi\|_\infty\) by algorithm definition
  • \(= \|R^\pi + \gamma P_\pi V_t-(R^\pi + \gamma P_\pi V^\pi)\|_\infty\) by Bellman eq
  • \(= \| \gamma P_\pi (V_t - V^\pi)\|_\infty=\gamma\max_s \langle P_\pi(s), V_t-V^\pi\rangle \) norm definition
  • \(=\gamma\max_s |\mathbb E_{s'\sim P(s,\pi(s))}[V_t(s')-V^\pi(s')]|\) expectation definition
  • \(\leq \gamma \max_s \mathbb E_{s'\sim P(s,a)}[|V_t(s')-V^\pi(s')|]\) basic inequality (PSet 1)
  • \(\leq \gamma \max_{s'}|V_t(s')-V^\pi(s')|=\|V_t-V^\pi\|_\infty\) basic inequality (PSet 1)

Proof

  • First statement follows by induction using the Lemma
  • For the second statement,
    • \(\|V_{T} - V^\pi\|_\infty\leq \gamma^T \|V_0-V^\pi\|_\infty\leq \epsilon\)
    • Taking \(\log\) of both sides,
    • \(T\log \gamma + \log  \|V_0-V^\pi\|_\infty \leq \log \epsilon \), then rearrange

Convergence of Approx PE

Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$

so an \(\epsilon\) correct solution requires

\(T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}\)

Recap

  • PSet 1 released today
  • Office Hours on Ed

 

  • Value Function
  • Bellman Equation
  • Policy Evaluation

 

  • Next lecture: Optimality