CS 4/5789: Introduction to Reinforcement Learning
Lecture 3: Bellman Equations
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Announcements
Agenda
1. Recap
2. Value Function
3. Bellman Equation
4. Policy Evaluation
Recap: Infinite Horizon MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\) defined by states, actions, reward, transition, discount factor


action \(a_t\in\mathcal A\)
state \(s_t\in\mathcal S\)
reward
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
Recap: Infinite Horizon MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\) defined by states, actions, reward, transition, discount factor
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
Goal
Recap: Behavioral Cloning
Dataset from expert policy \(\pi_\star\): $$ \{(s_i, a_i)\}_{i=1}^N \sim \mathcal D_\star $$
\(\pi\)
minimize \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)


expert trajectory
learned policy
No training data of "recovery" behavior!
Recap: Trajectory and State Distributions
- Probability of trajectory \(\tau =(s_0, a_0, s_1, ... s_t, a_t)\) under policy \(\pi\) starting from initial distribution \(\mu_0\): $$\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\pi(a_0 \mid s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)$$
- State distribution \(d_t\) as \({|\mathcal S|}\) dimensional vector with $$ d_t[s] = \mathbb{P}\{s_t=s\mid \mu_0,\pi\} $$
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
Recap: Trajectory and State Distributions
- Probability of trajectory \(\tau =(s_0, a_0, s_1, ... s_t, a_t)\): $$\mathbb{P}_{\mu_0}^\pi (\tau)=\mu_0(s_0)\pi(a_0 \mid s_0)\displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)$$
- State distribution \(d_t\) as \({|\mathcal S|}\) dimensional vector with $$ d_t[s] = \mathbb{P}\{s_t=s\mid \mu_0,\pi\} $$
Proposition: The state distribution evolves according to \( d_t = (P_\pi^t)^\top d_0\)
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
\(P_\pi^\top=\)
Agenda
1. Recap
2. Value Function
3. Bellman Equation
4. Policy Evaluation
The value of a state \(s\) under a policy \(\pi\) is the expected cumulative discounted reward starting from that state
Value Function
$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t), r_t\sim r(s_t, a_t)\right]$$
simplification for the rest of lecture: \(r(s,a)\) is deterministic
Example

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
- Recall simple MDP example
- Suppose the reward is:
- \(r(0,a)=1\) and \(r(1,a)=0\) for all \(a\)
- Consider the policy
- \(\pi(s)=\)stay for all \(s\)
- Simulate reward sequences
Rollout vs. Expectation
...
...
...
The cumulative reward of a given trajectory $$\sum_{t=0}^\infty \gamma^t \mathbb r(s_t, a_t)$$
The expected cumulative reward averages over all possible trajectories
$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) \mid s_0=s,P,\pi\right]$$
- If \(s_0=0\) then \(s_t=0\) for all \(t\)
-
PollEV: \(V^\pi(0) = \sum_{t=0}^\infty \gamma^t r(0,\pi(0))\)
- \(=\sum_{t=0}^\infty \gamma^t = \frac{1}{1-\gamma}\)
- If \(s_0=1\) then at some time \(T\) state transitions and then \(s_t=0\) for all \(t\geq T\)
- \(V^\pi(1) = \mathbb E_{T}[ \sum_{t=0}^{T-1} \gamma^t r(1,\pi(1)) + \sum_{t=T}^\infty \gamma^t r(0,\pi(0))]\)
- \(\displaystyle =\sum_{T=0}^\infty \mathbb P\{T=t\} \sum_{t=T}^\infty \gamma^t \)
- \(\displaystyle=\sum_{T=0}^\infty p_1^T(1-p_1) \frac{\gamma^T}{1-\gamma}=\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\)
Example

\(0\)
\(1\)
\(p_1\)
\(1-p_1\)
Example

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
- Recall the reward is:
- \(r(0,a)=1\) and \(r(1,a)=0\)
for all \(a\)
- \(r(0,a)=1\) and \(r(1,a)=0\)
- Consider the policy
- \(\pi(1)=\)stay
- \(\pi(0)=\begin{cases}\textsf{stay} & \text{w.p. } p_1\\ \textsf{switch}& \text{w.p. } 1-p_1\end{cases}\)
Example

\(0\)
\(1\)
\(p_1\)
\(1-p_1\)
\(p_1\)
\(1-p_1\)
- Recall the reward is:
- \(r(0,a)=1\) and \(r(1,a)=0\)
for all \(a\)
- \(r(0,a)=1\) and \(r(1,a)=0\)
- What happens?
- Is \(V^\pi(0)\) or \(V^\pi(1)\) larger?
Food for thought: what distribution determines the value function?
$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, \pi(a_t)) \mid s_0=s, P, \pi\right]$$
- Discounted "steady-state" distribution $$ d_\gamma = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_t $$
Steady-state?
Agenda
1. Recap
2. Value Function
3. Bellman Equation
4. Policy Evaluation
Exercise: review proof (below)
Bellman Expectation Equation:
\(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
...
...
...
The cumulative reward expression is almost recursive:
$$\sum_{t=0}^\infty \gamma^t r(s_t, a_t) = r(s_0,a_0) + \gamma \sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) $$
Bellman Expectation Equation
Proof
- \(V^\pi(s) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s, P, \pi ]\)
- \(= \mathbb{E}[r(s_0,a_0)\mid s_0=s, P, \pi ] + \mathbb{E}[\sum_{t=1}^\infty \gamma^{t} r(s_{t},a_{t}) \mid s_0=s, P, \pi ]\)
(linearity of expectation) - \(= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_0=s, P, \pi ]\)
(simplifying conditional expectation, re-indexing sum) - \(= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_1=s', P, \pi ]\mid s'\sim P(s, a), a\sim \pi(s)]\) (tower property of conditional expectation)
- \(= \mathbb{E}[r(s,a)+ \gamma\mathbb{E}[V^\pi(s')\mid s'\sim P(s, a)] \mid a\sim \pi(s)]\)
(definition of value function and linearity of expectation)
Example

\(0\)
\(1\)
\(p_1\)
\(1-p_1\)
\(p_1\)
\(1-p_1\)
Recall \(r(0,a)=1\) and \(r(1,a)=0\)
- \(V^\pi(s)=\mathbb E_a[r(s,a)+\gamma \mathbb E_{s'}[V^\pi(s')]]\)
- \(V^\pi(0)=p_1(1+\gamma V^\pi(0)) \)
\(\qquad \qquad + (1-p_1)(1+\gamma V^\pi(1))\)- \(V^\pi(0)=\frac{1-\gamma p_1}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2}\)
- \(V^\pi(0)=\frac{1-\gamma p_1}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2}\)
- \(V^\pi(1)=0+\gamma(p_1V^\pi(1) + (1-p_1)V^\pi(0))\)
- \(V^\pi(1)=\frac{\gamma( 1-p_1)}{1-\gamma p_1}V^\pi(0)\)
- \(V^\pi(1)=\frac{\gamma( 1-p_1)}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2}\)
Agenda
1. Recap
2. Value Function
3. Bellman Equation
4. Policy Evaluation
Policy Evaluation
- Consider deterministic policy. The Bellman equation:
- \(V^{\pi}(s) = r(s, \pi(s)) + \gamma \mathbb{E}_{s' \sim P( s, \pi(s))} [V^\pi(s')] \)
- \(V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S} P(s'\mid s, \pi(s)) V^\pi(s') \)
- As with the state distribution, consider \(V^\pi\) as an \(S=|\mathcal S|\) dimensional vector, also \(R^\pi\)
- \(V^{\pi}(s) = R^\pi(s) + \gamma \langle P_\pi(s) , V^\pi \rangle \)
\(P_\pi=\)
\(s\)
\(P(\cdot\mid s,\pi(s))\)
Policy Evaluation
- \(V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S} P(s'\mid s, \pi(s)) V^\pi(s') \)
- The matrix vector form of the Bellman Equation is
\(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
\(=\)
\(+\gamma\)
\(V^\pi(s)\)
\(r(s,\pi(s))\)
Matrix inversion is slow! \(\mathcal O(S^3)\)
To exactly compute the value function, we just need to solve the \(S\times S\) system of linear equations:
- \(V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}\)
- (PSet 1)
Exact Policy Evaluation
\(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)
Approximate Policy Evaluation:
- Initialize \(V_0\)
- For \(t=0,1,\dots, T\):
- \(V_{t+1} = R^{\pi} + \gamma P^{\pi} V_t\)
Complexity of each iteration?
- \(\mathcal O(S^2)\)
Approximate Policy Evaluation
To trade off computation time for complexity, we can use a fixed point iteration algorithm
Example

\(0\)
\(1\)
\(p_1\)
\(1-p_1\)
\(p_1\)
\(1-p_1\)
Recall \(r(0,a)=1\) and \(r(1,a)=0\)
- \(R^\pi = \begin{bmatrix} 1\\ 0\end{bmatrix} \)
- \(P_\pi = \begin{bmatrix} p_1 & 1-p_1 \\ 1-p_1 &p_1 \end{bmatrix}\)
- Exercises:
- Approx PE with \(V_0 = [1,1]\)
- Confirm \(V^\pi=\begin{bmatrix}\frac{1-\gamma p_1}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2} \\ \frac{\gamma( 1-p_1)}{(1-p_1\gamma)^2 - \gamma^2( 1-p_1)^2} \end{bmatrix}\) is a fixed point
To show the Approx PE works, we first prove a contraction lemma
Convergence of Approx PE
Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$
Proof
- \(\|V_{t+1} - V^\pi\|_\infty = \|R^\pi + \gamma P_\pi V_t-V^\pi\|_\infty\) by algorithm definition
- \(= \|R^\pi + \gamma P_\pi V_t-(R^\pi + \gamma P_\pi V^\pi)\|_\infty\) by Bellman eq
- \(= \| \gamma P_\pi (V_t - V^\pi)\|_\infty=\gamma\max_s \langle P_\pi(s), V_t-V^\pi\rangle \) norm definition
- \(=\gamma\max_s |\mathbb E_{s'\sim P(s,\pi(s))}[V_t(s')-V^\pi(s')]|\) expectation definition
- \(\leq \gamma \max_s \mathbb E_{s'\sim P(s,a)}[|V_t(s')-V^\pi(s')|]\) basic inequality (PSet 1)
- \(\leq \gamma \max_{s'}|V_t(s')-V^\pi(s')|=\|V_t-V^\pi\|_\infty\) basic inequality (PSet 1)
Proof
- First statement follows by induction using the Lemma
- For the second statement,
- \(\|V_{T} - V^\pi\|_\infty\leq \gamma^T \|V_0-V^\pi\|_\infty\leq \epsilon\)
- Taking \(\log\) of both sides,
- \(T\log \gamma + \log \|V_0-V^\pi\|_\infty \leq \log \epsilon \), then rearrange
Convergence of Approx PE
Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$
so an \(\epsilon\) correct solution requires
\(T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}\)
Recap
- PSet 1 released today
- Office Hours on Ed
- Value Function
- Bellman Equation
- Policy Evaluation
- Next lecture: Optimality
Sp23 CS 4/5789: Lecture 3
By Sarah Dean