CS 4/5789: Introduction to Reinforcement Learning
Lecture 4: Optimal Policies
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Announcements
- Questions about waitlist/enrollment?
- Homework released this week
- Problem Set 1 due Monday 2/6
- Programming Assignment 1 released tonight, due in 2 weeks later
- CIS Partner Finding Social
- Come to Duffield Atrium to find a partner or study buddy for any CIS classes you are taking this semester! February 2nd from 4:30 to 6:30pm
Agenda
1. Policy Evaluation
2. Optimal Policies
3. Value Iteration
Example

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
- Recall ongoing example
- Suppose the reward is:
- \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for
\(a=\) switch
- \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for
- Notation review: what is \(\{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\) for this example?

\(0\)
\(1\)




Notation Review

\(0\)
\(1\)
- \(\mathcal S = \{0,1\}\) and \(\mathcal A=\{\)stay,switch\(\}\)
- \(r(0,\)stay\()=1\), \(r(0,\)switch\()=\frac{1}{2}\)
- \(r(1,\)stay\()=0\), \(r(1,\)switch\()=-\frac{1}{2}\)
- \(P(0,\)stay\()=\mathbf{1}_{0}=\mathsf{Bernoulli}(0)\)
- \(P(1,\)stay\()=\mathsf{Bernoulli}(p_1)\)
- \(P(0\mid 0,\)stay\()=\)
- \(P(1\mid 0,\)stay\()=\)
- \(P(0\mid 1,\)stay\()=\)
- \(P(1\mid 1,\)stay\()=\)
- \(P(0\mid 0,\)switch\()=\)
- \(P(1\mid 0,\)switch\()=\)
- \(P(0\mid 1,\)switch\()=\)
- \(P(1\mid 1,\)switch\()=\)
- \(1\)
- \(0\)
- \(1-p_1\)
- \(p_1\)
- \(0\)
- \(1\)
- \(p_2\)
- \(1-p_2\)
- \(P(0,\)switch\()=\mathbf{1}_{1}=\mathsf{Bernoulli}(1)\)
- \(P(1,\)switch\()=\mathsf{Bernoulli}(1-p_2)\)
The value of a state \(s\) under a policy \(\pi\) is the expected cumulative discounted reward starting from that state
Value Function
$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t)\right]$$
Bellman Expectation Equation: \(\forall s\),
\(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
...
...
...
Q function: \(Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \)
Proof of BE
- \(V^\pi(s) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s, P, \pi ]\)
- \(= \mathbb{E}[r(s_0,a_0)\mid s_0=s, P, \pi ] + \mathbb{E}[\sum_{t=1}^\infty \gamma^{t} r(s_{t},a_{t}) \mid s_0=s, P, \pi ]\)
(linearity of expectation) - \(= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_0=s, P, \pi ]\)
(simplifying conditional expectation, re-indexing sum) - \(= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_1=s', P, \pi ]\mid s'\sim P(s, a), a\sim \pi(s)]\) (tower property of conditional expectation)
- \(= \mathbb{E}[r(s,a)+ \gamma\mathbb{E}[V^\pi(s')\mid s'\sim P(s, a)] \mid a\sim \pi(s)]\)
(definition of value function and linearity of expectation)
Example

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
- Suppose the reward is:
- \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for
\(a=\) switch
- \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for
- Consider the policy \(\pi(s)=\)stay for all \(s\)
- \(V^\pi(0) =\sum_{t=0}^\infty \gamma^t = \frac{1}{1-\gamma}\)
- \(V^\pi(1) =\sum_{T=0}^\infty p_1^T(1-p_1) \sum_{t=T}^\infty \gamma^t =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\)
Policy Evaluation (PE)
- \(V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S} P(s'\mid s, \pi(s)) V^\pi(s') \)
- The matrix vector form of the Bellman Equation is
\(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
\(=\)
\(+\gamma\)
\(V^\pi(s)\)
\(r(s,\pi(s))\)
Approximate Policy Evaluation:
- Initialize \(V_0\)
- For \(t=0,1,\dots, T\):
- \(V_{t+1} = R^{\pi} + \gamma P^{\pi} V_t\)
Complexity of each iteration is \(\mathcal O(S^2)\)
Approximate Policy Evaluation
To trade off computation time for complexity, we can use a fixed point iteration algorithm
To show the Approx PE works, we first prove a contraction lemma
Convergence of Approx PE
Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$
Proof
- \(\|V_{t+1} - V^\pi\|_\infty = \|R^\pi + \gamma P_\pi V_t-V^\pi\|_\infty\) by algorithm definition
- \(= \|R^\pi + \gamma P_\pi V_t-(R^\pi + \gamma P_\pi V^\pi)\|_\infty\) by Bellman eq
- \(= \| \gamma P_\pi (V_t - V^\pi)\|_\infty=\gamma\max_s \langle P_\pi(s), V_t-V^\pi\rangle \) norm definition
- \(=\gamma\max_s |\mathbb E_{s'\sim P(s,\pi(s))}[V_t(s')-V^\pi(s')]|\) expectation definition
- \(\leq \gamma \max_s \mathbb E_{s'\sim P(s,a)}[|V_t(s')-V^\pi(s')|]\) basic inequality (PSet 1)
- \(\leq \gamma \max_{s'}|V_t(s')-V^\pi(s')|=\|V_t-V^\pi\|_\infty\) basic inequality (PSet 1)
Proof
- First statement follows by induction using the Lemma
- For the second statement,
- \(\|V_{T} - V^\pi\|_\infty\leq \gamma^T \|V_0-V^\pi\|_\infty\leq \epsilon\)
- Taking \(\log\) of both sides,
- \(T\log \gamma + \log \|V_0-V^\pi\|_\infty \leq \log \epsilon \), then rearrange
Convergence of Approx PE
Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$
so an \(\epsilon\) correct solution requires
\(T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}\)
Agenda
1. Policy Evaluation
2. Optimal Policies
3. Value Iteration
Example

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
- Suppose the reward is:
- \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for
\(a=\) switch
- \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for
- Consider the policy \(\pi(s)=\)stay for all \(s\)
- \(V^\pi(0) =\frac{1}{1-\gamma}\)
- \(V^\pi(1) =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\)
- Is this optimal? PollEV
Optimal Policy
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
- An optimal policy \(\pi_\star\) is one where \(V^{\pi_\star}(s) \geq V^{\pi}(s)\) for all \(s\) and policies \(\pi\)
- i.e. the policy dominates other policies for all states
- vector notation: \(V^{\pi_\star}(s) \geq V^{\pi}(s)~\forall~s\iff V^{\pi_\star} \geq V^{\pi}\)
- All optimal policies achieve the same value \(V^\star\), i.e. at every state \(s\), \(V^\star(s) = V^{\pi_\star}(s)\)
Finding and Verifying Optimal Policies
Enumeration:
- Initialize \(V^\star=-\infty, \pi_\star\)
- For all \(\pi:\mathcal S\to\mathcal A\):
- compute \(V^\pi\) with PE
- if \(V^\pi\geq V^\star \): set \(V^\star =V^\pi\) and \(\pi_\star=\pi\)
- return \(\pi^\star\)
- How can we find an optimal policy? How can we verify whether a policy is optimal?
- Naive approach: enumeration
- For \(S=|\mathcal S|\) states and \(A=|\mathcal A|\) actions, the complexity is \(\mathcal O(A^S S^3)\)!
Bellman Optimality Equation
- Just like the Bellman Expectation Equation made it easier to compute the Value for a given policy,
- the Bellman Optimization Equation will make it easier to verify and compute the optimal policy/value function
Bellman Optimality Equation (BOE): $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
Theorem (Bellman Optimality):
- If \(\pi_\star\) is an optimal policy, then \(V^{\pi_\star}\) satisfies the BOE
- If \(V^\pi\) satisfies the BOE, then \(\pi\) is an optimal policy
Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy, if \(V^\pi\) satisfies \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
Example

\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
- Suppose the reward is:
- \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for
\(a=\) switch
- \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for
- Consider the policy \(\pi(s)=\)stay for all \(s\)
- \(V^\pi(0) =\frac{1}{1-\gamma}\)
- \(V^\pi(1) =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\)
- Is this optimal?
- \(V^\pi(0) =\frac{1}{1-\gamma}\)
- \(\max_{a\in\mathcal A} \left[ r(0, a) + \gamma \mathbb{E}_{s' \sim P( 0, a)} [V(s')] \right]\)
- for \(a=\)stay, \(\frac{1}{1-\gamma}\)
- for \(a=\)switch,
- \(\frac{1}{2} + \gamma V(1) = \frac{\gamma (1-p_1)}{(1-\gamma p_1)(1-\gamma)} +\frac{1}{2} \leq 1 + \frac{\gamma}{1-\gamma} = \frac{1}{1-\gamma}\)
- Thus BOE satisfied for \(s=0\)
- \(V^\pi(1) =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\) [warning: possible algebra mistakes below]
- \(\max_{a\in\mathcal A} \left[ r(1, a) + \gamma \mathbb{E}_{s' \sim P( 1, a)} [V(s')] \right]\)
- for \(a=\)stay, \(\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\)
- for \(a=\)switch,
- \(-\frac{1}{2} + \gamma ((1-p_2)V(1)+p_2V(0)) = \frac{\gamma (1-p_2)(1-p_1)}{(1-\gamma p_1)(1-\gamma)} + \frac{\gamma p_2}{1-\gamma} -\frac{1}{2} \)
- Thus BOE satisfied if \(p_2\leq \frac{p_1}{1-\gamma p_1}+\frac{1-\gamma}{2}\)
discount factor \(\gamma\)
\(p_1\) probability of stay | stay
- Color: maximum value that \(p_2\) can have for "stay" to be optimal
- ranging from 0 (dark) to 1.5 (light)
- If \(\gamma\) is small, cost of "switch" action is not worth it
- If \(p_1\) is small, likely to transition without "switch" action

\(0\)
\(1\)




- Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
- Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
- We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
- \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\right] \)
- \(\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\)
- \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\right] \)
Bellman Optimality Proof
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
- Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
- Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
- We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
- \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, \pi^\star(s)) + \gamma \mathbb E_{s'\sim P(s, \pi_\star(s))}[V^{\pi_\star}(s')]\right] \)
- \(\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\)
- \(\leq r(s, \hat \pi(s)) + \gamma \mathbb E_{s'\sim P(s, \hat \pi(s))}[V^{\pi_\star}(s')]\)
- Writing the above expression in vector form:
- \(V^{\pi_\star} \leq R^{\hat\pi} + \gamma P^{\hat\pi} V^{\pi_\star}\)
- \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, \pi^\star(s)) + \gamma \mathbb E_{s'\sim P(s, \pi_\star(s))}[V^{\pi_\star}(s')]\right] \)
Bellman Optimality Proof
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
- Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
- Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
- We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
- \(V^{\pi_\star} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star}\)
- \(V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - V^{\hat \pi}\) (subtract from both sides)
- \(V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - R^{\hat \pi} - \gamma P^{\hat \pi} V^{\hat \pi}\) (Bellman Expectation Eq)
- \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})\) (\(\star\))
- \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}- V^{\hat \pi})\) (apply (\(\star\)) to RHS)
Bellman Optimality Proof
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
- Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
- Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
- We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
- \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})\) (\(\star\))
- \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}- V^{\hat \pi})\) (apply (\(\star\)) to RHS)
- \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^k (P^\pi)^k (V^{\pi_\star}- V^{\hat \pi})\) (apply (\(\star\)) k times)
- \(V^{\pi_\star} - V^{\hat\pi} \leq 0\) (limit \(k\to\infty\))
- Therefore, \(V^{\pi_\star} = V^{\hat\pi}\)
Bellman Optimality Proof
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
- Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
- We showed that \(V^{\pi_\star} = V^{\hat\pi}\)
- this means \(\hat \pi(s)\) is an optimal policy!
- By definition of \(\hat\pi\) and the Bellman Expectation Equation, \(V^{\hat \pi}\) satisfies the Bellman Optimality Equation
- Therefore, \(V^{\pi_\star}\) must also satisfy it.
Bellman Optimality Proof
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
- If we know the optimal value \(V^\star\) then we can write down optimal policies! $$\pi^\star(s) \in \arg\max_{a\in\mathcal A}\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\star}(s')] \right]$$
- Recall the definition of the Q function: $$Q^\star(s,a)= r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\star(s')] $$
- \(\pi^\star(s) \in \arg\max_{a\in\mathcal A} Q^\star(s,a)\)
Bellman Optimality
Bellman Optimality Proof
Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy if \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
- Consider an optimal policy \(\pi_\star\) and the value \(V^{\pi_\star}\)
- By part 1, we know that \(V^{\pi_\star}\) satisfies BOE
- We bound \(|V^{\pi}(s)-V^{\pi_\star}(s)|\)
- \(=|\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|\) (BOE by assumption and part 1)
- \(\leq \max_{a\in\mathcal A} |r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] -\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|\)
- basic inequality from PSet 1: $$|\max_x f_1(x) - \max_x f_2(x)| \leq \max_x|f_1(x)-f_2(x)|$$
- \(\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|\) (linearity of expectation)
Bellman Optimality Proof
Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy if \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
- Consider an optimal policy \(\pi_\star\) and the value \(V^{\pi_\star}\)
- We bound \(|V^{\pi}(s)-V^{\pi_\star}(s)|\)
- \(\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|\) (linearity of expectation)
- \(\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [|V^\pi(s')-V^{\pi_\star}(s')|]\) (basic inequality PSet 1)
- \(\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [ \max_{a'\in\mathcal A} \gamma \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]\)
- \(\leq \gamma^2 \max_{a,a'\in\mathcal A} \mathbb{E}_{s' \sim P( s, a)} [ \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]\)
- \(\leq \gamma^k \max_{a_1,\dots,a_k} \mathbb{E}_{s_1,\dots, s_k} [|V^\pi(s_k)-V^{\pi_\star}(s_k)|]\)
- \(\leq 0\) (letting \(k\to\infty\))
- Therefore, \(V^\pi = V^{\pi_\star}\) so \(\pi\) must be optimal
Agenda
1. Policy Evaluation
2. Optimal Policies
3. Value Iteration
Value Iteration
- The Bellman Optimality Equation is a fixed point equation! $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
- If \(V^\star\) satisfies the BOE then $$\pi_\star(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^\star(s')]$$ is an optimal policy
- Idea: find \(\hat V\) with fixed point iteration, then get approximately optimal policy \(\hat\pi\).
Value Iteration
Value Iteration
- Initialize \(V_1\)
- For \(t=1,\dots,T\):
- \(V_{t+1}(s) = \max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V_{t}(s') \right]\)
- Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_T(s')]\)
- Idea: find \(\hat V\) with fixed point iteration, then get approximately optimal policy \(\hat\pi\).
Bellman Operator
- Define the Bellman Operator \(\mathcal T:\mathbb R^S\to \mathbb R^S\) as $$(\mathcal TV)(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
- Nonlinear map
- Value Iteration is repeated application of the Bellman Operator
- Compare with Bellman Expectation Equation we used in Approximate Policy Evaluation
Recap
- PSet 1 due Monday
- PA 1 released today
- Policy Evaluation
- Optimal Policies
- Value Iteration
- Next lecture: Value Iteration, Policy Iteration
Sp23 CS 4/5789: Lecture 4
By Sarah Dean