CS 4/5789: Introduction to Reinforcement Learning

Lecture 4: Optimal Policies

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Announcements

Questions about waitlist/enrollment?
- https://www.cs.cornell.edu/courseinfo/enrollment
Homework released this week
- Problem Set 1 due Monday 2/6
- Programming Assignment 1 released tonight, due in 2 weeks later
CIS Partner Finding Social
Come to Duffield Atrium to find a partner or study buddy for any CIS classes you are taking this semester! February 2nd from 4:30 to 6:30pm

Agenda

1. Policy Evaluation

2. Optimal Policies

3. Value Iteration

Example

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Recall ongoing example
Suppose the reward is:
- $+1$ for $s=0$ and $-\frac{1}{2}$ for
  $a=$ switch
Notation review: what is $\{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$ for this example?

$0$

$1$

Notation Review

$0$

$1$

$\mathcal S = \{0,1\}$ and $\mathcal A=\{$stay,switch$\}$
$r(0,$stay$)=1$, $r(0,$switch$)=\frac{1}{2}$
$r(1,$stay$)=0$, $r(1,$switch$)=-\frac{1}{2}$

$P(0,$stay$)=\mathbf{1}_{0}=\mathsf{Bernoulli}(0)$
$P(1,$stay$)=\mathsf{Bernoulli}(p_1)$

$P(0\mid 0,$stay$)=$
$P(1\mid 0,$stay$)=$
$P(0\mid 1,$stay$)=$
$P(1\mid 1,$stay$)=$

$P(0\mid 0,$switch$)=$
$P(1\mid 0,$switch$)=$
$P(0\mid 1,$switch$)=$
$P(1\mid 1,$switch$)=$

$1$
$0$
$1-p_1$
$p_1$

$0$
$1$
$p_2$
$1-p_2$

$P(0,$switch$)=\mathbf{1}_{1}=\mathsf{Bernoulli}(1)$
$P(1,$switch$)=\mathsf{Bernoulli}(1-p_2)$

The value of a state $s$ under a policy $\pi$ is the expected cumulative discounted reward starting from that state

Value Function

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t)\right]$$

Bellman Expectation Equation: $\forall s$,

$V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$

...

Q function: $Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] $

Proof of BE

$V^\pi(s) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s, P, \pi ]$
$= \mathbb{E}[r(s_0,a_0)\mid s_0=s, P, \pi ] + \mathbb{E}[\sum_{t=1}^\infty \gamma^{t} r(s_{t},a_{t}) \mid s_0=s, P, \pi ]$
(linearity of expectation)
$= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_0=s, P, \pi ]$
(simplifying conditional expectation, re-indexing sum)
$= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_1=s', P, \pi ]\mid s'\sim P(s, a), a\sim \pi(s)]$ (tower property of conditional expectation)
$= \mathbb{E}[r(s,a)+ \gamma\mathbb{E}[V^\pi(s')\mid s'\sim P(s, a)] \mid a\sim \pi(s)]$
(definition of value function and linearity of expectation)

Example

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Suppose the reward is:
- $+1$ for $s=0$ and $-\frac{1}{2}$ for
  $a=$ switch
Consider the policy $\pi(s)=$stay for all $s$
$V^\pi(0) =\sum_{t=0}^\infty \gamma^t = \frac{1}{1-\gamma}$
$V^\pi(1) =\sum_{T=0}^\infty p_1^T(1-p_1) \sum_{t=T}^\infty \gamma^t =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$

Policy Evaluation (PE)

$V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S} P(s'\mid s, \pi(s)) V^\pi(s') $
The matrix vector form of the Bellman Equation is

$V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$

$s$

$s'$

$P(s'\mid s,\pi(s))$

$=$

$+\gamma$

$V^\pi(s)$

$r(s,\pi(s))$

Approximate Policy Evaluation:

Initialize $V_0$
For $t=0,1,\dots, T$:
- $V_{t+1} = R^{\pi} + \gamma P^{\pi} V_t$

Complexity of each iteration is $\mathcal O(S^2)$

Approximate Policy Evaluation

To trade off computation time for complexity, we can use a fixed point iteration algorithm

To show the Approx PE works, we first prove a contraction lemma

Convergence of Approx PE

Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$

Proof

$\|V_{t+1} - V^\pi\|_\infty = \|R^\pi + \gamma P_\pi V_t-V^\pi\|_\infty$ by algorithm definition
$= \|R^\pi + \gamma P_\pi V_t-(R^\pi + \gamma P_\pi V^\pi)\|_\infty$ by Bellman eq
$= \| \gamma P_\pi (V_t - V^\pi)\|_\infty=\gamma\max_s \langle P_\pi(s), V_t-V^\pi\rangle $ norm definition
$=\gamma\max_s |\mathbb E_{s'\sim P(s,\pi(s))}[V_t(s')-V^\pi(s')]|$ expectation definition
$\leq \gamma \max_s \mathbb E_{s'\sim P(s,a)}[|V_t(s')-V^\pi(s')|]$ basic inequality (PSet 1)
$\leq \gamma \max_{s'}|V_t(s')-V^\pi(s')|=\|V_t-V^\pi\|_\infty$ basic inequality (PSet 1)

Proof

First statement follows by induction using the Lemma
For the second statement,
- $\|V_{T} - V^\pi\|_\infty\leq \gamma^T \|V_0-V^\pi\|_\infty\leq \epsilon$
- Taking $\log$ of both sides,
- $T\log \gamma + \log \|V_0-V^\pi\|_\infty \leq \log \epsilon $, then rearrange

Convergence of Approx PE

Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$

so an $\epsilon$ correct solution requires

$T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}$

Agenda

1. Policy Evaluation

2. Optimal Policies

3. Value Iteration

Example

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Suppose the reward is:
- $+1$ for $s=0$ and $-\frac{1}{2}$ for
  $a=$ switch
Consider the policy $\pi(s)=$stay for all $s$
$V^\pi(0) =\frac{1}{1-\gamma}$
$V^\pi(1) =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$
Is this optimal? PollEV

Optimal Policy

maximize $\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]$

s.t. $s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$

$\pi$

An optimal policy $\pi_\star$ is one where $V^{\pi_\star}(s) \geq V^{\pi}(s)$ for all $s$ and policies $\pi$
- i.e. the policy dominates other policies for all states
- vector notation: $V^{\pi_\star}(s) \geq V^{\pi}(s)~\forall~s\iff V^{\pi_\star} \geq V^{\pi}$
All optimal policies achieve the same value $V^\star$, i.e. at every state $s$, $V^\star(s) = V^{\pi_\star}(s)$

Finding and Verifying Optimal Policies

Enumeration:

Initialize $V^\star=-\infty, \pi_\star$
For all $\pi:\mathcal S\to\mathcal A$:
- compute $V^\pi$ with PE
- if $V^\pi\geq V^\star $: set $V^\star =V^\pi$ and $\pi_\star=\pi$
return $\pi^\star$

How can we find an optimal policy? How can we verify whether a policy is optimal?
Naive approach: enumeration
- For $S=|\mathcal S|$ states and $A=|\mathcal A|$ actions, the complexity is $\mathcal O(A^S S^3)$!

Bellman Optimality Equation

Just like the Bellman Expectation Equation made it easier to compute the Value for a given policy,
- the Bellman Optimization Equation will make it easier to verify and compute the optimal policy/value function

Bellman Optimality Equation (BOE): $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$

Theorem (Bellman Optimality):

If $\pi_\star$ is an optimal policy, then $V^{\pi_\star}$ satisfies the BOE
If $V^\pi$ satisfies the BOE, then $\pi$ is an optimal policy

Theorem (Bellman Optimality) 2: $\pi$ is an optimal policy, if $V^\pi$ satisfies $V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$

Example

$0$

$1$

stay: $1$

switch: $1$

stay: $p_1$

switch: $1-p_2$

stay: $1-p_1$

switch: $p_2$

Suppose the reward is:
- $+1$ for $s=0$ and $-\frac{1}{2}$ for
  $a=$ switch
Consider the policy $\pi(s)=$stay for all $s$
$V^\pi(0) =\frac{1}{1-\gamma}$
$V^\pi(1) =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$
Is this optimal?

$V^\pi(0) =\frac{1}{1-\gamma}$
$\max_{a\in\mathcal A} \left[ r(0, a) + \gamma \mathbb{E}_{s' \sim P( 0, a)} [V(s')] \right]$
- for $a=$stay, $\frac{1}{1-\gamma}$
- for $a=$switch,
  - $\frac{1}{2} + \gamma V(1) = \frac{\gamma (1-p_1)}{(1-\gamma p_1)(1-\gamma)} +\frac{1}{2} \leq 1 + \frac{\gamma}{1-\gamma} = \frac{1}{1-\gamma}$
- Thus BOE satisfied for $s=0$
$V^\pi(1) =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$ [warning: possible algebra mistakes below]
$\max_{a\in\mathcal A} \left[ r(1, a) + \gamma \mathbb{E}_{s' \sim P( 1, a)} [V(s')] \right]$
- for $a=$stay, $\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}$
- for $a=$switch,
  - $-\frac{1}{2} + \gamma ((1-p_2)V(1)+p_2V(0)) = \frac{\gamma (1-p_2)(1-p_1)}{(1-\gamma p_1)(1-\gamma)} + \frac{\gamma p_2}{1-\gamma} -\frac{1}{2} $
- Thus BOE satisfied if $p_2\leq \frac{p_1}{1-\gamma p_1}+\frac{1-\gamma}{2}$

discount factor $\gamma$

$p_1$ probability of stay | stay

Color: maximum value that $p_2$ can have for "stay" to be optimal
- ranging from 0 (dark) to 1.5 (light)
If $\gamma$ is small, cost of "switch" action is not worth it
If $p_1$ is small, likely to transition without "switch" action

$0$

$1$

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
Then by definition of optimality, $V^{\pi_\star}\geq V^{\hat \pi}$
We now show that $ V^{\pi_\star}\leq V^{\hat \pi}$
- $V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\right] $
  - $\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]$

Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If $\pi^\star$ is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
Then by definition of optimality, $V^{\pi_\star}\geq V^{\hat \pi}$
We now show that $ V^{\pi_\star}\leq V^{\hat \pi}$
- $V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, \pi^\star(s)) + \gamma \mathbb E_{s'\sim P(s, \pi_\star(s))}[V^{\pi_\star}(s')]\right] $
  - $\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]$
  - $\leq r(s, \hat \pi(s)) + \gamma \mathbb E_{s'\sim P(s, \hat \pi(s))}[V^{\pi_\star}(s')]$
- Writing the above expression in vector form:
- $V^{\pi_\star} \leq R^{\hat\pi} + \gamma P^{\hat\pi} V^{\pi_\star}$

Bellman Optimality Proof

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
Then by definition of optimality, $V^{\pi_\star}\geq V^{\hat \pi}$
We now show that $ V^{\pi_\star}\leq V^{\hat \pi}$
- $V^{\pi_\star} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star}$
- $V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - V^{\hat \pi}$ (subtract from both sides)
- $V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - R^{\hat \pi} - \gamma P^{\hat \pi} V^{\hat \pi}$ (Bellman Expectation Eq)
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})$ ($\star$)
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}- V^{\hat \pi})$ (apply ($\star$) to RHS)

Bellman Optimality Proof

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
Then by definition of optimality, $V^{\pi_\star}\geq V^{\hat \pi}$
We now show that $ V^{\pi_\star}\leq V^{\hat \pi}$
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star} -V^{\hat \pi})$ ($\star$)
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}- V^{\hat \pi})$ (apply ($\star$) to RHS)
- $V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^k (P^\pi)^k (V^{\pi_\star}- V^{\hat \pi})$ (apply ($\star$) k times)
- $V^{\pi_\star} - V^{\hat\pi} \leq 0$ (limit $k\to\infty$)
Therefore, $V^{\pi_\star} = V^{\hat\pi}$

Bellman Optimality Proof

Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
We showed that $V^{\pi_\star} = V^{\hat\pi}$
- this means $\hat \pi(s)$ is an optimal policy!
By definition of $\hat\pi$ and the Bellman Expectation Equation, $V^{\hat \pi}$ satisfies the Bellman Optimality Equation
Therefore, $V^{\pi_\star}$ must also satisfy it.

Bellman Optimality Proof

If we know the optimal value $V^\star$ then we can write down optimal policies! $$\pi^\star(s) \in \arg\max_{a\in\mathcal A}\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\star}(s')] \right]$$
Recall the definition of the Q function: $$Q^\star(s,a)= r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\star(s')] $$
$\pi^\star(s) \in \arg\max_{a\in\mathcal A} Q^\star(s,a)$

Bellman Optimality

Bellman Optimality Proof

Theorem (Bellman Optimality) 2: $\pi$ is an optimal policy if $V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$

Consider an optimal policy $\pi_\star$ and the value $V^{\pi_\star}$
By part 1, we know that $V^{\pi_\star}$ satisfies BOE
We bound $|V^{\pi}(s)-V^{\pi_\star}(s)|$
- $=|\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|$ (BOE by assumption and part 1)
- $\leq \max_{a\in\mathcal A} |r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] -\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|$
  - basic inequality from PSet 1: $$|\max_x f_1(x) - \max_x f_2(x)| \leq \max_x|f_1(x)-f_2(x)|$$
- $\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|$ (linearity of expectation)

Bellman Optimality Proof

Theorem (Bellman Optimality) 2: $\pi$ is an optimal policy if $V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$

Consider an optimal policy $\pi_\star$ and the value $V^{\pi_\star}$
We bound $|V^{\pi}(s)-V^{\pi_\star}(s)|$
- $\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|$ (linearity of expectation)
- $\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [|V^\pi(s')-V^{\pi_\star}(s')|]$ (basic inequality PSet 1)
- $\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [ \max_{a'\in\mathcal A} \gamma \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]$
- $\leq \gamma^2 \max_{a,a'\in\mathcal A} \mathbb{E}_{s' \sim P( s, a)} [ \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]$
- $\leq \gamma^k \max_{a_1,\dots,a_k} \mathbb{E}_{s_1,\dots, s_k} [|V^\pi(s_k)-V^{\pi_\star}(s_k)|]$
- $\leq 0$ (letting $k\to\infty$)
Therefore, $V^\pi = V^{\pi_\star}$ so $\pi$ must be optimal

Agenda

1. Policy Evaluation

2. Optimal Policies

3. Value Iteration

Value Iteration

The Bellman Optimality Equation is a fixed point equation! $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
If $V^\star$ satisfies the BOE then $$\pi_\star(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^\star(s')]$$ is an optimal policy
Idea: find $\hat V$ with fixed point iteration, then get approximately optimal policy $\hat\pi$.

Value Iteration

Value Iteration

Initialize $V_1$
For $t=1,\dots,T$:
- $V_{t+1}(s) = \max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V_{t}(s') \right]$
Return $\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_T(s')]$

Idea: find $\hat V$ with fixed point iteration, then get approximately optimal policy $\hat\pi$.

Bellman Operator

Define the Bellman Operator $\mathcal T:\mathbb R^S\to \mathbb R^S$ as $$(\mathcal TV)(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
Nonlinear map
Value Iteration is repeated application of the Bellman Operator
Compare with Bellman Expectation Equation we used in Approximate Policy Evaluation

Recap

PSet 1 due Monday
PA 1 released today

Policy Evaluation
Optimal Policies
Value Iteration

Next lecture: Value Iteration, Policy Iteration

Sp23 CS 4/5789: Lecture 4

By Sarah Dean

Sp23 CS 4/5789: Lecture 4

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 4: Optimal Policies

Announcements

Agenda

Example

Notation Review

Value Function

Example

Policy Evaluation (PE)

\(=\)

\(+\gamma\)

Approximate Policy Evaluation

Convergence of Approx PE

Convergence of Approx PE

Agenda

Example

Optimal Policy

Finding and Verifying Optimal Policies

Bellman Optimality Equation

Example

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality

Bellman Optimality Proof

Bellman Optimality Proof

Agenda

Value Iteration

Value Iteration

Bellman Operator

Recap

Sp23 CS 4/5789: Lecture 4

Sp23 CS 4/5789: Lecture 4

Sarah Dean PRO

CS 4/5789: Introduction to Reinforcement Learning

Lecture 4: Optimal Policies

Announcements

Agenda

Example

Notation Review

Value Function

Example

Policy Evaluation (PE)

\(=\)

\(+\gamma\)

Approximate Policy Evaluation

Convergence of Approx PE

Convergence of Approx PE

Agenda

Example

Optimal Policy

Finding and Verifying Optimal Policies

Bellman Optimality Equation

Example

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality Proof

Bellman Optimality

Bellman Optimality Proof

Bellman Optimality Proof

Agenda

Value Iteration

Value Iteration

Bellman Operator

Recap

Sp23 CS 4/5789: Lecture 4

More from Sarah Dean