CS 4/5789: Introduction to Reinforcement Learning

Lecture 4: Optimal Policies

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Announcements

  • Questions about waitlist/enrollment?
  • Homework released this week
    • Problem Set 1 due Monday 2/6
    • Programming Assignment 1 released tonight, due in 2 weeks later
  • CIS Partner Finding Social
  • Come to Duffield Atrium to find a partner or study buddy for any CIS classes you are taking this semester! February 2nd from 4:30 to 6:30pm

Agenda

 

1. Policy Evaluation

2. Optimal Policies

3. Value Iteration

Example

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

  • Recall ongoing example
  • Suppose the reward is:
    • \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for
      \(a=\) switch
  • Notation review: what is \(\{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\) for this example?

\(0\)

\(1\)

Notation Review

\(0\)

\(1\)

  • \(\mathcal S = \{0,1\}\) and \(\mathcal A=\{\)stay,switch\(\}\)
  • \(r(0,\)stay\()=1\), \(r(0,\)switch\()=\frac{1}{2}\)
  • \(r(1,\)stay\()=0\), \(r(1,\)switch\()=-\frac{1}{2}\)
  • \(P(0,\)stay\()=\mathbf{1}_{0}=\mathsf{Bernoulli}(0)\)
  • \(P(1,\)stay\()=\mathsf{Bernoulli}(p_1)\)
  • \(P(0\mid 0,\)stay\()=\)
  • \(P(1\mid 0,\)stay\()=\)
  • \(P(0\mid 1,\)stay\()=\)
  • \(P(1\mid 1,\)stay\()=\)
  • \(P(0\mid 0,\)switch\()=\)
  • \(P(1\mid 0,\)switch\()=\)
  • \(P(0\mid 1,\)switch\()=\)
  • \(P(1\mid 1,\)switch\()=\)
  • \(1\)
  • \(0\)
  • \(1-p_1\)
  • \(p_1\)
  • \(0\)
  • \(1\)
  • \(p_2\)
  • \(1-p_2\)
  • \(P(0,\)switch\()=\mathbf{1}_{1}=\mathsf{Bernoulli}(1)\)
  • \(P(1,\)switch\()=\mathsf{Bernoulli}(1-p_2)\)

The value of a state \(s\) under a policy \(\pi\) is the expected cumulative discounted reward starting from that state

Value Function

$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t)\right]$$

Bellman Expectation Equation: \(\forall s\),

 \(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)

...

...

...

Q function: \(Q^{\pi}(s, a) =  r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \)

Proof of BE

  • \(V^\pi(s) = \mathbb{E}[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s, P, \pi ]\)
  • \(= \mathbb{E}[r(s_0,a_0)\mid s_0=s, P, \pi ] + \mathbb{E}[\sum_{t=1}^\infty \gamma^{t} r(s_{t},a_{t}) \mid s_0=s, P, \pi ]\)
    (linearity of expectation)
  • \(= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_0=s, P, \pi ]\)
    (simplifying conditional expectation, re-indexing sum)
  • \(= \mathbb{E}[r(s,a_0) \mid \pi ] + \gamma\mathbb{E}[\mathbb{E}[\sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) \mid s_1=s', P, \pi ]\mid s'\sim P(s, a), a\sim \pi(s)]\) (tower property of conditional expectation)
  • \(= \mathbb{E}[r(s,a)+ \gamma\mathbb{E}[V^\pi(s')\mid s'\sim P(s, a)] \mid a\sim \pi(s)]\)
    (definition of value function and linearity of expectation)

Example

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

  • Suppose the reward is:
    • \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for
      \(a=\) switch
  • Consider the policy \(\pi(s)=\)stay for all \(s\)
  • \(V^\pi(0) =\sum_{t=0}^\infty \gamma^t = \frac{1}{1-\gamma}\)
  • \(V^\pi(1) =\sum_{T=0}^\infty p_1^T(1-p_1)  \sum_{t=T}^\infty \gamma^t =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\)

Policy Evaluation (PE)

  • \(V^{\pi}(s) = r(s, \pi(s)) + \gamma \sum_{s'\in\mathcal S}  P(s'\mid s, \pi(s)) V^\pi(s') \)
  • The matrix vector form of the Bellman Equation is

\(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)

\(s\)

\(s'\)

\(P(s'\mid s,\pi(s))\)

\(=\)

\(+\gamma\)

\(V^\pi(s)\)

\(r(s,\pi(s))\)

Approximate Policy Evaluation:

  • Initialize \(V_0\)
  • For \(t=0,1,\dots, T\):
    • \(V_{t+1} = R^{\pi} + \gamma P^{\pi} V_t\)

Complexity of each iteration is \(\mathcal O(S^2)\)

Approximate Policy Evaluation

To trade off computation time for complexity, we can use a fixed point iteration algorithm

To show the Approx PE works, we first prove a contraction lemma

Convergence of Approx PE

Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$

Proof

  • \(\|V_{t+1} - V^\pi\|_\infty = \|R^\pi + \gamma P_\pi V_t-V^\pi\|_\infty\) by algorithm definition
  • \(= \|R^\pi + \gamma P_\pi V_t-(R^\pi + \gamma P_\pi V^\pi)\|_\infty\) by Bellman eq
  • \(= \| \gamma P_\pi (V_t - V^\pi)\|_\infty=\gamma\max_s \langle P_\pi(s), V_t-V^\pi\rangle \) norm definition
  • \(=\gamma\max_s |\mathbb E_{s'\sim P(s,\pi(s))}[V_t(s')-V^\pi(s')]|\) expectation definition
  • \(\leq \gamma \max_s \mathbb E_{s'\sim P(s,a)}[|V_t(s')-V^\pi(s')|]\) basic inequality (PSet 1)
  • \(\leq \gamma \max_{s'}|V_t(s')-V^\pi(s')|=\|V_t-V^\pi\|_\infty\) basic inequality (PSet 1)

Proof

  • First statement follows by induction using the Lemma
  • For the second statement,
    • \(\|V_{T} - V^\pi\|_\infty\leq \gamma^T \|V_0-V^\pi\|_\infty\leq \epsilon\)
    • Taking \(\log\) of both sides,
    • \(T\log \gamma + \log  \|V_0-V^\pi\|_\infty \leq \log \epsilon \), then rearrange

Convergence of Approx PE

Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$

so an \(\epsilon\) correct solution requires

\(T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}\)

Agenda

 

1. Policy Evaluation

2. Optimal Policies

3. Value Iteration

Example

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

  • Suppose the reward is:
    • \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for
      \(a=\) switch
  • Consider the policy \(\pi(s)=\)stay for all \(s\)
  • \(V^\pi(0) =\frac{1}{1-\gamma}\)
  • \(V^\pi(1) =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\)
  • Is this optimal? PollEV

Optimal Policy

maximize   \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)

s.t.   \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)

\(\pi\)

  • An optimal policy \(\pi_\star\) is one where \(V^{\pi_\star}(s) \geq V^{\pi}(s)\) for all \(s\) and policies \(\pi\)
    • i.e. the policy dominates other policies for all states
    • vector notation: \(V^{\pi_\star}(s) \geq V^{\pi}(s)~\forall~s\iff V^{\pi_\star} \geq V^{\pi}\)
  • All optimal policies achieve the same value \(V^\star\), i.e. at every state \(s\), \(V^\star(s) = V^{\pi_\star}(s)\)

Finding and Verifying Optimal Policies

Enumeration:

  • Initialize \(V^\star=-\infty, \pi_\star\)
  • For all \(\pi:\mathcal S\to\mathcal A\):
    • compute \(V^\pi\) with PE
    • if \(V^\pi\geq V^\star \): set \(V^\star =V^\pi\) and \(\pi_\star=\pi\)
  • return \(\pi^\star\)
  • How can we find an optimal policy? How can we verify whether a policy is optimal?
  • Naive approach: enumeration
    • For \(S=|\mathcal S|\) states and \(A=|\mathcal A|\) actions, the complexity is \(\mathcal O(A^S S^3)\)!

Bellman Optimality Equation

  • Just like the Bellman Expectation Equation made it easier to compute the Value for a given policy,
    • the Bellman Optimization Equation will make it easier to verify and compute the optimal policy/value function

Bellman Optimality Equation (BOE): $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$

Theorem (Bellman Optimality):

  1. If \(\pi_\star\) is an optimal policy, then \(V^{\pi_\star}\) satisfies the BOE
  2. If \(V^\pi\) satisfies the BOE, then \(\pi\) is an optimal policy

Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy, if \(V^\pi\) satisfies \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)

Example

\(0\)

\(1\)

stay: \(1\)

switch: \(1\)

stay: \(p_1\)

switch: \(1-p_2\)

stay: \(1-p_1\)

switch: \(p_2\)

  • Suppose the reward is:
    • \(+1\) for \(s=0\) and \(-\frac{1}{2}\) for
      \(a=\) switch
  • Consider the policy \(\pi(s)=\)stay for all \(s\)
  • \(V^\pi(0) =\frac{1}{1-\gamma}\)
  • \(V^\pi(1)  =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\)
  • Is this optimal?
  • \(V^\pi(0) =\frac{1}{1-\gamma}\)
  • \(\max_{a\in\mathcal A} \left[ r(0, a) + \gamma \mathbb{E}_{s' \sim P( 0, a)} [V(s')] \right]\)
    • for \(a=\)stay, \(\frac{1}{1-\gamma}\)
    • for \(a=\)switch,
      • \(\frac{1}{2} + \gamma V(1) = \frac{\gamma (1-p_1)}{(1-\gamma p_1)(1-\gamma)} +\frac{1}{2} \leq 1 + \frac{\gamma}{1-\gamma} = \frac{1}{1-\gamma}\)
    • Thus BOE satisfied for \(s=0\)
  • \(V^\pi(1)  =\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\) [warning: possible algebra mistakes below]
  • \(\max_{a\in\mathcal A} \left[ r(1, a) + \gamma \mathbb{E}_{s' \sim P( 1, a)} [V(s')] \right]\)
    • for \(a=\)stay, \(\frac{1-p_1}{(1-\gamma p_1)(1-\gamma)}\)
    • for \(a=\)switch,
      • \(-\frac{1}{2} + \gamma ((1-p_2)V(1)+p_2V(0)) = \frac{\gamma (1-p_2)(1-p_1)}{(1-\gamma p_1)(1-\gamma)} + \frac{\gamma p_2}{1-\gamma} -\frac{1}{2} \)
    • Thus BOE satisfied if \(p_2\leq \frac{p_1}{1-\gamma p_1}+\frac{1-\gamma}{2}\)

discount factor \(\gamma\)

\(p_1\) probability of stay | stay

  • Color: maximum value that \(p_2\) can have for "stay" to be optimal
    • ranging from 0 (dark) to 1.5 (light)
  • If \(\gamma\) is small, cost of "switch" action is not worth it
  • If \(p_1\) is small, likely to transition without "switch" action

\(0\)

\(1\)

  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
  • Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
  • We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
    • \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\right] \)
      • \(\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\)

Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
  • Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
  • We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
    • \(V^{\pi_\star}(s) =\mathbb E_{a\sim \pi_\star(s)}\left[ r(s, \pi^\star(s)) + \gamma \mathbb E_{s'\sim P(s, \pi_\star(s))}[V^{\pi_\star}(s')]\right] \)
      • \(\leq \max_{a\in\mathcal A} r(s, a) + \gamma \mathbb E_{s'\sim P(s, a)}[V^{\pi_\star}(s')]\)
      • \(\leq r(s, \hat \pi(s)) + \gamma \mathbb E_{s'\sim P(s, \hat \pi(s))}[V^{\pi_\star}(s')]\)
    • Writing the above expression in vector form:
    • \(V^{\pi_\star} \leq R^{\hat\pi} + \gamma P^{\hat\pi} V^{\pi_\star}\)

Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
  • Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
  • We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
    • \(V^{\pi_\star} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star}\)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - V^{\hat \pi}\) (subtract from both sides)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq R^{\hat \pi} + \gamma P^{\hat \pi} V^{\pi_\star} - R^{\hat \pi} - \gamma P^{\hat \pi} V^{\hat \pi}\) (Bellman Expectation Eq)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star}  -V^{\hat \pi})\)       (\(\star\))
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}-  V^{\hat \pi})\) (apply (\(\star\)) to RHS)

Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
  • Then by definition of optimality, \(V^{\pi_\star}\geq V^{\hat \pi}\)
  • We now show that \( V^{\pi_\star}\leq V^{\hat \pi}\)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma P^\pi (V^{\pi_\star}  -V^{\hat \pi})\)       (\(\star\))
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^2 (P^\pi)^2 (V^{\pi_\star}-  V^{\hat \pi})\) (apply (\(\star\)) to RHS)
    • \(V^{\pi_\star} - V^{\hat \pi} \leq + \gamma^k (P^\pi)^k (V^{\pi_\star}-  V^{\hat \pi})\) (apply (\(\star\)) k times)
    • \(V^{\pi_\star} - V^{\hat\pi} \leq 0\) (limit \(k\to\infty\))
  • Therefore, \(V^{\pi_\star} = V^{\hat\pi}\)

Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

  • Consider the following policy $$\hat \pi(s) = \arg\max_{a\in\mathcal A}\left[( r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\hat \pi}(s')] \right]$$
  • We showed that \(V^{\pi_\star} = V^{\hat\pi}\)
    • this means \(\hat \pi(s)\) is an optimal policy!
  • By definition of \(\hat\pi\) and the Bellman Expectation Equation, \(V^{\hat \pi}\) satisfies the Bellman Optimality Equation
  • Therefore, \(V^{\pi_\star}\) must also satisfy it.

Bellman Optimality Proof

Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$

  • If we know the optimal value \(V^\star\) then we can write down optimal policies! $$\pi^\star(s) \in \arg\max_{a\in\mathcal A}\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\star}(s')] \right]$$
  • Recall the definition of the Q function: $$Q^\star(s,a)= r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\star(s')] $$
  • \(\pi^\star(s) \in \arg\max_{a\in\mathcal A} Q^\star(s,a)\)

Bellman Optimality

Bellman Optimality Proof

Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy if \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)

  • Consider an optimal policy \(\pi_\star\) and the value \(V^{\pi_\star}\)
  • By part 1, we know that \(V^{\pi_\star}\) satisfies BOE
  • We bound \(|V^{\pi}(s)-V^{\pi_\star}(s)|\)
    • \(=|\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right] - \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|\) (BOE by assumption and part 1)
    • \(\leq \max_{a\in\mathcal A} |r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] -\left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi_\star}(s')] \right]|\)
      • basic inequality from PSet 1: $$|\max_x f_1(x) - \max_x f_2(x)| \leq \max_x|f_1(x)-f_2(x)|$$
    • \(\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|\) (linearity of expectation)

Bellman Optimality Proof

Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy if \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)

  • Consider an optimal policy \(\pi_\star\) and the value \(V^{\pi_\star}\)
  • We bound \(|V^{\pi}(s)-V^{\pi_\star}(s)|\)
    • \(\leq \max_{a\in\mathcal A} \gamma |\mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')-V^{\pi_\star}(s')]|\) (linearity of expectation)
    • \(\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [|V^\pi(s')-V^{\pi_\star}(s')|]\) (basic inequality PSet 1)
    • \(\leq \max_{a\in\mathcal A} \gamma \mathbb{E}_{s' \sim P( s, a)} [ \max_{a'\in\mathcal A} \gamma \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]\)
    • \(\leq \gamma^2 \max_{a,a'\in\mathcal A} \mathbb{E}_{s' \sim P( s, a)} [ \mathbb{E}_{s'' \sim P( s', a')} [|V^\pi(s'')-V^{\pi_\star}(s'')|]\)
    • \(\leq \gamma^k \max_{a_1,\dots,a_k} \mathbb{E}_{s_1,\dots, s_k} [|V^\pi(s_k)-V^{\pi_\star}(s_k)|]\)
    • \(\leq 0\) (letting \(k\to\infty\))
  • Therefore, \(V^\pi = V^{\pi_\star}\) so \(\pi\) must be optimal

Agenda

 

1. Policy Evaluation

2. Optimal Policies

3. Value Iteration

Value Iteration

  • The Bellman Optimality Equation is a fixed point equation! $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
  • If \(V^\star\) satisfies the BOE then $$\pi_\star(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V^\star(s')]$$ is an optimal policy
  • Idea: find \(\hat V\) with fixed point iteration, then get approximately optimal policy \(\hat\pi\).

Value Iteration

Value Iteration

  • Initialize \(V_1\)
  • For \(t=1,\dots,T\):
    • \(V_{t+1}(s) = \max_{a\in\mathcal A}  r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V_{t}(s') \right]\)
  • Return \(\displaystyle \hat\pi(s) = \arg\max_{a\in\mathcal A} r(s,a)+\gamma\mathbb E_{s'\sim P(s,a)}[V_T(s')]\)
  • Idea: find \(\hat V\) with fixed point iteration, then get approximately optimal policy \(\hat\pi\).

Bellman Operator

  • Define the Bellman Operator \(\mathcal T:\mathbb R^S\to \mathbb R^S\) as $$(\mathcal TV)(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
  • Nonlinear map
  • Value Iteration is repeated application of the Bellman Operator
  • Compare with Bellman Expectation Equation we used in Approximate Policy Evaluation

Recap

  • PSet 1 due Monday
  • PA 1 released today

 

  • Policy Evaluation
  • Optimal Policies
  • Value Iteration

 

  • Next lecture: Value Iteration, Policy Iteration

Sp23 CS 4/5789: Lecture 4

By Sarah Dean

Private

Sp23 CS 4/5789: Lecture 4