Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap
2. Policy Iteration
3. Finite Horizon MDP
4. Dynamic Programming
Bellman Optimality Equation (BOE): The optimal value satisfies, \(\forall s\), $$V^\star(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\star(s')] \right]$$
Bellman Expectation Equation: For a given policy \(\pi\), the value is, \(\forall s\),
\(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
Value Iteration
Q Value Iteration
We can think of the Q function as an \(S\times A\) array or an \(SA\) vector
Lemma (Contraction): For any \(V, V'\) $$\|\mathcal T V - \mathcal T V'\|_\infty \leq \gamma \|V-V'\|_\infty$$
Define the Bellman Operator \(\mathcal T:\mathbb R^S\to \mathbb R^S\) as $$(\mathcal TV)(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
Lemma (Convergence): For iterates \(V_t\) of VI, $$\|V_t - V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$
Theorem (Suboptimality): For policy \(\pi_T\) from VI, \(\forall s\) $$ V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{T+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$
Proof
Theorem (Suboptimality): For policy \(\pi_T\) from VI, \(\forall s\) $$ V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{T+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$
Proof of Claim:
\(V^{\pi_t}(s) - V^\star(s) =\)
1. Recap
2. Policy Iteration
3. Finite Horizon MDP
4. Dynamic Programming
Policy Iteration
0 | 1 | 2 | 3 |
4 | 5 | 6 | 7 |
8 | 9 | 10 | 11 |
12 | 13 | 14 | 15 |
16
Policy Iteration
Lemma (Monotonic Improvement): For iterates \(\pi_t\) of PI, the value monotonically improves, i.e. $$ V^{\pi_{t+1}} \geq V^{\pi_{t}}$$
Proof:
Consider vectors \(V,V'\) and matrix \(P\) with nonnegative entries.
In homework, you will show that if \(V\leq V'\) then \(PV\leq PV'\) when \(P\) has non-negative entries (inequalities hold entrywise).
You will also show that \((P^\pi)^k\) is bounded when \(P^\pi\) is a stochastic matrix.
Theorem (PI Convergence): For \(\pi_t\) from PI, $$ \|V^{\pi_{t}}-V^\star\|_\infty \leq \gamma^t \|V^{\pi_{0}}-V^\star\|_\infty$$
Proof:
Theorem (PI Convergence): For \(\pi_t\) from PI, $$ \|V^{\pi_{t}}-V^\star\|_\infty \leq \gamma^t \|V^{\pi_{0}}-V^\star\|_\infty$$
Proof:
Policy Iteration
Value Iteration
Goal: achieve high cumulative reward:
$$\sum_{t=0}^{H-1} r_t$$
maximize \(\displaystyle \mathbb E\left[\sum_{t=0}^{H-1} r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}\)
Lasts exactly \(H\) steps, no discounting
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
\(0\)
\(1\)
We consider time-varying policies $$\pi = (\pi_0,\dots,\pi_{H-1})$$
The value of a state also depends on time
$$V_t^\pi(s) = \mathbb E\left[\sum_{k=t}^{H-1} r(s_k, a_k) \mid s_0=s,s_{k+1}\sim P(s_k, a_k),a_k\sim \pi_k(s_k)\right]$$
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
Bellman Expectation Equation: \(\forall s\),
\(V_t^{\pi}(s) = \mathbb{E}_{a \sim \pi_t(s)} \left[ r(s, a) + \mathbb{E}_{s' \sim P( s, a)} [V_{t+1}^\pi(s')] \right]\)
Q function
\(Q_t^{\pi}(s, a) = \ r(s, a) + \mathbb{E}_{s' \sim P( s, a)} [V_{t+1}^\pi(s')]\)
Rather than a recursion, in finite time we have an iterative equation
Bellman optimality is also an iterative rather than a recursive equation: \(V^\star_t(s)=\max_a r(s,a) + \mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(1-p\)
switch: \(1-2p\)
stay: \(p\)
switch: \(2p\)
By Sarah Dean