Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Policy Evaluation
2. Optimal Policies
3. Value Iteration
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
\(0\)
\(1\)
\(0\)
\(1\)
The value of a state \(s\) under a policy \(\pi\) is the expected cumulative discounted reward starting from that state
$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t)\right]$$
Bellman Expectation Equation: \(\forall s\),
\(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
...
...
...
Q function: \(Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \)
Proof of BE
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
\(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
\(V^\pi(s)\)
\(r(s,\pi(s))\)
Approximate Policy Evaluation:
Complexity of each iteration is \(\mathcal O(S^2)\)
To trade off computation time for complexity, we can use a fixed point iteration algorithm
To show the Approx PE works, we first prove a contraction lemma
Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$
Proof
Proof
Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$
so an \(\epsilon\) correct solution requires
\(T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}\)
1. Policy Evaluation
2. Optimal Policies
3. Value Iteration
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
Enumeration:
Bellman Optimality Equation (BOE): $$V(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V(s')] \right]$$
Theorem (Bellman Optimality):
Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy, if \(V^\pi\) satisfies \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
discount factor \(\gamma\)
\(p_1\) probability of stay | stay
\(0\)
\(1\)
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
Theorem (Bellman Optimality) 1: If \(\pi^\star\) is an optimal policy, $$V^{\pi^\star}(s) = \max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^{\pi^\star}(s')] \right]$$
Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy if \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
Theorem (Bellman Optimality) 2: \(\pi\) is an optimal policy if \(V^\pi(s)=\max_{a\in\mathcal A} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
1. Policy Evaluation
2. Optimal Policies
3. Value Iteration
Value Iteration
By Sarah Dean