Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:55-4:10pm
255 Olin Hall
1. Recap
2. Value Function
3. Bellman Equation
4. Policy Evaluation
The value function \(V_t^\pi(s) = \mathbb E\left[\sum_{k=t}^{H-1} r(s_k,a_k) \mid s_t=s \right]\)
Bellman Consistency Equation enables efficient policy evaluation
Optimal policies have optimal (highest) value \(V^\star_t(s)\) for all \(t,s\)
Bellman Optimality Equation (BOE) enables efficient policy optimization
The optimal policy is greed wrt optimal value $$\pi_t^\star(s) = \arg\max_a r(s,a) + \mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]$$
1. Recap
2. Infinite Horizon & Value Function
3. Bellman Equation
4. Policy Evaluation
Goal: achieve high cumulative reward:
$$\sum_{t=0}^\infty \gamma^t r_t$$
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
action \(a_t\in\mathcal A\)
state \(s_t\in\mathcal S\)
reward
\(r_t= r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
The value of a state \(s\) under a policy \(\pi\) is the expected cumulative discounted reward starting from that state
$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, a_t) \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t), \right]$$
this lecture we take \(r(s,a)\) to be deterministic and \(\pi\) to be stationary and state-dependent
\(0\)
\(1\)
stay: \(1\)
move: \(1\)
stay: \(p_1\)
move: \(1-p_2\)
stay: \(1-p_1\)
move: \(p_2\)
\(0\)
\(1\)
\(p_1\)
\(1-p_1\)
PSet preview: what distribution determines the value function?
$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, \pi(s_t)) \mid s_0=s, P, \pi\right]$$
1. Recap
2. Value Function
3. Bellman Equation
4. Policy Evaluation
Exercise: review proof (below)
Bellman Consistency Equation:
\(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
...
...
...
The cumulative reward expression is almost recursive:
$$\sum_{t=0}^\infty \gamma^t r(s_t, a_t) = r(s_0,a_0) + \gamma \sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) $$
Proof
\(0\)
\(1\)
move: \(1\)
stay: \(p_1\)
move: \(1-p_2\)
stay: \(1-p_1\)
move: \(p_2\)
\(\pi(s)=\)stay
1. Recap
2. Value Function
3. Bellman Equation
4. Policy Evaluation
\(P_\pi=\)
\(s\)
\(P(\cdot\mid s,\pi(s))\)
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
\(V^\pi(s)\)
\(r(s,\pi(s))\)
Matrix inversion is slow! \(\mathcal O(S^3)\)
To exactly compute the value function, we just need to solve the \(S\times S\) system of linear equations:
\(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)
Approximate Policy Evaluation:
Complexity of each iteration?
To trade off computation time for complexity, we can use a fixed point iteration algorithm
Recall \(r(0,a)=1\) and \(r(1,a)=0\)
\(0\)
\(1\)
\(1\)
\(p_1\)
\(1-p_2\)
\(1-p_1\)
\(p_2\)
\(\pi(s)=\)stay
To show the Approx PE works, we first prove a contraction lemma
Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$
Proof
Proof
Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$
so an \(\epsilon\) correct solution requires
\(T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}\)
By Sarah Dean