Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
1. Recap
2. Value Function
3. Bellman Equation
4. Policy Evaluation
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\) defined by states, actions, reward, transition, discount factor
action \(a_t\in\mathcal A\)
state \(s_t\in\mathcal S\)
reward
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\) defined by states, actions, reward, transition, discount factor
maximize \(\displaystyle \mathbb E\left[\sum_{i=1}^\infty \gamma^t r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
Goal
Dataset from expert policy \(\pi_\star\): $$ \{(s_i, a_i)\}_{i=1}^N \sim \mathcal D_\star $$
\(\pi\)
minimize \(\sum_{i=1}^N \ell(\pi(s_i), a_i)\)
expert trajectory
learned policy
No training data of "recovery" behavior!
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
Proposition: The state distribution evolves according to \( d_t = (P_\pi^t)^\top d_0\)
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
\(P_\pi^\top=\)
1. Recap
2. Value Function
3. Bellman Equation
4. Policy Evaluation
The value of a state \(s\) under a policy \(\pi\) is the expected cumulative discounted reward starting from that state
$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r_t \mid s_0=s,s_{t+1}\sim P(s_t, a_t),a_t\sim \pi(s_t), r_t\sim r(s_t, a_t)\right]$$
simplification for the rest of lecture: \(r(s,a)\) is deterministic
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
...
...
...
The cumulative reward of a given trajectory $$\sum_{t=0}^\infty \gamma^t \mathbb r(s_t, a_t)$$
The expected cumulative reward averages over all possible trajectories
$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) \mid s_0=s,P,\pi\right]$$
\(0\)
\(1\)
\(p_1\)
\(1-p_1\)
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(p_1\)
switch: \(1-p_2\)
stay: \(1-p_1\)
switch: \(p_2\)
\(0\)
\(1\)
\(p_1\)
\(1-p_1\)
\(p_1\)
\(1-p_1\)
Food for thought: what distribution determines the value function?
$$V^\pi(s) = \mathbb E\left[\sum_{t=0}^\infty \gamma^t r(s_t, \pi(a_t)) \mid s_0=s, P, \pi\right]$$
1. Recap
2. Value Function
3. Bellman Equation
4. Policy Evaluation
Exercise: review proof (below)
Bellman Expectation Equation:
\(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
...
...
...
The cumulative reward expression is almost recursive:
$$\sum_{t=0}^\infty \gamma^t r(s_t, a_t) = r(s_0,a_0) + \gamma \sum_{t'=0}^\infty \gamma^{t'} r(s_{t'+1},a_{t'+1}) $$
Proof
\(0\)
\(1\)
\(p_1\)
\(1-p_1\)
\(p_1\)
\(1-p_1\)
Recall \(r(0,a)=1\) and \(r(1,a)=0\)
1. Recap
2. Value Function
3. Bellman Equation
4. Policy Evaluation
\(P_\pi=\)
\(s\)
\(P(\cdot\mid s,\pi(s))\)
\(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)
\(s\)
\(s'\)
\(P(s'\mid s,\pi(s))\)
\(V^\pi(s)\)
\(r(s,\pi(s))\)
Matrix inversion is slow! \(\mathcal O(S^3)\)
To exactly compute the value function, we just need to solve the \(S\times S\) system of linear equations:
\(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)
Approximate Policy Evaluation:
Complexity of each iteration?
To trade off computation time for complexity, we can use a fixed point iteration algorithm
\(0\)
\(1\)
\(p_1\)
\(1-p_1\)
\(p_1\)
\(1-p_1\)
Recall \(r(0,a)=1\) and \(r(1,a)=0\)
To show the Approx PE works, we first prove a contraction lemma
Lemma: For iterates of Approx PE, $$\|V_{t+1} - V^\pi\|_\infty \leq \gamma \|V_t-V^\pi\|_\infty$$
Proof
Proof
Theorem: For iterates of Approx PE, $$\|V_{t} - V^\pi\|_\infty \leq \gamma^t \|V_0-V^\pi\|_\infty$$
so an \(\epsilon\) correct solution requires
\(T\geq \log\frac{\|V_0-V^\pi\|_\infty}{\epsilon} / \log\frac{1}{\gamma}\)
By Sarah Dean