Sarah Dean PRO
asst prof in CS at Cornell
Prof. Sarah Dean
MW 2:55-4:10pm
255 Olin Hall
1. Recap
2. Value Function
3. Optimal Policy
4. Dynamic Programming
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}\) defined by states, actions, reward, transition, horizon
action \(a_t\in\mathcal A\)
state \(s_t\in\mathcal S\)
reward
\(r_t= r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
in today's lecture, \(r(s,a)\) is deterministic
Goal: maximize expected cumulative reward
$$\max_\pi ~\mathbb E_{\tau\sim \mathbb{P}_{\mu_0}^\pi }\left[\sum_{k=0}^{H-1} r(s_k, a_k) \right]$$
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H\}\) defined by states, actions, reward, transition, horizon
probability of trajectory \(\tau=(s_0,a_0,...,s_{H-1},a_{H-1})\) under \(P\), policy \(\pi\), initial distribution \(\mu_0\)
\(a_t=\pi_t(s_t)\)
\(r_t= r(s_t, a_t)\)
\(s_{t}\sim P(s_{t-1}, a_{t-1})\)
1. Recap
2. Value Function
3. Optimal Policy
4. Dynamic Programming
The value of a state \(s\) under a policy \(\pi\) at time \(t\) is the expected cumulative reward-to-go
$$V_t^\pi(s) = \mathbb E\left[\sum_{k=t}^{H-1} r(s_k, a_k) \mid s_t=s,s_{k+1}\sim P(s_k, a_k),a_k\sim \pi_k(s_k)\right]$$
\(s_t\)
\(a_t\)
\(s_{t+1}\)
\(a_{t+1}\)
\(s_{t+2}\)
\(a_{t+2}\)
...
\(s_{H-1}\)
\(a_{H-1}\)
\(0\)
\(1\)
stay: \(1\)
move: \(1\)
stay: \(p_1\)
move: \(1-p_2\)
stay: \(1-p_1\)
move: \(p_2\)
\(0\)
\(1\)
\(1\)
\(p_1\)
\(1-p_1\)
\(\pi(s)=\)stay
\(0\)
\(1\)
\(1\)
\(1-p_2\)
\(p_2\)
\(\pi(s)=\)move
Bellman Consistency Equation: \(\forall s\), $$V_t^{\pi}(s) = \mathbb{E}_{a \sim \pi_t(s)} \left[ r(s, a) + \mathbb{E}_{s' \sim P( s, a)} [V_{t+1}^\pi(s')] \right]$$
Exercise: review the proof below
Enables policy evaluation (i.e. computing \(V_t^\pi\)) by backwards iteration
Initialize \(V_H^\pi(s) =0\) for all \(s\in\mathcal S\)
For \(t=H-1,H-2,...,0\): $$V_t^{\pi}(s)=\mathbb{E}_{a \sim \pi_t(s)} \left[ Q_t^\pi(s,a) \right] ~~\forall ~s\in\mathcal S$$
Total complexity to compute is \(S^2AH\)
\(=Q_t^\pi(s,a)\)
\(\underbrace{\qquad\qquad\qquad\qquad}{}\)
Proof
\(0\)
\(1\)
\(1\)
\(p_1\)
\(1-p_1\)
\(\pi(s)=\)stay
Recall \(r(0,a)=1\) and \(r(1,a)=0\)
1. Recap
2. Value Function
3. Optimal Policy
4. Dynamic Programming
\(a_t=\pi_t(s_t)\)
\(r_t= r(s_t, a_t)\)
\(s_{t}\sim P(s_{t-1}, a_{t-1})\)
✓
Goal: maximize expected cumulative reward
$$\max_\pi ~\mathbb E_{\tau\sim \mathbb{P}_{\mu_0}^\pi }\left[\sum_{k=0}^{H-1} r(s_k, a_k) \right]$$
$$=\max_\pi \mathbb E_{s\sim\mu_0}\left[V_0^\pi(s)\right]$$
Goal: maximize expected cumulative reward
$$\max_\pi \mathbb E_{s\sim\mu_0}\left[V_0^\pi(s)\right]$$
Goal: maximize expected cumulative reward
$$\max_\pi \mathbb E_{s\sim\mu_0}\left[V_0^\pi(s)\right]$$
Theorem (Bellman Optimality):
shorthand \(Q_t^\star(s,a)\)
\(\underbrace{\qquad\qquad\qquad\qquad}{}\)
1. Recap
2. Value Function
3. Optimal Policy
4. Dynamic Programming
\(0\)
\(1\)
stay: \(1\)
switch: \(1\)
stay: \(1-p\)
switch: \(1-2p\)
stay: \(p\)
switch: \(2p\)
By Sarah Dean