CS 4/5789: Introduction to Reinforcement Learning
Lecture 14: Review
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework
- PSet 4 due
todayFriday - 5789 Paper Reviews due weekly on Mondays
- PA 3/PSet 5 released next week
- PSet 4 due
- My office hours cancelled on Wednesday 3/15 due to Prelim
Please sign up for the event through the ACSU’s page on Campus Groups at this link: https://cglink.me/2ee/r2067244

Prelim on 3/15 in Lecture
- Prelim Wednesday 3/15
- During lecture (2:45-4pm in 255 Olin)
- 1 hour exam, closed-book, equation sheet provided
- Materials:
- slides (Lectures 1-10, some of 11-13)
- PSets 1-4 (1-3 solutions on Canvas)
- Last minute conflicts/accomodations? (EdStem)
- Monitoring Prelim tag on EdStem for questions
Outline:
- MDP Definitions
- Policies and Distributions
- Value and Q function
- Optimal Policies
- Linear Optimal Control
Review
Participation point: PollEV.com/sarahdean011
Infinite Horizon Discounted MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
1. MDP Definitions
- \(\mathcal{S}\) states, \(\mathcal{A}\) actions
- \(r\) map from state, action to scalar reward
- \(P\) transition probability to next state given current state and action (Markov assumption)
- \(\gamma\) discount factor
Finite Horizon MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}\)
- \(\mathcal{S},\mathcal{A},r,P\) same
- \(H\) horizon

ex - Pac-Man as MDP
1. MDP Definitions
Optimal Control Problem
- continuous states/actions \(\mathcal{S}=\mathbb R^{n_s},\mathcal{A}=\mathbb R^{n_a}\)
- Cost instead of reward
- transitions are deterministic and described in terms of dynamics function
\(s'= f(s, a)\)
ex - UAV as OCP
2. Policies and Distributions
- Policy \(\pi\) chooses an action based on the current state so \(a_t=a\) with probability \(\pi(a|s_t)\)
- Shorthand for deterministic policy: \(a_t=\pi(s_t)\)

examples:
Policy results in a trajectory \(\tau = (s_0, a_0, s_1, a_1, ... )\)
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
2. Policies and Distributions
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
- Probability of trajectory \(\tau =(s_0, a_0, s_1, ... s_t, a_t)\) $$ \mathbb{P}_{\mu_0}^\pi (\tau) = \mu_0(s_0)\pi(a_0 \mid s_0) \cdot \displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i) $$
- Probability of \(s\) at \(t\) $$ \mathbb{P}^\pi_t(s ; \mu_0) = \displaystyle\sum_{\substack{s_{0:t-1}\\ a_{0:t-1}}} \mathbb{P}^\pi_{\mu_0} (s_{0:t}, a_{0:t-1} \mid s_t = s) $$
2. Policies and Distributions
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
- Probability vector of \(s\) at \(t\): \(d_{\mu_0,t}^\pi(s) = \mathbb{P}^\pi_t(s ; \mu_0) \) evolves as $$ d_{\mu_0,t+1}^\pi=P_\pi^\top d_{\mu_0,t}^\pi $$ where \(P_\pi\) at row \(s\) and column \(s'\) is \(\mathbb E_{a\sim \pi(s)}[P(s'\mid s,a)]\)
- Discounted "steady-state" distribution (PSet 2) $$ d^\pi_{\mu_0} = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_{\mu_0,t}^\pi$$
\(+\gamma\)
\(+\gamma^2\)
\(+\quad ...\quad=\)

\(1\)
\(1-p_1\)
\(p_1\)
\(0\)
\(1\)
- \(d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \)
- \(d_1 = \begin{bmatrix}1& 1-p_1\\0 & p_1\end{bmatrix} \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}\)
- \(d_2 =\begin{bmatrix}1& 1-p_1\\0 & p_1\end{bmatrix}\begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix} = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}\)
Example: \(\pi(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).
State Evolution Example
$$P_\pi = \begin{bmatrix}1& 0\\ 1-p_1 & p_1\end{bmatrix}$$
- How does state distribution change over time?
- Recall, \(s_{t+1}\sim P(s_t,\pi(s_t))\)
- i.e. \(s_{t+1} = s'\) with probability \(P(s'|s_t, \pi(s_t))\)
- Write as a summation over possible \(s_t\):
- \(\mathbb{P}\{s_{t+1}=s'\mid \mu_0,\pi\} =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))\mathbb{P}\{s_{t}=s\mid \mu_0,\pi\}\)
- In vector notation:
- \(d_{t+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))d_t[s]\)
- \(d_{t+1}[s'] =\langle\begin{bmatrix} P(s'\mid 1, \pi(1)) & \dots & P(s'\mid S, \pi(S))\end{bmatrix},d_t\rangle \)
2. Policies and Distributions
Food for thought:
- How are these distributions different when:
- Transitions are different (Simulation Lemma)
- Policies are different (Performal Difference Lemma)
- Initial states are different
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
3. Value and Q function
- Evaluate policy by cumulative reward
- \(V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s,P,\pi]\)
- \(Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s, a_0=a,P,\pi]\)
- For finite horizon, for \(t=0,...H-1\),
- \(V_t^\pi(s) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s,P,\pi]\)
- \(Q_t^\pi(s, a) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s, a_t=a,P,\pi]\)

examples:
...
...
...
3. Value and Q function
Recursive Bellman Expectation Equation:
- Discounted Infinite Horizon
- \(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
- \(Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]\)
- Finite Horizon, for \(t=0,\dots H-1\),
- \(V^{\pi}_t(s) = \mathbb{E}_{a \sim\pi_t(s) } \left[ r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \right]\)
- \(Q^{\pi}_t(s) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \)
...
...
...
Recall: Icy navigation (PSet 2, lecture example)
3. Value and Q function
- Recursive computation: \(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)
- Exact Policy Evaluation: \(V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}\)
- Iterative Policy Evaluation: \(V^{\pi}_{i+1} = R^{\pi} + \gamma P_{\pi} V^\pi_i\)
- Converges: fixed point contraction
- Backwards-Iterative computation in finite horizon:
- Initialize \(V^{\pi}_H = 0\)
- For \(t=H-1, H-2, ... 0\)
- \(V^{\pi}_t = R^{\pi} +P_{\pi} V^\pi_{t+1}\)
4. Optimal Policies
- An optimal policy \(\pi^*\) is one where \(V^{\pi^*}(s) \geq V^{\pi}(s)\) for all \(s\) and policies \(\pi\)
- Equivalent condition: Bellman Optimality
- \(V^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[V^*(s') \right]\right]\)
- \( Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^*(s', a') \right]\)
- Optimal policy \(\pi^*(s) = \argmax_{a\in \mathcal A} Q^*(s, a)\)
Recall: Verifying optimality in Icy Street example
Food for thought: What does Bellman Optimality imply about advantage function \(A^{\pi^*}(s,a)\)?
4. Optimal Policies
- Finite horizon: for \(t=0,\dots H-1\),
- \(V_t^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[V_{t+1}^*(s') \right]\right]\)
- \(Q_t^*(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t+1}^*(s', a') \right]\)
- Optimal policy \(\pi_t^*(s) = \argmax_{a\in \mathcal A} Q_t^*(s, a)\)
- Solve exactly with Dynamic Programming
- Iterate backwards in time from \(V^*_{H}=0\)
4. Optimal Policies
- Infinite horizon: algorithms for recursion in the Bellman Optimality equation
- Value Iteration
- Initialize \(V_0\). For \(i=0,1,\dots\),
- \(V^{i+1}(s) =\max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right]\)
- Initialize \(V_0\). For \(i=0,1,\dots\),
- Policy Iteration
- Initialize \(\pi_0\). For \(i=0,1,\dots\),
- \(V^{i}= \) PolicyEval(\(\pi^i\))
- \(\pi^{i+1}(s) = \argmax_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right] \)
- Initialize \(\pi_0\). For \(i=0,1,\dots\),
4. Optimal Policies
- Value Iteration
- Converges: fixed point contraction: $$\|V^{i+1} - V^*\|_\infty \leq \gamma \|V^i - V^*\|_\infty$$
- Policy Iteration
- Monotone Improvement: \(V^{i+1}(s) \geq V^{i}(s)\)
- Contraction: \(\|V^{i+1} - V^*\|_\infty \leq \gamma \|V^i- V^*\|_\infty\)
- Converges to exactly optimal policy in finite time (PSet 3)
5. Linear Optimal Control
- Linear Dynamics: $$s_{t+1} = A s_t + Ba_t$$
- Unrolled dynamics (PSet 3) $$ s_{t} = A^ts_0 + \sum_{k=0}^{t-1} A^k Ba_{t-k-1}$$
- Stability of uncontrolled \(s_{t+1}=As_t\):
- stable if \(\max_i |\lambda_i(A)|< 1\)
- unstable if \(\max_i |\lambda_i(A)| > 1\)
- marginally unstable if \(\max_i |\lambda_i(A)|= 1\)
ex - UAV
Food for thought: relationship between stability and cumulative cost? (PSet 4)
5. Linear Optimal Control
Finite Horizon LQR: Application of Dynamic Programming
- Initialize \(V^{\pi}_H(s) = 0\)
- For \(t=H-1, H-2, ... 0\)
- \(Q^{*}_t(s,a) = c(s,a) + V^*_{t+1}(f(s,a))\)
- \(\pi^*(s) = \argmin_{a\in\mathcal A} Q^{*}_t(s,a)\)
- \(V^*_t = Q^{*}_t(s,\pi^*(s)) \)
Basis for approximation-based algorithms (local linearization and iLQR)
Proof Stratgies
- Add and subtract: $$ \|f(x) - g(y)\| \leq \|f(x)-f(y)\| +\|f(y)-g(y)\| $$
- Contractions (induction) $$ \|x_{t+1}\|\leq \gamma \|x_t\| \implies \|x_t\|\leq \gamma^t\|x_0\|$$
- Additive induction $$ \|x_{t+1}\| \leq \delta_t + \|x_t\| \implies \|x_t\|\leq \sum_{k=0}^{t-1} \delta_k + \|x_0\| $$
- Basic Inequalities (PSet 1): $$|\mathbb E[f(x)] - \mathbb E[g(x)]| \leq \mathbb E[|f(x)-g(x)|] $$ $$|\max f(x) - \max g(x)| \leq \max |f(x)-g(x)| $$ $$ \mathbb E[f(x)] \leq \max f(x) $$
Test-taking Strategies
- Move on if stuck!
- Write explanations and show steps for partial credit
- Multipart questions: can be done mostly independently
- ex: 1) show \(\|x_{t+1}\|\leq \gamma \|x_t\|\)
2) give a bound on \(\|x_t\|\) in terms of \(\|x_0\|\)
- ex: 1) show \(\|x_{t+1}\|\leq \gamma \|x_t\|\)
Sp23 CS 4/5789: Lecture 14
By Sarah Dean