CS 4/5789: Introduction to Reinforcement Learning
Lecture 16
Prof. Sarah Dean
MW 2:45-4pm
110 Hollister Hall
Agenda
0. Announcements
1. Review
2. Questions
Announcements
HW2 due Monday 3/28
5789 Paper Review Assignment (weekly pace suggested)
Today is the last day to drop
Prelim TOMORROW 3/22 at 7:30-9pm in Phillips 101
Closed-book, definition/equation sheet provided
Focus: mainly Unit 1 (known models) but many lectures in Unit 2 revisit important key concepts
Study Materials: Lecture Notes 1-15, HW0&1
Prelim Exam
Outline:
- MDP Definitions
- Policies and Distributions
- Value and Q function
- Optimal Policies
- Linear Optimal Control
Review
Participation point: PollEV.com/sarahdean011
Infinite Horizon Discounted MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
1. MDP Definitions
- \(\mathcal{S}\) states, \(\mathcal{A}\) actions
- \(r\) map from state, action to scalar reward
- \(P\) transition probability to next state given current state and action (Markov assumption)
- \(\gamma\) discount factor
Finite Horizon MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}\)
- \(\mathcal{S},\mathcal{A},r,P\) same
- \(H\) horizon
- \(\mu_0\) initial distribution

ex - Pac-Man as MDP
1. MDP Definitions
Optimal Control Problem
- continuous states/actions \(\mathcal{S}=\mathbb R^{n_s},\mathcal{A}=\mathbb R^{n_a}\)
- Cost instead of reward
- transitions \(P\) described in terms of dynamics function and disturbance \(w\sim \mathcal D\)
\(s'= f(s, a, w)\)
ex - UAV as OCP
2. Policies and Distributions
- Policy \(\pi\) chooses an action based on the current state so \(a_t=a\) with probability \(\pi(a|s_t)\)
- Shorthand for deterministic policy: \(a_t=\pi(s_t)\)

examples:
Policy results in a trajectory \(\tau = (s_0, a_0, s_1, a_1, ... )\)
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
2. Policies and Distributions
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
- Probability of trajectory \(\tau =(s_0, a_0, s_1, ... s_t, a_t)\) $$ \mathbb{P}_{\mu_0}^\pi (\tau) = \mu_0(s_0)\pi(a_0 \mid s_0) \cdot \displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i) $$
- Probability of \((s, a)\) at \(t\) $$ \mathbb{P}^\pi_t(s, a ; \mu_0) = \displaystyle\sum_{\substack{s_{0:t-1}\\ a_{0:t-1}}} \mathbb{P}^\pi_{\mu_0} (s_{0:t-1}, a_{0:t-1}, s_t, a_t \mid s_t = s, a_t = a) $$
- Discounted "steady-state" distribution $$ d^\pi_{\mu_0}(s, a) = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t \mathbb{P}^\pi_t(s, a; \mu_0) $$
2. Policies and Distributions
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
Food for thought:
- How do these distributions change under two different policies \(\pi\) and \(\pi'\)? (HW2)
- How to write the distributions \(\mathbb{P}^\pi_t\) and \(d^\pi_{\mu_0}\) over the state only?
3. Value and Q function
- Evaluate policy by cumulative reward
- \(V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r_t | s_0=s]\)
- \(Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r_t | s_0=s, a_0=a]\)
- For finite horizon, for \(t=0,...H-1\),
- \(V_t^\pi(s) = \mathbb E[\sum_{k=t}^{H-1} r_k | s_t=s]\)
- \(Q_t^\pi(s, a) = \mathbb E[\sum_{k=t}^{H-1} r_t | s_t=s, a_t=a]\)

examples:
...
...
...
3. Value and Q function
Recursive Bellman Expectation Equation:
- Discounted Infinite Horizon
- \(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
- \(Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]\)
- Finite Horizon, for \(t=0,\dots H-1\),
- \(V^{\pi}_t(s) = \mathbb{E}_{a \sim\pi_t(s) } \left[ r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \right]\)
- \(Q^{\pi}_t(s) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \)
...
...
...
Recall: Gardening MDP HW problem
3. Value and Q function
- Recursive computation: \(V^{\pi} = R^{\pi} + \gamma P^{\pi} V^\pi\)
- Exact Policy Evaluation: \(V^{\pi} = (I- \gamma P^{\pi} )^{-1}R^{\pi}\)
- Iterative Policy Evaluation: \(V^{\pi}_{t+1} = R^{\pi} + \gamma P^{\pi} V^\pi_t\)
- Backwards-Iterative computation in finite horizon:
- Initialize \(V^{\pi}_H = 0\)
- For \(t=H-1, H-2, ... 0\)
- \(V^{\pi}_t = R^{\pi} +P^{\pi} V^\pi_{t+1}\)
...
...
...
4. Optimal Policies
- An optimal policy \(\pi^*\) is one where \(V^{\pi^*}(s) \geq V^{\pi}(s)\) for all \(s\) and policies \(\pi\)
- Equivalent condition: Bellman Optimality
- \(V^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[V^*(s') \right]\right]\)
- \( Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^*(s', a') \right]\)
- Optimal policy \(\pi^*(s) = \argmax_{a\in \mathcal A} Q^*(s, a)\)
Recall: Gardening MDP HW problem (verifying optimality)
Food for thought: What does Bellman Optimality imply about advantage function \(A^{\pi^*}(s,a)\)?
4. Optimal Policies
- Finite horizon, for \(t=0,\dots H-1\),
- \(V_t^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[V_{t+1}^*(s') \right]\right]\)
- \(Q_t^*(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t+1}^*(s', a') \right]\)
- Optimal policy \(\pi_t^*(s) = \argmax_{a\in \mathcal A} Q_t^*(s, a)\)
- Can directly solve with Dynamic Programming
- Iterate backwards in time from \(V^*_{H}=0\)
4. Optimal Policies
- Infinite horizon: algorithms for recursion in the Bellman Optimality equation
- Value Iteration
- Initialize \(Q_0\). For \(t=0,1,\dots\),
- \(Q^{t+1}(s,a) =r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^{t}(s', a') \right]\)
- Initialize \(Q_0\). For \(t=0,1,\dots\),
- Policy Iteration
- Initialize \(\pi_0\). For \(t=0,1,\dots\),
- \(Q^{t}= \) PolicyEval(\(\pi^t\))
- \(\pi^{t+1}(s) = \argmax_{a\in\mathcal A} Q^{t}(s,a)\)
- Initialize \(\pi_0\). For \(t=0,1,\dots\),
4. Optimal Policies
- Value Iteration
- Fixed point iteration (like Iterative Policy Iteration) from Bellman Q Optimality
- Contraction in Q: \(\|Q^{t+1} - Q^*\|_\infty \leq \gamma \|Q^t - Q^*\|_\infty\)
- Policy Iteration
- Monotone Improvement: \(Q^{t+1}(s,a) \geq Q^{t}(s,a)\)
- Contraction in V: \(\|V^{t+1} - V^*\|_\infty \leq \gamma \|V^t - V^*\|_\infty\)
5. Linear Optimal Control
- Linear Dynamics: $$s_{t+1} = A s_t + Ba_t + w_t,\quad w_t\sim \mathcal N(0,\sigma^2 I)$$
- Unrolled dynamics $$ s_{t} = A^ts_0 + \sum_{k=0}^{t-1} A^k (Ba_{t-k-1} + w_{t-k-1})$$
- Stability of uncontrolled \(s_{t+1}=As_t\):
- stable if \(\rho(A)< 1\)
- unstable if \(\rho(A) > 1\)
- marginally unstable if \(\rho(A) = 1\)
ex - UAV
Food for thought: What are dynamics, stability, value under linear policy \(a_t = K s_t\)?
5. Linear Optimal Control
Finite Horizon LQR: Application of Dynamic Programming
- Initialize \(V^{\pi}_H(s) = 0\)
- For \(t=H-1, H-2, ... 0\)
- \(Q^{*}_t(s,a) = c(s,a) +\mathbb E_{s'\sim P(s,a)}[ V^*_{t+1}(s')]\)
- \(\pi^*(s) = \argmin_{a\in\mathcal A} Q^{*}_t(s,a)\)
- \(V^*_t = Q^{*}_t(s,\pi^*(s)) \)
Basis for approximation-based algorithms (local linearization and iLQR)
Proof Stratgies
- Add and subtract: $$ \|f(x) - g(y)\| \leq \|f(x)-f(y)\| +\|f(y)-g(y)\| $$
- Contractions (induction) $$ \|x_{t+1}\|\leq \gamma \|x_t\| \implies \|x_t\|\leq \gamma^t\|x_0\|$$
- Additive induction $$ \|x_{t+1}\| \leq \delta_t + \|x_t\| \implies \|x_t\|\leq \sum_{k=0}^{t-1} \delta_k + \|x_0\| $$
- Basic Inequalities (HW0) $$|\mathbb E[f(x)] - \mathbb E[g(x)]| \leq \mathbb E[|f(x)-g(x)|] $$ $$|\max f(x) - \max g(x)| \leq \max |f(x)-g(x)| $$ $$ \mathbb E[f(x)] \leq \max f(x) $$
Test-taking Strategies
- Move on if stuck!
- Write explanations and show steps for partial credit
- Multipart questions: can be done mostly independently
- ex: 1) show \(\|x_{t+1}\|\leq \gamma \|x_t\|\)
2) give a bound on \(\|x_t\|\) in terms of \(\|x_0\|\)
- ex: 1) show \(\|x_{t+1}\|\leq \gamma \|x_t\|\)
CS 4/5789: Lecture 16
By Sarah Dean