CS 4/5789: Introduction to Reinforcement Learning
Lecture 27: Final Review
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework
- 5789 Paper Reviews due
- Midterm corrections due
- Both accepted without penalty until "late deadline" on Gradescope
- Course evaluations open! Participation credit
- Office hours
- Mine 9:30-10:30am Tues and 11-noon Thurs
- TA office hours cancelled Wednesday onward
- Stay tuned for TA review/question session
Final Exam
- Final exam Saturday 5/13
- Location: 155 Olin
- Time: 2pm (until \(\approx\) 4pm)
- Length: 2 hours
- Cumulative and closed book
- equation sheet provided (posted by Wed)
- Materials: slides, PSets (solutions on Canvas)
- I will monitor Final tag on EdStem for questions
- also last minute conflicts/accomodations
Outline:
- MDP Definitions
- Policies and Distributions
- Value and Q function
- Optimal Policies
- Linear Optimal Control
- Learned Models, Values, Policies
- Exploration
- Learning from Experts
Review
Infinite Horizon Discounted MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)
1. MDP Definitions
- \(\mathcal{S}\) states, \(\mathcal{A}\) actions
- \(r\) map from state, action to scalar reward
- \(P\) transition probability to next state given current state and action (Markov assumption)
- \(\gamma\) discount factor
Finite Horizon MDP
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}\)
- \(\mathcal{S},\mathcal{A},r,P\) same
- \(H\) horizon

ex - Pac-Man as MDP
1. MDP Definitions
Optimal Control Problem
- continuous states/actions \(\mathcal{S}=\mathbb R^{n_s},\mathcal{A}=\mathbb R^{n_a}\)
- Cost instead of reward
- transitions are deterministic and described in terms of dynamics function
\(s'= f(s, a)\)
ex - UAV as OCP
2. Policies and Distributions
- Policy \(\pi\) chooses an action based on the current state so \(a_t=a\) with probability \(\pi(a|s_t)\)
- Shorthand for deterministic policy: \(a_t=\pi(s_t)\)

examples:
Policy results in a trajectory \(\tau = (s_0, a_0, s_1, a_1, ... )\)
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
2. Policies and Distributions
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
- Probability of trajectory \(\tau =(s_0, a_0, s_1, ... s_t, a_t)\) $$ \mathbb{P}_{\mu_0}^\pi (\tau) = \mu_0(s_0)\pi(a_0 \mid s_0) \cdot \displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i) $$
- Probability of \(s\) at \(t\) $$ \mathbb{P}^\pi_t(s ; \mu_0) = \displaystyle\sum_{\substack{s_{0:t-1}\\ a_{0:t-1}}} \mathbb{P}^\pi_{\mu_0} (s_{0:t}, a_{0:t-1} \mid s_t = s) $$
2. Policies and Distributions
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
- Probability vector of \(s\) at \(t\): \(d_{\mu_0,t}^\pi(s) = \mathbb{P}^\pi_t(s ; \mu_0) \) evolves as $$ d_{\mu_0,t+1}^\pi=P_\pi^\top d_{\mu_0,t}^\pi $$ where \(P_\pi\) at row \(s\) and column \(s'\) is \(\mathbb E_{a\sim \pi(s)}[P(s'\mid s,a)]\)
- Discounted "steady-state" distribution (PSet 2) $$ d^\pi_{\mu_0} = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_{\mu_0,t}^\pi$$
\(+\gamma\)
\(+\gamma^2\)
\(+\quad ...\quad=\)
2. Policies and Distributions
Food for thought:
- How are these distributions different when
- Initial states are different
- Policies are different (Performance Difference Lemma)
- Transitions are different (Simulation Lemma)
\(s_0\)
\(a_0\)
\(s_1\)
\(a_1\)
\(s_2\)
\(a_2\)
...
3. Value and Q function
- Evaluate policy by cumulative reward
- \(V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s,P,\pi]\)
- \(Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s, a_0=a,P,\pi]\)
- For finite horizon, for \(t=0,...H-1\),
- \(V_t^\pi(s) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s,P,\pi]\)
- \(Q_t^\pi(s, a) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s, a_t=a,P,\pi]\)

examples:
...
...
...
3. Value and Q function
Recursive Bellman Expectation Equation:
- Discounted Infinite Horizon
- \(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
- \(Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]\)
- Finite Horizon, for \(t=0,\dots H-1\),
- \(V^{\pi}_t(s) = \mathbb{E}_{a \sim\pi_t(s) } \left[ r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \right]\)
- \(Q^{\pi}_t(s) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \)
...
...
...
Recall: Icy navigation (PSet 2, lecture example), Prelim question
3. Value and Q function
- Recursive computation: \(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)
- Exact Policy Evaluation: \(V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}\)
- Iterative Policy Evaluation: \(V^{\pi}_{i+1} = R^{\pi} + \gamma P_{\pi} V^\pi_i\)
- Converges: fixed point contraction
- Backwards-Iterative computation in finite horizon:
- Initialize \(V^{\pi}_H = 0\)
- For \(t=H-1, H-2, ... 0\)
- \(V^{\pi}_t = R^{\pi} +P_{\pi} V^\pi_{t+1}\)
4. Optimal Policies
- An optimal policy \(\pi^*\) is one where \(V^{\pi^*}(s) \geq V^{\pi}(s)\) for all \(s\) and policies \(\pi\)
- Equivalent condition: Bellman Optimality
- \(V^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[V^*(s') \right]\right]\)
- \( Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^*(s', a') \right]\)
- Optimal policy \(\pi^*(s) = \argmax_{a\in \mathcal A} Q^*(s, a)\)
Recall: Verifying optimality in Icy Street example, Prelim
4. Optimal Policies
- \(0\) cost for \(a_0\)
- \(2\epsilon\) cost for \(a_1\)
- \(\epsilon\) reward in \(s_0\)
- \(1\) reward in \(s_1\)
- \(\gamma\) discount
Food for thought: rigorous argument for optimal policy?
4. Optimal Policies
- Finite horizon: for \(t=0,\dots H-1\),
- \(V_t^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[V_{t+1}^*(s') \right]\right]\)
- \(Q_t^*(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t+1}^*(s', a') \right]\)
- Optimal policy \(\pi_t^*(s) = \argmax_{a\in \mathcal A} Q_t^*(s, a)\)
- Solve exactly with Dynamic Programming
- Iterate backwards in time from \(V^*_{H}=0\)
4. Optimal Policies
- Infinite horizon: algorithms for recursion in the Bellman Optimality equation
- Value Iteration
- Initialize \(V_0\). For \(i=0,1,\dots\),
- \(V^{i+1}(s) =\max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right]\)
- Initialize \(V_0\). For \(i=0,1,\dots\),
- Policy Iteration
- Initialize \(\pi_0\). For \(i=0,1,\dots\),
- \(V^{i}= \) PolicyEval(\(\pi^i\))
- \(\pi^{i+1}(s) = \argmax_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right] \)
- Initialize \(\pi_0\). For \(i=0,1,\dots\),
5. Linear Optimal Control
- Linear Dynamics: $$s_{t+1} = A s_t + Ba_t$$
- Unrolled dynamics (PSet 3) $$ s_{t} = A^ts_0 + \sum_{k=0}^{t-1} A^k Ba_{t-k-1}$$
- Stability of \(s_{t+1}=As_t\):
- stable if \(\max_i |\lambda_i(A)|< 1\)
- unstable if \(\max_i |\lambda_i(A)| > 1\)
- marginally unstable if \(\max_i |\lambda_i(A)|= 1\)
ex - UAV
Recall: PSet 4 and Prelim question about cumulative cost and stability
5. Linear Optimal Control
- Finite Horizon LQR: Application of Dynamic Programming $$c(s,a) = s^\top Qs + a^\top Ra \implies \pi^\star_t(s) = Ks,~~V^\star_t(s) = s^\top P_t s$$
- Basis for approximation-based algorithms (local linearization and iLQR)
- Food for thought:
- What is \(V^\pi\) for a non-optimal linear policy?
- What are policy/value for infinite horizon discounted LQR? (recursive use of Bellman equations)
6. Learning from data
- What do we want to learn? \(\mathcal M = \{\mathcal S,\mathcal A,P,r,\gamma\}\)
- Unknown transitions \(P(s'|s,a)\) or reward function \(r(s,a)\)
- Value/Q function
- of policy \(V^\pi(s)\) or \(Q^\pi(s,a)\)
- optimal \(V^*(s)\) or \(Q^*(s,a)\)
- Optimal Policy \(\pi^*(s)\)
- Given a dataset with features \(x_i\) and labels \(y_i\), model:
- Via counting: \(\widehat f(x) = \sum_{i=1}^N y_i \mathbf 1\{x=x_i\} / \sum_{i=1}^N\mathbf 1\{x=x_i\} \)
- Function approx: \(\widehat f(x) = \min_{f\in\mathcal F} \frac{1}{N} \sum_{i=1}^N (f(x_i)-y_i)^2 \)
- e.g. closed-form linear regression solution
Model-Based RL
- Features are \((s,a)\) and label is \(s'\)
- Data collection
- Query model: uniform exploration
- Active exploration with reward bonus
- Tabular setting: \(\widehat P\) via counting
- Simulation Lemma
- Translate error in \(\widehat P\) vs \(P\) into difference in performance \(\widehat V\) vs \(V\)
6. Learning Models
- Features are \((s_i,a_i)\)
- \((s_i,a_i) = (s_{h_1}, a_{h_1}) \sim d^\pi_{\mu_0}\)
- Labels constructed as:
- Rollout based (MC): \(y_i = \sum_{t=h_1}^{h_1+h_2} r_t\)
- Bellman Exp based (TD): \(y_t =r_t + \gamma \widehat Q(s_{t+1},a_{t+1}) \)
- Bellman Opt based (TD): \(y_t =r_t + \gamma \max_a \widehat Q(s_{t+1},a) \)
- On vs. off policy with importance weighting (Recall PSet)
- \(\widehat Q =\arg\min \frac{1}{N}\sum_{i=1}^N (Q(s_i,a_i)-y_i)^2\)
6. Learning Value/Q
\(h_1=h\) w.p. \(\propto \gamma^h\)
\(s_t\)
\(a_t\sim \pi(s_t)\)
\(r_t\sim r(s_t, a_t)\)
\(s_{t+1}\sim P(s_t, a_t)\)
\(a_{t+1}\sim \pi(s_{t+1})\)
-
Value-based RL
For \(i=0,1...\):- Rollout \(\pi^i\) and construct dataset
- Learn \(\widehat Q^{i+1}\) from data
- Update \(\pi^{i+1}\) greedy, incremental, or \(\epsilon\) greedy
- Approximate vs. Conservative Policy Iteration
- Greedy improvement \(\rightarrow\) oscillations, incremental is more stable
- Performance Difference Lemma
6. Learning Value/Q
6. Policy Optimization
- \(J(\theta)=\) expected cumulative reward under policy \(\pi_\theta\)
- Estimate \(\nabla_\theta J(\theta)\) via rollouts \(\tau\), observed reward \(R(\tau)\)
- Random Search: \(\theta \pm \delta v\) , \(g=\frac{1}{2\delta}(R(\tau_+) - R(\tau_-))v\)
- REINFORCE: \(g=\sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)\)
- Actor-Critic: \(s,a\sim d^{\pi_\theta}_{\mu_0}\) ,
\(g=\frac{1}{1-\gamma} \nabla_\theta \log \pi_\theta(a_t|s_t) (Q^{\pi_\theta}(s,a)-b(s)) \)
Food for thought: how to compute off-policy gradient estimate?
6. Policy Optimization
- Policy Gradient Meta-Algorithm
for \(i=0,1,...\)- collect rollouts using \(\theta_i\)
- estimate gradient with \(g_i\)
- \(\theta_{i+1} = \theta_i+ \alpha g_i\)
- Trust regions and Natural PG
\( \max ~J(\theta)\)
\(\text{s.t.} ~~d_{KL}(\theta, \theta_0)\leq \delta \)
\( \max ~\nabla J(\theta_0)^\top(\theta-\theta_0)\)
\(\text{s.t.} ~~(\theta-\theta_0)^\top F_{\theta_0} (\theta-\theta_0) \leq \delta\)
\(\theta_{i+1} = \theta_i + \alpha F^{-1}_{i} g_i\)
7. Exploration
- Multi-Arm and Contextual Bandits: MDP with no transitions!
- Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t) \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$
- Linear regret: pure random, pure greedy, \(\epsilon\) greedy (PSet)
- Sublinear regret: Explore-then-commit, UCB, LinUCB
- \( \arg\max_a \widehat \mu_a\) vs \( \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}\)
Food for thought: performance/regret of softmax policy?

8. Learning From Experts
Imitation Learning with BC
Food for thought: Expert in LQR setting? (Linear regression)
Supervised Learning
Policy
Dataset
\(\mathcal D = (x_i, y_i)_{i=1}^M\)




...
\(\pi\)( ) =


8. Learning From Experts
Imitation Learning with DAgger
Supervised Learning
Policy
Dataset
\(\mathcal D = (x_i, y_i)_{i=1}^M\)




...
\(\pi\)( ) =


Execute

Query Expert
\(\pi^*(s_0), \pi^*(s_1),...\)
\(s_0, s_1, s_2...\)
Aggregate
\((x_i = s_i, y_i = \pi^*(s_i))\)
8. Learning From Experts
- Inverse RL: Principle of maximum entropy
- Max-Ent IRL with Soft-VI (entropy regularized)
- replace \(\max\) with softmax
- Lagrange Formulation to constrained optimization
- \(x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0\)
- \(\displaystyle x^* =\arg \min_x \max_{w} ~~f(x)+w\cdot g(x)\) equivalent
- Iterative algorithm or solve \(\nabla [f(x) + w^\top g(x)] = 0\)
maximize \(\mathsf{Ent}(\pi)\)
s.t. \(\pi\) consistent with expert data
Proof Stratgies
- Add and subtract: $$ \|f(x) - g(y)\| \leq \|f(x)-f(y)\| +\|f(y)-g(y)\| $$
- Contractions (induction) $$ \|x_{t+1}\|\leq \gamma \|x_t\| \implies \|x_t\|\leq \gamma^t\|x_0\|$$
- Additive induction $$ \|x_{t+1}\| \leq \delta_t + \|x_t\| \implies \|x_t\|\leq \sum_{k=0}^{t-1} \delta_k + \|x_0\| $$
- Basic Inequalities (PSet 1): $$|\mathbb E[f(x)] - \mathbb E[g(x)]| \leq \mathbb E[|f(x)-g(x)|] $$ $$|\max f(x) - \max g(x)| \leq \max |f(x)-g(x)| $$ $$ \mathbb E[f(x)] \leq \max f(x) $$
Test-taking Strategies
- Move on if stuck!
- Write explanations and show steps for partial credit
- Multipart questions: can be done mostly independently
- ex: 1) show \(\|x_{t+1}\|\leq \gamma \|x_t\|\)
2) give a bound on \(\|x_t\|\) in terms of \(\|x_0\|\)
- ex: 1) show \(\|x_{t+1}\|\leq \gamma \|x_t\|\)
Questions?
CS 4/5789: Lecture 27
By Sarah Dean