CS 4/5789: Introduction to Reinforcement Learning

Lecture 27: Final Review

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- 5789 Paper Reviews due
- Midterm corrections due
- Both accepted without penalty until "late deadline" on Gradescope
Course evaluations open! Participation credit
Office hours
- Mine 9:30-10:30am Tues and 11-noon Thurs
- TA office hours cancelled Wednesday onward
  - Stay tuned for TA review/question session

Final Exam

Final exam Saturday 5/13
Location: 155 Olin
Time: 2pm (until $\approx$ 4pm)
Length: 2 hours
Cumulative and closed book
- equation sheet provided (posted by Wed)
Materials: slides, PSets (solutions on Canvas)
I will monitor Final tag on EdStem for questions
- also last minute conflicts/accomodations

Outline:

MDP Definitions
Policies and Distributions
Value and Q function
Optimal Policies
Linear Optimal Control
Learned Models, Values, Policies
Exploration
Learning from Experts

Review

Infinite Horizon Discounted MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$

1. MDP Definitions

$\mathcal{S}$ states, $\mathcal{A}$ actions
$r$ map from state, action to scalar reward
$P$ transition probability to next state given current state and action (Markov assumption)
$\gamma$ discount factor

Finite Horizon MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}$

$\mathcal{S},\mathcal{A},r,P$ same
$H$ horizon

ex - Pac-Man as MDP

1. MDP Definitions

Optimal Control Problem

continuous states/actions $\mathcal{S}=\mathbb R^{n_s},\mathcal{A}=\mathbb R^{n_a}$
Cost instead of reward
transitions are deterministic and described in terms of dynamics function
$s'= f(s, a)$

ex - UAV as OCP

2. Policies and Distributions

Policy $\pi$ chooses an action based on the current state so $a_t=a$ with probability $\pi(a|s_t)$
- Shorthand for deterministic policy: $a_t=\pi(s_t)$

examples:

Policy results in a trajectory $\tau = (s_0, a_0, s_1, a_1, ... )$

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

2. Policies and Distributions

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

Probability of trajectory $\tau =(s_0, a_0, s_1, ... s_t, a_t)$ $$ \mathbb{P}_{\mu_0}^\pi (\tau) = \mu_0(s_0)\pi(a_0 \mid s_0) \cdot \displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i) $$
Probability of $s$ at $t$ $$ \mathbb{P}^\pi_t(s ; \mu_0) = \displaystyle\sum_{\substack{s_{0:t-1}\\ a_{0:t-1}}} \mathbb{P}^\pi_{\mu_0} (s_{0:t}, a_{0:t-1} \mid s_t = s) $$

2. Policies and Distributions

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

Probability vector of $s$ at $t$: $d_{\mu_0,t}^\pi(s) = \mathbb{P}^\pi_t(s ; \mu_0) $ evolves as $$ d_{\mu_0,t+1}^\pi=P_\pi^\top d_{\mu_0,t}^\pi $$ where $P_\pi$ at row $s$ and column $s'$ is $\mathbb E_{a\sim \pi(s)}[P(s'\mid s,a)]$
Discounted "steady-state" distribution (PSet 2) $$ d^\pi_{\mu_0} = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_{\mu_0,t}^\pi$$

$+\gamma$

$+\gamma^2$

$+\quad ...\quad=$

2. Policies and Distributions

Food for thought:

How are these distributions different when
- Initial states are different
- Policies are different (Performance Difference Lemma)
- Transitions are different (Simulation Lemma)

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

3. Value and Q function

Evaluate policy by cumulative reward
- $V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s,P,\pi]$
- $Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s, a_0=a,P,\pi]$
For finite horizon, for $t=0,...H-1$,
- $V_t^\pi(s) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s,P,\pi]$
- $Q_t^\pi(s, a) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s, a_t=a,P,\pi]$

examples:

...

3. Value and Q function

Recursive Bellman Expectation Equation:

Discounted Infinite Horizon
- $V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$
- $Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]$
Finite Horizon, for $t=0,\dots H-1$,
- $V^{\pi}_t(s) = \mathbb{E}_{a \sim\pi_t(s) } \left[ r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \right]$
- $Q^{\pi}_t(s) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] $

...

Recall: Icy navigation (PSet 2, lecture example), Prelim question

3. Value and Q function

Recursive computation: $V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$
- Exact Policy Evaluation: $V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}$
- Iterative Policy Evaluation: $V^{\pi}_{i+1} = R^{\pi} + \gamma P_{\pi} V^\pi_i$
  - Converges: fixed point contraction
Backwards-Iterative computation in finite horizon:
- Initialize $V^{\pi}_H = 0$
- For $t=H-1, H-2, ... 0$
  - $V^{\pi}_t = R^{\pi} +P_{\pi} V^\pi_{t+1}$

4. Optimal Policies

An optimal policy $\pi^*$ is one where $V^{\pi^*}(s) \geq V^{\pi}(s)$ for all $s$ and policies $\pi$
Equivalent condition: Bellman Optimality
- $V^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[V^*(s') \right]\right]$
- $ Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^*(s', a') \right]$
Optimal policy $\pi^*(s) = \argmax_{a\in \mathcal A} Q^*(s, a)$

Recall: Verifying optimality in Icy Street example, Prelim

4. Optimal Policies

$0$ cost for $a_0$
$2\epsilon$ cost for $a_1$
$\epsilon$ reward in $s_0$
$1$ reward in $s_1$
$\gamma$ discount

Food for thought: rigorous argument for optimal policy?

4. Optimal Policies

Finite horizon: for $t=0,\dots H-1$,
- $V_t^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[V_{t+1}^*(s') \right]\right]$
- $Q_t^*(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t+1}^*(s', a') \right]$
Optimal policy $\pi_t^*(s) = \argmax_{a\in \mathcal A} Q_t^*(s, a)$
Solve exactly with Dynamic Programming
- Iterate backwards in time from $V^*_{H}=0$

4. Optimal Policies

Infinite horizon: algorithms for recursion in the Bellman Optimality equation
Value Iteration
- Initialize $V_0$. For $i=0,1,\dots$,
  - $V^{i+1}(s) =\max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right]$
Policy Iteration
- Initialize $\pi_0$. For $i=0,1,\dots$,
  - $V^{i}= $ PolicyEval($\pi^i$)
  - $\pi^{i+1}(s) = \argmax_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right] $

5. Linear Optimal Control

Linear Dynamics: $$s_{t+1} = A s_t + Ba_t$$
Unrolled dynamics (PSet 3) $$ s_{t} = A^ts_0 + \sum_{k=0}^{t-1} A^k Ba_{t-k-1}$$
Stability of $s_{t+1}=As_t$:
- stable if $\max_i |\lambda_i(A)|< 1$
- unstable if $\max_i |\lambda_i(A)| > 1$
- marginally unstable if $\max_i |\lambda_i(A)|= 1$

ex - UAV

Recall: PSet 4 and Prelim question about cumulative cost and stability

5. Linear Optimal Control

Finite Horizon LQR: Application of Dynamic Programming $$c(s,a) = s^\top Qs + a^\top Ra \implies \pi^\star_t(s) = Ks,~~V^\star_t(s) = s^\top P_t s$$
Basis for approximation-based algorithms (local linearization and iLQR)
Food for thought:
- What is $V^\pi$ for a non-optimal linear policy?
- What are policy/value for infinite horizon discounted LQR? (recursive use of Bellman equations)

6. Learning from data

What do we want to learn? $\mathcal M = \{\mathcal S,\mathcal A,P,r,\gamma\}$
- Unknown transitions $P(s'|s,a)$ or reward function $r(s,a)$
- Value/Q function
  - of policy $V^\pi(s)$ or $Q^\pi(s,a)$
  - optimal $V^*(s)$ or $Q^*(s,a)$
- Optimal Policy $\pi^*(s)$
Given a dataset with features $x_i$ and labels $y_i$, model:
- Via counting: $\widehat f(x) = \sum_{i=1}^N y_i \mathbf 1\{x=x_i\} / \sum_{i=1}^N\mathbf 1\{x=x_i\} $
- Function approx: $\widehat f(x) = \min_{f\in\mathcal F} \frac{1}{N} \sum_{i=1}^N (f(x_i)-y_i)^2 $
  - e.g. closed-form linear regression solution

Model-Based RL

Features are $(s,a)$ and label is $s'$
Data collection
- Query model: uniform exploration
- Active exploration with reward bonus
Tabular setting: $\widehat P$ via counting
Simulation Lemma
- Translate error in $\widehat P$ vs $P$ into difference in performance $\widehat V$ vs $V$

6. Learning Models

Features are $(s_i,a_i)$
- $(s_i,a_i) = (s_{h_1}, a_{h_1}) \sim d^\pi_{\mu_0}$
Labels constructed as:
- Rollout based (MC): $y_i = \sum_{t=h_1}^{h_1+h_2} r_t$
- Bellman Exp based (TD): $y_t =r_t + \gamma \widehat Q(s_{t+1},a_{t+1}) $
- Bellman Opt based (TD): $y_t =r_t + \gamma \max_a \widehat Q(s_{t+1},a) $
On vs. off policy with importance weighting (Recall PSet)
$\widehat Q =\arg\min \frac{1}{N}\sum_{i=1}^N (Q(s_i,a_i)-y_i)^2$

6. Learning Value/Q

$h_1=h$ w.p. $\propto \gamma^h$

$s_t$

$a_t\sim \pi(s_t)$

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

$a_{t+1}\sim \pi(s_{t+1})$

Value-based RL
For $i=0,1...$:
1. Rollout $\pi^i$ and construct dataset
2. Learn $\widehat Q^{i+1}$ from data
3. Update $\pi^{i+1}$ greedy, incremental, or $\epsilon$ greedy
Approximate vs. Conservative Policy Iteration
- Greedy improvement $\rightarrow$ oscillations, incremental is more stable
Performance Difference Lemma

6. Learning Value/Q

6. Policy Optimization

$J(\theta)=$ expected cumulative reward under policy $\pi_\theta$
Estimate $\nabla_\theta J(\theta)$ via rollouts $\tau$, observed reward $R(\tau)$
- Random Search: $\theta \pm \delta v$ , $g=\frac{1}{2\delta}(R(\tau_+) - R(\tau_-))v$
- REINFORCE: $g=\sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)$
- Actor-Critic: $s,a\sim d^{\pi_\theta}_{\mu_0}$ ,
  $g=\frac{1}{1-\gamma} \nabla_\theta \log \pi_\theta(a_t|s_t) (Q^{\pi_\theta}(s,a)-b(s)) $

Food for thought: how to compute off-policy gradient estimate?

6. Policy Optimization

Policy Gradient Meta-Algorithm
for $i=0,1,...$
1. collect rollouts using $\theta_i$
2. estimate gradient with $g_i$
3. $\theta_{i+1} = \theta_i+ \alpha g_i$
Trust regions and Natural PG

$ \max ~J(\theta)$

$\text{s.t.} ~~d_{KL}(\theta, \theta_0)\leq \delta $

$ \max ~\nabla J(\theta_0)^\top(\theta-\theta_0)$

$\text{s.t.} ~~(\theta-\theta_0)^\top F_{\theta_0} (\theta-\theta_0) \leq \delta$

$\theta_{i+1} = \theta_i + \alpha F^{-1}_{i} g_i$

7. Exploration

Multi-Arm and Contextual Bandits: MDP with no transitions!
Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t) \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$
Linear regret: pure random, pure greedy, $\epsilon$ greedy (PSet)
Sublinear regret: Explore-then-commit, UCB, LinUCB
- $ \arg\max_a \widehat \mu_a$ vs $ \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}$

Food for thought: performance/regret of softmax policy?

8. Learning From Experts

Imitation Learning with BC

Food for thought: Expert in LQR setting? (Linear regression)

Supervised Learning

Policy

Dataset

$\mathcal D = (x_i, y_i)_{i=1}^M$

...

$\pi$( ) =

8. Learning From Experts

Imitation Learning with DAgger

Supervised Learning

Policy

Dataset

$\mathcal D = (x_i, y_i)_{i=1}^M$

...

$\pi$( ) =

Execute

Query Expert

$\pi^*(s_0), \pi^*(s_1),...$

$s_0, s_1, s_2...$

Aggregate

$(x_i = s_i, y_i = \pi^*(s_i))$

8. Learning From Experts

Inverse RL: Principle of maximum entropy
Max-Ent IRL with Soft-VI (entropy regularized)
- replace $\max$ with softmax
Lagrange Formulation to constrained optimization
- $x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0$
- $\displaystyle x^* =\arg \min_x \max_{w} ~~f(x)+w\cdot g(x)$ equivalent
- Iterative algorithm or solve $\nabla [f(x) + w^\top g(x)] = 0$

maximize $\mathsf{Ent}(\pi)$

s.t. $\pi$ consistent with expert data

Proof Stratgies

Add and subtract: $$ \|f(x) - g(y)\| \leq \|f(x)-f(y)\| +\|f(y)-g(y)\| $$
Contractions (induction) $$ \|x_{t+1}\|\leq \gamma \|x_t\| \implies \|x_t\|\leq \gamma^t\|x_0\|$$
Additive induction $$ \|x_{t+1}\| \leq \delta_t + \|x_t\| \implies \|x_t\|\leq \sum_{k=0}^{t-1} \delta_k + \|x_0\| $$
Basic Inequalities (PSet 1): $$|\mathbb E[f(x)] - \mathbb E[g(x)]| \leq \mathbb E[|f(x)-g(x)|] $$ $$|\max f(x) - \max g(x)| \leq \max |f(x)-g(x)| $$ $$ \mathbb E[f(x)] \leq \max f(x) $$

Test-taking Strategies

Move on if stuck!
Write explanations and show steps for partial credit
Multipart questions: can be done mostly independently
- ex: 1) show $\|x_{t+1}\|\leq \gamma \|x_t\|$
  2) give a bound on $\|x_t\|$ in terms of $\|x_0\|$

Questions?

CS 4/5789: Lecture 27

By Sarah Dean

CS 4/5789: Lecture 27

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 27: Final Review

Reminders

Final Exam

Review

1. MDP Definitions

1. MDP Definitions

2. Policies and Distributions

2. Policies and Distributions

2. Policies and Distributions

2. Policies and Distributions

3. Value and Q function

3. Value and Q function

3. Value and Q function

4. Optimal Policies

4. Optimal Policies

4. Optimal Policies

4. Optimal Policies

5. Linear Optimal Control

5. Linear Optimal Control

6. Learning from data

6. Learning Models

6. Learning Value/Q

6. Learning Value/Q

6. Policy Optimization

6. Policy Optimization

7. Exploration

8. Learning From Experts

8. Learning From Experts

8. Learning From Experts

Proof Stratgies

Test-taking Strategies

Questions?

CS 4/5789: Lecture 27

More from Sarah Dean