CS 4/5789: Introduction to Reinforcement Learning

Lecture 28

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

0. Announcements

1. Review

2. Questions

Announcements

HW4 due tonight

5789 Paper Review Assignment due Friday

Course evaluations

Final Monday 5/16 at 7pm in Statler Hall 196

Closed-book, definition/equation sheet provided

Focus: Units 1-4

Study Materials: Lecture Notes, HWs, Prelim, Review Slides

Final Exam

Outline:

MDP Definitions
Policies and Distributions
Value and Q function
Optimal Policies
Linear Optimal Control
Learned Models, Values, Policies
Exploration
Learning from Experts

Review

Participation point: PollEV.com/sarahdean011

Infinite Horizon Discounted MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$

1. MDP Definitions

$\mathcal{S}$ states, $\mathcal{A}$ actions
$r$ map from state, action to scalar reward
$P$ transition probability to next state given current state and action (Markov assumption)
$\gamma$ discount factor

Finite Horizon MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}$

$\mathcal{S},\mathcal{A},r,P$ same
$H$ horizon
$\mu_0$ initial distribution

ex - Pac-Man as MDP

1. MDP Definitions

Optimal Control Problem

continuous states/actions $\mathcal{S}=\mathbb R^{n_s},\mathcal{A}=\mathbb R^{n_a}$
Cost instead of reward
transitions $P$ described in terms of dynamics function and disturbance $w\sim \mathcal D$
$s'= f(s, a, w)$

ex - UAV as OCP

2. Policies and Distributions

Policy $\pi$ chooses an action based on the current state so $a_t=a$ with probability $\pi(a|s_t)$
- Shorthand for deterministic policy: $a_t=\pi(s_t)$

examples:

Policy results in a trajectory $\tau = (s_0, a_0, s_1, a_1, ... )$

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

2. Policies and Distributions

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

Probability of trajectory $\tau =(s_0, a_0, s_1, ... s_t, a_t)$ $$ \mathbb{P}_{\mu_0}^\pi (\tau) = \mu_0(s_0)\pi(a_0 \mid s_0) \cdot \displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i) $$
Probability of $(s, a)$ at $t$ $$ \mathbb{P}^\pi_t(s, a ; \mu_0) = \displaystyle\sum_{\substack{s_{0:t-1}\\ a_{0:t-1}}} \mathbb{P}^\pi_{\mu_0} (s_{0:t-1}, a_{0:t-1}, s_t, a_t \mid s_t = s, a_t = a) $$
Discounted "steady-state" distribution $$ d^\pi_{\mu_0}(s, a) = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t \mathbb{P}^\pi_t(s, a; \mu_0) $$
- Finite horizon: $d^\pi_{\mu_0}(s, a) =\frac{1}{H}\sum_{t=0}^{H-1} \mathbb{P}^\pi_t(s, a; \mu_0) $

2. Policies and Distributions

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

Food for thought:

How do these distributions change under two different transition models $P$ and $\widehat P$ (Simulation Lemma) or two different policies (PDL, Prelim, HW2)?
How to write the distribution $\mathbb{P}^\pi_t$ in terms of $\mathbb{P}^\pi_{t-1}$?

3. Value and Q function

Evaluate policy by cumulative reward
- $V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r_t | s_0=s]$
- $Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r_t | s_0=s, a_0=a]$
For finite horizon, for $t=0,...H-1$,
- $V_t^\pi(s) = \mathbb E[\sum_{k=t}^{H-1} r_k | s_t=s]$
- $Q_t^\pi(s, a) = \mathbb E[\sum_{k=t}^{H-1} r_k | s_t=s, a_t=a]$

examples:

...

3. Value and Q function

Recursive Bellman Expectation Equation:

Discounted Infinite Horizon
- $V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$
- $Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]$
Finite Horizon, for $t=0,\dots H-1$,
- $V^{\pi}_t(s) = \mathbb{E}_{a \sim\pi_t(s) } \left[ r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \right]$
- $Q^{\pi}_t(s) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] $

...

Recall: Gardening MDP HW problem, Prelim

3. Value and Q function

Recursive computation: $V^{\pi} = R^{\pi} + \gamma P^{\pi} V^\pi$
- Exact Policy Evaluation: $V^{\pi} = (I- \gamma P^{\pi} )^{-1}R^{\pi}$
- Iterative Policy Evaluation: $V^{\pi}_{t+1} = R^{\pi} + \gamma P^{\pi} V^\pi_t$
Backwards-Iterative computation in finite horizon:
- Initialize $V^{\pi}_H = 0$
- For $t=H-1, H-2, ... 0$
  - $V^{\pi}_t = R^{\pi} +P^{\pi} V^\pi_{t+1}$

...

4. Optimal Policies

An optimal policy $\pi^*$ is one where $V^{\pi^*}(s) \geq V^{\pi}(s)$ for all $s$ and policies $\pi$
Equivalent condition: Bellman Optimality
- $V^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[V^*(s') \right]\right]$
- $ Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^*(s', a') \right]$
Optimal policy $\pi^*(s) = \argmax_{a\in \mathcal A} Q^*(s, a)$

Recall: Gardening MDP, Prelim (verifying optimality)

4. Optimal Policies

$0$ cost for $a_0$
$2\epsilon$ cost for $a_1$
$\epsilon$ reward in $s_0$
$1$ reward in $s_1$
$\gamma$ discount

Food for thought: rigorous argument for optimal policy?

4. Optimal Policies

Finite horizon, for $t=0,\dots H-1$,
- $V_t^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[V_{t+1}^*(s') \right]\right]$
- $Q_t^*(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t+1}^*(s', a') \right]$
Optimal policy $\pi_t^*(s) = \argmax_{a\in \mathcal A} Q_t^*(s, a)$
Can directly solve with Dynamic Programming
- Iterate backwards in time from $V^*_{H}=0$

4. Optimal Policies

Infinite horizon: algorithms for recursion in the Bellman Optimality equation
Value Iteration
- Initialize $Q_0$. For $t=0,1,\dots$,
  - $Q^{t+1}(s,a) =r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^{t}(s', a') \right]$
Policy Iteration
- Initialize $\pi_0$. For $t=0,1,\dots$,
  - $Q^{t}= $ PolicyEval($\pi^t$)
  - $\pi^{t+1}(s) = \argmax_{a\in\mathcal A} Q^{t}(s,a)$

4. Optimal Policies

Value Iteration
- Fixed point iteration (like Iterative Policy Iteration) from Bellman Q Optimality
- Contraction in Q: $\|Q^{t+1} - Q^*\|_\infty \leq \gamma \|Q^t - Q^*\|_\infty$
Policy Iteration
- Monotone Improvement: $Q^{t+1}(s,a) \geq Q^{t}(s,a)$
- Contraction in V: $\|V^{t+1} - V^*\|_\infty \leq \gamma \|V^t - V^*\|_\infty$

5. Linear Optimal Control

Linear Dynamics: $$s_{t+1} = A s_t + Ba_t + w_t,\quad w_t\sim \mathcal N(0,\sigma^2 I)$$
Unrolled dynamics $$ s_{t} = A^ts_0 + \sum_{k=0}^{t-1} A^k (Ba_{t-k-1} + w_{t-k-1})$$
Stability of uncontrolled $s_{t+1}=As_t$: determined by whether $\rho(A)<1$
Finite Horizon LQR: Application of Dynamic Programming
- Basis for approximation-based algorithms (local linearization and iLQR)

Recall: Prelim question on linear policy $a_t = K s_t$

6. Learning from data

What do we want to learn? $\mathcal M = \{\mathcal S,\mathcal A,P,r,\gamma\}$
- Unknown transitions $P(s'|s,a)$ or reward function $r(s,a)$
- Value/Q function
  - of policy $V^\pi(s)$ or $Q^\pi(s,a)$
  - optimal $V^*(s)$ or $Q^*(s,a)$
- Optimal Policy $\pi^*(s)$
Given a dataset with features $x_i$ and labels $y_i$
Fitting a model:
- Via counting: $\widehat f(x) = \sum_{i=1}^N y_i \mathbf 1\{x=x_i\} / \sum_{i=1}^N\mathbf 1\{x=x_i\} $
- Function approx: $\widehat f(x) = \min_{f\in\mathcal F} \frac{1}{N} \sum_{i=1}^N (f(x_i)-y_i)^2 $

Model-Based RL

Features are $(s,a)$ and label is $s'$
Tabular setting: $\widehat P$ via counting
Simulation Lemma
- Translate error in $\widehat P$ vs $P$ into difference in performance $\widehat V$ vs $V$

6. Learning Models

Features are $(s_i,a_i)$
- $(s_i,a_i) = (s_{h_1}, a_{h_1}) \sim d^\pi_{\mu_0}$
Labels constructed as:
- Rollout based (MC): $y_i = \sum_{t=h_1}^{h_1+h_2} r_t$
- Bellman Exp based (TD): $y_t =r_t + \gamma \widehat Q(s_{t+1},a_{t+1}) $
- Bellman Opt based (TD): $y_t =r_t + \gamma \max_a \widehat Q(s_{t+1},a) $
On vs. off policy (Recall HW)
$\widehat Q =\arg\min \frac{1}{N}\sum_{i=1}^N (Q(s_i,a_i)-y_i)^2$

6. Learning Value/Q

$h_1=h$ w.p. $\propto \gamma^h$

$s_t$

$a_t\sim \pi(s_t)$

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

$a_{t+1}\sim \pi(s_{t+1})$

Approximate Dynamic Programming
For $t=0,1...$:
1. $\widehat Q^t = \mathsf{SampleEval}(\pi^t)$
2. $\pi^t = \mathsf{Improvement}(\widehat Q^t)$
Approximate Policy Iteration
- Greedy improvement, could oscillate
Conservative Policy Iteration
- Incremental improvement
Performance Difference Lemma

6. Learning Value/Q

6. Policy Optimization

$J(\theta)=$ expected cumulative reward under policy $\pi_\theta$
Estimate $\nabla_\theta J(\theta)$ via rollouts $\tau$, observed reward $R(\tau)$
- Random Search: $\theta \pm \delta v$ , $g=\frac{1}{2\delta}(R(\tau_+) - R(\tau_-))v$
- REINFORCE: $g=\sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)$
- Actor-Critic: $s,a\sim d^{\pi_\theta}_{\mu_0}$ ,
  $g=\frac{1}{1-\gamma} \nabla_\theta \log \pi_\theta(a_t|s_t) (Q^{\pi_\theta}(s,a)-b(s)) $

Food for thought: how to compute off-policy gradient estimate?

Recap

Derivative Free Optimization: Random Search

$\nabla J(\theta)$$ \approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}$

$J(\theta) = -\theta^2 - 1$

$\theta$

Recap

Derivative Free Optimization: Sampling

$\nabla J(\theta)$$ \approx \nabla_\theta \log(P_\theta(x)) h(x) $

$J(\theta) = \mathbb E_{x\sim P_\theta}[h(x)]$

$x$

$= 2(\theta-x)\theta h(x)$

$h(x) = -x^2$

$=\mathbb E_{x\sim\mathcal N(\theta, 1)}[-x^2]$

$P_\theta = \mathcal N(\theta, 1)$

6. Policy Optimization

Policy Gradient Meta-Algorithm
for $t=0,1,...$
1. collect rollouts using $\theta_t$
2. estimate gradient with $g_t$
3. $\theta_{t+1} = \theta_t + \alpha g_t$
Trust regions and Natural PG

$ \max ~J(\theta)$

$\text{s.t.} ~~d_{KL}(\theta, \theta_0)\leq \delta $

$ \max ~\nabla J(\theta_0)^\top(\theta-\theta_0)$

$\text{s.t.} ~~(\theta-\theta_0)^\top F_{\theta_0} (\theta-\theta_0) \leq \delta$

$\theta_{t+1} = \theta_t + \alpha F^{-1}_{t} g_t$

7. Exploration

Multi-Arm and Contextual Bandits: MDP with no transitions!
Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t) \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$
Explore-then-commit, UCB, LinUCB
- $ \arg\max_a \widehat \mu_a$ vs $ \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}$

Food for thought: performance/regret of softmax policy?

Recap

Explore-then-Commit

Pull each arm $N$ times and compute empirical mean $\widehat \mu_a$
For $t=NK+1,...,T$:
Pull $\widehat a^* = \arg\max_a \widehat \mu_a$

Upper Confidence Bound

For $t=1,...,T$:

Pull $ a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}$
Update empirical means $\widehat \mu_t^a$ and counts $N_t^a$

Set exploration $N \approx T^{2/3}$,

$R(T) \lesssim T^{2/3}$

$R(T) \lesssim \sqrt{T}$

8. Learning From Experts

Imitation Learning with BC

Food for thought: Expert in LQR setting? (Linear regression)

Supervised Learning

Policy

Dataset

$\mathcal D = (x_i, y_i)_{i=1}^M$

...

$\pi$( ) =

8. Learning From Experts

Imitation Learning with DAgger

Food for thought: Expert in LQR setting? (Linear regression)

Supervised Learning

Policy

Dataset

$\mathcal D = (x_i, y_i)_{i=1}^M$

...

$\pi$( ) =

Execute

Query Expert

$\pi^*(s_0), \pi^*(s_1),...$

$s_0, s_1, s_2...$

Aggregate

$(x_i = s_i, y_i = \pi^*(s_i))$

BC vs. DAgger

Supervised learning guarantee

$\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon$

Online learning guarantee

$\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon$

Performance Guarantee

$V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}$

Performance Guarantee

$V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon$

8. Learning From Experts

Inverse RL: Principle of maximum entropy
Soft-VI (entropy weighted) replaces $\max$ with softmax
Max-Ent IRL: For $k=0,\dots,K-1$:
1. $\pi^k = \mathsf{SoftVI}(w_k^\top \varphi)$
2. $w_{k+1} = w_k + \eta (\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] - \mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)])$
Lagrange Formulation to constrained optimization

maximize $\mathsf{Ent}(\pi)$

s.t. $\pi$ consistent with expert data

$x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0$

$\displaystyle x^* =\arg \min_x \max_{w} ~~f(x)+w\cdot g(x)$

Iterative or $\nabla \mathcal L(x,w) = 0$

Proof Stratgies

Add and subtract: $$ \|f(x) - g(y)\| \leq \|f(x)-f(y)\| +\|f(y)-g(y)\| $$
Contractions (induction) $$ \|x_{t+1}\|\leq \gamma \|x_t\| \implies \|x_t\|\leq \gamma^t\|x_0\|$$
Additive induction $$ \|x_{t+1}\| \leq \delta_t + \|x_t\| \implies \|x_t\|\leq \sum_{k=0}^{t-1} \delta_k + \|x_0\| $$
Basic Inequalities (HW0) $$|\mathbb E[f(x)] - \mathbb E[g(x)]| \leq \mathbb E[|f(x)-g(x)|] $$ $$|\max f(x) - \max g(x)| \leq \max |f(x)-g(x)| $$ $$ \mathbb E[f(x)] \leq \max f(x) $$

Test-taking Strategies

Move on if stuck!
Write explanations and show steps for partial credit
Multipart questions: can be done mostly independently
- ex: 1) show $\|x_{t+1}\|\leq \gamma \|x_t\|$
  2) give a bound on $\|x_t\|$ in terms of $\|x_0\|$

Prelim Summary

Problem 1: Approximate Policy Evaluation
- Similar to PE proof from lecture with $V$
Problem 2: Optimal Machine Repair
- Similar to Gardening HW problem
Problem 3: State distributions
- Use proof techniques from review lecture
- Induction does not prove 3.2 (use 3.1, 3.2, & induction for 3.3)
Problem 4: Value of Linear Policy
- Finite horizon Bellman Expectation Equation not Bellman Optimality Equation, or unrolled expression for linear dynamics

CS 4/5789: Lecture 28

By Sarah Dean

CS 4/5789: Lecture 28

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 28

Agenda

Announcements

Final Exam

Review

1. MDP Definitions

1. MDP Definitions

2. Policies and Distributions

2. Policies and Distributions

2. Policies and Distributions

3. Value and Q function

3. Value and Q function

3. Value and Q function

4. Optimal Policies

4. Optimal Policies

4. Optimal Policies

4. Optimal Policies

4. Optimal Policies

5. Linear Optimal Control

6. Learning from data

6. Learning Models

6. Learning Value/Q

6. Learning Value/Q

6. Policy Optimization

Recap

Recap

6. Policy Optimization

7. Exploration

Recap

8. Learning From Experts

8. Learning From Experts

BC vs. DAgger

8. Learning From Experts

Proof Stratgies

Test-taking Strategies

Prelim Summary

CS 4/5789: Lecture 28

More from Sarah Dean