CS 4/5789: Lecture 28

CS 4/5789: Introduction to Reinforcement Learning

Lecture 28

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

0. Announcements

1. Review

2. Questions

Announcements

HW4 due tonight

5789 Paper Review Assignment due Friday

Course evaluations

Final Monday 5/16 at 7pm in Statler Hall 196

Closed-book, definition/equation sheet provided

Focus: Units 1-4

Study Materials: Lecture Notes, HWs, Prelim, Review Slides

Final Exam

Outline:

MDP Definitions
Policies and Distributions
Value and Q function
Optimal Policies
Linear Optimal Control
Learned Models, Values, Policies
Exploration
Learning from Experts

Review

Participation point: PollEV.com/sarahdean011

Infinite Horizon Discounted MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$

1. MDP Definitions

$\mathcal{S}$ states, $\mathcal{A}$ actions
$r$ map from state, action to scalar reward
$P$ transition probability to next state given current state and action (Markov assumption)
$\gamma$ discount factor

Finite Horizon MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}$

$\mathcal{S},\mathcal{A},r,P$ same
$H$ horizon
$\mu_0$ initial distribution

ex - Pac-Man as MDP

1. MDP Definitions

Optimal Control Problem

continuous states/actions $\mathcal{S}=\mathbb R^{n_s},\mathcal{A}=\mathbb R^{n_a}$
Cost instead of reward
transitions $P$ described in terms of dynamics function and disturbance $w\sim \mathcal D$
$s'= f(s, a, w)$

ex - UAV as OCP

2. Policies and Distributions

Policy $\pi$ chooses an action based on the current state so $a_t=a$ with probability $\pi(a|s_t)$
- Shorthand for deterministic policy: $a_t=\pi(s_t)$

examples:

Policy results in a trajectory $\tau = (s_0, a_0, s_1, a_1, ... )$

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

2. Policies and Distributions

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

Probability of trajectory $\tau =(s_0, a_0, s_1, ... s_t, a_t)$ $$ \mathbb{P}_{\mu_0}^\pi (\tau) = \mu_0(s_0)\pi(a_0 \mid s_0) \cdot \displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i) $$
Probability of $(s, a)$ at $t$ $$ \mathbb{P}^\pi_t(s, a ; \mu_0) = \displaystyle\sum_{\substack{s_{0:t-1}\\ a_{0:t-1}}} \mathbb{P}^\pi_{\mu_0} (s_{0:t-1}, a_{0:t-1}, s_t, a_t \mid s_t = s, a_t = a) $$
Discounted "steady-state" distribution $$ d^\pi_{\mu_0}(s, a) = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t \mathbb{P}^\pi_t(s, a; \mu_0) $$
- Finite horizon: $d^\pi_{\mu_0}(s, a) =\frac{1}{H}\sum_{t=0}^{H-1} \mathbb{P}^\pi_t(s, a; \mu_0) $

2. Policies and Distributions

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

Food for thought:

How do these distributions change under two different transition models $P$ and $\widehat P$ (Simulation Lemma) or two different policies (PDL, Prelim, HW2)?
How to write the distribution $\mathbb{P}^\pi_t$ in terms of $\mathbb{P}^\pi_{t-1}$?

3. Value and Q function

Evaluate policy by cumulative reward
- $V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r_t | s_0=s]$
- $Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r_t | s_0=s, a_0=a]$
For finite horizon, for $t=0,...H-1$,
- $V_t^\pi(s) = \mathbb E[\sum_{k=t}^{H-1} r_k | s_t=s]$
- $Q_t^\pi(s, a) = \mathbb E[\sum_{k=t}^{H-1} r_k | s_t=s, a_t=a]$

examples:

...

3. Value and Q function

Recursive Bellman Expectation Equation:

Discounted Infinite Horizon
- $V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$
- $Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]$
Finite Horizon, for $t=0,\dots H-1$,
- $V^{\pi}_t(s) = \mathbb{E}_{a \sim\pi_t(s) } \left[ r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \right]$
- $Q^{\pi}_t(s) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] $

...

Recall: Gardening MDP HW problem, Prelim

3. Value and Q function

Recursive computation: $V^{\pi} = R^{\pi} + \gamma P^{\pi} V^\pi$
- Exact Policy Evaluation: $V^{\pi} = (I- \gamma P^{\pi} )^{-1}R^{\pi}$
- Iterative Policy Evaluation: $V^{\pi}_{t+1} = R^{\pi} + \gamma P^{\pi} V^\pi_t$
Backwards-Iterative computation in finite horizon:
- Initialize $V^{\pi}_H = 0$
- For $t=H-1, H-2, ... 0$
  - $V^{\pi}_t = R^{\pi} +P^{\pi} V^\pi_{t+1}$

...

4. Optimal Policies

An optimal policy $\pi^*$ is one where $V^{\pi^*}(s) \geq V^{\pi}(s)$ for all $s$ and policies $\pi$
Equivalent condition: Bellman Optimality
- $V^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[V^*(s') \right]\right]$
- $ Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^*(s', a') \right]$
Optimal policy $\pi^*(s) = \argmax_{a\in \mathcal A} Q^*(s, a)$

Recall: Gardening MDP, Prelim (verifying optimality)

4. Optimal Policies

$0$ cost for $a_0$
$2\epsilon$ cost for $a_1$
$\epsilon$ reward in $s_0$
$1$ reward in $s_1$
$\gamma$ discount

Food for thought: rigorous argument for optimal policy?

4. Optimal Policies

Finite horizon, for $t=0,\dots H-1$,
- $V_t^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[V_{t+1}^*(s') \right]\right]$
- $Q_t^*(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t+1}^*(s', a') \right]$
Optimal policy $\pi_t^*(s) = \argmax_{a\in \mathcal A} Q_t^*(s, a)$
Can directly solve with Dynamic Programming
- Iterate backwards in time from $V^*_{H}=0$

4. Optimal Policies

Infinite horizon: algorithms for recursion in the Bellman Optimality equation
Value Iteration
- Initialize $Q_0$. For $t=0,1,\dots$,
  - $Q^{t+1}(s,a) =r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^{t}(s', a') \right]$
Policy Iteration
- Initialize $\pi_0$. For $t=0,1,\dots$,
  - $Q^{t}= $ PolicyEval($\pi^t$)
  - $\pi^{t+1}(s) = \argmax_{a\in\mathcal A} Q^{t}(s,a)$

4. Optimal Policies

Value Iteration
- Fixed point iteration (like Iterative Policy Iteration) from Bellman Q Optimality
- Contraction in Q: $\|Q^{t+1} - Q^*\|_\infty \leq \gamma \|Q^t - Q^*\|_\infty$
Policy Iteration
- Monotone Improvement: $Q^{t+1}(s,a) \geq Q^{t}(s,a)$
- Contraction in V: $\|V^{t+1} - V^*\|_\infty \leq \gamma \|V^t - V^*\|_\infty$

5. Linear Optimal Control

Linear Dynamics: $$s_{t+1} = A s_t + Ba_t + w_t,\quad w_t\sim \mathcal N(0,\sigma^2 I)$$
Unrolled dynamics $$ s_{t} = A^ts_0 + \sum_{k=0}^{t-1} A^k (Ba_{t-k-1} + w_{t-k-1})$$
Stability of uncontrolled $s_{t+1}=As_t$: determined by whether $\rho(A)<1$
Finite Horizon LQR: Application of Dynamic Programming
- Basis for approximation-based algorithms (local linearization and iLQR)

Recall: Prelim question on linear policy $a_t = K s_t$

6. Learning from data

What do we want to learn? $\mathcal M = \{\mathcal S,\mathcal A,P,r,\gamma\}$
- Unknown transitions $P(s'|s,a)$ or reward function $r(s,a)$
- Value/Q function
  - of policy $V^\pi(s)$ or $Q^\pi(s,a)$
  - optimal $V^*(s)$ or $Q^*(s,a)$
- Optimal Policy $\pi^*(s)$
Given a dataset with features $x_i$ and labels $y_i$
Fitting a model:
- Via counting: $\widehat f(x) = \sum_{i=1}^N y_i \mathbf 1\{x=x_i\} / \sum_{i=1}^N\mathbf 1\{x=x_i\} $
- Function approx: $\widehat f(x) = \min_{f\in\mathcal F} \frac{1}{N} \sum_{i=1}^N (f(x_i)-y_i)^2 $

Model-Based RL

Features are $(s,a)$ and label is $s'$
Tabular setting: $\widehat P$ via counting
Simulation Lemma
- Translate error in $\widehat P$ vs $P$ into difference in performance $\widehat V$ vs $V$

6. Learning Models

Features are $(s_i,a_i)$
- $(s_i,a_i) = (s_{h_1}, a_{h_1}) \sim d^\pi_{\mu_0}$
Labels constructed as:
- Rollout based (MC): $y_i = \sum_{t=h_1}^{h_1+h_2} r_t$
- Bellman Exp based (TD): $y_t =r_t + \gamma \widehat Q(s_{t+1},a_{t+1}) $
- Bellman Opt based (TD): $y_t =r_t + \gamma \max_a \widehat Q(s_{t+1},a) $
On vs. off policy (Recall HW)
$\widehat Q =\arg\min \frac{1}{N}\sum_{i=1}^N (Q(s_i,a_i)-y_i)^2$

6. Learning Value/Q

$h_1=h$ w.p. $\propto \gamma^h$

$s_t$

$a_t\sim \pi(s_t)$

$r_t\sim r(s_t, a_t)$

$s_{t+1}\sim P(s_t, a_t)$

$a_{t+1}\sim \pi(s_{t+1})$

Approximate Dynamic Programming
For $t=0,1...$:
1. $\widehat Q^t = \mathsf{SampleEval}(\pi^t)$
2. $\pi^t = \mathsf{Improvement}(\widehat Q^t)$
Approximate Policy Iteration
- Greedy improvement, could oscillate
Conservative Policy Iteration
- Incremental improvement
Performance Difference Lemma

6. Learning Value/Q

6. Policy Optimization

$J(\theta)=$ expected cumulative reward under policy $\pi_\theta$
Estimate $\nabla_\theta J(\theta)$ via rollouts $\tau$, observed reward $R(\tau)$
- Random Search: $\theta \pm \delta v$ , $g=\frac{1}{2\delta}(R(\tau_+) - R(\tau_-))v$
- REINFORCE: $g=\sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)$
- Actor-Critic: $s,a\sim d^{\pi_\theta}_{\mu_0}$ ,
  $g=\frac{1}{1-\gamma} \nabla_\theta \log \pi_\theta(a_t|s_t) (Q^{\pi_\theta}(s,a)-b(s)) $

Food for thought: how to compute off-policy gradient estimate?

Recap

Derivative Free Optimization: Random Search

$\nabla J(\theta)$$ \approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}$

$J(\theta) = -\theta^2 - 1$

$\theta$

Recap

Derivative Free Optimization: Sampling

$\nabla J(\theta)$$ \approx \nabla_\theta \log(P_\theta(x)) h(x) $

$J(\theta) = \mathbb E_{x\sim P_\theta}[h(x)]$

$x$

$= 2(\theta-x)\theta h(x)$

$h(x) = -x^2$

$=\mathbb E_{x\sim\mathcal N(\theta, 1)}[-x^2]$

$P_\theta = \mathcal N(\theta, 1)$

6. Policy Optimization

Policy Gradient Meta-Algorithm
for $t=0,1,...$
1. collect rollouts using $\theta_t$
2. estimate gradient with $g_t$
3. $\theta_{t+1} = \theta_t + \alpha g_t$
Trust regions and Natural PG

$ \max ~J(\theta)$

$\text{s.t.} ~~d_{KL}(\theta, \theta_0)\leq \delta $

$ \max ~\nabla J(\theta_0)^\top(\theta-\theta_0)$

$\text{s.t.} ~~(\theta-\theta_0)^\top F_{\theta_0} (\theta-\theta_0) \leq \delta$

$\theta_{t+1} = \theta_t + \alpha F^{-1}_{t} g_t$

7. Exploration

Multi-Arm and Contextual Bandits: MDP with no transitions!
Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t) \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$
Explore-then-commit, UCB, LinUCB
- $ \arg\max_a \widehat \mu_a$ vs $ \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}$

Food for thought: performance/regret of softmax policy?

Recap

Explore-then-Commit

Pull each arm $N$ times and compute empirical mean $\widehat \mu_a$
For $t=NK+1,...,T$:
Pull $\widehat a^* = \arg\max_a \widehat \mu_a$

Upper Confidence Bound

For $t=1,...,T$:

Pull $ a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}$
Update empirical means $\widehat \mu_t^a$ and counts $N_t^a$

Set exploration $N \approx T^{2/3}$,

$R(T) \lesssim T^{2/3}$

$R(T) \lesssim \sqrt{T}$

8. Learning From Experts

Imitation Learning with BC

Food for thought: Expert in LQR setting? (Linear regression)

Supervised Learning

Policy

Dataset

$\mathcal D = (x_i, y_i)_{i=1}^M$

...

$\pi$( ) =

8. Learning From Experts

Imitation Learning with DAgger

Food for thought: Expert in LQR setting? (Linear regression)

Supervised Learning

Policy

Dataset

$\mathcal D = (x_i, y_i)_{i=1}^M$

...

$\pi$( ) =

Execute

Query Expert

$\pi^*(s_0), \pi^*(s_1),...$

$s_0, s_1, s_2...$

Aggregate

$(x_i = s_i, y_i = \pi^*(s_i))$

BC vs. DAgger

Supervised learning guarantee

$\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon$

Online learning guarantee

$\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon$

Performance Guarantee

$V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}$

Performance Guarantee

$V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon$

8. Learning From Experts

Inverse RL: Principle of maximum entropy
Soft-VI (entropy weighted) replaces $\max$ with softmax
Max-Ent IRL: For $k=0,\dots,K-1$:
1. $\pi^k = \mathsf{SoftVI}(w_k^\top \varphi)$
2. $w_{k+1} = w_k + \eta (\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] - \mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)])$
Lagrange Formulation to constrained optimization

maximize $\mathsf{Ent}(\pi)$

s.t. $\pi$ consistent with expert data

$x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0$

$\displaystyle x^* =\arg \min_x \max_{w} ~~f(x)+w\cdot g(x)$

Iterative or $\nabla \mathcal L(x,w) = 0$

Proof Stratgies

Add and subtract: $$ \|f(x) - g(y)\| \leq \|f(x)-f(y)\| +\|f(y)-g(y)\| $$
Contractions (induction) $$ \|x_{t+1}\|\leq \gamma \|x_t\| \implies \|x_t\|\leq \gamma^t\|x_0\|$$
Additive induction $$ \|x_{t+1}\| \leq \delta_t + \|x_t\| \implies \|x_t\|\leq \sum_{k=0}^{t-1} \delta_k + \|x_0\| $$
Basic Inequalities (HW0) $$|\mathbb E[f(x)] - \mathbb E[g(x)]| \leq \mathbb E[|f(x)-g(x)|] $$ $$|\max f(x) - \max g(x)| \leq \max |f(x)-g(x)| $$ $$ \mathbb E[f(x)] \leq \max f(x) $$

Test-taking Strategies

Move on if stuck!
Write explanations and show steps for partial credit
Multipart questions: can be done mostly independently
- ex: 1) show $\|x_{t+1}\|\leq \gamma \|x_t\|$
  2) give a bound on $\|x_t\|$ in terms of $\|x_0\|$

Prelim Summary

Problem 1: Approximate Policy Evaluation
- Similar to PE proof from lecture with $V$
Problem 2: Optimal Machine Repair
- Similar to Gardening HW problem
Problem 3: State distributions
- Use proof techniques from review lecture
- Induction does not prove 3.2 (use 3.1, 3.2, & induction for 3.3)
Problem 4: Value of Linear Policy
- Finite horizon Bellman Expectation Equation not Bellman Optimality Equation, or unrolled expression for linear dynamics