CS 4/5789: Introduction to Reinforcement Learning

Lecture 28

Prof. Sarah Dean

MW 2:45-4pm
110 Hollister Hall

Agenda

 

0. Announcements

1. Review

2. Questions

Announcements

 

HW4 due tonight

 

5789 Paper Review Assignment due Friday

 

Course evaluations

Final Monday 5/16 at 7pm in Statler Hall 196

 

Closed-book, definition/equation sheet provided

 

Focus: Units 1-4

Study Materials: Lecture Notes, HWs, Prelim, Review Slides

Final Exam

Outline:

  1. MDP Definitions
  2. Policies and Distributions
  3. Value and Q function
  4. Optimal Policies
  5. Linear Optimal Control
  6. Learned Models, Values, Policies
  7. Exploration
  8. Learning from Experts

Review

Participation point: PollEV.com/sarahdean011

Infinite Horizon Discounted MDP

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)

1. MDP Definitions

  • \(\mathcal{S}\) states, \(\mathcal{A}\) actions
  • \(r\) map from state, action to scalar reward
  • \(P\) transition probability to next state given current state and action (Markov assumption)
  • \(\gamma\) discount factor

Finite Horizon MDP

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}\)

  • \(\mathcal{S},\mathcal{A},r,P\) same
  • \(H\) horizon
  • \(\mu_0\) initial distribution

ex - Pac-Man as MDP

1. MDP Definitions

Optimal Control Problem

  • continuous states/actions \(\mathcal{S}=\mathbb R^{n_s},\mathcal{A}=\mathbb R^{n_a}\)
  • Cost instead of reward
  • transitions \(P\) described in terms of dynamics function and disturbance \(w\sim \mathcal D\)
                                 \(s'= f(s, a, w)\)

ex - UAV as OCP

image/svg+xml

2. Policies and Distributions

  • Policy \(\pi\) chooses an action based on the current state so \(a_t=a\) with probability \(\pi(a|s_t)\)
    • Shorthand for deterministic policy: \(a_t=\pi(s_t)\)
image/svg+xml

examples:

Policy results in a trajectory \(\tau = (s_0, a_0, s_1, a_1, ... )\)

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

2. Policies and Distributions

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

  • Probability of trajectory \(\tau =(s_0, a_0, s_1, ... s_t, a_t)\) $$ \mathbb{P}_{\mu_0}^\pi (\tau) = \mu_0(s_0)\pi(a_0 \mid s_0) \cdot \displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i) $$
  • Probability of \((s, a)\) at \(t\) $$ \mathbb{P}^\pi_t(s, a ; \mu_0) = \displaystyle\sum_{\substack{s_{0:t-1}\\ a_{0:t-1}}} \mathbb{P}^\pi_{\mu_0} (s_{0:t-1}, a_{0:t-1}, s_t, a_t \mid s_t = s, a_t = a) $$
  • Discounted "steady-state" distribution $$ d^\pi_{\mu_0}(s, a) = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t \mathbb{P}^\pi_t(s, a; \mu_0) $$
    • Finite horizon: \(d^\pi_{\mu_0}(s, a) =\frac{1}{H}\sum_{t=0}^{H-1} \mathbb{P}^\pi_t(s, a; \mu_0) \)

2. Policies and Distributions

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

Food for thought:

  • How do these distributions change under two different transition models \(P\) and \(\widehat P\) (Simulation Lemma) or two different policies (PDL, Prelim, HW2)?
  • How to write the distribution \(\mathbb{P}^\pi_t\) in terms of \(\mathbb{P}^\pi_{t-1}\)?

3. Value and Q function

  • Evaluate policy by cumulative reward
    • \(V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r_t | s_0=s]\)
    • \(Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r_t | s_0=s, a_0=a]\)
  • For finite horizon, for \(t=0,...H-1\),
    • \(V_t^\pi(s) = \mathbb E[\sum_{k=t}^{H-1} r_k | s_t=s]\)
    • \(Q_t^\pi(s, a) = \mathbb E[\sum_{k=t}^{H-1} r_k | s_t=s, a_t=a]\)
image/svg+xml

examples:

...

...

...

3. Value and Q function

Recursive Bellman Expectation Equation:

  • Discounted Infinite Horizon
    •  \(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
    • \(Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]\)
  • Finite Horizon,  for \(t=0,\dots H-1\),
    • \(V^{\pi}_t(s) = \mathbb{E}_{a \sim\pi_t(s) } \left[ r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \right]\)
    • \(Q^{\pi}_t(s) =  r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \)

...

...

...

Recall: Gardening MDP HW problem, Prelim

3. Value and Q function

  • Recursive computation: \(V^{\pi} = R^{\pi} + \gamma P^{\pi} V^\pi\)
    • Exact Policy Evaluation: \(V^{\pi} = (I- \gamma P^{\pi} )^{-1}R^{\pi}\)
    • Iterative Policy Evaluation: \(V^{\pi}_{t+1} = R^{\pi} + \gamma P^{\pi} V^\pi_t\)
  • Backwards-Iterative computation in finite horizon:
    • Initialize \(V^{\pi}_H = 0\)
    • For \(t=H-1, H-2, ... 0\)
      • \(V^{\pi}_t = R^{\pi} +P^{\pi} V^\pi_{t+1}\)

...

...

...

4. Optimal Policies

  • An optimal policy \(\pi^*\) is one where \(V^{\pi^*}(s) \geq V^{\pi}(s)\) for all \(s\) and policies \(\pi\)
  • Equivalent condition: Bellman Optimality
    • \(V^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[V^*(s') \right]\right]\)
    • \( Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^*(s', a') \right]\)
  • Optimal policy \(\pi^*(s) = \argmax_{a\in \mathcal A} Q^*(s, a)\)

Recall: Gardening MDP, Prelim (verifying optimality)

4. Optimal Policies

  • \(0\) cost for \(a_0\)
  • \(2\epsilon\) cost for \(a_1\)
  • \(\epsilon\) reward in \(s_0\)
  • \(1\) reward in \(s_1\)
  • \(\gamma\) discount

Food for thought: rigorous argument for optimal policy?

4. Optimal Policies

  • Finite horizon, for \(t=0,\dots H-1\),
    • \(V_t^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[V_{t+1}^*(s') \right]\right]\)
    • \(Q_t^*(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t+1}^*(s', a') \right]\)
  • Optimal policy \(\pi_t^*(s) = \argmax_{a\in \mathcal A} Q_t^*(s, a)\)
  • Can directly solve with Dynamic Programming
    • Iterate backwards in time from \(V^*_{H}=0\)

4. Optimal Policies

  • Infinite horizon: algorithms for recursion in the Bellman Optimality equation
  • Value Iteration
    • Initialize \(Q_0\). For \(t=0,1,\dots\),
      • \(Q^{t+1}(s,a) =r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^{t}(s', a') \right]\)
  • Policy Iteration
    • Initialize \(\pi_0\). For \(t=0,1,\dots\),
      • \(Q^{t}= \) PolicyEval(\(\pi^t\))
      • \(\pi^{t+1}(s) = \argmax_{a\in\mathcal A} Q^{t}(s,a)\)

4. Optimal Policies

  • Value Iteration
    • Fixed point iteration (like Iterative Policy Iteration) from Bellman Q Optimality
    • Contraction in Q: \(\|Q^{t+1} - Q^*\|_\infty \leq \gamma \|Q^t - Q^*\|_\infty\)
  • Policy Iteration
    • Monotone Improvement: \(Q^{t+1}(s,a) \geq Q^{t}(s,a)\)
    • Contraction in V: \(\|V^{t+1} - V^*\|_\infty \leq \gamma \|V^t - V^*\|_\infty\)

5. Linear Optimal Control

  • Linear Dynamics: $$s_{t+1} = A s_t + Ba_t + w_t,\quad w_t\sim \mathcal N(0,\sigma^2 I)$$
  • Unrolled dynamics $$ s_{t} = A^ts_0 + \sum_{k=0}^{t-1} A^k (Ba_{t-k-1} + w_{t-k-1})$$
  • Stability of uncontrolled \(s_{t+1}=As_t\): determined by whether \(\rho(A)<1\)
  • Finite Horizon LQR: Application of Dynamic Programming
    • Basis for approximation-based algorithms (local linearization and iLQR)

Recall: Prelim question on linear policy \(a_t = K s_t\)

6. Learning from data

  • What do we want to learn? \(\mathcal M = \{\mathcal S,\mathcal A,P,r,\gamma\}\)
    • Unknown transitions \(P(s'|s,a)\) or reward function \(r(s,a)\)
    • Value/Q function
      • of policy \(V^\pi(s)\) or \(Q^\pi(s,a)\)
      • optimal \(V^*(s)\) or \(Q^*(s,a)\)
    • Optimal Policy \(\pi^*(s)\)
  • Given a dataset with features \(x_i\) and labels \(y_i\)
  • Fitting a model:
    • Via counting: \(\widehat f(x) = \sum_{i=1}^N y_i \mathbf 1\{x=x_i\} / \sum_{i=1}^N\mathbf 1\{x=x_i\}  \)
    • Function approx: \(\widehat f(x) = \min_{f\in\mathcal F} \frac{1}{N} \sum_{i=1}^N (f(x_i)-y_i)^2   \)

Model-Based RL

  • Features are \((s,a)\) and label is \(s'\)
  • Tabular setting: \(\widehat P\) via counting
  • Simulation Lemma
    • Translate error in \(\widehat P\) vs \(P\) into difference in performance \(\widehat V\) vs \(V\)

6. Learning Models

  • Features are \((s_i,a_i)\)
    • \((s_i,a_i)  = (s_{h_1}, a_{h_1}) \sim d^\pi_{\mu_0}\)
  • Labels constructed as:
    • Rollout based (MC): \(y_i = \sum_{t=h_1}^{h_1+h_2} r_t\)
    • Bellman Exp based (TD): \(y_t =r_t + \gamma \widehat Q(s_{t+1},a_{t+1}) \)
    • Bellman Opt based (TD): \(y_t =r_t + \gamma \max_a \widehat Q(s_{t+1},a) \)
  • On vs. off policy (Recall HW)
  • \(\widehat Q =\arg\min \frac{1}{N}\sum_{i=1}^N (Q(s_i,a_i)-y_i)^2\)

6. Learning Value/Q

\(h_1=h\) w.p. \(\propto \gamma^h\)

\(s_t\)

\(a_t\sim \pi(s_t)\)

\(r_t\sim r(s_t, a_t)\)

\(s_{t+1}\sim P(s_t, a_t)\)

\(a_{t+1}\sim \pi(s_{t+1})\)

  • Approximate Dynamic Programming
    For \(t=0,1...\):
    1. \(\widehat Q^t = \mathsf{SampleEval}(\pi^t)\)
    2. \(\pi^t = \mathsf{Improvement}(\widehat Q^t)\)
  • Approximate Policy Iteration
    • Greedy improvement, could oscillate
  • Conservative Policy Iteration
    • Incremental improvement
  • Performance Difference Lemma

6. Learning Value/Q

6. Policy Optimization

  • \(J(\theta)=\) expected cumulative reward under policy \(\pi_\theta\)
  • Estimate \(\nabla_\theta J(\theta)\) via rollouts \(\tau\), observed reward \(R(\tau)\)
    • Random Search: \(\theta \pm \delta v\) , \(g=\frac{1}{2\delta}(R(\tau_+) - R(\tau_-))v\)
    • REINFORCE: \(g=\sum_{t=0}^\infty \nabla_\theta \log \pi_\theta(a_t|s_t) R(\tau)\)
    • Actor-Critic: \(s,a\sim d^{\pi_\theta}_{\mu_0}\) ,
      \(g=\frac{1}{1-\gamma} \nabla_\theta \log \pi_\theta(a_t|s_t) (Q^{\pi_\theta}(s,a)-b(s)) \)

Food for thought: how to compute off-policy gradient estimate?

Recap

Derivative Free Optimization: Random Search

\(\nabla J(\theta)\)\( \approx \frac{1}{2\delta} (J(\textcolor{cyan}{\theta}+{\delta v}) - J(\textcolor{cyan}{\theta}-{\delta v}))\textcolor{LimeGreen}{v}\)

Parabola

\(J(\theta) = -\theta^2 - 1\)

\(\theta\)

Recap

Derivative Free Optimization: Sampling

\(\nabla J(\theta)\)\( \approx \nabla_\theta \log(P_\theta(x)) h(x) \)

Parabola

\(J(\theta) = \mathbb E_{x\sim P_\theta}[h(x)]\)

\(x\)

image/svg+xml

\(= 2(\theta-x)\theta h(x)\)

\(h(x) = -x^2\)

\(=\mathbb E_{x\sim\mathcal N(\theta, 1)}[-x^2]\)

\(P_\theta = \mathcal N(\theta, 1)\)

6. Policy Optimization

  • Policy Gradient Meta-Algorithm
    for \(t=0,1,...\)
    1. collect rollouts using \(\theta_t\)
    2. estimate gradient with \(g_t\)
    3. \(\theta_{t+1} = \theta_t + \alpha g_t\)
  • Trust regions and Natural PG

\( \max ~J(\theta)\)

\(\text{s.t.} ~~d_{KL}(\theta, \theta_0)\leq \delta \)

\( \max ~\nabla J(\theta_0)^\top(\theta-\theta_0)\)

\(\text{s.t.} ~~(\theta-\theta_0)^\top F_{\theta_0} (\theta-\theta_0) \leq \delta\)

\(\theta_{t+1} = \theta_t + \alpha  F^{-1}_{t} g_t\)

7. Exploration

  • Multi-Arm and Contextual Bandits: MDP with no transitions!
  • Regret: $$ R(T) = \mathbb E\left[\sum_{t=1}^T r(x_t, \pi^*(x_t))-r(x_t, a_t)  \right] = \sum_{t=1}^T \mathbb E[\mu^*(x_t) - \mu_{a_t}(x_t)]$$
  • Explore-then-commit, UCB, LinUCB
    • \( \arg\max_a \widehat \mu_a\)   vs   \( \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}\)

Food for thought: performance/regret of softmax policy?

Recap

Explore-then-Commit

  1. Pull each arm \(N\) times and compute empirical mean \(\widehat \mu_a\)
  2. For \(t=NK+1,...,T\):
        Pull \(\widehat a^* = \arg\max_a \widehat \mu_a\)

Upper Confidence Bound

For \(t=1,...,T\):

  • Pull \( a_t = \arg\max_a \widehat \mu_t^a + \sqrt{C/N_t^a}\)
  • Update empirical means \(\widehat \mu_t^a\) and counts \(N_t^a\)

Set exploration \(N \approx T^{2/3}\),

\(R(T) \lesssim T^{2/3}\)

\(R(T) \lesssim \sqrt{T}\)

8. Learning From Experts

Imitation Learning with BC

Food for thought: Expert in LQR setting? (Linear regression)

Supervised Learning

Policy

Dataset

\(\mathcal D = (x_i, y_i)_{i=1}^M\)

...

\(\pi\)(       ) =

8. Learning From Experts

Imitation Learning with DAgger

Food for thought: Expert in LQR setting? (Linear regression)

Supervised Learning

Policy

Dataset

\(\mathcal D = (x_i, y_i)_{i=1}^M\)

...

\(\pi\)(       ) =

Execute

Query Expert

\(\pi^*(s_0), \pi^*(s_1),...\)

\(s_0, s_1, s_2...\)

Aggregate

\((x_i = s_i, y_i = \pi^*(s_i))\)

BC vs. DAgger

Supervised learning guarantee

\(\mathbb E_{s\sim d^{\pi^*}_\mu}[\mathbf 1\{\widehat \pi(s) - \pi^*(s)\}]\leq \epsilon\)

Online learning guarantee

\(\mathbb E_{s\sim d^{\pi^t}_\mu}[\mathbf 1\{ \pi^t(s) - \pi^*(s)\}]\leq \epsilon\)

Performance Guarantee

\(V_\mu^{\pi^*} - V_\mu^{\widehat \pi} \leq \frac{2\epsilon}{(1-\gamma)^2}\)

Performance Guarantee

\(V_\mu^{\pi^*} - V_\mu^{\pi^t} \leq \frac{\max_{s,a}|A^{\pi^*}(s,a)|}{1-\gamma}\epsilon\)

8. Learning From Experts

  • Inverse RL: Principle of maximum entropy


     
  • Soft-VI (entropy weighted) replaces \(\max\) with softmax
  • Max-Ent IRL: For \(k=0,\dots,K-1\):
    1. \(\pi^k = \mathsf{SoftVI}(w_k^\top \varphi)\)
    2. \(w_{k+1} = w_k + \eta (\mathbb E_{d^{\pi^*}_\mu}[\varphi (s,a)] - \mathbb E_{d^{\pi^k}_\mu}[\varphi (s,a)])\)
  • Lagrange Formulation to constrained optimization

maximize    \(\mathsf{Ent}(\pi)\)

s.t.    \(\pi\) consistent with expert data

\(x^* =\arg \min~~f(x)~~\text{s.t.}~~g(x)=0\)

\(\displaystyle x^* =\arg \min_x \max_{w} ~~f(x)+w\cdot g(x)\)

Iterative or \(\nabla \mathcal L(x,w) = 0\)

Proof Stratgies

  1. Add and subtract: $$ \|f(x) - g(y)\| \leq  \|f(x)-f(y)\| +\|f(y)-g(y)\| $$
  2. Contractions (induction) $$ \|x_{t+1}\|\leq \gamma \|x_t\| \implies \|x_t\|\leq \gamma^t\|x_0\|$$
  3. Additive induction $$ \|x_{t+1}\| \leq \delta_t + \|x_t\| \implies \|x_t\|\leq \sum_{k=0}^{t-1} \delta_k + \|x_0\|  $$
  4. Basic Inequalities (HW0) $$|\mathbb E[f(x)] - \mathbb E[g(x)]| \leq \mathbb E[|f(x)-g(x)|] $$ $$|\max f(x) - \max g(x)| \leq \max |f(x)-g(x)| $$ $$ \mathbb E[f(x)] \leq \max f(x) $$

Test-taking Strategies

  1. Move on if stuck!
  2. Write explanations and show steps for partial credit
  3. Multipart questions: can be done mostly independently
    • ex: 1) show \(\|x_{t+1}\|\leq \gamma \|x_t\|\)
            2) give a bound on \(\|x_t\|\) in terms of \(\|x_0\|\)

Prelim Summary

  1. Problem 1: Approximate Policy Evaluation
    • Similar to PE proof from lecture with \(V\)
  2. Problem 2: Optimal Machine Repair
    • Similar to Gardening HW problem
  3. Problem 3: State distributions
    • Use proof techniques from review lecture
    • Induction does not prove 3.2 (use 3.1, 3.2, & induction for 3.3)
  4. Problem 4: Value of Linear Policy
    • Finite horizon Bellman Expectation Equation not Bellman Optimality Equation, or unrolled expression for linear dynamics

CS 4/5789: Lecture 28

By Sarah Dean

Private

CS 4/5789: Lecture 28