CS 4/5789: Introduction to Reinforcement Learning

Prelim 1 Review Session

 

MW 2:45-4pm
255 Olin Hall

Prelim on 3/4 in Lecture

  • Prelim Wednesday 3/4
  • During lecture (2:55-4:10pm in 255 Olin)
  • 1 hour exam, closed-book, equation sheet provided
  • Materials:
    • slides (Lectures 1-10)
    • lecture notes (MDP and LQR chapter)
    • PSets 1-3 (solutions posted on Canvas by Friday)
  • Prelim tag on EdStem for questions

Outline:

  1. MDP Definitions
  2. Policies and Distributions
  3. Value and Q function
  4. Optimal Policies
  5. Linear Optimal Control

Review

Participation point: PollEV.com/sarahdean011

Infinite Horizon Discounted MDP

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}\)

1. MDP Definitions

  • \(\mathcal{S}\) states, \(\mathcal{A}\) actions
  • \(r\) map from state, action to scalar reward
  • \(P\) transition probability to next state given current state and action (Markov assumption)
  • \(\gamma\) discount factor

Finite Horizon MDP

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}\)

  • \(\mathcal{S},\mathcal{A},r,P\) same
  • \(H\) horizon

ex - Pac-Man as MDP

1. MDP Definitions

Optimal Control Problem

  • continuous states/actions \(\mathcal{S}=\mathbb R^{n_s},\mathcal{A}=\mathbb R^{n_a}\)
  • Cost instead of reward
  • transitions are deterministic and described in terms of dynamics function
                                 \(s'= f(s, a)\)

ex - UAV as OCP

image/svg+xml

2. Policies and Distributions

  • A policy \(\pi\) determines how actions are taken
  • Policies can be:
    • deterministic or stochastic $$a_t = \pi(s_t)\quad\text{vs}\quad a_t\sim \pi(s_t)$$
    • state-dependent or history-dependent $$a_t = \pi(s_t)\quad\text{vs}\quad a_t = \pi(s_t, s_{t-1},\dots, s_0) $$
    • stationary or time-dependent $$a_t = \pi(s_t)\quad\text{vs}\quad a_t = \pi_t(s_t) $$
  • We focus on state-dependent policies

2. Policies and Distributions

  • Policy \(\pi\) chooses an action based on the current state so \(a_t=a\) with probability \(\pi(a|s_t)\)
    • Shorthand for deterministic policy: \(a_t=\pi(s_t)\)
image/svg+xml

examples:

Policy results in a trajectory \(\tau = (s_0, a_0, s_1, a_1, ... )\)

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

2. Policies and Distributions

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

  • Probability of trajectory \(\tau =(s_0, a_0, s_1, ... s_t, a_t)\) $$ \mathbb{P}_{\mu_0}^\pi (\tau) = \mu_0(s_0)\pi(a_0 \mid s_0) \cdot \displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i) $$
  • Probability of \(s\) at \(t\) $$ \mathbb{P}^\pi_t(s ; \mu_0) = \displaystyle\sum_{\substack{s_{0:t-1}\\ a_{0:t-1}}} \mathbb{P}^\pi_{\mu_0} (s_{0:t}, a_{0:t-1} \mid s_t = s) $$

2. Policies and Distributions

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

  • Probability vector of \(s\) at \(t\): \(d_{\mu_0,t}^\pi(s) = \mathbb{P}^\pi_t(s ; \mu_0)  \) evolves as $$ d_{\mu_0,t+1}^\pi=P_\pi^\top d_{\mu_0,t}^\pi $$ where \(P_\pi\) at row \(s\) and column \(s'\) is \(\mathbb E_{a\sim \pi(s)}[P(s'\mid s,a)]\)
  • Discounted "steady-state" distribution (PSet 2) $$ d^\pi_{\mu_0} = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_{\mu_0,t}^\pi$$

\(+\gamma\)

\(+\gamma^2\)

\(+\quad ...\quad=\)

\(1\)

\(1-p_1\)

\(p_1\)

\(0\)

\(1\)

  • \(d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} \)
  • \(d_1 = \begin{bmatrix}1& 1-p_1\\0  & p_1\end{bmatrix} \begin{bmatrix} 1/2\\ 1/2\end{bmatrix}   = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}\)
  • \(d_2 =\begin{bmatrix}1& 1-p_1\\0  & p_1\end{bmatrix}\begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix} = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}\)

Example: \(\pi(s)=\)stay and \(\mu_0\) is each state with probability \(1/2\).

State Evolution Example

$$P_\pi = \begin{bmatrix}1& 0\\ 1-p_1 & p_1\end{bmatrix}$$

  • How does state distribution change over time?
    • Recall, \(s_{t+1}\sim P(s_t,\pi(s_t))\)
    • i.e. \(s_{t+1} = s'\) with probability \(P(s'|s_t, \pi(s_t))\)
  • Write as a summation over possible \(s_t\):
    • \(\mathbb{P}\{s_{t+1}=s'\mid \mu_0,\pi\}  =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))\mathbb{P}\{s_{t}=s\mid \mu_0,\pi\}\)
  • In vector notation:
    • \(d_{t+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))d_t[s]\)
    • \(d_{t+1}[s'] =\langle\begin{bmatrix} P(s'\mid 1, \pi(1)) & \dots & P(s'\mid S, \pi(S))\end{bmatrix},d_t\rangle \)

2. Policies and Distributions

Food for thought:

  • Relationship between state distribution and value (PSet 2)

\(s_0\)

\(a_0\)

\(s_1\)

\(a_1\)

\(s_2\)

\(a_2\)

...

3. Value and Q function

  • Evaluate policy by cumulative reward
    • \(V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s,P,\pi]\)
    • \(Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s, a_0=a,P,\pi]\)
  • For finite horizon, for \(t=0,...H-1\),
    • \(V_t^\pi(s) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s,P,\pi]\)
    • \(Q_t^\pi(s, a) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s, a_t=a,P,\pi]\)
image/svg+xml

examples:

...

...

...

3. Value and Q function

Recursive Bellman Expectation Equation:

  • Discounted Infinite Horizon
    •  \(V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]\)
    • \(Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]\)
  • Finite Horizon,  for \(t=0,\dots H-1\),
    • \(V^{\pi}_t(s) = \mathbb{E}_{a \sim\pi_t(s) } \left[ r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \right]\)
    • \(Q^{\pi}_t(s) =  r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \)

...

...

...

Recall: Icy navigation (PSet 2, lecture example)

3. Value and Q function

  • Recursive computation: \(V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi\)
    • Exact Policy Evaluation: \(V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}\)
    • Iterative Policy Evaluation: \(V^{\pi}_{i+1} = R^{\pi} + \gamma P_{\pi} V^\pi_i\)
      • Converges: fixed point contraction
  • Backwards-Iterative computation in finite horizon:
    • Initialize \(V^{\pi}_H = 0\)
    • For \(t=H-1, H-2, ... 0\)
      • \(V^{\pi}_t = R^{\pi} +P_{\pi} V^\pi_{t+1}\)

4. Optimal Policies

  • An optimal policy \(\pi^*\) is one where \(V^{\pi^*}(s) \geq V^{\pi}(s)\) for all \(s\) and policies \(\pi\)
  • Equivalent condition: Bellman Optimality
    • \(V^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[V^*(s') \right]\right]\)
    • \( Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^*(s', a') \right]\)
  • Optimal policy \(\pi^*(s) = \argmax_{a\in \mathcal A} Q^*(s, a)\)

Recall: Verifying optimality in Icy Street example

4. Optimal Policies

  • Finite horizon: for \(t=0,\dots H-1\),
    • \(V_t^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[V_{t+1}^*(s') \right]\right]\)
    • \(Q_t^*(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t+1}^*(s', a') \right]\)
  • Optimal policy \(\pi_t^*(s) = \argmax_{a\in \mathcal A} Q_t^*(s, a)\)
  • Solve exactly with Dynamic Programming
    • Iterate backwards in time from \(V^*_{H}=0\)

4. Optimal Policies

  • Infinite horizon: algorithms for recursion in the Bellman Optimality equation
  • Value Iteration
    • Initialize \(V_0\). For \(i=0,1,\dots\),
      • \(V^{i+1}(s) =\max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right]\)
  • Policy Iteration
    • Initialize \(\pi_0\). For \(i=0,1,\dots\),
      • \(V^{i}= \) PolicyEval(\(\pi^i\))
      • \(\pi^{i+1}(s) = \argmax_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right] \)

4. Optimal Policies

  • Value Iteration
    • Converges: fixed point contraction: $$\|V^{i+1} - V^*\|_\infty \leq \gamma \|V^i - V^*\|_\infty$$
  • Policy Iteration
    • Monotone Improvement: \(V^{i+1}(s) \geq V^{i}(s)\)
    • Contraction: \(\|V^{i+1} - V^*\|_\infty \leq \gamma \|V^i- V^*\|_\infty\)
    • Converges to exactly optimal policy in finite time (PSet 3)

Food for thought: For a fixed point contraction by \(\gamma\), how many iterations are necessary to guarantee \(\epsilon\) error?

5. Linear Optimal Control

  • Linear Dynamics: $$s_{t+1} = A s_t + Ba_t$$
  • Unrolled dynamics (PSet 3) $$ s_{t} = A^ts_0 + \sum_{k=0}^{t-1} A^k Ba_{t-k-1}$$
  • Stability of uncontrolled \(s_{t+1}=As_t\):
    • stable if \(\max_i |\lambda_i(A)|< 1\)
    • unstable if \(\max_i |\lambda_i(A)| > 1\)
    • marginally unstable if \(\max_i |\lambda_i(A)|= 1\)

ex - UAV

image/svg+xml

5. Linear Optimal Control

Finite Horizon LQR: Application of Dynamic Programming

  • Initialize \(V^{\pi}_H(s) = 0\)
  • For \(t=H-1, H-2, ... 0\)
    • \(Q^{*}_t(s,a) = c(s,a) + V^*_{t+1}(f(s,a))\)
    • \(\pi^*(s) = \argmin_{a\in\mathcal A} Q^{*}_t(s,a)\)
    • \(V^*_t =  Q^{*}_t(s,\pi^*(s)) \)

Basis for approximation-based algorithms (local linearization and iLQR)

Proof Stratgies

  1. Add and subtract: $$ \|f(x) - g(y)\| \leq  \|f(x)-f(y)\| +\|f(y)-g(y)\| $$
  2. Contractions (induction) $$ \|x_{t+1}\|\leq \gamma \|x_t\| \implies \|x_t\|\leq \gamma^t\|x_0\|$$
  3. Basic Inequalities (PSet 1): $$|\mathbb E[f(x)] - \mathbb E[g(x)]| \leq \mathbb E[|f(x)-g(x)|] $$ $$|\max f(x) - \max g(x)| \leq \max |f(x)-g(x)| $$ $$ \mathbb E[f(x)] \leq \max f(x) $$

Test-taking Strategies

  1. Move on if stuck!
  2. Write explanations and show steps for partial credit
  3. Multipart questions: can be done mostly independently
    • ex: 1) show \(\|x_{t+1}\|\leq \gamma \|x_t\|\)
            2) give a bound on \(\|x_t\|\) in terms of \(\|x_0\|\)

Sp24 CS 4/5789: Lecture 11

By Sarah Dean

Private

Sp24 CS 4/5789: Lecture 11