CS 4/5789: Introduction to Reinforcement Learning

Lecture 14: Review

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework
- PSet 4 due ~~today~~ Friday
- 5789 Paper Reviews due weekly on Mondays
- PA 3/PSet 5 released next week
My office hours cancelled on Wednesday 3/15 due to Prelim

Please sign up for the event through the ACSU’s page on Campus Groups at this link: https://cglink.me/2ee/r2067244

Prelim on 3/15 in Lecture

Prelim Wednesday 3/15
During lecture (2:45-4pm in 255 Olin)
1 hour exam, closed-book, equation sheet provided
Materials:
- slides (Lectures 1-10, some of 11-13)
- PSets 1-4 (1-3 solutions on Canvas)
Last minute conflicts/accomodations? (EdStem)
Monitoring Prelim tag on EdStem for questions

Outline:

MDP Definitions
Policies and Distributions
Value and Q function
Optimal Policies
Linear Optimal Control

Review

Participation point: PollEV.com/sarahdean011

Infinite Horizon Discounted MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$

1. MDP Definitions

$\mathcal{S}$ states, $\mathcal{A}$ actions
$r$ map from state, action to scalar reward
$P$ transition probability to next state given current state and action (Markov assumption)
$\gamma$ discount factor

Finite Horizon MDP

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}$

$\mathcal{S},\mathcal{A},r,P$ same
$H$ horizon

ex - Pac-Man as MDP

1. MDP Definitions

Optimal Control Problem

continuous states/actions $\mathcal{S}=\mathbb R^{n_s},\mathcal{A}=\mathbb R^{n_a}$
Cost instead of reward
transitions are deterministic and described in terms of dynamics function
$s'= f(s, a)$

ex - UAV as OCP

2. Policies and Distributions

Policy $\pi$ chooses an action based on the current state so $a_t=a$ with probability $\pi(a|s_t)$
- Shorthand for deterministic policy: $a_t=\pi(s_t)$

examples:

Policy results in a trajectory $\tau = (s_0, a_0, s_1, a_1, ... )$

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

2. Policies and Distributions

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

Probability of trajectory $\tau =(s_0, a_0, s_1, ... s_t, a_t)$ $$ \mathbb{P}_{\mu_0}^\pi (\tau) = \mu_0(s_0)\pi(a_0 \mid s_0) \cdot \displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i) $$
Probability of $s$ at $t$ $$ \mathbb{P}^\pi_t(s ; \mu_0) = \displaystyle\sum_{\substack{s_{0:t-1}\\ a_{0:t-1}}} \mathbb{P}^\pi_{\mu_0} (s_{0:t}, a_{0:t-1} \mid s_t = s) $$

2. Policies and Distributions

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

Probability vector of $s$ at $t$: $d_{\mu_0,t}^\pi(s) = \mathbb{P}^\pi_t(s ; \mu_0) $ evolves as $$ d_{\mu_0,t+1}^\pi=P_\pi^\top d_{\mu_0,t}^\pi $$ where $P_\pi$ at row $s$ and column $s'$ is $\mathbb E_{a\sim \pi(s)}[P(s'\mid s,a)]$
Discounted "steady-state" distribution (PSet 2) $$ d^\pi_{\mu_0} = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_{\mu_0,t}^\pi$$

$+\gamma$

$+\gamma^2$

$+\quad ...\quad=$

$1$

$1-p_1$

$p_1$

$0$

$1$

$d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} $
$d_1 = \begin{bmatrix}1& 1-p_1\\0 & p_1\end{bmatrix} \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}$
$d_2 =\begin{bmatrix}1& 1-p_1\\0 & p_1\end{bmatrix}\begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix} = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}$

Example: $\pi(s)=$stay and $\mu_0$ is each state with probability $1/2$.

State Evolution Example

$$P_\pi = \begin{bmatrix}1& 0\\ 1-p_1 & p_1\end{bmatrix}$$

State Distribution Transition

How does state distribution change over time?
- Recall, $s_{t+1}\sim P(s_t,\pi(s_t))$
- i.e. $s_{t+1} = s'$ with probability $P(s'|s_t, \pi(s_t))$
Write as a summation over possible $s_t$:
- $\mathbb{P}\{s_{t+1}=s'\mid \mu_0,\pi\} =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))\mathbb{P}\{s_{t}=s\mid \mu_0,\pi\}$
In vector notation:
- $d_{t+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))d_t[s]$
- $d_{t+1}[s'] =\langle\begin{bmatrix} P(s'\mid 1, \pi(1)) & \dots & P(s'\mid S, \pi(S))\end{bmatrix},d_t\rangle $

2. Policies and Distributions

Food for thought:

How are these distributions different when:
- Transitions are different (Simulation Lemma)
- Policies are different (Performal Difference Lemma)
- Initial states are different

$s_0$

$a_0$

$s_1$

$a_1$

$s_2$

$a_2$

...

3. Value and Q function

Evaluate policy by cumulative reward
- $V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s,P,\pi]$
- $Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s, a_0=a,P,\pi]$
For finite horizon, for $t=0,...H-1$,
- $V_t^\pi(s) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s,P,\pi]$
- $Q_t^\pi(s, a) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s, a_t=a,P,\pi]$

examples:

...

3. Value and Q function

Recursive Bellman Expectation Equation:

Discounted Infinite Horizon
- $V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$
- $Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]$
Finite Horizon, for $t=0,\dots H-1$,
- $V^{\pi}_t(s) = \mathbb{E}_{a \sim\pi_t(s) } \left[ r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \right]$
- $Q^{\pi}_t(s) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] $

...

Recall: Icy navigation (PSet 2, lecture example)

3. Value and Q function

Recursive computation: $V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$
- Exact Policy Evaluation: $V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}$
- Iterative Policy Evaluation: $V^{\pi}_{i+1} = R^{\pi} + \gamma P_{\pi} V^\pi_i$
  - Converges: fixed point contraction
Backwards-Iterative computation in finite horizon:
- Initialize $V^{\pi}_H = 0$
- For $t=H-1, H-2, ... 0$
  - $V^{\pi}_t = R^{\pi} +P_{\pi} V^\pi_{t+1}$

4. Optimal Policies

An optimal policy $\pi^*$ is one where $V^{\pi^*}(s) \geq V^{\pi}(s)$ for all $s$ and policies $\pi$
Equivalent condition: Bellman Optimality
- $V^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[V^*(s') \right]\right]$
- $ Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^*(s', a') \right]$
Optimal policy $\pi^*(s) = \argmax_{a\in \mathcal A} Q^*(s, a)$

Recall: Verifying optimality in Icy Street example

Food for thought: What does Bellman Optimality imply about advantage function $A^{\pi^*}(s,a)$?

4. Optimal Policies

Finite horizon: for $t=0,\dots H-1$,
- $V_t^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[V_{t+1}^*(s') \right]\right]$
- $Q_t^*(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t+1}^*(s', a') \right]$
Optimal policy $\pi_t^*(s) = \argmax_{a\in \mathcal A} Q_t^*(s, a)$
Solve exactly with Dynamic Programming
- Iterate backwards in time from $V^*_{H}=0$

4. Optimal Policies

Infinite horizon: algorithms for recursion in the Bellman Optimality equation
Value Iteration
- Initialize $V_0$. For $i=0,1,\dots$,
  - $V^{i+1}(s) =\max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right]$
Policy Iteration
- Initialize $\pi_0$. For $i=0,1,\dots$,
  - $V^{i}= $ PolicyEval($\pi^i$)
  - $\pi^{i+1}(s) = \argmax_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right] $

4. Optimal Policies

Value Iteration
- Converges: fixed point contraction: $$\|V^{i+1} - V^*\|_\infty \leq \gamma \|V^i - V^*\|_\infty$$
Policy Iteration
- Monotone Improvement: $V^{i+1}(s) \geq V^{i}(s)$
- Contraction: $\|V^{i+1} - V^*\|_\infty \leq \gamma \|V^i- V^*\|_\infty$
- Converges to exactly optimal policy in finite time (PSet 3)

5. Linear Optimal Control

Linear Dynamics: $$s_{t+1} = A s_t + Ba_t$$
Unrolled dynamics (PSet 3) $$ s_{t} = A^ts_0 + \sum_{k=0}^{t-1} A^k Ba_{t-k-1}$$
Stability of uncontrolled $s_{t+1}=As_t$:
- stable if $\max_i |\lambda_i(A)|< 1$
- unstable if $\max_i |\lambda_i(A)| > 1$
- marginally unstable if $\max_i |\lambda_i(A)|= 1$

ex - UAV

Food for thought: relationship between stability and cumulative cost? (PSet 4)

5. Linear Optimal Control

Finite Horizon LQR: Application of Dynamic Programming

Initialize $V^{\pi}_H(s) = 0$
For $t=H-1, H-2, ... 0$
- $Q^{*}_t(s,a) = c(s,a) + V^*_{t+1}(f(s,a))$
- $\pi^*(s) = \argmin_{a\in\mathcal A} Q^{*}_t(s,a)$
- $V^*_t = Q^{*}_t(s,\pi^*(s)) $

Basis for approximation-based algorithms (local linearization and iLQR)

Proof Stratgies

Add and subtract: $$ \|f(x) - g(y)\| \leq \|f(x)-f(y)\| +\|f(y)-g(y)\| $$
Contractions (induction) $$ \|x_{t+1}\|\leq \gamma \|x_t\| \implies \|x_t\|\leq \gamma^t\|x_0\|$$
Additive induction $$ \|x_{t+1}\| \leq \delta_t + \|x_t\| \implies \|x_t\|\leq \sum_{k=0}^{t-1} \delta_k + \|x_0\| $$
Basic Inequalities (PSet 1): $$|\mathbb E[f(x)] - \mathbb E[g(x)]| \leq \mathbb E[|f(x)-g(x)|] $$ $$|\max f(x) - \max g(x)| \leq \max |f(x)-g(x)| $$ $$ \mathbb E[f(x)] \leq \max f(x) $$

Test-taking Strategies

Move on if stuck!
Write explanations and show steps for partial credit
Multipart questions: can be done mostly independently
- ex: 1) show $\|x_{t+1}\|\leq \gamma \|x_t\|$
  2) give a bound on $\|x_t\|$ in terms of $\|x_0\|$

Sp23 CS 4/5789: Lecture 14

By Sarah Dean

Sp23 CS 4/5789: Lecture 14

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 14: Review

Reminders

Prelim on 3/15 in Lecture

Review

1. MDP Definitions

1. MDP Definitions

2. Policies and Distributions

2. Policies and Distributions

2. Policies and Distributions

State Evolution Example

State Distribution Transition

2. Policies and Distributions

3. Value and Q function

3. Value and Q function

3. Value and Q function

4. Optimal Policies

4. Optimal Policies

4. Optimal Policies

4. Optimal Policies

5. Linear Optimal Control

5. Linear Optimal Control

Proof Stratgies

Test-taking Strategies

Sp23 CS 4/5789: Lecture 14

More from Sarah Dean