Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Reminders

• Homework
• PSet 4 due today Friday
• 5789 Paper Reviews due weekly on Mondays
• PA 3/PSet 5 released next week
• My office hours cancelled on Wednesday 3/15 due to Prelim

## Prelim on 3/15 in Lecture

• Prelim Wednesday 3/15
• During lecture (2:45-4pm in 255 Olin)
• 1 hour exam, closed-book, equation sheet provided
• Materials:
• slides (Lectures 1-10, some of 11-13)
• PSets 1-4 (1-3 solutions on Canvas)
• Last minute conflicts/accomodations? (EdStem)
• Monitoring Prelim tag on EdStem for questions

Outline:

1. MDP Definitions
2. Policies and Distributions
3. Value and Q function
4. Optimal Policies
5. Linear Optimal Control

## Review

Participation point: PollEV.com/sarahdean011

Infinite Horizon Discounted MDP

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, \gamma\}$$

## 1. MDP Definitions

• $$\mathcal{S}$$ states, $$\mathcal{A}$$ actions
• $$r$$ map from state, action to scalar reward
• $$P$$ transition probability to next state given current state and action (Markov assumption)
• $$\gamma$$ discount factor

Finite Horizon MDP

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, H, \mu_0\}$$

• $$\mathcal{S},\mathcal{A},r,P$$ same
• $$H$$ horizon

ex - Pac-Man as MDP

## 1. MDP Definitions

Optimal Control Problem

• continuous states/actions $$\mathcal{S}=\mathbb R^{n_s},\mathcal{A}=\mathbb R^{n_a}$$
• transitions are deterministic and described in terms of dynamics function
$$s'= f(s, a)$$

ex - UAV as OCP

## 2. Policies and Distributions

• Policy $$\pi$$ chooses an action based on the current state so $$a_t=a$$ with probability $$\pi(a|s_t)$$
• Shorthand for deterministic policy: $$a_t=\pi(s_t)$$

examples:

Policy results in a trajectory $$\tau = (s_0, a_0, s_1, a_1, ... )$$

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

## 2. Policies and Distributions

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

• Probability of trajectory $$\tau =(s_0, a_0, s_1, ... s_t, a_t)$$ $$\mathbb{P}_{\mu_0}^\pi (\tau) = \mu_0(s_0)\pi(a_0 \mid s_0) \cdot \displaystyle\prod_{i=1}^t {P}(s_i \mid s_{i-1}, a_{i-1}) \pi(a_i \mid s_i)$$
• Probability of $$s$$ at $$t$$ $$\mathbb{P}^\pi_t(s ; \mu_0) = \displaystyle\sum_{\substack{s_{0:t-1}\\ a_{0:t-1}}} \mathbb{P}^\pi_{\mu_0} (s_{0:t}, a_{0:t-1} \mid s_t = s)$$

## 2. Policies and Distributions

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

• Probability vector of $$s$$ at $$t$$: $$d_{\mu_0,t}^\pi(s) = \mathbb{P}^\pi_t(s ; \mu_0)$$ evolves as $$d_{\mu_0,t+1}^\pi=P_\pi^\top d_{\mu_0,t}^\pi$$ where $$P_\pi$$ at row $$s$$ and column $$s'$$ is $$\mathbb E_{a\sim \pi(s)}[P(s'\mid s,a)]$$
• Discounted "steady-state" distribution (PSet 2) $$d^\pi_{\mu_0} = (1 - \gamma) \displaystyle\sum_{t=0}^\infty \gamma^t d_{\mu_0,t}^\pi$$

$$+\gamma$$

$$+\gamma^2$$

$$+\quad ...\quad=$$

$$1$$

$$1-p_1$$

$$p_1$$

$$0$$

$$1$$

• $$d_0 = \begin{bmatrix} 1/2\\ 1/2\end{bmatrix}$$
• $$d_1 = \begin{bmatrix}1& 1-p_1\\0 & p_1\end{bmatrix} \begin{bmatrix} 1/2\\ 1/2\end{bmatrix} = \begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix}$$
• $$d_2 =\begin{bmatrix}1& 1-p_1\\0 & p_1\end{bmatrix}\begin{bmatrix} 1-p_1/2\\ p_1/2\end{bmatrix} = \begin{bmatrix} 1-p_1^2/2\\ p_1^2/2\end{bmatrix}$$

Example: $$\pi(s)=$$stay and $$\mu_0$$ is each state with probability $$1/2$$.

## State Evolution Example

$$P_\pi = \begin{bmatrix}1& 0\\ 1-p_1 & p_1\end{bmatrix}$$

## State Distribution Transition

• How does state distribution change over time?
• Recall, $$s_{t+1}\sim P(s_t,\pi(s_t))$$
• i.e. $$s_{t+1} = s'$$ with probability $$P(s'|s_t, \pi(s_t))$$
• Write as a summation over possible $$s_t$$:
• $$\mathbb{P}\{s_{t+1}=s'\mid \mu_0,\pi\} =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))\mathbb{P}\{s_{t}=s\mid \mu_0,\pi\}$$
• In vector notation:
• $$d_{t+1}[s'] =\sum_{s\in\mathcal S} P(s'\mid s, \pi(s))d_t[s]$$
• $$d_{t+1}[s'] =\langle\begin{bmatrix} P(s'\mid 1, \pi(1)) & \dots & P(s'\mid S, \pi(S))\end{bmatrix},d_t\rangle$$

## 2. Policies and Distributions

Food for thought:

• How are these distributions different when:
• Transitions are different (Simulation Lemma)
• Policies are different (Performal Difference Lemma)
• Initial states are different

$$s_0$$

$$a_0$$

$$s_1$$

$$a_1$$

$$s_2$$

$$a_2$$

...

## 3. Value and Q function

• Evaluate policy by cumulative reward
• $$V^\pi(s) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s,P,\pi]$$
• $$Q^\pi(s, a) = \mathbb E[\sum_{t=0}^\infty \gamma^t r(s_t,a_t) | s_0=s, a_0=a,P,\pi]$$
• For finite horizon, for $$t=0,...H-1$$,
• $$V_t^\pi(s) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s,P,\pi]$$
• $$Q_t^\pi(s, a) = \mathbb E[\sum_{k=t}^{H-1} r(s_k,a_k) | s_t=s, a_t=a,P,\pi]$$

examples:

...

...

...

## 3. Value and Q function

Recursive Bellman Expectation Equation:

• Discounted Infinite Horizon
•  $$V^{\pi}(s) = \mathbb{E}_{a \sim \pi(s)} \left[ r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} [V^\pi(s')] \right]$$
• $$Q^{\pi}(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P( s, a)} \left[ V^\pi(s') \right]$$
• Finite Horizon,  for $$t=0,\dots H-1$$,
• $$V^{\pi}_t(s) = \mathbb{E}_{a \sim\pi_t(s) } \left[ r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')] \right]$$
• $$Q^{\pi}_t(s) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} [V^\pi_{t+1}(s')]$$

...

...

...

Recall: Icy navigation (PSet 2, lecture example)

## 3. Value and Q function

• Recursive computation: $$V^{\pi} = R^{\pi} + \gamma P_{\pi} V^\pi$$
• Exact Policy Evaluation: $$V^{\pi} = (I- \gamma P_{\pi} )^{-1}R^{\pi}$$
• Iterative Policy Evaluation: $$V^{\pi}_{i+1} = R^{\pi} + \gamma P_{\pi} V^\pi_i$$
• Converges: fixed point contraction
• Backwards-Iterative computation in finite horizon:
• Initialize $$V^{\pi}_H = 0$$
• For $$t=H-1, H-2, ... 0$$
• $$V^{\pi}_t = R^{\pi} +P_{\pi} V^\pi_{t+1}$$

## 4. Optimal Policies

• An optimal policy $$\pi^*$$ is one where $$V^{\pi^*}(s) \geq V^{\pi}(s)$$ for all $$s$$ and policies $$\pi$$
• Equivalent condition: Bellman Optimality
• $$V^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[V^*(s') \right]\right]$$
• $$Q^*(s, a) = r(s, a) + \gamma \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q^*(s', a') \right]$$
• Optimal policy $$\pi^*(s) = \argmax_{a\in \mathcal A} Q^*(s, a)$$

Recall: Verifying optimality in Icy Street example

Food for thought: What does Bellman Optimality imply about advantage function $$A^{\pi^*}(s,a)$$?

## 4. Optimal Policies

• Finite horizon: for $$t=0,\dots H-1$$,
• $$V_t^*(s) = \max_{a\in\mathcal A} \left[r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[V_{t+1}^*(s') \right]\right]$$
• $$Q_t^*(s, a) = r(s, a) + \mathbb{E}_{s' \sim P(s, a)} \left[ \max_{a'\in\mathcal A} Q_{t+1}^*(s', a') \right]$$
• Optimal policy $$\pi_t^*(s) = \argmax_{a\in \mathcal A} Q_t^*(s, a)$$
• Solve exactly with Dynamic Programming
• Iterate backwards in time from $$V^*_{H}=0$$

## 4. Optimal Policies

• Infinite horizon: algorithms for recursion in the Bellman Optimality equation
• Value Iteration
• Initialize $$V_0$$. For $$i=0,1,\dots$$,
• $$V^{i+1}(s) =\max_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right]$$
• Policy Iteration
• Initialize $$\pi_0$$. For $$i=0,1,\dots$$,
• $$V^{i}=$$ PolicyEval($$\pi^i$$)
• $$\pi^{i+1}(s) = \argmax_{a\in\mathcal A} r(s, a) + \gamma\mathbb{E}_{s' \sim P(s, a)} \left[ V^i(s') \right]$$

## 4. Optimal Policies

• Value Iteration
• Converges: fixed point contraction: $$\|V^{i+1} - V^*\|_\infty \leq \gamma \|V^i - V^*\|_\infty$$
• Policy Iteration
• Monotone Improvement: $$V^{i+1}(s) \geq V^{i}(s)$$
• Contraction: $$\|V^{i+1} - V^*\|_\infty \leq \gamma \|V^i- V^*\|_\infty$$
• Converges to exactly optimal policy in finite time (PSet 3)

## 5. Linear Optimal Control

• Linear Dynamics: $$s_{t+1} = A s_t + Ba_t$$
• Unrolled dynamics (PSet 3) $$s_{t} = A^ts_0 + \sum_{k=0}^{t-1} A^k Ba_{t-k-1}$$
• Stability of uncontrolled $$s_{t+1}=As_t$$:
• stable if $$\max_i |\lambda_i(A)|< 1$$
• unstable if $$\max_i |\lambda_i(A)| > 1$$
• marginally unstable if $$\max_i |\lambda_i(A)|= 1$$

ex - UAV

Food for thought: relationship between stability and cumulative cost? (PSet 4)

## 5. Linear Optimal Control

Finite Horizon LQR: Application of Dynamic Programming

• Initialize $$V^{\pi}_H(s) = 0$$
• For $$t=H-1, H-2, ... 0$$
• $$Q^{*}_t(s,a) = c(s,a) + V^*_{t+1}(f(s,a))$$
• $$\pi^*(s) = \argmin_{a\in\mathcal A} Q^{*}_t(s,a)$$
• $$V^*_t = Q^{*}_t(s,\pi^*(s))$$

Basis for approximation-based algorithms (local linearization and iLQR)

## Proof Stratgies

1. Add and subtract: $$\|f(x) - g(y)\| \leq \|f(x)-f(y)\| +\|f(y)-g(y)\|$$
2. Contractions (induction) $$\|x_{t+1}\|\leq \gamma \|x_t\| \implies \|x_t\|\leq \gamma^t\|x_0\|$$
3. Additive induction $$\|x_{t+1}\| \leq \delta_t + \|x_t\| \implies \|x_t\|\leq \sum_{k=0}^{t-1} \delta_k + \|x_0\|$$
4. Basic Inequalities (PSet 1): $$|\mathbb E[f(x)] - \mathbb E[g(x)]| \leq \mathbb E[|f(x)-g(x)|]$$ $$|\max f(x) - \max g(x)| \leq \max |f(x)-g(x)|$$ $$\mathbb E[f(x)] \leq \max f(x)$$

## Test-taking Strategies

1. Move on if stuck!
2. Write explanations and show steps for partial credit
3. Multipart questions: can be done mostly independently
• ex: 1) show $$\|x_{t+1}\|\leq \gamma \|x_t\|$$
2) give a bound on $$\|x_t\|$$ in terms of $$\|x_0\|$$

By Sarah Dean

Private