Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Reminders

• Homework this week
• Problem Set 2 due TONIGHT
• Programming Assignment 1 due Wednesday 2/15
• Next PSet and PA released on Wednesday
• My office hours:
• Tuesdays 10:30-11:30am in Gates 416A
• Wednesdays 4-4:50pm in Olin 255 (right after lecture)

## Agenda

1. Recap

2. Continuous Control

3. Linear Dynamics

## Markov Decision Process

• $$\mathcal{S}, \mathcal{A}$$ state and action spaces
• finite size $$S$$ and $$A$$
• $$r$$ reward function, $$P$$ transition function (tabular representation $$SA$$ and $$S^2A$$)
• discount factor $$0<\gamma<1$$ or horizon $$H>0$$

Goal: achieve high cumulative reward

maximize   $$\displaystyle \mathbb E\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)\right]$$ or $$\displaystyle \mathbb E\left[\sum_{t=0}^{H-1} r(s_t, a_t)\right]$$

s.t.   $$s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$$

$$\pi$$

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, [H~\text{or}~\gamma]\}$$

## Infinite Horizon: VI and PI

Policy Iteration

• Initialize $$\pi_0:\mathcal S\to\mathcal A$$
• For $$t=0,\dots,T-1$$:
• Policy Evaluation: $$V^{\pi_t}$$
• Policy Improvement: $$\pi^{t+1}$$

Value Iteration

• Initialize $$V_0$$
• For $$t=0,\dots,T-1$$:
• Bellman Operator: $$V_{t+1}$$
• Return $$\displaystyle \pi_T$$
1. Monotonic Improvement:
$$V^{\pi_{t+1}} \geq V^{\pi_t}$$
2. Convergence:
$$\|V^{\pi_t} - V^\star\|_\infty \leq\gamma^t \|V^{\pi_0}-V^\star\|_\infty$$
1. Iterate convergence:
$$\| V_{t}- V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$$
2. Suboptimality:
$$V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{T+1}}{1-\gamma} \|V_0-V^\star\|_\infty$$

PollEV

## Finite Horizon: DP

Exactly compute the optimal policy

• Initialize $$V^\star_H = 0$$
• For $$t=H-1, H-2, ..., 0$$:
• $$Q_t^\star(s,a) = r(s,a)+\mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]$$
• $$\pi_t^\star(s) = \arg\max_a Q_t^\star(s,a)$$
• $$V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )$$
• Return $$\pi^\star = (\pi^\star_0,\dots ,\pi^\star_{H-1})$$

## Agenda

1. Recap

2. Continuous Control

3. Linear Dynamics

## Continuous MDP

• So far, we consider finitely many states and actions $$|\mathcal S| = S$$ and $$|\mathcal A| = A$$
• Tabular representation of functions
• In applications like robotics, states and actions can take continuous values
• e.g. position, velocity, force
• $$\mathcal S = \mathbb R^{n_s}$$ and $$\mathcal A = \mathbb R^{n_a}$$
• Historical terminology: "optimal control problem" originates from the use of these techniques to design control laws for regulating physical processes

## Finite Horizon Optimal Control

• Continuous $$\mathcal S = \mathbb R^{n_s}$$ and $$\mathcal A = \mathbb R^{n_a}$$
• alternate terminology/notation (we won't use): states $$x$$ and "inputs" $$u$$
• Cost to be minimized (rather than reward to be maximized)
• think of as "negative reward", or think of reward as "negative cost"
• potentially time-varying $$c=(c_0,\dots, c_{H-1}, c_H)$$
• $$c_t:\mathcal S\times\mathcal A\to \mathbb R$$ for $$t=0,\dots,H-1$$
• final state cost $$c_H:\mathcal S\to \mathbb R$$

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$$

## Finite Horizon Optimal Control

• Continuous $$\mathcal S = \mathbb R^{n_s}$$ and $$\mathcal A = \mathbb R^{n_a}$$
• Cost to be minimized $$c=(c_0,\dots, c_{H-1}, c_H)$$
• Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
• Finite horizon $$H$$

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$$

minimize   $$\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)$$

s.t.   $$s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)$$

$$\pi$$

## Not in Scope: Stochastic & Infinite Horizon

• Non-deterministic dynamics are out of our scope (requiring a background in continuous random variables)
• Stochastic transitions described by dynamics function and independent "process noise" $$s_{t+1} = f(s_t, a_t, w_t), \quad w_t\overset{i.i.d.}{\sim} \mathcal D_w$$
• Infinite Horizon as either "discounted" or "average" $$\sum_{t=0}^\infty \gamma^t c_t\quad \text{or}\quad \lim_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T-1} c_t$$
• Though we won't study them, these settings routine for LQR (topic of next lecture)

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, c,( f,\mathcal D_w), [H,\gamma,\mathsf{avg}]\}$$

## Example

$$a_t$$

• Setting: hovering UAV over a target
• cost: distance from target
• Action: thrust right/left
• Newton's second law
• $$a_t = \frac{m}{\Delta} (\mathsf{velocity}_{t+1}- \mathsf{velocity}_{t})$$
• $$\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + \frac{\Delta}{m} a_t$$
• Effect on position
• $$\mathsf{position}_{t+1} = \mathsf{position}_{t}+\Delta \mathsf{velocity}_{t}$$
• State is $$s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}$$

## Example

• Setting: hovering UAV over a target
• Action: thrust right/left
• State is $$s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}$$
• $$\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + \frac{\Delta}{m} a_t$$
• $$\mathsf{position}_{t+1} = \mathsf{position}_{t}+\Delta \mathsf{velocity}_{t}$$

$$a_t$$

• $$\mathcal S = \mathbb R^2$$, $$\mathcal A = \mathbb R$$
• $$c_t(s_t, a_t) = (\mathsf{position}_t-\mathsf{target}_t)^2+\lambda a_t^2$$
• $$f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t$$

$$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$$

## Discretization?

• Could approximate continuous states/action by discretizing
• How many states/actions does this require?
• Let $$B_s$$ bound* the size of the maximum state and $$B_a$$ bound the size of the maximum action
• $$(B_s/\varepsilon)^{n_s}$$ for states and $$(B_a/\varepsilon)^{n_a}$$ for actions
• *bounds depend on dynamics, horizon, initial state, etc (nontrivial!)
• This is not a feasible approach in many cases!

$$\varepsilon$$

## Agenda

1. Recap

2. Continuous Control

3. Linear Dynamics

## Example

• Setting: hovering UAV over a target
• Action: thrust right/left
• State is $$s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}$$

$$a_t$$

$$f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t$$

## Linear Dynamics

• The dynamics function $$f$$ has a linear form $$s_{t+1} = As_t + Ba_t$$
• $$A\in\mathbb R^{n_s\times n_s}$$ and $$B\in\mathbb R^{n_s\times n_a}$$ are dynamics matrices
• $$A$$ describes the evolution of the state when there is no action (internal dynamics)
• $$B$$ describes the effects of actions

## Example: investing

You have investments in two companies.

Setting 1:  Each dollar of investment in company $$i$$ leads to $$\lambda_i$$ returns. The companies are independent.

• $$\displaystyle s_{t+1} = \begin{bmatrix} \lambda_1 & \\ & \lambda_2 \end{bmatrix} s_t$$

$$0<\lambda_2<\lambda_1<1$$

$$0<\lambda_2<1<\lambda_1$$

$$1<\lambda_2<\lambda_1$$

## Autonomous trajectories

• Trajectories $$s_t=A^t s_0$$ are determined by the eigen-decomposition of $$A$$
• Ex: if $$s_0=v$$ is an eigenvector of $$A$$ (i.e. $$Av =\lambda v$$)
• $$s_{1} = As_0 = \lambda s_0$$

• $$s_t = \lambda^t v$$

• If $$A$$ is diagonalizable, then any $$s_0$$ can be written as a linear combination of eigenvectors $$s_0 = \sum_{i=1}^{n_s} \alpha_i v_i$$

• $$s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i$$

The effect of internal dynamics $$s_{t+1} = As_t$$

## Example: investing

Setting 2:  The companies are interdependent: each dollar of investment in company $$i$$ leads to $$\alpha$$ return for company $$i$$, but it also leads to $$\beta$$ return ($$i=1$$) or loss ($$i=2$$) to the other company.

• $$\displaystyle s_{t+1} = \begin{bmatrix} \alpha & -\beta \\ \beta & \alpha \end{bmatrix} s_t$$

$$0<\alpha^2+\beta^2<1$$

$$1<\alpha^2+\beta^2$$

$$\begin{bmatrix}1\\0\end{bmatrix} \to \begin{bmatrix}\alpha\\ \beta\end{bmatrix}$$

rotation by $$\arctan(\beta/\alpha)$$

scale by $$\sqrt{\alpha^2+\beta^2}$$

$$\lambda = \alpha \pm i \beta$$

## Example: investing

Setting 3:  Each dollar of investment in company $$i$$ leads to $$\lambda$$ return for company $$i$$, and $$2$$ is a subsidiary of $$1$$ who thus accumulates its returns as well.

• $$\displaystyle s_{t+1} = \begin{bmatrix} \lambda & 1 \\ 0 & \lambda \end{bmatrix} s_t$$

$$0<\lambda<1$$

$$1<\lambda$$

$$\left(\begin{bmatrix} \lambda & \\ & \lambda\end{bmatrix} + \begin{bmatrix} & 1\\ & \end{bmatrix} \right)^t$$

$$=\begin{bmatrix} \lambda^t & t\lambda^{t-1}\\ & \lambda^t\end{bmatrix}$$

## Summary of 2D Examples

General case: diagonalizable, real eigenvalues (geometric $$=$$ algebraic multiplicity)

Example 1:  $$\displaystyle s_{t+1} = \begin{bmatrix} \lambda_1 & \\ & \lambda_2 \end{bmatrix} s_t$$

Example 2:  $$\displaystyle s_{t+1} = \begin{bmatrix} \alpha & -\beta\\\beta & \alpha\end{bmatrix} s_t$$

General case: pair of complex eigenvalues

$$\lambda = \alpha \pm i \beta$$

Example 3:  $$\displaystyle s_{t+1} = \begin{bmatrix} \lambda & 1\\ & \lambda\end{bmatrix} s_t$$

General case: non-diagonalizable (geometric $$<$$ algebraic multiplicity)

## Equilibria and Stability

• An equilibrium state satisfies $$s_{eq} = As_{eq}$$
• $$s_{eq}=0$$ is always an equilbrium
• if there is an eigenvalue equal to 1, then for the associated eigenvector, $$Av=v$$. Thus $$cv$$ is an equilibrium for any scalar $$c$$.
1. Asymptotically stable: $$s_t\to 0$$
2. Unstable: $$\|s_t\|\to\infty$$
• There are examples which are neither (e.g. $$A=I$$)

## Stability Theorem

Theorem: Let $$\{\lambda_i\}_{i=1}^n\subset \mathbb C$$ be the eigenvalues of $$A$$.
Then for $$s_{t+1}=As_t$$, the equilibrium $$s_{eq}=0$$ is

• asymptotically stable $$\iff \max_{i\in[n]}|\lambda_i|<1$$
• unstable if $$\max_{i\in[n]}|\lambda_i|> 1$$
• call $$\max_{i\in[n]}|\lambda_i|=1$$ "marginally (un)stable"

$$\mathbb C$$

## Stability Theorem

Proof

• If $$A$$ is diagonalizable, then any $$s_0$$ can be written as a linear combination of eigenvectors $$s_0 = \sum_{i=1}^{n_s} \alpha_i v_i$$

• By definition, $$Av_i = \lambda_i v_i$$

• Therefore, $$s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i$$

• Thus $$s_t\to 0$$ if and only if all $$|\lambda_i|<1$$, and if any $$|\lambda_i|>1$$, $$\|s_t\|\to\infty$$

• Proof in the non-diagonalizable case is out of scope, but it follows using the Jordan Normal Form

## Marginally (un)stable

• We call $$\max_i|\lambda_i|=1$$ "marginally (un)stable"

• Consider independent investing example: (not unstable $$\lambda_2<1$$) $$s_{t} = \begin{bmatrix} 1 &0 \\0 & \lambda_2 \end{bmatrix}^t s_0$$
• Consider UAV example: (unstable)$$s_{t} = \begin{bmatrix} 1 & 1 \\0 & 1 \end{bmatrix}^t s_0 =\begin{bmatrix} 1 & t\\ & 1\end{bmatrix} s_0$$
• Depends on eigenvectors not just eigenvalues!

## Controlled Trajectories

• Full dynamics depend on actions $$s_{t+1} = As_t+Ba_t$$

• The trajectories can be written as (PSet 3) $$s_{t} = A^t s_0 + \sum_{k=0}^{t-1}A^k Ba_{t-k-1}$$
• The internal dynamics $$A$$ determines the long term effects of actions

## Example

• Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
• Initially at rest, then one rightward thrust followed by one leftward thrust $$a_0=1,\quad a_{t_0}=-1,\quad a_k=0~~k\notin\{0,t_0\}$$

$$a_t$$

• $$s_{t} = \displaystyle \begin{bmatrix}1 & t \\ 0 & 1\end{bmatrix}\begin{bmatrix}\mathsf{pos}_0 \\ 0 \end{bmatrix}+ \sum_{k=0}^{t-1} \begin{bmatrix}1 & k\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}a_{t-k-1}$$
• $$s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0 \\ 0 \end{bmatrix}+ \begin{bmatrix}1 & t-1\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}- \begin{bmatrix}1 & t-t_0-1\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}$$
• for $$t\leq t_0$$, $$s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0+ t-1 \\ 1 \end{bmatrix}$$ and for $$t\geq t_0$$, $$s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0+ t_0 \\ 0 \end{bmatrix}$$

## Example

• Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
• Thrust according to distance from target $$a_t = -(\mathsf{pos}_t- x)$$

$$a_t$$

## Linear Policy

• Linear policy defined by $$a_t=Ks_t$$: $$s_{t+1} = As_t+BKs_t = (A+BK)s_t$$

• The trajectories can be written as $$s_{t} = (A+BK)^t s_0$$
• The internal dynamics $$A$$ are modified depending on $$B$$ and $$K$$

## Example

• Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
• Thrust according to distance from target $$a_t = -(\mathsf{pos}_t- x)$$

$$a_t$$

• $$s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
• $$\left(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& 0\end{bmatrix} \left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right)$$
• $$\left(s_{t} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ -1& 1\end{bmatrix}^t\left(s_0 -\begin{bmatrix}x\\ 0\end{bmatrix}\right)$$

## Example

• Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
• Thrust according to distance from target $$a_t = -(\mathsf{pos}_t+\mathsf{vel}_t- x)$$

$$a_t$$

• $$\left(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& -1\end{bmatrix} \left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right)$$
• $$\left(s_{t} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ -1 & 0\end{bmatrix}^t\left(s_0 -\begin{bmatrix}x\\ 0\end{bmatrix}\right)$$

## Recap

• PSet 2 due TONIGHT
• PA 1 due Wednesday

• Continuous Control
• Linear Dynamics

• Next lecture: Linear Quadratic Regulator

By Sarah Dean

Private