Sp23 CS 4/5789: Lecture 7

CS 4/5789: Introduction to Reinforcement Learning

Lecture 7: Continuous Control

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework this week
- Problem Set 2 due TONIGHT
- Programming Assignment 1 due Wednesday 2/15
- Next PSet and PA released on Wednesday
My office hours:
- Tuesdays 10:30-11:30am in Gates 416A
- Wednesdays 4-4:50pm in Olin 255 (right after lecture)

Agenda

1. Recap

2. Continuous Control

3. Linear Dynamics

Markov Decision Process

$\mathcal{S}, \mathcal{A}$ state and action spaces
- finite size $S$ and $A$
$r$ reward function, $P$ transition function (tabular representation $SA$ and $S^2A$)
discount factor $0<\gamma<1$ or horizon $H>0$

Goal: achieve high cumulative reward

maximize $\displaystyle \mathbb E\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)\right]$ or $\displaystyle \mathbb E\left[\sum_{t=0}^{H-1} r(s_t, a_t)\right]$

s.t. $s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)$

$\pi$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, [H~\text{or}~\gamma]\}$

Infinite Horizon: VI and PI

Policy Iteration

Initialize $\pi_0:\mathcal S\to\mathcal A$
For $t=0,\dots,T-1$:
- Policy Evaluation: $V^{\pi_t}$
- Policy Improvement: $\pi^{t+1}$

Value Iteration

Initialize $V_0$
For $t=0,\dots,T-1$:
- Bellman Operator: $V_{t+1}$
Return $\displaystyle \pi_T$

Monotonic Improvement:
$V^{\pi_{t+1}} \geq V^{\pi_t}$
Convergence:
$\|V^{\pi_t} - V^\star\|_\infty \leq\gamma^t \|V^{\pi_0}-V^\star\|_\infty$

Iterate convergence:
$\| V_{t}- V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty$
Suboptimality:
$V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{T+1}}{1-\gamma} \|V_0-V^\star\|_\infty$

PollEV

Finite Horizon: DP

Exactly compute the optimal policy

Initialize $V^\star_H = 0$
For $t=H-1, H-2, ..., 0$:
- $Q_t^\star(s,a) = r(s,a)+\mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]$
- $\pi_t^\star(s) = \arg\max_a Q_t^\star(s,a)$
- $V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )$
Return $\pi^\star = (\pi^\star_0,\dots ,\pi^\star_{H-1})$

Agenda

1. Recap

2. Continuous Control

3. Linear Dynamics

Continuous MDP

So far, we consider finitely many states and actions $|\mathcal S| = S$ and $|\mathcal A| = A$
- Tabular representation of functions
In applications like robotics, states and actions can take continuous values
- e.g. position, velocity, force
- $\mathcal S = \mathbb R^{n_s}$ and $\mathcal A = \mathbb R^{n_a}$
Historical terminology: "optimal control problem" originates from the use of these techniques to design control laws for regulating physical processes

Finite Horizon Optimal Control

Continuous $\mathcal S = \mathbb R^{n_s}$ and $\mathcal A = \mathbb R^{n_a}$
- alternate terminology/notation (we won't use): states $x$ and "inputs" $u$
Cost to be minimized (rather than reward to be maximized)
- think of as "negative reward", or think of reward as "negative cost"
- potentially time-varying $c=(c_0,\dots, c_{H-1}, c_H)$
  - $c_t:\mathcal S\times\mathcal A\to \mathbb R$ for $t=0,\dots,H-1$
  - final state cost $c_H:\mathcal S\to \mathbb R$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$

Finite Horizon Optimal Control

Continuous $\mathcal S = \mathbb R^{n_s}$ and $\mathcal A = \mathbb R^{n_a}$
Cost to be minimized $c=(c_0,\dots, c_{H-1}, c_H)$
Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
Finite horizon $H$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$

minimize $\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)$

s.t. $s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)$

$\pi$

Not in Scope: Stochastic & Infinite Horizon

Non-deterministic dynamics are out of our scope (requiring a background in continuous random variables)
Stochastic transitions described by dynamics function and independent "process noise" $$s_{t+1} = f(s_t, a_t, w_t), \quad w_t\overset{i.i.d.}{\sim} \mathcal D_w$$
Infinite Horizon as either "discounted" or "average" $$\sum_{t=0}^\infty \gamma^t c_t\quad \text{or}\quad \lim_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T-1} c_t$$
Though we won't study them, these settings routine for LQR (topic of next lecture)

$\mathcal M = \{\mathcal{S}, \mathcal{A}, c,( f,\mathcal D_w), [H,\gamma,\mathsf{avg}]\}$

Example

$a_t$

Setting: hovering UAV over a target
- cost: distance from target
Action: thrust right/left
Newton's second law
- $a_t = \frac{m}{\Delta} (\mathsf{velocity}_{t+1}- \mathsf{velocity}_{t})$
- $\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + \frac{\Delta}{m} a_t$
Effect on position
- $\mathsf{position}_{t+1} = \mathsf{position}_{t}+\Delta \mathsf{velocity}_{t}$
State is $s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}$

Example

Setting: hovering UAV over a target
Action: thrust right/left
State is $s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}$
- $\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + \frac{\Delta}{m} a_t$
- $\mathsf{position}_{t+1} = \mathsf{position}_{t}+\Delta \mathsf{velocity}_{t}$

$a_t$

$\mathcal S = \mathbb R^2$, $\mathcal A = \mathbb R$
$c_t(s_t, a_t) = (\mathsf{position}_t-\mathsf{target}_t)^2+\lambda a_t^2$
$f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$

Discretization?

Could approximate continuous states/action by discretizing
How many states/actions does this require?
- Let $B_s$ bound* the size of the maximum state and $B_a$ bound the size of the maximum action
- $(B_s/\varepsilon)^{n_s}$ for states and $(B_a/\varepsilon)^{n_a}$ for actions
*bounds depend on dynamics, horizon, initial state, etc (nontrivial!)
This is not a feasible approach in many cases!

$\varepsilon$

Agenda

1. Recap

2. Continuous Control

3. Linear Dynamics

Example

Setting: hovering UAV over a target
Action: thrust right/left
State is $s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}$

$a_t$

$f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t$

Linear Dynamics

The dynamics function $f$ has a linear form $$ s_{t+1} = As_t + Ba_t $$
$A\in\mathbb R^{n_s\times n_s}$ and $B\in\mathbb R^{n_s\times n_a}$ are dynamics matrices
$A$ describes the evolution of the state when there is no action (internal dynamics)
$B$ describes the effects of actions

Example: investing

You have investments in two companies.

Setting 1: Each dollar of investment in company $i$ leads to $\lambda_i$ returns. The companies are independent.

$\displaystyle s_{t+1} = \begin{bmatrix} \lambda_1 & \\ & \lambda_2 \end{bmatrix} s_t $

$0<\lambda_2<\lambda_1<1$

$0<\lambda_2<1<\lambda_1$

$1<\lambda_2<\lambda_1$

Autonomous trajectories

Trajectories $s_t=A^t s_0$ are determined by the eigen-decomposition of $A$
Ex: if $s_0=v$ is an eigenvector of $A$ (i.e. $Av =\lambda v$)
- $s_{1} = As_0 = \lambda s_0$
- $s_t = \lambda^t v$
If $A$ is diagonalizable, then any $s_0$ can be written as a linear combination of eigenvectors $s_0 = \sum_{i=1}^{n_s} \alpha_i v_i$
- $s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i$

The effect of internal dynamics $$ s_{t+1} = As_t$$

Example: investing

Setting 2: The companies are interdependent: each dollar of investment in company $i$ leads to $\alpha$ return for company $i$, but it also leads to $\beta$ return ($i=1$) or loss ($i=2$) to the other company.

$\displaystyle s_{t+1} = \begin{bmatrix} \alpha & -\beta \\ \beta & \alpha \end{bmatrix} s_t $

$0<\alpha^2+\beta^2<1$

$1<\alpha^2+\beta^2$

$$\begin{bmatrix}1\\0\end{bmatrix} \to \begin{bmatrix}\alpha\\ \beta\end{bmatrix} $$

rotation by $\arctan(\beta/\alpha)$

scale by $\sqrt{\alpha^2+\beta^2}$

$\lambda = \alpha \pm i \beta$

Example: investing

Setting 3: Each dollar of investment in company $i$ leads to $\lambda$ return for company $i$, and $2$ is a subsidiary of $1$ who thus accumulates its returns as well.

$\displaystyle s_{t+1} = \begin{bmatrix} \lambda & 1 \\ 0 & \lambda \end{bmatrix} s_t $

$0<\lambda<1$

$1<\lambda$

$$ \left(\begin{bmatrix} \lambda & \\ & \lambda\end{bmatrix} + \begin{bmatrix} & 1\\ & \end{bmatrix} \right)^t$$

$$ =\begin{bmatrix} \lambda^t & t\lambda^{t-1}\\ & \lambda^t\end{bmatrix} $$

Summary of 2D Examples

General case: diagonalizable, real eigenvalues (geometric $=$ algebraic multiplicity)

Example 1: $\displaystyle s_{t+1} = \begin{bmatrix} \lambda_1 & \\ & \lambda_2 \end{bmatrix} s_t $

Example 2: $\displaystyle s_{t+1} = \begin{bmatrix} \alpha & -\beta\\\beta & \alpha\end{bmatrix} s_t $

General case: pair of complex eigenvalues

$\lambda = \alpha \pm i \beta$

Example 3: $\displaystyle s_{t+1} = \begin{bmatrix} \lambda & 1\\ & \lambda\end{bmatrix} s_t $

General case: non-diagonalizable (geometric $<$ algebraic multiplicity)

Equilibria and Stability

An equilibrium state satisfies $$ s_{eq} = As_{eq} $$
- $s_{eq}=0$ is always an equilbrium
- if there is an eigenvalue equal to 1, then for the associated eigenvector, $Av=v$. Thus $cv$ is an equilibrium for any scalar $c$.
Broadly categorize as
1. Asymptotically stable: $s_t\to 0$
2. Unstable: $\|s_t\|\to\infty$
There are examples which are neither (e.g. $A=I$)

Stability Theorem

Theorem: Let $\{\lambda_i\}_{i=1}^n\subset \mathbb C$ be the eigenvalues of $A$.
Then for $s_{t+1}=As_t$, the equilibrium $s_{eq}=0$ is

asymptotically stable $\iff \max_{i\in[n]}|\lambda_i|<1$
unstable if $\max_{i\in[n]}|\lambda_i|> 1$
call $\max_{i\in[n]}|\lambda_i|=1$ "marginally (un)stable"

$\mathbb C$

Stability Theorem

Proof

If $A$ is diagonalizable, then any $s_0$ can be written as a linear combination of eigenvectors $s_0 = \sum_{i=1}^{n_s} \alpha_i v_i$
- By definition, $Av_i = \lambda_i v_i$
- Therefore, $s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i$
- Thus $s_t\to 0$ if and only if all $|\lambda_i|<1$, and if any $|\lambda_i|>1$, $\|s_t\|\to\infty$
Proof in the non-diagonalizable case is out of scope, but it follows using the Jordan Normal Form

Marginally (un)stable

We call $\max_i|\lambda_i|=1$ "marginally (un)stable"
Consider independent investing example: (not unstable $\lambda_2<1$) $$ s_{t} = \begin{bmatrix} 1 &0 \\0 & \lambda_2 \end{bmatrix}^t s_0 $$
Consider UAV example: (unstable)$$s_{t} = \begin{bmatrix} 1 & 1 \\0 & 1 \end{bmatrix}^t s_0 =\begin{bmatrix} 1 & t\\ & 1\end{bmatrix} s_0 $$
Depends on eigenvectors not just eigenvalues!

Controlled Trajectories

Full dynamics depend on actions $$ s_{t+1} = As_t+Ba_t $$
The trajectories can be written as (PSet 3) $$ s_{t} = A^t s_0 + \sum_{k=0}^{t-1}A^k Ba_{t-k-1} $$
The internal dynamics $A$ determines the long term effects of actions

Example

Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
Initially at rest, then one rightward thrust followed by one leftward thrust $$a_0=1,\quad a_{t_0}=-1,\quad a_k=0~~k\notin\{0,t_0\} $$

$a_t$

$s_{t} = \displaystyle \begin{bmatrix}1 & t \\ 0 & 1\end{bmatrix}\begin{bmatrix}\mathsf{pos}_0 \\ 0 \end{bmatrix}+ \sum_{k=0}^{t-1} \begin{bmatrix}1 & k\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}a_{t-k-1}$
$s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0 \\ 0 \end{bmatrix}+ \begin{bmatrix}1 & t-1\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}- \begin{bmatrix}1 & t-t_0-1\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}$
for $t\leq t_0$, $s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0+ t-1 \\ 1 \end{bmatrix}$ and for $t\geq t_0$, $s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0+ t_0 \\ 0 \end{bmatrix}$

Example

Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
Thrust according to distance from target $a_t = -(\mathsf{pos}_t- x)$

$a_t$

Linear Policy

Linear policy defined by $a_t=Ks_t$: $$ s_{t+1} = As_t+BKs_t = (A+BK)s_t$$
The trajectories can be written as $$ s_{t} = (A+BK)^t s_0 $$
The internal dynamics $A$ are modified depending on $B$ and $K$

Example

Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
Thrust according to distance from target $a_t = -(\mathsf{pos}_t- x)$

$a_t$

$s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$
$\left(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& 0\end{bmatrix} \left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right)$
$\left(s_{t} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ -1& 1\end{bmatrix}^t\left(s_0 -\begin{bmatrix}x\\ 0\end{bmatrix}\right)$

simulations

Example

Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
Thrust according to distance from target $a_t = -(\mathsf{pos}_t+\mathsf{vel}_t- x)$

$a_t$

$\left(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& -1\end{bmatrix} \left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right)$
$\left(s_{t} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ -1 & 0\end{bmatrix}^t\left(s_0 -\begin{bmatrix}x\\ 0\end{bmatrix}\right)$

Recap

PSet 2 due TONIGHT
PA 1 due Wednesday

Continuous Control
Linear Dynamics

Next lecture: Linear Quadratic Regulator