CS 4/5789: Introduction to Reinforcement Learning

Lecture 7: Continuous Control and LQR

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

Auditing (unofficial)
- Fill out form: https://forms.gle/xCRP9uXC8irZYKQW9
Homework
- Problem Set 2 due tonight
- Programming Assignment 1 due Wednesday
- PSet 3, PA 2 released Wednesday
First exam is Monday 3/4 during lecture
- If you have a conflict, post on Ed ASAP!

Agenda

1. Continuous Control

2. UAV Example

3. Linear Quadratic Regulator

Continuous MDP

So far, we consider finitely many states and actions $|\mathcal S| = S$ and $|\mathcal A| = A$
- Tabular representation of functions
In applications like robotics, states and actions can take continuous values
- e.g. position, velocity, force
- $\mathcal S = \mathbb R^{n_s}$ and $\mathcal A = \mathbb R^{n_a}$
Historical terminology: "optimal control problem" originates from the use of these techniques to design control laws for regulating physical processes

Finite Horizon Optimal Control

Continuous states $\mathcal S = \mathbb R^{n_s}$ and actions $\mathcal A = \mathbb R^{n_a}$
- alternate terminology/notation (we won't use): states $x$ and "inputs" $u$
Cost to be minimized (rather than reward to be maximized)
- think of as "negative reward", or think of reward as "negative cost"
- potentially time-varying $c=(c_0,\dots, c_{H-1}, c_H)$
  - $c_t:\mathcal S\times\mathcal A\to \mathbb R$ for $t=0,\dots,H-1$
  - final state cost $c_H:\mathcal S\to \mathbb R$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$

Finite Horizon Optimal Control

Continuous $\mathcal S = \mathbb R^{n_s}$ and $\mathcal A = \mathbb R^{n_a}$
Cost to be minimized $c=(c_0,\dots, c_{H-1}, c_H)$
Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
Finite horizon $H$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$

minimize $\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)$

s.t. $s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)$

$\pi$

Not in Scope: Stochastic & Infinite Horizon

Non-deterministic dynamics are out of our scope (requiring a background in continuous random variables)
Stochastic transitions described by dynamics function and independent "process noise" $$s_{t+1} = f(s_t, a_t, w_t), \quad w_t\overset{i.i.d.}{\sim} \mathcal D_w$$
Infinite Horizon as either "discounted" or "average" $$\sum_{t=0}^\infty \gamma^t c_t\quad \text{or}\quad \lim_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T-1} c_t$$
Though we won't study them, these settings routine for LQR

$\mathcal M = \{\mathcal{S}, \mathcal{A}, c,( f,\mathcal D_w), [H,\gamma,\mathsf{avg}]\}$

Agenda

1. Continuous Control

2. UAV Example

3. Linear Quadratic Regulator

Example

$a_t$

Setting: hovering UAV over a target
- cost: distance from target
Action: thrust right/left
Newton's second law
- $a_t = \frac{m}{\Delta} (\mathsf{velocity}_{t+1}- \mathsf{velocity}_{t})$
- $\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + \frac{\Delta}{m} a_t$
Effect on position
- $\mathsf{position}_{t+1} = \mathsf{position}_{t}+\Delta \mathsf{velocity}_{t}$
State is $s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}$

Example

Setting: hovering UAV over a target
Action: thrust right/left
State is $s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}$
- $\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + \frac{\Delta}{m} a_t$
- $\mathsf{position}_{t+1} = \mathsf{position}_{t}+\Delta \mathsf{velocity}_{t}$

$a_t$

$\mathcal S = \mathbb R^2$, $\mathcal A = \mathbb R$
$c_t(s_t, a_t) = (\mathsf{position}_t-\mathsf{target})^2+\lambda a_t^2$
$f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$

Q: How would you pick actions?

Simple policy?

Setting: hovering UAV over a target
$c_t(s_t, a_t) = (\mathsf{position}_t-\mathsf{target})^2+\lambda a_t^2$
$f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t$
Guess: negative policy $$\pi(s) = -(\mathsf{position}_t-\mathsf{target}) $$

$a_t$

Starting at $s_0=[1, 0]$, let $\Delta=m=1$, let $\mathsf{target}=0$:
- $s_1 = [1, -1]$ $s_2=[0, -2]$ $s_3=[-2, -2]$ $s_4=[-4, 0]$ $s_4=[-4, 4]$
Unstable!
even with different gains $\pi(s)=-\gamma(\mathsf{pos}-\mathsf{tar})$ (simulations)

Discretization?

Could approximate continuous states/action by discretizing
How many states/actions does this require?
- Let $B_s$ bound* the size of the maximum state and $B_a$ bound the size of the maximum action
- $(B_s/\varepsilon)^{n_s}$ for states and $(B_a/\varepsilon)^{n_a}$ for actions
*bounds depend on dynamics, horizon, initial state, etc (nontrivial!)
This is not a feasible approach in many cases!

$\varepsilon$

Agenda

1. Continuous Control

2. UAV Example

3. Linear Quadratic Regulator

Linear Dynamics

The dynamics function $f$ has a linear form $$ s_{t+1} = As_t + Ba_t $$
$A\in\mathbb R^{n_s\times n_s}$ and $B\in\mathbb R^{n_s\times n_a}$ are dynamics matrices
$A$ describes the evolution of the state when there is no action (internal dynamics)
$B$ describes the effects of actions

Quadratic Costs

Cost function $c(s,a) = s^\top Q s + a^\top R a$
$Q$ and $R$ are cost matrices, usually positive semi-definite
$Q$ describes penalty on states, $R$ is cost of actions

Important background on matrices:

A matrix is symmetric if $M=M^\top$
A matrix is positive semi-definite (PSD) if all its eigenvalues are greater than or equal to 0
A matrix is positive definite if all its eigenvalues are strictly greater than 0
All positive definite matrices are invertible

Example: Quadratic Costs

Recall setting: hovering UAV over a target $$f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t,\quad c_t(s_t, a_t) = (\mathsf{position}_t-\mathsf{target})^2+\lambda a_t^2$$
To write quadratic cost, redefine the state as $\tilde s_t = \begin{bmatrix}\mathsf{position}_t-\mathsf{target}\\ \mathsf{velocity}_t\end{bmatrix}$
- Exercise: verify that we still have $\tilde s_{t+1}=f(\tilde s_t, a_t)$
Then we have that $$c_t(\tilde s_t, a_t) = (\tilde s_t[1])^2+\lambda a_t^2 = \tilde s_t^\top \underbrace{\begin{bmatrix}1&0\\ 0&0\end{bmatrix}}_{Q} \tilde s_t + \underbrace{\lambda}_{R}a_t^2$$

$\underbrace{\qquad}_{A}$

$\underbrace{\qquad}_{B}$

Linear Quadratic Regulator

Special case of optimal control problem with

Quadratic cost $$c_t(s,a) = s^\top Qs+ a^\top Ra,\quad c_H = s^\top Qs$$
Linear dynamics $$s_{t+1} = As_t+ Ba_t$$

minimize $\displaystyle\sum_{t=0}^{H-1} s_t^\top Qs_t +a_t^\top Ra_t+s_H^\top Q s_H$

s.t. $s_{t+1}=As_t+B a_t, ~~a_t=\pi_t(s_t)$

$\pi$

DP for Optimal Control

Reformulating for optimal control (max vs min), our general purpose dynamic programming algorithm is:

Initialize $V^\star_H(s) = c_H(s)$
For $t=H-1, H-2, ..., 0$:
- $Q_t^\star(s,a) = c(s,a)+\mathbb E_{s'=f(s,a)}[V^\star_{t+1}(s')]$
- $\pi_t^\star(s) = \arg\min_a Q_t^\star(s,a)$
- $V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )$
Return $\pi^\star = (\pi^\star_0,\dots ,\pi^\star_{H-1})$

$V^\star_{t+1}(f(s,a))$

LQR via DP

$V_H^\star(s) = s^\top Q s$
$t=H-1$: $\quad \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top Q (As+Ba)$
- $\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba$
General minimization: $\arg\min_a c + a^\top M a + 2m^\top a$
- $2Ma_\star + 2m = 0 \implies a_\star = -M^{-1} m$
  - $ \pi_{H-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs$
- minimum is $c-m^\top M^{-1} m$
  - $V_{H-1}^\star(s) = s^\top (Q+A^\top QA - A^\top QB(R+B^\top QB)^{-1}B^\top QA) s$

DP: $V_t^\star (s) = \min_{a} c(s, a)+V_{t+1}^\star (f(s,a))$

PollEV

LQR via DP

$V_H^\star(s) = s^\top Q s$
$t=H-1$: $\quad \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top Q (As+Ba)$
- $ \pi_{H-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs$
- $V_{H-1}^\star(s) = s^\top (Q+A^\top QA - A^\top QB(R+B^\top QB)^{-1}B^\top QA) s$

Theorem: For $t=0,\dots ,H-1$, the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$

where the matrices are defined as $P_{H} = Q$ and

$P_t = Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A$
$K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A$

LQR via DP

Theorem: For $t=0,\dots ,H-1$, the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$

where the matrices are defined as $P_{H} = Q$ and

$P_t = Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A$
$K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A$

Proof by induction:
- $t=H$ is the base case
- inductive step very similar to previous derivation

LQR via DP

Theorem: For $t=0,\dots ,H-1$, the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$

where the matrices are defined as $P_{H} = Q$ and

$P_t = Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A$
$K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A$

$\pi^\star = (K_0,\dots,K_{H-1}) = \mathsf{LQR}(A,B,Q,R)$

Example

Setting: hovering UAV over a target
Action: thrust right/left
State: distance from target, velocity
LQR$\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\frac{1}{2}\right)$

$a_t$

$\pi_t^\star(s) = K^\star_t s= \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s$

$\gamma^\mathsf{pos}$

$\gamma^\mathsf{vel}$

$-1$

$t$

$H$

Recap

PSet 2 due TONIGHT
PA 1 due Wednesday

Continuous Control
Linear Quadratic Regulator

Next lecture: Linear Dynamics & Stability

Sp24 CS 4/5789: Lecture 7

By Sarah Dean

Sp24 CS 4/5789: Lecture 7

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 7: Continuous Control and LQR

Reminders

Agenda

Continuous MDP

Finite Horizon Optimal Control

Finite Horizon Optimal Control

Not in Scope: Stochastic & Infinite Horizon

Agenda

Example

Example

Simple policy?

Discretization?

Agenda

Linear Dynamics

Quadratic Costs

Example: Quadratic Costs

Linear Quadratic Regulator

DP for Optimal Control

LQR via DP

LQR via DP

LQR via DP

LQR via DP

Example

Recap

Sp24 CS 4/5789: Lecture 7

More from Sarah Dean