CS 4/5789: Introduction to Reinforcement Learning

Lecture 9: LQR & Nonlinear Control

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

Homework this week
- PSet due Wednesday
- PA due in 3/1
My office hours:
- Tuesdays 10:30-11:30am in Gates 416A
  - cancelled 2/28 (February break)
- Wednesdays 4-4:50pm in Olin 255 (right after lecture)

Agenda

1. Recap: Control & LQR

2. Optimal LQR Policy

3. Nonlinear Approximation

4. Local Linear Control

Recap: Optimal Control

Continuous $\mathcal S = \mathbb R^{n_s}$ and $\mathcal A = \mathbb R^{n_a}$
Cost to be minimized $c=(c_0,\dots, c_{H-1}, c_H)$
Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
Finite horizon $H$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$

minimize $\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)$

s.t. $s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)$

$\pi$

Recap: DP for OC

General purpose dynamic programming algorithm in the context of optimal control is:

Initialize $V^\star_H(s) = c_H(s)$
For $t=H-1, H-2, ..., 0$:
- $Q_t^\star(s,a) = c(s,a)+V^\star_{t+1}(f(s,a))$
- $\pi_t^\star(s) = \arg\min_a Q_t^\star(s,a)$
- $V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )$
Return $\pi^\star = (\pi^\star_0,\dots ,\pi^\star_{H-1})$

Recap: LQR

Special case of optimal control problem with

Quadratic cost $$c_t(s,a) = s^\top Qs+ a^\top Ra,\quad c_H = s^\top Qs$$ where $Q$ is symmetric and positive semi-definite and $R$ is symmetric and positive definite
Linear dynamics $$s_{t+1} = As_t+ Ba_t$$

minimize $\displaystyle\sum_{t=0}^{H-1} s_t^\top Qs_t +a_t^\top Ra_t+s_H^\top Q s_H$

s.t. $s_{t+1}=As_t+B a_t, ~~a_t=\pi_t(s_t)$

$\pi$

Important background:

A matrix is symmetric if $M=M^\top$
A matrix is positive semi-definite (PSD) if all its eigenvalues are greater than or equal to 0
A matrix is positive definite if all its eigenvalues are strictly greater than 0
All positive definite matrices are invertible

Resources

Linear algebra and probability background*

Interactive Linear Algebra
- especially Ch 6
Linear Algebra Review and Reference
Review of Probability Theory

*these references are not necessarily an exact match to the course and they are not required

Recall: Example

$$\min_{a_0, a_1}\quad s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0 + s_1^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_1 + s_2^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_2+\lambda a_{0}^2+\lambda a_1^2 $$

$$\text{s.t.} \quad s_{1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0} , \quad \quad s_{2} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{1} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{1} $$

Setting: hovering UAV over a target
Action: thrust right/left
State: distance from target, velocity
LQR$\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\lambda,H=2\right)$

$a_t$

$a_1^\star=0$

$$\min_{a_1}\quad (\begin{bmatrix}1&0\end{bmatrix}s_1)^2 + (\begin{bmatrix}1&1\end{bmatrix}s_1)^2 + \lambda a_1^2$$

Recall: Example

$$\min_{a_0}\quad s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0 + (\begin{bmatrix}1&0\end{bmatrix}s_1)^2 + (\begin{bmatrix}1&1\end{bmatrix}s_1)^2 +\lambda a_{0}^2 $$

$$\text{s.t.} \quad s_{1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0} , \quad \qquad\qquad\qquad\qquad$$

$a_t$

via algebraic manipulations

$a_1^\star=0$

$=-\frac{1}{1+\lambda}(\mathsf{pos}_0-x+2\mathsf{vel}_0)$

Setting: hovering UAV over a target
Action: thrust right/left
State: distance from target, velocity
LQR$\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\lambda,H=2\right)$

$ a_0^\star = -\frac{\begin{bmatrix}1&2\end{bmatrix}s_0}{1+\lambda} $

$$\min_{a_0}\quad s_0^\top \begin{bmatrix}3&3\\ 3&5\end{bmatrix}s_0 + 2 s_0^\top \begin{bmatrix}1\\2\end{bmatrix}a_0 + (1 +\lambda)a_0^2 $$

Agenda

1. Recap: Control & LQR

2. Optimal LQR Policy

3. Nonlinear Approximation

4. Local Linear Control

LQR via DP

$V_H^\star(s) = s^\top Q s$
$t=H-1$: $\quad \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top Q (As+Ba)$
- $\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba$
General minimization: $\arg\min_a c + a^\top M a + 2m^\top a$
- $2Ma_\star + 2m = 0 \implies a_\star = -M^{-1} m$
  - $ \pi_{H-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs$
- minimum is $c-m^\top M^{-1} m$
  - $V_{H-1}^\star(s) = s^\top (Q+A^\top QA - A^\top QB(R+B^\top QB)^{-1}B^\top QA) s$

DP: $V_t^\star (s) = \min_{a} c(s, a)+V_{t+1}^\star (f(s,a))$

PollEV

Important background:

The gradient of a function $f:\mathbb R^d \to\mathbb R$ is the vector $$\nabla f(x) = \begin{bmatrix}\frac{\partial f}{\partial x_1} \\ \vdots \\ \frac{\partial f}{\partial x_n}\end{bmatrix}$$
If $f$ has a minimum at $x_\star$ then $$\nabla f(x_\star) = 0$$
The gradient of quadratic and linear functions are $$\nabla \left[x^\top Mx\right]=Mx+M^\top x,\quad \nabla \left[m^\top x\right] = m $$

LQR via DP

$V_H^\star(s) = s^\top Q s$
$t=H-1$: $\quad \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top Q (As+Ba)$
- $ \pi_{H-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs$
- $V_{H-1}^\star(s) = s^\top (Q+A^\top QA - A^\top QB(R+B^\top QB)^{-1}B^\top QA) s$

Theorem: For $t=0,\dots ,H-1$, the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$

where the matrices are defined as $P_{H} = Q$ and

$P_t = Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A$
$K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A$

LQR Proof

Base case: $V_H^\star(s) = s^\top Q s$
Inductive step: Assume that $V^\star_{t+1} (s) = s^\top P_{t+1} s$.
DP at $t$: $V_t^\star(s)= \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top P_{t+1} (As+Ba)$
- $\quad \min_{a} s^\top (Q+A^\top P_{t+1}A) s+a^\top (R+B^\top P_{t+1} B) a+2s^\top A^\top P_{t+1} Ba$
General minimization: $\arg\min_a c + a^\top M a + 2m^\top a$ gives $a_\star = -M^{-1} m$ and minimum is $c-m^\top M^{-1} m$
- $ \pi_{t}^\star(s)=-(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}As$
- $V_{t}^\star(s) = s^\top (Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A) s$

Theorem: $V^\star_t (s) = s^\top P_t s$ and $\pi_t^\star(s) = K_t s$ where $P_{H} = Q$,
$P_t = Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A$
$K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A$

Example

Setting: hovering UAV over a target
Action: thrust right/left
State: distance from target, velocity
LQR$\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\frac{1}{2}\right)$

$a_t$

$\pi_t^\star(s) = \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s = \gamma^\mathsf{pos}_t (\mathsf{pos} - x) + \gamma^\mathsf{vel}_t \mathsf{vel} $

$\gamma^\mathsf{pos}$

$\gamma^\mathsf{vel}$

$-1$

$t$

$H$

simulations

LQR Extensions

The same dynamic programming method extends in a straightforward manner when:
1. Dynamics and costs are time varying
2. Affine term in the dynamics, cross terms in the costs
General form: $ f_t(s_t,a_t) = A_ts_t + B_t a_t +c_t$ and $$c_t(s,a) = s^\top Q_ts+a^\top R_ta+a^\top M_ts + q_t^\top s + r_t^\top a+ v_t $$
General solution: $\pi^\star_t(s) = K_t s+ k_t$ where $$\{K_t,k_t\}_{t=0}^{H-1} = \mathsf{LQR}(\{A_t,B_t,c_t, Q_t, R_t, M_t, q_t, r_t, v_t\}_{t=0}^{H-1}) $$
Many applications can be reformulated this way:
- e.g. trajectory tracking $c_t(s,a) = \|s-\bar s_t\|_2^2 + \|a\|_2^2$ for given $\bar s_t$
- Nonlinear dynamics and costs (Programming Assignment 2)

Agenda

1. Recap: Control & LQR

2. Optimal LQR Policy

3. Nonlinear Approximation

4. Local Linear Control

Example

Setting: hovering UAV over a target
Action: thrust right/left
- imperfect: attenuated at high thrusts and velocities
The dynamics:
- $\mathsf{position}_{t+1} = \mathsf{position}_{t}+ \mathsf{velocity}_{t}$
- $\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + e^{- (\mathsf{velocity}_t^2+a_t^2)} a_t$
When velocity/thrust is:
- small, then $\mathsf{velocity}_{t+1}\approx \mathsf{velocity}_{t} +a_t $
- large, then $\mathsf{velocity}_{t+1}\approx \mathsf{velocity}_{t} $

$a_t$

Example

Setting: hovering UAV over a target
Action: thrust right/left
- imperfect: attenuated at high thrusts and velocities
Goal: stay near target position $0$
- Field of view is limited
- Thus cost is $$c(s,a) =(1-e^{-\mathsf{pos}^2}) +\lambda a^2$$

$a_t$

Low-Order Approximation

How to find simpler (e.g. linear or quadratic) approximations?
For a nonlinear differentiable function $g:\mathbb R\to\mathbb R$
- Recall Taylor Expansion $$ g(x) = g(x_0) +g'(x_0)(x-x_0)+\frac{1}{2}g''(x_0)(x-x_0)^2 + ... $$
- When $x$ is close to $x_0$, the higher order terms become vanishingly small: $\epsilon^p\to 0$ as $p\to\infty$ for $ |\epsilon|<1$

Linear Approximation

Linear, also called first-order, approximation $$ g(x) \approx g(x_0) + g'(x_0)(x-x_0) $$
For vector-valued multi-variate function $f:\mathbb R^{n_s}\times \mathbb R^{n_a} \to \mathbb R^{n_s}$ $$ f(s,a) \approx f(s_0, a_0) + \nabla_s f(s_0, a_0)^\top (s-s_0) + \nabla_a f(s_0, a_0)^\top (a-a_0) $$
Jacobians $ \nabla_s f(s, a) \in\mathbb R^{n_s\times n_s}$ and $ \nabla_a f(s, a) \in\mathbb R^{n_a\times n_s}$ contain:

row $i$ represents effects of $i$th dimension of current state/action, col $j$ represents effects on $f_j$, i.e. $j$th dimension of next state

$ \frac{\partial f_j (s,a)}{\partial s_i}$

$i$

$j$

$ \frac{\partial f_j (s,a)}{\partial a_i}$

$i$

$j$

Example

Setting: hovering UAV over a target
- state $s = [\mathsf{pos}, \mathsf{vel}]$
The dynamics: $$ f(s_t, a_t) = \begin{bmatrix} \mathsf{pos}_{t}+ \mathsf{vel}_{t}\\ \mathsf{vel}_{t} + e^{- (\mathsf{vel}_t^2+a_t^2)} a_t \end{bmatrix}\qquad $$
$= \begin{bmatrix} 1 & 0 \\ 1 & 1-2a\mathsf{vel}e^{-(\mathsf{vel}^2+a^2)} \end{bmatrix} $
$\nabla_a f(s,a) = \begin{bmatrix} \frac{\partial f_1 (s,a)}{\partial a} & \frac{\partial f_2 (s,a)}{\partial a} \end{bmatrix} $
- $=\begin{bmatrix} 0 & (1-2a^2) e^{-(\mathsf{vel}^2+a^2)} \end{bmatrix}$

$a_t$

$$\nabla_s f(s,a) = \begin{bmatrix} \frac{\partial f_1 (s,a)}{\partial \mathsf{pos}} & \frac{\partial f_2 (s,a)}{\partial \mathsf{pos}} \\ \frac{\partial f_1 (s,a)}{\partial \mathsf{vel}} & \frac{\partial f_2 (s,a)}{\partial \mathsf{vel}} \end{bmatrix} $$

$=\begin{bmatrix} f_1(s,a)\\f_2(s,a)\end{bmatrix}$

Quadratic Approximation

Second-order approximation $$ g(x) \approx g(x_0) + g'(x_0)(x-x_0) + \frac{1}{2} g''(x_0)(x-x_0)$$
For multi-variate function $c:\mathbb R^{n_s}\times \mathbb R^{n_a} \to \mathbb R$ $$ c(s,a) \approx c(s_0, a_0) + \nabla_s c(s_0, a_0)^\top (s-s_0) + \nabla_a c(s_0, a_0)^\top (a-a_0) + \\ \frac{1}{2} (s-s_0) ^\top \nabla^2_s c(s_0, a_0)(s-s_0) + \frac{1}{2} (a-a_0) ^\top \nabla^2_a c(s_0, a_0)(a-a_0) \\+ (a-a_0) ^\top \nabla_{as}^2 c(s_0, a_0)(s-s_0) $$
Gradients $ \nabla_s c(s, a) \in\mathbb R^{n_s}$ and $ \nabla_a c(s, a) \in\mathbb R^{n_a}$
Hessians $ \nabla_s^2 c(s, a) \in\mathbb R^{n_s\times n_s}$, $ \nabla_a^2 c(s, a) \in\mathbb R^{n_a \times n_a}$, and $ \nabla_{as}^2 c(s, a) \in\mathbb R^{n_a\times n_s}$ contain second derivatives

Quadratic Approximation

For multi-variate function $c:\mathbb R^{n_s}\times \mathbb R^{n_a} \to \mathbb R$
- Gradients $ \nabla_s c(s, a) \in\mathbb R^{n_s}$ and $ \nabla_a c(s, a) \in\mathbb R^{n_a}$
  - entry $i$ represents effect of $i$th dimension of current state/action

$ \frac{\partial c (s,a)}{\partial s_i}$

$ \frac{\partial c (s,a)}{\partial a_i}$

$i$

Quadratic Approximation

For multi-variate function $c:\mathbb R^{n_s}\times \mathbb R^{n_a} \to \mathbb R$
- Gradients $ \nabla_s c(s, a) \in\mathbb R^{n_s}$ and $ \nabla_a c(s, a) \in\mathbb R^{n_a}$
- Hessians $ \nabla_s^2 c(s, a) \in\mathbb R^{n_s\times n_s}$, $ \nabla_a^2 c(s, a) \in\mathbb R^{n_a \times n_a}$, $ \nabla_{as}^2 c(s, a) \in\mathbb R^{n_a\times n_s}$

$ \frac{\partial^2 c (s,a)}{\partial s_i\partial s_j}$

$ \frac{\partial^2c(s,a)}{\partial a_i\partial a_j}$

$ \frac{\partial^2 c (s,a)}{\partial a_i \partial s_j}$

$i$

$j$

symmetric

Example

Setting: hovering UAV over a target
- state $s = [\mathsf{pos}, \mathsf{vel}]$
The cost: $$c(s,a) = (1-e^{-\mathsf{pos}^2}) +\lambda a^2$$
$\nabla_s c(s,a)= \begin{bmatrix} 2\mathsf{pos}\cdot e^{-\mathsf{pos}^2} \\ 0 \end{bmatrix} $
$\nabla_s^2 c(s,a)= \begin{bmatrix} 2(1-2\mathsf{pos}^2) e^{-\mathsf{pos}^2} & 0\\ 0& 0 \end{bmatrix} $
$\nabla_a c(s,a)= 2\lambda a$ and $\nabla_a^2 c(s,a)= 2\lambda$
$\nabla_{as}^2 c(s,a)=0$

$a_t$

Finite Difference Approximation

For scalar function $$g'(x) \approx \frac{g(x+\delta)-g(x-\delta)}{2\delta}$$
For multivariate $$ \frac{\partial f_j (s,a)}{\partial s_i} \approx \frac{f_j(s+\delta e_i,a)-f_j(s-\delta e_i,a)}{2\delta}$$ where $e_i$ is a standard basis vector
For second derivatives, repeat

$$\frac{\partial c (s,a)}{\partial a_i \partial s_j} \approx \frac{1}{2\delta}\Big[ \frac{c(s+\delta e_j,a +\delta e_i)- c(s-\delta e_j,a +\delta e_i)}{2\delta} \\- \frac{c(s+\delta e_j,a -\delta e_i)-c(s-\delta e_j,a -\delta e_i)}{2\delta} \Big]$$

$$\frac{\partial c (s,a)}{\partial a_i \partial s_j} \approx \frac{1}{2\delta}\Big[ \frac{\partial c (s,a +\delta e_i)}{\partial s_j} - \frac{\partial c (s,a -\delta e_i)}{\partial s_j} \Big]$$

Agenda

1. Recap: Control & LQR

2. Optimal LQR Policy

3. Nonlinear Approximation

4. Local Linear Control

Local Control

Local control around $(s_\star,a_\star)$
- e.g. Cartpole (PA2)
  - $s = \begin{bmatrix} \theta\\ \omega \\ x \\ f \end{bmatrix}$ and $a = f$
  - goal: balance $s_\star = 0$ and $a_\star = 0$

Applicable when costs $c$ are smallest at $(s_\star,a_\star)$ and initial state is close to $s_\star$

angle $\theta$

angular velocity $\omega$

gravity

position $x$

force $f$

velocity $v$

Assumptions:
1. Black-box access to $f$ and $c$
  - i.e. can query at any $(s,a)$ and observe outputs $s'$ and $c$ where $s'=f(s,a)$ and $c=c(s,a)$
2. $f$ is differentiable and $c$ is twice differentiable
  - i.e. Jacobians and Hessians are well defined

minimize $\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)$

s.t. $s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)$

$\pi$

Local Control

Procedure
1. Approximate dynamics & costs
  - First/second order approximation
  - Finite differencing
2. Policy via LQR

minimize $\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)$

s.t. $s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)$

$\pi$

Local Control

Linearized Dynamics

Linearization of dynamics around $(s_0,a_0)$
- $ f(s,a) \approx f(s_0, a_0) + \nabla_s f(s_0, a_0)^\top (s-s_0) + \nabla_a f(s_0, a_0)^\top (a-a_0) $
- $ =A_0s+B_0a+c_0 $
where the matrices depend on $(s_0,a_0)$:
- $A_0 = \nabla_s f(s_0, a_0)^\top $
- $B_0 = \nabla_a f(s_0, a_0)^\top $
- $c_0 = f(s_0, a_0) - \nabla_s f(s_0, a_0)^\top s_0 - \nabla_a f(s_0, a_0)^\top a_0 $
Black box access: use finite differencing to compute

Example

Setting: hovering UAV over a target
- state $s = [\mathsf{pos}, \mathsf{vel}]$
Linearizing around $(0,0)$
$f(0,0) = 0$
$\nabla_s f(0,0) = \begin{bmatrix} 1 & 0 \\ 1 & 1-2\cdot 0\cdot e^{-0} \end{bmatrix} $
$\nabla_a f(0,0) =\begin{bmatrix} 0 & (1-0) e^{-0} \end{bmatrix}$
$s_{t+1}=f(s_t, a_t) \approx \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$

$a_t$

Second-Order Approx. Costs

Approximate costs around $(s_0,a_0)$ $$ c(s,a) \approx c(s_0, a_0) + \nabla_s c(s_0, a_0)^\top (s-s_0) + \nabla_a c(s_0, a_0)^\top (a-a_0) + \\ \frac{1}{2} (s-s_0) ^\top \nabla^2_s c(s_0, a_0)(s-s_0) + \frac{1}{2} (a-a_0) ^\top \nabla^2_a c(s_0, a_0)(a-a_0) \\+ (a-a_0) ^\top \nabla_{as}^2 c(s_0, a_0)(s-s_0) $$
- $ =s^\top Q_0s+a^\top R_0a+a^\top M_0s + q_0^\top s + r_0^\top a+ v_0$
Practical consideration:
- Force $Q_0,R_0$ to be positive definite by setting negative eigenvalues to 0 and adding regularization $\lambda I$
Black box access: use finite differencing to compute

For a symmetric matrix $Q\in\mathbb R^{n\times n}$ the eigen-decomposition is $$Q = \sum_{i=1}^n v_iv_i^\top \sigma_i $$

To make this PSD, we replace $$Q\leftarrow \sum_{i=1}^n v_iv_i^\top (\max\{0,\sigma_i\} +\lambda)$$

Practical Consideration

Example

Setting: hovering UAV over a target
- state $s = [\mathsf{pos}, \mathsf{vel}]$
Linearizing around $(0,0)$
- $\nabla_s c(0,0)= \begin{bmatrix} 0 \\ 0 \end{bmatrix} $
  - $\nabla_s^2 c(0,0)= \begin{bmatrix} 2 & 0\\ 0& 0 \end{bmatrix} $
- $\nabla_a c(0,0)= 0$ and $\nabla_a^2 c(0,0)= 2\lambda$
- $\nabla_{as}^2 c(0,0)=0$
$c(s,a)\approx \mathsf{pos}^2 + \lambda a^2$

$a_t$

Approximate dynamics & costs
- Linearize $f$ as $A_0,B_0,c_0$
- Approx $c$ as $Q_0,R_0,M_0,q_0,r_0,v_0$
LQR policy: $\pi^\star_t(s) = K_t s+ k_t$ where $$\{K_t,k_t\}_{t=0}^{H-1} = \mathsf{LQR}(A_0,B_0,c_0, Q_0, R_0, M_0, q_0, r_0, v_0) $$
- works as long as states and actions remain close to $s_\star$ and $a_\star$

Local Control

minimize $\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)$

s.t. $s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)$

$\pi$

Local Control as Approx DP

Initialize $V^\star_H(s) = c_H(s)$
For $t=H-1, H-2, ..., 0$:
- $Q_t^\star(s,a) = c(s,a)+V^\star_{t+1}(f(s,a))$
- $\pi_t^\star(s) = \arg\min_a Q_t^\star(s,a)$
- $V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )$
Return $\pi^\star = (\pi^\star_0,\dots ,\pi^\star_{H-1})$

Recap

PSet due Wednesday

Optimal LQR Policy
Nonlinear Approximation
Locally Linear Control

Next lecture: Iterative Nonlinear Control

Sp23 CS 4/5789: Lecture 9

By Sarah Dean

Sp23 CS 4/5789: Lecture 9

Sarah Dean PRO

asst prof in CS at Cornell

sdean.website

CS 4/5789: Introduction to Reinforcement Learning

Lecture 9: LQR & Nonlinear Control

Reminders

Agenda

Recap: Optimal Control

Recap: DP for OC

Recap: LQR

Resources

Recall: Example

Recall: Example

Agenda

LQR via DP

LQR via DP

LQR Proof

Example

LQR Extensions

Agenda

Example

Example

Low-Order Approximation

Linear Approximation

Example

Quadratic Approximation

Quadratic Approximation

Quadratic Approximation

Example

Finite Difference Approximation

Agenda

Local Control

Local Control

Local Control

Linearized Dynamics

Example

Second-Order Approx. Costs

Practical Consideration

Example

Local Control

Local Control as Approx DP

Recap

Sp23 CS 4/5789: Lecture 9

More from Sarah Dean