Sp24 CS 4/5789: Lecture 10

CS 4/5789: Introduction to Reinforcement Learning

Lecture 10: Nonlinear Control

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

Homework
- PSet 3 due tonight
- PA 2 due 3/6
- Next PSet released 3/6
February break Mon/Tues 2/26-7
- no lecture, office hours

Prelim on 3/4 in Lecture

Prelim Monday 3/4
During lecture (2:55-4:10pm in 255 Olin)
1 hour exam, closed-book, equation sheet provided
Materials:
- slides (Lectures 1-10)
- lecture notes (MDP and LQR chapter)
- PSets 1-3 (solutions to be posted Canvas)
TA led Review Session during lecture on 2/28

Agenda

1. Recap: LQR

2. Local LQR

3. Iterative LQR

4. Differential DP

Recap: Optimal Control

Continuous $\mathcal S = \mathbb R^{n_s}$ and $\mathcal A = \mathbb R^{n_a}$
Cost to be minimized $c=(c_0,\dots, c_{H-1}, c_H)$
Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
Finite horizon $H$

$\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}$

minimize $\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)$

s.t. $s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)$

$\pi$

DP Algorithm: $V^\star_{H}(s)=c_H(s)$$$V_{t}^\star(s) =\min_a c(s,a)+V^\star_{t+1}(f(s,a))$$

Recap: LQR

Theorem: For $t=0,\dots ,H-1$, the optimal value function is quadratic and the optimal policy is linear

$\displaystyle V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$
where the matrices are defined as $P_{H} = Q$ and
$P_t$ and $K_t$ in terms of $A,B,Q,R$ and $P_{t+1}$

Special case of linear dynamics & quadratic costs $$f(s,a) = As+Ba,\quad c(s,a) = s^\top Q s + a^\top R a$$

$\pi^\star = (K_0,\dots,K_{H-1}) = \mathsf{LQR}(A,B,Q,R)$

LQR Extensions

The same DP derivation extends straightforwardly when:
1. Dynamics and costs are time varying
2. Affine term in the dynamics, cross terms in the costs
General form (PA 2): $ f_t(s_t,a_t) = A_ts_t + B_t a_t +c_t$ and $$c_t(s,a) = s^\top Q_ts+a^\top R_ta+a^\top M_ts + q_t^\top s + r_t^\top a+ v_t $$
PollEv
General solution: $\pi^\star_t(s) = K_t s+ k_t$ where $$\{K_t,k_t\}_{t=0}^{H-1} = \mathsf{LQR}(\{A_t,B_t,c_t, Q_t, R_t, M_t, q_t, r_t, v_t\}_{t=0}^{H-1}) $$

Agenda

1. Recap: LQR

2. Local LQR

3. Iterative LQR

4. Differential DP

Local Control

Local control around $(s_\star,a_\star)$
- e.g. Cartpole (PA2)
  - $s = \begin{bmatrix} \theta\\ \omega \\ x \\ f \end{bmatrix}$ and $a = f$
  - goal: balance $s_\star = 0$ and $a_\star = 0$

Applicable when costs $c$ are smallest at $(s_\star,a_\star)$ and initial state is close to $s_\star$

angle $\theta$

angular velocity $\omega$

gravity

position $x$

force $f$

velocity $v$

Applicable when costs $c$ are smallest at $(s_\star,a_\star)$ and initial state is close to $s_\star$
Assumptions:
1. Black-box access to $f$ and $c$
  - i.e. can query at any $(s,a)$ and observe outputs $s'$ and $c$ where $s'=f(s,a)$ and $c=c(s,a)$
2. $f$ is differentiable and $c$ is twice differentiable
  - i.e. Jacobians and Hessians are well defined

Local Control

Applicable when costs $c$ are smallest at $(s_\star,a_\star)$ and initial state is close to $s_\star$
Assumptions:
1. Black-box access to $f$ and $c$
2. $f$ is differentiable and $c$ is twice differentiable
Procedure
1. Approximate dynamics & costs around $(s_\star,a_\star)$
  - Finite differencing for first/second order approximation
2. Policy via general (time-varying, cross terms) LQR

Local Control

Linearized Dynamics

Linearization of dynamics around $(s_0,a_0)$
- $ f(s,a) \approx f(s_0, a_0) + \nabla_s f(s_0, a_0)^\top (s-s_0) + \nabla_a f(s_0, a_0)^\top (a-a_0) $
- $ =A_0s+B_0a+c_0 $
where the matrices depend on $(s_0,a_0)$:
- $A_0 = \nabla_s f(s_0, a_0)^\top $
- $B_0 = \nabla_a f(s_0, a_0)^\top $
- $c_0 = f(s_0, a_0) - \nabla_s f(s_0, a_0)^\top s_0 - \nabla_a f(s_0, a_0)^\top a_0 $
Black box access: use finite differencing to compute

Second-Order Approx. Costs

Approximate costs around $(s_0,a_0)$ $$ c(s,a) \approx c(s_0, a_0) + \nabla_s c(s_0, a_0)^\top (s-s_0) + \nabla_a c(s_0, a_0)^\top (a-a_0) + \\ \frac{1}{2} (s-s_0) ^\top \nabla^2_s c(s_0, a_0)(s-s_0) + \frac{1}{2} (a-a_0) ^\top \nabla^2_a c(s_0, a_0)(a-a_0) \\+ (a-a_0) ^\top \nabla_{as}^2 c(s_0, a_0)(s-s_0) $$
- $ =s^\top Q_0s+a^\top R_0a+a^\top M_0s + q_0^\top s + r_0^\top a+ v_0$
Black box access: use finite differencing to compute
Practical consideration:
- Force quadratic to be positive definite by setting negative eigenvalues to 0 and adding regularization $\lambda I$

For a symmetric matrix $H\in\mathbb R^{n\times n}$ the eigen-decomposition $$H = \sum_{i=1}^n v_iv_i^\top \sigma_i $$
To make this PSD, we replace $$H\leftarrow \sum_{i=1}^n v_iv_i^\top (\max\{0,\sigma_i\} +\lambda)$$

Second-Order Approx. Costs

Recall: Example

Setting: hovering UAV over a target
- $s = [\mathsf{pos},\mathsf{vel}]$
Action: imperfect thrust right/left
$s_{t+1}=\begin{bmatrix}\mathsf{pos}_{t}+ \mathsf{vel}_{t} \\ \mathsf{vel}_{t} + e^{- (\mathsf{vel}_t^2+a_t^2)} a_t\end{bmatrix}$
- $\approx \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$ near $(0,0)$
$c(s,a) =(1-e^{-\mathsf{pos}^2}) +\lambda a^2$
- $\approx \mathsf{pos}^2 + \lambda a^2$ near $(0,0)$

$a_t$

Recall: Example

Setting: hovering UAV over a target
Action: imperfect thrust right/left
LQR$\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\frac{1}{2}\right)$
Local control $\pi_t^\star(s) = \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s $

$a_t$

simulations

Agenda

1. Recap: LQR

2. Local LQR

3. Iterative LQR

4. Differential DP

Motivation for iLQR

Rather than approximate around single point $(s_0,a_0)$, local approximations for trajectory $\tau=(s_t,a_t)_{t=0}^{H-1}$
Leads to time-varying approximation of dynamics & costs
- For each $t$, linearize $f$ around $(s_t,a_t)$: $\{A_t,B_t,c_t\}_{t=0}^{H-1}$
- For each $t$, approx $c$ as quadratic: $\{Q_t,R_t,M_t,q_t,r_t,v_t\}_{t=0}^{H-1}$
But what trajectory should we use?
- Let's iterate!

Iterative LQR

Initialize policy $\pi^0$ and state $\bar s_0^0\sim \mu_0$
For $i=0,1,\dots$:
1. Forward:
  - Generate trajectory $\tau_i = \{(\bar s_t^i, \bar a_t^i)\}_{t=0}^{H-1}$ by $$\bar a_t^i = \pi^i_t(\bar s_t^i),\quad \bar s^i_{t+1} =f(\bar s_t^i, \bar a_t^i),\quad t\in[0,H]$$
  - Approximate dynamics and cost around $\tau_i$ $$\{A_t, B_t, v_t, Q_t, R_t, q_t, r_t, c_t\}_{t=0}^{H-1}=\mathsf{Approx}(f, c, \tau_i)$$
2. Backward:
  - $\pi^{i+1} = \{K^\star_t, k^\star_t\}_{t=0}^{H-1}=$LQR$(\{A_t, B_t, v_t, Q_t, R_t, q_t, r_t, c_t\}_{t=0}^{H-1})$

Approximate around a trajectory. What trajectory? Iterate!

Black lines: $\tau_{i-1}$, red arrows: trajectory if linearization was true, blue dashed lines: $\tau_i$

Agenda

1. Recap: LQR

2. Local LQR

3. Iterative LQR

4. Differential DP

Motivation for DDP

We can approximate the dynamic programming minimization step directly (rather than use LQR)
Recall that: $V^\star_H(s) = c_H(s)$ and for $t=H-1,..., 0$:
- $\displaystyle V^\star_{t}(s)=\min_a \underbrace{c(s,a)+V^\star_{t+1}(f(s,a))}_{Q_t^\star(s,a)}$ and $\pi_t^\star(s) $ is argmin
Quadratic approximation of $Q_t^\star(s,a)$ around $(s_t, a_t)$
- ✅ approx. cost as quadratic
- approx. composition of value and dynamics
  - involves first & second order approximations of $f$
  - details are out of scope

DDP Sketch

Initialize policy $\pi^0$ and state $\bar s_0^0\sim \mu_0$
For $i=0,1,\dots$:
1. Forward:
  - Generate trajectory $\tau_i = \{(\bar s_t^i, \bar a_t^i)\}_{t=0}^{H-1}$ by $$\bar a_t^i = \pi^i_t(\bar s_t^i),\quad \bar s^i_{t+1} =f(\bar s_t^i, \bar a_t^i),\quad t\in[0,H]$$
2. Backward: for $t=H, H-1,...,0$
  - Approximate $Q_t^\star(s,a)$ around $(\bar s_t^i, \bar a_t^i)$ by quadratic $\hat Q_t(s,a)$
    - details out of scope
  - $\pi_t^{i+1}(s) = K_ts+ k_t =\arg \min_a \hat Q_t(s,a)$

Summary

Local nonlinear control
- approximate around single point $(s_0,a_0)$
- Local LQR: LQ approximation of $f,c$, then use LQR
Iterative nonlinear control
- approximate around a trajectory $\tau=\{(s_t,a_t)\}_{t=0}^{H-1}$
- iterate forward/backward to determine trajectory
- iLQR: LQ approx of $f,c$, then use LQR
- DDP: quadratic approx of Q function, then use DP directly

Recap

PSet due tonight
PA due after exam
Prelim 1 on 3/4

Local Nonlinear Control
Iterative Nonlinear Control

Happy Feb break!
Next lecture: Review Session