CS 4/5789: Introduction to Reinforcement Learning

Lecture 10: Nonlinear Control

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

  • Homework
    • PSet 3 due tonight
    • PA 2 due 3/6
    • Next PSet released 3/6
  • February break Mon/Tues 2/26-7
    • no lecture, office hours

Prelim on 3/4 in Lecture

  • Prelim Monday 3/4
  • During lecture (2:55-4:10pm in 255 Olin)
  • 1 hour exam, closed-book, equation sheet provided
  • Materials:
    • slides (Lectures 1-10)
    • lecture notes (MDP and LQR chapter)
    • PSets 1-3 (solutions to be posted Canvas)
  • TA led Review Session during lecture on 2/28

Agenda

1. Recap: LQR

2. Local LQR

3. Iterative LQR

4. Differential DP

Recap: Optimal Control

  • Continuous \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
  • Cost to be minimized \(c=(c_0,\dots, c_{H-1}, c_H)\)
  • Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
  • Finite horizon \(H\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)

minimize   \(\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)\)

s.t.   \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)

\(\pi\)

 

DP Algorithm: \(V^\star_{H}(s)=c_H(s)\)$$V_{t}^\star(s) =\min_a  c(s,a)+V^\star_{t+1}(f(s,a))$$

 

Recap: LQR

Theorem:  For \(t=0,\dots ,H-1\), the optimal value function is quadratic and the optimal policy is linear

  • \(\displaystyle V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s\)
  • where the matrices are defined as \(P_{H} = Q\) and
  • \(P_t\) and \(K_t\) in terms of \(A,B,Q,R\) and \(P_{t+1}\)

Special case of linear dynamics & quadratic costs $$f(s,a) = As+Ba,\quad c(s,a) = s^\top Q s + a^\top R a$$

\(\pi^\star = (K_0,\dots,K_{H-1}) = \mathsf{LQR}(A,B,Q,R)\)

LQR Extensions

  • The same DP derivation extends straightforwardly when:
    1. Dynamics and costs are time varying
    2. Affine term in the dynamics, cross terms in the costs
  • General form (PA 2):  \( f_t(s_t,a_t) = A_ts_t + B_t a_t +c_t\) and $$c_t(s,a) = s^\top Q_ts+a^\top R_ta+a^\top M_ts + q_t^\top s + r_t^\top a+ v_t $$
  • PollEv
  • General solution: \(\pi^\star_t(s) = K_t s+ k_t\) where $$\{K_t,k_t\}_{t=0}^{H-1} = \mathsf{LQR}(\{A_t,B_t,c_t, Q_t, R_t, M_t, q_t, r_t, v_t\}_{t=0}^{H-1}) $$

Agenda

1. Recap: LQR

2. Local LQR

3. Iterative LQR

4. Differential DP

Local Control

  • Local control around \((s_\star,a_\star)\)
    • e.g. Cartpole (PA2)
      • \(s = \begin{bmatrix} \theta\\ \omega \\ x \\ f \end{bmatrix}\) and \(a = f\)
      • goal: balance \(s_\star = 0\) and \(a_\star = 0\)

 

  • Applicable when costs \(c\) are smallest at \((s_\star,a_\star)\) and initial state is close to \(s_\star\)

angle \(\theta\)

angular velocity \(\omega\)

gravity

position \(x\)

force \(f\)

velocity \(v\)

  • Applicable when costs \(c\) are smallest at \((s_\star,a_\star)\) and initial state is close to \(s_\star\)
  • Assumptions:
    1. Black-box access to \(f\) and \(c\)
      • i.e. can query at any \((s,a)\) and observe outputs \(s'\) and \(c\) where \(s'=f(s,a)\) and \(c=c(s,a)\)
    2. \(f\) is differentiable and \(c\) is twice differentiable
      • i.e. Jacobians and Hessians are well defined

Local Control

  • Applicable when costs \(c\) are smallest at \((s_\star,a_\star)\) and initial state is close to \(s_\star\)
  • Assumptions:
    1. Black-box access to \(f\) and \(c\)
    2. \(f\) is differentiable and \(c\) is twice differentiable
  • Procedure
    1. Approximate dynamics & costs around \((s_\star,a_\star)\)
      • Finite differencing for first/second order approximation
    2. Policy via general (time-varying, cross terms) LQR

Local Control

Linearized Dynamics

  • Linearization of dynamics around \((s_0,a_0)\)
    • \( f(s,a) \approx f(s_0, a_0) + \nabla_s f(s_0, a_0)^\top (s-s_0) + \nabla_a f(s_0, a_0)^\top (a-a_0) \)
    • \( =A_0s+B_0a+c_0 \)
  • where the matrices depend on \((s_0,a_0)\):
    • \(A_0 = \nabla_s f(s_0, a_0)^\top \)
    • \(B_0 = \nabla_a f(s_0, a_0)^\top \)
    • \(c_0 = f(s_0, a_0) - \nabla_s f(s_0, a_0)^\top s_0 - \nabla_a f(s_0, a_0)^\top a_0 \)
  • Black box access: use finite differencing to compute

Second-Order Approx. Costs

  • Approximate costs around \((s_0,a_0)\) $$ c(s,a) \approx c(s_0, a_0) + \nabla_s c(s_0, a_0)^\top (s-s_0) + \nabla_a c(s_0, a_0)^\top (a-a_0) + \\ \frac{1}{2} (s-s_0) ^\top \nabla^2_s c(s_0, a_0)(s-s_0)  + \frac{1}{2} (a-a_0) ^\top \nabla^2_a c(s_0, a_0)(a-a_0) \\+ (a-a_0) ^\top \nabla_{as}^2 c(s_0, a_0)(s-s_0) $$
    • \( =s^\top Q_0s+a^\top R_0a+a^\top M_0s + q_0^\top s + r_0^\top a+ v_0\)
  • Black box access: use finite differencing to compute
  • Practical consideration:
    • Force quadratic to be positive definite by setting negative eigenvalues to 0 and adding regularization \(\lambda I\)
Parabola
  • For a symmetric matrix \(H\in\mathbb R^{n\times n}\) the eigen-decomposition $$H = \sum_{i=1}^n v_iv_i^\top \sigma_i $$
  • To make this PSD, we replace $$H\leftarrow \sum_{i=1}^n v_iv_i^\top (\max\{0,\sigma_i\} +\lambda)$$

Second-Order Approx. Costs

Recall: Example

image/svg+xml
  • Setting: hovering UAV over a target
    • \(s = [\mathsf{pos},\mathsf{vel}]\)
  • Action: imperfect thrust right/left
  • \(s_{t+1}=\begin{bmatrix}\mathsf{pos}_{t}+ \mathsf{vel}_{t} \\  \mathsf{vel}_{t} + e^{- (\mathsf{vel}_t^2+a_t^2)} a_t\end{bmatrix}\)
    • \(\approx \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t\) near \((0,0)\)
  • \(c(s,a) =(1-e^{-\mathsf{pos}^2}) +\lambda a^2\)
    • \(\approx \mathsf{pos}^2 + \lambda a^2\) near \((0,0)\)

\(a_t\)

image/svg+xml f(x) x μ

Recall: Example

  • Setting: hovering UAV over a target
  • Action: imperfect thrust right/left
  • LQR\(\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\frac{1}{2}\right)\)
  • Local control \(\pi_t^\star(s) = \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s \)
image/svg+xml

\(a_t\)

Agenda

1. Recap: LQR

2. Local LQR

3. Iterative LQR

4. Differential DP

Motivation for iLQR

  • Rather than approximate around single point \((s_0,a_0)\), local approximations for trajectory \(\tau=(s_t,a_t)_{t=0}^{H-1}\)
  • Leads to time-varying approximation of dynamics & costs
    • For each \(t\), linearize \(f\) around \((s_t,a_t)\): \(\{A_t,B_t,c_t\}_{t=0}^{H-1}\)
    • For each \(t\), approx \(c\) as quadratic: \(\{Q_t,R_t,M_t,q_t,r_t,v_t\}_{t=0}^{H-1}\)
  • But what trajectory should we use?
    • Let's iterate!

Iterative LQR

  • Initialize policy \(\pi^0\) and state \(\bar s_0^0\sim \mu_0\)
  • For \(i=0,1,\dots\):
    1. Forward:
      • Generate trajectory \(\tau_i = \{(\bar s_t^i, \bar a_t^i)\}_{t=0}^{H-1}\) by $$\bar a_t^i = \pi^i_t(\bar s_t^i),\quad \bar s^i_{t+1} =f(\bar s_t^i, \bar a_t^i),\quad t\in[0,H]$$
      • Approximate dynamics and cost around \(\tau_i\) $$\{A_t, B_t, v_t, Q_t, R_t, q_t, r_t, c_t\}_{t=0}^{H-1}=\mathsf{Approx}(f, c, \tau_i)$$
    2. Backward:
      • \(\pi^{i+1} = \{K^\star_t, k^\star_t\}_{t=0}^{H-1}=\)LQR\((\{A_t, B_t, v_t, Q_t, R_t, q_t, r_t, c_t\}_{t=0}^{H-1})\)

Approximate around a trajectory. What trajectory? Iterate!

Black lines: \(\tau_{i-1}\), red arrows: trajectory if linearization was true, blue dashed lines: \(\tau_i\)

Agenda

1. Recap: LQR

2. Local LQR

3. Iterative LQR

4. Differential DP

Motivation for DDP

  • We can approximate the dynamic programming minimization step directly (rather than use LQR)
  • Recall that: \(V^\star_H(s) = c_H(s)\) and for \(t=H-1,..., 0\):
    • \(\displaystyle V^\star_{t}(s)=\min_a \underbrace{c(s,a)+V^\star_{t+1}(f(s,a))}_{Q_t^\star(s,a)}\) and \(\pi_t^\star(s) \) is argmin
  • Quadratic approximation of \(Q_t^\star(s,a)\) around \((s_t, a_t)\)
    • ✅ approx. cost as quadratic
    • approx. composition of value and dynamics
      • involves first & second order approximations of \(f\)
      • details are out of scope

DDP Sketch

  • Initialize policy \(\pi^0\) and state \(\bar s_0^0\sim \mu_0\)
  • For \(i=0,1,\dots\):
    1. Forward:
      • Generate trajectory \(\tau_i = \{(\bar s_t^i, \bar a_t^i)\}_{t=0}^{H-1}\) by $$\bar a_t^i = \pi^i_t(\bar s_t^i),\quad \bar s^i_{t+1} =f(\bar s_t^i, \bar a_t^i),\quad t\in[0,H]$$
    2. Backward: for \(t=H, H-1,...,0\)
      • Approximate \(Q_t^\star(s,a)\) around \((\bar s_t^i, \bar a_t^i)\) by quadratic \(\hat Q_t(s,a)\)
        • details out of scope
      • \(\pi_t^{i+1}(s) = K_ts+ k_t =\arg \min_a \hat Q_t(s,a)\)

Summary

  • Local nonlinear control
    • approximate around single point \((s_0,a_0)\)
    • Local LQR: LQ approximation of \(f,c\), then use LQR
  • Iterative nonlinear control
    • approximate around a trajectory \(\tau=\{(s_t,a_t)\}_{t=0}^{H-1}\)
    • iterate forward/backward to determine trajectory
    • iLQR: LQ approx of \(f,c\), then use LQR
    • DDP: quadratic approx of Q function, then use DP directly

Recap

  • PSet due tonight
  • PA due after exam
  • Prelim 1 on 3/4

 

  • Local Nonlinear Control
  • Iterative Nonlinear Control

 

  • Happy Feb break!
  • Next lecture: Review Session