CS 4/5789: Introduction to Reinforcement Learning

Lecture 9: LQR & Nonlinear Control

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework this week
    • PSet due Wednesday
    • PA due in 3/1
  • My office hours:
    • Tuesdays 10:30-11:30am in Gates 416A
      • cancelled 2/28 (February break)
    • Wednesdays 4-4:50pm in Olin 255 (right after lecture)

Agenda

1. Recap: Control & LQR

2. Optimal LQR Policy

3. Nonlinear Approximation

4. Local Linear Control

Recap: Optimal Control

  • Continuous \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
  • Cost to be minimized \(c=(c_0,\dots, c_{H-1}, c_H)\)
  • Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
  • Finite horizon \(H\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)

minimize   \(\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)\)

s.t.   \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)

\(\pi\)

Recap: DP for OC

General purpose dynamic programming algorithm in the context of optimal control is:

  • Initialize \(V^\star_H(s) = c_H(s)\)
  • For \(t=H-1, H-2, ..., 0\):
    • \(Q_t^\star(s,a) = c(s,a)+V^\star_{t+1}(f(s,a))\)
    • \(\pi_t^\star(s) = \arg\min_a Q_t^\star(s,a)\)
    • \(V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )\)
  • Return \(\pi^\star = (\pi^\star_0,\dots ,\pi^\star_{H-1})\)

Recap: LQR

Special case of optimal control problem with

  • Quadratic cost $$c_t(s,a) = s^\top Qs+ a^\top Ra,\quad c_H = s^\top Qs$$ where \(Q\) is symmetric and positive semi-definite and \(R\) is symmetric and positive definite
  • Linear dynamics $$s_{t+1} = As_t+ Ba_t$$

minimize   \(\displaystyle\sum_{t=0}^{H-1} s_t^\top Qs_t +a_t^\top Ra_t+s_H^\top Q s_H\)

s.t.   \(s_{t+1}=As_t+B a_t, ~~a_t=\pi_t(s_t)\)

\(\pi\)

Important background:

  1. A matrix is symmetric if \(M=M^\top\)
  2. A matrix is positive semi-definite (PSD) if all its eigenvalues are greater than or equal to 0
  3. A matrix is positive definite if all its eigenvalues are strictly greater than 0
  4. All positive definite matrices are invertible

Resources

Linear algebra and probability background*

*these references are not necessarily an exact match to the course and they are not required

Recall: Example

$$\min_{a_0, a_1}\quad  s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0  + s_1^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_1  + s_2^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_2+\lambda a_{0}^2+\lambda a_1^2 $$

$$\text{s.t.} \quad s_{1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0} , \quad \quad s_{2} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{1} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{1} $$

  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State: distance from target, velocity
  • LQR\(\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\lambda,H=2\right)\)
image/svg+xml

\(a_t\)

\(a_1^\star=0\)

$$\min_{a_1}\quad  (\begin{bmatrix}1&0\end{bmatrix}s_1)^2 +  (\begin{bmatrix}1&1\end{bmatrix}s_1)^2 + \lambda a_1^2$$

Recall: Example

$$\min_{a_0}\quad  s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0  + (\begin{bmatrix}1&0\end{bmatrix}s_1)^2 +  (\begin{bmatrix}1&1\end{bmatrix}s_1)^2 +\lambda a_{0}^2 $$

$$\text{s.t.} \quad s_{1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0} , \quad  \qquad\qquad\qquad\qquad$$

image/svg+xml

\(a_t\)

\(a_1^\star=0\)

\(=-\frac{1}{1+\lambda}(\mathsf{pos}_0-x+2\mathsf{vel}_0)\)

  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State: distance from target, velocity
  • LQR\(\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\lambda,H=2\right)\)

\( a_0^\star = -\frac{\begin{bmatrix}1&2\end{bmatrix}s_0}{1+\lambda} \)

$$\min_{a_0}\quad  s_0^\top \begin{bmatrix}3&3\\ 3&5\end{bmatrix}s_0  + 2 s_0^\top \begin{bmatrix}1\\2\end{bmatrix}a_0 + (1 +\lambda)a_0^2 $$

Agenda

1. Recap: Control & LQR

2. Optimal LQR Policy

3. Nonlinear Approximation

4. Local Linear Control

LQR via DP

  • \(V_H^\star(s) = s^\top Q s\)
  • \(t=H-1\): \(\quad \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top Q (As+Ba)\)
    • \(\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba\)
  • General minimization: \(\arg\min_a c + a^\top M a + 2m^\top a\)
    • \(2Ma_\star + 2m = 0 \implies a_\star = -M^{-1} m\)
      • \( \pi_{H-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs\)
    • minimum is \(c-m^\top M^{-1} m\)
      • \(V_{H-1}^\star(s) = s^\top (Q+A^\top QA - A^\top QB(R+B^\top QB)^{-1}B^\top QA) s\)

DP: \(V_t^\star (s) = \min_{a} c(s, a)+V_{t+1}^\star (f(s,a))\)

PollEV

Important background:

  1. The gradient of a function \(f:\mathbb R^d \to\mathbb R\)  is the vector $$\nabla f(x) = \begin{bmatrix}\frac{\partial  f}{\partial x_1} \\ \vdots \\ \frac{\partial  f}{\partial x_n}\end{bmatrix}$$
  2. If \(f\) has a minimum at \(x_\star\) then $$\nabla f(x_\star) = 0$$
  3. The gradient of quadratic and linear functions are $$\nabla \left[x^\top Mx\right]=Mx+M^\top x,\quad \nabla \left[m^\top x\right] = m $$

LQR via DP

  • \(V_H^\star(s) = s^\top Q s\)
  • \(t=H-1\): \(\quad \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top Q (As+Ba)\)
    • \( \pi_{H-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs\)
    • \(V_{H-1}^\star(s) = s^\top (Q+A^\top QA - A^\top QB(R+B^\top QB)^{-1}B^\top QA) s\)

Theorem:  For \(t=0,\dots ,H-1\), the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$

where the matrices are defined as \(P_{H} = Q\) and

  • \(P_t = Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)
  • \(K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)

LQR Proof

  • Base case: \(V_H^\star(s) = s^\top Q s\)
  • Inductive step: Assume that \(V^\star_{t+1} (s) = s^\top P_{t+1} s\).
  • DP at \(t\): \(V_t^\star(s)= \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top P_{t+1} (As+Ba)\)
    • \(\quad \min_{a} s^\top (Q+A^\top P_{t+1}A) s+a^\top (R+B^\top P_{t+1} B) a+2s^\top A^\top P_{t+1} Ba\)
  • General minimization: \(\arg\min_a c + a^\top M a + 2m^\top a\) gives \(a_\star = -M^{-1} m\) and minimum is \(c-m^\top M^{-1} m\)
    • \( \pi_{t}^\star(s)=-(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}As\)
    • \(V_{t}^\star(s) = s^\top (Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A) s\)

Theorem:  \(V^\star_t (s) = s^\top P_t s\) and \(\pi_t^\star(s) = K_t s\) where \(P_{H} = Q\),
\(P_t = Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)
\(K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)

Example

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State: distance from target, velocity
  • LQR\(\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\frac{1}{2}\right)\)

\(a_t\)

\(\pi_t^\star(s) = \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s = \gamma^\mathsf{pos}_t (\mathsf{pos} - x) + \gamma^\mathsf{vel}_t \mathsf{vel} \)

\(\gamma^\mathsf{pos}\)

\(\gamma^\mathsf{vel}\)

\(-1\)

\(t\)

\(H\)

LQR Extensions

  • The same dynamic programming method extends in a straightforward manner when:
    1. Dynamics and costs are time varying
    2. Affine term in the dynamics, cross terms in the costs
  • General form:  \( f_t(s_t,a_t) = A_ts_t + B_t a_t +c_t\) and $$c_t(s,a) = s^\top Q_ts+a^\top R_ta+a^\top M_ts + q_t^\top s + r_t^\top a+ v_t $$
  • General solution: \(\pi^\star_t(s) = K_t s+ k_t\) where $$\{K_t,k_t\}_{t=0}^{H-1} = \mathsf{LQR}(\{A_t,B_t,c_t, Q_t, R_t, M_t, q_t, r_t, v_t\}_{t=0}^{H-1}) $$
  • Many applications can be reformulated this way:
    • e.g. trajectory tracking \(c_t(s,a) = \|s-\bar s_t\|_2^2 + \|a\|_2^2\) for given \(\bar s_t\)
    • Nonlinear dynamics and costs (Programming Assignment 2)

Agenda

1. Recap: Control & LQR

2. Optimal LQR Policy

3. Nonlinear Approximation

4. Local Linear Control

Example

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
    • imperfect: attenuated at high thrusts and velocities
  • The dynamics:
    • \(\mathsf{position}_{t+1} = \mathsf{position}_{t}+ \mathsf{velocity}_{t}\)
    • \(\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + e^{- (\mathsf{velocity}_t^2+a_t^2)} a_t\)
  • When velocity/thrust is:
    • small, then \(\mathsf{velocity}_{t+1}\approx \mathsf{velocity}_{t} +a_t \)
    • large, then \(\mathsf{velocity}_{t+1}\approx \mathsf{velocity}_{t} \)

\(a_t\)

image/svg+xml f(x) x μ

Example

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
    • imperfect: attenuated at high thrusts and velocities
  • Goal: stay near target position \(0\)
    • Field of view is limited
    • Thus cost is $$c(s,a) =(1-e^{-\mathsf{pos}^2}) +\lambda a^2$$

\(a_t\)

image/svg+xml f(x) x μ

Low-Order Approximation

  • How to find simpler (e.g. linear or quadratic) approximations?
  • For a nonlinear differentiable function \(g:\mathbb R\to\mathbb R\)
    • Recall Taylor Expansion $$ g(x) = g(x_0) +g'(x_0)(x-x_0)+\frac{1}{2}g''(x_0)(x-x_0)^2 + ... $$
    • When \(x\) is close to \(x_0\), the higher order terms become vanishingly small: \(\epsilon^p\to 0\) as \(p\to\infty\) for \( |\epsilon|<1\)

Linear Approximation

  • Linear, also called first-order, approximation $$ g(x) \approx g(x_0) + g'(x_0)(x-x_0) $$
  • For vector-valued multi-variate function \(f:\mathbb R^{n_s}\times \mathbb R^{n_a} \to \mathbb R^{n_s}\) $$ f(s,a) \approx f(s_0, a_0) + \nabla_s f(s_0, a_0)^\top (s-s_0) + \nabla_a f(s_0, a_0)^\top (a-a_0) $$
  • Jacobians \( \nabla_s f(s, a) \in\mathbb R^{n_s\times n_s}\) and \( \nabla_a f(s, a) \in\mathbb R^{n_a\times n_s}\) contain:
  • row \(i\) represents effects of \(i\)th dimension of current state/action, col \(j\) represents effects on \(f_j\), i.e. \(j\)th dimension of next state

\( \frac{\partial f_j (s,a)}{\partial s_i}\)

\(i\)

\(j\)

\( \frac{\partial f_j (s,a)}{\partial a_i}\)

\(i\)

\(j\)

Example

image/svg+xml
  • Setting: hovering UAV over a target
    • state \(s = [\mathsf{pos}, \mathsf{vel}]\)
  • The dynamics: $$ f(s_t, a_t) = \begin{bmatrix} \mathsf{pos}_{t}+ \mathsf{vel}_{t}\\ \mathsf{vel}_{t} + e^{- (\mathsf{vel}_t^2+a_t^2)} a_t \end{bmatrix}\qquad $$
  • \(= \begin{bmatrix} 1 & 0 \\  1 & 1-2a\mathsf{vel}e^{-(\mathsf{vel}^2+a^2)} \end{bmatrix} \)
  • \(\nabla_a f(s,a) = \begin{bmatrix} \frac{\partial f_1 (s,a)}{\partial a} & \frac{\partial f_2 (s,a)}{\partial a} \end{bmatrix} \)
    • \(=\begin{bmatrix} 0 & (1-2a^2) e^{-(\mathsf{vel}^2+a^2)} \end{bmatrix}\)

\(a_t\)

image/svg+xml f(x) x μ

$$\nabla_s f(s,a) = \begin{bmatrix} \frac{\partial f_1 (s,a)}{\partial \mathsf{pos}} & \frac{\partial f_2 (s,a)}{\partial \mathsf{pos}} \\  \frac{\partial f_1 (s,a)}{\partial \mathsf{vel}} & \frac{\partial f_2 (s,a)}{\partial \mathsf{vel}} \end{bmatrix} $$

\(=\begin{bmatrix} f_1(s,a)\\f_2(s,a)\end{bmatrix}\)

Quadratic Approximation

  • Second-order approximation $$ g(x) \approx g(x_0) + g'(x_0)(x-x_0) + \frac{1}{2} g''(x_0)(x-x_0)$$
  • For multi-variate function \(c:\mathbb R^{n_s}\times \mathbb R^{n_a} \to \mathbb R\) $$ c(s,a) \approx c(s_0, a_0) + \nabla_s c(s_0, a_0)^\top (s-s_0) + \nabla_a c(s_0, a_0)^\top (a-a_0) + \\ \frac{1}{2} (s-s_0) ^\top \nabla^2_s c(s_0, a_0)(s-s_0)  + \frac{1}{2} (a-a_0) ^\top \nabla^2_a c(s_0, a_0)(a-a_0) \\+ (a-a_0) ^\top \nabla_{as}^2 c(s_0, a_0)(s-s_0) $$
  • Gradients \( \nabla_s c(s, a) \in\mathbb R^{n_s}\) and \( \nabla_a c(s, a) \in\mathbb R^{n_a}\)
  • Hessians \( \nabla_s^2 c(s, a) \in\mathbb R^{n_s\times n_s}\), \( \nabla_a^2 c(s, a) \in\mathbb R^{n_a \times n_a}\), and \( \nabla_{as}^2 c(s, a) \in\mathbb R^{n_a\times n_s}\) contain second derivatives
Parabola

Quadratic Approximation

  • For multi-variate function \(c:\mathbb R^{n_s}\times \mathbb R^{n_a} \to \mathbb R\)
    • Gradients \( \nabla_s c(s, a) \in\mathbb R^{n_s}\) and \( \nabla_a c(s, a) \in\mathbb R^{n_a}\)
      • entry \(i\) represents effect of \(i\)th dimension of current state/action

\( \frac{\partial c (s,a)}{\partial s_i}\)

\( \frac{\partial c (s,a)}{\partial a_i}\)

\(i\)

\(i\)

Quadratic Approximation

  • For multi-variate function \(c:\mathbb R^{n_s}\times \mathbb R^{n_a} \to \mathbb R\)
    • Gradients \( \nabla_s c(s, a) \in\mathbb R^{n_s}\) and \( \nabla_a c(s, a) \in\mathbb R^{n_a}\)
    • Hessians \( \nabla_s^2 c(s, a) \in\mathbb R^{n_s\times n_s}\), \( \nabla_a^2 c(s, a) \in\mathbb R^{n_a \times n_a}\), \( \nabla_{as}^2 c(s, a) \in\mathbb R^{n_a\times n_s}\)

\( \frac{\partial^2 c (s,a)}{\partial s_i\partial s_j}\)

\( \frac{\partial^2c(s,a)}{\partial a_i\partial a_j}\)

\( \frac{\partial^2 c (s,a)}{\partial a_i \partial s_j}\)

\(i\)

\(i\)

\(i\)

\(j\)

\(j\)

\(j\)

symmetric

Example

image/svg+xml
  • Setting: hovering UAV over a target
    • state \(s = [\mathsf{pos}, \mathsf{vel}]\)
  • The cost: $$c(s,a) = (1-e^{-\mathsf{pos}^2}) +\lambda a^2$$
  • \(\nabla_s c(s,a)= \begin{bmatrix} 2\mathsf{pos}\cdot e^{-\mathsf{pos}^2} \\ 0 \end{bmatrix} \)
  • \(\nabla_s^2 c(s,a)= \begin{bmatrix} 2(1-2\mathsf{pos}^2) e^{-\mathsf{pos}^2} & 0\\ 0& 0 \end{bmatrix} \)
  • \(\nabla_a c(s,a)= 2\lambda a\) and \(\nabla_a^2 c(s,a)= 2\lambda\)
  • \(\nabla_{as}^2 c(s,a)=0\)

\(a_t\)

image/svg+xml f(x) x μ

Finite Difference Approximation

  • For scalar function $$g'(x) \approx \frac{g(x+\delta)-g(x-\delta)}{2\delta}$$
  • For multivariate $$  \frac{\partial f_j (s,a)}{\partial s_i} \approx \frac{f_j(s+\delta e_i,a)-f_j(s-\delta e_i,a)}{2\delta}$$ where \(e_i\) is a standard basis vector
  • For second derivatives, repeat

$$\frac{\partial c (s,a)}{\partial a_i \partial s_j} \approx  \frac{1}{2\delta}\Big[ \frac{c(s+\delta e_j,a +\delta e_i)- c(s-\delta e_j,a +\delta e_i)}{2\delta} \\-  \frac{c(s+\delta e_j,a -\delta e_i)-c(s-\delta e_j,a -\delta e_i)}{2\delta} \Big]$$

$$\frac{\partial c (s,a)}{\partial a_i \partial s_j} \approx  \frac{1}{2\delta}\Big[ \frac{\partial c (s,a +\delta e_i)}{\partial s_j} - \frac{\partial c (s,a -\delta e_i)}{\partial s_j} \Big]$$

Agenda

1. Recap: Control & LQR

2. Optimal LQR Policy

3. Nonlinear Approximation

4. Local Linear Control

Local Control

  • Local control around \((s_\star,a_\star)\)
    • e.g. Cartpole (PA2)
      • \(s = \begin{bmatrix} \theta\\ \omega \\ x \\ f \end{bmatrix}\) and \(a = f\)
      • goal: balance \(s_\star = 0\) and \(a_\star = 0\)

 

  • Applicable when costs \(c\) are smallest at \((s_\star,a_\star)\) and initial state is close to \(s_\star\)

angle \(\theta\)

angular velocity \(\omega\)

gravity

position \(x\)

force \(f\)

velocity \(v\)

  • Assumptions:
    1. Black-box access to \(f\) and \(c\)
      • i.e. can query at any \((s,a)\) and observe outputs \(s'\) and \(c\) where \(s'=f(s,a)\) and \(c=c(s,a)\)
    2. \(f\) is differentiable and \(c\) is twice differentiable
      • i.e. Jacobians and Hessians are well defined

minimize   \(\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)\)

s.t.   \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)

\(\pi\)

Local Control

  • Procedure
    1. Approximate dynamics & costs
      • First/second order approximation
      • Finite differencing
    2. Policy via LQR

minimize   \(\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)\)

s.t.   \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)

\(\pi\)

Local Control

Linearized Dynamics

  • Linearization of dynamics around \((s_0,a_0)\)
    • \( f(s,a) \approx f(s_0, a_0) + \nabla_s f(s_0, a_0)^\top (s-s_0) + \nabla_a f(s_0, a_0)^\top (a-a_0) \)
    • \( =A_0s+B_0a+c_0 \)
  • where the matrices depend on \((s_0,a_0)\):
    • \(A_0 = \nabla_s f(s_0, a_0)^\top \)
    • \(B_0 = \nabla_a f(s_0, a_0)^\top \)
    • \(c_0 = f(s_0, a_0) - \nabla_s f(s_0, a_0)^\top s_0 - \nabla_a f(s_0, a_0)^\top a_0 \)
  • Black box access: use finite differencing to compute

Example

image/svg+xml
  • Setting: hovering UAV over a target
    • state \(s = [\mathsf{pos}, \mathsf{vel}]\)
  • Linearizing around \((0,0)\)
  • \(f(0,0) = 0\)
  • \(\nabla_s f(0,0) = \begin{bmatrix} 1 & 0 \\  1 & 1-2\cdot 0\cdot e^{-0} \end{bmatrix} \)
  • \(\nabla_a f(0,0) =\begin{bmatrix} 0 & (1-0) e^{-0} \end{bmatrix}\)
  • \(s_{t+1}=f(s_t, a_t) \approx \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t\)

\(a_t\)

image/svg+xml f(x) x μ

Second-Order Approx. Costs

  • Approximate costs around \((s_0,a_0)\) $$ c(s,a) \approx c(s_0, a_0) + \nabla_s c(s_0, a_0)^\top (s-s_0) + \nabla_a c(s_0, a_0)^\top (a-a_0) + \\ \frac{1}{2} (s-s_0) ^\top \nabla^2_s c(s_0, a_0)(s-s_0)  + \frac{1}{2} (a-a_0) ^\top \nabla^2_a c(s_0, a_0)(a-a_0) \\+ (a-a_0) ^\top \nabla_{as}^2 c(s_0, a_0)(s-s_0) $$
    • \( =s^\top Q_0s+a^\top R_0a+a^\top M_0s + q_0^\top s + r_0^\top a+ v_0\)
  • Practical consideration:
    • Force \(Q_0,R_0\) to be positive definite by setting negative eigenvalues to 0 and adding regularization \(\lambda I\)
  • Black box access: use finite differencing to compute

For a symmetric matrix \(Q\in\mathbb R^{n\times n}\) the eigen-decomposition is $$Q = \sum_{i=1}^n v_iv_i^\top \sigma_i $$

To make this PSD, we replace $$Q\leftarrow \sum_{i=1}^n v_iv_i^\top (\max\{0,\sigma_i\} +\lambda)$$

Practical Consideration

Example

image/svg+xml
  • Setting: hovering UAV over a target
    • state \(s = [\mathsf{pos}, \mathsf{vel}]\)
  • Linearizing around \((0,0)\)
    • \(\nabla_s c(0,0)= \begin{bmatrix} 0 \\ 0 \end{bmatrix} \)
      • \(\nabla_s^2 c(0,0)= \begin{bmatrix} 2 & 0\\ 0& 0 \end{bmatrix} \)
    • \(\nabla_a c(0,0)= 0\) and \(\nabla_a^2 c(0,0)= 2\lambda\)
    • \(\nabla_{as}^2 c(0,0)=0\)
  • \(c(s,a)\approx \mathsf{pos}^2 + \lambda a^2\)

\(a_t\)

image/svg+xml f(x) x μ
  1. Approximate dynamics & costs
    • Linearize \(f\) as \(A_0,B_0,c_0\)
    • Approx \(c\) as \(Q_0,R_0,M_0,q_0,r_0,v_0\)
  2. LQR policy: \(\pi^\star_t(s) = K_t s+ k_t\) where $$\{K_t,k_t\}_{t=0}^{H-1} = \mathsf{LQR}(A_0,B_0,c_0, Q_0, R_0, M_0, q_0, r_0, v_0) $$
    • works as long as states and actions remain close to \(s_\star\) and \(a_\star\)

Local Control

minimize   \(\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)\)

s.t.   \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)

\(\pi\)

Local Control as Approx DP

  • Initialize \(V^\star_H(s) = c_H(s)\)
  • For \(t=H-1, H-2, ..., 0\):
    • \(Q_t^\star(s,a) = c(s,a)+V^\star_{t+1}(f(s,a))\)
    • \(\pi_t^\star(s) = \arg\min_a Q_t^\star(s,a)\)
    • \(V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )\)
  • Return \(\pi^\star = (\pi^\star_0,\dots ,\pi^\star_{H-1})\)

Recap

  • PSet due Wednesday

 

  • Optimal LQR Policy
  • Nonlinear Approximation
  • Locally Linear Control

 

  • Next lecture: Iterative Nonlinear Control

Sp23 CS 4/5789: Lecture 9

By Sarah Dean

Private

Sp23 CS 4/5789: Lecture 9