CS 4/5789: Introduction to Reinforcement Learning

Lecture 8: Linear Quadratic Regulator

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework this week
    • Programming Assignment 1 due TONIGHT
    • Next PSet and PA released tonight
      • PSet due next Wednesday
      • PA due in 2 weeks
  • My office hours:
    • Tuesdays 10:30-11:30am in Gates 416A
    • Wednesdays 4-4:50pm in Olin 255 (right after lecture)

Agenda

1. Recap

2. Linear Control

3. Linear Quadratic Regulator

Recap: Optimal Control

  • Continuous \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
  • Cost to be minimized \(c=(c_0,\dots, c_{H-1}, c_H)\)
  • Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
  • Finite horizon \(H\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)

minimize   \(\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)\)

s.t.   \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)

\(\pi\)

Recap: Linear Dynamics

  • The dynamics function \(f\) has a linear form $$ s_{t+1} = As_t + Ba_t $$
  • \(A\) describes the evolution of the state when there is no action (internal dynamics) $$ s_{t+1}=As_t$$

Recap: Trajectories and Stability

\(0<\lambda_2<\lambda_1<1\)

\(0<\lambda_2<1<\lambda_1\)

\(1<\lambda_2<\lambda_1\)

\(\mathbb C\)

\(\mathcal R(\lambda)\)

\(\mathcal I(\lambda)\)

Trajectory is determined by the eigenstructure of \(A\)

\(s_1\)

\(s_2\)

Recap: Trajectories and Stability

\(\mathbb C\)

\(\mathcal R(\lambda)\)

\(\mathcal I(\lambda)\)

Trajectory is determined by the eigenstructure of \(A\)

\(s_1\)

\(s_2\)

\(\lambda = \alpha \pm i \beta\)

\(\mathbb C\)

\(\mathcal R(\lambda)\)

\(\mathcal I(\lambda)\)

Trajectory is determined by the eigenstructure of \(A\)

\(s_1\)

\(s_2\)

\(\lambda = \alpha \pm i \beta\)

\(0<\alpha^2+\beta^2<1\)

\(1<\alpha^2+\beta^2\)

Recap: Trajectories and Stability

\(\mathbb C\)

\(\mathcal R(\lambda)\)

\(\mathcal I(\lambda)\)

Trajectory is determined by the eigenstructure of \(A\)

\(s_1\)

\(s_2\)

Recap: Trajectories and Stability

\(\lambda_1 = \lambda_2=\lambda\)

\(\mathbb C\)

\(\mathcal R(\lambda)\)

\(\mathcal I(\lambda)\)

Trajectory is determined by the eigenstructure of \(A\)

  • depends on if \(A\) is diagonalizable

\(s_1\)

\(s_2\)

\(0<\lambda<1\)

\(\lambda>1\)

Recap: Trajectories and Stability

\(\lambda_1 = \lambda_2=\lambda\)

Recap: Stability Theorem

Theorem: Let \(\{\lambda_i\}_{i=1}^n\subset \mathbb C\) be the eigenvalues of \(A\).
Then for \(s_{t+1}=As_t\), the equilibrium \(s_{eq}=0\) is

  • asymptotically stable \(\iff \max_{i\in[n]}|\lambda_i|<1\)
  • unstable if \(\max_{i\in[n]}|\lambda_i|> 1\)
  • call \(\max_{i\in[n]}|\lambda_i|=1\) "marginally (un)stable"

\(\mathbb C\)

Stability Theorem

Proof

  • If \(A\) is diagonalizable, then any \(s_0\) can be written as a linear combination of eigenvectors \(s_0 = \sum_{i=1}^{n_s} \alpha_i v_i\)

    • By definition, \(Av_i = \lambda_i v_i\)

    • Therefore, \(s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i\)

    • Thus \(s_t\to 0\) if and only if all \(|\lambda_i|<1\), and if any \(|\lambda_i|>1\), \(\|s_t\|\to\infty\)

  • Proof in the non-diagonalizable case is out of scope, but it follows using the Jordan Normal Form

Marginally (un)stable

  • We call \(\max_i|\lambda_i|=1\) "marginally (un)stable"

  • Consider independent investing example: (not unstable \(\lambda_2<1\)) $$ s_{t} = \begin{bmatrix} 1  &0 \\0 & \lambda_2 \end{bmatrix}^t s_0 $$
  • Consider UAV example: (unstable)$$s_{t} = \begin{bmatrix} 1  & 1 \\0 & 1 \end{bmatrix}^t s_0 =\begin{bmatrix} 1 & t\\  & 1\end{bmatrix} s_0 $$
  • Depends on eigenvectors not just eigenvalues!

Agenda

1. Recap

2. Linear Control

3. Linear Quadratic Regulator

Controlled Trajectories

  • Full dynamics depend on actions $$ s_{t+1} = As_t+Ba_t $$

  • The trajectories can be written as (PSet 3) $$ s_{t} = A^t s_0 + \sum_{k=0}^{t-1}A^k Ba_{t-k-1} $$
  • The internal dynamics \(A\) determines the long term effects of actions

Example

image/svg+xml
  • Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
  • Initially at rest, then one rightward thrust followed by one leftward thrust $$a_0=1,\quad a_{t_0}=-1,\quad a_k=0~~k\notin\{0,t_0\} $$

\(a_t\)

  • \(s_{t} = \displaystyle \begin{bmatrix}1 & t \\ 0 & 1\end{bmatrix}\begin{bmatrix}\mathsf{pos}_0  \\ 0 \end{bmatrix}+ \sum_{k=0}^{t-1} \begin{bmatrix}1 & k\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}a_{t-k-1}\)
  • \(s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0  \\ 0 \end{bmatrix}+  \begin{bmatrix}1 & t-1\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}- \begin{bmatrix}1 & t-t_0-1\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}\)
  • for \(t\leq t_0\), \(s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0+ t-1 \\ 1 \end{bmatrix}\) and for \(t\geq t_0\), \(s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0+ t_0 \\ 0 \end{bmatrix}\)

Linear Policy

  • Linear policy defined by \(a_t=Ks_t\): $$ s_{t+1} = As_t+BKs_t = (A+BK)s_t$$

  • The trajectories can be written as $$ s_{t} = (A+BK)^t s_0 $$
  • The internal dynamics \(A\) are modified depending on \(B\) and \(K\)

Example

image/svg+xml
  • Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
  • Thrust according to distance from target \(a_t = -(\mathsf{pos}_t- x)\)

\(a_t\)

  • \(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}a_t\)
  • \(\left(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& 0\end{bmatrix} \left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)
  • \(\left(s_{t} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ -1& 1\end{bmatrix}^t\left(s_0 -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)

PollEV

Example

image/svg+xml
  • Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
  • Thrust according to distance from target \(a_t = -(\mathsf{pos}_t+\mathsf{vel}_t- x)\)

\(a_t\)

 

  • \(\left(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& -1\end{bmatrix} \left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)
  • \(\left(s_{t} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ -1 & 0\end{bmatrix}^t\left(s_0 -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)

Agenda

1. Recap

2. Linear Control

3. Linear Quadratic Regulator

Linear Quadratic Regulator

Special case of optimal control problem with

  • Quadratic cost $$c_t(s,a) = s^\top Qs+ a^\top Ra,\quad c_H = s^\top Qs$$
  • Linear dynamics $$s_{t+1} = As_t+ Ba_t$$

minimize   \(\displaystyle\sum_{t=0}^{H-1} s_t^\top Qs_t +a_t^\top Ra_t+s_H^\top Q s_H\)

s.t.   \(s_{t+1}=As_t+B a_t, ~~a_t=\pi_t(s_t)\)

\(\pi\)

Example

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State is \(s_t = \begin{bmatrix}\mathsf{position}_t - x\\ \mathsf{velocity}_t\end{bmatrix}\)
  • \(c_t(s_t, a_t) = (\mathsf{position}_t-x)^2+\lambda a_t^2\)
  • \(f(s_t, a_t) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t\)

\(a_t\)

\(Q = \begin{bmatrix}1&0\\ 0&0\end{bmatrix},\quad R=\lambda\)

Example

  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State: distance from target, velocity
  • Consider \(H=1\)

$$\min_{a}\quad s^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s +  (s')^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s' +\lambda a^2 \quad \text{s.t.} \quad s' = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s + \begin{bmatrix}0\\ 1\end{bmatrix}a $$

$$\min_{a}\quad  (\begin{bmatrix}1&0\end{bmatrix}s)^2 +  (\begin{bmatrix}1&1\end{bmatrix}s)^2 + \lambda a^2 \quad \implies a^\star = 0 $$

image/svg+xml

\(a_t\)

Example

$$\min_{a_0, a_1}\quad  s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0  + s_1^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_1  + s_2^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_2+\lambda a_{0}^2+\lambda a_1^2 $$

$$\text{s.t.} \quad s_{1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0} , \quad \quad s_{2} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{1} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{1} $$

$$\min_{a_0}\quad  s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0  + (\begin{bmatrix}1&0\end{bmatrix}s_1)^2 +  (\begin{bmatrix}1&1\end{bmatrix}s_1)^2 +\lambda a_{0}^2 $$

$$\text{s.t.} \quad s_{1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0} , \quad  $$

  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State: distance from target, velocity
  • Consider \(H=2\)
image/svg+xml

\(a_t\)

$$ a_0^\star = -\frac{\begin{bmatrix}1&2\end{bmatrix}s_0}{1+\lambda} $$

\(a_1^\star=0\)

$$\min_{a_0}\quad  s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0  + s_1^\top \begin{bmatrix}2&1\\ 1 & 1\end{bmatrix}s_1 +\lambda a_{0}^2 \quad \text{s.t.} \quad s_{1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0} , \quad  $$

$$\min_{a_0}\quad  s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0  + \left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0}\right)^\top \begin{bmatrix}2&1\\ 1 & 1\end{bmatrix}\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0}\right) +\lambda a_{0} ^2$$

 

$$\min_{a_0}\quad  s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0  + s_0^\top \begin{bmatrix}2&3\\ 3&5\end{bmatrix}s_0  + 2 s_0^\top \begin{bmatrix}1\\2\end{bmatrix}a_0 + a_0^2 +\lambda a_{0}^2 $$

 

$$\min_{a_0}\quad  s_0^\top \begin{bmatrix}3&3\\ 3&5\end{bmatrix}s_0  + 2 s_0^\top \begin{bmatrix}1\\2\end{bmatrix}a_0 + (1 +\lambda)a_0^2 \implies a_0^\star = -\frac{\begin{bmatrix}1&2\end{bmatrix}s_0}{1+\lambda} $$

DP for Optimal Control

Reformulating for optimal control, our general purpose dynamic programming algorithm is:

  • Initialize \(V^\star_H(s) = c_H(s)\)
  • For \(t=H-1, H-2, ..., 0\):
    • \(Q_t^\star(s,a) = c(s,a)+\mathbb E_{s'=f(s,a)}[V^\star_{t+1}(s')]\)
    • \(\pi_t^\star(s) = \arg\min_a Q_t^\star(s,a)\)
    • \(V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )\)
  • Return \(\pi^\star = (\pi^\star_0,\dots ,\pi^\star_{H-1})\)

\(V^\star_{t+1}(f(s,a))\)

LQR via DP

  • \(V_H^\star(s) = s^\top Q s\)
  • \(t=H-1\): \(\quad \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top Q (As+Ba)\)
    • \(\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba\)
  • General minimization: \(\arg\min_a c + a^\top M a + 2m^\top a\)
    • \(2Ma_\star + 2m = 0 \implies a_\star = -M^{-1} m\)
      • \( \pi_{H-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs\)
    • minimum is \(c-m^\top M^{-1} m\)
      • \(V_{H-1}^\star(s) = s^\top (Q+A^\top QA - A^\top QB(R+B^\top QB)^{-1}B^\top QA) s\)

DP: \(V_t^\star (s) = \min_{a} c(s, a)+V_{t+1}^\star (f(s,a))\)

LQR via DP

  • \(V_H^\star(s) = s^\top Q s\)
  • \(t=H-1\): \(\quad \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top Q (As+Ba)\)
    • \( \pi_{H-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs\)
    • \(V_{H-1}^\star(s) = s^\top (Q+A^\top QA - A^\top QB(R+B^\top QB)^{-1}B^\top QA) s\)

Theorem:  For \(t=0,\dots ,H-1\), the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$

where the matrices are defined as \(P_{H} = Q\) and

  • \(P_t = Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)
  • \(K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)

LQR Proof

  • Base case: \(V_H^\star(s) = s^\top Q s\)
  • Inductive step: Assume true at \(t+1\).
  • DP at \(t\): \(V_t^\star(s)= \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top P_{t+1} (As+Ba)\)
    • \(\quad \min_{a} s^\top (Q+A^\top P_{t+1}A) s+a^\top (R+B^\top P_{t+1} B) a+2s^\top A^\top P_{t+1} Ba\)
  • General minimization: \(\arg\min_a c + a^\top M a + 2m^\top a\) gives \(a_\star = -M^{-1} m\) and minimum is \(c-m^\top M^{-1} m\)
    • \( \pi_{t}^\star(s)=-(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}As\)
    • \(V_{t}^\star(s) = s^\top (Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A) s\)

Example

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State: distance from target, velocity
  • LQR\(\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\frac{1}{2}\right)\)

\(a_t\)

\(\pi_t^\star(s) = \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s\)

\(\gamma^\mathsf{pos}\)

\(\gamma^\mathsf{vel}\)

\(-1\)

\(t\)

\(H\)

LQR Extensions

  • The same dynamic programming method extends in a straightforward manner when:
    1. Dynamics and costs are time varying
    2. Affine term in the dynamics, cross terms in the costs
  • General form: $$c_t(s,a) = s^\top Q_t s+a^\top R_t a + s^\top M_ta + m_t$$ $$ f_t(s_t,a_t) = A_ts_t + B_t a_t +c_t$$
  • Many applications can be reformulated this way:
    • e.g. trajectory tracking \(c_t(s,a) = \|s-\bar s_t\|_2^2 + \|a\|_2^2\) for given \(\bar s_t\)
    • Next lecture: general (nonlinear) dynamics and costs

Recap

  • PA 1 due TONIGHT

 

  • Linear Control
  • LQR

 

  • Next lecture: Nonlinear Control

Sp23 CS 4/5789: Lecture 8

By Sarah Dean

Private

Sp23 CS 4/5789: Lecture 8