CS 4/5789: Introduction to Reinforcement Learning
Lecture 8: Linear Quadratic Regulator
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework this week
- Programming Assignment 1 due TONIGHT
- Next PSet and PA released tonight
- PSet due next Wednesday
- PA due in 2 weeks
- My office hours:
- Tuesdays 10:30-11:30am in Gates 416A
- Wednesdays 4-4:50pm in Olin 255 (right after lecture)
Agenda
1. Recap
2. Linear Control
3. Linear Quadratic Regulator
Recap: Optimal Control
- Continuous \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
- Cost to be minimized \(c=(c_0,\dots, c_{H-1}, c_H)\)
- Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
- Finite horizon \(H\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)
minimize \(\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)\)
s.t. \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)
\(\pi\)
Recap: Linear Dynamics
- The dynamics function \(f\) has a linear form $$ s_{t+1} = As_t + Ba_t $$
- \(A\) describes the evolution of the state when there is no action (internal dynamics) $$ s_{t+1}=As_t$$
Recap: Trajectories and Stability
\(0<\lambda_2<\lambda_1<1\)
\(0<\lambda_2<1<\lambda_1\)
\(1<\lambda_2<\lambda_1\)
\(\mathbb C\)
\(\mathcal R(\lambda)\)
\(\mathcal I(\lambda)\)
Trajectory is determined by the eigenstructure of \(A\)
\(s_1\)
\(s_2\)
Recap: Trajectories and Stability
\(\mathbb C\)
\(\mathcal R(\lambda)\)
\(\mathcal I(\lambda)\)
Trajectory is determined by the eigenstructure of \(A\)
\(s_1\)
\(s_2\)
\(\lambda = \alpha \pm i \beta\)
\(\mathbb C\)
\(\mathcal R(\lambda)\)
\(\mathcal I(\lambda)\)
Trajectory is determined by the eigenstructure of \(A\)
\(s_1\)
\(s_2\)
\(\lambda = \alpha \pm i \beta\)
\(0<\alpha^2+\beta^2<1\)
\(1<\alpha^2+\beta^2\)
Recap: Trajectories and Stability
\(\mathbb C\)
\(\mathcal R(\lambda)\)
\(\mathcal I(\lambda)\)
Trajectory is determined by the eigenstructure of \(A\)
\(s_1\)
\(s_2\)
Recap: Trajectories and Stability
\(\lambda_1 = \lambda_2=\lambda\)
\(\mathbb C\)
\(\mathcal R(\lambda)\)
\(\mathcal I(\lambda)\)
Trajectory is determined by the eigenstructure of \(A\)
- depends on if \(A\) is diagonalizable
\(s_1\)
\(s_2\)
\(0<\lambda<1\)
\(\lambda>1\)
Recap: Trajectories and Stability
\(\lambda_1 = \lambda_2=\lambda\)
Recap: Stability Theorem
Theorem: Let \(\{\lambda_i\}_{i=1}^n\subset \mathbb C\) be the eigenvalues of \(A\).
Then for \(s_{t+1}=As_t\), the equilibrium \(s_{eq}=0\) is
- asymptotically stable \(\iff \max_{i\in[n]}|\lambda_i|<1\)
- unstable if \(\max_{i\in[n]}|\lambda_i|> 1\)
- call \(\max_{i\in[n]}|\lambda_i|=1\) "marginally (un)stable"
\(\mathbb C\)
Stability Theorem
Proof
-
If \(A\) is diagonalizable, then any \(s_0\) can be written as a linear combination of eigenvectors \(s_0 = \sum_{i=1}^{n_s} \alpha_i v_i\)
-
By definition, \(Av_i = \lambda_i v_i\)
-
Therefore, \(s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i\)
-
Thus \(s_t\to 0\) if and only if all \(|\lambda_i|<1\), and if any \(|\lambda_i|>1\), \(\|s_t\|\to\infty\)
-
-
Proof in the non-diagonalizable case is out of scope, but it follows using the Jordan Normal Form
Marginally (un)stable
-
We call \(\max_i|\lambda_i|=1\) "marginally (un)stable"
- Consider independent investing example: (not unstable \(\lambda_2<1\)) $$ s_{t} = \begin{bmatrix} 1 &0 \\0 & \lambda_2 \end{bmatrix}^t s_0 $$
- Consider UAV example: (unstable)$$s_{t} = \begin{bmatrix} 1 & 1 \\0 & 1 \end{bmatrix}^t s_0 =\begin{bmatrix} 1 & t\\ & 1\end{bmatrix} s_0 $$
- Depends on eigenvectors not just eigenvalues!
Agenda
1. Recap
2. Linear Control
3. Linear Quadratic Regulator
Controlled Trajectories
-
Full dynamics depend on actions $$ s_{t+1} = As_t+Ba_t $$
- The trajectories can be written as (PSet 3) $$ s_{t} = A^t s_0 + \sum_{k=0}^{t-1}A^k Ba_{t-k-1} $$
- The internal dynamics \(A\) determines the long term effects of actions
Example
- Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
- Initially at rest, then one rightward thrust followed by one leftward thrust $$a_0=1,\quad a_{t_0}=-1,\quad a_k=0~~k\notin\{0,t_0\} $$
\(a_t\)
- \(s_{t} = \displaystyle \begin{bmatrix}1 & t \\ 0 & 1\end{bmatrix}\begin{bmatrix}\mathsf{pos}_0 \\ 0 \end{bmatrix}+ \sum_{k=0}^{t-1} \begin{bmatrix}1 & k\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}a_{t-k-1}\)
- \(s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0 \\ 0 \end{bmatrix}+ \begin{bmatrix}1 & t-1\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}- \begin{bmatrix}1 & t-t_0-1\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}\)
- for \(t\leq t_0\), \(s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0+ t-1 \\ 1 \end{bmatrix}\) and for \(t\geq t_0\), \(s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0+ t_0 \\ 0 \end{bmatrix}\)
Linear Policy
-
Linear policy defined by \(a_t=Ks_t\): $$ s_{t+1} = As_t+BKs_t = (A+BK)s_t$$
- The trajectories can be written as $$ s_{t} = (A+BK)^t s_0 $$
- The internal dynamics \(A\) are modified depending on \(B\) and \(K\)
Example
- Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
- Thrust according to distance from target \(a_t = -(\mathsf{pos}_t- x)\)
\(a_t\)
- \(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}a_t\)
- \(\left(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& 0\end{bmatrix} \left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)
- \(\left(s_{t} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ -1& 1\end{bmatrix}^t\left(s_0 -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)
PollEV
Example
- Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
- Thrust according to distance from target \(a_t = -(\mathsf{pos}_t+\mathsf{vel}_t- x)\)
\(a_t\)
- \(\left(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& -1\end{bmatrix} \left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)
- \(\left(s_{t} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ -1 & 0\end{bmatrix}^t\left(s_0 -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)
Agenda
1. Recap
2. Linear Control
3. Linear Quadratic Regulator
Linear Quadratic Regulator
Special case of optimal control problem with
- Quadratic cost $$c_t(s,a) = s^\top Qs+ a^\top Ra,\quad c_H = s^\top Qs$$
- Linear dynamics $$s_{t+1} = As_t+ Ba_t$$
minimize \(\displaystyle\sum_{t=0}^{H-1} s_t^\top Qs_t +a_t^\top Ra_t+s_H^\top Q s_H\)
s.t. \(s_{t+1}=As_t+B a_t, ~~a_t=\pi_t(s_t)\)
\(\pi\)
Example
- Setting: hovering UAV over a target
- Action: thrust right/left
- State is \(s_t = \begin{bmatrix}\mathsf{position}_t - x\\ \mathsf{velocity}_t\end{bmatrix}\)
- \(c_t(s_t, a_t) = (\mathsf{position}_t-x)^2+\lambda a_t^2\)
- \(f(s_t, a_t) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t\)
\(a_t\)
\(Q = \begin{bmatrix}1&0\\ 0&0\end{bmatrix},\quad R=\lambda\)
Example
- Setting: hovering UAV over a target
- Action: thrust right/left
- State: distance from target, velocity
- Consider \(H=1\)
$$\min_{a}\quad s^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s + (s')^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s' +\lambda a^2 \quad \text{s.t.} \quad s' = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s + \begin{bmatrix}0\\ 1\end{bmatrix}a $$
$$\min_{a}\quad (\begin{bmatrix}1&0\end{bmatrix}s)^2 + (\begin{bmatrix}1&1\end{bmatrix}s)^2 + \lambda a^2 \quad \implies a^\star = 0 $$
\(a_t\)
Example
$$\min_{a_0, a_1}\quad s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0 + s_1^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_1 + s_2^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_2+\lambda a_{0}^2+\lambda a_1^2 $$
$$\text{s.t.} \quad s_{1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0} , \quad \quad s_{2} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{1} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{1} $$
$$\min_{a_0}\quad s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0 + (\begin{bmatrix}1&0\end{bmatrix}s_1)^2 + (\begin{bmatrix}1&1\end{bmatrix}s_1)^2 +\lambda a_{0}^2 $$
$$\text{s.t.} \quad s_{1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0} , \quad $$
- Setting: hovering UAV over a target
- Action: thrust right/left
- State: distance from target, velocity
- Consider \(H=2\)
\(a_t\)
$$ a_0^\star = -\frac{\begin{bmatrix}1&2\end{bmatrix}s_0}{1+\lambda} $$
\(a_1^\star=0\)
$$\min_{a_0}\quad s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0 + s_1^\top \begin{bmatrix}2&1\\ 1 & 1\end{bmatrix}s_1 +\lambda a_{0}^2 \quad \text{s.t.} \quad s_{1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0} , \quad $$
$$\min_{a_0}\quad s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0 + \left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0}\right)^\top \begin{bmatrix}2&1\\ 1 & 1\end{bmatrix}\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_{0} + \begin{bmatrix}0\\ 1\end{bmatrix}a_{0}\right) +\lambda a_{0} ^2$$
$$\min_{a_0}\quad s_0^\top \begin{bmatrix}1&0\\ 0&0\end{bmatrix}s_0 + s_0^\top \begin{bmatrix}2&3\\ 3&5\end{bmatrix}s_0 + 2 s_0^\top \begin{bmatrix}1\\2\end{bmatrix}a_0 + a_0^2 +\lambda a_{0}^2 $$
$$\min_{a_0}\quad s_0^\top \begin{bmatrix}3&3\\ 3&5\end{bmatrix}s_0 + 2 s_0^\top \begin{bmatrix}1\\2\end{bmatrix}a_0 + (1 +\lambda)a_0^2 \implies a_0^\star = -\frac{\begin{bmatrix}1&2\end{bmatrix}s_0}{1+\lambda} $$
DP for Optimal Control
Reformulating for optimal control, our general purpose dynamic programming algorithm is:
- Initialize \(V^\star_H(s) = c_H(s)\)
- For \(t=H-1, H-2, ..., 0\):
- \(Q_t^\star(s,a) = c(s,a)+\mathbb E_{s'=f(s,a)}[V^\star_{t+1}(s')]\)
- \(\pi_t^\star(s) = \arg\min_a Q_t^\star(s,a)\)
- \(V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )\)
- Return \(\pi^\star = (\pi^\star_0,\dots ,\pi^\star_{H-1})\)
\(V^\star_{t+1}(f(s,a))\)
LQR via DP
- \(V_H^\star(s) = s^\top Q s\)
- \(t=H-1\): \(\quad \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top Q (As+Ba)\)
- \(\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba\)
- General minimization: \(\arg\min_a c + a^\top M a + 2m^\top a\)
- \(2Ma_\star + 2m = 0 \implies a_\star = -M^{-1} m\)
- \( \pi_{H-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs\)
- minimum is \(c-m^\top M^{-1} m\)
- \(V_{H-1}^\star(s) = s^\top (Q+A^\top QA - A^\top QB(R+B^\top QB)^{-1}B^\top QA) s\)
- \(2Ma_\star + 2m = 0 \implies a_\star = -M^{-1} m\)
DP: \(V_t^\star (s) = \min_{a} c(s, a)+V_{t+1}^\star (f(s,a))\)
LQR via DP
- \(V_H^\star(s) = s^\top Q s\)
- \(t=H-1\): \(\quad \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top Q (As+Ba)\)
- \( \pi_{H-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs\)
- \(V_{H-1}^\star(s) = s^\top (Q+A^\top QA - A^\top QB(R+B^\top QB)^{-1}B^\top QA) s\)
Theorem: For \(t=0,\dots ,H-1\), the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$
where the matrices are defined as \(P_{H} = Q\) and
- \(P_t = Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)
- \(K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)
LQR Proof
- Base case: \(V_H^\star(s) = s^\top Q s\)
- Inductive step: Assume true at \(t+1\).
- DP at \(t\): \(V_t^\star(s)= \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top P_{t+1} (As+Ba)\)
- \(\quad \min_{a} s^\top (Q+A^\top P_{t+1}A) s+a^\top (R+B^\top P_{t+1} B) a+2s^\top A^\top P_{t+1} Ba\)
- General minimization: \(\arg\min_a c + a^\top M a + 2m^\top a\) gives \(a_\star = -M^{-1} m\) and minimum is \(c-m^\top M^{-1} m\)
- \( \pi_{t}^\star(s)=-(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}As\)
- \(V_{t}^\star(s) = s^\top (Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A) s\)
Example
- Setting: hovering UAV over a target
- Action: thrust right/left
- State: distance from target, velocity
- LQR\(\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\frac{1}{2}\right)\)
\(a_t\)
\(\pi_t^\star(s) = \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s\)
\(\gamma^\mathsf{pos}\)
\(\gamma^\mathsf{vel}\)
\(-1\)
\(t\)
\(H\)
LQR Extensions
- The same dynamic programming method extends in a straightforward manner when:
- Dynamics and costs are time varying
- Affine term in the dynamics, cross terms in the costs
- General form: $$c_t(s,a) = s^\top Q_t s+a^\top R_t a + s^\top M_ta + m_t$$ $$ f_t(s_t,a_t) = A_ts_t + B_t a_t +c_t$$
- Many applications can be reformulated this way:
- e.g. trajectory tracking \(c_t(s,a) = \|s-\bar s_t\|_2^2 + \|a\|_2^2\) for given \(\bar s_t\)
- Next lecture: general (nonlinear) dynamics and costs
Recap
- PA 1 due TONIGHT
- Linear Control
- LQR
- Next lecture: Nonlinear Control
Sp23 CS 4/5789: Lecture 8
By Sarah Dean