CS 4/5789: Introduction to Reinforcement Learning

Lecture 8: Linear Dynamics and Stability

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

  • Homework
    • Programming Assignment 1 due tonight
    • PSet 3 released tonight
    • PA 2 released later this week
  • First exam is Monday 3/4 during lecture
    • If you have a conflict, post on Ed ASAP!

Agenda

1. Recap

2. Linear Dynamics

3. Stability & Examples

4. Stability Theorem

Recap: Optimal Control

  • Continuous \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
  • Cost to be minimized \(c=(c_0,\dots, c_{H-1}, c_H)\)
  • Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
  • Finite horizon \(H\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)

minimize   \(\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)\)

s.t.   \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)

\(\pi\)

Recap: LQR

Theorem:  For \(t=0,\dots ,H-1\), the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$

where the matrices are defined as \(P_{H} = Q\) and

  • \(P_t\) and \(K_t\) in terms of \(A,B,Q,R\) and \(P_{t+1}\)

Special case of linear dynamics & quadratic costs $$f(s,a) = As+Ba,\quad c(s,a) = s^\top Q s + a^\top R a$$

\(\pi^\star = (K_0,\dots,K_{H-1}) = \mathsf{LQR}(A,B,Q,R)\)

Agenda

1. Recap

2. Linear Dynamics

3. Stability & Examples

4. Stability Theorem

Linear Dynamics

  • Special case when dynamics \(f\) has a linear form $$ s_{t+1} = As_t + Ba_t $$
  • \(A, B\in\mathbb R^{n_s\times n_a}\) are dynamics matrices respectively describing the "internal" dynamics and the effects of actions
  • The trajectories can be written as (PSet 3) $$ s_{t} = A^t s_0 + \sum_{k=0}^{t-1}A^k Ba_{t-k-1} $$
  • Power of \(A\) determines the long term effects of initial states and actions

Linear Dynamics

  • Special case when dynamics \(f\) has a linear form $$ s_{t+1} = As_t + Ba_t $$
  • Consider linear policy defined by \(a_t=Ks_t\): $$ s_{t+1} = As_t+BKs_t = (A+BK)s_t$$

  • The trajectories can be written as (PSet 3) $$ s_{t} = (A+BK)^t s_0 $$
  • The "internal" dynamics are modified according to \(B\) and \(K\)

Example: naive policy

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State: distance from target, velocity$$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
  • Thrust according to distance from target \(a_t = -\begin{bmatrix} 1 & 0\end{bmatrix} s_t\)

\(a_t\)

  • \(s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix} s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t\)
  • \(s_{t+1}  = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix} s_t + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& 0\end{bmatrix}  s_t \)
  • \(  = \begin{bmatrix}1 & 1 \\ -1 & 1\end{bmatrix} s_t \)

Example: optimal policy

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State: distance from target, velocity
  • LQR\(\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\frac{1}{2}\right)\)

\(a_t\)

\(\pi_t^\star(s) = K^\star_t s= \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s\)

\(\gamma^\mathsf{pos}\)

\(\gamma^\mathsf{vel}\)

\(-1\)

\(t\)

\(H\)

Example: approx. optimal policy

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State: distance from target, velocity
  • \(\approx\) LQR\(\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\frac{1}{2}\right)\)

\(a_t\)

Consider \(\pi(s) = \begin{bmatrix} -\frac{1}{2} &-1 \end{bmatrix}s\)

  • \(s_{t+1}  = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix} s_t + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-\frac{1}{2}& -1\end{bmatrix}  s_t \)
  • \(s_{t+1}  = \begin{bmatrix}1 & 1 \\ -\frac{1}{2} & 0\end{bmatrix} s_t \)

Simulations demonstrate difference between

$$ s_{t+1} = \begin{bmatrix}1 & 1 \\ -1 & 1\end{bmatrix} s_t \quad \text{vs.} \quad s_{t+1}  = \begin{bmatrix}1 & 1 \\ -\frac{1}{2} & 0\end{bmatrix} s_t$$

  • What is the difference?
  • What causes this difference?

Example: comparison

Agenda

1. Recap

2. Linear Dynamics

3. Stability & Examples

4. Stability Theorem

Stability of Linear Dynamics

  • For the dynamics $$ s_{t+1} = As_{t} ,\quad s_0\neq 0, $$
    1. \(s_t\to 0\), which is called asymptotically stable
    2. \(\|s_t\|\to\infty\), which is called unstable
    3. something else (e.g. \(A=I\))
  • Since we know \(s_t = A^t s_0\), the stability is determined by the matrix \(A\)

Diagonalizable dynamics

  • Our goal is to understand what happens when we raise \(A\) to the \(t^{th}\) power like in \(s_t = A^t s_0\)
  • If \(A\) is diagonalizable, then any \(s_0\) can be written as a linear combination of eigenvectors

    • \(s_0 = \sum_{i=1}^{n_s} \alpha_i v_i\)

    • \(s_1 = A\sum_{i=1}^{n_s} \alpha_i v_i = \sum_{i=1}^{n_s} \alpha_i A v_i = \sum_{i=1}^{n_s} \alpha_i \lambda_i v_i\)

    • Claim: \(s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i\)

      • Exercise: write the proof by induction

  • Another perspective on why eigenvalues matter: \(A^t = (VDV^{-1})^t = VD^tV^{-1}\)

PollEV

Example: investing

You have investments in two companies.

Setting 1:  Each dollar of investment in company \(i\) leads to \(\lambda_i\) returns. The companies are independent.

  • \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda_1 & \\ & \lambda_2 \end{bmatrix} s_t \)

\(0<\lambda_2<\lambda_1<1\)

\(0<\lambda_2<1<\lambda_1\)

\(1<\lambda_2<\lambda_1\)

Example: investing

Setting 2:  The companies are interdependent: each dollar of investment in company \(i\) leads to \(\alpha\) return for company \(i\), but it also leads to \(\beta\) return (\(i=1\)) or loss (\(i=2\)) to the other company.

  • \(\displaystyle s_{t+1} = \begin{bmatrix} \alpha & -\beta \\ \beta & \alpha \end{bmatrix} s_t \)

\(0<\alpha^2+\beta^2<1\)

\(1<\alpha^2+\beta^2\)

$$\begin{bmatrix}1\\0\end{bmatrix} \to \begin{bmatrix}\alpha\\ \beta\end{bmatrix} $$

rotation by \(\arctan(\beta/\alpha)\)

scale by \(\sqrt{\alpha^2+\beta^2}\)

\(\lambda = \alpha \pm i \beta\)

Example: investing

Setting 3:  Each dollar of investment in company \(i\) leads to \(\lambda\) return for company \(i\), and \(2\) is a subsidiary of \(1\) who thus accumulates its returns as well.

  • \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda & 1 \\ 0 & \lambda \end{bmatrix} s_t \)

\(0<\lambda<1\)

\(1<\lambda\)

$$ \left(\begin{bmatrix} \lambda & \\  & \lambda\end{bmatrix} + \begin{bmatrix}  & 1\\  & \end{bmatrix} \right)^t$$

$$ =\begin{bmatrix} \lambda^t & t\lambda^{t-1}\\  & \lambda^t\end{bmatrix} $$

Summary of 2D Examples

General case: diagonalizable, real eigenvalues

Example 1:  \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda_1 & \\ & \lambda_2 \end{bmatrix} s_t \)

Example 2:  \(\displaystyle s_{t+1} = \begin{bmatrix} \alpha & -\beta\\\beta  & \alpha\end{bmatrix} s_t  \)

General case: pair of complex eigenvalues

\(\lambda = \alpha \pm i \beta\)

Example 3:  \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda & 1\\  & \lambda\end{bmatrix} s_t  \)

General case: non-diagonalizable

Agenda

1. Recap

2. Linear Dynamics

3. Stability & Examples

4. Stability Theorem

Stability Theorem

Theorem: Let \(\{\lambda_i\}_{i=1}^n\subset \mathbb C\) be the eigenvalues of \(A\).
Then \(s_{t+1}=As_t\) is

  • asymptotically stable \(\iff \max_{i\in[n]}|\lambda_i|<1\)
  • unstable if \(\max_{i\in[n]}|\lambda_i|> 1\)
  • call \(\max_{i\in[n]}|\lambda_i|=1\) "marginally (un)stable"

\(\mathbb C\)

Marginally (un)stable

  • We call \(\max_i|\lambda_i|=1\) "marginally (un)stable"

  • Consider independent investing example:  $$ s_{t} = \begin{bmatrix} 1  &0 \\0 & 1\end{bmatrix}^t s_0 $$
  • Consider UAV example: (unstable)$$s_{t} = \begin{bmatrix} 1  & 1 \\0 & 1 \end{bmatrix}^t s_0 =\begin{bmatrix} 1 & t\\  & 1\end{bmatrix} s_0 $$
  • Depends on eigenvectors not just eigenvalues!

Recall: 2D Examples

\(0<\lambda_2<\lambda_1<1\)

\(0<\lambda_2<1<\lambda_1\)

\(1<\lambda_2<\lambda_1\)

\(\mathbb C\)

\(\mathcal R(\lambda)\)

\(\mathcal I(\lambda)\)

Trajectory is determined by the eigenstructure of \(A\)

\(s_1\)

\(s_2\)

\(\mathbb C\)

\(\mathcal R(\lambda)\)

\(\mathcal I(\lambda)\)

Trajectory is determined by the eigenstructure of \(A\)

\(s_1\)

\(s_2\)

\(\lambda = \alpha \pm i \beta\)

Recall: 2D Examples

\(\mathbb C\)

\(\mathcal R(\lambda)\)

\(\mathcal I(\lambda)\)

Trajectory is determined by the eigenstructure of \(A\)

\(s_1\)

\(s_2\)

\(\lambda = \alpha \pm i \beta\)

\(0<\alpha^2+\beta^2<1\)

\(1<\alpha^2+\beta^2\)

Recall: 2D Examples

\(\mathbb C\)

\(\mathcal R(\lambda)\)

\(\mathcal I(\lambda)\)

Trajectory is determined by the eigenstructure of \(A\)

\(s_1\)

\(s_2\)

\(\lambda_1 = \lambda_2=\lambda\)

Recall: 2D Examples

\(\mathbb C\)

\(\mathcal R(\lambda)\)

\(\mathcal I(\lambda)\)

Trajectory is determined by the eigenstructure of \(A\)

  • depends on if \(A\) is diagonalizable

\(s_1\)

\(s_2\)

\(0<\lambda<1\)

\(\lambda>1\)

\(\lambda_1 = \lambda_2=\lambda\)

Recall: 2D Examples

Stability Theorem

Proof

  • If \(A\) is diagonalizable, then any \(s_0\) can be written as a linear combination of eigenvectors \(s_0 = \sum_{i=1}^{n_s} \alpha_i v_i\)

    • We previously argued that \(s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i\)

    • We have \(\|s_t\| \leq \sum_{i=1}^{n_s}|\alpha_i| |\lambda_i|^t \|v_i\|\)

    • Thus \(s_t\to 0\) if and only if all \(|\lambda_i|<1\), and if any \(|\lambda_i|>1\), \(\|s_t\|\to\infty\)

  • Proof in the non-diagonalizable case is out of scope, but it follows using the Jordan Normal Form

Recap

  • PA 1 due tonight

 

  • Linear Dynamics
  • Stability

 

  • Next lecture: Locally Linear Control