CS 4/5789: Introduction to Reinforcement Learning

Lecture 10: Iterative LQR & Fundamental Limitations

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework
    • Grades released: regrades today-Friday
    • PSet due TONIGHT
    • PA due on 3/1 -- extension to 3/3
  • My office hours:
    • Tuesdays 10:30-11:30am in Gates 416A
      • cancelled 2/28 (February break)
    • Wednesdays 4-4:50pm in Olin 255 (right after lecture)

Agenda

1. Recap: Local LQR

2. Iterative LQR

3. PID Control

4. Limitations to Control

Recap: LQR

Theorem:  For \(t=0,\dots ,H-1\), the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$

where the matrices are defined as \(P_{H} = Q\) and

  • \(P_t\) and \(K_t\) in terms of \(A,B,Q,R\) and \(P_{t+1}\)
  • General form:  \( f_t(s_t,a_t) = A_ts_t + B_t a_t +c_t\) and $$c_t(s,a) = s^\top Q_ts+a^\top R_ta+a^\top M_ts + q_t^\top s + r_t^\top a+ v_t $$
  • General solution: \(\pi^\star_t(s) = K_t s+ k_t\) where $$\{K_t,k_t\}_{t=0}^{H-1} = \mathsf{LQR}(\{A_t,B_t,c_t, Q_t, R_t, M_t, q_t, r_t, v_t\}_{t=0}^{H-1}) $$
  1. Approximate dynamics & costs
    • Linearize \(f\) as \(A_0,B_0,c_0\)
    • Approx \(c\) as quadratic with \(Q_0,R_0,M_0,q_0,r_0,v_0\)
  2. LQR policy: \(\pi^\star_t(s) = K_t s+ k_t\) where $$\{K_t,k_t\}_{t=0}^{H-1} = \mathsf{LQR}(A_0,B_0,c_0, Q_0, R_0, M_0, q_0, r_0, v_0) $$
    • works as long as states and actions remain close to \(s_\star\) and \(a_\star\)

Recap: Local Control

minimize   \(\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)\)

s.t.   \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)

\(\pi\)

Linearized Dynamics

  • Linearization of dynamics around \((s_0,a_0)\)
    • \( f(s,a) \approx f(s_0, a_0) + \nabla_s f(s_0, a_0)^\top (s-s_0) + \nabla_a f(s_0, a_0)^\top (a-a_0) \)
    • \( =A_0s+B_0a+c_0 \)
  • where the matrices depend on \((s_0,a_0)\):
    • \(A_0 = \nabla_s f(s_0, a_0)^\top \)
    • \(B_0 = \nabla_a f(s_0, a_0)^\top \)
    • \(c_0 = f(s_0, a_0) - \nabla_s f(s_0, a_0)^\top s_0 - \nabla_a f(s_0, a_0)^\top a_0 \)
  • Black box access: use finite differencing to compute

Second-Order Approx. Costs

  • Approximate costs around \((s_0,a_0)\) $$ c(s,a) \approx c(s_0, a_0) + \nabla_s c(s_0, a_0)^\top (s-s_0) + \nabla_a c(s_0, a_0)^\top (a-a_0) + \\ \frac{1}{2} (s-s_0) ^\top \nabla^2_s c(s_0, a_0)(s-s_0)  + \frac{1}{2} (a-a_0) ^\top \nabla^2_a c(s_0, a_0)(a-a_0) \\+ (a-a_0) ^\top \nabla_{as}^2 c(s_0, a_0)(s-s_0) $$
    • \( =s^\top Q_0s+a^\top R_0a+a^\top M_0s + q_0^\top s + r_0^\top a+ v_0\)
  • Practical consideration:
    • Force \(Q_0,R_0\) to be positive definite by setting negative eigenvalues to 0 and adding regularization \(\lambda I\)
  • Black box access: use finite differencing to compute
Parabola

For a symmetric matrix \(Q\in\mathbb R^{n\times n}\) the eigen-decomposition is $$Q = \sum_{i=1}^n v_iv_i^\top \sigma_i $$

To make this PSD, we replace $$Q\leftarrow \sum_{i=1}^n v_iv_i^\top (\max\{0,\sigma_i\} +\lambda)$$

Practical Consideration

Recap: Example

image/svg+xml
  • Setting: hovering UAV over a target
    • \(s = [\mathsf{pos},\mathsf{vel}]\)
  • Action: imperfect thrust right/left
  • \(s_{t+1}=\begin{bmatrix}\mathsf{pos}_{t}+ \mathsf{vel}_{t} \\  \mathsf{vel}_{t} + e^{- (\mathsf{vel}_t^2+a_t^2)} a_t\end{bmatrix}\)
    • \(\approx \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t\) near \((0,0)\)
  • \(c(s,a) =(1-e^{-\mathsf{pos}^2}) +\lambda a^2\)
    • \(\approx \mathsf{pos}^2 + \lambda a^2\) near \((0,0)\)

\(a_t\)

image/svg+xml f(x) x μ

Recap: Example

  • Setting: hovering UAV over a target
  • Action: imperfect thrust right/left
  • LQR\(\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\frac{1}{2}\right)\)

\(\pi_t^\star(s) = \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s = \gamma^\mathsf{pos}_t (\mathsf{pos} - x) + \gamma^\mathsf{vel}_t \mathsf{vel} \)

\(\gamma^\mathsf{pos}\)

\(\gamma^\mathsf{vel}\)

\(-1\)

\(t\)

\(H\)

image/svg+xml

\(a_t\)

Recap: Example

  • Setting: hovering UAV over a target
  • Action: imperfect thrust right/left
  • Local control \(\pi_t^\star(s) = \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s \)
image/svg+xml

\(a_t\)

Agenda

1. Recap: Local LQR

2. Iterative LQR

3. PID Control

4. Limitations to Control

Approximate with Trajectory

  • Rather than approximate around single point \((s_0,a_0)\)
    • local approximations for trajectory \(\tau=(s_t,a_t)_{t=0}^{H-1}\)
  • Leads to time-varying approximation of dynamics & costs
    • For each \(t\), linearize \(f\) around \((s_t,a_t)\): \(\{A_t,B_t,c_t\}_{t=0}^{H-1}\)
    • For each \(t\), approx \(c\) as quadratic: \(\{Q_t,R_t,M_t,q_t,r_t,v_t\}_{t=0}^{H-1}\)
  • But what trajectory should we use?

minimize   \(\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)\)

s.t.   \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)

\(\pi\)

\(s_0\sim\mu_0\)

iLQR

  • Initialize \(\bar a_0^0,\dots \bar a_{H-1}^0\) and \(\bar s_0^0\sim \mu_0\)
  • Generate initial trajectory \(\tau_0 = \{(\bar s_t^0, \bar a_t^0)\}_{t=0}^{H-1}\)
    • by \(\bar s^0_{t+1} =f(\bar s_t^0, \bar a_t^0)\) for \(t\in[0,H]\)
  • For \(i=0,1,\dots\):
    • \(\{A_t, B_t, v_t, Q_t, R_t, q_t, r_t, c_t\}_{t=0}^{H-1}=\)Approx\((f, c, \tau_i)\)
    • \(\{K^\star_t, k^\star_t\}_{t=0}^{H-1}=\)LQR\((\{A_t, B_t, v_t, Q_t, R_t, q_t, r_t, c_t\}_{t=0}^{H-1})\)
    • generate \(\tau_{i+1} = \{(\bar s_t^{i+1}, \bar a_t^{i+1})\}_{t=0}^{H-1}\)
      • by \(\bar s_{t+1}^{i+1} = f(\bar s_{t}^{i+1},\underbrace{ K^\star_t\bar s_{t}^{i+1} + k^\star_t}_{\bar a_t^{i+1}})\) for \(t\in[0,H]\)

Linearize around a trajectory. What trajectory? Iterate!

Black lines: \(\tau_{i-1}\), red arrows: trajectory if linearization was true, blue dashed lines: \(\tau_i\)

Agenda

1. Recap: Local LQR

2. Iterative LQR

3. PID Control

4. Limitations to Control

PID Control

  • Type of policy which may not be optimal, but is open used in practice (especially for low-level stabilization)
  • Applicable when:
    • There is an observation \(o_t\in\mathbb R\) and desired setpoint \(o^\star_t\)
    • The action \(a_t\in\mathbb R\) is "correlated" with \(o_t\), i.e. positive actions tend to increase \(o_t\)
  • Actions are determined by errors \(e_t = o^\star_t - o_t\) $$a_t = K_P e_t + K_I \sum_{k=0}^t e_k + K_D (e_t-e_{t-1})$$

PID Control

  • Actions are determined by errors \(e_t = o^\star_t - o_t\) $$a_t = K_P e_t + K_I \sum_{k=0}^t e_k + K_D (e_t-e_{t-1})$$
  • Policy depends on history of errors \(e_{t},e_{t-1},\dots e_{0}\)
  • Tuning parameters is a heuristic process

\(t\)

error

Agenda

1. Recap: Local LQR

2. Iterative LQR

3. PID Control

4. Limitations to Control

  • How good can an optimal policy be?
  • Are there inherent properties of a system that limit performance?

Limitations to Control

  1. Finite and Deterministic MDP with \(r(s,a) = 1\) if \(s=0\) and \(0\) otherwise




     
  2. Linear Dynamics with cost \(\|s_t\|_2^2\) $$s_{t+1} = \begin{bmatrix} 2 & 0 \\ 0 & 1\end{bmatrix} s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t $$

Motivating examples

\(0\)

\(1\)

\(a\in\){stay,switch}

\(a=\)stay

\(a=\)switch

PollEV

Definition:

  • A state \(s'\) is reachable from a state \(s\) if there exists a sequence of actions \(a_0,\dots,a_{T-1}\) for a finite \(T\) such that $$\mathbb P\{s_T=s'\mid s_0=s,a_0,\dots,a_{T-1}\}>0$$
  • MDP is reachable (also controllable) if all states are reachable from any other state

Reachability

Theorem: Given finite \(\mathcal S,\mathcal A\) and transition function \(P\), construct a directed graph with vertices \(\mathcal V=\mathcal S\) and an edge from \(s\) to \(s'\) if \(P(s'|s,a)>0\) for some \(a\in\mathcal A\).

  • Then the MDP is reachable if the graph is strongly connected, i.e. if there is a path from every vertex to every other vertex

Discrete Reachability

\(0\)

\(1\)

Proof:

  • Since the graph is strongly connected, there exists a directed path from \(s'\) to \(s\) for any \(s\) and \(s'\)
    • Let \(T\) be its length
  • By construction, each edge along this path corresponds to at least one action \(a_i\) and some nonzero transition probability \(p_i\)
  • \(\mathbb P\{s_T=s'\mid s_0=s,a_0,\dots,a_{T-1}\}>\prod_{i=0}^{T-1} p_i >0\)

Discrete Reachability

\(0\)

\(1\)

Theorem: The linear dynamics \(s_{t+1}=As_t+Ba_t\) are controllable if the controllability grammian \(\mathcal C\) is full rank. $$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}B & AB & A^2 B & \dots & A^{n_s-1}B\end{bmatrix}}_{\mathcal C}\Big) = n_s $$

Linear Deterministic Reachability

For the example \(s_{t+1} = \begin{bmatrix} 2 & 0 \\ 0 & 1\end{bmatrix} s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t\)

  • \(\mathcal C = \begin{bmatrix} 0 &0 \\ 1 & 1\end{bmatrix} \) is not full rank

Theorem: The linear dynamics \(s_{t+1}=As_t+Ba_t\) are controllable if the controllability grammian \(\mathcal C\) is full rank. $$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}B & AB & A^2 B & \dots & A^{n_s-1}B\end{bmatrix}}_{\mathcal C}\Big) = n_s $$

Linear Deterministic Reachability

Proof:

  • Recall that \(s_t = A^t s_0 + \sum_{k=0}^{t-1}A^{k}Ba_{t-k-1}\) (PSet)
  • Therefore, setting \(s_{n_s}=s'\) and \(s_0=s\), $$s' - A^{n_s} s = \begin{bmatrix}B & AB & \dots & A^{n_s-1}B \end{bmatrix} \begin{bmatrix}a_{n_s-1}\\ \vdots \\ a_0\end{bmatrix}$$
  • If \(\mathcal C\) is full rank, this system of linear equations has at least one solution \(a_0,\dots a_{n_s-1}\)

Example

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • \(s_{t+1}=\begin{bmatrix}1 & 1\\  0 & 1\end{bmatrix} s_t +\begin{bmatrix}0\\  1\end{bmatrix}a_t \)
  • Controllability grammian $$ \mathcal C = \begin{bmatrix}0&1 \\1& 1\end{bmatrix}$$

\(a_t\)

To get from \(s\) to \(s'\) we can simply take the actions:

  • \(\begin{bmatrix}a_0\\a_1\end{bmatrix} = \mathcal C^{-1}(s' - A^2 s_0)=\begin{bmatrix}-1&1 \\1& 0\end{bmatrix}\left(s' - \begin{bmatrix}1&2 \\0& 1\end{bmatrix} s\right)\)
  • \(=\begin{bmatrix}-1&1 \\1& 0\end{bmatrix}\begin{bmatrix}\mathsf{pos}'-\mathsf{pos}-2\mathsf{vel} \\\mathsf{vel}'-\mathsf{vel}\end{bmatrix} = \begin{bmatrix}-\mathsf{pos}'+\mathsf{pos}+\mathsf{vel}+\mathsf{vel}' \\\mathsf{pos}'-\mathsf{pos}-2\mathsf{vel}\end{bmatrix} \)
  • Our focus is mostly optimization rather than design
    • Design includes building a system and then modelling it as an MDP
  • In real applications, design is just as (if not more) important
    • If reachability is an issue, maybe we can add another actuator
    • If robustness is an issue, maybe we can tweak the cost or reward function

Optimization vs Design

Recap

  • PSet due TONIGHT
  • PA due after Feb break

 

  • Iterative Nonlinear Control
  • PID Control
  • Reachability

 

  • Happy Feb break!
  • Next lecture: Model-Based RL

Sp23 CS 4/5789: Lecture 10

By Sarah Dean

Private

Sp23 CS 4/5789: Lecture 10