CS 4/5789: Introduction to Reinforcement Learning

Lecture 7: Continuous Control and LQR

Prof. Sarah Dean

MW 2:55-4:10pm
255 Olin Hall

Reminders

  • Auditing (unofficial)
  • Homework
    • Problem Set 2 due tonight
    • Programming Assignment 1 due Wednesday
    • PSet 3, PA 2 released Wednesday
  • First exam is Monday 3/4 during lecture
    • If you have a conflict, post on Ed ASAP!

Agenda

1. Continuous Control

2. UAV Example

3. Linear Quadratic Regulator

Continuous MDP

  • So far, we consider finitely many states and actions \(|\mathcal S| = S\) and \(|\mathcal A| = A\)
    • Tabular representation of functions
  • In applications like robotics, states and actions can take continuous values
    • e.g. position, velocity, force
    • \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
  • Historical terminology: "optimal control problem" originates from the use of these techniques to design control laws for regulating physical processes

Finite Horizon Optimal Control

  • Continuous states \(\mathcal S = \mathbb R^{n_s}\) and actions \(\mathcal A = \mathbb R^{n_a}\)
    • alternate terminology/notation (we won't use): states \(x\) and "inputs" \(u\)
  • Cost to be minimized (rather than reward to be maximized)
    • think of as "negative reward", or think of reward as "negative cost"
    • potentially time-varying \(c=(c_0,\dots, c_{H-1}, c_H)\)
      • \(c_t:\mathcal S\times\mathcal A\to \mathbb R\) for \(t=0,\dots,H-1\)
      • final state cost \(c_H:\mathcal S\to \mathbb R\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)

Finite Horizon Optimal Control

  • Continuous \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
  • Cost to be minimized \(c=(c_0,\dots, c_{H-1}, c_H)\)
  • Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
  • Finite horizon \(H\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)

minimize   \(\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)\)

s.t.   \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)

\(\pi\)

Not in Scope: Stochastic & Infinite Horizon

  • Non-deterministic dynamics are out of our scope (requiring a background in continuous random variables)
  • Stochastic transitions described by dynamics function and independent "process noise" $$s_{t+1} = f(s_t, a_t, w_t), \quad w_t\overset{i.i.d.}{\sim} \mathcal D_w$$
  • Infinite Horizon as either "discounted" or "average" $$\sum_{t=0}^\infty \gamma^t c_t\quad \text{or}\quad  \lim_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T-1} c_t$$
  • Though we won't study them, these settings routine for LQR

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c,( f,\mathcal D_w), [H,\gamma,\mathsf{avg}]\}\)

Agenda

1. Continuous Control

2. UAV Example

3. Linear Quadratic Regulator

Example

image/svg+xml

\(a_t\)

  • Setting: hovering UAV over a target
    • cost: distance from target
  • Action: thrust right/left
  • Newton's second law
    • \(a_t = \frac{m}{\Delta} (\mathsf{velocity}_{t+1}- \mathsf{velocity}_{t})\)
    • \(\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + \frac{\Delta}{m}  a_t\)
  • Effect on position
    • \(\mathsf{position}_{t+1} = \mathsf{position}_{t}+\Delta \mathsf{velocity}_{t}\)
  • State is \(s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}\)

Example

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State is \(s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}\)
    • \(\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + \frac{\Delta}{m}  a_t\)
    • \(\mathsf{position}_{t+1} = \mathsf{position}_{t}+\Delta \mathsf{velocity}_{t}\)

\(a_t\)

  • \(\mathcal S = \mathbb R^2\), \(\mathcal A = \mathbb R\)
  • \(c_t(s_t, a_t) = (\mathsf{position}_t-\mathsf{target})^2+\lambda a_t^2\)
  • \(f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)

Q: How would you pick actions?

Simple policy?

image/svg+xml
  • Setting: hovering UAV over a target
  • \(c_t(s_t, a_t) = (\mathsf{position}_t-\mathsf{target})^2+\lambda a_t^2\)
  • \(f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t\)
  • Guess: negative policy $$\pi(s) = -(\mathsf{position}_t-\mathsf{target}) $$

\(a_t\)

  • Starting at \(s_0=[1, 0]\), let \(\Delta=m=1\), let \(\mathsf{target}=0\):
    • \(s_1 = [1, -1]\) \(s_2=[0, -2]\) \(s_3=[-2, -2]\) \(s_4=[-4, 0]\) \(s_4=[-4, 4]\)
  • Unstable!
  • even with different gains \(\pi(s)=-\gamma(\mathsf{pos}-\mathsf{tar})\) (simulations)

Discretization?

  • Could approximate continuous states/action by discretizing
  • How many states/actions does this require?
    • Let \(B_s\) bound* the size of the maximum state and \(B_a\) bound the size of the maximum action
    • \((B_s/\varepsilon)^{n_s}\) for states and \((B_a/\varepsilon)^{n_a}\) for actions
  • *bounds depend on dynamics, horizon, initial state, etc (nontrivial!)
  • This is not a feasible approach in many cases!

\(\varepsilon\)

Agenda

1. Continuous Control

2. UAV Example

3. Linear Quadratic Regulator

Linear Dynamics

  • The dynamics function \(f\) has a linear form $$ s_{t+1} = As_t + Ba_t $$
  • \(A\in\mathbb R^{n_s\times n_s}\) and \(B\in\mathbb R^{n_s\times n_a}\) are dynamics matrices
  • \(A\) describes the evolution of the state when there is no action (internal dynamics)
  • \(B\) describes the effects of actions

Quadratic Costs

  • Cost function \(c(s,a) = s^\top Q s + a^\top R a\)
  • \(Q\) and \(R\) are cost matrices, usually positive semi-definite
  • \(Q\) describes penalty on states, \(R\) is cost of actions

Important background on matrices:

  1. A matrix is symmetric if \(M=M^\top\)
  2. A matrix is positive semi-definite (PSD) if all its eigenvalues are greater than or equal to 0
  3. A matrix is positive definite if all its eigenvalues are strictly greater than 0
  4. All positive definite matrices are invertible

Example: Quadratic Costs

  • Recall setting: hovering UAV over a target $$f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t,\quad c_t(s_t, a_t) = (\mathsf{position}_t-\mathsf{target})^2+\lambda a_t^2$$
     
  • To write quadratic cost, redefine the state as \(\tilde s_t = \begin{bmatrix}\mathsf{position}_t-\mathsf{target}\\ \mathsf{velocity}_t\end{bmatrix}\)
    • Exercise: verify that we still have \(\tilde s_{t+1}=f(\tilde s_t, a_t)\)
  • Then we have that $$c_t(\tilde s_t, a_t) = (\tilde s_t[1])^2+\lambda a_t^2 = \tilde s_t^\top \underbrace{\begin{bmatrix}1&0\\ 0&0\end{bmatrix}}_{Q} \tilde s_t + \underbrace{\lambda}_{R}a_t^2$$

\(\underbrace{\qquad}_{A}\)

\(\underbrace{\qquad}_{B}\)

Linear Quadratic Regulator

Special case of optimal control problem with

  • Quadratic cost $$c_t(s,a) = s^\top Qs+ a^\top Ra,\quad c_H = s^\top Qs$$
  • Linear dynamics $$s_{t+1} = As_t+ Ba_t$$

minimize   \(\displaystyle\sum_{t=0}^{H-1} s_t^\top Qs_t +a_t^\top Ra_t+s_H^\top Q s_H\)

s.t.   \(s_{t+1}=As_t+B a_t, ~~a_t=\pi_t(s_t)\)

\(\pi\)

DP for Optimal Control

Reformulating for optimal control (max vs min), our general purpose dynamic programming algorithm is:

  • Initialize \(V^\star_H(s) = c_H(s)\)
  • For \(t=H-1, H-2, ..., 0\):
    • \(Q_t^\star(s,a) = c(s,a)+\mathbb E_{s'=f(s,a)}[V^\star_{t+1}(s')]\)
    • \(\pi_t^\star(s) = \arg\min_a Q_t^\star(s,a)\)
    • \(V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )\)
  • Return \(\pi^\star = (\pi^\star_0,\dots ,\pi^\star_{H-1})\)

\(V^\star_{t+1}(f(s,a))\)

LQR via DP

  • \(V_H^\star(s) = s^\top Q s\)
  • \(t=H-1\): \(\quad \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top Q (As+Ba)\)
    • \(\quad \min_{a} s^\top (Q+A^\top QA) s+a^\top (R+B^\top QB) a+2s^\top A^\top Q Ba\)
  • General minimization: \(\arg\min_a c + a^\top M a + 2m^\top a\)
    • \(2Ma_\star + 2m = 0 \implies a_\star = -M^{-1} m\)
      • \( \pi_{H-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs\)
    • minimum is \(c-m^\top M^{-1} m\)
      • \(V_{H-1}^\star(s) = s^\top (Q+A^\top QA - A^\top QB(R+B^\top QB)^{-1}B^\top QA) s\)

DP: \(V_t^\star (s) = \min_{a} c(s, a)+V_{t+1}^\star (f(s,a))\)

PollEV

LQR via DP

  • \(V_H^\star(s) = s^\top Q s\)
  • \(t=H-1\): \(\quad \min_{a} s^\top Q s+a^\top Ra+ (As+Ba)^\top Q (As+Ba)\)
    • \( \pi_{H-1}^\star(s)=-(R+B^\top QB)^{-1}B^\top QAs\)
    • \(V_{H-1}^\star(s) = s^\top (Q+A^\top QA - A^\top QB(R+B^\top QB)^{-1}B^\top QA) s\)

Theorem:  For \(t=0,\dots ,H-1\), the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$

where the matrices are defined as \(P_{H} = Q\) and

  • \(P_t = Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)
  • \(K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)

LQR via DP

Theorem:  For \(t=0,\dots ,H-1\), the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$

where the matrices are defined as \(P_{H} = Q\) and

  • \(P_t = Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)
  • \(K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)
  • Proof by induction:
    • \(t=H\) is the base case
    • inductive step very similar to previous derivation

LQR via DP

Theorem:  For \(t=0,\dots ,H-1\), the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$

where the matrices are defined as \(P_{H} = Q\) and

  • \(P_t = Q+A^\top P_{t+1}A - A^\top P_{t+1}B(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)
  • \(K_t = -(R+B^\top P_{t+1}B)^{-1}B^\top P_{t+1}A\)

\(\pi^\star = (K_0,\dots,K_{H-1}) = \mathsf{LQR}(A,B,Q,R)\)

Example

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State: distance from target, velocity
  • LQR\(\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\frac{1}{2}\right)\)

\(a_t\)

\(\pi_t^\star(s) = K^\star_t s= \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s\)

\(\gamma^\mathsf{pos}\)

\(\gamma^\mathsf{vel}\)

\(-1\)

\(t\)

\(H\)

Recap

  • PSet 2 due TONIGHT
  • PA 1 due Wednesday

 

  • Continuous Control
  • Linear Quadratic Regulator

 

  • Next lecture: Linear Dynamics & Stability

Sp24 CS 4/5789: Lecture 7

By Sarah Dean

Private

Sp24 CS 4/5789: Lecture 7