CS 4/5789: Introduction to Reinforcement Learning

Lecture 7: Continuous Control

Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

Reminders

  • Homework this week
    • Problem Set 2 due TONIGHT
    • Programming Assignment 1 due Wednesday 2/15
    • Next PSet and PA released on Wednesday
  • My office hours:
    • Tuesdays 10:30-11:30am in Gates 416A
    • Wednesdays 4-4:50pm in Olin 255 (right after lecture)

Agenda

1. Recap

2. Continuous Control

3. Linear Dynamics

Markov Decision Process

  • \(\mathcal{S}, \mathcal{A}\) state and action spaces
    • finite size \(S\) and \(A\)
  • \(r\) reward function, \(P\) transition function (tabular representation \(SA\) and \(S^2A\))
  • discount factor \(0<\gamma<1\) or horizon \(H>0\)

Goal: achieve high cumulative reward

maximize   \(\displaystyle \mathbb E\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)\right]\) or \(\displaystyle \mathbb E\left[\sum_{t=0}^{H-1} r(s_t, a_t)\right]\)

s.t.   \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)

\(\pi\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, [H~\text{or}~\gamma]\}\)

Infinite Horizon: VI and PI

Policy Iteration

  • Initialize \(\pi_0:\mathcal S\to\mathcal A\)
  • For \(t=0,\dots,T-1\):
    • Policy Evaluation: \(V^{\pi_t}\)
    • Policy Improvement: \(\pi^{t+1}\)

Value Iteration

  • Initialize \(V_0\)
  • For \(t=0,\dots,T-1\):
    • Bellman Operator: \(V_{t+1}\)
  • Return \(\displaystyle \pi_T\)
  1. Monotonic Improvement:
    \(V^{\pi_{t+1}} \geq V^{\pi_t}\)
  2. Convergence:
    \(\|V^{\pi_t} - V^\star\|_\infty \leq\gamma^t \|V^{\pi_0}-V^\star\|_\infty\)
  1. Iterate convergence:
    \(\| V_{t}- V^\star\|_\infty  \leq \gamma^t \|V_0-V^\star\|_\infty\)
  2. Suboptimality:
    \(V^\star(s) - V^{\pi_T}(s)  \leq \frac{2\gamma^{T+1}}{1-\gamma} \|V_0-V^\star\|_\infty\)

PollEV

Finite Horizon: DP

Exactly compute the optimal policy

  • Initialize \(V^\star_H = 0\)
  • For \(t=H-1, H-2, ..., 0\):
    • \(Q_t^\star(s,a) = r(s,a)+\mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]\)
    • \(\pi_t^\star(s) = \arg\max_a Q_t^\star(s,a)\)
    • \(V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )\)
  • Return \(\pi^\star = (\pi^\star_0,\dots ,\pi^\star_{H-1})\)

Agenda

1. Recap

2. Continuous Control

3. Linear Dynamics

Continuous MDP

  • So far, we consider finitely many states and actions \(|\mathcal S| = S\) and \(|\mathcal A| = A\)
    • Tabular representation of functions
  • In applications like robotics, states and actions can take continuous values
    • e.g. position, velocity, force
    • \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
  • Historical terminology: "optimal control problem" originates from the use of these techniques to design control laws for regulating physical processes

Finite Horizon Optimal Control

  • Continuous \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
    • alternate terminology/notation (we won't use): states \(x\) and "inputs" \(u\)
  • Cost to be minimized (rather than reward to be maximized)
    • think of as "negative reward", or think of reward as "negative cost"
    • potentially time-varying \(c=(c_0,\dots, c_{H-1}, c_H)\)
      • \(c_t:\mathcal S\times\mathcal A\to \mathbb R\) for \(t=0,\dots,H-1\)
      • final state cost \(c_H:\mathcal S\to \mathbb R\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)

Finite Horizon Optimal Control

  • Continuous \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
  • Cost to be minimized \(c=(c_0,\dots, c_{H-1}, c_H)\)
  • Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
  • Finite horizon \(H\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)

minimize   \(\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)\)

s.t.   \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)

\(\pi\)

Not in Scope: Stochastic & Infinite Horizon

  • Non-deterministic dynamics are out of our scope (requiring a background in continuous random variables)
  • Stochastic transitions described by dynamics function and independent "process noise" $$s_{t+1} = f(s_t, a_t, w_t), \quad w_t\overset{i.i.d.}{\sim} \mathcal D_w$$
  • Infinite Horizon as either "discounted" or "average" $$\sum_{t=0}^\infty \gamma^t c_t\quad \text{or}\quad  \lim_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T-1} c_t$$
  • Though we won't study them, these settings routine for LQR (topic of next lecture)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c,( f,\mathcal D_w), [H,\gamma,\mathsf{avg}]\}\)

Example

image/svg+xml

\(a_t\)

  • Setting: hovering UAV over a target
    • cost: distance from target
  • Action: thrust right/left
  • Newton's second law
    • \(a_t = \frac{m}{\Delta} (\mathsf{velocity}_{t+1}- \mathsf{velocity}_{t})\)
    • \(\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + \frac{\Delta}{m}  a_t\)
  • Effect on position
    • \(\mathsf{position}_{t+1} = \mathsf{position}_{t}+\Delta \mathsf{velocity}_{t}\)
  • State is \(s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}\)

Example

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State is \(s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}\)
    • \(\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + \frac{\Delta}{m}  a_t\)
    • \(\mathsf{position}_{t+1} = \mathsf{position}_{t}+\Delta \mathsf{velocity}_{t}\)

\(a_t\)

  • \(\mathcal S = \mathbb R^2\), \(\mathcal A = \mathbb R\)
  • \(c_t(s_t, a_t) = (\mathsf{position}_t-\mathsf{target}_t)^2+\lambda a_t^2\)
  • \(f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t\)

\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)

Discretization?

  • Could approximate continuous states/action by discretizing
  • How many states/actions does this require?
    • Let \(B_s\) bound* the size of the maximum state and \(B_a\) bound the size of the maximum action
    • \((B_s/\varepsilon)^{n_s}\) for states and \((B_a/\varepsilon)^{n_a}\) for actions
  • *bounds depend on dynamics, horizon, initial state, etc (nontrivial!)
  • This is not a feasible approach in many cases!

\(\varepsilon\)

Agenda

1. Recap

2. Continuous Control

3. Linear Dynamics

Example

image/svg+xml
  • Setting: hovering UAV over a target
  • Action: thrust right/left
  • State is \(s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}\)

\(a_t\)

\(f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t\)

Linear Dynamics

  • The dynamics function \(f\) has a linear form $$ s_{t+1} = As_t + Ba_t $$
  • \(A\in\mathbb R^{n_s\times n_s}\) and \(B\in\mathbb R^{n_s\times n_a}\) are dynamics matrices
  • \(A\) describes the evolution of the state when there is no action (internal dynamics)
  • \(B\) describes the effects of actions

Example: investing

You have investments in two companies.

Setting 1:  Each dollar of investment in company \(i\) leads to \(\lambda_i\) returns. The companies are independent.

  • \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda_1 & \\ & \lambda_2 \end{bmatrix} s_t \)

\(0<\lambda_2<\lambda_1<1\)

\(0<\lambda_2<1<\lambda_1\)

\(1<\lambda_2<\lambda_1\)

Autonomous trajectories

  • Trajectories \(s_t=A^t s_0\) are determined by the eigen-decomposition of \(A\)
  • Ex: if \(s_0=v\) is an eigenvector of \(A\) (i.e. \(Av =\lambda v\))
    • \(s_{1} = As_0 = \lambda s_0\)

    • \(s_t = \lambda^t v\)

  • If \(A\) is diagonalizable, then any \(s_0\) can be written as a linear combination of eigenvectors \(s_0 = \sum_{i=1}^{n_s} \alpha_i v_i\)

    • \(s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i\)

The effect of internal dynamics $$ s_{t+1} = As_t$$

Example: investing

Setting 2:  The companies are interdependent: each dollar of investment in company \(i\) leads to \(\alpha\) return for company \(i\), but it also leads to \(\beta\) return (\(i=1\)) or loss (\(i=2\)) to the other company.

  • \(\displaystyle s_{t+1} = \begin{bmatrix} \alpha & -\beta \\ \beta & \alpha \end{bmatrix} s_t \)

\(0<\alpha^2+\beta^2<1\)

\(1<\alpha^2+\beta^2\)

$$\begin{bmatrix}1\\0\end{bmatrix} \to \begin{bmatrix}\alpha\\ \beta\end{bmatrix} $$

rotation by \(\arctan(\beta/\alpha)\)

scale by \(\sqrt{\alpha^2+\beta^2}\)

\(\lambda = \alpha \pm i \beta\)

Example: investing

Setting 3:  Each dollar of investment in company \(i\) leads to \(\lambda\) return for company \(i\), and \(2\) is a subsidiary of \(1\) who thus accumulates its returns as well.

  • \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda & 1 \\ 0 & \lambda \end{bmatrix} s_t \)

\(0<\lambda<1\)

\(1<\lambda\)

$$ \left(\begin{bmatrix} \lambda & \\  & \lambda\end{bmatrix} + \begin{bmatrix}  & 1\\  & \end{bmatrix} \right)^t$$

$$ =\begin{bmatrix} \lambda^t & t\lambda^{t-1}\\  & \lambda^t\end{bmatrix} $$

Summary of 2D Examples

General case: diagonalizable, real eigenvalues (geometric \(=\) algebraic multiplicity)

Example 1:  \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda_1 & \\ & \lambda_2 \end{bmatrix} s_t \)

Example 2:  \(\displaystyle s_{t+1} = \begin{bmatrix} \alpha & -\beta\\\beta  & \alpha\end{bmatrix} s_t  \)

General case: pair of complex eigenvalues

\(\lambda = \alpha \pm i \beta\)

Example 3:  \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda & 1\\  & \lambda\end{bmatrix} s_t  \)

General case: non-diagonalizable (geometric \(<\) algebraic multiplicity)

Equilibria and Stability

  • An equilibrium state satisfies $$ s_{eq} = As_{eq} $$
    • \(s_{eq}=0\) is always an equilbrium
    • if there is an eigenvalue equal to 1, then for the associated eigenvector, \(Av=v\). Thus \(cv\) is an equilibrium for any scalar \(c\).
  • Broadly categorize as
    1. Asymptotically stable: \(s_t\to 0\)
    2. Unstable: \(\|s_t\|\to\infty\)
  • There are examples which are neither (e.g. \(A=I\))

Stability Theorem

Theorem: Let \(\{\lambda_i\}_{i=1}^n\subset \mathbb C\) be the eigenvalues of \(A\).
Then for \(s_{t+1}=As_t\), the equilibrium \(s_{eq}=0\) is

  • asymptotically stable \(\iff \max_{i\in[n]}|\lambda_i|<1\)
  • unstable if \(\max_{i\in[n]}|\lambda_i|> 1\)
  • call \(\max_{i\in[n]}|\lambda_i|=1\) "marginally (un)stable"

\(\mathbb C\)

Stability Theorem

Proof

  • If \(A\) is diagonalizable, then any \(s_0\) can be written as a linear combination of eigenvectors \(s_0 = \sum_{i=1}^{n_s} \alpha_i v_i\)

    • By definition, \(Av_i = \lambda_i v_i\)

    • Therefore, \(s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i\)

    • Thus \(s_t\to 0\) if and only if all \(|\lambda_i|<1\), and if any \(|\lambda_i|>1\), \(\|s_t\|\to\infty\)

  • Proof in the non-diagonalizable case is out of scope, but it follows using the Jordan Normal Form

Marginally (un)stable

  • We call \(\max_i|\lambda_i|=1\) "marginally (un)stable"

  • Consider independent investing example: (not unstable \(\lambda_2<1\)) $$ s_{t} = \begin{bmatrix} 1  &0 \\0 & \lambda_2 \end{bmatrix}^t s_0 $$
  • Consider UAV example: (unstable)$$s_{t} = \begin{bmatrix} 1  & 1 \\0 & 1 \end{bmatrix}^t s_0 =\begin{bmatrix} 1 & t\\  & 1\end{bmatrix} s_0 $$
  • Depends on eigenvectors not just eigenvalues!

Controlled Trajectories

  • Full dynamics depend on actions $$ s_{t+1} = As_t+Ba_t $$

  • The trajectories can be written as (PSet 3) $$ s_{t} = A^t s_0 + \sum_{k=0}^{t-1}A^k Ba_{t-k-1} $$
  • The internal dynamics \(A\) determines the long term effects of actions

Example

image/svg+xml
  • Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
  • Initially at rest, then one rightward thrust followed by one leftward thrust $$a_0=1,\quad a_{t_0}=-1,\quad a_k=0~~k\notin\{0,t_0\} $$

\(a_t\)

  • \(s_{t} = \displaystyle \begin{bmatrix}1 & t \\ 0 & 1\end{bmatrix}\begin{bmatrix}\mathsf{pos}_0  \\ 0 \end{bmatrix}+ \sum_{k=0}^{t-1} \begin{bmatrix}1 & k\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}a_{t-k-1}\)
  • \(s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0  \\ 0 \end{bmatrix}+  \begin{bmatrix}1 & t-1\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}- \begin{bmatrix}1 & t-t_0-1\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}\)
  • for \(t\leq t_0\), \(s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0+ t-1 \\ 1 \end{bmatrix}\) and for \(t\geq t_0\), \(s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0+ t_0 \\ 0 \end{bmatrix}\)

Example

image/svg+xml
  • Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
  • Thrust according to distance from target \(a_t = -(\mathsf{pos}_t- x)\)

\(a_t\)

Linear Policy

  • Linear policy defined by \(a_t=Ks_t\): $$ s_{t+1} = As_t+BKs_t = (A+BK)s_t$$

  • The trajectories can be written as $$ s_{t} = (A+BK)^t s_0 $$
  • The internal dynamics \(A\) are modified depending on \(B\) and \(K\)

Example

image/svg+xml
  • Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
  • Thrust according to distance from target \(a_t = -(\mathsf{pos}_t- x)\)

\(a_t\)

  • \(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}a_t\)
  • \(\left(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& 0\end{bmatrix} \left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)
  • \(\left(s_{t} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ -1& 1\end{bmatrix}^t\left(s_0 -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)

Example

image/svg+xml
  • Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
  • Thrust according to distance from target \(a_t = -(\mathsf{pos}_t+\mathsf{vel}_t- x)\)

\(a_t\)

 

  • \(\left(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& -1\end{bmatrix} \left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)
  • \(\left(s_{t} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ -1 & 0\end{bmatrix}^t\left(s_0 -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)

Recap

  • PSet 2 due TONIGHT
  • PA 1 due Wednesday

 

  • Continuous Control
  • Linear Dynamics

 

  • Next lecture: Linear Quadratic Regulator

Sp23 CS 4/5789: Lecture 7

By Sarah Dean

Private

Sp23 CS 4/5789: Lecture 7