CS 4/5789: Introduction to Reinforcement Learning
Lecture 7: Continuous Control
Prof. Sarah Dean
MW 2:45-4pm
255 Olin Hall
Reminders
- Homework this week
- Problem Set 2 due TONIGHT
- Programming Assignment 1 due Wednesday 2/15
- Next PSet and PA released on Wednesday
- My office hours:
- Tuesdays 10:30-11:30am in Gates 416A
- Wednesdays 4-4:50pm in Olin 255 (right after lecture)
Agenda
1. Recap
2. Continuous Control
3. Linear Dynamics
Markov Decision Process
- \(\mathcal{S}, \mathcal{A}\) state and action spaces
- finite size \(S\) and \(A\)
- \(r\) reward function, \(P\) transition function (tabular representation \(SA\) and \(S^2A\))
- discount factor \(0<\gamma<1\) or horizon \(H>0\)
Goal: achieve high cumulative reward
maximize \(\displaystyle \mathbb E\left[\sum_{t=0}^{\infty} \gamma^t r(s_t, a_t)\right]\) or \(\displaystyle \mathbb E\left[\sum_{t=0}^{H-1} r(s_t, a_t)\right]\)
s.t. \(s_{t+1}\sim P(s_t, a_t), ~~a_t\sim \pi(s_t)\)
\(\pi\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, r, P, [H~\text{or}~\gamma]\}\)
Infinite Horizon: VI and PI
Policy Iteration
- Initialize \(\pi_0:\mathcal S\to\mathcal A\)
- For \(t=0,\dots,T-1\):
- Policy Evaluation: \(V^{\pi_t}\)
- Policy Improvement: \(\pi^{t+1}\)
Value Iteration
- Initialize \(V_0\)
- For \(t=0,\dots,T-1\):
- Bellman Operator: \(V_{t+1}\)
- Return \(\displaystyle \pi_T\)
- Monotonic Improvement:
\(V^{\pi_{t+1}} \geq V^{\pi_t}\) - Convergence:
\(\|V^{\pi_t} - V^\star\|_\infty \leq\gamma^t \|V^{\pi_0}-V^\star\|_\infty\)
- Iterate convergence:
\(\| V_{t}- V^\star\|_\infty \leq \gamma^t \|V_0-V^\star\|_\infty\) - Suboptimality:
\(V^\star(s) - V^{\pi_T}(s) \leq \frac{2\gamma^{T+1}}{1-\gamma} \|V_0-V^\star\|_\infty\)
PollEV
Finite Horizon: DP
Exactly compute the optimal policy
- Initialize \(V^\star_H = 0\)
- For \(t=H-1, H-2, ..., 0\):
- \(Q_t^\star(s,a) = r(s,a)+\mathbb E_{s'\sim P(s,a)}[V^\star_{t+1}(s')]\)
- \(\pi_t^\star(s) = \arg\max_a Q_t^\star(s,a)\)
- \(V^\star_{t}(s)=Q_t^\star(s,\pi_t^\star(s) )\)
- Return \(\pi^\star = (\pi^\star_0,\dots ,\pi^\star_{H-1})\)
Agenda
1. Recap
2. Continuous Control
3. Linear Dynamics
Continuous MDP
- So far, we consider finitely many states and actions \(|\mathcal S| = S\) and \(|\mathcal A| = A\)
- Tabular representation of functions
- In applications like robotics, states and actions can take continuous values
- e.g. position, velocity, force
- \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
- Historical terminology: "optimal control problem" originates from the use of these techniques to design control laws for regulating physical processes
Finite Horizon Optimal Control
- Continuous \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
- alternate terminology/notation (we won't use): states \(x\) and "inputs" \(u\)
- Cost to be minimized (rather than reward to be maximized)
- think of as "negative reward", or think of reward as "negative cost"
- potentially time-varying \(c=(c_0,\dots, c_{H-1}, c_H)\)
- \(c_t:\mathcal S\times\mathcal A\to \mathbb R\) for \(t=0,\dots,H-1\)
- final state cost \(c_H:\mathcal S\to \mathbb R\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)
Finite Horizon Optimal Control
- Continuous \(\mathcal S = \mathbb R^{n_s}\) and \(\mathcal A = \mathbb R^{n_a}\)
- Cost to be minimized \(c=(c_0,\dots, c_{H-1}, c_H)\)
- Deterministic transitions described by dynamics function $$s_{t+1} = f(s_t, a_t)$$
- Finite horizon \(H\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)
minimize \(\displaystyle\sum_{t=0}^{H-1} c_t(s_t, a_t)+c_H(s_H)\)
s.t. \(s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)\)
\(\pi\)
Not in Scope: Stochastic & Infinite Horizon
- Non-deterministic dynamics are out of our scope (requiring a background in continuous random variables)
- Stochastic transitions described by dynamics function and independent "process noise" $$s_{t+1} = f(s_t, a_t, w_t), \quad w_t\overset{i.i.d.}{\sim} \mathcal D_w$$
- Infinite Horizon as either "discounted" or "average" $$\sum_{t=0}^\infty \gamma^t c_t\quad \text{or}\quad \lim_{T\to\infty}\frac{1}{T}\sum_{t=0}^{T-1} c_t$$
- Though we won't study them, these settings routine for LQR (topic of next lecture)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c,( f,\mathcal D_w), [H,\gamma,\mathsf{avg}]\}\)
Example
\(a_t\)
- Setting: hovering UAV over a target
- cost: distance from target
- Action: thrust right/left
- Newton's second law
- \(a_t = \frac{m}{\Delta} (\mathsf{velocity}_{t+1}- \mathsf{velocity}_{t})\)
- \(\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + \frac{\Delta}{m} a_t\)
- Effect on position
- \(\mathsf{position}_{t+1} = \mathsf{position}_{t}+\Delta \mathsf{velocity}_{t}\)
- State is \(s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}\)
Example
- Setting: hovering UAV over a target
- Action: thrust right/left
- State is \(s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}\)
- \(\mathsf{velocity}_{t+1}=\mathsf{velocity}_{t} + \frac{\Delta}{m} a_t\)
- \(\mathsf{position}_{t+1} = \mathsf{position}_{t}+\Delta \mathsf{velocity}_{t}\)
\(a_t\)
- \(\mathcal S = \mathbb R^2\), \(\mathcal A = \mathbb R\)
- \(c_t(s_t, a_t) = (\mathsf{position}_t-\mathsf{target}_t)^2+\lambda a_t^2\)
- \(f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t\)
\(\mathcal M = \{\mathcal{S}, \mathcal{A}, c, f, H\}\)
Discretization?
- Could approximate continuous states/action by discretizing
- How many states/actions does this require?
- Let \(B_s\) bound* the size of the maximum state and \(B_a\) bound the size of the maximum action
- \((B_s/\varepsilon)^{n_s}\) for states and \((B_a/\varepsilon)^{n_a}\) for actions
- *bounds depend on dynamics, horizon, initial state, etc (nontrivial!)
- This is not a feasible approach in many cases!
\(\varepsilon\)
Agenda
1. Recap
2. Continuous Control
3. Linear Dynamics
Example
- Setting: hovering UAV over a target
- Action: thrust right/left
- State is \(s_t = \begin{bmatrix}\mathsf{position}_t\\ \mathsf{velocity}_t\end{bmatrix}\)
\(a_t\)
\(f(s_t, a_t) = \begin{bmatrix}1 & \Delta \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ \frac{\Delta}{m}\end{bmatrix}a_t\)
Linear Dynamics
- The dynamics function \(f\) has a linear form $$ s_{t+1} = As_t + Ba_t $$
- \(A\in\mathbb R^{n_s\times n_s}\) and \(B\in\mathbb R^{n_s\times n_a}\) are dynamics matrices
- \(A\) describes the evolution of the state when there is no action (internal dynamics)
- \(B\) describes the effects of actions
Example: investing
You have investments in two companies.
Setting 1: Each dollar of investment in company \(i\) leads to \(\lambda_i\) returns. The companies are independent.
- \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda_1 & \\ & \lambda_2 \end{bmatrix} s_t \)
\(0<\lambda_2<\lambda_1<1\)
\(0<\lambda_2<1<\lambda_1\)
\(1<\lambda_2<\lambda_1\)
Autonomous trajectories
- Trajectories \(s_t=A^t s_0\) are determined by the eigen-decomposition of \(A\)
-
Ex: if \(s_0=v\) is an eigenvector of \(A\) (i.e. \(Av =\lambda v\))
-
\(s_{1} = As_0 = \lambda s_0\)
-
\(s_t = \lambda^t v\)
-
-
If \(A\) is diagonalizable, then any \(s_0\) can be written as a linear combination of eigenvectors \(s_0 = \sum_{i=1}^{n_s} \alpha_i v_i\)
-
\(s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i\)
-
The effect of internal dynamics $$ s_{t+1} = As_t$$
Example: investing
Setting 2: The companies are interdependent: each dollar of investment in company \(i\) leads to \(\alpha\) return for company \(i\), but it also leads to \(\beta\) return (\(i=1\)) or loss (\(i=2\)) to the other company.
- \(\displaystyle s_{t+1} = \begin{bmatrix} \alpha & -\beta \\ \beta & \alpha \end{bmatrix} s_t \)
\(0<\alpha^2+\beta^2<1\)
\(1<\alpha^2+\beta^2\)
$$\begin{bmatrix}1\\0\end{bmatrix} \to \begin{bmatrix}\alpha\\ \beta\end{bmatrix} $$
rotation by \(\arctan(\beta/\alpha)\)
scale by \(\sqrt{\alpha^2+\beta^2}\)
\(\lambda = \alpha \pm i \beta\)
Example: investing
Setting 3: Each dollar of investment in company \(i\) leads to \(\lambda\) return for company \(i\), and \(2\) is a subsidiary of \(1\) who thus accumulates its returns as well.
- \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda & 1 \\ 0 & \lambda \end{bmatrix} s_t \)
\(0<\lambda<1\)
\(1<\lambda\)
$$ \left(\begin{bmatrix} \lambda & \\ & \lambda\end{bmatrix} + \begin{bmatrix} & 1\\ & \end{bmatrix} \right)^t$$
$$ =\begin{bmatrix} \lambda^t & t\lambda^{t-1}\\ & \lambda^t\end{bmatrix} $$
Summary of 2D Examples
General case: diagonalizable, real eigenvalues (geometric \(=\) algebraic multiplicity)
Example 1: \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda_1 & \\ & \lambda_2 \end{bmatrix} s_t \)
Example 2: \(\displaystyle s_{t+1} = \begin{bmatrix} \alpha & -\beta\\\beta & \alpha\end{bmatrix} s_t \)
General case: pair of complex eigenvalues
\(\lambda = \alpha \pm i \beta\)
Example 3: \(\displaystyle s_{t+1} = \begin{bmatrix} \lambda & 1\\ & \lambda\end{bmatrix} s_t \)
General case: non-diagonalizable (geometric \(<\) algebraic multiplicity)
Equilibria and Stability
- An equilibrium state satisfies $$ s_{eq} = As_{eq} $$
- \(s_{eq}=0\) is always an equilbrium
- if there is an eigenvalue equal to 1, then for the associated eigenvector, \(Av=v\). Thus \(cv\) is an equilibrium for any scalar \(c\).
- Broadly categorize as
- Asymptotically stable: \(s_t\to 0\)
- Unstable: \(\|s_t\|\to\infty\)
- There are examples which are neither (e.g. \(A=I\))
Stability Theorem
Theorem: Let \(\{\lambda_i\}_{i=1}^n\subset \mathbb C\) be the eigenvalues of \(A\).
Then for \(s_{t+1}=As_t\), the equilibrium \(s_{eq}=0\) is
- asymptotically stable \(\iff \max_{i\in[n]}|\lambda_i|<1\)
- unstable if \(\max_{i\in[n]}|\lambda_i|> 1\)
- call \(\max_{i\in[n]}|\lambda_i|=1\) "marginally (un)stable"
\(\mathbb C\)
Stability Theorem
Proof
-
If \(A\) is diagonalizable, then any \(s_0\) can be written as a linear combination of eigenvectors \(s_0 = \sum_{i=1}^{n_s} \alpha_i v_i\)
-
By definition, \(Av_i = \lambda_i v_i\)
-
Therefore, \(s_t = \sum_{i=1}^{n_s}\alpha_i \lambda_i^t v_i\)
-
Thus \(s_t\to 0\) if and only if all \(|\lambda_i|<1\), and if any \(|\lambda_i|>1\), \(\|s_t\|\to\infty\)
-
-
Proof in the non-diagonalizable case is out of scope, but it follows using the Jordan Normal Form
Marginally (un)stable
-
We call \(\max_i|\lambda_i|=1\) "marginally (un)stable"
- Consider independent investing example: (not unstable \(\lambda_2<1\)) $$ s_{t} = \begin{bmatrix} 1 &0 \\0 & \lambda_2 \end{bmatrix}^t s_0 $$
- Consider UAV example: (unstable)$$s_{t} = \begin{bmatrix} 1 & 1 \\0 & 1 \end{bmatrix}^t s_0 =\begin{bmatrix} 1 & t\\ & 1\end{bmatrix} s_0 $$
- Depends on eigenvectors not just eigenvalues!
Controlled Trajectories
-
Full dynamics depend on actions $$ s_{t+1} = As_t+Ba_t $$
- The trajectories can be written as (PSet 3) $$ s_{t} = A^t s_0 + \sum_{k=0}^{t-1}A^k Ba_{t-k-1} $$
- The internal dynamics \(A\) determines the long term effects of actions
Example
- Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
- Initially at rest, then one rightward thrust followed by one leftward thrust $$a_0=1,\quad a_{t_0}=-1,\quad a_k=0~~k\notin\{0,t_0\} $$
\(a_t\)
- \(s_{t} = \displaystyle \begin{bmatrix}1 & t \\ 0 & 1\end{bmatrix}\begin{bmatrix}\mathsf{pos}_0 \\ 0 \end{bmatrix}+ \sum_{k=0}^{t-1} \begin{bmatrix}1 & k\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}a_{t-k-1}\)
- \(s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0 \\ 0 \end{bmatrix}+ \begin{bmatrix}1 & t-1\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}- \begin{bmatrix}1 & t-t_0-1\\ 0 & 1\end{bmatrix} \begin{bmatrix}0\\ 1\end{bmatrix}\)
- for \(t\leq t_0\), \(s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0+ t-1 \\ 1 \end{bmatrix}\) and for \(t\geq t_0\), \(s_{t} = \displaystyle \begin{bmatrix}\mathsf{pos}_0+ t_0 \\ 0 \end{bmatrix}\)
Example
- Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
- Thrust according to distance from target \(a_t = -(\mathsf{pos}_t- x)\)
\(a_t\)
Linear Policy
-
Linear policy defined by \(a_t=Ks_t\): $$ s_{t+1} = As_t+BKs_t = (A+BK)s_t$$
- The trajectories can be written as $$ s_{t} = (A+BK)^t s_0 $$
- The internal dynamics \(A\) are modified depending on \(B\) and \(K\)
Example
- Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
- Thrust according to distance from target \(a_t = -(\mathsf{pos}_t- x)\)
\(a_t\)
- \(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}a_t\)
- \(\left(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& 0\end{bmatrix} \left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)
- \(\left(s_{t} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ -1& 1\end{bmatrix}^t\left(s_0 -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)
Example
- Setting: hovering UAV over a target $$s_{t+1} = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
- Thrust according to distance from target \(a_t = -(\mathsf{pos}_t+\mathsf{vel}_t- x)\)
\(a_t\)
- \(\left(s_{t+1} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}\left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right) + \begin{bmatrix}0\\ 1\end{bmatrix}\begin{bmatrix}-1& -1\end{bmatrix} \left(s_t -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)
- \(\left(s_{t} - \begin{bmatrix}x\\ 0\end{bmatrix}\right) = \begin{bmatrix}1 & 1 \\ -1 & 0\end{bmatrix}^t\left(s_0 -\begin{bmatrix}x\\ 0\end{bmatrix}\right)\)
Recap
- PSet 2 due TONIGHT
- PA 1 due Wednesday
- Continuous Control
- Linear Dynamics
- Next lecture: Linear Quadratic Regulator
Sp23 CS 4/5789: Lecture 7
By Sarah Dean