Prof. Sarah Dean

MW 2:45-4pm
255 Olin Hall

## Reminders

• Homework
• PSet due TONIGHT
• PA due on 3/1 -- extension to 3/3
• My office hours:
• Tuesdays 10:30-11:30am in Gates 416A
• cancelled 2/28 (February break)
• Wednesdays 4-4:50pm in Olin 255 (right after lecture)

## Agenda

1. Recap: Local LQR

2. Iterative LQR

3. PID Control

4. Limitations to Control

## Recap: LQR

Theorem:  For $$t=0,\dots ,H-1$$, the optimal value function is quadratic and the optimal policy is linear$$V^\star_t (s) = s^\top P_t s \quad\text{ and }\quad \pi_t^\star(s) = K_t s$$

where the matrices are defined as $$P_{H} = Q$$ and

• $$P_t$$ and $$K_t$$ in terms of $$A,B,Q,R$$ and $$P_{t+1}$$
• General form:  $$f_t(s_t,a_t) = A_ts_t + B_t a_t +c_t$$ and $$c_t(s,a) = s^\top Q_ts+a^\top R_ta+a^\top M_ts + q_t^\top s + r_t^\top a+ v_t$$
• General solution: $$\pi^\star_t(s) = K_t s+ k_t$$ where $$\{K_t,k_t\}_{t=0}^{H-1} = \mathsf{LQR}(\{A_t,B_t,c_t, Q_t, R_t, M_t, q_t, r_t, v_t\}_{t=0}^{H-1})$$
1. Approximate dynamics & costs
• Linearize $$f$$ as $$A_0,B_0,c_0$$
• Approx $$c$$ as quadratic with $$Q_0,R_0,M_0,q_0,r_0,v_0$$
2. LQR policy: $$\pi^\star_t(s) = K_t s+ k_t$$ where $$\{K_t,k_t\}_{t=0}^{H-1} = \mathsf{LQR}(A_0,B_0,c_0, Q_0, R_0, M_0, q_0, r_0, v_0)$$
• works as long as states and actions remain close to $$s_\star$$ and $$a_\star$$

## Recap: Local Control

minimize   $$\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)$$

s.t.   $$s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)$$

$$\pi$$

## Linearized Dynamics

• Linearization of dynamics around $$(s_0,a_0)$$
• $$f(s,a) \approx f(s_0, a_0) + \nabla_s f(s_0, a_0)^\top (s-s_0) + \nabla_a f(s_0, a_0)^\top (a-a_0)$$
• $$=A_0s+B_0a+c_0$$
• where the matrices depend on $$(s_0,a_0)$$:
• $$A_0 = \nabla_s f(s_0, a_0)^\top$$
• $$B_0 = \nabla_a f(s_0, a_0)^\top$$
• $$c_0 = f(s_0, a_0) - \nabla_s f(s_0, a_0)^\top s_0 - \nabla_a f(s_0, a_0)^\top a_0$$
• Black box access: use finite differencing to compute

## Second-Order Approx. Costs

• Approximate costs around $$(s_0,a_0)$$ $$c(s,a) \approx c(s_0, a_0) + \nabla_s c(s_0, a_0)^\top (s-s_0) + \nabla_a c(s_0, a_0)^\top (a-a_0) + \\ \frac{1}{2} (s-s_0) ^\top \nabla^2_s c(s_0, a_0)(s-s_0) + \frac{1}{2} (a-a_0) ^\top \nabla^2_a c(s_0, a_0)(a-a_0) \\+ (a-a_0) ^\top \nabla_{as}^2 c(s_0, a_0)(s-s_0)$$
• $$=s^\top Q_0s+a^\top R_0a+a^\top M_0s + q_0^\top s + r_0^\top a+ v_0$$
• Practical consideration:
• Force $$Q_0,R_0$$ to be positive definite by setting negative eigenvalues to 0 and adding regularization $$\lambda I$$
• Black box access: use finite differencing to compute

For a symmetric matrix $$Q\in\mathbb R^{n\times n}$$ the eigen-decomposition is $$Q = \sum_{i=1}^n v_iv_i^\top \sigma_i$$

To make this PSD, we replace $$Q\leftarrow \sum_{i=1}^n v_iv_i^\top (\max\{0,\sigma_i\} +\lambda)$$

## Recap: Example

• Setting: hovering UAV over a target
• $$s = [\mathsf{pos},\mathsf{vel}]$$
• Action: imperfect thrust right/left
• $$s_{t+1}=\begin{bmatrix}\mathsf{pos}_{t}+ \mathsf{vel}_{t} \\ \mathsf{vel}_{t} + e^{- (\mathsf{vel}_t^2+a_t^2)} a_t\end{bmatrix}$$
• $$\approx \begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix}s_t + \begin{bmatrix}0\\ 1\end{bmatrix}a_t$$ near $$(0,0)$$
• $$c(s,a) =(1-e^{-\mathsf{pos}^2}) +\lambda a^2$$
• $$\approx \mathsf{pos}^2 + \lambda a^2$$ near $$(0,0)$$

$$a_t$$

## Recap: Example

• Setting: hovering UAV over a target
• Action: imperfect thrust right/left
• LQR$$\left(\begin{bmatrix}1 & 1 \\ 0 & 1\end{bmatrix},\begin{bmatrix}0\\ 1\end{bmatrix},\begin{bmatrix}1&0\\ 0&0\end{bmatrix},\frac{1}{2}\right)$$

$$\pi_t^\star(s) = \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s = \gamma^\mathsf{pos}_t (\mathsf{pos} - x) + \gamma^\mathsf{vel}_t \mathsf{vel}$$

$$\gamma^\mathsf{pos}$$

$$\gamma^\mathsf{vel}$$

$$-1$$

$$t$$

$$H$$

$$a_t$$

## Recap: Example

• Setting: hovering UAV over a target
• Action: imperfect thrust right/left
• Local control $$\pi_t^\star(s) = \begin{bmatrix}{ \gamma^\mathsf{pos}_t }& {\gamma_t^\mathsf{vel}} \end{bmatrix}s$$

$$a_t$$

## Agenda

1. Recap: Local LQR

2. Iterative LQR

3. PID Control

4. Limitations to Control

## Approximate with Trajectory

• Rather than approximate around single point $$(s_0,a_0)$$
• local approximations for trajectory $$\tau=(s_t,a_t)_{t=0}^{H-1}$$
• Leads to time-varying approximation of dynamics & costs
• For each $$t$$, linearize $$f$$ around $$(s_t,a_t)$$: $$\{A_t,B_t,c_t\}_{t=0}^{H-1}$$
• For each $$t$$, approx $$c$$ as quadratic: $$\{Q_t,R_t,M_t,q_t,r_t,v_t\}_{t=0}^{H-1}$$
• But what trajectory should we use?

minimize   $$\displaystyle\sum_{t=0}^{H-1} c(s_t, a_t)$$

s.t.   $$s_{t+1}=f(s_t, a_t), ~~a_t=\pi_t(s_t)$$

$$\pi$$

$$s_0\sim\mu_0$$

## iLQR

• Initialize $$\bar a_0^0,\dots \bar a_{H-1}^0$$ and $$\bar s_0^0\sim \mu_0$$
• Generate initial trajectory $$\tau_0 = \{(\bar s_t^0, \bar a_t^0)\}_{t=0}^{H-1}$$
• by $$\bar s^0_{t+1} =f(\bar s_t^0, \bar a_t^0)$$ for $$t\in[0,H]$$
• For $$i=0,1,\dots$$:
• $$\{A_t, B_t, v_t, Q_t, R_t, q_t, r_t, c_t\}_{t=0}^{H-1}=$$Approx$$(f, c, \tau_i)$$
• $$\{K^\star_t, k^\star_t\}_{t=0}^{H-1}=$$LQR$$(\{A_t, B_t, v_t, Q_t, R_t, q_t, r_t, c_t\}_{t=0}^{H-1})$$
• generate $$\tau_{i+1} = \{(\bar s_t^{i+1}, \bar a_t^{i+1})\}_{t=0}^{H-1}$$
• by $$\bar s_{t+1}^{i+1} = f(\bar s_{t}^{i+1},\underbrace{ K^\star_t\bar s_{t}^{i+1} + k^\star_t}_{\bar a_t^{i+1}})$$ for $$t\in[0,H]$$

Linearize around a trajectory. What trajectory? Iterate!

Black lines: $$\tau_{i-1}$$, red arrows: trajectory if linearization was true, blue dashed lines: $$\tau_i$$

## Agenda

1. Recap: Local LQR

2. Iterative LQR

3. PID Control

4. Limitations to Control

## PID Control

• Type of policy which may not be optimal, but is open used in practice (especially for low-level stabilization)
• Applicable when:
• There is an observation $$o_t\in\mathbb R$$ and desired setpoint $$o^\star_t$$
• The action $$a_t\in\mathbb R$$ is "correlated" with $$o_t$$, i.e. positive actions tend to increase $$o_t$$
• Actions are determined by errors $$e_t = o^\star_t - o_t$$ $$a_t = K_P e_t + K_I \sum_{k=0}^t e_k + K_D (e_t-e_{t-1})$$

## PID Control

• Actions are determined by errors $$e_t = o^\star_t - o_t$$ $$a_t = K_P e_t + K_I \sum_{k=0}^t e_k + K_D (e_t-e_{t-1})$$
• Policy depends on history of errors $$e_{t},e_{t-1},\dots e_{0}$$
• Tuning parameters is a heuristic process

$$t$$

error

## Agenda

1. Recap: Local LQR

2. Iterative LQR

3. PID Control

4. Limitations to Control

• How good can an optimal policy be?
• Are there inherent properties of a system that limit performance?

## Limitations to Control

1. Finite and Deterministic MDP with $$r(s,a) = 1$$ if $$s=0$$ and $$0$$ otherwise

2. Linear Dynamics with cost $$\|s_t\|_2^2$$ $$s_{t+1} = \begin{bmatrix} 2 & 0 \\ 0 & 1\end{bmatrix} s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t$$

## Motivating examples

$$0$$

$$1$$

$$a\in$${stay,switch}

$$a=$$stay

$$a=$$switch

PollEV

Definition:

• A state $$s'$$ is reachable from a state $$s$$ if there exists a sequence of actions $$a_0,\dots,a_{T-1}$$ for a finite $$T$$ such that $$\mathbb P\{s_T=s'\mid s_0=s,a_0,\dots,a_{T-1}\}>0$$
• MDP is reachable (also controllable) if all states are reachable from any other state

## Reachability

Theorem: Given finite $$\mathcal S,\mathcal A$$ and transition function $$P$$, construct a directed graph with vertices $$\mathcal V=\mathcal S$$ and an edge from $$s$$ to $$s'$$ if $$P(s'|s,a)>0$$ for some $$a\in\mathcal A$$.

• Then the MDP is reachable if the graph is strongly connected, i.e. if there is a path from every vertex to every other vertex

## Discrete Reachability

$$0$$

$$1$$

Proof:

• Since the graph is strongly connected, there exists a directed path from $$s'$$ to $$s$$ for any $$s$$ and $$s'$$
• Let $$T$$ be its length
• By construction, each edge along this path corresponds to at least one action $$a_i$$ and some nonzero transition probability $$p_i$$
• $$\mathbb P\{s_T=s'\mid s_0=s,a_0,\dots,a_{T-1}\}>\prod_{i=0}^{T-1} p_i >0$$

## Discrete Reachability

$$0$$

$$1$$

Theorem: The linear dynamics $$s_{t+1}=As_t+Ba_t$$ are controllable if the controllability grammian $$\mathcal C$$ is full rank. $$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}B & AB & A^2 B & \dots & A^{n_s-1}B\end{bmatrix}}_{\mathcal C}\Big) = n_s$$

## Linear Deterministic Reachability

For the example $$s_{t+1} = \begin{bmatrix} 2 & 0 \\ 0 & 1\end{bmatrix} s_t + \begin{bmatrix}0\\1\end{bmatrix}a_t$$

• $$\mathcal C = \begin{bmatrix} 0 &0 \\ 1 & 1\end{bmatrix}$$ is not full rank

Theorem: The linear dynamics $$s_{t+1}=As_t+Ba_t$$ are controllable if the controllability grammian $$\mathcal C$$ is full rank. $$\mathrm{rank}\Big(\underbrace{\begin{bmatrix}B & AB & A^2 B & \dots & A^{n_s-1}B\end{bmatrix}}_{\mathcal C}\Big) = n_s$$

## Linear Deterministic Reachability

Proof:

• Recall that $$s_t = A^t s_0 + \sum_{k=0}^{t-1}A^{k}Ba_{t-k-1}$$ (PSet)
• Therefore, setting $$s_{n_s}=s'$$ and $$s_0=s$$, $$s' - A^{n_s} s = \begin{bmatrix}B & AB & \dots & A^{n_s-1}B \end{bmatrix} \begin{bmatrix}a_{n_s-1}\\ \vdots \\ a_0\end{bmatrix}$$
• If $$\mathcal C$$ is full rank, this system of linear equations has at least one solution $$a_0,\dots a_{n_s-1}$$

## Example

• Setting: hovering UAV over a target
• Action: thrust right/left
• $$s_{t+1}=\begin{bmatrix}1 & 1\\ 0 & 1\end{bmatrix} s_t +\begin{bmatrix}0\\ 1\end{bmatrix}a_t$$
• Controllability grammian $$\mathcal C = \begin{bmatrix}0&1 \\1& 1\end{bmatrix}$$

$$a_t$$

To get from $$s$$ to $$s'$$ we can simply take the actions:

• $$\begin{bmatrix}a_0\\a_1\end{bmatrix} = \mathcal C^{-1}(s' - A^2 s_0)=\begin{bmatrix}-1&1 \\1& 0\end{bmatrix}\left(s' - \begin{bmatrix}1&2 \\0& 1\end{bmatrix} s\right)$$
• $$=\begin{bmatrix}-1&1 \\1& 0\end{bmatrix}\begin{bmatrix}\mathsf{pos}'-\mathsf{pos}-2\mathsf{vel} \\\mathsf{vel}'-\mathsf{vel}\end{bmatrix} = \begin{bmatrix}-\mathsf{pos}'+\mathsf{pos}+\mathsf{vel}+\mathsf{vel}' \\\mathsf{pos}'-\mathsf{pos}-2\mathsf{vel}\end{bmatrix}$$
• Our focus is mostly optimization rather than design
• Design includes building a system and then modelling it as an MDP
• In real applications, design is just as (if not more) important
• If reachability is an issue, maybe we can add another actuator
• If robustness is an issue, maybe we can tweak the cost or reward function

## Recap

• PSet due TONIGHT
• PA due after Feb break

• Iterative Nonlinear Control
• PID Control
• Reachability

• Happy Feb break!
• Next lecture: Model-Based RL

By Sarah Dean

Private