Model-Based RL:

LQR, iLQR

Curricula

Linear Quadratic Regulator
iterative LQR (iLQR)
The case of unknown dynamics

LQR - motivation

We are aimed to optimize

J = \sum_{t=0}^{T} r(s_t, a_t) \;\;\; s.t. \; \; s_{t+1} = f(s_t, a_t)

for now assume deterministic dynamics

What if \(r\) and \(f\) are known and are "nice":

we can find \( \max_a r(s, a) \) analytically
the composition of \(r\) and \(f\) is also "nice"

Then we can express the optimal \(a_T\) action as a function of \(s_T\):

a_T = \arg\max r(s_T, \cdot) = \pi_T(s_T)

Now we can write the value of the last time-step

V(s_T) = r(s_T, \pi_T(s_T))

LQR - motivation

Going one step backward:

Apply recursively until \(t = 0 \) where the state \(s_0\) is known!

Then we can go forward:

V(s_{T-1}) = Q(s_{T-1}, \pi_{T-1}(s_T))

Q(s_{T-1}, a_{T-1}) = r(s_{T-1}, a_{T-1}) + V(s_T)

Q(s_{T-1}, a_{T-1}) = r(s_{T-1}, a_{T-1}) + V(f(s_{T-1}, a_{T-1}))

Using the rule of "nice" compositions:

\pi_{T-1}(s_{T-1}) = \arg\max Q (s_{T-1}, \cdot)

s_{t+1} = f(s_t, a_t)

a_t = \pi_t(s_t)

Linear dynamics, Quadratic rewards

f_t(s_t, a_t) = F_t \begin{bmatrix} s_t \\ a_t \end{bmatrix} + f_t

r_t(s_t, a_t) =\frac{1}{2}\begin{bmatrix} s_t \\ a_t \end{bmatrix}^T R_t \begin{bmatrix} s_t \\ a_t \end{bmatrix} + \begin{bmatrix} s_t \\ a_t \end{bmatrix} ^T r_t

Dynamics

Rewards

LQR - backward pass

Q(s_T, a_T) = \frac{1}{2}\begin{bmatrix} s_T \\ a_T \end{bmatrix}^T R_T \begin{bmatrix} s_T \\ a_T \end{bmatrix} + \begin{bmatrix} s_T \\ a_T \end{bmatrix} ^T r_T

= \begin{bmatrix} s_T \\ a_T \end{bmatrix}^T Q_T \begin{bmatrix} s_T \\ a_T \end{bmatrix} + \begin{bmatrix} s_T \\ a_T \end{bmatrix} ^T q_T

\nabla Q(s_T, a_T) = Q_{{T}_{as}} s_T + Q_{T_{aa}} a_T + q_{T_a} = 0

\pi_T(s_T) = - Q_{T_{aa}}^{-1} ( Q_{T_{as}} s_T + q_{T_{a}})

The last Q:

Equate the gradient to zero:

Get the optimal last time-step behavior

LQR - backward pass
(check it yourself slide)

V(s_T) = Q(s_T, \pi_T(s_T)) = s_T^T V_T s_T + s_T^T v_T

\pi_t(s_t) = - Q_{t_{aa}}^{-1} ( Q_{t_{as}} s_t + q_{t_{a}})

The last time-step value is also quadratic in \(s_T\)

The Q function at \(T - 1\) is again quadratic in both \(s_{T-1}\) and \(a_{T-1}\):

Thus you can get an analytical formula for policy at each time-step:

Q(s_{T-1}, a_{T-1}) = \frac{1}{2}\begin{bmatrix} s_{T-1} \\ a_{T-1} \end{bmatrix}^T R_{T-1} \begin{bmatrix} s_{T-1} \\ a_{T-1} \end{bmatrix} + \begin{bmatrix} s_{T-1} \\ a_{T-1} \end{bmatrix} ^T r_{T-1} + V(f(s_{T-1}, a_{T-1}))

LQR - algorithm

Given: \(s_0, F_t, f_t, R_t, r_t\) for all \(t\)

Calculate \(Q_t, q_t\) for all \(t\) going backward
Calculate \(a_t = - Q^{-1}_{t_{aa}} ( Q_{t_{as}} s_t + q_{t_a}) \) for all \(t\) going forward
Calculate \(s_{t+1} = f_t(s_t, a_t) \)

LQR - stochastic dynamics

Closed-loop

Planning

Tutorial

LQR - algorithm (stochastic dynamics)

Given: \(s_0, F_t, f_t, R_t, r_t\) for all \(t\)

Calculate \(Q_t, q_t\) for all \(t\) going backward
Calculate \(a_0 = - Q^{-1}_{0_{aa}} ( Q_{0_{as}} s_0 + q_{0_a}) \) only for \(t=0\)
Apply \(a_0\) in the real environment
Observe \(s_1 \sim p(s_1|s_0, a_0) \)
Start from the beginning!

Curricula

Linear Quadratic Regulator
iterative LQR (iLQR)
The case of unknown dynamics

The dynamics is not L
the rewards are not Q :(

Tailor's expansion:

\( f(s_t, a_t) \approx f(\hat s_t, \hat a_t) + \nabla f (\hat s_t, \hat a_t) \begin{bmatrix} s_t - \hat s_t \\ a_t - \hat a_t \end{bmatrix}\)

\( r(s_t, a_t) \approx r(\hat s_t, \hat a_t) + \nabla r (\hat s_t, \hat a_t) \begin{bmatrix} s_t - \hat s_t \\ a_t - \hat a_t \end{bmatrix} + \frac{1}{2} \begin{bmatrix} s_t - \hat s_t \\ a_t - \hat a_t \end{bmatrix}^T \nabla^2 r(\hat s_t, \hat a_t) \begin{bmatrix} s_t - \hat s_t \\ a_t - \hat a_t \end{bmatrix} \)

Simplify a bit....

\( \tilde f_t(s_t, a_t) = F_t \begin{bmatrix} s_t \\ a_t \end{bmatrix} + f_t \)

\( \tilde r_t (s_t, a_t) = \frac{1}{2} \begin{bmatrix} s_t \\ a_t \end{bmatrix}^T R_t \begin{bmatrix} s_t \\ a_t \end{bmatrix} + \begin{bmatrix} s_t \\ a_t \end{bmatrix}^T r_t \)

Iterative LQR (iLQR)

Algorithm:

Initialize \(( \hat s_0, \hat a_0, \hat s_1, \dots) \) somehow

\(F_t = \nabla f(\hat s_t, \hat a_t) \)
\(f_t = \dots \)
\(R_t = \nabla^2 r(\hat s_t, \hat a_t) \)
\(r_t = \dots \)
\( (a_0, s_1, a_1, \dots) = LQR() \)
\( (\hat s_0, \hat a_0, \hat s_1, \hat a_1, \dots) \leftarrow (s_0, a_0, s_1, a_1, \dots) \)
Go to 1.

Curricula

Linear Quadratic Regulator
iterative LQR (iLQR)
Differential Dynamic Programming
The case of unknown dynamics

Curricula

Linear Quadratic Regulator
iterative LQR (iLQR)
The case of unknown dynamics

Model-Based RL:

LQR, iLQR

Curricula

LQR - motivation

LQR - motivation

Linear dynamics, Quadratic rewards

LQR - backward pass

LQR - backward pass (check it yourself slide)

LQR - algorithm

LQR - stochastic dynamics

LQR - algorithm (stochastic dynamics)

Curricula

The dynamics is not L the rewards are not Q :(

Iterative LQR (iLQR)

Curricula

Curricula

MB-RL with Model-Free fine-tuning

LQR - backward pass
(check it yourself slide)

The dynamics is not L
the rewards are not Q :(