Model-Based RL:
LQR, iLQR
Curricula
- Linear Quadratic Regulator
- iterative LQR (iLQR)
- The case of unknown dynamics
LQR - motivation
We are aimed to optimize
for now assume deterministic dynamics
What if \(r\) and \(f\) are known and are "nice":
- we can find \( \max_a r(s, a) \) analytically
- the composition of \(r\) and \(f\) is also "nice"
Then we can express the optimal \(a_T\) action as a function of \(s_T\):
Now we can write the value of the last time-step
LQR - motivation
Going one step backward:
Apply recursively until \(t = 0 \) where the state \(s_0\) is known!
Then we can go forward:
Using the rule of "nice" compositions:
Linear dynamics, Quadratic rewards
Dynamics
Rewards
LQR - backward pass
The last Q:
Equate the gradient to zero:
Get the optimal last time-step behavior
LQR - backward pass
(check it yourself slide)
The last time-step value is also quadratic in \(s_T\)
The Q function at \(T - 1\) is again quadratic in both \(s_{T-1}\) and \(a_{T-1}\):
Thus you can get an analytical formula for policy at each time-step:
LQR - algorithm
Given: \(s_0, F_t, f_t, R_t, r_t\) for all \(t\)
- Calculate \(Q_t, q_t\) for all \(t\) going backward
- Calculate \(a_t = - Q^{-1}_{t_{aa}} ( Q_{t_{as}} s_t + q_{t_a}) \) for all \(t\) going forward
- Calculate \(s_{t+1} = f_t(s_t, a_t) \)
LQR - stochastic dynamics
Closed-loop
Planning
Tutorial
LQR - algorithm (stochastic dynamics)
Given: \(s_0, F_t, f_t, R_t, r_t\) for all \(t\)
- Calculate \(Q_t, q_t\) for all \(t\) going backward
- Calculate \(a_0 = - Q^{-1}_{0_{aa}} ( Q_{0_{as}} s_0 + q_{0_a}) \) only for \(t=0\)
- Apply \(a_0\) in the real environment
- Observe \(s_1 \sim p(s_1|s_0, a_0) \)
- Start from the beginning!
Curricula
- Linear Quadratic Regulator
- iterative LQR (iLQR)
- The case of unknown dynamics
The dynamics is not L
the rewards are not Q :(
Tailor's expansion:
\( f(s_t, a_t) \approx f(\hat s_t, \hat a_t) + \nabla f (\hat s_t, \hat a_t) \begin{bmatrix} s_t - \hat s_t \\ a_t - \hat a_t \end{bmatrix}\)
\( r(s_t, a_t) \approx r(\hat s_t, \hat a_t) + \nabla r (\hat s_t, \hat a_t) \begin{bmatrix} s_t - \hat s_t \\ a_t - \hat a_t \end{bmatrix} + \frac{1}{2} \begin{bmatrix} s_t - \hat s_t \\ a_t - \hat a_t \end{bmatrix}^T \nabla^2 r(\hat s_t, \hat a_t) \begin{bmatrix} s_t - \hat s_t \\ a_t - \hat a_t \end{bmatrix} \)
Simplify a bit....
\( \tilde f_t(s_t, a_t) = F_t \begin{bmatrix} s_t \\ a_t \end{bmatrix} + f_t \)
\( \tilde r_t (s_t, a_t) = \frac{1}{2} \begin{bmatrix} s_t \\ a_t \end{bmatrix}^T R_t \begin{bmatrix} s_t \\ a_t \end{bmatrix} + \begin{bmatrix} s_t \\ a_t \end{bmatrix}^T r_t \)
Iterative LQR (iLQR)
Algorithm:
Initialize \(( \hat s_0, \hat a_0, \hat s_1, \dots) \) somehow
- \(F_t = \nabla f(\hat s_t, \hat a_t) \)
- \(f_t = \dots \)
- \(R_t = \nabla^2 r(\hat s_t, \hat a_t) \)
- \(r_t = \dots \)
- \( (a_0, s_1, a_1, \dots) = LQR() \)
- \( (\hat s_0, \hat a_0, \hat s_1, \hat a_1, \dots) \leftarrow (s_0, a_0, s_1, a_1, \dots) \)
- Go to 1.
Curricula
- Linear Quadratic Regulator
- iterative LQR (iLQR)
- Differential Dynamic Programming
- The case of unknown dynamics
Curricula
- Linear Quadratic Regulator
- iterative LQR (iLQR)
- The case of unknown dynamics
MB-RL with Model-Free fine-tuning
MB-RL: LQR, iLQR, DDP
By cydoroga
MB-RL: LQR, iLQR, DDP
- 493