Chaining Local Policies from Offline Samples

Motivation

Motivation: 

 

1. We have highly efficient local controllers when queried for the right direction. How do we leverage this for more global planning?

2. How do we leverage access to resets more efficiently in simulation

3. Multi-query 

Algorithm

Setup:

u = \pi(x;x_{goal})

1. Assume some access to a simple local controller

(MPC, MPPI, collision-free motion planning, etc.)

 

Importantly, we want to quantify the performance of this local controller with some form of a local value function.

 

We want this to preserve some notion of goal-reaching behavior - some examples include

V^\pi(x;x_{goal})
\begin{aligned} V^\pi(x;x_{goal}) & = \|x_{goal} - F_\pi(x)\|^2 \\ & = \mathbf{1}[F_\pi(x) = x_{goal}] \end{aligned}

(Reachability)

(Distance to goal)

Algorithm

Setup:

2. Sample many "good" states offline to build a library of local value functions, 

V^\pi(x_i;x_{goal})

3. Online, do graph search to find the shortest path of subgoals,

build a graph where the nodes are defined by the sample states,

edges are defined by 

c(x_i,x_j) = V^\pi(x_i;x_j)
x_0
\min_{\{(i,j)\}}\sum V(x_i,x_j)

Questions

1. Why is chaining together local value functions better than learning global goal-conditioned value functions

2. How do we make this more "continuous" over the sampled states

3. Benefits of using MPC in the lower-level control

4. Why not learn a terminal value function for MPC

Toy Problem

  • On-policy / Finite-Horizon Planning Struggles

 

  • MPC with Terminal Value Function
  • MPC with Graph Search on Subgoals
  • Off-Policy RL (DDPG / SAC)

Long-Horizon Sparse-Reward Problem

(Difficult Exploration)

Toy Problem

  • This configuration gives high value 
  • Backwards iteration to get to this configuration
  • Value iteration methods work similarly.

Dynamic Programming Argument

Toy Problem

  • The states nearby have a strong reward signal, policy improves quickly
  • Nearby states converge to optimal value function
  • Further states propagate this information using optimal value functions of nearby states?

How does this work in policy iteration?

Questions

  • Is DDPG / SAC policy or value iteration?
  • How would one use access to such informative resets in typical RL formulations?

Questions

  • Get low-level controller
  • Construct value function for low-level
  • Construct learned model over higher-level using learned value functions
  • Plan over learned model
  • Resets help reducing complexity 

Lessons between RL / Model-Based Control

RL has exploration problem

Long-Horizon Sparse-Reward Problem

Project 1. RL with Dynamic Smoothing

Is RL on force-from-a-distance easier than non-smooth? 

Why? 

 

- RL might suffer from bad exploration from random noise

- Smoothing has "free exploration"

Project 2. Differentiable RL

Can we bootstrap value gradients dV/dx instead of value V in policy iteration if the dynamics are differentiable?

 

Show that this leads to better performance by leveraging gradients

 

 

Hand task: 100x / parameter tuning

Complex tasks?

 

 

Project 3. Differentiable RL

Can we bootstrap value gradients dV/dx instead of value V in policy iteration if the dynamics are differentiable?

 

Show that this leads to better performance by leveraging gradients

Copy of Offline Local

By Terry Suh

Copy of Offline Local

  • 90