Copy of Offline Local

Motivation

Motivation:

1. We have highly efficient local controllers when queried for the right direction. How do we leverage this for more global planning?

2. How do we leverage access to resets more efficiently in simulation

3. Multi-query

Algorithm

Setup:

u = \pi(x;x_{goal})

1. Assume some access to a simple local controller

(MPC, MPPI, collision-free motion planning, etc.)

Importantly, we want to quantify the performance of this local controller with some form of a local value function.

We want this to preserve some notion of goal-reaching behavior - some examples include

V^\pi(x;x_{goal})

\begin{aligned} V^\pi(x;x_{goal}) & = \|x_{goal} - F_\pi(x)\|^2 \\ & = \mathbf{1}[F_\pi(x) = x_{goal}] \end{aligned}

(Reachability)

(Distance to goal)

Algorithm

Setup:

2. Sample many "good" states offline to build a library of local value functions,

V^\pi(x_i;x_{goal})

3. Online, do graph search to find the shortest path of subgoals,

build a graph where the nodes are defined by the sample states,

edges are defined by

c(x_i,x_j) = V^\pi(x_i;x_j)

x_0

\min_{\{(i,j)\}}\sum V(x_i,x_j)

Questions

1. Why is chaining together local value functions better than learning global goal-conditioned value functions

2. How do we make this more "continuous" over the sampled states

3. Benefits of using MPC in the lower-level control

4. Why not learn a terminal value function for MPC

Toy Problem

On-policy / Finite-Horizon Planning Struggles

MPC with Terminal Value Function
MPC with Graph Search on Subgoals
Off-Policy RL (DDPG / SAC)

Long-Horizon Sparse-Reward Problem

(Difficult Exploration)

Toy Problem

This configuration gives high value
Backwards iteration to get to this configuration
Value iteration methods work similarly.

Dynamic Programming Argument

Toy Problem

The states nearby have a strong reward signal, policy improves quickly
Nearby states converge to optimal value function
Further states propagate this information using optimal value functions of nearby states?

How does this work in policy iteration?

Questions

Is DDPG / SAC policy or value iteration?
How would one use access to such informative resets in typical RL formulations?

Questions

Get low-level controller
Construct value function for low-level
Construct learned model over higher-level using learned value functions
Plan over learned model
Resets help reducing complexity

Lessons between RL / Model-Based Control

RL has exploration problem

Long-Horizon Sparse-Reward Problem

Project 1. RL with Dynamic Smoothing

Is RL on force-from-a-distance easier than non-smooth?

Why?

- RL might suffer from bad exploration from random noise

- Smoothing has "free exploration"

Project 2. Differentiable RL

Can we bootstrap value gradients dV/dx instead of value V in policy iteration if the dynamics are differentiable?

Show that this leads to better performance by leveraging gradients

Hand task: 100x / parameter tuning

Complex tasks?

Project 3. Differentiable RL

Can we bootstrap value gradients dV/dx instead of value V in policy iteration if the dynamics are differentiable?

Show that this leads to better performance by leveraging gradients

Chaining Local Policies from Offline Samples