Chaining Local Policies from Offline Samples
Motivation
Motivation:
1. We have highly efficient local controllers when queried for the right direction. How do we leverage this for more global planning?
2. How do we leverage access to resets more efficiently in simulation
3. Multi-query
Algorithm
Setup:
1. Assume some access to a simple local controller
(MPC, MPPI, collision-free motion planning, etc.)
Importantly, we want to quantify the performance of this local controller with some form of a local value function.
We want this to preserve some notion of goal-reaching behavior - some examples include
(Reachability)
(Distance to goal)
Algorithm
Setup:
2. Sample many "good" states offline to build a library of local value functions,
3. Online, do graph search to find the shortest path of subgoals,
build a graph where the nodes are defined by the sample states,
edges are defined by
Questions
1. Why is chaining together local value functions better than learning global goal-conditioned value functions
2. How do we make this more "continuous" over the sampled states
3. Benefits of using MPC in the lower-level control
4. Why not learn a terminal value function for MPC
Toy Problem
- On-policy / Finite-Horizon Planning Struggles
- MPC with Terminal Value Function
- MPC with Graph Search on Subgoals
- Off-Policy RL (DDPG / SAC)
Long-Horizon Sparse-Reward Problem
(Difficult Exploration)
Toy Problem
- This configuration gives high value
- Backwards iteration to get to this configuration
- Value iteration methods work similarly.
Dynamic Programming Argument
Toy Problem
- The states nearby have a strong reward signal, policy improves quickly
- Nearby states converge to optimal value function
- Further states propagate this information using optimal value functions of nearby states?
How does this work in policy iteration?
Questions
- Is DDPG / SAC policy or value iteration?
- How would one use access to such informative resets in typical RL formulations?
Questions
- Get low-level controller
- Construct value function for low-level
- Construct learned model over higher-level using learned value functions
- Plan over learned model
- Resets help reducing complexity
Lessons between RL / Model-Based Control
RL has exploration problem
Long-Horizon Sparse-Reward Problem
Project 1. RL with Dynamic Smoothing
Is RL on force-from-a-distance easier than non-smooth?
Why?
- RL might suffer from bad exploration from random noise
- Smoothing has "free exploration"
Project 2. Differentiable RL
Can we bootstrap value gradients dV/dx instead of value V in policy iteration if the dynamics are differentiable?
Show that this leads to better performance by leveraging gradients
Hand task: 100x / parameter tuning
Complex tasks?
Project 3. Differentiable RL
Can we bootstrap value gradients dV/dx instead of value V in policy iteration if the dynamics are differentiable?
Show that this leads to better performance by leveraging gradients
Copy of Offline Local
By Terry Suh
Copy of Offline Local
- 83