Motivation
Motivation:
1. We have highly efficient local controllers when queried for the right direction. How do we leverage this for more global planning?
2. How do we leverage access to resets more efficiently in simulation
3. Multi-query
Algorithm
Setup:
1. Assume some access to a simple local controller
(MPC, MPPI, collision-free motion planning, etc.)
Importantly, we want to quantify the performance of this local controller with some form of a local value function.
We want this to preserve some notion of goal-reaching behavior - some examples include
(Reachability)
(Distance to goal)
Algorithm
Setup:
2. Sample many "good" states offline to build a library of local value functions,
3. Online, do graph search to find the shortest path of subgoals,
build a graph where the nodes are defined by the sample states,
edges are defined by
Questions
1. Why is chaining together local value functions better than learning global goal-conditioned value functions
2. How do we make this more "continuous" over the sampled states
3. Benefits of using MPC in the lower-level control
4. Why not learn a terminal value function for MPC
Toy Problem
Long-Horizon Sparse-Reward Problem
(Difficult Exploration)
Toy Problem
Dynamic Programming Argument
Toy Problem
How does this work in policy iteration?
Questions
Questions
Lessons between RL / Model-Based Control
RL has exploration problem
Long-Horizon Sparse-Reward Problem
Project 1. RL with Dynamic Smoothing
Is RL on force-from-a-distance easier than non-smooth?
Why?
- RL might suffer from bad exploration from random noise
- Smoothing has "free exploration"
Project 2. Differentiable RL
Can we bootstrap value gradients dV/dx instead of value V in policy iteration if the dynamics are differentiable?
Show that this leads to better performance by leveraging gradients
Hand task: 100x / parameter tuning
Complex tasks?
Project 3. Differentiable RL
Can we bootstrap value gradients dV/dx instead of value V in policy iteration if the dynamics are differentiable?
Show that this leads to better performance by leveraging gradients