Towards Large Behavior Models for Dexterous Manipulation

Russ Tedrake

December 15, 2023

Goal: Foundation Models for Manipulation

"Dexterous Manipulation" Team

(founded in 2016)

For the next challenge:

Good control when we don't have useful models?

For the next challenge:

Good control when we don't have useful models?

Rules out:
- (Multibody) Simulation
- Simulation-based reinforcement learning (RL)
- State estimation / model-based control
My top choices:
- Learn a dynamics model
- Behavior cloning (imitation learning)

Key advance: visuomotor policies

The value of using RGB (at control rates) as a sensor is undeniable.
Policies with (implicit) learned state representations
I don't love imitation learning (decision making $\gg$ mimcry), but it's an awfully clever way to explore the space of policy representations
- Don't need a model
- Don't need an explicit state representation
  - (Not even to specify the objective!)

We've been exploring, and found something good in...

Image backbone: ResNet-18 (pretrained on ImageNet)
Total: 110M-150M Parameters
Training Time: 3-6 GPU Days ($150-$300)

(Often) Reactive

Discrete/branching logic

Long horizon

Limited "Generalization"

(when training a single skill)

But there are definitely limits to the single-task models

Why (Denoising) Diffusion Models?

High capacity + great performance
Small number of demonstrations (typically ~50-100)
Multi-modal (non-expert) demonstrations

Learns a distribution (score function) over actions

e.g. to deal with "multi-modal demonstrations"

Why (Denoising) Diffusion Models?

High capacity + great performance
Small number of demonstrations (typically ~50)
Multi-modal (non-expert) demonstrations
Training stability and consistency
- no hyper-parameter tuning
Generates high-dimension continuous outputs
- vs categorical distributions (e.g. RT-1, RT-2)
- CVAE in "action-chunking transformers" (ACT)
Solid mathematical foundations (score functions)
Reduces nicely to the simple cases (e.g. LQG / Youla)

Denoising LQR ( )

\begin{gather*} x[n+1] = A x[n] + B u[n] + w[n], \\ w[n] \sim \mathcal{N}(0, \Sigma_w). \end{gather*}

Standard LQR:

u[n] = -Kx[n]

Optimal actor:

\ell(\theta) = \mathbb{E}_{x, \epsilon, \sigma} || f_\theta(-Kx + \sigma \epsilon, \sigma, x) - \epsilon ||^2

Training loss:

stationary distribution of optimal policy

\begin{align*} f_\theta&:&\text{denoiser}\\ \sigma&:&\text{noise level}\\ \epsilon&:&\text{noise} \end{align*}

\mathcal{H}_2

x \sim

f_{\theta^*}(u, \sigma, x) = \frac{1}{\sigma}\left[u + K x\right].

Optimal denoiser:

u_{k-1} = u_k + \frac{\sigma_{k-1} - \sigma_k}{\sigma_k}\left[u_k + K x\right],

(deterministic) DDIM sampler:

Straight-forward extension to LQG:
Diffusion Policy learns (truncated) unrolled Kalman filter.

converges to LQR solution

Denoising LQR ( )

\mathcal{H}_2

Enabling technologies

Haptic Teleop Interface

Excellent system identification / robot control

Visuotactile sensing

with TRI's Soft Bubble Gripper

Open source:

https://punyo.tech/

Scaling Up

Cumulative Number of Skills Collected Over Time

Earlier this week...

Today's focus: training one skill
Multitask, language-conditioned policies
- multitask training can improve robustness of original skills
  - partly by consuming more data
- few shot generalization to new skills
Big Questions:
- How do we feed the data flywheel? What's the right data?
- What are the scaling laws?
- Benchmarking/evaluation

TRI's role in this push

High-quality datasets
of dynamically rich demonstrations
leveraging our robotics+control expertise + ML expertise
high-fidelity simulation for evaluation / scaling / benchmarking

Discussion

I do think there is something deep happening here...

Manipulation should be easy (from a controls perspective)
probably low dimensional?? (manifold hypothesis)
memorization can go a long way

Model-based control and structured optimization

Graphs of Convex Sets
for trajectory optimization and RL

Online classes (videos + lecture notes + code)

http://manipulation.mit.edu

http://underactuated.mit.edu

Dexterous Manipulation at TRI

https://www.tri.global/careers

drake.mit.edu

Foundation Models for Decision Making @NeurIPS 2023

By russtedrake

Foundation Models for Decision Making @NeurIPS 2023

3,653

russtedrake PRO

Roboticist at MIT and TRI

people.csail.mit.edu/russt

Towards Large Behavior Models for Dexterous Manipulation

Goal: Foundation Models for Manipulation

Good control when we don't have useful models?

Good control when we don't have useful models?

Key advance: visuomotor policies

(Often) Reactive

Discrete/branching logic

Long horizon

Limited "Generalization"

But there are definitely limits to the single-task models

Why (Denoising) Diffusion Models?

Learns a distribution (score function) over actions

Why (Denoising) Diffusion Models?

Denoising LQR ( )

Denoising LQR ( )

Enabling technologies

Haptic Teleop Interface

Excellent system identification / robot control

Visuotactile sensing

Scaling Up

Earlier this week...

TRI's role in this push

Discussion

Model-based control and structured optimization

Online classes (videos + lecture notes + code)

Dexterous Manipulation at TRI

Foundation Models for Decision Making @NeurIPS 2023

More from russtedrake