Large Behavior Models

(Foundation models for dexterous manipulation)

Russ Tedrake

CSAIL Alliances Meeting

April 4, 2024

Goal: Foundation Models for Manipulation

LLMs \(\Rightarrow\) VLMs \(\Rightarrow\) LBMs

large language models

visually-conditioned language models

large behavior models

\(\sim\) VLA (vision-language-action)

\(\sim\) EFM (embodied foundation model)

Q: Is predicting actions fundamentally different?

Why actions (for dexterous manipulation) could be different:

Actions are continuous (language tokens are discrete)
Have to obey physics, deal with stochasticity
Feedback / stability
...

should we expect similar generalization / scaling-laws?

Recent success in (single-task) behavior cloning suggests that these are not blockers

but we don't have internet-scale action data yet.

The Robot Data Diet

Big data

Big transfer

Small data

No transfer

Ego-Exo

robot teleop

(the "transfer learning bet")

Open-X

simulation rollouts

World-class simulator for robotics

drake.mit.edu

NVIDIA selected Drake and MuJoCo

(for potential inclusion in Omniverse)

"Hydroelastic contact" as implemented in Drake

The AlphaGo Playbook

Step 1: Behavior Cloning
- from human expert games
Step 2: Self-play
- Policy network
- Value network
- Monte Carlo tree search (MCTS)

Scaling Monte-Carlo Tree Search

"Graphs of Convex Sets" (GCS)

Online classes (videos + lecture notes + code)

http://manipulation.mit.edu