Diffusion Policies

Towards Foundations Models for Control(?)

Russ Tedrake

Workshop on Control and Machine Learning

October 11, 2023

​"What's still hard for AI" by Kai-Fu Lee:

  • Manual dexterity

  • Social intelligence (empathy/compassion)

"Dexterous Manipulation" Team

(founded in 2016)

For the next challenge:

Good control when we don't have useful models?

For the next challenge:

Good control when we don't have useful models?

  • Rules out:
    • Simulation
    • Reinforcement learning (in practice)
    • State Estimation / Model-based control
  • Two natural choices:
    • Learn a dynamics model
    • Behavior cloning (imitation learning)

Levine*, Finn*, Darrel, Abbeel, JMLR 2016 

Visuomotor policies

perception network

(often pre-trained)

policy network

other robot sensors

learned state representation

actions

x history

I was forced to reflect on my core beliefs...

  • The value of using RGB (at control rates) as a sensor is undeniable.  I must not ignore this going forward.
     
  • I don't love imitation learning (decision making \(\gg\) mimcry), but it's an awfully CLEVR way to explore the space of policy representations
    • Don't need a model
    • Don't need an explicit state representation
      • (Not even to specify the objective!)

We've been exploring, and seem to have found something...

Earlier this week...

Denoising diffusion models (generative AI)

Image source: Ho et al. 2020 

Denoiser can be conditioned on additional inputs, \(u\): \(p_\theta(x_{t-1} | x_t, u) \)

A derministic interpretation (manifold hypothesis)

Denoising approximates the projection onto the data manifold;

approximating the gradient of the distance to the manifold

Representing dynamic output feedback

..., u_{-1}, u_0, u_1, ...
..., y_{-1}, y_0, y_1, ...

input

output

Control Policy
(as a dynamical system)

"Diffusion Policy" is an auto-regressive (ARX) model with forecasting

\begin{aligned} [y_{n+1}, ..., y_{n+P}] = f_\theta(&u_n, ..., u_{n-H} \\ &y_n, ..., y_{n-H} )\end{aligned}

\(H\) is the length of the history,

\(P\) is the length of the prediction

Conditional denoiser produces the forecast, conditional on the history

Image backbone: ResNet-18 (pretrained on ImageNet)
Total: 110M-150M Parameters
Training Time: 3-6 GPU Days ($150-$300)

Learns a distribution (score function) over actions

e.g. to deal with "multi-modal demonstrations"

Why (Denoising) Diffusion Models?

  • High capacity + great performance
  • Small number of demonstrations (typically ~50)
  • Multi-modal (non-expert) demonstrations
  • Training stability and consistency
    • no hyper-parameter tuning
  • Generates high-dimension continuous outputs
    • vs categorical distributions (e.g. RT-1, RT-2)
    • Action-chunking transformers (ACT)
  • Solid mathematical foundations (score functions)
  • Reduces nicely to the simple cases (e.g. LQG / Youla)

Enabling technologies

Haptic Teleop Interface

Excellent system identification / robot control

Visuotactile sensing

with TRI's Soft Bubble Gripper

Open source:

https://punyo.tech/

Scaling Up

  • I've discussed training one skill
  • Wanted: few shot generalization to new skills
    • multitask, language-conditioned policies
    • connects beautifully to internet-scale data

 

  • Big Questions:
    • How do we feed the data flywheel?
    • What are the scaling laws?

 

  • I don't see any immediate ceiling

Discussion

I do think there is something deep happening here...

  • Manipulation should be easy (from a controls perspective)
  • probably low dimensional?? (manifold hypothesis)
  • memorization can go a long way

If we really understand this, can we do the same via principles from a model?  Or will control go the way of computer vision and language?

Summary

  • Dexterous manipulation is still unsolved, but progress is fast
  • Visuomotor diffusion policies
    • currently via imitation learning from humans
  • We need a deeper understanding (e.g. more theory)

 

  • Much of our code is open-source:

 

pip install drake
sudo apt install drake