# Diffusion Policies

Towards Foundations Models for Control(?)

Russ Tedrake

Workshop on Control and Machine Learning

October 11, 2023

​"What's still hard for AI" by Kai-Fu Lee:

• Manual dexterity

• Social intelligence (empathy/compassion)

"Dexterous Manipulation" Team

(founded in 2016)

For the next challenge:

# Good control when we don't have useful models?

For the next challenge:

# Good control when we don't have useful models?

• Rules out:
• Simulation
• Reinforcement learning (in practice)
• State Estimation / Model-based control
• Two natural choices:
• Learn a dynamics model
• Behavior cloning (imitation learning)

Levine*, Finn*, Darrel, Abbeel, JMLR 2016

## Visuomotor policies

perception network

(often pre-trained)

policy network

other robot sensors

learned state representation

actions

x history

## I was forced to reflect on my core beliefs...

• The value of using RGB (at control rates) as a sensor is undeniable.  I must not ignore this going forward.

• I don't love imitation learning (decision making $$\gg$$ mimcry), but it's an awfully CLEVR way to explore the space of policy representations
• Don't need a model
• Don't need an explicit state representation
• (Not even to specify the objective!)

We've been exploring, and seem to have found something...

## Denoising diffusion models (generative AI)

Image source: Ho et al. 2020

Denoiser can be conditioned on additional inputs, $$u$$: $$p_\theta(x_{t-1} | x_t, u)$$

## A derministic interpretation (manifold hypothesis)

Denoising approximates the projection onto the data manifold;

approximating the gradient of the distance to the manifold

## Representing dynamic output feedback

..., u_{-1}, u_0, u_1, ...
..., y_{-1}, y_0, y_1, ...

input

output

Control Policy
(as a dynamical system)

"Diffusion Policy" is an auto-regressive (ARX) model with forecasting

\begin{aligned} [y_{n+1}, ..., y_{n+P}] = f_\theta(&u_n, ..., u_{n-H} \\ &y_n, ..., y_{n-H} )\end{aligned}

$$H$$ is the length of the history,

$$P$$ is the length of the prediction

Conditional denoiser produces the forecast, conditional on the history

Image backbone: ResNet-18 (pretrained on ImageNet)
Total: 110M-150M Parameters
Training Time: 3-6 GPU Days ($150-$300)

## Learns a distribution (score function) over actions

e.g. to deal with "multi-modal demonstrations"

## Why (Denoising) Diffusion Models?

• High capacity + great performance
• Small number of demonstrations (typically ~50)
• Multi-modal (non-expert) demonstrations
• Training stability and consistency
• no hyper-parameter tuning
• Generates high-dimension continuous outputs
• vs categorical distributions (e.g. RT-1, RT-2)
• Action-chunking transformers (ACT)
• Solid mathematical foundations (score functions)
• Reduces nicely to the simple cases (e.g. LQG / Youla)

# Enabling technologies

## Visuotactile sensing

with TRI's Soft Bubble Gripper

Open source:

https://punyo.tech/

# Scaling Up

• I've discussed training one skill
• Wanted: few shot generalization to new skills
• connects beautifully to internet-scale data

• Big Questions:
• How do we feed the data flywheel?
• What are the scaling laws?

• I don't see any immediate ceiling

## Discussion

I do think there is something deep happening here...

• Manipulation should be easy (from a controls perspective)
• probably low dimensional?? (manifold hypothesis)
• memorization can go a long way

If we really understand this, can we do the same via principles from a model?  Or will control go the way of computer vision and language?

## Summary

• Dexterous manipulation is still unsolved, but progress is fast
• Visuomotor diffusion policies
• currently via imitation learning from humans
• We need a deeper understanding (e.g. more theory)

• Much of our code is open-source:

pip install drake
sudo apt install drake

By russtedrake

# Diffusion Policies for Dexterous Manipulation

Workshop on Control and Machine Learning: Challenges and Progress

• 844