DeepMind Robotics (formerly known as Brain)

Learning Discontinuities for

Contact-Rich Manipulation

Andy Zeng

Embracing Contacts

ICRA 2023 Workshop

Manipulation

 TossingBot

Manipulation

PaLM-SayCan

Learning to interact with the physical world (through contact)

RoboPianist

with machine learning from pixels

Discontinuities in contact-rich manipulation

and how we might go about modeling them

Continuous-Time Representations

Perception & State Estimation

Imitation Learning

Occlusions

Actions

Sensor Fusion

Discontinuities in contact-rich manipulation

and how we might go about modeling them

Continuous-Time Representations

Perception & State Estimation

Imitation Learning

Occlusions

Actions

Sensor Fusion

Implicit Behavior Cloning

Real human teleop trajectories

are full of discontinuities

Imitation Learning

\hat{\textbf{a}} = \underset{\textbf{a} \in \mathcal{A}}{\arg\min} \ \ E_{\theta}(\textbf{o},\textbf{a})

BC policy learning as:                 instead of:

\hat{\textbf{a}} = F_{\theta}(\textbf{o})

"Implicit Behavioral Cloning"

Pete Florence et al., CoRL 2021

Implicit Behavior Cloning

Real human teleop trajectories

are full of discontinuities

Imitation Learning

Learn a probability distribution over actions:

\mathcal{L}_{\text{InfoNCE}} = \sum_{i=1}^N -\log \big( \tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) \big)
\tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) = \frac{e^{-E_{\theta}(\mathbf{x}_i, {\color{black} \mathbf{y}_i} )}} {e^{-E_{\theta}( \mathbf{x}_i, {\color{black} \mathbf{y}_i})} + {\color{red} \sum_{j=1}^{N_{\text{neg}}}} e^{-E_{\theta}(\mathbf{x}_i, {\color{red} \tilde{\mathbf{y}}^j_i} )} }
  • Conditioned on observation (raw images)
  • Uniformly sampled negatives
\hat{\textbf{a}} = \underset{\textbf{a} \in \mathcal{A}}{\arg\min} \ \ E_{\theta}(\textbf{o},\textbf{a})

BC policy learning as:                 instead of:

\hat{\textbf{a}} = F_{\theta}(\textbf{o})

"Implicit Behavioral Cloning"

Pete Florence et al., CoRL 2021

Implicit Behavior Cloning

Learn a probability distribution over actions:

\mathcal{L}_{\text{InfoNCE}} = \sum_{i=1}^N -\log \big( \tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) \big)
\tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) = \frac{e^{-E_{\theta}(\mathbf{x}_i, {\color{black} \mathbf{y}_i} )}} {e^{-E_{\theta}( \mathbf{x}_i, {\color{black} \mathbf{y}_i})} + {\color{red} \sum_{j=1}^{N_{\text{neg}}}} e^{-E_{\theta}(\mathbf{x}_i, {\color{red} \tilde{\mathbf{y}}^j_i} )} }
  • Conditioned on observation (raw images)
  • Uniformly sampled negatives
\hat{\textbf{a}} = \underset{\textbf{a} \in \mathcal{A}}{\arg\min} \ \ E_{\theta}(\textbf{o},\textbf{a})

BC policy learning as:                 instead of:

\hat{\textbf{a}} = F_{\theta}(\textbf{o})
f(x)=\text{sign}(x)

Semi-Algebraic Approximation using Christoffel-Darboux Kernel

Marx et al., Springer 2021

4-3{x_1}{x_2}-4{x_2}^2+{x_1}{x_2}^3+2{x_2}^4

"Implicit Behavioral Cloning"

Pete Florence et al., CoRL 2021

Implicit Behavior Cloning

+ Can represent multi-modal actions

+ More sample efficiently learn discontinuous trajectories

\hat{\textbf{a}} = \underset{\textbf{a} \in \mathcal{A}}{\arg\min} \ \ E_{\theta}(\textbf{o},\textbf{a})

BC policy learning as:                 instead of:

\hat{\textbf{a}} = F_{\theta}(\textbf{o})

Fly left or right around the tree?

The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies

Ronen Basri et al., NeurIPS 2019

subtle but decisive maneuvers

"Implicit Behavioral Cloning"

Pete Florence et al., CoRL 2021

Discontinuities in contact-rich manipulation

and how we might go about modeling them

Continuous-Time Representations

Perception & State Estimation

Imitation Learning

Occlusions

Actions

Sensor Fusion

Discontinuities in contact-rich manipulation

and how we might go about modeling them

Continuous-Time Representations

Perception & State Estimation

Imitation Learning

Occlusions

Actions

Sensor Fusion

Occlusions in Perception & State Estimation

Occlusions appear as

discontinuities in image space

3D data: self-occlusions

Contact points are often occluded

Partial observability

Deformable Object State Estimation with Implicit SDFs

"VIRDO: Visio-Tactile Implicit Representations of Deformable Objects"

"VIRDO++: Real-World, Visuo-Tactile Dynamics and Perception of Deformable Objects"

Youngsun Wi, Pete Florence, Andy Zeng, Nima Fazeli. ICRA & CoRL 2022

Deformable Object State Estimation with Implicit SDFs

"VIRDO: Visio-Tactile Implicit Representations of Deformable Objects"

"VIRDO++: Real-World, Visuo-Tactile Dynamics and Perception of Deformable Objects"

Youngsun Wi, Pete Florence, Andy Zeng, Nima Fazeli. ICRA & CoRL 2022

Deformable Object State Estimation with Implicit SDFs

"VIRDO: Visio-Tactile Implicit Representations of Deformable Objects"

"VIRDO++: Real-World, Visuo-Tactile Dynamics and Perception of Deformable Objects"

Youngsun Wi, Pete Florence, Andy Zeng, Nima Fazeli. ICRA & CoRL 2022

Discontinuities in contact-rich manipulation

and how we might go about modeling them

Continuous-Time Representations

Perception & State Estimation

Imitation Learning

Occlusions

Actions

Sensor Fusion

Discontinuities in contact-rich manipulation

and how we might go about modeling them

Continuous-Time Representations

Perception & State Estimation

Imitation Learning

Occlusions

Actions

Sensor Fusion

Observation data may appear discontinuous

"Multiscale Sensor Fusion and Continuous Control with Neural CDEs"

Sumeet Singh, Francis McCann Ramirez, Jacob Varley, Andy Zeng, Vikas Sindhwani. ICRA 2022

but the underlying process might not be

input: 30 Hz camera images

input: 100 Hz F/T readings...

output: 50 Hz actions?

Observation data may appear discontinuous

"Multiscale Sensor Fusion and Continuous Control with Neural CDEs"

Sumeet Singh, Francis McCann Ramirez, Jacob Varley, Andy Zeng, Vikas Sindhwani. ICRA 2022

but the underlying process might not be

input: 30 Hz camera images

input: 100 Hz F/T readings...

output: 50 Hz actions?

Time-continuous policies?

observation t=1

observation t=0

action t=0.5?

Observation data may appear discontinuous

"Multiscale Sensor Fusion and Continuous Control with Neural CDEs"

Sumeet Singh, Francis McCann Ramirez, Jacob Varley, Andy Zeng, Vikas Sindhwani. ICRA 2022

but the underlying process might not be

Observation data may appear discontinuous

"Multiscale Sensor Fusion and Continuous Control with Neural CDEs"

Sumeet Singh, Francis McCann Ramirez, Jacob Varley, Andy Zeng, Vikas Sindhwani. ICRA 2022

but the underlying process might not be

task success & completion

T = time between readings

Observation data may appear discontinuous

"Multiscale Sensor Fusion and Continuous Control with Neural CDEs"

Sumeet Singh, Francis McCann Ramirez, Jacob Varley, Andy Zeng, Vikas Sindhwani. ICRA 2022

but the underlying process might not be

task success & completion

rate of image frame dropout

T = time between readings

Discontinuities in contact-rich manipulation

and how we might go about modeling them

Continuous-Time Representations

Perception & State Estimation

Imitation Learning

Occlusions

Actions

Sensor Fusion

Manipulation without contact?

our objective is to affect change in the world

Jimmy Wu, Xingyuan Sun, Andy Zeng, Shuran Song, Szymon Rusinkiewicz, Thomas Funkhouser

IROS & RA-L 2022

Learning Pneumatic Non-Prehensile Manipulation with a Mobile Blower

Thank you!

Pete Florence

Youngsun Wi

Johnny Lee

Vikas Sindhwani

Jimmy Wu

Vincent Vanhoucke

Kevin Zakka

Michael Ryoo

Maria Attarian

Brian Ichter

Krzysztof Choromanski

Federico Tombari

Jacky Liang

Sumeet Singh

Wenlong Huang

Fei Xia

Peng Xu

Karol Hausman

and many others!