Robotics at Google

Language as Robot Middleware

Andy Zeng

Samsung Reading Group

Robot Learning (bottom up commonsense)

Self-Supervised Learning

TossingBot

Imitation Learning

Implicit Behavior Cloning

Reinforcement Learning

QT-Opt

The End-to-End Robot Learning Recipe

1. Collect a big dataset

2. Train end-to-end deep networks

Transformers

or ConvNets

Actions

Pixels

- But robot data is expensive

The End-to-End Robot Learning Recipe

50 expert demos

The End-to-End Robot Learning Recipe

500 expert demos

50 expert demos

The End-to-End Robot Learning Recipe

500 expert demos

5000 expert demos

50 expert demos

The End-to-End Robot Learning Recipe

500 expert demos

5000 expert demos

50 expert demos

when are we going to see task-level generalization?

Deep Learning is a Box

Interpolation

Extrapolation

Deep Learning is a Box

Interpolation

Extrapolation

Roboticist

Vision

NLP

Deep Learning is a Box

Interpolation

Extrapolation

Internet

Meanwhile in NLP...

Large Language Models

Large Language Models

Internet

Meanwhile in NLP...

Books

Recipes

Code

News

Articles

Dialogue

Demo

Quick Primer on Language Models

Tokens (inputs & outputs)

Transformers (models)

Attention Is All You Need, NeurIPS 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Quick Primer on Language Models

Tokens (inputs & outputs)

Transformers (models)

Pieces of words (BPE encoding)

big

bigger

per word:

biggest

small

smaller

smallest

big

per token:

est

small

Attention Is All You Need, NeurIPS 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Quick Primer on Language Models

Tokens (inputs & outputs)

Transformers (models)

Self-Attention

Pieces of words (BPE encoding)

big

bigger

per word:

biggest

small

smaller

smallest

big

per token:

est

small

x_1

x_2

x_3

Attention Is All You Need, NeurIPS 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

\text{softmax}(\frac{QK^\intercal}{\sqrt{d_k}})V

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Somewhere in the space of interpolation

Example?

Lives robot planning

Somewhere in the space of interpolation

Example?

Lives robot planning

Can LLMs give us top down commonsense?

PaLM-SayCan & Socratic Models

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

LLMs on robots! Open research problem, but here's one way to do it...

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

https://say-can.github.io

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Visual Language Model

CLIP, ALIGN, LiT,

SimVLM, ViLD, MDETR

Human input (task)

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

https://say-can.github.io

PaLM-SayCan & Socratic Models

LLMs on robots! Open research problem, but here's one way to do it...

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Visual Language Model

CLIP, ALIGN, LiT,

SimVLM, ViLD, MDETR

Human input (task)

Large Language Model for Planning (e.g. SayCan)

Language-conditioned Policies

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

https://say-can.github.io

PaLM-SayCan & Socratic Models

LLMs on robots! Open research problem, but here's one way to do it...

Live Demo

Language as a "state" representation

3D Recontruction

Language as a "state" representation

semantic?

compact?

compositional?

general?

interpretable?

Perception

Actions

Planning

3D Recontruction

Language as a "state" representation

3D Recontruction

Perception

Actions

Planning

semantic? ✓

compact? ✓

compositional? ✓

general? ✓

interpretable? ✓

What about language?

Ego4D

Limits of language as a "state" representation

- Loses spatial precision

- Highly multimodal (lots of different ways to say the same thing)

- Not as information-rich as in-domain representations (e.g. images)

Limits of language as a "state" representation

- Only for high level? what about control?

Perception

Planning

Control

Socratic Models

Inner Monologue

ALM + LLM + VLM

SayCan

Wenlong Huang et al, 2022

LLM

Imitation? RL?

Engineered?

Intuition and commonsense is not just a high-level thing

Applies to low-level behaviors too

Is the "dark matter" of robotics

Spatial: "move a little bit to the left"
Temporal: "move faster"
Functional: "balance yourself"

Demo

Intuition and commonsense is not just a high-level thing

Seems to be stored in the depths of in language models... how to extract it?

Applies to low-level behaviors too

Is the "dark matter" of robotics

Spatial: "move a little bit to the left"
Temporal: "move faster"
Functional: "balance yourself"

Language models can write code

Code as a medium to express low-level commonsense

Live Demo

Language models can write code

Code as a medium to express more complex plans

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

Live Demo

Language models can write code

Code as a medium to express more complex plans

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

Live Demo

Language models can write code

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

use NumPy,

SciPy code...

Language models can write code

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

PD controllers
impedance controllers

Language models can write code

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

Language models can write code

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

What is the foundation models for robotics?

How much data do we need?

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

500 expert demos

5000 expert demos

50 expert demos

How much data do we need?

Robot Learning

Language Models

Finding other sources of data (sim, YouTube)
Improve data efficiency with prior knowledge

Not a lot of robot data

Lots of Internet data

500 expert demos

5000 expert demos

50 expert demos

Scale alone might not be enough

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

adapted from Tomás Lozano-Pérez

Machine learning is a box

... but robotics is a line

Finding other sources of data (sim, YouTube)
Improve data efficiency with prior knowledge

Different embodiments etc....

A possible middleground

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

Embrace Compositionality

adapted from Tomás Lozano-Pérez

Machine learning is a box

... but robotics is a line

A possible middleground

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

Embrace Compositionality

adapted from Tomás Lozano-Pérez

Machine learning is a box

... but robotics is a line

2. composing them
autonomously

1. build boxes

Towards grounding everything in language

Language

Control

Vision

Tactile

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

"Language" as the glue for intelligent machines

Language

Perception

Planning

Control

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

"Language" as the glue for intelligent machines

Language

Perception

Planning

Control

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

We have some reason to believe that

"the structure of language is the structure of generalization"

To understand language is to understand generalization

https://evjang.com/2021/12/17/lang-generalization.html

Sapir–Whorf hypothesis

Towards grounding everything in language

Language

Perception

Planning

Control

Humans

Towards grounding everything in language

Language

Perception

Planning

Control

Humans

Not just for general robots,
but for human-centered intelligent machines!

Thank you!

Pete Florence

Adrian Wong

Johnny Lee

Vikas Sindhwani

Stefan Welker

Vincent Vanhoucke

Kevin Zakka

Michael Ryoo

Maria Attarian

Brian Ichter

Krzysztof Choromanski

Federico Tombari

Jacky Liang

Aveek Purohit

Wenlong Huang

Fei Xia

Peng Xu

Karol Hausman

and many others!

Code as Policies

Chad Boodoo (onsite!)
Andy Zeng (remote)

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Language Model Programs for Embodied Control

Robotics at Google

Manipulation

amazing skill! teppanyaki steak master

https://youtu.be/5qVMIKCn_Cs

Problem difficulty = #DoFs (robot) + #DoFs (environment)

Robotics as a platform to study intelligent machines

SHRDLU, Terry Winograd, MIT 1968

"will you please stack up both of the red blocks and either a green cube or a pyramid"

person:

"what does the box contain?"

person:

"the blue pyramid and blue block."

robot:

"can the table pick up blocks?"

person:

"no."

robot:

"why did you do that?"

person:

"ok."

robot:

"because you asked me to."

robot:

Simple things are surprisingly complex

"At an analytical level, pushing is a well understood problem... These are usually based on Coulomb's friction law..."

"The reality, however, is bitter... the sensitivity of the task to small changes in contact geometry, along with the variability of friction, hinders accurate predictions."

More than a Million Ways to Be Pushed. A High-Fidelity Experimental Dataset of Planar Pushing

Kuan-Ting Yu, Maria Bauza, Nima Fazeli, Alberto Rodriguez, IROS 2016

Imitation Learning with Behavior Cloning

Supervised Learning

1. Collect dataset with human teleop

2. Train end-to-end deep networks

Data hungry
Policy often gets "stuck"

High-rate (velocity) control
Explicit (e.g. MSE) losses

In Practice

Transformers

or ConvNets

Actions

Pixels

Imitation Learning with Behavior Cloning

Supervised Learning

1. Collect dataset with human teleop

2. Train end-to-end deep networks

Data hungry
Policy often gets "stuck"

High-rate (velocity) control
Explicit (e.g. MSE) losses

In Practice

Transformers

or ConvNets

Actions

Pixels

Imitation Learning with Behavior Cloning

Data hungry
Policy often gets "stuck"

In Practice

Implicit Behavior Cloning

Implicit Behavioral Cloning, CoRL 2021

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, Jonathan Tompson

Implicit Behavior Cloning

\mathcal{L}_{\text{InfoNCE}} = \sum_{i=1}^N -\log \big( \tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) \big)

\tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) = \frac{e^{-E_{\theta}(\mathbf{x}_i, {\color{black} \mathbf{y}_i} )}} {e^{-E_{\theta}( \mathbf{x}_i, {\color{black} \mathbf{y}_i})} + {\color{red} \sum_{j=1}^{N_{\text{neg}}}} e^{-E_{\theta}(\mathbf{x}_i, {\color{red} \tilde{\mathbf{y}}^j_i} )} }

Implicit Behavioral Cloning, CoRL 2021

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, Jonathan Tompson

Implicit Behavior Cloning

\mathcal{L}_{\text{InfoNCE}} = \sum_{i=1}^N -\log \big( \tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) \big)

Implicit Behavioral Cloning, CoRL 2021

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, Jonathan Tompson

Implicit Behavior Cloning & Transporter Nets

\mathcal{L}_{\text{InfoNCE}} = \sum_{i=1}^N -\log \big( \tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) \big)

Transporter Networks: Rearranging the Visual World for Robotic Manipulation, CoRL 2020

Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Ayzaan Wahid, Vikas Sindhwani, Johnny Lee

A generalization of loss functions from Transporter Nets (spatial action maps)