Robotics at Google

Transformers & Robotics

Language as Robot Middleware

Andy Zeng

AAAI Tutorial on

Everything You Need to Know about Transformers: Architectures, Optimization, Applications, and Interpretation

What to expect from this session:

Overview: Transformers applications to robotics

Background: how we think about robot learning (10 mins)

How we can ride on the success of Transformers in NLP and vision (5 mins)

Run it yourself: demo of robots powered by LLMs and VLMs (10 mins)

Several other interesting ways robots can use LLMs and VLMs (10 mins)

Future: Where we think robot learning is headed (5 mins)

Robotics at Google

Manipulation

TossingBot

Interact with the physical world to learn bottom-up commonsense

Transporter Nets

Implicit Behavior Cloning

w/ machine learning

i.e. "how the world works"

On the quest for shared priors

Interact with the physical world to learn bottom-up commonsense

w/ machine learning

i.e. "how the world works"

# Tasks

Data

On the quest for shared priors

Interact with the physical world to learn bottom-up commonsense

w/ machine learning

i.e. "how the world works"

# Tasks

Data

Expectation

Reality

Complexity in environment, embodiment, contact, etc.

Transformers in Robotics

RT-1: Robotics Transformer for Real-World Control at Scale

robotics-transformer.github.io

Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation

peract.github.io

Transformers (Deep Learning) is a Box

Interpolation

Extrapolation

adapted from Tomás Lozano-Pérez

Transformers (Deep Learning) is a Box

Interpolation

Extrapolation

Roboticist

Vision

NLP

Transformers (Deep Learning) is a Box

Interpolation

Extrapolation

Internet

Meanwhile in NLP...

Large Language Models

Large Language Models?

Internet

Meanwhile in NLP...

Books

Recipes

Code

News

Articles

Dialogue

Demo

Quick Primer on Language Models

Tokens (inputs & outputs)

Transformers (models)

Attention Is All You Need, NeurIPS 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Quick Primer on Language Models

Tokens (inputs & outputs)

Transformers (models)

Pieces of words (BPE encoding)

big

bigger

per word:

biggest

small

smaller

smallest

big

per token:

est

small

Attention Is All You Need, NeurIPS 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Quick Primer on Language Models

Tokens (inputs & outputs)

Transformers (models)

Self-Attention

Pieces of words (BPE encoding)

big

bigger

per word:

biggest

small

smaller

smallest

big

per token:

est

small

x_1

x_3

x_2

Attention Is All You Need, NeurIPS 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

\text{softmax}(\frac{QK^\intercal}{\sqrt{d_k}})V

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Robot Planning

Visual Commonsense

Robot Programming

Socratic Models

Code as Policies

PaLM-SayCan

Demo

Somewhere in the space of interpolation

Lives

Socratic Models & PaLM-SayCan

Open research problem, but here's one way to do it

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Socratic Models & PaLM-SayCan

Open research problem, but here's one way to do it

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Visual Language Model

CLIP, ALIGN, LiT,

SimVLM, ViLD, MDETR

Human input (task)

Socratic Models & PaLM-SayCan

Open research problem, but here's one way to do it

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Visual Language Model

CLIP, ALIGN, LiT,

SimVLM, ViLD, MDETR

Human input (task)

Large Language Models for
High-Level Planning

Language-conditioned Policies

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances say-can.github.io

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents wenlong.page/language-planner

Socratic Models: Robot Pick-and-Place Demo

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

For each step, predict pick & place:

Socratic Models & PaLM-SayCan

Open research problem, but here's one way to do it

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Visual Language Model

CLIP, ALIGN, LiT,

SimVLM, ViLD, MDETR

Human input (task)

Large Language Models for
High-Level Planning

Language-conditioned Policies

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances say-can.github.io

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents wenlong.page/language-planner

Describing the visual world with language

(and using for feedback too!)

Limits of language as information bottleneck?

- Loses spatial precision

- Highly (distributional) multimodal

- Not as information-rich as e.g. images

Limits of language as information bottleneck?

- Loses spatial precision

- Highly (distributional) multimodal

- Not as information-rich as e.g. images

- Only for high level? what about control?

Perception

Planning

Control

Socratic Models

Inner Monologue

SayCan

Wenlong Huang et al, 2022

Imitation? RL?

Engineered?

Intuition and commonsense is not just a high-level thing

Applies to low-level behaviors too

spatial: "move a little bit to the left"
temporal: "move faster"
functional: "balance yourself"

Behavioral commonsense is the "dark matter" of robotics:

Intuition and commonsense is not just a high-level thing

Seems to be stored in the depths of in language models... how to extract it?

Applies to low-level behaviors too

spatial: "move a little bit to the left"
temporal: "move faster"
functional: "balance yourself"

Behavioral commonsense is the "dark matter" of robotics:

Language models can write code

Code as a medium to express low-level commonsense

Live Demo

Language models can write code

Code as a medium to express more complex plans

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

Live Demo

Language models can write code

Code as a medium to express more complex plans

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

Live Demo

Language models can write code

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

use NumPy,

SciPy code...

Language models can write code

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

PD controllers
impedance controllers

Language models can write code

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

Language models can write code

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

What is the foundation models for robotics?

Closing Thoughts

On the road to robot commonsense

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

500 expert demos

5000 expert demos

50 expert demos

On the road to robot commonsense

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

500 expert demos

5000 expert demos

50 expert demos

On the road to robot commonsense

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

256K token vocab w/ word embedding dim = 18,432

PaLM-sized robot dataset = 100 robots for 24 yrs

collecting (mostly) diverse data

On the road to robot commonsense

Robot Learning

Language Models

Finding other sources of data (sim, YouTube)
Improve data efficiency with prior knowledge

Not a lot of robot data

Lots of Internet data

256K token vocab w/ word embedding dim = 18,432

PaLM-sized robot dataset = 100 robots for 24 yrs

collecting (mostly) diverse data

On the road to robot commonsense

Robot Learning

Language Models

Finding other sources of data (sim, YouTube)
Improve data efficiency with prior knowledge

Not a lot of robot data

Lots of Internet data

Embrace language to help close the gap!

256K token vocab w/ word embedding dim = 18,432

PaLM-sized robot dataset = 100 robots for 24 yrs

collecting (mostly) diverse data

Towards grounding everything in language

Language

Control

Vision

Tactile

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Lots of data

Less data

Revisiting modularity is a step towards the endgame

Today

1960s

End-to-end is the endgame, but we need useful robots everywhere first

End-to-End

10^6+ robots

Tomorrow

Revisiting modularity is a step towards the endgame

Modular Systems

2007

1960s

ROS

End-to-end is the endgame, but we need useful robots everywhere first

End-to-End

out of necessity

10^6+ robots

Tomorrow

Revisiting modularity is a step towards the endgame

Modular Systems

2007

2015

1960s

ROS

End-to-end is the endgame, but we need useful robots everywhere first

End-to-End

out of necessity

thx deep learning

10^6+ robots

Tomorrow

Revisiting modularity is a step towards the endgame

Modular Systems

2007

2015

Today

1960s

ROS

Modular Systems

End-to-end is the endgame, but we need useful robots everywhere first

End-to-End

out of necessity

thx deep learning

advent of LLMs & Transformers

10^6+ robots

Tomorrow

compositional generality

"Language" as the glue for intelligent machines

Language

Perception

Planning

Control

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

"Language" as the glue for intelligent machines

Language

Perception

Planning

Control

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

We have some reason to believe that

"the structure of language is the structure of generalization"

To understand language is to understand generalization

https://evjang.com/2021/12/17/lang-generalization.html

Sapir–Whorf hypothesis

Towards grounding everything in language

Language

Perception

Planning

Control

Humans

Not just for general robots,
but for human-centered intelligent machines!

Thank you!

Pete Florence

Adrian Wong

Johnny Lee

Vikas Sindhwani

Stefan Welker

Vincent Vanhoucke

Kevin Zakka

Michael Ryoo

Maria Attarian

Brian Ichter

Krzysztof Choromanski

Federico Tombari

Jacky Liang

Aveek Purohit

Wenlong Huang

Fei Xia

Peng Xu

Karol Hausman

and many others!