Robotics at Google
Transformers & Robotics
Language as Robot Middleware
Andy Zeng
AAAI Tutorial on
Everything You Need to Know about Transformers: Architectures, Optimization, Applications, and Interpretation
What to expect from this session:
- Overview: Transformers applications to robotics
- Background: how we think about robot learning (10 mins)
- How we can ride on the success of Transformers in NLP and vision (5 mins)
- Run it yourself: demo of robots powered by LLMs and VLMs (10 mins)
- Several other interesting ways robots can use LLMs and VLMs (10 mins)
- Future: Where we think robot learning is headed (5 mins)
Robotics at Google
Manipulation
TossingBot
Interact with the physical world to learn bottom-up commonsense
Transporter Nets
Implicit Behavior Cloning
w/ machine learning
i.e. "how the world works"
On the quest for shared priors
Interact with the physical world to learn bottom-up commonsense
w/ machine learning
i.e. "how the world works"
# Tasks
Data
On the quest for shared priors
Interact with the physical world to learn bottom-up commonsense
w/ machine learning
i.e. "how the world works"
# Tasks
Data
Expectation
Reality
Complexity in environment, embodiment, contact, etc.
Transformers in Robotics
RT-1: Robotics Transformer for Real-World Control at Scale
robotics-transformer.github.io
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation
peract.github.io
Transformers (Deep Learning) is a Box
Interpolation
Extrapolation
adapted from Tomás Lozano-Pérez
Transformers (Deep Learning) is a Box
Interpolation
Extrapolation
Roboticist
Vision
NLP
Transformers (Deep Learning) is a Box
Interpolation
Extrapolation
Internet
Meanwhile in NLP...
Large Language Models
Large Language Models?
Internet
Meanwhile in NLP...
Books
Recipes
Code
News
Articles
Dialogue
Demo
Quick Primer on Language Models
Tokens (inputs & outputs)
Transformers (models)
Attention Is All You Need, NeurIPS 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Quick Primer on Language Models
Tokens (inputs & outputs)
Transformers (models)
Pieces of words (BPE encoding)
big
bigger
per word:
biggest
small
smaller
smallest
big
er
per token:
est
small
Attention Is All You Need, NeurIPS 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Quick Primer on Language Models
Tokens (inputs & outputs)
Transformers (models)
Self-Attention
Pieces of words (BPE encoding)
big
bigger
per word:
biggest
small
smaller
smallest
big
er
per token:
est
small
Attention Is All You Need, NeurIPS 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin
Bigger is Better
Neural Language Models: Bigger is Better, WeCNLP 2018
Noam Shazeer
Bigger is Better
Neural Language Models: Bigger is Better, WeCNLP 2018
Noam Shazeer
Bigger is Better
Neural Language Models: Bigger is Better, WeCNLP 2018
Noam Shazeer
Bigger is Better
Neural Language Models: Bigger is Better, WeCNLP 2018
Noam Shazeer
Bigger is Better
Neural Language Models: Bigger is Better, WeCNLP 2018
Noam Shazeer
Robot Planning
Visual Commonsense
Robot Programming
Socratic Models
Code as Policies
PaLM-SayCan
Demo
Somewhere in the space of interpolation
Lives
Socratic Models & PaLM-SayCan
Open research problem, but here's one way to do it
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
Socratic Models & PaLM-SayCan
Open research problem, but here's one way to do it
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
Visual Language Model
CLIP, ALIGN, LiT,
SimVLM, ViLD, MDETR
Human input (task)
Socratic Models & PaLM-SayCan
Open research problem, but here's one way to do it
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
Visual Language Model
CLIP, ALIGN, LiT,
SimVLM, ViLD, MDETR
Human input (task)
Large Language Models for
High-Level Planning
Language-conditioned Policies
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances say-can.github.io
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents wenlong.page/language-planner
Socratic Models: Robot Pick-and-Place Demo
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
For each step, predict pick & place:
Socratic Models & PaLM-SayCan
Open research problem, but here's one way to do it
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
Visual Language Model
CLIP, ALIGN, LiT,
SimVLM, ViLD, MDETR
Human input (task)
Large Language Models for
High-Level Planning
Language-conditioned Policies
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances say-can.github.io
Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents wenlong.page/language-planner
Describing the visual world with language
(and using for feedback too!)
Limits of language as information bottleneck?
- Loses spatial precision
- Highly (distributional) multimodal
- Not as information-rich as e.g. images
Limits of language as information bottleneck?
- Loses spatial precision
- Highly (distributional) multimodal
- Not as information-rich as e.g. images
- Only for high level? what about control?
Perception
Planning
Control
Socratic Models
Inner Monologue
SayCan
Wenlong Huang et al, 2022
Imitation? RL?
Engineered?
Intuition and commonsense is not just a high-level thing
Intuition and commonsense is not just a high-level thing
Applies to low-level behaviors too
- spatial: "move a little bit to the left"
- temporal: "move faster"
- functional: "balance yourself"
Behavioral commonsense is the "dark matter" of robotics:
Intuition and commonsense is not just a high-level thing
Seems to be stored in the depths of in language models... how to extract it?
Applies to low-level behaviors too
- spatial: "move a little bit to the left"
- temporal: "move faster"
- functional: "balance yourself"
Behavioral commonsense is the "dark matter" of robotics:
Language models can write code
Code as a medium to express low-level commonsense
Live Demo
Language models can write code
Code as a medium to express more complex plans
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng
code-as-policies.github.io
Code as Policies: Language Model Programs for Embodied Control
Live Demo
Language models can write code
Code as a medium to express more complex plans
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng
code-as-policies.github.io
Code as Policies: Language Model Programs for Embodied Control
Live Demo
Language models can write code
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng
code-as-policies.github.io
Code as Policies: Language Model Programs for Embodied Control
use NumPy,
SciPy code...
Language models can write code
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng
code-as-policies.github.io
Code as Policies: Language Model Programs for Embodied Control
- PD controllers
- impedance controllers
Language models can write code
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng
code-as-policies.github.io
Code as Policies: Language Model Programs for Embodied Control
Language models can write code
Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng
code-as-policies.github.io
Code as Policies: Language Model Programs for Embodied Control
What is the foundation models for robotics?
Closing Thoughts
On the road to robot commonsense
Robot Learning
Language Models
Not a lot of robot data
Lots of Internet data
500 expert demos
5000 expert demos
50 expert demos
On the road to robot commonsense
Robot Learning
Language Models
Not a lot of robot data
Lots of Internet data
500 expert demos
5000 expert demos
50 expert demos
On the road to robot commonsense
Robot Learning
Language Models
Not a lot of robot data
Lots of Internet data
256K token vocab w/ word embedding dim = 18,432
PaLM-sized robot dataset = 100 robots for 24 yrs
collecting (mostly) diverse data
On the road to robot commonsense
Robot Learning
Language Models
- Finding other sources of data (sim, YouTube)
- Improve data efficiency with prior knowledge
Not a lot of robot data
Lots of Internet data
256K token vocab w/ word embedding dim = 18,432
PaLM-sized robot dataset = 100 robots for 24 yrs
collecting (mostly) diverse data
On the road to robot commonsense
Robot Learning
Language Models
- Finding other sources of data (sim, YouTube)
- Improve data efficiency with prior knowledge
Not a lot of robot data
Lots of Internet data
Embrace language to help close the gap!
256K token vocab w/ word embedding dim = 18,432
PaLM-sized robot dataset = 100 robots for 24 yrs
collecting (mostly) diverse data
Towards grounding everything in language
Language
Control
Vision
Tactile
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
Lots of data
Less data
Less data
Revisiting modularity is a step towards the endgame
Today
1960s
End-to-end is the endgame, but we need useful robots everywhere first
End-to-End
10^6+ robots
Tomorrow
Revisiting modularity is a step towards the endgame
Modular Systems
2007
1960s
ROS
End-to-end is the endgame, but we need useful robots everywhere first
End-to-End
out of necessity
10^6+ robots
Tomorrow
Revisiting modularity is a step towards the endgame
Modular Systems
2007
2015
1960s
ROS
End-to-end is the endgame, but we need useful robots everywhere first
End-to-End
End-to-End
out of necessity
thx deep learning
10^6+ robots
Tomorrow
Revisiting modularity is a step towards the endgame
Modular Systems
2007
2015
Today
1960s
ROS
Modular Systems
End-to-end is the endgame, but we need useful robots everywhere first
End-to-End
End-to-End
out of necessity
thx deep learning
advent of LLMs & Transformers
10^6+ robots
Tomorrow
compositional generality
"Language" as the glue for intelligent machines
Language
Perception
Planning
Control
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
"Language" as the glue for intelligent machines
Language
Perception
Planning
Control
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
We have some reason to believe that
"the structure of language is the structure of generalization"
To understand language is to understand generalization
https://evjang.com/2021/12/17/lang-generalization.html
Sapir–Whorf hypothesis
Towards grounding everything in language
Language
Perception
Planning
Control
Humans
Not just for general robots,
but for human-centered intelligent machines!
Thank you!
Pete Florence
Adrian Wong
Johnny Lee
Vikas Sindhwani
Stefan Welker
Vincent Vanhoucke
Kevin Zakka
Michael Ryoo
Maria Attarian
Brian Ichter
Krzysztof Choromanski
Federico Tombari
Jacky Liang
Aveek Purohit
Wenlong Huang
Fei Xia
Peng Xu
Karol Hausman
and many others!
2023-AAAI-Tutorial
By Andy Zeng
2023-AAAI-Tutorial
- 1,191