Language as Robot Middleware
Andy Zeng
CCI Seminar
Robot Manipulation
Amazon Picking Challenge
Amazon Picking Challenge
arc.cs.princeton.edu
Team MIT-Princeton
Amazon Picking Challenge
arc.cs.princeton.edu
Team MIT-Princeton
excel at simple things, adapt to hard things
How to endow robots with
"intuition" and "commonsense"?
Robotics as a platform to build intelligent machines
SHRDLU, Terry Winograd, MIT 1968
"will you please stack up both of the red blocks and either a green cube or a pyramid"
person:
"what does the box contain?"
person:
"the blue pyramid and the blue block."
robot:
"can the table pick up blocks?"
person:
"no."
robot:
"why did you do that?"
person:
"ok."
robot:
"because you asked me to."
robot:
Robotics as a platform to build intelligent machines
SHRDLU, Terry Winograd, MIT 1968
Robot butlers?
Representation is a key part of the puzzle
Perception
Actions
what is the right representation?
Representation is a key part of the puzzle
semantic?
compact?
compositional?
general?
interpretable?
Perception
Actions
what is the right representation?
On the hunt for the "best" state representation
how to represent:
semantic?
compact?
compositional?
general?
interpretable?
On the hunt for the "best" state representation
Haochen Shi and Huazhe Xu et al., RSS 2022
Learned Visual Representations
how to represent:
Dynamics Representations
Self-supervised Representations
Misha Laskin and Aravind Srinivas et al., ICML 2020
semantic?
compact?
compositional?
general?
interpretable?
On the hunt for the "best" state representation
Haochen Shi and Huazhe Xu et al., RSS 2022
Learned Visual Representations
NeRF Representations
3D Reconstructions
how to represent:
Dynamics Representations
Object-centric Representations
Danny Driess and Ingmar Schubert et al., arxiv 2022
Ben Mildenhall, Pratul Srinivasan, Matthew Tancik et al., ECCV 2020
Self-supervised Representations
Richard Newcombe et al., ISMAR 2011
Andy Zeng, Peter Yu, et al., ICRA 2017
Misha Laskin and Aravind Srinivas et al., ICML 2020
semantic?
compact?
compositional?
general?
interpretable?
On the hunt for the "best" state representation
Haochen Shi and Huazhe Xu et al., RSS 2022
Learned Visual Representations
NeRF Representations
3D Reconstructions
how to represent:
Dynamics Representations
Danny Driess and Ingmar Schubert et al., arxiv 2022
Ben Mildenhall, Pratul Srinivasan, Matthew Tancik et al., ECCV 2020
Self-supervised Representations
Richard Newcombe et al., ISMAR 2011
Misha Laskin and Aravind Srinivas et al., ICML 2020
Continuous-Time
Representations
Sumeet Singh et al., IROS 2022
Pretrained Representations
Lin Yen-Chen et al., ICRA 2020
Cross-embodied Representations
Kevin Zakka et al., CoRL 2021
semantic?
compact?
compositional?
general?
interpretable?
Object-centric Representations
Andy Zeng, Peter Yu, et al., ICRA 2017
On the hunt for the "best" state representation
how to represent:
semantic?
compact?
compositional?
general?
interpretable?
what about
language?
On the hunt for the "best" state representation
how to represent:
semantic? ✓
compact? ✓
compositional? ✓
general? ✓
interpretable? ✓
what about
language?
On the hunt for the "best" state representation
how to represent:
semantic? ✓
compact? ✓
compositional? ✓
general? ✓
interpretable? ✓
what about
language?
advent of large language models
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin et al., "PaLM", 2022
On the hunt for the "best" state representation
how to represent:
semantic? ✓
compact? ✓
compositional? ✓
general? ✓
interpretable? ✓
what about
language?
advent of large language models
maybe this was the multi-task representation we've been looking for all along?
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin et al., "PaLM", 2022
Mohit Shridhar et al., "CLIPort", CoRL 2021
Does multi-task learning result in positive transfer of representations?
Recent work in multi-task learning...
Does multi-task learning result in positive transfer of representations?
Past couple years of research suggest: its complicated
Recent work in multi-task learning...
Does multi-task learning result in positive transfer of representations?
Past couple years of research suggest: its complicated
In computer vision...
Amir Zamir, Alexander Sax, William Shen, et al., "Taskonomy", CVPR 2018
Recent work in multi-task learning...
Does multi-task learning result in positive transfer of representations?
Past couple years of research suggest: its complicated
In computer vision...
Amir Zamir, Alexander Sax, William Shen, et al., "Taskonomy", CVPR 2018
Recent work in multi-task learning...
Xiang Li et al., "Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?", 2022
Does multi-task learning result in positive transfer of representations?
Past couple years of research suggest: its complicated
In computer vision...
Amir Zamir, Alexander Sax, William Shen, et al., "Taskonomy", CVPR 2018
In robot learning...
Sam Toyer, et al., "MAGICAL", NeurIPS 2020
Recent work in multi-task learning...
Xiang Li et al., "Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?", 2022
Does multi-task learning result in positive transfer of representations?
Past couple years of research suggest: its complicated
In computer vision...
Amir Zamir, Alexander Sax, William Shen, et al., "Taskonomy", CVPR 2018
In robot learning...
Sam Toyer, et al., "MAGICAL", NeurIPS 2020
Scott Reed, Konrad Zolna, Emilio Parisotto, et al., "A Generalist Agent", 2022
Recent work in multi-task learning...
Xiang Li et al., "Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?", 2022
CLIPort
Multi-task learning + grounding in language seems more likely to lead to positive transfer
Mohit Shridhar, Lucas Manuelli, Dieter Fox, "CLIPort: What and Where Pathways for Robotic Manipulation", CoRL 2021
On the hunt for the "best" state representation
how to represent:
semantic? ✓
compact? ✓
compositional? ✓
general? ✓
interpretable? ✓
what about
language?
advent of large language models
maybe this was the multi-task representation we've been looking for all along?
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin et al., "PaLM", 2022
Mohit Shridhar et al., "CLIPort", CoRL 2021
How do we use "language" as a state representation?
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
Open research problem! but here's one way to do it...
How do we use "language" as a state representation?
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
Open research problem! but here's one way to do it...
Visual Language Model
CLIP, ALIGN, LiT,
SimVLM, ViLD, MDETR
Human input (task)
How do we use "language" as a state representation?
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
Open research problem! but here's one way to do it...
Visual Language Model
CLIP, ALIGN, LiT,
SimVLM, ViLD, MDETR
Human input (task)
Large Language Model for Planning (e.g. SayCan)
Language-conditioned Policies
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances
https://say-can.github.io/
Live demo
Describing the visual world with language
(and using for feedback too!)
Some limits of "language" as intermediate representation?
- loses spatial precision
- highly multimodal (lots of different ways to say the same thing)
- not as information-rich as in-domain representations (e.g. images)
Some limits of "language" as intermediate representation?
- loses spatial precision
- highly multimodal (lots of different ways to say the same thing)
- not as information-rich as in-domain representations (e.g. images)
Office space
Perception wishlist
item #1: hierarchical high-res image description?
Conference room
Desk, Chairs
Coke bottle
Bottle label
Nutrition facts
Nutrition values
w/ spatial info?
Some limits of "language" as intermediate representation?
- Only for high level? what about control?
Perception
Planning
Control
Some limits of "language" as intermediate representation?
- Only for high level? what about control?
Perception
Planning
Control
Socratic Models
Inner Monologue
ALM + LLM + VLM
SayCan
Wenlong Huang et al, 2022
LLM
Imitation? RL?
Intuition and commonsense is not just a high-level thing
Intuition and commonsense is not just a high-level thing
applies to low-level behaviors too
- spatial: "move a little bit to the left"
- temporal: "move faster"
- functional: "balance yourself"
Intuition and commonsense is not just a high-level thing
Seems to be stored in the depths of in language models... how to extract it?
applies to low-level behaviors too
- spatial: "move a little bit to the left"
- temporal: "move faster"
- functional: "balance yourself"
Can language models do control too?
Can language models do control too?
Turns out they've read lots of robot Python code and robotics textbooks too
LLMs can write robot code!
Can language models do control too?
Jacky Liang, "Code as Policies"
Turns out they've read lots of robot Python code and robotics textbooks too
LLMs can write robot code!
write a PD controller
Can language models do control too?
Jacky Liang, "Code as Policies"
Turns out they've read lots of robot Python code and robotics textbooks too
LLMs can write robot code!
write a PD controller
write impedance controller
use NumPy SciPy code...
more examples...
Can language models do control too?
Jacky Liang, "Code as Policies"
LLMs can generate (and adjust) continuous control trajectories...
... but no autonomous feedback loop...
Can language models do control too?
Jacky Liang, "Code as Policies"
LLMs can generate (and adjust) continuous control trajectories...
... but no autonomous feedback loop...
Perception wishlist item #2:
a perception model that can describe robot trajectories
Perception wishlist:
- hierarchical super high-res image description?
- open-vocab is great, but can we get generative?
- a visual-language model that is spatially grounded in 3D
- a perception model that can describe robot trajectories
- a foundation model for sounds
- not just speech, but also "robot noises"
As shared protocol for collaboration
Robot Operating System
(ROS)
Perception
Planning
Control
Since 2007: A common protocol for individual modules to "talk to each other"
"Language" as the glue for robots & AI
Language
Perception
Planning
Control
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
"Language" as the glue for robots & AI
Language
Perception
Planning
Control
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
We have some reason to believe that
"the structure of language is the structure of generalization"
To understand language is to understand generalization
https://evjang.com/2021/12/17/lang-generalization.html
Sapir–Whorf hypothesis
"Language" as the glue for robots & AI
Language
Perception
Planning
Control
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
We have some reason to believe that
"the structure of language is the structure of generalization"
To understand language is to understand generalization
https://evjang.com/2021/12/17/lang-generalization.html
Sapir–Whorf hypothesis
Discover new modes of collaboration
Towards grounding everything in language
Language
Perception
Planning
Control
Humans
Towards grounding everything in language
Language
Perception
Planning
Control
Humans
A path not just for general robots,
but for human-centered robots!
go/languagein-actionsout
Thank you!
Pete Florence
Tom Funkhouser
Adrian Wong
Kaylee Burns
Jake Varley
Erwin Coumans
Alberto Rodriguez
Johnny Lee
Vikas Sindhwani
Ken Goldberg
Stefan Welker
Corey Lynch
Laura Downs
Jonathan Tompson
Shuran Song
Vincent Vanhoucke
Kevin Zakka
Michael Ryoo
Travis Armstrong
Maria Attarian
Jonathan Chien
Brian Ichter
Krzysztof Choromanski
Phillip Isola
Tsung-Yi Lin
Ayzaan Wahid
Igor Mordatch
Oscar Ramirez
Federico Tombari
Daniel Seita
Lin Yen-Chen
Adi Ganapathi
2022-CCI-seminar
By Andy Zeng
2022-CCI-seminar
- 532