Language as Robot Middleware

Andy Zeng

CCI Seminar

Robot Manipulation

Amazon Picking Challenge

arc.cs.princeton.edu

Team MIT-Princeton

Amazon Picking Challenge

arc.cs.princeton.edu

Team MIT-Princeton

excel at simple things, adapt to hard things

How to endow robots with

"intuition" and "commonsense"?

Robotics as a platform to build intelligent machines

SHRDLU, Terry Winograd, MIT 1968

"will you please stack up both of the red blocks and either a green cube or a pyramid"

person:

"what does the box contain?"

person:

"the blue pyramid and the blue block."

robot:

"can the table pick up blocks?"

person:

"no."

robot:

"why did you do that?"

person:

"ok."

robot:

"because you asked me to."

robot:

Robotics as a platform to build intelligent machines

SHRDLU, Terry Winograd, MIT 1968

Robot butlers?

Representation is a key part of the puzzle

Perception

Actions

what is the right representation?

Representation is a key part of the puzzle

semantic?

compact?

compositional?

general?

interpretable?

Perception

Actions

what is the right representation?

On the hunt for the "best" state representation

how to represent:

semantic?

compact?

compositional?

general?

interpretable?

On the hunt for the "best" state representation

Haochen Shi and Huazhe Xu et al., RSS 2022

Learned Visual Representations

how to represent:

Dynamics Representations

Self-supervised Representations

Misha Laskin and Aravind Srinivas et al., ICML 2020

semantic?

compact?

compositional?

general?

interpretable?

On the hunt for the "best" state representation

Haochen Shi and Huazhe Xu et al., RSS 2022

Learned Visual Representations

NeRF Representations

3D Reconstructions

how to represent:

Dynamics Representations

Object-centric Representations

Danny Driess and Ingmar Schubert et al., arxiv 2022

Ben Mildenhall, Pratul Srinivasan, Matthew Tancik et al., ECCV 2020

Self-supervised Representations

Richard Newcombe et al., ISMAR 2011

Andy Zeng, Peter Yu, et al., ICRA 2017

Misha Laskin and Aravind Srinivas et al., ICML 2020

semantic?

compact?

compositional?

general?

interpretable?

On the hunt for the "best" state representation

Haochen Shi and Huazhe Xu et al., RSS 2022

Learned Visual Representations

NeRF Representations

3D Reconstructions

how to represent:

Dynamics Representations

Danny Driess and Ingmar Schubert et al., arxiv 2022

Ben Mildenhall, Pratul Srinivasan, Matthew Tancik et al., ECCV 2020

Self-supervised Representations

Richard Newcombe et al., ISMAR 2011

Misha Laskin and Aravind Srinivas et al., ICML 2020

Continuous-Time

Representations

Sumeet Singh et al., IROS 2022

Pretrained Representations

Lin Yen-Chen et al., ICRA 2020

Cross-embodied Representations

Kevin Zakka et al., CoRL 2021

semantic?

compact?

compositional?

general?

interpretable?

Object-centric Representations

Andy Zeng, Peter Yu, et al., ICRA 2017

On the hunt for the "best" state representation

how to represent:

semantic?

compact?

compositional?

general?

interpretable?

what about

language?

On the hunt for the "best" state representation

how to represent:

semantic? ✓

compact? ✓

compositional? ✓

general? ✓

interpretable? ✓

what about

language?

On the hunt for the "best" state representation

how to represent:

semantic? ✓

compact? ✓

compositional? ✓

general? ✓

interpretable? ✓

what about

language?

advent of large language models

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin et al., "PaLM", 2022

On the hunt for the "best" state representation

how to represent:

semantic? ✓

compact? ✓

compositional? ✓

general? ✓

interpretable? ✓

what about

language?

advent of large language models

maybe this was the multi-task representation we've been looking for all along?

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin et al., "PaLM", 2022

Mohit Shridhar et al., "CLIPort", CoRL 2021

Does multi-task learning result in positive transfer of representations?

Recent work in multi-task learning...

Does multi-task learning result in positive transfer of representations?

Past couple years of research suggest: its complicated

Recent work in multi-task learning...

Does multi-task learning result in positive transfer of representations?

Past couple years of research suggest: its complicated

In computer vision...

Amir Zamir, Alexander Sax, William Shen, et al., "Taskonomy", CVPR 2018

Recent work in multi-task learning...

Does multi-task learning result in positive transfer of representations?

Past couple years of research suggest: its complicated

In computer vision...

Amir Zamir, Alexander Sax, William Shen, et al., "Taskonomy", CVPR 2018

Recent work in multi-task learning...

Xiang Li et al., "Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?", 2022

Does multi-task learning result in positive transfer of representations?

Past couple years of research suggest: its complicated

In computer vision...

Amir Zamir, Alexander Sax, William Shen, et al., "Taskonomy", CVPR 2018

In robot learning...

Sam Toyer, et al., "MAGICAL", NeurIPS 2020

Recent work in multi-task learning...

Xiang Li et al., "Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?", 2022

Does multi-task learning result in positive transfer of representations?

Past couple years of research suggest: its complicated

In computer vision...

Amir Zamir, Alexander Sax, William Shen, et al., "Taskonomy", CVPR 2018

In robot learning...

Sam Toyer, et al., "MAGICAL", NeurIPS 2020

Scott Reed, Konrad Zolna, Emilio Parisotto, et al., "A Generalist Agent", 2022

Recent work in multi-task learning...

Xiang Li et al., "Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?", 2022

CLIPort

Multi-task learning + grounding in language seems more likely to lead to positive transfer

Mohit Shridhar, Lucas Manuelli, Dieter Fox, "CLIPort: What and Where Pathways for Robotic Manipulation", CoRL 2021

On the hunt for the "best" state representation

how to represent:

semantic? ✓

compact? ✓

compositional? ✓

general? ✓

interpretable? ✓

what about

language?

advent of large language models

maybe this was the multi-task representation we've been looking for all along?

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin et al., "PaLM", 2022

Mohit Shridhar et al., "CLIPort", CoRL 2021

How do we use "language" as a state representation?

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Open research problem! but here's one way to do it...

How do we use "language" as a state representation?

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Open research problem! but here's one way to do it...

Visual Language Model

CLIP, ALIGN, LiT,

SimVLM, ViLD, MDETR

Human input (task)

How do we use "language" as a state representation?

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Open research problem! but here's one way to do it...

Visual Language Model

CLIP, ALIGN, LiT,

SimVLM, ViLD, MDETR

Human input (task)

Large Language Model for Planning (e.g. SayCan)

Language-conditioned Policies

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

https://say-can.github.io/

Live demo

Describing the visual world with language

(and using for feedback too!)

Some limits of "language" as intermediate representation?

- loses spatial precision

- highly multimodal (lots of different ways to say the same thing)

- not as information-rich as in-domain representations (e.g. images)

Some limits of "language" as intermediate representation?

- loses spatial precision

- highly multimodal (lots of different ways to say the same thing)

- not as information-rich as in-domain representations (e.g. images)

Office space

Perception wishlist

item #1: hierarchical high-res image description?

Conference room

Desk, Chairs

Coke bottle

Bottle label

Nutrition facts

Nutrition values

w/ spatial info?

Some limits of "language" as intermediate representation?

- Only for high level? what about control?

Perception

Planning

Control

Some limits of "language" as intermediate representation?

- Only for high level? what about control?

Perception

Planning

Control

Socratic Models

Inner Monologue

ALM + LLM + VLM

SayCan

Wenlong Huang et al, 2022

LLM

Imitation? RL?

Intuition and commonsense is not just a high-level thing

applies to low-level behaviors too

spatial: "move a little bit to the left"
temporal: "move faster"
functional: "balance yourself"

Intuition and commonsense is not just a high-level thing

Seems to be stored in the depths of in language models... how to extract it?

applies to low-level behaviors too

spatial: "move a little bit to the left"
temporal: "move faster"
functional: "balance yourself"

Can language models do control too?

Turns out they've read lots of robot Python code and robotics textbooks too

LLMs can write robot code!

Can language models do control too?

Jacky Liang, "Code as Policies"

Turns out they've read lots of robot Python code and robotics textbooks too

LLMs can write robot code!

write a PD controller

Can language models do control too?

Jacky Liang, "Code as Policies"

Turns out they've read lots of robot Python code and robotics textbooks too

LLMs can write robot code!

write a PD controller

write impedance controller

use NumPy SciPy code...

more examples...

Can language models do control too?

Jacky Liang, "Code as Policies"

LLMs can generate (and adjust) continuous control trajectories...

... but no autonomous feedback loop...

Can language models do control too?

Jacky Liang, "Code as Policies"

LLMs can generate (and adjust) continuous control trajectories...

... but no autonomous feedback loop...

Perception wishlist item #2:

a perception model that can describe robot trajectories

Perception wishlist:

hierarchical super high-res image description?
- open-vocab is great, but can we get generative?
a visual-language model that is spatially grounded in 3D
- a perception model that can describe robot trajectories
a foundation model for sounds
- not just speech, but also "robot noises"

As shared protocol for collaboration

Robot Operating System

(ROS)

Perception

Planning

Control