Language as Robot Middleware

Andy Zeng

Princeton Robotics Seminar

 

Manipulation

amazing skill! teppanyaki steak master

https://youtu.be/5qVMIKCn_Cs

Problem difficulty = #DoFs (robot) + #DoFs (environment)

Robotics as a platform to study intelligent machines

Robotics as a platform to study intelligent machines

SHRDLU, Terry Winograd, MIT 1968

"will you please stack up both of the red blocks and either a green cube or a pyramid"

person:

"what does the box contain?"

person:

"the blue pyramid and blue block."

robot:

"can the table pick up blocks?"

person:

"no."

robot:

"why did you do that?"

person:

"ok."

robot:

"because you asked me to."

robot:

Simple things are surprisingly complex

"At an analytical level, pushing is a well understood problem... These are usually based on Coulomb's friction law..."

"The reality, however, is bitter... the sensitivity of the task to small changes in contact geometry, along with the variability of friction, hinders accurate predictions."

More than a Million Ways to Be Pushed. A High-Fidelity Experimental Dataset of Planar Pushing

Kuan-Ting Yu, Maria Bauza, Nima Fazeli, Alberto Rodriguez, IROS 2016

Imitation Learning with Behavior Cloning

Supervised Learning

1. Collect dataset with human teleop

2. Train end-to-end deep networks

  • High-rate (velocity) control
  • Explicit (e.g. MSE) losses

Transformers

or ConvNets

Actions

Pixels

Imitation Learning with Behavior Cloning

Supervised Learning

1. Collect dataset with human teleop

2. Train end-to-end deep networks

  • Data hungry
  • Policy often gets "stuck"
  • High-rate (velocity) control
  • Explicit (e.g. MSE) losses

In Practice

Transformers

or ConvNets

Actions

Pixels

Imitation Learning with Behavior Cloning

Supervised Learning

1. Collect dataset with human teleop

2. Train end-to-end deep networks

  • Data hungry
  • Policy often gets "stuck"
  • High-rate (velocity) control
  • Explicit (e.g. MSE) losses

In Practice

Transformers

or ConvNets

Actions

Pixels

Imitation Learning with Behavior Cloning

  • Data hungry
  • Policy often gets "stuck"

In Practice

Implicit Behavior Cloning

Implicit Behavioral Cloning, CoRL 2021

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, Jonathan Tompson

Implicit Behavior Cloning

\mathcal{L}_{\text{InfoNCE}} = \sum_{i=1}^N -\log \big( \tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) \big)
\tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) = \frac{e^{-E_{\theta}(\mathbf{x}_i, {\color{black} \mathbf{y}_i} )}} {e^{-E_{\theta}( \mathbf{x}_i, {\color{black} \mathbf{y}_i})} + {\color{red} \sum_{j=1}^{N_{\text{neg}}}} e^{-E_{\theta}(\mathbf{x}_i, {\color{red} \tilde{\mathbf{y}}^j_i} )} }

Implicit Behavioral Cloning, CoRL 2021

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, Jonathan Tompson

Implicit Behavior Cloning

\mathcal{L}_{\text{InfoNCE}} = \sum_{i=1}^N -\log \big( \tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) \big)
\tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) = \frac{e^{-E_{\theta}(\mathbf{x}_i, {\color{black} \mathbf{y}_i} )}} {e^{-E_{\theta}( \mathbf{x}_i, {\color{black} \mathbf{y}_i})} + {\color{red} \sum_{j=1}^{N_{\text{neg}}}} e^{-E_{\theta}(\mathbf{x}_i, {\color{red} \tilde{\mathbf{y}}^j_i} )} }

Implicit Behavioral Cloning, CoRL 2021

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, Jonathan Tompson

Implicit Behavior Cloning & Transporter Nets

\mathcal{L}_{\text{InfoNCE}} = \sum_{i=1}^N -\log \big( \tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) \big)
\tilde{p}_{\theta}( {\color{black} \mathbf{y}_i} | \ \mathbf{x}, \ {\color{red}\{\tilde{\mathbf{y}}^j_i\}_{j=1}^{N_{\text{neg.}}} } ) = \frac{e^{-E_{\theta}(\mathbf{x}_i, {\color{black} \mathbf{y}_i} )}} {e^{-E_{\theta}( \mathbf{x}_i, {\color{black} \mathbf{y}_i})} + {\color{red} \sum_{j=1}^{N_{\text{neg}}}} e^{-E_{\theta}(\mathbf{x}_i, {\color{red} \tilde{\mathbf{y}}^j_i} )} }

Transporter Networks: Rearranging the Visual World for Robotic Manipulation, CoRL 2020

Andy Zeng, Pete Florence, Jonathan Tompson, Stefan Welker, Jonathan Chien, Maria Attarian, Travis Armstrong, Ivan Krasin, Dan Duong, Ayzaan Wahid, Vikas Sindhwani, Johnny Lee

A generalization of loss functions from Transporter Nets (spatial action maps)

Implicit Behavior Cloning

Implicit Behavioral Cloning, CoRL 2021

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, Jonathan Tompson

Implicit Behavior Cloning

Implicit Behavioral Cloning, CoRL 2021

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, Jonathan Tompson

Implicit Behavior Cloning

Implicit Behavioral Cloning, CoRL 2021

Pete Florence, Corey Lynch, Andy Zeng, Oscar Ramirez, Ayzaan Wahid, Laura Downs, Adrian Wong, Johnny Lee, Igor Mordatch, Jonathan Tompson

Interpolation vs Extrapolation*

  • Sample efficiency improvements

Deep Learning is a Box

Interpolation

Extrapolation

Deep Learning is a Box

Interpolation

Extrapolation

Roboticist

Vision

NLP

Deep Learning is a Box

Interpolation

Extrapolation

Internet

Meanwhile in NLP...

Large Language Models

Large Language Models

Internet

Meanwhile in NLP...

Books

Recipes

Code

News

Articles

Dialogue

Demo

Quick Primer on Language Models

Tokens (inputs & outputs)

Transformers (models)

Attention Is All You Need, NeurIPS 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Quick Primer on Language Models

Tokens (inputs & outputs)

Transformers (models)

Pieces of words (BPE encoding)

big

bigger

per word:

biggest

small

smaller

smallest

big

er

per token:

est

small

Attention Is All You Need, NeurIPS 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

Quick Primer on Language Models

Tokens (inputs & outputs)

Transformers (models)

Self-Attention

Pieces of words (BPE encoding)

big

bigger

per word:

biggest

small

smaller

smallest

big

er

per token:

est

small

y
x_1
x_2
x_3

Attention Is All You Need, NeurIPS 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin

\text{softmax}(\frac{QK^\intercal}{\sqrt{d_k}})V

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Bigger is Better

Neural Language Models: Bigger is Better, WeCNLP 2018

Noam Shazeer

Somewhere in the space of interpolation

Example?

Lives robot planning

Large Language Models on Robots

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Open research problem! but here's one way to do it...

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

https://say-can.github.io

Large Language Models on Robots

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Open research problem! but here's one way to do it...

Visual Language Model

CLIP, ALIGN, LiT,

SimVLM, ViLD, MDETR

Human input (task)

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

https://say-can.github.io

Large Language Models on Robots

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

Open research problem! but here's one way to do it...

Visual Language Model

CLIP, ALIGN, LiT,

SimVLM, ViLD, MDETR

Human input (task)

Large Language Model for Planning (e.g. SayCan)

Language-conditioned Policies

Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

https://say-can.github.io

Live Demo

Language as a "state" representation

3D Recontruction

Language as a "state" representation

z

semantic?

compact?

compositional?

general?

interpretable?

Perception

Actions

Planning

3D Recontruction

Language as a "state" representation

z

semantic?

compact?

compositional?

general?

interpretable?

Perception

Actions

Planning

3D Recontruction

scope of guarantees

safety & bias mitigation

Language as a "state" representation

3D Recontruction

z

Perception

Actions

Planning

semantic?

compact?

compositional?

general?

interpretable?

What about language?

Ego4D

Limits of language as a "state" representation

-  Loses spatial precision

-  Highly multimodal (lots of different ways to say the same thing)

-  Not as information-rich as in-domain representations (e.g. images)

Limits of language as a "state" representation

-  Only for high level? what about control?

Perception

Planning

Control

Socratic Models

Inner Monologue

ALM + LLM + VLM

SayCan

Wenlong Huang et al, 2022

LLM

Imitation? RL?

Engineered?

Intuition and commonsense is not just a high-level thing

Intuition and commonsense is not just a high-level thing

Applies to low-level behaviors too

Is the "dark matter" of robotics

  • Spatial: "move a little bit to the left"
  • Temporal: "move faster"
  • Functional: "balance yourself"

Demo

Intuition and commonsense is not just a high-level thing

Seems to be stored in the depths of in language models... how to extract it?

Applies to low-level behaviors too

Is the "dark matter" of robotics

  • Spatial: "move a little bit to the left"
  • Temporal: "move faster"
  • Functional: "balance yourself"

Language models can write code

Code as a medium to express low-level commonsense

Live Demo

Language models can write code

Code as a medium to express low-level commonsense

Live Demo

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

Language models can write code

Code as a medium to express low-level commonsense

Live Demo

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

Language models can write code

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

use NumPy,

SciPy code...

Language models can write code

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

  • PD controllers
  • impedance controllers

Language models can write code

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

Language models can write code

Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Hausman, Brian Ichter, Pete Florence, Andy Zeng

code-as-policies.github.io

Code as Policies: Language Model Programs for Embodied Control

What is the foundation models for robotics?

How much data do we need?

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

How much data do we need?

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

How much data do we need?

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

50 expert demos

How much data do we need?

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

500 expert demos

50 expert demos

How much data do we need?

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

500 expert demos

5000 expert demos

50 expert demos

How much data do we need?

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

500 expert demos

5000 expert demos

50 expert demos

How much data do we need?

Robot Learning

Language Models

  • Finding other sources of data (sim, YouTube)
  • Improve data efficiency with prior knowledge

Not a lot of robot data

Lots of Internet data

500 expert demos

5000 expert demos

50 expert demos

Scale alone might not be enough

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

adapted from Tomás Lozano-Pérez

Machine learning is a box

... but robotics is a line

  • Finding other sources of data (sim, YouTube)
  • Improve data efficiency with prior knowledge

Different embodiments etc....

A possible middleground

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

Embrace Compositionality

adapted from Tomás Lozano-Pérez

Deep learning is a box

... but robotics is a line

A possible middleground

Robot Learning

Language Models

Not a lot of robot data

Lots of Internet data

Embrace Compositionality

adapted from Tomás Lozano-Pérez

Deep learning is a box

... but robotics is a line

2. composing them
    autonomously

1. build boxes

Towards grounding everything in language

Language

Control

Vision

Tactile

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

"Language" as the glue for robots & AI

Language

Perception

Planning

Control

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

"Language" as the glue for robots & AI

Language

Perception

Planning

Control

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

https://socraticmodels.github.io

We have some reason to believe that

"the structure of language is the structure of generalization"

To understand language is to understand generalization

https://evjang.com/2021/12/17/lang-generalization.html

Sapir–Whorf hypothesis

Towards grounding everything in language

Language

Perception

Planning

Control

Humans

Towards grounding everything in language

Language

Perception

Planning

Control

Humans

Not just for general robots, 
but for human-centered
intelligent machines!

  • Grounding safety and risk in language

Things I wish my robot had...

  • Grounding safety and risk in language
  • Compositional guarantees on safety and generalization?

Things I wish my robot had...

  • Grounding safety and risk in language
  • Compositional guarantees on safety and generalization?
  • Embed more LLM prior knowledge into control?

Things I wish my robot had...

  • Grounding safety and risk in language
  • Compositional guarantees on safety and generalization?
  • Embed more LLM prior knowledge into control?
  • PAC-bayes on LLM planners?

Things I wish my robot had...

Thank you!

Pete Florence

Adrian Wong

Johnny Lee

Vikas Sindhwani

Stefan Welker

Vincent Vanhoucke

Kevin Zakka

Michael Ryoo

Maria Attarian

Brian Ichter

Krzysztof Choromanski

Federico Tombari

Jacky Liang

Aveek Purohit

Wenlong Huang

Fei Xia

Peng Xu

Karol Hausman

and many others!

Amazon Picking Challenge

arc.cs.princeton.edu

Team MIT-Princeton

excel at simple things, adapt to hard things

How to endow robots with

"intuition" and "commonsense"?

2022-Princeton-Talk

By Andy Zeng

2022-Princeton-Talk

  • 488