MIT 6.4210/6.4212 Lec 22
Boyuan Chen
Slide: https://slides.com/d/EyfmqBY/live
Connect unstructured world with structured algorithms
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Segmentation, Point Cloud -> Grasp
Finite State Machine
....
These are all very structured!
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
It turns out humans activities on the internet produces a massive amount of knowledge in the form of text that are really useful!
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Text
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Text
GPT 3 writing MIT 24.09 Essay
Problem:
Our robots can only do a fixed number of commands and need the problem broken down in actionable steps. This is not what LLMs have seen.
We need to get LLMs to speak “robot language”!
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
LLMs can copy the logic and extrapolate it!
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
Reinforcement learning already provides task-based affordances.
They are encoded in the value function!
[Value Function Spaces, Shah, Xu, Lu, Xiao, Toshev, Levine, Ichter, ICLR 2022]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Open-vocabulary Queryable Scene Representations for Real World Planning, Chen et al. 2022
Open-vocabulary object detection via vision and language knowledge, Gu et al. 2021
Open-vocabulary Queryable Scene Representations for Real World Planning, Chen et al. 2022
Open-vocabulary Queryable Scene Representations for Real World Planning, Chen et al. 2022
Open-vocabulary Queryable Scene Representations for Real World Planning, Chen et al. 2022
Open-vocabulary Queryable Scene Representations for Real World Planning, Chen et al. 2022
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Toolformer: Language Models Can Teach Themselves to Use Tools, Schick et al. 2023
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Prompt LLM with APIs and ask it to write code as a response
Human: help me put the apple on the book
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Robot:
img = get_image() book_pos = detect(img, "book")[0] apple_pos = detect(img, "apple")[0] pick(apple_pos) place(book_pos + np.array([0, 0, 0.1]))
Code as Policie: Language Model Programs for Embodied Control, Liang et al. , 2022
Voyager: An Open-Ended Embodied Agent with Large Language Models, Wang et al. , 2023
Voyager: An Open-Ended Embodied Agent with Large Language Models, Wang et al. , 2023
Is scaling all you need?
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
How sufficient is this?
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Vision
Text
Action
Vision
Text
Action
Inputs
Outputs
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Vision
Text
Text
Inputs
Outputs
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
The campus of the Massachusetts Institute of Technology in Cambridge will soon be home to a new college of computer science, which will get its own building
GPT-4V came almost a year after ChatGPT....
VLMs are pretrained on image-text pairs like these:
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Vision
Text
Text
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Vision
Text
Text
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Vision
Text
Text
works because output is text, so we also have generalization in output behavior
Pre-trained so we have some
generalization in input behavior
Vision
Text
Vision
Text
Action
Inputs
Outputs
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Vision
Text
Action
Pre-trained so we have some generalization in input behavior
Output behavior is unclear. Hard to get data to have a foundation model that generates actions
RT-1: Robotics Transformer for real-world control at scale, Brohan et al., 2022
RT2: RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control, Brohan et al., 2023
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, Chi et al. 2023
Teaching Robots New Behaviors, TRI 2023
Diffusion Policy: Visuomotor Policy Learning via Action Diffusion, Chi et al. 2023
Thursday (Dec 7th) 4 pm at 32-D463
Ben Burchfield & Siyuan Feng from TRI will be giving a talk, "Towards Large Behavior Models: Versatile and Dexterous Robots via Supervised Learning."
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
GenSim: Generating Robotic Simulation Tasks via Large Language Models, Wang et al., 2023
RoboGen: Towards Unleashing Infinite Data for Automated Robot Learning via Generative Simulation Wang et al., 2023
One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization, Liu et al., 2023
Language to Rewards for Robotic Skill Synthesis, Yu et al., 2023
Eureka: Human-Level Reward Design via Coding Large Language Models, Ma et al., 2023
Vision-Language Models as Success Detectors, Du et al., 2023
VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models, Huang et al. 2023
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
Boyuan Chen on scaling robotics:
Model scales, data collection doesn't.
Video give you action data.
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
MIT 6.4210/6.4212 Guest Lecture By Boyuan Chen, Twitter, Website
UniSim: Learning Interactive Real-World Simulators, Du et al., 2023
Instruction: Make a Line