SCM.256 - Spring 2024
Guest Lecture:
(Some) Recent ML trends/applications
Shen Shen
May 13, 2024
Notable trends lately
(in CV and NLP)
- Self-supervision
- Scaling up
- Multi-modality
- Transformer-based architecture stack
- Diffusion-based generative algorithms
Self-supervision (masking)
It seems
It seems
Multi-modality
Diffusion/score-based
[image credit: Lilian Weng]
Dall-E 2 (UnCLIP): CLIP + GLIDE
[https://arxiv.org/pdf/2204.06125.pdf]
Outline Today
Part 1: Some echoing trends in Robotics
Part 3: Some more domain-specific applications
- Engineering
- Natural sciences (e.g. life sciences, health care)
- Math, algorithms
- Social sciences (e.g. political, ethical impact)
Part 2: Some future directions in CV/NLP/Robotics
Part 1: Robotics
Lots of slides adapted from
2004 - Uses vanilla policy gradient (actor-critic)
uses first-principle (modeling, control, optimization stack)
For the next challenge:
Good control when we don't have useful models?
For the next challenge:
Good control when we don't have useful models?
- Rules out:
- (Multibody) Simulation
- Simulation-based reinforcement learning (RL)
- State estimation / model-based control
- Some top choices:
- Learn a dynamics model
- Behavior cloning (imitation learning)
Levine*, Finn*, Darrel, Abbeel, JMLR 2016
Visuomotor policies
perception network
(often pre-trained)
policy network
other robot sensors
learned state representation
actions
x history
manipulation (and general control) is hard because?
partially because the data is scarce
Denoising diffusion models
(for actions)
Image source: Ho et al. 2020
Denoiser can be conditioned on additional inputs, \(u\): \(p_\theta(x_{t-1} | x_t, u) \)
Image backbone: ResNet-18 (pretrained on ImageNet)
Total: 110M-150M Parameters
Training Time: 3-6 GPU Days ($150-$300)
Why (Denoising) Diffusion Models?
- High capacity + great performance
- Small number of demonstrations (typically ~50)
- Multi-modal (non-expert) demonstrations
- Training stability and consistency
- no hyper-parameter tuning
- Generates high-dimension continuous outputs
- vs categorical distributions (e.g. RT-1, RT-2)
- Action-chunking transformers (ACT)
- Solid mathematical foundations (score functions)
- Reduces nicely to the simple cases (e.g. LQG / Youla)
Scaling Up
- We've discussed training one skill
-
Wanted: few shot generalization to new skills
- multitask, language-conditioned policies
- connects beautifully to internet-scale data
-
Big Questions:
- How do we feed the data flywheel?
- What are the scaling laws?
Discussion
I do think there is something deep happening here...
- Manipulation should be easy (from a controls perspective)
- probably low dimensional?? (manifold hypothesis)
- memorization can go a long way
If we really understand this, can we do the same via principles from a model? Or will control go the way of computer vision and language?
Discussion
What if we did have a good model? (and well-specified objective)
- Core challenges:
- Control from pixels
- Control through contact
- Optimizing rich robustness objective
- The most effective approach today:
- RL on privileged information + teacher-student
Deep RL + Teacher-Student
Lee et al., Learning quadrupedal locomotion over challenging terrain, Science Robotics, 2020
Deep RL + Teacher-student
Magic of Modality
- Modality = image, video, 3d mesh, text etc.
- One recipe (motivated by data):
- Use discriminative model on data-rich domain to guide training in data-scarce domains
- Use generative model on data-rich domains to synthesize data for data-scarce domains
Task Planning with LLM
Connect unstructured world with structured algorithms
What humans would want:
Task: clean up the spilled coke
- Set the coke can into an upright position
- Find some napkins
- Pick up napkins
- Wipe the spilled coke with napkins
- Wipe the coke can
- Throw away the used napkins
Humans: Language as tasks
Language as plans!
Can we use human priors & knowledges
It turns out humans activities on the internet produces a massive amount of knowledge in the form of text that are really useful!
Highlight
- Given a fixed list of options, can evaluate likelihood with LM
- Given all vocabularies, can sample with likelihood to generate
Ingredient 1
- Bind each executable skill to some text options
- Have a list of text options for LM to choose from
- Given instruction, choose the most likely one
Ingredient 2
- Prompt LLM to output in a more structured way
- Parse the structure output
Few-shot prompting of Large Language Models
LLMs can copy the logic and extrapolate it!
Prompt Large Language Models to do structured planning
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
LLMs for robotics
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
What task-based affordances reminds us of in MDP/RL?
Value functions!
[Value Function Spaces, Shah, Xu, Lu, Xiao, Toshev, Levine, Ichter, ICLR 2022]
Robotic affordances
Combine LLM and Affordance
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
LLM x Affordance
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
- Language Models as Zero-Shot Planners:
Extracting Actionable Knowledge for Embodied Agents - Inner Monologue: Embodied Reasoning through Planning with Language Models
- PaLM-E: An Embodied Multimodal Language Model
- Chain-of-thought prompting elicits reasoning in large language models
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Extended readings in
LLM + Planning
Part 2: Current/future directions
- NLP -- credits: Andrej Karpathy (nov 2023)
- CV -- credits: Kaiming He (oct 23)
- Robotics -- credits: CoRL debate (nov 23)
Interpretability
Why video
- Video is how human perceive the world (physics, 3D)
- Video is widely available on internet
- Internet videos contain human actions and tutorials
- Pre-train on entire youtube, first image + text -> video
- Finetune on some robot video
- Inference time, given observation image + text prompt -> video of robot doing the task -> back out actions
A lot of actions/tutorials
(The demo won't embed in PDF. But the direct link below works.)
Magic of Modality
Text, Image, Video -> Text
Video -> 3D shape
Magic of Modality
Boom of 3D data
Universal dynamics model
UniSim: Learning Interactive Real-World Simulators, Du et al., 2023
Video Prediction for Robots
Learning Universal Policies via Text-Guided Video Generation, Du et al. 2023
Video Prediction for Robots
Learning Universal Policies via Text-Guided Video Generation, Du et al. 2023
Video + Language
Video Language Planning, Du et al. 2023
Video + Language
Video Language Planning, Du et al. 2023
Instruction: Make a Line
Video + RL
Mastering Diverse Domains through World Models, Hafner et al. 2023
Part 3: Other domain-specific applications
Open challenges:
two main-stream methods:
- predict amino-acid pair-wise distance. or
- predict 3d coordinates
tradeoff between representation compactness and structure.
credit: manoli/regina class
Efficient Hardware Co-design
Hardware section slides credit and link:
Mathematics/Algorithmetic
Computer-aided proofs have a long history
Suppose we "trivialize" theorem proofs into exam T/F question, what are some common strategies?
- If we really understand the materials, jump to invoking lemmas punchlines, etc and derive
- If less sure, make a guess about the T/F, conjecture, then prove
- If less sure still, try out a few data (if possible) and hope for counter-examples or intuition
Finding things similar to our problem.
For proofs that "look like" this in the past, induction techniques are "often" used, so an assistant may suggest "try induction"
So we need "good" characterizations of facts/statement (a lot of research there). As in what are these "theorems" about??
Nuanced/semantic characterization: what are the assumptions?
First such ML-aided system created about 15 years ago
Previously computers help:
- find counter-example
- accelerating calculations
- symbolic reasoning
Algorithmic Discovery
Divide-and-conquer. Try to propose inter-mediate lemmas.
Can try to do without concrete Proof Blueprint.
Take a further step: explore the rich "proof library" bits like LEAN.
Find an open-ended goal in a statistical way.
Square matrices => system suggests that it's true for arbitrary matrices =>
GPTs can read a book and reference a true statement
Societal impact
“The AI Index 2023 Annual Report,” HAI, Stanford University, April 2023.
“The AI Index 2024 Annual Report,” HAI, Stanford University, April 2024.
Thanks!
Guest Lecture - Some recent ML trends/applications
By Shen Shen
Guest Lecture - Some recent ML trends/applications
- 69