SCM.256 - Spring 2024
Guest Lecture:
(Some) Recent ML trends/applications
Shen Shen
May 13, 2024
data:image/s3,"s3://crabby-images/a824f/a824f0bb706c7a69ed663a2645871be414d5cde3" alt=""
Notable trends lately
(in CV and NLP)
- Self-supervision
- Scaling up
- Multi-modality
- Transformer-based architecture stack
- Diffusion-based generative algorithms
Self-supervision (masking)
data:image/s3,"s3://crabby-images/2d7b8/2d7b8d49788aec3cdeddbbd23b0ed16f530dc5c0" alt=""
data:image/s3,"s3://crabby-images/4ddb5/4ddb54eef5b3b44d2d7e3bc393c2bc173d2a8ba8" alt=""
It seems
It seems
data:image/s3,"s3://crabby-images/5e303/5e30332011cce90ce909b4c16ba978437a3b466e" alt=""
Multi-modality
data:image/s3,"s3://crabby-images/fb599/fb5992b1d8d2a92d46ec9baa4582c871ff38c6bd" alt=""
Diffusion/score-based
data:image/s3,"s3://crabby-images/d95d0/d95d06591172a7df5afb8edfa9c012821bac9b4d" alt=""
[image credit: Lilian Weng]
Dall-E 2 (UnCLIP): CLIP + GLIDE
data:image/s3,"s3://crabby-images/eca18/eca18f0d83dbb66dd4ad7c68f3aedbfb5a87a9e0" alt=""
[https://arxiv.org/pdf/2204.06125.pdf]
Outline Today
Part 1: Some echoing trends in Robotics
Part 3: Some more domain-specific applications
- Engineering
- Natural sciences (e.g. life sciences, health care)
- Math, algorithms
- Social sciences (e.g. political, ethical impact)
Part 2: Some future directions in CV/NLP/Robotics
Part 1: Robotics
Lots of slides adapted from
2004 - Uses vanilla policy gradient (actor-critic)
uses first-principle (modeling, control, optimization stack)
For the next challenge:
Good control when we don't have useful models?
data:image/s3,"s3://crabby-images/51139/51139f1c8b5dddeb2f98893eb4794f699f3b046c" alt=""
For the next challenge:
Good control when we don't have useful models?
- Rules out:
- (Multibody) Simulation
- Simulation-based reinforcement learning (RL)
- State estimation / model-based control
- Some top choices:
- Learn a dynamics model
- Behavior cloning (imitation learning)
data:image/s3,"s3://crabby-images/4fc54/4fc541b2e234dd5590a9ffa5ec2791a235909495" alt=""
Levine*, Finn*, Darrel, Abbeel, JMLR 2016
Visuomotor policies
data:image/s3,"s3://crabby-images/72eb6/72eb63bea57e6a67d2f728b7b33deb70713d0992" alt=""
perception network
(often pre-trained)
data:image/s3,"s3://crabby-images/72eb6/72eb63bea57e6a67d2f728b7b33deb70713d0992" alt=""
policy network
other robot sensors
learned state representation
actions
data:image/s3,"s3://crabby-images/4fc54/4fc541b2e234dd5590a9ffa5ec2791a235909495" alt=""
x history
data:image/s3,"s3://crabby-images/4812a/4812af589b9996693d69d8f400077549d574de51" alt=""
manipulation (and general control) is hard because?
partially because the data is scarce
data:image/s3,"s3://crabby-images/9c12d/9c12d15a236d439a2688a98e8fa9c0faca554810" alt=""
data:image/s3,"s3://crabby-images/242da/242da63e4c3bdd3c44e8f3b5d8382cf81a4767cd" alt=""
data:image/s3,"s3://crabby-images/509c0/509c0fe532876cfb773bc16c626cd584902a0672" alt=""
data:image/s3,"s3://crabby-images/1a41b/1a41b5c2b693881eaf964682b80b1ebe3d3de550" alt=""
Denoising diffusion models
(for actions)
Image source: Ho et al. 2020
Denoiser can be conditioned on additional inputs, u: pθ(xt−1∣xt,u)
data:image/s3,"s3://crabby-images/f914e/f914e48a8e49b1e801fb46a952e56635eeeea8c2" alt=""
data:image/s3,"s3://crabby-images/ac62e/ac62e54e04697b89d9af1138a8c95c92b2e5b152" alt=""
Image backbone: ResNet-18 (pretrained on ImageNet)
Total: 110M-150M Parameters
Training Time: 3-6 GPU Days ($150-$300)
Why (Denoising) Diffusion Models?
- High capacity + great performance
- Small number of demonstrations (typically ~50)
- Multi-modal (non-expert) demonstrations
- Training stability and consistency
- no hyper-parameter tuning
- Generates high-dimension continuous outputs
- vs categorical distributions (e.g. RT-1, RT-2)
- Action-chunking transformers (ACT)
- Solid mathematical foundations (score functions)
- Reduces nicely to the simple cases (e.g. LQG / Youla)
Scaling Up
- We've discussed training one skill
-
Wanted: few shot generalization to new skills
- multitask, language-conditioned policies
- connects beautifully to internet-scale data
-
Big Questions:
- How do we feed the data flywheel?
- What are the scaling laws?
Discussion
What if we did have a good model? (and well-specified objective)
- Core challenges:
- Control from pixels
- Control through contact
- Optimizing rich robustness objective
- The most effective approach today:
- RL on privileged information + teacher-student
Deep RL + Teacher-Student
data:image/s3,"s3://crabby-images/1dbd4/1dbd442deb1e0e3b71db70b07b1e4d86d19b7034" alt=""
data:image/s3,"s3://crabby-images/021a6/021a67c53d56b18b5de86acff5177dcb3e99680b" alt=""
Lee et al., Learning quadrupedal locomotion over challenging terrain, Science Robotics, 2020
data:image/s3,"s3://crabby-images/cf2e7/cf2e7e2f2c6b757569bf45bc5248c2a43e049537" alt=""
Deep RL + Teacher-student
Magic of Modality
- Modality = image, video, 3d mesh, text etc.
- One recipe (motivated by data):
- Use discriminative model on data-rich domain to guide training in data-scarce domains
- Use generative model on data-rich domains to synthesize data for data-scarce domains
Task Planning with LLM
Connect unstructured world with structured algorithms
data:image/s3,"s3://crabby-images/909ba/909ba042b64b597b82f24f07158ec5538b4ea3d9" alt=""
What humans would want:
data:image/s3,"s3://crabby-images/1c0c6/1c0c6100107aff10710a84adc3492f8e4bbb86f5" alt=""
Task: clean up the spilled coke
- Set the coke can into an upright position
- Find some napkins
- Pick up napkins
- Wipe the spilled coke with napkins
- Wipe the coke can
- Throw away the used napkins
Humans: Language as tasks
Language as plans!
data:image/s3,"s3://crabby-images/900f5/900f5eb99f6ab1d4006bcbbc2c31e68cfba73f65" alt=""
Can we use human priors & knowledges
It turns out humans activities on the internet produces a massive amount of knowledge in the form of text that are really useful!
data:image/s3,"s3://crabby-images/a9bbe/a9bbeb08ac3298d07579ca7b22fbcc5e09f96094" alt=""
Highlight
data:image/s3,"s3://crabby-images/43717/437175df4b142cfeacb7d84554eedf295f6d62de" alt=""
- Given a fixed list of options, can evaluate likelihood with LM
- Given all vocabularies, can sample with likelihood to generate
Ingredient 1
- Bind each executable skill to some text options
- Have a list of text options for LM to choose from
- Given instruction, choose the most likely one
data:image/s3,"s3://crabby-images/cb1a3/cb1a370b16d096526ddd056e07eab131cc0c4cd6" alt=""
Ingredient 2
- Prompt LLM to output in a more structured way
- Parse the structure output
Few-shot prompting of Large Language Models
data:image/s3,"s3://crabby-images/2eb87/2eb873b9ddb064507dd1ff65a629e2eaa7937e10" alt=""
LLMs can copy the logic and extrapolate it!
Prompt Large Language Models to do structured planning
data:image/s3,"s3://crabby-images/46d9f/46d9fbd83a040b8ce91804551a5dc9b376a4f0d1" alt=""
data:image/s3,"s3://crabby-images/f64ec/f64ec6000010e7c87b171581470b66ab2ec8438b" alt=""
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
LLMs for robotics
data:image/s3,"s3://crabby-images/8ae40/8ae4060fca25b5ce1acce2f287a74b1b2b35e9a3" alt=""
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
data:image/s3,"s3://crabby-images/5b2a1/5b2a13a7baef52d720cff2da314c98471209d7b7" alt=""
What task-based affordances reminds us of in MDP/RL?
Value functions!
[Value Function Spaces, Shah, Xu, Lu, Xiao, Toshev, Levine, Ichter, ICLR 2022]
Robotic affordances
Combine LLM and Affordance
data:image/s3,"s3://crabby-images/f0722/f07225090dce89c64f070117973c4e21055af129" alt=""
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
LLM x Affordance
data:image/s3,"s3://crabby-images/451a7/451a75dbba990cb74eec8cd3bcfc1912e915d2ba" alt=""
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
- Language Models as Zero-Shot Planners:
Extracting Actionable Knowledge for Embodied Agents - Inner Monologue: Embodied Reasoning through Planning with Language Models
- PaLM-E: An Embodied Multimodal Language Model
- Chain-of-thought prompting elicits reasoning in large language models
- Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Extended readings in
LLM + Planning
Part 2: Current/future directions
- NLP -- credits: Andrej Karpathy (nov 2023)
- CV -- credits: Kaiming He (oct 23)
- Robotics -- credits: CoRL debate (nov 23)
data:image/s3,"s3://crabby-images/7c4ea/7c4eaba99a80aeb2d4d4d2afe437aaed82c9fcbe" alt=""
data:image/s3,"s3://crabby-images/ef247/ef24766bc80aa6001929e0e1bc7b26c1e094eab1" alt=""
data:image/s3,"s3://crabby-images/c1c56/c1c56c525092b647d89852baaf0278f95fd17718" alt=""
data:image/s3,"s3://crabby-images/ca7dc/ca7dc002b80ab9cd9fac83fe260985abcec497c1" alt=""
data:image/s3,"s3://crabby-images/03d1e/03d1eff17bed14bdf4fe926a37fc320d7f45915c" alt=""
data:image/s3,"s3://crabby-images/35740/35740ac0d4c8dbe1c76a05d968928ad41dc64ee3" alt=""
data:image/s3,"s3://crabby-images/f477a/f477a81c4aa98fbf8b905407a94786c522578bb9" alt=""
data:image/s3,"s3://crabby-images/da546/da54689d5619fd64dc0bd8167bbcfafcc6ff5b4c" alt=""
data:image/s3,"s3://crabby-images/dab3e/dab3e6416941cd309ed4a17f6b7bcdd2facbf947" alt=""
data:image/s3,"s3://crabby-images/5ba61/5ba610e9ab24c901b8b0c3a8aec3bdb3ce45a9dc" alt=""
data:image/s3,"s3://crabby-images/59ae6/59ae6b5fbf0c94a2dfa9b4af61a4d41f765be6b5" alt=""
data:image/s3,"s3://crabby-images/48a13/48a1375c4b61d2cb13c9a13e9523e38897ba1556" alt=""
data:image/s3,"s3://crabby-images/9b9b4/9b9b4ba4e22bf3d04a4ee49f5b0c4c5b766ab5e0" alt=""
data:image/s3,"s3://crabby-images/38a47/38a475b8428d2023c086b8e28d9bec000c26b438" alt=""
data:image/s3,"s3://crabby-images/cdc5c/cdc5cfae9eab7fe1ceef5a875f5cc0fba105aa1d" alt=""
data:image/s3,"s3://crabby-images/689dd/689ddbf8876d21fbc166a3a32fafe99bd6253509" alt=""
data:image/s3,"s3://crabby-images/a8aa8/a8aa873ecebd5e3a418f238ec163eca65690c988" alt=""
data:image/s3,"s3://crabby-images/05f57/05f5728df4744564b76daffcb6d19c4c3430ee65" alt=""
data:image/s3,"s3://crabby-images/f45d7/f45d7623b69cf2687b6a3eebff2ba9ba5f02d928" alt=""
data:image/s3,"s3://crabby-images/627db/627dbdd4f5ea58729e69f80c38c98c2a9d0ddab7" alt=""
data:image/s3,"s3://crabby-images/ff81b/ff81bc7d4edeff92bff872aec3651573e7de24ef" alt=""
Interpretability
data:image/s3,"s3://crabby-images/6ebe9/6ebe9e38181dc59d0e677d5307674bc4de5a4aa7" alt=""
data:image/s3,"s3://crabby-images/19371/19371fe531a5711de63ddc5bc9ececb74f5a9198" alt=""
Why video
- Video is how human perceive the world (physics, 3D)
- Video is widely available on internet
- Internet videos contain human actions and tutorials
- Pre-train on entire youtube, first image + text -> video
- Finetune on some robot video
- Inference time, given observation image + text prompt -> video of robot doing the task -> back out actions
A lot of actions/tutorials
(The demo won't embed in PDF. But the direct link below works.)
Magic of Modality
data:image/s3,"s3://crabby-images/d36a4/d36a4a44debd3d1b4a13d92fa07e4b563ba7ca05" alt=""
Text, Image, Video -> Text
Video -> 3D shape
Magic of Modality
Boom of 3D data
Universal dynamics model
UniSim: Learning Interactive Real-World Simulators, Du et al., 2023
data:image/s3,"s3://crabby-images/8e874/8e8742f9ace4c4509cf0fe2b00ccf1940ad0a4de" alt=""
Video Prediction for Robots
Learning Universal Policies via Text-Guided Video Generation, Du et al. 2023
data:image/s3,"s3://crabby-images/cba0a/cba0aab7b540aa775be931d0a02c04e56deeb9b4" alt=""
Video Prediction for Robots
Learning Universal Policies via Text-Guided Video Generation, Du et al. 2023
data:image/s3,"s3://crabby-images/56b04/56b04a4b4605e24a329491315244175fa2c5c71a" alt=""
Video + Language
Video Language Planning, Du et al. 2023
data:image/s3,"s3://crabby-images/f5804/f5804b3bb5ede7aada3b185c049df22db92b60f5" alt=""
Video + Language
Video Language Planning, Du et al. 2023
Instruction: Make a Line
data:image/s3,"s3://crabby-images/c610e/c610ede0e9041ad068f609728fbd8d0f5266dfd6" alt=""
Video + RL
Mastering Diverse Domains through World Models, Hafner et al. 2023
data:image/s3,"s3://crabby-images/e6d92/e6d92bb1eb726471c8a92417100ce21ece2033fe" alt=""
data:image/s3,"s3://crabby-images/f9dbe/f9dbec6f394c283a1cbf27e473977382a7733372" alt=""
Part 3: Other domain-specific applications
data:image/s3,"s3://crabby-images/7925a/7925a4d8b803531a3fbd1bdfd8595c878a5fe0df" alt=""
data:image/s3,"s3://crabby-images/46fb0/46fb00cf81781cfd908bff252b8c61ba25fedaf3" alt=""
data:image/s3,"s3://crabby-images/bb191/bb191957336733cedcd666f42cb6841c88add356" alt=""
data:image/s3,"s3://crabby-images/32f5d/32f5d64c05f1dffb9685dd34f123e8a75fbb409a" alt=""
data:image/s3,"s3://crabby-images/dd5fb/dd5fb87563d9532757dcab5aba0d0ffb43e00662" alt=""
data:image/s3,"s3://crabby-images/36aad/36aad70aa5547191d622b67032cdb8c661725202" alt=""
data:image/s3,"s3://crabby-images/54c4f/54c4ffabc97711a1026e7a1aa4472c705c404828" alt=""
Efficient Hardware Co-design
data:image/s3,"s3://crabby-images/271f0/271f057bf6306e7fe4e582ffeaeeb8a2460e491c" alt=""
data:image/s3,"s3://crabby-images/a02f2/a02f2ee536f846dda58081c380150748ba7908ef" alt=""
data:image/s3,"s3://crabby-images/e0c73/e0c732a45cdc9829ebcc68d0c1245307708769d2" alt=""
Hardware section slides credit and link:
data:image/s3,"s3://crabby-images/e2cff/e2cff90881d22c49b185054ef4c29ed5036face0" alt=""
data:image/s3,"s3://crabby-images/d6c85/d6c85f6263a73bc70772386c45b871aa30c844c8" alt=""
data:image/s3,"s3://crabby-images/b7888/b78888670de191521303b8de3bb55f4b87789fd9" alt=""
Mathematics/Algorithmetic
data:image/s3,"s3://crabby-images/7bd7b/7bd7b801891ee9117e9950e9a9a208d255c6d0e0" alt=""
data:image/s3,"s3://crabby-images/7a89a/7a89aa461deb00170667d907d778d537b8f2ea04" alt=""
data:image/s3,"s3://crabby-images/0226b/0226be6912192a9a1f8fab8a74e7a4faa8a34548" alt=""
data:image/s3,"s3://crabby-images/e3ac1/e3ac11f75942f1a7fbf6f921e760e8414ef5b8e0" alt=""
data:image/s3,"s3://crabby-images/5e1f9/5e1f9def94121cea69b53ad0057f061543a84a67" alt=""
data:image/s3,"s3://crabby-images/e6fed/e6fed46f68ac1d428292c225ab2d95f0e1862747" alt=""
data:image/s3,"s3://crabby-images/421eb/421eba99ba4adb661164f18503d0e2e6d32195ec" alt=""
data:image/s3,"s3://crabby-images/8da3f/8da3fb8c20d2b5cbbd9486526f25f3d1ed256996" alt=""
data:image/s3,"s3://crabby-images/0558e/0558e611d04bcb4d2a77ccd188bdd3938e91be30" alt=""
data:image/s3,"s3://crabby-images/08ee0/08ee0fabbe6d77d9c9d5ec7ff385ff506671afbb" alt=""
Algorithmic Discovery
data:image/s3,"s3://crabby-images/b5abe/b5abe0898827ce73d553af5ee3169c5f1452aa41" alt=""
data:image/s3,"s3://crabby-images/35253/35253f79d6dc8a692e450e15da23682840cbbbf6" alt=""
data:image/s3,"s3://crabby-images/5e629/5e62978f18ce98d0cab62a99c434b8ed68d98f77" alt=""
data:image/s3,"s3://crabby-images/eb461/eb461f8b02be823571a1f47bf0fba898ce103409" alt=""
data:image/s3,"s3://crabby-images/6619f/6619fdb1d1f3368bd94f97457bbe9935f0b0fc73" alt=""
data:image/s3,"s3://crabby-images/24ea8/24ea83fe1ad20bd310a4f6b5e75fc25164b28b80" alt=""
Societal impact
data:image/s3,"s3://crabby-images/e9dec/e9deca1100538a06e596104ad996fb748e13ba55" alt=""
data:image/s3,"s3://crabby-images/22941/22941ad58920f3617274a7f8643fcaa2e91b5dc0" alt=""
data:image/s3,"s3://crabby-images/62c2c/62c2c5d31ba50536979f5ffac00358af0ef61a12" alt=""
data:image/s3,"s3://crabby-images/b1700/b1700d7a4e3577276301a883579cb72bef6b9e40" alt=""
data:image/s3,"s3://crabby-images/9111d/9111d9ca4cff1a2298bef336d15b7d58842e72a5" alt=""
“The AI Index 2023 Annual Report,” HAI, Stanford University, April 2023.
data:image/s3,"s3://crabby-images/cb0b6/cb0b64f46f3e7b562919c0a43f88614c7c886b95" alt=""
“The AI Index 2024 Annual Report,” HAI, Stanford University, April 2024.
Guest Lecture - Some recent ML trends/applications
By Shen Shen
Guest Lecture - Some recent ML trends/applications
- 118