Shen Shen
May 13, 2024
(in CV and NLP)
It seems
It seems
[image credit: Lilian Weng]
[https://arxiv.org/pdf/2204.06125.pdf]
Part 1: Some echoing trends in Robotics
Part 3: Some more domain-specific applications
Part 2: Some future directions in CV/NLP/Robotics
Lots of slides adapted from
2004 - Uses vanilla policy gradient (actor-critic)
uses first-principle (modeling, control, optimization stack)
For the next challenge:
For the next challenge:
Levine*, Finn*, Darrel, Abbeel, JMLR 2016
perception network
(often pre-trained)
policy network
other robot sensors
learned state representation
actions
x history
manipulation (and general control) is hard because?
partially because the data is scarce
Image source: Ho et al. 2020
Denoiser can be conditioned on additional inputs, \(u\): \(p_\theta(x_{t-1} | x_t, u) \)
Image backbone: ResNet-18 (pretrained on ImageNet)
Total: 110M-150M Parameters
Training Time: 3-6 GPU Days ($150-$300)
I do think there is something deep happening here...
If we really understand this, can we do the same via principles from a model? Or will control go the way of computer vision and language?
What if we did have a good model? (and well-specified objective)
Lee et al., Learning quadrupedal locomotion over challenging terrain, Science Robotics, 2020
Connect unstructured world with structured algorithms
It turns out humans activities on the internet produces a massive amount of knowledge in the form of text that are really useful!
LLMs can copy the logic and extrapolate it!
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
What task-based affordances reminds us of in MDP/RL?
Value functions!
[Value Function Spaces, Shah, Xu, Lu, Xiao, Toshev, Levine, Ichter, ICLR 2022]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
(The demo won't embed in PDF. But the direct link below works.)
Text, Image, Video -> Text
Video -> 3D shape
UniSim: Learning Interactive Real-World Simulators, Du et al., 2023
Instruction: Make a Line
Open challenges:
two main-stream methods:
- predict amino-acid pair-wise distance. or
- predict 3d coordinates
tradeoff between representation compactness and structure.
credit: manoli/regina class
Hardware section slides credit and link:
Suppose we "trivialize" theorem proofs into exam T/F question, what are some common strategies?
- If we really understand the materials, jump to invoking lemmas punchlines, etc and derive
- If less sure, make a guess about the T/F, conjecture, then prove
- If less sure still, try out a few data (if possible) and hope for counter-examples or intuition
Finding things similar to our problem.
For proofs that "look like" this in the past, induction techniques are "often" used, so an assistant may suggest "try induction"
So we need "good" characterizations of facts/statement (a lot of research there). As in what are these "theorems" about??
Nuanced/semantic characterization: what are the assumptions?
First such ML-aided system created about 15 years ago
Previously computers help:
- find counter-example
- accelerating calculations
- symbolic reasoning
Divide-and-conquer. Try to propose inter-mediate lemmas.
Can try to do without concrete Proof Blueprint.
Take a further step: explore the rich "proof library" bits like LEAN.
Find an open-ended goal in a statistical way.
Square matrices => system suggests that it's true for arbitrary matrices =>
GPTs can read a book and reference a true statement
“The AI Index 2023 Annual Report,” HAI, Stanford University, April 2023.
“The AI Index 2024 Annual Report,” HAI, Stanford University, April 2024.