Some Slides forked from Russ Tedrake
Image credit: Boston Dynamics
Speaker: Shen Shen
May 8, 2025
MIT Undergraduate Math Association
DARPA Robotics Competition
2015
1. First-principle model
2. Lyapunov analysis
Search for a \(V\) such that:
(can be generalized for synthesis)
3. Optimization
Monomial basis \(m(x)\)
Gram matrix \(Q\)
[Shen and Tedrake, "Sampling Quotient-Ring Sum-of-Squares Verification for Scalable Verification", CDC, 2020
Robots are dancing and starting to do parkour, but...
what about something more useful, like loading the dishwasher?
(for robotics; in a few slides)
Input
Neural Network
Released in 2009
layer
linear combo
activations
layer
input
neuron
learnable weights
hidden
output
compositions of ReLU(s) can be quite expressive
in fact, asymptotically, can approximate any function!
[image credit: Phillip Isola]
Training data
maps from complex data space to simple embedding space
[images credit: visionbook.mit.edu]
[video edited from 3b1b]
embedding
Large Language Models (LLMs) are trained in a self-supervised way
"To date, the cleverest thinker of all time was Issac. "
feature
label
To date, the
cleverest
To date, the cleverest
thinker
To date, the cleverest thinker
was
To date, the cleverest thinker of all time was
Issac
e.g., train to predict the next-word
Auto-regressive
How to train? The same recipe:
[video edited from 3b1b]
[video edited from 3b1b]
[video edited from 3b1b]
[image edited from 3b1b]
Cross-entropy loss encourages the internal weights update so as to make this probability higher
image credit: Nicholas Pfaff
Generative Boba by Boyuan Chen in Bldg 45
😉
😉
Image credit: Adding Conditional Control to Text-to-Image Diffusion Models https://arxiv.org/pdf/2302.05543
ControlNet: refined control
Text-to-audio generation
Key Idea: denoising in many small steps is easier than attempting to remove all noise in a single step
Forward process is easy: for fixed \( \{\beta_t\}_{t \in [T]} \in (0, 1) \), let \[ q(x_t \mid x_{t-1}) := \mathcal{N}(x_t \mid \sqrt{1 - \beta_t} \, x_{t-1}, \beta_t I) \]
Equivalently, \[ x_t = \sqrt{1 - \beta_t} \, x_{t-1} + \sqrt{\beta_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \]
\[ \Rightarrow \quad x_t = \sqrt{\bar{\alpha}_t} \, x_0 + \sqrt{1 - \bar{\alpha}_t} \, \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) \]
\(\alpha_t=1-\beta_t, \quad \bar{\alpha}_t=\prod_{s=1}^t \alpha_s\)
can think of \(\sqrt{\beta_t} \approx \sigma_t-\sigma_{t-1}\) noise schedule difference
Fact:
Fact:
\(q\left(x_{0: T}\right)=q\left(x_0\right) q\left(x_{1: T} \mid x_0\right)\)
\(=q\left(x_T\right) \prod_{t=1}^T q\left(x_{t-1} \mid x_t\right)\)
\(\approx \mathcal{N}(0, I) \prod_{t=1}^T \mathcal{N}\left(x_{t-1} ; \mu\left(x_0, x_t\right), \beta_t I\right)\)
by Markov
by the two facts
Choose to parameterize \(p_\theta\left(x_{t-1} \mid x_t\right)=\mathcal{N}\left(x_{t-1} ; \mu\left(x_0, x_t\right), \beta_t I\right)\), learn \(\hat{x}_0\left(x_t, t\right)\)
Reverse process key:
There are two important variations to this training procedure:
Re-parameterize:
\(x_t=x_0+\sigma_t \epsilon, \epsilon \sim \mathcal{N}(0, I)\)
\(x_0=z_0, \quad x_t=z_t / \sqrt{\bar{\alpha}_t}, \quad \sigma_t=\sqrt{\frac{1-\bar{\alpha}_t}{\bar{\alpha}_t}}\)
Re-parameterize variation tends to perform better in practice, keeps model input constant norm
Denoising diffusion models estimate a noise vector \(\epsilon\) \(\in \mathbb{R}^n\) from a given noise level \(\sigma > 0\) and noisy input \(x_\sigma \in \mathbb{R}^n\) such that for some \(x_0\) in the data manifold \(\mathcal{K}\),
\[ {x_\sigma} \;\approx\; \textcolor{green}{x_0} \;+\; \textcolor{orange}{\sigma}\, \textcolor{red}{\epsilon}. \]
A denoiser \(\textcolor{red}{\epsilon_\theta} : \mathbb{R}^n \times \mathbb{R}_+ \to \mathbb{R}^n\) is learned by minimizing
\[ L(\theta) := \mathbb{E}_{\textcolor{green}{x_0},\textcolor{orange}{\sigma},\textcolor{red}{\epsilon}} \Biggl[\Biggl\|\textcolor{red}{\epsilon_\theta}\Biggl(\textcolor{green}{x_0} + \textcolor{orange}{\sigma}\,\textcolor{red}{\epsilon}, \textcolor{orange}{\sigma}\Biggr) - \textcolor{red}{\epsilon}\Biggr\|^2\Biggr]. \]
Mathematically equivalent for fixed \(\sigma\), but reweighs loss by a function of \(\sigma\) !
Image source: Ho et al. 2020
Denoiser can be conditioned on additional inputs, \(u\): \(p_\theta(x_{t-1} | x_t, u) \)
Image backbone: ResNet-18 (pretrained on ImageNet)
Total: 110M-150M Parameters
Training Time: 3-6 GPU Days ($150-$300)
LLMs can copy the logic and extrapolate it!
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
What task-based affordances reminds us of in MDP/RL?
Value functions!
[Value Function Spaces, Shah, Xu, Lu, Xiao, Toshev, Levine, Ichter, ICLR 2022]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances, Ahn et al. , 2022
Towards grounding everything in language
Language
Control
Vision
Tactile
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
https://socraticmodels.github.io
Lots of data
Less data
Less data
Roboticist
Vision
NLP
adapted from Tomás Lozano-Pérez
Instruction: Make a Line
What can you do right now?
https://introml.mit.edu/notes
https://slides.com/shensquared
https://introml.mit.edu/notes
https://slides.com/shensquared
http://manipulation.mit.edu
http://underactuated.mit.edu