Apr 3, 2026
Adam Wei
1. Methods for representation learning
2. JEPAs
3. I-JEPAs
4. V-JEPA 2 and Robotics
Today's Goal!
Every figure in this talk is taken from either the JEPA or the V-JEPA 2 paper
"Invariance Methods"
"Generative Methods"
ex. CLIP, contrastive loss, BYOL, MoCo, etc
Limitation: encoding for \(x\) and \(y\) are invariant under some transform
... but what transform?
\(f_{\theta}(x) \approx f_{\theta}(T(x))\)
ex. VAE, BERT, MAE
Limitation: embeddings are less semantic
not pixel-level!
\(\implies\) semantic
\(B_1\)
\(\xrightarrow{f_{\bar \theta}} s_y(1)\)
\(B_2\)
\(\xrightarrow{f_{\bar \theta}} s_y(2)\)
\(B_3\)
\(\xrightarrow{f_{\bar \theta}} s_y(3)\)
\(B_4\)
\(\xrightarrow{f_{\bar \theta}} s_y(4)\)
\(s_y(i)\) are the targets for prediction!
\(s_x\)
\(B_x = \) crop \(\setminus \{{s_y(i)\}_{i\in [M]}}\)
\(B_x\xrightarrow{f_{\theta}} s_x\)
\(s_x\)
For all \(M\) targets, \(s_y(i)\)
\(\hat s_y(i) = g_\phi(s_x, z_i)\)
\(z_i\) is a learnable mask corresponding to \(B_i\)
\(s_x\)
Minimize:
\(\frac{1}{M} \sum_{i=1}^M D(\hat s_y(i), s_y(i)\))
\(D(\cdot, \cdot)\) is essentially \(\ell_2\)*
*\(\ell_1\) for V-JEPA
\(\xrightarrow{f_{\theta}} s_x\)
\(\xrightarrow{f_{\bar \theta}} s_y(i)\)
\(\xrightarrow{g_\phi(s_x, z_i)} \hat s_y(i)\)
\(D(\hat s_y(i), s_y(i))\)
"Target"
"Context"
\(g_\phi\)
\(f_\theta\)
\(f_{\bar \theta}\)
Learn \(\theta\) and \(\phi\) with gradient descent
\(\bar \theta = \mathrm{EMA}(\theta)\)
Use \(f_{\bar \theta}\) for downstream tasks
V-JEPA 2
"world model"
Predictor
Inputs:
Output:
Predictor is our world model!
Planning:
Results:
Some of these limitations seem surmountable?