Introduction to Foundation Models

Francois Lanusse

SOS 2026 Summer School, Aussois, June 2026

https://slides.com/eiffl/Aussois2026

The Deep Learning Boom in Astrophysics

astro-ph abstracts mentioning Deep Learning, CNN, or Neural Networks

The vast majority of these results has relied on supervised learning and networks trained from scratch.

The Limits of Traditional Deep Learning

Limited Supervised Training Data
- Rare or novel objects have by definition few labeled examples
- In Simulation Based Inference (SBI), training a neural compression model requires many simulations
Limited Reusability
- Existing models are trained supervised on a specific task, and specific data.

Huang et al. (2019)

Zhang, Bloom, Gaudi, Lanusse, Lam, Lu (2021)

=> Limits in practice the ease of using deep learning for analysis and discovery

Can we make use of all the unlabeled data we have access to?

The Rise of The Foundation Model Paradigm

Foundation Model approach
- Pretrain models on pretext tasks, without supervision, on very large scale datasets.
- Adapt pretrained models to downstream tasks.
- Combine pretrained modules in more complex systems.

Bommasani et al. 2021

The Advantage of Scale of Data and Compute

Liu et al. 2022

Linearly Accessible Information

Zhai et al. 2022

Backbone of modern architectures embed input images as vectors in where d can typically be between 512 to 2048.
Linear probing refers to training a single matrix to adapt this vector representation to the desired downstream task.

\mathbb{R}^{d}

y = \mathbf{M} \ h_\theta(x)

Rethinking the way we use Deep Learning

Conventional scientific workflow with deep learning

Build a large training set of realistic data
Design a neural network architecture for your data
Deal with data preprocessing/normalization issues
Train your network on some GPUs for a day or so
Apply your network to your problem
Throw the network away...
=> Because it's completely specific to your data, and to the one task it's trained for.

Conventional researchers @ CMU
Circa 2016

CMU DeepLens (Lanusse et al 2017)

Rethinking the way we use Deep Learning

Foundation Model-based Scientific Workflow

Build a small training set of realistic data
Design a neural network architecture for your data
Deal with data preprocessing/normalization issues
Adapt a model in a matter of minutes
Apply your model to your problem
Throw the network away...
=> Because it's completely specific to your data, and to the one task it's trained for.

Bommasani et al. 2021

Already taken care of

y = \mathbf{M} \ h_\theta(x)

What This New Paradigm Could Mean for Us

Never have to retrain my own neural networks from scratch
- Existing pre-trained models would already be near optimal, no matter the task at hand
- Saves a lot of time and energy
Practical large scale Deep Learning even in very few example regime
- Searching for very rare objects in large surveys like Euclid or LSST becomes possible
- Pretraining on data itself ensures that all sorts of image artifacts are already folded in the training.
If the information is embedded in a space where it becomes linearly accessible, very simple analysis tools are enough for downstream analysis
- In the future, survey pipelines may add vector embedding of detected objects into catalogs, these would be enough for most tasks, without the need to go back to pixels

... but how does it work?

Before we dive in...
A short primer on Transformers

Attention is all you need

Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (2017)

\text{Attention}(Q,K,V) = \text{softmax}\!\left( \frac{Q K^\top}{\sqrt{d_k}} \right) V

https://iclr-blogposts.github.io/2025/blog/positional-embedding/

A Minimal Code Example

https://github.com/EiffL/Tutorials/TimeSeries/QuasarLightcurvesTransformer.ipynb


import jax
import jax.numpy as jnp
import optax
import flax.linen as nn

class MLP(nn.Module):
    features: int

    @nn.compact
    def __call__(self, x):
        x = nn.Dense(self.features)(x)
        x = nn.relu(x)
        x = nn.Dense(self.features)(x)
        return x

class TransformerBlock(nn.Module):
    d_model: int
    num_heads: int

    @nn.compact
    def __call__(self, x, mask):
        z = nn.LayerNorm()(x)
        z = nn.MultiHeadDotProductAttention(num_heads=self.num_heads)(z, mask=mask)
        x = x + z
        z = nn.LayerNorm()(x)
        x += MLP(self.d_model)(z)
        return x

class LightCurveTransformer(nn.Module):
    d_model: int
    num_heads: int
    num_layers: int

    @nn.compact
    def __call__(self, x, mask):
      # Building attention mask
      attention_mask = nn.make_attention_mask(mask, mask)

      # Positional embedding
      pos = get_positional_encoding(self.d_model)(x)

      # Embedding layer
      x = nn.Dense(self.d_model)(x[:,:,1:5]) # Extracting the flux light curves
      x += pos

      # Transformer blocks
      for i in range(self.num_layers):
        x = TransformerBlock(self.d_model, self.num_heads)(x, attention_mask)

      # Global average pooling
      x = jnp.mean(x, axis=1)

      # Output layer
      x = nn.Dense(1)(x)

      return x.squeeze()

More about positional embedding:

Vision Transformer (ViT) (Dosovitskiy et al. 2020)

Text

Masked Modeling

Sometimes, the simplest things just work...

An idea that comes from language models

p( x_m | x_0, \ldots, x_{m-1}, x_{m+1}, \ldots, x_{N})

p(x_{i+1}| x_{0},\ldots, x_{i})

i.e. BERT (Devlin et al. 2018)

Masked Autoencoders Are Scalable Vision Learners (He et al. 2021)

Masked Auto Encoding (MAE)

\mathcal{L} = \parallel M \odot ( x - f_\theta(x)) \parallel_2^2

How to use such a model for classification

Credit: (Liang et al. 2022)

Application to Dense Prediction Tasks

Kirillov et al. 2023

Example of Application on Astronomical Data: Specformer

Pre-Training Galaxy Spectra Representation by Masked Modeling

\mathcal{L}_{\textrm{MM}} = \frac{1}{NK} \sum_{j=1}^K \sum_{i=1}^N \textbf{m}_i \cdot (\textbf{x}_{i} - \hat{\textbf{x}}_{i})^2,

Multi-View Self-Supervised Contrastive Learning

MultiView Contrastive Learning e.g. SimCLR (Chen et al. 2000)

Contrastive Learning in Astrophysics

Self-Supervised similarity search for large scientific datasets (Stein et al. 2021)

https://georgestein-galaxy-search-streamlit-app-o7accj.streamlit.app/

Detecting Galaxy Tidal Features Using Self-Supervised Representation Learning

Project led by Alice Desmons, Francois Lanusse, Sarah Brough

Credit: Piotr bojanowski

DiNOv2 (Oquab et al. 2023)

PCA of patch features

Dense Semantic Segmentation

Dense Depth Estimation

Dinov3 (Siméoni et al. 2025)

https://ai.meta.com/research/dinov3/

Weakly Supervised Contrastive Learning

Or what you can do when you do have independent views of an object...

Contrastive Language Image Pretraining

Contrastive Language Image Pretraining (CLIP)
(Radford et al. 2021)

The Information Point of View

The InfoNCE loss is a lower bound on the Mutual Information between modalities

Shared information

van den Oord et al. (2018)

(Oquab et al. 2023)

One model, many downstream applications!

Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac et al. 2022)

Hierarchical Text-Conditional Image Generation with CLIP Latents (Ramesh et al. 2022)

The AstroCLIP approach

We use spectra and multi-band images as our two different views for the same underlying object.
DESI Legacy Surveys (g,r,z) images, and DESI EDR galaxy spectra.

Cosine similarity search

Evaluation of the model

Cross-Modal similarity search

Image Similarity

Spectral Similarity

Image-Spectral Similarity

Redshift Estimation From Images

Supervised baseline

z_{true}

Zero-shot prediction
- k-NN regression

Few-shot prediction
- MLP head trained on top of frozen backbone

Now you try it!

https://tinyurl.com/astroclip-tuto

https://tinyurl.com/astroclip-tuto-solution

Multimodal Masked Modeling

Or let's do one massive model of everything!

AION-1
AstronomIcal Omnimodal Network

with extensive support from the rest of the team.

Project led by:

Francois
Lanusse

Liam
Parker

Jeff
Shen

Tom
Hehir

Ollie
Liu

Lucas
Meyer

Leopoldo
Sarra

Sebastian Wagner-Carena

Helen
Qu

Micah
Bowles

Diverse data modalities for diverse science cases

(Blanco Telescope and Dark Energy Camera.
Credit: Reidar Hahn/Fermi National Accelerator Laboratory)

(Subaru Telescope and Hyper Suprime Cam. Credit: NAOJ)

(Dark Energy Spectroscopic Instrument)

(Sloan Digital Sky Survey. Credit: SDSS)

(Gaia Satellite. Credit: ESA/ATG)

Galaxy formation
Cosmology
Stellar physics
Galaxy archaeology
...

Standardizing all modalities through tokenization

For each modality class (e.g. image, spectrum) we build dedicated metadata-aware tokenizers
For Aion-1, we integrate 39 different modalities (different instruments, different measurements, etc.)

\mathcal{L} = \parallel \Sigma^{- \frac{1}{2}} \left( x - d_\theta( \lfloor e_\theta(x) \rfloor_{\text{FSQ}} \right) \parallel_2^2

Field Embedding Strategy Developed for
Multiple Physics Pretraining (McCabe et al. 2023)

DES g

DES r

DES i

DES z

HSC g

HSC r

HSC i

HSC z

HSC y

(Mentzer et al. 2023)

Any-to-Any Modeling with Generative Masked Modeling

Training is done by pairing observations of the same objects from different instruments.
Each input token is tagged with a modality embedding that specifies provenance metadata.
Model is trained by cross-modal generative masked modeling (Mizrahi et al. 2023)
=> Learns the joint and all conditional distributions of provided modalities:

\forall m,n \quad p(x_m | x_n)