Towards Scientific Foundation Models and
How They Could Change ML In Survey Astronomy

2025 IDIES Annual Symposium

François Lanusse

CNRS Researcher @ AIM, CEA Paris-Saclay
Polymathic AI

The Deep Learning Boom in Astrophysics

astro-ph abstracts mentioning Deep Learning, CNN, or Neural Networks

The vast majority of these results has relied on supervised learning and networks trained from scratch.

Huertas-Company & Lanusse (2023)

The Limits of Traditional Deep Learning

Limited Supervised Training Data
- Rare or novel objects have by definition few labeled examples
- In Simulation Based Inference (SBI), training a neural compression model requires many simulations
Limited Reusability
- Existing models are trained supervised on a specific task, and specific data.

Huang et al. (2019)

Zhang, Bloom, Gaudi, Lanusse, Lam, Lu (2021)

=> Limits in practice the ease of using deep learning for analysis and discovery

Meanwhile, in Computer Science...

The Rise of The Foundation Model Paradigm

Foundation Model approach
- Pretrain models on pretext tasks, without supervision, on very large scale datasets.
- Adapt pretrained models to downstream tasks.

Bommasani et al. 2021

He et al. 2021

The Advantage of Scale of Data and Compute

Liu et al. 2022

Can we translate these innovations into a paradigm shift in machine learning for scientific applications?

Polymathic

Advancing Science through Multi‑Disciplinary AI

Why a Dedicated Effort?

(Yin et al. 2024)

Transposing these methodologies to scientific data and problems brings
unique challenges

Scientific Data is Complex and Diverse
- Impacts data collection
- Requires dedicated architectures
Adoption by Scientists Requires Strategies for Rigorous Adaptation to Downstream Use Cases.

Data
Challenge

The Challenges of Scientific Data

Success of recent foundation models is driven by large corpora of uniform data (e.g LAION 5B).
Scientific data comes with many additional challenges:
- Metadata matters
- Wide variety of measurements/observations
- Accessing and formatting data requires very specific expertise

Credit: Melchior et al. 2021

Credit:DESI collaboration/DESI Legacy Imaging Surveys/LBNL/DOE & KPNO/CTIO/NOIRLab/NSF/AURA/unWISE

Schuhmann et al. (2022)

The Multimodal Universe

Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data

Collaborative project with about 30 contributors
Presented at NeurIPS 2024 Datasets & Benchmark track

The MultiModal Universe Project

Goal: Assemble the first large-scale multimodal dataset for machine learning in astrophysics.
Strategy:
- Engage with a broad community of AI+Astro experts.
- Target large astronomical surveys, varied types of instruments, many different astrophysics sub-fields.
- Adopt standardized conventions for storing and accessing data and metadata through mainstream tools (e.g. Hugging Face Datasets).

Ground-based imaging from Legacy Survey

Space-based imaging from JWST

MultiModal Universe Infrastructure

Presented at NeurIPS 2024 Datasets & Benchmark Track

https://github.com/MultimodalUniverse/MultimodalUniverse

The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning

55B tokens from 3M frames
=> First ImageNet scale dataset for fluids
18 subsets spanning problems in astro, bio, aerospace, chemistry, atmospheric science, and more.
Simple self-documented HDF5 files, with pytorch readers provided.

Presented at NeurIPS 2024 Datasets & Benchmark Track

https://polymathic-ai.org/the_well

Polymathic in the top 5 AI4Science Orgs on HuggingFace

https://huggingface.co/spaces/hugging-science/science-release-map

Architecture
Challenge

The Universal Neural Architecture Challenge

Most General

Most Specific

Single model capable of processing all types of data

Lanusse et al. 2020

Liang et al. 2023

Independent models for all types of data

The Universal Neural Architecture Challenge

Most General

Most Specific

Independent models for all types of data

Single model capable of processing all types of data

Lanusse et al. 2020

Liang et al. 2023

Bytes Are All You Need (Horton et al. 2023)

The Universal Neural Architecture Challenge

Most General

Most Specific

Independent models for all types of data

Single model capable of processing all types of data

Bytes Are All You Need (Horton et al. 2023)

AstroCLIP (Parker et al. 2024)

AstroCLIP

The Universal Neural Architecture Challenge

Most General

Most Specific

Independent models for all types of data

Single model capable of processing all types of data

Bytes Are All You Need (Horton et al. 2023)

Early Fusion Multimodal Models

AstroCLIP (Parker et al. 2024)

Early-Fusion Multimodal Models

Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac et al. 2022)

Chameleon: Mixed-Modal Early-Fusion Foundation Models (Chameleon team, 2024)

AION-1

Omnimodal Foundation Model for
Astronomical Surveys

Accepted at NeurIPS 2025, spotlight presentation at NeurIPS 2025 AI4Science Workshop

Project led by:

Francois
Lanusse

Liam
Parker

Jeff
Shen

Tom
Hehir

Ollie
Liu

Lucas
Meyer

Sebastian Wagner-Carena

Helen
Qu

Micah
Bowles

Diverse data modalities for diverse science cases

(Blanco Telescope and Dark Energy Camera.
Credit: Reidar Hahn/Fermi National Accelerator Laboratory)

(Subaru Telescope and Hyper Suprime Cam. Credit: NAOJ)

(Dark Energy Spectroscopic Instrument)

(Sloan Digital Sky Survey. Credit: SDSS)

(Gaia Satellite. Credit: ESA/ATG)

Galaxy formation
Cosmology
Stellar physics
Galaxy archaeology
...

Standardizing all modalities through tokenization

For each modality class (e.g. image, spectrum) we build dedicated metadata-aware tokenizers
For Aion-1, we integrate 39 different modalities (different instruments, different measurements, etc.)

\mathcal{L} = \parallel \Sigma^{- \frac{1}{2}} \left( x - d_\theta( \lfloor e_\theta(x) \rfloor_{\text{FSQ}} \right) \parallel_2^2

(Mentzer et al. 2023)

Universal Spectral Tokenizer

Accepted at NeurIPS 2025 Machine Learning for Physical Sciences Workshop

Jeff Shen

Any-to-Any Modeling with Generative Masked Modeling

Training is done by pairing observations of the same objects from different instruments.
Each input token is tagged with a modality embedding that specifies provenance metadata.
Model is trained by cross-modal generative masked modeling (Mizrahi et al. 2023)
=> Learns the joint and all conditional distributions of provided modalities:

\forall m,n \quad p(x_m | x_n)

AION-1 family of models

Models trained as part of the 2024 Jean Zay Grand Challenge, following an extension to a new partition of 1400 H100s

AION-1 Base: 300 M parameters
- 64 H100s - 1.5 days
AION-1 Large: 800 M parameters
- 100 H100s - 2.5 days
AION-1 XLarge: 3B parameters
- 288 H100s - 3.5 days

Technical details

(credit)

Training based on pure torch Fully Sharded Data Parallel (FSDP) ZeRO Stage 2 for main models
- Stage 3 for 13B model

Example of out-of-the-box capabilities

Survey translation

p(\bm{x}_{HSC} | \bm{x}_{DES} )

Spectrum super-resolution

p(\bm{x}_{DESI} | \bm{x}_{GAIA} )

Example of emergent multimodal understanding

p(\bm{x}_{DESI} | \bm{x}_{HSC} )

Direct association between DESI and HSC was excluded during pretraining
=> This task is out of distribution!

Accelerating Downstream Science

Rethinking the way we use Deep Learning

Conventional scientific workflow with deep learning

Build a large training set of realistic data
Design a neural network architecture for your data
Deal with data preprocessing/normalization issues
Train your network on some GPUs for a day or so
Apply your network to your problem
Throw the network away...
=> Because it's completely specific to your data, and to the one task it's trained for.

Conventional researchers @ CMU
Circa 2016

CMU DeepLens (Lanusse et al 2017)

Rethinking the way we use Deep Learning

Foundation Model-based Scientific Workflow

Build a small training set of realistic data
Design a neural network architecture for your data
Deal with data preprocessing/normalization issues
Adapt a model in a matter of minutes
Apply your model to your problem
Throw the network away...
=> Because it's completely specific to your data, and to the one task it's trained for.

Bommasani et al. 2021

Already taken care of

=> Let's discuss embedding-based adaptation

\mathbf{z} = f_\theta(\mathbf{x})

Adaptation of AION-1 embeddings

Adaptation at low cost
with simple strategies:

Mean pooling + linear probing
Attentive pooling

y = \mathbf{M} \sum_i z_i

y = \operatorname{softmax} \left(\frac{\mathbf{Q} \mathbf{K}^\top(z)}{\sqrt{d}} \right) \mathbf{V}(z)

Can be used trivially on any input data
Flexible to varying number/types of inputs
=> Allows for trivial data fusion

x_train = Tokenize(hsc_images, modality='HSC')

model = FineTunedModel(base='Aion-B',
                       adaptation='AttentivePooling')
model.fit(x_train, y_train)
                              
y_test = model.predict(x_test)

Physical parameter estimation and data fusion for galaxies

Inputs:

measured fluxes

Inputs:

measured fluxes + image

Morphology classification by Linear Probing

Trained on ->

Eval on ->

DiNOv2

Physical parameter estimation for stars

Semantic segmentation

Segmenting central bar and spiral arms in galaxy images based on Galaxy Zoo 3D

(Masters et al. 2021)

Example-based retrieval

nDCG@10 score

AION-Search: Natural Language Semantic Retrieval

Spotlight at 2025 NeurIPS AI4Science Workshop

Nolan Koblischke

nDCG@10 score

Why are such Foundation Models useful for Scientists?

Never have to retrain my own neural networks from scratch
- Existing pre-trained models would already be near optimal, no matter the task at hand
- Saves a lot of time and energy
Practical large scale Deep Learning even in very few example regime
- Searching for very rare objects in large astronomical surveys becomes possible
If the information is embedded in a space where it becomes linearly accessible, very simple analysis tools are enough for downstream analysis
- Embeddings can be included as part of the data processing of future surveys

Polymathic's recipe for developing Multimodal Scientific Models

Takeaways

Engagement with Scientific Communities

Data Curation And Aggregation

Dedicated ML R&D

Follow us online!

AION-1 papers will be on the ArXiv next week! and models available for download!

Thank you for listening!

Towards Scientific Foundation Models and How They Could Change ML In Survey Astronomy

By eiffl

Towards Scientific Foundation Models and How They Could Change ML In Survey Astronomy

Talk at the 2025 IDIES Annual Symposium at Johns Hopkins University

Towards Scientific Foundation Models andHow They Could Change ML In Survey Astronomy

2025 IDIES Annual Symposium

The Deep Learning Boom in Astrophysics

The Limits of Traditional Deep Learning

Meanwhile, in Computer Science...

The Rise of The Foundation Model Paradigm

The Advantage of Scale of Data and Compute

Can we translate these innovations into a paradigm shift in machine learning for scientific applications?

Polymathic

Advancing Science through Multi‑Disciplinary AI

Why a Dedicated Effort?

Data Challenge

The Challenges of Scientific Data

The Multimodal Universe

Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data

The MultiModal Universe Project

MultiModal Universe Infrastructure

The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning

Polymathic in the top 5 AI4Science Orgs on HuggingFace

Architecture Challenge

The Universal Neural Architecture Challenge

The Universal Neural Architecture Challenge

The Universal Neural Architecture Challenge

The Universal Neural Architecture Challenge

Early-Fusion Multimodal Models

AION-1

Omnimodal Foundation Model for Astronomical Surveys

Diverse data modalities for diverse science cases

Standardizing all modalities through tokenization

Universal Spectral Tokenizer

Any-to-Any Modeling with Generative Masked Modeling

AION-1 family of models

Technical details

Example of out-of-the-box capabilities

Example of emergent multimodal understanding

Accelerating Downstream Science

Rethinking the way we use Deep Learning

Rethinking the way we use Deep Learning

Adaptation of AION-1 embeddings

Physical parameter estimation and data fusion for galaxies

Morphology classification by Linear Probing

Physical parameter estimation for stars

Semantic segmentation

Example-based retrieval

AION-Search: Natural Language Semantic Retrieval

Why are such Foundation Models useful for Scientists?

Takeaways

Follow us online!

Towards Scientific Foundation Models and How They Could Change ML In Survey Astronomy

More from eiffl

Towards Scientific Foundation Models and
How They Could Change ML In Survey Astronomy

Data
Challenge

Architecture
Challenge

Omnimodal Foundation Model for
Astronomical Surveys