Polymathic

Colm-Cille Caulfield University of Cambridge	Leslie Greengard Flatiron Institute New York University	David Ha Sakana AI	Yann LeCun Meta AI New York University
Stephane Mallat École Normale Supérieure Collège de France Flatiron Institute	David Spergel Simons Foundation	Olga Troyanskaya Flatiron Institute Princeton University	Laure Zanna New York University

SCIENTIFIC ADVISORY GROUP

Alberto Bietti	Kyunghyun Cho	Miles Cranmer	Michael Eickenberg	Siavash Golkar	Keiya Hirashima
Shirley Ho	Geraud Krawezik	Francois Lanusse	Nick Lourie	Michael McCabe	Ruben Ohana
Liam Parker	Mariel Pettee	Bruno Regaldo	Lucas Meyer	Rudy Morel

The Foundation Model Spectrum

Language-like/less structured

Structured-data

Scientific Reasoning

Multi-Modality

Generalization to Data-Limited Domains

How can we build foundation models that jump across scientific disciplines?

Should we treat scientific data as if we treat language?
Should we treat scientific data the way we have been treating them in ML?
(structured as in grids, images, videos, graphs, etc...)
What is a common basis across multiple disciplines and modalities?

The Foundation Model Spectrum

Language-like/less structured

Structured-data

AstroCLIP
Cross-Modal Pretraining for Astronomical data

MPP
Multiple Physics Pretraining for Physical Surrogate Models

Scientific Reasoning

Multi-Modality

Generalization to Data-Limited Domains

xVal
A Continuous Number Encoding for LLMs

MPP

Multiple Physics Pretraining for Physical Surrogate Models

Project led by Michael McCabe, Bruno Régaldo, Liam Parker, Ruben Ohana, Miles Cranmer
Accepted at NeurIPS 2024, Best paper award at the NeurIPS 2023 AI4Science Workshop

Physical Systems from PDEBench

Navier-Stokes

Incompressible

Compressible

Shallow Water

Diffusion-Reaction

Takamoto et al. 2022

Can we improve performance of surrogate models by pretraining on large quantities of easily simulatable systems?

Compositionality and Pretraining

Experiment 1: Performance on Pretraining Tasks

Context size: 16 frames

Experiment 2: Transfer

Compressible Navier-Stokes

M = 0.1

M = 1.0

Going further

Methodology improvements for long roll out predictions.
Larger and more diverse datasets

PDEBench

The Foundation Model Spectrum

Language-like/less structured

Structured-data

AstroCLIP
Cross-Modal Pretraining for Astronomical data

MPP
Multiple Physics Pretraining for Physical Surrogate Models

Scientific Reasoning

Multi-Modality

Generalization to Data-Limited Domains

xVal
A Continuous Number Encoding for LLMs

The Foundation Model Spectrum

Language-like/less structured

Structured-data

AstroCLIP
Cross-Modal Pretraining for Astronomical data

MPP
Multiple Physics Pretraining for Physical Surrogate Models

Scientific Reasoning

Multi-Modality

Generalization to Data-Limited Domains

xVal
A Continuous Number Encoding for LLMs

The Data Diversity Challenge

Success of recent foundation models is driven by large corpora of uniform data (e.g LAION 5B).
Scientific data comes with many additional challenges:
- Metadata matters
- Wide variety of measurements/observations

Credit: Melchior et al. 2021

Credit:DESI collaboration/DESI Legacy Imaging Surveys/LBNL/DOE & KPNO/CTIO/NOIRLab/NSF/AURA/unWISE

The Multimodal Universe

Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data

Collaborative project with about 30 contributors
Accepted at NeurIPS 2024 Datasets & Benchmark track

Going Further: Collection and Curation of Scientific Data

Development of large models requires access to "web scale" datasets
Astrophysics generates large amounts of publicly available data,
- BUT, data is usually not stored or structured in an ML friendly way.
Accessing and using scientific data requires significant expertise, for each dataset.
=> Implies engaging with domain experts.

Schuhmann et al. (2022)

Credit: Melchior et al. 2021

Presented at NeurIPS 2024

https://github.com/MultimodalUniverse/MultimodalUniverse

Towards Large Multi-Modal Data Models

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations

Lanusse et al. 2020

Liang et al. 2023

Towards Large Multi-Modal Data Models

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations

Lanusse et al. 2020

Liang et al. 2023

Bytes Are All You Need (Horton et al. 2023)

Towards Large Multi-Modal Data Models

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations

Lanusse et al. 2020

Liang et al. 2023

Bytes Are All You Need (Horton et al. 2023)

AstroCLIP

Cross-Modal Pre-Training for Astronomical Foundation Models

Project led by Francois Lanusse, Liam Parker, Leopoldo Sarra, Siavash Golkar, Miles Cranmer
Accepted contribution at the NeurIPS 2023 AI4Science Workshop
Published in Monthly Notices of Royal Astronomical Society

What is CLIP?

Contrastive Language Image Pretraining (CLIP)
(Radford et al. 2021)

The AstroCLIP approach

We use spectra and multi-band images as our two different views for the same underlying object.
DESI Legacy Surveys (g,r,z) images, and DESI EDR galaxy spectra.

Cosine similarity search

The Information Point of View

The InfoNCE loss is a lower bound on the Mutual Information between modalities

Shared physical information about galaxies between images and spectra

van den Oord et al. (2018)

=> We are building summary statistics for the physical parameters describing an object in a completely data driven way

Daunhawer et al. (2023)

What This New Paradigm Could Mean for Astrophysicists

Never have to retrain my own neural networks from scratch
- Existing pre-trained models would already be near optimal, no matter the task at hand
- Saves a lot of time and energy
Practical large scale Deep Learning even in very few example regime
- Searching for very rare objects in large astronomical surveys becomes possible
- Pretraining on data itself ensures that all sorts of image artifacts are already folded in the training.
If the information is embedded in a space where it becomes linearly accessible, very simple analysis tools are enough for downstream analysis
- In the future, survey pipelines may add vector embedding of detected objects into catalogs, these would be enough for most tasks, without the need to go back to pixels