The Polymathic AI Initiative
Towards Multidisciplinary Scientific Foundation Models

Francois Lanusse

Simons Foundation/CNRS

 

on behalf of Shirley Ho and the Polymathic AI team

The Rise of The Foundation Model Paradigm

  • Foundation Model approach
    • Pretrain models on pretext tasks, without supervision, on very large scale datasets.
    • Adapt pretrained models to downstream tasks. 
    • Combine pretrained modules in more complex systems.

Can we translate these innovations into a paradigm shift in machine learning for scientific applications?

Polymathic

 

Advancing Science through Multi‑Disciplinary AI

Our mission: to usher in a new class of machine learning for scientific data, building models that can leverage shared concepts across disciplines."

Meet the Polymathic AI Team

Colm-Cille
Caulfield
University of Cambridge
Leslie
Greengard
Flatiron Institute
New York University
David
Ha
Sakana AI
Yann
LeCun
Meta AI
New York University
Stephane
Mallat
École Normale Supérieure
Collège de France
Flatiron Institute
David
Spergel

Simons Foundation
 
Olga
Troyanskaya

Flatiron Institute
Princeton University
Laure
Zanna
New York University

Our Resources

SCIENTIFIC ADVISORY GROUP

COMPUTING RESOURCES

  • Internal resources at the Flatiron Institute H100 GPUs (24 nodes equivalent to NVIDIA DGX-H100)
  • External GPU grants (A100, H100)

The Foundation Model Spectrum

Language-like/less structured

Structured-data

Scientific Reasoning 

Multi-Modality

Generalization to Data-Limited Domains

How can we build foundation models that jump across scientific disciplines?

  • Should we treat scientific data as if we treat language?
  • Should we treat scientific data the way we have been treating them in ML?
    (structured as in grids, images, videos, graphs, etc...)
  • What is a common basis across multiple disciplines and modalities?

The Foundation Model Spectrum

Language-like/less structured

Structured-data

AstroCLIP
Cross-Modal Pretraining for Astronomical data

MPP
Multiple Physics Pretraining for Physical Surrogate Models

Scientific Reasoning 

Multi-Modality

Generalization to Data-Limited Domains

xVal
 A Continuous Number Encoding for LLMs

MPP

Multiple Physics Pretraining for Physical Surrogate Models

Project led by Michael McCabe,  Bruno Régaldo, Liam Parker, Ruben Ohana, Miles Cranmer
Best paper award at the NeurIPS 2023 AI4Science Workshop

Physical Systems from PDEBench

Navier-Stokes

Incompressible

Compressible

Shallow Water

Diffusion-Reaction

Takamoto et al. 2022

Can we improve performance of surrogate models by pretraining on large quantities of easily simulatable systems? 

Compositionality and Pretraining

MPP (Multi-Physics Pretraining): a single model for varied systems

Balancing objectives during training

Normalized MSE:

Experiment 1: Performance on Pretraining Tasks

Context size: 16 frames

Experiment 2: Transfer

Compressible Navier-Stokes

M = 0.1

M = 1.0

Going further

  • Methodology improvements for long roll out predictions.
  • Larger and more diverse datasets

PDEBench

The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning

  • 55B tokens from 3M frames
    => First ImageNet scale dataset for fluids
     
  • 18 subsets spanning problems in astro, bio, aerospace, chemistry, atmospheric science, and more.
     
  • Simple self-documented HDF5 files, with pytorch readers provided.

=> Available early September

The Foundation Model Spectrum

Language-like/less structured

Structured-data

AstroCLIP
Cross-Modal Pretraining for Astronomical data

MPP
Multiple Physics Pretraining for Physical Surrogate Models

Scientific Reasoning 

Multi-Modality

Generalization to Data-Limited Domains

xVal
 A Continuous Number Encoding for LLMs

The Foundation Model Spectrum

Language-like/less structured

Structured-data

AstroCLIP
Cross-Modal Pretraining for Astronomical data

MPP
Multiple Physics Pretraining for Physical Surrogate Models

Scientific Reasoning 

Multi-Modality

Generalization to Data-Limited Domains

xVal
 A Continuous Number Encoding for LLMs

The Data Diversity Challenge

  • Success of recent foundation models is driven by large corpora of uniform data (e.g LAION 5B). 
  • Scientific data comes with many additional challenges:
    • Metadata matters
    • Wide variety of measurements/observations

Credit:DESI collaboration/DESI Legacy Imaging Surveys/LBNL/DOE & KPNO/CTIO/NOIRLab/NSF/AURA/unWISE

Towards Large Multi-Modal Data Models

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations 

Towards Large Multi-Modal Data Models

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations 

Bytes Are All You Need (Horton et al. 2023)

Towards Large Multi-Modal Data Models

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations 

Bytes Are All You Need (Horton et al. 2023)

AstroCLIP

AstroCLIP

Cross-Modal Pre-Training for Astronomical Foundation Models

Project led by Francois Lanusse, Liam Parker, Leopoldo Sarra, Siavash Golkar, Miles Cranmer
Accepted contribution at the NeurIPS 2023 AI4Science Workshop
Published in Monthly Notices of Royal Astronomical Society

What is CLIP?

Contrastive Language Image Pretraining (CLIP)
(Radford et al. 2021)

The AstroCLIP approach

  • We use spectra and multi-band images as our two different views for the same underlying object.
     
  • DESI Legacy Surveys (g,r,z) images, and DESI EDR galaxy spectra.

Cosine similarity search

The Information Point of View

  • The InfoNCE loss is a lower bound on the Mutual Information between modalities

Shared physical information about galaxies between images and spectra

=> We are building summary statistics for the physical parameters describing an object in a completely data driven way

  • Redshift Estimation From Images

Supervised baseline

z_{true}
z_{true}
z_{true}
z_{true}
z_{true}
z_{true}
z_{true}
z_{true}
  • Zero-shot prediction                    
    • k-NN regression

 

 

  • Few-shot prediction
    • MLP head trained on top of frozen backbone

Evaluation of the model: Parameter Inference

  • Galaxy Physical Property Estimation from Images and Spectra

We use estimates of galaxy properties from the PROVABGS catalog (Hahn et al. 2023) (Bayesian spectral energy distribution (SED) modeling of DESI spectroscopy and photometry method)

R^2

of regression

Negative Log Likelihood of Neural Posterior Inference

Example-based retrieval

Example of Science Application: Identifying Galaxy Tidal Features

What This New Paradigm Could Mean for Astrophysicists

  • Never have to retrain my own neural networks from scratch
    • Existing pre-trained models would already be near optimal, no matter the task at hand
    • Saves a lot of time and energy
       
  • Practical large scale Deep Learning even in very few example regime
    • Searching for very rare objects in large astronomical surveys becomes possible
    • Pretraining on data itself ensures that all sorts of image artifacts are already folded in the training.
       
  • If the information is embedded in a space where it becomes linearly accessible,  very simple analysis tools are enough for downstream analysis
    • In the future, survey pipelines may add vector embedding of detected objects into catalogs, these would be enough for most tasks, without the need to go back to pixels

Going Further: Collection and Curation of Scientific Data

  • Development of large models requires access to "web scale" datasets
     
  • Astrophysics generates large amounts of publicly available data,
    • BUT, data is usually not stored or structured in an ML friendly way.
       
  • Accessing and using scientific data requires significant expertise, for each dataset.
    => Implies engaging with domain experts.

The MultiModal Universe Project

  • Goal: Assemble the first large-scale multi-modal dataset for machine learning in astrophysics.
  • Main pillars:
    • Engage with a broad community of AI+Astro experts.
    • Adopt standardized conventions for storing and accessing data and metadata through mainstream tools (e.g. Hugging Face Datasets).
    • Target large astronomical surveys, varied types of instruments, many different astrophysics sub-fields.

Multiband images from Legacy Survey

=> Official release early September

Towards Large Multi-Modal Data Models

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations 

Bytes Are All You Need (Horton et al. 2023)

AstroCLIP

Towards Large Multi-Modal Data Models

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations 

Bytes Are All You Need (Horton et al. 2023)

AstroCLIP

Early Fusion Multi-modal Data Models

New Generation of Token-Based Multimodal Models

Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac et al. 2022)

Chameleon: Mixed-Modal Early-Fusion Foundation Models (Chameleon team, 2024)

All-to-All Foundation Models

  • Learns the joint and all conditional distributions of provided modalities:  
  • Can be further fine-tuned to build specialist models for news tasks.
\forall m,n \quad p(x_m | x_n)

Scientific Data Tokenization

Input

Reconstructed

Our strategy: 

  • Develop modality specific but universal tokenizers, i.e. a single model to embed all type of astronomical images
     
  • This requires specific innovations to take into account the metadata of observations.

Example of strategy to embed different bands

Field Embedding Strategy Developed for Multiple Physics Pretraining (McCabe et al. 2023)

Looking

Forward at Polymathic

  • Next year we are focusing on scaling up (more domains, more data, larger models) and developing the next generation of our models.
     
  • We are hiring!
    • Postdoctoral positions
    • Research engineer positions

Follow us online!

Thank you for listening!