Towards Multidisciplinary Scientific
Foundation Models
From spatiotemporal surrogate modeling to large multimodal data models
Francois Lanusse
CNRS / Flatiron Institute
The Rise of The Foundation Model Paradigm
-
Foundation Model approach
- Pretrain models on pretext tasks, without supervision, on very large scale datasets.
- Adapt pretrained models to downstream tasks.
- Combine pretrained modules in more complex systems.
The Advantage of Scale of Data and Compute
Linearly Accessible Information
- Backbone of modern architectures embed input images as vectors in where d can typically be between 512 to 2048.
- Linear probing refers to training a single matrix to adapt this vector representation to the desired downstream task.
Can we translate these innovations into a paradigm shift in machine learning for scientific applications?
Polymathic
Colm-Cille
Caulfield University of Cambridge
|
Leslie
Greengard Flatiron Institute
New York University |
David Ha Sakana AI |
Yann LeCun Meta AI New York University |
---|---|---|---|
Stephane
Mallat École Normale Supérieure
Collège de France Flatiron Institute |
David
Spergel Simons Foundation |
Olga Troyanskaya Flatiron Institute Princeton University |
Laure
Zanna New York University
|
SCIENTIFIC ADVISORY GROUP
Polymathic
Advancing Science through Multi‑Disciplinary AI
Our mission: to usher in a new class of machine learning for scientific data, building models that can leverage shared concepts across disciplines."
The Foundation Model Spectrum
Language-like/less structured
Structured-data
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
How can we build foundation models that jump across scientific disciplines?
- Should we treat scientific data as if we treat language?
- Should we treat scientific data the way we have been treating them in ML?
(structured as in grids, images, videos, graphs, etc...) - What is a common basis across multiple disciplines and modalities?
The Foundation Model Spectrum
Language-like/less structured
Structured-data
AstroCLIP
Cross-Modal Pretraining for Astronomical data
MPP
Multiple Physics Pretraining for Physical Surrogate Models
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
xVal
A Continuous Number Encoding for LLMs
MPP
Multiple Physics Pretraining for Physical Surrogate Models
Project led by Michael McCabe, Bruno Régaldo, Liam Parker, Ruben Ohana, Miles Cranmer
Accepted at NeurIPS 2024, Best paper award at the NeurIPS 2023 AI4Science Workshop
Physical Systems from PDEBench
Navier-Stokes
Incompressible
Compressible
Shallow Water
Diffusion-Reaction
Takamoto et al. 2022
Can we improve performance of surrogate models by pretraining on large quantities of easily simulatable systems?
Compositionality and Pretraining
MPP (Multi-Physics Pretraining): a single model for varied systems
Balancing objectives during training
Normalized MSE:
Experiment 1: Performance on Pretraining Tasks
Context size: 16 frames
Experiment 2: Transfer
Compressible Navier-Stokes
M = 0.1
M = 1.0
Going further
- Methodology improvements for long roll out predictions.
- Larger and more diverse datasets
PDEBench
The Well: a Large-Scale Collection of Diverse Physics Simulations for Machine Learning
- 55B tokens from 3M frames
=> First ImageNet scale dataset for fluids
-
18 subsets spanning problems in astro, bio, aerospace, chemistry, atmospheric science, and more.
- Simple self-documented HDF5 files, with pytorch readers provided.
Accepted at NeurIPS 2024 Datasets & Benchmark Track
The Foundation Model Spectrum
Language-like/less structured
Structured-data
AstroCLIP
Cross-Modal Pretraining for Astronomical data
MPP
Multiple Physics Pretraining for Physical Surrogate Models
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
xVal
A Continuous Number Encoding for LLMs
The Foundation Model Spectrum
Language-like/less structured
Structured-data
AstroCLIP
Cross-Modal Pretraining for Astronomical data
MPP
Multiple Physics Pretraining for Physical Surrogate Models
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
xVal
A Continuous Number Encoding for LLMs
The Data Diversity Challenge
- Success of recent foundation models is driven by large corpora of uniform data (e.g LAION 5B).
- Scientific data comes with many additional challenges:
- Metadata matters
- Wide variety of measurements/observations
Credit: Melchior et al. 2021
Credit:DESI collaboration/DESI Legacy Imaging Surveys/LBNL/DOE & KPNO/CTIO/NOIRLab/NSF/AURA/unWISE
The Multimodal Universe
Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data
Collaborative project with about 30 contributors
Accepted at NeurIPS 2024 Datasets & Benchmark track
Going Further: Collection and Curation of Scientific Data
- Development of large models requires access to "web scale" datasets
- Astrophysics generates large amounts of publicly available data,
-
BUT, data is usually not stored or structured in an ML friendly way.
-
BUT, data is usually not stored or structured in an ML friendly way.
- Accessing and using scientific data requires significant expertise, for each dataset.
=> Implies engaging with domain experts.
Credit: Melchior et al. 2021
The MultiModal Universe Project
- Goal: Assemble the first large-scale multi-modal dataset for machine learning in astrophysics.
-
Main pillars:
- Engage with a broad community of AI+Astro experts.
- Adopt standardized conventions for storing and accessing data and metadata through mainstream tools (e.g. Hugging Face Datasets).
- Target large astronomical surveys, varied types of instruments, many different astrophysics sub-fields.
Multiband images from Legacy Survey
MMU Infrastructure
Presented at NeurIPS 2024
Towards Large Multi-Modal Data Models
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Towards Large Multi-Modal Data Models
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
Towards Large Multi-Modal Data Models
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
AstroCLIP
AstroCLIP
Cross-Modal Pre-Training for Astronomical Foundation Models
Project led by Francois Lanusse, Liam Parker, Leopoldo Sarra, Siavash Golkar, Miles Cranmer
Accepted contribution at the NeurIPS 2023 AI4Science Workshop
Published in Monthly Notices of Royal Astronomical Society
What is CLIP?
Contrastive Language Image Pretraining (CLIP)
(Radford et al. 2021)
The AstroCLIP approach
- We use spectra and multi-band images as our two different views for the same underlying object.
- DESI Legacy Surveys (g,r,z) images, and DESI EDR galaxy spectra.
Cosine similarity search
- Redshift Estimation From Images
Supervised baseline
- Zero-shot prediction
- k-NN regression
Evaluation of the model: Parameter Inference
- Galaxy Physical Property Estimation from Images and Spectra
We use estimates of galaxy properties from the PROVABGS catalog (Hahn et al. 2023) (Bayesian spectral energy distribution (SED) modeling of DESI spectroscopy and photometry method)
of regression
Negative Log Likelihood of Neural Posterior Inference
The Information Point of View
- The InfoNCE loss is a lower bound on the Mutual Information between modalities
Shared physical information about galaxies between images and spectra
=> We are building summary statistics for the physical parameters describing an object in a completely data driven way
What This New Paradigm Could Mean for Astrophysicists
-
Never have to retrain my own neural networks from scratch
- Existing pre-trained models would already be near optimal, no matter the task at hand
-
Saves a lot of time and energy
- Practical large scale Deep Learning even in very few example regime
- Searching for very rare objects in large astronomical surveys becomes possible
-
Pretraining on data itself ensures that all sorts of image artifacts are already folded in the training.
- If the information is embedded in a space where it becomes linearly accessible, very simple analysis tools are enough for downstream analysis
- In the future, survey pipelines may add vector embedding of detected objects into catalogs, these would be enough for most tasks, without the need to go back to pixels
Towards Large Multi-Modal Data Models
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
AstroCLIP
Towards Large Multi-Modal Data Models
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
AstroCLIP
Early Fusion Multi-modal Data Models
Multimodal Large Data Models for Astrophysics
New Generation of Token-Based Multimodal Models
Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac et al. 2022)
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Chameleon team, 2024)
Why Is It Interesting to Us?
Galaxy Image Segmentation
Walsmley & Spindler (2023)
Galaxy Image Deblending
=> Foundation Models that build a deep understanding of the data at the pixel level.
Standardizing data modalities through Tokenization
Input
Reconstructed
Any-to-Any Modeling with Generative Masked Modeling
- Each input token is tagged with a modality embedding that specifies its type provide metadata (e.g. HSC image, DESI spectrum).
- Learns the joint and all conditional distributions of provided modalities:
- Can be further fine-tuned to build specialist models for news tasks.
Preview of model capabilities
Conditional Generation
Similarity search
Survey translation
Redshift estimation
Early results: Scaling and Transfer
Follow us online!
Thank you for listening!
Towards Multidisciplinary Scientific Foundation Models
By eiffl
Towards Multidisciplinary Scientific Foundation Models
Overview talk of the Polymathic AI Initiative
- 19