Francois Lanusse
Simons Foundation/CNRS
on behalf of Shirley Ho and the Polymathic AI team
Our mission: to usher in a new class of machine learning for scientific data, building models that can leverage shared concepts across disciplines."
Colm-Cille
Caulfield University of Cambridge
|
Leslie
Greengard Flatiron Institute
New York University |
David
Ha Sakana AI
|
Yann
LeCun Meta AI
New York University |
---|---|---|---|
Stephane
Mallat École Normale Supérieure
Collège de France Flatiron Institute |
David
Spergel Simons Foundation |
Olga Troyanskaya Flatiron Institute Princeton University |
Laure
Zanna New York University
|
SCIENTIFIC ADVISORY GROUP
COMPUTING RESOURCES
Language-like/less structured
Structured-data
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
How can we build foundation models that jump across scientific disciplines?
Language-like/less structured
Structured-data
AstroCLIP
Cross-Modal Pretraining for Astronomical data
MPP
Multiple Physics Pretraining for Physical Surrogate Models
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
xVal
A Continuous Number Encoding for LLMs
Project led by Michael McCabe, Bruno Régaldo, Liam Parker, Ruben Ohana, Miles Cranmer
Best paper award at the NeurIPS 2023 AI4Science Workshop
Navier-Stokes
Incompressible
Compressible
Shallow Water
Diffusion-Reaction
Takamoto et al. 2022
Can we improve performance of surrogate models by pretraining on large quantities of easily simulatable systems?
Normalized MSE:
Context size: 16 frames
Compressible Navier-Stokes
M = 0.1
M = 1.0
PDEBench
=> Available early September
Language-like/less structured
Structured-data
AstroCLIP
Cross-Modal Pretraining for Astronomical data
MPP
Multiple Physics Pretraining for Physical Surrogate Models
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
xVal
A Continuous Number Encoding for LLMs
Language-like/less structured
Structured-data
AstroCLIP
Cross-Modal Pretraining for Astronomical data
MPP
Multiple Physics Pretraining for Physical Surrogate Models
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
xVal
A Continuous Number Encoding for LLMs
Credit: Melchior et al. 2021
Credit:DESI collaboration/DESI Legacy Imaging Surveys/LBNL/DOE & KPNO/CTIO/NOIRLab/NSF/AURA/unWISE
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
AstroCLIP
Project led by Francois Lanusse, Liam Parker, Leopoldo Sarra, Siavash Golkar, Miles Cranmer
Accepted contribution at the NeurIPS 2023 AI4Science Workshop
Published in Monthly Notices of Royal Astronomical Society
Contrastive Language Image Pretraining (CLIP)
(Radford et al. 2021)
Cosine similarity search
Shared physical information about galaxies between images and spectra
=> We are building summary statistics for the physical parameters describing an object in a completely data driven way
Supervised baseline
We use estimates of galaxy properties from the PROVABGS catalog (Hahn et al. 2023) (Bayesian spectral energy distribution (SED) modeling of DESI spectroscopy and photometry method)
of regression
Negative Log Likelihood of Neural Posterior Inference
Credit: Melchior et al. 2021
Multiband images from Legacy Survey
=> Official release early September
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
AstroCLIP
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
AstroCLIP
Early Fusion Multi-modal Data Models
Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac et al. 2022)
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Chameleon team, 2024)
Input
Reconstructed
Our strategy:
Field Embedding Strategy Developed for Multiple Physics Pretraining (McCabe et al. 2023)
Thank you for listening!