Siavash Golkar
Large models pre-trained
on massive datasets
They can perform a variety of downstream tasks (zero-shot generalization)
They are good starting points for fine-tuning on data poor domains (carry useful inductive bias)
Our mission: to usher in a new class of machine learning for scientific data, building models that can leverage shared concepts across disciplines."
Colm-Cille
Caulfield University of Cambridge
|
Leslie Greengard Flatiron Institute New York University |
David Ha Sakana AI |
Yann LeCun Meta AI New York University |
---|---|---|---|
Stephane Mallat École Normale Supérieure Collège de France Flatiron Institute |
David Spergel Simons Foundation |
Olga Troyanskaya Flatiron Institute Princeton University |
Laure
Zanna New York University
|
SCIENTIFIC ADVISORY GROUP
COMPUTING RESOURCES
Thanks Ian!
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
How can we build foundation models that can have these properties?
more efficient, less general
more general, less efficient
Language-like/less structured
Structured-data
Language-like/less structured
Structured-data
xVal
A Continuous Number Encoding for LLMs
AstroCLIP
Cross-Modal Pretraining for Astronomical data
MPP
Multiple Physics Pretraining for Physical Surrogate Models
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
more efficient, less general
more general, less efficient
Project led by Michael McCabe, Bruno Régaldo, Liam Parker, Ruben Ohana, Miles Cranmer
Oral presentation at the NeurIPS 2023 AI4Science Workshop
Time
Ex: N-body simulation
Springel et al. 2005
Language-like/less structured
Structured-data
xVal
A Continuous Number Encoding for LLMs
AstroCLIP
Cross-Modal Pretraining for Astronomical data
MPP
Multiple Physics Pretraining for Physical Surrogate Models
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
Natural choice for physical surrogate modeling
Navier-Stokes
Incompressible
Compressible
Shallow Water
Diffusion-Reaction
Takamoto et al. 2022
Normalized MSE:
Context size: 16 frames
Compressible Navier-Stokes
M = 0.1
M = 1.0
Tube masking
ViT
ViT
ViT
Trained on reconstructing masked pixels on natural videos (SSV2 and K400)
Tong et al. 2022
Regression Problems on Incompressible Navier-Stokes
Mixed results
Project led by Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti
Accepted contribution at the NeurIPS 2023 AI4Science Workshop
Language-like/less structured
Structured-data
xVal
A Continuous Number Encoding for LLMs
AstroCLIP
Cross-Modal Pretraining for Astronomical data
MPP
Multiple Physics Pretraining for Physical Surrogate Models
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
arXiv:2305.18654 [cs.CL]
arXiv:2109.03137 [cs.CL]
They make erratic, discontinuous predictions.
Even fine-tuning exhaustively does not grant out-of-distribution generalization abilities.
This encoding strategy has 3 main benefits:
Continuity
The model is now end-to-end continuous by construction.
(Standard LLMs are discontinuous both at the input and output stage.)
Interpolation
It makes better out-of-distribution predictions than other numerical encodings.
Efficiency
By using just a single token to represent any number, it requires less memory, compute resources, and training time to achieve strong results.
xVal shows improved predictions for out-of-distribution values.
When evaluated on multi-digit multiplication tasks, xVal performs comparably well, and is less prone to large outliers:
And when evaluated on compound operations of basic arithmetic, xVal shows the strongest performance:
Future directions: improving the dynamic range of the embedded values.
Project led by Francois Lanusse, Liam Parker, Siavash Golkar, Miles Cranmer
Accepted contribution at the NeurIPS 2023 AI4Science Workshop
Credit: Melchior et al. 2021
Credit:DESI collaboration/DESI Legacy Imaging Surveys/LBNL/DOE & KPNO/CTIO/NOIRLab/NSF/AURA/unWISE
Language-like/less structured
Structured-data
xVal
A Continuous Number Encoding for LLMs
AstroCLIP
Cross-Modal Pretraining for Astronomical data
MPP
Multiple Physics Pretraining for Physical Surrogate Models
Scientific Reasoning
Multi-Modality
Generalization to Data-Limited Domains
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
AstroCLIP
Self-Supervised similarity search for large scientific datasets (Stein et al. 2021)
How do we add in other modalities?
(e.g. spectral information?)
Contrastive Language Image Pretraining (CLIP)
(Radford et al. 2021)
Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac et al. 2022)
Hierarchical Text-Conditional Image Generation with CLIP Latents (Ramesh et al. 2022)
We take a two steps approach:
Image Similarity
Spectral Similarity
Image-Spectral Similarity
UMAP representation of spectra embeddings
Shared physical information about galaxies between images and spectra
=> We are building summary statistics for the physical parameters describing an object in a completely data driven way
Thank you for listening!