Towards A New Era of Multi-Modal Self-Supervised Learning for Astrophysics
Francois Lanusse
The Deep Learning Boom in Astrophysics
astro-ph abstracts mentioning Deep Learning, CNN, or Neural Networks
The vast majority of these results has relied on supervised learning and networks trained from scratch.
The Limits of Traditional Deep Learning
-
Limited Supervised Training Data
- Rare or novel objects have by definition few labeled examples
- In Simulation Based Inference (SBI), training a neural compression model requires many simulations
- Rare or novel objects have by definition few labeled examples
-
Limited Reusability
- Existing models are trained supervised on a specific task, and specific data.
=> Limits in practice the ease of using deep learning for analysis and discovery
Meanwhile, in Computer Science...
The Rise of The Foundation Model Paradigm
-
Foundation Model approach
- Pretrain models on pretext tasks, without supervision, on very large scale datasets.
- Adapt pretrained models to downstream tasks.
- Combine pretrained modules in more complex systems.
The Advantage of Scale of Data and Compute
Linearly Accessible Information
- Backbone of modern architectures embed input images as vectors in where d can typically be between 512 to 2048.
- Linear probing refers to training a single matrix to adapt this vector representation to the desired downstream task.
What This New Paradigm Could Mean for Us Astrophysicists
-
Never have to retrain my own neural networks from scratch
-
Existing pre-trained models would already be near optimal, no matter the task at hand
-
Existing pre-trained models would already be near optimal, no matter the task at hand
- Practical large scale Deep Learning even in very few example regime
-
Searching for very rare objects in large surveys like Euclid or LSST becomes possible
-
Searching for very rare objects in large surveys like Euclid or LSST becomes possible
- If the information is embedded in a space where it becomes linearly accessible, very simple analysis tools are enough for downstream analysis
- In the future, survey pipelines may add vector embedding of detected objects into catalogs, these would be enough for most tasks, without the need to go back to pixels
Polymathic
Colm-Cille
Caulfield University of Cambridge
|
Leslie
Greengard Flatiron Institute
New York University |
David Ha Sakana AI |
Yann LeCun Meta AI New York University |
---|---|---|---|
Stephane
Mallat École Normale Supérieure
Collège de France Flatiron Institute |
David
Spergel Simons Foundation |
Olga Troyanskaya Flatiron Institute Princeton University |
Laure
Zanna New York University
|
SCIENTIFIC ADVISORY GROUP
The Data Diversity Challenge
- Success of recent foundation models is driven by large corpora of uniform data (e.g LAION 5B).
- Scientific data comes with many additional challenges:
- Metadata matters
- Wide variety of measurements/observations
Credit: Melchior et al. 2021
Credit:DESI collaboration/DESI Legacy Imaging Surveys/LBNL/DOE & KPNO/CTIO/NOIRLab/NSF/AURA/unWISE
The Multimodal Universe
Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data
Collaborative project with about 30 contributors
Accepted at NeurIPS 2024 Datasets & Benchmark track
The barrier to universal datasets
- Development of large models requires access to "web scale" datasets
- Astrophysics generates large amounts of publicly available data
BUT:- data is usually not stored or structured in an ML friendly way (e.g. postage stamps).
- data access varies significantly between survey
- Accessing and using scientific data requires significant expertise, for each dataset.
The MultiModal Universe Project
- Goal: Assemble the first large-scale multi-modal dataset for machine learning in astrophysics.
-
Main pillars:
- Engage with a broad community of AI+Astro experts.
- Adopt standardized conventions for storing and accessing data and metadata through mainstream tools (e.g. Hugging Face Datasets).
- Target large astronomical surveys, varied types of instruments, many different astrophysics sub-fields.
Multiband images from Legacy Survey
MMU Infrastructure
Data schema and storage
- For each example MMU expects a few mandatory fields:
- object_id, ra, dec
- object_id, ra, dec
- For each modality, MMU expects the data to be formatted according to a fixed schema which contains necessary metadata.
- Data is stored in HDF5 files, split according to HEALPix regions for efficient cross-matching and easy access
hsc
├── hsc.py
├── pdr3_dud_22.5
│ ├── healpix=1104
│ │ └── 001-of-001.hdf5
│ ├── healpix=1105
│ │ └── 001-of-001.hdf5
│ ├── healpix=1106
│ │ └── 001-of-001.hdf5
│ ├── healpix=1107
│ │ └── 001-of-001.hdf5
│ ├── healpix=1171
│ │ └── 001-of-001.hdf5
│ ├── healpix=1172
│ │ └── 001-of-001.hdf5
│ ├── healpix=1174
│ │ └── 001-of-001.hdf5
│ ├── healpix=1175
│ │ └── 001-of-001.hdf5
│ ├── healpix=1702
│ │ └── 001-of-001.hdf5
...
Content of v1
Usage example
from datasets import load_dataset
# Open Hugging Face dataset
dset_ls = load_dataset("MultimodalUniverse/legacysurvey",
streaming=True,
split='train')
dset_ls = dset_ls.with_format("numpy")
dset_iterator = iter(dset_ls)
# Draw one example from the dataset iterator
example = next(dset_iterator)
# Let's inspect what is contained in an example
print(example.keys())
figure(figsize=(12,5))
for i,b in enumerate(example['image']['band']):
subplot(1,4,i+1)
title(f'{b}')
imshow(example['image']['flux'][i], cmap='gray_r')
axis('off')
dict_keys(['image', 'blobmodel', 'rgb', 'object_mask', 'catalog', 'EBV', 'FLUX_G', 'FLUX_R', 'FLUX_I', 'FLUX_Z', 'FLUX_W1', 'FLUX_W2', 'FLUX_W3', 'FLUX_W4', 'SHAPE_R', 'SHAPE_E1', 'SHAPE_E2', 'object_id'])
Takeaways
- The Multimodal Universe makes it possible to
- access in one place a large amount of ML-ready data
- easily cross-match between different surveys and data modalities
- This is only the first initiative, probably not the last.
How can work as a community towards a universal data repositories usable for ML training?
The Neural Architecture Challenge
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
The Neural Architecture Challenge
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
The Neural Architecture Challenge
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
AstroCLIP
AstroCLIP
Cross-Modal Pre-Training for Astronomical Foundation Models
Project led by Francois Lanusse, Liam Parker, Leopoldo Sarra, Siavash Golkar, Miles Cranmer
Accepted contribution at the NeurIPS 2023 AI4Science Workshop
Published in the Monthly Notices of Royal Astronomical Society
What is CLIP?
Contrastive Language Image Pretraining (CLIP)
(Radford et al. 2021)
One model, many downstream applications!
Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac et al. 2022)
Hierarchical Text-Conditional Image Generation with CLIP Latents (Ramesh et al. 2022)
The AstroCLIP approach
- We use spectra and multi-band images as our two different views for the same underlying object.
- DESI Legacy Surveys (g,r,z) images, and DESI EDR galaxy spectra.
Cosine similarity search
- Redshift Estimation From Images
Supervised baseline
- Zero-shot prediction
- k-NN regression
- Few-shot prediction
- MLP head trained on top of frozen backbone
Evaluation of the model: Parameter Inference
- Galaxy Physical Property Estimation from Images and Spectra
We use estimates of galaxy properties from the PROVABGS catalog (Hahn et al. 2023) (Bayesian spectral energy distribution (SED) modeling of DESI spectroscopy and photometry method)
of regression
Negative Log Likelihood of Neural Posterior Inference (lower is better)
- Galaxy Morphology Classification
Classification Accuracy
We test a galaxy morphology classification task using as labels the GZ-5 dataset (Walmsley et al. 2021)
The AstroCLIP Model
- For images, we use a ViT-L Transformer, pre-pretrained on 70M images using DiNOv2.
- For spectra, we use a decoder only Transformer working at the level of spectral patches.
DiNOv2 (Oquab et al. 2023) Image Pretraining
- Common practice for SOTA CLIP models is to initially pretrain the image encoder before CLIP alignment
- We adopt the DiNOv2 state of the art Self-Supervised Learning model for the initial large scale training of the model.
- We pretrain the DiNOv2 model on ~70 million postage stamps from DECaLS
PCA of patch features
Dense Semantic Segmentation
Dense Depth Estimation
The Information Point of View
- The InfoNCE loss is a lower bound on the Mutual Information between modalities
Shared physical information about galaxies between images and spectra
=> We are building summary statistics for the physical parameters describing an object in a completely data driven way
A Surprising Observation
Redshift information in image embedding
Redshift information in spectra embedding
=> We find in practice that our contrastive alignment behave similarly to Canonical Correlation Analysis
Detecting Galaxy Tidal Features Using Self-Supervised Representation Learning
Project led by Alice Desmons, Francois Lanusse, Sarah Brough
The Neural Architecture Challenge
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
AstroCLIP
The Neural Architecture Challenge
Most General
Most Specific
Independent models for every type of observation
Single model capable of processing all types of observations
Bytes Are All You Need (Horton et al. 2023)
AstroCLIP
"Multimodal Generative Pretraining for Large Data Models"
Multimodal Large Data Models for Astrophysics
What will make Francois happy?
- Single pre-trained model which can operate on any input data type
- I no longer need to worry about what network to use on some data
- I no longer need to worry about what network to use on some data
- Emergent deep understanding of the data, informed by cross-modal information
- A downstream task could be specified with just a few examples
New Generation of Token-Based Multimodal Models
Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac et al. 2022)
Chameleon: Mixed-Modal Early-Fusion Foundation Models (Chameleon team, 2024)
Why Is It Interesting to Us?
Galaxy Image Segmentation
Walsmley & Spindler (2023)
Galaxy Image Deblending
=> Foundation Models that build a deep understanding of the data at the pixel level.
Standardizing data modalities through Tokenization
Input
Reconstructed
Example of strategy to embed different bands
Field Embedding Strategy Developed for Multiple Physics Pretraining (McCabe et al. 2023)
Technical Aspects of Code Quantization
- Finite Scale Quantization
- Lookup Free Quantization
Original
VQ
LFQ
Any-to-Any Modeling with Generative Masked Modeling
- Each input token is tagged with a modality embedding that specifies its type provide metadata (e.g. HSC image, DESI spectrum).
- Learns the joint and all conditional distributions of provided modalities:
- Can be further fine-tuned to build specialist models for news tasks.
Preview of model capabilities
Conditional Generation
Similarity search
Survey translation
Redshift estimation
Early results: Scaling and Transfer
What does such a framework give us?
- Tokenization provides a very convenient interface to the raw data.
- Data fusion (e.g. images and time series) becomes trivial.
- With data ingestion and neural architecture taken care of, deep learning finally boils down to providing a training set and loss function.
x_train = Tokenize(hsc_images, modality='HSC')
y_train = Tokenize(redshift, modality='z')
model = FineTunedModel(base='LSSTGPT_y1').fit(x_train, y_train)
y_test = model.predict(x_test)
Follow us online!
Thank you for listening!
A New Era of Multi-Modal Self-Supervised Learning for Astrophysics
By eiffl
A New Era of Multi-Modal Self-Supervised Learning for Astrophysics
Astroinformatics 2024, Puerto Natales
- 27