Towards A New Era of Multi-Modal Self-Supervised Learning for Astrophysics

Francois Lanusse

The Deep Learning Boom in Astrophysics

astro-ph abstracts mentioning Deep Learning, CNN, or Neural Networks

The vast majority of these results has relied on supervised learning and networks trained from scratch.

Huertas-Company & Lanusse (2023)

The Limits of Traditional Deep Learning

Limited Supervised Training Data
- Rare or novel objects have by definition few labeled examples
- In Simulation Based Inference (SBI), training a neural compression model requires many simulations
Limited Reusability
- Existing models are trained supervised on a specific task, and specific data.

Huang et al. (2019)

Zhang, Bloom, Gaudi, Lanusse, Lam, Lu (2021)

=> Limits in practice the ease of using deep learning for analysis and discovery

Meanwhile, in Computer Science...

The Rise of The Foundation Model Paradigm

Foundation Model approach
- Pretrain models on pretext tasks, without supervision, on very large scale datasets.
- Adapt pretrained models to downstream tasks.
- Combine pretrained modules in more complex systems.

Bommasani et al. 2021

He et al. 2021

The Advantage of Scale of Data and Compute

Liu et al. 2022

Linearly Accessible Information

Zhai et al. 2022

Backbone of modern architectures embed input images as vectors in where d can typically be between 512 to 2048.
Linear probing refers to training a single matrix to adapt this vector representation to the desired downstream task.

\mathbb{R}^{d}

What This New Paradigm Could Mean for Us Astrophysicists

Never have to retrain my own neural networks from scratch
- Existing pre-trained models would already be near optimal, no matter the task at hand
Practical large scale Deep Learning even in very few example regime
- Searching for very rare objects in large surveys like Euclid or LSST becomes possible
If the information is embedded in a space where it becomes linearly accessible, very simple analysis tools are enough for downstream analysis
- In the future, survey pipelines may add vector embedding of detected objects into catalogs, these would be enough for most tasks, without the need to go back to pixels

Polymathic

Colm-Cille Caulfield University of Cambridge	Leslie Greengard Flatiron Institute New York University	David Ha Sakana AI	Yann LeCun Meta AI New York University
Stephane Mallat École Normale Supérieure Collège de France Flatiron Institute	David Spergel Simons Foundation	Olga Troyanskaya Flatiron Institute Princeton University	Laure Zanna New York University

SCIENTIFIC ADVISORY GROUP

Alberto Bietti	Kyunghyun Cho	Miles Cranmer	Michael Eickenberg	Siavash Golkar	Keiya Hirashima
Shirley Ho	Geraud Krawezik	Francois Lanusse	Nick Lourie	Michael McCabe	Ruben Ohana
Liam Parker	Mariel Pettee	Bruno Regaldo	Lucas Meyer	Rudy Morel

The Data Diversity Challenge

Success of recent foundation models is driven by large corpora of uniform data (e.g LAION 5B).
Scientific data comes with many additional challenges:
- Metadata matters
- Wide variety of measurements/observations

Credit: Melchior et al. 2021

Credit:DESI collaboration/DESI Legacy Imaging Surveys/LBNL/DOE & KPNO/CTIO/NOIRLab/NSF/AURA/unWISE

The Multimodal Universe

Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data

Collaborative project with about 30 contributors
Accepted at NeurIPS 2024 Datasets & Benchmark track

The barrier to universal datasets

Development of large models requires access to "web scale" datasets
Astrophysics generates large amounts of publicly available data
BUT:
- data is usually not stored or structured in an ML friendly way (e.g. postage stamps).
- data access varies significantly between survey
Accessing and using scientific data requires significant expertise, for each dataset.

Schuhmann et al. (2022)

The MultiModal Universe Project

Goal: Assemble the first large-scale multi-modal dataset for machine learning in astrophysics.
Main pillars:
- Engage with a broad community of AI+Astro experts.
- Adopt standardized conventions for storing and accessing data and metadata through mainstream tools (e.g. Hugging Face Datasets).
- Target large astronomical surveys, varied types of instruments, many different astrophysics sub-fields.

Multiband images from Legacy Survey

MMU Infrastructure

Data schema and storage

For each example MMU expects a few mandatory fields:
- object_id, ra, dec
For each modality, MMU expects the data to be formatted according to a fixed schema which contains necessary metadata.
Data is stored in HDF5 files, split according to HEALPix regions for efficient cross-matching and easy access

hsc
├── hsc.py
├── pdr3_dud_22.5
│   ├── healpix=1104
│   │   └── 001-of-001.hdf5
│   ├── healpix=1105
│   │   └── 001-of-001.hdf5
│   ├── healpix=1106
│   │   └── 001-of-001.hdf5
│   ├── healpix=1107
│   │   └── 001-of-001.hdf5
│   ├── healpix=1171
│   │   └── 001-of-001.hdf5
│   ├── healpix=1172
│   │   └── 001-of-001.hdf5
│   ├── healpix=1174
│   │   └── 001-of-001.hdf5
│   ├── healpix=1175
│   │   └── 001-of-001.hdf5
│   ├── healpix=1702
│   │   └── 001-of-001.hdf5
...

Content of v1

Usage example

from datasets import load_dataset

# Open Hugging Face dataset
dset_ls = load_dataset("MultimodalUniverse/legacysurvey",
                       streaming=True,
                       split='train')
dset_ls = dset_ls.with_format("numpy")
dset_iterator = iter(dset_ls)

# Draw one example from the dataset iterator
example = next(dset_iterator)
     
# Let's inspect what is contained in an example
print(example.keys())

figure(figsize=(12,5))
for i,b in enumerate(example['image']['band']):
  subplot(1,4,i+1)
  title(f'{b}')
  imshow(example['image']['flux'][i], cmap='gray_r')
  axis('off')

dict_keys(['image', 'blobmodel', 'rgb', 'object_mask', 'catalog', 'EBV', 'FLUX_G', 'FLUX_R', 'FLUX_I', 'FLUX_Z', 'FLUX_W1', 'FLUX_W2', 'FLUX_W3', 'FLUX_W4', 'SHAPE_R', 'SHAPE_E1', 'SHAPE_E2', 'object_id'])

Takeaways

The Multimodal Universe makes it possible to
- access in one place a large amount of ML-ready data
- easily cross-match between different surveys and data modalities

This is only the first initiative, probably not the last.
How can work as a community towards a universal data repositories usable for ML training?

The Neural Architecture Challenge

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations

Lanusse et al. 2020

Liang et al. 2023

The Neural Architecture Challenge

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations

Lanusse et al. 2020

Liang et al. 2023

Bytes Are All You Need (Horton et al. 2023)

The Neural Architecture Challenge

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations

Lanusse et al. 2020

Liang et al. 2023

Bytes Are All You Need (Horton et al. 2023)

AstroCLIP

Cross-Modal Pre-Training for Astronomical Foundation Models

Project led by Francois Lanusse, Liam Parker, Leopoldo Sarra, Siavash Golkar, Miles Cranmer
Accepted contribution at the NeurIPS 2023 AI4Science Workshop
Published in the Monthly Notices of Royal Astronomical Society

What is CLIP?

Contrastive Language Image Pretraining (CLIP)
(Radford et al. 2021)

One model, many downstream applications!

Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac et al. 2022)

Hierarchical Text-Conditional Image Generation with CLIP Latents (Ramesh et al. 2022)

The AstroCLIP approach

We use spectra and multi-band images as our two different views for the same underlying object.
DESI Legacy Surveys (g,r,z) images, and DESI EDR galaxy spectra.

Cosine similarity search

Redshift Estimation From Images

Supervised baseline

z_{true}

Zero-shot prediction
- k-NN regression

Few-shot prediction
- MLP head trained on top of frozen backbone

Evaluation of the model: Parameter Inference

Galaxy Physical Property Estimation from Images and Spectra

We use estimates of galaxy properties from the PROVABGS catalog (Hahn et al. 2023) (Bayesian spectral energy distribution (SED) modeling of DESI spectroscopy and photometry method)

R^2

of regression

Negative Log Likelihood of Neural Posterior Inference (lower is better)

Galaxy Morphology Classification

Classification Accuracy

We test a galaxy morphology classification task using as labels the GZ-5 dataset (Walmsley et al. 2021)

The AstroCLIP Model

For images, we use a ViT-L Transformer, pre-pretrained on 70M images using DiNOv2.
For spectra, we use a decoder only Transformer working at the level of spectral patches.

(Dosovitskiy et al 2021)

DiNOv2 (Oquab et al. 2023) Image Pretraining

Common practice for SOTA CLIP models is to initially pretrain the image encoder before CLIP alignment
We adopt the DiNOv2 state of the art Self-Supervised Learning model for the initial large scale training of the model.

We pretrain the DiNOv2 model on ~70 million postage stamps from DECaLS

PCA of patch features

Dense Semantic Segmentation

Dense Depth Estimation

The Information Point of View

The InfoNCE loss is a lower bound on the Mutual Information between modalities

Shared physical information about galaxies between images and spectra

van den Oord et al. (2018)

=> We are building summary statistics for the physical parameters describing an object in a completely data driven way

Daunhawer et al. (2023)

A Surprising Observation

Redshift information in image embedding

Redshift information in spectra embedding

=> We find in practice that our contrastive alignment behave similarly to Canonical Correlation Analysis

Detecting Galaxy Tidal Features Using Self-Supervised Representation Learning

Project led by Alice Desmons, Francois Lanusse, Sarah Brough

The Neural Architecture Challenge

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations

Lanusse et al. 2020

Liang et al. 2023

Bytes Are All You Need (Horton et al. 2023)

AstroCLIP

The Neural Architecture Challenge

Most General

Most Specific

Independent models for every type of observation

Single model capable of processing all types of observations

Lanusse et al. 2020

Liang et al. 2023

Bytes Are All You Need (Horton et al. 2023)

AstroCLIP

"Multimodal Generative Pretraining for Large Data Models"

Multimodal Large Data Models for Astrophysics

What will make Francois happy?

Single pre-trained model which can operate on any input data type
- I no longer need to worry about what network to use on some data
Emergent deep understanding of the data, informed by cross-modal information
- A downstream task could be specified with just a few examples

New Generation of Token-Based Multimodal Models

Flamingo: a Visual Language Model for Few-Shot Learning (Alayrac et al. 2022)

Chameleon: Mixed-Modal Early-Fusion Foundation Models (Chameleon team, 2024)

Why Is It Interesting to Us?

Galaxy Image Segmentation
Walsmley & Spindler (2023)

Galaxy Image Deblending

Bosch et al. (2017), Sampson et al. (2024)

=> Foundation Models that build a deep understanding of the data at the pixel level.

Standardizing data modalities through Tokenization

Input

Reconstructed

Example of strategy to embed different bands

Field Embedding Strategy Developed for Multiple Physics Pretraining (McCabe et al. 2023)

Technical Aspects of Code Quantization

(Mentzer et al. 2023)

Finite Scale Quantization

Lookup Free Quantization

(Yu et al. 2024)

Original

LFQ

Any-to-Any Modeling with Generative Masked Modeling

Each input token is tagged with a modality embedding that specifies its type provide metadata (e.g. HSC image, DESI spectrum).
Learns the joint and all conditional distributions of provided modalities:
Can be further fine-tuned to build specialist models for news tasks.

\forall m,n \quad p(x_m | x_n)

Preview of model capabilities

Conditional Generation

Similarity search

Survey translation

p(\bm{x}_{HSC} | \bm{x}_{DES} )

Redshift estimation

p(z | \mathbf{x})

Early results: Scaling and Transfer

What does such a framework give us?

Tokenization provides a very convenient interface to the raw data.

Data fusion (e.g. images and time series) becomes trivial.
With data ingestion and neural architecture taken care of, deep learning finally boils down to providing a training set and loss function.

x_train = Tokenize(hsc_images, modality='HSC')
y_train = Tokenize(redshift, modality='z')

model = FineTunedModel(base='LSSTGPT_y1').fit(x_train, y_train)
                                   
y_test = model.predict(x_test)

Follow us online!

Thank you for listening!

A New Era of Multi-Modal Self-Supervised Learning for Astrophysics

By eiffl

A New Era of Multi-Modal Self-Supervised Learning for Astrophysics

Astroinformatics 2024, Puerto Natales

Towards A New Era of Multi-Modal Self-Supervised Learning for Astrophysics

The Deep Learning Boom in Astrophysics

The Limits of Traditional Deep Learning

Meanwhile, in Computer Science...

The Rise of The Foundation Model Paradigm

The Advantage of Scale of Data and Compute

Linearly Accessible Information

What This New Paradigm Could Mean for Us Astrophysicists

Polymathic

The Data Diversity Challenge

The Multimodal Universe

Enabling Large-Scale Machine Learning with 100TBs of Astronomical Scientific Data

The barrier to universal datasets

The MultiModal Universe Project

MMU Infrastructure

Data schema and storage

Content of v1

Usage example

Takeaways

The Neural Architecture Challenge

The Neural Architecture Challenge

The Neural Architecture Challenge

AstroCLIP

Cross-Modal Pre-Training for Astronomical Foundation Models

What is CLIP?

One model, many downstream applications!

The AstroCLIP approach

Evaluation of the model: Parameter Inference

The AstroCLIP Model

DiNOv2 (Oquab et al. 2023) Image Pretraining

The Information Point of View

A Surprising Observation

Detecting Galaxy Tidal Features Using Self-Supervised Representation Learning

The Neural Architecture Challenge

The Neural Architecture Challenge

Multimodal Large Data Models for Astrophysics

What will make Francois happy?

New Generation of Token-Based Multimodal Models

Why Is It Interesting to Us?

Standardizing data modalities through Tokenization

Example of strategy to embed different bands

Technical Aspects of Code Quantization

Any-to-Any Modeling with Generative Masked Modeling

Preview of model capabilities

Early results: Scaling and Transfer

What does such a framework give us?

Follow us online!

A New Era of Multi-Modal Self-Supervised Learning for Astrophysics

More from eiffl