Federica B. Bianco

University of Delaware

Physics and Astronomy

Biden School of Public Policy and Administration

Data Science Institute

AI and ML in astronomy and astrophysics

Machine Learning for Astronomers and Physicists

this slide deck:

https://slides.com/federicabianco/astroai

Historical perspective

1/6

Galileo Galilei 1610

Following: Djorgovski

https://events.asiaa.sinica.edu.tw/school/20170904/talk/djorgovski1.pdf

Experiment driven

what drives

cosmic discovery

Enistein 1916

Theory driven | Falsifiability

Experiment driven

what drives

cosmic discovery

Theory driven | Falsifiability

Experiment driven

Simulations | Probabilistic inference | Computation

the 1947-today

what drives

cosmic discovery

the 2000s-2010s

Theory driven | Falsifiability

Experiment driven

Simulations | Probabilistic inference | Computation

Data | Survey astronomy | Computation | Pattern Discovery

what drives

cosmic discovery

from commissioniong observation

to scanning the sky and giving away the data (open science model!)

https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195

Astronomy by the numbers

experiment driven science -∞:1900

theory driven science 1900-1950

data driven science 1990-2010

the fourth paradigm - Jim Gray, 2009

computationally driven science 1950-1990

AI driven science? 2010...

https://app.sli.do/event/qxbWnfzkJyT3SvbeKi3rwd

when did the first Neural Network in astronomy review came out?

Smith+Geach May 2022 Astronomia ex machina

number of arXiv:astro-ph submissions with abstracts containing one or more of the strings: ‘machine learning’, ‘ML’, ‘artificial intelligence’, ‘AI’, ‘deep learning’ or ‘neural network’.

Artificial Intelligence:

enable machines to make decisions without being explicitly programmed

Machine Learning:

machines learn directly from data and examples

Data Science: the field of studies that deals with the extraction of information from data within a domain context to enable interpretation and prediction of phenomena.

This includes development and application of statistical tools and machine learning and AI methods

Deep Learning
(Neural Networks)

DATA

MODEL

PRACTICE

Complex Large Data

whitening

cross validation

Flexible non-linar models

Gartner report 2001

4-V of Big Data

V1: Volume
Number of bites

Number of pixels

Number of astrophysical objects in a data x number of featured measured

V2: Variety
Diverse science return from the same dataset

e.g. cosmology+stellar physics
cosmo

Multiwavelength

Multimessenger

Images and spectra

V4: Veracity
This V will refer to both data quality and availability (added in 2012)

Inclusion of uncertainty in inference and simulations

V3: Velocity

Real time analysis, edge computing, data transfer

IceCube edge computing

Gartner report 2001

Exquisite image quality

all over the sky

over and over again

SDSS image circa 2000

HSC image circa 2018

when you look at the sky at this resolution and this depth...

everything is blended and everything is changing

Gartner report 2001

@fedhere

Text

@fedhere

log number of Megapixels

1.5 2.0 2.5 3.0 3.5

Etendue: area x FoV

4-V of Big Data

Exquisite image quality

all over the sky

over and over again

SDSS image circa 2000

HSC image circa 2018

when you look at the sky at this resolution and this depth...

everything is blended and everything is changing

Gartner report 2001

Text

4-V of Big Data

Exquisite image quality

all over the sky

over and over again

SDSS image circa 2000

HSC image circa 2018

when you look at the sky at this resolution and this depth...

everything is blended and everything is changing

Gartner report 2001

Text

4-V of Big Data

The IceCube collaboration

Nature 591, 220–224 (2021)

DATA

MODEL

PRACTICE

Complex Large Data

whitening

cross validation

Flexible non-linar models

what is machine learning?

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

a model is a low dimensional representation of a higher dimensionality dataset

What is a model in ML

ML: any model with parameters learnt from the data

dimensionality of the model: number of parameters

What is a model in ML

Machine Learning models are parametrized representation of "reality" where the parameters are learned from finite sets (the sample) of realizations of that reality (the population)

(note: learning by instance, e.g. nearest neighbours, may not comply to this definition)

Machine Learning is the disciplines that conceptualizes, studies, and applies those models.

Key Concept

what is machine learning?

DATA

MODEL

PRACTICE

Complex Large Data

whitening

cross validation

Flexible non-linar models

DATA

MODEL

PRACTICE

Hubble in 1929

Emphasis on transferability of the model to unseen data

AI "Learning"

2/6

Input

x

y

output

data

prediction

physics

Input

x

y

output

function

f(x)

Input

x

y

output

f(x)

f(x) = mx + b

b

m

m: slope

b: intercept

Input

x

y

output

f(x)

f(x) = mx + b

b

m

m: slope

b: intercept

parameters

Input

x

y

output

f(x)

f(x) = mx + b

b

m

m: slope

b: intercept

parameters

x

y

goal: find the right m and b that turn x into y

Input

x

y

output

f(x)

f(x) = mx + b

b

m

m: slope

b: intercept

parameters

x

y

learn

goal: find the right m and b that turn x into y

Input

x

y

output

f(x) = mx + b

m = 0.4 and b=0

m: slope

b: intercept

parameters

x

L2 = (y_{1,p} - y_{1,t})^2 + (y_{2,p} - y_{2,t})^2 + (y_{3,p} - y_{3,t})^2

y_{3,p}

y_{3,t}

let's try

goal: learn the right m and b that turn x into y

f(x)

unsupervised vs supervised learning

Clustering

understand the structure of a feature space

All features are observed for all datapoints

Clustering

partitioning the feature space so that the existing data is grouped (according to some target function!)

Unsupervised learning

understanding structure

All features are observed for all datapoints

unsupervised vs supervised learning

understand the structure of a feature space

Clustering

partitioning the feature space so that the existing data is grouped (according to some target function!)

Unsupervised learning

understanding structure
anomaly detection

All features are observed for all datapoints

unsupervised vs supervised learning

understand the structure of a feature space

Clustering

partitioning the feature space so that the existing data is grouped (according to some target function!)

Unsupervised learning

understanding structure
anomaly detection

All features are observed for all datapoints

unsupervised vs supervised learning

understand the structure of a feature space

Clustering

partitioning the feature space so that the existing data is grouped (according to some target function!)

Classifying & regression

Unsupervised learning

understanding structure
anomaly detection
dimensionality reduction

All features are observed for all datapoints

unsupervised vs supervised learning

prediction and classification based on examples

Some features not observable & we want to predict them.

Classifying & regression

finding functions of the variables that allow to predict unobserved properties of new observations

prediction and classification based on examples

Clustering

partitioning the feature space so that the existing data is grouped (according to some target function!)

Unsupervised learning

understanding structure
anomaly detection
dimensionality reduction

All features are observed for all datapoints

Some features not observable & we want to predict them.

unsupervised vs supervised learning

Classifying & regression

Supervised learning

classification (prediction)
regression (prediction)

prediction and classification based on examples

Clustering

partitioning the feature space so that the existing data is grouped (according to some target function!)

Unsupervised learning

understanding structure
anomaly detection
dimensionality reduction

All features are observed for all datapoints

finding functions of the variables that allow to predict unobserved properties of new observations

Some features not observable & we want to predict them.

unsupervised vs supervised learning

The Loss function

Supervised learning

classification (prediction)
regression (prediction)
feature selection

Unsupervised learning

understanding structure
anomaly detection
dimensionality reduction

L=\sum_{i,c} |\vec{x}_{i\in c}−\vec{μ}_{c}|^2

L1=\sum_{i} |y_{i}−f(\vec{x}_{i})|\\ L2=\sum_{i} |y_{i}−f(\vec{x}_{i})|^2\\

model parameter

Loss function

Physics informed AI

Application regime:

PiNN

-infinity - 1950's

theory driven: little data, mostly theory, falsifiability and all that...

-1980's - today

data driven: lots of data, drop theory and use associations, black-box modles

Application regime:

PiNN

-infinity - 1950's

theory driven: little data, mostly theory, falsifiability and all that...

-1980's - today

data driven: lots of data, drop theory and use associations, black-box modles

lots of data yet not enough for entirely automated decision making

complex theory that cannot be solved analytically

combine it with some theory

PiNN

Non Linear PDEs are hard to solve!

Provide training points at the boundary with calculated solution (trivial cause we have boundary conditions)

Provide the physical constraint: make sure the solution satisfies the PDE

via a modified loss function that includes residuals of the prediction and residual of the PDE

\mathrm{loss} = L2 + PDE =\\ \sum(u_\theta - u)^2 + \\ (\partial_t u_\theta + u_\theta \, \partial_x u_\theta - (0.01/\pi) \, \partial_{xx} u_\theta)^2\\

PiNN

Non Linear PDEs are hard to solve!

\mathrm{loss} = L2 + PDE =\\ \sum(u_\theta - u)^2 + \\ (\partial_t u_\theta + u_\theta \, \partial_x u_\theta - (0.01/\pi) \, \partial_{xx} u_\theta)^2\\

Raissi, Perdikaris, Karniadakis 2017

Unsupervised learning

Supervised learning

All features are observed for all datapoints

and we are looking for structure in the feature space

Some features are not observed for some data points we want to predict them.

The datapoints for which the target feature is observed are said to be "labeled"

Semi-supervised learning

Active learning

A small amount of labeled data is available. Data is cluster and clusters inherit labels

The code can interact with the user to update labels and update model.

also...

different flavors of learning

Reinforcement Learning

reward vs loss

delayed feedback from the changes in the environment

different flavors of learning

Reinforcement Learning

E.g. Selection of follow-up targets in Multi Messenger Astronomy

different flavors of learning

reward vs loss

delayed feedback from the changes in the environment

federica bianco - fbianco@udel.edu

Rubin ToO program

Andreoni+ 2022b

+80 authors!

PROS:

Large FoV (10 sq deg - easly cover 100 sq deg in full)

6 filters (5 available on any given night)

deep observations (r~24 in 30 sec, up to 180 sec)

public data

Registration for online participation open through March 15th!

https://lssttooworkshop.github.io/

This is a workshop: the goal of the workshop is to produce a report to be delivered to the SCOC containing recommendations for how to implement ToO responses with Rubin. There are no talks.Time is dedicated to collaboratively working toward the workshop report.

Rubin ToO program

SOC: I. Andreoni (KN), F. Bianco, A. Franckowiak (ν), T. Lister (Solar System),

R. Margutti (KN, GRB), G. Smith (Lensed KN)

Deep Learning

3/6

Input

x

y

output

Tree models

(at the basis of Random Forest

Gradient Boosted Trees)

nodes

(make a decision)

root node

branches

(split off of a node)

leaves (last groups)

Tree models

(at the basis of Random Forest

Gradient Boosted Trees)

Galaxy Zoo

p(class)

extracted

features vector

p(class)

pixel values tensor

1943

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

w1

w2

The perceptron algorithm : 1958, Frank Rosenblatt

1958

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

.

x_1

x_2

x_N

+b

output

weights

w_i

bias

b

linear regression:

w_2

w_1

w_N

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

1958

Perceptron

y= \begin{cases} 1~ if~ \sum_i(x_i w_i) + b ~>=~Z\\ 0 ~if~ \sum_i(x_i w_i) + b ~<~Z \end{cases}

.

x_1

x_2

x_N

+b

f

w_2

w_1

w_N

output

f

activation function

weights

w_i

bias

b

y ~= f(~\sum_i w_ix_i ~+~ b)

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

Perceptron

y= \begin{cases} 1~ if~ \sum_i(x_i w_i) + b ~>=~Z\\ 0 ~if~ \sum_i(x_i w_i) + b ~<~Z \end{cases}

.

x_1

x_2

x_N

+b

f

w_2

w_1

w_N

output

f

activation function

weights

w_i

bias

b

y ~= f(~\sum_i w_ix_i ~+~ b)

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

+b

f

w_2

w_1

w_N

output

f

activation function

weights

bias

b

sigmoid

f

\sigma = \frac{1}{1 + e^{-z}}

.

x_1

x_2

x_N

y ~= f(~\sum_i w_ix_i ~+~ b)

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

Perceptron

w_i

b

ANN examples of activation function

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptron

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

The embryo - the Weather Buerau's $2,000,000 "704" computer - learned to differentiate between left and right after 50 attempts in the Navy demonstration

NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser

July 8, 1958

multilayer perceptron

x_2

x_3

output

input layer

hidden layer

output layer

1970: multilayer perceptron architecture

x_1

Fully connected: all nodes go to all nodes of the next layer.

b_1

b_2

b_3

b_4

Perceptrons by Marvin Minsky and Seymour Papert 1969

Input

x

y

output

f(x)

x

y

A Neural Network is a kind of function that maps input to output

Input

output

hidden layers

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

w_{11}

w_{12}

w_{13}

w_{14}

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

w_{21}

w_{22}

w_{23}

w_{24}

layer of perceptrons

multilayer perceptron

layer of perceptrons

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

w_{31}

w_{32}

w_{33}

w_{34}

layer of perceptrons

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

f(w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1)

f(w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b1)

f(w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b1)

f(w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b1)

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

f: activation function:

turns neurons on-off

EXERCISE

output

how many parameters?

input layer

hidden layer

output layer

hidden layer

output

EXERCISE

output

how many parameters?

input layer

hidden layer

output layer

hidden layer

output

input layer

hidden layer

output layer

hidden layer

35

(3x4)+4

(4x3)+3

how many parameters?

EXERCISE

(3)+1

output

input layer

hidden layer

output layer

hidden layer

35

(3x4)+4

(4x3)+3

how many parameters?

EXERCISE

(3)+1

output

input layer

hidden layer

output layer

hidden layer

number of layers- 1
number of neurons/layer-
activation function/layer-
layer connectivity-
optimization metric - 1
optimization method - 1
parameters in optimization- M

N_l

N_l ^ {~??}

how many hyperparameters?

EXERCISE

GREEN: architecture hyperparameters

RED: training hyperparameters

N_l

GPT-3

175 Billion Parameters

3,640 PetaFLOPs days

Kaplan+ 2020

GPT-4

??? Billion Parameters

200,000 PetaFLOPs days

$100M

Kaplan+ 2020

National Public Radio

Generative

AI

4/6

Applications

Image Generation (and 3D Shape Generation)
Semantic Image-to-Photo Translation
Image Resolution Increase
Text-to-Speech Generator
Speech-to-Speech Conversion
Text Generation (Chat GP3)
Music Generation
Image-to-Image Conversion

Generative AI

What do NN do? approximate complex functions with series of linear functions

.... so if my layers are smaller what I have is a compact representation of the data

What do NN do? approximate complex functions with series of linear functions

To do that they extract information from the data

Each layer of the DNN produces a representation of the data a "latent representation" .

The dimensionality of that latent representation is determined by the size of the layer (and its connectivity, but we will ignore this bit for now)

.... so if my layers are smaller what I have is a compact representation of the data

Generative AI

Autoencoders

Autoencoder Architecture

Feed Forward DNN:

the size of the input is 5,

the size of the last layer is 2

Autoencoder Architecture

Feed Forward DNN:

the size of the input is 5,

the size of the last layer is 2

p(9)

infederntial AI output:

Autoencoder Architecture

Feed Forward DNN:

the size of the input is 5,

the size of the last layer is 2

encoder

Autoencoder Architecture

encoder

decoder

Autoencoder Architecture

encoder

decoder

latent space: internal representation of the input data

Generative AI

Autoencoders

Autoencoder Architecture

Feed Forward DNN:

the size of the input is <N,

the size of the last layer is N

Feed Forward DNN:

the size of the input is ,

the size of the last layer is N

remember the time when simulations drove astronomy...

Theory driven | Falsifiability

Experiment driven

Simulations | Probabilistic inference | Computation

The Millennium Run used more than 10^10 particles to trace the evolution of the matter distribution in a cubic region of the Universe 500/h Mpc on a side (~over 2 billion light-years on a side), and has a spatial resolution of 5/h kpc. ~20M galaxies.

350 000 processor hours of CPU time, or 28 days of wall-clock time. Springel+2005

https://wwwmpa.mpa-garching.mpg.de/galform/virgo/millennium

AI-assisted superresolution cosmological simulations

Yin Li+2021

LOW RES SIM

HIGH RES SIM

AI-assisted superresolution cosmological simulations

Yin Li+2021

LOW RES SIM

HIGH RES SIM

AI-AIDED HIGH RES

AI-assisted superresolution cosmological simulations

Yin Li+2021

LOW RES SIM

HIGH RES SIM

AI-AIDED HIGH RES

INPUT

OUTPUT

TARGET

loss = D(OUTPUT-TARGET)

November 30, 2022

will be made available to developers through Google Cloud’s API from December 13, 2023

teaching AI

5/6

teaching AI

- project based learning

immediate practice enhances theoretical understanding

- incremental learning

compartimentalized topics with shared fundations

- intuitive learning

can be taught with a light mathematical approach

teaching AI

- project based learning

immediate practice enhances theoretical understanding

- incremental learning

compartimentalized topics with shared fundations

- intuitive learning

can be taught with a light mathematical approach

Syllabus Machine Learning for Physical Scientists

teaching AI

- project based learning

immediate practice enhances theoretical understanding

- incremental learning

compartimentalized topics with shared fundations

- intuitive learning

can be taught with a light mathematical approach

teaching AI

ethics of AI

6/6

the butterfly effect

NGC 4565 is an edge-on spiral galaxy about 30 to 50 million light-years away. The faculty at the IUCAA used a AI model (emulator) to predict the hidden physical parameters of the Galaxy wrongfully estimating the DM content of NCG 4565 and claimed a novel process for Galaxy formation should be taken under consideration.

the butterfly effect

NGC 4565 is an edge-on spiral galaxy about 30 to 50 million light-years away. The faculty at the University of Delaware used a AI model (emulator) to predict the hidden physical parameters of the Galaxy wrongfully estimating the DM content of NCG 4565 and claimed a novel process for Galaxy formation should be taken under consideration.

Unfortunately, this was the result of a model hallucination.

the butterfly effect

NGC 4565 is an edge-on spiral galaxy about 30 to 50 million light-years away. The faculty at the University of Delaware used a AI model (emulator) to predict the hidden physical parameters of the Galaxy wrongfully estimating the DM content of NCG 4565 and claimed a novel process for Galaxy formation should be taken under consideration.

Unfortunately, this was the result of a model hallucination.

The galaxy was featured in many social media posts gaining rapid notoriety, but upon retraction it was canceled. The galaxy is suing University of Delaware claiming emotional damage and loss of revenue

the butterfly effect