DSU AI workshop
2023
University of Delaware
Department of Physics and Astronomy
federica bianco
Biden School of Public Policy and Administration
Data Science Institute
@fedhere
Li et al. 2022
AILE: the first AIbased platform for the detection and study of Light Echoes
NSF Award #2108841
Pessimal AI problem:
 small training data
 inaccurate labels
 imbalance classes
 diverse morphology
 low SNR
Xiaolong Li
LSSTC Catalyst Fellow 2023
UDelaware>John Hopkins
AILE: the first AIbased platform for the detection and study of Light Echoes
YOLO3 + "attention" mechanism
precision 80% at 70% recall with a training set of 19 light echo examples!
Xiaolong Li
LSSTC Catalyst Fellow 2023
UDelaware>John Hopkins
Time >
Language models for timeresolved image processing
Shar Daniels
UDel 1st year
ZTF timeresolved continuous readout images (w Igor Andreoni and Ashish Mahabal)
Transformer architecture
NN for language processing
who needs to learn
Educate Policy makers
without understanding how ML works policy makers do not have the instruments to regulate it
Education for the people
but does this put the burden on the victims?
Educating DS practitioners in communicating DS concepts
the put the burden back on the practitioners
Datascience Education to Help and Protect us
Jack Dorsey (Twitter CEO) at TED 2019
boring the TED audience with details
Zuckerberg (Facebook CEO) deflecting questions at senate hearing
#UDCSS2020
@fedhere
Data Science is a black box
Models are neutral, data is biased
two dangerous dataethics myths
#UDCSS2020
@fedhere
used to:
 understand structure of feature space
 classify based on examples
 predict a continuous variable (regression)
 understand which features are important in prediction (to get close to causality)
General ML concepts
Inferential AI
Generative AI
Generative AI
https://www.instagram.com/p/CtO_80PM6BD/
[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.
Arthur Samuel, 1959
what is a ML?
a model is a low dimensional representation of a higher dimensionality datase
what is a "model" in ML?
Any mathematical model with parameters that are
learned from the data
what is a ML "model"?
what is a ML "model"?
mathematical formula: y = ?
model parameters: slope a, intercept b
mathematical formula: y = ax + b
what is a ML "model"?
model parameters: slope a, intercept b
mathematical formula: y = ax + b
what is a ML "model"?
what is machine learning?
ML: study, development, and applicaton of any model with parameters learnt from the data
time
time
time
which is the "best fit" line? A , B, C, D?
A
B
C
D
to select the best fit parameters we define a function of the parameters to minimize or maximize
Objective Function
Loss Function
x1
x2
to select the best fit parameters we define a function of the parameters to minimize or maximize
Objective Function
Loss Function
Objective Function
Loss Function
to select the best fit parameters we define a function of the parameters to minimize or maximize
Machine Learning models are parametrized representation of "reality" where the parameters are learned from finite sets of realizations of that reality
(note: learning by instance, e.g. nearest neighbours, may not comply to this definition)
Machine Learning is the disciplines that conceptualizes, studies, and applies those models.
Key Concept
what is machine learning?
model parameters are learned by calculating a loss function for diferent parameter sets and trying to minimize loss (or a target function and trying to maximize)
e.g.
L1 = target  prediction
Learning relies on the definition of a loss function
Machine Learning
Data driven models for exploration of structure
set up: All features known for all observations
Goal: explore structure in the data
 data compression
 understanding structure
Algorithms: Clustering, (...)
x
y
Unsupervised Learning
Data driven models for exploration of structure
Unsupervised Learning
learning type  loss / target 

unsupervised  intracluster variance / inter cluster distance 
Data driven models for prediction
set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data
Goal: predicting missing feature
 classification
 regression
Algorithms: regression, SVM, tree methods, knearest neighbors, neural networks, (...)
x
y
Supervised Learning
Data driven models for prediction
set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data
Goal: predicting missing feature
 classification
 regression
Algorithms: regression, SVM, tree methods, knearest neighbors, neural networks, (...)
x
y
Supervised Learning
Learning relies on the definition of a loss function
learning type  loss / target 

unsupervised  intracluster variance / inter cluster distance 
supervised  distance between prediction and truth 
Machine Learning
Some FREE references!
michael nielsen
better pedagogical approach, more basic, more clear
ian goodfellow
mathematical approach, more advanced, unfinished
michael nielsen
better pedagogical approach, more basic, more clear
Galileo Galilei 1610
Following: Djorgovski
https://events.asiaa.sinica.edu.tw/school/20170904/talk/djorgovski1.pdf
Experiment driven
what drives
inference
@fedhere
Enistein 1916
what drives
inference
Theory driven  Falsifiability
Experiment driven
@fedhere
Ulam 1947
Theory driven  Falsifiability
Experiment driven
Simulations  Probabilistic inference  Computation
http://wwwstar.stand.ac.uk/~kw25/teaching/mcrt/MC_history_3.pdf
@fedhere
what drives
inference
what drives
astronomy
the 2000s
Theory driven  Falsifiability
Experiment driven
Simulations  Probabilistic inference  Computation
Big Data + Computation  pattern discovery  predict by association
@fedhere
data driven: lots of data, drop theory and use associations
algorithmic transparency
strictly policy issues:
proprietary algorithms + audability
#UDCSS2020
@fedhere
technical + policy issues:
data access and redress + data provenance
algorithmic transparency
https://www.darpa.mil/attachments/XAIProgramUpdate.pdf
trivially intuitive
generalized additive models
decision trees
SVM
Random Forest
Deep Learning
Accuracy
univaraite
linear
regression
algorithmic transparency
#UDCSS2020
@fedhere
we're still trying to figure it out
algorithmic transparency
https://www.darpa.mil/attachments/XAIProgramUpdate.pdf
trivially intuitive
generalized additive models
decision trees
SVM
Random Forest
Deep Learning
Accuracy in solving complex problems
univaraite
linear
regression
algorithmic transparency
#UDCSS2020
@fedhere
we're still trying to figure it out
algorithmic transparency
trivially intuitive
generalized additive models
decision trees
Deep Learning
number of features that can be effectively included in the model
thousands
1
SVM
Random Forest
univaraite
linear
regression
https://www.darpa.mil/attachments/XAIProgramUpdate.pdf
algorithmic transparency
#UDCSS2020
@fedhere
Accuracy in solving complex problems
we're still trying to figure it out
algorithmic transparency
trivially intuitive
univaraite
linear
regression
generalized additive models
decision trees
Deep Learning
SVM
Random Forest
https://www.darpa.mil/attachments/XAIProgramUpdate.pdf
time
algorithmic transparency
#UDCSS2020
@fedhere
Accuracy in solving complex problems
we're still trying to figure it out
algorithmic transparency
1
Machine learning: any method that learns parameters from the data
2
The transparency of an algorithm is proportional to its complexity and the complexity of the data space
3
The transparency of an algorithm is limited by our own ability and preparedness to interpret it
Toward Interpretable Machine Learning, Samek+2003
algorithmic transparency
#UDCSS2020
@fedhere
NN:
Neural Networks
1
NN:
Neural Networks
1.1
origins
1943
MP Neuron McCulloch & Pitts 1943
1943
MP Neuron McCulloch & Pitts 1943
1943
MP Neuron McCulloch & Pitts 1943
MP Neuron
1943
MP Neuron
its a classifier
MP Neuron McCulloch & Pitts 1943
MP Neuron
1943
MP Neuron McCulloch & Pitts 1943
MP Neuron
1943
if is Bool (True/False)
what value of corresponds to logical AND?
MP Neuron McCulloch & Pitts 1943
The perceptron algorithm : 1958, Frank Rosenblatt
1958
Perceptron
The perceptron algorithm : 1958, Frank Rosenblatt
.
.
.
output
weights
bias
linear regression:
1958
Perceptron
Perceptrons are linear classifiers: makes its predictions based on a linear predictor function
combining a set of weights (=parameters) with the feature vector.
The perceptron algorithm : 1958, Frank Rosenblatt
x
y
1958
Perceptrons are linear classifiers: makes its predictions based on a linear predictor function
combining a set of weights (=parameters) with the feature vector.
The perceptron algorithm : 1958, Frank Rosenblatt
x
y
1958
Perceptrons are linear classifiers: makes its predictions based on a linear predictor function
combining a set of weights (=parameters) with the feature vector.
The perceptron algorithm : 1958, Frank Rosenblatt
x
y
1958
1
0
{
{
.
.
.
output
activation function
weights
bias
perceptron
The perceptron algorithm : 1958, Frank Rosenblatt
Perceptrons are linear classifiers: makes its predictions based on a linear predictor function
combining a set of weights (=parameters) with the feature vector.
The perceptron algorithm : 1958, Frank Rosenblatt
output
activation function
weights
bias
sigmoid
.
.
.
Perceptrons are linear classifiers: makes its predictions based on a linear predictor function
combining a set of weights (=parameters) with the feature vector.
The perceptron algorithm : 1958, Frank Rosenblatt
output
activation function
weights
bias
.
.
.
Perceptron
The perceptron algorithm : 1958, Frank Rosenblatt
Perceptron
The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.
The embryo  the Weather Buerau's $2,000,000 "704" computer  learned to differentiate between left and right after 50 attempts in the Navy demonstration
NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser
July 8, 1958
The perceptron algorithm : 1958, Frank Rosenblatt
Perceptron
The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.
The embryo  the Weather Buerau's $2,000,000 "704" computer  learned to differentiate between left and right after 50 attempts in the Navy demonstration
NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser
July 8, 1958
x1
x2
b1
b2
b3
b
w11
w12
w13
w21
w22
w23
multilayer perceptron
w: weight
sets the sensitivity of a neuron
b: bias:
updown weights a neuron
EXERCISE
output
how many parameters?
input layer
hidden layer
output layer
hidden layer
Deep Learning
2
DNN:
multilayer perceptron
output
layer of perceptrons
multilayer perceptron
output
input layer
hidden layer
output layer
1970: multilayer perceptron architecture
Fully connected: all nodes go to all nodes of the next layer.
multilayer perceptron
output
layer of perceptrons
multilayer perceptron
output
layer of perceptrons
multilayer perceptron
layer of perceptrons
output
layer of perceptrons
multilayer perceptron
output
Fully connected: all nodes go to all nodes of the next layer.
layer of perceptrons
multilayer perceptron
output
Fully connected: all nodes go to all nodes of the next layer.
layer of perceptrons
w: weight
sets the sensitivity of a neuron
b: bias:
updown weights a neuron
learned parameters
multilayer perceptron
output
Fully connected: all nodes go to all nodes of the next layer.
layer of perceptrons
w: weight
sets the sensitivity of a neuron
b: bias:
updown weights a neuron
f: activation function:
turns neurons onoff
BINARY
CLASSIFICATION
input layer
hidden layer
output layer
P(0)
P(1)
input layer
hidden layer
output layer
P(C)
MULTICLASS
CLASSIFICATION
P(B)
P(A)
P(D)
input layer
hidden layer
output layer
REGRESSION
continuous value
variable
DNN:
parameters of DNN
3
EXERCISE
output
how many parameters?
input layer
hidden layer
output layer
hidden layer
EXERCISE
output
how many parameters?
input layer
hidden layer
output layer
hidden layer
3 x 4 (w) + 4 (b) = 16
EXERCISE
output
how many parameters?
input layer
hidden layer
output layer
hidden layer
3 x 4 (w) + 4 (b) = 16
4 x 3 (w) + 3 (b) = 15
EXERCISE
output
how many parameters?
input layer
hidden layer
output layer
hidden layer
3 x 4 (w) + 4 (b) = 16
4 x 3 (w) + 3 (b) = 15
3 x 1 (w) + 1 (b) = 4
35
DNN:
hyperparameters of DNN
4
There are other things that change from model to model, but that are not decided based on the data, simply things we decide "a prior"
hyperparameters
output
input layer
hidden layer
output layer
hidden layer
how many hyperparameters?
EXERCISE
GREEN: architecture hyperparameters
RED: training hyperparameters
output
input layer
hidden layer
output layer
hidden layer
 number of layers 1
 number of neurons/layer
 activation function/layer
 layer connectivity
 optimization metric  1
 optimization method  1
 parameters in optimization M
how many hyperparameters?
EXERCISE
GREEN: architecture hyperparameters
RED: training hyperparameters
principle of parsimony
or Ockham's razor
Pluralitas non est ponenda sine neccesitate
William of Ockham (logician and Franciscan friar) 1300ca
but probably to be attributed to John Duns Scotus (1265–1308)
“Complexity needs not to be postulated without a need for it”
principle of parsimony
Peter Apian, Cosmographia, Antwerp, 1524 from Edward Grant,
"Celestial Orbs in the Latin Middle Ages", Isis, Vol. 78, No. 2. (Jun., 1987).
Geocentric models are intuitive:
from our perspective we see the Sun moving, while we stay still
the earth is round,
and it orbits around the sun
principle of parsimony
As observations improve
this model can no longer fit the data!
not easily anyways...
the earth is round,
and it orbits around the sun
Encyclopaedia Brittanica 1st Edition
Dr Long's copy of Cassini, 1777
principle of parsimony
A new model that is much simpler fit the data just as well
(perhaps though only until better data comes...)
the earth is round,
and it orbits around the sun
Heliocentric model from Nicolaus Copernicus' De revolutionibus orbium coelestium.
principle of parsimony
or Ockham's razor
Pluralitas non est ponenda sine neccesitate
William of Ockham (logician and Franciscan friar) 1300ca
but probably to be attributed to John Duns Scotus (1265–1308)
“Complexity needs not to be postulated without a need for it”
“Between 2 theories that perform similarly choose the simpler one”
the principle of parsimony
or Ockham's razor
Between 2 theories that perform similarly choose the simpler one
In the context of model selection simpler means "with fewer parameters"
Key Concept
DNN need a lot of data to train
To optimize a lot of parameters we need..... lots of data!
DNN are justified if
 there are a lot of variables
 the relationships between input variables and output are nonlinear
proper care of your DNN:
4.1
how to make informed choices in the architectural design (TL;DR:... I will offer some guidance, but really you've got to try a bunch of things...)
NN are a vast topics and we only have 2 weeks!
Some FREE references!
michael nielsen
better pedagogical approach, more basic, more clear
ian goodfellow
mathematical approach, more advanced, unfinished
michael nielsen
better pedagogical approach, more basic, more clear
Lots of parameters and lots of hyperparameters! What to choose?
cheatsheet

architecture  wide networks tend to overfit, deep networks are hard to train
 number of epochs  the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices (like in all gradient descent)
 loss function  needs to be appropriate to the task, e.g. classification vs regression

activation functions  needs to be consistent with the loss function
 optimization scheme  needs to be appropriate to the task and data
 learning rate in optimization  balance speed and accuracy
 batch size  smaller batch size is faster but leads to overtraining
An article that compars various DNNs
An article that compars various DNNs
accuracy comparison
An article that compars various DNNs