DSU AI workshop

2023

 

University of Delaware

Department of Physics and Astronomy

 

federica bianco

 

Biden School of Public Policy and Administration

Data  Science Institute

 

 

@fedhere

Li et al. 2022

AILE: the first AI-based platform for the detection and study of Light Echoes

NSF Award #2108841

Pessimal AI problem:

  • small training data
  • inaccurate labels
  • imbalance classes
  • diverse morphology
  • low SNR

Xiaolong Li

LSSTC Catalyst Fellow 2023

UDelaware->John Hopkins

AILE: the first AI-based platform for the detection and study of Light Echoes

YOLO3 + "attention" mechanism

precision 80% at 70% recall with a training set of 19 light echo examples! 

Xiaolong Li

LSSTC Catalyst Fellow 2023

UDelaware->John Hopkins

Time ->

Language models for time-resolved image processing

Shar Daniels

UDel 1st year

ZTF time-resolved continuous readout images (w Igor Andreoni and Ashish Mahabal)

Transformer architecture

NN for language processing

who needs to learn

Educate Policy makers

without understanding how ML works policy makers do not have the instruments to regulate it

 

Education for the people

but does this put the burden on the victims?

 

Educating DS practitioners in communicating DS concepts

the put the burden back on the practitioners

Datascience Education to Help and Protect us

Jack Dorsey (Twitter CEO) at TED 2019

boring the TED audience with details

Zuckerberg (Facebook CEO) deflecting questions at senate hearing

#UDCSS2020

@fedhere

Data Science is a black box

Models are neutral, data is biased

two dangerous data-ethics myths

#UDCSS2020

@fedhere

used to:

  • understand structure of feature space
  • classify based on examples
  • predict a continuous variable (regression)
    • understand which features are important in prediction (to get close to causality)

General ML concepts

Inferential AI

Generative AI

Generative AI

https://www.instagram.com/p/CtO_80PM6BD/

https://www.instagram.com/p/CtO_80PM6BD/

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959


what is a ML?

 

a model is a low dimensional representation of a higher dimensionality datase

what is a "model" in ML?

 

Any mathematical model with parameters that are

learned from the data

what is a ML "model"?

what is a ML "model"?

mathematical formula: y = ?         

model parameters: slope a, intercept b

mathematical formula: y = ax + b

what is a ML "model"?

model parameters: slope a, intercept b

mathematical formula: y = ax + b

what is a ML "model"?

what is machine learning?

ML: study, development, and applicaton of any model with parameters learnt from the data

time

time

time

which is the "best fit" line? A , B, C, D?

A

B

C

D

to select the best fit parameters we define a function of the parameters to minimize or maximize

Objective Function

Loss Function

L_1 = \sum_{i=1}^N|f(x) - y|
L_2 = \sum_{i=1}^N(f(x) - y)^2

x1

x2

to select the best fit parameters we define a function of the parameters to minimize or maximize

Objective Function

Loss Function

Objective Function

Loss Function

L_1 = \sum_{i=1}^N|f(x) - y|
L_2 = \sum_{i=1}^N(f(x) - y)^2

to select the best fit parameters we define a function of the parameters to minimize or maximize

Machine Learning models are parametrized representation of "reality"  where the parameters are learned from finite sets of realizations of that reality

(note: learning by instance, e.g. nearest neighbours, may not comply to this definition)

Machine Learning is the disciplines that conceptualizes, studies, and applies those models.

Key Concept

what is  machine learning?

 

model parameters are learned by calculating a loss function for diferent parameter sets and trying to minimize loss (or a target function and trying to maximize)

e.g.

L1  = |target - prediction|

Learning relies on the definition of a loss function

Machine Learning

Data driven models for exploration of structure

set up: All features known for all observations

Goal: explore structure in the data

- data compression

- understanding structure

Algorithms: Clustering, (...)

x

y

Unsupervised Learning

Data driven models for exploration of structure

Unsupervised Learning

learning type loss / target
unsupervised intra-cluster variance / inter cluster distance

Data driven models for prediction

set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data

Goal: predicting missing feature

-  classification

- regression

Algorithms: regression, SVM, tree methods, k-nearest neighbors,            neural networks, (...)

x

y

Supervised Learning

Data driven models for prediction

set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data

Goal: predicting missing feature

-  classification

- regression

Algorithms: regression, SVM, tree methods, k-nearest neighbors,            neural networks, (...)

x

y

Supervised Learning

Learning relies on the definition of a loss function

learning type loss / target
unsupervised intra-cluster variance / inter cluster distance
supervised distance between prediction and truth

Machine Learning

 

Some FREE references!

 

michael nielsen

better pedagogical approach, more basic, more clear

ian goodfellow

mathematical approach,  more advanced, unfinished

michael nielsen

better pedagogical approach, more basic, more clear

Galileo Galilei 1610

Experiment driven

what drives

inference

@fedhere

Enistein 1916

what drives

inference

Theory driven | Falsifiability

Experiment driven

@fedhere

Ulam 1947

Theory driven | Falsifiability

Experiment driven

Simulations | Probabilistic inference | Computation

http://www-star.st-and.ac.uk/~kw25/teaching/mcrt/MC_history_3.pdf

@fedhere

what drives

inference

what drives

astronomy

the 2000s

Theory driven | Falsifiability

Experiment driven

Simulations | Probabilistic inference | Computation

Big Data + Computation | pattern discovery | predict by association

@fedhere

data driven: lots of data, drop theory and use associations

algorithmic transparency

strictly policy issues:

proprietary algorithms + audability

#UDCSS2020

@fedhere

technical + policy issues:

data access and redress + data provenance

algorithmic transparency

https://www.darpa.mil/attachments/XAIProgramUpdate.pdf

trivially intuitive

generalized additive models

decision trees

SVM

Random Forest

Deep Learning

Accuracy

univaraite

linear

regression

algorithmic transparency

#UDCSS2020

@fedhere

we're still trying to figure it out 

algorithmic transparency

https://www.darpa.mil/attachments/XAIProgramUpdate.pdf

trivially intuitive

generalized additive models

decision trees

SVM

Random Forest

Deep Learning

Accuracy in solving complex problems

univaraite

linear

regression

algorithmic transparency

#UDCSS2020

@fedhere

we're still trying to figure it out 

algorithmic transparency

trivially intuitive

generalized additive models

decision trees

Deep Learning

number of features that can be effectively included in the model

thousands

1

SVM

Random Forest

univaraite

linear

regression

https://www.darpa.mil/attachments/XAIProgramUpdate.pdf

algorithmic transparency

#UDCSS2020

@fedhere

Accuracy in solving complex problems

we're still trying to figure it out 

algorithmic transparency

trivially intuitive

univaraite

linear

regression

generalized additive models

decision trees

Deep Learning

SVM

Random Forest

https://www.darpa.mil/attachments/XAIProgramUpdate.pdf

time

algorithmic transparency

#UDCSS2020

@fedhere

Accuracy in solving complex problems

we're still trying to figure it out 

algorithmic transparency

1

Machine learning: any method that learns parameters from the data

2

The transparency of an algorithm is proportional to its complexity and the complexity of the data space

3

The transparency of an algorithm is limited by our own ability and preparedness to interpret it

algorithmic transparency

#UDCSS2020

@fedhere

NN:

 

Neural Networks

1

NN:

 

Neural Networks

1.1

origins

1943

M-P Neuron McCulloch & Pitts 1943

1943

M-P Neuron McCulloch & Pitts 1943

1943

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

M-P Neuron

its a classifier

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

M-P Neuron McCulloch & Pitts 1943

\sum_{i=1}^3x_i

M-P Neuron

1943

if     is Bool (True/False)

what value of  corresponds to logical AND?

x_i
\theta

M-P Neuron McCulloch & Pitts 1943

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

The perceptron algorithm : 1958, Frank Rosenblatt

1958

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

.

.

.

 

x_1
x_2
x_N
+b

output

weights

w_i

bias

b

linear regression:

w_2
w_1
w_N

1958

Perceptron

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

x

y

1958

y ~= ~\sum_i w_ix_i ~+~ b

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

x

y

1958

y ~= ~\sum_i w_ix_i ~+~ b

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

x

y

1958

y ~= ~\sum_i w_ix_i ~+~ b

1

0

{

{

y= \begin{cases} 1~ if~ \sum_i(x_i w_i) + b ~>=~Z\\ 0 ~if~ \sum_i(x_i w_i) + b ~<~Z \end{cases}

.

.

.

 

x_1
x_2
x_N
+b
f
w_2
w_1
w_N

output

f

activation function

weights

w_i

bias

b

perceptron

f
y ~= f(~\sum_i w_ix_i ~+~ b)

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

+b
f
w_2
w_1
w_N

output

f

activation function

weights

w_i

bias

b

sigmoid

f
\sigma = \frac{1}{1 + e^{-z}}

.

.

.

 

x_1
x_2
x_N
y ~= f(~\sum_i w_ix_i ~+~ b)

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

+b
f
w_2
w_1
w_N

output

f

activation function

weights

w_i

bias

b

.

.

.

 

x_1
x_2
x_N

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptron

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

The embryo - the Weather Buerau's $2,000,000 "704" computer - learned to differentiate between left and right after 50 attempts in the Navy demonstration

NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser

July 8, 1958

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptron

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

The embryo - the Weather Buerau's $2,000,000 "704" computer - learned to differentiate between left and right after 50 attempts in the Navy demonstration

NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser

July 8, 1958

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

x1

x2

b1

b2

b3

b

w11

w12

w13

w21

0
 Advanced issue found

w22

w23

multilayer perceptron

w: weight

sets the sensitivity of a neuron

 

b: bias:

up-down weights a neuron

 

 

EXERCISE

output

how many parameters?

input layer

hidden layer

output layer

hidden layer

Deep Learning

2

DNN:

 

multilayer perceptron

x_2
x_3

output

x_1

layer of perceptrons

b_1
b_2
b_3
b_4
b

multilayer perceptron

x_2
x_3

output

input layer

hidden layer

output layer

1970: multilayer perceptron architecture

x_1

Fully connected: all nodes go to all nodes of the next layer.

b_1
b_2
b_3
b_4

multilayer perceptron

x_2
x_3

output

x_1

layer of perceptrons

b_1
b_2
b_3
b_4
b
w_{11}
w_{12}
w_{13}
w_{14}

multilayer perceptron

x_2
x_3

output

x_1

layer of perceptrons

b_1
b_2
b_3
b_4
b
w_{21}
w_{22}
w_{23}
w_{24}

multilayer perceptron

layer of perceptrons

x_2
x_3

output

x_1

layer of perceptrons

b_1
b_2
b_3
b_4
b
w_{31}
w_{32}
w_{33}
w_{34}

multilayer perceptron

x_2
x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

x_1
w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + w_{14}x_4 + b1

multilayer perceptron

x_2
x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1
w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b2
w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b3
w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b4
x_1

w: weight

sets the sensitivity of a neuron

 

b: bias:

up-down weights a neuron

 

 

learned parameters

multilayer perceptron

x_2
x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

x_1

w: weight

sets the sensitivity of a neuron

 

b: bias:

up-down weights a neuron

 

 

f: activation function:

turns neurons on-off

 

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b3
w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b4
w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1
w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b2

BINARY

CLASSIFICATION

x_2
x_3

input layer

hidden layer

output layer

x_1
b_1
b_2
b_3
b_4

P(0)

P(1)

x_2
x_3

input layer

hidden layer

output layer

x_1
b_1
b_2
b_3
b_4

P(C)

MULTICLASS

CLASSIFICATION

P(B)

P(A)

P(D)

x_2
x_3

input layer

hidden layer

output layer

x_1
b_1
b_2
b_3
b_4

REGRESSION

 

continuous value

variable

DNN:

 

parameters of DNN

3

EXERCISE

output

how many parameters?

input layer

hidden layer

output layer

hidden layer

EXERCISE

output

how many parameters?

input layer

hidden layer

output layer

hidden layer

3 x 4 (w) + 4 (b) = 16

EXERCISE

output

how many parameters?

input layer

hidden layer

output layer

hidden layer

3 x 4 (w) + 4 (b) = 16

4 x 3 (w) + 3 (b) = 15

EXERCISE

output

how many parameters?

input layer

hidden layer

output layer

hidden layer

3 x 4 (w) + 4 (b) = 16

4 x 3 (w) + 3 (b) = 15

3 x 1 (w) + 1 (b) = 4

35

DNN:

 

hyperparameters of DNN

4

There are other things that change from model to model, but that are not decided based on the data, simply things we decide "a prior"

hyperparameters

output

input layer

hidden layer

output layer

hidden layer

how many hyperparameters?

EXERCISE

GREEN: architecture hyperparameters

RED: training hyperparameters

 

output

input layer

hidden layer

output layer

hidden layer

  1. number of layers-  1
  2. number of neurons/layer-   
  3. activation function/layer-  
  4. layer connectivity-       
  5. optimization metric - 1
  6. optimization method - 1
  7. parameters in optimization- M
N_l
N_l ^ {~??}

how many hyperparameters?

EXERCISE

GREEN: architecture hyperparameters

RED: training hyperparameters

 

N_l

principle of parsimony

or Ockham's razor

Pluralitas non est ponenda sine neccesitate


William of Ockham (logician and Franciscan friar) 1300ca

but probably to be attributed to John Duns Scotus (1265–1308)


“Complexity needs not to be postulated without a need for it”


 
 
 

principle of parsimony

 

Peter Apian, Cosmographia, Antwerp, 1524 from Edward Grant, "Celestial Orbs in the Latin Middle Ages", Isis, Vol. 78, No. 2. (Jun., 1987).

Peter Apian, Cosmographia, Antwerp, 1524 from Edward Grant,

"Celestial Orbs in the Latin Middle Ages", Isis, Vol. 78, No. 2. (Jun., 1987).

Geocentric models are intuitive:

from our perspective we see the Sun moving, while we stay still

the earth is round,

and it orbits around the sun

principle of parsimony

 

Peter Apian, Cosmographia, Antwerp, 1524 from Edward Grant, "Celestial Orbs in the Latin Middle Ages", Isis, Vol. 78, No. 2. (Jun., 1987).

As observations improve

this model can no longer fit the data!

not easily anyways...

the earth is round,

and it orbits around the sun

Encyclopaedia Brittanica 1st Edition

Dr Long's copy of Cassini, 1777

 principle of parsimony

 

Peter Apian, Cosmographia, Antwerp, 1524 from Edward Grant, "Celestial Orbs in the Latin Middle Ages", Isis, Vol. 78, No. 2. (Jun., 1987).

A new model that is much simpler fit the data just as well

(perhaps though only until better data comes...)

the earth is round,

and it orbits around the sun

Heliocentric model from Nicolaus Copernicus' De revolutionibus orbium coelestium.

principle of parsimony

or Ockham's razor

Pluralitas non est ponenda sine neccesitate

 

William of Ockham (logician and Franciscan friar) 1300ca

but probably to be attributed to John Duns Scotus (1265–1308)

 

“Complexity needs not to be postulated without a need for it”

“Between 2 theories that perform similarly choose the simpler one

 

 
 
 
 

the principle of parsimony

or Ockham's razor

 

 

 Between 2 theories that perform similarly choose the simpler one

 

 In the context of model selection simpler means "with fewer parameters"

 
 
 
 

Key Concept

DNN need a lot of data to train

To optimize a lot of parameters we need..... lots of data!

 

DNN are justified if

- there are a lot of variables

- the relationships between input variables and output are non-linear

 

proper care of your DNN:

 

0
 Advanced issue found
 

4.1

how to make informed choices in the architectural design (TL;DR:... I will offer some guidance, but really you've got to try a bunch of things...)

NN are a vast topics and we only have 2 weeks!

Some FREE references!

 

michael nielsen

better pedagogical approach, more basic, more clear

ian goodfellow

mathematical approach,  more advanced, unfinished

michael nielsen

better pedagogical approach, more basic, more clear

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

 
  1. architecture - wide networks tend to overfit, deep networks are hard to train

     
  2. number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices (like in all gradient descent)
     
  3. loss function - needs to be appropriate to the task, e.g. classification vs regression
     
  4. activation functions - needs to be consistent with the loss function
     
  5. optimization scheme - needs to be appropriate to the task and data
     
  6. learning rate in optimization - balance speed and accuracy
     
  7. batch size - smaller batch size is faster but leads to overtraining

An article that compars various DNNs

 

An article that compars various DNNs

 

accuracy comparison

An article that compars various DNNs

 

accuracy comparison

An article that compars various DNNs

 

batch size

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

 
  1. architecture - wide networks tend to overfit, deep networks are hard to train

     
  2. number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices
  3. loss function - needs to be appropriate to the task, e.g. classification vs regression
     
  4. activation functions - needs to be consistent with the loss function
     
  5. optimization scheme - needs to be appropriate to the task and data
     
  6. learning rate in optimization - balance speed and accuracy
     
  7. batch size - smaller batch size is faster but leads to overtraining
5
 Advanced issues found
1

What should I choose for the loss function and how does that relate to the activation functiom and optimization? 

Lots of parameters and lots of hyperparameters! What to choose?

 

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

 

always check your loss function! it should go down smoothly and flatten out at the end of the training.

not flat? you are still learning!

too flat? you are overfitting...

loss  (gallery of horrors)

jumps are not unlikely (and not necessarily a problem) if your activations are discontinuous (e.g. relu)

when you use validation you are introducing regularizations (e.g. dropout) so the loss can be smaller than for the training set

loss and learning rate (not that the appropriate learning rate depends on the chosen optimization scheme!)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization? 

loss good for activation last layer size last layer
mean_squared_error regression linear one node
mean_absolute_error regression linear one node
mean_squared_logarithmit_error ​regression linear one node
binary_crossentropy binary classification sigmoid one node
categorical_crossentropy multiclass classification sigmoid N nodes
Kullback_Divergence multiclass classification, probabilistic inerpretation sigmoid N nodes

GROKKING: GENERALIZATION BEYOND OVERFITTING ON SMALL ALGORITHMIC DATASETS

For small NNs, it is observed that extending training **well past** the beginning of overfitting can trigger a sudden rapid improve of performance on the test set.

This happens when the latent representation of the data reorganizes itself suddenly in an actually meaningful way. Priori to grokking, the NN is just learning similarities. After grokking, it learns the fundamental relations that govern a phenomenon

On the interpretability of DNNs

EXERCISE

DNN:

 

training DNN

5

.

.

.

 

x_1
x_2
x_N
+b
\vec{y} = \vec{x}W + b

Any linear model: 

w_2
w_1
w_N
y

y : prediction

ytrue : target

Error: e.g.

 

L_2~=~(y - y_\mathrm{true})^2

intercept

slope

L2

x

Find the best parameters by finding the minimum of the L2 hyperplane

 

at every step look around and choose the best direction

 back-propagation

deep neural net

Fully connected: all nodes go to all nodes of the next layer.

1986: Deep Neural Nets

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

f: activation function:

turns neurons on-off

 

w: weight

sets the sensitivity of a neuron

 

b: bias:

up-down weights a neuron

 

 

In a CNN these layers would not be fully connected except the last one

 

Seminal paper 

Y. LeCun 1998

 back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

x_{j}~=~\sum_i y_{i}w_{ji} ~~~~~~ y_j~=\frac{1}{1+e^{-x_j}}

.

.

.

 

x_1
x_N
f
+b
f
w_2

output

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

 

. defining the metric

. optimization schemes

. training/validation/testing sets

 

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

 

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

feed data forward through network and calculate cost metric

for each layer, calculate effect of small changes on next layer

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

 back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

think of applying just gradient to a function of a function of a function... use:

1)  partial derivatives, 2)  chain rule

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

Punch Line

Deep Neural Net are not some fancy-pants methods, they are just linear models with a bunch of parameters

Black Box?

Because they have many parameters they are difficult to "interpret" (no easy feature extraction)

 

that may be ok because they are prediction machines

Black Box?

Because they have many parameters they are difficult to "interpret" (no easy feature extraction)

 

that may be ok because they are prediction machines

deep dreams

deep dreams

what is happening in DeepDream?

Deep Dream (DD) is a google software, a pre-trained NN (originally created on the Cafe architecture, now imported on many other platforms including tensorflow).

 

The high level idea relies on training a convolutional NN to recognize common objects, e.g. dogs, cats, cars, in images. As the network learns to recognize those objects is developes its layers to pick out "features" of the NN, like lines at a cetrain orientations, circles, etc. 

 

The DD software runs this NN on an image you give it, and it loops on some layers, thus "manifesting" the things it knows how to recognize in the image. 

 

 

CNN

1

Convolutional Neural Nets

@akumadog

Brain Programming and the Random Search in Object Categorization

 

Stack multiple convolution layers

The visual cortex learns hierarchically: first detects simple features, then more complex features and ensembles of features

Some shapes are characteristic of the appearance of specific objects:

 

a dog face is composed of

  • circle eyes,
  • triangle ears,
  • heart nose
  • ...

The visual cortex learns hierarchically: first detects simple features, then more complex features and ensembles of features

a Convolutional NN inspects the images by looking for where the image is maximally similar to the specific form

 

a dog face is composed of

  • circle eyes,
  • triangle ears,
  • heart nose
  • ...

The visual cortex learns hierarchically: first detects simple features, then more complex features and ensembles of features

a Convolutional NN inspects the images by looking for where the image is maximally similar to the specific form

 

a dog face is composed of

  • circle eyes,
  • triangle ears,
  • heart nose
  • ...

The visual cortex learns hierarchically: first detects simple features, then more complex features and ensembles of features

a Convolutional NN inspects the images by looking for where the image is maximally similar to the specific form

 

a dog face is composed of

  • circle eyes,
  • triangle ears,
  • heart nose
  • ...

The visual cortex learns hierarchically: first detects simple features, then more complex features and ensembles of features

a Convolutional NN inspects the images by looking for where the image is maximally similar to the specific form

 

a dog face is composed of

  • circle eyes,
  • triangle ears,
  • heart nose
  • ...

The visual cortex learns hierarchically: first detects simple features, then more complex features and ensembles of features

a Convolutional NN inspects the images by looking for where the image is maximally similar to the specific form

 

a dog face is composed of

  • circle eyes,
  • triangle ears,
  • heart nose
  • ...

The visual cortex learns hierarchically: first detects simple features, then more complex features and ensembles of features

a Convolutional NN inspects the images by looking for where the image is maximally similar to the specific form

 

a dog face is composed of

  • circle eyes,
  • triangle ears,
  • heart nose
  • ...

The visual cortex learns hierarchically: first detects simple features, then more complex features and ensembles of features

Every piece of the image will have a value of similarity with a specified form

 

a dog face is composed of

  • circle eyes,
  • triangle ears,
  • heart nose
  • ...

The visual cortex learns hierarchically: first detects simple features, then more complex features and ensembles of features

The parameters being learned by the CNN are the template shapes which we call 

"convolution kernels"

CNN

1a

Convolution

Convolution

convolution is a mathematical operator on two functions

f and g

that produces a third function  

f x g

expressing how the shape of one is modified by the other.

o

Convolution Theorem

f * g= \mathcal{F}^{-1}\big\{\mathcal{F}\{f\}\cdot\mathcal{F}\{g\}\big\}
\mathcal{F}

fourier transform

{\displaystyle {\begin{aligned}F(\nu )&=\int _{\mathbb {R} ^{n}}f(x)e^{-2\pi ix\cdot \nu }\,dx,\\ G(\nu )&=\int _{\mathbb {R} ^{n}}g(x)e^{-2\pi ix\cdot \nu }\,dx,\end{aligned}}}

two images. 

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1

1

1

1

1

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1 -1 -1
-1 -1 -1 -1 -1
-1 -1 -1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
-1 -1 1
-1 1 -1
1 -1 -1

feature maps

1

1

1

1

1

convolution

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
(-1*1) + (-1*-1) + (-1*-1) + \\ (-1*-1)+(1*1)+(-1*-1)\\ (-1*-1)+(-1*-1)+(1*1)\\ = 7
7

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
(-1*1) + (-1*-1) + (-1*-1) + \\ (-1*1)+(-1*1)+(-1*1)\\ (-1*-1)+(-1*1)+(-1*1)\\ = -3
7 -3

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3
?

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3
? ?

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3
? ?

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3
? ?

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3
? ?

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3
? ?

=

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3
-3

=

input layer

feature map

convolution layer

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1
7 -3 3
-3 5 -3
3 -3 7

=

input layer

feature map

convolution layer

the feature map is "richer": we went from binary to R

1

1

1

1

1

-1 -1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1
-1 -1 -1
-1 -1 -1 -1 -1
1 -1 -1
-1 1 -1
-1 -1 1

=

input layer

feature map

convolution layer

the feature map is "richer": we went from binary to R

and it is reminiscent of the original layer

7

5 

7

7 -3 3
-3 5 -3
3 -3 7

=

7

7

Convolve with different feature: each neuron is 1 feature

CNN

1b

ReLu

7 -3 3
-5 5 -3
-6 -1 7

7

5 

7

ReLu: normalization that replaces negative values with 0's

7 0 3
0 5 0
3 0 7

7

5 

7

1c

Max-Pool

CNN

MaxPooling: reduce image size, generalizes result

7 0 3
0 5 0
3 0 7

7

5 

7

MaxPooling: reduce image size, generalizes result

7 0 3
0 5 0
3 0 7

7

5 

7

2x2 Max Poll

7 5

MaxPooling: reduce image size, generalizes result

7 0 3
0 5 0
3 0 7

7

5 

7

2x2 Max Poll

7 5
5

MaxPooling: reduce image size, generalizes result

7 0 3
0 5 0
3 0 7

7

5 

7

2x2 Max Poll

7 5
5 7

MaxPooling: reduce image size & generalizes result

 

 

By reducing the size and picking the maximum of a sub-region we make the network less sensitive to specific details

CNN

final layer:

the final layer is fully connected

x

O

last hidden layer

output layer

resources

 

resources

 

homework

 

dsu23_1

By federica bianco

dsu23_1

  • 400