ML for physical and natural scientists 2023 8

dr.federica bianco | fbb.space |    fedhere |    fedhere

NNs and Deep Learning

this slide deck:

https://slides.com/federicabianco/mlpns23_8

Recap

0

Data driven models for exploration of structure, prediction that learn parameters from data.

Machine Learning

used to:

classify based on examples
understand structure of feature space
regression (classification with infinitely small classes)
- understand which features are important in prediction (to get close to causality)

General ML usage

Data driven models for exploration of structure, prediction that learn parameters from data.

unupervised ------ supervised

set up: All features known for all observations

Goal: explore structure in the data

- data compression

- understanding structure

Algorithms: Clustering, (...)

x

y

Machine Learning

Data driven models for exploration of structure, prediction that learn parameters from data.

unupervised ------ supervised

set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data

Goal: predicting missing feature

- classification

- regression

Algorithms: regression, SVM, tree methods, k-nearest neighbors, neural networks, (...)

x

y

Machine Learning

unupervised ------ supervised

set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data

Goal: predicting missing feature

- classification

- regression

Algorithms: regression, SVM, tree methods, k-nearest neighbors, neural networks, (...)

unupervised ------ supervised

set up: All features known for all observations

Goal: explore structure in the data

- data compression

- understanding structure

Algorithms: k-means clustering, agglomerative clustering, density based clustering, (...)

Machine Learning

model parameters are learned by calculating a loss function for diferent parameter sets and trying to minimize loss (or a target function and trying to maximize)

e.g.

L1 = |target - prediction|

Learning relies on the definition of a loss function

Machine Learning

Learning relies on the definition of a loss function

learning type	loss / target
unsupervised	intra-cluster variance / inter cluster distance
supervised	distance between prediction and truth

Machine Learning

The definition of a loss function requires the definition of distance or similarity

Machine Learning

Minkowski distance

Jaccard similarity

Great circle distance

B

{A\cap B}

A

The definition of a loss function requires the definition of distance or similarity

Machine Learning

NN:

Neural Networks

1

NN:

Neural Networks

1.1

origins

1943

M-P Neuron McCulloch & Pitts 1943

1943

M-P Neuron McCulloch & Pitts 1943

1943

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

M-P Neuron

its a classifier

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

M-P Neuron McCulloch & Pitts 1943

\sum_{i=1}^3x_i

M-P Neuron

1943

if is Bool (True/False)

what value of corresponds to logical AND?

x_i

\theta

M-P Neuron McCulloch & Pitts 1943

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

The perceptron algorithm : 1958, Frank Rosenblatt

1958

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

.

x_1

x_2

x_N

+b

output

weights

w_i

bias

b

linear regression:

w_2

w_1

w_N

1958

Perceptron

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

x

y

1958

y ~= ~\sum_i w_ix_i ~+~ b

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

x

y

1958

y ~= ~\sum_i w_ix_i ~+~ b

1

0

{

y= \begin{cases} 1~ if~ \sum_i(x_i w_i) + b ~>=~Z\\ 0 ~if~ \sum_i(x_i w_i) + b ~<~Z \end{cases}

.

x_1

x_2

x_N

+b

f

w_2

w_1

w_N

output

f

activation function

weights

w_i

bias

b

perceptron

f

y ~= f(~\sum_i w_ix_i ~+~ b)

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

+b

f

w_2

w_1

w_N

output

f

activation function

weights

w_i

bias

b

sigmoid

f

\sigma = \frac{1}{1 + e^{-z}}

.

x_1

x_2

x_N

y ~= f(~\sum_i w_ix_i ~+~ b)

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

+b

f

w_2

w_1

w_N

output

f

activation function

weights

w_i

bias

b

.

x_1

x_2

x_N

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptron

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

The embryo - the Weather Buerau's $2,000,000 "704" computer - learned to differentiate between left and right after 50 attempts in the Navy demonstration

NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser

July 8, 1958

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptron

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

The embryo - the Weather Buerau's $2,000,000 "704" computer - learned to differentiate between left and right after 50 attempts in the Navy demonstration

NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser

July 8, 1958

Deep Learning

2

DNN:

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

multilayer perceptron

x_2

x_3

output

input layer

hidden layer

output layer

1970: multilayer perceptron architecture

x_1

Fully connected: all nodes go to all nodes of the next layer.

b_1

b_2

b_3

b_4

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

w_{11}

w_{12}

w_{13}

w_{14}

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

w_{21}

w_{22}

w_{23}

w_{24}

multilayer perceptron

layer of perceptrons

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

w_{31}

w_{32}

w_{33}

w_{34}

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

x_1

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + w_{14}x_4 + b1

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b2

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b3

w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b4

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

learned parameters

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

f: activation function:

turns neurons on-off

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b3

w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b4

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b2

DNN:

hyperparameters of DNN

3

EXERCISE

output

how many parameters?

input layer

hidden layer

output layer

hidden layer

output

input layer

hidden layer

output layer

hidden layer

number of layers- 1
number of neurons/layer-
activation function/layer-
layer connectivity-
optimization metric - 1
optimization method - 1
parameters in optimization- M

N_l

N_l ^ {~??}

how many hyperparameters?

EXERCISE

GREEN: architecture hyperparameters

RED: training hyperparameters

N_l

output

input layer

hidden layer

output layer

hidden layer

number of layers- 1
number of neurons/layer-
activation function/layer-
layer connectivity-
optimization metric - 1
optimization method - 1
parameters in optimization- M

N_l

N_l ^ {~??}

how many hyperparameters?

EXERCISE

GREEN: architecture hyperparameters

RED: training hyperparameters

N_l

EXERCISE

http://playground.tensorflow.org/

DNN:

training DNN

4

https://colab.research.google.com/drive/13c9uJ_fPGjszgsyEuYWafR2F4_n-IXeZ

deep neural net

Fully connected: all nodes go to all nodes of the next layer.

1986: Deep Neural Nets

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

f: activation function:

turns neurons on-off

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

In a CNN these layers would not be fully connected except the last one

http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

Seminal paper

Y. LeCun 1998

.

x_1

x_2

x_N

+b

\vec{y} = \vec{x}W + b

Any linear model:

w_2

w_1

w_N

y

y : prediction

ytrue : target

Error: e.g.

L_2~=~(y - y_\mathrm{true})^2

intercept

slope

L2

x

Find the best parameters by finding the minimum of the L2 hyperplane

at every step look around and choose the best direction

back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

x_{j}~=~\sum_i y_{i}w_{ji} ~~~~~~ y_j~=\frac{1}{1+e^{-x_j}}

.

x_1

x_N

f

https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf

+b

f

w_2

output

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

feed data forward through network and calculate cost metric

for each layer, calculate effect of small changes on next layer

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

think of applying just gradient to a function of a function of a function... use:

1) partial derivatives, 2) chain rule

http://neuralnetworksanddeeplearning.com/chap2.html

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

Punch Line

Deep Neural Net are not some fancy-pants methods, they are just linear models with a bunch of parameters

Black Box?

Because they have many parameters they are difficult to "interpret" (no easy feature extraction)

that may be ok because they are prediction machines

Black Box?

Because they have many parameters they are difficult to "interpret" (no easy feature extraction)

that may be ok because they are prediction machines

Epistemic transparency

Right to explanation: the scope of a general "right to explanation" is a matter of ongoing debate

tration by Hanne Morstad

Democratised AI — The Black Box Problem

Accountability: who is responsible if an algorithm does harm

algorithmic transparency

strictly policy issues:

proprietary algorithms + audability

#UDCSS2020

@fedhere

https://www.americanscientist.org/article/a-peek-at-proprietary-algorithms

technical + policy issues:

data access and redress + data provenance

algorithmic transparency

https://www.darpa.mil/attachments/XAIProgramUpdate.pdf

trivially intuitive

generalized additive models

decision trees

SVM

Random Forest

Deep Learning

Accuracy

univaraite

linear

regression

algorithmic transparency

#UDCSS2020

@fedhere

we're still trying to figure it out

algorithmic transparency

trivially intuitive

generalized additive models

decision trees

Deep Learning

number of features that can be effectively included in the model

thousands

1

SVM

Random Forest

univaraite

linear

regression

https://www.darpa.mil/attachments/XAIProgramUpdate.pdf

algorithmic transparency

#UDCSS2020

@fedhere

Accuracy in solving complex problems

we're still trying to figure it out

algorithmic transparency

trivially intuitive

univaraite

linear

regression

generalized additive models

decision trees

Deep Learning

SVM

Random Forest

https://www.darpa.mil/attachments/XAIProgramUpdate.pdf

time

algorithmic transparency

#UDCSS2020

@fedhere

Accuracy in solving complex problems

we're still trying to figure it out

algorithmic transparency

1

Machine learning: any method that learns parameters from the data

http://www.statsguy.co.uk/brexit-voting-and-education/

2

The transparency of an algorithm is proportional to its complexity and the complexity of the data space

3

The transparency of an algorithm is limited by our own ability and preparedness to interpret it

Toward Interpretable Machine Learning, Samek+2003

algorithmic transparency

#UDCSS2020

@fedhere

A single tree model

algorithmic transparency

accountability

can scientists be held responsible?
should whoever commissions be responsible?
is nobody responsible under the premise that decisions are objective? -> are they objective?, what does objective mean?, how can we objectively measure objectivity

https://www.scientificamerican.com/article/italian-scientists-get/

accountability

In a press release, the ACLU wrote, “Mr. Williams’ experience was the first case of wrongful arrest due to facial recognition technology to come to light in the United States.”

accountability

In a press release, the ACLU wrote, “Mr. Williams’ experience was the first case of wrongful arrest due to facial recognition technology to come to light in the United States.”

Who is responsible for setting the threshold?

FR returns a probabilistic result

a threshold is chosen to turn it into a T/F match for decision making

unethical applications of FR

https://modelviewculture.com/pieces/the-hidden-dangers-of-ai-for-queer-and-trans-people

unethical applications of FR

Text

https://www.vice.com/en/article/g5gxg3/proctorio-is-using-racist-algorithms-to-detect-faces

deep dreams

what is happening in DeepDream?

Deep Dream (DD) is a google software, a pre-trained NN (originally created on the Cafe architecture, now imported on many other platforms including tensorflow).

The high level idea relies on training a convolutional NN to recognize common objects, e.g. dogs, cats, cars, in images. As the network learns to recognize those objects is developes its layers to pick out "features" of the NN, like lines at a cetrain orientations, circles, etc.

The DD software runs this NN on an image you give it, and it loops on some layers, thus "manifesting" the things it knows how to recognize in the image.

Olague et al 2017