DSU AI workshop

2023

University of Delaware

Department of Physics and Astronomy

federica bianco

Biden School of Public Policy and Administration

Data Science Institute

@fedhere

[Machine Learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

what is a ML?

learning type	loss / target
unsupervised	intra-cluster variance / inter cluster distance

learning type	loss / target
unsupervised	intra-cluster variance / inter cluster distance
supervised	distance between prediction and truth

The perceptron algorithm : 1958, Frank Rosenblatt

.

x_1

x_1

x_2

x_2

x_N

x_N

+b

+b

output

weights

w_i

w_i

bias

b

b

linear regression:

w_2

w_2

w_1

w_1

w_N

w_N

1958

Perceptron

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

y= \begin{cases} 1~ if~ \sum_i(x_i w_i) + b ~>=~Z\\ 0 ~if~ \sum_i(x_i w_i) + b ~<~Z \end{cases}

y= \begin{cases} 1~ if~ \sum_i(x_i w_i) + b ~>=~Z\\ 0 ~if~ \sum_i(x_i w_i) + b ~<~Z \end{cases}

.

x_1

x_1

x_2

x_2

x_N

x_N

+b

+b

f

f

w_2

w_2

w_1

w_1

w_N

w_N

output

f

f

activation function

weights

w_i

w_i

bias

b

b

perceptron

f

f

y ~= f(~\sum_i w_ix_i ~+~ b)

y ~= f(~\sum_i w_ix_i ~+~ b)

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

+b

+b

f

f

w_2

w_2

w_1

w_1

w_N

w_N

output

f

f

activation function

weights

w_i

w_i

bias

b

b

sigmoid

f

f

\sigma = \frac{1}{1 + e^{-z}}

\sigma = \frac{1}{1 + e^{-z}}

.

x_1

x_1

x_2

x_2

x_N

x_N

y ~= f(~\sum_i w_ix_i ~+~ b)

y ~= f(~\sum_i w_ix_i ~+~ b)

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

x1

x2

b1

b2

b3

b

w11

w12

w13

w21

0

Advanced issue found

▲

w22

w23

multilayer perceptron

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

multilayer perceptron

x_2

x_2

x_3

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b2

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b2

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b3

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b3

w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b4

w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b4

x_1

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

learned parameters

multilayer perceptron

x_2

x_2

x_3

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

x_1

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

f: activation function:

turns neurons on-off

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b3

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b3

w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b4

w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b4

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b2

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b2

output

input layer

hidden layer

output layer

hidden layer

number of layers- 1
number of neurons/layer-
activation function/layer-
layer connectivity-
optimization metric - 1
optimization method - 1
parameters in optimization- M

N_l

N_l

N_l ^ {~??}

N_l ^ {~??}

how many hyperparameters?

EXERCISE

GREEN: architecture hyperparameters

RED: training hyperparameters

N_l

N_l

loss	good for	activation last layer	size last layer
mean_squared_error	regression	linear	one node
mean_absolute_error	regression	linear	one node
mean_squared_logarithmit_error	regression	linear	one node
binary_crossentropy	binary classification	sigmoid	one node
categorical_crossentropy	multiclass classification	sigmoid	N nodes
Kullback_Divergence	multiclass classification, probabilistic inerpretation	sigmoid	N nodes

.

x_1

x_1

x_2

x_2

x_N

x_N

+b

+b

\vec{y} = \vec{x}W + b

\vec{y} = \vec{x}W + b

Any linear model:

w_2

w_2

w_1

w_1

w_N

w_N

y

y

y : prediction

ytrue : target

Error: e.g.

L_2~=~(y - y_\mathrm{true})^2

L_2~=~(y - y_\mathrm{true})^2

intercept

slope

L2

x

Find the best parameters by finding the minimum of the L2 hyperplane

at every step look around and choose the best direction

back-propagation

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

f: activation function:

turns neurons on-off

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

In a CNN these layers would not be fully connected except the last one

back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

x_{j}~=~\sum_i y_{i}w_{ji} ~~~~~~ y_j~=\frac{1}{1+e^{-x_j}}

x_{j}~=~\sum_i y_{i}w_{ji} ~~~~~~ y_j~=\frac{1}{1+e^{-x_j}}

.

x_1

x_1

x_N

x_N

f

f

https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf

+b

+b

f

f

w_2

w_2

output

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

feed data forward through network and calculate cost metric

for each layer, calculate effect of small changes on next layer

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

think of applying just gradient to a function of a function of a function... use:

1) partial derivatives, 2) chain rule

http://neuralnetworksanddeeplearning.com/chap2.html

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

Convolution Theorem

f * g= \mathcal{F}^{-1}\big\{\mathcal{F}\{f\}\cdot\mathcal{F}\{g\}\big\}

f * g= \mathcal{F}^{-1}\big\{\mathcal{F}\{f\}\cdot\mathcal{F}\{g\}\big\}

\mathcal{F}

\mathcal{F}

fourier transform

{\begin{aligned}F(\nu )&=\int _{\mathbb {R} ^{n}}f(x)e^{-2\pi ix\cdot \nu }\,dx,\\ G(\nu )&=\int _{\mathbb {R} ^{n}}g(x)e^{-2\pi ix\cdot \nu }\,dx,\end{aligned}}

{\displaystyle {\begin{aligned}F(\nu )&=\int _{\mathbb {R} ^{n}}f(x)e^{-2\pi ix\cdot \nu }\,dx,\\ G(\nu )&=\int _{\mathbb {R} ^{n}}g(x)e^{-2\pi ix\cdot \nu }\,dx,\end{aligned}}}

-1	-1	-1	-1	-1
-1	-1	-1	-1	-1
-1	-1	-1	-1	-1
-1	-1	-1	-1	-1
-1	-1	-1	-1	-1

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

(-1*1) + (-1*-1) + (-1*-1) + \\ (-1*-1)+(1*1)+(-1*-1)\\ (-1*-1)+(-1*-1)+(1*1)\\ = 7

(-1*1) + (-1*-1) + (-1*-1) + \\ (-1*-1)+(1*1)+(-1*-1)\\ (-1*-1)+(-1*-1)+(1*1)\\ = 7

7

=

1

-1	-1	-1	-1	-1
-1		-1		-1
-1	-1		-1	-1
-1		-1		-1
-1	-1	-1	-1	-1

1	-1	-1
-1	1	-1
-1	-1	1

1	-1	-1
-1	1	-1
-1	-1	1

(-1*1) + (-1*-1) + (-1*-1) + \\ (-1*1)+(-1*1)+(-1*1)\\ (-1*-1)+(-1*1)+(-1*1)\\ = -3

(-1*1) + (-1*-1) + (-1*-1) + \\ (-1*1)+(-1*1)+(-1*1)\\ (-1*-1)+(-1*1)+(-1*1)\\ = -3

7	-3

=

DSU AI workshop

dsu23_1

More from federica bianco