foundations of data science for everyone XII

dr.federica bianco | fbb.space |    fedhere |    fedhere

Generative AI

this slide deck:

https://slides.com/federicabianco/fdsfe_13

Recap: Data Science

0

FDSFE

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

remote sensing

survey science

instrumental design and development

data retrieval

...

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

data types

identify correlation

missing variable

...

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

Scaling and

whitening

tokenizing

...

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

what is the goal:

statistical analysis

anomaly detection

prediction

structure identification

....

what is the task:

regression

classification

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

SciPy

Data Science

The discipline that deals with extraction of information from data in a specific domain context, from data collection through inference

(Problem Identification and Planning)

Data Collection
Data Exploration
Data Preparation
Model Identification
Model Building
Model Evaluation
Model Deployment.

Data driven models for exploration of structure and prediction that learn parameters from data.

Machine Learning

y

x

y

Reinforcement Learning

Active Learning

unupervised learning supervised learning

Data driven models for exploration of structure, prediction that learn parameters from data.

unupervised ------ supervised

set up: All features known for all observations

Goal: explore structure in the data

- data compression

- understanding structure

- anomaly detection

Algorithms: kMeans clustering, DBSCAN, Agglomerative clustering

x

y

Machine Learning

Data driven models for exploration of structure, prediction that learn parameters from data.

unupervised ------ supervised

set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data

Goal: predicting missing feature

- classification

- regression

Algorithms: regression, (SVM), Classification and Regression Tree methods, k-nearest neighbors, neural networks, (...)

x

y

Machine Learning

unupervised ------ supervised

Machine Learning

set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data

Goal: predicting missing feature

- classification

- regression

Algorithms: regression, (SVM), Classification and Regression Tree methods, k-nearest neighbors, neural networks, (...)

set up: All features known for all observations

Goal: explore structure in the data

- data compression

- understanding structure

- anomaly detection

Algorithms: kMeans clustering, DBSCAN, Agglomerative clustering

Learning relies on the definition of a loss function

learning type	loss / target
unsupervised	intra-cluster variance / inter cluster distance
supervised	distance between prediction and truth

Machine Learning

model parameters are learned by calculating a loss function for diferent parameter sets and trying to minimize loss (or a target function and trying to maximize)

e.g. supervised

L1 = |target - prediction|

Learning relies on the definition of a loss function

Machine Learning

supervised and unsupervised

e.g. unsupervised

Inertia =

\sum_j \sum_i x_{i\in j} - \bar{x_j}

Interaction with the environment builds a reward function

Machine Learning

reinforcement

The goal of the agent is to maximize a cumulative reward signal over time

The objective is not to predict a specific output but to learn a policy or strategy that maximizes the cumulative reward over time.

Supervised Learning tasks

regression ------ classification

Target Variable: CONTINUOUS

(age, income, temperature...)

Target Variable: Categorical

(color, shape, income class...)

The definition of a loss function requires the definition of distance or similarity

Machine Learning

Minkowski distance

Jaccard similarity

Great circle distance

B

{A\cap B}

A

The definition of a loss function requires the definition of distance or similarity

Machine Learning

Deep Learning

0

recap

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

x1

x2

b1

b2

b3

b

w11

w12

w13

w21

0

Advanced issue found

▲

w22

w23

multilayer perceptron

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

multilayer perceptron

x_2

x_3

output

layer of perceptrons

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b2

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b3

w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b4

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

f: activation function:

turns neurons on-off

b_1

b_2

b_3

b_4

b

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

multilayer perceptron

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

http://neuralnetworksanddeeplearning.com/chap4.html

f: activation function:

turns neurons on-off

layer connectivity

x_2

x_3

output

input layer

hidden layer

output layer

x_1

Fully connected: all nodes go to all nodes of the next layer.

b_1

b_2

b_3

b_4

b_1

b

x_2

x_3

output

input layer

hidden layer

output layer

x_1

Sparcely connected: all nodes go to all nodes of the next layer.

b_1

b_2

b_3

b_4

b_1

b

layer connectivity

x_2

x_3

output

input layer

hidden layer

output layer

x_1

Sparcely connected: all nodes go to all nodes of the next layer.

b_1

b_2

b_3

b_4

b_1

b

The last layer is always connected

layer connectivity

how does it relate to matrix multiplication

each layer is a matrix

Except this is a very misleading representation

there are no biases or activation functions

each layer should be a different shape

1x3

3x5

5x2

=

2x1

what we are doing is just a series of matrix multiplictions.

DeepNeuralNetwork

what we are doing is exactly a series of matrix multiplictions.

3x5

5x2

2x1

=

DeepNeuralNetwork

what we are doing is exactly a series of matrix multiplictions.

3x5

5x2

2x1

=

(((\vec{x} \cdot W_1) \cdot W_2) \cdot W_3)~=~y

DeepNeuralNetwork

what we are doing is exactly a series of matrix multiplictions.

3x5

5x2

2x1

=

f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

DeepNeuralNetwork

what we are doing is exactly a series of matrix multiplictions.

\phi(\vec{x}) ~\sim~f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

DeepNeuralNetwork

The purpose is to approximate a function φ

y = φ(x)

which (in general) is not linear with linear operations

\phi(\vec{x}) ~\sim~f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

DeepNeuralNetwork

The purpose is to approximate a function φ

y = φ(x)

which (in general) is not linear with linear operations

http://neuralnetworksanddeeplearning.com/chap4.html

output

input layer

hidden layer

output layer

hidden layer

32 parameters and

?? hyperparameters

activation functions -

loss function - 1

optimization method - 1

architecture - M

how many hyperparameters?

Parameters and hyperparameters

\sum_{l=1}^N N_{n_l}

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

x1

x2

b1

b2

b3

b

w11

w12

w13

w21

0

Advanced issue found

▲

w22

w23

proper care of your DNN

0

Advanced issue found

▲

0

NN are a vast topics and we only have 2 weeks!

Some FREE references!

michael nielsen

better pedagogical approach, more basic, more clear

ian goodfellow

mathematical approach, more advanced, unfinished

http://neuralnetworksanddeeplearning.com/index.html

michael nielsen

better pedagogical approach, more basic, more clear

https://www.deeplearningbook.org/

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

architecture - wide networks tend to overfit, deep networks are hard to train
number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices (like in all gradient descent)
loss function - needs to be appropriate to the task, e.g. classification vs regression
activation functions - needs to be consistent with the loss function
optimization scheme - needs to be appropriate to the task and data
learning rate in optimization - balance speed and accuracy
batch size - smaller batch size is faster but leads to overtraining

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

accuracy comparison

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

accuracy comparison

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

batch size

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

architecture - wide networks tend to overfit, deep networks are hard to train
number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices
loss function - needs to be appropriate to the task, e.g. classification vs regression
activation functions - needs to be consistent with the loss function
optimization scheme - needs to be appropriate to the task and data
learning rate in optimization - balance speed and accuracy
batch size - smaller batch size is faster but leads to overtraining

5

Advanced issues found

▲

1

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

always check your loss function! it should go down smoothly and flatten out at the end of the training.

not flat? you are still learning!

too flat? you are overfitting...

loss (gallery of horrors)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

jumps are not unlikely (and not necessarily a problem) if your activations are discontinuous (e.g. relu)

when you use validation you are introducing regularizations (e.g. dropout) so the loss can be smaller than for the training set

loss and learning rate (not that the appropriate learning rate depends on the chosen optimization scheme!)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

loss	good for	activation last layer	size last layer
mean_squared_error	regression	linear	one node
mean_absolute_error	regression	linear	one node
mean_squared_logarithmit_error	regression	linear	one node
binary_crossentropy	binary classification	sigmoid	one node
categorical_crossentropy	multiclass classification	sigmoid	N nodes
Kullback_Divergence	multiclass classification, probabilistic inerpretation	sigmoid	N nodes

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

On the interpretability of DNNs

https://distill.pub/2020/circuits/zoom-in/

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

generative AI

0

Advanced issue found

▲

2

Applications

Image Generation (and 3D Shape Generation)
Semantic Image-to-Photo Translation
Image Resolution Increase
Text-to-Speech Generator
Speech-to-Speech Conversion
Text Generation (Chat GP3)
Music Generation
Image-to-Image Conversion

GANs

VAE

Diffusion models

VAE

https://github.com/fedhere/MLPNS_FBianco/tree/main/generativeAI

Autoencoders

3

Unsupervised learning with

Neural Networks

What do NN do? approximate complex functions with series of linear functions

.... so if my layers are smaller what I have is a compact representation of the data

Unsupervised learning with

Neural Networks

What do NN do? approximate complex functions with series of linear functions

To do that they extract information from the data

Each layer of the DNN produces a representation of the data a "latent representation" .

The dimensionality of that latent representation is determined by the size of the layer (and its connectivity, but we will ignore this bit for now)

.... so if my layers are smaller what I have is a compact representation of the data

Autoencoder Architecture

Feed Forward DNN:

the size of the input is 5,

the size of the last layer is 2

Autoencoder Architecture

Encoder: outputs a lower dimensional representation z of the data x (similar to PCA, tSNE...)
Decoder: Learns how to reconstruct x given z: learns p(x|z)

Autoencoder Architecture

https://link.springer.com/chapter/10.1007/978-981-13-6661-1_3

Building a DNN

with keras and tensorflow

Trivial to build, but the devil is in the details!

Building a DNN

with keras and tensorflow

Trivial to build, but the devil is in the details!

from keras.models import Sequential
#can upload pretrained models from keras.models
from keras.layers import Dense,  Conv2D, MaxPooling2D
#create model
model = Sequential()


#create the model architecture by adding model layers
model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
model.add(Dense(10, activation='relu'))
model.add(Dense(1))

#need to choose the loss function, metric, optimization scheme
model.compile(optimizer='adam', loss='mean_squared_error')

#need to learn what to look for - always plot the loss function!
model.fit(x_train, y_train, validation_data=(x_test, y_test),
                     epochs=20, batch_size=100, verbose=1)
#note that the model allows to give a validation test, 
#this is for a 3fold cross valiation: train-validate-test 
#predict
test_y_predictions = model.predict(validate_X)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

encoder

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

decoder

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

bottle neck

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

This simple model has 200K parameters!

My original choice is to train it with "adadelta" with a mean squared loss function, all activation functions are relu, appropriate for a linear regression

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

loss	good for	activation last layer	size last layer
mean_squared_error	regression	linear	one node
mean_absolute_error	regression	linear	one node
mean_squared_logarithmit_error	regression	linear	one node
binary_crossentropy	binary classification	sigmoid	one node
categorical_crossentropy	multiclass classification	sigmoid	N nodes
Kullback_Divergence	multiclass classification, probabilistic inerpretation	sigmoid	N nodes

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

autoencoder for image recontstruction

model_digits64.add(Dense(ndim, 
                        activation='linear'))
model_digits64_sig.compile(optimizer="adadelta", 
                   loss="mean_squared_error")

model_digits64_sig.add(Dense(ndim, 
                             activation='sigmoid'))
model_digits64_sig.compile(optimizer="adadelta", 
                           loss="mean_squared_error")

model_digits64_sig.add(Dense(ndim, 
                             activation='sigmoid'))
model_digits64_bce.compile(optimizer="adadelta", 
                           loss="binary_crossentropy")

loss function: did not finish learning, it is still decreasing rapidly

The predictions are far too detailed. While the input is not binary, it does not have a lot of details. Maybe approaching it as a binary problem (with a sigmoid and a binary cross entropy loss) will give better results

loss function: also did not finish learning, it is still decreasing rapidly

A sigmoid gives activation gives a much better result!

Binary cross entropy loss function: It is more appriopriate when the output layer is sigmoid

Even better results!

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

original

predicted

original

predicted

original

predicted

autoencoder for image recontstruction

A more ambitious model has a 16 neurons bottle neck: we are trying to extract 16 numbers to reconstruct the entire image! its pretty remarcable! those 16 number are extracted features from the data