Machine Learning for

Time Series Analysis IX

Neural Networks

Fall 2025
dr. federica bianco

@fedhere

fbianco@udel.edu

this slide deck:

https://slides.com/federicabianco/mltsa10

Recap

0

MLTSA:

Machine Learning

Data driven models for exploration of structure, prediction that learn parameters from data.

MLTSA:

Machine Learning

Data driven models for exploration of structure, prediction that learn parameters from data.

unupervised ------ supervised

set up: All features known for all observations

Goal: explore structure in the data

- data compression

- understanding structure

Algorithms: Clustering, (...)

x

y

MLTSA:

Machine Learning

Data driven models for exploration of structure, prediction that learn parameters from data.

unupervised ------ supervised

set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data

Goal: predicting missing feature

- classification

- regression

Algorithms: regression, SVM, tree methods, k-nearest neighbors, neural networks, (...)

x

y

MLTSA:

Machine Learning

unupervised ------ supervised

set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data

Goal: predicting missing feature

- classification

- regression

Algorithms: regression, SVM, tree methods, k-nearest neighbors, neural networks, (...)

unupervised ------ supervised

set up: All features known for all observations

Goal: explore structure in the data

- data compression

- understanding structure

Algorithms: k-means clustering, agglomerative clustering, density based clustering, (...)

MLTSA:

Machine Learning

model parameters are learned by calculating a loss function for diferent parameter sets and trying to minimize loss (or a target function and trying to maximize)

e.g.

L1 = |target - prediction|

Learning relies on the definition of a loss function

MLTSA:

Machine Learning

Learning relies on the definition of a loss function

learning type	loss / target
unsupervised	intra-cluster variance / inter cluster distance
supervised	distance between prediction and truth

MLTSA:

Machine Learning

The definition of a loss function requires the definition of distance or similarity

MLTSA:

Machine Learning

Minkowski distance

Jaccard similarity

Dynamic Time Warping

B

{A\cap B}

A

The definition of a loss function requires the definition of distance or similarity

MLTSA:

Machine Learning

The definition of a loss function requires the definition of distance or similarity

Dont confuse the distance definition with the model!!

Models:

Distances:

mikowski

Euclidan

DTW

....

k-mean clusering SVM

kNN

RF

GBT

NN

MLTSA:

Machine Learning

Feature extraction methods == dimensionality reduction:

Dont confuse feature extraction methods with model, tho sometimes models can be used for feature exctraction...

PCA, ICA, clustering, autoencoders...

Finding lower dimensional representation of your data that still preserves similarity and "distance"

In time series he dimensionality is generally the number of data points which is generally very large

MLTSA:

Machine Learning

Feature extraction methods == dimensionality reduction:

Dont confuse feature extraction methods with model, tho sometimes models can be used for feature exctraction...

Dimensionality reduction:

Distance definitions

Ding et al. 2008

Vis of the week: t-SNE

1

MLTSA:

dimensionality reduction techniques are useful both for data preparation (to fight the Curse of Dimensionality) and to enable visualizations of large dataset.

In the latter casewe project a dataset that exists in a high (N)-dimensional space in a lower dimensional (typically 2D) projection with the goal of preserving distances between objects.... but that is really hard!

Proper dimensionality reduction

- LDA ( Linear Discriminant Analysis )

PCA ( Principle Component Analysis )

- ICA ( Independent Component Analysis )

- Clustering (any method, represent data by their cluster)

- Using the latent space of an Autoencoder

Visualization techniques

- SNE Stochastic Neighborhood Embedding

t-SNE t-distrubuted SNE

- UMAP ( Uniform Manifold Approximation and Projection )

I would say that they are considered primarily visualization techniques, rather than clustering methods or preprocessing tools, because they are very sensitive to the choice of hyperparameter

The t-distributed Stochastic Neighbor Embedding (SNE) method, or t-sne, was introduced in Maaten 2008. SNE works by embedding multidimensional Euclidean distances with conditional probabilities, which is what represents the similarities between datapoints. In other words, suppose we have a data point x_i in the high dimensional space. Then consider a normal distribution of distances from x_i, wherein points near x_i have a higher probability density under the distribution and further points have a lower probability density under the distribution. Then the similarity between x_i and another data point x_i' is the conditional probability P_(x_i' | x_i) that x_i will choose x_i' as a neighbor under the normal distribution just described.

Then we replicate the process for the lower dimensional space, for which we get another set of conditional probabilities. SNE then attempts to minimize the Kullback-Leibler (KL) divergence or relative entropy (Kullback 1951) between the two probability distributions using gradient descent.

https://arxiv.org/abs/2009.06760

MLTSA:

Neural Networks

2

NN are a vast topics and we only have 2 weeks!

Some FREE references!

michael nielsen

better pedagogical approach, more basic, more clear

ian goodfellow

mathematical approach, more advanced, unfinished

http://neuralnetworksanddeeplearning.com/index.html

michael nielsen

better pedagogical approach, more basic, more clear

https://www.deeplearningbook.org/

MLTSA:

Neural Networks

2.1

origins

deep

time-domain NN

1943

M-P Neuron McCulloch & Pitts 1943

1943

M-P Neuron McCulloch & Pitts 1943

1943

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

M-P Neuron

its a classifier

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

M-P Neuron McCulloch & Pitts 1943

\sum_{i=1}^3x_i

M-P Neuron

1943

if

what value of corresponds to logical AND?

x_i \in Bool

\theta

M-P Neuron McCulloch & Pitts 1943

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

Neuron McCulloch & Pitts 1948

1943

.

x_1

x_2

x_N

output

M-P Neuron

The perceptron algorithm : 1958, Frank Rosenblatt

1958

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

.

x_1

x_2

x_N

+b

output

weights

w_i

bias

b

linear regression:

w_2

w_1

w_N

1958

Perceptron

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

The perceptron algorithm : 1958, Frank Rosenblatt

.

x_1

x_2

x_N

+b

output

weights

w_i

bias

b

linear regression:

w_2

w_1

w_N

1958

Perceptron

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

error

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

x

y

1958

y ~= ~\sum_i w_ix_i ~+~ b

+b

w_2

w_1

w_N

output

weights

w_i

bias

b

.

x_1

x_2

x_N

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptron

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

The embryo - the Weather Buerau's $2,000,000 "704" computer - learned to differentiate between left and right after 50 attempts in the Navy demonstration

NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser

July 8, 1958

ADALINE : 1960 Bernard Widrow and Ted Hoff

1960

y ~= f(~\sum_i w_ix_i ~+~ b)

ADALINE introduces a continuous function before the binary output - this generated a probabilistic classifier and provides an opportunity for refining the learning process

.

x_1

x_2

x_N

+b

f

w_2

w_1

w_N

output

f

activation function

weights

bias

b

y ~= f(~\sum_i w_ix_i ~+~ b)

w_i

w_N

error

ADALINE introduces a continuous function before the binary output - this generated a probabilistic classifier and provides an opportunity for refining the learning process

ADALINE : 1960 Bernard Widrow and Ted Hoff

y= \begin{cases} 1~ if~ \sum_i(x_i w_i) + b ~>=~Z\\ 0 ~if~ \sum_i(x_i w_i) + b ~<~Z \end{cases}

.

x_1

x_2

x_N

+b

f

w_2

w_1

w_N

output

f

activation function

weights

bias

b

perceptron

f

y ~= f(~\sum_i w_ix_i ~+~ b)

w_i

w_N

ADALINE introduces a continuous function before the binary output - this generated a probabilistic classifier and provides an opportunity for refining the learning process

ADALINE : 1960 Bernard Widrow and Ted Hoff

+b

f

w_2

w_1

w_N

output

f

activation function

weights

w_i

bias

b

sigmoid

f

\sigma = \frac{1}{1 + e^{-z}}

.

x_1

x_2

x_N

y ~= f(~\sum_i w_ix_i ~+~ b)

.

x_1

x_2

x_N

+b

f

w_2

w_1

w_N

f

activation function

f

ADALINE introduces a continuous function before the binary output - this generated a probabilistic classifier and provides an opportunity for refining the learning process

ADALINE : 1960 Bernard Widrow and Ted Hoff

widrow-hoff rule

Weight Change = (Pre-Weight line value)(Error / (Number of Inputs)).

http://www-isl.stanford.edu/~widrow/papers/c1988madalinerule.pdf

Deep Learning

3

MLTSA:

1943

AND

OR

~~XOR~~

M-P Neuron McCulloch & Pitts 1943

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

if x1 and x2 and x3

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

multilayer perceptron

x_2

x_3

output

input layer

hidden layer

output layer

1970: multilayer perceptron architecture

x_1

Fully connected: all nodes go to all nodes of the next layer.

x_2

x_3

x_1

b_1

b_2

b_3

b_4

b

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

w_{11}

w_{21}

w_{31}

w_{41}

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

w_{12}

w_{22}

w_{32}

w_{42}

multilayer perceptron

layer of perceptrons

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

w_{13}

w_{23}

w_{33}

w_{43}

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b_1

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b_2

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b_3

w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b_4

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

x_2

x_3

x_1

b_1

b_2

b_3

b_4

b

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

x_1

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + w_{24}x_4 + b_2

x_2

x_3

x_1

b_1

b_2

b_3

b_4

b

each perceptron is a multilinear regression

multilayer perceptron

what we are doing is exactly a series of matrix multiplictions.

ADELINE and MADELINE 1962 - B. Widrow & M. Hoff

http://www-isl.stanford.edu/~widrow/papers/c1961generalizationand.pdf

MADALINE

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

x_1

f(w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + w_{24}x_4 + b_2)

x_2

x_3

x_1

b_1

b_2

b_3

b_4

b

each perceptron is a multilinear regression

f

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

f(w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b_1)

f(w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b_2)

f(w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b_3)

f(w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b_4)

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

f: activation function:

turns neurons on-off

deep neural net

Fully connected: all nodes go to all nodes of the next layer.

1986: Deep Neural Nets

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

f: activation function:

turns neurons on-off

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

In a CNN these layers would not be fully connected except the last one

MLTSA:

hyperparameters of DNN

4

output

input layer

hidden layer

output layer

hidden layer

how many hyperparameters?

EXERCISE

http://bit.ly/DSPSnnhp

EXERCISE

output

input layer

hidden layer

output layer

hidden layer

how many hyperparameters?

EXERCISE

http://bit.ly/DSPSnnhp

Weights

x

Biases

3 x 4 + 4

4 x 3 + 3

3 x 1 + 1

EXERCISE

output

input layer

hidden layer

output layer

hidden layer

number of layers- 1
number neurons/layer-
activat. function/layer-
layer connectivity-
optimization metric - 1
optimization method - 1
parameters in optimization - M

N_l

N_l ^ {~??}

N_l

how many hyperparameters?

EXERCISE

http://bit.ly/DSPSnnhp

Green architecture hyperparameters

RED training parameters

http://yann.lecun.com/exdb/publis/pdf/lecun-01a.pdf

Seminal paper

Y. LeCun 1998

MLTSA:

proper care of your DNN

0

Advanced issue found

▲

5

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

x1

x2

b1

b2

b3

b

w11

w12

w13

w21

0

Advanced issue found

▲

w22

w23

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

architecture - wide networks tend to overfit, deep networks are hard to train
number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices (like in all gradient descent)
loss function - needs to be appropriate to the task, e.g. classification vs regression
activation functions - needs to be consistent with the loss function
optimization scheme - needs to be appropriate to the task and data
learning rate in optimization - balance speed and accuracy
batch size - smaller batch size is faster but leads to overtraining

Pretrained DNNs performance comparison

https://arxiv.org/pdf/1605.07678.pdf

accuracy comparison

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

accuracy comparison

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

batch size

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

architecture - wide networks tend to overfit, deep networks are hard to train
number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices or too large learning rate
loss function - needs to be appropriate to the task, e.g. classification vs regression
activation functions - needs to be consistent with the loss function
optimization scheme - needs to be appropriate to the task and data
learning rate in optimization - balance speed and accuracy
batch size - smaller batch size is faster but leads to overtraining

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

always check your loss function! it should go down smoothly and flatten out at the end of the training.

not flat? you are still learning!

too flat? you are overfitting...

loss (gallery of horrors)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

jumps are not unlikely (and not necessarily a problem) if your activations are discontinuous (e.g. relu)

when you use validation you are introducing regularizations (e.g. dropout) so the loss can be smaller than for the training set

loss and learning rate (not that the appropriate learning rate depends on the chosen optimization scheme!)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

loss	good for	activation last layer	size last layer
mean_squared_error	regression	linear	one node
mean_absolute_error	regression	linear	one node
mean_squared_logarithmit_error	regression	linear	one node
binary_crossentropy	binary classification	sigmoid	one node
categorical_crossentropy	multiclass classification	sigmoid	N nodes
Kullback_Divergence	multiclass classification, probabilistic inerpretation	sigmoid	N nodes

Text

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

loss	good for	activation last layer	size last layer
mean_squared_error	regression	linear	one node
mean_absolute_error	regression	linear	one node
mean_squared_logarithmit_error	regression	linear	one node
binary_crossentropy	binary classification	sigmoid	one node
categorical_crossentropy	multiclass classification	sigmoid	N nodes
Kullback_Divergence	multiclass classification, probabilistic inerpretation	sigmoid	N nodes

https://github.com/fedhere/MLTSA22_FBianco/blob/main/CodeExamples/autoencoder_digits.ipynb

in this notebook above I experiment with combinations of these choices

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

Lots of parameters and lots of hyperparameters! What to choose?

https://github.com/fedhere/MLTSA22_FBianco/blob/main/CodeExamples/autoencoder_digits.ipynb

MLTSA:

training DNN

6

https://colab.research.google.com/drive/13c9uJ_fPGjszgsyEuYWafR2F4_n-IXeZ

.

x_1

x_2

x_N

+b

\vec{y} = \vec{x}W + b

Any linear model:

w_2

w_1

w_N

y

y : prediction

ytrue : target

Error: e.g.

L_2~=~(y - y_\mathrm{true})^2

intercept

slope

L2

x

Find the best parameters by finding the minimum of the L2 hyperplane

at every step look around and choose the best direction

back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

x_{j}~=~\sum_i y_{i}w_{ji} ~~~~~~ y_j~=\frac{1}{1+e^{-x_j}}

.

x_1

x_N

f

https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf

+b

f

w_2

output

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

feed data forward through network and calculate cost metric

for each layer, calculate effect of small changes on next layer

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

think of applying just gradient to a function of a function of a function... use:

1) partial derivatives, 2) chain rule

http://neuralnetworksanddeeplearning.com/chap2.html

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

MLTSA:

Autoencoders

7

Unsupervised learning with

Neural Networks

What do NN do? approximate complex functions with series of linear functions

Unsupervised learning with

Neural Networks

What do NN do? approximate complex functions with series of linear functions

To do that they extract information from the data

Each layer of the DNN produces a representation of the data a "latent representation" .

The dimensionality of that latent representation is determined by the size of the layer (and its connectivity, but we will ignore this bit for now)

Unsupervised learning with

Neural Networks

What do NN do? approximate complex functions with series of linear functions

To do that they extract information from the data

Each layer of the DNN produces a representation of the data a "latent representation" .

The dimensionality of that latent representation is determined by the size of the layer (and its connectivity, but we will ignore this bit for now)

.... so if my layers are smaller what I have is a compact representation of the data

Autoencoder Architecture

Feed Forward DNN:

the size of the input is 5,

the size of the last layer is 2

Autoencoder Architecture

Encoder: outputs a lower dimensional representation z of the data x (similar to PCA, tSNE...)
Decoder: Learns how to reconstruct x given z: learns p(x|z)

Autoencoder Architecture

https://link.springer.com/chapter/10.1007/978-981-13-6661-1_3

Building a DNN

with keras and tensorflow

Trivial to build, but the devil is in the details!

Building a DNN

with keras and tensorflow

Trivial to build, but the devil is in the details!

from keras.models import Sequential
#can upload pretrained models from keras.models
from keras.layers import Dense,  Conv2D, MaxPooling2D
#create model
model = Sequential()


#create the model architecture by adding model layers
model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
model.add(Dense(10, activation='relu'))
model.add(Dense(1))

#need to choose the loss function, metric, optimization scheme
model.compile(optimizer='adam', loss='mean_squared_error')

#need to learn what to look for - always plot the loss function!
model.fit(x_train, y_train, validation_data=(x_test, y_test),
                     epochs=20, batch_size=100, verbose=1)
#note that the model allows to give a validation test, 
#this is for a 3fold cross valiation: train-validate-test 
#predict
test_y_predictions = model.predict(validate_X)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

encoder

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

decoder

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

bottle neck

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

This simple odel has 200000 parameters!

My original choice is to train it with "adadelta" with a mean squared loss function, all activation functions are relu, appropriate for a linear regression

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

loss	good for	activation last layer	size last layer
mean_squared_error	regression	linear	one node
mean_absolute_error	regression	linear	one node
mean_squared_logarithmit_error	regression	linear	one node
binary_crossentropy	binary classification	sigmoid	one node
categorical_crossentropy	multiclass classification	sigmoid	N nodes
Kullback_Divergence	multiclass classification, probabilistic inerpretation	sigmoid	N nodes

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

autoencoder for image recontstruction

model_digits64.add(Dense(ndim, 
                        activation='linear'))
model_digits64_sig.compile(optimizer="adadelta", 
                   loss="mean_squared_error")

model_digits64_sig.add(Dense(ndim, 
                             activation='sigmoid'))
model_digits64_sig.compile(optimizer="adadelta", 
                           loss="mean_squared_error")

model_digits64_sig.add(Dense(ndim, 
                             activation='sigmoid'))
model_digits64_bce.compile(optimizer="adadelta", 
                           loss="binary_crossentropy")

loss function: did not finish learning, it is still decreasing rapidly

The predictions are far too detailed. While the input is not binary, it does not have a lot of details. Maybe approaching it as a binary problem (with a sigmoid and a binary cross entropy loss) will give better results

A sigmoid gives activation gives a much better result!

Binary cross entropy loss function: It is more appriopriate when the output layer is sigmoid

Even better results!

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

original

predicted

original

predicted

original

predicted

autoencoder for image recontstruction

A more ambitious model has a 16 neurons bottle neck: we are trying to extract 16 numbers to reconstruct the entire image! its pretty remarcable! those 16 number are extracted features from the data

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

predicted

original

latent

representation

resources

Neural Network and Deep Learning

an excellent and free book on NN and DL

http://neuralnetworksanddeeplearning.com/index.html

History of NN

https://cs.stanford.edu/people/eroberts/courses/soco/projects/neural-networks/History/history2.html

resources

Gradient Descent

https://ml-cheatsheet.readthedocs.io/en/latest/gradient_descent.html

Machine Learning for

Time Series Analysis IX

Recap

MLTSA:

MLTSA:

Machine Learning

MLTSA:

Machine Learning

MLTSA:

Machine Learning

MLTSA:

Machine Learning

MLTSA:

Machine Learning

MLTSA:

Machine Learning

MLTSA:

Machine Learning

MLTSA:

Machine Learning

MLTSA:

Machine Learning

MLTSA:

Machine Learning

MLTSA:

Machine Learning

Vis of the week: t-SNE

MLTSA:

MLTSA:

MLTSA:

NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser

widrow-hoff rule

Deep Learning

MLTSA:

multilayer perceptron

multilayer perceptron

multilayer perceptron

multilayer perceptron

multilayer perceptron

multilayer perceptron

multilayer perceptron

multilayer perceptron

MADALINE

multilayer perceptron

EXERCISE

deep neural net

MLTSA:

EXERCISE

EXERCISE

EXERCISE

EXERCISE

EXERCISE

MLTSA:

proper care of your DNN

MLTSA:

back-propagation

back-propagation

https://www.iro.umontreal.ca/~vincentp/ift3395/lectures/backprop_old.pdf

back-propagation

MLTSA:

resources

resources

Reading

MLTSA 09 2025

More from federica bianco