Machine Learning for

Time Series Analysis IX

Neural Networks

Fall 2025
dr. federica bianco 

 

@fedhere

Recap

0

MLTSA:

 

MLTSA:

Machine Learning

Data driven models for exploration of structure, prediction that learn parameters from data.

MLTSA:

Machine Learning

Data driven models for exploration of structure, prediction that learn parameters from data.

unupervised            ------            supervised

set up: All features known for all observations

Goal: explore structure in the data

- data compression

- understanding structure

Algorithms: Clustering, (...)

x

y

MLTSA:

Machine Learning

Data driven models for exploration of structure, prediction that learn parameters from data.

unupervised            ------            supervised

set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data

Goal: predicting missing feature

-  classification

- regression

Algorithms: regression, SVM, tree methods, k-nearest neighbors,            neural networks, (...)

x

y

MLTSA:

Machine Learning

unupervised            ------            supervised

set up: All features known for a sunbset of the data; one feature cannot be observed for the rest of the data

Goal: predicting missing feature

-  classification

- regression

Algorithms: regression, SVM, tree                  methods, k-nearest neighbors,            neural networks, (...)

unupervised            ------            supervised

set up: All features known for all observations

Goal: explore structure in the data

- data compression

- understanding structure

Algorithms: k-means clustering,                                         agglomerative clustering,                           density based clustering, (...)

MLTSA:

Machine Learning

model parameters are learned by calculating a loss function for diferent parameter sets and trying to minimize loss (or a target function and trying to maximize)

e.g.

L1  = |target - prediction|

Learning relies on the definition of a loss function

MLTSA:

Machine Learning

Learning relies on the definition of a loss function

learning type loss / target
unsupervised intra-cluster variance / inter cluster distance
supervised distance between prediction and truth

MLTSA:

Machine Learning

The definition of a loss function requires the definition of distance or similarity

MLTSA:

Machine Learning

 

Minkowski distance

 

 

                                            Jaccard similarity

 

 

                                                                                   Dynamic Time Warping

B
{A\cap B}
A

The definition of a loss function requires the definition of distance or similarity

MLTSA:

Machine Learning

The definition of a loss function requires the definition of distance or similarity

Dont confuse the distance definition with the model!!

Models:

Distances:

mikowski

Euclidan

DTW

....

k-mean clusering SVM

kNN

RF

GBT

NN

MLTSA:

Machine Learning

Feature extraction methods == dimensionality reduction:

Dont confuse feature extraction methods with model, tho sometimes models can be used for feature exctraction...

PCA, ICA, clustering, autoencoders...

Finding lower dimensional representation of your data that still preserves similarity and "distance"

In time series he dimensionality is generally the number of data points which is generally very large

MLTSA:

Machine Learning

Feature extraction methods == dimensionality reduction:

Dont confuse feature extraction methods with model, tho sometimes models can be used for feature exctraction...

Dimensionality reduction:

Distance definitions

Ding et al. 2008

Vis of the week: t-SNE

1

MLTSA:

 

dimensionality reduction techniques are useful both for data preparation (to fight the Curse of Dimensionality) and to enable visualizations of large dataset. 

 

In the latter casewe project a dataset that exists in a high (N)-dimensional space in a lower dimensional (typically 2D) projection with the goal of preserving distances between objects.... but that is really hard!

Proper dimensionality reduction

- LDA ( Linear Discriminant Analysis )

PCA ( Principle Component Analysis )

- ICA ( Independent Component Analysis )

- Clustering (any method, represent data by their cluster)

- Using the latent space of an Autoencoder

 

Visualization techniques

- SNE Stochastic Neighborhood Embedding

t-SNE t-distrubuted SNE

- UMAP ( Uniform Manifold Approximation and Projection )

 

 

I would say that they are considered primarily visualization techniques, rather than clustering methods or preprocessing tools, because they are very sensitive to the choice of hyperparameter

The t-distributed Stochastic Neighbor Embedding (SNE) method, or t-sne, was introduced in Maaten 2008. SNE works by embedding multidimensional Euclidean distances with conditional probabilities, which is what represents the similarities between datapoints. In other words, suppose we have a data point x_i in the high dimensional space. Then consider a normal distribution of distances from x_i, wherein points near x_i have a higher probability density under the distribution and further points have a lower probability density under the distribution. Then the similarity between x_i and another data point x_i' is the conditional probability P_(x_i' | x_i) that x_i will choose x_i' as a neighbor under the normal distribution just described. 

Then we replicate the process for the lower dimensional space, for which we get another set of conditional probabilities. SNE then attempts to minimize the Kullback-Leibler (KL) divergence or relative entropy (Kullback 1951)  between the two probability distributions using gradient descent.

MLTSA:

 

Neural Networks

2

NN are a vast topics and we only have 2 weeks!

Some FREE references!

 

michael nielsen

better pedagogical approach, more basic, more clear

ian goodfellow

mathematical approach,  more advanced, unfinished

michael nielsen

better pedagogical approach, more basic, more clear

MLTSA:

 

Neural Networks

2.1

origins

deep

deep

time-domain NN

1943

M-P Neuron McCulloch & Pitts 1943

1943

M-P Neuron McCulloch & Pitts 1943

1943

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

M-P Neuron

its a classifier

M-P Neuron McCulloch & Pitts 1943

M-P Neuron

1943

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

M-P Neuron McCulloch & Pitts 1943

\sum_{i=1}^3x_i

M-P Neuron

1943

if 

what value of  corresponds to logical AND?

x_i \in Bool
\theta

M-P Neuron McCulloch & Pitts 1943

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

Neuron McCulloch & Pitts 1948

1943

.

.

.

 

x_1
x_2
x_N

output

M-P Neuron

The perceptron algorithm : 1958, Frank Rosenblatt

1958

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

.

.

.

 

x_1
x_2
x_N
+b

output

weights

w_i

bias

b

linear regression:

w_2
w_1
w_N

1958

Perceptron

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

The perceptron algorithm : 1958, Frank Rosenblatt

.

.

.

 

x_1
x_2
x_N
+b

output

weights

w_i

bias

b

linear regression:

w_2
w_1
w_N

1958

Perceptron

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

error

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

The perceptron algorithm : 1958, Frank Rosenblatt

x

y

1958

y ~= ~\sum_i w_ix_i ~+~ b
+b
w_2
w_1
w_N

output

weights

w_i

bias

b

.

.

.

 

x_1
x_2
x_N

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptron

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

The embryo - the Weather Buerau's $2,000,000 "704" computer - learned to differentiate between left and right after 50 attempts in the Navy demonstration

NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser

July 8, 1958

ADALINE : 1960 Bernard Widrow and Ted Hoff

1960

y ~= f(~\sum_i w_ix_i ~+~ b)

ADALINE introduces a continuous function before the binary output - this generated a probabilistic classifier and provides an opportunity for refining the learning process

.

.

.

 

x_1
x_2
x_N
+b
f
w_2
w_1
w_N

output

f

activation function

weights

bias

b
y ~= f(~\sum_i w_ix_i ~+~ b)
w_i
w_N

error

ADALINE introduces a continuous function before the binary output - this generated a probabilistic classifier and provides an opportunity for refining the learning process

ADALINE : 1960 Bernard Widrow and Ted Hoff

y= \begin{cases} 1~ if~ \sum_i(x_i w_i) + b ~>=~Z\\ 0 ~if~ \sum_i(x_i w_i) + b ~<~Z \end{cases}

.

.

.

 

x_1
x_2
x_N
+b
f
w_2
w_1
w_N

output

f

activation function

weights

bias

b

perceptron

f
y ~= f(~\sum_i w_ix_i ~+~ b)
w_i
w_N

ADALINE introduces a continuous function before the binary output - this generated a probabilistic classifier and provides an opportunity for refining the learning process

ADALINE : 1960 Bernard Widrow and Ted Hoff

+b
f
w_2
w_1
w_N

output

f

activation function

weights

w_i

bias

b

sigmoid

f
\sigma = \frac{1}{1 + e^{-z}}

.

.

.

 

x_1
x_2
x_N
y ~= f(~\sum_i w_ix_i ~+~ b)

.

.

.

 

x_1
x_2
x_N
+b
f
w_2
w_1
w_N
f

activation function

f

ADALINE introduces a continuous function before the binary output - this generated a probabilistic classifier and provides an opportunity for refining the learning process

ADALINE : 1960 Bernard Widrow and Ted Hoff

widrow-hoff rule

 Weight Change = (Pre-Weight line value)(Error / (Number of Inputs)).

Deep Learning

3

MLTSA:

 

1943

AND

OR

XOR

M-P Neuron McCulloch & Pitts 1943

1 ~\mathrm{if} ~\sum_{i=1}^3x_i \geq\theta ~\mathrm{else}~ 0

if x1 and x2 and x3

multilayer perceptron

x_2
x_3

output

x_1

layer of perceptrons

b_1
b_2
b_3
b_4
b

multilayer perceptron

x_2
x_3

output

input layer

hidden layer

output layer

1970: multilayer perceptron architecture

x_1

Fully connected: all nodes go to all nodes of the next layer.

x_2
x_3
x_1
b_1
b_2
b_3
b_4
b

multilayer perceptron

x_2
x_3

output

x_1

layer of perceptrons

b_1
b_2
b_3
b_4
b
w_{11}
w_{21}
w_{31}
w_{41}

multilayer perceptron

x_2
x_3

output

x_1

layer of perceptrons

b_1
b_2
b_3
b_4
b
w_{12}
w_{22}
w_{32}
w_{42}

multilayer perceptron

layer of perceptrons

x_2
x_3

output

x_1

layer of perceptrons

b_1
b_2
b_3
b_4
b
w_{13}
w_{23}
w_{33}
w_{43}

multilayer perceptron

x_2
x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b_1
w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b_2
w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b_3
w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b_4
x_1

w: weight

sets the sensitivity of a neuron

 

b: bias:

up-down weights a neuron

 

 
x_2
x_3
x_1
b_1
b_2
b_3
b_4
b

multilayer perceptron

x_2
x_3

output

Fully connected: all nodes go to all nodes of the next layer.

x_1
w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + w_{24}x_4 + b_2
x_2
x_3
x_1
b_1
b_2
b_3
b_4
b

each perceptron is a multilinear regression

multilayer perceptron

what we are doing is exactly a series of matrix multiplictions. 

ADELINE and MADELINE 1962 - B. Widrow & M. Hoff

MADALINE

x_2
x_3

output

Fully connected: all nodes go to all nodes of the next layer.

x_1
f(w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + w_{24}x_4 + b_2)
x_2
x_3
x_1
b_1
b_2
b_3
b_4
b

each perceptron is a multilinear regression

f
f
f
f
f

multilayer perceptron

x_2
x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

f(w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b_1)
f(w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b_2)
f(w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b_3)
f(w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b_4)
x_1

w: weight

sets the sensitivity of a neuron

 

b: bias:

up-down weights a neuron

 

 

f: activation function:

turns neurons on-off

 

EXERCISE

  • try at least the concentric input (top left) and the spiral input (bottom right)
  • change the activation function and see what happens - what activation goes best with which input?
  • change the input features (but keep it simple)
  • summarize what key differences arise with each choice you make (we will discuss in class)

deep neural net

Fully connected: all nodes go to all nodes of the next layer.

1986: Deep Neural Nets

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

f: activation function:

turns neurons on-off

 

w: weight

sets the sensitivity of a neuron

 

b: bias:

up-down weights a neuron

 

 

In a CNN these layers would not be fully connected except the last one

 

MLTSA:

 

hyperparameters of DNN

4

output

input layer

hidden layer

output layer

hidden layer

how many hyperparameters?

EXERCISE

EXERCISE

output

input layer

hidden layer

output layer

hidden layer

how many hyperparameters?

EXERCISE

Weights

x

 Biases

3 x 4 + 4

4 x 3 + 3

3 x 1 + 1

EXERCISE

output

input layer

hidden layer

output layer

hidden layer

  1. number of layers-  1
  2. number neurons/layer-   
  3. activat. function/layer-  
  4. layer connectivity-       
  5. optimization metric - 1
  6. optimization method - 1
  7. parameters in optimization - M
N_l
N_l ^ {~??}
N_l

how many hyperparameters?

EXERCISE

Green architecture hyperparameters

RED training parameters

 

Seminal paper 

Y. LeCun 1998

MLTSA:

 

proper care of your DNN

0
 Advanced issue found
 

5

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

 

. defining the metric

. optimization schemes

. training/validation/testing sets

 

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

x1

x2

b1

b2

b3

b

w11

w12

w13

w21

0
 Advanced issue found

w22

w23

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

 
  1. architecture - wide networks tend to overfit, deep networks are hard to train

     
  2. number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices (like in all gradient descent)
     
  3. loss function - needs to be appropriate to the task, e.g. classification vs regression
     
  4. activation functions - needs to be consistent with the loss function
     
  5. optimization scheme - needs to be appropriate to the task and data
     
  6. learning rate in optimization - balance speed and accuracy
     
  7. batch size - smaller batch size is faster but leads to overtraining

Pretrained DNNs performance comparison

 

accuracy comparison

An article that compars various DNNs

 

accuracy comparison

An article that compars various DNNs

 

batch size

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

 
  • architecture - wide networks tend to overfit, deep networks are hard to train
     
  • number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices or too large learning rate
  • loss function - needs to be appropriate to the task, e.g. classification vs regression
     
  • activation functions - needs to be consistent with the loss function
     
  • optimization scheme - needs to be appropriate to the task and data
     
  • learning rate in optimization - balance speed and accuracy
     
  • batch size - smaller batch size is faster but leads to overtraining

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

 

always check your loss function! it should go down smoothly and flatten out at the end of the training.

not flat? you are still learning!

too flat? you are overfitting...

loss  (gallery of horrors)

jumps are not unlikely (and not necessarily a problem) if your activations are discontinuous (e.g. relu)

when you use validation you are introducing regularizations (e.g. dropout) so the loss can be smaller than for the training set

loss and learning rate (not that the appropriate learning rate depends on the chosen optimization scheme!)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization? 

loss good for activation last layer size last layer
mean_squared_error regression linear one node
mean_absolute_error regression linear one node
mean_squared_logarithmit_error ​regression linear one node
binary_crossentropy binary classification sigmoid one node
categorical_crossentropy multiclass classification sigmoid N nodes
Kullback_Divergence multiclass classification, probabilistic inerpretation sigmoid N nodes

Text

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

loss good for activation last layer size last layer
mean_squared_error regression linear one node
mean_absolute_error regression linear one node
mean_squared_logarithmit_error ​regression linear one node
binary_crossentropy binary classification sigmoid one node
categorical_crossentropy multiclass classification sigmoid N nodes
Kullback_Divergence multiclass classification, probabilistic inerpretation sigmoid N nodes

in this notebook above I experiment with combinations of these choices

What should I choose for the loss function and how does that relate to the activation functiom and optimization? 

Lots of parameters and lots of hyperparameters! What to choose?

 

MLTSA:

 

training DNN

6

.

.

.

 

x_1
x_2
x_N
+b
\vec{y} = \vec{x}W + b

Any linear model: 

w_2
w_1
w_N
y

y : prediction

ytrue : target

Error: e.g.

 

L_2~=~(y - y_\mathrm{true})^2

intercept

slope

L2

x

Find the best parameters by finding the minimum of the L2 hyperplane

 

at every step look around and choose the best direction

 back-propagation

 back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

x_{j}~=~\sum_i y_{i}w_{ji} ~~~~~~ y_j~=\frac{1}{1+e^{-x_j}}

.

.

.

 

x_1
x_N
f
+b
f
w_2

output

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

 

. defining the metric

. optimization schemes

. training/validation/testing sets

 

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

Training models with this many parameters requires a lot of care:

. defining the metric

. optimization schemes

. training/validation/testing sets

 

But just like our simple linear regression case, the fact that small changes in the parameters leads to small changes in the output for the right activation functions.

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

feed data forward through network and calculate cost metric

for each layer, calculate effect of small changes on next layer

\vec{y} = f_N(....(f_1(\vec{x}{ W_i + b_1}...W_N + b_N)))

 back-propagation

how does linear descent look when you have a whole network structure with hundreds of weights and biases to optimize??

think of applying just gradient to a function of a function of a function... use:

1)  partial derivatives, 2)  chain rule

C=\frac{1}{2}|y−a^L|^2~=~\frac{1}{2}\sum_j(y_j−a^L_j)^2

define a cost function, e.g.

Training a DNN

MLTSA:

 

Autoencoders

7

Unsupervised learning with

Neural Networks

What do NN do? approximate complex functions with series of linear functions

 

 

 

 

 

 

 

Unsupervised learning with

Neural Networks

What do NN do? approximate complex functions with series of linear functions

To do that they extract information from the data

Each layer of the DNN produces a representation of the data a "latent representation" .

The dimensionality of that latent representation is determined by the size of the layer (and its connectivity, but we will ignore this bit for now)

 

 

 

Unsupervised learning with

Neural Networks

What do NN do? approximate complex functions with series of linear functions

To do that they extract information from the data

Each layer of the DNN produces a representation of the data a "latent representation" .

The dimensionality of that latent representation is determined by the size of the layer (and its connectivity, but we will ignore this bit for now)



.... so if my layers are smaller what I have is a compact representation of the data


Autoencoder Architecture

Feed Forward DNN:

the size of the input is 5,

the size of the last layer is 2

Autoencoder Architecture

  • Encoder: outputs a lower dimensional representation z of the data x (similar to PCA, tSNE...)
  • Decoder: Learns how to reconstruct x given z: learns p(x|z)

Autoencoder Architecture

Building a DNN

with keras and tensorflow

Trivial to build, but the devil is in the details!

Building a DNN

with keras and tensorflow

Trivial to build, but the devil is in the details!

from keras.models import Sequential
#can upload pretrained models from keras.models
from keras.layers import Dense,  Conv2D, MaxPooling2D
#create model
model = Sequential()


#create the model architecture by adding model layers
model.add(Dense(10, activation='relu', input_shape=(n_cols,)))
model.add(Dense(10, activation='relu'))
model.add(Dense(1))

#need to choose the loss function, metric, optimization scheme
model.compile(optimizer='adam', loss='mean_squared_error')

#need to learn what to look for - always plot the loss function!
model.fit(x_train, y_train, validation_data=(x_test, y_test),
                     epochs=20, batch_size=100, verbose=1)
#note that the model allows to give a validation test, 
#this is for a 3fold cross valiation: train-validate-test 
#predict
test_y_predictions = model.predict(validate_X)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

encoder

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

decoder

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

This autoencoder model has a 64-neuron bottle neck. This means it will generate a compressed representation of the data out of that layer which is 16-dimensional (the original size is 784 pixels)

bottle neck

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

This simple odel has 200000 parameters! 

My original choice is to train it with "adadelta" with a mean squared loss function, all activation functions are relu, appropriate for a linear regression

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization? 

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization? 

loss good for activation last layer size last layer
mean_squared_error regression linear one node
mean_absolute_error regression linear one node
mean_squared_logarithmit_error ​regression linear one node
binary_crossentropy binary classification sigmoid one node
categorical_crossentropy multiclass classification sigmoid N nodes
Kullback_Divergence multiclass classification, probabilistic inerpretation sigmoid N nodes

autoencoder for image recontstruction

model_digits64.add(Dense(ndim, 
                        activation='linear'))
model_digits64_sig.compile(optimizer="adadelta", 
                   loss="mean_squared_error") 
model_digits64_sig.add(Dense(ndim, 
                             activation='sigmoid'))
model_digits64_sig.compile(optimizer="adadelta", 
                           loss="mean_squared_error") 
model_digits64_sig.add(Dense(ndim, 
                             activation='sigmoid'))
model_digits64_bce.compile(optimizer="adadelta", 
                           loss="binary_crossentropy")

loss function: did not finish learning, it is still decreasing rapidly

The predictions are far too detailed. While the input is not binary, it does not have a lot of details. Maybe approaching it as a binary problem (with a sigmoid and a binary cross entropy loss) will give better results

A sigmoid gives activation gives a much better result!

Binary cross entropy loss function: It is more appriopriate when the output layer is sigmoid

Even better results!

original

predicted

predicted

original

predicted

original

predicted

autoencoder for image recontstruction

A more ambitious model has a 16 neurons bottle neck: we are trying to extract 16 numbers to reconstruct the entire image! its pretty remarcable! those 16 number are extracted features from the data

predicted

original

latent

representation

resources

 

resources

 

Reading

 

MLTSA 09 2025

By federica bianco

MLTSA 09 2025

neural networks

  • 164