Fall 2025 - UDel PHYS 641
dr. federica bianco

@fedhere

Machine Learning for

Time Series Analysis IX

Intro to ANN

this slide deck:

https://slides.com/federicabianco/mlts25_09

Opportunity

the era of AI

experiment driven science -∞:1900

theory driven science 1900-1950

data driven science 1990-2010

the fourth paradigm - Jim Gray, 2009

computationally driven science 1950-1990

experiment driven science -∞:1900

theory driven science 1900-1950

data driven science 1990-2010

the fourth paradigm - Jim Gray, 2009

computationally driven science 1950-1990

AI driven science? 2010...

Input

x

y

output

Input

x

y

output

function

f(x)

Input

x

y

output

f(x)

f(x) = mx + b

b

m

m: slope

b: intercept

Input

x

y

output

f(x)

f(x) = mx + b

b

m

m: slope

b: intercept

parameters

Input

x

y

output

f(x)

f(x) = mx + b

b

m

m: slope

b: intercept

parameters

x

y

goal: find the right m and b that turn x into y

Input

x

y

output

f(x)

f(x) = mx + b

b

m

m: slope

b: intercept

parameters

x

y

learn

goal: find the right m and b that turn x into y

what is machine learning?

1

what is machine learning?

ML: any model with parameters learnt from the data

Input

x

y

output

f(x) = mx + b

m = 0.4 and b=0

m: slope

b: intercept

parameters

x

L2 = (y_{1,p} - y_{1,t})^2 + (y_{2,p} - y_{2,t})^2 + (y_{3,p} - y_{3,t})^2

y_{3,p}

y_{3,t}

let's try

goal: learn the right m and b that turn x into y

f(x)

m

L2

-1.4

-.5

.6

1.5

2.4

L2 = (y_{1,p} - y_{1,t})^2 + (y_{2,p} - y_{2,t})^2 + (y_{3,p} - y_{3,t})^2

Tree models

(at the basis of Random Forest

Gradient Boosted Trees)

Machine Learning

Galaxy Zoo

p(class)

extracted

features vector

p(class)

pixel values tensor

f(x)

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

w1

w2

The perceptron algorithm : 1958, Frank Rosenblatt

1958

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

.

x_1

x_2

x_N

+b

output

weights

w_i

bias

b

linear regression:

w_2

w_1

w_N

1 ~\mathrm{if} ~\sum_{i=1}^Nw_ix_i \geq\theta ~\mathrm{else}~ 0

1958

Perceptron

y= \begin{cases} 1~ if~ \sum_i(x_i w_i) + b ~>=~Z\\ 0 ~if~ \sum_i(x_i w_i) + b ~<~Z \end{cases}

.

x_1

x_2

x_N

+b

f

w_2

w_1

w_N

output

f

activation function

weights

w_i

bias

b

y ~= f(~\sum_i w_ix_i ~+~ b)

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

Perceptron

The perceptron algorithm : 1958, Frank Rosenblatt

+b

f

w_2

w_1

w_N

output

f

activation function

weights

w_i

bias

b

sigmoid

f

\sigma = \frac{1}{1 + e^{-z}}

.

x_1

x_2

x_N

y ~= f(~\sum_i w_ix_i ~+~ b)

Perceptrons are linear classifiers: makes its predictions based on a linear predictor function

combining a set of weights (=parameters) with the feature vector.

Perceptron

ANN examples of activation function

The perceptron algorithm : 1958, Frank Rosenblatt

Perceptron

The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.

The embryo - the Weather Buerau's $2,000,000 "704" computer - learned to differentiate between left and right after 50 attempts in the Navy demonstration

NEW NAVY DEVICE LEARNS BY DOING; Psychologist Shows Embryo of Computer Designed to Read and Grow Wiser

July 8, 1958

Input

x

y

output

f(x)

x

y

A Neural Network is a kind of function that maps input to output

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

multilayer perceptron

x_2

x_3

output

input layer

hidden layer

output layer

1970: multilayer perceptron architecture

x_1

Fully connected: all nodes go to all nodes of the next layer.

b_1

b_2

b_3

b_4

Perceptrons by Marvin Minsky and Seymour Papert 1969

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

w_{11}

w_{12}

w_{13}

w_{14}

multilayer perceptron

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

w_{21}

w_{22}

w_{23}

w_{24}

multilayer perceptron

layer of perceptrons

x_2

x_3

output

x_1

layer of perceptrons

b_1

b_2

b_3

b_4

b

w_{31}

w_{32}

w_{33}

w_{34}

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

x_1

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + w_{14}x_4 + b1

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1

w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b2

w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b3

w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b4

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

learned parameters

multilayer perceptron

x_2

x_3

output

Fully connected: all nodes go to all nodes of the next layer.

layer of perceptrons

f(w_{11}x_1 + w_{12}x_2 + w_{13}x_3 + b1)

f(w_{21}x_1 + w_{22}x_2 + w_{23}x_3 + b1)

f(w_{31}x_1 + w_{32}x_2 + w_{33}x_3 + b1)

f(w_{41}x_1 + w_{42}x_2 + w_{43}x_3 + b1)

x_1

w: weight

sets the sensitivity of a neuron

b: bias:

up-down weights a neuron

f: activation function:

turns neurons on-off

DNN:

hyperparameters of DNN

3

EXERCISE

output

how many parameters?

input layer

hidden layer

output layer

hidden layer

EXERCISE

output

how many parameters?

input layer

hidden layer

output layer

hidden layer

output

input layer

hidden layer

output layer

hidden layer

35

(3x4)+4

(4x3)+3

how many parameters?

EXERCISE

(3)+1

output

input layer

hidden layer

output layer

hidden layer

number of layers- 1
number of neurons/layer-
activation function/layer-
layer connectivity-
optimization metric - 1
optimization method - 1
parameters in optimization- M

N_l

N_l ^ {~??}

how many hyperparameters?

EXERCISE

GREEN: architecture hyperparameters

RED: training hyperparameters

N_l

EXERCISE

http://playground.tensorflow.org/

GPT-3

175 Billion Parameters

3,640 PetaFLOPs days

Kaplan+ 2020

x

y

A Neural Network is a kind of function that maps input to output

Input

output

hidden layers

latent space

x

y

A Neural Network is a kind of function that maps input to output

Input

output

hidden layers

latent space

visualizatoin and concept credit: Alex Razim

Kaicheng Zhang et al 2016 ApJ 820 67

deSoto+2024

Boone 2017

7% of LSST data

The rest

https://www.epa.gov/energy/greenhouse-gas-equivalencies-calculator#results

late layers learn complex aggregate specialized features

early layers learn simple generalized features (like lines for CNN)

prediction "head"

original data

trained extensively on large amounts of data to solve generic problems

Foundational AI models

trained extensively on large amounts of data to solve generic problems

Foundational AI models

We use the ILSVRC-2012 ImageNet dataset with 1k classes
and 1.3M images, its superset ImageNet-21k with
21k classes and 14M images and JFT with 18k classes and
303M high-resolution images.

Typically, we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For
this, we remove the pre-trained prediction head and attach a zero-initialized D × K feedforward
layer, where K is the number of downstream classe

trained extensively on large amounts of data to solve generic problems

Foundational AI models

We use the ILSVRC-2012 ImageNet dataset with 1k classes
and 1.3M images, its superset ImageNet-21k with
21k classes and 14M images and JFT with 18k classes and
303M high-resolution images.

Typically, we pre-train ViT on large datasets, and fine-tune to (smaller) downstream tasks. For
this, we remove the pre-trained prediction head and attach a zero-initialized D × K feedforward
layer, where K is the number of downstream classe

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

architecture - wide networks tend to overfit, deep networks are hard to train
number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices (like in all gradient descent)
loss function - needs to be appropriate to the task, e.g. classification vs regression
activation functions - needs to be consistent with the loss function
optimization scheme - needs to be appropriate to the task and data
learning rate in optimization - balance speed and accuracy
batch size - smaller batch size is faster but leads to overtraining

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

accuracy comparison

An article that compars various DNNs

https://arxiv.org/pdf/1605.07678.pdf

batch size

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

architecture - wide networks tend to overfit, deep networks are hard to train
number of epochs - the sweet spot is when learning slows down, but before you start overfitting... it may take DAYS! jumps may indicate bad initial choices
loss function - needs to be appropriate to the task, e.g. classification vs regression
activation functions - needs to be consistent with the loss function
optimization scheme - needs to be appropriate to the task and data
learning rate in optimization - balance speed and accuracy
batch size - smaller batch size is faster but leads to overtraining

5

Advanced issues found

▲

1

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

Lots of parameters and lots of hyperparameters! What to choose?

cheatsheet

always check your loss function! it should go down smoothly and flatten out at the end of the training.

not flat? you are still learning!

too flat? you are overfitting...

loss (gallery of horrors)

https://github.com/fedhere/MLTSA_FBianco/blob/master/autoencode_digits.ipynb

jumps are not unlikely (and not necessarily a problem) if your activations are discontinuous (e.g. relu)

when you use validation you are introducing regularizations (e.g. dropout) so the loss can be smaller than for the training set

loss and learning rate (not that the appropriate learning rate depends on the chosen optimization scheme!)

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization?

loss	good for	activation last layer	size last layer
mean_squared_error	regression	linear	one node
mean_absolute_error	regression	linear	one node
mean_squared_logarithmit_error	regression	linear	one node
binary_crossentropy	binary classification	sigmoid	one node
categorical_crossentropy	multiclass classification	sigmoid	N nodes
Kullback_Divergence	multiclass classification, probabilistic inerpretation	sigmoid	N nodes