Machine Learning for

Time Series Analysis XI

Neural Networks: Transformers

Fall 2025 - UDel PHYS 664
dr. federica bianco 

 

@fedhere

ML model performance

LR = _____________________________

 

True Negative

False Negative

H0 is True H0 is False
H0 is falsified Type I Error
False Positive
True Positive
H0 is not falsified
​True Negative Type II Error
False Negative

important message spammed

spam in

your inbox

Accuracy, Recall, Precision

ML model performance

LR = _____________________________

 

True Negative

False Negative

H0 is True H0 is False
H0 is falsified Type I Error
False Positive
True Positive
H0 is not falsified
​True Negative Type II Error
False Negative

Accuracy, Recall, Precision

ML model performance

Accuracy, Recall, Precision

Precision

Recall

Accuracy

= \frac{TP}{TP~+~FP}
= \frac{TP}{TP~+~FN}
= \frac{TP~+~TN}{TP~+~TN~+~FP~+~FN}

TP=True Positive

FP=False Positive

TN=True Negative

FN=False Positive

ML model performance

Sensitivity Specificity

 

True Positive Rate = TP / All Positive Labels

 

Sensitivity = TPR = Recall

 

False Positive Rate = FP / All Negative Labels

 

Specificity = TP / All Negative Labels = 1 - FPR

 

 

ML model performance

F score

A factor indicating how much more important recall is than precision. For example, if we consider recall to be twice as important as precision, we can set β to 2. The standard F-score is equivalent to setting β to one.

ML model performance

F score

Class Imbalance

Current classifier accuracy: 50%

 

Precision?

 

Recall?

 

Specificity?

 

Sensitivity?

Class Imbalance

Current classifier accuracy: 50%

 

Precision: 0.8

 

Recall?

 

Specificity?

 

Sensitivity?

Class Imbalance

Current classifier accuracy: 50%

 

Precision: 0.8

 

Recall: 0.5

 

Specificity?

 

Sensitivity?

Class Imbalance

Current classifier accuracy: 50%

 

Precision: 0.8

 

Recall: 0.5

 

Specificity: 0.5

 

Sensitivity?

Class Imbalance

Current classifier accuracy: 50%

 

Precision: 0.8

 

Recall: 0.5

 

Specificity: 0.5

 

Sensitivity: 0.5

Class Imbalance

Current classifier accuracy: 

 

Precision: 

 

Recall: 

 

Specificity: 

 

Sensitivity: 

Class Imbalance

Current classifier accuracy: 80%

 

Precision: 0.8

 

Recall: 1

 

Specificity: 0 

 

Sensitivity: 1

Receiver operating characteristic

 

along the curve, the classifier probability threshold t is what changes

 

 

 

{\rm class} = i~{\rm if} ~p_i > t

Receiver operating characteristic

 

GOOD

BAD

Receiver operating characteristic

 

GOOD

BAD

tuning by changing hyperparameters

Neural Networks for Time Series Analysis

1

MLTSA:

 

DeepNeuralNetwork

Promising solution to Time Series Analysis problems because they can learn highly non linear varied relations between data

 

Recurrent Neural Networks: take as input for the next state prediction the past/present state as well as their hidden NN representation

 

Issue: training through gradient descent (derivatives) causes the gradient to vanish or explode after few time steps: the mode looses memory of the past rapidly (~few steps) (cause math sometimes is... just hard)

 

Partial Solution: LSTM: forget cells can extend memory by dropping irrelevant time stamps

DeepNeuralNetwork

Promising solution to Time Series Analysis problems because they can learn highly non linear varied relations between data

 

Convolutional Neural Networks: learn relationships between pixels

 

Issue: training is expensive (will discuss next week)

 

\phi(\vec{x}) ~\sim~f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

DeepNeuralNetwork

The purpose is to approximate a function  φ

y = φ(x)

which (in general) is not linear with linear operations

what we are doing, except for the activation function

is exactly a series of matrix multiplictions. 

\phi(\vec{x}) ~\sim~f^{(3)}(f^{(2)}(f^{(1)}(\vec{x} \cdot W_1 + \vec{b_1}) \cdot W_2 + \vec{b_2}) \cdot W_3 + \vec{b_3})~=~y

DeepNeuralNetwork

The purpose is to approximate a function  φ

y = φ(x)

which (in general) is not linear with linear operations

Building a DNN

with keras and tensorflow

autoencoder for image recontstruction

What should I choose for the loss function and how does that relate to the activation functiom and optimization? 

loss good for activation last layer size last layer
mean_squared_error regression linear one node
mean_absolute_error regression linear one node
mean_squared_logarithmit_error ​regression linear one node
binary_crossentropy binary classification sigmoid one node
categorical_crossentropy multiclass classification sigmoid N nodes
Kullback_Divergence multiclass classification, probabilistic inerpretation sigmoid N nodes

Text

DeepNeuralNetwork - loss functions

Binary Cross Entropy

(Multiclass) Cross Entropy

-(y \log{(p)} + (1-y) \log{(1-p)})
-\sum_{c=1}^M y_{o,c} \log{(p_{o,c})}

c = class

o = object

p = probability

y = label | truth

y = prediction

Kullback-Leibler

\sum(\hat{y} \log {\frac{\hat{y}}{y}})

(Multiclass) Cross Entropy

Mean Squared Error 

L2

Mean Absolute Error 

L1
L(y, \hat{y}) = \frac{1}{N} \sum_{i=0}^{N}(\log(y_i + 1) - \log({\hat{y}}_i + 1))^2

Mean Squared Logarithmic Error 

^

On the interpretability of DNNs

MLTSA:

 

GPT3 and society

2

unexpected consequences of NLP models

 

Vinay Prabhu exposes racist bias in GPT-3

unexpected consequences of NLP models

unexpected consequences of NLP models

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell 

The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell 

Last week, Gebru said she was fired by Google after objecting to a manager’s request to retract or remove her name from the paper. Google’s head of AI said the work “didn’t meet our bar for publication.” Since then, more than 2,200 Google employees have signed a letter demanding more transparency into the company’s handling of the draft. Saturday, Gebru’s manager, Google AI researcher Samy Bengio, wrote on Facebook that he was “stunned,” declaring “I stand by you, Timnit.” AI researchers outside Google have publicly castigated the company’s treatment of Gebru.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell  

We have identified a wide variety of costs and risks associated with the rush for ever larger LMs, including:

environmental costs (borne typically by those not benefiting from the resulting technology);

financial costs, which in turn erect barriers to entry, limiting who can contribute to this research area and which languages can benefit from the most advanced techniques;

opportunity cost, as researchers pour effort away from directions requiring less resources; and the

risk of substantial harms, including stereotyping, denigration, increases in extremist ideology, and wrongful arrest, should humans encounter seemingly coherent LM output and take it for the words of some person or organization who has accountability for what is said.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell  

We have identified a wide variety of costs and risks associated with the rush for ever larger LMs, including:

environmental costs (borne typically by those not benefiting from the resulting technology);

financial costs, which in turn erect barriers to entry, limiting who can contribute to this research area and which languages can benefit from the most advanced techniques;

opportunity cost, as researchers pour effort away from directions requiring less resources; and the

risk of substantial harms, including stereotyping, denigration, increases in extremist ideology, and wrongful arrest, should humans encounter seemingly coherent LM output and take it for the words of some person or organization who has accountability for what is said.

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell  

When we perform risk/benefit analyses of language technology, we must keep in mind how the risks and benefits are distributed, because they do not accrue to the same people. On the one hand, it is well documented in the literature on environmental racism that the negative effects of climate change are reaching and impacting the world’s most marginalized communities first [1, 27].

Is it fair or just to ask, for example, that the residents of the Maldives (likely to be underwater by 2100 [6]) or the 800,000 people in Sudan affected by drastic floods pay the environmental price of training and deploying ever larger English LMs, when similar large-scale models aren’t being produced for Dhivehi or Sudanese Arabic?

While the average human is responsible for an estimated 5t CO2 per year, the authors trained a Transformer (big) model [136] with neural architecture search and estimated that the training procedure emitted 284t of CO2.

[...]

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell  

4.1 Size Doesn’t Guarantee Diversity The Internet is a large and diverse virtual space, and accordingly, it is easy to imagine that very large datasets, such as Common Crawl (“petabytes of data collected over 8 years of web crawling”, a filtered version of which is included in the GPT-3 training data) must therefore be broadly representative of the ways in which different people view the world. However, on closer examination, we find that there are several factors which narrow Internet participation [...]

Starting with who is contributing to these Internet text collections, we see that Internet access itself is not evenly distributed, resulting in Internet data overrepresenting younger users and those from developed countries [100, 143]. However, it’s not just the Internet as a whole that is in question, but rather specific subsamples of it. For instance, GPT-2’s training data is sourced by scraping outbound links from Reddit, and Pew Internet Research’s 2016 survey reveals 67% of Reddit users in the United States are men, and 64% between ages 18 and 29. Similarly, recent surveys of Wikipedians find that only 8.8–15% are women or girls [9].

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell  

4.3 Encoding Bias It is well established by now that large LMs exhibit various kinds of bias, including stereotypical associations [11, 12, 69, 119, 156, 157], or negative sentiment towards specific groups [61]. Furthermore, we see the effects of intersectionality [34], where BERT, ELMo, GPT and GPT-2 encode more bias against identities marginalized along more than one dimension than would be expected based on just the combination of the bias along each of the axes [54, 132].

On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell  

The ersatz fluency and coherence of LMs raises several risks, precisely because humans are prepared to interpret strings belonging to languages they speak as meaningful and corresponding to the communicative intent of some individual or group of individuals who have accountability for what is said.

Encoder + Decoder architecture

 

Attention mechanism

 

Multithreaded attention

Attention is all you need: transformer model

transformer generalized architecture elements

MLTSA:

 

attention

3

v1 v2 v3 v4
k1 0.1 0.1 0.1 0.1
k2 0.9 0.3 0.1 0.1
k3 0.2 0.1 0.2 0.1
k4 0.6 0.9 0.1 0.

attention mechanism:

a way to relate elements of the time series with each other

v1 v2 v3 v4
k1 0.1 0.1 0.1 0.1
k2 0.9 0.3 0.1 0.1
k3 0.2 0.1 0.2 0.1
k4 0.6 0.9 0.1 0.

The cat that ate

 

was full and happy

was full and happy

attention mechanism:

a way to relate elements of the time series with each other

MLTSA:

 

embedding

4

Word tockenization 

Word tockenization and embedding

lemmatization/stemming :  reduce inflectional forms and sometimes derivationally related forms of a word to a common base form. 

 

am, are, is --> be

dog, dogs, dog's --> dog

Word tockenization and embedding and contextualizing

vector embedding (768)

<by> o <river> -> small  

near orthogonal embedding, low similarity vectors,

no strong relation between tokens

by

river

Word tockenization and embedding and contextualizing

vector embedding (768)

<by> o <river> -> small  

<river> o <bank> -> large

near orthogonal embedding, low similarity vectors,

no strong relation between tokens

by

river

near parallel embedding, high similarity vectors,

strong relation between tokens

bank

river

v1 v2 v3 v4 v5 v6 v7 v8
k1 1 0.1 0.1 0.1 0.1 0.1 0.1 0.7
k2 0.2 1 0.1 0.6 0.8 0.2 0.1 0.4
k3 0.1 0.1 1 0.2 0.1 0.2 0.1 0.1
k4 0.6 0.7 0.1 1 0.5 0.9 0.1 0.5
k5 0.1 0.9 0.1 0.3 1 0.1 0.1 0.3
k6 0.1 0.5 0.3 0.7 0.3 1 0.1 0.9

The cat that ate

was

full

 

The cat that ate was full and happy

fully autoregressive model

attention mechanism:

a way to relate elements of the time series with each other

Attention is all you need (2017)

Encoder + Decoder architecture

attention: 

v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2

pairs if inputs (queries) and outputs (values - keys) are paired by weights (the "attention" W

1238  913 12

W

39

5

903

The key/value/query concept is analogous to retrieval systems.

project embedding into Key - Value - Query

lower dimensional representations

Attention is all you need (2017)

Encoder + Decoder architecture

attention: 

project embedding into Key - Value - Query

lower dimensional representations

Attention is all you need (2017)

Encoder + Decoder architecture

Key - Value - Query

attention: 

v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2

1238  913 12

W

39

5

903

The key/value/query concept is analogous to retrieval systems.

Attention is all you need (2017)

Encoder + Decoder architecture

different elements of the sentence relate to input elements in multiple ways

Multi-headed attention:

v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2

39

5

903

1238  913 12

W1

v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2
v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2

1238  913 12

W2

v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2
v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2

1238  913 12

W3

v1 v2 v3
k1 0.1 0.1 0.1
k2 0.9 0.3 0.1
k3 0.2 0.1 0.2

39

5

903

39

5

903

Attention is all you need (2017)

Encoder + Decoder architecture

The key/value/query concept is analogous to retrieval systems.

Multi-headed Self attention:

4.1

MLTSA:

 

quary value key

quary value key

1.1

MLTSA:

 

Dr. Pavlos Protopapas

with help from

- keep track of the entire text unit context

- understand word in context

What do we want from a language model?

"I took a walk on the river bank"

"I went to the bank to deposit a check"

- keep track of the entire text unit context

- understand word in context

What do we want from a language model?

"I took a walk on the river bank"

"I went to the bank to deposit a check"

0.2 | 0.6 | 0.5|...| 0.1

tockenization

- keep track of the entire text unit context

- understand word in context

What do we want from a language model?

We want to respect the sequential nature of language -> MLP cannot do this

Need long context -> MLP / RNN / LSTM cannot do this

Capture content dependent semantics -> tockenization (word2Vec) only captures word
We need an architecture that can be trained in parallel (non-Markovian property) -> MLP / RNN / LSTM cannot do this

 

MOTIVATION FOR ATTENTION

- keep track of the entire text unit context

- understand word in context

What do we want from a language model?

We want to respect the sequential nature of language -> MLP cannot do this

Need long context ->MLP / RNN / LSTM cannot do this

Capture content dependent semantics -> tockenization (word2Vec) only captures word
We need an architecture that can be trained in parallel (non-Markovian property) ->MLP / RNN / LSTM cannot do this

 

MOTIVATION FOR ATTENTION

What do we want from a time series analysis model?

- recognize patterns at any time lag

- recognize that patterns can relate to each other differently (seasonality, trends, stochastic events)

"on the river bank"

"on the river bank"

\alpha_{14}
\alpha_{24}
\alpha_{34}

If these coefficients represent the relative importance of the words in the meaning of the sentence,  river and bank

should be high!

\alpha_{24}
\alpha_{34}

Word tockenization and embedding and contextualizing

vector embedding (768)

<by> o <river> -> small  

near orthogonal embedding, low similarity vectors,

no strong relation between tokens

by

river

Word tockenization and embedding and contextualizing

vector embedding (768)

<by> o <river> -> small  

near orthogonal embedding, low "similarity" vectors,

no strong relation between tokens

by

river

Word tockenization and embedding and contextualizing

vector embedding (768)

<bank> o <river> -> <high>  

near parallel embedding, high "similarity" vectors,

 strong relation between tokens

bank

river

Word tockenization and embedding and contextualizing

vector embedding (768)

<bank> o <river> -> <high>  

near parallel embedding, high "similarity" vectors,

 strong relation between tokens

bank

river

..... linear algebra is not machine learning unless I am learning some parameters!

But context ALSO has to do with POSITION!

 

Willow talked to Federica

vs

Federica talked to Willow

P = f2(W2(f1(W1X+ B1) + B2)

X = [0.0,1.3,2.1,1.0,5.0]

P = f2(W2(f1(W1X+ B1) + B2)

X = [0.0,1.3,2.1,1.0,5.0]

W1 = [0.2,0.3,0.1,0.0,6.0]

P = f2(W2(f1(W1X+ B1) + B2)

X = [0.0,1.3,2.1,1.0,5.0]

P = [0.0,1.1,2.3,0.9,5.0]

W1 = [0.2,0.3,0.1,0.0,6.0]

P = f2(W2(f1(W1X+ B1) + B2)

X = [0.0,1.3,2.1,1.0,5.0]

P = [0.0,1.1,2.3,0.9,5.0]

W1 = [0.2,0.3,0.1,0.0,6.0]

X = [5.0,1.3,2.1,1.0,0.0]

P = [5.0,1.1,2.3,0.9,0.0]

W1 = [0.2,0.3,0.1,0.0,6.0]

P = f2(W2(f1(W1X+ B1) + B2)

X = [0.0,1.3,2.1,1.0,5.0]

P = [0.0,1.1,2.3,0.9,5.0]

W1 = [0.2,0.3,0.1,0.0,6.0]

X = [5.0,1.3,2.1,1.0,0.0]

P = [5.0,1.1,2.3,0.9,0.0]

W1 = [0.2,0.3,0.1,0.0,6.0]

MLTSA:

 

positional encoding

5

https://arxiv.org/pdf/2308.06404

Word tockenization and embedding and contextualizing

vector embedding (768)

<bank> o <river> -> <high>  

near parallel embedding, high "similarity" vectors,

 strong relation between tokens

bank

river

Word tockenization and embedding and contextualizing

vector embedding (768)

Willow talked to Fed

near parallel embedding, high "similarity" vectors,

 strong relation between tokens

talked

Willow

Willow 

talked

to

fed

Word tockenization and embedding and contextualizing

vector embedding (768)

Willow talked to Fed

near parallel embedding, high "similarity" vectors,

 strong relation between tokens

P+talked

position embedding (768)

\hat{A} = P+A\\ \hat{B} = P+B

P+Willow

Willow 

talked

to

Fed

Word tockenization and embedding and contextualizing

vector embedding (768)

Fed talked to Willow

near parallel embedding, high "similarity" vectors,

 strong relation between tokens

P+ talked

position embedding (768)

\hat{A} = P+A\\ \hat{B} = P+B

P+Willow

vector embedding (768)

position embedding (768)

talked

to

Fed

Willow 

"on the river bank"

POSITIONAL ENCODING

\pi_1
\pi_2
\pi_3
\pi_4

Attention is all you need

Encoder + Decoder architecture

Encodes the past

\pi_1
\pi_1
\pi_3
\pi_2

MLTSA:

 

transformer model

6

Encoder + Decoder architecture

Encodes the past

Turns out attention is not really _all_ you need...

so far we are working with a "bag of words": the order of words is not known to the model

Attention is all you need (2017)

Encoder + Decoder architecture

positional encoding

Attention is all you need (2017)

Encoder + Decoder architecture

positional encoding

Attention is all you need (2017)

Encoder + Decoder architecture

positional encoding

Attention is all you need (2017)

Attention is all you need

Encoder + Decoder architecture

Encodes the past

Encoder + Decoder architecture

decodes the past and predicts the future

MHA acting on encoder (1)

Attention is all you need (2017)

v1 v2 v3 v4 v5 v6 v7 v8
k1 1 0.1 0.1 0.1 0.1 0.1 0.1 0.7
k2 0.2 1 0.1 0.6 0.8 0.2 0.1 0.4
k3 0.1 0.1 1 0.2 0.1 0.2 0.1 0.1
k4 0.6 0.7 0.1 1 0.5 0.9 0.1 0.5
k5 0.1 0.9 0.1 0.3 1 0.1 0.1 0.3
k6 0.1 0.5 0.3 0.7 0.3 1 0.1 0.9

The cat that ate

was

full

 

The cat that ate was full and happy

Encoder + Decoder architecture

decodes the past and predicts the future

a stack of N = 6 identical layers each with

(1) a multi-head self-attention mechanism act on previous decoder output,

(2) a multi-head self-attention mechanism act on encoder output,

(3) a positionwise fully connected feed-forward NN

MHA acting on encoder (1)

Attention is all you need (2017)

v1 v2 v3 v4 v5 v6 v7 v8
k1 1 0.1 0.1 0.1 0.1 0.1 0.1 0.7
k2 0.2 1 0.1 0.6 0.8 0.2 0.1 0.4
k3 0.1 0.1 1 0.2 0.1 0.2 0.1 0.1
k4 0.6 0.7 0.1 1 0.5 0.9 0.1 0.5
k5 0.1 0.9 0.1 0.3 1 0.1 0.1 0.3
k6 0.1 0.5 0.3 0.7 0.3 1 0.1 0.9

The cat that ate

was

full

 

The cat that ate was full and happy

 masking dependence on the future

Encoder + Decoder architecture

decodes the past and predicts the future

a stack of N = 6 identical layers each with

(1) a multi-head self-attention mechanism act on previous decoder output,

(2) a multi-head self-attention mechanism act on encoder output,

(3) a positionwise fully connected feed-forward NN

MHA acting on decoder (2)

Attention is all you need (2017)

Attention is all you need (2017)

Encoder + Decoder architecture

Input

Embedding

Positional encoding

Encoder attention

Output

Embedding

Positional encoding

Decoder attention

Encoder-Decoder attention

Feed Forward

Linear

Softmax

Attention is all you need (2017)

Attention is all you need

Encoder + Decoder architecture

Encodes the past

Encodes the past

v1 v2 v3 v4 v5 v6 v7 v8
k1 1 0.1 0.1 0.1 0.1 0.1 0.1 0.7
k2 0.2 1 0.1 0.6 0.8 0.2 0.1 0.4
k3 0.1 0.1 1 0.2 0.1 0.2 0.1 0.1
k4 0.6 0.7 0.1 1 0.5 0.9 0.1 0.5
k5 0.1 0.9 0.1 0.3 1 0.1 0.1 0.3
k6 0.1 0.5 0.3 0.7 0.3 1 0.1 0.9

The cat that ate

was

full

 

The cat that ate was full and happy

Encoder + Decoder architecture

decodes the past and predicts the future

a stack of N = 6 identical layers each with

(1) a multi-head self-attention mechanism act on previous decoder output,

(2) a multi-head self-attention mechanism act on encoder output,

(3) a positionwise fully connected feed-forward NN

MHA acting on encoder (1)

v1 v2 v3 v4 v5 v6 v7 v8
k1 1 0.1 0.1 0.1 0.1 0.1 0.1 0.7
k2 0.2 1 0.1 0.6 0.8 0.2 0.1 0.4
k3 0.1 0.1 1 0.2 0.1 0.2 0.1 0.1
k4 0.6 0.7 0.1 1 0.5 0.9 0.1 0.5
k5 0.1 0.9 0.1 0.3 1 0.1 0.1 0.3
k6 0.1 0.5 0.3 0.7 0.3 1 0.1 0.9

The cat that ate

was

full

 

The cat that ate was full and happy

 masking dependence on the future

Encoder + Decoder architecture

decodes the past and predicts the future

a stack of N = 6 identical layers each with

(1) a multi-head self-attention mechanism act on previous decoder output,

(2) a multi-head self-attention mechanism act on encoder output,

(3) a positionwise fully connected feed-forward NN

MHA acting on decoder (2)

ML model performance

MLTSA:

 

5

Attention is all you need

Encoder + Decoder architecture

Encodes the past

Encodes the past

Decodes the past predicts the future

Encoder + Decoder architecture

v1 v2 v3 v4 v5 v6
k1 1 0.1 0.1 0.1 0.1 0.1
k2 0.2 1 0.1 0.6 0.8 0.2
k3 0.1 0.1 1 0.2 0.1 0.2
k4 0.6 0.7 0.1 1 0.5 0.9
k5 0.1 0.9 0.1 0.3 1 0.1
k6 0.1 0.5 0.3 0.7 0.3 1

The cat that ate

was

full

 

The cat that ate was full

encodes the past

a stack of N = 6 identical layers each with

(1) a multi-head self-attention mechanism,

(2) a positionwise fully connected feed-forward NN

reading 

 

resources

 

A video on transformer which I think is really good!

https://www.youtube.com/watch?v=4Bdc55j80l8

 

A video on attention (with a different accent than the one I subjected you all this time!) 

https://www.youtube.com/watch?v=-9vVhYEXeyQ

 

resources

 

Tutorial

MLTSA11 2025

By federica bianco

MLTSA11 2025

transformers

  • 353