Neural Networks: Transformers
Fall 2025 - UDel PHYS 664
dr. federica bianco
@fedhere
this slide deck:
LR = _____________________________
True Negative
False Negative
H0 is True | H0 is False | |
---|---|---|
H0 is falsified | Type I Error False Positive |
True Positive |
H0 is not falsified |
True Negative | Type II Error False Negative |
important message spammed
spam in
your inbox
LR = _____________________________
True Negative
False Negative
H0 is True | H0 is False | |
---|---|---|
H0 is falsified | Type I Error False Positive |
True Positive |
H0 is not falsified |
True Negative | Type II Error False Negative |
Precision
Recall
Accuracy
TP=True Positive
FP=False Positive
TN=True Negative
FN=False Positive
True Positive Rate = TP / All Positive Labels
Sensitivity = TPR = Recall
False Positive Rate = FP / All Negative Labels
Specificity = TP / All Negative Labels = 1 - FPR
|
A factor indicating how much more important recall is than precision. For example, if we consider recall to be twice as important as precision, we can set β to 2. The standard F-score is equivalent to setting β to one.
Current classifier accuracy: 50%
Precision?
Recall?
Specificity?
Sensitivity?
Current classifier accuracy: 50%
Precision: 0.8
Recall?
Specificity?
Sensitivity?
Current classifier accuracy: 50%
Precision: 0.8
Recall: 0.5
Specificity?
Sensitivity?
Current classifier accuracy: 50%
Precision: 0.8
Recall: 0.5
Specificity: 0.5
Sensitivity?
Current classifier accuracy: 50%
Precision: 0.8
Recall: 0.5
Specificity: 0.5
Sensitivity: 0.5
Current classifier accuracy:
Precision:
Recall:
Specificity:
Sensitivity:
Current classifier accuracy: 80%
Precision: 0.8
Recall: 1
Specificity: 0
Sensitivity: 1
along the curve, the classifier probability threshold t is what changes
GOOD
BAD
GOOD
BAD
tuning by changing hyperparameters
1
Promising solution to Time Series Analysis problems because they can learn highly non linear varied relations between data
Recurrent Neural Networks: take as input for the next state prediction the past/present state as well as their hidden NN representation
Issue: training through gradient descent (derivatives) causes the gradient to vanish or explode after few time steps: the mode looses memory of the past rapidly (~few steps) (cause math sometimes is... just hard)
Partial Solution: LSTM: forget cells can extend memory by dropping irrelevant time stamps
Promising solution to Time Series Analysis problems because they can learn highly non linear varied relations between data
Convolutional Neural Networks: learn relationships between pixels
Issue: training is expensive (will discuss next week)
The purpose is to approximate a function φ
y = φ(x)
which (in general) is not linear with linear operations
what we are doing, except for the activation function
is exactly a series of matrix multiplictions.
The purpose is to approximate a function φ
y = φ(x)
which (in general) is not linear with linear operations
Building a DNN
with keras and tensorflow
autoencoder for image recontstruction
What should I choose for the loss function and how does that relate to the activation functiom and optimization?
loss | good for | activation last layer | size last layer |
---|---|---|---|
mean_squared_error | regression | linear | one node |
mean_absolute_error | regression | linear | one node |
mean_squared_logarithmit_error | regression | linear | one node |
binary_crossentropy | binary classification | sigmoid | one node |
categorical_crossentropy | multiclass classification | sigmoid | N nodes |
Kullback_Divergence | multiclass classification, probabilistic inerpretation | sigmoid | N nodes |
Text
Binary Cross Entropy
(Multiclass) Cross Entropy
c = class
o = object
p = probability
y = label | truth
y = prediction
Kullback-Leibler
(Multiclass) Cross Entropy
Mean Squared Error
Mean Absolute Error
Mean Squared Logarithmic Error
^
On the interpretability of DNNs
GPT3 and society
2
Vinay Prabhu exposes racist bias in GPT-3
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
The past 3 years of work in NLP have been characterized by the development and deployment of ever larger language models, especially for English. BERT, its variants, GPT-2/3, and others, most recently Switch-C, have pushed the boundaries of the possible both through architectural innovations and through sheer size. Using these pretrained models and the methodology of fine-tuning them for specific tasks, researchers have extended the state of the art on a wide array of tasks as measured by leaderboards on specific benchmarks for English. In this paper, we take a step back and ask: How big is too big? What are the possible risks associated with this technology and what paths are available for mitigating those risks? We provide recommendations including weighing the environmental and financial costs first, investing resources into curating and carefully documenting datasets rather than ingesting everything on the web, carrying out pre-development exercises evaluating how the planned approach fits into research and development goals and supports stakeholder values, and encouraging research directions beyond ever larger language models.
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
Last week, Gebru said she was fired by Google after objecting to a manager’s request to retract or remove her name from the paper. Google’s head of AI said the work “didn’t meet our bar for publication.” Since then, more than 2,200 Google employees have signed a letter demanding more transparency into the company’s handling of the draft. Saturday, Gebru’s manager, Google AI researcher Samy Bengio, wrote on Facebook that he was “stunned,” declaring “I stand by you, Timnit.” AI researchers outside Google have publicly castigated the company’s treatment of Gebru.
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
We have identified a wide variety of costs and risks associated with the rush for ever larger LMs, including:
environmental costs (borne typically by those not benefiting from the resulting technology);
financial costs, which in turn erect barriers to entry, limiting who can contribute to this research area and which languages can benefit from the most advanced techniques;
opportunity cost, as researchers pour effort away from directions requiring less resources; and the
risk of substantial harms, including stereotyping, denigration, increases in extremist ideology, and wrongful arrest, should humans encounter seemingly coherent LM output and take it for the words of some person or organization who has accountability for what is said.
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
We have identified a wide variety of costs and risks associated with the rush for ever larger LMs, including:
environmental costs (borne typically by those not benefiting from the resulting technology);
financial costs, which in turn erect barriers to entry, limiting who can contribute to this research area and which languages can benefit from the most advanced techniques;
opportunity cost, as researchers pour effort away from directions requiring less resources; and the
risk of substantial harms, including stereotyping, denigration, increases in extremist ideology, and wrongful arrest, should humans encounter seemingly coherent LM output and take it for the words of some person or organization who has accountability for what is said.
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
When we perform risk/benefit analyses of language technology, we must keep in mind how the risks and benefits are distributed, because they do not accrue to the same people. On the one hand, it is well documented in the literature on environmental racism that the negative effects of climate change are reaching and impacting the world’s most marginalized communities first [1, 27].
Is it fair or just to ask, for example, that the residents of the Maldives (likely to be underwater by 2100 [6]) or the 800,000 people in Sudan affected by drastic floods pay the environmental price of training and deploying ever larger English LMs, when similar large-scale models aren’t being produced for Dhivehi or Sudanese Arabic?
While the average human is responsible for an estimated 5t CO2 per year, the authors trained a Transformer (big) model [136] with neural architecture search and estimated that the training procedure emitted 284t of CO2.
[...]
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
4.1 Size Doesn’t Guarantee Diversity The Internet is a large and diverse virtual space, and accordingly, it is easy to imagine that very large datasets, such as Common Crawl (“petabytes of data collected over 8 years of web crawling”, a filtered version of which is included in the GPT-3 training data) must therefore be broadly representative of the ways in which different people view the world. However, on closer examination, we find that there are several factors which narrow Internet participation [...]
Starting with who is contributing to these Internet text collections, we see that Internet access itself is not evenly distributed, resulting in Internet data overrepresenting younger users and those from developed countries [100, 143]. However, it’s not just the Internet as a whole that is in question, but rather specific subsamples of it. For instance, GPT-2’s training data is sourced by scraping outbound links from Reddit, and Pew Internet Research’s 2016 survey reveals 67% of Reddit users in the United States are men, and 64% between ages 18 and 29. Similarly, recent surveys of Wikipedians find that only 8.8–15% are women or girls [9].
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
4.3 Encoding Bias It is well established by now that large LMs exhibit various kinds of bias, including stereotypical associations [11, 12, 69, 119, 156, 157], or negative sentiment towards specific groups [61]. Furthermore, we see the effects of intersectionality [34], where BERT, ELMo, GPT and GPT-2 encode more bias against identities marginalized along more than one dimension than would be expected based on just the combination of the bias along each of the axes [54, 132].
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, Shmargaret Shmitchell
The ersatz fluency and coherence of LMs raises several risks, precisely because humans are prepared to interpret strings belonging to languages they speak as meaningful and corresponding to the communicative intent of some individual or group of individuals who have accountability for what is said.
Encoder + Decoder architecture
Attention mechanism
Multithreaded attention
Attention is all you need: transformer model
transformer generalized architecture elements
attention
3
v1 | v2 | v3 | v4 | |
---|---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 | 0.1 |
k4 | 0.6 | 0.9 | 0.1 | 0. |
attention mechanism:
a way to relate elements of the time series with each other
v1 | v2 | v3 | v4 | |
---|---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 | 0.1 |
k4 | 0.6 | 0.9 | 0.1 | 0. |
The cat that ate
was full and happy
was full and happy
attention mechanism:
a way to relate elements of the time series with each other
embedding
4
Word tockenization
Word tockenization and embedding
lemmatization/stemming : reduce inflectional forms and sometimes derivationally related forms of a word to a common base form.
am, are, is --> be
dog, dogs, dog's --> dog
Word tockenization and embedding and contextualizing
vector embedding (768)
<by> o <river> -> small
near orthogonal embedding, low similarity vectors,
no strong relation between tokens
by
river
Word tockenization and embedding and contextualizing
vector embedding (768)
<by> o <river> -> small
<river> o <bank> -> large
near orthogonal embedding, low similarity vectors,
no strong relation between tokens
by
river
near parallel embedding, high similarity vectors,
strong relation between tokens
bank
river
v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | |
---|---|---|---|---|---|---|---|---|
k1 | 1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.7 |
k2 | 0.2 | 1 | 0.1 | 0.6 | 0.8 | 0.2 | 0.1 | 0.4 |
k3 | 0.1 | 0.1 | 1 | 0.2 | 0.1 | 0.2 | 0.1 | 0.1 |
k4 | 0.6 | 0.7 | 0.1 | 1 | 0.5 | 0.9 | 0.1 | 0.5 |
k5 | 0.1 | 0.9 | 0.1 | 0.3 | 1 | 0.1 | 0.1 | 0.3 |
k6 | 0.1 | 0.5 | 0.3 | 0.7 | 0.3 | 1 | 0.1 | 0.9 |
The cat that ate
was
full
The cat that ate was full and happy
fully autoregressive model
attention mechanism:
a way to relate elements of the time series with each other
Attention is all you need (2017)
Encoder + Decoder architecture
attention:
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
pairs if inputs (queries) and outputs (values - keys) are paired by weights (the "attention" W)
1238 913 12
W
39
5
903
The key/value/query concept is analogous to retrieval systems.
project embedding into Key - Value - Query
lower dimensional representations
Attention is all you need (2017)
Encoder + Decoder architecture
attention:
project embedding into Key - Value - Query
lower dimensional representations
Attention is all you need (2017)
Encoder + Decoder architecture
Key - Value - Query
attention:
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
1238 913 12
W
39
5
903
The key/value/query concept is analogous to retrieval systems.
Attention is all you need (2017)
Encoder + Decoder architecture
different elements of the sentence relate to input elements in multiple ways
Multi-headed attention:
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
39
5
903
1238 913 12
W1
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
1238 913 12
W2
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
1238 913 12
W3
v1 | v2 | v3 | |
---|---|---|---|
k1 | 0.1 | 0.1 | 0.1 |
k2 | 0.9 | 0.3 | 0.1 |
k3 | 0.2 | 0.1 | 0.2 |
39
5
903
39
5
903
Attention is all you need (2017)
Encoder + Decoder architecture
The key/value/query concept is analogous to retrieval systems.
Multi-headed Self attention:
4.1
1.1
Dr. Pavlos Protopapas
with help from
- keep track of the entire text unit context
- understand word in context
What do we want from a language model?
"I took a walk on the river bank"
"I went to the bank to deposit a check"
- keep track of the entire text unit context
- understand word in context
What do we want from a language model?
"I took a walk on the river bank"
"I went to the bank to deposit a check"
0.2 | 0.6 | 0.5|...| 0.1
tockenization
- keep track of the entire text unit context
- understand word in context
What do we want from a language model?
We want to respect the sequential nature of language -> MLP cannot do this
Need long context -> MLP / RNN / LSTM cannot do this
Capture content dependent semantics -> tockenization (word2Vec) only captures word
We need an architecture that can be trained in parallel (non-Markovian property) -> MLP / RNN / LSTM cannot do this
MOTIVATION FOR ATTENTION
- keep track of the entire text unit context
- understand word in context
What do we want from a language model?
We want to respect the sequential nature of language -> MLP cannot do this
Need long context ->MLP / RNN / LSTM cannot do this
Capture content dependent semantics -> tockenization (word2Vec) only captures word
We need an architecture that can be trained in parallel (non-Markovian property) ->MLP / RNN / LSTM cannot do this
MOTIVATION FOR ATTENTION
What do we want from a time series analysis model?
- recognize patterns at any time lag
- recognize that patterns can relate to each other differently (seasonality, trends, stochastic events)
"on the river bank"
"on the river bank"
If these coefficients represent the relative importance of the words in the meaning of the sentence, river and bank
should be high!
Word tockenization and embedding and contextualizing
vector embedding (768)
<by> o <river> -> small
near orthogonal embedding, low similarity vectors,
no strong relation between tokens
by
river
Word tockenization and embedding and contextualizing
vector embedding (768)
<by> o <river> -> small
near orthogonal embedding, low "similarity" vectors,
no strong relation between tokens
by
river
Word tockenization and embedding and contextualizing
vector embedding (768)
<bank> o <river> -> <high>
near parallel embedding, high "similarity" vectors,
strong relation between tokens
bank
river
Word tockenization and embedding and contextualizing
vector embedding (768)
<bank> o <river> -> <high>
near parallel embedding, high "similarity" vectors,
strong relation between tokens
bank
river
..... linear algebra is not machine learning unless I am learning some parameters!
But context ALSO has to do with POSITION!
Willow talked to Federica
vs
Federica talked to Willow
X = [0.0,1.3,2.1,1.0,5.0]
X = [0.0,1.3,2.1,1.0,5.0]
W1 = [0.2,0.3,0.1,0.0,6.0]
X = [0.0,1.3,2.1,1.0,5.0]
P = [0.0,1.1,2.3,0.9,5.0]
W1 = [0.2,0.3,0.1,0.0,6.0]
X = [0.0,1.3,2.1,1.0,5.0]
P = [0.0,1.1,2.3,0.9,5.0]
W1 = [0.2,0.3,0.1,0.0,6.0]
X = [5.0,1.3,2.1,1.0,0.0]
P = [5.0,1.1,2.3,0.9,0.0]
W1 = [0.2,0.3,0.1,0.0,6.0]
X = [0.0,1.3,2.1,1.0,5.0]
P = [0.0,1.1,2.3,0.9,5.0]
W1 = [0.2,0.3,0.1,0.0,6.0]
X = [5.0,1.3,2.1,1.0,0.0]
P = [5.0,1.1,2.3,0.9,0.0]
W1 = [0.2,0.3,0.1,0.0,6.0]
positional encoding
5
https://arxiv.org/pdf/2308.06404
Word tockenization and embedding and contextualizing
vector embedding (768)
<bank> o <river> -> <high>
near parallel embedding, high "similarity" vectors,
strong relation between tokens
bank
river
Word tockenization and embedding and contextualizing
vector embedding (768)
Willow talked to Fed
near parallel embedding, high "similarity" vectors,
strong relation between tokens
talked
Willow
Willow
talked
to
fed
Word tockenization and embedding and contextualizing
vector embedding (768)
Willow talked to Fed
near parallel embedding, high "similarity" vectors,
strong relation between tokens
P+talked
position embedding (768)
P+Willow
Willow
talked
to
Fed
Word tockenization and embedding and contextualizing
vector embedding (768)
Fed talked to Willow
near parallel embedding, high "similarity" vectors,
strong relation between tokens
P+ talked
position embedding (768)
P+Willow
vector embedding (768)
position embedding (768)
talked
to
Fed
Willow
"on the river bank"
POSITIONAL ENCODING
Attention is all you need
Encoder + Decoder architecture
Encodes the past
transformer model
6
Encoder + Decoder architecture
Encodes the past
Turns out attention is not really _all_ you need...
so far we are working with a "bag of words": the order of words is not known to the model
Attention is all you need (2017)
Encoder + Decoder architecture
positional encoding
Attention is all you need (2017)
Encoder + Decoder architecture
positional encoding
Attention is all you need (2017)
Encoder + Decoder architecture
positional encoding
Attention is all you need (2017)
Attention is all you need
Encoder + Decoder architecture
Encodes the past
Encoder + Decoder architecture
decodes the past and predicts the future
MHA acting on encoder (1)
Attention is all you need (2017)
v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | |
---|---|---|---|---|---|---|---|---|
k1 | 1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.7 |
k2 | 0.2 | 1 | 0.1 | 0.6 | 0.8 | 0.2 | 0.1 | 0.4 |
k3 | 0.1 | 0.1 | 1 | 0.2 | 0.1 | 0.2 | 0.1 | 0.1 |
k4 | 0.6 | 0.7 | 0.1 | 1 | 0.5 | 0.9 | 0.1 | 0.5 |
k5 | 0.1 | 0.9 | 0.1 | 0.3 | 1 | 0.1 | 0.1 | 0.3 |
k6 | 0.1 | 0.5 | 0.3 | 0.7 | 0.3 | 1 | 0.1 | 0.9 |
The cat that ate
was
full
The cat that ate was full and happy
Encoder + Decoder architecture
decodes the past and predicts the future
a stack of N = 6 identical layers each with
(1) a multi-head self-attention mechanism act on previous decoder output,
(2) a multi-head self-attention mechanism act on encoder output,
(3) a positionwise fully connected feed-forward NN
MHA acting on encoder (1)
Attention is all you need (2017)
v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | |
---|---|---|---|---|---|---|---|---|
k1 | 1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.7 |
k2 | 0.2 | 1 | 0.1 | 0.6 | 0.8 | 0.2 | 0.1 | 0.4 |
k3 | 0.1 | 0.1 | 1 | 0.2 | 0.1 | 0.2 | 0.1 | 0.1 |
k4 | 0.6 | 0.7 | 0.1 | 1 | 0.5 | 0.9 | 0.1 | 0.5 |
k5 | 0.1 | 0.9 | 0.1 | 0.3 | 1 | 0.1 | 0.1 | 0.3 |
k6 | 0.1 | 0.5 | 0.3 | 0.7 | 0.3 | 1 | 0.1 | 0.9 |
The cat that ate
was
full
The cat that ate was full and happy
masking dependence on the future
Encoder + Decoder architecture
decodes the past and predicts the future
a stack of N = 6 identical layers each with
(1) a multi-head self-attention mechanism act on previous decoder output,
(2) a multi-head self-attention mechanism act on encoder output,
(3) a positionwise fully connected feed-forward NN
MHA acting on decoder (2)
Attention is all you need (2017)
Attention is all you need (2017)
Encoder + Decoder architecture
Input
Embedding
Positional encoding
Encoder attention
Output
Embedding
Positional encoding
Decoder attention
Encoder-Decoder attention
Feed Forward
Linear
Softmax
Attention is all you need (2017)
Attention is all you need
Encoder + Decoder architecture
Encodes the past
Encodes the past
v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | |
---|---|---|---|---|---|---|---|---|
k1 | 1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.7 |
k2 | 0.2 | 1 | 0.1 | 0.6 | 0.8 | 0.2 | 0.1 | 0.4 |
k3 | 0.1 | 0.1 | 1 | 0.2 | 0.1 | 0.2 | 0.1 | 0.1 |
k4 | 0.6 | 0.7 | 0.1 | 1 | 0.5 | 0.9 | 0.1 | 0.5 |
k5 | 0.1 | 0.9 | 0.1 | 0.3 | 1 | 0.1 | 0.1 | 0.3 |
k6 | 0.1 | 0.5 | 0.3 | 0.7 | 0.3 | 1 | 0.1 | 0.9 |
The cat that ate
was
full
The cat that ate was full and happy
Encoder + Decoder architecture
decodes the past and predicts the future
a stack of N = 6 identical layers each with
(1) a multi-head self-attention mechanism act on previous decoder output,
(2) a multi-head self-attention mechanism act on encoder output,
(3) a positionwise fully connected feed-forward NN
MHA acting on encoder (1)
v1 | v2 | v3 | v4 | v5 | v6 | v7 | v8 | |
---|---|---|---|---|---|---|---|---|
k1 | 1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.7 |
k2 | 0.2 | 1 | 0.1 | 0.6 | 0.8 | 0.2 | 0.1 | 0.4 |
k3 | 0.1 | 0.1 | 1 | 0.2 | 0.1 | 0.2 | 0.1 | 0.1 |
k4 | 0.6 | 0.7 | 0.1 | 1 | 0.5 | 0.9 | 0.1 | 0.5 |
k5 | 0.1 | 0.9 | 0.1 | 0.3 | 1 | 0.1 | 0.1 | 0.3 |
k6 | 0.1 | 0.5 | 0.3 | 0.7 | 0.3 | 1 | 0.1 | 0.9 |
The cat that ate
was
full
The cat that ate was full and happy
masking dependence on the future
Encoder + Decoder architecture
decodes the past and predicts the future
a stack of N = 6 identical layers each with
(1) a multi-head self-attention mechanism act on previous decoder output,
(2) a multi-head self-attention mechanism act on encoder output,
(3) a positionwise fully connected feed-forward NN
MHA acting on decoder (2)
5
Attention is all you need
Encoder + Decoder architecture
Encodes the past
Encodes the past
Decodes the past predicts the future
Encoder + Decoder architecture
v1 | v2 | v3 | v4 | v5 | v6 | |
---|---|---|---|---|---|---|
k1 | 1 | 0.1 | 0.1 | 0.1 | 0.1 | 0.1 |
k2 | 0.2 | 1 | 0.1 | 0.6 | 0.8 | 0.2 |
k3 | 0.1 | 0.1 | 1 | 0.2 | 0.1 | 0.2 |
k4 | 0.6 | 0.7 | 0.1 | 1 | 0.5 | 0.9 |
k5 | 0.1 | 0.9 | 0.1 | 0.3 | 1 | 0.1 |
k6 | 0.1 | 0.5 | 0.3 | 0.7 | 0.3 | 1 |
The cat that ate
was
full
The cat that ate was full
encodes the past
a stack of N = 6 identical layers each with
(1) a multi-head self-attention mechanism,
(2) a positionwise fully connected feed-forward NN
A video on transformer which I think is really good!
https://www.youtube.com/watch?v=4Bdc55j80l8
A video on attention (with a different accent than the one I subjected you all this time!)
https://www.youtube.com/watch?v=-9vVhYEXeyQ
Tutorial