Benjamin Akera
Learning how machines learn, and learning along the way
WhoAmI ?
McGill University | Mila - Quebec AI Inst.
Sunbird AI
IBM Research Africa - Kenya
DSA Since 2017
What is it?
Who has used it?
A few Examples ?
Why I find it exciting
A complete introduction to NLP
Does not cover all models, aspects, papers, architectures
Not preaching, should be interactive: Ask
An overview of recent simple architectural tricks that have influenced NLP as we know it now, with Deep Learning
Interactive
Ask as many questions
Jina yako ni nani?
Unaitwa nani?
How???
Attention
Attention; What is it?
– Question
Relates elements in a source sentence to a target sentence
Source Sentence
Target Sentence
H
a
b
a
r
i
H
e
l
l
o
BERT
T4
ELMo
GPT3
Neural MT
DALL.E
Sunbird translate
FB Translate
multimodal learning
Speech Recognition
Relates elements in a source sentence to a target sentence
Source Sentence
Target Sentence
H
a
b
a
r
i
H
e
l
l
o
Relates elements in a source sentence to a target sentence
H
a
b
a
r
i
H
e
l
l
o
Self Attention
Multi-Head attention
Is when when your source and target sentences are the same
G
n
i
a
H
a
b
a
r
i
?
Compute K Attention in Parallel
Allows more than one relation
Attention Head 1
Attention Head 2
Attention Head 3
Attention Head 4
1.Habari gani? 2.Jina yako ni nani? 3.Una toka wapi? 4.Unaitwa nani?
1.How are you? 2.What is your name? 3.Where do you come from? 4.What is your name ?
Attention Head
Transformers
Transformers
What are they?
Sequence
Attention
Positional
Encoding
Decoders
1. Self Attention + Multi-Head attention
Multi-head attention
Self Attention
Positional Encoding
When we compare two elements of a sequence together (like in attention)
we don’t have a notion of how far apart they are or where one is relative to the other
\( PE_{(pos, 2_i)} = sin (pos/10000^{{2i}/d_{model}}) \)
\( PE_{(pos, 2_i + 1)} = sin (pos/10000^{{2i}/d_{model}}) \)
When we compare two elements of a sequence together (like in attention)
we don’t have a notion of how far apart they are or where one is relative to the other
\( PE_{(pos, 2_i)} = sin (pos/10000^{{2i}/d_{model}}) \)
\( PE_{(pos, 2_i + 1)} = sin (pos/10000^{{2i}/d_{model}}) \)
Using a linear combination of these signals we can “pan” forwards or backwards in the sequence.
Decoder
Sequence generation
Learn a map from input sequence to output sequence:
\( y_o, . . ., y_T = f(x_0,....x_N) \)
In reality tend to look like this:
\( \hat y_o, . . ., \hat y_T = decoder(encoder(x_0,....x_N)) \)
Autoregressive Decoding
Condition each output on all previously generated outputs
\( \hat y_o = decoder(encoder(x_0,....x_N)) \)
\( \hat y_1 = decoder(\hat y_0, encoder(x_0,....x_N)) \)
\( \hat y_2 = decoder(\hat y_0, y_1, encoder(x_0,....x_N)) \)
\( \hat y_{t+1} = decoder(\hat y_0, y_1...y_t, encoder(x_0,....x_N)) \)
.
.
Autoregressive Decoding
At train time, we have access to all the true targets outputs
\( \hat y_o = decoder(encoder(x_0,....x_N)) \)
\( \hat y_1 = decoder(\hat y_0, encoder(x_0,....x_N)) \)
\( \hat y_{t+1} = decoder(\hat y_0, y_1...y_t, encoder(x_0,....x_N)) \)
.
Question: How do we squeeze all of these into one call of our decoder?
If we feed all our inputs and targets, it is easy to cheat using Attention
<pad> What is your name?
<pad> Jina lako ni nani?
?
Jina
lako
ni
nani
?
what
is
your
name
<PAD>
<pad> Jina lako ni nani?
If we feed all our inputs and targets, it is easy to cheat using Attention
what
is
your
name
?
Jina
lako
ni
nani
?
<PAD>
When we apply this triangular mask, the attention can only look into the past instead of being able to cheat and look into the future.
If we feed all our inputs and targets, it is easy to cheat using Attention
what
is
your
name
?
Jina
lako
ni
nani
?
<PAD>
\( \hat y_o = decoder(encoder(x_0,....x_N)) \)
\( \hat y_1 = decoder(\hat y_0, encoder(x_0,....x_N)) \)
\( \hat y_2 = decoder(\hat y_0, y_1, encoder(x_0,....x_N)) \)
\( \hat y_{t+1} = decoder(\hat y_0, y_1...y_t, encoder(x_0,....x_N)) \)
.
.
If we feed all our inputs and targets, it is easy to cheat using Attention
Jina yako ni [MASK]?
[MASK] is your name?
We Call this Masked Language Models
Applications,
Labs,
Questions
Neural Machine Translation
Text Generation
Image Classification
Multi-modal learning
....
Attention
Transformers
Encoder Block
Decoder Block
Positional Encoding
End to End Neural Machine Translation with Attention
Take Home: Reproduce a model using SALT Dataset
By Benjamin Akera
A bried introduction to attention mechanisms in deep learning