WhoAmI ?
McGill University | Mila - Quebec AI Inst.
Sunbird AI
IBM Research Africa - Kenya
What is it?
Who has used it?
A few Examples ?
Why I find it exciting
A complete introduction to NLP
Does not cover all models, aspects, papers, architectures
Not preaching, should be interactive: Ask
An overview of recent simple architectural tricks that have influenced NLP as we know it now, with Deep Learning
Interactive
Ask as many questions
Erinya lyo ggw'ani
bakuyita batya?
How???
Attention
Attention; What is it?
– Question
Relates elements in a source sentence to a target sentence
Source Sentence
Target Sentence
e
a
l
e
H
e
l
l
o
k
G
y
b
o
BERT
T4
ELMo
GPT3
Neural MT
DALL.E
Sunbird translate
FB Translate
multimodal learning
Speech Recognition
Relates elements in a source sentence to a target sentence
Source Sentence
Target Sentence
H
e
l
l
o
e
a
l
e
k
G
y
b
o
Relates elements in a source sentence to a target sentence
H
e
l
l
o
Source Sentence
e
a
l
e
k
G
y
b
o
Self Attention
Multi-Head Attention
Vaswani et al. (2017) describe attention functions as "mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors
o
\( Attention(Q, K, V) = softmax(\frac{QK^T}{\sqrt{d_k}})V\)
Is when when your source and target sentences are the same
Source Sentence
e
a
l
e
k
G
y
b
o
n
y
a
b
o
In Vaswani et al. , the authors first apply a linear transformation to the input matrices Q, K and V, and then perform attention i.e. they compute:
where, WV, WQ and WK are learnt parameters.
In multi-head attention, say with #heads = 4, the authors apply a linear transformation to the matrices and perform attention 4 times.
Compute K Attention in Parallel
Allows more than one relation
Attention Head 1
Attention Head 2
Attention Head 3
Attention Head 4
1.Oli Otya? 2.Erinya Lyo Lyanni? 3.ova wa ? 4.Bakuyita batya?
1.How are you? 2.What is your name? 3.Where do you come from? 4.What is your name ?
Attention Head
Transformers
Transformers
What are they?
Sequence
Attention
Positional
Encoding
Decoders
1. Self Attention + Multi-Head attention
Multi-head attention
Self Attention
Positional Encoding
When we compare two elements of a sequence together (like in attention)
we don’t have a notion of how far apart they are or where one is relative to the other
\( PE_{(pos, 2_i)} = sin (pos/10000^{{2i}/d_{model}}) \)
\( PE_{(pos, 2_i + 1)} = sin (pos/10000^{{2i}/d_{model}}) \)
When we compare two elements of a sequence together (like in attention)
we don’t have a notion of how far apart they are or where one is relative to the other
\( PE_{(pos, 2_i)} = sin (pos/10000^{{2i}/d_{model}}) \)
\( PE_{(pos, 2_i + 1)} = sin (pos/10000^{{2i}/d_{model}}) \)
Using a linear combination of these signals we can “pan” forwards or backwards in the sequence.
Decoder
Sequence generation
Learn a map from input sequence to output sequence:
\( y_o, . . ., y_T = f(x_0,....x_N) \)
In reality tend to look like this:
\( \hat y_o, . . ., \hat y_T = decoder(encoder(x_0,....x_N)) \)
Autoregressive Decoding
Condition each output on all previously generated outputs
\( \hat y_o = decoder(encoder(x_0,....x_N)) \)
\( \hat y_1 = decoder(\hat y_0, encoder(x_0,....x_N)) \)
\( \hat y_2 = decoder(\hat y_0, y_1, encoder(x_0,....x_N)) \)
\( \hat y_{t+1} = decoder(\hat y_0, y_1...y_t, encoder(x_0,....x_N)) \)
.
.
Autoregressive Decoding
At train time, we have access to all the true targets outputs
\( \hat y_o = decoder(encoder(x_0,....x_N)) \)
\( \hat y_1 = decoder(\hat y_0, encoder(x_0,....x_N)) \)
\( \hat y_{t+1} = decoder(\hat y_0, y_1...y_t, encoder(x_0,....x_N)) \)
.
Question: How do we squeeze all of these into one call of our decoder?
If we feed all our inputs and targets, it is easy to cheat using Attention
<pad> What is your name?
<pad> Erinya lyo ggwe ani?
?
Erinya
lyo
ggwe
ani
?
what
is
your
name
<PAD>
<pad> Erinya lyo ggwe ani?
If we feed all our inputs and targets, it is easy to cheat using Attention
what
is
your
name
?
Erinya
lyo
ggwe
ani
?
<PAD>
When we apply this triangular mask, the attention can only look into the past instead of being able to cheat and look into the future.
If we feed all our inputs and targets, it is easy to cheat using Attention
what
is
your
name
?
Erinya
lyo
ggwe
yani
?
<PAD>
\( \hat y_o = decoder(encoder(x_0,....x_N)) \)
\( \hat y_1 = decoder(\hat y_0, encoder(x_0,....x_N)) \)
\( \hat y_2 = decoder(\hat y_0, y_1, encoder(x_0,....x_N)) \)
\( \hat y_{t+1} = decoder(\hat y_0, y_1...y_t, encoder(x_0,....x_N)) \)
.
.
If we feed all our inputs and targets, it is easy to cheat using Attention
Erinya lyo gwe [MASK]?
[MASK] is your name?
We Call this Masked Language Models
Applications,
Labs,
Questions
Neural Machine Translation
Text Generation
Image Classification
Multi-modal learning
....
Attention
Transformers
Encoder Block
Decoder Block
Positional Encoding
End to End Neural Machine Translation with Attention
Take Home: Reproduce a model using SALT Dataset
https://t.ly/O43KV