Part 1: Background and The Transformer
Hamish Dickson - Oct 2024
Some code
Basically theory
Prompts n shit
What's the German for cake?
Was heißt auf Deutsch Kuchen?
???
What's the German for cake?
Was heißt auf Deutsch Kuchen?
???
First attempts were ngram-to-ngram translations
What's the German for cake?
Was heißt auf Deutsch Kuchen?
This doesn't work well for lots of reasons, even if you could build these mappings you lack context
What's the German for cake?
Was heißt auf Deutsch Kuchen?
???
Google Translate famously used seq2seq in 2014
Idea is
i.e. GenAI
inputs
outputs
Encoder
Decoder
This thing is really useful
Question: how do we go from words to something a ML model can use?
Our models use tensors, how do we turn words into tensors?
Until 2017 we mostly used w2v and glove, idea is:
YES! Very well
This is how Google Translate worked for about 4 years (although with tricks, bidirectional models and stacking)
But there are some issues
important modelling trick
Note, if you have a problem where:
An RNN can still easily beat a transformer
There is a huge problem with this design, it doesn't scale
This isn't completely true, there are tricks around this
Question: why would we want "to scale"?
Pretraining for fine tuning
Introduced lots of ideas
Part 1: Embeddings
import transformers
tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")
tokenizer.convert_ids_to_tokens(tokenizer("let's talk about bpe")['input_ids'])
['[CLS]', 'let', "'", 's', 'talk', 'about', 'bp', '##e', '[SEP]']
Notice we split words out into sub-parts
This lets us have huge coverage with less tokens
Token | Vector |
---|---|
"let" | [0.1, -0.8, ...] |
"the" | [0.9, 0.85,...] |
Model then learns vectors in a lookup table during training
These are called "embeddings"
Part 1: Embeddings
Some tips:
Part 2: Attention
Idea behind attention is each token keeps a mapping of the "important" tokens around it
So in this example the attention mechanism learns "it" has a strong connection with "The animal"
Part 2: Attention
For each token in the sequence we create:
This is a sort of scaling factor for our task - eg is the token a noun or not
This is a learnable representation of asking "what other tokens are important?"
A learnable way of describing the token for other tokens to use
Part 2: Attention
Dot product between Q and K to
norm factor, mostly to do with how we start training the model
our value scaling
Part 2.5: The transformer block
We then do this several times (heads) and layer this with an FC layer
note this thing
Part 3: Encoder (BERT et al)
We're nearly done, don't worry
What's the German for cake?
tokenize
embed
N layers of attention + FC networks
Get an embedding for the sequence
time is always up
Part 4: Decoder (GPT et al)
[start token]
embedded context
tokenize
embed
attention block
do this for N layers
FC layer the size of the vocab
Pick the most likely next token from here
nb we drop this in GPT
Part 4: Decoder (GPT et al)
[start token]Was
embedded context
tokenize
embed
attention block
do this for N layers
FC layer the size of the vocab
Pick the most likely next token from here
we literally loop - add "Was"
Text
Part 4: Decoder (GPT et al)
[start token]Was heißt
embedded context
tokenize
embed
attention block
do this for N layers
FC layer the size of the vocab
Pick the most likely next token from here
heißt
Note this is trained using masks/without the loop
Idea behind the original paper was to:
Next time: