Transformers Lunch & Learn

Part 1: Background and The Transformer

Hamish Dickson - Oct 2024

Over the next 3 weeks

What the heck is a transformer?

What if we break them apart?

What if we just keep going bigger?

Some code

Basically theory

Prompts n shit

Question: How would you Approach translation?

What's the German for cake?

Was heißt auf Deutsch Kuchen?

???

Question: How would you Approach translation?

What's the German for cake?

Was heißt auf Deutsch Kuchen?

???

First attempts were ngram-to-ngram translations

What's the German for cake?

Was heißt auf Deutsch Kuchen?

This doesn't work well for lots of reasons, even if you could build these mappings you lack context

Question: How would you Approach translation?

What's the German for cake?

Was heißt auf Deutsch Kuchen?

???

Google Translate famously used seq2seq in 2014

Idea is

have a model "roll over" the input text in order
build up a representation of the text as we go
"unroll" using a second model to a new language
new output model for each language

i.e. GenAI

Question: How would you Approach translation?

x_{N - 2}

h_{N - 3}

x_{N}

x_{N - 1}

inputs

outputs

y_{0}

y_{3}

y_{2}

y_{1}

Encoder

Decoder

h_{N}

This thing is really useful

d_{0}

d_{1}

d_{2}

d_{3}

Question: How would you Approach translation?

I did some slides on this a few years ago

Missing piece: encoding/decoding text

Question: how do we go from words to something a ML model can use?

Our models use tensors, how do we turn words into tensors?

Until 2017 we mostly used w2v and glove, idea is:

train a model so that when words are similar they are close in a vector space, different either perpendicular or in the opposite direction from the origin
use this as a lookup at the beginning
lots of preprocessing to crowbar our sentence into a pretrained model

Missing piece: encoding/decoding text

A highly mocked talk about buffalos

So does it work?

YES! Very well

This is how Google Translate worked for about 4 years (although with tricks, bidirectional models and stacking)

But there are some issues

note we have encoded this sequential bias into how we teach the model
this bias helps us learn fast (we have told the model part the answer) but also it's not quite how language works
this is very very hard to scale up
w2v is similar, but worse, it's a whole other network

important modelling trick

So does it work?

Note, if you have a problem where:

You have some sequential problem
You don't have BUCKETS of similar data to pretrain with
You want something fast (these go as O(N))

An RNN can still easily beat a transformer

Why do we care about all of this?

There is a huge problem with this design, it doesn't scale

Remember we roll over each word, one at a time
it's hard to imagine scaling this over multiple machines
one machine means it takes a very long time to train a model

This isn't completely true, there are tricks around this

Question: why would we want "to scale"?

Pretraining for fine tuning

Attention is all you Need

Google Brain paper from 2017
Honestly, the paper is underwhelming
Paper was an obscure translation paper (English to German)
Cool fact: they did this diagram in Google Slides

Attention is all you Need

Introduced lots of ideas

Ditch w2v, instead "tokenise" the input text (basically an ordinal encoding)
At the same time as training the rest of the model train the token -> vector encoding
Ditch the sequential encoding (ie the "rolling over/out")
Instead we provide the whole sequence at once, this is super important

So how does this work?

Part 1: Embeddings

import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer.convert_ids_to_tokens(tokenizer("let's talk about bpe")['input_ids'])

['[CLS]', 'let', "'", 's', 'talk', 'about', 'bp', '##e', '[SEP]']

Notice we split words out into sub-parts

This lets us have huge coverage with less tokens

Token	Vector
"let"	[0.1, -0.8, ...]
"the"	[0.9, 0.85,...]

Model then learns vectors in a lookup table during training

These are called "embeddings"

So how does this work?

Part 1: Embeddings

Some tips:

On open source models you can normally add or remove tokens
This encoding scheme can cause real issues with numbers, the transformer probably won't understand "123" is close to "12" and "4"
(maybe in part 2: you can force some tokens to be generated)

So how does this work?

Part 2: Attention

Idea behind attention is each token keeps a mapping of the "important" tokens around it

So in this example the attention mechanism learns "it" has a strong connection with "The animal"

So how does this work?

Part 2: Attention

For each token in the sequence we create:

A Query vector
A Key vector
A Value vector

This is a sort of scaling factor for our task - eg is the token a noun or not

This is a learnable representation of asking "what other tokens are important?"

A learnable way of describing the token for other tokens to use

So how does this work?

Part 2: Attention

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V

Dot product between Q and K to

norm factor, mostly to do with how we start training the model

our value scaling

So how does this work?

Part 2.5: The transformer block

We then do this several times (heads) and layer this with an FC layer

note this thing

So how does this work?

Part 3: Encoder (BERT et al)

We're nearly done, don't worry

What's the German for cake?

tokenize

embed

N layers of attention + FC networks

Get an embedding for the sequence

time is always up

So how does this work?

Part 4: Decoder (GPT et al)

[start token]

embedded context

tokenize

embed

attention block

do this for N layers

FC layer the size of the vocab

Pick the most likely next token from here

nb we drop this in GPT

So how does this work?

Part 4: Decoder (GPT et al)

[start token]Was

embedded context

tokenize

embed

attention block

do this for N layers

FC layer the size of the vocab

Pick the most likely next token from here

we literally loop - add "Was"

Text

So how does this work?

Part 4: Decoder (GPT et al)

[start token]Was heißt

embedded context

tokenize

embed

attention block

do this for N layers

FC layer the size of the vocab

Pick the most likely next token from here

heißt

Note this is trained using masks/without the loop

I guess downsides of this Architecture?

Each token attends to all others, that's O(N^2) RNNs are O(N)
You need a lot of compute and data to get this to train well. Realistically not possible before GPUs + AlexNet
Very difficult to train from scratch, we know much more now about how to do this, but jeepers it's still hard
Before this SOTA models could be trained on commercial desktops, now you need 60k H100s (meta-ai has two of these clusters)
Tokenization + positional embeddings a weak point

Final Last thoughts

Idea behind the original paper was to:

build a seq2seq model without the sequential bias
got scaling up for free

Next time:

Show you how to use the encoder and decoder on their own
Some code
Talk about pretraining and fine tuning