Transformers Lunch & Learn

Part 1: Background and The Transformer

Hamish Dickson - Oct 2024

Over the next 3 weeks

Some code

Basically theory

Prompts n shit

Question: How would you Approach translation?

What's the German for cake?

Was heißt auf Deutsch Kuchen?

???

Question: How would you Approach translation?

What's the German for cake?

Was heißt auf Deutsch Kuchen?

???

First attempts were ngram-to-ngram translations

What's the German for cake?

Was heißt auf Deutsch Kuchen?

This doesn't work well for lots of reasons, even if you could build these mappings you lack context

Question: How would you Approach translation?

What's the German for cake?

Was heißt auf Deutsch Kuchen?

???

Google Translate famously used seq2seq in 2014

Idea is

  • have a model "roll over" the input text in order
  • build up a representation of the text as we go
  • "unroll" using a second model to a new language
  • new output model for each language

i.e. GenAI

Question: How would you Approach translation?

x_{N - 2}
h_{N - 3}
x_{N}
x_{N - 1}

inputs

outputs

y_{0}
y_{3}
y_{2}
y_{1}

Encoder

Decoder

h_{N}

This thing is really useful

d_{0}
d_{1}
d_{2}
d_{3}

Question: How would you Approach translation?

Missing piece: encoding/decoding text

Question: how do we go from words to something a ML model can use?

 

Our models use tensors, how do we turn words into tensors?

Until 2017 we mostly used w2v and glove, idea is:

  • train a model so that when words are similar they are close in a vector space, different either perpendicular or in the opposite direction from the origin
  • use this as a lookup at the beginning
  • lots of preprocessing to crowbar our sentence into a pretrained model

Missing piece: encoding/decoding text

So does it work?

YES! Very well

This is how Google Translate worked for about 4 years (although with tricks, bidirectional models and stacking)

But there are some issues

  • note we have encoded this sequential bias into how we teach the model
  • this bias helps us learn fast (we have told the model part the answer) but also it's not quite how language works
  • this is very very hard to scale up
  • w2v is similar, but worse, it's a whole other network

important modelling trick

So does it work?

Note, if you have a problem where:

  • You have some sequential problem
  • You don't have BUCKETS of similar data to pretrain with
  • You want something fast (these go as O(N))

 

An RNN can still easily beat a transformer

Why do we care about all of this?

There is a huge problem with this design, it doesn't scale

  • Remember we roll over each word, one at a time
  • it's hard to imagine scaling this over multiple machines
  • one machine means it takes a very long time to train a model

This isn't completely true, there are tricks around this

Question: why would we want "to scale"?

Pretraining for fine tuning

Attention is all you Need

  • Google Brain paper from 2017
  • Honestly, the paper is underwhelming
  • Paper was an obscure translation paper (English to German)
  • Cool fact: they did this diagram in Google Slides

Attention is all you Need

Introduced lots of ideas

  • Ditch w2v, instead "tokenise" the input text (basically an ordinal encoding)
  • At the same time as training the rest of the model train the token -> vector encoding
  • Ditch the sequential encoding (ie the "rolling over/out")
  • Instead we provide the whole sequence at once, this is super important

So how does this work?

Part 1: Embeddings

import transformers

tokenizer = transformers.AutoTokenizer.from_pretrained("bert-base-uncased")

tokenizer.convert_ids_to_tokens(tokenizer("let's talk about bpe")['input_ids'])

['[CLS]', 'let', "'", 's', 'talk', 'about', 'bp', '##e', '[SEP]']

Notice we split words out into sub-parts

This lets us have huge coverage with less tokens

Token Vector
"let" [0.1, -0.8, ...]
"the" [0.9, 0.85,...]

Model then learns vectors in a lookup table during training

 

These are called "embeddings"

So how does this work?

Part 1: Embeddings

Some tips:

  • On open source models you can normally add or remove tokens
  • This encoding scheme can cause real issues with numbers, the transformer probably won't understand "123" is close to "12" and "4"
  • (maybe in part 2: you can force some tokens to be generated)

So how does this work?

Part 2: Attention

Idea behind attention is each token keeps a mapping of the "important" tokens around it

So in this example the attention mechanism learns "it" has a strong connection with "The animal"

So how does this work?

Part 2: Attention

For each token in the sequence we create:

  1. A Query vector
  2. A Key vector
  3. A Value vector

This is a sort of scaling factor for our task - eg is the token a noun or not

This is a learnable representation of asking "what other tokens are important?"

A learnable way of describing the token for other tokens to use

So how does this work?

Part 2: Attention

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{Q K^T}{\sqrt{d_k}}\right) V

Dot product between Q and K to 

norm factor, mostly to do with how we start training the model

our value scaling

So how does this work?

Part 2.5: The transformer block

We then do this several times (heads) and layer this with an FC layer

note this thing

So how does this work?

Part 3: Encoder (BERT et al)

We're nearly done, don't worry

What's the German for cake?

tokenize

embed

N layers of attention + FC networks

Get an embedding for the sequence

time is always up

So how does this work?

Part 4: Decoder (GPT et al)

[start token]

embedded context

tokenize

embed

attention block

do this for N layers

FC layer the size of the vocab

Pick the most likely next token from here

nb we drop this in GPT

So how does this work?

Part 4: Decoder (GPT et al)

[start token]Was

embedded context

tokenize

embed

attention block

do this for N layers

FC layer the size of the vocab

Pick the most likely next token from here

we literally loop - add "Was"

Text

So how does this work?

Part 4: Decoder (GPT et al)

[start token]Was heißt

embedded context

tokenize

embed

attention block

do this for N layers

FC layer the size of the vocab

Pick the most likely next token from here

heißt

Note this is trained using masks/without the loop

I guess downsides of this Architecture?

  1. Each token attends to all others, that's O(N^2) RNNs are O(N)
  2. You need a lot of compute and data to get this to train well. Realistically not possible before GPUs + AlexNet
  3. Very difficult to train from scratch, we know much more now about how to do this, but jeepers it's still hard
  4. Before this SOTA models could be trained on commercial desktops, now you need 60k H100s (meta-ai has two of these clusters)
  5. Tokenization + positional embeddings a weak point

Final Last thoughts

Idea behind the original paper was to:

  • build a seq2seq model without the sequential bias
  • got scaling up for free

 

Next time:

  • Show you how to use the encoder and decoder on their own
  • Some code
  • Talk about pretraining and fine tuning

Minimal

By Hamish dickson

Minimal

  • 38