LLM Basics Workshop

Agenda

Agenda

Theory

Hands-On

1. What is an LLM, I can't even 😳

2. Running models

a. Local models

b. Deployed models

3. Understanding System Prompts

4. Prompt Engineering

What is an LLM?

What is an LLM?

How many 'r' letters are in the word 'strawberry'?

Uhh, like 2?

Transformer Architecture

Transformer Architecture Overview

Linear

Softmax

Output Probabilities

Output Embedding

+

+

Positional Encoding

Input Embedding

Positional Encoding

Add & Norm

Multi-Head Attention

Feed Forward

Masked Multi-Head Attention

Add & Norm

Multi-Head Attention

Multi-Head Attention

Multi-Head Attention

Add & Norm

Feed Forward

Add & Norm

Outputs   (shifted right)

Add & Norm

Inputs

Neural Networks

Neural Networks

Neural Networks

Feed-Forward

Linear

Softmax

Output Probabilities

Output Embedding

+

+

Positional Encoding

Input Embedding

Positional Encoding

Add & Norm

Multi-Head Attention

Feed Forward

Masked Multi-Head Attention

Add & Norm

Multi-Head Attention

Add & Norm

Feed Forward

Add & Norm

Outputs   (shifted right)

Add & Norm

Inputs

Feed-Forward

Tokenization

Tokenization

This workshop is high key slay

This

workshop

is

high

key

slay

0

1

2

3

4

5

356

53782

52

333

672

1

Token Embeddings

Linear

Softmax

Output Probabilities

Output Embedding

+

+

Positional Encoding

Input Embedding

Positional Encoding

Add & Norm

Multi-Head Attention

Feed Forward

Masked Multi-Head Attention

Add & Norm

Multi-Head Attention

Add & Norm

Feed Forward

Add & Norm

Outputs   (shifted right)

Add & Norm

Inputs

Token Embeddings

This

workshop

is

high

key

slay

0.51 0.12 1 0 0 0.45 0.50 0.29 ... 0.77
0.98 0.32 0.63 0.92 0.17 0 0.07 0.83 ... 1
0.43 1 0.95 1 0.54 0.31 0.19 0 ... 0
0.53 0.52 0.51 0.92 0.78 0.71 0.99 0.84 ... 0.91
0.82 0.91 0.56 0.59 0.99 0.42 1 1 ... 0.72
0.60 0.15 0.75 0.59 0.01 0.07 0 0.27 ... 0.33

Token Embeddings

cat

dog

cluster

cult

0.75 0.61 0.98 0.02 0 0.12 0.17 0.29 ... 0.57
0.74 0.59 0.98 0.02 0.02 0.11 0.17 0.30 ... 0.56
0.59 0.95 0 0.85 1 0.67 0.15 0.72 ... 0.25
0.60 0.95 0 0.85 0.97 0.68 0.17 0.72 ... 0.23

Cat

Dog

I

We

Us

Cluster

Cult

Run

Walk

Cycle

Swim

Cat

Dog

I

We

Us

Cluster

Cult

Run

Walk

Cycle

Swim

Chat

Katze

Pisică

ネコ

Positional Encoding

Linear

Softmax

Output Probabilities

Output Embedding

+

+

Positional Encoding

Input Embedding

Positional Encoding

Add & Norm

Multi-Head Attention

Feed Forward

Masked Multi-Head Attention

Add & Norm

Multi-Head Attention

Add & Norm

Feed Forward

Add & Norm

Outputs   (shifted right)

Add & Norm

Inputs

Positional Encoding

John eats pineapple

Pineapple eats John

0

1

2

Positional Encoding

John ( i = 0 )

eats ( i = 1 )

0.75 0.61 0.98 0.02 0 0.12 0.17 0.29 ... 0.57
0.23 0.15 0.79 0.09 0 0.18 0.82 0.58 ... 0.73

pineapple ( i = 2 )

0.35 0.01 0.09 0.38 0.22 0 0 0.99 ... 1
sin(0)
0
cos(0)
1
sin(0)
0
cos(0)
1
... ... ... ... ... ...
sin(1/1)
0
cos(1/1)
0.54
sin(1/10)
0.1
cos(1/10)
1
... ... ... ... ... ...
sin(2/1)
0.91
cos(2/1)
-0.41
sin(2/10)
0.2
cos(2/10)
0.98
... ... ... ... ... ...

Self-Attention

Slay

Your pull request is kinda 

I cometh h're to         ev'ry peasant

slay

slay

Self Attention

Self Attention

Your

pull

request

is

kinda

slay

Your

pull

request

is

kinda

slay

     pull request is kinda slay

Your pull                 is kinda slay

Your 

request

0.74 0.59 0.98 0.02 0.02 0.11 0.17 0.30 ... 0.56
0.23 0.15 0.79 0.09 0 0.18 0.82 0.58 ... 0.73

Query:

Key:

Value:

Self Attention

What are we trying to find?

What are the key features of this token?

What is being retrieved?

Self Attention

...

0.74

0.59

0.98

0.02

0.02

0.11

0.17

0.30

0.56

...

0.31

0.75

1.5

-3.4

-0.5

1.23

0.91

-0.11

2

0.2

0.1

-0.2

0.15

0.9

-0.85

-0.52

0.6

0.5

Your

Query

Self Attention

...

0.74

0.59

0.98

0.02

0.02

0.11

0.17

0.30

0.56

...

1.5

-0.3

1.12

2.67

-0.1

-1.75

-2.12

0.63

0.95

Your

Key

Self Attention

...

0.74

0.59

0.98

0.02

0.02

0.11

0.17

0.30

0.56

...

0.71

0.81

-0.15

1.72

-0.9

-0.25

1.92

2.57

0.19

Your

Value

Self Attention

Query x Key = Similarity Vector

Similarity Vector + Softmax = Normalized Similarity Vector

Normalized Similarity Vector x Value = Self-Attention

a0 b0 c0 d0 e0 f0 g0 h0 ... z0

x

x

x

x

x

x

x

x

x

a1 b1 c1 d1 e1 f1 g1 h1 ... z1
a0a1 b0b1 c0c1 d0d1 e0e1 f0f1 g0g1 h0h1 ... z0z1

=

+

+

+

+

+

+

+

+

+

=

final similarity result

21 -3.2 14.2 15.7 -7.1 0.1 3.5 -2.9 ... 0.2
0.994 0 0.001 0.005 0 0 0 0 ... 0

a0 x 0.994 + a1 x 0.994 + ... + an x 0.994 = self-attention value on index 0

b0 x 0 + b1 x 0 + ... + bn x 0 = self-attention value on index 1

S0

Multi-Head Attention

Linear

Softmax

Output Probabilities

Output Embedding

+

+

Positional Encoding

Input Embedding

Positional Encoding

Add & Norm

Multi-Head Attention

Feed Forward

Masked Multi-Head Attention

Add & Norm

Multi-Head Attention

Add & Norm

Feed Forward

Add & Norm

Outputs   (shifted right)

Add & Norm

Inputs

Multi Head Attention

Your

pull

request

is

kinda

slay

Your

pull

request

is

kinda

slay

request

is

kinda

slay

kinda

slay

Multi Head Attention

Your

pull

request

is

kinda

slay

Your

pull

request

is

kinda

slay

Your

pull

request

is

kinda

slay

Your

pull

request

is

kinda

slay

Your

pull

request

is

kinda

slay

Your

pull

request

is

kinda

slay

Your

pull

request

is

kinda

slay

Your

pull

request

is

kinda

slay

Multi Head Attention

Normalization

Linear

Softmax

Output Probabilities

Output Embedding

+

+

Positional Encoding

Input Embedding

Positional Encoding

Add & Norm

Multi-Head Attention

Feed Forward

Masked Multi-Head Attention

Add & Norm

Multi-Head Attention

Add & Norm

Feed Forward

Add & Norm

Outputs   (shifted right)

Add & Norm

Inputs

Normalization

Normalization

Processing the Output

Linear

Softmax

Output Probabilities

Output Embedding

+

+

Positional Encoding

Input Embedding

Positional Encoding

Add & Norm

Multi-Head Attention

Feed Forward

Masked Multi-Head Attention

Add & Norm

Multi-Head Attention

Add & Norm

Feed Forward

Add & Norm

Outputs   (shifted right)

Add & Norm

Inputs

Processing the Output

This workshop is high key slay  ______

0.74 0.59 0.98 0.02 0.02 0.11 0.17 0.30 ... 0.56

1x256

-0.012 0.001 0.02 0 -0.1 -0.002 0.15 -0.05 ... 0.042

1x50000

0% 0% 0% 0% 0% 0% 35% 0% ... 1%

Linear

Softmax

1x50000

0% 0% 0% 0% 0% 0% 35% 0% ... 1%

6

5

4

3

2

1

0

7

49999

This workshop is high key slay  __6__

Processing the Output

This workshop is high key slay queen

LLM Basics

By alexgrigi

LLM Basics

  • 27