We’re going to break down the math, expose the quirks, and show you how to turn those vectors into meaning using "Text Embeddings"

Language is messy, chaotic, and beautiful. Mathematics is precise, cold, and rigid. Today we're going to find out how those two can shake hands.

+ why is text embeddings important

+ identify a connect with something they care about

+ Add an explanation (introduction character telling)  before the embedddings map. Walkthrough, tell people to try things 

LET'S TRAVEL INTO THE SPACE OF MEANINGS!!

LET'S TRAVEL INTO THE SPACE OF MEANINGS!!

If every word in the English language was a star in the night sky, text embedding is the telescope that reveals the constellations connecting them.

Tokens & Embeddings: Text's Secret Code

Contents

  1. I ❤ strawberries
  2. Glitches & Quirks
  3. Vector Playground
  4. Next-Gen Magic
  5. Exit Ticket

01

I ❤ strawberries

How Does Your Text Become Math?

Let's use some "magic" to reveal the first step of how AI understands language.

Original Text

I ❤ strawberries

Tokenization

Token ID Sequence

[40, 1037, 21020, 219, 23633, 1012]

Every word, symbol, and even typo must be converted to numbers before the model can start "thinking."

Tokenization Showdown: Whole Words vs. Characters

Word-based Tokenization

Splits by spaces and punctuation—simple and intuitive.

Drawback: Huge vocabulary, can't handle new words or typos.

Encountering "Covidlicious" results in [UNK] (Unknown).

Character-based Tokenization

Breaks text into individual characters, extremely small vocabulary.

Drawback: Sequences too long, loses semantics, low efficiency.

"unbelievable" is split into 12 characters, making it hard for the model to understand the overall meaning.

Subword Tokenization: The key to intelligence

Balances efficiency and meaning by breaking words into meaningful "building blocks."

un + believ + able = unbelievable

common prefix + root + common suffix = complete word

This way, the vocabulary is small, fewer unknown words, and it can understand that "unbelievable" is composed of "un + believable."

02

Glitches & Quirks

The Strawberry Mystery: Why Can't AI Count?

Ask AI: "How many 'r's are in 'strawberry'?"

Human Perspective:

s-t-r-a-w-b-e-r-r-y, at a glance, 3.

AI Perspective (Tokenized):

['str', 'aw', 'berry'], original letter information is lost, can only guess.

Tokenization Information Loss

The Challenge of Internet Slang "Tokenization"

How do tokenizers handle ever-evolving internet language?

😊 Emoji & Expressions

"🌟💀 slayyy 🖌️" gets split into ['🌟', '💀', 'slay', 'yy', '🖌️'].
If an emoji is a commonly used "word," it gets its own ID.

⌨️ Typos & New Words

"Covidlicious" might be split into ['Covid', 'licious'].
Subword models can "understand" and process newly coined words.

Platform Slang

"yyds", "amazing" etc.—if popular enough, they're treated as a complete token in new model training.

03

Vector Playground

From Words to Coordinates: Entering Vector Space

Embedding encodes word meaning as an "address" in high-dimensional space—a vector.

King → Royal, Male → [0.2, -0.5, 0.8, ..., 0.1]

In this space, words with similar meanings are closer together, and models understand relationships by calculating "distance."

The Magic of Embeddings: Word Vector Arithmetic

King - Man + Woman = Queen

Embeddings not only encode word meaning but also relationships.
Through vector addition and subtraction, we can explore analogies and logic in language.

The Birth of Embeddings: Ants Moving House

Imagine a colony of ants (the model) crawling through Wikipedia.

  1. Collect neighbors: Each ant collects words around the target word.
  2. Mutual attraction: Words that often appear together have vectors that "attract each other," getting closer in space.
  3. Form a map: After billions of crawls, a "semantic map" reflecting real language relationships is formed.

04

Next-Gen Magic

Upgraded Embeddings: Context-Aware

Early embeddings were static—one word, one vector.
Modern models (like BERT) can dynamically adjust word meaning based on context.

"I need to open an account at the bank."
→ Vector points to "financial institution"

"We're having a picnic by the bank."
→ Vector points to "riverbank"

The same word "bank" gets different vector representations in different contexts, allowing the model to truly understand its meaning.

Embeddings Beyond Text: Going Cross-Domain

Embedding technology has expanded beyond text—any data can be encoded as vectors.

🖼️ Images

Through models like CLIP, images and text can be mapped to the same vector space, enabling "search images with text."

🎵 Audio

Spotify uses it for music recommendations.

🛒 Products

Amazon uses it for product recommendations.

05

Takeaways

Interactive Time: Create Your Own Word Vector Equation

Challenge: Try to find your own word vector relationship!

$$\text{Sushi} - \text{Japan} + \text{Italy} = \text{Pizza}$$

Principle: (Food - Origin Country) + New Origin Country = New Food

This demonstrates how embeddings encode complex relationships.

Summary: Tokenize, Embed, Understand

This is the secret trilogy of how AI understands language:

  1. Tokenize
    Break text into small chunks the model can process.

  2. Embed
    Convert tokens into vectors in high-dimensional space.

  3. Understand
    Understand semantics through vector operations and model reasoning.

Next time you see "strawberries," you'll know it's just a string of mysterious code in the AI world.

THANK YOU

KIEN

By Dan Ryan