Building a full-text search engine in TypeScript

Michele Riva

Michele Riva

Senior Software Architect @NearForm

Google Developer Expert

Microsoft MVP

MicheleRivaCode

Why?

MicheleRivaCode

What I cannot create, I do not understand

Richard Feynman

MicheleRivaCode

A journey through algorithms and data structures

MicheleRivaCode

There's no slow programming language, just bad DSA design

MicheleRivaCode

What is "full-text" search?

sybase.com

Full-text search is a more advanced way to search a database.

Full-text search quickly finds all instances of a term (word) in a table without having to scan rows and without having to know which column a term is stored in.

Full-text search works by using text indexes.

A text index stores positional information for all terms found in the columns you create the text index on.

MicheleRivaCode

What is "full-text" search?

sybase.com

Full-text search is a more advanced way to search a database.

Full-text search quickly finds all instances of a term (word) in a table without having to scan rows and without having to know which column a term is stored in.

Full-text search works by using text indexes.

A text index stores positional information for all terms found in the columns you create the text index on.

MicheleRivaCode

What is "full-text" search?

sybase.com

Full-text search is a more advanced way to search a database.

Full-text search quickly finds all instances of a term (word) in a table without having to scan rows and without having to know which column a term is stored in.

Full-text search works by using text indexes.

A text index stores positional information for all terms found in the columns you create the text index on.

MicheleRivaCode

What is "full-text" search?

sybase.com

Full-text search is a more advanced way to search a database.

Full-text search quickly finds all instances of a term (word) in a table without having to scan rows and without having to know which column a term is stored in.

Full-text search works by using text indexes.

A text index stores positional information for all terms found in the columns you create the text index on.

Popular full-text search engines

MicheleRivaCode

"New generation" full-text search engines

MicheleRivaCode

Sonic

Meilisearch

JavaScript-based full-text search engines

MicheleRivaCode

Lunr.js

MiniSearch

Fuse.js

MicheleRivaCode

Where to start?

MicheleRivaCode

Understand what kind of data we want to store and retrieve

MicheleRivaCode

[
  {
    "id": 1,
    "quote": "It's alive! It's alive!",
    "movie": "Frankenstein",
    "year": 1931
  },
  {
    "id": 2,
    "quote": "You've got to ask yourself one question: 'Do I feel lucky?' Well, do ya, punk?",
    "movie": "Dirty Harry",
    "year": 1971
  },
  {
    "id": 3,
    "quote": "Mama always said life was like a box of chocolates. You never know what you're gonna get.",
    "movie": "Forrest Gump",
    "year": 1994
  }
]

Example documents

MicheleRivaCode

// "It's alive! It's alive!"
["Its", "alive", "Its", "alive"]

// "You've got to ask yourself one question: 'Do I feel lucky?' Well, do ya, punk?"
[
  "Youve", "got", "to", "ask", "yourself", "one", "question",
  "Do", "I", "feel", "lucky", "Well", "do", "ya", "punk"
]

// "Mama always said life was like a box of chocolates. You never know what you're gonna get."
[
  "Mama", "always", "said", "life", "was", "like", "a", "box", "of", 
  "chocolates", "You", "never", "know", "what", "youre", "gonna", "get"
]

Tokenizer

Break the sentences into individual tokens

MicheleRivaCode

// "It's alive! It's alive!"
["its", "alive", "its", "alive"]

// "You've got to ask yourself one question: 'Do I feel lucky?' Well, do ya, punk?"
[
  "youve", "got", "to", "ask", "yourself", "one", "question",
  "do", "i", "feel", "lucky", "well", "do", "ya", "punk"
]

// "Mama always said life was like a box of chocolates. You never know what you're gonna get."
[
  "mama", "always", "said", "life", "was", "like", "a", "box", "of", 
  "chocolates", "you", "never", "know", "what", "youre", "gonna", "get"
]

Tokenizer

Lowercase all tokens

MicheleRivaCode

// "It's alive! It's alive!"
["its", "alive"]

// "You've got to ask yourself one question: 'Do I feel lucky?' Well, do ya, punk?"
[
  "youve", "got", "to", "ask", "yourself", "one", "question",
  "do", "i", "feel", "lucky", "well", "ya", "punk"
]

// "Mama always said life was like a box of chocolates. You never know what you're gonna get."
[
  "mama", "always", "said", "life", "was", "like", "a", "box", "of", 
  "chocolates", "you", "never", "know", "what", "youre", "gonna", "get"
]

Tokenizer

Remove duplicates

MicheleRivaCode

// "It's alive! It's alive!"
["alive"]

// "You've got to ask yourself one question: 'Do I feel lucky?' Well, do ya, punk?"
[
  "youve", /* "got", */ /* "to", */ "ask", "yourself", "one", "question",
  /* "do", */ /* "i", */ "feel", "lucky", "well", "ya", "punk"
]

// "Mama always said life was like a box of chocolates. You never know what you're gonna get."
[
  "mama", "always", "said", "life", /* "was", */, "like", /* "a", */ "box", /* "of", */
  "chocolates", "you", "never", "know", /* "what", */ "youre", /* "gonna", */ "get"
]

Tokenizer

Remove stop-words*

MicheleRivaCode

What is a stop word?

Stop words are a set of commonly used words in a language. Examples of stop words in English are “a”, “the”, “is”, “are” and etc. Stop words are commonly used in Text Mining and Natural Language Processing (NLP) to eliminate words that are so commonly used that they carry very little useful information.

https://www.opinosis-analytics.com/knowledge-base/stop-words-explained/

MicheleRivaCode

// "It's alive! It's alive!"
["alive"]

// "You've got to ask yourself one question: 'Do I feel lucky?' Well, do ya, punk?"
[
  "youve", /* "got", */ /* "to", */ "ask", "yourself", "one", "question",
  /* "do", */ /* "i", */ "feel", "lucky", "well", "ya", "punk"
]

// "Mama always said life was like a box of chocolates. You never know what you're gonna get."
[
  "mama", "always", "said", "life", /* "was", */, "like", /* "a", */ "box", /* "of", */
  "chocolates", "you", "never", "know", /* "what", */ "youre", /* "gonna", */ "get"
]

Tokenizer

Remove stop-words*

MicheleRivaCode

// "It's alive! It's alive!"
["alive"]

// "You've got to ask yourself one question: 'Do I feel lucky?' Well, do ya, punk?"
[
  "you" /* was "youve" */, "ask", "yourself", "one", "question",
  "feel", "luck" /* was "lucky" */, "well", /* "ya" becomes "you", duplicate */ "punk"
]

// "Mama always said life was like a box of chocolates. You never know what you're gonna get."
[
  "mom" /* was "mama" */, "always", "say" /* was "said" */, "life", "like", "box", 
  "chocolate" /* was "chocolates" */, "you", "never", "know", /*"you", was "youre", duplicate */, "get"
]

Tokenizer

Stemming*

MicheleRivaCode

Snowball

https://snowballstem.org

MicheleRivaCode

English 🇺🇸🇬🇧🇦🇺

http://snowball.tartarus.org/algorithms/english/stemmer.html

MicheleRivaCode

German 🇩🇪

http://snowball.tartarus.org/algorithms/german/stemmer.html

MicheleRivaCode

Italian 🇮🇹

http://snowball.tartarus.org/algorithms/italian/stemmer.html

MicheleRivaCode

Finnish 🇫🇮

http://snowball.tartarus.org/algorithms/finnish/stemmer.html

MicheleRivaCode

[
  {
    "id": 1,
    "quote": ["alive"],
    ...
  },
  {
    "id": 2,
    "quote": ["you", "ask", "yourself", "one", "question", "feel", "luck", "well", "punk"],
    ...
  },
  {
    "id": 3,
    "quote": ["mom", "always", "say", "life", "like", "box", "chocolate", "you", "never", "know", "get"],
    ...
  }
]

Final Result

Remaining tokens

MicheleRivaCode

How do we want to store this data?

MicheleRivaCode