Elastic Search

Why do we need it?

Index columns in RDBMS

Works well for exact match and starts with queries

Usually implemented using BTree
Visualisation Demo

SELECT * 
FROM user 
WHERE name = 'John Doe'
AND user_id = 21
AND birth_date > '2007-08-02'

Search web pages with content on "blue sky"
Search for "7th Sector, HSR layout, Bangalore" in a unstructured address registry

How about ?

Apache Lucene

a high-performance, full-featured text search engine

Inverted Index

Break a document into tokens

Index sorted set of tokens

Map tokens to document and position of token within the document

What if we need

A search for UK to match United Kingdom.
A search for jump to match jumped, jumps and perhaps even leap.
A search for johnny walker should match Johnnie Walker
A search for fox news hunting should return stories about hunting on Fox News, while fox hunting news should return news stories about fox hunting.

Analysis

Pre tokenisation filter : Convert "&" to and, Strip html characters etc
Tokenisation
Post tokenisation filter
- Stemming: Convert "bikes" => "bike"
- Text Normalisation: Stripping accents etc
- Stop Words Filtering: Remove words like "the", "and" and "a"
- Synonym Expansion: Convert "UK" => "United Kingdom"

Scoring / Ranking Results

Term Frequency: If a term appears more number of times in document, it is ranked better
Inverse Document Frequency(IDF): If a term appears in fewer documents, documents containing these terms ranked better
Boost: This is a parameter provided in the query
Other factors...

Features

Fuzzy search : Handle typos
Phrase queries / proximity queries
Highlight Searches
Facet / Aggregations : Drill down results further (eg: e-commerce sites)
Fielded search (Blog title, author, tags)
Datatypes - Text, Numbers, Dates
Dynamic Range : Geo location and distance filters
More..

What does elastic search do?

Distributed search: Near realtime results on large dataset
Document Oriented Data Store : JSON documents
RESTful API, Query DSL
High Availability : Replication, Automatic Fail Over
More ..

Cluster
- Nodes
  - Shards : Primary(P) or Replica(R)
Each shard is a lucene index

Elastic Search Architecture

API Demo

The End

Made with Slides.com