Elastic Search

Why do we need it?

Index columns in RDBMS

  • Works well for exact match and starts with queries 

 

 

SELECT * 
FROM user 
WHERE name = 'John Doe'
AND user_id = 21
AND birth_date > '2007-08-02'
  • Search web pages with content on "blue sky"
  •  Search for "7th Sector, HSR layout, Bangalore" in a unstructured address registry

How about ?

Apache Lucene

a high-performance, full-featured text search engine

Inverted Index

  • Break a document into tokens

 

  • Index sorted set of tokens

 

  • Map tokens to document and position of token within the document

What if we need

  • A search for UK to match United Kingdom.
  • A search for jump to match jumpedjumps and perhaps even leap.
  • A search for johnny walker should match Johnnie Walker
  • A search for fox news hunting should return stories about hunting on Fox News, while fox hunting news should return news stories about fox hunting.

Analysis

  • ​Pre tokenisation filter :  Convert "&" to and, Strip html characters etc
  • Tokenisation
  • Post tokenisation filter
    • ​Stemming:  Convert "bikes" => "bike"
    • Text Normalisation: Stripping accents etc
    • Stop Words Filtering:  Remove words like "the", "and" and "a"
    • Synonym Expansion: Convert "UK" => "United Kingdom"

Scoring / Ranking Results

  • ​Term Frequency: If a term appears more number of times in document, it is ranked better
  • Inverse Document Frequency(IDF):  If a term appears in fewer documents, documents containing these terms ranked better
  • Boost: This is a parameter provided in the query 
  • Other factors...

Features

  • ​Fuzzy search : Handle typos
  • Phrase queries / proximity queries
  • Highlight Searches
  • Facet / Aggregations : Drill down results further (eg: e-commerce sites)
  • Fielded search (Blog title, author, tags)
  • Datatypes - Text, Numbers, Dates
  • Dynamic Range : Geo location and distance filters
  • More..

What does elastic search do?

  • Distributed search: Near realtime results on large dataset
  • Document Oriented Data Store : JSON documents
  • RESTful API, Query DSL
  • High Availability : Replication, Automatic Fail Over
  • More..
  • Cluster
    • Nodes
      • Shards : Primary(P) or Replica(R)
  • Each shard is a lucene index

Elastic Search Architecture

API Demo

The End

Made with Slides.com