Intro to ElasticSearch

by @vincent_lcy

WhaT are we doing here today


For Developers

Front end / Back End

Assume you know about
JSON/REST
You dont need to be an expert in 
ElasticSearch / Search / NoSQL / HTML5/ Java

I will talk about
Basics, concepts from different perspectives
Show you a demo 
Add search feature to your website 

Vincent LAU


Javascript, Java

@vincent_lcy
kleineblase.wordpress.com/

Slides and Notes


http://bit.ly/hkosc2013_es

Powering...



Github: Repo/Every line of Code/Users

is ES A SEARCH ENGINE?


http://www.elasticsearch.org/overview/


flexible and powerful open source, distributed real-time search and analytics engine for the cloud

IDEA ABOUT SEARCH IN 1MINUTE


Don't look for a number in a phone book sorted by names



Original data structure is inefficient for look up by value


"INDEX"

Google AS AN EXAMPLE

Crawler Index the whole Web

Photo Source: http://ianieba.com/how-to-optimize-your-site-architecture/

ElastICSEARCH..



Levenshtein Automata

Finite State Transducers





Search is Hard


Apache SEARCH STACK

Core Indexing/Search Libraries (Doug Cutting)
 - Map Reduce (Doug Cutting)
 = Sear Server w/ Lucene   - Parser
(Doug Cutting)
 =  Web Crawler / Search Engine= Lucene + (Hadoop) + (Solr)
 






Elastic


so Why                    Search



AngularJS said: 
90% of applications are CRUD

I will say
Most Apps are good fit for
Search-based Navigation

Most others need search features anyway

Examples

Hotesl.com, Tripadvisor

"Faceted Search"

Look Closer

IS ES A CRAWLER?

River PLUGIN


Many Storage Services provide a feed of recent change 

/_change ,  /_delta


ES will poll for changes and Index them automatically

 

http://www.elasticsearch.org/guide/en/elasticsearch/rivers/current/river.html
http://guide.couchdb.org/draft/notifications.html 

is ES A DB?

Better question


is ES Good enough as DB for my app?

Security concerns

Comparison


NO Eventual consistency
 each ElasticSearch operation is atomic, durable, and isolated. An operation is hashed to a specific shard, performed on it, and then replicated to all its replicas. When the operation returns, it has already been replicated to all the replicas and it is "safely" there
CAP:   

Most DB FEATURES



Transaction
High Availability
Sort
Query
Data Types
Storage
In-memory Cache

ELASTICSEARCH MAY NOT BE ANSWER FOR ALL QUERY



Report generation? Archiving?


Graph-based Query ->Graph DB
neo4j

IS IT NoSQL?


Good for small/big Data!

Key Value vs Document-based
 
Document: Lucene Document
The DB see its structure
->field-based query & retrieval & indexing





by Pramod J Sadalage & Martin Fowler




WHAT CAN I INDEX


ANYTHING!

Transform as Input To ES: JSON compatible



Natural Fit for document-oriented database



IS IT a WEB SERVER?



Restful HTTP API

Index / Query 




Even hosting static files - site plugin

Models




Security Concern!

Nginx to route

MODELS




ES as Search Service


Pure ES powered app 

(ES as web server & DB)

My TRY


HK Light Pollution Map
lightpollution.hk    /   www.facebook.com/lightpollutionmap
v1:    CouchDB -> Auto River Feed to ES for indexing
v2:    Use ES as major DB 
+AngularJS for Search-based Navigation
LeafletJS/ExpressJS

Master Project: Search Files
Use ES +Attachment Plugin to 
index Filesystems and Cloud Services
+AngularJS for Web-based, Faceted File Search

SOME DEMO


  • Basic Search
  • MAPPING - How should be the index document be created
  • Faceted Search
  • Search Engine for files

Conclusion


Try IT!

Change your thoughts on the boundary between Search / Navigation  

From Quick Prototype to Boss Level


Very EASY TO START WITH


20min to add search box to my app
ES x Bootstrap UI 
Take time To Understand Everything
 Knowledge to Optimize
Real Expert to Optimize the Core (Lucene!)

Building Nice Faceted Search w/ AngularJs and ES
Lots of libraries out there

What Makes ES REALLY POWERFUL


High Availability

Scalability


The most single important video you should watch

http://www.elasticsearch.org/videos/distributed-diagram/

THanks



Things not covered


Real-time monitoring

Search all your logs in a cluster (logstash, Hadoop etc)



IDEA ABOUT SEARCH IN 3 MINUTES


Term

Document

Corpus


Stop Words - and, or, is

Stemming - "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu"




Tokenize

e.g. CKJ  我們是快樂的好兒童->我們,是,快樂,的,好,兒童

Stop Words - and, or, is

Stemming - "argue", "argued", "argues", "arguing", and "argus" reduce to the stem "argu"

Analyzer -> Combine above to get index out of text


TF-IDF


a numerical statistic which reflects how important a word is to a document in a collection or corpus


More Like This VS Fuzzy Like This VS Fuzzy Query

More Like This -> Find a similar Document

Fuzzy Like this -> comparing criteria with multiple fields

Fuzzy Query -> search against combinations generated within Levenshtein edit distance limit 

Lucene




Photo source:  http://www.ibm.com/developerworks/library/wa-lucene/

Search Filter

- search within search
- efficient: instead of discarding results,
optimized query ?
- vs Query : no Scoring

Examples

Term Range Filter = Term Range Query - scoring

Span Query


- take positions of terms into place



SearchFirstQuery
query for spans within first sepcific # of positions of field

SpanNearQuery
matches spans within a certain number of positions from each other 
score higher if two terms closer

Real TIME? NEAR REAL TIME?



delay:

request w/ heavy load:
faceting, sorting

by:
indexing
disk IO


optimized:
warm up

Made with Slides.com