Scaling
the Document Repository
with Elasticsearch

Some Context

What we Do and What Problems We Try to Solve

Nuxeo

  • Nuxeo
    • we provide a Platform that developers can use to build highly customized Content Applications
    • we provide components, and the tools to assemble them
    • everything we do is open source (for real)
       
  • various customers - various use cases 




     
  • me: developer & CTO - joined the Nuxeo project 10+ years ago

Track game builds

Electronic Flight Bags

Central repository for Models

Food industry PLM

https://github.com/nuxeo

Document Repository

  • Store Documents / Assets / Objects
    • Blob objects
    • Complex data Structures
    • Hierarchy, references and links
  • Audit trail & Versioning
  • Data level security & encryption
  • Lifecycle, workflows ...

  • API (REST, CMIS, Java, JS...)
    • CRUD
    • Search
    • Service API

Heavily configurable : all data structures are flexible / customizable

Used by developers to build Content Applications on top of the Nuxeo Repository

OUR CHallenges

  • CRUD on large repository works
    • inject at 6,000 docs/s up to 1 Billion
    • not so many companies have that many documents anyway
       
  • Queries are the main scalability issue
    • impact of c_ud vs search
    • ​multi-criteria queries + full-text
    • security filtering
    • configurable data structures
    • user defined queries 
    • UI heavily depends on search

Search API is the most used:

search is the main scalability challenge

History : Nuxeo & Lucene

  • 2006: Nuxeo CPS 3.6
                   (Python / Zope based)

    • Replace built-in index with 
      lucene + XML-RPC server

    • pyLucene
      (GCJ build+ python bindings!)

    • Complex setup
       

  • 2007: Nuxeo Platform 5.1

    • JCR : queries (and backup) issues

    • Integrate Compass Core
      transactionnal  & storage abstraction 

    • Missing sync & concurrency issues

  • 2009: Nuxeo 5.2

    • VCS : Homebrew SQL based repository

    • Search in database but some real limitations
       
       

  • 2013 / 2014: Nuxeo 5.9.3

    • Reintroduce Lucene in the stack via elasticsearch

      • Learn from our past mistakes

      • Leverage elasticsearch architecture
        • ​easy deployment
        • safe indexing
        • powerful search

...  we are now happy with Elasticsearch 

Lucene and Nuxeo have a long story ...

Repository & Search

Understanding the Issue

Repository & SearcH

Search API is the most used :

search is the main scalability challenge

Complex SQL Queries

  •    Configurable Data Structure
    + User defined multi-criteria searches
    => multiple & complex SQL queries 

Search API is the most used:

search is the main scalability challenge

SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy" 
   JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id" 
   LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id" 
   LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id" 
 WHERE 
  ("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio')) 
  AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext"))) 
  AND ("hierarchy"."isversion" IS NULL) 
  AND ("_F1"."lifecyclestate" <> 'deleted') 
  AND ("_F2"."created" IS NOT NULL )

ORDER BY "_F2"."created" DESC 

LIMIT 201 OFFSET 0; 

About SQL Limitations

  • Scaling queries is complex
    • depend on indexes, I/O speed and available memory
      • can not satisfy all types of queries 
    • poor performances on unselective multi-criteria queries
      • some types of queries can simply not be fast in SQL
  • Scalability
    • Scale up is expensive
    • Scale out is complex at best (XA & MVCC)
    • Sharding requires a global index
  • Fulltext support is usually poor
    • limitations on features & impact on performances

SQL technology is not the solution

IS noSQL the solution!?

Using NoSQL for the repository

About the NoSQL option

  • (sadly) NoSQL is no magic
    • it does work very well for CRUD and it scales easily, but
      • query options are limited and performance is not that good
      • multi-document transactions is usually not safe
    • more adapted for DBs with billions of entries and simple queries 
       
  •  SQL has some real advantages 
    • ACID (and MVCC) is good
      • Workflows and bulk updates are a typical use case
      • ​(even transient) lack of consistency is complex to explain to users
    • ​lot of existing tools (BI & reporting), lot of existing skills (DBA)
    • PGSQL (or AWS RDS) can be very cost effective

SQL or NoSQL repository are not the solution

Keep the repository
SQL or NoSQL
but
find a super fast index engine

Repository & ElasticSearch

Toward an Hybrid Storage

HYBRID Storage

  • Use each storage solution for what it does the best

    • SQL DB

      • store content in an ACID way

        • store & retrieve

        • queries needed ACID and MVCC

    • elasticsearch

      • provide powerful and scalable queries

      • do the heavy lifting that the RDBMS can not do

        • scoring, native full-text, aggregates

        • distributed search

Route the query to the correct index depending on requirements

Elasticsearch & Repository

One query
Several possible backends

Performance results

  • Fast indexing

    • No ACID constraints / No impedance issue
       

    • 3,500 documents/s when using SQL backend

    • 10,000 documents/s when using MongoDB
       

  • Super query performance

    • query on term using inverted index

    • very efficient caching  

    • native full text support & distributed architecture
       

    • 3,000 queries/s with 1 elasticsearch node

    • 6,000 queries/s with 2 elasticsearch nodes 

some real life feedback

We are now testing the Nuxeo 6 stack in AWS.
DB is Postgres SQL db.r3.8xlarge which is a a 32 cpus
Between 350 and 400 tps the DB cpu is maxed out.

Please activate nuxeo-elasticsearch !

We are now able to do about 1200 tps with almost 0 DB activity.
Question though, Nuxeo and ES do not seem to be maxed out ?

It looks like you have some network congestion between your client and the servers.

...right... we have pushed past 1900 tps ...  I think we are close to declaring success for this configuration ...

Customer

Customer

Customer

Nuxeo support

Nuxeo support

SQL vs ElasticSearch

Scalability is simply from another order of magnitude

Scale out

UNIFIED INDEX ON SHARDED REPOSITORY

  • Tested with 10 PgSQL databases 
    • 10 x 100 Million documents => 1 Billion documents 
    • 1 elasticsearch cluster

Is this magic?

  • For users 

    • it really looks like magic
       

  • For sales guys & solution architects

    • it is magic: it unleashes a lot of possibilities

      • performance is just one aspect
         

  • For Nuxeo Core Dev team

    • it was almost magic: some integration work was needed

Integrating ElasticSearch

Inside nuxeo-elasticsearch Plugin

Challenges to address

  • Keep index in sync with the repository
    • No transaction management
    • Do not lose anything
    • Without support for update

       
  • Mitigate eventually consistent effect
    • ​Avoid displaying transient inconsistent state

       
  • Handle security filtering
    • Without join
    • Without post-filtering 

Security Filtering

  • Constraints
    • Filtering must be done at index level : no post filtering 
    • Join is not an option
      • can not join with DB or withing lucene (previously tested without success)
  • ​Solution
    • index the ReadACL as part of the JSON Document
      • ​list of groups / users who can read the resource
    • ​​automatically add a filter clause on ACL
       
  •  Consequences
    • Recursive indexing is needed
    • More pressure to maintain re-indexing procesing
      • ​in last resort: the Document security is checked by the repository anyway

SAFE Indexing Flow

  • Do not try to make it Transactionnal
    • Collect and de-duplicate Repository Events during Transaction
    • Wait for commit to be done at the repository level
      • then call elasticsearch
         
  • Do not lose any update
    • run Indexing Tasks in a distributed Job infrastructure
      • ​Jobs should be persisted
      • Jobs should be retried
      • Jobs should be monitored​

Async Indexing Flow

Mitigate Eventually consistent

  • In the code
    • use case : need to see results from within the transaction
    • query directly on the repository 
      • leverage ACID and MVCC of  SQL repository 
      • full-text search and facets are usually not needed by the code
         
  • For the users :
    • use case : see changes in listings in "real time"
    • use pseudo-real time indexing
      • ​​indexing actions triggered by UI threads are flagged
      • ​run as afterCompletion  listener
      • refresh elasticsearch index 

Pseudo-Sync Indexing Flow

Does this work ?

  • Live for about 18 months now 
     
  • No missing sync issue
    • some customers asked for verification tools
    • but no problem was found
    • re-index in bulk mode is very fast anyway
       
  • No consistency issues
    • good usage of hybrid query engines   
       
  • elasticsearch helped address several scaling challenges 

but elasticsearch brings us much more than just scalability

Bonus from ElasticSearch

More than Raw Speed

Leverage Aggregates

  • Leverage elasticsearch aggregates
    • ​integrate with the Query system (PageProvider)
    • integrate with the Listing / UI model (ContentView)
  • Allow to easily build and configure faceted search 

Advanced indexing

  • Fine tuning of elasticsearch indexing
    • multi language support using multiple analyzers and copy_to
    • compound fields created using groovy scripts
  • Introduce elasticsearch hints into NXQL
    • select a specific elasticsearch index / analyzer

       
    • leverage elasticseach operators

       
    • do geolocation search
-- Use an explicit Elasticsearch field
SELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) */ dc:title = 'foo'
-- Use ES operators not present in NXQL
SELECT * FROM Document WHERE /*+ES: OPERATOR(regex) */ dc:title = 's.*y'
SELECT * FROM Document WHERE /*+ES: OPERATOR(fuzzy) */ dc:title = 'zorkspaces'
-- Use ES for GeoQuery based on geo_hash_cell location near a point using geohash; 
SELECT * FROM Document WHERE /*+ES: OPERATOR(geo_hash_cell)*/ osm:location IN ('40','-74','5')

leverage what comes for free with elasticsearch

INDEX Audit Trail with Elasticsearch

  • Use elasticsearch to store & index Audit trail
    • all events are serialized in JSON and stored inside elasticsearch








       
  • ​Unleash Audit system power
    • ​can store a lot of events
    • can store and query arbitrary JSON structure

Elasticsearch PASS-Through

  • Expose an HTTP pass-through API on top of Nuxeo integration
    • Integrate Authentication & Authorization 
      • not all users can access workflow index
    • Integrate Security Filtering
      • activate data level security filtering​
    • Expose "virtual index" via http
      •  index + filter
         
  • ​​Use elasticsearch API related components on Nuxeo data
    • ​Documents + Audit log
    • With embedded security

Easy real time data analytics on business data

Data Analytics with Elasticsearch

Queries on Documents + Audit: flexible reporting on workflows

Read Documents from Elasticsearch

  • Full JSONDocument is stored in elasticsearch 
    • required to be able to do fast re-indexing
       
  • ​We can retrieve Documents from elasticsearch
    • execute full search & retrieve without touching the DB
       
  • ​By controling indexing we can use the elasticsearch index
    • ​as a persistent cache on top of the repository
    • as a staging area for queries
_source

Next steps

Leveraging Even More elasticsearch

Next steps

  • Leverage elasticsearch percolator
    • push update on the nuxeo-drive clients
    • notify users about saved search
    • automatic categorization

  • Search result highlighting  
    • ​not sure why it is still not there ...

  • Plug automatic denormalization

Any Questions ?

Thank You !

https://github.com/nuxeo

http://www.nuxeo.com/careers/

Scaling the Document Repository with elasticsearch

By Thierry Delprat

Scaling the Document Repository with elasticsearch

  • 4,885