the Document Repository
with Elasticsearch

Some Context

What we Do and What Problems We Try to Solve


  • Nuxeo
    • we provide a Platform that developers can use to build highly customized Content Applications
    • we provide components, and the tools to assemble them
    • everything we do is open source (for real)
  • various customers - various use cases 

  • me: developer & CTO - joined the Nuxeo project 10+ years ago

Track game builds

Electronic Flight Bags

Central repository for Models

Food industry PLM

Document Repository

  • Store Documents / Assets / Objects
    • Blob objects
    • Complex data Structures
    • Hierarchy, references and links
  • Audit trail & Versioning
  • Data level security & encryption
  • Lifecycle, workflows ...

  • API (REST, CMIS, Java, JS...)
    • CRUD
    • Search
    • Service API

Heavily configurable : all data structures are flexible / customizable

Used by developers to build Content Applications on top of the Nuxeo Repository

OUR CHallenges

  • CRUD on large repository works
    • inject at 6,000 docs/s up to 1 Billion
    • not so many companies have that many documents anyway
  • Queries are the main scalability issue
    • impact of c_ud vs search
    • ​multi-criteria queries + full-text
    • security filtering
    • configurable data structures
    • user defined queries 
    • UI heavily depends on search

Search API is the most used:

search is the main scalability challenge

History : Nuxeo & Lucene

  • 2006: Nuxeo CPS 3.6
                   (Python / Zope based)

    • Replace built-in index with 
      lucene + XML-RPC server

    • pyLucene
      (GCJ build+ python bindings!)

    • Complex setup

  • 2007: Nuxeo Platform 5.1

    • JCR : queries (and backup) issues

    • Integrate Compass Core
      transactionnal  & storage abstraction 

    • Missing sync & concurrency issues

  • 2009: Nuxeo 5.2

    • VCS : Homebrew SQL based repository

    • Search in database but some real limitations

  • 2013 / 2014: Nuxeo 5.9.3

    • Reintroduce Lucene in the stack via elasticsearch

      • Learn from our past mistakes

      • Leverage elasticsearch architecture
        • ​easy deployment
        • safe indexing
        • powerful search

...  we are now happy with Elasticsearch 

Lucene and Nuxeo have a long story ...

Repository & Search

Understanding the Issue

Repository & SearcH

Search API is the most used :

search is the main scalability challenge

Complex SQL Queries

  •    Configurable Data Structure
    + User defined multi-criteria searches
    => multiple & complex SQL queries 

Search API is the most used:

search is the main scalability challenge

SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy" 
   JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id" 
   LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id" 
   LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id" 
  ("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio')) 
  AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext"))) 
  AND ("hierarchy"."isversion" IS NULL) 
  AND ("_F1"."lifecyclestate" <> 'deleted') 
  AND ("_F2"."created" IS NOT NULL )

ORDER BY "_F2"."created" DESC 


About SQL Limitations

  • Scaling queries is complex
    • depend on indexes, I/O speed and available memory
      • can not satisfy all types of queries 
    • poor performances on unselective multi-criteria queries
      • some types of queries can simply not be fast in SQL
  • Scalability
    • Scale up is expensive
    • Scale out is complex at best (XA & MVCC)
    • Sharding requires a global index
  • Fulltext support is usually poor
    • limitations on features & impact on performances

SQL technology is not the solution

IS noSQL the solution!?

Using NoSQL for the repository

About the NoSQL option

  • (sadly) NoSQL is no magic
    • it does work very well for CRUD and it scales easily, but
      • query options are limited and performance is not that good
      • multi-document transactions is usually not safe
    • more adapted for DBs with billions of entries and simple queries 
  •  SQL has some real advantages 
    • ACID (and MVCC) is good
      • Workflows and bulk updates are a typical use case
      • ​(even transient) lack of consistency is complex to explain to users
    • ​lot of existing tools (BI & reporting), lot of existing skills (DBA)
    • PGSQL (or AWS RDS) can be very cost effective

SQL or NoSQL repository are not the solution

Keep the repository
find a super fast index engine

Repository & ElasticSearch

Toward an Hybrid Storage

HYBRID Storage

  • Use each storage solution for what it does the best

    • SQL DB

      • store content in an ACID way

        • store & retrieve

        • queries needed ACID and MVCC

    • elasticsearch

      • provide powerful and scalable queries

      • do the heavy lifting that the RDBMS can not do

        • scoring, native full-text, aggregates

        • distributed search

Route the query to the correct index depending on requirements

Elasticsearch & Repository

One query
Several possible backends

Performance results

  • Fast indexing

    • No ACID constraints / No impedance issue

    • 3,500 documents/s when using SQL backend

    • 10,000 documents/s when using MongoDB

  • Super query performance

    • query on term using inverted index

    • very efficient caching  

    • native full text support & distributed architecture

    • 3,000 queries/s with 1 elasticsearch node

    • 6,000 queries/s with 2 elasticsearch nodes 

some real life feedback

We are now testing the Nuxeo 6 stack in AWS.
DB is Postgres SQL db.r3.8xlarge which is a a 32 cpus
Between 350 and 400 tps the DB cpu is maxed out.

Please activate nuxeo-elasticsearch !

We are now able to do about 1200 tps with almost 0 DB activity.
Question though, Nuxeo and ES do not seem to be maxed out ?

It looks like you have some network congestion between your client and the servers.

...right... we have pushed past 1900 tps ...  I think we are close to declaring success for this configuration ...




Nuxeo support

Nuxeo support

SQL vs ElasticSearch

Scalability is simply from another order of magnitude

Scale out


  • Tested with 10 PgSQL databases 
    • 10 x 100 Million documents => 1 Billion documents 
    • 1 elasticsearch cluster

Is this magic?

  • For users 

    • it really looks like magic

  • For sales guys & solution architects

    • it is magic: it unleashes a lot of possibilities

      • performance is just one aspect

  • For Nuxeo Core Dev team

    • it was almost magic: some integration work was needed

Integrating ElasticSearch

Inside nuxeo-elasticsearch Plugin

Challenges to address

  • Keep index in sync with the repository
    • No transaction management
    • Do not lose anything
    • Without support for update

  • Mitigate eventually consistent effect
    • ​Avoid displaying transient inconsistent state

  • Handle security filtering
    • Without join
    • Without post-filtering 

Security Filtering

  • Constraints
    • Filtering must be done at index level : no post filtering 
    • Join is not an option
      • can not join with DB or withing lucene (previously tested without success)
  • ​Solution
    • index the ReadACL as part of the JSON Document
      • ​list of groups / users who can read the resource
    • ​​automatically add a filter clause on ACL
  •  Consequences
    • Recursive indexing is needed
    • More pressure to maintain re-indexing procesing
      • ​in last resort: the Document security is checked by the repository anyway

SAFE Indexing Flow

  • Do not try to make it Transactionnal
    • Collect and de-duplicate Repository Events during Transaction
    • Wait for commit to be done at the repository level
      • then call elasticsearch
  • Do not lose any update
    • run Indexing Tasks in a distributed Job infrastructure
      • ​Jobs should be persisted
      • Jobs should be retried
      • Jobs should be monitored​

Async Indexing Flow

Mitigate Eventually consistent

  • In the code
    • use case : need to see results from within the transaction
    • query directly on the repository 
      • leverage ACID and MVCC of  SQL repository 
      • full-text search and facets are usually not needed by the code
  • For the users :
    • use case : see changes in listings in "real time"
    • use pseudo-real time indexing
      • ​​indexing actions triggered by UI threads are flagged
      • ​run as afterCompletion  listener
      • refresh elasticsearch index 

Pseudo-Sync Indexing Flow

Does this work ?

  • Live for about 18 months now 
  • No missing sync issue
    • some customers asked for verification tools
    • but no problem was found
    • re-index in bulk mode is very fast anyway
  • No consistency issues
    • good usage of hybrid query engines   
  • elasticsearch helped address several scaling challenges 

but elasticsearch brings us much more than just scalability

Bonus from ElasticSearch

More than Raw Speed

Leverage Aggregates

  • Leverage elasticsearch aggregates
    • ​integrate with the Query system (PageProvider)
    • integrate with the Listing / UI model (ContentView)
  • Allow to easily build and configure faceted search 

Advanced indexing

  • Fine tuning of elasticsearch indexing
    • multi language support using multiple analyzers and copy_to
    • compound fields created using groovy scripts
  • Introduce elasticsearch hints into NXQL
    • select a specific elasticsearch index / analyzer

    • leverage elasticseach operators

    • do geolocation search
-- Use an explicit Elasticsearch field
SELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) */ dc:title = 'foo'
-- Use ES operators not present in NXQL
SELECT * FROM Document WHERE /*+ES: OPERATOR(regex) */ dc:title = 's.*y'
SELECT * FROM Document WHERE /*+ES: OPERATOR(fuzzy) */ dc:title = 'zorkspaces'
-- Use ES for GeoQuery based on geo_hash_cell location near a point using geohash; 
SELECT * FROM Document WHERE /*+ES: OPERATOR(geo_hash_cell)*/ osm:location IN ('40','-74','5')

leverage what comes for free with elasticsearch

INDEX Audit Trail with Elasticsearch

  • Use elasticsearch to store & index Audit trail
    • all events are serialized in JSON and stored inside elasticsearch

  • ​Unleash Audit system power
    • ​can store a lot of events
    • can store and query arbitrary JSON structure

Elasticsearch PASS-Through

  • Expose an HTTP pass-through API on top of Nuxeo integration
    • Integrate Authentication & Authorization 
      • not all users can access workflow index
    • Integrate Security Filtering
      • activate data level security filtering​
    • Expose "virtual index" via http
      •  index + filter
  • ​​Use elasticsearch API related components on Nuxeo data
    • ​Documents + Audit log
    • With embedded security

Easy real time data analytics on business data

Data Analytics with Elasticsearch

Queries on Documents + Audit: flexible reporting on workflows

Read Documents from Elasticsearch

  • Full JSONDocument is stored in elasticsearch 
    • required to be able to do fast re-indexing
  • ​We can retrieve Documents from elasticsearch
    • execute full search & retrieve without touching the DB
  • ​By controling indexing we can use the elasticsearch index
    • ​as a persistent cache on top of the repository
    • as a staging area for queries

Next steps

Leveraging Even More elasticsearch

Next steps

  • Leverage elasticsearch percolator
    • push update on the nuxeo-drive clients
    • notify users about saved search
    • automatic categorization

  • Search result highlighting  
    • ​not sure why it is still not there ...

  • Plug automatic denormalization

Any Questions ?

Thank You !

Scaling the Document Repository with elasticsearch

By Thierry Delprat

Scaling the Document Repository with elasticsearch

  • 4,723