Nuxeo
&
Elasticsearch



NUXEO AND LUCENE

About Nuxeo


  • Since 2001
  • A team of 25 developers
  • Making Opensource software since the very beginning
  • Providing a software platform to build Content Applications
    • Document Repository, Document Managament
    • Digital Asset Management, Media Library
    • Case Management, Document centric business process
    • All business application that needs to manage Content
      (i.e. : structured business object and not just files)
  • Trusted by Electonic Arts, JCDecaux, Boeing, Sharp ...

Nuxeo & ContenT

  • Store Content
  • Manage versions
  • Secure access to Content
  • Manage relations between Content objects
  • Render content
  • Convert content
  • Manage worfklows
  • Forms, Layouts and Tasks
  • ...
  • Index and Query

Nuxeo & Lucene



That's a long story :

Nuxeo integrated Lucene 
with various technologies over the years

Is lucene the super-hero we looking for ?

Nuxeo / Lucene History

  • 2006 : Nuxeo CPS 3.6 (Python / Zope based)
    • Built-in index (Z-Catalog) was too weak
    • Build a Lucene based index server accessed via XML-RPC
      • pyLucene (Lucene re-built with GCJ + python bindings)
      • Twisted as server framework
    • Complex setup
  • 2007 : Nuxeo Platform 5.1 (Java + JCR)
    • JCR is not good for making queries
    • Integrate a Lucene based index via Compass Core
      • provides transactional wrapper
      • provides storage abstraction 
    • Concurrency issues

Nuxeo / Lucene now

  • Since 2009 :
    • VCS : Homebrew SQL based repository
    • Everything is indexed inside the database (ACID)
  • But :
    • Full text capabilities are not so good
    • Scaling for complex queries is a challenge
    • Handling very big repositories is a challenge

  • 2014 : Nuxeo 5.9.3 + elasticsearch
    • Distribute indexes and queries on multiple nodes
    • Relief the database 
    • Learn from our past mistakes !



RDBMS Limits

Static schemas

  • RDBMS
    • Schemas are defined beforehand
      • not always possible
    • Schemas choice impacts performances
    • Schemas migration are painful

Query performances

  • SQL DB issues
    • normalized data
      • generate complex join and filtering
      • several round trips to fetch a Document
    • poor performances on unselective multi-criteria queries
      • tweaking is possible but painful
    • medium or poor full text support
      • vary depending on DB vendor

  • Workarounds
    • Stored procedures and triggers
      (maintenance issues)

Query exAmple

  • Some queries are nearly impossible to optimize
SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy" 
   JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id" 
   LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id" 
   LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id" 
 WHERE 
  ("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio')) 
  AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext"))) 
  AND ("hierarchy"."isversion" IS NULL) 
  AND ("_F1"."lifecyclestate" <> 'deleted') 
  AND ("_F2"."created" IS NOT NULL )

ORDER BY "_F2"."created" DESC 

LIMIT 201 OFFSET 0; 

Write overhead

  • RDBMS engine constraints create overhead
    • buffer management
    • locking & latching
      • transaction isolation
    • transaction log

  • Application and model level constraints
    • constraints checking 
      • type, unique, fk
    • Triggers and associated stored procedures

Nuxeo ExAmple


  • Triggers and constraints impact performances
    • ex : compute ACLs, compute path ...

  • Most database don't handle correctly cascades
    • ex: ON DELETE CASCADE

Scale out

  • RDBMS don't scale out easily

    • multi-master is very expensive and not very efficient
      • ensuring ACID across the network is a real problem!

    • single master + multi-read is easy by  not always applicable
      • need to flag read only transactions at startup

    • data partitioning is a solution
      • but it impacts user experience (no global index)

Why Elasticsearch wins ?

  • No Static Schemas 
    • essentially schema-less
    • mapping can be adapted as needed
  • Super query performances
    • query on term using revert index 
      • returning document and score
    • native full text support 
    • one query to fetch them all 
  • Fast indexing
    • No ACID constraints
  • Easy scale out
    • Native distributed architecture



Why keep RDBMS ?

ACID can be useful


  • Ensuring immediate consistency may be required
    • some user don't understand otherwise anyway

  • ACID model ensure that data 
    • is always good
    • is auditable

  • ACID RDBMS are safe and reliable
    • people trust RDBMS to store their data

RDBMS are here for a long time


  • Customers have Databases and DBAs

  • Customers have BI tools that use RDBMS

  • RDBMS are use for data interoperability

Hybrid model

  • Use each storage for what it does the best

    • RDBMS
      • store content in an ACID way
        (store & retrieve)

    • elasticsearch
      • provide powerful and scalable queries
      • do the heavy lifting that the RDBMS can not do



Nuxeo
&
elasticsearch

Nuxeo Challenges

  • Security filtering
    • filtering may be complex and must be fast

  • Hierarchy 
    • recursive re-indexing needed for some operations

  • Ensuring consistency
    • keep repository and index in sync

  • User operation should be indexed real-time
    • UI must be rendered using ES index
    • ES index must be updated in pseudo real time

ES/Lucene challenges

  • No support for update
    • Document must be re-indexed completely

  • No Join
    • can not split hierarchy, security and data
      in separated indexes

  • No transaction management

Nuxeo Cluster


Nuxeo and SHARDING


ES to Lighten DataBase Load


ES and Sharding


Nuxeo 5.9.3 integration

  • Export Nuxeo Document as JSON
    • JSON is already the format of our REST API
  • Index mapping is configurable

  • Asynchronous persistent tasks for indexing
    • survive server restart and can be retried as needed

  • Manage near-real time for UI operations
    • indexing tasks triggered by UI  in Post Commit

  • NXQL query language and PageProvider model
    • mapped to ES query language

Nuxeo 5.9.3 integration

  • Security
    • Previously tested approaches
      • use a custom Lucene Matcher
      • use index joins inside SOLR
    • simplified ACLs is part of the indexed document

  • Document fetching
    • for now only  the entry is taken from ES
    • for now the Document is fetched from repository

  • Integration
    • run with embedded ES (test) or external server

Async indexing


Sync indexing




First results

Does it works ?

Yes, it does work !




Perspectives

Improving ES integration

  • Improve integration
    • load Documents from ES rather than repository
    • UpdateByQuery for recursive updates

  • Leverage elasticsearch features
    • integrate Aggregates system with Nuxeo PageProviders
    • move Audit Log to elasticsearch
    • use Kibana for analytics