Nuxeo
&
Elasticsearch

NUXEO AND LUCENE

About Nuxeo

Since 2001
A team of 25 developers
Making Opensource software since the very beginning
Providing a software platform to build Content Applications

Document Repository, Document Managament
Digital Asset Management, Media Library
Case Management, Document centric business process
All business application that needs to manage Content
(i.e. : structured business object and not just files)

Trusted by Electonic Arts, JCDecaux, Boeing, Sharp ...

Nuxeo & ContenT

Store Content
Manage versions
Secure access to Content
Manage relations between Content objects
Render content
Convert content
Manage worfklows
Forms, Layouts and Tasks
...
Index and Query

Nuxeo & Lucene

That's a long story :

Nuxeo integrated Lucene

with various technologies over the years

Is lucene the super-hero we looking for ?

Nuxeo / Lucene History

2006 : Nuxeo CPS 3.6 (Python / Zope based)

Built-in index (Z-Catalog) was too weak
Build a Lucene based index server accessed via XML-RPC

pyLucene (Lucene re-built with GCJ + python bindings)
Twisted as server framework

Complex setup

2007 : Nuxeo Platform 5.1 (Java + JCR)

JCR is not good for making queries
Integrate a Lucene based index via Compass Core

provides transactional wrapper
provides storage abstraction

Concurrency issues

Nuxeo / Lucene now

Since 2009 :

VCS : Homebrew SQL based repository
Everything is indexed inside the database (ACID)

But :

Full text capabilities are not so good
Scaling for complex queries is a challenge
Handling very big repositories is a challenge

2014 : Nuxeo 5.9.3 + elasticsearch

Distribute indexes and queries on multiple nodes
Relief the database
Learn from our past mistakes !

RDBMS Limits

Static schemas

RDBMS

Schemas are defined beforehand

not always possible

Schemas choice impacts performances
Schemas migration are painful

Query performances

SQL DB issues

normalized data

generate complex join and filtering
several round trips to fetch a Document

poor performances on unselective multi-criteria queries

tweaking is possible but painful

medium or poor full text support

vary depending on DB vendor

Workarounds

Stored procedures and triggers
(maintenance issues)

Query exAmple

Some queries are nearly impossible to optimize

SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy" 
   JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id" 
   LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id" 
   LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id" 
 WHERE 
  ("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio')) 
  AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext"))) 
  AND ("hierarchy"."isversion" IS NULL) 
  AND ("_F1"."lifecyclestate" <> 'deleted') 
  AND ("_F2"."created" IS NOT NULL )

ORDER BY "_F2"."created" DESC 

LIMIT 201 OFFSET 0;

Write overhead

RDBMS engine constraints create overhead

buffer management
locking & latching

transaction isolation

transaction log

Application and model level constraints

constraints checking

type, unique, fk

Triggers and associated stored procedures

Nuxeo ExAmple

Triggers and constraints impact performances

ex : compute ACLs, compute path ...

Most database don't handle correctly cascades

ex: ON DELETE CASCADE

Scale out

RDBMS don't scale out easily

multi-master is very expensive and not very efficient

ensuring ACID across the network is a real problem!

single master + multi-read is easy by not always applicable

need to flag read only transactions at startup

data partitioning is a solution

but it impacts user experience (no global index)

Why Elasticsearch wins ?

No Static Schemas

essentially schema-less
mapping can be adapted as needed

Super query performances

query on term using revert index

returning document and score

native full text support
one query to fetch them all

Fast indexing

No ACID constraints

Easy scale out

Native distributed architecture

Why keep RDBMS ?

ACID can be useful

Ensuring immediate consistency may be required

some user don't understand otherwise anyway

ACID model ensure that data

is always good
is auditable

ACID RDBMS are safe and reliable

people trust RDBMS to store their data

RDBMS are here for a long time

Customers have Databases and DBAs
Customers have BI tools that use RDBMS
RDBMS are use for data interoperability

Hybrid model

Use each storage for what it does the best

RDBMS

store content in an ACID way
(store & retrieve)

elasticsearch

provide powerful and scalable queries
do the heavy lifting that the RDBMS can not do

Nuxeo
&
elasticsearch

Nuxeo Challenges

Security filtering

filtering may be complex and must be fast

Hierarchy

recursive re-indexing needed for some operations

Ensuring consistency

keep repository and index in sync

User operation should be indexed real-time

UI must be rendered using ES index
ES index must be updated in pseudo real time

ES/Lucene challenges

No support for update

Document must be re-indexed completely

No Join

can not split hierarchy, security and data
in separated indexes

No transaction management

Nuxeo Cluster

Nuxeo and SHARDING

ES to Lighten DataBase Load

ES and Sharding

Nuxeo 5.9.3 integration

Export Nuxeo Document as JSON

JSON is already the format of our REST API

Index mapping is configurable

Asynchronous persistent tasks for indexing

survive server restart and can be retried as needed

Manage near-real time for UI operations

indexing tasks triggered by UI in Post Commit

NXQL query language and PageProvider model

mapped to ES query language

Nuxeo 5.9.3 integration

Security

Previously tested approaches

use a custom Lucene Matcher
use index joins inside SOLR

simplified ACLs is part of the indexed document

Document fetching

for now only the entry is taken from ES
for now the Document is fetched from repository

Integration

run with embedded ES (test) or external server

Async indexing

Sync indexing

First results

Does it works ?

Yes, it does work !

Perspectives

Improving ES integration

Improve integration

load Documents from ES rather than repository
UpdateByQuery for recursive updates

Leverage elasticsearch features

integrate Aggregates system with Nuxeo PageProviders
move Audit Log to elasticsearch
use Kibana for analytics

Nuxeo and Lucene

By Thierry Delprat

Nuxeo and Lucene

2,645

Nuxeo & Elasticsearch

NUXEO AND LUCENE

About Nuxeo

Nuxeo & ContenT

Nuxeo & Lucene

Nuxeo / Lucene History

Nuxeo / Lucene now

RDBMS Limits

Static schemas

Query performances

Query exAmple

Write overhead

Nuxeo ExAmple

Scale out

Why Elasticsearch wins ?

Why keep RDBMS ?

ACID can be useful

RDBMS are here for a long time

Hybrid model

Nuxeo & elasticsearch

Nuxeo Challenges

ES/Lucene challenges

Nuxeo Cluster

Nuxeo and SHARDING

ES to Lighten DataBase Load

ES and Sharding

Nuxeo 5.9.3 integration

Nuxeo 5.9.3 integration

Async indexing

Sync indexing

First results

Yes, it does work !

Perspectives

Improving ES integration

Nuxeo and Lucene

More from Thierry Delprat

Nuxeo
&
Elasticsearch

Nuxeo
&
elasticsearch