Nuxeo
&
Elasticsearch
NUXEO AND LUCENE
About Nuxeo
- Since 2001
- A team of 25 developers
- Making Opensource software since the very beginning
- Providing a software platform to build Content Applications
- Document Repository, Document Managament
- Digital Asset Management, Media Library
- Case Management, Document centric business process
- All business application that needs to manage Content
(i.e. : structured business object and not just files) - Trusted by Electonic Arts, JCDecaux, Boeing, Sharp ...
Nuxeo & ContenT
- Store Content
- Manage versions
- Secure access to Content
- Manage relations between Content objects
- Render content
- Convert content
- Manage worfklows
- Forms, Layouts and Tasks
- ...
- Index and Query
Nuxeo & Lucene
That's a long story :
Nuxeo integrated Lucene
with various technologies over the years
Is lucene the super-hero we looking for ?
Nuxeo / Lucene History
- 2006 : Nuxeo CPS 3.6 (Python / Zope based)
- Built-in index (Z-Catalog) was too weak
- Build a Lucene based index server accessed via XML-RPC
- pyLucene (Lucene re-built with GCJ + python bindings)
- Twisted as server framework
-
Complex setup
- 2007 : Nuxeo Platform 5.1 (Java + JCR)
- JCR is not good for making queries
- Integrate a Lucene based index via Compass Core
- provides transactional wrapper
- provides storage abstraction
- Concurrency issues
Nuxeo / Lucene now
- Since 2009 :
- VCS : Homebrew SQL based repository
- Everything is indexed inside the database (ACID)
- But :
- Full text capabilities are not so good
- Scaling for complex queries is a challenge
- Handling very big repositories is a challenge
- 2014 : Nuxeo 5.9.3 + elasticsearch
- Distribute indexes and queries on multiple nodes
- Relief the database
- Learn from our past mistakes !
RDBMS Limits
Static schemas
- RDBMS
- Schemas are defined beforehand
- not always possible
-
Schemas choice impacts performances
- Schemas migration are painful
Query performances
- SQL DB issues
- normalized data
- generate complex join and filtering
- several round trips to fetch a Document
- poor performances on unselective multi-criteria queries
- tweaking is possible but painful
- medium or poor full text support
-
vary depending on DB vendor
- Workarounds
- Stored procedures and triggers
(maintenance issues)
Query exAmple
- Some queries are nearly impossible to optimize
SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy"
JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id"
LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id"
LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id"
WHERE
("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio'))
AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext")))
AND ("hierarchy"."isversion" IS NULL)
AND ("_F1"."lifecyclestate" <> 'deleted')
AND ("_F2"."created" IS NOT NULL )
ORDER BY "_F2"."created" DESC
LIMIT 201 OFFSET 0;
Write overhead
- RDBMS engine constraints create overhead
-
buffer management
- locking & latching
- transaction isolation
- transaction log
- Application and model level constraints
- constraints checking
- type, unique, fk
- Triggers and associated stored procedures
Nuxeo ExAmple
- Triggers and constraints impact performances
-
ex : compute ACLs, compute path ...
- Most database don't handle correctly cascades
- ex: ON DELETE CASCADE
Scale out
-
RDBMS don't scale out easily
-
multi-master is very expensive and not very efficient
- ensuring ACID across the network is a real problem!
-
single master + multi-read is easy by not always applicable
- need to flag read only transactions at startup
- data partitioning is a solution
- but it impacts user experience (no global index)
Why Elasticsearch wins ?
- No Static Schemas
- essentially schema-less
- mapping can be adapted as needed
- Super query performances
- query on term using revert index
- returning document and score
- native full text support
- one query to fetch them all
- Fast indexing
- No ACID constraints
- Easy scale out
- Native distributed architecture
Why keep RDBMS ?
ACID can be useful
- Ensuring immediate consistency may be required
-
some user don't understand otherwise anyway
- ACID model ensure that data
- is always good
-
is auditable
- ACID RDBMS are safe and reliable
- people trust RDBMS to store their data
RDBMS are here for a long time
- Customers have Databases and DBAs
- Customers have BI tools that use RDBMS
- RDBMS are use for data interoperability
Hybrid model
- Use each storage for what it does the best
- RDBMS
- store content in an ACID way
(store & retrieve)
- elasticsearch
- provide powerful and scalable queries
- do the heavy lifting that the RDBMS can not do
Nuxeo
&
elasticsearch
Nuxeo Challenges
- Security filtering
- filtering may be complex and must be fast
- Hierarchy
-
recursive re-indexing needed for some operations
- Ensuring consistency
-
keep repository and index in sync
- User operation should be indexed real-time
- UI must be rendered using ES index
- ES index must be updated in pseudo real time
ES/Lucene challenges
- No support for update
- Document must be re-indexed completely
- No Join
- can not split hierarchy, security and data
in separated indexes
- No transaction management
Nuxeo Cluster
Nuxeo and SHARDING
ES to Lighten DataBase Load
ES and Sharding
Nuxeo 5.9.3 integration
- Export Nuxeo Document as JSON
- JSON is already the format of our REST API
-
Index mapping is configurable
- Asynchronous persistent tasks for indexing
- survive server restart and can be retried as needed
- Manage near-real time for UI operations
- indexing tasks triggered by UI in Post Commit
- NXQL query language and PageProvider model
- mapped to ES query language
Nuxeo 5.9.3 integration
- Security
- Previously tested approaches
- use a custom Lucene Matcher
- use index joins inside SOLR
- simplified ACLs is part of the indexed document
- Document fetching
- for now only the entry is taken from ES
- for now the Document is fetched from repository
- Integration
- run with embedded ES (test) or external server
Async indexing
Sync indexing
First results
Does it works ?
Yes, it does work !
Perspectives
Improving ES integration
- Improve integration
- load Documents from ES rather than repository
- UpdateByQuery for recursive updates
- Leverage elasticsearch features
- integrate Aggregates system with Nuxeo PageProviders
- move Audit Log to elasticsearch
- use Kibana for analytics
Nuxeo and Lucene
By Thierry Delprat
Nuxeo and Lucene
- 2,645