Some Context
What we Do and What Problems We Try to Solve

- we provide a Platform that developers can use to build highly customized Content Applications
- we provide components, and the tools to assemble them
everything we do is open source (for real)
- various customers - various use cases
- me: developer & CTO - joined the Nuxeo project 10+ years ago

Track game builds
Electronic Flight Bags
Central repository for Models
Food industry PLM

Document Repository
Store Documents / Assets / Objects
- Blob objects
- Complex data Structures
- Hierarchy, references and links
- Audit trail & Versioning
- Data level security & encryption
- Lifecycle, workflows ...
API (REST, CMIS, Java, JS...)
- Search
- Service API
Heavily configurable : all data structures are flexible / customizable

Used by developers to build Content Applications on top of the Nuxeo Repository

OUR CHallenges
- CRUD on large repository works
- inject at 6,000 docs/s up to 1 Billion
not so many companies have that many documents anyway
Queries are the main scalability issue
- impact of c_ud vs search
- multi-criteria queries + full-text
- security filtering
- configurable data structures
- user defined queries
- UI heavily depends on search
Search API is the most used:
search is the main scalability challenge

History : Nuxeo & Lucene
2006: Nuxeo CPS 3.6
(Python / Zope based)-
Replace built-in index with
lucene + XML-RPC server -
(GCJ build+ python bindings!) -
Complex setup
2007: Nuxeo Platform 5.1
JCR : queries (and backup) issues
Integrate Compass Core
transactionnal & storage abstraction -
Missing sync & concurrency issues

2009: Nuxeo 5.2
VCS : Homebrew SQL based repository
Search in database but some real limitations
2013 / 2014: Nuxeo 5.9.3
Reintroduce Lucene in the stack via elasticsearch
Learn from our past mistakes
Leverage elasticsearch architecture
- easy deployment
- safe indexing
- powerful search
... we are now happy with Elasticsearch
Lucene and Nuxeo have a long story ...
Repository & Search
Understanding the Issue

Repository & SearcH
Complex SQL Queries
Configurable Data Structure
+ User defined multi-criteria searches
=> multiple & complex SQL queries
Search API is the most used:
search is the main scalability challenge

SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy"
JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id"
LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id"
LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id"
("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio'))
AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext")))
AND ("hierarchy"."isversion" IS NULL)
AND ("_F1"."lifecyclestate" <> 'deleted')
AND ("_F2"."created" IS NOT NULL )
ORDER BY "_F2"."created" DESC
About SQL Limitations
Scaling queries is complex
depend on indexes, I/O speed and available memory
- can not satisfy all types of queries
poor performances on unselective multi-criteria queries
- some types of queries can simply not be fast in SQL
- Scalability
- Scale up is expensive
- Scale out is complex at best (XA & MVCC)
- Sharding requires a global index
- Fulltext support is usually poor
- limitations on features & impact on performances
SQL technology is not the solution

IS noSQL the solution!?

Using NoSQL for the repository

About the NoSQL option
(sadly) NoSQL is no magic
it does work very well for CRUD and it scales easily, but
- query options are limited and performance is not that good
- multi-document transactions is usually not safe
- more adapted for DBs with billions of entries and simple queries
SQL has some real advantages
ACID (and MVCC) is good
- Workflows and bulk updates are a typical use case
- (even transient) lack of consistency is complex to explain to users
- lot of existing tools (BI & reporting), lot of existing skills (DBA)
- PGSQL (or AWS RDS) can be very cost effective
ACID (and MVCC) is good

SQL or NoSQL repository are not the solution
Keep the repository
find a super fast index engine

Repository & ElasticSearch
Toward an Hybrid Storage

HYBRID Storage
Use each storage solution for what it does the best
store content in an ACID way
store & retrieve
queries needed ACID and MVCC
provide powerful and scalable queries
do the heavy lifting that the RDBMS can not do
scoring, native full-text, aggregates
distributed search

Route the query to the correct index depending on requirements

Elasticsearch & Repository

One query
Several possible backends
Performance results
Fast indexing
No ACID constraints / No impedance issue
3,500 documents/s when using SQL backend
10,000 documents/s when using MongoDB
Super query performance
query on term using inverted index
very efficient caching
native full text support & distributed architecture
3,000 queries/s with 1 elasticsearch node
6,000 queries/s with 2 elasticsearch nodes

some real life feedback

We are now testing the Nuxeo 6 stack in AWS.
DB is Postgres SQL db.r3.8xlarge which is a a 32 cpus
Between 350 and 400 tps the DB cpu is maxed out.
Please activate nuxeo-elasticsearch !
We are now able to do about 1200 tps with almost 0 DB activity.
Question though, Nuxeo and ES do not seem to be maxed out ?
It looks like you have some network congestion between your client and the servers.
...right... we have pushed past 1900 tps ... I think we are close to declaring success for this configuration ...
Nuxeo support
Nuxeo support
SQL vs ElasticSearch

Scalability is simply from another order of magnitude
Scale out

Tested with 10 PgSQL databases
- 10 x 100 Million documents => 1 Billion documents
- 1 elasticsearch cluster

Is this magic?
For users
it really looks like magic
For sales guys & solution architects
it is magic: it unleashes a lot of possibilities
performance is just one aspect
For Nuxeo Core Dev team
it was almost magic: some integration work was needed

Integrating ElasticSearch
Inside nuxeo-elasticsearch Plugin

Challenges to address
- Keep index in sync with the repository
- No transaction management
- Do not lose anything
Without support for update
Mitigate eventually consistent effect
Avoid displaying transient inconsistent state
Handle security filtering
- Without join
- Without post-filtering

Security Filtering
- Constraints
- Filtering must be done at index level : no post filtering
Join is not an option
- can not join with DB or withing lucene (previously tested without success)
- Solution
- index the ReadACL as part of the JSON Document
- list of groups / users who can read the resource
automatically add a filter clause on ACL
- Recursive indexing is needed
More pressure to maintain re-indexing procesing
- in last resort: the Document security is checked by the repository anyway

SAFE Indexing Flow
Do not try to make it Transactionnal
- Collect and de-duplicate Repository Events during Transaction
Wait for commit to be done at the repository level
then call elasticsearch
Do not lose any update
run Indexing Tasks in a distributed Job infrastructure
- Jobs should be persisted
- Jobs should be retried
- Jobs should be monitored
run Indexing Tasks in a distributed Job infrastructure

Async Indexing Flow

Mitigate Eventually consistent
In the code :
- use case : need to see results from within the transaction
query directly on the repository
- leverage ACID and MVCC of SQL repository
full-text search and facets are usually not needed by the code
For the users :
- use case : see changes in listings in "real time"
use pseudo-real time indexing
- indexing actions triggered by UI threads are flagged
- run as afterCompletion listener
- refresh elasticsearch index

Pseudo-Sync Indexing Flow

Does this work ?
- Live for about 18 months now
No missing sync issue
- some customers asked for verification tools
- but no problem was found
re-index in bulk mode is very fast anyway
No consistency issues
good usage of hybrid query engines
- elasticsearch helped address several scaling challenges

but elasticsearch brings us much more than just scalability
Bonus from ElasticSearch
More than Raw Speed

Leverage Aggregates
- Leverage elasticsearch aggregates
- integrate with the Query system (PageProvider)
- integrate with the Listing / UI model (ContentView)
- Allow to easily build and configure faceted search

Advanced indexing
- Fine tuning of elasticsearch indexing
- multi language support using multiple analyzers and copy_to
- compound fields created using groovy scripts
- Introduce elasticsearch hints into NXQL
select a specific elasticsearch index / analyzer
- leverage elasticseach operators
- do geolocation search
-- Use an explicit Elasticsearch field
SELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) */ dc:title = 'foo'
-- Use ES operators not present in NXQL
SELECT * FROM Document WHERE /*+ES: OPERATOR(regex) */ dc:title = 's.*y'
SELECT * FROM Document WHERE /*+ES: OPERATOR(fuzzy) */ dc:title = 'zorkspaces'
-- Use ES for GeoQuery based on geo_hash_cell location near a point using geohash;
SELECT * FROM Document WHERE /*+ES: OPERATOR(geo_hash_cell)*/ osm:location IN ('40','-74','5')
leverage what comes for free with elasticsearch
INDEX Audit Trail with Elasticsearch
- Use elasticsearch to store & index Audit trail
all events are serialized in JSON and stored inside elasticsearch
Unleash Audit system power
- can store a lot of events
- can store and query arbitrary JSON structure

Elasticsearch PASS-Through
- Expose an HTTP pass-through API on top of Nuxeo integration
Integrate Authentication & Authorization
- not all users can access workflow index
Integrate Security Filtering
- activate data level security filtering
Expose "virtual index" via http
index + filter
Integrate Authentication & Authorization
Use elasticsearch API related components on Nuxeo data
- Documents + Audit log
- With embedded security

Easy real time data analytics on business data
Data Analytics with Elasticsearch

Queries on Documents + Audit: flexible reporting on workflows

Read Documents from Elasticsearch
- Full JSONDocument is stored in elasticsearch
- required to be able to do fast re-indexing
- We can retrieve Documents from elasticsearch
execute full search & retrieve without touching the DB
By controling indexing we can use the elasticsearch index
- as a persistent cache on top of the repository
- as a staging area for queries

Next steps
Leveraging Even More elasticsearch

Next steps
- Leverage elasticsearch percolator
- push update on the nuxeo-drive clients
- notify users about saved search
automatic categorization
Search result highlighting
not sure why it is still not there ...
- Plug automatic denormalization

