Some Context

What we Do and What Problems We Try to Solve

Nuxeo

Nuxeo
- we provide a Platform that developers can use to build highly customized Content Applications
- we provide components, and the tools to assemble them
- everything we do is open source (for real)
various customers - various use cases
me: developer & CTO - joined the Nuxeo project 10+ years ago

Track game builds

Electronic Flight Bags

Central repository for Models

Food industry PLM

https://github.com/nuxeo

Document Repository

Store Documents / Assets / Objects
- Blob objects
- Complex data Structures
- Hierarchy, references and links
Audit trail & Versioning
Data level security & encryption
Lifecycle, workflows ...
API (REST, CMIS, Java, JS...)
- CRUD
- Search
- Service API

Heavily configurable : all data structures are flexible / customizable

Used by developers to build Content Applications on top of the Nuxeo Repository

OUR CHallenges

CRUD on large repository works
- inject at 6,000 docs/s up to 1 Billion
- not so many companies have that many documents anyway
Queries are the main scalability issue
- impact of c_ud vs search
- multi-criteria queries + full-text
- security filtering
- configurable data structures
- user defined queries
- UI heavily depends on search

Search API is the most used:

search is the main scalability challenge

History : Nuxeo & Lucene

2006: Nuxeo CPS 3.6
(Python / Zope based)
- Replace built-in index with
  lucene + XML-RPC server
- pyLucene
  (GCJ build+ python bindings!)
- Complex setup
2007: Nuxeo Platform 5.1
- JCR : queries (and backup) issues
- Integrate Compass Core
  transactionnal & storage abstraction
- Missing sync & concurrency issues

2009: Nuxeo 5.2
- VCS : Homebrew SQL based repository
- Search in database but some real limitations
2013 / 2014: Nuxeo 5.9.3
- Reintroduce Lucene in the stack via elasticsearch
  - Learn from our past mistakes
  - Leverage elasticsearch architecture
    - easy deployment
    - safe indexing
    - powerful search

... we are now happy with Elasticsearch

Lucene and Nuxeo have a long story ...

Repository & Search

Understanding the Issue

Repository & SearcH

Search API is the most used :

search is the main scalability challenge

Complex SQL Queries

Configurable Data Structure
+ User defined multi-criteria searches
=> multiple & complex SQL queries

Search API is the most used:

search is the main scalability challenge

SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy" 
   JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id" 
   LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id" 
   LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id" 
 WHERE 
  ("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio')) 
  AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext"))) 
  AND ("hierarchy"."isversion" IS NULL) 
  AND ("_F1"."lifecyclestate" <> 'deleted') 
  AND ("_F2"."created" IS NOT NULL )

ORDER BY "_F2"."created" DESC 

LIMIT 201 OFFSET 0;

About SQL Limitations

Scaling queries is complex
- depend on indexes, I/O speed and available memory
  - can not satisfy all types of queries
- poor performances on unselective multi-criteria queries
  - some types of queries can simply not be fast in SQL
Scalability
- Scale up is expensive
- Scale out is complex at best (XA & MVCC)
- Sharding requires a global index
Fulltext support is usually poor
- limitations on features & impact on performances

SQL technology is not the solution

IS noSQL the solution!?

Using NoSQL for the repository

About the NoSQL option

(sadly) NoSQL is no magic
- it does work very well for CRUD and it scales easily, but
  - query options are limited and performance is not that good
  - multi-document transactions is usually not safe
- more adapted for DBs with billions of entries and simple queries
SQL has some real advantages
- ACID (and MVCC) is good
  - Workflows and bulk updates are a typical use case
  - (even transient) lack of consistency is complex to explain to users
- lot of existing tools (BI & reporting), lot of existing skills (DBA)
- PGSQL (or AWS RDS) can be very cost effective

SQL or NoSQL repository are not the solution

Keep the repository
SQL or NoSQL
but
find a super fast index engine

Repository & ElasticSearch

Toward an Hybrid Storage

HYBRID Storage

Use each storage solution for what it does the best
- SQL DB
  - store content in an ACID way
    - store & retrieve
    - queries needed ACID and MVCC
- elasticsearch
  - provide powerful and scalable queries
  - do the heavy lifting that the RDBMS can not do
    - scoring, native full-text, aggregates
    - distributed search

Route the query to the correct index depending on requirements

Elasticsearch & Repository

One query
Several possible backends

Performance results

Fast indexing
- No ACID constraints / No impedance issue
- 3,500 documents/s when using SQL backend
- 10,000 documents/s when using MongoDB
Super query performance
- query on term using inverted index
- very efficient caching
- native full text support & distributed architecture
- 3,000 queries/s with 1 elasticsearch node
- 6,000 queries/s with 2 elasticsearch nodes

some real life feedback

We are now testing the Nuxeo 6 stack in AWS.
DB is Postgres SQL db.r3.8xlarge which is a a 32 cpus
Between 350 and 400 tps the DB cpu is maxed out.

Please activate nuxeo-elasticsearch !

We are now able to do about 1200 tps with almost 0 DB activity.
Question though, Nuxeo and ES do not seem to be maxed out ?

It looks like you have some network congestion between your client and the servers.

...right... we have pushed past 1900 tps ... I think we are close to declaring success for this configuration ...

Customer

Nuxeo support

SQL vs ElasticSearch

Scalability is simply from another order of magnitude

Scale out

UNIFIED INDEX ON SHARDED REPOSITORY

Tested with 10 PgSQL databases
- 10 x 100 Million documents => 1 Billion documents
- 1 elasticsearch cluster

Is this magic?

For users
- it really looks like magic
For sales guys & solution architects
- it is magic: it unleashes a lot of possibilities
  - performance is just one aspect
For Nuxeo Core Dev team
- it was almost magic: some integration work was needed

Integrating ElasticSearch

Inside nuxeo-elasticsearch Plugin

Challenges to address

Keep index in sync with the repository
- No transaction management
- Do not lose anything
- Without support for update
Mitigate eventually consistent effect
- Avoid displaying transient inconsistent state
Handle security filtering
- Without join
- Without post-filtering

Security Filtering

Constraints
- Filtering must be done at index level : no post filtering
- Join is not an option
  - can not join with DB or withing lucene (previously tested without success)
Solution
- index the ReadACL as part of the JSON Document
  - list of groups / users who can read the resource
- automatically add a filter clause on ACL
Consequences
- Recursive indexing is needed
- More pressure to maintain re-indexing procesing
  - in last resort: the Document security is checked by the repository anyway

SAFE Indexing Flow

Do not try to make it Transactionnal
- Collect and de-duplicate Repository Events during Transaction
- Wait for commit to be done at the repository level
  - then call elasticsearch
Do not lose any update
- run Indexing Tasks in a distributed Job infrastructure
  - Jobs should be persisted
  - Jobs should be retried
  - Jobs should be monitored

Async Indexing Flow

Mitigate Eventually consistent

In the code :
- use case : need to see results from within the transaction
- query directly on the repository
  - leverage ACID and MVCC of SQL repository
  - full-text search and facets are usually not needed by the code
For the users :
- use case : see changes in listings in "real time"
- use pseudo-real time indexing
  - indexing actions triggered by UI threads are flagged
  - run as afterCompletion listener
  - refresh elasticsearch index

Pseudo-Sync Indexing Flow

Does this work ?

Live for about 18 months now
No missing sync issue
- some customers asked for verification tools
- but no problem was found
- re-index in bulk mode is very fast anyway
No consistency issues
- good usage of hybrid query engines
elasticsearch helped address several scaling challenges

but elasticsearch brings us much more than just scalability

Bonus from ElasticSearch

More than Raw Speed

Leverage Aggregates

Leverage elasticsearch aggregates
- integrate with the Query system (PageProvider)
- integrate with the Listing / UI model (ContentView)
Allow to easily build and configure faceted search

Advanced indexing

Fine tuning of elasticsearch indexing
- multi language support using multiple analyzers and copy_to
- compound fields created using groovy scripts
Introduce elasticsearch hints into NXQL
- select a specific elasticsearch index / analyzer
- leverage elasticseach operators
- do geolocation search

-- Use an explicit Elasticsearch field
SELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) */ dc:title = 'foo'

-- Use ES operators not present in NXQL
SELECT * FROM Document WHERE /*+ES: OPERATOR(regex) */ dc:title = 's.*y'
SELECT * FROM Document WHERE /*+ES: OPERATOR(fuzzy) */ dc:title = 'zorkspaces'

-- Use ES for GeoQuery based on geo_hash_cell location near a point using geohash; 
SELECT * FROM Document WHERE /*+ES: OPERATOR(geo_hash_cell)*/ osm:location IN ('40','-74','5')

leverage what comes for free with elasticsearch

INDEX Audit Trail with Elasticsearch

Use elasticsearch to store & index Audit trail
- all events are serialized in JSON and stored inside elasticsearch
Unleash Audit system power
- can store a lot of events
- can store and query arbitrary JSON structure

Elasticsearch PASS-Through

Expose an HTTP pass-through API on top of Nuxeo integration
- Integrate Authentication & Authorization
  - not all users can access workflow index
- Integrate Security Filtering
  - activate data level security filtering
- Expose "virtual index" via http
  - index + filter
Use elasticsearch API related components on Nuxeo data
- Documents + Audit log
- With embedded security

Easy real time data analytics on business data

Data Analytics with Elasticsearch

Queries on Documents + Audit: flexible reporting on workflows

Read Documents from Elasticsearch

Full JSONDocument is stored in elasticsearch
- required to be able to do fast re-indexing
We can retrieve Documents from elasticsearch
- execute full search & retrieve without touching the DB
By controling indexing we can use the elasticsearch index
- as a persistent cache on top of the repository
- as a staging area for queries

_source

Next steps

Leveraging Even More elasticsearch

Next steps

Leverage elasticsearch percolator
- push update on the nuxeo-drive clients
- notify users about saved search
- automatic categorization
Search result highlighting
- not sure why it is still not there ...
Plug automatic denormalization

Scaling the Document Repository with Elasticsearch

Some Context

Nuxeo

Document Repository

OUR CHallenges

History : Nuxeo & Lucene

Repository & Search

Repository & SearcH

Complex SQL Queries

About SQL Limitations

IS noSQL the solution!?

Using NoSQL for the repository

About the NoSQL option

Keep the repository SQL or NoSQL but find a super fast index engine

Repository & ElasticSearch

HYBRID Storage

Elasticsearch & Repository

Performance results

some real life feedback

SQL vs ElasticSearch

Scale out

UNIFIED INDEX ON SHARDED REPOSITORY

Is this magic?

Integrating ElasticSearch

Challenges to address

Security Filtering

SAFE Indexing Flow

Async Indexing Flow

Mitigate Eventually consistent

Pseudo-Sync Indexing Flow

Does this work ?

Bonus from ElasticSearch

Leverage Aggregates

Advanced indexing

INDEX Audit Trail with Elasticsearch

Elasticsearch PASS-Through

Data Analytics with Elasticsearch

Read Documents from Elasticsearch

Next steps

Next steps

Any Questions ?

Scaling the Document Repository with elasticsearch

More from Thierry Delprat

Scaling
the Document Repository
with Elasticsearch

Keep the repository
SQL or NoSQL
but
find a super fast index engine