Scaling
the Document Repository
with Elasticsearch
Some Context
What we Do and What Problems We Try to Solve
Nuxeo
-
Nuxeo
- we provide a Platform that developers can use to build highly customized Content Applications
- we provide components, and the tools to assemble them
-
everything we do is open source (for real)
- various customers - various use cases
- me: developer & CTO - joined the Nuxeo project 10+ years ago
Track game builds
Electronic Flight Bags
Central repository for Models
Food industry PLM
https://github.com/nuxeo
Document Repository
-
Store Documents / Assets / Objects
- Blob objects
- Complex data Structures
- Hierarchy, references and links
- Audit trail & Versioning
- Data level security & encryption
- Lifecycle, workflows ...
-
API (REST, CMIS, Java, JS...)
- CRUD
- Search
- Service API
Heavily configurable : all data structures are flexible / customizable
Used by developers to build Content Applications on top of the Nuxeo Repository
OUR CHallenges
- CRUD on large repository works
- inject at 6,000 docs/s up to 1 Billion
-
not so many companies have that many documents anyway
-
Queries are the main scalability issue
- impact of c_ud vs search
- multi-criteria queries + full-text
- security filtering
- configurable data structures
- user defined queries
- UI heavily depends on search
Search API is the most used:
search is the main scalability challenge
History : Nuxeo & Lucene
-
2006: Nuxeo CPS 3.6
(Python / Zope based)-
Replace built-in index with
lucene + XML-RPC server -
pyLucene
(GCJ build+ python bindings!) -
Complex setup
-
-
2007: Nuxeo Platform 5.1
-
JCR : queries (and backup) issues
-
Integrate Compass Core
transactionnal & storage abstraction -
Missing sync & concurrency issues
-
-
2009: Nuxeo 5.2
-
VCS : Homebrew SQL based repository
-
Search in database but some real limitations
-
-
2013 / 2014: Nuxeo 5.9.3
-
Reintroduce Lucene in the stack via elasticsearch
-
Learn from our past mistakes
-
Leverage elasticsearch architecture
- easy deployment
- safe indexing
- powerful search
-
-
... we are now happy with Elasticsearch
Lucene and Nuxeo have a long story ...
Repository & Search
Understanding the Issue
Repository & SearcH
Search API is the most used :
search is the main scalability challenge
Complex SQL Queries
-
Configurable Data Structure
+ User defined multi-criteria searches
=> multiple & complex SQL queries
Search API is the most used:
search is the main scalability challenge
SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy"
JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id"
LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id"
LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id"
WHERE
("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio'))
AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext")))
AND ("hierarchy"."isversion" IS NULL)
AND ("_F1"."lifecyclestate" <> 'deleted')
AND ("_F2"."created" IS NOT NULL )
ORDER BY "_F2"."created" DESC
LIMIT 201 OFFSET 0;
About SQL Limitations
-
Scaling queries is complex
-
depend on indexes, I/O speed and available memory
- can not satisfy all types of queries
-
poor performances on unselective multi-criteria queries
- some types of queries can simply not be fast in SQL
-
depend on indexes, I/O speed and available memory
- Scalability
- Scale up is expensive
- Scale out is complex at best (XA & MVCC)
- Sharding requires a global index
- Fulltext support is usually poor
- limitations on features & impact on performances
SQL technology is not the solution
IS noSQL the solution!?
Using NoSQL for the repository
About the NoSQL option
-
(sadly) NoSQL is no magic
-
it does work very well for CRUD and it scales easily, but
- query options are limited and performance is not that good
- multi-document transactions is usually not safe
- more adapted for DBs with billions of entries and simple queries
-
it does work very well for CRUD and it scales easily, but
-
SQL has some real advantages
-
ACID (and MVCC) is good
- Workflows and bulk updates are a typical use case
- (even transient) lack of consistency is complex to explain to users
- lot of existing tools (BI & reporting), lot of existing skills (DBA)
- PGSQL (or AWS RDS) can be very cost effective
-
ACID (and MVCC) is good
SQL or NoSQL repository are not the solution
Keep the repository
SQL or NoSQL
but
find a super fast index engine
Repository & ElasticSearch
Toward an Hybrid Storage
HYBRID Storage
-
Use each storage solution for what it does the best
-
SQL DB
-
store content in an ACID way
store & retrieve
queries needed ACID and MVCC
-
-
elasticsearch
provide powerful and scalable queries
-
do the heavy lifting that the RDBMS can not do
scoring, native full-text, aggregates
distributed search
-
Route the query to the correct index depending on requirements
Elasticsearch & Repository
One query
Several possible backends
Performance results
-
Fast indexing
-
No ACID constraints / No impedance issue
-
3,500 documents/s when using SQL backend
-
10,000 documents/s when using MongoDB
-
-
Super query performance
-
query on term using inverted index
-
very efficient caching
-
native full text support & distributed architecture
-
3,000 queries/s with 1 elasticsearch node
-
6,000 queries/s with 2 elasticsearch nodes
-
some real life feedback
We are now testing the Nuxeo 6 stack in AWS.
DB is Postgres SQL db.r3.8xlarge which is a a 32 cpus
Between 350 and 400 tps the DB cpu is maxed out.
Please activate nuxeo-elasticsearch !
We are now able to do about 1200 tps with almost 0 DB activity.
Question though, Nuxeo and ES do not seem to be maxed out ?
It looks like you have some network congestion between your client and the servers.
...right... we have pushed past 1900 tps ... I think we are close to declaring success for this configuration ...
Customer
Customer
Customer
Nuxeo support
Nuxeo support
SQL vs ElasticSearch
Scalability is simply from another order of magnitude
Scale out
UNIFIED INDEX ON SHARDED REPOSITORY
-
Tested with 10 PgSQL databases
- 10 x 100 Million documents => 1 Billion documents
- 1 elasticsearch cluster
Is this magic?
-
For users
-
it really looks like magic
-
-
For sales guys & solution architects
-
it is magic: it unleashes a lot of possibilities
-
performance is just one aspect
-
-
-
For Nuxeo Core Dev team
-
it was almost magic: some integration work was needed
-
Integrating ElasticSearch
Inside nuxeo-elasticsearch Plugin
Challenges to address
- Keep index in sync with the repository
- No transaction management
- Do not lose anything
-
Without support for update
-
Mitigate eventually consistent effect
-
Avoid displaying transient inconsistent state
-
Avoid displaying transient inconsistent state
-
Handle security filtering
- Without join
- Without post-filtering
Security Filtering
- Constraints
- Filtering must be done at index level : no post filtering
-
Join is not an option
- can not join with DB or withing lucene (previously tested without success)
- Solution
- index the ReadACL as part of the JSON Document
- list of groups / users who can read the resource
-
automatically add a filter clause on ACL
- index the ReadACL as part of the JSON Document
-
Consequences
- Recursive indexing is needed
-
More pressure to maintain re-indexing procesing
- in last resort: the Document security is checked by the repository anyway
SAFE Indexing Flow
-
Do not try to make it Transactionnal
- Collect and de-duplicate Repository Events during Transaction
-
Wait for commit to be done at the repository level
-
then call elasticsearch
-
then call elasticsearch
-
Do not lose any update
-
run Indexing Tasks in a distributed Job infrastructure
- Jobs should be persisted
- Jobs should be retried
- Jobs should be monitored
-
run Indexing Tasks in a distributed Job infrastructure
Async Indexing Flow
Mitigate Eventually consistent
-
In the code :
- use case : need to see results from within the transaction
-
query directly on the repository
- leverage ACID and MVCC of SQL repository
-
full-text search and facets are usually not needed by the code
-
For the users :
- use case : see changes in listings in "real time"
-
use pseudo-real time indexing
- indexing actions triggered by UI threads are flagged
- run as afterCompletion listener
- refresh elasticsearch index
Pseudo-Sync Indexing Flow
Does this work ?
- Live for about 18 months now
-
No missing sync issue
- some customers asked for verification tools
- but no problem was found
-
re-index in bulk mode is very fast anyway
-
No consistency issues
-
good usage of hybrid query engines
-
good usage of hybrid query engines
- elasticsearch helped address several scaling challenges
but elasticsearch brings us much more than just scalability
Bonus from ElasticSearch
More than Raw Speed
Leverage Aggregates
- Leverage elasticsearch aggregates
- integrate with the Query system (PageProvider)
- integrate with the Listing / UI model (ContentView)
- Allow to easily build and configure faceted search
Advanced indexing
- Fine tuning of elasticsearch indexing
- multi language support using multiple analyzers and copy_to
- compound fields created using groovy scripts
- Introduce elasticsearch hints into NXQL
-
select a specific elasticsearch index / analyzer
- leverage elasticseach operators
- do geolocation search
-
select a specific elasticsearch index / analyzer
-- Use an explicit Elasticsearch field
SELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) */ dc:title = 'foo'
-- Use ES operators not present in NXQL
SELECT * FROM Document WHERE /*+ES: OPERATOR(regex) */ dc:title = 's.*y'
SELECT * FROM Document WHERE /*+ES: OPERATOR(fuzzy) */ dc:title = 'zorkspaces'
-- Use ES for GeoQuery based on geo_hash_cell location near a point using geohash;
SELECT * FROM Document WHERE /*+ES: OPERATOR(geo_hash_cell)*/ osm:location IN ('40','-74','5')
leverage what comes for free with elasticsearch
INDEX Audit Trail with Elasticsearch
- Use elasticsearch to store & index Audit trail
-
all events are serialized in JSON and stored inside elasticsearch
-
all events are serialized in JSON and stored inside elasticsearch
-
Unleash Audit system power
- can store a lot of events
- can store and query arbitrary JSON structure
Elasticsearch PASS-Through
- Expose an HTTP pass-through API on top of Nuxeo integration
-
Integrate Authentication & Authorization
- not all users can access workflow index
-
Integrate Security Filtering
- activate data level security filtering
-
Expose "virtual index" via http
-
index + filter
-
index + filter
-
Integrate Authentication & Authorization
-
Use elasticsearch API related components on Nuxeo data
- Documents + Audit log
- With embedded security
Easy real time data analytics on business data
Data Analytics with Elasticsearch
Queries on Documents + Audit: flexible reporting on workflows
Read Documents from Elasticsearch
- Full JSONDocument is stored in elasticsearch
- required to be able to do fast re-indexing
- required to be able to do fast re-indexing
- We can retrieve Documents from elasticsearch
-
execute full search & retrieve without touching the DB
-
execute full search & retrieve without touching the DB
-
By controling indexing we can use the elasticsearch index
- as a persistent cache on top of the repository
- as a staging area for queries
_source
Next steps
Leveraging Even More elasticsearch
Next steps
- Leverage elasticsearch percolator
- push update on the nuxeo-drive clients
- notify users about saved search
-
automatic categorization
-
Search result highlighting
-
not sure why it is still not there ...
-
not sure why it is still not there ...
- Plug automatic denormalization
Any Questions ?
Thank You !
https://github.com/nuxeo
http://www.nuxeo.com/careers/
Scaling the Document Repository with elasticsearch
By Thierry Delprat
Scaling the Document Repository with elasticsearch
- 4,855