Scaling
the Document Repository
with Elasticsearch
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725072/nx-font.png)
Some Context
What we Do and What Problems We Try to Solve
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725189/nx-background.002.png)
Nuxeo
-
Nuxeo
- we provide a Platform that developers can use to build highly customized Content Applications
- we provide components, and the tools to assemble them
-
everything we do is open source (for real)
- various customers - various use cases
- me: developer & CTO - joined the Nuxeo project 10+ years ago
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1733851/EALogoBlack1-300x145.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1733854/170px-FICO_logo.svg.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1733858/customer-jeppesen-300x101.jpg)
Track game builds
Electronic Flight Bags
Central repository for Models
Food industry PLM
https://github.com/nuxeo
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1750082/Keendo.png)
Document Repository
-
Store Documents / Assets / Objects
- Blob objects
- Complex data Structures
- Hierarchy, references and links
- Audit trail & Versioning
- Data level security & encryption
- Lifecycle, workflows ...
-
API (REST, CMIS, Java, JS...)
- CRUD
- Search
- Service API
Heavily configurable : all data structures are flexible / customizable
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
Used by developers to build Content Applications on top of the Nuxeo Repository
![](https://s3.amazonaws.com/media-p.slid.es/uploads/thierrydelprat/images/702446/icons_slides_nuxeo_features.009.jpg)
OUR CHallenges
- CRUD on large repository works
- inject at 6,000 docs/s up to 1 Billion
-
not so many companies have that many documents anyway
-
Queries are the main scalability issue
- impact of c_ud vs search
- multi-criteria queries + full-text
- security filtering
- configurable data structures
- user defined queries
- UI heavily depends on search
Search API is the most used:
search is the main scalability challenge
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1728154/Screenshot_from_2015-09-16_09_19_58.png)
History : Nuxeo & Lucene
-
2006: Nuxeo CPS 3.6
(Python / Zope based)-
Replace built-in index with
lucene + XML-RPC server -
pyLucene
(GCJ build+ python bindings!) -
Complex setup
-
-
2007: Nuxeo Platform 5.1
-
JCR : queries (and backup) issues
-
Integrate Compass Core
transactionnal & storage abstraction -
Missing sync & concurrency issues
-
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
-
2009: Nuxeo 5.2
-
VCS : Homebrew SQL based repository
-
Search in database but some real limitations
-
-
2013 / 2014: Nuxeo 5.9.3
-
Reintroduce Lucene in the stack via elasticsearch
-
Learn from our past mistakes
-
Leverage elasticsearch architecture
- easy deployment
- safe indexing
- powerful search
-
-
... we are now happy with Elasticsearch
Lucene and Nuxeo have a long story ...
Repository & Search
Understanding the Issue
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725189/nx-background.002.png)
Repository & SearcH
Search API is the most used :
search is the main scalability challenge
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://www.lucidchart.com/publicSegments/view/65e29184-0f1d-43cc-bbc3-7f209eff2a1e/image.png)
Complex SQL Queries
-
Configurable Data Structure
+ User defined multi-criteria searches
=> multiple & complex SQL queries
Search API is the most used:
search is the main scalability challenge
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy"
JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id"
LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id"
LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id"
WHERE
("hierarchy"."primarytype" IN ('Video', 'Picture', 'File', 'Audio'))
AND ((TO_TSQUERY('english', 'sydney') @@NX_TO_TSVECTOR("fulltext"."fulltext")))
AND ("hierarchy"."isversion" IS NULL)
AND ("_F1"."lifecyclestate" <> 'deleted')
AND ("_F2"."created" IS NOT NULL )
ORDER BY "_F2"."created" DESC
LIMIT 201 OFFSET 0;
About SQL Limitations
-
Scaling queries is complex
-
depend on indexes, I/O speed and available memory
- can not satisfy all types of queries
-
poor performances on unselective multi-criteria queries
- some types of queries can simply not be fast in SQL
-
depend on indexes, I/O speed and available memory
- Scalability
- Scale up is expensive
- Scale out is complex at best (XA & MVCC)
- Sharding requires a global index
- Fulltext support is usually poor
- limitations on features & impact on performances
SQL technology is not the solution
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
IS noSQL the solution!?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
Using NoSQL for the repository
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://www.lucidchart.com/publicSegments/view/9380a81d-6283-4475-a309-b0262bfc212c/image.png)
About the NoSQL option
-
(sadly) NoSQL is no magic
-
it does work very well for CRUD and it scales easily, but
- query options are limited and performance is not that good
- multi-document transactions is usually not safe
- more adapted for DBs with billions of entries and simple queries
-
it does work very well for CRUD and it scales easily, but
-
SQL has some real advantages
-
ACID (and MVCC) is good
- Workflows and bulk updates are a typical use case
- (even transient) lack of consistency is complex to explain to users
- lot of existing tools (BI & reporting), lot of existing skills (DBA)
- PGSQL (or AWS RDS) can be very cost effective
-
ACID (and MVCC) is good
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
SQL or NoSQL repository are not the solution
Keep the repository
SQL or NoSQL
but
find a super fast index engine
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1729751/postgresql-9.3-free-download.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1735743/mongodb-100275964-orig.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1735750/Epic_Flash.png)
Repository & ElasticSearch
Toward an Hybrid Storage
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725189/nx-background.002.png)
HYBRID Storage
-
Use each storage solution for what it does the best
-
SQL DB
-
store content in an ACID way
store & retrieve
queries needed ACID and MVCC
-
-
elasticsearch
provide powerful and scalable queries
-
do the heavy lifting that the RDBMS can not do
scoring, native full-text, aggregates
distributed search
-
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
Route the query to the correct index depending on requirements
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1729747/logo-elastic.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1729751/postgresql-9.3-free-download.png)
Elasticsearch & Repository
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://www.lucidchart.com/publicSegments/view/91705f33-28d9-4d56-8c26-21444e87b447/image.png)
One query
Several possible backends
Performance results
-
Fast indexing
-
No ACID constraints / No impedance issue
-
3,500 documents/s when using SQL backend
-
10,000 documents/s when using MongoDB
-
-
Super query performance
-
query on term using inverted index
-
very efficient caching
-
native full text support & distributed architecture
-
3,000 queries/s with 1 elasticsearch node
-
6,000 queries/s with 2 elasticsearch nodes
-
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
some real life feedback
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
We are now testing the Nuxeo 6 stack in AWS.
DB is Postgres SQL db.r3.8xlarge which is a a 32 cpus
Between 350 and 400 tps the DB cpu is maxed out.
Please activate nuxeo-elasticsearch !
We are now able to do about 1200 tps with almost 0 DB activity.
Question though, Nuxeo and ES do not seem to be maxed out ?
It looks like you have some network congestion between your client and the servers.
...right... we have pushed past 1900 tps ... I think we are close to declaring success for this configuration ...
Customer
Customer
Customer
Nuxeo support
Nuxeo support
SQL vs ElasticSearch
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://www.lucidchart.com/publicSegments/view/73b7b66e-6975-4aa5-a9d7-2f677875f0d8/image.png)
![](https://www.lucidchart.com/publicSegments/view/1dedc6f0-4ae4-4201-b8ef-e81d0cb6f5da/image.png)
Scalability is simply from another order of magnitude
Scale out
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://www.lucidchart.com/publicSegments/view/3e753e46-9c2d-4afb-8bc2-c5c8c2b63b0d/image.png)
UNIFIED INDEX ON SHARDED REPOSITORY
-
Tested with 10 PgSQL databases
- 10 x 100 Million documents => 1 Billion documents
- 1 elasticsearch cluster
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://www.lucidchart.com/publicSegments/view/b6d19fbf-0e52-49c2-b218-ea6e2c570540/image.png)
Is this magic?
-
For users
-
it really looks like magic
-
-
For sales guys & solution architects
-
it is magic: it unleashes a lot of possibilities
-
performance is just one aspect
-
-
-
For Nuxeo Core Dev team
-
it was almost magic: some integration work was needed
-
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
Integrating ElasticSearch
Inside nuxeo-elasticsearch Plugin
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725189/nx-background.002.png)
Challenges to address
- Keep index in sync with the repository
- No transaction management
- Do not lose anything
-
Without support for update
-
Mitigate eventually consistent effect
-
Avoid displaying transient inconsistent state
-
Avoid displaying transient inconsistent state
-
Handle security filtering
- Without join
- Without post-filtering
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://www.lucidchart.com/publicSegments/view/d4d40d26-4b50-485a-8457-552e3ad66b21/image.png)
![](http://www.stoimen.com/blog/wp-content/uploads/2010/07/access.jpg)
![](https://www.lucidchart.com/publicSegments/view/740c3f2a-32b9-4069-a031-93de203aec58/image.png)
Security Filtering
- Constraints
- Filtering must be done at index level : no post filtering
-
Join is not an option
- can not join with DB or withing lucene (previously tested without success)
- Solution
- index the ReadACL as part of the JSON Document
- list of groups / users who can read the resource
-
automatically add a filter clause on ACL
- index the ReadACL as part of the JSON Document
-
Consequences
- Recursive indexing is needed
-
More pressure to maintain re-indexing procesing
- in last resort: the Document security is checked by the repository anyway
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
SAFE Indexing Flow
-
Do not try to make it Transactionnal
- Collect and de-duplicate Repository Events during Transaction
-
Wait for commit to be done at the repository level
-
then call elasticsearch
-
then call elasticsearch
-
Do not lose any update
-
run Indexing Tasks in a distributed Job infrastructure
- Jobs should be persisted
- Jobs should be retried
- Jobs should be monitored
-
run Indexing Tasks in a distributed Job infrastructure
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://www.lucidchart.com/publicSegments/view/d4d40d26-4b50-485a-8457-552e3ad66b21/image.png)
Async Indexing Flow
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://www.lucidchart.com/publicSegments/view/c505c515-158d-4a3b-a7f5-a9eebafb23c9/image.png)
Mitigate Eventually consistent
-
In the code :
- use case : need to see results from within the transaction
-
query directly on the repository
- leverage ACID and MVCC of SQL repository
-
full-text search and facets are usually not needed by the code
-
For the users :
- use case : see changes in listings in "real time"
-
use pseudo-real time indexing
- indexing actions triggered by UI threads are flagged
- run as afterCompletion listener
- refresh elasticsearch index
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
Pseudo-Sync Indexing Flow
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://www.lucidchart.com/publicSegments/view/617ccf3a-40ab-4414-a694-b18c95733028/image3.png)
Does this work ?
- Live for about 18 months now
-
No missing sync issue
- some customers asked for verification tools
- but no problem was found
-
re-index in bulk mode is very fast anyway
-
No consistency issues
-
good usage of hybrid query engines
-
good usage of hybrid query engines
- elasticsearch helped address several scaling challenges
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
but elasticsearch brings us much more than just scalability
Bonus from ElasticSearch
More than Raw Speed
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725189/nx-background.002.png)
Leverage Aggregates
- Leverage elasticsearch aggregates
- integrate with the Query system (PageProvider)
- integrate with the Listing / UI model (ContentView)
- Allow to easily build and configure faceted search
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1728154/Screenshot_from_2015-09-16_09_19_58.png)
Advanced indexing
- Fine tuning of elasticsearch indexing
- multi language support using multiple analyzers and copy_to
- compound fields created using groovy scripts
- Introduce elasticsearch hints into NXQL
-
select a specific elasticsearch index / analyzer
- leverage elasticseach operators
- do geolocation search
-
select a specific elasticsearch index / analyzer
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
-- Use an explicit Elasticsearch field
SELECT * FROM Document WHERE /*+ES: INDEX(dc:title.ngram) */ dc:title = 'foo'
-- Use ES operators not present in NXQL
SELECT * FROM Document WHERE /*+ES: OPERATOR(regex) */ dc:title = 's.*y'
SELECT * FROM Document WHERE /*+ES: OPERATOR(fuzzy) */ dc:title = 'zorkspaces'
-- Use ES for GeoQuery based on geo_hash_cell location near a point using geohash;
SELECT * FROM Document WHERE /*+ES: OPERATOR(geo_hash_cell)*/ osm:location IN ('40','-74','5')
leverage what comes for free with elasticsearch
INDEX Audit Trail with Elasticsearch
- Use elasticsearch to store & index Audit trail
-
all events are serialized in JSON and stored inside elasticsearch
-
all events are serialized in JSON and stored inside elasticsearch
-
Unleash Audit system power
- can store a lot of events
- can store and query arbitrary JSON structure
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://www.lucidchart.com/publicSegments/view/e56e4b62-7ca3-4dab-a139-7a64b4424e29/image.png)
Elasticsearch PASS-Through
- Expose an HTTP pass-through API on top of Nuxeo integration
-
Integrate Authentication & Authorization
- not all users can access workflow index
-
Integrate Security Filtering
- activate data level security filtering
-
Expose "virtual index" via http
-
index + filter
-
index + filter
-
Integrate Authentication & Authorization
-
Use elasticsearch API related components on Nuxeo data
- Documents + Audit log
- With embedded security
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
Easy real time data analytics on business data
Data Analytics with Elasticsearch
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1729212/DataViz1.png)
Queries on Documents + Audit: flexible reporting on workflows
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1729233/DataViz2.png)
Read Documents from Elasticsearch
- Full JSONDocument is stored in elasticsearch
- required to be able to do fast re-indexing
- required to be able to do fast re-indexing
- We can retrieve Documents from elasticsearch
-
execute full search & retrieve without touching the DB
-
execute full search & retrieve without touching the DB
-
By controling indexing we can use the elasticsearch index
- as a persistent cache on top of the repository
- as a staging area for queries
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
_source
Next steps
Leveraging Even More elasticsearch
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725189/nx-background.002.png)
Next steps
- Leverage elasticsearch percolator
- push update on the nuxeo-drive clients
- notify users about saved search
-
automatic categorization
-
Search result highlighting
-
not sure why it is still not there ...
-
not sure why it is still not there ...
- Plug automatic denormalization
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725193/nx-logo.png)
Any Questions ?
Thank You !
![](https://s3.amazonaws.com/media-p.slid.es/uploads/101047/images/1725189/nx-background.002.png)
https://github.com/nuxeo
http://www.nuxeo.com/careers/
Scaling the Document Repository with elasticsearch
By Thierry Delprat
Scaling the Document Repository with elasticsearch
- 4,763