Thierry Delprat
What we Do and What Problems We Try to Solve
Track game builds
Electronic Flight Bags
Central repository for Models
Food industry PLM
https://github.com/nuxeo
2006: Nuxeo Repository is based on ZODB (Python / Zope based)
2007: Nuxeo Platform 5.1 - Apache JackRabbit (JCR based)
2009: Nuxeo 5.2 - Nuxeo VCS - pure SQL
2013/2014: Nuxeo 5.9 - Nuxeo DBS + Elasticsearch
Object DB
Document DB
SQL DB
Not only SQL
Understanding the motivations
ACID
XA
SELECT "hierarchy"."id" AS "_C1" FROM "hierarchy"
JOIN "fulltext" ON "fulltext"."id" = "hierarchy"."id"
LEFT JOIN "misc" "_F1" ON "hierarchy"."id" = "_F1"."id"
LEFT JOIN "dublincore" "_F2" ON "hierarchy"."id" = "_F2"."id"
WHERE
("hierarchy"."primarytype" IN ('Video', 'Picture', 'File'))
AND ((TO_TSQUERY('english', 'sydney')
@@NX_TO_TSVECTOR("fulltext"."fulltext")))
AND ("hierarchy"."isversion" IS NULL)
AND ("_F1"."lifecyclestate" <> 'deleted')
AND ("_F2"."created" IS NOT NULL )
ORDER BY "_F2"."created" DESC
LIMIT 201 OFFSET 0;
some types of queries can simply
not be fast in SQL
Not Only SQL
Good candidate for the Repository & Audit Trail
Fast indexing
No ACID constraints / No impedance issue
Append only index
Super query performance
query on term using inverted index
very efficient caching
native full text support & distributed architecture
Good candidate for the Search & Audit Trail
... argh !
let's see the technical details
MongoDB and Elasticsearch at work
{
"ecm:id":"52a7352b-041e-49ed-8676-328ce90cc103",
"ecm:primaryType":"MyFile",
"ecm:majorVersion":NumberLong(2),
"ecm:minorVersion":NumberLong(0),
"dc:title":"My Document",
"dc:contributors":[ "bob", "pete", "mary" ],
"dc:created": ISODate("2014-07-03T12:15:07+0200"),
...
"cust:primaryAddress":{
"street":"1 rue René Clair", "zip":"75018", "city":"Paris", "country":"France"},
"files:files":[
{ "name":"doc.txt", "length":1234, "mime-type":"plain/text",
"data":"0111fefdc8b14738067e54f30e568115"
},
{
"name":"doc.pdf", "length":29344, "mime-type":"application/pdf",
"data":"20f42df3221d61cb3e6ab8916b248216"
}
],
"ecm:acp":[
{
name:"local",
acl:[ { "grant":false, "perm":"Write", "user":"bob"},
{ "grant":true, "perm":"Read", "user":"members" } ]
}]
...
}
ecm:parentId
ecm:ancestorIds
{ ... "ecm:parentId" : "3d7efffe-e36b-44bd-8d2e-d8a70c233e9d",
"ecm:ancestorIds" : [ "00000000-0000-0000-0000-000000000000",
"4f5c0e28-86cf-47b3-8269-2db2d8055848",
"3d7efffe-e36b-44bd-8d2e-d8a70c233e9d" ] ...}
ecm:racl: ["Management", "Supervisors", "bob"]
{... "ecm:acp":[ {
name:"local",
acl:[ { "grant":false, "perm":"Write", "user":"bob"},
{ "grant":true, "perm":"Read", "user":"members" } ]}] ...}
db.default.find({
$and: [
{"dc:title": { $in: ["Workspaces", "Sections"] } },
{"ecm:racl": {"$in": ["bob", "members", "Everyone"]}}
]
}
)
SELECT * FROM Document WHERE dc:title = 'Sections' OR dc:title = 'Workspaces'
Find a way to mitigate consistency issues
Transactions can not span across multiple documents
routing
now what ?
Store according to use cases
Does not impact application code: this can be a deployment choice!
SQL DB
store content in an ACID way
strong schema
scalability issue for queries & storage
MongoDB
scale CRUD operations
store content in a BASE way, schema-less
queries are not really more scalable
elasticsearch
powerful and scalable queries
flexible schema
asynchronous storage
Fast indexing
3,500 documents/s when using SQL backend
10,000 documents/s when using MongoDB
Scalability of queries
Scale out
3,000 queries/s with 1 elasticsearch node
6,000 queries/s with 2 elasticsearch nodes
We are now testing the Nuxeo 6 stack in AWS.
DB is Postgres SQL db.r3.8xlarge which is a a 32 cpus
Between 350 and 400 tps the DB cpu is maxed out.
Please activate nuxeo-elasticsearch !
We are now able to do about 1200 tps with almost 0 DB activity.
Question though, Nuxeo and ES do not seem to be maxed out ?
It looks like you have some network congestion between your client and the servers.
...right... we have pushed past 1900 tps ... I think we are close to declaring success for this configuration ...
Customer
Customer
Customer
Nuxeo support
Nuxeo support
we keep sync indexing "inside the repository backend"
SQL DB collapses (on commodity hardware)
MongoDB handles the volume
Read & Write Operations
are competing
Write Operations
are not blocked
C4.xlarge (nuxeo)
C4.2Xlarge (DB)
SQL
SQL
with tunning
commodity hardware
SQL
7x faster
Processing on large Document sets are an issue on SQL
Side effects of impedance miss match
Ex: Process 100,000 documents
9,500 documents/s with MongoDB / mmapv1: x13
11,500 documents/s with MongoDB / wiredTiger: x15
lazy loading
cache trashing
Real life project choosing Nuxeo + MongoDB
good use case for MongoDB
want to use MongoDB
lots of data + lots of updates
Sample use case:
Press Agency
production system
mixed
requirements
Going further with NoSQL
Thank You !
https://github.com/nuxeo
http://www.nuxeo.com/careers/