The Design and Implementation of

XML Prague
2019-02-08

Adam Retter

adam@evolvedbinary.com

@adamretter

Why did we start in 2014?

Personal Concerns
- Open Source NXDs problems/limitations are not being addresed
- Commercial NXDs are Expensive and not Open Source
- New NoSQL (JSON) document db are out-innovating us
- 10 Years invested in Open Source NXD, unhappy with progress
Commercial Concerns from Customers
- Help! Our Open Source NXD sometimes:
  - Crashes and Corrupts the database
  - Stops responding
  - Won't Scale with new servers/users

Reported by Users
- Stability - Responsiveness / Deadlocks
- Operational Reliability - Backup / Corruption / Fail-over
- Missing Feature - Key/Value Metadata for Documents
- Missing Feature - Triple/Graph linking for inter/cross-document references
Recognised by Developers
- Correctness - Crash Recovery / Deadlock Avoidance / Deadlock Resolution / ACID
- Performance - Reducing Contention / Avoiding System Maintenance mutex
- Missing Features - Multi Model / Clustering

OS NXD - Known Issues ~2014

Hold My Beer...

I Gotta
Fix This!

Project Health?
- Issues - Rate of decay, i.e. Open vs Closed over time
- Attracts new contributors?
- Attracts new and varied users?
- How do contributors pay their bills?
Contributor Constraints?
- How long to get PRs reviewed?
- Open to radical changes? Incremental vs Big-bang?
- Other developers with time/knowledge to review PRs
License
- Business friendly? CLA?
Reputation - Perceived or otherwise

Can we fix an existing NXD?

Project "Granite"
- Research and Development
- Primary Focus on Correctness and Stability
  - Never become unresponsive
  - Never crash
  - Never lose data or corrupt the database
Should become Open Source
- Should be appealing to Commercial enterprises
- Open Source license choice(s) vs Revenue opportunities
Don't reinvent wheels!
- Reuse - Faster time to market
- Developers know eXist-db... Fork it!

Time to build something new

Why?
- We don't trust it's correctness
- Old and Creaky? - (dbXML ~2001)
  - Improved with caching and journaling
- Not Scalable - single-threaded read/write
- Classic B+ Tree
- Why not fix it?
  - Newer/better algorthms exist - B-link Tree, Bw Tree, etc.
  - We want a giant-leap, not an incremental improvement

First... Replace eXist-db's Storage Engine

How does a NXD Store an XML Document Anyway?

...Shredding!

<fruits>
    <fruit>
        <name>Apple</name>
        <colour>Green</colour>
    </fruit>
    <fruit>
        <name>Banana</name>
        <colour>Yellow</colour>
    </fruit>
</fruits>

Given some very simple XML
- fruits.xml

1. Number the tree (DLN)

fruits.xml: docId=6

2. Extract Symbols

3. Store DOM

4. Store Symbols
+ Collection Entry

We won't develop our own!
- It's hard (to get correct)!
- Other well-resourced projects available for reuse
- We would rather focus on the larger DBMS
Choosing a suitable engine
- Other Java Database B+ Trees are unsuitable
  - Examined - Apache Derby, H2, HSQLDB, and Neo4j
- Few Open Source pure Storage Engines in Java
  - Discounted MapDB - known issues
- Identified 3 possibilities:
  - LMDB - B-Tree written in C
  - ForestDB - HB+-Trie written in C++11
  - RocksDB - LSM written in C++14

New Storage Engine

Fork of Google's LevelDB
- Performance Improvements for concurrency and I/O
- Optimised for SSDs
Large Open Source community with commercial interests
- e.g. Facebook, AirBnb, LinkedIn, NetFlix, Uber, Yahoo, etc.
Rich feature set
- LSM-tree (Log Structured Merge Tree)
- MVCC (Multi-Version Concurrency Control)
- ACID
  - Built-in Atomicity and Durability
  - Offers primitives for building Consistency and Isolation
- Column Families

Why we Adopted RocksDB

How do we store XML into RocksDB?

...Mooor Shredding!

We still:
- Number the tree with DLN
- Extract Symbols
- Shred each node into a key/value pair
We do NOT use eXist's B+Tree or Variable Record Store
Instead, we use RocksDB Column Families
- Each is an LSM-tree. Share a WAL
- For each component
  - Persisent DOM Column Family - XML_DOM_STORE
  - Symbol Table Column Families - SYMBOL_STORE, NAME_INDEX, NAME_ID_INDEX
  - Collection Store Column Family - COLLECTION_STORE

Storing XML in RocksDB

RocksDB's LSM Tree

XML in RocksDB's SSTable File

eXist-db lacks strong transaction semantics
- Txn - Log commit/abort just for Crash Recovery
- Mostly just the Durability of ACID
- Isolation level: ~Read Uncommitted
Our Transactions
- RocksDB ensures Atomicity and Durability
- We add Consistency and Isolation
- Begin Transaction creates a db Snapshot
- Each Transaction has an in-memory Write Batch
  - Write - only to in-memory Write Batch
  - Read - try in-memory write batch, fallback to Snapshot
- Isolation level: >= Snapshot Isolation

ACID Transactions

Each public API call is a transaction
- e.g. REST / WebDAV / RESTXQ / XML-RPC / XML:DB
- auto-abort on exception
- auto-commit when the call returns data
Each XQuery is a transaction
- auto-abort if the query raises an error
- auto-commit when the query completes
- Begin Transaction creates a db Snapshot
- XQuery 3.0 try/catch
  - try - begins a new sub-transaction
  - catch - auto-abort, the operations in the try body are undone
  - Sub-transactions can be nested, just like try/catch

Transactions for Users

Key/Value Model
- Metadata for Documents and Collections
- Searchable from XQuery
Online Backup
- Lock free
- Checkpoint Backup
- Full Document Export
UUID Locators
- Persist across backups and nodes
BLOB Store
- Deduplication

Other new features include...

Many Changes Upstreamed
- eXist-db - Locking, BLOB Store, Concurrency, etc.
- RocksDB - further Java APIs and improved JNI performance
With hindsight...
- Wouldn't fork eXist-db...
  - Too much time spent discovering and fixing bugs
  - Start with a green field, add eXist-db compatible APIs
- Much more work than anticipated
Today, new storage engines
- FoundationDB / BadgerDB / FASTER

Reflections

Benchmarking and Performance
JSON Native
Virtualised Collections
Distributed Cluster
Graph Model
XQuery compilation to native
CPU/GPU code

The Design and Implementation of

Adam Retter

adam@evolvedbinary.com

@adamretter

Why did we start in 2014?

OS NXD - Known Issues ~2014

Hold My Beer...

I Gotta Fix This!

Can we fix an existing NXD?

Time to build something new

First... Replace eXist-db's Storage Engine

How does a NXD Store an XML Document Anyway?

...Shredding!

Given some very simple XML - fruits.xml

1. Number the tree (DLN)

2. Extract Symbols

3. Store DOM

4. Store Symbols + Collection Entry

New Storage Engine

Why we Adopted RocksDB

How do we store XML into RocksDB?

...Mooor Shredding!

Storing XML in RocksDB

RocksDB's LSM Tree

XML in RocksDB's SSTable File

ACID Transactions

Transactions for Users

Other new features include...

Reflections

On the roadmap...

I Gotta
Fix This!

Given some very simple XML
- fruits.xml

4. Store Symbols
+ Collection Entry