An Adventure in Distributed Programming
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn't change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
About Me
Marten (Wiebe-Marten Wijnja)
- ~12 years of software development
- ~6 years of decentralized and distributed systems
- ~3 years of Elixir
- ~1.5 years of Resilia
- ~ 9 months of project Planga
![](https://s3.amazonaws.com/media-p.slid.es/uploads/727274/images/5497910/wm_deepti_smudged.png)
Online: Qqwy
![](https://s3.amazonaws.com/media-p.slid.es/uploads/727274/images/5509756/logo_inverse_slogan.png)
Planga: How this adventure started
- Seamless Instant Chat Integration
- "handles your chat so you don't have to"
- SaaS & FOSS
- Design Goals:
- Simple to integrate
- Chat should never break!
![](https://s3.amazonaws.com/media-p.slid.es/uploads/727274/images/5993198/drake.png)
Planga: Distributed
Soon™
Talk Rationale:
Presentation Goals:
- High-Level Overview
- Lessons learned at Planga
![](https://s3.amazonaws.com/media-p.slid.es/uploads/727274/images/5914149/pasted-from-clipboard.png)
Contents
- Crash Course Distributed Systems
- Tools for Distribution in Elixir
- Comparing Distributed Databases
- Planga: Choices + Future
1. Distributed Systems Crash Course
What is a Distributed System?
- Software running on multiple computers at once
- Reason: Scalability, Fault-Tolerancy
- Need to communicate to agree about state
- This is hard!
Distributed Systems Crash Course
These Things Are False:
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn't change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
The Byzantine General's Problem
Why Communication is Hard
The Byzantine General's Problem
The Byzantine General's Problem
The Byzantine General's Problem
Situations look The Same!
![](https://s3.amazonaws.com/media-p.slid.es/uploads/727274/images/5913647/pasted-from-clipboard.png)
No Reply!
What should the General do?
- cancel the attack but miss attack opportunity, or
- proceed, but risk uncoordinated attack?
![](https://s3.amazonaws.com/media-p.slid.es/uploads/727274/images/5913647/pasted-from-clipboard.png)
Network Partition!
What should the Node do?
- cancel the operation and thus decrease availability, or
- proceed with the operation and thus risk inconsistency?
Distributed Systems Crash Course
CAP Theorem
Distributed Systems Crash Course
CAP Theorem
![](https://s3.amazonaws.com/media-p.slid.es/uploads/727274/images/5914223/pasted-from-clipboard.png)
Distributed Systems Crash Course
CAP Theorem
CP vs AP?
- CP: cancel the operation and thus decrease availability, or
- AP: proceed with the operation and thus risk inconsistency?
Distributed Systems Crash Course
CAP Theorem
CP vs AP?
CP: Critical data like banks account balances.
✔No need to 'fix' state: Easier to work with!
✘ Needs lots of communication: Hard to scale
AP: Chat/Social Media feeds, Sensor data, etc.
✘ Needs to fix inconsistent states: Tricky!
✔Little communication needed: Scalable
Distributed Systems Crash Course
CP: Consensus
-
2-Phase-Commit/Paxos/Raft: Basically, 'Voting'
-
Have to wait until more than half of nodes is available
-
Example: Distributed ACID Transactions
-
Distributed Postgres / CockroachDB / Citus
-
Distributed MongoDB
-
FaunaDB
-
BigTable
-
VoltDB
-
Distributed Systems Crash Course
AP: Eventual Consistency
- Split Brain: Application needs to decide how to combine states again
- This can be painful/error-prone!
Distributed Systems Crash Course
AP: Eventual Consistency CRDTs!
Conflict-Free Replicated Data-Types
Only supported are:
- counters
- sets
- (nested) maps
Distributed Systems Crash Course
CAP Theorem
In Practice: CP vs AP choice is Non-Binary
- We'd like to decide per datatype (or even per field!)
- Most tools don't currently support this :-(
- Also Consider:
- How does the system respond when under normal operation? (Latency vs Consistency, PACELC)
Distributed Systems Crash Course
Side Note: Sharding
- No communication between nodes necessary
- Great for scaling
- No fault-tolerancy
2. Tools for Distribution in Elixir/Erlang
- Multi-node clusters
- Transparent Message-passing!
- libcluster
- Partisan
-
Phoenix.Presence / Phoenix.Tracker
-
CRDTs! :-)
-
-
GenServer.multi_call / GenServer.abcast
-
Hot-Code upgrades
Your Application is Not Your Database
- ! Multiple ways of scaling:
- More data?
- More active users?
4. Distributed Databases Comparison
- AP:
- Mnesia
- Cassandra
- CouchDB
- Riak
Distributed Databases Comparison
- Erlang's built-in database
-
Do It Yourself:
- Split-Brain
- Clustering
- Your DB is not your application
Mnesia
Distributed Databases Comparison
- Java-based
- Column-based structured DB
- SQL-like querying
- Unconfigurable, per-column Last-Write-Wins, based on timestamps
Cassandra
Distributed Databases Comparison
- Erlang-based
- Document Store
- JSON-based querying
- Document-based Vector-Clocks for synchronization
- Conflicts have to be checked/fixed manually
CouchDB
Distributed Databases Comparison
- Erlang-based
- K/V-store + CRDTs!
- Limited querying capabilities:
- key-based range queries using '2i'
- Solr, which lags ~1 second behind.
- Vector-Clocks for synchronization
- Conflicts can be resolved automatically
Riak
General Challenges with Distributed Databases
- By their nature, NoSQL
- Makes adoption more difficult
- In general, more difficult to query
- Currently, no mature Elixir adapters
Choices/Solutions for Planga
- AP over CP
- Go with Riak...
- ... and build a Riak Ecto3 Adapter while we're at it
- Snowflakes
Planga until recently
- One node
- Mnesia as DB
- Phoenix.PubSub to connect users
Planga Soon™
- 3 App nodes
- 3 Riak nodes
- Riak as DB
- Phoenix Presence for ephemeral state
- Frontend pings all app nodes to connect to current fastest
Potential future plans
- Adding Nebulex as in-app distributed cache in front of Riak?
- More nodes in multiple regions?
- Frontend as Progressive Web App to deal with spotty internet connections?
Summary; Closing Remarks
- Distributed Applications are Hard
- Elixir makes it reasonably bearable
- Tooling can be (and is being!) improved
- Projects to be aware of:
- LASP/Partisan
- Phoenix.PubSub (Firenest)
- libcluster
- RiakEcto3 :-)
- Shoutouts
- Martin Sumner
- Ecto Team
- the amazing Elixir Community
Thank You!
- Try Planga
- Read the code (and criticize it)
- Questions?
An Adventure in Distributed Programming
By qqwy
An Adventure in Distributed Programming
Talk given at ElixirConf.EU 2019 (http://www.elixirconf.eu/elixirconfeu2019/wiebe-marten-wijnja).
- 3,549