An Adventure in Distributed Programming
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn't change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
About Me
Marten (Wiebe-Marten Wijnja)
- ~12 years of software development
- ~6 years of decentralized and distributed systems
- ~3 years of Elixir
- ~1.5 years of Resilia
- ~ 9 months of project Planga
Online: Qqwy
Planga: How this adventure started
- Seamless Instant Chat Integration
- "handles your chat so you don't have to"
- SaaS & FOSS
- Design Goals:
- Simple to integrate
- Chat should never break!
Planga: Distributed
Soon™
Talk Rationale:
Presentation Goals:
- High-Level Overview
- Lessons learned at Planga
Contents
- Crash Course Distributed Systems
- Tools for Distribution in Elixir
- Comparing Distributed Databases
- Planga: Choices + Future
1. Distributed Systems Crash Course
What is a Distributed System?
- Software running on multiple computers at once
- Reason: Scalability, Fault-Tolerancy
- Need to communicate to agree about state
- This is hard!
Distributed Systems Crash Course
These Things Are False:
- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn't change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
The Byzantine General's Problem
Why Communication is Hard
The Byzantine General's Problem
The Byzantine General's Problem
The Byzantine General's Problem
Situations look The Same!
No Reply!
What should the General do?
- cancel the attack but miss attack opportunity, or
- proceed, but risk uncoordinated attack?
Network Partition!
What should the Node do?
- cancel the operation and thus decrease availability, or
- proceed with the operation and thus risk inconsistency?
Distributed Systems Crash Course
CAP Theorem
Distributed Systems Crash Course
CAP Theorem
Distributed Systems Crash Course
CAP Theorem
CP vs AP?
- CP: cancel the operation and thus decrease availability, or
- AP: proceed with the operation and thus risk inconsistency?
Distributed Systems Crash Course
CAP Theorem
CP vs AP?
CP: Critical data like banks account balances.
✔No need to 'fix' state: Easier to work with!
✘ Needs lots of communication: Hard to scale
AP: Chat/Social Media feeds, Sensor data, etc.
✘ Needs to fix inconsistent states: Tricky!
✔Little communication needed: Scalable
Distributed Systems Crash Course
CP: Consensus
-
2-Phase-Commit/Paxos/Raft: Basically, 'Voting'
-
Have to wait until more than half of nodes is available
-
Example: Distributed ACID Transactions
-
Distributed Postgres / CockroachDB / Citus
-
Distributed MongoDB
-
FaunaDB
-
BigTable
-
VoltDB
-
Distributed Systems Crash Course
AP: Eventual Consistency
- Split Brain: Application needs to decide how to combine states again
- This can be painful/error-prone!
Distributed Systems Crash Course
AP: Eventual Consistency CRDTs!
Conflict-Free Replicated Data-Types
Only supported are:
- counters
- sets
- (nested) maps
Distributed Systems Crash Course
CAP Theorem
In Practice: CP vs AP choice is Non-Binary
- We'd like to decide per datatype (or even per field!)
- Most tools don't currently support this :-(
- Also Consider:
- How does the system respond when under normal operation? (Latency vs Consistency, PACELC)
Distributed Systems Crash Course
Side Note: Sharding
- No communication between nodes necessary
- Great for scaling
- No fault-tolerancy
2. Tools for Distribution in Elixir/Erlang
- Multi-node clusters
- Transparent Message-passing!
- libcluster
- Partisan
-
Phoenix.Presence / Phoenix.Tracker
-
CRDTs! :-)
-
-
GenServer.multi_call / GenServer.abcast
-
Hot-Code upgrades
Your Application is Not Your Database
- ! Multiple ways of scaling:
- More data?
- More active users?
4. Distributed Databases Comparison
- AP:
- Mnesia
- Cassandra
- CouchDB
- Riak
Distributed Databases Comparison
- Erlang's built-in database
-
Do It Yourself:
- Split-Brain
- Clustering
- Your DB is not your application
Mnesia
Distributed Databases Comparison
- Java-based
- Column-based structured DB
- SQL-like querying
- Unconfigurable, per-column Last-Write-Wins, based on timestamps
Cassandra
Distributed Databases Comparison
- Erlang-based
- Document Store
- JSON-based querying
- Document-based Vector-Clocks for synchronization
- Conflicts have to be checked/fixed manually
CouchDB
Distributed Databases Comparison
- Erlang-based
- K/V-store + CRDTs!
- Limited querying capabilities:
- key-based range queries using '2i'
- Solr, which lags ~1 second behind.
- Vector-Clocks for synchronization
- Conflicts can be resolved automatically
Riak
General Challenges with Distributed Databases
- By their nature, NoSQL
- Makes adoption more difficult
- In general, more difficult to query
- Currently, no mature Elixir adapters
Choices/Solutions for Planga
- AP over CP
- Go with Riak...
- ... and build a Riak Ecto3 Adapter while we're at it
- Snowflakes
Planga until recently
- One node
- Mnesia as DB
- Phoenix.PubSub to connect users
Planga Soon™
- 3 App nodes
- 3 Riak nodes
- Riak as DB
- Phoenix Presence for ephemeral state
- Frontend pings all app nodes to connect to current fastest
Potential future plans
- Adding Nebulex as in-app distributed cache in front of Riak?
- More nodes in multiple regions?
- Frontend as Progressive Web App to deal with spotty internet connections?
Summary; Closing Remarks
- Distributed Applications are Hard
- Elixir makes it reasonably bearable
- Tooling can be (and is being!) improved
- Projects to be aware of:
- LASP/Partisan
- Phoenix.PubSub (Firenest)
- libcluster
- RiakEcto3 :-)
- Shoutouts
- Martin Sumner
- Ecto Team
- the amazing Elixir Community
Thank You!
- Try Planga
- Read the code (and criticize it)
- Questions?
An Adventure in Distributed Programming
By qqwy
An Adventure in Distributed Programming
Talk given at ElixirConf.EU 2019 (http://www.elixirconf.eu/elixirconfeu2019/wiebe-marten-wijnja).
- 3,713