About Me

Marten (Wiebe-Marten Wijnja)

~12 years of software development
~6 years of decentralized and distributed systems
~3 years of Elixir
~1.5 years of Resilia
~ 9 months of project Planga

Online: Qqwy

Planga: How this adventure started

Seamless Instant Chat Integration
"handles your chat so you don't have to"
SaaS & FOSS
Design Goals:
- Simple to integrate
- Chat should never break!

Planga: Distributed

Soon™

Michal Muskala - Getting distributed with Firenest - Code BEAM Lite Berlin 18

Talk Rationale:

Presentation Goals:

High-Level Overview
Lessons learned at Planga

1. Distributed Systems Crash Course

What is a Distributed System?

Software running on multiple computers at once
- Reason: Scalability, Fault-Tolerancy
- Need to communicate to agree about state
  - This is hard!

Distributed Systems Crash Course

These Things Are False:

The network is reliable.
Latency is zero.
Bandwidth is infinite.
The network is secure.
Topology doesn't change.
There is one administrator.
Transport cost is zero.
The network is homogeneous.

The Byzantine General's Problem

Why Communication is Hard

The Byzantine General's Problem

Situations look The Same!

No Reply!

What should the General do?

cancel the attack but miss attack opportunity, or

proceed, but risk uncoordinated attack?

Network Partition!

What should the Node do?

cancel the operation and thus decrease availability, or

proceed with the operation and thus risk inconsistency?

Distributed Systems Crash Course

CAP Theorem

Distributed Systems Crash Course

CAP Theorem

http://blog.thislongrun.com/2015/04/the-unclear-cp-vs-ca-case-in-cap.html

https://codahale.com/you-cant-sacrifice-partition-tolerance/

Distributed Systems Crash Course

CAP Theorem

CP vs AP?

CP: cancel the operation and thus decrease availability, or
AP: proceed with the operation and thus risk inconsistency?

Distributed Systems Crash Course

CAP Theorem

CP vs AP?

CP: Critical data like banks account balances.

✔No need to 'fix' state: Easier to work with!

✘ Needs lots of communication: Hard to scale

AP: Chat/Social Media feeds, Sensor data, etc.

✘ Needs to fix inconsistent states: Tricky!

✔Little communication needed: Scalable

Distributed Systems Crash Course

CP: Consensus

2-Phase-Commit/Paxos/Raft: Basically, 'Voting'
Have to wait until more than half of nodes is available
Example: Distributed ACID Transactions
- Distributed Postgres / CockroachDB / Citus
- Distributed MongoDB
- FaunaDB
- BigTable
- VoltDB

Distributed Systems Crash Course

AP: Eventual Consistency

Split Brain: Application needs to decide how to combine states again
This can be painful/error-prone!

Distributed Systems Crash Course

AP: Eventual Consistency CRDTs!

Conflict-Free Replicated Data-Types

Only supported are:

counters
sets
(nested) maps

Distributed Systems Crash Course

CAP Theorem

In Practice: CP vs AP choice is Non-Binary

We'd like to decide per datatype (or even per field!)
- Most tools don't currently support this :-(
Also Consider:
- How does the system respond when under normal operation? (Latency vs Consistency, PACELC)

Distributed Systems Crash Course

Side Note: Sharding

No communication between nodes necessary
Great for scaling
No fault-tolerancy

2. Tools for Distribution in Elixir/Erlang

Multi-node clusters
- Transparent Message-passing!
- libcluster
- Partisan
Phoenix.Presence / Phoenix.Tracker
- CRDTs! :-)
GenServer.multi_call / GenServer.abcast
Hot-Code upgrades

Your Application is Not Your Database

! Multiple ways of scaling:
- More data?
- More active users?

4. Distributed Databases Comparison

AP:
- Mnesia
- Cassandra
- CouchDB
- Riak

Distributed Databases Comparison

Erlang's built-in database
Do It Yourself:
- Split-Brain
- Clustering
Your DB is not your application

Mnesia

Distributed Databases Comparison

Java-based
Column-based structured DB
SQL-like querying
Unconfigurable, per-column Last-Write-Wins, based on timestamps

Cassandra

Distributed Databases Comparison

Erlang-based
Document Store
JSON-based querying
Document-based Vector-Clocks for synchronization
Conflicts have to be checked/fixed manually

CouchDB

Distributed Databases Comparison

Erlang-based
K/V-store + CRDTs!
Limited querying capabilities:
- key-based range queries using '2i'
- Solr, which lags ~1 second behind.
Vector-Clocks for synchronization
Conflicts can be resolved automatically

Riak

General Challenges with Distributed Databases

By their nature, NoSQL
- Makes adoption more difficult
In general, more difficult to query
Currently, no mature Elixir adapters

Choices/Solutions for Planga

AP over CP
Go with Riak...
- ... and build a Riak Ecto3 Adapter while we're at it
Snowflakes

Planga until recently

One node
Mnesia as DB
Phoenix.PubSub to connect users

Planga Soon™

3 App nodes
3 Riak nodes
Riak as DB
Phoenix Presence for ephemeral state
Frontend pings all app nodes to connect to current fastest

Potential future plans

Adding Nebulex as in-app distributed cache in front of Riak?
More nodes in multiple regions?
Frontend as Progressive Web App to deal with spotty internet connections?

Summary; Closing Remarks

Distributed Applications are Hard
Elixir makes it reasonably bearable
Tooling can be (and is being!) improved

Projects to be aware of:
- LASP/Partisan
- Phoenix.PubSub (Firenest)
- libcluster
- RiakEcto3 :-)
Shoutouts
- Martin Sumner
- Ecto Team
- the amazing Elixir Community

Thank You!

Try Planga
Read the code (and criticize it)
Questions?

An Adventure in Distributed Programming

An Adventure in Distributed Programming

More from qqwy