An Adventure in Distributed Programming

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

About Me

Marten (Wiebe-Marten Wijnja)

 

 

  • ~12 years of software development
  • ~6 years of decentralized and distributed systems
  • ~3 years of Elixir
  • ~1.5 years of Resilia
  • ~ 9 months of project Planga

Online: Qqwy

Planga: How this adventure started

  • Seamless Instant Chat Integration
  • "handles your chat so you don't have to"
  • SaaS & FOSS
  • Design Goals:
    • Simple to integrate
    • Chat should never break!

Planga: Distributed

Soon™

Talk Rationale:

Presentation Goals:

  • High-Level Overview
  • Lessons learned at Planga

Contents

  1. Crash Course Distributed Systems
  2. Tools for Distribution in Elixir
  3. Comparing Distributed Databases
  4. Planga: Choices + Future

1. Distributed Systems Crash Course

What is a Distributed System?

  • Software running on multiple computers at once
    • Reason: Scalability, Fault-Tolerancy
    • Need to communicate to agree about state
      • This is hard!

Distributed Systems Crash Course

 

These Things Are False:

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn't change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

The Byzantine General's Problem

Why Communication is Hard

The Byzantine General's Problem

The Byzantine General's Problem

The Byzantine General's Problem

Situations look The Same!

No Reply!

What should the General do?

 

  • cancel the attack but miss attack opportunity, or

 

  • proceed, but risk uncoordinated attack?

Network Partition!

What should the Node do?

 

  • cancel the operation and thus decrease availability, or

 

  • proceed with the operation and thus risk inconsistency?

Distributed Systems Crash Course

 

CAP Theorem

Distributed Systems Crash Course

 

CAP Theorem

Distributed Systems Crash Course

 

CAP Theorem

CP vs AP?

  • CP: cancel the operation and thus decrease availability, or
  • AP: proceed with the operation and thus risk inconsistency?

 

Distributed Systems Crash Course

 

CAP Theorem

CP vs AP?

 

CP:     Critical data like banks account balances.

    ✔No need to 'fix' state: Easier to work with!

     ✘ Needs lots of communication: Hard to scale

 

AP:     Chat/Social Media feeds, Sensor data, etc.

     ✘ Needs to fix inconsistent states: Tricky!

    ✔Little communication needed: Scalable

Distributed Systems Crash Course

 

CP: Consensus

  • 2-Phase-Commit/Paxos/Raft: Basically, 'Voting'

  • Have to wait until more than half of nodes is available

  • Example: Distributed ACID Transactions

    • Distributed Postgres / CockroachDB / Citus

    • Distributed MongoDB

    • FaunaDB

    • BigTable

    • VoltDB

Distributed Systems Crash Course

 

AP: Eventual Consistency

  • Split Brain: Application needs to decide how to combine states again
  • This can be painful/error-prone!

 

Distributed Systems Crash Course

 

AP: Eventual Consistency CRDTs!

Conflict-Free Replicated Data-Types

Only supported are:

  • counters
  • sets
  • (nested) maps

Distributed Systems Crash Course

 

CAP Theorem

In Practice: CP vs AP choice is Non-Binary

  • We'd like to decide per datatype (or even per field!)
    • Most tools don't currently support this :-(
  • Also Consider:
    • How does the system respond when under normal operation? (Latency vs Consistency, PACELC)

 

Distributed Systems Crash Course

 

Side Note: Sharding

  • No communication between nodes necessary
  • Great for scaling
  • No fault-tolerancy

2. Tools for Distribution in Elixir/Erlang

  • Multi-node clusters
    • Transparent Message-passing!
    • libcluster
    • Partisan
  • Phoenix.Presence / Phoenix.Tracker

    • CRDTs! :-)

  • GenServer.multi_call / GenServer.abcast

  • Hot-Code upgrades

Your Application is Not Your Database

  • ! Multiple ways of scaling:
    • More data?
    • More active users?

4. Distributed Databases Comparison

  • AP:
    • Mnesia
    • Cassandra
    • CouchDB
    • Riak

Distributed Databases Comparison

  • Erlang's built-in database
  • Do It Yourself:
    • Split-Brain
    • Clustering
  • Your DB is not your application

Mnesia

Distributed Databases Comparison

  • Java-based
  • Column-based structured DB
  • SQL-like querying
  • Unconfigurable, per-column Last-Write-Wins, based on timestamps

Cassandra

Distributed Databases Comparison

  • Erlang-based
  • Document Store
  • JSON-based querying
  • Document-based Vector-Clocks for synchronization
  • Conflicts have to be checked/fixed manually

CouchDB

Distributed Databases Comparison

  • Erlang-based
  • K/V-store + CRDTs!
  • Limited querying capabilities:
    • key-based range queries using '2i'
    • Solr, which lags ~1 second behind.
  • Vector-Clocks for synchronization
  • Conflicts can be resolved automatically

Riak

General Challenges with Distributed Databases

  • By their nature, NoSQL
    • Makes adoption more difficult
  • In general, more difficult to query
  • Currently, no mature Elixir adapters

Choices/Solutions for Planga

  1. AP over CP
  2. Go with Riak...
    • ... and build a Riak Ecto3 Adapter while we're at it
  3. Snowflakes

Planga until recently

  • One node
  • Mnesia as DB
  • Phoenix.PubSub to connect users

Planga Soon™

  • 3 App nodes
  • 3 Riak nodes
  • Riak as DB
  • Phoenix Presence for ephemeral state
  • Frontend pings all app nodes to connect to current fastest

Potential future plans

  • Adding Nebulex as in-app distributed cache in front of Riak?
  • More nodes in multiple regions?
  • Frontend as Progressive Web App to deal with spotty internet connections?

Summary; Closing Remarks

  • Distributed Applications are Hard
  • Elixir makes it reasonably bearable
  • Tooling can be (and is being!) improved

 

Thank You!

  • Try Planga
  • Read the code (and criticize it)
  • Questions?

An Adventure in Distributed Programming

By qqwy

An Adventure in Distributed Programming

Talk given at ElixirConf.EU 2019 (http://www.elixirconf.eu/elixirconfeu2019/wiebe-marten-wijnja).

  • 3,705