The Highs and Lows of Early Adoption:

CoreOS in Production

@lukeb0nd

http://yld.io

Luke Bond

CoreOS London November 2014

The Plan

  • Project background
  • Our current stack
  • What's been great
  • What's been challenging
  • Some tips and recommendations

@lukeb0nd

http://yld.io

"Connected Boilers" project

British Gas: Connected Homes (makers of Hive)

 

  • Lots of data emitted by boilers in the home
  • We receive it all via a cloud intermediary
  • Currently focused on detecting errors
  • Extensible for other functionality in backend

@lukeb0nd

http://yld.io

 

  • Large projected data volume and scale
  • JSON all the way
  • Data consumed by API and also data science/analytics

 

It has been an interesting project with interesting challenges, and more or less greenfield.

"Connected Boilers" project

@lukeb0nd

http://yld.io

Project Aims

 

  • Scalable
  • More-or-less self-managing:
    • Strong monitoring/alerting
    • Zero-downtime deployments
    • Service discovery

@lukeb0nd

http://yld.io

Project Aims

  • Small team of contractors, so:
    • Minimal human intervention
    • Easy for newcomers to pick up
    • In short: want to leave behind something easy to manage

 

Therefore we opted from the beginning for a rigorously tested continuous deployment approach.

@lukeb0nd

http://yld.io

Technologies Used

  • Node.js back-end & API (+ a bit of Java)
  • AWS: EC2, ELB, EBS
  • Couchbase
  • Angular web front-end
  • Mobile app
  • CoreOS, Fleet, Etcd, HAProxy, Confd
  • Continuous deployment pipeline:
    • Jenkins
    • Node.js + LevelDB deployment bot

@lukeb0nd

http://yld.io

What's been great: CoreOS

  • From a developer's POV it just works™
    • Minimal, largely read-only, no package manager- fewer things to go wrong
  • Using the stable branch since a few weeks ago
  • We're using "cfndsl", lets you write Cloud Formation templates with Ruby
  • A lean OS is perfect for Docker

@lukeb0nd

http://yld.io

CoreOS: updates & restarts

  • We began with one-machine-at-a-time restarts
  • Fleet got into a bad state once after a big update to it (alpha channel)
    • We never figured out what happened
  • Now we disable restarts and do planned updates

@lukeb0nd

http://yld.io

CoreOS: how we're using units

Or "units are more than just runners of Docker containers"

 

  • Mount units
  • Timer units
    • We use these for scheduled backups
  • To attach/detach EBS volumes
  • One-shot units for administrative/maintenance tasks
    • Global one-shot units particularly useful

@lukeb0nd

http://yld.io

CoreOS: how we're using units

Example

​$ fleetctl cat jq.service
[Unit]
Description=Install JQ

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/mkdir -p /opt/bin
ExecStart=/usr/bin/curl http://stedolan.github.io/jq/download/linux64/jq -o /opt/bin/jq
ExecStart=/usr/bin/chmod +x /opt/bin/jq

[X-Fleet]
Global=true

@lukeb0nd

http://yld.io

CoreOS: how we're using units

  • Other examples:
    • A one-shot global unit to set overcommit kernel flag (we wanted it for Redis). We're able to run that only on the high-RAM hosts designated for Redis
    • One-time execution of database maintenance- ie. adding views, fixtures, import/export, backup

 

  • We use the `core` user and add our team's SSH keys

@lukeb0nd

http://yld.io

What's been great: Fleet

  • Cluster presented as a systemd abstraction

  • Think: systemd of multiple hosts "taped together" to appear as one, using Etcd

  • Docker makes "abstracting the host" possible, but Fleet delivers it

  • You now think about the cluster, not hosts

    • I can't imagine going back from this thinking now :)

  • Very powerful when combined with 12-factor principles

@lukeb0nd

http://yld.io

Fleet as CM solution?

  • Configuration-management type actions encapsulated as systemd unit files, run with Fleet
  • Can have tag-like functionality with [X-Fleet] conditionals
  • It's declarative: "cluster, make this available" rather than "run command X on box Y".
  • The systemd users here will be better able to imagine the potential for cluster-wide abstraction of it than I.

@lukeb0nd

http://yld.io

  • Personally, I prefer this to use a traditional* CM solution
  • Include only bare core in cloud-config
  • Write unit files for configuration management and administration
  • Check them into Git
  • Deploy them on your CoreOS cluster in the same way you do your services
  • Fleet will handle running them on new machines

@lukeb0nd

http://yld.io

Fleet as CM solution?

What's been great: Docker

  • Great for development

    • Fig instead of Vagrant

  • Great for testing

    • Fig for functional/acceptance tests

    • `docker build/tag/push` on green

  • Great for deployment

    • `docker pull`, `docker run`

@lukeb0nd

http://yld.io

What's been great: Docker

  • I now naturally tend towards smaller, simpler, composable services

  • Docker and Node.js work really well together

    • Single-threaded* event-driven model maps well to the "one process per container" Docker model

      • (I don't like the supervisor model**, especially when you have systemd)

@lukeb0nd

http://yld.io

Challenges: CoreOS

  • CoreOS terminal is a PITA

  • Etcd cluster losing quorum

    • CoreOS don't seem to have a recommended way of replacing an Etcd cluster, or dealing with this issue

    • Separate Etcd cluster?

  • Complex ordering systemd units for asynchronous tasks

    • e.g. not starting unit until EBS volume attached

    • Likewise detaching cleanly

@lukeb0nd

http://yld.io

Challenges: new technology

 

It's a dangerous business, Frodo, going out your door. You step onto the road, and if you don't keep your feet, there's no knowing where you might be swept off to.

– J.R.R. Tolkein, The Lord of the Rings

 

  • We're using a lot of new technology at once

  • That wasn't the plan, but it was like tugging at a thread!

    • Or "swallow the spider to catch the fly", at times

@lukeb0nd

http://yld.io

Challenges: Docker networking

 

May you live in interesting times

– Chinese Curse (apocryphal)

 

  • I look forward to when Docker networking is solved

    • @weavenetwork from Zettio looks very promising

  • But these early days of Docker are exciting times

  • Beware complex and gnarly service discovery solutions

@lukeb0nd

http://yld.io

Challenges: Docker networking

 

  • Tricky to balance keeping it simple with robustness

  • Complexity and number of moving parts can skyrocket if you're not careful

  • For us, startup-time service discovery not enough

    • Needed dynamically configured internal load-balancers

    • Beware potential DNS implications

@lukeb0nd

http://yld.io

Challenges: databases

  • Mount volumes (of course)

  • Databases don't always like moving hosts

    • e.g. Couchbase, depending how it's configured

  • When choosing a DB, consider how running in Docker affect that choice.

    • i.e. how does it cluster/replicate?

    • Couchbase, Riak, Cassandra compared to MongoDB, CouchDB, MySQL?

@lukeb0nd

http://yld.io

DOs

  • Keep your wits about you

  • Keep it simple

  • Keep services small*

  • Expect services to move hosts

  • Accept limitations in order to simplify

  • Define per-host services via global units rather than in user-data

DON'Ts

  • Get carried away

  • Neglect to consider persistent storage

  • Treat hosts like pets

  • Stovepiping

  • Force apps to do anything special in order to work

@lukeb0nd

http://yld.io

Questions?

@lukeb0nd

http://yld.io

CoreOS in Production

By Luke Bond

CoreOS in Production

  • 4,222