The Highs and Lows of Early Adoption:

CoreOS in Production

@lukeb0nd

http://yld.io

Luke Bond

CoreOS London November 2014

The Plan

Project background
Our current stack
What's been great
What's been challenging
Some tips and recommendations

@lukeb0nd

http://yld.io

"Connected Boilers" project

British Gas: Connected Homes (makers of Hive)

Lots of data emitted by boilers in the home
We receive it all via a cloud intermediary
Currently focused on detecting errors
Extensible for other functionality in backend

@lukeb0nd

http://yld.io

Large projected data volume and scale
JSON all the way
Data consumed by API and also data science/analytics

It has been an interesting project with interesting challenges, and more or less greenfield.

"Connected Boilers" project

@lukeb0nd

http://yld.io

Project Aims

Scalable
More-or-less self-managing:
- Strong monitoring/alerting
- Zero-downtime deployments
- Service discovery

@lukeb0nd

http://yld.io

Project Aims

Small team of contractors, so:
- Minimal human intervention
- Easy for newcomers to pick up
- In short: want to leave behind something easy to manage

Therefore we opted from the beginning for a rigorously tested continuous deployment approach.

@lukeb0nd

http://yld.io

Technologies Used

Node.js back-end & API (+ a bit of Java)
AWS: EC2, ELB, EBS
Couchbase
Angular web front-end
Mobile app
CoreOS, Fleet, Etcd, HAProxy, Confd
Continuous deployment pipeline:
- Jenkins
- Node.js + LevelDB deployment bot

@lukeb0nd

http://yld.io

What's been great: CoreOS

From a developer's POV it just works™
- Minimal, largely read-only, no package manager- fewer things to go wrong
Using the stable branch since a few weeks ago
We're using "cfndsl", lets you write Cloud Formation templates with Ruby
A lean OS is perfect for Docker

@lukeb0nd

http://yld.io

CoreOS: updates & restarts

We began with one-machine-at-a-time restarts
Fleet got into a bad state once after a big update to it (alpha channel)
- We never figured out what happened
Now we disable restarts and do planned updates

@lukeb0nd

http://yld.io

CoreOS: how we're using units

Or "units are more than just runners of Docker containers"

Mount units
Timer units
- We use these for scheduled backups
To attach/detach EBS volumes
One-shot units for administrative/maintenance tasks
- Global one-shot units particularly useful

@lukeb0nd

http://yld.io

CoreOS: how we're using units

Example

$ fleetctl cat jq.service
[Unit]
Description=Install JQ

[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/usr/bin/mkdir -p /opt/bin
ExecStart=/usr/bin/curl http://stedolan.github.io/jq/download/linux64/jq -o /opt/bin/jq
ExecStart=/usr/bin/chmod +x /opt/bin/jq

[X-Fleet]
Global=true

@lukeb0nd

http://yld.io

CoreOS: how we're using units

Other examples:
- A one-shot global unit to set overcommit kernel flag (we wanted it for Redis). We're able to run that only on the high-RAM hosts designated for Redis
- One-time execution of database maintenance- ie. adding views, fixtures, import/export, backup

We use the `core` user and add our team's SSH keys

@lukeb0nd

http://yld.io

What's been great: Fleet

Cluster presented as a systemd abstraction
Think: systemd of multiple hosts "taped together" to appear as one, using Etcd
Docker makes "abstracting the host" possible, but Fleet delivers it
You now think about the cluster, not hosts
- I can't imagine going back from this thinking now :)
Very powerful when combined with 12-factor principles

@lukeb0nd

http://yld.io

Fleet as CM solution?

Configuration-management type actions encapsulated as systemd unit files, run with Fleet
Can have tag-like functionality with [X-Fleet] conditionals
It's declarative: "cluster, make this available" rather than "run command X on box Y".
The systemd users here will be better able to imagine the potential for cluster-wide abstraction of it than I.

@lukeb0nd

http://yld.io

Personally, I prefer this to use a traditional* CM solution
Include only bare core in cloud-config
Write unit files for configuration management and administration
Check them into Git
Deploy them on your CoreOS cluster in the same way you do your services
Fleet will handle running them on new machines

@lukeb0nd

http://yld.io

Fleet as CM solution?

What's been great: Docker

Great for development
- Fig instead of Vagrant
Great for testing
- Fig for functional/acceptance tests
- `docker build/tag/push` on green
Great for deployment
- `docker pull`, `docker run`

@lukeb0nd

http://yld.io

What's been great: Docker

I now naturally tend towards smaller, simpler, composable services
Docker and Node.js work really well together
- Single-threaded* event-driven model maps well to the "one process per container" Docker model
  - (I don't like the supervisor model**, especially when you have systemd)

@lukeb0nd

http://yld.io

Challenges: CoreOS

CoreOS terminal is a PITA
Etcd cluster losing quorum
- CoreOS don't seem to have a recommended way of replacing an Etcd cluster, or dealing with this issue
- Separate Etcd cluster?
Complex ordering systemd units for asynchronous tasks
- e.g. not starting unit until EBS volume attached
- Likewise detaching cleanly

@lukeb0nd

http://yld.io

Challenges: new technology

It's a dangerous business, Frodo, going out your door. You step onto the road, and if you don't keep your feet, there's no knowing where you might be swept off to.

– J.R.R. Tolkein, The Lord of the Rings

We're using a lot of new technology at once
That wasn't the plan, but it was like tugging at a thread!
- Or "swallow the spider to catch the fly", at times

@lukeb0nd

http://yld.io

Challenges: Docker networking

May you live in interesting times

– Chinese Curse (apocryphal)

I look forward to when Docker networking is solved
- @weavenetwork from Zettio looks very promising
But these early days of Docker are exciting times
Beware complex and gnarly service discovery solutions

@lukeb0nd

http://yld.io

Challenges: Docker networking

Tricky to balance keeping it simple with robustness
Complexity and number of moving parts can skyrocket if you're not careful
For us, startup-time service discovery not enough
- Needed dynamically configured internal load-balancers
- Beware potential DNS implications

@lukeb0nd

http://yld.io

Challenges: databases

Mount volumes (of course)
Databases don't always like moving hosts
- e.g. Couchbase, depending how it's configured
When choosing a DB, consider how running in Docker affect that choice.
- i.e. how does it cluster/replicate?
- Couchbase, Riak, Cassandra compared to MongoDB, CouchDB, MySQL?

@lukeb0nd

http://yld.io

DOs

Keep your wits about you
Keep it simple
Keep services small*
Expect services to move hosts
Accept limitations in order to simplify
Define per-host services via global units rather than in user-data

DON'Ts

Get carried away
Neglect to consider persistent storage
Treat hosts like pets
Stovepiping
Force apps to do anything special in order to work

@lukeb0nd

http://yld.io

Questions?

@lukeb0nd

http://yld.io

CoreOS in Production

By Luke Bond

CoreOS in Production

4,222

The Highs and Lows of Early Adoption:

CoreOS in Production

The Plan

"Connected Boilers" project

"Connected Boilers" project

Project Aims

Project Aims

Technologies Used

What's been great: CoreOS

CoreOS: updates & restarts

CoreOS: how we're using units

CoreOS: how we're using units

CoreOS: how we're using units

What's been great: Fleet

Fleet as CM solution?

Fleet as CM solution?

What's been great: Docker

What's been great: Docker

Challenges: CoreOS

Challenges: new technology

Challenges: Docker networking

Challenges: Docker networking

Challenges: databases

DOs

DON'Ts

Questions?

CoreOS in Production

More from Luke Bond