Docker - The fine print

Avishai Ish-Shalom (@nukemberg)

 

Agenda

  • Da faq is Docker
  • Let's build a system!
  • Docker clusters

What is this Docker thing

  • Chroot, namespaces
  • cgroups
  • Layered FS, layered images
  • Container API

What Docker can't provide

  • Configuration management
  • Scalability
  • Security *
  • Full process isolation

Docker for me*:

Uber omnibus package - with a local deployment daemon

Docker is not a VM

And you shouldn't use it like one

Let's build a production system!

  • RabbitMQ cluster
  • Postgres
  • Java app server instances
  • Nginx
  • ELk
  • Graphite
  • 3rd party containers

First, RabbitMQ

  1. Start an RabbitMQ container with a data volume
  2. Stop it, then run a new container with the same data volume
  3. Our data is gone!!

Many databases use hostnames in storage format (directories mostly)

RabbitMQ Cluster (still in staging)

We link the containers and...

Shit, I can't get the nodes to see each other

  • EPMD (Erlang port mapper) requires consistent hostnames
  • Docker assigns autogenerated private hostnames
  • Docker link renders linkname as hostname
  • icc=true anyone?

RabbitMQ Cluster (production)

  • Can't put nodes on same host
  • Docker networking isn't cross hosts yet
  • epmd hates NAT
  • Host mode networking -> port conflicts, bad isolation

Elasticsearch Cluster

  • Zen multicast discovery doesn't work with icc=false
  • Fallback to Zen unicast or icc=true
  • But nodes need to be on different hosts....
  • Oh shit, hostnames again!!!

Sigh

Surely, a simple Java app will be easier...

Java app server

java -Xmx?

Need to configure from the outside

Same problem for external IP, external ports, etc.

 

Introspection

  • Get the container ID & read from the remote API
    • --cidfile + volume
    • cat /proc/self/cgroup |grep cpu:| sed -r 's#.*docker/(.{12}).*#\1#'
  • cat /sys/fs/cgroup/memory/memory.limit_in_bytes (1.8)
  • External daemon
  • Pass environment variables (no port info)
  • Wrap with script, docker inspect to file + volume

Postgres

Pretty much runs out of the box...

  • Minor issues with pgsql
  • No Unix domain sockets
  • docker exec to the rescue

Nginx

  • Top level LB cluster needs to use consistent ports
  • Nginx needs to find Java app servers
  • Needs to drain connections, so no immutable magic

Phew... Finally done deploying!

Now let's productionize

Production stuff

  • Debugging access
  • Monitoring & metrics
  • Isolation
  • Performance

Debugging our Java app servers

Shit, I can't connect to JMX

  • Connect via local or remote JMX
  • Local mode use hsperf files
  • JMX uses RMI by default
  • Not NAT friendly

No easy way to run jconsole, jstack, etc

JVM tools

  • Use JMXMP instead of RMI
  • Java 7u4 adds -Dcom.sun.management.jmxremote.rmi.port (and NAT the port)
  • Use internal IP (icc=true)
  • SSH tricks (yuck)
  • docker exec

Mental excercise

What system metrics would you use for a container?

Metrics

  • docker stats
  • Collect from cgroups
  • iptables rules
  • veth counters (with ip netns exec)

Metrics

  • Old school metrics only for capacity
  • Container metrics depend on neighbors
  • Short tasks
  • Accounting and aggregation

Current monitoring tools incompatible

Example: graphite

  • Pre-allocate storage for every metric
  • Containers have random hostnames
  • Containers can be shortlived

Beima shelcha

Liberal english translation: da faq

3
Write error: No space left on device

Isolation

  • No quotas with AUFS/OverlayFS, container can fill entire disk
  • Any container can fill host disk
  • Large logs, Inodes
  • blkio limits (since 1.9)

Isolation

  • Kernel is shared, resources (CPU, RAM, etc) are accounted and limited
  • Namespaces to limit visibility
  • Some resources are completely unprotected. E.g.:
    • (some) Kernel memory
    • /dev/random
  • UID namespace support in 1.10, not widely used (requires recent kernel)
 

Isolation

:(){ :|: & };:

docker run --user=1100 --ulimit=nproc=1024

NPROC Limit is not per container

 

Resolved in 1.11 - process cgroup

 
[26034.240339] Memory cgroup out of memory: Kill process 14760 (mem-hogger) score 997 or sacrifice child
[26034.240341] Killed process 14760 (mem-hogger) total-vm:137324kB, anon-rss:130432kB, file-rss:16kB
  • Memory limit enforced by swapping
  • If no swap memory, OOM killer
  • Can disable OOM killer, but then process blocks on OOM

Think about that for a moment

Zombies!? WTF?????

  • PID namespaces
  • PID 1 in namespace is init
  • Needs to reap zombies
  • Also ignores signals from inside namespace

 

since 1.11 - docker-containerd-shim is init

Isolation

  • Use CFQ scheduler
  • Consider Device Mapper rather than AUFS/OverlayFS
  • Use updated kernel
  • Use ulimit
  • Map /dev/urandom to /dev/random
  • Plan container co-location
  • SELinux, AppArmor, seccomp
  • Use VMs when you want full isolation

Performance

Native performance & Isolation

Not exactly

Performance

  • By default, Docker uses NAT networking
  • Userland proxy for published ports
  • CoW storage (default: AUFS/OverlayFS/DM)
  • RAM accounting overhead

Performance

  • Use docker volumes for application data and logs
  • If you need top notch network performance, avoid NAT and port publishing

Docker clusters

Storage management

  • Worse than EC2 storage 7 years ago
  • Problematic for persistent apps
  • Can't "Run anywhere" without remote storage
  • Storage drivers!

Storage management

  • Use application level replication
  • Distributed file system
  • Daemon labels
  • Try out storage drivers

Network management (old)

  • NAT - no multicast, broadcast, back connections, multi port
  • iptables/userland proxy
  • Host mode uses host-local network (port collisions)
  • Shared container networks

Network management

  • Network drivers
  • Socketplane/OpenVSwitch
  • Weave
  • OpenStack integration (very young)
  • DIY
  • Consul DNS

Cluster/resource management

  • Provisioning manually is insane
  • Load balancing
  • Container restart (on another host)
  • Topology - failure domains, network
  • Local resources
  • Integration with storage and network services

Cluster/resources management

  • Mesos + Marathon/Aurora
  • Kubernetes
  • Docker Swarm
  • OpenStack, ECS, VIC
  • Flynn.io, Deis.io, Shipyard, Fleet, etc

Summary

In production, Docker is hard to use on its own

Eco-system

Applications should be container aware and friendly

Unmodified applications can run, but expect problems

Current status

  • 12 factor apps work best
  • With enough ecosystem and effort, some statefull apps too
  • Still need VMs

Docker is improving at a staggering rate

Practical tips

  • Use up-to-date kernel (at least 3.18)
  • Use up-to-date Docker
  • Use a cluster manager
  • Don't drink too much koolaid

Questions?

 

Docker - the fine print

By Avishai Ish-Shalom

Docker - the fine print

All the nasty stuff about docker that wasn't in the sales pitch and you had to learn the hard way

  • 3,346