Docker - The fine print
Avishai Ish-Shalom (@nukemberg)
Agenda
- Da faq is Docker
- Let's build a system!
- Docker clusters
What is this Docker thing
- Chroot, namespaces
- cgroups
- Layered FS, layered images
- Container API
What Docker can't provide
- Configuration management
- Scalability
- Security *
- Full process isolation
Docker for me*:
Uber omnibus package - with a local deployment daemon
Docker is not a VM
And you shouldn't use it like one
Let's build a production system!
- RabbitMQ cluster
- Postgres
- Java app server instances
- Nginx
- ELk
- Graphite
- 3rd party containers
First, RabbitMQ
- Start an RabbitMQ container with a data volume
- Stop it, then run a new container with the same data volume
- Our data is gone!!
Many databases use hostnames in storage format (directories mostly)
RabbitMQ Cluster (still in staging)
We link the containers and...
Shit, I can't get the nodes to see each other
- EPMD (Erlang port mapper) requires consistent hostnames
- Docker assigns autogenerated private hostnames
- Docker link renders linkname as hostname
- icc=true anyone?
RabbitMQ Cluster (production)
- Can't put nodes on same host
- Docker networking isn't cross hosts yet
- epmd hates NAT
- Host mode networking -> port conflicts, bad isolation
Elasticsearch Cluster
- Zen multicast discovery doesn't work with icc=false
- Fallback to Zen unicast or icc=true
- But nodes need to be on different hosts....
- Oh shit, hostnames again!!!
Sigh
Surely, a simple Java app will be easier...
Java app server
java -Xmx?
Need to configure from the outside
Same problem for external IP, external ports, etc.
Introspection
- Get the container ID & read from the remote API
- --cidfile + volume
- cat /proc/self/cgroup |grep cpu:| sed -r 's#.*docker/(.{12}).*#\1#'
- cat /sys/fs/cgroup/memory/memory.limit_in_bytes (1.8)
- External daemon
- Pass environment variables (no port info)
- Wrap with script, docker inspect to file + volume
Postgres
Pretty much runs out of the box...
- Minor issues with pgsql
- No Unix domain sockets
- docker exec to the rescue
Nginx
- Top level LB cluster needs to use consistent ports
- Nginx needs to find Java app servers
- Needs to drain connections, so no immutable magic
Phew... Finally done deploying!
Now let's productionize
Production stuff
- Debugging access
- Monitoring & metrics
- Isolation
- Performance
Debugging our Java app servers
Shit, I can't connect to JMX
- Connect via local or remote JMX
- Local mode use hsperf files
- JMX uses RMI by default
- Not NAT friendly
No easy way to run jconsole, jstack, etc
JVM tools
- Use JMXMP instead of RMI
- Java 7u4 adds -Dcom.sun.management.jmxremote.rmi.port (and NAT the port)
- Use internal IP (icc=true)
- SSH tricks (yuck)
- docker exec
Mental excercise
What system metrics would you use for a container?
Metrics
- docker stats
- Collect from cgroups
- iptables rules
- veth counters (with ip netns exec)
Metrics
- Old school metrics only for capacity
- Container metrics depend on neighbors
- Short tasks
- Accounting and aggregation
Current monitoring tools incompatible
Example: graphite
- Pre-allocate storage for every metric
- Containers have random hostnames
- Containers can be shortlived
Beima shelcha
Liberal english translation: da faq
3
Write error: No space left on device
Isolation
- No quotas with AUFS/OverlayFS, container can fill entire disk
- Any container can fill host disk
- Large logs, Inodes
- blkio limits (since 1.9)
Isolation
- Kernel is shared, resources (CPU, RAM, etc) are accounted and limited
- Namespaces to limit visibility
- Some resources are completely unprotected. E.g.:
- (some) Kernel memory
- /dev/random
- UID namespace support in 1.10, not widely used (requires recent kernel)
Isolation
:(){ :|: & };:
docker run --user=1100 --ulimit=nproc=1024
NPROC Limit is not per container
Resolved in 1.11 - process cgroup
[26034.240339] Memory cgroup out of memory: Kill process 14760 (mem-hogger) score 997 or sacrifice child
[26034.240341] Killed process 14760 (mem-hogger) total-vm:137324kB, anon-rss:130432kB, file-rss:16kB
- Memory limit enforced by swapping
- If no swap memory, OOM killer
- Can disable OOM killer, but then process blocks on OOM
Think about that for a moment
Zombies!? WTF?????
- PID namespaces
- PID 1 in namespace is init
- Needs to reap zombies
- Also ignores signals from inside namespace
since 1.11 - docker-containerd-shim is init
Isolation
- Use CFQ scheduler
- Consider Device Mapper rather than AUFS/OverlayFS
- Use updated kernel
- Use ulimit
- Map /dev/urandom to /dev/random
- Plan container co-location
- SELinux, AppArmor, seccomp
- Use VMs when you want full isolation
Performance
Native performance & Isolation
Not exactly
Performance
- By default, Docker uses NAT networking
- Userland proxy for published ports
- CoW storage (default: AUFS/OverlayFS/DM)
- RAM accounting overhead
Performance
- Use docker volumes for application data and logs
- If you need top notch network performance, avoid NAT and port publishing
Docker clusters
Storage management
- Worse than EC2 storage 7 years ago
- Problematic for persistent apps
- Can't
"Run anywhere"
without remote storage - Storage drivers!
Storage management
- Use application level replication
- Distributed file system
- Daemon labels
- Try out storage drivers
Network management (old)
- NAT - no multicast, broadcast, back connections, multi port
- iptables/userland proxy
- Host mode uses host-local network (port collisions)
- Shared container networks
Network management
- Network drivers
- Socketplane/OpenVSwitch
- Weave
- OpenStack integration (very young)
- DIY
- Consul DNS
Cluster/resource management
- Provisioning manually is insane
- Load balancing
- Container restart (on another host)
- Topology - failure domains, network
- Local resources
- Integration with storage and network services
Cluster/resources management
- Mesos + Marathon/Aurora
- Kubernetes
- Docker Swarm
- OpenStack, ECS, VIC
- Flynn.io, Deis.io, Shipyard, Fleet, etc
Summary
In production, Docker is hard to use on its own
Eco-system
Applications should be container aware and friendly
Unmodified applications can run, but expect problems
Current status
- 12 factor apps work best
- With enough ecosystem and effort, some statefull apps too
- Still need VMs
Docker is improving at a staggering rate
Practical tips
- Use up-to-date kernel (at least 3.18)
- Use up-to-date Docker
- Use a cluster manager
- Don't drink too much koolaid
Questions?
Docker - the fine print
By Avishai Ish-Shalom
Docker - the fine print
All the nasty stuff about docker that wasn't in the sales pitch and you had to learn the hard way
- 3,502