Scheduling Docker Containers With Nomad

What

Nomad is a tool for managing a cluster of machines and running applications on them. Nomad abstracts away machines and the location of applications, and instead enables users to declare what they want to run and Nomad handles where they should run and how to run them.

Docker Support
Operationally Simple
Multi-Datacenter & Multi-Region Support
Flexible Workloads
Built for Scale

Why

A single, declarative tool to schedule multiple Docker containers into Vagrant, VMWare and AWS environments.

way simpler to set up than Kubernetes or Mesos
might be more capable than Docker Swarm
abstracts away Vagrant, ESXi and AWS
part of the Hashicorp portfolio
not an AWS-only solution
intelligent scheduling makes efficient use of resources
text-based descriptor can work in a CI pipeline
REST API for customized control
can mix Docker and JVM workloads

How

each EC2 instance runs a client
3 tiny EC2 instances runs the Nomad server cluster
each Git push produces a new Nomad descriptor
descriptor is pushed to the cluster
Nomad schedules the work accordingly
rinse and repeat
EC2 instances can be scheduled and auto-scaled

Architecture

Job

A Job is a specification provided by users that declares a workload for Nomad. A Job is a form of desired state; the user is expressing that the job should be running, but not where it should be run. The responsibility of Nomad is to make sure the actual state matches the user desired state. A Job is composed of one or more task groups.

Task Group

A Task Group is a set of tasks that must be run together. For example, a web server may require that a log shipping co-process is always running as well. A task group is the unit of scheduling, meaning the entire group must run on the same client node and cannot be split.

Driver

A Driver represents the basic means of executing your Tasks. Example Drivers include Docker, Qemu, Java, and static binaries.

Task

A Task is the smallest unit of work in Nomad. Tasks are executed by drivers, which allow Nomad to be flexible in the types of tasks it supports. Tasks specify their driver, configuration for the driver, constraints, and resources required.

Client

A Client of Nomad is a machine that tasks can be run on. All clients run the Nomad agent. The agent is responsible for registering with the servers, watching for any work to be assigned and executing tasks. The Nomad agent is a long lived process which interfaces with the servers.

Allocation

An Allocation is a mapping between a task group in a job and a client node. A single job may have hundreds or thousands of task groups, meaning an equivalent number of allocations must exist to map the work to client machines. Allocations are created by the Nomad servers as part of scheduling decisions made during an evaluation.

Evaluation

Evaluations are the mechanism by which Nomad makes scheduling decisions. When either the desired state (jobs) or actual state (clients) changes, Nomad creates a new evaluation to determine if any actions must be taken. An evaluation may result in changes to allocations if necessary.

Server

Nomad servers are the brains of the cluster. There is a cluster of servers per region and they manage all jobs and clients, run evaluations, and create task allocations. The servers replicate data between each other and perform leader election to ensure high availability. Servers federate across regions to make Nomad globally aware.

Regions and Datacenters

Nomad models infrastructure as regions and datacenters. Regions may contain multiple datacenters. Servers are assigned to regions and manage all state for the region and make scheduling decisions within that region. Requests that are made between regions are forwarded to the appropriate servers. As an example, you may have a US region with the us-east-1 and us-west-1 datacenters, connected to the EU region with the eu-fr-1 and eu-uk-1 datacenters.

Bin Packing

Bin Packing is the process of filling bins with items in a way that maximizes the utilization of bins. This extends to Nomad, where the clients are "bins" and the items are task groups. Nomad optimizes resources by efficiently bin packing tasks onto client machines.

Consensus Protocol

Nomad uses a consensus protocol (RAFT) to ensure that the nodes in server cluster have the same view of the world.

sensitive to latency
one leader per region
possible to federate multiple regions
possible to run in "stale" mode and avoid leader election
3 to 5 servers is the recommended cluster size

Gossip Protocol

Nomad makes use of a single global WAN gossip pool that all servers participate in. Membership information provided by the gossip pool allows servers to perform cross region requests. The integrated failure detection allows Nomad to gracefully handle an entire region losing connectivity, or just a single server in a remote region. The gossip protocol is also used to detect servers in the same region to perform automatic clustering via the consensus protocol.

Scheduling

It is the process of assigning tasks from jobs to client machines. This process must respect the constraints as declared in the job, and optimize for resource utilization.

Task Drivers

Task drivers are used by Nomad clients to execute a task and provide resource isolation. By having extensible task drivers, Nomad has the flexibility to support a broad set of workloads across all major operating systems.

Docker
Isolated Fork/Exec
Raw Fork/Exec
Java
Qemu
Rkt
Custom

Docker Task Driver

		task "redis" {
			# Use Docker to run the task.
			driver = "docker"

			# Configure Docker driver with the image
			config {
				image = "redis:latest"
				network_mode = "host"
				port_map {
					db = 6379
				}
			}

			service {
				name = "${TASKGROUP}-redis"
				tags = ["global", "cache"]
				port = "db"
				check {
					name = "alive"
					type = "tcp"
					interval = "10s"
					timeout = "2s"
				}
			}

			# We must specify the resources required for
			# this task to ensure it runs on a machine with
			# enough capacity.
			resources {
				cpu = 500 # 500 Mhz
				memory = 256 # 256MB
				network {
					mbits = 10
					port "db" {
					}
				}
			}
		}

Nomad Commands (CLI)

Nomad is controlled via a very easy to use command-line interface (CLI). Nomad is only a single command-line application: nomad, which takes a subcommand such as "agent" or "status".

nomad agent
nomad init
nomad run
nomad status
nomad stop
nomad validate

HTTP API

The Nomad HTTP API is the primary interface to using Nomad, and is used to query the current state of the system as well as to modify it. The Nomad CLI makes use of the Go HTTP client and invokes the HTTP API.

jobs
nodes
allocations
evaluations
agent
regions
status

Job Descriptor

job "grouped-service" {
    // Controls if the entire set of tasks in the job must be placed atomically or if they can be scheduled incrementally.
    all_at_once = false

    // A list of datacenters in the region which are eligible for task placement. This must be provided, and does not have a default.
    datacenters = ["my-datacenter"]
    
    // Annotates the job with opaque metadata.
    meta {
        jobKey = "job value"
    }

    // Specifies the job priority which is used to prioritize scheduling and access to resources. Must be between 1 and 100 
    // inclusively, and defaults to 50.
    priority = 50

    // The region to run the job in, defaults to "global".
    region = "USA"

    // Specifies the job type and switches which scheduler is used. Nomad provides the service, system and batch schedulers, 
    // and defaults to service. 
    type = "service"

    // Restrict our job to only linux. We can specify multiple constraints as needed.
    constraint {
        attribute = "$attr.kernel.name"
        value = "linux"
    }

    // Specifies the task's update strategy. When omitted, rolling updates are disabled.
    update {
        // Specifies the number of tasks that can be updated at the same time.
        max_parallel = 1

        // Delay between sets of task updates and is given as an as a time duration. If stagger is provided as an integer, 
        // seconds are assumed.  Otherwise the "s", "m", and "h" suffix can be used, such as "30s".
        stagger = "30s"
    }

    // group definition goes here
}

Job Descriptor

    group "infrastructure-services" {
        // specifies the number of the task groups that should be running. Must be positive, defaults to 1.
        count = 1

        // This can be provided multiple times to define additional constraints.
        constraint {
            attribute = "$attr.kernel.name"
            value = "linux"
        }

        // Specifies the restart policy to be applied to tasks in this group. If omitted, a default policy for
        // batch and non-batch jobs is used based on the job type.
        restart {
            interval = "5m"
            attempts = 10
            delay = "25s"
        }

        // Annotates the task group with opaque metadata.
        meta {
            taskKey = "group value"
        }

        // task definitions go here
    }

Job Descriptor

        task "mysql-server" {

            driver = "docker"

            constraint {
                distinct_hosts = true
            }

            config {
                image = "mysql:latest"
                labels {
                    realm = "Experiment"
                    managed-by = "Nomad"
                }
                priviledged = false
                ipc_mode = "none"
                pid_mode = ""
                uts_mode = ""
                network_mode = "bridge"
                host_name = "mysql"
                dns_servers = ["8.8.8.8", "8.8.4.4"]
                dns_search_domains = ["kurron.org", "transparent.com"]
                port_map {
                    mysql = 3306
                }

                auth {
                    username = "dockerhub_user"
                    password = "dockerhub_password"
                    email = "bob@thebuilder.com"
                    server_address = "repository.kurron.org"
                }
            }

            service {
                name = "${JOB}-mysql"
                tags = ["experiment", "messaging"]
                port = "mysql"
                check {
                    type = "tcp"
                    delay = "30s"
                    timeout = "2s"
                    path = "/"
                    protocol = "http"
                }
            }
            env {
                MYSQL_ROOT_PASSWORD = "sa"
                MYSQL_USER = "mysql"
                MYSQL_PASSWORD = "mysql"
                MYSQL_DATABASE = "sushe"
            }
            resources {
                cpu = 500
                disk = 256
                iops = 10
                memory = 512
                network {
                    mbits = 100
                    port "mysql" {
                    }
                }
            }
            meta {
                taskKey = "task value"
            }
            kill_timeout = "30s"
        }

Deployment

3-5 instances running Nomad servers
N number of instances running Nomad clients and Docker engines
N number of instances using the Nomad CLI to push jobs into the cluster
servers can be t2.nano instances
clients can be any size that can handle your workload

Workload Sequence

job descriptor is created and defines the workload to be scheduled
using the CLI or HTTP API, the descriptor is pushed to one of the servers
Nomad examines the constraints of the job and determines which client is capable of running the workload and routes the work to a client
The client adds/replaces containers as needed to satisfy the workload
Containers use service discovery to locate required services (Consul)

Service Discovery

Nomad adds/removes containers to adjust to the new workload
Nomad expects you to consult Consul to determine the coordinates of any dependent services
AL expects things to be in a well-known location: localhost:<some stable port>
This mismatch might be a non-starter
It is possible to have Nomad bind containers to static ports and does not seem to affect updates
Port collisions will cause failures or relocation issues
You are expected to manage the Consul cluster but you can deploy Consul containers as a "system" workload

Container Environment

[
{
    "Id": "f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee",
    "Created": "2016-01-26T21:40:17.969085813Z",
    "Path": "/entrypoint.sh",
    "Args": [
        "redis-server"
    ],
    "State": {
        "Status": "running",
        "Running": true,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 23459,
        "ExitCode": 0,
        "Error": "",
        "StartedAt": "2016-01-26T21:40:18.133205627Z",
        "FinishedAt": "0001-01-01T00:00:00Z"
    },
    "Image": "ce0116e4e7f549950db2e8ae2a306038153b3a2ad818de9c144323a751dd7922",
    "ResolvConfPath": "/opt/docker/containers/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/resolv.conf",
    "HostnamePath": "/opt/docker/containers/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/hostname",
    "HostsPath": "/opt/docker/containers/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/hosts",
    "LogPath": "/opt/docker/containers/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee-json.log",
    "Name": "/redis-c72a26d6-dba9-5f97-120a-3355a0945335",
    "RestartCount": 0,
    "Driver": "overlay",
    "ExecDriver": "native-0.2",
    "MountLabel": "",
    "ProcessLabel": "",
    "AppArmorProfile": "",
    "ExecIDs": null,
    "HostConfig": {
        "Binds": [
            "/tmp/NomadClient658170613/c72a26d6-dba9-5f97-120a-3355a0945335/alloc:/alloc:rw,z",
            "/tmp/NomadClient658170613/c72a26d6-dba9-5f97-120a-3355a0945335/redis:/local:rw,Z"
        ],
        "ContainerIDFile": "",
        "LxcConf": null,
        "Memory": 268435456,
        "MemoryReservation": 0,
        "MemorySwap": -1,
        "KernelMemory": 0,
        "CpuShares": 500,
        "CpuPeriod": 0,
        "CpusetCpus": "",
        "CpusetMems": "",
        "CpuQuota": 0,
        "BlkioWeight": 0,
        "OomKillDisable": false,
        "MemorySwappiness": null,
        "Privileged": false,
        "PortBindings": {
            "6379/tcp": [
                {
                    "HostIp": "10.0.2.15",
                    "HostPort": "6379"
                }
            ],
            "6379/udp": [
                {
                    "HostIp": "10.0.2.15",
                    "HostPort": "6379"
                }
            ]
        },
        "Links": null,
        "PublishAllPorts": false,
        "Dns": null,
        "DnsOptions": null,
        "DnsSearch": null,
        "ExtraHosts": null,
        "VolumesFrom": null,
        "Devices": null,
        "NetworkMode": "host",
        "IpcMode": "",
        "PidMode": "",
        "UTSMode": "",
        "CapAdd": null,
        "CapDrop": null,
        "GroupAdd": null,
        "RestartPolicy": {
            "Name": "",
            "MaximumRetryCount": 0
        },
        "SecurityOpt": null,
        "ReadonlyRootfs": false,
        "Ulimits": null,
        "LogConfig": {
            "Type": "json-file",
            "Config": {}
        },
        "CgroupParent": "",
        "ConsoleSize": [
            0,
            0
        ],
        "VolumeDriver": ""
    },
    "GraphDriver": {
        "Name": "overlay",
        "Data": {
            "LowerDir": "/opt/docker/overlay/ce0116e4e7f549950db2e8ae2a306038153b3a2ad818de9c144323a751dd7922/root",
            "MergedDir": "/opt/docker/overlay/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/merged",
            "UpperDir": "/opt/docker/overlay/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/upper",
            "WorkDir": "/opt/docker/overlay/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/work"
        }
    },
    "Mounts": [
        {
            "Source": "/tmp/NomadClient658170613/c72a26d6-dba9-5f97-120a-3355a0945335/alloc",
            "Destination": "/alloc",
            "Mode": "rw,z",
            "RW": true
        },
        {
            "Source": "/tmp/NomadClient658170613/c72a26d6-dba9-5f97-120a-3355a0945335/redis",
            "Destination": "/local",
            "Mode": "rw,Z",
            "RW": true
        },
        {
            "Name": "fa2c1cc6450407594864b91645908d4b4d1fb683ce9ad40c569aba87995f6421",
            "Source": "/opt/docker/volumes/fa2c1cc6450407594864b91645908d4b4d1fb683ce9ad40c569aba87995f6421/_data",
            "Destination": "/data",
            "Driver": "local",
            "Mode": "",
            "RW": true
        }
    ],
    "Config": {
        "Hostname": "kal-el",
        "Domainname": "",
        "User": "",
        "AttachStdin": false,
        "AttachStdout": false,
        "AttachStderr": false,
        "ExposedPorts": {
            "6379/tcp": {},
            "6379/udp": {}
        },
        "Tty": false,
        "OpenStdin": false,
        "StdinOnce": false,
        "Env": [
            "NOMAD_IP=10.0.2.15",
            "NOMAD_PORT_db=6379",
            "NOMAD_ALLOC_DIR=/alloc",
            "NOMAD_TASK_DIR=/local",
            "NOMAD_MEMORY_LIMIT=256",
            "NOMAD_CPU_LIMIT=500",
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "REDIS_VERSION=2.8.23",
            "REDIS_DOWNLOAD_URL=http://download.redis.io/releases/redis-2.8.23.tar.gz",
            "REDIS_DOWNLOAD_SHA1=828fc5d4011e6141fabb2ad6ebc193e8f0d08cfa"
        ],
        "Cmd": [
            "redis-server"
        ],
        "Image": "redis:2.8",
        "Volumes": {
            "/data": {}
        },
        "WorkingDir": "/data",
        "Entrypoint": [
            "/entrypoint.sh"
        ],
        "OnBuild": null,
        "Labels": {}
    },
    "NetworkSettings": {
        "Bridge": "",
        "SandboxID": "7daaa3eacf3efea95f326319c17181667bd05b67074dcd11976ce1a20b27e202",
        "HairpinMode": false,
        "LinkLocalIPv6Address": "",
        "LinkLocalIPv6PrefixLen": 0,
        "Ports": {},
        "SandboxKey": "/var/run/docker/netns/default",
        "SecondaryIPAddresses": null,
        "SecondaryIPv6Addresses": null,
        "EndpointID": "",
        "Gateway": "",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "IPAddress": "",
        "IPPrefixLen": 0,
        "IPv6Gateway": "",
        "MacAddress": "",
        "Networks": {
            "host": {
                "EndpointID": "7ae1ce8c6558894da9f5cfd4b65f96a040c944c83d357aecc06b46709d3160ce",
                "Gateway": "",
                "IPAddress": "",
                "IPPrefixLen": 0,
                "IPv6Gateway": "",
                "GlobalIPv6Address": "",
                "GlobalIPv6PrefixLen": 0,
                "MacAddress": ""
            }
        }
    }
}
]

Pros and Cons

On The Pro Side

deployment target agnostic
simpler to set up than some other solutions
developer friendly
Hashicorp can be trusted
declarative
flexible
can be controlled via CLI or HTTP
workloads from different projects can mixed together -- you don't need an AL cluster and a TLO cluster
batch workloads are intriguing -- Mold-E event replay, for example
rolling updates appear to work correctly

Pros and Cons

On The Con Side

failure diagnoses is challenging
almost requires service discovery solution
depending on context, descriptors can be verbose -- almost requires a templating solution
unclear how/if we should specified required resources -- CPU, RAM and IOPS (same can be said of ECS)
still pre-1.0
unclear how proxy-friendly Nomad is -- can we obtain the proper externally visible host name and port when constructing location URLs?
no load balancer support -- we have roll our own ELB equivalent

Docker Scheduling With Nomad

By Ronald Kurr

Docker Scheduling With Nomad

How Nomad can be used to schedule Docker containers in on-premises and cloud environments.

1,901

Ronald Kurr

Long time software developer.

Scheduling Docker Containers With Nomad

What

Why

How

Architecture

Job

Task Group

Driver

Task

Client

Allocation

Evaluation

Server

Regions and Datacenters

Bin Packing

Consensus Protocol

Gossip Protocol

Scheduling

Task Drivers

Docker Task Driver

Nomad Commands (CLI)

HTTP API

Job Descriptor

Job Descriptor

Job Descriptor

Deployment

Workload Sequence

Service Discovery

Container Environment

Pros and Cons

Pros and Cons

Docker Scheduling With Nomad

More from Ronald Kurr