Scheduling Docker Containers With Nomad

What

Nomad is a tool for managing a cluster of machines and running applications on them. Nomad abstracts away machines and the location of applications, and instead enables users to declare what they want to run and Nomad handles where they should run and how to run them.

  • Docker Support
  • Operationally Simple
  • Multi-Datacenter & Multi-Region Support
  • Flexible Workloads
  • Built for Scale

Why

A single, declarative tool to schedule multiple Docker containers into Vagrant, VMWare and AWS environments.

 

  • way simpler to set up than Kubernetes or Mesos
  • might be more capable than Docker Swarm
  • abstracts away Vagrant, ESXi and AWS
  • part of the Hashicorp portfolio
  • not an AWS-only solution
  • intelligent scheduling makes efficient use of resources
  • text-based descriptor can work in a CI pipeline
  • REST API for customized control
  • can mix Docker and JVM workloads

How

  • each EC2 instance runs a client
  • 3 tiny EC2 instances runs the Nomad server cluster
  • each Git push produces a new Nomad descriptor
  • descriptor is pushed to the cluster
  • Nomad schedules the work accordingly
  • rinse and repeat
  • EC2 instances can be scheduled and auto-scaled

Architecture

Job

 A Job is a specification provided by users that declares a workload for Nomad. A Job is a form of desired state; the user is expressing that the job should be running, but not where it should be run. The responsibility of Nomad is to make sure the actual state matches the user desired state. A Job is composed of one or more task groups.

Task Group

A Task Group is a set of tasks that must be run together. For example, a web server may require that a log shipping co-process is always running as well. A task group is the unit of scheduling, meaning the entire group must run on the same client node and cannot be split.

Driver

A Driver represents the basic means of executing your Tasks. Example Drivers include Docker, Qemu, Java, and static binaries.

Task 

A Task is the smallest unit of work in Nomad. Tasks are executed by drivers, which allow Nomad to be flexible in the types of tasks it supports. Tasks specify their driver, configuration for the driver, constraints, and resources required.

Client 

A Client of Nomad is a machine that tasks can be run on. All clients run the Nomad agent. The agent is responsible for registering with the servers, watching for any work to be assigned and executing tasks. The Nomad agent is a long lived process which interfaces with the servers.

Allocation 

An Allocation is a mapping between a task group in a job and a client node. A single job may have hundreds or thousands of task groups, meaning an equivalent number of allocations must exist to map the work to client machines. Allocations are created by the Nomad servers as part of scheduling decisions made during an evaluation.

Evaluation 

Evaluations are the mechanism by which Nomad makes scheduling decisions. When either the desired state (jobs) or actual state (clients) changes, Nomad creates a new evaluation to determine if any actions must be taken. An evaluation may result in changes to allocations if necessary.

Server

Nomad servers are the brains of the cluster. There is a cluster of servers per region and they manage all jobs and clients, run evaluations, and create task allocations. The servers replicate data between each other and perform leader election to ensure high availability. Servers federate across regions to make Nomad globally aware.

Regions and Datacenters

Nomad models infrastructure as regions and datacenters. Regions may contain multiple datacenters. Servers are assigned to regions and manage all state for the region and make scheduling decisions within that region. Requests that are made between regions are forwarded to the appropriate servers. As an example, you may have a US region with the us-east-1 and us-west-1 datacenters, connected to the EU region with the eu-fr-1 and eu-uk-1 datacenters.

Bin Packing

Bin Packing is the process of filling bins with items in a way that maximizes the utilization of bins. This extends to Nomad, where the clients are "bins" and the items are task groups. Nomad optimizes resources by efficiently bin packing tasks onto client machines.

Consensus Protocol

Nomad uses a consensus protocol (RAFT) to ensure that the nodes in server cluster have the same view of the world.

 

  • sensitive to latency
  • one leader per region
  • possible to federate multiple regions
  • possible to run in "stale" mode and avoid leader election
  • 3 to 5 servers is the recommended cluster size

Gossip Protocol

Nomad makes use of a single global WAN gossip pool that all servers participate in. Membership information provided by the gossip pool allows servers to perform cross region requests. The integrated failure detection allows Nomad to gracefully handle an entire region losing connectivity, or just a single server in a remote region. The gossip protocol is also used to detect servers in the same region to perform automatic clustering via the consensus protocol.

Scheduling

It is the process of assigning tasks from jobs to client machines. This process must respect the constraints as declared in the job, and optimize for resource utilization.

Task Drivers

Task drivers are used by Nomad clients to execute a task and provide resource isolation. By having extensible task drivers, Nomad has the flexibility to support a broad set of workloads across all major operating systems. 

  • Docker
  • Isolated Fork/Exec
  • Raw Fork/Exec
  • Java
  • Qemu
  • Rkt
  • Custom

Docker Task Driver

		task "redis" {
			# Use Docker to run the task.
			driver = "docker"

			# Configure Docker driver with the image
			config {
				image = "redis:latest"
				network_mode = "host"
				port_map {
					db = 6379
				}
			}

			service {
				name = "${TASKGROUP}-redis"
				tags = ["global", "cache"]
				port = "db"
				check {
					name = "alive"
					type = "tcp"
					interval = "10s"
					timeout = "2s"
				}
			}

			# We must specify the resources required for
			# this task to ensure it runs on a machine with
			# enough capacity.
			resources {
				cpu = 500 # 500 Mhz
				memory = 256 # 256MB
				network {
					mbits = 10
					port "db" {
					}
				}
			}
		}

Nomad Commands (CLI)

Nomad is controlled via a very easy to use command-line interface (CLI). Nomad is only a single command-line application: nomad, which takes a subcommand such as "agent" or "status". 

  • nomad agent
  • nomad init
  • nomad run
  • nomad status
  • nomad stop
  • nomad validate

HTTP API

The Nomad HTTP API is the primary interface to using Nomad, and is used to query the current state of the system as well as to modify it. The Nomad CLI makes use of the Go HTTP client and invokes the HTTP API.

 

  • jobs
  • nodes
  • allocations
  • evaluations
  • agent
  • regions
  • status

Job Descriptor

job "grouped-service" {
    // Controls if the entire set of tasks in the job must be placed atomically or if they can be scheduled incrementally.
    all_at_once = false

    // A list of datacenters in the region which are eligible for task placement. This must be provided, and does not have a default.
    datacenters = ["my-datacenter"]
    
    // Annotates the job with opaque metadata.
    meta {
        jobKey = "job value"
    }

    // Specifies the job priority which is used to prioritize scheduling and access to resources. Must be between 1 and 100 
    // inclusively, and defaults to 50.
    priority = 50

    // The region to run the job in, defaults to "global".
    region = "USA"

    // Specifies the job type and switches which scheduler is used. Nomad provides the service, system and batch schedulers, 
    // and defaults to service. 
    type = "service"

    // Restrict our job to only linux. We can specify multiple constraints as needed.
    constraint {
        attribute = "$attr.kernel.name"
        value = "linux"
    }

    // Specifies the task's update strategy. When omitted, rolling updates are disabled.
    update {
        // Specifies the number of tasks that can be updated at the same time.
        max_parallel = 1

        // Delay between sets of task updates and is given as an as a time duration. If stagger is provided as an integer, 
        // seconds are assumed.  Otherwise the "s", "m", and "h" suffix can be used, such as "30s".
        stagger = "30s"
    }

    // group definition goes here
}

Job Descriptor

    group "infrastructure-services" {
        // specifies the number of the task groups that should be running. Must be positive, defaults to 1.
        count = 1

        // This can be provided multiple times to define additional constraints.
        constraint {
            attribute = "$attr.kernel.name"
            value = "linux"
        }

        // Specifies the restart policy to be applied to tasks in this group. If omitted, a default policy for
        // batch and non-batch jobs is used based on the job type.
        restart {
            interval = "5m"
            attempts = 10
            delay = "25s"
        }

        // Annotates the task group with opaque metadata.
        meta {
            taskKey = "group value"
        }

        // task definitions go here
    }

Job Descriptor

        task "mysql-server" {

            driver = "docker"

            constraint {
                distinct_hosts = true
            }

            config {
                image = "mysql:latest"
                labels {
                    realm = "Experiment"
                    managed-by = "Nomad"
                }
                priviledged = false
                ipc_mode = "none"
                pid_mode = ""
                uts_mode = ""
                network_mode = "bridge"
                host_name = "mysql"
                dns_servers = ["8.8.8.8", "8.8.4.4"]
                dns_search_domains = ["kurron.org", "transparent.com"]
                port_map {
                    mysql = 3306
                }

                auth {
                    username = "dockerhub_user"
                    password = "dockerhub_password"
                    email = "bob@thebuilder.com"
                    server_address = "repository.kurron.org"
                }
            }

            service {
                name = "${JOB}-mysql"
                tags = ["experiment", "messaging"]
                port = "mysql"
                check {
                    type = "tcp"
                    delay = "30s"
                    timeout = "2s"
                    path = "/"
                    protocol = "http"
                }
            }
            env {
                MYSQL_ROOT_PASSWORD = "sa"
                MYSQL_USER = "mysql"
                MYSQL_PASSWORD = "mysql"
                MYSQL_DATABASE = "sushe"
            }
            resources {
                cpu = 500
                disk = 256
                iops = 10
                memory = 512
                network {
                    mbits = 100
                    port "mysql" {
                    }
                }
            }
            meta {
                taskKey = "task value"
            }
            kill_timeout = "30s"
        }

Deployment

  • 3-5 instances running Nomad servers
  • N number of instances running Nomad clients and Docker engines
  • N number of instances using the Nomad CLI to push jobs into the cluster
  • servers can be t2.nano instances
  • clients can be any size that can handle your workload

Workload Sequence

  1. job descriptor is created and defines the workload to be scheduled
  2. using the CLI or HTTP API, the descriptor is pushed to one of the servers
  3. Nomad examines the constraints of the job and determines which client is capable of running the workload and routes the work to a client
  4. The client adds/replaces containers as needed to satisfy the workload
  5. Containers use service discovery to locate required services (Consul)

Service Discovery

  • Nomad adds/removes containers to adjust to the new workload
  • Nomad expects you to consult Consul to determine the coordinates of any dependent services
  • AL expects things to be in a well-known location: localhost:<some stable port>
  • This mismatch might be a non-starter
  • It is possible to have Nomad bind containers to static ports and does not seem to affect updates
  • Port collisions will cause failures or relocation issues
  • You are expected to manage the Consul cluster but you can deploy Consul containers as a "system" workload

Container Environment

[
{
    "Id": "f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee",
    "Created": "2016-01-26T21:40:17.969085813Z",
    "Path": "/entrypoint.sh",
    "Args": [
        "redis-server"
    ],
    "State": {
        "Status": "running",
        "Running": true,
        "Paused": false,
        "Restarting": false,
        "OOMKilled": false,
        "Dead": false,
        "Pid": 23459,
        "ExitCode": 0,
        "Error": "",
        "StartedAt": "2016-01-26T21:40:18.133205627Z",
        "FinishedAt": "0001-01-01T00:00:00Z"
    },
    "Image": "ce0116e4e7f549950db2e8ae2a306038153b3a2ad818de9c144323a751dd7922",
    "ResolvConfPath": "/opt/docker/containers/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/resolv.conf",
    "HostnamePath": "/opt/docker/containers/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/hostname",
    "HostsPath": "/opt/docker/containers/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/hosts",
    "LogPath": "/opt/docker/containers/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee-json.log",
    "Name": "/redis-c72a26d6-dba9-5f97-120a-3355a0945335",
    "RestartCount": 0,
    "Driver": "overlay",
    "ExecDriver": "native-0.2",
    "MountLabel": "",
    "ProcessLabel": "",
    "AppArmorProfile": "",
    "ExecIDs": null,
    "HostConfig": {
        "Binds": [
            "/tmp/NomadClient658170613/c72a26d6-dba9-5f97-120a-3355a0945335/alloc:/alloc:rw,z",
            "/tmp/NomadClient658170613/c72a26d6-dba9-5f97-120a-3355a0945335/redis:/local:rw,Z"
        ],
        "ContainerIDFile": "",
        "LxcConf": null,
        "Memory": 268435456,
        "MemoryReservation": 0,
        "MemorySwap": -1,
        "KernelMemory": 0,
        "CpuShares": 500,
        "CpuPeriod": 0,
        "CpusetCpus": "",
        "CpusetMems": "",
        "CpuQuota": 0,
        "BlkioWeight": 0,
        "OomKillDisable": false,
        "MemorySwappiness": null,
        "Privileged": false,
        "PortBindings": {
            "6379/tcp": [
                {
                    "HostIp": "10.0.2.15",
                    "HostPort": "6379"
                }
            ],
            "6379/udp": [
                {
                    "HostIp": "10.0.2.15",
                    "HostPort": "6379"
                }
            ]
        },
        "Links": null,
        "PublishAllPorts": false,
        "Dns": null,
        "DnsOptions": null,
        "DnsSearch": null,
        "ExtraHosts": null,
        "VolumesFrom": null,
        "Devices": null,
        "NetworkMode": "host",
        "IpcMode": "",
        "PidMode": "",
        "UTSMode": "",
        "CapAdd": null,
        "CapDrop": null,
        "GroupAdd": null,
        "RestartPolicy": {
            "Name": "",
            "MaximumRetryCount": 0
        },
        "SecurityOpt": null,
        "ReadonlyRootfs": false,
        "Ulimits": null,
        "LogConfig": {
            "Type": "json-file",
            "Config": {}
        },
        "CgroupParent": "",
        "ConsoleSize": [
            0,
            0
        ],
        "VolumeDriver": ""
    },
    "GraphDriver": {
        "Name": "overlay",
        "Data": {
            "LowerDir": "/opt/docker/overlay/ce0116e4e7f549950db2e8ae2a306038153b3a2ad818de9c144323a751dd7922/root",
            "MergedDir": "/opt/docker/overlay/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/merged",
            "UpperDir": "/opt/docker/overlay/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/upper",
            "WorkDir": "/opt/docker/overlay/f97f0dec72a8003e767624f36c11fbd1ca414a0c52b512187a601dddc8a432ee/work"
        }
    },
    "Mounts": [
        {
            "Source": "/tmp/NomadClient658170613/c72a26d6-dba9-5f97-120a-3355a0945335/alloc",
            "Destination": "/alloc",
            "Mode": "rw,z",
            "RW": true
        },
        {
            "Source": "/tmp/NomadClient658170613/c72a26d6-dba9-5f97-120a-3355a0945335/redis",
            "Destination": "/local",
            "Mode": "rw,Z",
            "RW": true
        },
        {
            "Name": "fa2c1cc6450407594864b91645908d4b4d1fb683ce9ad40c569aba87995f6421",
            "Source": "/opt/docker/volumes/fa2c1cc6450407594864b91645908d4b4d1fb683ce9ad40c569aba87995f6421/_data",
            "Destination": "/data",
            "Driver": "local",
            "Mode": "",
            "RW": true
        }
    ],
    "Config": {
        "Hostname": "kal-el",
        "Domainname": "",
        "User": "",
        "AttachStdin": false,
        "AttachStdout": false,
        "AttachStderr": false,
        "ExposedPorts": {
            "6379/tcp": {},
            "6379/udp": {}
        },
        "Tty": false,
        "OpenStdin": false,
        "StdinOnce": false,
        "Env": [
            "NOMAD_IP=10.0.2.15",
            "NOMAD_PORT_db=6379",
            "NOMAD_ALLOC_DIR=/alloc",
            "NOMAD_TASK_DIR=/local",
            "NOMAD_MEMORY_LIMIT=256",
            "NOMAD_CPU_LIMIT=500",
            "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
            "REDIS_VERSION=2.8.23",
            "REDIS_DOWNLOAD_URL=http://download.redis.io/releases/redis-2.8.23.tar.gz",
            "REDIS_DOWNLOAD_SHA1=828fc5d4011e6141fabb2ad6ebc193e8f0d08cfa"
        ],
        "Cmd": [
            "redis-server"
        ],
        "Image": "redis:2.8",
        "Volumes": {
            "/data": {}
        },
        "WorkingDir": "/data",
        "Entrypoint": [
            "/entrypoint.sh"
        ],
        "OnBuild": null,
        "Labels": {}
    },
    "NetworkSettings": {
        "Bridge": "",
        "SandboxID": "7daaa3eacf3efea95f326319c17181667bd05b67074dcd11976ce1a20b27e202",
        "HairpinMode": false,
        "LinkLocalIPv6Address": "",
        "LinkLocalIPv6PrefixLen": 0,
        "Ports": {},
        "SandboxKey": "/var/run/docker/netns/default",
        "SecondaryIPAddresses": null,
        "SecondaryIPv6Addresses": null,
        "EndpointID": "",
        "Gateway": "",
        "GlobalIPv6Address": "",
        "GlobalIPv6PrefixLen": 0,
        "IPAddress": "",
        "IPPrefixLen": 0,
        "IPv6Gateway": "",
        "MacAddress": "",
        "Networks": {
            "host": {
                "EndpointID": "7ae1ce8c6558894da9f5cfd4b65f96a040c944c83d357aecc06b46709d3160ce",
                "Gateway": "",
                "IPAddress": "",
                "IPPrefixLen": 0,
                "IPv6Gateway": "",
                "GlobalIPv6Address": "",
                "GlobalIPv6PrefixLen": 0,
                "MacAddress": ""
            }
        }
    }
}
]

Pros and Cons

On The Pro Side

  • deployment target agnostic
  • simpler to set up than some other solutions
  • developer friendly
  • Hashicorp can be trusted
  • declarative
  • flexible
  • can be controlled via CLI or HTTP
  • workloads from different projects can mixed together -- you don't need an AL cluster and a TLO cluster
  • batch workloads are intriguing -- Mold-E event replay, for example
  • rolling updates appear to work correctly

Pros and Cons

On The Con Side

  • failure diagnoses is challenging
  • almost requires service discovery solution
  • depending on context, descriptors can be verbose -- almost requires a templating solution
  • unclear how/if we should specified required resources -- CPU, RAM and IOPS (same can be said of ECS)
  • still pre-1.0
  • unclear how proxy-friendly Nomad is -- can we obtain the proper externally visible host name and port when constructing location URLs?
  • no load balancer support -- we have roll our own ELB equivalent
Made with Slides.com