Deploying Serverless Docker applications on AWS

Andrew Ang

Harvard IT Summit 2019
May 14, 2019

http://bit.ly/itsummit-fargate

Harvard VPAL Research Group

Learning about learning, at scale, using MOOCs ...

… and bringing research and technology innovations back to campus ...

… to create engaging learning experiences

Our use cases for dockerized applications in education research + technology

Itero: Writing history analytics for Google Docs

itero.vpal.io

Research and analytics with education data

Data pipelines

API scraper, web crawler, data ETL

Dashboards - Django admin panel for data pipeline diagnostics

Adaptive / Personalized learning

https://github.com/harvard-vpal/bridge-adaptivity/wiki

Adaptivity - service architecture

Typical dockerized web application architecture

Ideal service scaling

Deployment solution #1:

EC2 + docker-compose

Deployment solution #2a:

Elastic Beanstalk

Previous deployment solutions #2b:

Elastic Beanstalk

(Multicontainer Docker)

Previous deployment (attempt) #3:

ECS - Elastic Container Service

Fargate introduced in Nov 2017

“Fargate is like EC2 but instead of giving you a virtual machine you get a container.”

https://aws.amazon.com/blogs/aws/aws-fargate/

ECS launch types:

EC2 vs. Fargate

ECS launch types:

EC2 vs. Fargate

Workflow

Workflow:

Build containers and upload to registry

Workflow:

Pull containers from registry and run

Workflow:

Task definitions

Task: scalable unit of ECS

{
    "family": "web",
    "taskRoleArn": "arn:...",
    "networkMode": "awsvpc",
    "containerDefinitions": [
    {
      "name": "web",
      "image": "${web_image}",
      "essential": true,
      "memory": 256,
      "portMappings": [
        {
          "containerPort": 8000
        }
      ],
      "command": [
        "/usr/local/bin/gunicorn",
        "config.wsgi:application",
        "-w=2",
        "-b=:8000",
        "--log-file=-",
        "--access-logfile=-"
      ],
      "environment": [
        {"name": "DJANGO_SETTINGS_MODULE", "value": "${DJANGO_SETTINGS_MODULE}"},
        {"name": "ENV_LABEL", "value": "${env_label}"},
        {"name": "HOST", "value": "${domain_name}"}
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "${log_group_name}",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
    {
      "name": "nginx",
      "image": "${nginx_image}",
      "essential": false,
      "memory": 256,
      "portMappings": [
        {
          "containerPort": 80
        }
      ],
      ...
    }
  ]
}

Tasks can have one or more containers.

Task instance: instantiation of a task

Default settings from task definition such as command, memory allocation, etc. can be overridden when instantiating a task.

Each task instantiation from a task definition has the same set of containers.

Service: an "auto scaling group" for tasks

Services are used for tasks that run indefinitely

(e.g. web service)

Cluster: a logical grouping of tasks/services

ECS abstractions

ECS term	description / analog
container definition	docker-compose
task definition	docker-compose + AWS config
task instance	instantiation of a task definition
service	auto-scaling group for task instances
cluster	grouping of tasks/services

Example application 1

web microservice

(adaptive learning recommendation engine)

Workflow

build with docker-compose
push image to ECR
define ECS tasks and services
setup additional infrastructure

Build images

e.g. `docker-compose build`

# docker-compose.yml
version: '2'
services:
  bridge:
    container_name: BFA
    build:
      context: .
      dockerfile: Dockerfile
    image: bridge_adaptivity
    command: bash -c "./prod_run.sh"
    volumes:
      - .:/bridge_adaptivity
      - static:/www/static
    ports:
      - "8000:8000"
    links:
      - postgres

  # Celery worker
  worker:
    image: bridge_adaptivity
    environment:
      DJANGO_SETTINGS_MODULE: config.settings.prod
    command: bash -c "sleep 5 && celery -A config worker -l info"
    volumes:
      - .:/bridge_adaptivity
    links:
      - rabbit
      - postgres
    depends_on:
      - bridge

  rabbit:
    container_name: rabbitmq
    image: rabbitmq
    env_file: ./envs/rabbit.env

  nginx:
    container_name: nginx_BFA
    build: ./nginx
    ports:
      - "80:80"
      - "443:443"
    volumes_from:
      - bridge
    volumes:
      - /etc/nginx/ssl/:/etc/nginx/ssl/
    links:
      - bridge

  postgres:
    container_name: postgresql_BFA
    image: postgres
    env_file: ./envs/pg.env
    volumes:
      - pgs:/var/lib/postgresql/data/
    ports:
      - "5432:5432"

Push images

e.g. `docker-compose push`

Task definition: web

{
    "family": "web",
    "taskRoleArn": "arn:...",
    "networkMode": "awsvpc",
    "containerDefinitions": [
    {
      "name": "web",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/namespace/app",
      "essential": true,
      "memory": 256,
      "portMappings": [
        {
          "containerPort": 8000
        }
      ],
      "command": [
        "/usr/local/bin/gunicorn",
        "config.wsgi:application",
        "-w=2",
        "-b=:8000",
        "--log-file=-",
        "--access-logfile=-"
      ],
      "environment": [
        {"name": "DJANGO_SETTINGS_MODULE", "value": "${DJANGO_SETTINGS_MODULE}"},
        {"name": "ENV_LABEL", "value": "${env_label}"},
        {"name": "HOST", "value": "${domain_name}"}
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "${log_group_name}",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      },
    {
      "name": "nginx",
      "image": "${nginx_image}",
      "essential": false,
      "memory": 256,
      "portMappings": [
        {
          "containerPort": 80
        }
      ],
      ...
    }
  ]
}

Task definition: queue

{
    "family": "web",
    "taskRoleArn": "arn:...",
    "networkMode": "awsvpc",
    "containerDefinitions": [
    [
      {
        "name": "rabbit",
        "image": "rabbitmq",
        "essential": true,
        "memory": 256,
        "environment": [
          {"name": "RABBITMQ_DEFAULT_PASS", "value": "${celery_password}"},
          {"name": "RABBITMQ_DEFAULT_USER", "value": "${celery_user}"}
        ],
        "logConfiguration": {
          "logDriver": "awslogs",
          "options": {
            "awslogs-group": "${log_group_name}",
            "awslogs-region": "us-east-1",
            "awslogs-stream-prefix": "ecs"
          }
        }
      }
    ]
  ]
}

Task definition: worker

{
  "family": "web",
  "taskRoleArn": "arn:...",
  "networkMode": "awsvpc",
  "containerDefinitions": [
    {
      "name": "worker",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/namespace/app",
      "essential": true,
      "memory": 256,
      "command": ["celery","-A","config","worker","-l","info"],
      "environment": [
        {"name": "DJANGO_SETTINGS_MODULE", "value": "${DJANGO_SETTINGS_MODULE}"}
      ],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "${log_group_name}",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

Define services

Additional infrastructure

database, associated security group
task IAM role
ECS service discovery
application load balancer
route 53 zone

Example application #2:

Long-running (~ few days) web crawler that triggers on a scheduled basis, or in response to new records

Run scripts with finite execution time as tasks, instead of services

e.g. data processing jobs, api scraper, web crawler

Scheduling jobs with Airflow

Task dependencies expressed as a DAG (Directed Acyclic Graph)

Typical Airflow cluster setup

Scheduling ECS tasks with Airflow

DevOps considerations and streamlining deployment tasks

DevOps considerations

Secrets management
Access control
Application configuration
Versioning
Build/Deploy Automation
Cost

Secrets management:

Specify values from SSM Param Store in task definition to inject as environment variables (available Nov 2018)

{
    "family": "web",
    "taskRoleArn": "arn:...",
    "networkMode": "awsvpc",
    "containerDefinitions": [
      {
        "name": "web",
        "image": "${web_image}",
        ...
        "environment": [
          {"name": "DJANGO_SETTINGS_MODULE", "value": "${DJANGO_SETTINGS_MODULE}"},
          {"name": "HOST", "value": "${domain_name}"}
        ],
        "secrets": [
          {
            "name": "SECRET_KEY",
            "valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/app/${env_label}/SECRET_KEY"
          },
          {
            "name": "DATABASE_CONNECTION",
            "valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/app/${env_label}/DATABASE_CONNECTION"
          }
        ]
      },
      ...
    ]
}

Secrets management:

Populate values in SSM Param Store; access control can be controlled via IAM

Access control:

Control access to AWS resources on startup with

Task Execution Role

Useful for controlling access to secrets, ECR repos

Security groups:

Define security group for task to be associated with

Useful for giving access to
ip-restricted databases

Access control:

Control access to AWS resources at task runtime with Task Role

Useful for controlling access to data sources / destinations

Versioning:

Tag ECR images with version at build/upload

Versioning:

Reference image tags in task definition

{
    "family": "app",
    "networkMode": "awsvpc",
    "containerDefinitions": [
    {
      "name": "web",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/namespace/app:3.1.2",
      
      ...
    }
  ]
}

Build / deploy automation

Build
- build versioned docker image
- add additional config if applicable
Deploy
- Push version to image repo
- Create infrastructure
  - load balancer, security groups and policies, ECS task definitions and services
- Apply infrastructure changes
  - Deploy app (image) version
  - Scale up/down

docker-compose file for deploy builds

Alternate docker-compose file for versioned/custom builds -

Uses APP_TAG env variable and builds from github source

version: '3'
services:
  # base app image
  app_base:
    image: ${APP_IMAGE}:${APP_TAG}-base
    build:
      dockerfile: Dockerfile_opt
      # context: app_base/src/bridge_adaptivity  # if building from local version; ensure volume mount is configured in other docker-compose
      
      # build from github, using reference APP_TAG and bridge_adaptivity subdirectory
      context: https://github.com/harvard-vpal/bridge-adaptivity.git#${APP_TAG}:bridge_adaptivity
      
  # copy custom settings into base app image (see Dockerfile)
  app:
    build:
      context: app
      args:
        - APP_IMAGE=${APP_IMAGE}:${APP_TAG}-base
    image: ${APP_IMAGE}:${APP_TAG}
    environment:
      - DJANGO_SETTINGS_MODULE=config.settings.custom
  
  # custom nginx image build that collects static assets from app image and copies to nginx image
  nginx:
    build:
      context: nginx
      args:
        - APP_IMAGE=${APP_IMAGE}:${APP_TAG}
    image: ${NGINX_IMAGE}:${APP_TAG}

Extending an app image

Adding custom settings

# Dockerfile that derives from base app image and adds some custom settings

# Base app image:tag to use
ARG APP_IMAGE

FROM ${APP_IMAGE} as app

WORKDIR /bridge_adaptivity

# copy custom settings into desired location
COPY settings/custom.py config/settings/custom.py
COPY settings/collectstatic.py config/settings/collectstatic.py

# generate staticfiles.json even if app image is not serving static images directly
RUN python manage.py collectstatic -c --noinput --settings=config.settings.collectstatic

Custom image build

ECS-specific settings - a django example

# django custom settings (custom.py)

def get_ecs_task_ips():
    """
    Retrieve the internal ip address(es) for task, if running with AWS ECS and awsvpc networking mode
    """
    ip_addresses = []
    try:
        r = requests.get("http://169.254.170.2/v2/metadata", timeout=0.01)
    except requests.exceptions.RequestException:
        return []
    if r.ok:
        task_metadata = r.json()
        for container in task_metadata['Containers']:
            for network in container['Networks']:
                if network['NetworkMode'] == 'awsvpc':
                    ip_addresses.extend(network['IPv4Addresses'])
    return list(set(ip_addresses))

ecs_task_ips = get_ecs_task_ips()

if ecs_task_ips:
    # ALLOWED_HOSTS comes from config.settings.base
    ALLOWED_HOSTS.extend(ecs_task_ips)

http://bit.ly/django-ecs

Managing infrastructure state with Terraform

"Define infrastructure as code"

resource "aws_ecs_task_definition" "main" {
  family                = "${var.name}"
  container_definitions = "${var.container_definitions}"
  execution_role_arn = "${var.execution_role_arn}" # required for awslogs
  task_role_arn = "${var.role_arn}"
  network_mode = "awsvpc"
  memory = "${var.memory}"
  cpu = "${var.cpu}"
  requires_compatibilities = ["FARGATE"]
}

terraform workspace select dev

terraform apply -var-file="dev.tfvars"

Manage multiple environments (dev/stage/prod) with respective config values

ECS resources in Terraform

resource "aws_ecs_task_definition" "main" {
  family                = "${var.name}"
  container_definitions = "${var.container_definitions}"
  execution_role_arn = "${var.execution_role_arn}" # required for awslogs
  task_role_arn = "${var.role_arn}"
  network_mode = "awsvpc"
  memory = "${var.memory}"
  cpu = "${var.cpu}"
  requires_compatibilities = ["FARGATE"]
}

resource "aws_ecs_service" "main" {
  name            = "${var.name}"
  cluster         = "${var.cluster_name}"
  task_definition = "${aws_ecs_task_definition.main.arn}"
  desired_count   = "${var.count}"
  launch_type     = "FARGATE"

  load_balancer {
    target_group_arn = "${var.target_group_arn}"
    container_name   = "${var.load_balancer_container_name}"
    container_port   = "${var.load_balancer_container_port}"
  }

  network_configuration {
    subnets = ["${data.aws_subnet_ids.main.ids}"],
    security_groups = ["${var.security_group_id}"]
    assign_public_ip = true
  }
}

Other AWS resources in Terraform

route53 record, load balancer, target groups, ...

resource "aws_alb" "main" {
  name            = "${var.project}-${var.env_label}"
  subnets         = ["${data.aws_subnet_ids.main.ids}"]
  security_groups = ["${var.security_group_id}"]
}

resource "aws_alb_listener" "main" {
  load_balancer_arn = "${aws_alb.main.id}"
  port              = "443"
  protocol          = "HTTPS"
  ssl_policy        = "ELBSecurityPolicy-2016-08"
  certificate_arn   = "${var.ssl_certificate_arn}"

  default_action {
    type             = "fixed-response"

    fixed_response {
      content_type = "text/plain"
      message_body = "Service Temporarily Unavailable (ALB Default Action)"
      status_code  = "503"
    }
  }
}

resource "aws_route53_record" "main" {
  zone_id = "${data.aws_route53_zone.main.zone_id}"
  name    = "${var.domain_name}"
  type    = "A"

  alias {
    name                   = "${aws_alb.main.dns_name}"
    zone_id                = "${aws_alb.main.zone_id}"
    evaluate_target_health = false
  }
}

## Assumes only one service is being load balanced but may make sense to move these to service modules if not the case

resource "aws_alb_target_group" "main" {
  name_prefix = "${var.short_project_label}"  # using name_prefix instead of name used because of create_before_destroy option
  port        = "${var.container_port}"
  protocol    = "HTTP"
  target_type = "ip"  # required for use of awsvpc task networking mode
  vpc_id      = "${var.vpc_id}"

  health_check {
    path = "${var.health_check_path}"
  }

  # Resolves: (Error deleting Target Group: Target group is currently in use by a listener or a rule)
  lifecycle {
    create_before_destroy = true
  }

  # Resolves: The target group does not have an associated load balancer
  depends_on = ["aws_alb.main"]
}

resource "aws_alb_listener_rule" "main" {
  listener_arn = "${aws_alb_listener.main.arn}"

  action {
    target_group_arn = "${aws_alb_target_group.main.id}"
    type             = "forward"
  }
  condition {
    field = "path-pattern"
    values = ["*"]
  }
}

Terraform modules in ecs-app-utils repo

# creates load balancer, security group, route 53 records, and target groups

module "network" {
  source = "git::https://github.com/harvard-vpal/ecs-app-utils.git//terraform/network/public?ref=2.3.0"

  vpc_id = "${var.vpc_id}"
  ssl_certificate_arn = "${var.ssl_certificate_arn}"
  hosted_zone = "${var.hosted_zone}"
  domain_name = "${var.domain_name}"
  env_label = "${var.env_label}"
  project = "${var.project}"
  short_project_label = "${var.short_project_label}"
}

https://github.com/harvard-vpal/ecs-app-utils/tree/master/terraform

Available terraform modules in ecs-app-utils

https://github.com/harvard-vpal/ecs-app-utils/tree/master/terraform

execution role (IAM role)
network
- base
- public (base + open inbound security group)
service
- load balanced (e.g. web)
- discoverable (e.g. queue)
- generic (e.g. worker)

Container definitions are application-specific

Use of templating to pass in environment-specific variables or version tags

data "template_file" "container_definitions_web" {
  template = "${file("./container_definitions_web.tpl")}"

  vars {
    web_image = "${var.app_image}:${var.app_tag}"
    nginx_image = "${var.nginx_image}:${var.app_tag}"
    project = "${var.project}"
    env_label = "${var.env_label}"
    log_group_name = "${aws_cloudwatch_log_group.main.name}"
    DJANGO_SETTINGS_MODULE = "${var.DJANGO_SETTINGS_MODULE}"
    domain_name = "${var.domain_name}"
  }
}

module "web_service" {
  source = "git::https://github.com/harvard-vpal/ecs-app-utils.git//terraform/services/load_balanced?ref=3.2.0"

  vpc_id = "${var.vpc_id}"
  cluster_name = "${var.cluster_name}"
  role_arn = "${aws_iam_role.task.arn}"
  execution_role_arn = "${module.execution_role.arn}"
  security_group_id = "${aws_security_group.ecs_service.id}"
  name = "${var.project}-${var.env_label}-web"
  container_definitions = "${data.template_file.container_definitions_web.rendered}"
  target_group_arn = "${module.network.target_group_arn}"
  count = "${var.web_count}"
  cpu = 512
  memory = 1024
}

# container_definitions_web.tpl
[
  {
    "name": "web",
    "image": "${web_image}",
    "essential": true,
    "portMappings": [
      {
        "containerPort": 8000
      }
    ],
    "command": [
      "/usr/local/bin/gunicorn",
      "itero.wsgi:application",
      "-w=2",
      "-b=:8000",
      "--log-level=debug",
      "--log-file=-",
      "--access-logfile=-"
    ],
    "environment": [
      {"name": "DJANGO_SETTINGS_MODULE", "value": "${DJANGO_SETTINGS_MODULE}"},
      {"name": "HOST", "value": "${domain_name}"}
    ],
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "${log_group_name}",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "ecs"
      }
    },
    "secrets": [
      {
        "name": "SECRET_KEY",
        "valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/itero/${env_label}/SECRET_KEY"
      },
      {
        "name": "DATABASE_CONNECTION",
        "valueFrom": "arn:aws:ssm:us-east-1:361808764124:parameter/itero/${env_label}/DATABASE_CONNECTION"
      },
      {
        "name": "CELERY_BROKER_URL",
        "valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/itero/${env_label}/CELERY_BROKER_URL"
      }
      {
        "name": "GOOGLE_PICKER_CLIENT_ID",
        "valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/itero/${env_label}/GOOGLE_CLIENT_ID"
      },
      {
        "name": "GOOGLE_PICKER_APP_ID",
        "valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/itero/common/GOOGLE_APP_ID"
      }
    ]
  },
  {
    "name": "nginx",
    "image": "${nginx_image}",
    "essential": false,
    "portMappings": [
      {
        "containerPort": 80
      }
    ],
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "${log_group_name}",
        "awslogs-region": "us-east-1",
        "awslogs-stream-prefix": "ecs"
      }
    }
  }
]

ecs-utils

CLI for common ecs build/deploy tasks

Supports image versioning and multiple environments (dev/stage/prod)

https://github.com/harvard-vpal/ecs-app-utils


# Checkout the app code with the specified version and tag image with that tag
deploy build --tag 1.0.0

# Push images with the specified tag to ECR repositories
deploy push --tag 1.0.0

# Run 'terraform apply' with specified image tag against 'dev' environment
deploy apply --tag 1.0.0 --env dev

# Build, push, and apply
deploy all --tag 1.0.0 --env dev

# Redeploy services (force restart of specific services, even if no config changes)
deploy redeploy --env dev web worker

Comparisions with other AWS services

Fargate vs EC2

Fargate	EC2
Easy scalability, automatic, built-in failure recovery, no need to think about instance provisioning More expensive Compute/memory capacity flexible within a limited range (max 4vCPU / 30 GB memory)	Build your own scaling, failure recovery, container orchestration (if using docker) If not using docker, dependency management/setup may be complex depending on library/system dependencies Cheaper More high-end options for compute/memory configurations

Fargate pricing

Fargate vs EC2 cost

Fargate can be a good option for use cases that are memory-limited (e.g. data processing/transformation in memory) vs upgrading to next ec2 tier

Fargate vs Lambda

Fargate	Lambda
High level of control of library and system dependencies with docker Slower startup time (30 sec - 1 min) Can use for long running applications	No docker support 50mb (zipped) deployment package size limit - barely enough for basic python data science stack (numpy / pandas / sklearn / statsmodel) 900 second execution time limit Fast startup time

Fargate vs Kubernetes

Fargate	Kubernetes
No need to consider instance provisioning Tight integration with other AWS resources - (task IAM roles, SSM secrets, Cloudwatch logs)	Abstractions are more complex (imo) EKS more expensive (need to run control plane - $0.2 / hour) Managing your own K8s cluster is complex Open-source; community plug-ins (e.g. canary deployments)

Thanks!

Questions

Slides:

https://bit.ly/itsummit-fargate

ecs-utils

https://github.com/harvard-vpal/ecs-app-utils

andrew_ang@harvard.edu