Harvard IT Summit 2019
May 14, 2019
Learning about learning, at scale, using MOOCs ...
… and bringing research and technology innovations back to campus ...
… to create engaging learning experiences
API scraper, web crawler, data ETL
{
"family": "web",
"taskRoleArn": "arn:...",
"networkMode": "awsvpc",
"containerDefinitions": [
{
"name": "web",
"image": "${web_image}",
"essential": true,
"memory": 256,
"portMappings": [
{
"containerPort": 8000
}
],
"command": [
"/usr/local/bin/gunicorn",
"config.wsgi:application",
"-w=2",
"-b=:8000",
"--log-file=-",
"--access-logfile=-"
],
"environment": [
{"name": "DJANGO_SETTINGS_MODULE", "value": "${DJANGO_SETTINGS_MODULE}"},
{"name": "ENV_LABEL", "value": "${env_label}"},
{"name": "HOST", "value": "${domain_name}"}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "${log_group_name}",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
{
"name": "nginx",
"image": "${nginx_image}",
"essential": false,
"memory": 256,
"portMappings": [
{
"containerPort": 80
}
],
...
}
]
}
Tasks can have one or more containers.
Default settings from task definition such as command, memory allocation, etc. can be overridden when instantiating a task.
Each task instantiation from a task definition has the same set of containers.
Services are used for tasks that run indefinitely
(e.g. web service)
ECS term | description / analog |
---|---|
container definition | docker-compose |
task definition | docker-compose + AWS config |
task instance | instantiation of a task definition |
service | auto-scaling group for task instances |
cluster | grouping of tasks/services |
web microservice
(adaptive learning recommendation engine)
Workflow
e.g. `docker-compose build`
# docker-compose.yml
version: '2'
services:
bridge:
container_name: BFA
build:
context: .
dockerfile: Dockerfile
image: bridge_adaptivity
command: bash -c "./prod_run.sh"
volumes:
- .:/bridge_adaptivity
- static:/www/static
ports:
- "8000:8000"
links:
- postgres
# Celery worker
worker:
image: bridge_adaptivity
environment:
DJANGO_SETTINGS_MODULE: config.settings.prod
command: bash -c "sleep 5 && celery -A config worker -l info"
volumes:
- .:/bridge_adaptivity
links:
- rabbit
- postgres
depends_on:
- bridge
rabbit:
container_name: rabbitmq
image: rabbitmq
env_file: ./envs/rabbit.env
nginx:
container_name: nginx_BFA
build: ./nginx
ports:
- "80:80"
- "443:443"
volumes_from:
- bridge
volumes:
- /etc/nginx/ssl/:/etc/nginx/ssl/
links:
- bridge
postgres:
container_name: postgresql_BFA
image: postgres
env_file: ./envs/pg.env
volumes:
- pgs:/var/lib/postgresql/data/
ports:
- "5432:5432"
e.g. `docker-compose push`
{
"family": "web",
"taskRoleArn": "arn:...",
"networkMode": "awsvpc",
"containerDefinitions": [
{
"name": "web",
"image": "123456789.dkr.ecr.us-east-1.amazonaws.com/namespace/app",
"essential": true,
"memory": 256,
"portMappings": [
{
"containerPort": 8000
}
],
"command": [
"/usr/local/bin/gunicorn",
"config.wsgi:application",
"-w=2",
"-b=:8000",
"--log-file=-",
"--access-logfile=-"
],
"environment": [
{"name": "DJANGO_SETTINGS_MODULE", "value": "${DJANGO_SETTINGS_MODULE}"},
{"name": "ENV_LABEL", "value": "${env_label}"},
{"name": "HOST", "value": "${domain_name}"}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "${log_group_name}",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
{
"name": "nginx",
"image": "${nginx_image}",
"essential": false,
"memory": 256,
"portMappings": [
{
"containerPort": 80
}
],
...
}
]
}
{
"family": "web",
"taskRoleArn": "arn:...",
"networkMode": "awsvpc",
"containerDefinitions": [
[
{
"name": "rabbit",
"image": "rabbitmq",
"essential": true,
"memory": 256,
"environment": [
{"name": "RABBITMQ_DEFAULT_PASS", "value": "${celery_password}"},
{"name": "RABBITMQ_DEFAULT_USER", "value": "${celery_user}"}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "${log_group_name}",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
]
}
{
"family": "web",
"taskRoleArn": "arn:...",
"networkMode": "awsvpc",
"containerDefinitions": [
{
"name": "worker",
"image": "123456789.dkr.ecr.us-east-1.amazonaws.com/namespace/app",
"essential": true,
"memory": 256,
"command": ["celery","-A","config","worker","-l","info"],
"environment": [
{"name": "DJANGO_SETTINGS_MODULE", "value": "${DJANGO_SETTINGS_MODULE}"}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "${log_group_name}",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
}
database, associated security group
task IAM role
ECS service discovery
application load balancer
route 53 zone
Long-running (~ few days) web crawler that triggers on a scheduled basis, or in response to new records
Run scripts with finite execution time as tasks, instead of services
e.g. data processing jobs, api scraper, web crawler
Task dependencies expressed as a DAG (Directed Acyclic Graph)
Specify values from SSM Param Store in task definition to inject as environment variables (available Nov 2018)
{
"family": "web",
"taskRoleArn": "arn:...",
"networkMode": "awsvpc",
"containerDefinitions": [
{
"name": "web",
"image": "${web_image}",
...
"environment": [
{"name": "DJANGO_SETTINGS_MODULE", "value": "${DJANGO_SETTINGS_MODULE}"},
{"name": "HOST", "value": "${domain_name}"}
],
"secrets": [
{
"name": "SECRET_KEY",
"valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/app/${env_label}/SECRET_KEY"
},
{
"name": "DATABASE_CONNECTION",
"valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/app/${env_label}/DATABASE_CONNECTION"
}
]
},
...
]
}
Populate values in SSM Param Store; access control can be controlled via IAM
Control access to AWS resources on startup with
Task Execution Role
Useful for controlling access to secrets, ECR repos
Define security group for task to be associated with
Useful for giving access to
ip-restricted databases
Control access to AWS resources at task runtime with Task Role
Useful for controlling access to data sources / destinations
Tag ECR images with version at build/upload
Reference image tags in task definition
{
"family": "app",
"networkMode": "awsvpc",
"containerDefinitions": [
{
"name": "web",
"image": "123456789.dkr.ecr.us-east-1.amazonaws.com/namespace/app:3.1.2",
...
}
]
}
Alternate docker-compose file for versioned/custom builds -
Uses APP_TAG env variable and builds from github source
version: '3'
services:
# base app image
app_base:
image: ${APP_IMAGE}:${APP_TAG}-base
build:
dockerfile: Dockerfile_opt
# context: app_base/src/bridge_adaptivity # if building from local version; ensure volume mount is configured in other docker-compose
# build from github, using reference APP_TAG and bridge_adaptivity subdirectory
context: https://github.com/harvard-vpal/bridge-adaptivity.git#${APP_TAG}:bridge_adaptivity
# copy custom settings into base app image (see Dockerfile)
app:
build:
context: app
args:
- APP_IMAGE=${APP_IMAGE}:${APP_TAG}-base
image: ${APP_IMAGE}:${APP_TAG}
environment:
- DJANGO_SETTINGS_MODULE=config.settings.custom
# custom nginx image build that collects static assets from app image and copies to nginx image
nginx:
build:
context: nginx
args:
- APP_IMAGE=${APP_IMAGE}:${APP_TAG}
image: ${NGINX_IMAGE}:${APP_TAG}
Adding custom settings
# Dockerfile that derives from base app image and adds some custom settings
# Base app image:tag to use
ARG APP_IMAGE
FROM ${APP_IMAGE} as app
WORKDIR /bridge_adaptivity
# copy custom settings into desired location
COPY settings/custom.py config/settings/custom.py
COPY settings/collectstatic.py config/settings/collectstatic.py
# generate staticfiles.json even if app image is not serving static images directly
RUN python manage.py collectstatic -c --noinput --settings=config.settings.collectstatic
ECS-specific settings - a django example
# django custom settings (custom.py)
def get_ecs_task_ips():
"""
Retrieve the internal ip address(es) for task, if running with AWS ECS and awsvpc networking mode
"""
ip_addresses = []
try:
r = requests.get("http://169.254.170.2/v2/metadata", timeout=0.01)
except requests.exceptions.RequestException:
return []
if r.ok:
task_metadata = r.json()
for container in task_metadata['Containers']:
for network in container['Networks']:
if network['NetworkMode'] == 'awsvpc':
ip_addresses.extend(network['IPv4Addresses'])
return list(set(ip_addresses))
ecs_task_ips = get_ecs_task_ips()
if ecs_task_ips:
# ALLOWED_HOSTS comes from config.settings.base
ALLOWED_HOSTS.extend(ecs_task_ips)
"Define infrastructure as code"
resource "aws_ecs_task_definition" "main" {
family = "${var.name}"
container_definitions = "${var.container_definitions}"
execution_role_arn = "${var.execution_role_arn}" # required for awslogs
task_role_arn = "${var.role_arn}"
network_mode = "awsvpc"
memory = "${var.memory}"
cpu = "${var.cpu}"
requires_compatibilities = ["FARGATE"]
}
terraform workspace select dev
terraform apply -var-file="dev.tfvars"
Manage multiple environments (dev/stage/prod) with respective config values
resource "aws_ecs_task_definition" "main" {
family = "${var.name}"
container_definitions = "${var.container_definitions}"
execution_role_arn = "${var.execution_role_arn}" # required for awslogs
task_role_arn = "${var.role_arn}"
network_mode = "awsvpc"
memory = "${var.memory}"
cpu = "${var.cpu}"
requires_compatibilities = ["FARGATE"]
}
resource "aws_ecs_service" "main" {
name = "${var.name}"
cluster = "${var.cluster_name}"
task_definition = "${aws_ecs_task_definition.main.arn}"
desired_count = "${var.count}"
launch_type = "FARGATE"
load_balancer {
target_group_arn = "${var.target_group_arn}"
container_name = "${var.load_balancer_container_name}"
container_port = "${var.load_balancer_container_port}"
}
network_configuration {
subnets = ["${data.aws_subnet_ids.main.ids}"],
security_groups = ["${var.security_group_id}"]
assign_public_ip = true
}
}
route53 record, load balancer, target groups, ...
resource "aws_alb" "main" {
name = "${var.project}-${var.env_label}"
subnets = ["${data.aws_subnet_ids.main.ids}"]
security_groups = ["${var.security_group_id}"]
}
resource "aws_alb_listener" "main" {
load_balancer_arn = "${aws_alb.main.id}"
port = "443"
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-2016-08"
certificate_arn = "${var.ssl_certificate_arn}"
default_action {
type = "fixed-response"
fixed_response {
content_type = "text/plain"
message_body = "Service Temporarily Unavailable (ALB Default Action)"
status_code = "503"
}
}
}
resource "aws_route53_record" "main" {
zone_id = "${data.aws_route53_zone.main.zone_id}"
name = "${var.domain_name}"
type = "A"
alias {
name = "${aws_alb.main.dns_name}"
zone_id = "${aws_alb.main.zone_id}"
evaluate_target_health = false
}
}
## Assumes only one service is being load balanced but may make sense to move these to service modules if not the case
resource "aws_alb_target_group" "main" {
name_prefix = "${var.short_project_label}" # using name_prefix instead of name used because of create_before_destroy option
port = "${var.container_port}"
protocol = "HTTP"
target_type = "ip" # required for use of awsvpc task networking mode
vpc_id = "${var.vpc_id}"
health_check {
path = "${var.health_check_path}"
}
# Resolves: (Error deleting Target Group: Target group is currently in use by a listener or a rule)
lifecycle {
create_before_destroy = true
}
# Resolves: The target group does not have an associated load balancer
depends_on = ["aws_alb.main"]
}
resource "aws_alb_listener_rule" "main" {
listener_arn = "${aws_alb_listener.main.arn}"
action {
target_group_arn = "${aws_alb_target_group.main.id}"
type = "forward"
}
condition {
field = "path-pattern"
values = ["*"]
}
}
# creates load balancer, security group, route 53 records, and target groups
module "network" {
source = "git::https://github.com/harvard-vpal/ecs-app-utils.git//terraform/network/public?ref=2.3.0"
vpc_id = "${var.vpc_id}"
ssl_certificate_arn = "${var.ssl_certificate_arn}"
hosted_zone = "${var.hosted_zone}"
domain_name = "${var.domain_name}"
env_label = "${var.env_label}"
project = "${var.project}"
short_project_label = "${var.short_project_label}"
}
Use of templating to pass in environment-specific variables or version tags
data "template_file" "container_definitions_web" {
template = "${file("./container_definitions_web.tpl")}"
vars {
web_image = "${var.app_image}:${var.app_tag}"
nginx_image = "${var.nginx_image}:${var.app_tag}"
project = "${var.project}"
env_label = "${var.env_label}"
log_group_name = "${aws_cloudwatch_log_group.main.name}"
DJANGO_SETTINGS_MODULE = "${var.DJANGO_SETTINGS_MODULE}"
domain_name = "${var.domain_name}"
}
}
module "web_service" {
source = "git::https://github.com/harvard-vpal/ecs-app-utils.git//terraform/services/load_balanced?ref=3.2.0"
vpc_id = "${var.vpc_id}"
cluster_name = "${var.cluster_name}"
role_arn = "${aws_iam_role.task.arn}"
execution_role_arn = "${module.execution_role.arn}"
security_group_id = "${aws_security_group.ecs_service.id}"
name = "${var.project}-${var.env_label}-web"
container_definitions = "${data.template_file.container_definitions_web.rendered}"
target_group_arn = "${module.network.target_group_arn}"
count = "${var.web_count}"
cpu = 512
memory = 1024
}
# container_definitions_web.tpl
[
{
"name": "web",
"image": "${web_image}",
"essential": true,
"portMappings": [
{
"containerPort": 8000
}
],
"command": [
"/usr/local/bin/gunicorn",
"itero.wsgi:application",
"-w=2",
"-b=:8000",
"--log-level=debug",
"--log-file=-",
"--access-logfile=-"
],
"environment": [
{"name": "DJANGO_SETTINGS_MODULE", "value": "${DJANGO_SETTINGS_MODULE}"},
{"name": "HOST", "value": "${domain_name}"}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "${log_group_name}",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
},
"secrets": [
{
"name": "SECRET_KEY",
"valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/itero/${env_label}/SECRET_KEY"
},
{
"name": "DATABASE_CONNECTION",
"valueFrom": "arn:aws:ssm:us-east-1:361808764124:parameter/itero/${env_label}/DATABASE_CONNECTION"
},
{
"name": "CELERY_BROKER_URL",
"valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/itero/${env_label}/CELERY_BROKER_URL"
}
{
"name": "GOOGLE_PICKER_CLIENT_ID",
"valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/itero/${env_label}/GOOGLE_CLIENT_ID"
},
{
"name": "GOOGLE_PICKER_APP_ID",
"valueFrom": "arn:aws:ssm:us-east-1:123456789:parameter/itero/common/GOOGLE_APP_ID"
}
]
},
{
"name": "nginx",
"image": "${nginx_image}",
"essential": false,
"portMappings": [
{
"containerPort": 80
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "${log_group_name}",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
CLI for common ecs build/deploy tasks
Supports image versioning and multiple environments (dev/stage/prod)
# Checkout the app code with the specified version and tag image with that tag
deploy build --tag 1.0.0
# Push images with the specified tag to ECR repositories
deploy push --tag 1.0.0
# Run 'terraform apply' with specified image tag against 'dev' environment
deploy apply --tag 1.0.0 --env dev
# Build, push, and apply
deploy all --tag 1.0.0 --env dev
# Redeploy services (force restart of specific services, even if no config changes)
deploy redeploy --env dev web worker
Fargate | EC2 |
---|---|
Easy scalability, automatic, built-in failure recovery, no need to think about instance provisioning More expensive Compute/memory capacity flexible within a limited range (max 4vCPU / 30 GB memory) |
Build your own scaling, failure recovery, container orchestration (if using docker) If not using docker, dependency management/setup may be complex depending on library/system dependencies Cheaper More high-end options for compute/memory configurations |
Fargate can be a good option for use cases that are memory-limited (e.g. data processing/transformation in memory) vs upgrading to next ec2 tier
Fargate | Lambda |
---|---|
High level of control of library and system dependencies with docker Slower startup time (30 sec - 1 min) Can use for long running applications |
No docker support 50mb (zipped) deployment package size limit - barely enough for basic python data science stack (numpy / pandas / sklearn / statsmodel) 900 second execution time limit Fast startup time |
Fargate | Kubernetes |
---|---|
No need to consider instance provisioning Tight integration with other AWS resources - (task IAM roles, SSM secrets, Cloudwatch logs) |
Abstractions are more complex (imo) EKS more expensive (need to run control plane - $0.2 / hour) Managing your own K8s cluster is complex Open-source; community plug-ins (e.g. canary deployments) |
Slides:
https://bit.ly/itsummit-fargate
ecs-utils
https://github.com/harvard-vpal/ecs-app-utils
andrew_ang@harvard.edu