Learn AWS The Hard Way

Valentino Volonghi, CTO

dialtone@adroll.com | @dialtone_

Resilience

Performance Marketing Platform

Site traffic, data, ML, actions

Customers Websites

Internet Exchanges

Tracking

Count

US-East

US-West

Eu-West

Ap-S

Ap-N

20B

70B

100ms

20 billion user interactions captured each day

70+ billion real time auctions per day

100TB new compressed data each day

100ms max latency per auction

5 AWS Regions

Have you ever committed your AWS Keys in your source code?

Hologram

Hologram exposes an imitation of the EC2 instance metadata service on developer workstations that supports the temporary credentials workflow. It is accessible via the same HTTP endpoint to calling SDKs, so your code can use the same process in both development and production. The keys that Hologram provisions are temporary, so EC2 access can be centrally controlled without direct administrative access to developer workstations.

https://github.com/AdRoll/Hologram

Holochrome

Holochrome is a chrome extension that allows you to easily log in and switch between your AWS accounts using a single key stroke. It is built on top of the aws instance metadata service and therefore encourages security best practices by completely removing the need for static, long-lived credentials. The AWS console session is granted the exact same permissions as the IAM role available via the metadata service.

https://github.com/Bridgewater/holochrome

$ hologram use user
Successfully got credentials for role 'user'

$ hologram use admin
User dialtone is not authorized to assume role arn:aws:iam::123:role/admin!

$ curl 169.254.169.254/latest/meta-data/iam/security-credentials/hologram-access
{"Code":"Success","LastUpdated":"2017-06-01T22:38:35Z","Type":"AWS-HMAC","AccessKeyId":"AWS_KEY","SecretAccessKey":"SECRET_KEY","Token":"TOKEN","Expiration":"2017-06-01T23:34:15Z"}

$ curl 169.254.169.254/latest/meta-data/instance-id
i-deadbeef

$ curl 169.254.169.254/latest/meta-data/placement/availability-zone
us-west-2x

Hologram

LDAP

AWS Token Service

https://github.com/AdRoll/Hologram

https://github.com/Bridgewater/holochrome

How do you coordinate

multiple regions?

Customer looking to generate

conversions, by spending dollars

on web traffic from anywhere in the world

Customer Budget

Potential Volume

Regions

Number of Boxes

$1000/d

100 QPS

4-700

Simple Split?

$1000

Us-East

Us-West

Eu-West

Ap-North

Ap-South

$500

c1.xl

$100/day

100 conns

1000 connections

Us-East

Us-West

Eu-West

Ap-North

Ap-South

500 conns

100 conns

Good Enough?

Enter c3.4xlarge

$1000

Us-East

Us-West

Eu-West

Ap-North

Ap-South

$500

c3.4xl

$100/day

0 conns

1000 connections

Us-East

Us-West

Eu-West

Ap-North

Ap-South

500 conns

0 conns

100 conns

400 conns

Simple Split doesn't work

Use TCP ELB with caution

Each connection is very long lived
One new c3.4xl was enough to handle thousands
The first box to join the ELB would get most conns
Only that one box gets to spend its money
And a lot of other advertising performance considerations

Global Pacing Coordinator: Banker

Global Eventually Consistent Counters
Let ML decide which box spends money
Details: https://goo.gl/UzrZBF

us-east

eu-west

...

Kinesis

us-west

Aggregator

Fetch every 10s

Sync every 10s

Quorum check

Spend Predictor

Design to not fail. Expect Failure.

Take inspiration from other fields
Adapt their solution to your problem
Automatic graceful degradation

How do you move data around?

20 Billion user interactions

150 TB logs compressed

4 Trillion log lines

Ever lost data because an instance died?

S3 as "data lake"

"Infinite" bandwidth
Common log line format
Common path format
Dedicated bucket for massive volume
Source of Truth for almost every database

Different data types

High Latency
Medium Latency
Low Latency

High Latency Data

S3 Replication with our helper for SLA
Moves the bulk of the files
Lifecycle rules to delete origin data
Small data loss in the system is ok
Massive volume

Mid Latency Data

Directly moved to main s3 bucket
<20 min latency is a requirement
Backup for low latency data
No data loss
Large volume

Low Latency Data

AWS Kinesis
1s latency is a requirement
OMFG for 1min latency needs
Small data loss in the system is ok
Medium volume (4B+ logs/day)

source region

main region

Kinesis

Instance

OMFG

Profiles

Batch

Presto

Reporting

High latency

Mid Latency

Low Latency

Offload data ASAP

Keeps failure profile low
Enables use of aggressive scaling

Moving is not the point

Data within reach
Multiple teams with different needs
They'll figure out how to move it

How do we process this data?

100 TB logs compressed

4 Trillion log lines

4 Trillion Events Per Day

Text

http://TrailDB.io

Discrete Event DB

Polyglot

Before AWS Batch

AWS Batch is great but...

Still lacks storage scheduling

Waiting while downloading...

Download

Reads

Processing

Upload

Wait

Can we avoid the wait?

Could we use S3?

4Gbps bandwidth per instance, current limit

Almost infinite in aggregate

~10-100 ms latency

https://goo.gl/Oyo1TM

Userfaults allow the implementation of on-demand paging from userland and more generally they allow userland to take control of various memory page faults, something otherwise only the kernel code could do.