Using observability to scale AWS Lambda

bene@theodo.co.uk
Ben Ellerby

@EllerbyBen

Ben Ellerby

@EllerbyBen

http://serverless-transformation.com/

https://www.theodo.co.uk/experts/serverless

Alex White

@agwhi_

@EllerbyBen

Serverless

What is this Serverless thing?

  • Architectural movement
    • “allows you to build and run applications and services without thinking about servers” — AWS
    • Developers send application code which is run by the cloud provider in isolated containers abstracted from the developer.
    • Use 3rd party services used to manage backend logic and state (e.g. Firebase, Cognito)
  • A framework with the same name

 

@EllerbyBen

Why Serverless?

💰 Cost reduction

👷‍♂️ #NoOps... well LessOps

💻 Developers focus on delivering                   business value

📈 More scalable

🌳 Greener

@EllerbyBen

Not just Lambda (FaaS)

Lambda

S3

Dynamo

API Gateway

Compute

Storage

Data

API Proxy

Cognito

Auth

SQS

Queue

Step Functions

Workflows

EventBridge

Bus

@EllerbyBen

Power and Flexibility to build...

@EllerbyBen

Optimising Lambda during Development

@agwhi_

@agwhi_

Nathan Malishev

https://levelup.gitconnected.com/aws-lambda-cold-start-language-comparisons-2019-edition-%EF%B8%8F-1946d32a0244

@agwhi_

Improving performance

  • Reduce cold starts
  • Power tuning
  • Architecture/code

@agwhi_

Cold Starts

  • Code hasn't been executed in a while
  • Scaling up
  • Rebalancing across availability zones
  • Updating code/config flushes

@agwhi_

Improving Cold Starts

Frequency

Duration

@agwhi_

@agwhi_

Provisioned Concurrency

@agwhi_

Duration

Measuring with x-ray

@agwhi_

@agwhi_

Cold Starts times

  • Package size
  • Runtime
  • Amount of code
  • Amount of initialisation work

@agwhi_

Reducing Duration

  • Avoid monolithic functions
  • Minify code
    • Webpack
  • Optimise imports
    • Only import the parts of the library you're using
    • lazyload dependencies that might not be used

HTTP Keep-Alive

  • Reuse TCP connections between requests
  • Reduce DynamoDB operation from 30ms to 10ms
  • Easy to set up

@agwhi_

Power Tuning

@agwhi_

Memory = Power

@agwhi_

Power Tuning

https://github.com/alexcasalboni/aws-lambda-power-tuning

@agwhi_

Power Tuning

  • Data-driven cost and performance
    optimisation
  • Available as an AWS Serverless
    Application Repository app
  • Can integrate with CI/CD

@agwhi_

Input

Output

@agwhi_

CPU-bound example

@agwhi_

Architecture/Code

@agwhi_

Distributed Tracing

@agwhi_

Common mistakes

  • Fetching more data than you need
  • Not using related services well
    • Scans in dynamoDB
  • Defaulting to synchronous execution

@agwhi_

Moving to Async

​Sync

  • You pay while your lambda
  •  Downstream slowdown affects the lambda
  • Needs custom code for error handling and retries

Async

  • Minimizes cost of waiting
  • Queueing separates fast and slow processes
  • Managed services provide reliability features 

@agwhi_

Do you need a lambda?

  • Move orchestration out of your lambdas
    • avoid paying for idle time
    • Use step functions
  • Move data transport out of functions
    • "Transform not transport" 
    • Use VTL when possible
      • Access dynamoDB directly

@agwhi_

Parallelise Code 

  • Make use of promises (in nodejs) to parallelise processing
  • AVX2 announced at re:invent 2020

@agwhi_

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

- Donald Knuth

@agwhi_

Conclusion

Cold Starts

  • AWS optimises the first part for us
  •  What we can change
  • Provisioned Concurrency

​Power tuning

  • Memory = Power
  • AWS Lambda Power Tuning
  • Cost vs Speed

​Architecture/Code

  • Move to async
  • You don't always need a lambda
  • Use parrelisation

@agwhi_

Load Testing

@EllerbyBen

Microservices

@EllerbyBen

How does Serverless Scale Differently?

  • Pay-per-use => leading to denial of wallet
  • AWS Service Limits (both hard and soft)
  • Outpace non-serverless components / third parties
  • Cold start impacts for sudden spikes
  • Combination of multiple services in a distributed system makes bottlenecks harder to spot 
  • Regional distribution of traffic

@EllerbyBen

What is Load Testing?

  • Different things to different people.

@EllerbyBen

Simulating different concurrent traffic levels on an application to validate its scalability.

 

Protocol vs Browser

  • 2 Types:
    • Protocol Based: Simulating at the API level
    • Browser Based: Spinning up browsers and simulating interactions with browser elements to trigger realistic protocol-level requests.
  • Typically, Serverless Architectures are best tested at the Protocol level for as the scale of testing is usually high and browser simulated testing at this level would be expensive a slow.
  • Protocol could be HTTP API requests, or more custom triggering of the AWS SDK Directly

@EllerbyBen

Components of a good load test

  • Exact replica of production infrastructure
  • Observability tooling
  • Repeatable scenarios
  • Ability to simulate high load
  • Realistic user flows
  • Geographic distribution

@EllerbyBen

Example Application: Gamercraft

@EllerbyBen

@EllerbyBen

The Gamercraft platform needed the ability to support a massive volume of users and accommodate traffic spikes during large-scale tournaments and low usage periods

Gamercraft

@EllerbyBen

Gamercraft

@EllerbyBen

What we want test?

  • Validate our cost estimates as load increases
  • Identify AWS Service Limits that need raising
  • Identify AWS Service Limits that can't be raised
  • Verify 3rd party and non-serverless components are protected from spikes
  • Identify hidden bottlenecks
  • Verify impact of regional distribution

@EllerbyBen

How do I start?

@EllerbyBen

🤷‍♂️

Environment to Test Against

@EllerbyBen

🌎

Isomorphic Ephemeral Load Testing Environments 

@EllerbyBen

  • 100% serverless architectures can be deployed to short lived environments.
  • In "Serverless Flow" we spin up an environment per PR to run integration testing.
  • There is 0 mocking, and the architecture is isomorphic to production
  • This same approach can be taken for isolated load testing.

* Non-serverless components and 3rd parties add complexity

Metrics

@EllerbyBen

📊

Basic Metrics

@EllerbyBen

  • Response times
  • Error rates
  • Throttles

Know what’s happening

@EllerbyBen

  • The flexibility, distribution and granularity of Serverless architectures makes logging hard.

  • Cloudwatch & XRay are the minimum.

@EllerbyBen

CloudWatch Lambda Insights

@EllerbyBen

Dedicated Observability Service

Load Testing Toolkit

@EllerbyBen

🛠

Artillery

@EllerbyBen

Artillery is a load testing and smoke testing solution for SREs, developers and QA engineers

Artillery - Test Definition

@EllerbyBen

config:
  target: "https://shopping.service.staging"
  phases:
    - duration: 60
      arrivalRate: 5
      name: Warm up
    - duration: 120
      arrivalRate: 5
      rampTo: 50
      name: Ramp up load
    - duration: 600
      arrivalRate: 50
      name: Sustained load
  payload:
    # Load search keywords from an external CSV file and make them available
    # to virtual user scenarios as variable "keywords":
    path: "keywords.csv"
    fields:
      - "keywords"
scenarios:
  # We define one scenario:
  - name: "Search and buy"
    flow:
      - post:
          url: "/search"
          body: "kw={{ keywords }}"
          # The endpoint responds with JSON, which we parse and extract a field from
          # to use in the next request:
          capture:
            json: "$.results[0].id"
            as: "id"
      # Get the details of the product:
      - get:
          url: "/product/{{ id }}/details"
      # Pause for 3 seconds:
      - think: 3
      # Add product to cart:
      - post:
          url: "/cart"
          json:
            productId: "{{ id }}"
artillery run search-and-add-to-cart.yml

But where would we run this from?

A server... 🤮

@EllerbyBen

What if there was another way?
 

A service that can run code (without us having to managing servers) with support for massive parallel scale?

@EllerbyBen

Serverless-Artillery (slsart)

@EllerbyBen

Combine serverless with artillery and you get serverless-artillery for instant, cheap, and easy performance testing at scale.

 

Serverless-Artillery (slsart)

@EllerbyBen

Running From Different AWS Account

@EllerbyBen

  • We are running our load test using AWS Services. (i.e. Lambda)
  • We don't want the load-testing infra to impact limits on our infra under test
  • More realistic traffic paths

Committing Expermients

@EllerbyBen

  • All tests should be repeatable experiments.
  • The context for the test, scenario templates and results should all be committed to the repo.
  • Allows future analysis and repeating of experiments.

Components of a good load test

  • Exact replica of production infrastructure
  • Observability tooling
  • Repeatable scenarios
  • Ability to simulate high load
  • Realistic user flows
  • Geographic distribution
  • Committed repeatable tests

@EllerbyBen

Conclusion

@EllerbyBen

🌎

📊

🛠

@EllerbyBen

serverless-transformation

Serverless Optimisation Workshop

@agwhi_

Made with Slides.com