Using observability to scale AWS Lambda

bene@theodo.co.uk
Ben Ellerby

@EllerbyBen

Ben Ellerby

@EllerbyBen

http://serverless-transformation.com/

https://www.theodo.co.uk/experts/serverless

Alex White

@agwhi_

@EllerbyBen

Serverless

What is this Serverless thing?

  • Architectural movement
    • “allows you to build and run applications and services without thinking about servers” — AWS
    • Developers send application code which is run by the cloud provider in isolated containers abstracted from the developer.
    • Use 3rd party services used to manage backend logic and state (e.g. Firebase, Cognito)
  • A framework with the same name

 

@EllerbyBen

Why Serverless?

💰 Cost reduction

👷‍♂️ #NoOps... well LessOps

💻 Developers focus on delivering                   business value

📈 More scalable

🌳 Greener

@EllerbyBen

Not just Lambda (FaaS)

Lambda

S3

Dynamo

API Gateway

Compute

Storage

Data

API Proxy

Cognito

Auth

SQS

Queue

Step Functions

Workflows

EventBridge

Bus

@EllerbyBen

Power and Flexibility to build...

@EllerbyBen

Optimising Lambda during Development

@agwhi_

@agwhi_

Nathan Malishev

https://levelup.gitconnected.com/aws-lambda-cold-start-language-comparisons-2019-edition-%EF%B8%8F-1946d32a0244

@agwhi_

Improving performance

  • Reduce cold starts
  • Power tuning
  • Architecture/code

@agwhi_

Cold Starts

  • Code hasn't been executed in a while
  • Scaling up
  • Rebalancing across availability zones
  • Updating code/config flushes

@agwhi_

Improving Cold Starts

Frequency

Duration

@agwhi_

@agwhi_

Provisioned Concurrency

@agwhi_

Duration

Measuring with x-ray

@agwhi_

@agwhi_

Cold Starts times

  • Package size
  • Runtime
  • Amount of code
  • Amount of initialisation work

@agwhi_

Reducing Duration

  • Avoid monolithic functions
  • Minify code
    • Webpack
  • Optimise imports
    • Only import the parts of the library you're using
    • lazyload dependencies that might not be used

HTTP Keep-Alive

  • Reuse TCP connections between requests
  • Reduce DynamoDB operation from 30ms to 10ms
  • Easy to set up

@agwhi_

Power Tuning

@agwhi_

Memory = Power

@agwhi_

Power Tuning

https://github.com/alexcasalboni/aws-lambda-power-tuning

@agwhi_

Power Tuning

  • Data-driven cost and performance
    optimisation
  • Available as an AWS Serverless
    Application Repository app
  • Can integrate with CI/CD

@agwhi_

Input

Output

@agwhi_

CPU-bound example

@agwhi_

Architecture/Code

@agwhi_

Distributed Tracing

@agwhi_

Common mistakes

  • Fetching more data than you need
  • Not using related services well
    • Scans in dynamoDB
  • Defaulting to synchronous execution

@agwhi_

Moving to Async

​Sync

  • You pay while your lambda
  •  Downstream slowdown affects the lambda
  • Needs custom code for error handling and retries

Async

  • Minimizes cost of waiting
  • Queueing separates fast and slow processes
  • Managed services provide reliability features 

@agwhi_

Do you need a lambda?

  • Move orchestration out of your lambdas
    • avoid paying for idle time
    • Use step functions
  • Move data transport out of functions
    • "Transform not transport" 
    • Use VTL when possible
      • Access dynamoDB directly

@agwhi_

Parallelise Code 

  • Make use of promises (in nodejs) to parallelise processing
  • AVX2 announced at re:invent 2020

@agwhi_

"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%."

- Donald Knuth

@agwhi_

Conclusion

Cold Starts

  • AWS optimises the first part for us
  •  What we can change
  • Provisioned Concurrency

​Power tuning

  • Memory = Power
  • AWS Lambda Power Tuning
  • Cost vs Speed

​Architecture/Code

  • Move to async
  • You don't always need a lambda
  • Use parrelisation

@agwhi_

Load Testing

@EllerbyBen

Microservices

@EllerbyBen

How does Serverless Scale Differently?

  • Pay-per-use => leading to denial of wallet
  • AWS Service Limits (both hard and soft)
  • Outpace non-serverless components / third parties
  • Cold start impacts for sudden spikes
  • Combination of multiple services in a distributed system makes bottlenecks harder to spot 
  • Regional distribution of traffic

@EllerbyBen

What is Load Testing?

  • Different things to different people.

@EllerbyBen

Simulating different concurrent traffic levels on an application to validate its scalability.

 

Protocol vs Browser

  • 2 Types:
    • Protocol Based: Simulating at the API level
    • Browser Based: Spinning up browsers and simulating interactions with browser elements to trigger realistic protocol-level requests.
  • Typically, Serverless Architectures are best tested at the Protocol level for as the scale of testing is usually high and browser simulated testing at this level would be expensive a slow.
  • Protocol could be HTTP API requests, or more custom triggering of the AWS SDK Directly

@EllerbyBen

Components of a good load test

  • Exact replica of production infrastructure
  • Observability tooling
  • Repeatable scenarios
  • Ability to simulate high load
  • Realistic user flows
  • Geographic distribution

@EllerbyBen

Example Application: Gamercraft

@EllerbyBen

@EllerbyBen

The Gamercraft platform needed the ability to support a massive volume of users and accommodate traffic spikes during large-scale tournaments and low usage periods

Gamercraft

@EllerbyBen

Gamercraft

@EllerbyBen

What we want test?

  • Validate our cost estimates as load increases
  • Identify AWS Service Limits that need raising
  • Identify AWS Service Limits that can't be raised
  • Verify 3rd party and non-serverless components are protected from spikes
  • Identify hidden bottlenecks
  • Verify impact of regional distribution

@EllerbyBen

How do I start?

@EllerbyBen

🤷‍♂️

Environment to Test Against

@EllerbyBen

🌎

Isomorphic Ephemeral Load Testing Environments 

@EllerbyBen

  • 100% serverless architectures can be deployed to short lived environments.
  • In "Serverless Flow" we spin up an environment per PR to run integration testing.
  • There is 0 mocking, and the architecture is isomorphic to production
  • This same approach can be taken for isolated load testing.

* Non-serverless components and 3rd parties add complexity

Metrics

@EllerbyBen

📊

Basic Metrics

@EllerbyBen

  • Response times
  • Error rates
  • Throttles

Know what’s happening

@EllerbyBen

  • The flexibility, distribution and granularity of Serverless architectures makes logging hard.

  • Cloudwatch & XRay are the minimum.

@EllerbyBen

CloudWatch Lambda Insights

@EllerbyBen

Dedicated Observability Service

Load Testing Toolkit

@EllerbyBen

🛠

Artillery

@EllerbyBen

Artillery is a load testing and smoke testing solution for SREs, developers and QA engineers

Artillery - Test Definition

@EllerbyBen

config:
  target: "https://shopping.service.staging"
  phases:
    - duration: 60
      arrivalRate: 5
      name: Warm up
    - duration: 120
      arrivalRate: 5
      rampTo: 50
      name: Ramp up load
    - duration: 600
      arrivalRate: 50
      name: Sustained load
  payload:
    # Load search keywords from an external CSV file and make them available
    # to virtual user scenarios as variable "keywords":
    path: "keywords.csv"
    fields:
      - "keywords"
scenarios:
  # We define one scenario:
  - name: "Search and buy"
    flow:
      - post:
          url: "/search"
          body: "kw={{ keywords }}"
          # The endpoint responds with JSON, which we parse and extract a field from
          # to use in the next request:
          capture:
            json: "$.results[0].id"
            as: "id"
      # Get the details of the product:
      - get:
          url: "/product/{{ id }}/details"
      # Pause for 3 seconds:
      - think: 3
      # Add product to cart:
      - post:
          url: "/cart"
          json:
            productId: "{{ id }}"
artillery run search-and-add-to-cart.yml

But where would we run this from?

A server... 🤮

@EllerbyBen

What if there was another way?
 

A service that can run code (without us having to managing servers) with support for massive parallel scale?

@EllerbyBen

Serverless-Artillery (slsart)

@EllerbyBen

Combine serverless with artillery and you get serverless-artillery for instant, cheap, and easy performance testing at scale.

 

Serverless-Artillery (slsart)

@EllerbyBen

Running From Different AWS Account

@EllerbyBen

  • We are running our load test using AWS Services. (i.e. Lambda)
  • We don't want the load-testing infra to impact limits on our infra under test
  • More realistic traffic paths

Committing Expermients

@EllerbyBen

  • All tests should be repeatable experiments.
  • The context for the test, scenario templates and results should all be committed to the repo.
  • Allows future analysis and repeating of experiments.

Components of a good load test

  • Exact replica of production infrastructure
  • Observability tooling
  • Repeatable scenarios
  • Ability to simulate high load
  • Realistic user flows
  • Geographic distribution
  • Committed repeatable tests

@EllerbyBen

Conclusion

@EllerbyBen

🌎

📊

🛠

@EllerbyBen

serverless-transformation

Serverless Optimisation Workshop

@agwhi_

Alex White Joint Presentation: Using observability to scale AWS Lambda

By Ben Ellerby

Alex White Joint Presentation: Using observability to scale AWS Lambda

Serverless architectures on AWS, involving services like AWS Lambda, DynamoDB, Cognito, Step Functions, API Gateway, bring instant scalability when built and configured in the correct way. We’ll look at how AWS Serverless architectures need to be treated differently to ensure optimal scalability and how Serverless tools (like Serverless Artillery) can be used to verify scalability. Not only will we look at achieving scalability, we’ll also look at the tools and techniques to predict and limit the cost of scaling. To bring these topics to life we’ll look at the architecture of 2 live Serverless applications built on AWS for scale and discuss how they were architected, how costs were monitored and kept in line and how serverless load testing was used to verify scalability and catch edge cases.

  • 622