Serverless for HPC - A case study

Luciano Mammino (@loige)

fth.link/comsum-hpc

Get these slides!

loige

fth.link/comsum-hpc

Let me introduce myself first...

👋 I'm Luciano (🇮🇹🍕🍝🤌)

👨‍💻 Senior Architect @ fourTheorem (Dublin 🇮🇪)

nodejsdp.link

📔 Co-Author of Node.js Design Patterns 👉

Let's connect!

loige.co (blog)

@loige (twitter)

loige (twitch)

lmammino (github)

We are business focused technologists that deliver.

Accelerated Serverless | AI as a Service | Platform Modernisation

loige

We host a weekly podcast about AWS

loige

awsbites.com

🤔

Is serverless a good option for HPC?

loige

Agenda

loige

HPC Case study
Our requirements & principles
The implementation
Achievements & Pain points
Questions (& bashing)

Based on a talk presented at the AWS Summit London 2022 (MA-03 with Matheus Guimaraes & Colum Thorne) - fth.link/jn5

loige

2 types of Workflow

loige

1️⃣ Risk Rollup — Nightly
2️⃣ Deal Analytics — Near-Real time

1️⃣ Risk Rollup

loige

Uses financial modelling to understand the state of the portfolio
Executed 2-3 times a day
Multiple terabytes of data being processed

2️⃣ Deal Analytics

loige

Uses the same modelling code but focuses on a subset of deals
High frequency of execution (~1000 times a day)
Lower data volumes

Original on-prem implementation

loige

Challenges & Limitations of this implementation

loige

Scale!
Long execution times (constraining business agility)
Competing workloads
Limited ability to support portfolio growth
Hard to deliver new features

Let's re-imagine all of this!
... In the cloud ☁️

loige

Design Principles

loige

Think big
Plan for future growth & more
Use managed services as much as possible
Limit undifferentiated heavy lifting

High-level architecture

loige

Execution Planner

loige

Compute Strategy and Error Handling

loige

Outcomes

loige

Risk Rollup is fast (~1 hour)
Faster and more consistent Deal Analytics
Well positioned to support portfolio growth
Reduced original codebase by 70%
Lowered TCO

Technical challenges & pain points

loige

S3 throughput
Job execution caching
Scaling Fargate Containers
Observability

S3 Throughput

loige

We read/write thousands of files concurrently
Easy to bump into throughput exceptions if we don't use proper partitioning
Automatic partitioning did not work well for us
We needed to agree on an S3 partitioning schema with AWS (through support)

S3 Quotas

3,500 write req/sec
5,500 read req/sec
per prefix

/parts/123abc/...

/parts/456efg/...

/parts/ef12ab/...

/parts/...

Job execution caching

loige

Most jobs are deterministic
If we have the output in S3 we don't need to re-run them
But that means sending tens of thousand of HEAD requests to S3 in a short amount of time! 🙈
We had to build our own custom S3 caching solution!

Job execution caching

loige

We store all the generated files in a Redis SET
We use Redis SET intersection operation to quickly figure out which jobs needs to be scheduled and which ones we can skip

Job execution caching

loige

The process to update the cache is event based
Whenever theres is a new output object in S3 we trigger a lambda to add the new key to the Redis set

Job execution caching

loige

Redis memory grows very quickly, so we need to regularly remove files from the set...
And this is where things get complicated!

Job execution caching

loige

Unfortunately, Redis does not support expiring individual keys in a set! 😥
We ended up implementing our own expiry flow using DynamoDB TTL!

Job execution caching

loige

Job execution caching

loige

Our actual implementation is even more complicated than this...
We try to batch were possible and we try to limit the speed of write to DynamoDB using Kinesis and SQS.
All this complexity doesn't add undifferentiated value and we'd happily get rid of it if there was a managed solution...

loige

Yup... I know what you are thinking!

Scaling Fargate containers

loige

When we run a Risk Rollup job we need to spawn ~3k container tasks ASAP.
Using a Fargate service, it was taking ~1 hour to do that (this might have changed).
Not ideal for us, so we looked for solutions that did not require us to move away from Fargate.
We realised that by calling the runTask API directly we could spawn container tasks faster! So we built a custom Fargate task scaler!

Scaling Fargate containers

loige

We have a continuously running step function to scale up the number of containers as needed

Decides whether to restart the step function, start new tasks or simply wait

Checks all the running jobs and how many containers they need

Uses the runTask API to start as many containers as needed

If this step function has done 500 iterations, start a new one and end

Scaling Fargate containers

loige

How do we stop containers?

They stop automatically after 15 minutes if they can't get jobs from the queue.

while True:
  result = sqs_client.receive_message(/* ... */)
  if 'Messages' not in result and 
    time.time() - time_since_last_message > 15 * 60:
      break
  else
    process_job(result['Messages'])

Scaling Fargate containers

loige

We tried to build a continuous workflow using serverless technologies (Step Function + Lambda).
This is another example of undifferentiated heavy lifting.
We should go back and try again the performance of Fargate and see if we can remove this custom component from our architecture.

Observability

loige

We have built a complex workflow with multiple components, how do we make sure it works as expected in production?
Lots of effort put into consistent logging (using Lambda PowerTools for Python and relying heavily on CloudWatch Log Insights)
Lots of effort in collecting metrics, visualising them through dashboards and triggering alarms when something seems wrong.
It's a constant work of review and fine tuning.
If only observability can be a little bit easier...

Conclusion

loige

We believe that Serverless is a viable option for HPC workloads
The performance is great and most of the components scale to 0
There are some rough edges but we believe the ecosystem around this use case is still in its infancy
Things will improve and the future of HPC will be a lot more serverless!

Cover Photo by israel palacio on Unsplash

Thanks to @eoins, @pelger, @guimathed, @cmthorne10 + the awesome tech team at RenRe!

loige

fourtheorem.com

THANKS! 🙌

fth.link/comsum-hpc

* just a happy cloud

Serverless for HPC - A case study

Let me introduce myself first...

We are business focused technologists that deliver.

🤔

Is serverless a good option for HPC?

Agenda

2 types of Workflow

1️⃣ Risk Rollup

2️⃣ Deal Analytics

Original on-prem implementation

Challenges & Limitations of this implementation

Let's re-imagine all of this! ... In the cloud ☁️

Design Principles

High-level architecture

Execution Planner

Compute Strategy and Error Handling

Outcomes

Technical challenges & pain points

S3 Throughput

Job execution caching

Job execution caching

Job execution caching

Job execution caching

Job execution caching

Job execution caching

Job execution caching

Yup... I know what you are thinking!

Scaling Fargate containers

Scaling Fargate containers

Scaling Fargate containers

Scaling Fargate containers

Observability

Conclusion

Let's re-imagine all of this!
... In the cloud ☁️