How Amazon Web Services Uses Formal Methods

Paper

Authors:

Chris Newcombe, Tim Rath, Fan Zhang, Bogdan ,...

Released:

Communication Of  The ACM, April 2015

Amazon Web Services (AWS), is a collection of cloud computing services, also called web services, that make up a cloud-computing platform offered by Amazon.com.

PRODUCTS

S3

DynamoDB

EBS

Many Others...

Elastic Beanstalk, Auto Scaling, Virtual Private Cloud, Elastic Load Ballancing...

FORMAL METHODS

A particular kind of mathematically based techniques for the specification, development and verification of software and hardware systems.

formal methods have a reputation for require a huge amount of training and effort to verify a tiny piece of relatively straightforward code, so the return on investment is justified only in safety-critical domains (such as medical systems and aviation) 

ALLOY

  • Developed in MIT
  • Taught in University Of Iceland
  • Considered by Amazon

TLA+

Temporal Logic of Actions

N Queens Problem

  • Fit N Queens on NxN spaces
  • Solvable for all
  • 8 Queens on chess-board
N > 2
N>2N > 2

To a first approximation, we can say that that accidents are almost always the result of incorrect estimates of likelihood of one or more things

Human intuition is poor at estmiating the true probability of supposedly "extremely rare" combinations of events in systems operating at a scale of millions of requests per second

The nuber of reachable states in the code is astronomical

Why do they need formal methods??

Title Text

M.C. initially chose Alloy. Wrote specification for a non-trivial algo in Alloy. Later did the same in tla+.

Why they chose TLA+ over Alloy and others

steps

  1. "What needs to go right"
    • Safety - what system is allowed to do
    • Liveness - what system should eventually do
  2. ​"asd"

Steps?? :/

SIDE BENEFIT

Great documentation

What are formal methods not good for

Sustained emergent performance degradation

Java garbage collection causes timeouts to be breached on clients, causing clients to retry requests, thus addong load to the server, and further shutdowns. 

First Success

DynamoDB launched in January 2012 

T.R. wrote  TLA+ specification for several components

Verified that a complex part of the algo was correct

Found a bug that could lead to dataloss if a perticular series of failures and recovery steps would be interleaved with other processing

TLC model checker ran on 10 EC2 instances, each with 8 cores, hyberthreads and 23GB of RAM

very subtle bug

Convince Other Engineers

Testable Pseudo-code

Avoided terms like "formal" and "proof" 

Don't mention formal methods

Incorrect impression of complexity

Hvernig plötuðu þeir aðra að nota?

More Success in S3

F.Z. found two bugs in an algo, verified fixes

Tweak specification to introduce optimizations

AWS managements starts encouraging teams to adopts formal methods.

TLA+ model finds a known very subtle bug that passed through mutliple reviews, in seconds.

Robustness

M.D. used CalcPlus to find a critical bug in AWS's most important new distributed algorithm 

C.N. wrote a spec for the same algo, quite different in style, found the same bug.

Suggests  TLA+ specifications are robust to variations among engineers.

Good For Data Modeling

Improve system scalability

 

How do we know that the executable code correctly implements the verified design?

We don't.

But formal methods help:

  • At least get the design right
  • Gain better understanding
  • Write better code

CONCLUSION

  • Bullet One
  • Bullet Two
  • Bullet Three

Review and reflection of the paper

AWS Formal Methods Dark

By Tryggvi Gylfason

AWS Formal Methods Dark

  • 798