Building More Resilient Applications

Karim Alibhai

github.com/karimsa | npm install karimsa

I will drastically oversimplify some things!

DISCLAIMER:

Delivering the art & science of retail execution

Pst! We're hiring @ fokoretail.com

Observers / Managers

Assignees

Highly async

pipeline

Sub-par handling due to how fast the features were built

800

* 5 events per assignee = 40 000 events

Observers / Managers

Assignees

Worked beautifully

Still sub-par

800

* 5 events per assignee = 3 200 000 events

(80x)

How we think of failures

Problem: Evil butterflies are rare. Stupid butterflies are common.

How do we aim for high availability?

Availability =

Time to Failure

Time to Failure + Time to Recovery

* "Patterns for fault tolerant cloud software" by Robert Hanmer

The CAP Theorem

Consistency
Availability
Partition tolerance

*

* Required for distributed systems.

HTTP blah

blah

Server

Client

Unavoidable network partition

Consistency
Availability
Partition tolerance

*

* Required for distributed systems.

CP System

Prefers consistency over availability (where possible).
CAP defines consistency in terms of reads.
In reality, we design in terms of writes.
For example:
- Database transactions
- REST APIs (?)

AP System

Prefers availability over consistency (where possible).
Key phrase: is eventually consistent.
For example:
- A `POST` / `PUT` during database downtime is still successful.
- The `User` microservice is down, so the recommender system provides general recommendations.
- Your `ES` cluster is down, so your API uses mongodb / cache.

Considerations

When mechanics are involved
When the user is orchestrating

Use AP where possible, otherwise fallback to CP.

Building More Resilient Applications (Part 1)

By Karim Alibhai

Building More Resilient Applications (Part 1)

A story about when all hell breaks loose & lessons learned from it.

Building More Resilient Applications

Karim Alibhai

I will drastically oversimplify some things!

DISCLAIMER:

Pst! We're hiring @ fokoretail.com

How we think of failures

How do we aim for high availability?

Availability =

Time to Failure

Time to Failure + Time to Recovery

The CAP Theorem

Consistency

Availability

Partition tolerance

*

Consistency

Availability

Partition tolerance

*

CP System

AP System

Considerations

Building More Resilient Applications (Part 1)

More from Karim Alibhai