Resiliency in
Distributed Systems
Follow along,
- 18 Products
- 1m+ Drivers
- 300+ Microservices
- 15k+ Cores
- 2 Cloud Providers
- 6 Data centers
- 100+ million bookings per month
Transport, logistics, hyperlocal delivery and payments
Agenda
- Resiliency and Distributed Systems
- Why care for Resiliency ?
- Faults vs Failures
- Patterns for Resiliency
Distributed Systems
Networked Components which communicate and coordinate their actions by passing messages
Troll Definition
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4659040/distributed-systems-i-have-no-idea-what-i-am-doing.jpg)
Resiliency
Capacity to Recover from difficulties
Why care about Resiliency ?
- Financial Losses
- Losing Customers
- Affecting Customers
- Affecting Livelihood of Drivers
Faults vs Failures
Fault
Incorrect internal state in your system
Faults
- Database slowdown
- Memory leaks
- Blocked threads
- Dependency failure
- Bad Data
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4664796/Screen_Shot_2018-03-04_at_1.48.36_PM.png)
Healthy
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4664798/Screen_Shot_2018-03-04_at_1.50.30_PM.png)
Faults
Failure
Inability of the system to do its intended job
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4664801/Screen_Shot_2018-03-04_at_1.58.01_PM.png)
Failures
Resiliency is about preventing faults turning into failures
Resiliency in Distributed Systems is Hard
- Network is unreliable
- Dependencies can always fail
- Users are unpredicatable
Patterns for Resiliency
Heimdall
https://github.com/gojektech/heimdall
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4876035/heimdall.png)
#NOCODE
Resiliency Pattern #0
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4663987/Screen_Shot_2018-03-03_at_9.59.39_PM.png)
#LessCode
Timeouts
Stop waiting for an answer
Resiliency Pattern #1
Required at Integration Points
DefaultHTTPClient Waits forever
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4675926/Screen_Shot_2018-03-07_at_10.55.06_AM.png)
httpClient := http.Client{}
_, err := httpClient.Get("https://gojek.com/drivers")
Goroutines
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4673032/Screen_Shot_2018-03-06_at_9.31.49_PM.png)
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4673035/Screen_Shot_2018-03-06_at_9.30.14_PM.png)
httpClient := heimdall.NewHTTPClient(1 * time.Millisecond)
_, err := httpClient.Get("https://gojek.com/drivers",
http.Header{})
Prevents Cascading Failures
Provides Failure Isolation
Timeouts must be based on dependency's SLA
Retries
Try again on Failure
Resiliency Pattern #2
Reduces Recovery time
backoff := heimdall.NewConstantBackoff(500)
retrier := heimdall.NewRetrier(backoff)
httpClient := heimdall.NewHTTPClient(1 * time.Millisecond)
httpClient.SetRetrier(retrier)
httpClient.SetRetryCount(3)
httpClient.Get("https://gojek.com/drivers", http.Header{})
Retrying immediately may not be useful
Queue and Retry wherever possible
Idempotency is important
Circuit Breakers
Stop making calls to save systems
Resiliency Pattern #3
State Transitions
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4659269/futureinternet-09-00058-g003.png)
Hystrix
config := heimdall.HystrixCommandConfig{
MaxConcurrentRequests: 100,
ErrorPercentThreshold: 25,
SleepWindow: 10,
RequestVolumeThreshold: 10,
}
hystrixConfig := heimdall.NewHystrixConfig("MyCommand",
config)
timeout := 10 * time.Millisecond
httpClient := heimdall.NewHystrixHTTPClient(timeout,
hystrixConfig)
_, err := httpClient.Get("https://gojek.com/drivers",
http.Header{})
Circumvent calls when system is unhealthy
Guards Integration Points
Metrics/Monitoring
Hystrix Dashboards
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4664751/Screen_Shot_2018-03-04_at_1.03.47_PM.png)
Fallbacks
Degrade Gracefully
Resiliency Pattern #4
Curious case of Maps Service
Route Distance
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4664768/index1.jpg)
Fallback from Route Distance to Route Approximation
Route Approximation
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4664755/index.jpg)
Fallback to a Different Maps Provider
Helps Degrade gracefully
Protect Critical flows from Failure (Ex: Booking Flow)
Think of fallbacks at Integration points
Resiliency Testing
Resiliency Pattern #5
Test and Break
Find Failure modes
Create a Test Harness to break callers
Inject Failures
Unknown Unknowns
Simian Army
- Chaos Monkey
- Janitor Monkey
- Conformity Monkey
- Latency Monkey
More patterns
- Rate-limit/Throttling
- Bulk-heading
- Queuing
- Monitoring/alerting
- Canary releases
- Redundancies
In Conclusion ...
Patterns are no silver bullet
Systems Fail, Deal with it
Design Your Systems for Failure
Recap
- Faults vs Failures
- Timeouts
- Retries
- Circuit Breakers
- Fallbacks
- Resiliency Testing
War Stories
Come meet us ...
![](https://s3.amazonaws.com/media-p.slid.es/uploads/158589/images/4663983/releaseit.jpg)
References
Questions ?
Resiliency in Distributed Systems
By Rajeev Bharshetty
Resiliency in Distributed Systems
- 4,249