Resiliency in
Distributed Systems
Follow along,
- 18 Products
- 1m+ Drivers
- 300+ Microservices
- 15k+ Cores
- 2 Cloud Providers
- 6 Data centers
- 100+ million bookings per month
Transport, logistics, hyperlocal delivery and payments
Agenda
- Resiliency and Distributed Systems
- Why care for Resiliency ?
- Faults vs Failures
- Patterns for Resiliency
Distributed Systems
Networked Components which communicate and coordinate their actions by passing messages
Troll Definition
Resiliency
Capacity to Recover from difficulties
Why care about Resiliency ?
- Financial Losses
- Losing Customers
- Affecting Customers
- Affecting Livelihood of Drivers
Faults vs Failures
Fault
Incorrect internal state in your system
Faults
- Database slowdown
- Memory leaks
- Blocked threads
- Dependency failure
- Bad Data
Healthy
Faults
Failure
Inability of the system to do its intended job
Failures
Resiliency is about preventing faults turning into failures
Resiliency in Distributed Systems is Hard
- Network is unreliable
- Dependencies can always fail
- Users are unpredicatable
Patterns for Resiliency
Heimdall
https://github.com/gojektech/heimdall
#NOCODE
Resiliency Pattern #0
#LessCode
Timeouts
Stop waiting for an answer
Resiliency Pattern #1
Required at Integration Points
DefaultHTTPClient Waits forever
httpClient := http.Client{}
_, err := httpClient.Get("https://gojek.com/drivers")
Goroutines
httpClient := heimdall.NewHTTPClient(1 * time.Millisecond)
_, err := httpClient.Get("https://gojek.com/drivers",
http.Header{})
Prevents Cascading Failures
Provides Failure Isolation
Timeouts must be based on dependency's SLA
Retries
Try again on Failure
Resiliency Pattern #2
Reduces Recovery time
backoff := heimdall.NewConstantBackoff(500)
retrier := heimdall.NewRetrier(backoff)
httpClient := heimdall.NewHTTPClient(1 * time.Millisecond)
httpClient.SetRetrier(retrier)
httpClient.SetRetryCount(3)
httpClient.Get("https://gojek.com/drivers", http.Header{})
Retrying immediately may not be useful
Queue and Retry wherever possible
Idempotency is important
Circuit Breakers
Stop making calls to save systems
Resiliency Pattern #3
State Transitions
Hystrix
config := heimdall.HystrixCommandConfig{
MaxConcurrentRequests: 100,
ErrorPercentThreshold: 25,
SleepWindow: 10,
RequestVolumeThreshold: 10,
}
hystrixConfig := heimdall.NewHystrixConfig("MyCommand",
config)
timeout := 10 * time.Millisecond
httpClient := heimdall.NewHystrixHTTPClient(timeout,
hystrixConfig)
_, err := httpClient.Get("https://gojek.com/drivers",
http.Header{})
Circumvent calls when system is unhealthy
Guards Integration Points
Metrics/Monitoring
Hystrix Dashboards
Fallbacks
Degrade Gracefully
Resiliency Pattern #4
Curious case of Maps Service
Route Distance
Fallback from Route Distance to Route Approximation
Route Approximation
Fallback to a Different Maps Provider
Helps Degrade gracefully
Protect Critical flows from Failure (Ex: Booking Flow)
Think of fallbacks at Integration points
Resiliency Testing
Resiliency Pattern #5
Test and Break
Find Failure modes
Create a Test Harness to break callers
Inject Failures
Unknown Unknowns
Simian Army
- Chaos Monkey
- Janitor Monkey
- Conformity Monkey
- Latency Monkey
More patterns
- Rate-limit/Throttling
- Bulk-heading
- Queuing
- Monitoring/alerting
- Canary releases
- Redundancies
In Conclusion ...
Patterns are no silver bullet
Systems Fail, Deal with it
Design Your Systems for Failure
Recap
- Faults vs Failures
- Timeouts
- Retries
- Circuit Breakers
- Fallbacks
- Resiliency Testing
War Stories
Come meet us ...
References
Questions ?
Resiliency in Distributed Systems
By Rajeev Bharshetty
Resiliency in Distributed Systems
- 4,341