Resiliency in

Distributed Systems

Rajeev N B

rajeevnb.com | @rBharshetty

Engineering at GO-JEK Tech

Follow along,

https://bit.ly/godistsys

 

  • 18 Products
  • 1m+ Drivers
  • 300+ Microservices
  • 15k+ Cores
  • 2 Cloud Providers
  • 6 Data centers
  • 100+ million bookings per month

Transport, logistics, hyperlocal delivery and payments

Agenda

  • Resiliency and Distributed Systems
  • Why care for Resiliency ?
  • Faults vs Failures
  • Patterns for Resiliency

Distributed Systems

Networked Components which communicate and coordinate their actions by passing messages

Troll Definition

Resiliency

Capacity to Recover from difficulties

Why care about Resiliency ?

  • Financial Losses
  • Losing Customers
  • Affecting Customers
  • Affecting Livelihood of Drivers

Faults vs Failures

Fault

Incorrect internal state in your system

Faults

  • Database slowdown
  • Memory leaks
  • Blocked threads
  • Dependency failure
  • Bad Data

Healthy

Faults

Failure

Inability of the system to do its intended job

Failures

Resiliency is about preventing faults turning into failures

Resiliency in Distributed Systems is Hard

  • Network is unreliable
  • Dependencies can always fail
  • Users are unpredicatable

Patterns for Resiliency

Heimdall

https://github.com/gojektech/heimdall

#NOCODE

Resiliency Pattern #0

#LessCode

Timeouts

Stop waiting for an answer

Resiliency Pattern #1

Required at Integration Points

DefaultHTTPClient Waits forever



  httpClient := http.Client{}
  _, err := httpClient.Get("https://gojek.com/drivers")

Goroutines



  httpClient := heimdall.NewHTTPClient(1 * time.Millisecond)
  _, err := httpClient.Get("https://gojek.com/drivers", 
                           http.Header{})

Prevents Cascading Failures

Provides Failure Isolation

Timeouts must be based on dependency's SLA

Retries

Try again on Failure

Resiliency Pattern #2

Reduces Recovery time



  backoff := heimdall.NewConstantBackoff(500)
  retrier := heimdall.NewRetrier(backoff)
  httpClient := heimdall.NewHTTPClient(1 * time.Millisecond)

  httpClient.SetRetrier(retrier)
  httpClient.SetRetryCount(3)

  httpClient.Get("https://gojek.com/drivers", http.Header{})

Retrying immediately may not be useful

Queue and Retry wherever possible

Idempotency is important

Circuit Breakers

Stop making calls to save systems

Resiliency Pattern #3

State Transitions

Hystrix

  

  config := heimdall.HystrixCommandConfig{
      MaxConcurrentRequests:  100,
      ErrorPercentThreshold:  25,
      SleepWindow:            10,
      RequestVolumeThreshold: 10,
  }

  hystrixConfig := heimdall.NewHystrixConfig("MyCommand", 
                                              config)

  timeout := 10 * time.Millisecond
  httpClient := heimdall.NewHystrixHTTPClient(timeout,
                                              hystrixConfig)

  _, err := httpClient.Get("https://gojek.com/drivers", 
                            http.Header{})

Circumvent calls when system is unhealthy

Guards Integration Points

Metrics/Monitoring

Hystrix Dashboards

Fallbacks

Degrade Gracefully

Resiliency Pattern #4

Curious case of Maps Service

Route Distance

Fallback from Route Distance to Route Approximation

Route Approximation

Fallback to a Different Maps Provider

Helps Degrade gracefully

Protect Critical flows from Failure (Ex: Booking Flow)

Think of fallbacks at Integration points

Resiliency Testing

Resiliency Pattern #5

Test and Break

Find Failure modes

Create a Test Harness to break callers

Inject Failures

Unknown Unknowns

Simian Army

  • Chaos Monkey
  • Janitor Monkey
  • Conformity Monkey
  • Latency Monkey

More patterns

  • Rate-limit/Throttling
  • Bulk-heading
  • Queuing
  • Monitoring/alerting
  • Canary releases
  • Redundancies

In Conclusion ...

Patterns are no silver bullet

Systems Fail, Deal with it

Design Your Systems for Failure

Recap

  • Faults vs Failures
  • Timeouts
  • Retries
  • Circuit Breakers
  • Fallbacks
  • Resiliency Testing

War Stories

Come meet us ...

References

Thanks for listening

Yes, we're hiring!!

gojek.jobs

 

 

 

Questions ?

@rBharshetty

Resiliency in Distributed Systems

By Rajeev Bharshetty

Resiliency in Distributed Systems

  • 4,341