Fault tolerance in Ruby

 

 

Hubert Łepicki

Warsaw Ruby Users Group, 17.05.2017

@hubertlepicki

We'll talk fault tolerance...

...concurrency and scalability...

...and do some time travel!

Fault tolerance?

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components.

 

random people from Wikipedia

(...) fault tolerance is the ability for software to detect and recover from a fault that is happening (...)

 

https://users.ece.cmu.edu/~koopman/des_s99/sw_fault_tolerance/

Fault tolerance is hard

It may be okay not to care (yet)

Let's go back to year 1992

"Of course 5 years from now that will be different, but 5 years from now everyone will be running free GNU on their 200 MIPS, 64M SPARCstation-5." 

Implementing fault-tolerant systems costs (money, time, effort, complexity)

If you think you should care
...what now?

Defensive programming

Exceptions and timeouts

begin
  ...
rescue SomeExceptionClass => error
  logger.info "Exception caught"
  ...
end
begin
  ...
rescue SomeExceptionClass, SomeOther => error
  logger.info "Exception caught"
  ...
end
begin
  ...
rescue SomeException => e 
  ...
rescue SomeOther => e 
  ...
end

Simple retry

tries = 0

begin
  tries += 1
  ...
rescue
  retry if tries < 4
  ...
end

Bulkheads

begin
  AWSWrapper.some_external_operation
rescue StandardError => error
  logger.error "Exception intercepted calling AWSWrapper"
  logger.error error.message
  logger.error error.backtrace.join("\n")
end
require 'timeout'

begin
  Timeout::timeout(3) do
    AWSWrapper.some_external_operation
  end
rescue StandardError => error
  logger.error "Exception intercepted calling AWSWrapper"
  logger.error error.message
  logger.error error.backtrace.join("\n")
end

Cascading failures

  • failure in one module takes down the rest

Slow HTTP requests

Failure to send e-mails

Stripe API problems

  • registration / plan updating calls Stripe API
  • other pages do not call API
  • whole system got unresponsive when Stripe had failure / API calls were taking lots of time
unicorn master -c /app/unicorn.rb -E production -D
unicorn worker[0] -c /app/unicorn.rb -E production -D
unicorn worker[1] -c /app/unicorn.rb -E production -D
unicorn worker[2] -c /app/unicorn.rb -E production -D
unicorn worker[3] -c /app/unicorn.rb -E production -D

Simple bulkheads may not be enough

Prevent cascading system failures

Option 0:

Improve your web server configuration

(look for timeout option)

Option 1: go async

  • user visits payment page
  • we initialize API call in background job
  • user polls back-end for job status
  • or: use ActionCable to notify browser

Option 2: Fail fast

to prevent crashing whole system

Circuit breaker

  • circuit is closed == power can kill you
  • circuit open == you made an error but you're safe

Circuit Breaker Pattern

  • Allows requests when "closed" state
  • Detects failures and switches to "open" state
  • Fails all requests while in "open" state
  • Switches back to "closed" or "half-closed" state after some interval
require 'circuit_breaker'

class ApiWrapper
  include CircuitBreaker

  def call_remote_service
    ...
  end

  circuit_method :call_remote_service
end

Semian

  • circuit breaker
  • ready to use adapters (mysql, redis, net/http)
  • fail fast philosophy
  • works with Ruby / OS processes (but NOT threads)
  • uses Unix IPC to synchronize
  • modifies behaviors of third party libs
  • allows rate limiting for API calls/resources usage
SEMIAN_PARAMETERS = { tickets: 1,
                      success_threshold: 1,
                      error_threshold: 3,
                      error_timeout: 10 }

Semian::NetHTTP.semian_configuration = proc do |host, port|

  # Let's make it only active for www.wrug.eu

  if host == "www.wrug.eu"
    SEMIAN_PARAMETERS.merge(name: "wrug.eu")
  else
    nil
  end
end

Option 3: Overall architecture & design

Microservices/SOA/breaking up into N apps

Whie we're on RabbitMQ/Microservices/you name it...

Microservices are hard

More options:

Option 4:
Heavy caching

Option 5:

Event Sourcing + CQRS

Option 6: steal ideas from others Erlang

Let's go back to year 1986

Actor model concurrency

Let it fail / fail fast

Avoid defensive programming

Built-in mechanisms to detect crashes
(monitors, links)

Built-in mechanism to recover from errors
(supervisors)

Kill 2 birds with one stone

Actor model for Ruby?

class Counter
  # This is all you have to do to turn any Ruby
  # class into one which creates
  # Celluloid actors instead of normal objects
  include Celluloid

  # Now just define methods like you ordinarily would
  attr_reader :count

  def initialize
    @count = 0
  end

  def increment(n = 1)
    @count += n
  end
end

actor = Counter.new

p actor.count
# Log *all* exceptions thrown by *all* actors in the system
Celluloid.exception_handler { |ex| MyNotifier.notify(ex) }

# Reference your actors by name
Celluloid::Actor[:itchy]    = Itchy.new
Actor[:itchy].scratch()


# Supervise your actors
class MyGroup < Celluloid::SupervisionGroup
  supervise Itchy,    as: :itchy
end

Celluloid

  • turns objects into actors
  • allows linking actors
  • allows supervision/restarts

Let's move forward to year 2020

Ruby 3.0.0

Koichi Sasada

proposed
Guilds

Elixir/Erlang-inspired

Elements of immutability and actor model concurrency

Guilds will simplify concurrency

Guilds will improve fault tolerance

fin.

Resources

Circuit Breaker

That's it for today!
Thanks!

Fault tolerance in Ruby - Hubert Łępicki - WRUG

By Hubert Łępicki

Fault tolerance in Ruby - Hubert Łępicki - WRUG

  • 1,711