Fault tolerance in Ruby

 

 

Hubert Łepicki

wroc_love.rb 2017

@hubertlepicki

We'll talk fault tolerance...

...concurrency and scalability...

...and do some time travel!

Fault tolerance?

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components.

 

random people from Wikipedia

(...) fault tolerance is the ability for software to detect and recover from a fault that is happening (...)

 

https://users.ece.cmu.edu/~koopman/des_s99/sw_fault_tolerance/

Fault tolerance is hard

Let's go back to year 1992

"Of course 5 years from now that will be different, but 5 years from now everyone will be running free GNU on their 200 MIPS, 64M SPARCstation-5." 

Implementing fault-tolerant systems costs (money, time, effort, complexity)

Basic defence techniques

Exceptions

begin
  ...
rescue SomeExceptionClass => error
  logger.info "Exception caught"
  ...
end
begin
  ...
rescue SomeExceptionClass, SomeOther => error
  logger.info "Exception caught"
  ...
end
begin
  ...
rescue SomeException => e 
  ...
rescue SomeOther => e 
  ...
end
tries = 0

begin
  tries += 1
  ...
rescue
  retry if tries < 4
  ...
end
f = open("file.txt") rescue nil
begin
  AWSWrapper.some_external_operation
rescue StandardError => error
  logger.error "Exception intercepted calling AWSWrapper"
  logger.error error.message
  logger.error error.backtrace.join("\n")
end
require 'timeout'

begin
  Timeout::timeout(3) do
    AWSWrapper.some_external_operation
  end
rescue StandardError => error
  logger.error "Exception intercepted calling AWSWrapper"
  logger.error error.message
  logger.error error.backtrace.join("\n")
end

Prevent cascading system failures

unicorn master -c /app/unicorn.rb -E production -D
unicorn worker[0] -c /app/unicorn.rb -E production -D
unicorn worker[1] -c /app/unicorn.rb -E production -D
unicorn worker[2] -c /app/unicorn.rb -E production -D
unicorn worker[3] -c /app/unicorn.rb -E production -D

Simple bulkhead may not be enough

Fail fast

to prevent crashing whole system

Circuit Breaker Pattern

  • Allows requests when "closed" state
  • Detects failures and switches to "open" state
  • Fails all requests while in "open" state
  • Switches back to "closed" or "half-closed" state after some interval
require 'circuit_breaker'

class ApiWrapper
  include CircuitBreaker

  def call_remote_service
    ...
  end

  circuit_method :call_remote_service
end

Semian

  • circuit breaker
  • ready to use adapters (mysql, redis, net/http)
  • fail fast philosophy
  • works with Ruby / OS processes (but NOT threads)
  • uses Unix IPC to synchronize
  • modifies behaviors of third party libs
SEMIAN_PARAMETERS = { tickets: 1,
                      success_threshold: 1,
                      error_threshold: 3,
                      error_timeout: 10 }

Semian::NetHTTP.semian_configuration = proc do |host, port|

  # Let's make it only active for www.wrocloverb.com

  if host == "www.wrocloverb.com"
    SEMIAN_PARAMETERS.merge(name: "wroc_love.rb")
  else
    nil
  end
end

Microservices/SOA/breaking up into N apps

Persistent connections

System always breaks when no one was looking

opts = {
  ...
  heartbeat: 0,
  ...
}

Reason: firewalls and other "smart" network gear

Solution: enable heartbeat

It takes 11-25 minutes to detect "dead" TCP connection for Linux

Whie we're on RabbitMQ/Microservices/you name it...

Let's go back to year 1986

Actor model concurrency

Let it fail / fail fast

Avoid defensive programming

Built-in mechanisms to detect crashes
(monitors, links)

Built-in mechanism to recover from errors
(supervisors)

Kill 2 birds with one stone

Actor model for Ruby?

class Counter
  # This is all you have to do to turn any Ruby
  # class into one which creates
  # Celluloid actors instead of normal objects
  include Celluloid

  # Now just define methods like you ordinarily would
  attr_reader :count

  def initialize
    @count = 0
  end

  def increment(n = 1)
    @count += n
  end
end

actor = Counter.new

p actor.count
# Log *all* exceptions thrown by *all* actors in the system
Celluloid.exception_handler { |ex| MyNotifier.notify(ex) }

# Reference your actors by name
Celluloid::Actor[:itchy]    = Itchy.new
Actor[:itchy].scratch()


# Supervise your actors
class MyGroup < Celluloid::SupervisionGroup
  supervise Itchy,    as: :itchy
end

Celluloid

  • turns objects into actors
  • allows linking actors
  • allows supervision/restarts

Let's move forward to year 2020

Ruby 3.0.0

Koichi Sasada

proposed
Guilds

Elixir/Erlang-inspired

Elements of immutability and actor model concurrency

Guilds will simplify concurrency

Guilds will improve fault tolerance

Resources

Circuit Breaker

That's it for today!
Thanks!

Fault tolerance in Ruby - Hubert Łępicki

By Hubert Łępicki

Fault tolerance in Ruby - Hubert Łępicki

  • 3,197