Fault tolerance in Ruby

Hubert Łepicki

Warsaw Ruby Users Group, 17.05.2017

@hubertlepicki

We'll talk fault tolerance...

...concurrency and scalability...

...and do some time travel!

Fault tolerance?

Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components.

random people from Wikipedia

(...) fault tolerance is the ability for software to detect and recover from a fault that is happening (...)

https://users.ece.cmu.edu/~koopman/des_s99/sw_fault_tolerance/

Fault tolerance is hard

It may be okay not to care (yet)

Let's go back to year 1992

"Of course 5 years from now that will be different, but 5 years from now everyone will be running free GNU on their 200 MIPS, 64M SPARCstation-5."

Implementing fault-tolerant systems costs (money, time, effort, complexity)

If you think you should care
...what now?

Defensive programming

Exceptions and timeouts

begin
  ...
rescue SomeExceptionClass => error
  logger.info "Exception caught"
  ...
end

begin
  ...
rescue SomeExceptionClass, SomeOther => error
  logger.info "Exception caught"
  ...
end

begin
  ...
rescue SomeException => e 
  ...
rescue SomeOther => e 
  ...
end

Simple retry

tries = 0

begin
  tries += 1
  ...
rescue
  retry if tries < 4
  ...
end

Bulkheads

begin
  AWSWrapper.some_external_operation
rescue StandardError => error
  logger.error "Exception intercepted calling AWSWrapper"
  logger.error error.message
  logger.error error.backtrace.join("\n")
end

require 'timeout'

begin
  Timeout::timeout(3) do
    AWSWrapper.some_external_operation
  end
rescue StandardError => error
  logger.error "Exception intercepted calling AWSWrapper"
  logger.error error.message
  logger.error error.backtrace.join("\n")
end

Cascading failures

failure in one module takes down the rest

Slow HTTP requests

&

Failure to send e-mails

Stripe API problems

registration / plan updating calls Stripe API
other pages do not call API
whole system got unresponsive when Stripe had failure / API calls were taking lots of time

unicorn master -c /app/unicorn.rb -E production -D
unicorn worker[0] -c /app/unicorn.rb -E production -D
unicorn worker[1] -c /app/unicorn.rb -E production -D
unicorn worker[2] -c /app/unicorn.rb -E production -D
unicorn worker[3] -c /app/unicorn.rb -E production -D

Simple bulkheads may not be enough

Prevent cascading system failures

Option 0:

Improve your web server configuration

(look for timeout option)

Option 1: go async

user visits payment page
we initialize API call in background job
user polls back-end for job status
or: use ActionCable to notify browser

Option 2: Fail fast

to prevent crashing whole system

Circuit breaker

circuit is closed == power can kill you
circuit open == you made an error but you're safe

Circuit Breaker Pattern

Allows requests when "closed" state
Detects failures and switches to "open" state
Fails all requests while in "open" state
Switches back to "closed" or "half-closed" state after some interval

require 'circuit_breaker'

class ApiWrapper
  include CircuitBreaker

  def call_remote_service
    ...
  end

  circuit_method :call_remote_service
end

Semian

circuit breaker
ready to use adapters (mysql, redis, net/http)
fail fast philosophy
works with Ruby / OS processes (but NOT threads)
uses Unix IPC to synchronize
modifies behaviors of third party libs
allows rate limiting for API calls/resources usage

SEMIAN_PARAMETERS = { tickets: 1,
                      success_threshold: 1,
                      error_threshold: 3,
                      error_timeout: 10 }

Semian::NetHTTP.semian_configuration = proc do |host, port|

  # Let's make it only active for www.wrug.eu

  if host == "www.wrug.eu"
    SEMIAN_PARAMETERS.merge(name: "wrug.eu")
  else
    nil
  end
end

Option 3: Overall architecture & design

Microservices/SOA/breaking up into N apps

Whie we're on RabbitMQ/Microservices/you name it...

Microservices are hard

More options:

Option 4:
Heavy caching

Option 5:

Event Sourcing + CQRS

Option 6: steal ideas from others Erlang

Let's go back to year 1986

Actor model concurrency

Let it fail / fail fast

Avoid defensive programming

Built-in mechanisms to detect crashes
(monitors, links)

Built-in mechanism to recover from errors
(supervisors)

Kill 2 birds with one stone

Actor model for Ruby?

class Counter
  # This is all you have to do to turn any Ruby
  # class into one which creates
  # Celluloid actors instead of normal objects
  include Celluloid

  # Now just define methods like you ordinarily would
  attr_reader :count

  def initialize
    @count = 0
  end

  def increment(n = 1)
    @count += n
  end
end

actor = Counter.new

p actor.count

# Log *all* exceptions thrown by *all* actors in the system
Celluloid.exception_handler { |ex| MyNotifier.notify(ex) }

# Reference your actors by name
Celluloid::Actor[:itchy]    = Itchy.new
Actor[:itchy].scratch()


# Supervise your actors
class MyGroup < Celluloid::SupervisionGroup
  supervise Itchy,    as: :itchy
end

Celluloid

turns objects into actors
allows linking actors
allows supervision/restarts

Let's move forward to year 2020

Ruby 3.0.0

Koichi Sasada

proposed
Guilds

Elixir/Erlang-inspired

Elements of immutability and actor model concurrency

Guilds will simplify concurrency

Guilds will improve fault tolerance

fin.

Resources

Circuit Breaker

by Martin Fowler

https://martinfowler.com/bliki/CircuitBreaker.html

A proposal of new concurrency model for Ruby 3

Koichi Sasada

http://www.atdot.net/~ko1/activities/2016_rubykaigi.pdf

https://www.youtube.com/watch?v=WIrYh14H9kA&feature=youtu.be

That's it for today!
Thanks!

Fault tolerance in Ruby - Hubert Łępicki - WRUG

By Hubert Łępicki

Fault tolerance in Ruby - Hubert Łępicki - WRUG

2,122

Fault tolerance in Ruby

@hubertlepicki

We'll talk fault tolerance...

...concurrency and scalability...

...and do some time travel!

Fault tolerance?

Fault tolerance is hard

It may be okay not to care (yet)

Let's go back to year 1992

Implementing fault-tolerant systems costs (money, time, effort, complexity)

If you think you should care ...what now?

Defensive programming

Exceptions and timeouts

Simple retry

Bulkheads

Cascading failures

Slow HTTP requests

&

Failure to send e-mails

Stripe API problems

Simple bulkheads may not be enough

Prevent cascading system failures

Option 0:

Improve your web server configuration

Option 1: go async

Option 2: Fail fast

to prevent crashing whole system

Circuit breaker

Circuit Breaker Pattern

Semian

Option 3: Overall architecture & design

Microservices/SOA/breaking up into N apps

Whie we're on RabbitMQ/Microservices/you name it...

Microservices are hard

More options:

Option 4: Heavy caching

Option 5:

Event Sourcing + CQRS

Option 6: steal ideas from others Erlang

Let's go back to year 1986

Actor model concurrency

Let it fail / fail fast

Avoid defensive programming

Built-in mechanisms to detect crashes (monitors, links)

Built-in mechanism to recover from errors (supervisors)

Kill 2 birds with one stone

Actor model for Ruby?

Celluloid

Let's move forward to year 2020

Ruby 3.0.0

Koichi Sasada

proposed Guilds

Elixir/Erlang-inspired

Elements of immutability and actor model concurrency

Guilds will simplify concurrency

Guilds will improve fault tolerance

fin.

Resources

Circuit Breaker

by Martin Fowler

A proposal of new concurrency model for Ruby 3

That's it for today! Thanks!

Fault tolerance in Ruby - Hubert Łępicki - WRUG

More from Hubert Łępicki

If you think you should care
...what now?

Option 4:
Heavy caching

Built-in mechanisms to detect crashes
(monitors, links)

Built-in mechanism to recover from errors
(supervisors)

proposed
Guilds

That's it for today!
Thanks!