Fault tolerance in Ruby
Hubert Łepicki
Warsaw Ruby Users Group, 17.05.2017
@hubertlepicki
We'll talk fault tolerance...
...concurrency and scalability...
...and do some time travel!
Fault tolerance?
Fault tolerance is the property that enables a system to continue operating properly in the event of the failure of (or one or more faults within) some of its components.
random people from Wikipedia
(...) fault tolerance is the ability for software to detect and recover from a fault that is happening (...)
https://users.ece.cmu.edu/~koopman/des_s99/sw_fault_tolerance/
Fault tolerance is hard
It may be okay not to care (yet)
Let's go back to year 1992
"Of course 5 years from now that will be different, but 5 years from now everyone will be running free GNU on their 200 MIPS, 64M SPARCstation-5."
Implementing fault-tolerant systems costs (money, time, effort, complexity)
If you think you should care
...what now?
Defensive programming
Exceptions and timeouts
begin
...
rescue SomeExceptionClass => error
logger.info "Exception caught"
...
end
begin
...
rescue SomeExceptionClass, SomeOther => error
logger.info "Exception caught"
...
end
begin
...
rescue SomeException => e
...
rescue SomeOther => e
...
end
Simple retry
tries = 0
begin
tries += 1
...
rescue
retry if tries < 4
...
end
Bulkheads
begin
AWSWrapper.some_external_operation
rescue StandardError => error
logger.error "Exception intercepted calling AWSWrapper"
logger.error error.message
logger.error error.backtrace.join("\n")
end
require 'timeout'
begin
Timeout::timeout(3) do
AWSWrapper.some_external_operation
end
rescue StandardError => error
logger.error "Exception intercepted calling AWSWrapper"
logger.error error.message
logger.error error.backtrace.join("\n")
end
Cascading failures
- failure in one module takes down the rest
Slow HTTP requests
&
Failure to send e-mails
Stripe API problems
- registration / plan updating calls Stripe API
- other pages do not call API
- whole system got unresponsive when Stripe had failure / API calls were taking lots of time
unicorn master -c /app/unicorn.rb -E production -D
unicorn worker[0] -c /app/unicorn.rb -E production -D
unicorn worker[1] -c /app/unicorn.rb -E production -D
unicorn worker[2] -c /app/unicorn.rb -E production -D
unicorn worker[3] -c /app/unicorn.rb -E production -D
Simple bulkheads may not be enough
Prevent cascading system failures
Option 0:
Improve your web server configuration
(look for timeout option)
Option 1: go async
- user visits payment page
- we initialize API call in background job
- user polls back-end for job status
- or: use ActionCable to notify browser
Option 2: Fail fast
to prevent crashing whole system
Circuit breaker
- circuit is closed == power can kill you
- circuit open == you made an error but you're safe
Circuit Breaker Pattern
- Allows requests when "closed" state
- Detects failures and switches to "open" state
- Fails all requests while in "open" state
- Switches back to "closed" or "half-closed" state after some interval
require 'circuit_breaker'
class ApiWrapper
include CircuitBreaker
def call_remote_service
...
end
circuit_method :call_remote_service
end
Semian
- circuit breaker
- ready to use adapters (mysql, redis, net/http)
- fail fast philosophy
- works with Ruby / OS processes (but NOT threads)
- uses Unix IPC to synchronize
- modifies behaviors of third party libs
- allows rate limiting for API calls/resources usage
SEMIAN_PARAMETERS = { tickets: 1,
success_threshold: 1,
error_threshold: 3,
error_timeout: 10 }
Semian::NetHTTP.semian_configuration = proc do |host, port|
# Let's make it only active for www.wrug.eu
if host == "www.wrug.eu"
SEMIAN_PARAMETERS.merge(name: "wrug.eu")
else
nil
end
end
Option 3: Overall architecture & design
Microservices/SOA/breaking up into N apps
Whie we're on RabbitMQ/Microservices/you name it...
Microservices are hard
More options:
Option 4:
Heavy caching
Option 5:
Event Sourcing + CQRS
Option 6: steal ideas from others Erlang
Let's go back to year 1986
Actor model concurrency
Let it fail / fail fast
Avoid defensive programming
Built-in mechanisms to detect crashes
(monitors, links)
Built-in mechanism to recover from errors
(supervisors)
Kill 2 birds with one stone
Actor model for Ruby?
class Counter
# This is all you have to do to turn any Ruby
# class into one which creates
# Celluloid actors instead of normal objects
include Celluloid
# Now just define methods like you ordinarily would
attr_reader :count
def initialize
@count = 0
end
def increment(n = 1)
@count += n
end
end
actor = Counter.new
p actor.count
# Log *all* exceptions thrown by *all* actors in the system
Celluloid.exception_handler { |ex| MyNotifier.notify(ex) }
# Reference your actors by name
Celluloid::Actor[:itchy] = Itchy.new
Actor[:itchy].scratch()
# Supervise your actors
class MyGroup < Celluloid::SupervisionGroup
supervise Itchy, as: :itchy
end
Celluloid
- turns objects into actors
- allows linking actors
- allows supervision/restarts
Let's move forward to year 2020
Ruby 3.0.0
Koichi Sasada
proposed
Guilds
Elixir/Erlang-inspired
Elements of immutability and actor model concurrency
Guilds will simplify concurrency
Guilds will improve fault tolerance
fin.
Resources
Circuit Breaker
by Martin Fowler
A proposal of new concurrency model for Ruby 3
Koichi Sasada
http://www.atdot.net/~ko1/activities/2016_rubykaigi.pdf
https://www.youtube.com/watch?v=WIrYh14H9kA&feature=youtu.be
That's it for today!
Thanks!
Fault tolerance in Ruby - Hubert Łępicki - WRUG
By Hubert Łępicki
Fault tolerance in Ruby - Hubert Łępicki - WRUG
- 1,735