Erlang for software fault tolerance
Geovane Fedrecheski
Universidade de São Paulo
Escola Politécnica da USP
geonnave@gmail.com | geovane@lsi.usp.br
Contents
- Historical perspective
- Technical details
- Fault tolerance analysis
- Conclusions
History
In early 80s, Ericsson Telecom engineers found no programming language suited to provide:
- Fault-tolerance
- High-availability
- Soft Real-Time
History
A research was then conducted..
..whose outcome was Erlang
History
Erlang can be divided in three levels:
- Programming language
- Framework (OTP)
- Runtime
Technical
Details
High Level Architecture
High Level Taxonomy
Processes
{:ok, pid_a} = spawn(fn -> IO.puts "Hello World!" end)
Spawns a new process
an anonymous function is the argument
{:error, reason} = spawn(fn -> raise "an exception!" end)
Non-OS Processes
Erlang processes are not OS processes
Instead, they run inside the Erlang VM
Message Passing
No shared memory
message passing
Processes must communicate
Message Passing
{:ok, pid_b} = spawn(fn -> # Spawns a concurrent process.
receive do # Awaits for an incoming message.
msg -> IO.puts msg
end
end)
{:ok, pid_a} = spawn(fn -> # Spawns a concurrent process.
send(pid_b, "Hello") # Sends a message to the process
# identified by `pid_b`.
end)
Fail-fast processes
- If a process encounters an error, terminate
- This is to prevent errors from propagating
Process Links
- Two processes can be "linked"
- If one dies, the other dies too
- To prevent "orphan processes"
spawn_link(fn -> raise "I die, so does my father" end)
Process Links
- The parent process may "trap exits"
- Instead of dying too, just receives a message
Process.flag(:trap_exit, true)
spawn_link(fn -> raise "I die, but not my father" end)
- However, if the parent dies, so do all its children!
Supervisors
We have:
- A parent that receives messages upon child's death
- Children that dies upon father's death
The idea: the father can take action on children's termination
Supervisors
import Supervisor.Spec
children = [
worker(Cache, []),
worker(DatabaseWorker, []),
worker(TCP.Acceptor, [4040])
]
Supervisor.start_link(children, strategy: :one_for_one)
Supervisors
Restart Strategies:
- Predefined actions to be taken when a process dies
- e.g should all workers be restarted? or only one?
Restart Frequency:
- If a process keeps failing, something else may be wrong
- Default is 5 restarts in 1 second
- if surpassed, the supervisor itself dies
Supervision Trees
OTP
Open Telecom Platform
- A framework with generic behaviors
- Implements e.g:
- Generic Supervisors
- Generic Server pattern
- Also facilitates releases
- Makes Hot Code Swapping doable
Inspecting Tools
REPL
(Read-Eval-
Print-Loop)
Observer
Erlang/Elixir
Elixir is
- A programming language built to run in the Erlang VM
- With a more approachable syntax
- Focus on better tooling
- Can use any Erlang library with no runtime cost
- Has macros that manipulate the own Elixir AST, which helps reducing boilerplate
Erlang/Elixir
-module(sum_server).
-behaviour(gen_server).
-export([
start/0, sum/3,
init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2,
code_change/3
]).
start() -> gen_server:start(?MODULE, [], []).
sum(Server, A, B) -> gen_server:call(Server, {sum, A, B}).
init(_) -> {ok, undefined}.
handle_call({sum, A, B}, _From, State) -> {reply, A + B, State};
handle_cast(_Msg, State) -> {noreply, State}.
handle_info(_Info, State) -> {noreply, State}.
terminate(_Reason, _State) -> ok.
code_change(_OldVsn, State, _Extra) -> {ok, State}.
defmodule SumServer do
use GenServer
def start do
GenServer.start(__MODULE__, nil)
end
def sum(server, a, b) do
GenServer.call(server, {:sum, a, b})
end
def handle_call({:sum, a, b}, _from, state) do
{:reply, a + b, state}
end
end
Fault Tolerance Analysis
Analysis
Topics selected from Johnson, 1989, to achieve Fault Tolerance:
- Reliability
- Availability
- Safety
- Performability
- Maintenability
- Testability
- Maintenability
A qualitative analysis was performed
Analysis
- Can be improved by using OTP abstractions such as Supervisors
Reliability
- Supervisors, too
- Hot code swapping
- Distribution
Availability
- No solution provided
Safety
- Isolated processes: fault-containment
- Distribution
Performability
Analysis
- Supervisors
- Inspecting tools
Maintenability
- Simple functions, clear interfaces
- No global state
Testability
- Expected to be good, as others goals are met
Dependability
Conclusion
- Erlang has a number of interesting aspects for providing fault tolerance
- Isolated processes
- Supervisors
- Distribution
- Quality analysis of requirements for achieving fault tolerance shows good prospects, except for safety
Obrigado!
Erlang: Software fault tolerance
By Geovane Fedrecheski
Erlang: Software fault tolerance
- 433