Erlang for software fault tolerance

 

Geovane Fedrecheski

Universidade de São Paulo

Escola Politécnica da USP

geonnave@gmail.com | geovane@lsi.usp.br    

 

Contents

  • Historical perspective
  • Technical details
  • Fault tolerance analysis
  • Conclusions

History

In early 80s, Ericsson Telecom engineers found no programming language suited to provide:

  • Fault-tolerance
  • High-availability
  • Soft Real-Time

History

A research was then conducted..

..whose outcome was Erlang

History

Erlang can be divided in three levels:

  • Programming language
  • Framework (OTP)
  • Runtime

Technical

Details

High Level Architecture

High Level Taxonomy

Processes

{:ok, pid_a} = spawn(fn -> IO.puts "Hello World!" end)

Spawns a new process

an anonymous function is the argument

{:error, reason} = spawn(fn -> raise "an exception!" end)

Non-OS Processes

Erlang processes are not OS processes

Instead, they run inside the Erlang VM

Message Passing

No shared memory

message passing

Processes must communicate

Message Passing

{:ok, pid_b} = spawn(fn ->     # Spawns a concurrent process.
  receive do                   # Awaits for an incoming message.
    msg -> IO.puts msg
  end
end)

{:ok, pid_a} = spawn(fn ->     # Spawns a concurrent process.
  send(pid_b, "Hello")         # Sends a message to the process
                               # identified by `pid_b`.
end)

Fail-fast processes

  • If a process encounters an error, terminate
  • This is to prevent errors from propagating

Process Links

  • Two processes can be "linked"
  • If one dies, the other dies too
  • To prevent "orphan processes"
spawn_link(fn -> raise "I die, so does my father" end)

Process Links

  • The parent process may "trap exits"
  • Instead of dying too, just receives a message
Process.flag(:trap_exit, true)
spawn_link(fn -> raise "I die, but not my father" end)
  • However, if the parent dies, so do all its children!

Supervisors

We have:

  • A parent that receives messages upon child's death
  • Children that dies upon father's death

The idea: the father can take action on children's termination

Supervisors

import Supervisor.Spec

children = [
  worker(Cache, []),
  worker(DatabaseWorker, []),
  worker(TCP.Acceptor, [4040])
]

Supervisor.start_link(children, strategy: :one_for_one)

Supervisors

Restart Strategies:

  • Predefined actions to be taken when a process dies
    • e.g should all workers be restarted? or only one?

Restart Frequency:

  • If a process keeps failing, something else may be wrong
  • Default is 5 restarts in 1 second
    • if surpassed, the supervisor itself dies

Supervision Trees

OTP

Open Telecom Platform

  • A framework with generic behaviors
  • Implements e.g:
    • Generic Supervisors
    • Generic Server pattern
  • Also facilitates releases
  • Makes Hot Code Swapping doable

Inspecting Tools

REPL

(Read-Eval-

Print-Loop)

Observer

Erlang/Elixir

Elixir is

  • A programming language built to run in the Erlang VM
  • With a more approachable syntax
  • Focus on better tooling
  • Can use any Erlang library with no runtime cost
  • Has macros that manipulate the own Elixir AST, which helps reducing boilerplate

Erlang/Elixir

-module(sum_server).

-behaviour(gen_server).
-export([
  start/0, sum/3,
  init/1, handle_call/3, handle_cast/2, handle_info/2, terminate/2,
  code_change/3
]).

start() -> gen_server:start(?MODULE, [], []).

sum(Server, A, B) -> gen_server:call(Server, {sum, A, B}).

init(_) -> {ok, undefined}.
handle_call({sum, A, B}, _From, State) -> {reply, A + B, State};
handle_cast(_Msg, State) -> {noreply, State}.
handle_info(_Info, State) -> {noreply, State}.
terminate(_Reason, _State) -> ok.
code_change(_OldVsn, State, _Extra) -> {ok, State}.
defmodule SumServer do
  use GenServer
  
  def start do
    GenServer.start(__MODULE__, nil)
  end

  def sum(server, a, b) do
    GenServer.call(server, {:sum, a, b})
  end
  
  def handle_call({:sum, a, b}, _from, state) do
    {:reply, a + b, state}
  end
end

Fault Tolerance Analysis

Analysis

Topics selected from Johnson, 1989, to achieve Fault Tolerance:

  • Reliability
  • Availability
  • Safety
  • Performability
  • Maintenability
  • Testability
  • Maintenability

A qualitative analysis was performed

Analysis

  • Can be improved by using OTP abstractions such as Supervisors

Reliability

  • Supervisors, too
  • Hot code swapping
  • Distribution

Availability

  • No solution provided

Safety

  • Isolated processes: fault-containment
  • Distribution

Performability

Analysis

  • Supervisors
  • Inspecting tools

Maintenability

  • Simple functions, clear interfaces
  • No global state

Testability

  • Expected to be good, as others goals are met

Dependability

Conclusion

  • Erlang has a number of interesting aspects for providing fault tolerance
    • Isolated processes
    • Supervisors
    • Distribution
  • Quality analysis of requirements for achieving fault tolerance shows good prospects, except for safety

Obrigado!

Made with Slides.com