Introduction To Systems Programming

Alex Bunardzic, 2017

How can we make a reliable system in the presence of errors?

System

Stand together
Composition of programs that offer services to other programs
No global supervision/managers
Loosely coupled

Everything that we build eventually exceeds our ability to understand it

Complexity (# of moving parts)

Randomness (degree of independence)

System

Formal Analysis

Statistical Analysis

Complexity (# of moving parts)

Randomness (degree of independence)

System

Formal Analysis

Statistical Analysis

GAP!

System Language?

Problem: such thing does not exist!

Program talking to another program

How?

Over the network

Networks are inherently unreliable

Errors are at the system level

E.g. one of the programs comprising the system all of a sudden not available

Nothing to do with programming errors

Programming errors are not really errors, they're bugs to be squished ;)

A chain is only as strong as its weakest link

How to provide an abstraction boundary which stops the propagation of errors?

How do programs talk to each other?

Protocols and formats

Historic Epic Failures!

Remote Method Invocation (Java RMI)
Common Object Request Broker (CORBA)
DCOM (Microsoft)

Happened during the object infatuation phase

(back in 1990s)

Never Again!

We're finally over the object fetishizing

Values!

Values

No identity

Values

No identity
Ephemeral

Values

No identity
Ephemeral
Nameless

Values

No identity
Ephemeral
Nameless
On wire

Example Value

A service that returns monthly payment rate on a term loan

Example Value

A service that returns monthly payment rate on a term loan
That value is ephemeral

Example Value

A service that returns monthly payment rate on a term loan
That value is ephemeral
It needs no name

Example Value

A service that returns monthly payment rate on a term loan
That value is ephemeral
It needs no name
It is meant to be sent on a wire

Ephemeral nature of values implies Flow!

Systems are not place-oriented

Systems are flow-oriented

How do values flow in the system?

Transform
Move
Route
Record
Keep above activities segregated

Transform

Move

Source => destination
Mover (producer) depends on identity/availability
Must decouple producers from consumers
Must remove dependency on identity
Must remove dependency on availability
Use queues
Pub/sub

Design services primarily for machines

Avoid designing services to be consumed by humans

Machines should never be expected to access services via operational interfaces

Today, if a machine needs to access a service such as Git, good luck!

Build human operational interfaces only after you've built a machine-centric service

Strive to build only simple services

Simple services are easily composable

When designing simple services, there is no danger of premature abstraction

Not possible to over-abstract a simple service

Good practice is to consider a second implementation of your service interface

You may start your design by abstracting the service using HTTP protocol

Consider also offering the same service via SMTP protocol, etc.

That exercise will help you sort out your abstractions

Challenge: avoid turning your service into a monolith

Abstain from adding functionality and features -- keep it super simple

Avoid at all cost turning your service into a stack

Allow users of your service to choose which commodities to use when consuming it

Let them decide which store to use, which queue, etc. Don't dictate your custom stack to your clients

Failures

System Failure model is the only failure model

System Failures are guaranteed to happen!

Not if, but when and how often

Exceptions occur when the run-time system doesn't know what to do

Errors occur when the programmer doesn’t know what to do

System failures are partial and uncoordinated

Extremely unlikely that the entire system fails at once

Minimum requirements for reliable systems:

Concurrency

Non-imperative

"Everything is a process"

Lightweight mechanism for creating parallel processes

Efficient context switching between processes and message passing

Fault detection primitives allow one process to observe another process

Error encapsulation

Errors occurring in one process must not be able to damage other processes in the system

"The process achieves fault containment by sharing no state with other processes; its only contact with other processes is via messages carried by a kernel message system." Jim Gray

"As with hardware, the key to software fault-tolerance is to hierarchically decompose large systems into modules, each module being a unit of service and a unit of failure. A failure of a module does not propagate beyond the module."

Jim Gray

We should only write code for the normal case

Let it crash!

Don’t try to fix up the error and continue

The error should be handled in a different process

Clean separation of error recovery code and normal case code should greatly simplify the overall system design

Fault detection

Programming logic must be able to detect exceptions both locally (in the processes where the exception occurred,) and remotely (being able to detect that an exception has occurred in a non-local process)

A component is considered faulty once its behaviour is no longer consistent with its specification

Error detection is an essential component of fault tolerance

If we cannot do what we want to do, then try to do something simpler

The likelihood of success increases as the tasks become simpler

In the face of failure, we become more interested in protecting the system against damage than in offering full service

Our goal is to offer an acceptable level of service, though we become less ambitious when things start to fail

We need stable error log which will survive a crash

1. Try to perform a task

2. If you cannot perform the task, then try to perform a simpler task

Fault identification

We should be able to identify why an exception occurred

Code upgrade

The ability to change code as it is executing, and without stopping the system

Stable storage

Store data in a manner which survives a system crash

Well Behaved Programs:

The program should be isomorphic to the specification

If the specification says something silly then the program should do something silly -- the program must faithfully reproduce any errors in the specification

If the specification doesn’t say what to do raise an exception

Avoid guesswork -- this is not the time to be creative

Turn non-functional requirements into assertions

Be cognizant of latency budgets

"It is essential for security to be able to isolate mistrusting programs from one another, and to protect the host platform from such programs. Isolation is difficult in object-oriented systems because objects can easily become aliased (i.e. at least two other objects hold a reference to an object)."—Ciaran Bryce

Tasks cannot directly share objects. The only way for tasks to communicate is to use standard, copying communication mechanisms.

Conclusion

Processes are the units of error encapsulation

Strong isolation

Processes do what they are supposed to do or fail as soon as possible (fail fast)

Allowing components to crash and then restart leads to a simpler fault model and more reliable code