Introduction To Systems Programming
Alex Bunardzic, 2017
How can we make a reliable system in the presence of errors?
System
- Stand together
- Composition of programs that offer services to other programs
- No global supervision/managers
- Loosely coupled
Everything that we build eventually exceeds our ability to understand it
Complexity (# of moving parts)
Randomness (degree of independence)
System
Formal Analysis
Statistical Analysis
Complexity (# of moving parts)
Randomness (degree of independence)
System
Formal Analysis
Statistical Analysis
GAP!
System Language?
Problem: such thing does not exist!
Program talking to another program
How?
Over the network
Networks are inherently unreliable
Errors are at the system level
E.g. one of the programs comprising the system all of a sudden not available
Nothing to do with programming errors
Programming errors are not really errors, they're bugs to be squished ;)
A chain is only as strong as its weakest link
How to provide an abstraction boundary which stops the propagation of errors?
How do programs talk to each other?
Protocols and formats
Historic Epic Failures!
- Remote Method Invocation (Java RMI)
- Common Object Request Broker (CORBA)
- DCOM (Microsoft)
Happened during the object infatuation phase
(back in 1990s)
Never Again!
We're finally over the object fetishizing
Values!
Values
- No identity
Values
- No identity
- Ephemeral
Values
- No identity
- Ephemeral
- Nameless
Values
- No identity
- Ephemeral
- Nameless
- On wire
Example Value
- A service that returns monthly payment rate on a term loan
Example Value
- A service that returns monthly payment rate on a term loan
- That value is ephemeral
Example Value
- A service that returns monthly payment rate on a term loan
- That value is ephemeral
- It needs no name
Example Value
- A service that returns monthly payment rate on a term loan
- That value is ephemeral
- It needs no name
- It is meant to be sent on a wire
Ephemeral nature of values implies Flow!
Systems are not place-oriented
Systems are flow-oriented
How do values flow in the system?
How do values flow in the system?
- Transform
- Move
- Route
- Record
- Keep above activities segregated
Transform
Move
- Source => destination
- Mover (producer) depends on identity/availability
- Must decouple producers from consumers
- Must remove dependency on identity
- Must remove dependency on availability
- Use queues
- Pub/sub
Design services primarily for machines
Avoid designing services to be consumed by humans
Machines should never be expected to access services via operational interfaces
Today, if a machine needs to access a service such as Git, good luck!
Build human operational interfaces only after you've built a machine-centric service
Strive to build only simple services
Simple services are easily composable
When designing simple services, there is no danger of premature abstraction
Not possible to over-abstract a simple service
Good practice is to consider a second implementation of your service interface
You may start your design by abstracting the service using HTTP protocol
Consider also offering the same service via SMTP protocol, etc.
That exercise will help you sort out your abstractions
Challenge: avoid turning your service into a monolith
Abstain from adding functionality and features -- keep it super simple
Avoid at all cost turning your service into a stack
Allow users of your service to choose which commodities to use when consuming it
Let them decide which store to use, which queue, etc. Don't dictate your custom stack to your clients
Failures
System Failure model is the only failure model
System Failures are guaranteed to happen!
Not if, but when and how often
Exceptions occur when the run-time system doesn't know what to do
Errors occur when the programmer doesn’t know what to do
System failures are partial and uncoordinated
Extremely unlikely that the entire system fails at once
Minimum requirements for reliable systems:
Concurrency
Non-imperative
"Everything is a process"
Lightweight mechanism for creating parallel processes
Efficient context switching between processes and message passing
Fault detection primitives allow one process to observe another process
Error encapsulation
Errors occurring in one process must not be able to damage other processes in the system
"The process achieves fault containment by sharing no state with other processes; its only contact with other processes is via messages carried by a kernel message system." Jim Gray
"As with hardware, the key to software fault-tolerance is to hierarchically decompose large systems into modules, each module being a unit of service and a unit of failure. A failure of a module does not propagate beyond the module."
Jim Gray
We should only write code for the normal case
Let it crash!
Don’t try to fix up the error and continue
The error should be handled in a different process
Clean separation of error recovery code and normal case code should greatly simplify the overall system design
Fault detection
Programming logic must be able to detect exceptions both locally (in the processes where the exception occurred,) and remotely (being able to detect that an exception has occurred in a non-local process)
A component is considered faulty once its behaviour is no longer consistent with its specification
Error detection is an essential component of fault tolerance
If we cannot do what we want to do, then try to do something simpler
The likelihood of success increases as the tasks become simpler
In the face of failure, we become more interested in protecting the system against damage than in offering full service
Our goal is to offer an acceptable level of service, though we become less ambitious when things start to fail
We need stable error log which will survive a crash
1. Try to perform a task
2. If you cannot perform the task, then try to perform a simpler task
Fault identification
We should be able to identify why an exception occurred
Code upgrade
The ability to change code as it is executing, and without stopping the system
Stable storage
Store data in a manner which survives a system crash
Well Behaved Programs:
The program should be isomorphic to the specification
If the specification says something silly then the program should do something silly -- the program must faithfully reproduce any errors in the specification
If the specification doesn’t say what to do raise an exception
Avoid guesswork -- this is not the time to be creative
Turn non-functional requirements into assertions
Be cognizant of latency budgets
"It is essential for security to be able to isolate mistrusting programs from one another, and to protect the host platform from such programs. Isolation is difficult in object-oriented systems because objects can easily become aliased (i.e. at least two other objects hold a reference to an object)."—Ciaran Bryce
Tasks cannot directly share objects. The only way for tasks to communicate is to use standard, copying communication mechanisms.
Conclusion
Processes are the units of error encapsulation
Strong isolation
Processes do what they are supposed to do or fail as soon as possible (fail fast)
Allowing components to crash and then restart leads to a simpler fault model and more reliable code
Failure, and the reason for failure, must be detectable by remote processes
Processes share no state, but communicate by message passing
That's it!
Introduction To Systems Programming
By Alex Bunardzic
Introduction To Systems Programming
- 655