Cost does not increase linearly as reliability increments. Two dimensions of the cost:
Strive to make a service reliable enough, but no more reliable than it needs to be.
Identify the objective metric to represent the property of a system we want to optimize. In Google, to measure service risk, the metric they use is unplanned downtime.
Unplanned downtime is captured by the desired level of service availability.
Given the formula above, system with an availability target of 99.99% can be down for up to 52.56 minutes in a year.
Given the formula above, system with 99.99% availability target that servers 2.5M requests per day can have up to 250 error requests per day.
Factors to consider when accessing the risk tolerance of a service:
Factors to consider when accessing the target level of availability of a service:
Dev and Ops have different metric. Dev wants to push his/her frequency as many times as possible. Ops wants to keep system stability, and frequent changes mean increase of instability.
To reconcile the two sides, Google use what they call as "error budget", an objective metric that determines how unreliable a service is allowed within a single quarter.
An SLI is a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.
Example:
An SLO is a service level objective: a target value or range of values for a service level that is measured by an SLI.
Structure of SLOs usually in the form of something like: SLI ≤ target, or lower bound ≤ SLI ≤ upper bound.
Choosing SLO is hard. Choosing and publishing SLOs to users sets expectations about how a service will perform.
SLAs are service level agreements: an explicit or implicit contract with your users that includes consequences of meeting (or missing) the SLOs they contain.
Site Reliability Engineers doesn’t typically get involved in constructing SLAs, because SLAs are closely tied to business and product decisions.
SRE does, however, get involved in helping to avoid triggering the consequences of missed SLOs.
They can also help to define the SLIs: there obviously needs to be an objective way to measure the SLOs in the agreement, or disagreements will arise.
Services tend to fall into a few broad categories in terms of the SLIs they find relevant.
It is recommended to standardize indicators so we don't have to argue about it from scratch every time we determine SLI for a service. Example:
Start by finding out what your users care about, not what you can measure.
Often, what your users care about is difficult or impossible to measure, so you’ll end up approximating users’ needs in some way.
For maximum clarity, SLOs should specify how they’re measured and the conditions under which they’re valid. For instance:
Choosing targets (SLOs) is not a purely technical activity. It should involve considerations of constraints such as staffing, time to market, hardware availability, funding, etc. Some few guidance to choose SLOs are:
Now that we understand what are SLI and SLO, we have to:
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automate-able, tactical, devoid of enduring value, and that scales linearly as a service grows.
Not every task deemed toil has all these attributes, but the more closely work matches one or more of the following descriptions, the more likely it is to be toil.
Manual
This includes work such as manually running a script that automates some task. Running a script may be quicker than manually executing each step in the script, but the hands-on time a human spends running that script (not the elapsed time) is still toil time.
Repetitive
If you’re performing a task for the first time ever, or even the second time, this work is not toil. Toil is work you do over and over. If you’re solving a novel problem or inventing a new solution, this work is not toil.
Automate-able
If a machine could accomplish the task just as well as a human, or the need for the task could be designed away, that task is toil. If human judgment is essential for the task, there’s a good chance it’s not toil.
Tactical
Toil is interrupt-driven and reactive, rather than strategy-driven and proactive. Handling pager alerts is toil. We may never be able to eliminate this type of work completely, but we have to continually work toward minimizing it.
No enduring value
If your service remains in the same state after you have finished a task, the task was probably toil. If the task produced a permanent improvement in your service, it probably wasn’t toil, even if some amount of grunt work—such as digging into legacy code and configurations and straightening them out—was involved.
O(n) with service growth
If the work involved in a task scales up linearly with service size, traffic volume, or user count, that task is probably toil. An ideally managed and designed service can grow by at least one order of magnitude with zero additional work, other than some one-time efforts to add resources.
An SRE organization should only work on toil max 50% of their time. The remaining 50% should be used on engineering project to improve reliability, performance, and utilization which often results in reducing toil.