Error Budgets

SRE has found that roughly 70% of outages are due to changes in a live system.

Problem

Knowing this, there is no need to look any further the reasons why SRE teams–or production team or whatever the team that will be called by angry customers–are so reluctant to change. If it’s not enough, just remind that their objectives are certainly based on the reliability of the services they maintain.

On the other side, teams in charge of developing new products are trying to push their code into production as often as possible–agility for better or worse encourages this trend– to provide new features to customers or to fix their own bugs and also because they are evaluated on their velocity.

So you end up with a kind of dichotomy between two populations that work on the same product but not at the same level and that does not share the same vision nor the same objectives.

Solution

The obvious solution is to have only one team performing the whole scope, but it’s not always possible. To solve this dilemma without hours of negotiations or a blind approval of some manager–who will only take the decision based on its own objectives–the Google SRE team has found a data-driven answer called the error budget.

Instead, our goal is to define an objective metric, agreed upon by both sides, that can be used to guide the negotiations in a reproducible way. The more data-based the decision can ben the better.

The answer is to base the decision on SLO (Service Level Objective): How unreliable the service is allowed to be within a single quarter. How much margin (this is the error budget) do we have to deploy new versions that will potentially break the production and impact the SLO?

And it’s quite simple to implement:

Ending with a simple rule: As long as actual performance is above the SLO new releases can be pushed in production.

Benefits

There are many benefits to apply this strategy.

Disclaimer: Most of this article is a rephrasing–or a summary or at least my understanding–of the chapter “Motivation for Error Budgets” of the amazing book from Google Site Reliability Engineering1.


  1. Collective work, Site Reliability Engineering (O’Reilly, 2016). This book is also available for free here. [return]