A circuit breaker is a well known piece of technology used in — almost — every house. According to Wikipedia it is
designed to protect an electrical circuit from damage caused by excess current, typically resulting from an overload or short circuit. Its basic function is to interrupt current flow after a fault is detected. Unlike a fuse, which operates once and then must be replaced, a circuit breaker can be reset (either manually or automatically) to resume normal operation.
Ok, but what does it have to do with software architecture? The basic idea is that the same concept can be used to protect an entire system from a big cascade failure caused by the failure of a weak dependency. Here is an example of this kind of failure given by Lee Atchison in his book Architecting for scale1.
A classic example of the pitfalls of ignoring dependency failure occurred in a real-life application I worked on. The application provided a service to customers, and on the top of every page was a customizable icon representing the currently logged-in user. The icon was generated by a third-party system.
One day, the third-party system that generated the icon failed. Our application, which assumed that system would always work, didn’t know what to do. As a result, our application failed as well. Our entire application failed simply because the icon-generation system—a very minor “feature”—failed.
The circuit breaker pattern has been defined to isolate the system from operations that are likely to fail. Typically remote calls fall into this category. They can simply fail, hang or became very slow impacting the whole system. Circuit breakers are a way to automatically degrade functionality when the system is under stress. They have been popularized by Michael Nygard in his amazing book Release It!2. It’s one of the stability patterns presented in the book.
How does it work?
We can ask Martin Fowler for that3.
The basic idea behind the circuit breaker is very simple. You wrap a protected function call in a circuit breaker object, which monitors for failures. Once the failures reach a certain threshold, the circuit breaker trips, and all further calls to the circuit breaker return with an error, without the protected call being made at all.
And Michael Nygard to explain the concepts borrowed from the electrical domain to represent the status of the circuit breaker
- closed: operations are executed as usual
- open: operations fails immediately
In the normal “closed” state, the circuit breaker executes operations as usual. These can be calls out to another system, or they can be internal operations that are subject to timeout or other execution failure. If the call succeeds, nothing extraordinary happens. If it fails, however, the circuit breaker makes a note of the failure. Once the number of failures (or frequency of failures, in more sophisticated cases) exceeds a threshold, the circuit breaker trips and “opens” the circuit […] When the circuit is “open,” calls to the circuit breaker fail immediately, without any attempt to execute the real operation.
All about automation
An interesting property is the capability of the circuit breaker to switch automatically from the open state to the closed state. In other words, back to normal operations after a phase in degraded mode. This is for this reason that it is called circuit breaker and not fuse which operates once and then must be replaced.
After a suitable amount of time, the circuit breaker decides that the operation has a chance of succeeding, so it goes into the “half-open” state. In this state, the next call to the circuit breaker is allowed to execute the dangerous operation. Should the call succeed, the circuit breaker resets and returns to the “closed” state, ready for more routine operation. If this trial call fails, however, the circuit breaker returns to the “open” state until another timeout elapses.
Being informed of failure
The state of circuit breaker has to be exposed in order to be monitored and to inform operations of a failure in the system. Even if it can automatically switch between status, operation teams shall have the possibility to force the switch to open or close.
In fact, the frequency of state changes is a useful metric to chart over time; it is a leading indicator of problems elsewhere in the enterprise. Likewise, operations needs some way to directly trip or reset the circuit breaker.
Products out of the box
Netflix is a company that has performed an important amount of work in the Site Reliability Engineering domain. A lot of people know their famous Simian Army but they have also published a product called Hystrix implementing the circuit breaker pattern in Java.
Hystrix is a latency and fault tolerance library designed to isolate points of access to remote systems, services and 3rd party libraries, stop cascading failure and enable resilience in complex distributed systems where failure is inevitable.
It even provides very clear built-in dashboards.