Why Google Crashes their own Service
Yes, you read that correctly. I’m going to attempt to explain this with a minimum of jargon, so those well-versed in DevOps and those utterly unaware of it can both get something out of this.
I recently started reading the Google SRE book (I know, I know–way overdue). If you aren’t familiar, this book basically defined the nascent field of Site Reliability Engineering (or SRE). Without getting into too much in the way of specifics, SRE is the discipline of applying software engineering practices to infrastructure and operations. Rather than a traditional operations team responsible for manually spinning up infrastructure–servers, firewalls, virtual machines, etc.–Google created SRE, which deals with those same tasks using automation, to a high degree of reliability and observability. It is similar to the field of DevOps, and can often functionally look the same.
DevOps is notoriously poorly defined (which makes it extremely fun to tell relatives what I do for work). Ask 100 DevOps engineers what DevOps is, and you’ll get 300 different answers. This is partly caused by companies wanting to jump on the latest trend, leading to them renaming their IT/ops divisions to DevOps and changing nothing else. At its core, though, DevOps is the merger of development and operations (hence the name).
A true DevOps engineer, according to theory, will build a product by writing code, just as a developer would. They will then deploy it themselves, as an infrastructure engineer would, but using some kind of infrastructure as code tool (a way to create infrastructure by writing code instead of doing things manually). After deployment, they own the management of the product. If it goes down, that’s their problem. They fix things on the code side and on the infrastructure side. They are the complete combination of development and operations. That’s the theory.
Pretty much no one does that, as, in most large organizations, it’s just not realistic or cost-effective. Most DevOps engineers do at least some of those things, though, with a focus on automation (in particular automated building, testing, and deployment), observability (did your application break? Someone should probably be informed), and reliability (things generally shouldn’t break. As we will see, there are exceptions).
There is a lot of debate over whether to embrace SRE vs DevOps vs Platform Engineering vs whatever; all have their merits. In many large organizations, I think, SRE tends to be the most accurate label for applying software engineering methods to infrastructure/ops; but I digress. You probably clicked on this because of the title. So, as promised:
Yes, Google really intentionally crashes their own service. They do this for essentially psychological reasons–something akin to an experiment on their coworkers.
Most DevOps engineers end up running at least some services that are designed for use by other developers within their organization. As I tell my relatives, I write code that makes it easier for other people to write code. When doing so, most DevOps engineers (or at least the good ones) define some standards and best practices for how their services should be used. If they are in charge of approving merge requests (allowing code to be added to an application), they are effectively in charge of defining “good code.” If they are in charge of approving the provisioning of infrastructure, they are in charge of defining a well-designed system.
Being in charge of those definitions is a precarious position. If developers feel like you are imposing certain rules for no reason, they’re not going to want to work with you. And honestly? That’s fair. We’ve all had to deal with nonsense rules that waste our time. No one should have to slog through more of them. The way to get someone to follow your rules for “good code” is to get their buy-in. If you explain how these standards make life easier for them–mainly through them having to do less work in the end, or deal with fewer incidents–you get happy collaborators.
That’s psychology, at the end of the day, and it’s a key part of what this job looks like. More specifically, it’s behavioral psychology, close to nudge theory. If you’ll pardon the introduction of yet another technical term, a nudge, for my purposes, is a minor change in people’s choices that leads to improved outcomes without relying on any kind of proscriptive rules. As Thaler and Sunstein put it in their book popularizing the term, “Putting fruit at eye level counts as a nudge. Banning junk food does not.” The classic example of an effective nudge is the urinal target, designed to–ahem–passively improve one’s aim. It’s for that reason that Google breaks their own service.
The service, in this case, is the little-known Chubby. Chubby is a distributed lock manager (or DLM), which is a tool that solves the problem of multiple processes in a distributed system (what can be thought of as separate computers working together as one) trying to access a given resource at the same time. Imagine there are two word processors editing a file at the same time. If both of them save their changes at the same time, there could be problems. Which is the real version of the file? A DLM solves this issue by deciding which process is allowed to read from/write to a file at a given time. For all but one lucky process, the file is locked. That’s what Chubby does, at a much larger scale, across many separate systems. It’s used throughout Google, to the point where many services rely on it, and it is extremely reliable.
But extremely reliable is not 100% reliable; and, in this case, that extreme reliability became a liability. You see, internal clients (developers within Google who used Chubby) came to rely on it entirely, building dependencies that would not work at all without Chubby working. They viewed Chubby as so reliable as to be effectively 100% working, all the time. This is poor design; you want to make your own service as reliable as possible, which means taking into account the possibility that services you interact with will fail. The Chubby team was a victim of their own success; they had built a service so reliable that, when it did have an occasional issue, it completely took down some of the services that used (and relied on) it. The SRE team had a psychological issue to resolve, rather than a code one. They couldn’t change people’s code themselves; they couldn’t make other people’s products more resilient directly; but they could give a nudge.
The nudge, in this case, was a planned outage: crashing their service themselves. Let’s say Chubby had a service level objective (or SLO–what customers can expect from the service) of 99.995% availability. If they were significantly above that number–let’s say 99.9999% uptime–they would make use of that extra wiggle room and trigger an intentional outage. This would have the benefit of making sure that developers were aware that Chubby could fail–and that their own service would be able to handle it gracefully rather than halting and catching fire. The SRE team could, in the words of Marc Alvidrez, Google SRE, “flush out unreasonable dependencies on Chubby shortly after they are added. Doing so forces service owners to reckon with the reality of distributed systems sooner rather than later.”
The SRE team couldn’t force developers to design reliable systems; but they could demonstrate to those developers just how advantageous reliable system design can be. They judged this to be so worthwhile, they were willing to trigger intentional incidents in order to make it happen.
You can read more about this in Chapter 4 of the SRE book. I’d highly recommend reading the book from the beginning, though; it makes a compelling case for several counterintuitive ideas, including the at first baffling declaration that 100% reliability is actually a bad thing (Chapter 3). Give it a shot; you just might find yourself agreeing.