Ugh. Each of these points is a classic reliability precaution – yet all were missed simultaneously. As one analyst put it, Google had “written the book on Site Reliability Engineering” but still deployed code that could not handle null inputs. In hindsight, this outage looks like a string of simple errors aligning by unfortunate chance.
Yes, that's how major outages happen. By this stage of maturity any single failure generally doesn't break things dramatically. When things go this wrong, it's ALWAYS a combination of failures: failure of recovery system, omission in detection systems, gap in automated review, oversight in ...
The vacuous gotcha language is indicative of the low quality of the whole article. As Metalnem says in comments here, see the official incident report for a better writeup and more insight. https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...
Very surprising that it was not feature flagged, when I worked at Google we went to enormous lengths to support partial feature rollouts, supporting this for complex codebases often added weeks to development but was crucial because an outage of this magnitude was considered sacrilege
The problem, which I've only gotten enough of a taste of to see how untenable they would be at that scale, is that with enough feature toggles, and enough partitions to rolling them out (%, region, or AZ cutoffs), you eventually spend most of your time shepherding rollouts, or coordinating with other people not to impinge on theirs, instead of writing code for rollouts.
Rollout fatigue should be respected, even feared. It will insidiously tempt people to skip steps. And failing that, it will blur together in your mind the last twenty times you did this procedure, and you will forget if you ran step 5 before you were interrupted by someone. You will remember having done step 5, but you won't remember if that was ten minutes ago, or yesterday.
It's the reason I keep writing tools to force a checklist, or a prompted sequence. If I didn't check off step 5 I still need to do it. And I'm not even in operations.
Yes, that's how major outages happen. By this stage of maturity any single failure generally doesn't break things dramatically. When things go this wrong, it's ALWAYS a combination of failures: failure of recovery system, omission in detection systems, gap in automated review, oversight in ...
The vacuous gotcha language is indicative of the low quality of the whole article. As Metalnem says in comments here, see the official incident report for a better writeup and more insight. https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...