Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
A Null Pointer Exception Brought Down Mighty Google;7 Hours of Downtime (getpanto.ai)
9 points by pavan_panto 7 months ago | hide | past | favorite | 9 comments


Ugh. Each of these points is a classic reliability precaution – yet all were missed simultaneously. As one analyst put it, Google had “written the book on Site Reliability Engineering” but still deployed code that could not handle null inputs. In hindsight, this outage looks like a string of simple errors aligning by unfortunate chance.

Yes, that's how major outages happen. By this stage of maturity any single failure generally doesn't break things dramatically. When things go this wrong, it's ALWAYS a combination of failures: failure of recovery system, omission in detection systems, gap in automated review, oversight in ...

The vacuous gotcha language is indicative of the low quality of the whole article. As Metalnem says in comments here, see the official incident report for a better writeup and more insight. https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...


I really love the "Swiss Cheese model" for showing this in a very explicit way, it's easy to see how the most improbably thing could happen.

https://en.wikipedia.org/wiki/Swiss_cheese_model


This is a terrible article. It doesn’t cite its sources and, even worse, invents quotes. It's basically just an ad for some Panto AI tool.

Save yourself some time and just read the official incident report: https://status.cloud.google.com/incidents/ow5i3PPK96RduMcb1S...

Or the previous Hacker News discussion: https://news.ycombinator.com/item?id=44274563


Very surprising that it was not feature flagged, when I worked at Google we went to enormous lengths to support partial feature rollouts, supporting this for complex codebases often added weeks to development but was crucial because an outage of this magnitude was considered sacrilege


fitting for the company that invented the nil pointer exception


[flagged]


Google’s scale is insane, but this shows how fragile even the biggest clouds can be. Hope they drop the full technical details soon.


The problem, which I've only gotten enough of a taste of to see how untenable they would be at that scale, is that with enough feature toggles, and enough partitions to rolling them out (%, region, or AZ cutoffs), you eventually spend most of your time shepherding rollouts, or coordinating with other people not to impinge on theirs, instead of writing code for rollouts.

Rollout fatigue should be respected, even feared. It will insidiously tempt people to skip steps. And failing that, it will blur together in your mind the last twenty times you did this procedure, and you will forget if you ran step 5 before you were interrupted by someone. You will remember having done step 5, but you won't remember if that was ten minutes ago, or yesterday.

It's the reason I keep writing tools to force a checklist, or a prompted sequence. If I didn't check off step 5 I still need to do it. And I'm not even in operations.


They did! It was already discussed here: https://news.ycombinator.com/item?id=44274563.


Thanks. This article appears to be no more than an AI-slop-summary of the official Google report (plus of course, some advertising tacked on).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: