Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> What's clear to me is that the FAA has no post deployment validation, hasn't tested its DR strategy, and that errors can go unseen for long periods of time.

It is possible to have all of those mitigations in place and still experience a failure like this.

Post deployment validation is only as good as the validations executed. 99% coverage still leaves the door open to failure.

A DR strategy is just that - a strategy.

A failure of this sort is not an automatic implication that those things do not exist, just that they failed in this particular case.

I would find it incredibly surprising that an organization of that complexity could have survived as long as they did without a major incident if none of those things were in place.

They’d be either incredibly lucky, or incredibly competent, and if they are the latter, they would not operate without such mitigations in place.

It seems far more believable that an organization of the FAA’s age and complexity missed something along the way.



> incredibly surprising that an organization of that complexity could have survived as long as they did without a major incident

I'm not surprised. FAA does not fly each plane. Government organizational complexity helps ensure the government organization survives through next round of Congressional appropriations.

Org complexity + opaque oversight + 'safety' + 'homeland security' + taxpayer funded = playing around and more budget.

The pilot is responsible for safety. Air travel has rules to avoid collisions (eastbound gets altitude levels different than westbound, pilots shall broadcast on known frequencies) and pilots have distributed intelligence to keep their flight safe.

Yes, somehow there needs to be coordination of runway use. Many ways to provide reservations and queuing.


We can make excuses all day long. A simple query of the database/table would have produced an error. Sure, the FAA does some complex stuff, but the tech I see in airplanes looks ancient. I'm willing to bet most of the FAA complexity comes from budget (lack thereof) and old computer systems.


This has nothing to do with excuses - I’m challenging the assertion that “because something bad happened, they must not have any mitigations in place at all”.

This seems like a bad case of binary thinking, and my point was that the occurrence of an incident like this is not sufficient to support that claim. It’s just as likely that an ancient process that wasn’t accounted for somewhere in the architecture broke down, and this is how it manifested.

Clearly improvements are needed, as is always the case after an outage. That doesn’t justify wild speculation.

Anecdote time: I once worked for a large financial institution that makes money when people swipe their credit cards. The system that authorizes purchases is ancient, battle tested, and undergoes minimal change because the cost of an outage could be measured in the millions of $ per minute.

Every change was scrutinized, reviewed by multiple groups, discussed with executives, and tested thoroughly. The same system underwent regular DR testing that involved quite a lot of involvement from all related teams.

So the day it went down, it was obviously a big deal, and raised all of the natural questions about how such a thing could occur.

Turns out it had an unknown transitive dependency on an internal server - a server that had not been rebooted in literally a decade. When that server was rebooted (I think it was a security group insisting it needed patches despite some strong reasons to avoid that when considering the architecture), some of the services never came back up, and everyone quickly learned that a very old change that predated almost everyone there established this unknown dependency.

The point of this story is really about the unknowability of sufficiently complex legacy enterprise systems.

All of the right processes and procedures won’t necessarily account for that seemingly inconsequential RPC call to an internal system implemented by a grizzled dev shortly before his retirement.


Those were the wrong procedures. If you were regularly rebooting systems left and right you'd learn quickly if things didn't come up.


And then you find an obscure service doesn’t come back up on the 10,000th or 100,000th reboot because of <any number of reasons>. And now you have multiple states, because you have to handle failover. It’s turtles all the way down.


It’s always easy to say that in hindsight. But keep in mind this is an environment with many core components built in the 80s. Regular reboots on old AIX systems wasn’t a common practice - the sheer uptime capability of these systems was a big selling point in an environment that looks nothing like a modern cloud architecture.

But none of that is really the point. The point is that even with every correct procedure in place, you’ll still encounter failures.

Modern dev teams in companies that build software have more checks and balances in place from the get go that help head off some categories of failure.

But when an organization is built on core tech born of the 80s/90s, there will always be dragons, regardless of the current active policies and procedures.

The problem is that the cost to replace some of these systems was inestimable.


We found the person who intimately knows how FAA’s system is engineered and who also builds perfect systems


A "simple query". The tech "looks ancient". You're "willing to bet" things. And yet you speak so confidently and derisively about this outage.

It must be nice to sit behind your keyboard and just have all of the answers all day long! Do you have any tips for how to be so omniscient?


We can make shit up and pretend to be experts and criticize things we know absolutely nothing about all day long, too.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: