> *What's clear to me is that the FAA has no post deployment validation, hasn't ...

landemva · on Jan 13, 2023

> incredibly surprising that an organization of that complexity could have survived as long as they did without a major incident

I'm not surprised. FAA does not fly each plane. Government organizational complexity helps ensure the government organization survives through next round of Congressional appropriations.

Org complexity + opaque oversight + 'safety' + 'homeland security' + taxpayer funded = playing around and more budget.

The pilot is responsible for safety. Air travel has rules to avoid collisions (eastbound gets altitude levels different than westbound, pilots shall broadcast on known frequencies) and pilots have distributed intelligence to keep their flight safe.

Yes, somehow there needs to be coordination of runway use. Many ways to provide reservations and queuing.

bastardoperator · on Jan 12, 2023

We can make excuses all day long. A simple query of the database/table would have produced an error. Sure, the FAA does some complex stuff, but the tech I see in airplanes looks ancient. I'm willing to bet most of the FAA complexity comes from budget (lack thereof) and old computer systems.

haswell · on Jan 12, 2023

This has nothing to do with excuses - I’m challenging the assertion that “because something bad happened, they must not have any mitigations in place at all”.

This seems like a bad case of binary thinking, and my point was that the occurrence of an incident like this is not sufficient to support that claim. It’s just as likely that an ancient process that wasn’t accounted for somewhere in the architecture broke down, and this is how it manifested.

Clearly improvements are needed, as is always the case after an outage. That doesn’t justify wild speculation.

Anecdote time: I once worked for a large financial institution that makes money when people swipe their credit cards. The system that authorizes purchases is ancient, battle tested, and undergoes minimal change because the cost of an outage could be measured in the millions of $ per minute.

Every change was scrutinized, reviewed by multiple groups, discussed with executives, and tested thoroughly. The same system underwent regular DR testing that involved quite a lot of involvement from all related teams.

So the day it went down, it was obviously a big deal, and raised all of the natural questions about how such a thing could occur.

Turns out it had an unknown transitive dependency on an internal server - a server that had not been rebooted in literally a decade. When that server was rebooted (I think it was a security group insisting it needed patches despite some strong reasons to avoid that when considering the architecture), some of the services never came back up, and everyone quickly learned that a very old change that predated almost everyone there established this unknown dependency.

The point of this story is really about the unknowability of sufficiently complex legacy enterprise systems.

All of the right processes and procedures won’t necessarily account for that seemingly inconsequential RPC call to an internal system implemented by a grizzled dev shortly before his retirement.

wbl · on Jan 12, 2023

Those were the wrong procedures. If you were regularly rebooting systems left and right you'd learn quickly if things didn't come up.

jquery · on Jan 12, 2023

And then you find an obscure service doesn’t come back up on the 10,000th or 100,000th reboot because of <any number of reasons>. And now you have multiple states, because you have to handle failover. It’s turtles all the way down.

haswell · on Jan 13, 2023

It’s always easy to say that in hindsight. But keep in mind this is an environment with many core components built in the 80s. Regular reboots on old AIX systems wasn’t a common practice - the sheer uptime capability of these systems was a big selling point in an environment that looks nothing like a modern cloud architecture.

But none of that is really the point. The point is that even with every correct procedure in place, you’ll still encounter failures.

Modern dev teams in companies that build software have more checks and balances in place from the get go that help head off some categories of failure.

But when an organization is built on core tech born of the 80s/90s, there will always be dragons, regardless of the current active policies and procedures.

The problem is that the cost to replace some of these systems was inestimable.

thrashh · on Jan 12, 2023

We found the person who intimately knows how FAA’s system is engineered and who also builds perfect systems

slingnow · on Jan 12, 2023

A "simple query". The tech "looks ancient". You're "willing to bet" things. And yet you speak so confidently and derisively about this outage.

It must be nice to sit behind your keyboard and just have all of the answers all day long! Do you have any tips for how to be so omniscient?

DonHopkins · on Jan 12, 2023

We can make shit up and pretend to be experts and criticize things we know absolutely nothing about all day long, too.