Announcement

**Evan** · 2016-09-08, 10:06

Originally posted by TeeVee View Post

The airline was upgrading its check-in system

Piecemeal. Without proper redundancy. Because there are no standards and no penalties.

**TeeVee** · 2016-09-08, 10:19

Originally posted by Evan View Post

Piecemeal. Without proper redundancy.

and you know this, how exactly?

**Schwartz** · 2016-09-11, 15:07

Originally posted by Evan View Post

At first, I just saw this as a logistical nightmare. First Southwest, then Delta experience debilitating fleetwide cancellations due to ground computer network meltdowns. People are stuck safely on the ground. Not a safety issue.

But what happens next? A great backlog of flights. Enormous pressure to get everything back on track as soon as possible. How does that effect pilot rostering? Turnaround times? MEL issues? How much stress and pressure are flight and ground crews under to get it all back to normal? What is the leading cause of accidents? Stress, pressure, fatigue, shortcuts, bad decisions based on get-there-itis, questionable dispatches, a lack of contingency for safety reasons, back office pressure...?

The way I see it, such events DO heighten the risk of something going wrong. So shouldn't these systems�a relatively new but essential component of aviation�have FAR certification standards as well. Shouldn't these systems be required to conform to "aircraft-grade' reliability and proven redundancy requirements? Both Southwest and Delta have stated that they have back-up systems in place, but in both cases these systems failed to work. Flimsy stuff.

Well, maybe simple economics will take care of it. I can well imagine the airlines not caring to invest in better back-up systems before such an event happens, but now Southwest is looking at "tens of millions" of dollars in damages. Perhaps better back-up systems are looking more appealing now...

Still, it's the 21st century. Everything depends on these networks. Why isn't this a safety issue? Shouldn't we have FAR's for this?

As a multi-decade veteran of building and operating enterprise software, (for banks) I just had to weigh in here. First, reliability in anything is expensive, and the cost per added reliability is an exponential curve. In software and computer related hardware, reliability also comes at the cost of agility and velocity (i.e. the more reliable, the older and more antiquated it will be because you can't change or upgrade it quickly).

The cynical might say, public safety deserves money, but at the end of the day, if you make the software slow, antiquated, and super expensive, you are starving the rest of the organization from cash, which will no doubt put different safety stresses. Frankly, starving the company of money is far more likely to induce safety related stresses than the rare outage and timetable screwups that occur from other worldwide events like weather etc. Keep the high reliability and fault tolerance in the minimal areas where you require it -- plane maintenance, training, operating procedures etc -- and save your money in places where you don't need it.

**Evan** · 2016-09-11, 16:38

Originally posted by Schwartz View Post

As a multi-decade veteran of building and operating enterprise software, (for banks) I just had to weigh in here. First, reliability in anything is expensive, and the cost per added reliability is an exponential curve. In software and computer related hardware, reliability also comes at the cost of agility and velocity (i.e. the more reliable, the older and more antiquated it will be because you can't change or upgrade it quickly).

The cynical might say, public safety deserves money, but at the end of the day, if you make the software slow, antiquated, and super expensive, you are starving the rest of the organization from cash, which will no doubt put different safety stresses. Frankly, starving the company of money is far more likely to induce safety related stresses than the rare outage and timetable screwups that occur from other worldwide events like weather etc. Keep the high reliability and fault tolerance in the minimal areas where you require it -- plane maintenance, training, operating procedures etc -- and save your money in places where you don't need it.

Well, in deference to your experience, the multi-decade thing might be working against you. Current technology is far less costly and far more agile than it was even a decade ago. And it's not so much a matter of infrastructural complexity as it is a matter of sound architecture. Where a single component of a system might fail, another can be assigned to take over and any disruption of the entire network will be momentary. Large media organizations have mastered this (when is the last time Facebook experienced a prolonged delay across their network).

Certainly there is expense involved. But modern components are vastly cheaper than the ones currently in use by these airline networks. Many of those now in use date back to the 90's! These are fragile, neglected networks we are dealing with that are desperately in need of replacement. The airline industry has the funds to do this. It would have a somewhat deleterious effect on short-term earnings but that money isn't otherwise going to 'plane maintenance, training, operating procedures etc.', it is being used to inflate the bottom line.

We don't have to make a choice between safety in the air and ground network reliability. We can and should have both. And I still contend that the latter can affect the former.

**TeeVee** · 2016-09-12, 01:50

Originally posted by Evan View Post

Large media organizations have mastered this (when is the last time Facebook experienced a prolonged delay across their network).

yes, this a wonderful analogy, especially since facebook's systems do so many different things, like accepting pictures of food, finding friends for you and the oh so important poke.

if you did know anything about networks and enterprise equipment you would realize that it has changed very little in the last decade. take HDD's for example. on the server level, they use the same technology as 16 years ago. RAID has not changed in god only knows how long. TCP/IP is the same, as are probably all other network protocols. sure, speeds have improved, but not much else of grave importance from a hardware perspective.

what is probably limiting airlines far more than hardware are the software systems they use. you do realize of course that they are all custom and proprietary and HUGELY expensive. BA didn't walk into CompUSA or its equivalent and buy a software solution in a box.

anyway, boring convo. suffice to say that reservation system failures are not a safety issue.

**Schwartz** · 2016-09-12, 03:51

Originally posted by Evan View Post

Well, in deference to your experience, the multi-decade thing might be working against you. Current technology is far less costly and far more agile than it was even a decade ago. And it's not so much a matter of infrastructural complexity as it is a matter of sound architecture. Where a single component of a system might fail, another can be assigned to take over and any disruption of the entire network will be momentary. Large media organizations have mastered this (when is the last time Facebook experienced a prolonged delay across their network).

Certainly there is expense involved. But modern components are vastly cheaper than the ones currently in use by these airline networks. Many of those now in use date back to the 90's! These are fragile, neglected networks we are dealing with that are desperately in need of replacement. The airline industry has the funds to do this. It would have a somewhat deleterious effect on short-term earnings but that money isn't otherwise going to 'plane maintenance, training, operating procedures etc.', it is being used to inflate the bottom line.

We don't have to make a choice between safety in the air and ground network reliability. We can and should have both. And I still contend that the latter can affect the former.

Evan, given that I currently run both the development and operations for a SAS (software as a service) company, I assure you I am very familiar with current technologies and the methodologies used by the unicorns (i.e. facebook or google) and other companies that have more robust software. Facebook has errors all the time since they continually update their software. The reason they don't care, is that nothing happens if you see an out of date post about your long lost cousin for half a day. It doesn't matter. No one is even paying for that service. For a couple of years, their iOS application had the lowest ratings on the Apple store because they didn't do any testing, because they figured people and their partners would report their problems for them. For you to make a comparison of their software with the totally different world of scheduling and routing for airlines, shows a lack of fundamental understanding. To suggest that the business software fall under some sort of regulatory umbrella also shows you don't really understand what you are asking for. New software techniques (like having your customers test your software like Facebook does) do not mesh with regulation which relies heavily on documented requirements and extensive testing. At my last company we did some work in health care and we got ISO certified for medical devices. It didn't change anything about the software. It added a ton of overhead regarding the documentation of requirements and testing. Now, ask yourself, if you have to document every requirement and test pass, how is that going to speed up the creation, modification or adoption of newer software? How do you figure you're going to use all those third party open source components that the unicorns use? You're not, because they're out of your control, they change quickly, and they sure don't document their requirements or tests for you.

I'm not saying that Delta did a good job of disaster planning... clearly they didn't. Even with old systems, you should be able to prevent such an extended outage due to power failure. However, the longer a system has been in place, it is that much more expensive to replace. Think about it, they have 20 years invested in that software. How are they going to easily extract everything that it does and repeat it with newer stuff? That is thousands of person years of work invested and that will cost a lot of money to replace. I guarantee the cost to replace the software would easily eclipse the business they lost during the outage. I also know replacing these things is risky because it is impossible to run a comprehensive test due to the complexities. The safest approach is to assume failure and make sure you things in place to mitigate appropriately.

I can say from experience working with hardened labs at IBM and big banks. Every single redundant system I have seen has failed at least once during an outage that the redundancy was designed to prevent. Every single one whether it be hardware or software. One of the top 5 banks I did work for a mere handful of years ago, had a power failure in a big datacenter when the electricians wiring the backup systems made a mistake. So the cooling system went down, and then the hardware all overheated and shut down. The alerting system shut down, so they didn't figure it out until someone physically walked over to the room and opened the door to a blast of hot air.

**Evan** · 2016-09-12, 11:41

Originally posted by Schwartz View Post

How do you figure you're going to use all those third party open source components that the unicorns use? You're not, because they're out of your control, they change quickly, and they sure don't document their requirements or tests for you.

I didn't mean to make a direct comparison to the nature of entities like Facebook, just the robustness of their networks. I'm not talking about product design. I not talking about developing new, untested software. Obviously airlines are not focused on product design the way media is. But can't you build a robust and redundant network from existing, proven, MODERN platforms?

I'm not saying that Delta did a good job of disaster planning... clearly they didn't. Even with old systems, you should be able to prevent such an extended outage due to power failure. However, the longer a system has been in place, it is that much more expensive to replace. Think about it, they have 20 years invested in that software. How are they going to easily extract everything that it does and repeat it with newer stuff? That is thousands of person years of work invested and that will cost a lot of money to replace. I guarantee the cost to replace the software would easily eclipse the business they lost during the outage. I also know replacing these things is risky because it is impossible to run a comprehensive test due to the complexities. The safest approach is to assume failure and make sure you things in place to mitigate appropriately.

That just seems like a defeatist, bureaucratic answer to me. Of course it will be expensive and no, it won't be easy. It might cost more than the losses suffered by the recent outtage but add to that the cost of the next one, and the one after that, and the 'freak' accident that might occur as a result of these disruptions. I am talking about having a future vision instead of a short-term one. And I am talking about solving a complex problem by removing a great deal of the complexity, making it efficient and streamlined and providing a path for redundancy at every point of failure. Why is that unachievable? I think it's more a matter of will. And priorities. Placing customers over investors. Investing now to prevent problems rather than reacting to them as they happen (I realize this goes against modern-day business ethics).

I can say from experience working with hardened labs at IBM and big banks. Every single redundant system I have seen has failed at least once during an outage that the redundancy was designed to prevent. Every single one whether it be hardware or software. One of the top 5 banks I did work for a mere handful of years ago, had a power failure in a big datacenter when the electricians wiring the backup systems made a mistake. So the cooling system went down, and then the hardware all overheated and shut down. The alerting system shut down, so they didn't figure it out until someone physically walked over to the room and opened the door to a blast of hot air.

Yes, there is no infallible system. Human error will always enter into it. But compare this to what currently passes for an acceptable network in the airline industry. As with every regulated thing, I am talking about reducing the odds of failure as much as practicality (not just profitiablity) allows, not making them entirely failproof. By comparison, it is an absolute miracle the current airline networks aren't failing more often. You can't compare the hardened labs at IBM with a hodgepodge of third-party legacy systems twisted together by mergers and outsourcing with no architectural foundation and countless unprotected points of failure.

If I assigned you to design a clean-slate core network purpose-built for a 21st-century airline operation using modern components, would you tell me that is unachievable within a reasonable budget and a reasonable timeframe? Because I feel pretty confident I could throw that to a bunch of kids at Stanford and get it done.

**elaw** · 2016-09-12, 11:55

Originally posted by Evan View Post

But can't you build a robust and redundant network from existing, proven, MODERN platforms?

I've got just two questions:
1) How young does a system need to be to be called "modern"?

and 2) How old does a system need to be to be called "proven"?

My coworker in IT won't consider anything "modern" that was introduced over a year ago.

Some people would consider a system "proven" if it passes some tests in a lab based on a few hypothetical "most likely failure scenarios". I wouldn't... I want to see it work in the field, for a good length of time (years) in actual use.

I bet Comcast would tell you their DVRs that can't go a day without having to be rebooted are "proven". The batteries in the Samsung Galaxy Note 7 were "proven"... as were the batteries in the Boeing 787...

**TeeVee** · 2016-09-12, 12:44

here's another thing to consider evan, tech companies like fbook are being run by younger people that say f-you to wall street and do what they believe will best achieve the goals they set for their company's operation NOT PROFITS TO SHAREHOLDERS, while airlines kiss the balls of every wall street idiot and do what wall street says is best for wall street (or so they think).

i am now general counsel and a 5% shareholder in a startup tech company that has gross revenues of about $1 million per month, with net profit a tiny fraction thereof. for the past 18 months they have successfully operated on 100mb cable internet service for roughly $90 per month. however, the techies say then "need" fiber optic and are going to switch to $700 per month service. wall street takes a shit when a company like delta makes a similar move, albeit not to the tune of $600 per month.

one of your problems is that you actually believe that airlines are anything but publicly traded companies with only profit in mind.

**Evan** · 2016-09-12, 13:35

Originally posted by TeeVee View Post

one of your problems is that you actually believe that airlines are anything but publicly traded companies with only profit in mind.

No no no Teevee, I explicitly realize this! This is exactly why we need regulations to establish minimum standards. Publicly traded leviathans like Delta are unable to act altruistically or even responsibly in a competitive investor market. The idea that market forces will ensure progress does not apply to these companies. They are milk cows for the unvestor class who would milk them until they fall dead if not for regulators and regulations. But it is also a pay-to-play economy. If regulators required a network standard in place by, say, 2020, with the alternative being a suspension of the operating license, it would be considered a necessary expense and it the funds would appear. Without such requirements, without a universal requirement for all operators, the investor class would consider it foolish to attempt this, and from their point-of-view alone... it would be.

**3WE** · 2016-09-12, 20:16

Russian Anthem by Russian Army

https://www.youtube.com/watch?v=sowpvuK-co8

The Russian army singing the Russian National Anthem. I would appreciate if people would stop with those hatefull comments below in the comment section! If y...

**Schwartz** · 2016-09-13, 04:14

Originally posted by Evan View Post

I didn't mean to make a direct comparison to the nature of entities like Facebook, just the robustness of their networks. I'm not talking about product design. I not talking about developing new, untested software. Obviously airlines are not focused on product design the way media is. But can't you build a robust and redundant network from existing, proven, MODERN platforms?

I'm very curious why you're only focusing on the network given the nature of Delta's failure was a power issue at a datacenter, not a networking failure. For all we know, their network was fine outside of where the servers were. Facebook's past failures were all related to software issues -- which can also cause network issues. Amazon's most famous outage was caused by a backup network being overloaded by their software. What you are missing is that the overall software architecture including the deployment dictates the nature of the networking requirements. Again, because facebook doesn't require instant synchronization between all of their nodes (i.e. your update from your long lost cousin isn't mission critical and can be inaccurate for quite a while) allows them to distribute their architecture which leaves them far fewer centralized points of failure. Obviously, Delta's does not. Again, even by trying to limit the discussion to networks (which we don't know if they failed), you are still comparing apples to oranges.

Originally posted by Evan View Post

That just seems like a defeatist, bureaucratic answer to me. Of course it will be expensive and no, it won't be easy. It might cost more than the losses suffered by the recent outtage but add to that the cost of the next one, and the one after that, and the 'freak' accident that might occur as a result of these disruptions. I am talking about having a future vision instead of a short-term one. And I am talking about solving a complex problem by removing a great deal of the complexity, making it efficient and streamlined and providing a path for redundancy at every point of failure. Why is that unachievable? I think it's more a matter of will. And priorities. Placing customers over investors. Investing now to prevent problems rather than reacting to them as they happen (I realize this goes against modern-day business ethics).

I am the farthest thing from a bureaucrat and the main reason I read these boards, is because my job involves a lot of incident assessment and risk analysis and the methodology of analysis is similar. I submit that my statement was realistic (vs defeatest) and it was intended to provide professional context into an area for which the commentary seems idealistic. My point was that spending increasing amounts of money to gain smaller and smaller amounts of uptime will not necessarily give you the best ROI. As long as safety is not on the line (which I argue is the case here) one needs to consider spending money on the ability to deal with problems rather than preventing them.

Originally posted by Evan View Post

Yes, there is no infallible system. Human error will always enter into it. But compare this to what currently passes for an acceptable network in the airline industry. As with every regulated thing, I am talking about reducing the odds of failure as much as practicality (not just profitiablity) allows, not making them entirely failproof. By comparison, it is an absolute miracle the current airline networks aren't failing more often. You can't compare the hardened labs at IBM with a hodgepodge of third-party legacy systems twisted together by mergers and outsourcing with no architectural foundation and countless unprotected points of failure.

It doesn't require human error to screw things up. Nature and complexity are both more than capable of messing up the works.

The hardened IBM lab had a hardware failure that was a partial failure and the system didn't detect the failure and never failed over resulting in complete outage -- in that case it was a network outage. Almost every large worldwide bank in existence (especially in the US) has the same hodgepodge of third party legacy systems, their systems are not regulated and they can successfully manage very high up times -- though they are prone to occasional failure as well. My point is again that regulation won't make it better, it will slow it down. The banks have been doing it without regulating their systems for decades. It is so easy to sneer at legacy systems (heck, I do it too, when it suits my purpose), but they run almost every critical system you rely on, and maybe you should think about why that is.

Originally posted by Evan View Post

If I assigned you to design a clean-slate core network purpose-built for a 21st-century airline operation using modern components, would you tell me that is unachievable within a reasonable budget and a reasonable timeframe? Because I feel pretty confident I could throw that to a bunch of kids at Stanford and get it done.

Lots of people can build nice shiny new things from scratch but we live in the real world. The challenge is how to build a system to replace all the important aspects of the existing one, while not stopping the existing business, with a budget that includes keeping the existing system running, doesn't blow the expense line of the balance sheet, and making sure the brand new system doesn't barf when you load it up. I hire a lot of kids like the ones you talk about from Standford, and I can tell you as smart as they are, they don't have the experience to get something like that anywhere near right the first time around. The first challenge is they wouldn't have any idea of all the little things the system was supposed to do.

**3WE** · 2016-09-13, 11:53

Originally posted by Schwartz View Post

...Lots of people can build nice shiny new things from scratch but we live in the real world...

No...

...we are in an unofficial, outsider aviation discussion forum...leans very hard towards a fantasy world.

**TeeVee** · 2016-09-13, 13:03

the other way to look at this is not by how much havoc a failure wreaks but on how much revenue the airline loses when there is a total system outage. it's not ust that planes don't take off. they also cannot sell new tickets, which is what they are all about.

lastly, evan's original post was questioning whether the network failure created a SAFETY issue, likely as a result of having to make-up for the missed flights. that question was answered in the negative, by atlcrew, who apparently works for an airline, though i don't believe he/she has ever disclosed what exactly he/she does. i'm guessing flight crew but...

**Evan** · 2016-09-13, 13:35

Originally posted by Schwartz View Post

Lots of people can build nice shiny new things from scratch but we live in the real world. The challenge is how to build a system to replace all the important aspects of the existing one, while not stopping the existing business, with a budget that includes keeping the existing system running, doesn't blow the expense line of the balance sheet, and making sure the brand new system doesn't barf when you load it up. I hire a lot of kids like the ones you talk about from Standford, and I can tell you as smart as they are, they don't have the experience to get something like that anywhere near right the first time around. The first challenge is they wouldn't have any idea of all the little things the system was supposed to do.

Schwartz, again, I defer to your experience. I'm not an IT guy as you can probably tell. But come on... You just stated the challenge there, so why can't we step up to it? Lack of funds? Are you kidding me? Delta enjoyed historic profits last year. They returned $2.6B USD to shareholders! How much of that $2.6B do we need to get a reliable network into shape over four years of development, testing and debugging? They could fund the entire thing with just one year's profit but of course they would spread it out over many years and it would begin to pay off almost immediately by preventing the next $10M meltdown. I'm talking about a fail-operative core system providing all the things necessary to keep planes dispatched with minimal delays. That doesn't need to cover point-of-sale networks, just confirmed booking records and operations (although, as TeeVee points out, POS is probably more important to them). No, of course you don't get it right the first time; that's what the four years are for.

The first challenge is of course to map out all the little things the system is supposed to do. That is how you begin. From there it's just the challenge you described. All it takes is money, which they have, and will, which they lack.

Announcement

Is this an aviation safety issue?

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment