Originally posted by TeeVee
View Post
Announcement
Collapse
No announcement yet.
Is this an aviation safety issue?
Collapse
X
-
Originally posted by Evan View PostAt first, I just saw this as a logistical nightmare. First Southwest, then Delta experience debilitating fleetwide cancellations due to ground computer network meltdowns. People are stuck safely on the ground. Not a safety issue.
But what happens next? A great backlog of flights. Enormous pressure to get everything back on track as soon as possible. How does that effect pilot rostering? Turnaround times? MEL issues? How much stress and pressure are flight and ground crews under to get it all back to normal? What is the leading cause of accidents? Stress, pressure, fatigue, shortcuts, bad decisions based on get-there-itis, questionable dispatches, a lack of contingency for safety reasons, back office pressure...?
The way I see it, such events DO heighten the risk of something going wrong. So shouldn't these systems—a relatively new but essential component of aviation—have FAR certification standards as well. Shouldn't these systems be required to conform to "aircraft-grade' reliability and proven redundancy requirements? Both Southwest and Delta have stated that they have back-up systems in place, but in both cases these systems failed to work. Flimsy stuff.
Well, maybe simple economics will take care of it. I can well imagine the airlines not caring to invest in better back-up systems before such an event happens, but now Southwest is looking at "tens of millions" of dollars in damages. Perhaps better back-up systems are looking more appealing now...
Still, it's the 21st century. Everything depends on these networks. Why isn't this a safety issue? Shouldn't we have FAR's for this?
The cynical might say, public safety deserves money, but at the end of the day, if you make the software slow, antiquated, and super expensive, you are starving the rest of the organization from cash, which will no doubt put different safety stresses. Frankly, starving the company of money is far more likely to induce safety related stresses than the rare outage and timetable screwups that occur from other worldwide events like weather etc. Keep the high reliability and fault tolerance in the minimal areas where you require it -- plane maintenance, training, operating procedures etc -- and save your money in places where you don't need it.
Comment
-
Originally posted by Schwartz View PostAs a multi-decade veteran of building and operating enterprise software, (for banks) I just had to weigh in here. First, reliability in anything is expensive, and the cost per added reliability is an exponential curve. In software and computer related hardware, reliability also comes at the cost of agility and velocity (i.e. the more reliable, the older and more antiquated it will be because you can't change or upgrade it quickly).
The cynical might say, public safety deserves money, but at the end of the day, if you make the software slow, antiquated, and super expensive, you are starving the rest of the organization from cash, which will no doubt put different safety stresses. Frankly, starving the company of money is far more likely to induce safety related stresses than the rare outage and timetable screwups that occur from other worldwide events like weather etc. Keep the high reliability and fault tolerance in the minimal areas where you require it -- plane maintenance, training, operating procedures etc -- and save your money in places where you don't need it.
Certainly there is expense involved. But modern components are vastly cheaper than the ones currently in use by these airline networks. Many of those now in use date back to the 90's! These are fragile, neglected networks we are dealing with that are desperately in need of replacement. The airline industry has the funds to do this. It would have a somewhat deleterious effect on short-term earnings but that money isn't otherwise going to 'plane maintenance, training, operating procedures etc.', it is being used to inflate the bottom line.
We don't have to make a choice between safety in the air and ground network reliability. We can and should have both. And I still contend that the latter can affect the former.
Comment
-
Originally posted by Evan View PostLarge media organizations have mastered this (when is the last time Facebook experienced a prolonged delay across their network).
if you did know anything about networks and enterprise equipment you would realize that it has changed very little in the last decade. take HDD's for example. on the server level, they use the same technology as 16 years ago. RAID has not changed in god only knows how long. TCP/IP is the same, as are probably all other network protocols. sure, speeds have improved, but not much else of grave importance from a hardware perspective.
what is probably limiting airlines far more than hardware are the software systems they use. you do realize of course that they are all custom and proprietary and HUGELY expensive. BA didn't walk into CompUSA or its equivalent and buy a software solution in a box.
anyway, boring convo. suffice to say that reservation system failures are not a safety issue.
Comment
-
Originally posted by Evan View PostWell, in deference to your experience, the multi-decade thing might be working against you. Current technology is far less costly and far more agile than it was even a decade ago. And it's not so much a matter of infrastructural complexity as it is a matter of sound architecture. Where a single component of a system might fail, another can be assigned to take over and any disruption of the entire network will be momentary. Large media organizations have mastered this (when is the last time Facebook experienced a prolonged delay across their network).
Certainly there is expense involved. But modern components are vastly cheaper than the ones currently in use by these airline networks. Many of those now in use date back to the 90's! These are fragile, neglected networks we are dealing with that are desperately in need of replacement. The airline industry has the funds to do this. It would have a somewhat deleterious effect on short-term earnings but that money isn't otherwise going to 'plane maintenance, training, operating procedures etc.', it is being used to inflate the bottom line.
We don't have to make a choice between safety in the air and ground network reliability. We can and should have both. And I still contend that the latter can affect the former.
I'm not saying that Delta did a good job of disaster planning... clearly they didn't. Even with old systems, you should be able to prevent such an extended outage due to power failure. However, the longer a system has been in place, it is that much more expensive to replace. Think about it, they have 20 years invested in that software. How are they going to easily extract everything that it does and repeat it with newer stuff? That is thousands of person years of work invested and that will cost a lot of money to replace. I guarantee the cost to replace the software would easily eclipse the business they lost during the outage. I also know replacing these things is risky because it is impossible to run a comprehensive test due to the complexities. The safest approach is to assume failure and make sure you things in place to mitigate appropriately.
I can say from experience working with hardened labs at IBM and big banks. Every single redundant system I have seen has failed at least once during an outage that the redundancy was designed to prevent. Every single one whether it be hardware or software. One of the top 5 banks I did work for a mere handful of years ago, had a power failure in a big datacenter when the electricians wiring the backup systems made a mistake. So the cooling system went down, and then the hardware all overheated and shut down. The alerting system shut down, so they didn't figure it out until someone physically walked over to the room and opened the door to a blast of hot air.
Comment
-
Originally posted by Schwartz View PostHow do you figure you're going to use all those third party open source components that the unicorns use? You're not, because they're out of your control, they change quickly, and they sure don't document their requirements or tests for you.
I'm not saying that Delta did a good job of disaster planning... clearly they didn't. Even with old systems, you should be able to prevent such an extended outage due to power failure. However, the longer a system has been in place, it is that much more expensive to replace. Think about it, they have 20 years invested in that software. How are they going to easily extract everything that it does and repeat it with newer stuff? That is thousands of person years of work invested and that will cost a lot of money to replace. I guarantee the cost to replace the software would easily eclipse the business they lost during the outage. I also know replacing these things is risky because it is impossible to run a comprehensive test due to the complexities. The safest approach is to assume failure and make sure you things in place to mitigate appropriately.
I can say from experience working with hardened labs at IBM and big banks. Every single redundant system I have seen has failed at least once during an outage that the redundancy was designed to prevent. Every single one whether it be hardware or software. One of the top 5 banks I did work for a mere handful of years ago, had a power failure in a big datacenter when the electricians wiring the backup systems made a mistake. So the cooling system went down, and then the hardware all overheated and shut down. The alerting system shut down, so they didn't figure it out until someone physically walked over to the room and opened the door to a blast of hot air.
If I assigned you to design a clean-slate core network purpose-built for a 21st-century airline operation using modern components, would you tell me that is unachievable within a reasonable budget and a reasonable timeframe? Because I feel pretty confident I could throw that to a bunch of kids at Stanford and get it done.
Comment
-
Originally posted by Evan View PostBut can't you build a robust and redundant network from existing, proven, MODERN platforms?
1) How young does a system need to be to be called "modern"?
and 2) How old does a system need to be to be called "proven"?
My coworker in IT won't consider anything "modern" that was introduced over a year ago.
Some people would consider a system "proven" if it passes some tests in a lab based on a few hypothetical "most likely failure scenarios". I wouldn't... I want to see it work in the field, for a good length of time (years) in actual use.
I bet Comcast would tell you their DVRs that can't go a day without having to be rebooted are "proven". The batteries in the Samsung Galaxy Note 7 were "proven"... as were the batteries in the Boeing 787...Be alert! America needs more lerts.
Eric Law
Comment
-
here's another thing to consider evan, tech companies like fbook are being run by younger people that say f-you to wall street and do what they believe will best achieve the goals they set for their company's operation NOT PROFITS TO SHAREHOLDERS, while airlines kiss the balls of every wall street idiot and do what wall street says is best for wall street (or so they think).
i am now general counsel and a 5% shareholder in a startup tech company that has gross revenues of about $1 million per month, with net profit a tiny fraction thereof. for the past 18 months they have successfully operated on 100mb cable internet service for roughly $90 per month. however, the techies say then "need" fiber optic and are going to switch to $700 per month service. wall street takes a shit when a company like delta makes a similar move, albeit not to the tune of $600 per month.
one of your problems is that you actually believe that airlines are anything but publicly traded companies with only profit in mind.
Comment
-
Originally posted by TeeVee View Postone of your problems is that you actually believe that airlines are anything but publicly traded companies with only profit in mind.
Comment
-
Originally posted by Evan View PostI didn't mean to make a direct comparison to the nature of entities like Facebook, just the robustness of their networks. I'm not talking about product design. I not talking about developing new, untested software. Obviously airlines are not focused on product design the way media is. But can't you build a robust and redundant network from existing, proven, MODERN platforms?
Originally posted by Evan View PostThat just seems like a defeatist, bureaucratic answer to me. Of course it will be expensive and no, it won't be easy. It might cost more than the losses suffered by the recent outtage but add to that the cost of the next one, and the one after that, and the 'freak' accident that might occur as a result of these disruptions. I am talking about having a future vision instead of a short-term one. And I am talking about solving a complex problem by removing a great deal of the complexity, making it efficient and streamlined and providing a path for redundancy at every point of failure. Why is that unachievable? I think it's more a matter of will. And priorities. Placing customers over investors. Investing now to prevent problems rather than reacting to them as they happen (I realize this goes against modern-day business ethics).
Originally posted by Evan View PostYes, there is no infallible system. Human error will always enter into it. But compare this to what currently passes for an acceptable network in the airline industry. As with every regulated thing, I am talking about reducing the odds of failure as much as practicality (not just profitiablity) allows, not making them entirely failproof. By comparison, it is an absolute miracle the current airline networks aren't failing more often. You can't compare the hardened labs at IBM with a hodgepodge of third-party legacy systems twisted together by mergers and outsourcing with no architectural foundation and countless unprotected points of failure.
The hardened IBM lab had a hardware failure that was a partial failure and the system didn't detect the failure and never failed over resulting in complete outage -- in that case it was a network outage. Almost every large worldwide bank in existence (especially in the US) has the same hodgepodge of third party legacy systems, their systems are not regulated and they can successfully manage very high up times -- though they are prone to occasional failure as well. My point is again that regulation won't make it better, it will slow it down. The banks have been doing it without regulating their systems for decades. It is so easy to sneer at legacy systems (heck, I do it too, when it suits my purpose), but they run almost every critical system you rely on, and maybe you should think about why that is.
Originally posted by Evan View PostIf I assigned you to design a clean-slate core network purpose-built for a 21st-century airline operation using modern components, would you tell me that is unachievable within a reasonable budget and a reasonable timeframe? Because I feel pretty confident I could throw that to a bunch of kids at Stanford and get it done.
Comment
-
Originally posted by Schwartz View Post...Lots of people can build nice shiny new things from scratch but we live in the real world...
...we are in an unofficial, outsider aviation discussion forum...leans very hard towards a fantasy world.Les règles de l'aviation de base découragent de longues périodes de dur tirer vers le haut.
Comment
-
the other way to look at this is not by how much havoc a failure wreaks but on how much revenue the airline loses when there is a total system outage. it's not ust that planes don't take off. they also cannot sell new tickets, which is what they are all about.
lastly, evan's original post was questioning whether the network failure created a SAFETY issue, likely as a result of having to make-up for the missed flights. that question was answered in the negative, by atlcrew, who apparently works for an airline, though i don't believe he/she has ever disclosed what exactly he/she does. i'm guessing flight crew but...
Comment
-
Originally posted by Schwartz View PostLots of people can build nice shiny new things from scratch but we live in the real world. The challenge is how to build a system to replace all the important aspects of the existing one, while not stopping the existing business, with a budget that includes keeping the existing system running, doesn't blow the expense line of the balance sheet, and making sure the brand new system doesn't barf when you load it up. I hire a lot of kids like the ones you talk about from Standford, and I can tell you as smart as they are, they don't have the experience to get something like that anywhere near right the first time around. The first challenge is they wouldn't have any idea of all the little things the system was supposed to do.
The first challenge is of course to map out all the little things the system is supposed to do. That is how you begin. From there it's just the challenge you described. All it takes is money, which they have, and will, which they lack.
Comment
Comment