Delta’s computer outage suggests need for testing

Aug 14, 2016

Enhanced testing regimes could have helped prevent the types of system outages that caused both Delta and Southwest to cancel thousands of flights during separate incidents in recent weeks, experts say.

But the same experts also warn that the growing complexity and size of the computer networks that airlines must now maintain make it likely that large system meltdowns will continue to happen.

“Just the law of large numbers tells you that you are going to have more unfortunate outcomes, even if the rate of bad outcomes is vanishingly low,” said aviation analyst Robert Mann of RW Mann and Co.

‘We’re surprised these systems don’t go down more’

Computer systems risk-management expert Robert Charette spoke about the Delta computer outage and whether it or a recent Southwest computer outage could have been avoided. Read More

The failure of a power control module at Delta’s Technology Command Center in Atlanta during the wee hours of Monday, Aug. 8, triggered a six-hour, systemwide computer outage that forced the carrier to halt departures worldwide.

Though Delta had its computer system mostly repaired by the end of that day, it took the carrier three days to get displaced flight crews situated so that operations were back to normal. By then, Delta had canceled more than 2,100 flights and had delayed thousands more. Tens of thousands of passengers were stranded around the world.

The disruptions tainted Delta’s well-earned reputation as the most reliable of the major airlines. Its cancellations last Monday and Tuesday alone amounted to nearly five times all its previous cancellations in 2016, CEO Ed Bastian told the Associated Press.

In explaining the system failure, Bastian said that when the control module (called a switchgear) failed, it caused Delta to lose the transformer that was supplying power to its data center.

Though Delta has backup systems, technicians discovered that some servers had not been protected against power outages. As a result, the entire network went down.

Robert Johnson, an executive vice president at Irvine, Calif.-based Vision Solutions, a provider of disaster recovery software, said Delta’s explanation suggested it was not doing adequate testing.

“Even though you invest in the right pieces, you can still face challenges if you don’t test regularly,” he said.

Johnson said that routers are inevitably going to break and servers are going to fail, and “You want people in your organization to know what to do when that happens.”

Computer systems risk management expert Robert Charette, the founder of the Virginia-based consultancy ITABHI Corp., agreed that a lack of testing was likely a problem for Delta. But he added that airlines might be loath to run a full system test, since if the system fails during testing the consequences could be disastrous.

“You have a half million passengers a day Charette said in an interview with Travel Weekly (see Q&A above). “How willing are you, in the dead of night, when you only have a couple hundred planes flying, to shut down your system?”

Insufficient testing was likely also a culprit in the 12-hour Southwest shutdown on July 20, which the airline attributed to a router failure, followed by the router’s backup failing to kick in, Charette said.

As with the Delta system failure, Southwest’s process of catching up with flight operations lasted several days beyond the original technical problem. By the time the airline’s schedule was back to normal on July 24 it had canceled 2,300 flights. An analysis by the Dallas Morning News estimated that the outage cost Southwest between $54 million and $82 million.

Southwest blamed the failure on older portions of its system, saying that new components responded to the router failure properly. The carrier is planning to replace much of its remaining legacy system in the next three to five years, CEO Gary Kelly told analysts at an earnings call that coincided with the failure. It will also roll out a new reservation system next year.

Delta, meanwhile, is spending more than $150 million on technology infrastructure just this year, Bastian told the AP.

Still, Johnson and Charette said airlines, like other companies, have to make difficult decisions about what they spend on technical replacement, monitoring and upkeep.

“If you have a finite budget, you are going to have to make some choices about where you are going to upgrade and where you are going to accept the risk,” Charette said.

Analysts said that even with robust investment, large system outages like those Delta and Southwest have recently experienced are likely to happen from time to time.

Airline operating systems are complex. They deal with expanding global operational networks and are constantly updating flight and staffing schedules and booking tickets electronically, for example. In addition, with each passing year an airline’s data center is communicating with more mobile devices around the world, Mann said.

“The complexity is growing exponentially as far as the number of the devices, so the possibility of bad outcomes due to a single event becomes larger over time,” he said.
___

Correction: Delta canceled 2,100 flight, not 3,100.