Enhanced testing regimes could have helped prevent the types
of system outages that caused both Delta and Southwest to cancel thousands of
flights during separate incidents in recent weeks, experts say.
But the same experts also warn that the growing complexity
and size of the computer networks that airlines must now maintain make it
likely that large system meltdowns will continue to happen.
“Just the law of large numbers tells you that you are going
to have more unfortunate outcomes, even if the rate of bad outcomes is
vanishingly low,” said aviation analyst Robert Mann of RW Mann and Co.
‘We’re surprised these systems don’t go down more’
risk-management expert Robert Charette spoke about the Delta computer outage and whether it or a recent Southwest
computer outage could have been avoided. Read More
The failure of a power control module at Delta’s Technology
Command Center in Atlanta during the wee hours of Monday, Aug. 8, triggered a
six-hour, systemwide computer outage that forced the carrier to halt departures
Though Delta had its computer system mostly repaired by the
end of that day, it took the carrier three days to get displaced flight crews
situated so that operations were back to normal. By then, Delta had canceled
more than 2,100 flights and had delayed thousands more. Tens of thousands of
passengers were stranded around the world.
The disruptions tainted Delta’s well-earned reputation as
the most reliable of the major airlines. Its cancellations last Monday and
Tuesday alone amounted to nearly five times all its previous cancellations in
2016, CEO Ed Bastian told the Associated Press.
In explaining the system failure, Bastian said that when the
control module (called a switchgear) failed, it caused Delta to lose the
transformer that was supplying power to its data center.
Though Delta has backup systems, technicians discovered that
some servers had not been protected against power outages. As a result, the
entire network went down.
Robert Johnson, an executive vice president at Irvine,
Calif.-based Vision Solutions, a provider of disaster recovery software, said
Delta’s explanation suggested it was not doing adequate testing.
“Even though you invest in the right pieces, you can still
face challenges if you don’t test regularly,” he said.
Johnson said that routers are inevitably going to break and
servers are going to fail, and “You want people in your organization to know
what to do when that happens.”
Computer systems risk management expert Robert Charette, the
founder of the Virginia-based consultancy ITABHI Corp., agreed that a lack of
testing was likely a problem for Delta. But he added that airlines might be
loath to run a full system test, since if the system fails during testing the
consequences could be disastrous.
“You have a half
million passengers a day Charette said in an interview with Travel Weekly (see
Q&A above). “How willing are you, in the dead of night, when you only have
a couple hundred planes flying, to shut down your system?”
Insufficient testing was likely also a culprit in the
12-hour Southwest shutdown on July 20, which the airline attributed to a router
failure, followed by the router’s backup failing to kick in, Charette said.
As with the Delta system failure, Southwest’s process of
catching up with flight operations lasted several days beyond the original
technical problem. By the time the airline’s schedule was back to normal on
July 24 it had canceled 2,300 flights. An analysis by the Dallas Morning News
estimated that the outage cost Southwest between $54 million and $82 million.
Southwest blamed the failure on older portions of its
system, saying that new components responded to the router failure properly.
The carrier is planning to replace much of its remaining legacy system in the
next three to five years, CEO Gary Kelly told analysts at an earnings call that
coincided with the failure. It will also roll out a new reservation system next
Delta, meanwhile, is spending more than $150 million on
technology infrastructure just this year, Bastian told the AP.
Still, Johnson and Charette said airlines, like other
companies, have to make difficult decisions about what they spend on technical
replacement, monitoring and upkeep.
“If you have a finite budget, you are going to have to make
some choices about where you are going to upgrade and where you are going to
accept the risk,” Charette said.
Analysts said that even with robust investment, large system
outages like those Delta and Southwest have recently experienced are likely to
happen from time to time.
Airline operating systems are complex. They deal with
expanding global operational networks and are constantly updating flight and
staffing schedules and booking tickets electronically, for example. In
addition, with each passing year an airline’s data center is communicating with
more mobile devices around the world, Mann said.
“The complexity is growing exponentially as far as the
number of the devices, so the possibility of bad outcomes due to a single event
becomes larger over time,” he said.
Correction: Delta canceled 2,100 flight, not 3,100.