Dive Brief:
- American Airlines had the resources in place to recover rapidly from IT outages caused by a faulty CrowdStrike update last Friday, company executives said Thursday morning during the company’s Q2 2024 earnings call.
- “Just like other airlines and businesses worldwide, many of our operating systems were taken offline,” COO David Seymour said. “But within an hour of the outage, we assembled the right operating teams and IT experts to develop and execute a plan to get our systems back online and the aircraft moving again.”
- American was one of the fastest major domestic carriers impacted by the IT crisis to recover normal operations after the defective Falcon sensor update went live, shortly after midnight on July 19. The company had to cancel more than 400 flights in the first 24 hours but only grounded 50 flights the following day, according to flight tracker FlightAware.
Dive Insight:
Response times to the software-induced global disruption varied across the airline industry. As United Airlines manually rebooted more than 26,000 computers at 365 locations globally over the busy summer travel weekend, Delta Air Lines struggled to regain operational footing through Tuesday, when it canceled more than 500 flights.
The CrowdStrike bug affected certain Windows-based systems, leaving companies running Linux and Mac operating systems largely unscathed. For airlines taken down by the bug, the challenge was to rapidly deploy IT teams to fix systems at hundreds of airports.
“What seems to hamper recovery the most is that this is for the most part a manual, people-driven recovery process,” Gartner Senior Director Analyst Jon Amato told CIO Dive in an email. “Someone, typically an IT support person or at least an end user working under the direct guidance of one, has to physically access every single affected computer to perform the recovery process.”
While airlines rely on multiple customer-facing endpoints to get passengers onto planes and to their destinations, crew-tracking systems were the recovery key for American.
“One of the things that we've learned is that in terms of any disruption, you better keep track of your aircraft, certainly also your crews, wherever they are, and you probably ought to take action as quickly as possible to make sure that you don't lose visibility for the purpose of recovery,” American Airlines CEO Robert Isom said Thursday.
“We've built technology and we've done the right things to ensure that we take early precautions, early steps, and that ultimately results in a better outcome,” he said. “We also benefited by making sure that we have devices and means of communicating with our team members out in the field.”
American Airlines rapidly recovered operations after the CrowdStrike disruption
Delta’s struggles highlighted the perils of crew-reassignment software failures. The company’s CEO Ed Bastian acknowledged that a Delta crew-tracking tool was overwhelmed by the volume of changes triggered by the system shutdown in a Sunday customer update.
American’s prior experience overcoming weather-related disruptions prepared the airline for this IT crisis, Isom said Thursday. Isom prioritized operations technology investments last year, after a December 2022 storm grounded Southwest Airline’s fleet for over a week during the annual holiday travel blitz. A glitch in Southwest’s crew-reassignment system played a major role in the billion-dollar fiasco.
Delta’s final bill has yet to be tallied. But the U.S. Department of Transportation opened an investigation into the company’s response to the crisis, according to a Tuesday statement by Secretary of Transportation Pete Buttigieg.
From a technical perspective, CIOs will learn more about protecting systems from defective vendor updates in the coming weeks and months.
“More and more factors will come to light as the results of post-mortems are made public,” John Annand, Practice Lead at Info-Tech Research Group, said in an email.
“There is probably a correlation with the specific build of Windows and the affected machines but there is no doubt that automation played a huge role here,” Annand said. “Explainability is what’s going to allow us to learn lessons and perhaps return to some form of testing before applying patches.”