Dive Brief:
- The median annual downtime from high-impact IT outages is 77 hours, with an hourly cost of up to $1.9 million, according to a Tuesday report published by New Relic. The observability company commissioned ERT to survey 1,700 technology professionals in April and May.
- IT teams spend an average of 30% of their time addressing interruptions, respondents said — the equivalent of 12 hours per 40-hour work week. The leading causes of unplanned outages reported over the last two years included network failure, third-party services issues and human error.
- Major outages like the global event triggered by a flawed CrowdStrike Windows systems update in July can bring operations to a standstill, according to Nic Benders, chief technical strategist at New Relic. But minor issues can snowball as well. “It doesn't have to be CrowdStrike in order for it to be a three-alarm fire,” he told CIO Dive. “You can knock out the business function of IT with a relatively small technical issue.”
Dive Insight:
All it took was an automated software update dispatched shortly after midnight EST on July 19 to bring millions of Windows-based computers crashing down across the globe. The CrowdStrike update was live for just over an hour but the impacts were felt for days, as several major airlines scrambled to reboot workstations and restore operations, grounding thousands of flights.
“The CrowdStrike incident is in its own class because it disproportionately impacted some of the largest companies in the world — it was a poison pill those companies had to remediate themselves,” Benders said.
As executives surveyed the losses, which mounted to $5.4 billion among Fortune 500 companies and cost Delta Air Lines $500 million in just five days, IT resilience and recovery planning took center stage.
“When something like a cloud provider outage hits, it’s rare that the problem is clear cut initially,” Benders said. “Your alarms are going off, support tickets are lighting up and you're in chaos, but in that first step you’re just trying to characterize the nature of the issue.”
While major vendor outages and cyber events tend to steal the headlines, death-by-a-thousand-cuts scenarios involving smaller interruptions are far more common. The median number of annual outages among respondents was 232, with more than half of companies experiencing low-impact disruptions on a weekly basis.
Costs can be hard to gauge, particularly for low-impact issues. But the minutes or hours it takes engineering teams to identify and defuse even minor IT outages adds up. Over the course of a year, teams spend roughly 134 hours — the equivalent of nearly six full days — addressing IT outages across all business impact levels.
“It all comes down to dollars,” Benders said. “I would take 1,000 incidents a week if they had zero cost. That's not an incident at all.”