Dive Brief:
- At the end of March, a Microsoft Azure disruption hit 6,136 companies across Europe, causing service disruptions for up to nine hours, as the cloud provider grapples with a significant surge in use driven by the coronavirus pandemic.
- A primary incident manager (or PIM, in Microsoft parlance) was not paged as the pipeline delay unfolded, said Chad Kimes, Azure Director of Engineering, in a blog post analyzing the aftermath of the incident. The manager was asleep even as a designated responsible individual (DRI) worked to assess the situation and was "looking for potential mitigations."
- Previously, PIMs were only paged when there were customer request failures or performance impacts. As part of the measures taken by Microsoft to address the issue, the internal protocol for pipeline delays will now be updated to ensure initial communication "happens on the same schedule as other incident types."
Dive Insight:
The pressure is on for cloud providers, the backbone of the internet, to support a global shift to digital that spans school, work and most of everyday life.
The influx of humans online has put a strain on online learning, collaboration and travel companies, the three industry sectors seeing the biggest increases in digital service incidents during the pandemic, according to an analysis from PagerDuty.
For Microsoft, the response Azure's recent hiccup spans beyond waking up the on-duty PIM. The company is speeding up architectural changes to its hosted agent pools, which will help mitigate the potential for future issues of this type, Kimes said.
Microsoft is not alone in service disruption. Google Cloud suffered an hour-long outage in March caused by "significant router failure" in one of its data centers in Atlanta, which led to network congestion. Urs Hölzle, SVP of technical infrastructure at Google, took to Twitter to clarify the outage was unrelated to the pandemic.
But the strain of shifting most of life online — and, in many cases, until further notice — is beginning to take its toll on connectivity. Network speeds in most U.S. cities have dwindled, with median download speeds sitting below their 10-week average.