Cloud outages raise question on how to architect for resiliency

UPDATE: Dec. 22, 2021: AWS suffered a third consecutive week of service disruptions after console application upload errors in AWS Elastic Beanstalk began Tuesday and continued into Wednesday, according to an update on its Service Health Dashboard. The company also had issues with Amazon Elastic Compute Cloud (EC2) and AWS Single Sign-on.

Content security policy errors triggered issues with Elastic Beanstalk, which organizations use to deploy and scale web applications and services, AWS said. Resolved Wednesday morning, customers experienced errors if they tried to upload a new version of an application to an existing environment or if they created a new environment in multiple regions.

EC2 had launch failures and networking connectivity issues in one availability zone within its US-EAST-1 region throughout the day Wednesday. Recovery of the remaining instances and volumes took longer than expected, Amazon said. The company believes the issue arose from the way in which the data center lost power, which led to hardware failures.

Single Sign-on also had issues in the US-East-1 region, creating error rates for Directory Services AD Connector and Managed AD, the company said.

AWS did not respond for comment by publishing time.

Dive Brief:

Network congestion between AWS and a subset of internet service providers triggered connectivity issues for the infrastructure services provider Wednesday morning. The outage, though resolved within an hour, interrupted operations from companies including SiriusXM and Doordash, Downdetector shows.
The disruption was caused by AWS traffic engineering that "incorrectly moved more traffic than expected to parts of the AWS backbone that affected connectivity to a subset of internet destinations," AWS said on its service health dashboard.
The outage is the second for AWS in December, following last week's network surge that affected the US-EAST-1 region in Northern Virginia. This week's issue was unrelated to the previous outage.

Dive Insight:

Despite the string of recent outages, the top cloud provider promises 99.99% uptime in its Amazon Compute service level agreement across top services, including Amazon EC2. If the company fails to meet that service level, it offers customers a service credit percentage.

If monthly uptime falls below 95%, AWS will credit customers 100% of service costs.

December's outages aren't the first, nor the last, in the cloud provider ecosystem. Microsoft experienced a global outage of its virtual machine service on Azure in October and Google Cloud had an outage last month, disrupting popular websites including Spotify and Snapchat.

Following last week's incident, AWS provided a thorough post-mortem that detailed why it suffered a network surge and illustrated how the issue caused cascading effects.

Modern technology is heavily integrated and one change or flaw can cause systemwide failures. While providers seek 99.999% uptime, it's not always possible.

For businesses, the main concern is what to do in the event a service provider goes down. Wednesday's AWS outage was a brief hiccup compared to last week's hours-long event, yet it still resulted in some service outages for some organizations.

Though some organizations look toward redundancy to increase resiliency, it's not always immediately possible. "Right now, it is pretty hard to shift services if a cloud provider goes down," though there are some options to move between a provider's availability zones and regions, said Brent Ellis, senior analyst with Forrester, in an email.

Long-term, a shift toward multicloud resiliency would have organizations architect workloads around a service such as VMware Cloud or Kubernetes, which allows an organization to shift assets between providers, according to Ellis.

There's a bigger trend for companies in working to solve technology resilience as part of an overall production system architecture, he said. It's a move away from relying only on reactive measures, like backups and system restoration, to get back into production.