While AWS sorts out takeaways to improve its cloud architecture internally following last week's outage, customers of the cloud service provider must consider which investments will protect their businesses from future disruptions.
A small addition of capacity to Amazon Kinesis, the platform responsible for real-time processing of streaming data, caused an outage across AWS US-EAST-1 region on Wednesday, according to an announcement from the company. It left businesses offline and consumers unable to access some services.
The Kinesis disruption caused all servers in the fleet to exceed what the operating system configuration could process. The error exposed internal AWS interdependencies and widespread reliance on the cloud platform for business continuity.
As AWS bounces back from the situation, the company is reducing the total number of services, moving larger services to separate fleets and implementing cellularization to stop future failures from spreading, according to the announcement.
But for businesses, the outage means reconsidering investments in back-up cloud and data storage products, according to analysts and cloud experts.
"Interdependency is a feature, not a bug, of modern IT and cloud computing," said Melanie Posey, research director for the Cloud & Managed Services Transformation at 451 Research. "Dealing with that interdependency is something that needs to be baked into the cake in terms of how you design your IT architecture, how you think about disaster recovery and business continuity."
Companies concerned with business continuity in the wake of an outage could host workloads in multiple clouds or regions. Data across multiple platforms practically guarantees access if one cloud experiences downtime.
If a business "can standardize its applications such that they can deploy services and data across multiple cloud providers (i.e., cloud interoperability), the failure of one cloud provider would not lead to an impact to their service," said Joy Sim, a manager at West Monroe, in a statement to CIO Dive.
A complete duplicate of data from one region to another allows businesses access even amid outage, but the cost and practicality may be out of reach.
"The downside of it is cost because essentially what you're doing is replicating large parts of your infrastructure and your workloads in two different places," Posey said. "But one way to look at that added expense is that it's insurance."
While this insurance provides coverage for companies if needed, the costly investment of duplicating cloud data may not be worth it to many companies. While an outage such as the AWS US-EAST-1 one causes inconvenience, the interoperability and reliance businesses suffered from may not be excruciating to the bottom line.
"You can create multi-region, you can do multi-cloud … but a lot of those multis will cost you multi-millions as well," said Drew Firment, SVP, Cloud Transformation at A Cloud Guru. "Mileage varies in terms of really focusing on what is the impact of these outages — and realistically they do come few and far between."
Firment compared a public cloud outage to an electric outage; is it worth running three different electric companies' power lines to one home just in case an outage occurs for a few hours every couple of years? Maybe not for a residential building, but for a hospital it would be.
Understanding the return on investment of precautionary measures on a case-by-case business basis can help organizations understand how much of a risk modern interdependency plays. For the AWS outage, Firment saw little reason to sow distrust in the public cloud provider's offerings.
"Reading through the root cause analysis … there is nothing in there that would change my mind about using a cloud provider," Firment said. "I would just want to understand more on dependencies that exist among services so I can make the right architectural choices for my organization."
This starts with a businesswide understanding of cloud architecture and education around how the technology works. A common misconception about cloud service providers is that they will manage all of an organization's needs, but the reality is companies are still responsible for overcoming outages.
Cloud service providers can improve the availability of services and an organization's risk posture, but planning for application outages based on criticality will always fall on the enterprise because providers can't guarantee 100% availability, said Naveen Chhabra, senior analyst at Forrester.
"Things are going to fail," Chhabra said. "You better plan for it."
Correction: In a previous version of this article, Drew Firment was misidentified. He is the SVP, Cloud Transformation at A Cloud Guru.