Last year J. Crew, Walmart, Ulta and Lowes let some online shoppers down.
While none had catastrophic outages, customers were derailed from making purchases because of "technical difficulties" during peak shopping hours of the Black Friday and Cyber Monday weekend.
Retailers expect traffic spikes the weekend after Thanksgiving, but capacity management can fail. Resilience is measured in the moment.
"As an online retailer, if customers can't access your online store during Black Friday and Cyber Monday, there is no business," Yaniv Valik, VP of Product, Cyber and IT Resilience at Continuity Software, told CIO Dive.
The majority of retailers, 81%, said they've ramped up infrastructure for Black Friday and Cyber Monday, and 66% said they increased cloud capacity in preparation, according to a recent Harris Poll.
However, "many people don't realize that during a service outage, the cloud vendor's role is minimal," said Valik. "As long as the cloud vendor's service is up and operational, it's the responsibility of the customer to ensure the proper configuration and resilient architecture are in place so that services scale or failover appropriately."
"As an online retailer, if customers can't access your online store during Black Friday and Cyber Monday, there is no business,"
Yaniv Valik
Continuity Software
Even with the optimism in capacity management, 24% of retailers said they lack a plan for an outage and 40% have experienced an outage in the last three years, according to the poll.
Uptime recommends several steps in resilience:
- Define an escalation process so engineers are aware of who to contact and when to make contact during an outage.
- Make it obvious for customers to know there's an issue and a resolution is in progress.
- Be aware of third parties' roles in APIs, including payment processing, site search, login and shopping carts.
In moments of distress
Outages get the best of the biggest players. Amazon's annual Prime Day shoppers were met with images of dogs and technical errors — for the second year in a row.
For every minute Amazon.com experiences an outage, it could lose more than $220,000, according to Gremlin's "Cost of Downtime" tracker for top e-commerce sites in the U.S. An hour of downtime could cost the retailer more than $13.2 million.
Manually identifying risks — in reliability, security, misconfigurations — in real time is a near impossible feat. "An automated validation process becomes a must, otherwise hidden risks of outages may go unnoticed," said Valik.
During peak traffic times, companies that rely on their infrastructure could adopt load balancer as a service (LBaaS) for the time, which provides load balancing in OpenStack private clouds, according to Avine Networks.
Retailers should have configured "every mission-critical component in their IT environment" for redundancy, said Valik. "From redundant power supplies and network connections at the server-level, through redundant core networking and storage infrastructure, all the way to the use of clustering, load-balancing [and] elastic computing."
Real-time, continuous monitoring is a retailer's safest bet for maintaining status quo. Preparing for an outage and recovery starts with a "domain health check," including the DNS, web server, mail server, and blacklist and malware, according to Uptime.com.
From there, retailers should ensure the navigation function and transactional checks on their site are efficient. Companies with real user monitoring (RUM) installed will have oversight of performance and potential issues.
Tools like RUM help companies facing an outage stemming from something other than a traffic influx, like constant database updates.
"I've seen a company with multiple instances of NoSQL databases where replication problems caused intermittent 50x errors," Alexei Mironov, senior developer at Uptime.com, told CIO Dive in an email. Database replication runs the risk of availability outages while data is multiplied to servers or nodes.
An increase in load, experienced by code-hosting site GitLab in 2017, can cause a service outage. Large companies have the potential of experiencing "issues with database replication across multiple clusters," Michael Esposito, chief strategy officer of Uptime.com, told CIO dive.
The most common causes of e-commerce-related outages include overburdened APIs, slow third-party components, overload of graphic components, servers unequipped for high traffic, and a disregard for regional performance levels.
Layered management software in the development pipeline mitigate site issues and allows DevOps teams to have more control over systems.