Dive Brief:
- Wells Fargo customers experienced a large-scale outage Thursday after smoke detection led to an automatic power shutdown, according to a company announcement Friday.
- The shutdown initiated a systematic rerouting to backup data centers the day of the outage. By the end of the day, the most critical systems were recovered but Wells Fargo saw a "high call volume and online and mobile traffic," according to the announcement.
- The bank extended Friday, Saturday and Sunday's business hours for all 5,500 locations. It also added support in contact centers for impacted customers.
Dive Insight:
The bank plans on investing in data management modernization, IT simplification and business support, and in light of an outage, albeit a controlled one, the focus is warranted.
Wells Fargo named its first head of technology in January with a background in risk remediation, and the bank expects a 10% increase in technology expenses year-over-year. The increased technology budget leaves room for the company to quickly handle IT crises.
Customers were left without access to their online bank accounts, unable to use ATMs, and had inactive debit cards on Thursday, though the company restored most services by late in the evening.
The outage continued into Friday and so did the restoration process. Transactions, balance and payroll deposits were not "visible," but everything was processed "normally," according to the bank.
The smoke detection was a result of on-site routine maintenance and the data center did as it was supposed to by rerouting.
However, outages cost companies money, even when restoration processes are in place. The company's outage was not the result of foul play or human error.
The majority, 82%, of outages are caused by human error, which brings chaos engineers into the equation. In light of an outage, chaos engineers have to ask what processes were missing to prevent the mistake, why there wasn't guidance, and why there weren't checks and balances in place to prevent the incident.
However, in Wells Fargo's case, the data center underwent an automatic response, which makes the detective work of a chaos engineer more manageable.
From there, the engineering team has to coordinate to weave through microservices and code to understand where the critical services are and who should be assigned their recovery.