Dive Brief:
- Facebook regularly conducts stress tests on its operational networks to help ensure it can put data centers back together after a natural disaster or other disruption, according to an IEEE Spectrum report.
- The tests are conducted by the company’s disaster special weapons and tactics team, also known as SWAT. Facebook vice president of engineering Jay Parikh recently introduced the stress tests to an audience of invited engineers at the third annual @Scale conference.
- Parikh said the the tests stem from Hurricane Sandy in 2012, when two Facebook data centers were threatened. Even though both were unharmed in the storm, the engineering team set out to test the impact of any data center or computing region losses on the global availability of the social network.
Dive Insight:
Parikh said it’s all about shifting traffic. In 2014, the team started conducting live tests during regular work days. While the first few tests were challenging, the team has become better with time, said Parikh. But, conducting live stress tests is not easy work.
"It’s easier to take a data center down than to put it back together," said Parikh.
Such tests can greatly mitigate the negative outcomes of emergencies like the power outage that recently took Delta Airlines’ computer systems down and led to widespread flight cancellations and delays.
For companies, staying online in the event of an emergency is all about preparation and unflinching disaster response. Preparing for every possible outcome makes recovering from a data center emergency a more fluid experience and stops mistakes that can exacerbate an outage.