Dive Brief:
- Google’s Compute Engine cloud service went down unexpectedly for 18 minutes on Monday for nearly all of its customers.
- The outage occurred from 7:09 pm until 7:27 pm Pacific Time, but did not interrupt the company's regular services, like Gmail or maps.
- Google published a long explanation and an apology on Wednesday and offered to credit customers 10% to 25% of their monthly bill.
Dive Insight:
Google said they fully understand what caused the problem and the Google Compute Engine was not in danger of another shutdown.
In a nutshell, the outage was caused when an update underway on the network hit a bug. The automated failsafe software—which should have caught the problem and automatically fixed it—then hit a bug too. The software glitched, sending the wrong technical information throughout the network, forcing the entire network to go down.
Google said it has now made 14 distinct engineering changes to ensure this type of incident won’t happen again.
"This incident report is both longer and more detailed than usual precisely because we consider the April 11th event so important, and we want you to understand why it happened and what we are doing about it," said Benjamin Treynor Sloss, in writing about the problem on Google’s blog. "It is our hope that, by being transparent and providing considerable detail, we both help you to build more reliable services, and we demonstrate our ongoing commitment to offering you a reliable Google Cloud platform."
The crash came at a bad time for Google. The company began a major push on cloud this year in hopes of catching AWS. Google made a slew of announcements at Cloud Platform Next conference in San Francisco in March, where Diane Greene, Google’s head of cloud computing, argued that Google's cloud platform is ready to compete with cloud industry leaders.