Friday Amazon finally released an official apology for the downtime of its cloud-based services last week. The outages struck on April 21 and affected popular services like Foursquare, Hootsuite, Reddit and Quora. The Amazon cloud was eventually re-inflated by Sunday, April 24 without any type of explanation of what actually happened... until now.
The apology/explanation is surprisingly lengthy, consisting of 5,679 words that span from an overview of the Elastic Block Storage (EBS) system, to what took place during the primary outage, to offering a 10-day service credit to affected customers. Overall Amazon's Web Services unit explained that an incorrectly performed network change "as part of our normal AWS scaling activities" at a data center in northern Virginia was the central cause of the outage.
"The configuration change was to upgrade the capacity of the primary network," Amazon said in the letter. "During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network."
For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn’t handle the traffic level it was receiving, the letter said.
"As a result, many EBS nodes in the affected Availability Zone were completely isolated from other EBS nodes in its cluster. Unlike a normal network interruption, this change disconnected both the primary and secondary network simultaneously, leaving the affected nodes completely isolated from one another."
By the end of the incredibly lengthy explanation, Amazon finally offered its apology to everyone involved.
"Last, but certainly not least, we want to apologize," Amazon's Web Services unit concluded. "We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes."