Skip to main content

Amazon Finally Apologizes for Epic Cloud Failure

Friday Amazon finally released an official apology for the downtime of its cloud-based services last week. The outages struck on April 21 and affected popular services like Foursquare, Hootsuite, Reddit and Quora. The Amazon cloud was eventually re-inflated by Sunday, April 24 without any type of explanation of what actually happened... until now.

The apology/explanation is surprisingly lengthy, consisting of 5,679 words that span from an overview of the Elastic Block Storage (EBS) system, to what took place during the primary outage, to offering a 10-day service credit to affected customers. Overall Amazon's Web Services unit explained that an incorrectly performed network change "as part of our normal AWS scaling activities" at a data center in northern Virginia was the central cause of the outage.

"The configuration change was to upgrade the capacity of the primary network," Amazon said in the letter. "During the change, one of the standard steps is to shift traffic off of one of the redundant routers in the primary EBS network to allow the upgrade to happen. The traffic shift was executed incorrectly and rather than routing the traffic to the other router on the primary network, the traffic was routed onto the lower capacity redundant EBS network."

For a portion of the EBS cluster in the affected Availability Zone, this meant that they did not have a functioning primary or secondary network because traffic was purposely shifted away from the primary network and the secondary network couldn’t handle the traffic level it was receiving, the letter said.

"As a result, many EBS nodes in the affected Availability Zone were completely isolated from other EBS nodes in its cluster. Unlike a normal network interruption, this change disconnected both the primary and secondary network simultaneously, leaving the affected nodes completely isolated from one another."

By the end of the incredibly lengthy explanation, Amazon finally offered its apology to everyone involved.

"Last, but certainly not least, we want to apologize," Amazon's Web Services unit concluded. "We know how critical our services are to our customers' businesses and we will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes."

  • rad666
    And that is why I hate "the cloud"...
    Reply
  • hellwig
    Didn't Google have some sort of cascading failure a year or two ago like this? Their redundant system detected a fault and switched the traffic over, which over-whelmed the secondary node causing it to shut-down, so forth and so on.

    They seem to be thinking RAID 5/6 when they should be thinking RAID 1 (these are just analogies, I'm not talking actual storage configuration). That is, be 100% redundant, not 20% redundant. It's a lot more expensive, but it would take quite the disaster to bring your sytem down, not just an accidental "our secondary network can't handle the traffic from our primary network, oopsie".
    Reply
  • hellwig
    rad666And that is why I hate "the cloud"...And what do you propose to do otherwise? There would be no difference between Foursquare hosting their own services and Amazon hosting those services, except that it would cost Foursquare a whole lot more to do the former than the latter. Even if every single website was hosted on private servers, those servers would still have to be redundant, with some sort of scheme to keep them up and running. "The Cloud" is just a B.S. term people came up with the sell crap. The internet has always been "the cloud". When you read a news story online, that story is hosted on a non-local webserver (in the cloud). A non-cloud news-source would be a newspaper, where the news is printed on paper physically in your possession. But if that paper gets lost or destroyed, you have to buy a new one. In "the cloud", if that news webserver goes down, there's a second one somewhere to ramp up and replace it.

    Even if you download your email and store it locally on your PC, your email service is still "in the cloud". When someone sends you an email to douche-at-yahoo.com, it's the "cloudy" nature of Yahoo! that lets that email get to you, even if one set of mail servers is down. If you simply ran your own email server on your home network, you might miss an email if your connection went down for a decent length of time.

    The cloud didn't fail here, Amazon did.
    Reply
  • f-14
    somebody did a rain dance and brought down the cloud!
    and yes hellwig it's a cloud failure, nobody could access jack shit, flavour it any way you like, it still taste like shit for the simple fact you could not get access to your info from some where else like you seemed to think it works.
    i hope you remember that the next time a bridge fails and all traffic has to be routed to a side alley and then there is an accident in the side alley effectively stopping ALL traffic.
    it's no different then if every line was chopped by an axe that accessed the cloud, nothing got thru not a single bit.
    Reply
  • grieve
    hellwigThe cloud didn't fail here, Amazon did.
    I can’t agree more, Amazon failed!
    I bet a few job opportunities opened up over @ Amazon shortly after they realized what happened.
    Reply
  • house70
    the ZONE...
    sounds like a S.T.A.L.K.E.R. issue...
    Reply
  • mayne92
    hellwigAnd what do you propose to do otherwise? There would be no difference between Foursquare hosting their own services and Amazon hosting those services, except that it would cost Foursquare a whole lot more to do the former than the latter. Even if every single website was hosted on private servers, those servers would still have to be redundant, with some sort of scheme to keep them up and running. "The Cloud" is just a B.S. term people came up with the sell crap. The internet has always been "the cloud". When you read a news story online, that story is hosted on a non-local webserver (in the cloud). A non-cloud news-source would be a newspaper, where the news is printed on paper physically in your possession. But if that paper gets lost or destroyed, you have to buy a new one. In "the cloud", if that news webserver goes down, there's a second one somewhere to ramp up and replace it.Even if you download your email and store it locally on your PC, your email service is still "in the cloud". When someone sends you an email to douche-at-yahoo.com, it's the "cloudy" nature of Yahoo! that lets that email get to you, even if one set of mail servers is down. If you simply ran your own email server on your home network, you might miss an email if your connection went down for a decent length of time.The cloud didn't fail here, Amazon did.Ah damn, you beat me to it! Awesome. :-P
    Reply
  • sounds like humans failed the cloud to me
    Reply
  • The cloud did fail. The failure is akin to centralized food distribution vs. local food distribution: The central point of failure didn't contaminate a single town's hamburger with e. coli; it contaminated millions of tonnes of hamburger being served all over the world. If Foursquare's own hosting failed, Foursquare would have an outage, when Amazon has an outage, thousand's of companies have an outage.
    Reply
  • Yuka
    The Cloud = Internet = Just a series of tubes...

    :trollface:

    Cheers! xD!
    Reply