The $2.5 billion question: How many more AWS outages until the internet builds a real backup plan?

AWS logo at South By Southwest
(Image credit: Getty Images)

Chances are you were hit by internet troubles yesterday. The AWS outage impacted over 2,500 companies and services worldwide — estimated to cost everyone involved roughly $2.5 billion.

And it was all because of one server region in Northern Virginia — a single point of failure took down thousands of companies and essential public services across the globe.

When AWS sneezes, half the internet catches the flu.

Monica Eaton, Founder and CEO of Chargebacks911 and Fi911

And that happened even though AWS best practice states that companies use server regions closest to the largest pool of end-users of your service. So how did this happen? And did it just expose how fragile the internet actually is? Spoiler alert: Yes. Let me explain.

How did the AWS outage happen?

Close up of AWS sign at their offices in SOMA district

(Image credit: Sundry Photography | Shutterstock)

Amazon has issued a statement about the outage, but it's a nothingburger that’s probably been posted for legal reasons. Our AI Editor Amanda Caswell has provided much more detailed insight into how the AWS outage happened.

But to summarize real quick, the crisis began inside Amazon Web Services' busiest data hub in Northern Virginia (US-EAST-1), where a core networking failure caused a problem with the Domain Name System (DNS). Think of the DNS as the internet’s central phone book, and DynamoDB (a critical database service) was its most important entry.

AWS

(Image credit: Amazon)

The metaphorical phone book started spontaneously deleting the address for the main warehouse. All the internal systems for key services were suddenly trying to call the DynamoDB database, but the DNS could not provide the correct digital address. With no instructions on where to send the data, all those applications stalled, timed out and started to crash.

This initial failure then triggered a massive cascading failure across the entire cloud. Imagine a power grid: when one major substation goes offline, the sudden surge of traffic overwhelms the remaining infrastructure. US-EAST-1 is that major substation that controls the flow of power across all other stations, which also holds that “phone book."

This caused services like EC2 (virtual computers) and Lambda (serverless code) to fail, creating massive backlogs of requests. Even after Amazon fixed the "phone book" entry, the grid was still overloaded, requiring hours of manual work and "rate limiting" (temporarily slowing down new traffic) to clear the congestion and fully restore stability.

Who was affected?

Venmo, Amazon and AWS logos

(Image credit: Shutterstock)

Yes, we all lamented the big problems. Snapchat and Reddit went down, so did Fortnite, PlayStation Network, various streaming services and a whole lot of content-based sites. Duolingo and Wordle streaks were at risk, but there were more surprising victims given the location.

If you have smart home and personal security tech, chances are you couldn’t do a whole lot around your house. With Ring doorbells/cameras and Amazon Alexa devices being cloud-dependent using AWS, automations and routines collapsed instantly. For those who use Life360 for family peace of mind, that went down, too.

Down detector

(Image credit: Down detector)

Education also took a hit, as the major educational platform Canvas went down — leaving students unable to access coursework or submit assignments. Financial tech also took a dive, as several major U.K. banks experienced outages, as well as Venmo and Coinbase in the U.S.

But most concerningly were critical public services, transport and enterprise systems. The U.K.’s tax authority HMRC went down, United Airlines and Delta’s websites were offline, which meant people couldn’t book flights, and Zoom, Slack and Xero were out. All because of one hub in West Virginia!?

Also, hilariously, AWS outage issues were felt in the world of sports, as the semi-automated offside technology used in Premier League soccer went down — making VAR in the West Ham match a more involved process.

What needs to happen now?

A Ring Floodlight Cam Plus

(Image credit: Ring)

Here’s the $2.5 billion question for Amazon Web Services — why on Earth is a lot of the world’s key infrastructure reliant on a single point of failure like this? Yes, I know it’s the “default” option, but that is based purely on the historical context. And historical context shouldn’t make a single region the central nervous system for daily website traffic.

The digital world relies on a handful of massive tech companies for critical services like this, so is it time for regulators and companies to mandate a change?

The big actions

There’s precedent for government action here, too, and these questions need to be asked over and over. If any political figures stumble upon this article, please take these questions and put them to Amazon! And if I may suggest two solutions:

  • Make multi-region mandatory: The system architecture of key services is too critical to be based in just one place. There should be a live failover in a separate region, like Europe or Asia, to circumvent this in the future.
  • Governments need to get tougher: Rules for critical services like banking, education, transportation and government services should have a backup plan baked into their IT. That means tougher requirements like multi-cloud strategies.

What can you do?

But what about you? Because if history repeats itself, we could all go back into the status quo until the next time AWS coughs and most of the internet catches the flu.

Well, the first thing you can do is make your smart home outage-proof. Ring doorbells and Alexa devices are entirely cloud-dependent. You need to look for devices that run on local protocol systems like Matter, which makes local control a core requirement.

But the long game for you (and me, and everyone else) is to demand better redundancy from the tech you use every day. And the way companies listen is to hit them where it hurts — their wallets.

Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds. Make sure to click the Follow button!

More from Tom's Guide

Category
Arrow
Arrow
Back to Laptops
Brand
Arrow
Processor
Arrow
RAM
Arrow
Storage Size
Arrow
Screen Size
Arrow
Colour
Arrow
Condition
Arrow
Price
Arrow
Any Price
Showing 10 of 168 deals
Filters
Arrow
Show more
TOPICS
Jason England
Managing Editor — Computing

Jason brings a decade of tech and gaming journalism experience to his role as a Managing Editor of Computing at Tom's Guide. He has previously written for Laptop Mag, Tom's Hardware, Kotaku, Stuff and BBC Science Focus. In his spare time, you'll find Jason looking for good dogs to pet or thinking about eating pizza if he isn't already.

You must confirm your public display name before commenting

Please logout and then login again, you will then be prompted to enter your display name.