Microsoft 365? More like Microsoft 364 — why yesterday’s outage proves these companies don’t yet have a backup plan for the internet
Microsoft 365 went down, and the ‘always-on internet’ myth went with it
I feel like I’m living in “Groundhog Day” these days: huge server infrastructure goes down and knocks out essential internet services for people, the reason exposes a risk of concentrating too much of said infrastructure in one location, I rant about it and the problem happens again.
Well, the thing just happened again. Microsoft’s services went down — taking out 365, Outlook, Teams and Azure with it. And would you look at that! The root cause is “a portion of service infrastructure in North America” that just so happened to knock out the planet.
So just like the AWS outage last October, I’m here again with a plea. For the love of everything, figure out a backup plan. The world is far too dependent on the internet for there not be a Plan B in case of emergency.
Is this ‘US-EAST-1’ all over again?
So let’s figure out the symmetry here to the AWS outage last year, because there are some differences that all lead back to one similar epicenter. The differences are simple enough:
- The AWS issue was caused by a DNS issue in one server region — US-EAST-1 (the internet’s phonebook forgot all the phone numbers), which knocked out half the internet.
- The Microsoft outage was caused by a failure in North American service infrastructure that stopped processing traffic correctly. A busted toll booth made the digital traffic jam worse and brought services to a standstill.
So in that way, these are different problems, but they both highlight the same key issue: a massive centralized dependency on a specific region to run the world’s cloud computing infrastructure.
To paraphrase Monica Eaton, Founder and CEO of Chargebacks911 and Fi911 from my earlier article, when one of these companies sneezes, “half the internet catches the flu.”
But it gets worse (and more complicated)
As you may have seen, Microsoft’s first attempt to fix these traffic imbalances actually made the problem worse. The highway patrol saw the huge digital traffic jam and set up a detour down a tiny street not built for millions of cars. It immediately bottlenecked, and the road cracked under the weight of it all.
Get instant access to breaking news, the hottest reviews, great deals and helpful tips.
My problem is not that Microsoft tried, it’s that we’re here in the first place. And to go into that, I need to go into the difference between the “Data Plane” and the “Control Plane,” because this is what’s critical here.
- The “Data Plane” is protected by a Multi-availability zone safety net. Basically, if you have two different computers in two different rooms, and a pipe bursts and floods one, the other keeps working. This is what most people mean by “redundancy.”
- The “Control Plane” is the brain that tells those computers where to send traffic, and it’s a single point of failure. In both the AWS and Microsoft outages, the brain broke, and none of those redundancies on the “Data Plane” mattered.
That’s not to say there aren't redundancies for the “Control Plane,” but both of these companies actually have too many of the wrong kind — internal redundancies. The brains are all in one place (multiple servers in one region) rather than building in external redundancies (separate brains across multiple regions).
The fix isn’t simple
In defense of these companies, that’s a very tough nut to crack. If you were to change your password on one brain, every brain in the world needs to know that immediately.
They have to constantly talk to each other, and if one starts to hallucinate (if a bad software update or database error happens), then all the other redundant brains get the same wrong information in perfect sync.
AWS and Microsoft use something called Static Stability to mitigate, which means that if the brain (Control Plane) dies, the body (Data Plane) should continue. You won’t be able to change your password, but users should still be able to send emails because the local servers remember the last good state.
However, the Microsoft outage wasn’t just a failure in the brain; it was in the traffic layer. The body looked fine, but the central nervous system (the network) couldn’t carry the signal to the limbs.
But there is a fix, a cell-based architecture. AWS and Microsoft are moving aggressively towards this answer to the “one giant brain” problem, which breaks down a huge server region into hundreds of independent micro-neighborhoods. So if one cell is impacted, nobody else notices.
Sounds like a dream fix, right? Well, why doesn’t it happen now? There is an inconceivable amount of complexity and a massive amount of legacy to overcome:
- The cells need the right traffic directed to them, which requires a cell router. If the router breaks, none of this cell-based architecture matters.
- Microsoft 365 is a 15-year-old system that is massive. Turning this monolithic brain into 100 mini brains is like trying to perform a brain transplant on someone who is running a marathon.
So the short answer is they’re working on it, but while it’s proving to be tricky, this needs to be resolved like yesterday for one key reason.
Wait, the world expects us to rent a PC from the cloud?
Computing Editor Darragh Murphy wrote a great piece exposing the moment when Jeff Bezos said the quiet part out loud: the idea of having a local PC is “not going to last.” AI and the RAM price crisis are accelerating us towards the idea that the only way computing makes sense is to rent one from the cloud.
While I massively disagree with Jeff’s take with every fiber of my being (many things have tried to kill owning your own computer in the past, and all have failed), let’s entertain it for a second.
For this to even remotely stand a chance of working, the infrastructure has to be perfect. While there are many Windows 11 bugs in updates to your PC, you can at the very least turn it on. With cloud infrastructure issues like this, none of that happens.
And if the plan is to make the world rely solely on cloud computing, small issues like what we’ve seen in these outages can cause issues far worse than just the frustration of stopping you getting to your cloud PC to play games. It could severely impact small businesses, governments, healthcare and much more.
And this is what I mean when I end every single one of these rants by saying we need to demand better redundancy from the tech we use every day. There’s too much at stake for the cloud to be the only way we compute — there must always be a local element.
Follow Tom's Guide on Google News and add us as a preferred source to get our up-to-date news, analysis, and reviews in your feeds.
More from Tom's Guide
- I’m reviewing an Intel Core Ultra X9 388H laptop right now — what would you like me to test?
- ‘Chaotic’ RAM pricing won’t kill PC gaming, a CEO told me — and history backs him up
- Xreal is suing Viture for ‘freeriding on technological breakthroughs’ — how does this huge patent infringement lawsuit impact you and the AR glasses you should buy?

Jason brings a decade of tech and gaming journalism experience to his role as a Managing Editor of Computing at Tom's Guide. He has previously written for Laptop Mag, Tom's Hardware, Kotaku, Stuff and BBC Science Focus. In his spare time, you'll find Jason looking for good dogs to pet or thinking about eating pizza if he isn't already.
You must confirm your public display name before commenting
Please logout and then login again, you will then be prompted to enter your display name.
