Microsoft Outage: 4 Critical Learnings
How CrowdStrike's mishap exposed the fragility of our interconnected world! Let’s see what we can learn from this incident.
Delayed surgeries, grounded flights, canceled transactions, forced broadcasters off the air…
A seemingly innocuous CrowdStrike update resulted in the largest IT outage in history.
As most of the dust settles and systems crawl back to normalcy, let's understand
What went wrong?
What do we learn from this? What do experts say?
The great digital derailment: What went wrong?
On July 19th, 2024, CrowdStrike found itself at the center of a perfect(?) storm.
A small, unvetted update to their Falcon Sensor software snowballed into a catastrophe affecting 8.5 million Windows devices worldwide(less than one percent of all Windows machines).
"Despite the low percentage of computers affected, those computers were typically enterprise customer computers and those enterprise customers are the ones that affect much of our lives," says Mark Gregory, a senior telecommunications and network engineering academic from RMIT University.
System crashes, the infamous Blue Screen of Death (BSOD), and countless machines caught in an endless boot loop.
Critical services like Microsoft 365 and Azure also failed.
As David Glance, director of the Centre for Software and Security Practice at the University of Western Australia, bluntly put it:
"This is where I am curious to see what litigation follows because it's just pure negligence on their part to actually release something without the appropriate testing."
Clare O'Neil, Minister for Home Affairs of Australia, expressed the concern-
Important lessons and expert insights for a resilient future
1. Implement staged rollouts for safer deployments
Staged rollouts involve releasing updates to a small subset of users before a full-scale deployment.
This practice helps identify any issues in a controlled environment, allowing time to fix bugs or unforeseen problems without affecting the entire user base.
"That's a normal software development practice that we teach to undergraduate students: you don't go from fixing it to releasing it to all your customers.
You go through a number of test stages, but there's no way you can eliminate every possible problem with the software."
-Tom Worthington, an honorary lecturer in the School of Computing at the Australian National University
2. Prioritize disaster recovery and backups in cyber-resilience strategy
A cyber-resilience plan is no longer an option.
You need to have strategies for threat detection, response, recovery, and continuous improvement.
This means integrating comprehensive security measures, regular training, and robust incident response protocols.
“I’ve talked to several CISOs and CSOs who are considering triggering restore-from-backup protocols instead of manually booting each computer into safe mode, finding the offending CrowdStrike file, deleting it, and rebooting into normal Windows. Companies that haven’t invested in rapid backup solutions are stuck in a Catch-22.”
Eric O’Neill, a Public Speaker and Cybersecurity Expert
Jake Moore, global security advisor at ESET, says,
"Businesses must test their infrastructure and have multiple fail safes in place, regardless of company size."
Another great insight comes from Aleksandr Yampolskiy, CEO, SecurityScorecard -
"When I used to work at Goldman Sachs, the policy was to get tools from multiple vendors. This way, if one firewall goes down by one vendor, you have another vendor who may be more resilient.”
Aleksandr further expresses,
“Antifragility in these situations comes from not putting all your eggs in one basket. You need to have diverse systems, know where your single points of failure are, and proactively stress-test through tabletop exercises and simulations of outages.
Consider the “chaos monkey” concept, where you deliberately break your systems—e.g., shut down your database or make your firewall malfunction to see how your computers react."
3. Conduct rigorous testing and validation to enhance cybersecurity
It is essential to perform comprehensive testing of Endpoint Detection and Response (EDR) and Early Launch Anti-Malware (ELAM) drivers.
“The severity of this incident serves as a stark wake-up call, highlighting the critical need for rigorous and dependable testing of EDR and ELAM drivers in cybersecurity systems. Now more than ever, it is crucial to reassess and overhaul current testing procedures, swiftly identifying and addressing any issues that arise."
Rob Reeves, principal cybersecurity engineer at Immersive Labs
Thorough testing and validation ensure that any updates or changes to the system are efficiently vetted, minimizing the risk of future incidents and maintaining robust protection against cyber threats.
4. Enhance observability and monitoring to mitigate outages
With proper supply chain visibility, you can better monitor and manage your software dependencies.
This helps promptly identify and mitigate any vulnerabilities or risks.
"It is essential to have visibility on your software supply chain, especially around critical practices such as cybersecurity, cryptography management, and, of course, testing and updates practices.
With this historical outage, along with other recent software supply-chain catastrophic events, such as SolarWinds and Log4j, we cannot accept with blind trust software updates nor blindly trust cybersecurity or cryptography practices.
Every company should implement observability in their software systems right away to monitor these high-impact platforms and prevent these catastrophes.”
Carlos Aguilar Melchor, chief scientist, cybersecurity at SandboxAQ
As we’ve seen from time to time, even tech giants aren't immune to stumbles.
However, it's not about pointing fingers but learning and growing stronger together.
As a wise user, Day Johnson on X said, “We respond, we learn, and we become better from them.”
Happy learning!