The CrowdStrike incident is a strong reminder that systems fragile by design are at the mercy of Murphy’s Law: if something can go wrong, it eventually will.
It was only a matter of time.
What's scary is we don’t know how many other important systems are just as fragile and haven't failed yet. Very few knew what CrowdStrike was three days ago.
As long as there is a possibility for failure, it will likely happen again. Our only hope is to minimize the impact.
To be honest, I was surprised to learn that a third-party software update could cause such a big problem for an OS. I’m not an expert in this domain, but I'm sure there are trade-offs in design choices.
General-purpose software including OSes, can't be perfect for all situations. The CrowdStrike event makes us question the choice for critical systems that can't afford to fail.
Fixing the process that led to this issue is the first step, but it won't solve everything.
The problem lies in the design. We must learn from this incident and be more cautious with software design, especially for high-impact, critical systems.