Learning from the CrowdStrike outage

The BCS Software Testing Specialist Group (SIGiST) examines what may have caused the recent global IT outage and explores what it believes should be the software industry's lasting lessons.

On Friday 19th July 2024, CrowdStrike, a prominent cybersecurity firm, experienced a significant software failure following the release of a content configuration update for its Falcon Windows sensor.

The story began in February 2024, when the security vendor introduced a ‘new sensor capability to enable visibility into possible novel attack techniques that may abuse certain Windows mechanisms.’ The feature was released in March and, as history shows, was updated on July 19th.

Called a Rapid Response Content update, the July release was intended as an evolution of February’s new capabilities. However, the update caused around 8.5 million Windows hosts to ‘blue screen’, disrupting IT services worldwide. For example in the US, Delta Airlines reported that 30% of its flights were cancelled or delayed.

Ironically, the global outage stemmed from software designed to prevent widespread cybersecurity failures, highlighting a significant vulnerability.

SIGiST believes these are the key points:

This situation underscores the necessity for quality and security governance, robust software development and maintenance practices
Development teams must conduct a thorough root cause analysis, focusing on solutions rather than assigning blame. Finding and addressing these root causes is pivotal for implementing effective countermeasures to prevent future occurrences
This incident highlights the importance of adequate threat modelling for end user systems. For example, in high impact systems, vulnerabilities must be identified and mitigated before they lead to significant disruptions

SIGiST advocates for these proactive measures, emphasising that a systematic approach —including security testing approaches such as threat modelling and traditional methods such as root cause analysis — is essential for maintaining software integrity and reliability.

Analysis

From CrowdStrike's root cause analysis, we can all learn lessons about becoming too over reliant on tooling — whether in-house or third party. These frameworks may be solid 99.9% of the time, but high though that number is, nothing is reliable 100% of the time.

Looking beyond and before the CrowdStrike incident, we see that the world’s digital systems have become more interconnected and interdependent. They’ve also grown in capability and capacity. What’s more, behind the scenes, digital systems are starting to rely increasingly on AI and machine learning for development, testing and deployment. AI is also playing an increasingly important role in keeping systems safe from cyber attacks.

The EU's new AI Act may help improve testing standards, especially when ensuring the entire system is tested. For high risk systems, some of the known requirements for conformity assessments include the quality of data sets used to train and tests to ensure that the AI systems are ‘relevant, representative, free of errors and complete’ and that they include human oversight.

Recommendations

From SIGiST’s perspective, it is important that we all promote continuous improvement within a quality-focused culture — a culture where everyone feels accountable for quality.

Here are some suggestions for enhancing the effectiveness of quality and software testing practices:

Companies should reinforce quality and security compliance, ensuring all engineers and team members responsible for quality know their accountability
To reduce the risks of security breaches, consider regular evaluations and internal audits. Adhere to industry best practices for safeguarding data and enhancing efficiency
Invest in professional training and certification for those with quality or compliance responsibilities
Resilience and failover processes need to be robust, and their underpinning mechanisms must be validated constantly through automatic and manual audits.
Microsoft's incident response states that it provides ‘a technical overview of the root cause’. However, there is rarely one root cause of incidents
Companies should aim to learn from their own issues by finding the root causes of failures. Methods such as Five Whys, Ishikawa Diagrams, and 'blameless post-mortems’ can all help turn a problem into an opportunity
Systems thinking can play a valuable role in development processes because it considers the whole system
A company's suppliers, such as CrowdStrike, are part of the company's system. Edwards Deming wrote that 94% of troubles belong to the system
Incidents often involve 'soft issues'. CrowdStrike and Microsoft should seek to ‘understand people, the interaction between people and circumstances’ and how this relates to the incident's causes

Conclusion

From a software testing perspective, it’s vital to recognise that, even as AI and machine learning technologies become increasingly sophisticated, the need for input from trained quality professionals is even more critical.

For you

Be part of something bigger, join BCS, The Chartered Institute for IT.

The recent CrowdStrike outage highlights the importance of integrating a holistic approach to technology management. While automated tools and AI are hugely valuable assets, they should not replace human oversight and expertise.

Testing brings insights and contextual understanding that AI alone cannot provide. Testers are adept at anticipating edge cases, understanding user experience, spotting novel threats and adapting to unexpected scenarios — all in ways that automated systems might not.

Failures like the CrowdStrike outage present an opportunity to learn and improve. Companies need to develop a culture where issues are openly discussed (and this could be public). Teams should be given the freedom to identify root causes, implement corrective actions, and continuously enhance processes. All this can help prevent further issues. This approach strengthens the quality of software, builds trust, and promotes a mindset of growth and resilience.

Organisations can better identify potential vulnerabilities and enhance system robustness by ensuring that testing remains integral to the process. This approach helps address and mitigate issues before they escalate, potentially preventing similar incidents in the future.

In summary, we all hope to create reliable, user-centred software. To achieve this, we need to strike a careful balance between human input into development and the use of advanced technologies. Developing a culture that fosters continuous development and encourages adoption is also essential. If we get all this right, we’ll be well placed to turn setbacks into valuable learning experiences — lessons that contribute to further improvement.

We’ll leave you with a thought: ‘The emphasis on productivity had a negative effect on quality’.