Lessons from the CrowdStrike outage
COMMENTARY: Agencies can mitigate impacts of future outages by putting IT risks front and center and by understanding the incident response requirements of their vendors.
This summer's CrowdStrike outage has been widely discussed across government agencies, boardrooms, CIO/CISO offices, media, professional organizations and in academic settings. The circumstances around the faulty software update provide a rare case study for those focused on continuity of operations.
The software update issue hit nearly 8.5 million Windows operating systems and impacted a broad range of Microsoft users, but government agencies and businesses suffered the brunt of the effects. Disruptions to both internal and external operations were so great that companies such as Delta Airlines said the outage cost the company about $550 million and that they are pursuing damages against both Microsoft and CrowdStrike.
While the Delta case rates on the higher end of damage caused by the outage, state, local, tribal, and territorial agencies and the federal government all felt the impact of the disruptions. By dissecting this case, there are several practical steps that can be taken to minimize the effects should a similar event take place in the future.
- Integration of IT and business risks: The outage caused decreased productivity and potential delays in critical business processes, impacting overall operational efficiency, financial outcomes, reputational damage, and customer satisfaction. IT risks should be integrated with corporate business risks. Developing and continuously updating Business Resiliency Plans (BRP) that include the supply chain is crucial.
- Mitigation strategies for software updates: Microsoft was unable to catch this bug because of CrowdStrike's direct access to the Windows OS kernel, leaving the system without a built-in defense. The historical “defense-in-depth” strategy has become increasingly outdated in the context of modern cyber resiliency. Nevertheless, in any defense strategy, critical routes — such as access to the kernel — should be subject to strict scrutiny to prevent events of this magnitude. Additionally, this event underscores Microsoft’s need to revert to its own popularized ring deployment model for patch management and software updates. In using this phased approach, along with adopting a more evolved approach to cybersecurity, the CrowdStrike outage could have been less impactful. However, some of the technical complexities involved in CrowdStrike’s need for deep access levels to Windows OS need to be carefully evaluated along with any deployment strategies.
- Review of service level agreements: Government agencies and organizations should ensure SLAs with vendors include provisions for incident management, communication and compensation during outages. Regular vendor performance reviews and risk assessments are essential.
- Maintaining transparency and clear communication: Transparency and clear communication with stakeholders are crucial during incidents. Agencies and organizations must have a well-defined communication plan for both internal and external audiences to effectively manage public relations, uphold trust, and sustain strong relationships with customers and partners.
Diversifying security solutions
Although the CrowdStrike event was not technically a security breach (some argue this point), the outage created a vulnerability by disrupting real-time threat detection and incident response capabilities. Put more simply, organizations could not effectively monitor endpoints or respond to security incidents, creating a window for undetected breaches and lateral movement of threats.
These issues increase the possible impacts of the outage. While most of the financial losses being reported come from service disruptions, cyberattacks can also have severe financial impacts. Organizations should budget for potential disruptions, including costs related to downtime, incident response, legal fees, and regulatory fines. In this instance, cyber insurance can provide a safety net for organizations.
Beyond that, agencies and organizations should diversify their suite of security solutions. Implementing multiple layers of security controls and tools ensures redundancy and resilience. Since agencies need to ensure swift and effective responses to outages and security incidents collectively, continuous testing and updating of business continuity, disaster recovery, and incident response plans will play a big part in defending against cyber events. This includes having clear communication protocols and backup systems in place.
The CrowdStrike outage was an event we will not soon forget. The “blue screen of death” and the ensuing chaos initially led speculation and panic before. Even after the cause was determined and a cyberattack ruled out, disruptions continued to pile up. Those effects will surely be felt as we continue to survey the damage.
The silver lining lies in the lessons learned from this incident. This deeper understanding will better equip agencies and organizations to respond to future challenges. Although it is doubtful we can entirely prevent unexpected events like the CrowdStrike software update issue, we can implement proactive measures to mitigate the impact of similar occurrences in the future.