A “perfect storm of issues” and internal validation errors resulted in CrowdStrike's defective software update that caused a global IT network outage in July, an executive said during testimony Tuesday in a congressional hearing.
Adam Meyers, SVP of counter adversary operations at CrowdStrike, accepted complete responsibility and apologized on behalf of the company for causing one of the largest IT outages in history, in testimony before members of the House Subcommittee on Cybersecurity and Infrastructure Protection.
“Trust takes years to make and seconds to break, and we understand that we broke trust and that we need to work to earn it back,” Meyers said.
Members of the subcommittee were largely empathetic to the cybersecurity industry’s technical challenges, though in a few confrontational moments lawmakers criticized CrowdStrike’s processes that allowed the error to slip through testing.
“It seems like a very large miss. You guys touch a lot of things. You touch a lot of infrastructure in the United States, something that we count on you for, which you’ve been doing a great job until this one,” said Rep. Morgan Luttrell.
“You mention North Korea, China and Iran, our outside actors are trying to get us every day. We shot ourselves in the foot on the inside of the house.”
While the fault lies with CrowdStrike in this incident, the broad impact of its error on Windows users called into question cybersecurity vendors’ practices, specifically their tools' reliance on deep control and access to the Windows kernel.
Microsoft’s involvement is significant and its response is being closely watched as it continues to overhaul its cybersecurity strategy. Last week, Microsoft outlined some internal testing and systems changes it's working on to prevent another widespread outage via third-party vendors.
Here are five takeaways from Meyers’ testimony:
1. CrowdStrike's testing process failed.
CrowdStrike’s process for testing content updates prior to and during the faulty July 19 update relied on validators that tested content channel files individually, but not collectively. That has since changed.
“The new methodology is to test all of the content updates internally before they're released to the early adopters,” Meyers said.
During Meyers’ testimony, some lawmakers were keen to understand what went wrong with the faulty sensor content update. Processes were followed, but those internal processes failed to catch the error.
“We tested each of the channels, so each of the different rules that were inside that content file were tested individually,” Meyers said.
The validators ensured that the rules conformed and were compliant with the structure CrowdStrike built for the content update. “It tested as clean or good, and that's why it was allowed to roll out,” Meyers said.
Yet, the content file triggered an issue within the kernel. “It is almost like if you think about a chessboard trying to move a chess piece to someplace where there's no square. That's effectively what happened inside the sensor,” Meyers said.
The configuration update had a mismatch in fields that left one of the fields unlinked to a rule.
“It was not a lack of following the process. This was a issue with the content validator,” Meyers said. “The perfect storm was the content validator allowed the content configuration to go out to the sensor, and the sensor was not able to find the rule that it was looking for, causing the issue.”
The novelty factor of what went wrong was not lost on Meyers, who said: “This is the first time that this issue has manifested to my knowledge.”
2. CrowdStrike explains why it needs kernel access in Windows systems.
The kernel is the central-most part of the Windows operating system, responsible for interfacing with hardware. It is also critical, Meyers said, for cybersecurity tools to ensure performance, maintain systemwide visibility and provide threat prevention.
“The kernel driver is a key component of every security product that I could think of. Whether they would say that they do most of their work in the kernel or not varies from vendor to vendor, but to try to secure the operating system without kernel access would be very difficult,” Meyers said.
The Windows kernel also ensures anti-tampering, which is essential to prevent cybercriminals from disabling security tools, Meyers said.
Kernel visibility is critical to ensure threat groups do not gain access to the kernel and disable or remove security products and features, Meyers said.
“We got it wrong in this case and we are learning from what happened and we’ve implemented changes to ensure that doesn’t happen again.”
Microsoft said it plans to boost security capabilities outside of the kernel, including anti-tampering protection and security sensor requirements.
3. CrowdStrike customers now control content update cadence.
CrowdStrike content updates are now operating under an opt-in model that Meyers described as a “system of concentric rings.” The phased approach gives customers the ability to choose when and how their systems receive content configuration updates.
Testing, the first step in this new process, occurs internally within CrowdStrike. From there customers can choose to be part of the early-adopter program to receive content updates as soon as CrowdStrike makes them available.
General availability occurs after early adoption, and customers can delay updates further at their discretion or choose not to receive them at all. CrowdStrike cannot unilaterally override these content update controls.
Early adoption is appropriate for enterprise testing purposes, “if an organization would like to receive those content updates in a timely manner and make sure that there’s no outcome or unexpected behavior,” Meyers said.
“For mission critical systems, or things that they would prefer to wait longer for, they can choose to do that, but that comes of course with the risk that they’re not getting the most up-to-date threat intelligence information provided to their system,” Meyers said.
4. CrowdStrike changed its policy to treat content updates as code.
The faulty content configuration update, which included a basic field input error that caused an out-of-bounds memory read, was not code, but rather threat information distributed to CrowdStrike customers’ sensors.
In the wake of the error, CrowdStrike is no longer making a distinction between software code changes and content configuration updates with respect to testing.
“The configuration now is being treated as code, whereas before it was treated purely as configuration information,” Meyers said. “So we're providing a lot more oversight and visibility into what that is and how it goes out to the system.”
CrowdStrike’s content updates for sensors now go through more rigorous internal testing before they’re distributed to early adopters or customers who opted-in for general availability updates.
“Prior to this, our sensor packages, all of our source code, had those established best practices already in place, and now we are applying this as well to the content updates,” Meyers said.
5. Rapid content configuration updates are here to stay.
CrowdStrike releases content configuration updates for Windows sensors 10 to 12 times a day, on average.
These updates “contain the latest threat intelligence information to instrument our sensor, our tool, to understand what new threats are evolving,” Meyers said.
“The threat landscape changes sometimes minute-by-minute,” Meyers said. “In order to keep ahead of those threats, to allow the CrowdStrike platform to detect and prevent those threats, it needs routine updates.”
Meyers gave no indication CrowdStrike plans to limit how often it updates sensors.
“We will continue to update our product with threat information as frequently as we need to in order to stay ahead of the threats that we’re facing,” Meyers said. “Speed does matter in this domain in order to stay ahead of these threat actors.”