The worldwide outage in July caused by a relatively routine threat intelligence update sent by CrowdStrike to Windows devices propelled the mundane topic of endpoint patching into the mainstream media. Suddenly thousands of companies find themselves weighing the tradeoffs of risk versus reward when performing rapid software updates and asking themselves the tough question: “Are we all patching wrong?”
Even before the recent headline-grabbing outage, patching was a constant daily headache for many IT operations teams that struggled to balance the need to patch urgently with the requirement to patch safely. Security teams have a mandate to reduce risk and with 90% of cybersecurity attacks starting at the endpoint unpatched devices remain one of the greatest risks to an organization. However, deploying a bad patch can have dire consequences for a business as we saw when many banks, hospitals, airlines, and other global organizations suffered disruptions from a single software update this month.
Autonomous endpoint management and patching company, Adaptiva, supports hundreds of today’s largest enterprises as they work to strike that delicate balance between accelerating patching and software updates to reduce risks while ensuring employee and customer experiences are maintained. Adaptiva CEO and founder Dr. Deepak Kumar provided insights on how organizations can navigate the risk versus reward of faster patching to prevent service disruptions and widespread outages from happening in the future in the Q&A below.
We learned that a single bad patch can lead to global and widespread outages and IT interruptions. What happened with this patch that was different from the thousands of patches that came before it?
This patch triggered a Windows kernel crash, or more dramatically, a BSOD – Blue Screen of Death. Of course, when that happens, you can no longer use the computer. Unfortunately, it takes some skill to resolve this type of error. In particular, sensitive corporate computers which have been protected with Microsoft’s bitlocker disk encryption software can be quite hard to recover. If this happened at scale, it would quickly overwhelm a corporation’s IT resources, and chaos would follow, which is exactly what happened in many cases, leading to prolonged outages.
Is a BSOD unique to CrowdStrike, or can other patches cause this type of error?
Any kernel driver can cause a blue screen. Windows ships with tens of thousands of kernel drivers, and many times more than that have been written by other companies. Any of these kernel drivers could cause a blue screen, if it had a bug.
Why do companies write kernel drivers?
Kernel drivers have some unique advantages. Once they are loaded, they become part of the Operating System’s inner sanctum, the kernel, and execute at Ring 0, the most privileged level of the CPU. In the case of CrowdStrike, their sensor is tasked with observing the actions of all processes executing on the machine, and in many circumstances, with interfering with processes which are attempting malicious activity. This can only be done using a kernel driver. All modern anti-virus software has to include one or more kernel drivers to deliver even basic functionality.
I think what happened with CrowdStrike was a freak accident. It took a lot of statistically improbable things to intersect together for something like this to happen. It could have happened with any company’s product, it just happened to be them. That said, there are certainly many lessons to be learned here for all of us who are in this line of work.
The impacts of the outage from this bad patch have been far-reaching. Is the lesson here for how organizations should do their patching?
At its core, this incident is in fact a patching problem, and not a kernel driver problem.
Bad patches get released all the time. Microsoft’s KB5040427, released on 9th July, is repeatedly blue-screening and rebooting machines, though it hasn’t hit the news cycle as hard.
The knee-jerk reaction most people have is to slow down patching, hoping that it will delay a bad patch. In reality, a much more sophisticated and nuanced approach is needed.
Twenty years ago, when I was designing Microsoft’s first enterprise patching product, it was a simpler world. Windows, Office, and a few fragile Adobe products were all you needed to patch. The Adaptiva OneSite Patch product I designed last year already has more than 1600 products in its catalog.
Every patch released for each of these products presents a different tradeoff between risk and reward, for each organization, and even in different parts of the same organization.
Clearly, this task has vastly outgrown human capabilities, and AI-based autonomous patching systems are the only reasonable way forward. I have come to the fundamental belief that humans should define strategy and process, and software should do the rest
In your view, what is the right long-term response to prevent these situations?
CrowdStrike has announced some measures, including increased testing, and staggered deployment of content in future. I think we can take this a step further.
Some cloud-first companies tend to club together all customer devices into a single mammoth entity, and try to manage them as a monolith. Instead, they could enable customers to manage their own endpoints, and ceding some control to them to do so would go a long way in fragmenting the problem into smaller pieces. It would also place the management decisions where the knowledge of the environment lives, and reduce the likelihood that an outage will affect critical systems first.
Is there a fundamental problem in placing all our infrastructure bets on a small number of very successful companies? Will this reliance on a handful of companies come back to haunt us?
CrowdStrike makes one of the best cybersecurity products on the market. Their protection rates are extremely high, and that is why people want to use their product. The same facts apply to a lot of these larger successful companies.
In order to reduce our dependence on these successful infrastructure companies, a lot of people would either have to be willing to use the second best, third best, or fourth best product, or someone would have to coerce them to do so.
The alternative is that we let the free markets decide the winners, and over the long term, companies with the most reliable products will eventually win.