The following is a guest article from Dr. Sven Krasser, chief scientist at CrowdStrike.
Without a doubt, machine learning is one of the hottest topics in cybersecurity at the moment, and most vendors boast their newest machine learning additions as the panacea that liberates you from all security woes.
Machine learning allows security products to do vastly better in various areas. However, it is best understood as a set of techniques that dramatically optimize detection techniques. It does not allow sidestepping inherent limitations e.g. pre-execution malware detection.
Effective solutions work at many layers over a whole range of timeframes — from almost instant to detect commodity threats to very long durations to detect entrenched adversaries — and go beyond using machine learning as their only tool.
The hype aside, if properly used and managed, machine learning can make a significant positive impact on your security posture. Here are three critical questions you can ask while exploring new solutions:
Why should I care about machine learning?
The term "machine learning" refers to a broad set of algorithms that can solve a specific set of problems. These algorithms have great utility to solve problems the cybersecurity industry is facing, but by no means is every problem in cybersecurity a problem that can be solved using machine learning.
Its main promise is the ability to consistently detect new and unknown threats in the absence of traditional indicators of compromise. While integrating machine learning into a product is trivial, effectively training the algorithms to detect never-before-seen threats is not.
As a concrete example, there are now several anti-malware engines that are purely machine learning-based integrated into VirusTotal, the largest online community of antivirus products and online scanning engines.
In contrast to signature-based engines, these next generation engines can detect new and unknown zero-day malware even after not being updated for several months. To accomplish that level of performance, these engines dissect files into a large number of abstract features that describe properties of those files, which then are subsequently fed into a machine learning classifier.
This allows these engines to detect malware based on its quality of being malicious as opposed to having to know how to recognize a specific family of malware.
Should I be cautious about machine learning?
Machine learning can result in detection capabilities that are less reactive and dependent on constant updates and more effective in detecting unknown threats. However, even though there is a machine learning sticker on the box of a new security product does not mean buyers should not worry about novel threats.
Let’s look again at the anti-malware engine example. First of all, to detect unknown threats, machine learning needs to be deployed such that it can evaluate the overall maliciousness of a file. Several vendors augment their signature-based approaches with machine learning-based heuristics, which target specific malware families.
Such systems can perform well on new variants of a known family, but they will fail when presented with an unknown malware family — case in point: the WannaCry ransomware.
Even a perfectly performing anti-malware engine has its limitations. An attacker has months worth of time to craft malware while an anti-malware engine generally needs to come to a decision in a sub-second timeframe. Without observing the execution, there will always be malware files that manage to sneak by undetected — that is a fact that machine learning does not change.
It is important to not just zero in on machine learning being used but to understand how it is deployed in a larger context. To overcome this limitation, protection needs to continue once a file starts executing on a host. To spot entrenched adversaries, observing and analyzing data at the network level over even longer timeframes is critical.
How can I recognize solutions that effectively implement machine learning?
First and foremost, machine learning works well when large amounts of Big Data are available, e.g. in the cloud. Case in point, Netflix (think cloud) gives better movie recommendations than a local Blockbuster clerk (think appliance).
Not just the sheer scale, but also the richness of the data matters. The more facets the data covers, the faster a cohesive broad picture emerges. Solutions working on small datasets with few facets can perform reasonably well on variations of known threats but are unlikely to generalize to new threats.
Next, effective solutions provide coverage in many areas ranging from pre-execution file analysis over host execution behavior to macroscopic behaviors at the network level over long durations.
All hype aside, not every problem can be solved with machine learning. One can build a perfectly viable smoke detector without sticking machine learning on its circuit board.
In a similar fashion, there are only a limited number of ways an adversary can steal a password hash from a Windows system — something that can be effectively detected generically with an indicator of attack.