Machine learning (ML) in eight blogposts?! In truth, we’ve only just scratched the
surface when it comes to the potential of ML in cybersecurity. However, that
said, readers should now be better able to separate fact from fiction, and
marketing from actual function. So, last but not least, let’s take a peek under
the hood of ESET’s cybersecurity engine and its ML gears.
Our experts have been playing with machine learning
for more than 20 years – with neural networks making their first appearance in
our products in 1997. Since then there have been numerous internal projects
aimed at automating security analysis, helping us categorize the virtual world
into the good, the bad and the ugly (or even grey areas containing potentially
unwanted applications or PUAs, if you will).
One of our early efforts was an automated expert
system, designed for mass processing. In 2006, it was quite simple and helped
us process part of the growing number of samples and cutting the immense
workload of our detection engineers. Over the years, we have perfected its
abilities and made it a crucial part of the technology responsible for the
initial sorting and classification of the hundreds of thousands of items we
receive every day from sources such as our worldwide network ESET LiveGrid®,
security feeds and our ongoing exchange with other security vendors.
Another ML project has been running under ESET’s
hood since 2012, placing all the analyzed items on “the cybersecurity map” and
flagging those that require more attention. Interestingly, it was exactly this
system that did a great job in the recent WannaCryptor case, alerting us in the earliest phases about
the soon to be wildly spreading ransomware file. Despite already having a
network detection for the EternalBlue exploit, this system helped ESET make
additional detections that further improved the protection of our users.
However, machine learning is a tricky beast and not
all our efforts have gone according to plan. Older projects focused on
automating the creation of broader DNA detections from previously known
detections, determining URL reputations, or finding the “nearest neighbors” of
samples. Eventually, these were outperformed by other, more effective means, or
replaced thanks to further development.
However, all of this helped us gain experience,
and, piece-by-piece, has paved the way for what we have today – a mature,
real-world application of machine learning technology in the cloud, as well as
on client’s endpoints.
Meet Augur, our ML
beast
At ESET we love ancient history – the company is
named after an Egyptian goddess after all – so that was naturally the place we
looked for when it came to naming our machine learning engine. In Ancient Rome,
augur was a term used for religious officials who observed natural signs and
interpreted these as an indication of divine approval or disapproval of a
proposed action. The analogy with cybersecurity is not hard to draw, but in
contrast with the alchemy-natured augurs back then, our Augur bases its
decisions on science, mathematics and previous experience.
Now for the technical part. ESET’s Augur ML engine
couldn’t have materialized without three main factors:
1.
With the
arrival of big data and cheaper hardware, machine learning was made more
affordable – be it for medical purposes, autonomous cars or detections in
cybersecurity.
2.
Growing
popularity of ML algorithms and the science behind it led to their broader
technical application and availability to anyone who was willing to implement
them.
3.
After three
decades of fighting black-hats and their “products”, we have built a latter-day
“Library of Alexandria” equivalent – of malware. This vast and highly organized
database contains millions of extracted features and DNA genes of everything
we’ve analyzed in the past. A great foundation to create a carefully chosen mix
for Augur’s training set.
However, the boom of the above named factors have
also brought challenges. We have had to pick the best performing algorithms and
approaches, as not all machine learning is applicable to the highly specific
security universe.
After much testing, we have settled on combining
two methodologies that have proven effective so far:
1.
Neural
networks, specifically deep learning and long short-term memory.
2.
Consolidated
output of six precisely chosen classification algorithms.
Not clear enough? Imagine you have a suspicious
executable file. Augur will first emulate its behavior and run a basic DNA
analysis. Then it will use the gathered information to extract numeric features
from the file, look at which processes it wants to run and look at the DNA
mosaic in order to decide which category it fits best – clean, potentially
unwanted or malicious. At this point, it is important to state that unlike some
vendors who claim they do not need unpacking, behavioral analyzing or
emulation, we find this crucial to properly extract data for machine learning.
Otherwise – when data is compressed or encrypted – it’ just an attempt to
classify noise.
The group of classification algorithms has two
possible setups:
The more aggressive one will label a sample as
malicious if most of the six algorithms vote it as such. This is useful mainly
for IT staff using ESET Enterprise inspector, as it can flag anything
suspicious and leave the final evaluation of the outputs to a competent admin.
The milder or more conservative approach, declares
a sample clean, if at least one of the six algorithms comes to such conclusion.
This is useful for general purpose systems with a less expert overview.
Just to top it all off, we came across a
presentation by Facebook describing their machine learning solution and it very
much resembles Augur’s architecture – aiming to combine the best of the
classification algorithms and neural networks.
Okay, so let’s move away from theory and look at
the real world results of ESET’s machine learning approach as applied to the
recent malware attacks misusing the EternalBlue exploit and pushing both the
WannaCryptor ransomware and CoinMiner malware families. Apart from our network
detection and effective flagging by our other ML system, the Augur model also
immediately identified samples of both families as malicious.
What’s more interesting, we also ran this test with
a month old Augur model that couldn’t have encountered these malware families
anywhere before. This means that the detections were based solely on the
information learned from the training set. And guess what? They were both
correctly labeled as malicious.
30 years of progress and innovation in IT security
has taught us, that somethings do not have an easy solution, especially in
cyberspace, where change comes rapidly and the playing field can shift in a
matter of minutes. Machine learning, even when wrapped up in shiny marketing
speak, won’t change that anytime soon. Therefore, we believe that even the best
ML cannot replace skilled and experienced researchers, those who built its
foundations and those same researchers who will further innovate it. We’re
proud to say that many of these talented individuals work at ESET, helping
protect users from future threats.
The whole series: