Nowadays, even your smallest company can generate huge sets of data. Fortunately for them, technology has kept up the pace and, with the dawn of Big Data, we are now able to store and analyze huge sets of digital information (read our previous article on The Cybersecurity Hydra and its Big Data nemesis here). What we must remember here is that, whereas this may appear to be a “Big Answer”, there is an even Bigger Question at stake.
Big Data is not about exploring and finding new sources of information, but rather it is about collecting and unveiling what is already there, using newly found methods – much like a modern day archaeologist. The purpose: take out Small Data in the form of valuable insights based on the interpretation of these very data relics. Now, while all this sounds great in theory, we cannot help but ask ourselves: how do enterprises manage to transfer oodles of data, within and between networks, in a secure manner?
From where we stand, cybersecurity experts are having a tough time monitoring it all and, as such, stealthy attacks go easily unnoticed. What do IT execs do in this case? More often than not, they just hire more personnel. What’s one more additional person spending his/her time reviewing false positives? Not so sure about that approach. As threats become increasingly sophisticated and the organizational environment having a tendency to evolve, not to mention the looming cybersecurity talent gap, employing more staff may prove to be not only costly, but inefficient as well.
May the best robot win… or not
Having moved on from an “if/then” paradigm in the development of modern security solutions, machine learning (ML) provides algorithm-based judgment calls that enable a system’s ability to be the referee when it comes to ‘similar to’ situations. It’s the same when we switch between programming paradigms – from functional to imperative, for instance. A functional approach involves composing the problem as a set of functions to be executed, carefully defining the input to each function (the value returned is therefore entirely dependent on the input). With an imperative approach (referred to as algorithmic programming) to problem solving, a developer defines a sequence of steps/instructions that happen in order to accomplish the goal.
By definition a subset of Artificial Intelligence, machine learning can be supervised, unsupervised or semisupervised. As the names directly imply, each ML type involves a certain degree of involvement on behalf of the operator and demands a specific set of algorithms. Many voices say that, given how scarce experienced professionals in cybersecurity are becoming, the goal should be to replace them altogether with a sort of supreme Artificial Intelligence, capable of being omniscient and of rooting out all security threats – your typical Man versus Machine dystopian scenario, where the All Powerful AI wins. Translating this from fiction to fact: the world is waiting for that perfect unsupervised machine learning system, a system capable of knowing what we want to know before we even know it. And that’s where we tend to disagree.
As more and more robots and AIs are becoming better than humans at some jobs (find out what are the 21 jobs robots already do better than you here), cybersecurity is not your average occupation. While machine learning is awesome (there’s really no other word for it) and companies such as Facebook and Netflix have hit the jackpot with it, the issue is not the same when it comes to IT security. We neither want to be able to tag our photos better nor to receive more movie suggestions. In cybersecurity, we need to be able to detect unknown threats despite weak signals and to reduce this detection time to almost real-time – all aspects in which unsupervised machine learning does not excel. Leaving all decisions up to a ML-powered system will inevitably lead to alert fatigue, generating an incommensurable amount of potential threats – beyond even the analysis capacity of the best of us. Seeing how the average detection time of a breach can take months, something needs to change.
Machine Learning: the Jarvis to your Iron Man
If neither the machine, nor the man can fight alone against cyber-threats, why not combine forces? The goal shouldn’t be to replace humans with AI nor to leave it all to the AI. If we were to look for inspiration elsewhere, let’s say the Marvel universe, the best of superheroes are those whose powers had been enhanced by some not-so-realistic gadget. Whereas machine learning is far from being perfect, it has the potential to be a true side-kick for the expert analyst – the real-life (realistic) equivalent of JARVIS, Tony Stark’s artificially intelligent computer. JARVIS (Just A Rather Very Intelligent System), just like ML, warns of potential dangers and dismisses them once the call is made by its user, improving its distinction between normal and malicious behaviors over time. Integrated in the Iron Man armor and Stark’s home defenses, it is the perfect metaphor for illustrating the symbiosis human/AI we should aspire to.
So where do you start? Well, first, for a more dramatic effect, put your Iron Man suit on. Then, try pinpointing the issue. Do you just need to detect compromised users? Or do you suspect you’ve been or you will be attacked? Either way, a specific use case needs to be developed. From there on, the data required to solve the problem needs to be identified. If you’re after advanced persistent threats, then look for information regarding the existing security and network infrastructure. Be sure to combine multiple sources (not necessarily more, just diverse) to get a 360° view of your user activity. If your machine learning analytics are multi-dimensional, you should be able to catch malware early in the kill-chain, spotting anomalies such as privilege escalation, lateral movement, data exfiltration, etc.
Finally, be patient. The core task of machine learning being to replicate and predict, it takes time. The system needs to gather enough data and feed it to its behavior analysis engines in order to achieve an accurate classification between normal and abnormal behaviors. Starting with a training set, a sample of good code and one of bad code, ML filters them with the help of statistical algorithms and, through multiple iterations, it slowly learns to distinguish between the two. We say “slowly”, but it’s actually incredibly fast compared to past technologies: known threats are identified almost instantly with the help of existing knowledge bases, while in the case of unknown threats it’s a matter of days (1 week with Reveelium, read our article here). But remember – there are some behaviors that we still don’t know yet and, as such, we cannot teach them to the system. Also, while malware can be predicted this way with a high degree of probability, it is still the human in the Iron Man suit that has the last say in the matter.