We are a long way away from the day when all of InfoSec can run autonomously using machine learning (ML). We, humans, are still the most advanced component in the InfoSec chain, and we will be for some time. We have the experience, analytic skills, and collaboration skills that just aren't there yet with ML. That said, ML is an extremely powerful tool. When used with other technologies and combined with human security analysts, responders, and researchers, machine learning goes a long way in helping to put defenders ahead of the problem.
So where does machine learning fit in the grand scheme of things?
First, it's important to note that machine learning is not a monolith any more than math, science, or medicine are. There are many techniques, areas of study, and combinations, thereof, that fall under the umbrella of machine learning. When machine learning is applied to a problem, there may be a single algorithm that's used, or hundreds, or even thousands, of algorithms working together.
Machine learning isn't appropriate for every use case. It is great for recognizing and predicting patterns in monstrous data sets and for identifying those nuggets that need deeper, human investigation. For instance, if the problem you're trying to solve falls into a well-known domain, say looking for multiple failed logins across your environment. That use case has been well-covered for some time and applying ML doesn't make sense. But say that you want to analyze that login data for patterns that factor in users' geography, job functions, regular business hours, travel, etc. Machine learning would certainly provide value. (Read the blog post about a similar case study where we used machine learning to help identify suspicious downloads for one of our customers.)
When machine learning is applied, you've got to wonder, how does the machine "know"? It takes the combination of a healthy initial data set ("ground truth") and skilled data scientists to develop and refine the algorithm(s) before ML will provide any valuable analysis. And I can't emphasize the importance of this enough. If the ground truth is not varied enough, or if the data scientists are pre-disposed to a particular conclusion, for example, the result will be false positives, false negatives, and/or biases in the output. In short, the results may cause more confusion than value.
Of course, machine learning plays an important part in our threat research and the development of content that we provide to our customers. Every day we analyze nearly two million malware samples, as well as telemetry from thousands of honey pots, millions of DNS requests, and billions of emails, and more. This is our ground truth, and because of its depth and breadth, we're able to detect and block more threats, as well as continue to refine the algorithms we use. Furthermore, ML helps to reduce the noise so that our team of 250+ threat researchers and data scientists can focus on those advanced threats that require expert investigation.
Take for example how applying machine learning analysis to encrypted traffic enabled us to detect encrypted botnet and cryptomining at different conferences here and here and became the basis for our Encrypted Traffic Analysis (ETA).
Machine learning is still at its beginning phase. And while it is already producing great results, no single technology will ever be a panacea. But, when combined with other established technologies and skilled humans, it will help defenders to get ahead of the bad guys.
For more on machine learning, check out our recently posted introduction to machine learning white paper to explore some of the concepts and realities related to ML in information security.