Applying Machine Learning To Advance Cyber Security Analytics – Part 1


Technology moves swiftly. Nowhere is that more accurate than in the current state of machine learning. One merely has to look at a variety of ubiquitous technological experiences they undergo each day, and find a myriad of machine learning applications at their core. Take, for example, the task of online shopping. Almost every large online storefront will recommend items you may want to purchase. These recommendations are based on a few data points; for example, previous shopping history, your recent searches, or even based on who your friends are. Machine learning demands that one consider this vast amount of data in order to come up with a simple answer: What item might you like to purchase?

Shopping, of course, is not the only industry to leverage recent advances in machine learning. The list of companies and industries is growing by the day in addition to the various applications of machine learning. Common applications of machine learning in today’s technology include voice recognition, fraud detection, email spam filtering, text processing, search recommendations, video analysis, etc. In addition, these current technologies are being improved daily, with these improvements being fuelled by greater data analytics, reduction in the cost of computation, and advancements in the state of the art of machine learning research.

So with all of the recent technologies embracing machine learning approaches, one may ask what exactly is machine learning, and how is it applied in these situations. In a broad sense, machine learning refers to a series of techniques where one would “train” a machine how to solve a problem. As a simple example, say I want to train a machine to determine if a photo is an apple or an orange. To train the machine, I provide it 100 photos of apples, and 100 photos or oranges. Once the machine is trained, I can give it a picture and it can tell me if the photo is an apple or an orange.

However, not all machine learning solution are created equally. One measure to determine the effectiveness of a machine learning model would be its accuracy in future predictions. For example, I ask the apples and oranges model to tell me if a photo is an apple or an orange. Let’s say I provide it with 10 photos of apples, and of that 10 it says 8 are apples and 2 or oranges. We can then say the model is 80% accurate. While this is reasonably accurate, one can easily improve upon this model. One way to improve a machine learning system is to provide more data; essentially provide broader experiences to improve its capabilities. For example, instead of 100 photos, one might provide 1,000, or 1,000,000 photos to train the machine. Very often, this increase in volume provides huge improvements in the accuracy of such models.

The incredible rate of growth of data over the last few years has led to the coinage of a new term, “big data”. As you can imagine, this means lots of data, enough for special consideration to be given as to how to store, transport, manage, and analyze. Big data has been on of the pillars for the rapid growth and improvement in machine learning in recent years. Another major force behind the machine learning movement has been the availability of cheap and plentiful computation.

Cloud computing advances have been essential in providing massive computing power in a cost effective manner, helping to solve computation intensive problems. A famous example of this may be the “SETI@home” project, where volunteers donate their unused CPU cycles to help in the analysis of radio telescope data. The ability to leverage thousand upon thousands of machines committed to solving a single problem lends itself well to the field of machine learning, in particular while trying to deal with data sets that are extremely large. As a practical example, we have one computation at Cylance that we run which takes 1,000 machines approximately 30 days to solve. Even a few years ago, it was not practical to solve such a problem.

The ability to collect and handle big data, along with increased ability to perform previously impossible calculations, are significant achievements. Combined, they are helping to fuel an explosion of growth in machine learning areas.

When I look at the cyber security industry, I see two trends that lead me to the conclusion that machine learning approaches are a good fit for the industry. One, the collection and storage of large amounts of useful data points is already well underway in cyber security. It would be difficult for me to find a security analyst who is not currently overwhelmed by the vast amount of raw data that is collected every day in mature environments. There even exist a plethora of tools designed to help sort, slice, and mine this data in a somewhat automated fashion to help the analyst along in their day-to-day activities.