Machine Learning in Cybersecurity
Arthur Samuel, a pioneer in artificial intelligence, described machine learning as a set of methods and technologies that “gives computers the ability to learn without being explicitly programmed.” In a particular case of supervised learning for anti-malware, the task could be formulated as follows: given a set of object features X and corresponding object labels Y as an input, create a model that will produce the correct labels Y’ for previously unseen test objects X’. X could be some features representing file content or behavior (file statistics, list of used API functions, etc.) and labels Y could be simply “malware” or “benign” (in more complex cases, we could be interested in a fine-grained classification such as Virus, Trojan-Downloader, Adware, etc.). In case of unsupervised learning, we are more interested in revealing hidden structure of data - e.g., finding groups of similar objects or highly correlated features.
Kaspersky Lab’s multi-layered, next generation protection utilizes machine learning methods extensively on all stages of detection pipeline - from scalable clustering methods used for preprocessing incoming file stream in infrastructure to robust and compact deep neural network models for behavioral detection that will work directly on users’ machines. These technologies are designed in a way to address several important requirements for machine learning models in a real world information security applications, i.e. extremely low false positive rate, interpretability of a model and robustness to a potential adversary.
Let’s consider some of the most important machine learning based technologies used in Kaspersky Lab endpoint products:
Decision tree ensemble
In this approach, the predictive model takes the form of a set of decision trees (e.g. random forest or gradient boosted trees). Every non-leaf node of a tree contains some question regarding features of a file, while the leaf nodes contain final decision of the tree on object. During test phase, the model traverses the tree by answering the questions in the nodes with the corresponding features of the object under consideration. At the final stage, decisions of multiple trees are averaged in an algorithm-specific way to provide final decision on the object.
The model benefits Pre-Execution Proactive protection stage on the endpoint site. One of our applications of this technology is Cloud ML for Android used for mobile threats detection.
Similarity hashing (Locality sensitive hashing)
Hashes used to create malware “footprints” in older times were sensitive to every small change in a file. This drawback was exploited by malware writers through obfuscation techniques like server-side polymorphism: minor changes in malware took it off the radar. Similarity hash (or locality sensitive hash) is a method to detect similar malicious files. To do this, the system extracts file features and use orthogonal projection learning to choose the most important features. Then ML-based compression is applied so that value vectors of similar features are transformed into similar or identical patterns. This method provides good generalization and noticeably reduces the size of the detection records' base, since one record now can detect the whole family of polymorphic malware.
The model benefits Pre-Execution Proactive protection stage on the endpoint site. It’s applied in our Similarity Hash Detection System.
Behavioral model
A monitoring component provides a behavior log - the sequence of system events occurred during the process execution together with corresponding arguments. In order to detect malicious activity in observed log data our model compresses obtained sequence of events to a set of binary vectors and trains the deep neural network to distinguish clean and malicious logs.
The object classification made by the Behavioral model is used by both static and dynamic detection modules in Kaspersky products on the endpoint side.
Machine learning plays an equally important role when it comes to building proper in-lab malware processing infrastructure. Kaspersky Lab uses it for the following infrastructure purposes:
Incoming stream clustering
ML-based clustering algorithms allow us to efficiently separate the large volumes of unknown files coming to our infrastructure into a reasonable number of clusters, some of which can be automatically processed based on the presence of an already annotated object inside it.
Large-scale classification models
Some of the most powerful classification models (like a huge random decision forest) require large amount of resources (processor time, memory) along with expensive feature extractors (e.g., processing via sandbox could be required for detailed behavior logs). It is more effective therefore to keep and run the models in a lab, and then distil the knowledge gained by such models via training some lightweight classification model on the output decisions of the bigger model.
Security of Machine Learning
ML algorithms, once released from the confines of the lab and introduced into the real world, could be vulnerable to many forms of attacks designed to force ML systems into making deliberate errors. An attacker can poison a training dataset or reverse-engineer the model's code. Besides, hackers can ’brute-force’ ML models with specially developed ‘adversarial AI’ to automatically generate many attacking samples until a weak point of the model is discovered. The impact of such attacks on ML-based anti-malware systems could be devastating: a misidentified Trojan means millions of devices infected and millions of dollars lost.
For this reason, some key considerations should be applied to ML use in security systems:
- The security vendor should understand and carefully address essential requirements for ML performance in the real, potentially hostile, world – requirements that include a robustness to potential adversaries. ML/AI-specific security audits and ‘red-teaming’ should be a key component of ML/AI development.
- In assessing the security of an ML solution, questions should be asked about how much the solution depends on third party data and architectures, as so many attacks are based on third party input (we’re talking threat intelligence feeds, public datasets, pre-trained and outsourced ML models).
- ML methods should not be viewed as ‘the ultimate answer’. They need be a part of multi-layered security approach, where complementary protection technologies and human expertise work together, watching one other's backs.
For a more detailed overview of popular attacks on ML algorithms and the methods of protection from these threats, see our whitepaper "AI under Attack: How to Secure Machine Learning in Security System".