Cybercriminals are constantly looking for new strategies to defeat security solutions and improve the success of their attacks.

The increase in adoption of polymorphism and packing has made traditional signature-based detection at the client side (endpoint) obsolete. Backend systems struggle in analyzing modern malware since both static and dynamic analysis are limited when heavily obfuscated code or anti-sandboxing techniques are employed. In addition, the number of newly discovered threats is increasing, and faster detection systems are required to protect users around the world.

To meet this need, we recently developed a system that overcomes the limitations of current static and dynamic techniques and is able to detect new threats in real-time. We use a combination of machine learning and graph-based reasoning to classify software downloads in under a second.

Figure 1. Overview of detection method

Each protected endpoint runs a download identification agent (DIA) capable of identifying new software downloads. The agent transmits contextual information on the download to our classification system (which we call a malware download detection system, or MDD) and temporarily quarantines the file, preventing file access until a decision is made regarding its nature. Contextual information like the download client and the endpoint configuration are transmitted, but the downloaded file itself is not.

Figure 2. Example of annotated download graph

Our approach is content-agnostic. The system makes a decision without needing to inspect the downloaded file or accessing the actual URL. This is why the system is capable of processing large numbers of new threats in a very short amount of time. This also extends to threats discovered on unconventional devices like smartphones and other IoT devices, as our system is OS independent and works well on any type of endpoint.

In simple terms, the system works by keeping a representation of the download events in a tripartite graph. It uses a graph probabilistic model to compute the probability that a new node (i.e., a downloaded file that has not been seen before) is benign or malicious, as reported in Figure 2.

Information on new downloads is collected from endpoints and included in the graph as new nodes. A probability score is computed from adjacent nodes and propagated for classification, as detailed in Figure 3.

Figure 3. Propagation of reputation scores

How does this translate to a classification system? For example, software downloads originating from endpoints known to be running malware in the past are considered low reputation. In the same way, URLs serving downloads known to be malicious are considered low reputation. A system based on these reputation scores is more likely to block these types of files.

What would then be considered high reputation? Conversely, downloads originating from trusted domains or endpoints operating for years without infection alerts (i.e., fully-patched and well-configured devices) are considered high-reputation. This probability is then propagated iteratively to adjacent nodes so that new threats can be detected.

Conclusions

Modern threats and malware authors are putting the security industry under pressure, requiring novel detection techniques to spot threats that would otherwise remain undetected for years. Our research team at Trend Micro is always looking for solutions and improvements to help face this challenge.

Our work represents a step in this direction, as it employs techniques that do not require either the collection or the inspection of the software downloads or the download URL. This is in contrast with traditional techniques that need to analyze these files or URLs individually.

In addition, the concept of a definitive yes/no signature is substituted with a probabilistic one: Detection is taken by probabilistic models created by “artificially intelligent” algorithms that make autonomous decisions based on a global knowledge base sourced from millions of endpoints.

Our solution is also OS-independent, allowing it to protect both traditional endpoints (like personal PCs) and internet of things devices like – possibly among others – mobiles, smart TVs, and HVACs.

This work was discussed at the 11th ACM on Asia Conference on Computer and Communications Security and currently filled as USA patent number 14/988,430. You can read more about our other research in the field of machine learning as well.