Skip to main content
European Commission logo print header

MALCODE: Malicious Code Detection using Emulation

Final Report Summary - MALCODE (MALCODE: Malicious Code Detection using Emulation)

Despite considerable advances in system protection and mitigation technologies, the exploitation of memory corruption vulnerabilities persists as one of the most common methods for system compromise and malware infection. In contrast to system compromise methods such as computer viruses or social engineering, which rely on luring unsuspecting users to download and execute malicious files or to reveal usernames and passwords, these attacks exploit some software flaw in a program running on the victim's computer that allows an attacker to execute malicious code and unconditionally get full access to the system.

The aim of the MALCODE project is to design, develop, and evaluate new malicious code detection algorithms based on code emulation. Working at the lowest level---the actual instructions that get executed---dynamic analysis using emulation unveils the actual malicious code without being affected by evasion techniques like encryption, polymorphism, or code obfuscation. Focusing on the behavior and not the structure of the code, we aim to identify common functionality and actions that are inherent to different types of malicious code, and use them for the development of new malicious code detection heuristics.

The main outcomes of the project are two novel methods for the detection of network-level code injection attacks and malicious PDF documents, and two novel techniques for the detection and prevention of attacks based on Return-Oriented Programming. The project has also resulted in novel contributions in the fields of network-level traffic monitoring and analysis, the use of graphics processors for speeding up computationally intensive network traffic processing operations, exploiting graphics processors for designing stealthier malware, protecting online privacy, and studying the Android mobile application ecosystem.

We have designed a comprehensive and extensible shellcode detection technique that uses a set of runtime heuristics to identify the presence of shellcode in arbitrary data streams. Each heuristic matches inherent low-level operations that are always exhibited during the execution of a given type of shellcode irrespective of its specific implementation. We have identified fundamental machine-level operations that are inescapably performed by different shellcode types, based on which we have designed heuristics that enable the detection of different classes of shellcode, including plain, metamorphic, and memory scanning shellcode, regardless of the use of self-decryption. Collectively, these heuristics increase significantly the detection coverage compared to existing emulation-based detectors. We have implemented our technique in Gene, a code injection attack detection system based on passive network monitoring. Our experimental evaluation and real-world deployment show that Gene can effectively detect a large and diverse set of shellcode samples that are currently missed by existing emulation-based detectors, while extensive testing with large data sets of real and generated benign inputs did not produce any false positives.

The widespread adoption of the PDF format for document exchange has given rise to the use of PDF files as a prime vector for malware propagation. As vulnerabilities in the major PDF viewers keep surfacing, effective detection of malicious PDF documents remains an important issue. Our second detection system developed during the first year of the project, called MDScan, is a standalone malicious document scanner that combines static document analysis and dynamic code execution to detect previously unknown PDF threats. Our evaluation shows that MDScan can detect a broad range of malicious PDF documents, even when they have been extensively obfuscated.

The prevalence of code injection attacks has led to the wide adoption of exploit mitigations based on non-executable memory pages. In turn, attackers are increasingly relying on return-oriented programming (ROP) to achieve arbitrary code execution without the injection of any code, and bypass these protections. At the same time, existing detection techniques based on shellcode identification are oblivious to this new breed of exploits, since attack vectors may not contain binary code anymore.

In the second half of the project, we focused on the detection and prevention of ROP attacks. Specifically, we designed a novel detection method for the identification of ROP payloads in arbitrary data such as network traffic or process memory buffers. Our technique speculatively drives the execution of code that already exists in the address space of a targeted process according to the scanned input data, and identifies the execution of valid ROP code at runtime. Our experimental evaluation demonstrates that our prototype implementation can detect a broad range of ROP exploits against Windows applications without false positives, while it can be easily integrated into existing defenses based on shellcode detection.

In the front of exploit mitigation and protection techniques, existing defenses against ROP exploits either require source code or symbolic debugging information, or impose a significant runtime overhead, which limits their applicability for the protection of third-party applications. In the second year of the project, we introduced in-place code randomization, a practical mitigation technique against ROP attacks that can be applied directly on third-party software. Our method uses various narrow-scope code transformations that can be applied statically, without changing the location of basic blocks, allowing the safe randomization of stripped binaries even with partial disassembly coverage. These transformations effectively eliminate about 10%, and probabilistically break about 80% of the useful instruction sequences found in a large set of PE files. Since no additional code is inserted, in-place code randomization does not incur any measurable runtime overhead, enabling it to be easily used in tandem with existing exploit mitigations such as address space layout randomization. Our evaluation using publicly available ROP exploits and two ROP code generation toolkits demonstrates that our technique prevents the exploitation of the tested vulnerable Windows 7 applications, including Adobe Reader, as well as the automated construction of alternative ROP payloads that aim to circumvent in-place code randomization using solely any remaining unaffected instruction sequences.