Semantics-Based Obfuscation-Resilient Binary Code Similarity Comparison with Applications to Software and Algorithm Plagiarism Detection

Existing code similarity comparison methods, whether source or binary code based, are mostly not resilient to obfuscations. Identifying similar or identical code fragments among programs is very important in some applications. For example, one application is to detect illegal code reuse. In the code...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on software engineering Jg. 43; H. 12; S. 1157 - 1177
Hauptverfasser: Luo, Lannan, Ming, Jiang, Wu, Dinghao, Liu, Peng, Zhu, Sencun
Format: Journal Article
Sprache:Englisch
Veröffentlicht: New York IEEE 01.12.2017
IEEE Computer Society
Schlagworte:
ISSN:0098-5589, 1939-3520
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Existing code similarity comparison methods, whether source or binary code based, are mostly not resilient to obfuscations. Identifying similar or identical code fragments among programs is very important in some applications. For example, one application is to detect illegal code reuse. In the code theft cases, emerging obfuscation techniques have made automated detection increasingly difficult. Another application is to identify cryptographic algorithms which are widely employed by modern malware to circumvent detection, hide network communications, and protect payloads among other purposes. Due to diverse coding styles and high programming flexibility, different implementation of the same algorithm may appear very distinct, causing automatic detection to be very hard, let alone code obfuscations are sometimes applied. In this paper, we propose a binary-oriented, obfuscation-resilient binary code similarity comparison method based on a new concept, longest common subsequence of semantically equivalent basic blocks , which combines rigorous program semantics with longest common subsequence based fuzzy matching. We model the semantics of a basic block by a set of symbolic formulas representing the input-output relations of the block. This way, the semantic equivalence (and similarity) of two blocks can be checked by a theorem prover. We then model the semantic similarity of two paths using the longest common subsequence with basic blocks as elements. This novel combination has resulted in strong resiliency to code obfuscation. We have developed a prototype. The experimental results show that our method can be applied to software plagiarism and algorithm detection, and is effective and practical to analyze real-world software.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0098-5589
1939-3520
DOI:10.1109/TSE.2017.2655046