Filtering Malicious JavaScript Code with Doc2Vec on an Imbalanced Dataset

Drive-by download attacks are one of main threats on the Internet. Several detection methods are to build run-time environments that allow JavaScript code to run and track its behavior while it runs. Dynamic analysis requires too much time to examine all the web pages a client accesses. Hence, light...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:2019 14th Asia Joint Conference on Information Security (AsiaJCIS) S. 24 - 31
Hauptverfasser: Mimura, Mamoru, Suga, Yuya
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: IEEE 01.08.2019
Schlagworte:
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:Drive-by download attacks are one of main threats on the Internet. Several detection methods are to build run-time environments that allow JavaScript code to run and track its behavior while it runs. Dynamic analysis requires too much time to examine all the web pages a client accesses. Hence, lightweight filtering methods to detect unseen malicious JavaScript snippets are required. Static analysis often extracts statistical and lexical features from the associated JavaScript code of each web page in order to build detection models. In general, static analysis imposes no runtime overhead. These methods are, however, vulnerable to code obfuscation techniques. Some researchers attempt to detect obfuscated VBA macros with Natural Language Processing (NLP) techniques. In these methods, neural networks extract the features automatically in contrast with traditional approaches. In addition, since several methods are evaluated with a balanced dataset, the practical performance is still open to discussion. To evaluate the practical performance, these methods have to be evaluated with imbalanced datasets. In this paper, we attempt to detect unseen malicious JavaScript snippets with Doc2Vec, an unsupervised algorithm to generate vectors for documents with neural networks. To mitigate the class imbalance problem, our method uses a clustering-based undersampling technique. Furthermore, we build a web crawler and generate an imbalanced dataset with over 20,000 samples. The evaluation result shows that our method achieves a F-measure of 0.71.
DOI:10.1109/AsiaJCIS.2019.000-9