Filtering Malicious JavaScript Code with Doc2Vec on an Imbalanced Dataset

Drive-by download attacks are one of main threats on the Internet. Several detection methods are to build run-time environments that allow JavaScript code to run and track its behavior while it runs. Dynamic analysis requires too much time to examine all the web pages a client accesses. Hence, light...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	2019 14th Asia Joint Conference on Information Security (AsiaJCIS) s. 24 - 31
Hlavní autoři:	Mimura, Mamoru, Suga, Yuya
Médium:	Konferenční příspěvek
Jazyk:	angličtina
Vydáno:	IEEE 01.08.2019
Témata:	Class Imbalance Problem Context modeling DBSCAN Doc2vec Feature extraction JavaScript K-means Malware Natural language processing Neural networks Paragraph Vector Static analysis SVM Web pages Word Vector
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Drive-by download attacks are one of main threats on the Internet. Several detection methods are to build run-time environments that allow JavaScript code to run and track its behavior while it runs. Dynamic analysis requires too much time to examine all the web pages a client accesses. Hence, lightweight filtering methods to detect unseen malicious JavaScript snippets are required. Static analysis often extracts statistical and lexical features from the associated JavaScript code of each web page in order to build detection models. In general, static analysis imposes no runtime overhead. These methods are, however, vulnerable to code obfuscation techniques. Some researchers attempt to detect obfuscated VBA macros with Natural Language Processing (NLP) techniques. In these methods, neural networks extract the features automatically in contrast with traditional approaches. In addition, since several methods are evaluated with a balanced dataset, the practical performance is still open to discussion. To evaluate the practical performance, these methods have to be evaluated with imbalanced datasets. In this paper, we attempt to detect unseen malicious JavaScript snippets with Doc2Vec, an unsupervised algorithm to generate vectors for documents with neural networks. To mitigate the class imbalance problem, our method uses a clustering-based undersampling technique. Furthermore, we build a web crawler and generate an imbalanced dataset with over 20,000 samples. The evaluation result shows that our method achieves a F-measure of 0.71.
DOI:	10.1109/AsiaJCIS.2019.000-9