Filtering Malicious JavaScript Code with Doc2Vec on an Imbalanced Dataset
Drive-by download attacks are one of main threats on the Internet. Several detection methods are to build run-time environments that allow JavaScript code to run and track its behavior while it runs. Dynamic analysis requires too much time to examine all the web pages a client accesses. Hence, light...
Uloženo v:
| Vydáno v: | 2019 14th Asia Joint Conference on Information Security (AsiaJCIS) s. 24 - 31 |
|---|---|
| Hlavní autoři: | , |
| Médium: | Konferenční příspěvek |
| Jazyk: | angličtina |
| Vydáno: |
IEEE
01.08.2019
|
| Témata: | |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Drive-by download attacks are one of main threats on the Internet. Several detection methods are to build run-time environments that allow JavaScript code to run and track its behavior while it runs. Dynamic analysis requires too much time to examine all the web pages a client accesses. Hence, lightweight filtering methods to detect unseen malicious JavaScript snippets are required. Static analysis often extracts statistical and lexical features from the associated JavaScript code of each web page in order to build detection models. In general, static analysis imposes no runtime overhead. These methods are, however, vulnerable to code obfuscation techniques. Some researchers attempt to detect obfuscated VBA macros with Natural Language Processing (NLP) techniques. In these methods, neural networks extract the features automatically in contrast with traditional approaches. In addition, since several methods are evaluated with a balanced dataset, the practical performance is still open to discussion. To evaluate the practical performance, these methods have to be evaluated with imbalanced datasets. In this paper, we attempt to detect unseen malicious JavaScript snippets with Doc2Vec, an unsupervised algorithm to generate vectors for documents with neural networks. To mitigate the class imbalance problem, our method uses a clustering-based undersampling technique. Furthermore, we build a web crawler and generate an imbalanced dataset with over 20,000 samples. The evaluation result shows that our method achieves a F-measure of 0.71. |
|---|---|
| DOI: | 10.1109/AsiaJCIS.2019.000-9 |