Bibliographische Detailangaben
| Titel: |
Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification. |
| Autoren: |
Beebe, Nicole L., Maddox, Laurence A., Liu, Lishu, Sun, Minghe |
| Quelle: |
IEEE Transactions on Information Forensics & Security; Sep2013, Vol. 8 Issue 9, p1519-1530, 12p |
| Abstract: |
Over 20 studies have been published in the past decade involving file and data type classification for digital forensics and information security applications. Methods using n-grams as inputs have proven the most successful across a wide variety of types; however, there are mixed results regarding the utility of unigrams and bigrams as inputs independently. In this study, we use support vector machines (SVMs) consisting of unigrams and bigrams, as well as complexity and other byte frequency-based measures, as inputs. Using concatenated unigrams and bigrams as input and a linear kernel SVM, we achieve significantly improved results over those previously reported (73.4% classification rate across 38 file and data types). We are the first to use concatenated n-grams as the sole input, and we show their superiority over inputs used previously. We also found that too many different types of features as inputs result in overfitting and poor generalization properties. We include several types seldom or not studied in the past (Microsoft Office 2010 files, file system data, base64, base85, URL encoding, flash video, M4A, MP4, WMV, and JSON records). The “winning” approach is instantiated in an open source software tool called Sceadan—Systematic Classification Engine for Advanced Data ANalysis. [ABSTRACT FROM PUBLISHER] |
|
Copyright of IEEE Transactions on Information Forensics & Security is the property of IEEE and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.) |
| Datenbank: |
Complementary Index |