Sparse Coding for N-Gram Feature Extraction and Training for File Fragment Classification

File fragment classification is an important step in the task of file carving in digital forensics. In file carving, files must be reconstructed based on their content as a result of their fragmented storage on disk or in memory. Existing methods for classification of file fragments typically use ha...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	IEEE transactions on information forensics and security Jg. 13; H. 10; S. 2553 - 2562
Hauptverfasser:	Wang, Felix, Quach, Tu-Thach, Wheeler, Jason, Aimone, James B., James, Conrad D.
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	United States IEEE 01.10.2018
Schlagworte:	automated feature extraction Data mining Dictionaries dictionary learning Encoding Feature extraction file carving File fragment classification Machine learning MATHEMATICS AND COMPUTING n-gram sparse coding support vector machine Support vector machines Training unsupervised learning
ISSN:	1556-6013, 1556-6021
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	File fragment classification is an important step in the task of file carving in digital forensics. In file carving, files must be reconstructed based on their content as a result of their fragmented storage on disk or in memory. Existing methods for classification of file fragments typically use hand-engineered features, such as byte histograms or entropy measures. In this paper, we propose an approach using sparse coding that enables automated feature extraction. Sparse coding, or sparse dictionary learning, is an unsupervised learning algorithm, and is capable of extracting features based simply on how well those features can be used to reconstruct the original data. With respect to file fragments, we learn sparse dictionaries for n-grams, continuous sequences of bytes, of different sizes. These dictionaries may then be used to estimate n-gram frequencies for a given file fragment, but for significantly larger n-gram sizes than are typically found in existing methods which suffer from combinatorial explosion. To demonstrate the capability of our sparse coding approach, we used the resulting features to train standard classifiers, such as support vector machines over multiple file types. Experimentally, we achieved significantly better classification results with respect to existing methods, especially when the features were used in supplement to existing hand-engineered features.
Bibliographie:	SAND-2018-3201J AC04-94AL85000 USDOE National Nuclear Security Administration (NNSA)
ISSN:	1556-6013 1556-6021
DOI:	10.1109/TIFS.2018.2823697