A Content-based File Identification Dataset (machine learning-based dataset)

Uložené v:
Podrobná bibliografia
Názov: A Content-based File Identification Dataset (machine learning-based dataset)
Autori: Khudhur, Saja, Jeiad, Hassan
Informácie o vydavateľovi: Open Science Framework, 2022.
Rok vydania: 2022
Predmety: file type identification, FTI, digital forensic, file fragments classification
Popis: content-based dataset that composes of 12 features for eight common types of files (JPG, PNG, HTML, TXT, MP4, M4A, MOV, and MP3) to be suitable for file type identification (FTI). These features were extracted from pool of file fragment of size 512 byte each from all the prementioned eight types. This dataset is developed in such a way that can be used for supervised and unsupervised ML model. It provides the ability to classifying and clustering the above-mentioned type into two levels. As a fine grain level (by their file type exactly, JPG, PNG, HTML, TXT, MP4, M4A, MOV, and MP3) and as a coarse-grain level (by their broad type, image, text, audio, video).
Druh dokumentu: Other literature type
DOI: 10.17605/osf.io/8bk3r
Prístupové číslo: edsair.doi...........89891ff5bf07389e133502f72027a86e
Databáza: OpenAIRE
Popis
Abstrakt:content-based dataset that composes of 12 features for eight common types of files (JPG, PNG, HTML, TXT, MP4, M4A, MOV, and MP3) to be suitable for file type identification (FTI). These features were extracted from pool of file fragment of size 512 byte each from all the prementioned eight types. This dataset is developed in such a way that can be used for supervised and unsupervised ML model. It provides the ability to classifying and clustering the above-mentioned type into two levels. As a fine grain level (by their file type exactly, JPG, PNG, HTML, TXT, MP4, M4A, MOV, and MP3) and as a coarse-grain level (by their broad type, image, text, audio, video).
DOI:10.17605/osf.io/8bk3r