Statistical learning for file-type identification

Gespeichert in:
Bibliographische Detailangaben
Titel: Statistical learning for file-type identification
Autoren: Siddharth Gopal, Yiming Yang, Konstantin Salomatin, Jaime Carbonell
Weitere Verfasser: The Pennsylvania State University CiteSeerX Archives
Quelle: http://www.cs.cmu.edu/%7Esgopal1/papers/ICMLA12.pdf.
Bestand: CiteSeerX
Schlagwörter: File-type Identification, Classification, Comparative Evaluation
Beschreibung: —File-type Identification (FTI) is an important problem in digital forensics, intrusion detection, and other related fields. Using state-of-the-art classification techniques to solve FTI problems has begun to receive research attention; however, general conclusions have not been reached due to the lack of thorough evaluations for method comparison. This paper presents a systematic investigation of the problem, algorithmic solutions and an evaluation methodology. Our focus is on performance comparison of statistical classifiers (e.g. SVM and kNN) and knowledge-based approaches, especially COTS (Commercial Off-The-Shelf) solutions which currently dominate FTI applications. We analyze the robustness of different methods in handling damaged files and file segments. We propose two alternative criteria in measuring performance: 1) treating filename extensions as the true labels, and 2) treating the predictions by knowledge based approaches on intact files as true labels; these rely on signature bytes as the true labels (and removing these signature bytes before testing each method). In our experiments with simulated damages in files, SVM and kNN substantially outperform all the COTS solutions we tested, improving classification accuracy very substantially – some COTS methods cannot identify damaged files at all.
Publikationsart: text
Dateibeschreibung: application/pdf
Sprache: English
Relation: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.303.3596; http://www.cs.cmu.edu/%7Esgopal1/papers/ICMLA12.pdf
Verfügbarkeit: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.303.3596
http://www.cs.cmu.edu/%7Esgopal1/papers/ICMLA12.pdf
Rights: Metadata may be used without restrictions as long as the oai identifier remains attached to it.
Dokumentencode: edsbas.8B80EED2
Datenbank: BASE
FullText Text:
  Availability: 0
CustomLinks:
  – Url: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.303.3596#
    Name: EDS - BASE (s4221598)
    Category: fullText
    Text: View record from BASE
  – Url: https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=EBSCO&SrcAuth=EBSCO&DestApp=WOS&ServiceName=TransferToWoS&DestLinkType=GeneralSearchSummary&Func=Links&author=Gopal%20S
    Name: ISI
    Category: fullText
    Text: Nájsť tento článok vo Web of Science
    Icon: https://imagesrvr.epnet.com/ls/20docs.gif
    MouseOverText: Nájsť tento článok vo Web of Science
Header DbId: edsbas
DbLabel: BASE
An: edsbas.8B80EED2
RelevancyScore: 750
AccessLevel: 3
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 750
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Statistical learning for file-type identification
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Siddharth+Gopal%22">Siddharth Gopal</searchLink><br /><searchLink fieldCode="AR" term="%22Yiming+Yang%22">Yiming Yang</searchLink><br /><searchLink fieldCode="AR" term="%22Konstantin+Salomatin%22">Konstantin Salomatin</searchLink><br /><searchLink fieldCode="AR" term="%22Jaime+Carbonell%22">Jaime Carbonell</searchLink>
– Name: Author
  Label: Contributors
  Group: Au
  Data: The Pennsylvania State University CiteSeerX Archives
– Name: TitleSource
  Label: Source
  Group: Src
  Data: <i>http://www.cs.cmu.edu/%7Esgopal1/papers/ICMLA12.pdf</i>.
– Name: Subset
  Label: Collection
  Group: HoldingsInfo
  Data: CiteSeerX
– Name: Subject
  Label: Subject Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22File-type+Identification%22">File-type Identification</searchLink><br /><searchLink fieldCode="DE" term="%22Classification%22">Classification</searchLink><br /><searchLink fieldCode="DE" term="%22Comparative+Evaluation%22">Comparative Evaluation</searchLink>
– Name: Abstract
  Label: Description
  Group: Ab
  Data: —File-type Identification (FTI) is an important problem in digital forensics, intrusion detection, and other related fields. Using state-of-the-art classification techniques to solve FTI problems has begun to receive research attention; however, general conclusions have not been reached due to the lack of thorough evaluations for method comparison. This paper presents a systematic investigation of the problem, algorithmic solutions and an evaluation methodology. Our focus is on performance comparison of statistical classifiers (e.g. SVM and kNN) and knowledge-based approaches, especially COTS (Commercial Off-The-Shelf) solutions which currently dominate FTI applications. We analyze the robustness of different methods in handling damaged files and file segments. We propose two alternative criteria in measuring performance: 1) treating filename extensions as the true labels, and 2) treating the predictions by knowledge based approaches on intact files as true labels; these rely on signature bytes as the true labels (and removing these signature bytes before testing each method). In our experiments with simulated damages in files, SVM and kNN substantially outperform all the COTS solutions we tested, improving classification accuracy very substantially – some COTS methods cannot identify damaged files at all.
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: text
– Name: Format
  Label: File Description
  Group: SrcInfo
  Data: application/pdf
– Name: Language
  Label: Language
  Group: Lang
  Data: English
– Name: NoteTitleSource
  Label: Relation
  Group: SrcInfo
  Data: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.303.3596; http://www.cs.cmu.edu/%7Esgopal1/papers/ICMLA12.pdf
– Name: URL
  Label: Availability
  Group: URL
  Data: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.303.3596<br />http://www.cs.cmu.edu/%7Esgopal1/papers/ICMLA12.pdf
– Name: Copyright
  Label: Rights
  Group: Cpyrght
  Data: Metadata may be used without restrictions as long as the oai identifier remains attached to it.
– Name: AN
  Label: Accession Number
  Group: ID
  Data: edsbas.8B80EED2
PLink https://erproxy.cvtisr.sk/sfx/access?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.8B80EED2
RecordInfo BibRecord:
  BibEntity:
    Languages:
      – Text: English
    Subjects:
      – SubjectFull: File-type Identification
        Type: general
      – SubjectFull: Classification
        Type: general
      – SubjectFull: Comparative Evaluation
        Type: general
    Titles:
      – TitleFull: Statistical learning for file-type identification
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Siddharth Gopal
      – PersonEntity:
          Name:
            NameFull: Yiming Yang
      – PersonEntity:
          Name:
            NameFull: Konstantin Salomatin
      – PersonEntity:
          Name:
            NameFull: Jaime Carbonell
      – PersonEntity:
          Name:
            NameFull: The Pennsylvania State University CiteSeerX Archives
    IsPartOfRelationships:
      – BibEntity:
          Identifiers:
            – Type: issn-locals
              Value: edsbas
            – Type: issn-locals
              Value: edsbas.oa
          Titles:
            – TitleFull: http://www.cs.cmu.edu/%7Esgopal1/papers/ICMLA12.pdf
              Type: main
ResultId 1