Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification.

Saved in:
Bibliographic Details
Title: Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification.
Authors: Beebe, Nicole L., Maddox, Laurence A., Liu, Lishu, Sun, Minghe
Source: IEEE Transactions on Information Forensics & Security; Sep2013, Vol. 8 Issue 9, p1519-1530, 12p
Abstract: Over 20 studies have been published in the past decade involving file and data type classification for digital forensics and information security applications. Methods using n-grams as inputs have proven the most successful across a wide variety of types; however, there are mixed results regarding the utility of unigrams and bigrams as inputs independently. In this study, we use support vector machines (SVMs) consisting of unigrams and bigrams, as well as complexity and other byte frequency-based measures, as inputs. Using concatenated unigrams and bigrams as input and a linear kernel SVM, we achieve significantly improved results over those previously reported (73.4% classification rate across 38 file and data types). We are the first to use concatenated n-grams as the sole input, and we show their superiority over inputs used previously. We also found that too many different types of features as inputs result in overfitting and poor generalization properties. We include several types seldom or not studied in the past (Microsoft Office 2010 files, file system data, base64, base85, URL encoding, flash video, M4A, MP4, WMV, and JSON records). The “winning” approach is instantiated in an open source software tool called Sceadan—Systematic Classification Engine for Advanced Data ANalysis. [ABSTRACT FROM PUBLISHER]
Copyright of IEEE Transactions on Information Forensics & Security is the property of IEEE and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Database: Complementary Index
FullText Text:
  Availability: 0
CustomLinks:
  – Url: https://resolver.ebscohost.com/openurl?sid=EBSCO:edb&genre=article&issn=15566013&ISBN=&volume=8&issue=9&date=20130901&spage=1519&pages=1519-1530&title=IEEE Transactions on Information Forensics & Security&atitle=Sceadan%3A%20Using%20Concatenated%20N-Gram%20Vectors%20for%20Improved%20File%20and%20Data%20Type%20Classification.&aulast=Beebe%2C%20Nicole%20L.&id=DOI:10.1109/TIFS.2013.2274728
    Name: Full Text Finder
    Category: fullText
    Text: Full Text Finder
    Icon: https://imageserver.ebscohost.com/branding/images/FTF.gif
    MouseOverText: Full Text Finder
  – Url: https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=EBSCO&SrcAuth=EBSCO&DestApp=WOS&ServiceName=TransferToWoS&DestLinkType=GeneralSearchSummary&Func=Links&author=Beebe%20NL
    Name: ISI
    Category: fullText
    Text: Nájsť tento článok vo Web of Science
    Icon: https://imagesrvr.epnet.com/ls/20docs.gif
    MouseOverText: Nájsť tento článok vo Web of Science
Header DbId: edb
DbLabel: Complementary Index
An: 89773440
RelevancyScore: 835
AccessLevel: 6
PubType: Academic Journal
PubTypeId: academicJournal
PreciseRelevancyScore: 834.630676269531
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification.
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Beebe%2C+Nicole+L%2E%22">Beebe, Nicole L.</searchLink><br /><searchLink fieldCode="AR" term="%22Maddox%2C+Laurence+A%2E%22">Maddox, Laurence A.</searchLink><br /><searchLink fieldCode="AR" term="%22Liu%2C+Lishu%22">Liu, Lishu</searchLink><br /><searchLink fieldCode="AR" term="%22Sun%2C+Minghe%22">Sun, Minghe</searchLink>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: IEEE Transactions on Information Forensics & Security; Sep2013, Vol. 8 Issue 9, p1519-1530, 12p
– Name: Abstract
  Label: Abstract
  Group: Ab
  Data: Over 20 studies have been published in the past decade involving file and data type classification for digital forensics and information security applications. Methods using n-grams as inputs have proven the most successful across a wide variety of types; however, there are mixed results regarding the utility of unigrams and bigrams as inputs independently. In this study, we use support vector machines (SVMs) consisting of unigrams and bigrams, as well as complexity and other byte frequency-based measures, as inputs. Using concatenated unigrams and bigrams as input and a linear kernel SVM, we achieve significantly improved results over those previously reported (73.4% classification rate across 38 file and data types). We are the first to use concatenated n-grams as the sole input, and we show their superiority over inputs used previously. We also found that too many different types of features as inputs result in overfitting and poor generalization properties. We include several types seldom or not studied in the past (Microsoft Office 2010 files, file system data, base64, base85, URL encoding, flash video, M4A, MP4, WMV, and JSON records). The “winning” approach is instantiated in an open source software tool called Sceadan—Systematic Classification Engine for Advanced Data ANalysis. [ABSTRACT FROM PUBLISHER]
– Name: Abstract
  Label:
  Group: Ab
  Data: <i>Copyright of IEEE Transactions on Information Forensics & Security is the property of IEEE and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract.</i> (Copyright applies to all Abstracts.)
PLink https://erproxy.cvtisr.sk/sfx/access?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edb&AN=89773440
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1109/TIFS.2013.2274728
    Languages:
      – Code: eng
        Text: English
    PhysicalDescription:
      Pagination:
        PageCount: 12
        StartPage: 1519
    Titles:
      – TitleFull: Sceadan: Using Concatenated N-Gram Vectors for Improved File and Data Type Classification.
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Beebe, Nicole L.
      – PersonEntity:
          Name:
            NameFull: Maddox, Laurence A.
      – PersonEntity:
          Name:
            NameFull: Liu, Lishu
      – PersonEntity:
          Name:
            NameFull: Sun, Minghe
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 09
              Text: Sep2013
              Type: published
              Y: 2013
          Identifiers:
            – Type: issn-print
              Value: 15566013
          Numbering:
            – Type: volume
              Value: 8
            – Type: issue
              Value: 9
          Titles:
            – TitleFull: IEEE Transactions on Information Forensics & Security
              Type: main
ResultId 1