Feature Extraction for Payload Classification: A Byte Pair Encoding Algorithm

Payload classification is a kind of deep packet inspection model that has been proved effective for many Internet applications such as, but not limited to, intrusion detection and network diagnostics. In typical payload classification, feature extraction is the first and very important step which ma...

Full description

Saved in:
Bibliographic Details
Published in:2022 IEEE 8th International Conference on Computer and Communications (ICCC) pp. 1 - 5
Main Authors: Xu, Tianci, Zhou, Peng
Format: Conference Proceeding
Language:English
Published: IEEE 09.12.2022
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Payload classification is a kind of deep packet inspection model that has been proved effective for many Internet applications such as, but not limited to, intrusion detection and network diagnostics. In typical payload classification, feature extraction is the first and very important step which makes a great impact on the quality and quantity of classification results. At present, most feature extraction of payloads adopts n-gram model. However, n-gram model generates features in fixed-length (length of n), which may induce kinds of information loss for feature extraction. In this paper, we propose a very different Byte Pair Encoding (BPE) algorithm for payload feature extractions. In this algorithm, we introduce a novel concept of sub-words to express the payload features, and thereby have the feature length not fixed any more. By the BPE, we can first initialize a vocabulary in a single byte basis, and then continuously update the vocabulary by merging the most frequent byte pairs in the payload to form new sub-words until all sub-word pairs reach the (approximately) same frequency, regardless the lengths of these sub-words. We finally have a very flexible and scalable vocabulary for feature extraction and payload embedding. At the end, we conduct sets of payload classification experiments on the CIC-IDS2017 dataset, in order to verify the effectiveness of our algorithm. The results have successfully confirmed the better classification performance by the use of our BPE algorithm than the traditional n-gram methods.
AbstractList Payload classification is a kind of deep packet inspection model that has been proved effective for many Internet applications such as, but not limited to, intrusion detection and network diagnostics. In typical payload classification, feature extraction is the first and very important step which makes a great impact on the quality and quantity of classification results. At present, most feature extraction of payloads adopts n-gram model. However, n-gram model generates features in fixed-length (length of n), which may induce kinds of information loss for feature extraction. In this paper, we propose a very different Byte Pair Encoding (BPE) algorithm for payload feature extractions. In this algorithm, we introduce a novel concept of sub-words to express the payload features, and thereby have the feature length not fixed any more. By the BPE, we can first initialize a vocabulary in a single byte basis, and then continuously update the vocabulary by merging the most frequent byte pairs in the payload to form new sub-words until all sub-word pairs reach the (approximately) same frequency, regardless the lengths of these sub-words. We finally have a very flexible and scalable vocabulary for feature extraction and payload embedding. At the end, we conduct sets of payload classification experiments on the CIC-IDS2017 dataset, in order to verify the effectiveness of our algorithm. The results have successfully confirmed the better classification performance by the use of our BPE algorithm than the traditional n-gram methods.
Author Xu, Tianci
Zhou, Peng
Author_xml – sequence: 1
  givenname: Tianci
  surname: Xu
  fullname: Xu, Tianci
  email: tianei_xu@shu.edu.cn
  organization: School of Mechatronical Engineering and Automation, Shanghai University,Shanghai,China
– sequence: 2
  givenname: Peng
  surname: Zhou
  fullname: Zhou, Peng
  email: pzhou@shu.edu.cn
  organization: School of Mechatronical Engineering and Automation, Shanghai University,Shanghai,China
BookMark eNo1j81KxDAYACPowV19A8G8QGuS5mfjrYauLqzoQc_Ll_TrGug2kkZw315FPc1hYGAW5HRKExJyzVnNObM3G-ec0o2QtWBC1JwxrawxJ2TBtVZSMcXNOXlcI5SPjLT7LBlCiWmiQ8r0GY5jgp66EeY5DjHAj7qlLb07FvzWMdNuCqmP05624z7lWN4OF-RsgHHGyz8uyeu6e3EP1fbpfuPabRU5t6UKxqi-UUbxHpELA94KEEFh8CscpPUMASRw8Mrble-V0cAkCuN1AyhksyRXv92IiLv3HA-Qj7v_xeYL_OFMYg
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICCC56324.2022.10065977
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1665450517
9781665450515
EndPage 5
ExternalDocumentID 10065977
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China
  grantid: 61972452
  funderid: 10.13039/501100001809
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i119t-c775d35751dee127ab92a2c5ecb8ef49b0eaa4a1ab5b98bd576a04e27b63ae243
IEDL.DBID RIE
IngestDate Thu Jan 18 11:14:59 EST 2024
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i119t-c775d35751dee127ab92a2c5ecb8ef49b0eaa4a1ab5b98bd576a04e27b63ae243
PageCount 5
ParticipantIDs ieee_primary_10065977
PublicationCentury 2000
PublicationDate 2022-Dec.-9
PublicationDateYYYYMMDD 2022-12-09
PublicationDate_xml – month: 12
  year: 2022
  text: 2022-Dec.-9
  day: 09
PublicationDecade 2020
PublicationTitle 2022 IEEE 8th International Conference on Computer and Communications (ICCC)
PublicationTitleAbbrev ICCC
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.8304089
Snippet Payload classification is a kind of deep packet inspection model that has been proved effective for many Internet applications such as, but not limited to,...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Byte Pair Encoding (BPE)
Classification algorithms
Encoding
Feature extraction
Inspection
Intrusion detection
Merging
payload classification
sub-word model
Vocabulary
word embedding
word segmentation
Title Feature Extraction for Payload Classification: A Byte Pair Encoding Algorithm
URI https://ieeexplore.ieee.org/document/10065977
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA62ePCkYsU3OXjdms1js_FWlxY9WHpQ6K3kMdUF3ZXtVvTfm6St4sGDlxBCmMAXwsxk5ptB6FJYoM44SDJp5wmHFBINwBMrLc8ZECZyF5tNyPE4n07VZE1Wj1wYAIjJZ9AP0xjLd7Vdhq8y_8JDFFDKDupIma3IWuucrZSoq7uiKEQoP-7dPkr7m92_-qZEtTHa_eeBe6j3Q8DDk2_Vso-2oDpA98FcWzaAhx9ts-IjYG9y4klwurXDsb9lyPyJYF_jAb75bL0cXTZ4WNk6iMKDl6e6Kdvn1x56HA0fittk3Q0hKdNUtR49KRwLYRIHkFKpjaKaWgHW5DDnyhDQmutUG2FUbpx3JDThQKXJmAbK2SHqVnUFRwgTNufcD5SajBPDlZNATCaoslYybo9RL2Axe1sVvJhtYDj5Y_0U7QTEY5aHOkPdtlnCOdq27225aC7iNX0BphyWNg
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEA5aBT2pWPFtDl63ZrPJZuOtLi0ttqWHCr2VPKa6oLuybkX_vZu0VTx48BJCCBP4QpiZzHwzCF1zA9RqC0EszDxgEEKgAFhghGFJBCTiifXNJsRolEyncrwiq3suDAD45DNouamP5dvCLNxXWf3CXRRQiE20xRmjZEnXWmVthUTe9NM05a4Aee34Udpa7__VOcUrju7eP4_cR80fCh4efyuXA7QB-SEaOoNtUQLufFTlkpGAa6MTj53brSz2HS5d7o-H-xa38d1nVctRWYk7uSmcKNx-fizKrHp6aaKHbmeS9oJVP4QgC0NZ1fgJbiMXKLEAIRVKS6qo4WB0AnMmNQGlmAqV5lom2tauhCIMqNBxpICy6Ag18iKHY4RJNGesHijVMSOaSSuA6JhTaYyImDlBTYfF7HVZ8mK2huH0j_UrtNObDAezQX90f4Z2Hfo-50Oeo0ZVLuACbZv3KnsrL_2VfQGnH5l9
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+IEEE+8th+International+Conference+on+Computer+and+Communications+%28ICCC%29&rft.atitle=Feature+Extraction+for+Payload+Classification%3A+A+Byte+Pair+Encoding+Algorithm&rft.au=Xu%2C+Tianci&rft.au=Zhou%2C+Peng&rft.date=2022-12-09&rft.pub=IEEE&rft.spage=1&rft.epage=5&rft_id=info:doi/10.1109%2FICCC56324.2022.10065977&rft.externalDocID=10065977