LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning

Uloženo v:
Podrobná bibliografie
Název: LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning
Autoři: SUN, Tiezhu, PIAN, Weiguo, DAOUDI, Nadia, ALLIX, Kevin, F. Bissyandé, Tegawendé, KLEIN, Jacques
Zdroj: urn:isbn:978-3-03-170238-9 ; Natural Language Processing and Information Systems - 29th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Proceedings (2024-09-20); The 29th International Conference on Natural Language & Information Systems, Turin, Italy [IT], 25-06-2024 => 27-06-2024
Informace o vydavateli: Springer Science and Business Media Deutschland GmbH
Rok vydání: 2024
Sbírka: University of Luxembourg: ORBilu - Open Repository and Bibliography
Témata: Large file classification, Multiple instance learning, Classification tasks, Computational costs, Input constraints, Language processing, Large files, Multiple-instance learning, Natural languages, Text classification, Engineering, computing & technology, Computer science, Ingénierie, informatique & technologie, Sciences informatiques
Popis: peer reviewed ; Transformer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. It is optimized for efficient training on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL’s effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20000 tokens while training on a single GPU with 32 GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL’s potential as a groundbreaking approach in the field of large file classification.
Druh dokumentu: conference object
report
Jazyk: English
ISBN: 978-3-031-70238-9
3-031-70238-7
Relation: https://link.springer.com/content/pdf/10.1007/978-3-031-70239-6_5; https://orbilu.uni.lu/handle/10993/62891; info:hdl:10993/62891; https://orbilu.uni.lu/bitstream/10993/62891/1/LaFiCMIL.pdf
DOI: 10.1007/978-3-031-70239-6_5
Dostupnost: https://orbilu.uni.lu/handle/10993/62891
https://orbilu.uni.lu/bitstream/10993/62891/1/LaFiCMIL.pdf
https://doi.org/10.1007/978-3-031-70239-6_5
Rights: open access ; http://purl.org/coar/access_right/c_abf2 ; info:eu-repo/semantics/openAccess
Přístupové číslo: edsbas.DCDC8119
Databáze: BASE
FullText Text:
  Availability: 0
CustomLinks:
  – Url: https://orbilu.uni.lu/handle/10993/62891#
    Name: EDS - BASE (s4221598)
    Category: fullText
    Text: View record from BASE
  – Url: https://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=EBSCO&SrcAuth=EBSCO&DestApp=WOS&ServiceName=TransferToWoS&DestLinkType=GeneralSearchSummary&Func=Links&author=SUN%20T
    Name: ISI
    Category: fullText
    Text: Nájsť tento článok vo Web of Science
    Icon: https://imagesrvr.epnet.com/ls/20docs.gif
    MouseOverText: Nájsť tento článok vo Web of Science
Header DbId: edsbas
DbLabel: BASE
An: edsbas.DCDC8119
RelevancyScore: 947
AccessLevel: 3
PubType: Conference
PubTypeId: conference
PreciseRelevancyScore: 947.306396484375
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22SUN%2C+Tiezhu%22">SUN, Tiezhu</searchLink><br /><searchLink fieldCode="AR" term="%22PIAN%2C+Weiguo%22">PIAN, Weiguo</searchLink><br /><searchLink fieldCode="AR" term="%22DAOUDI%2C+Nadia%22">DAOUDI, Nadia</searchLink><br /><searchLink fieldCode="AR" term="%22ALLIX%2C+Kevin%22">ALLIX, Kevin</searchLink><br /><searchLink fieldCode="AR" term="%22F%2E+Bissyandé%2C+Tegawendé%22">F. Bissyandé, Tegawendé</searchLink><br /><searchLink fieldCode="AR" term="%22KLEIN%2C+Jacques%22">KLEIN, Jacques</searchLink>
– Name: TitleSource
  Label: Source
  Group: Src
  Data: urn:isbn:978-3-03-170238-9 ; Natural Language Processing and Information Systems - 29th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Proceedings (2024-09-20); The 29th International Conference on Natural Language & Information Systems, Turin, Italy [IT], 25-06-2024 => 27-06-2024
– Name: Publisher
  Label: Publisher Information
  Group: PubInfo
  Data: Springer Science and Business Media Deutschland GmbH
– Name: DatePubCY
  Label: Publication Year
  Group: Date
  Data: 2024
– Name: Subset
  Label: Collection
  Group: HoldingsInfo
  Data: University of Luxembourg: ORBilu - Open Repository and Bibliography
– Name: Subject
  Label: Subject Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Large+file+classification%22">Large file classification</searchLink><br /><searchLink fieldCode="DE" term="%22Multiple+instance+learning%22">Multiple instance learning</searchLink><br /><searchLink fieldCode="DE" term="%22Classification+tasks%22">Classification tasks</searchLink><br /><searchLink fieldCode="DE" term="%22Computational+costs%22">Computational costs</searchLink><br /><searchLink fieldCode="DE" term="%22Input+constraints%22">Input constraints</searchLink><br /><searchLink fieldCode="DE" term="%22Language+processing%22">Language processing</searchLink><br /><searchLink fieldCode="DE" term="%22Large+files%22">Large files</searchLink><br /><searchLink fieldCode="DE" term="%22Multiple-instance+learning%22">Multiple-instance learning</searchLink><br /><searchLink fieldCode="DE" term="%22Natural+languages%22">Natural languages</searchLink><br /><searchLink fieldCode="DE" term="%22Text+classification%22">Text classification</searchLink><br /><searchLink fieldCode="DE" term="%22Engineering%22">Engineering</searchLink><br /><searchLink fieldCode="DE" term="%22computing+%26+technology%22">computing & technology</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+science%22">Computer science</searchLink><br /><searchLink fieldCode="DE" term="%22Ingénierie%22">Ingénierie</searchLink><br /><searchLink fieldCode="DE" term="%22informatique+%26+technologie%22">informatique & technologie</searchLink><br /><searchLink fieldCode="DE" term="%22Sciences+informatiques%22">Sciences informatiques</searchLink>
– Name: Abstract
  Label: Description
  Group: Ab
  Data: peer reviewed ; Transformer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. It is optimized for efficient training on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL’s effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20000 tokens while training on a single GPU with 32 GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL’s potential as a groundbreaking approach in the field of large file classification.
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: conference object<br />report
– Name: Language
  Label: Language
  Group: Lang
  Data: English
– Name: ISBN
  Label: ISBN
  Group: ISBN
  Data: 978-3-031-70238-9<br />3-031-70238-7
– Name: NoteTitleSource
  Label: Relation
  Group: SrcInfo
  Data: https://link.springer.com/content/pdf/10.1007/978-3-031-70239-6_5; https://orbilu.uni.lu/handle/10993/62891; info:hdl:10993/62891; https://orbilu.uni.lu/bitstream/10993/62891/1/LaFiCMIL.pdf
– Name: DOI
  Label: DOI
  Group: ID
  Data: 10.1007/978-3-031-70239-6_5
– Name: URL
  Label: Availability
  Group: URL
  Data: https://orbilu.uni.lu/handle/10993/62891<br />https://orbilu.uni.lu/bitstream/10993/62891/1/LaFiCMIL.pdf<br />https://doi.org/10.1007/978-3-031-70239-6_5
– Name: Copyright
  Label: Rights
  Group: Cpyrght
  Data: open access ; http://purl.org/coar/access_right/c_abf2 ; info:eu-repo/semantics/openAccess
– Name: AN
  Label: Accession Number
  Group: ID
  Data: edsbas.DCDC8119
PLink https://erproxy.cvtisr.sk/sfx/access?url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsbas&AN=edsbas.DCDC8119
RecordInfo BibRecord:
  BibEntity:
    Identifiers:
      – Type: doi
        Value: 10.1007/978-3-031-70239-6_5
    Languages:
      – Text: English
    Subjects:
      – SubjectFull: Large file classification
        Type: general
      – SubjectFull: Multiple instance learning
        Type: general
      – SubjectFull: Classification tasks
        Type: general
      – SubjectFull: Computational costs
        Type: general
      – SubjectFull: Input constraints
        Type: general
      – SubjectFull: Language processing
        Type: general
      – SubjectFull: Large files
        Type: general
      – SubjectFull: Multiple-instance learning
        Type: general
      – SubjectFull: Natural languages
        Type: general
      – SubjectFull: Text classification
        Type: general
      – SubjectFull: Engineering
        Type: general
      – SubjectFull: computing & technology
        Type: general
      – SubjectFull: Computer science
        Type: general
      – SubjectFull: Ingénierie
        Type: general
      – SubjectFull: informatique & technologie
        Type: general
      – SubjectFull: Sciences informatiques
        Type: general
    Titles:
      – TitleFull: LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: SUN, Tiezhu
      – PersonEntity:
          Name:
            NameFull: PIAN, Weiguo
      – PersonEntity:
          Name:
            NameFull: DAOUDI, Nadia
      – PersonEntity:
          Name:
            NameFull: ALLIX, Kevin
      – PersonEntity:
          Name:
            NameFull: F. Bissyandé, Tegawendé
      – PersonEntity:
          Name:
            NameFull: KLEIN, Jacques
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 01
              M: 01
              Type: published
              Y: 2024
          Identifiers:
            – Type: isbn-print
              Value: 9783031702389
            – Type: isbn-print
              Value: 3031702387
            – Type: issn-locals
              Value: edsbas
            – Type: issn-locals
              Value: edsbas.oa
          Titles:
            – TitleFull: urn:isbn:978-3-03-170238-9 ; Natural Language Processing and Information Systems - 29th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Proceedings (2024-09-20); The 29th International Conference on Natural Language & Information Systems, Turin, Italy [IT], 25-06-2024 => 27-06-2024
              Type: main
ResultId 1