LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning

Saved in:
Bibliographic Details
Title: LaFiCMIL: Rethinking Large File Classification from the Perspective of Correlated Multiple Instance Learning
Authors: SUN, Tiezhu, PIAN, Weiguo, DAOUDI, Nadia, ALLIX, Kevin, F. Bissyandé, Tegawendé, KLEIN, Jacques
Source: urn:isbn:978-3-03-170238-9 ; Natural Language Processing and Information Systems - 29th International Conference on Applications of Natural Language to Information Systems, NLDB 2024, Proceedings (2024-09-20); The 29th International Conference on Natural Language & Information Systems, Turin, Italy [IT], 25-06-2024 => 27-06-2024
Publisher Information: Springer Science and Business Media Deutschland GmbH
Publication Year: 2024
Collection: University of Luxembourg: ORBilu - Open Repository and Bibliography
Subject Terms: Large file classification, Multiple instance learning, Classification tasks, Computational costs, Input constraints, Language processing, Large files, Multiple-instance learning, Natural languages, Text classification, Engineering, computing & technology, Computer science, Ingénierie, informatique & technologie, Sciences informatiques
Description: peer reviewed ; Transformer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. It is optimized for efficient training on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL’s effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20000 tokens while training on a single GPU with 32 GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL’s potential as a groundbreaking approach in the field of large file classification.
Document Type: conference object
report
Language: English
ISBN: 978-3-031-70238-9
3-031-70238-7
Relation: https://link.springer.com/content/pdf/10.1007/978-3-031-70239-6_5; https://orbilu.uni.lu/handle/10993/62891; info:hdl:10993/62891; https://orbilu.uni.lu/bitstream/10993/62891/1/LaFiCMIL.pdf
DOI: 10.1007/978-3-031-70239-6_5
Availability: https://orbilu.uni.lu/handle/10993/62891
https://orbilu.uni.lu/bitstream/10993/62891/1/LaFiCMIL.pdf
https://doi.org/10.1007/978-3-031-70239-6_5
Rights: open access ; http://purl.org/coar/access_right/c_abf2 ; info:eu-repo/semantics/openAccess
Accession Number: edsbas.DCDC8119
Database: BASE
Description
Abstract:peer reviewed ; Transformer-based models have significantly advanced natural language processing, in particular the performance in text classification tasks. Nevertheless, these models face challenges in processing large files, primarily due to their input constraints, which are generally restricted to hundreds or thousands of tokens. Attempts to address this issue in existing models usually consist in extracting only a fraction of the essential information from lengthy inputs, while often incurring high computational costs due to their complex architectures. In this work, we address the challenge of classifying large files from the perspective of correlated multiple instance learning. We introduce LaFiCMIL, a method specifically designed for large file classification. It is optimized for efficient training on a single GPU, making it a versatile solution for binary, multi-class, and multi-label classification tasks. We conducted extensive experiments using seven diverse and comprehensive benchmark datasets to assess LaFiCMIL’s effectiveness. By integrating BERT for feature extraction, LaFiCMIL demonstrates exceptional performance, setting new benchmarks across all datasets. A notable achievement of our approach is its ability to scale BERT to handle nearly 20000 tokens while training on a single GPU with 32 GB of memory. This efficiency, coupled with its state-of-the-art performance, highlights LaFiCMIL’s potential as a groundbreaking approach in the field of large file classification.
ISBN:9783031702389
3031702387
DOI:10.1007/978-3-031-70239-6_5