Effective Hierarchical Text Classification with Large Language Models

Hierarchical Text Classification presents significant challenges, especially when dealing with intricate taxonomies with multi-level labels. The scarcity of annotated datasets emphasizes these challenges, limiting traditional approaches. Large Language Models (LLMs) alone struggle with the inherent...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	SN computer science Ročník 6; číslo 7; s. 873
Hlavní autori:	Longo, Carmelo Fabio, Tuccari, Giusy Giulia, Bulla, Luana, Russo, Chiara Maria, Mongiovì, Misael
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Singapore Springer Nature Singapore 06.10.2025 Springer Nature B.V
Predmet:	Annotations Automation Classification Computer Imaging Computer Science Computer Systems Organization and Communication Networks Data Structures and Information Theory Datasets Effectiveness Information Systems and Communication Service Keywords Labels Language Large language models Natural language Original Research Pattern Recognition and Graphics Semantics Software Engineering/Programming and Operating Systems Synthetic data Taxonomy Text categorization Vision Hierarchical text classification Large language models Synthetic data generation Machine learning
ISSN:	2661-8907, 2662-995X, 2661-8907
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Hierarchical Text Classification presents significant challenges, especially when dealing with intricate taxonomies with multi-level labels. The scarcity of annotated datasets emphasizes these challenges, limiting traditional approaches. Large Language Models (LLMs) alone struggle with the inherent complexity of hierarchical structures and require significant computational resources. This work presents HTC-GEN, an innovative framework leveraging synthetic data generation using LLMs, specifically LLaMa3, to create realistic and context-aware text samples across hierarchical levels. HTC-GEN reduces the reliance on manual annotation, addressing class imbalance issues by producing high-quality data for underrepresented labels. We evaluate our framework on the Web of Science dataset in a zero-shot setting, benchmarking it against the state-of-the-art HTC model (Z-STC) and LLaMa3. The results highlight the effectiveness of HTC-GEN, which achieves state-of-the-art performance in hierarchical text classification. Our evaluation also demonstrates that LLaMa3 alone is insufficient for this task. Furthermore, we perform a comprehensive analysis of model performance, examining individual components and assessing the impact of different hyperparameter configurations, with a particular focus on temperature and dataset sizes. The study underscores the potential of LLM-generated data for enabling robust, scalable classification systems without extensive human intervention.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2661-8907 2662-995X 2661-8907
DOI:	10.1007/s42979-025-04435-x