DACL+: domain-adapted contrastive learning for enhanced low-resource language representations in document clustering tasks

Low-resource languages in natural language processing present unique challenges, marked by limited linguistic resources and sparse data. These challenges extend to document clustering tasks, where the need for meaningful and semantically rich representations is crucial. Along with the emergence of t...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Neural computing & applications Ročník 37; číslo 17; s. 10577 - 10590
Hlavní autori: Zaikis, Dimitrios, Vlahavas, Ioannis
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: London Springer London 01.06.2025
Springer Nature B.V
Predmet:
ISSN:0941-0643, 1433-3058
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Low-resource languages in natural language processing present unique challenges, marked by limited linguistic resources and sparse data. These challenges extend to document clustering tasks, where the need for meaningful and semantically rich representations is crucial. Along with the emergence of transformer-based language models (LM), the need for vast amounts of training data has also increased significantly. To this end, we introduce a domain-adapted contrastive learning approach for low-resource Greek document clustering. We introduce manually annotated datasets, essential for LM pre-training and clustering tasks, and extend the investigations by combining Greek BERT and Longformer models. We explore the efficacy of various domain adaptation pre-training objectives and of further pre-training the LMs using contrastive learning with diverse loss functions on datasets generated from a classification corpus. By maximizing the similarity between positive examples and minimizing the similarity between negative examples, our proposed approach learns meaningful representations that capture the underlying structure of the documents. We demonstrate that our proposed approach significantly improves the accuracy of clustering tasks, with an average improvement of up to 50% compared to the base LM, leading to enhanced performance in unsupervised learning tasks. Furthermore, we show how combining language models optimized for different sequence lengths improves performance and compare this approach against an unsupervised graph-based summarization method. Our findings underscore the importance of effective document representations in enhancing the accuracy of clustering tasks in low-resource language settings.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0941-0643
1433-3058
DOI:10.1007/s00521-024-10589-1