MATSFT: User query-based multilingual abstractive text summarization for low resource Indian languages by fine-tuning mT5

User query-based summarization is a challenging research area of natural language processing. However, the existing approaches struggle to effectively manage the intricate long-distance semantic relationships between user queries and input documents. This paper introduces a user query-based multilin...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Alexandria engineering journal Jg. 127; S. 129 - 142
Hauptverfasser: Phani, Siginamsetty, Abdul, Ashu, Prasad, M. Krishna Siva, Reddy, V. Dinesh
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier B.V 01.08.2025
Elsevier
Schlagworte:
ISSN:1110-0168
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:User query-based summarization is a challenging research area of natural language processing. However, the existing approaches struggle to effectively manage the intricate long-distance semantic relationships between user queries and input documents. This paper introduces a user query-based multilingual abstractive text summarization approach for the Indian low-resource languages by fine-tuning the multilingual pre-trained text-to-text (mT5) transformer model (MATSFT). The MATSFT employs a co-attention mechanism within a shared encoder–decoder architecture alongside the mT5 model to transfer knowledge across multiple low-resource languages. The Co-attention captures cross-lingual dependencies, which allows the model to understand the relationships and nuances between the different languages. Most multilingual summarization datasets focus on major global languages like English, French, and Spanish. To address the challenges in the LRLs, we created an Indian language dataset, comprising seven LRLs and the English language, by extracting data from the BBC news website. We evaluate the performance of the MATSFT using the ROUGE metric and a language-agnostic target summary evaluation metric. Experimental results show that MATSFT outperforms the monolingual transformer model, pre-trained MTM, mT5 model, NLI model, IndicBART, mBART25, and mBART50 on the IL dataset. The statistical paired t-test indicates that the MATSFT achieves a significant improvement with a p-value of ≤ 0.05 compared to other models. [Display omitted] •Query-focused summarization enabling concise, relevant results across complex, distant contexts.•Proposed a MATSFT unified framework for multilingual text summarization.•Proposed the Indian Language (IL) dataset with 7 Indian and English languages.•Proposed LaTSEM metric to evaluate MATSFT model performance.•We fine-tune mT5 for concise, coherent summaries based on user queries.
ISSN:1110-0168
DOI:10.1016/j.aej.2025.04.031