MATSFT: User query-based multilingual abstractive text summarization for low resource Indian languages by fine-tuning mT5

User query-based summarization is a challenging research area of natural language processing. However, the existing approaches struggle to effectively manage the intricate long-distance semantic relationships between user queries and input documents. This paper introduces a user query-based multilin...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Alexandria engineering journal Ročník 127; s. 129 - 142
Hlavní autoři: Phani, Siginamsetty, Abdul, Ashu, Prasad, M. Krishna Siva, Reddy, V. Dinesh
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier B.V 01.08.2025
Elsevier
Témata:
ISSN:1110-0168
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:User query-based summarization is a challenging research area of natural language processing. However, the existing approaches struggle to effectively manage the intricate long-distance semantic relationships between user queries and input documents. This paper introduces a user query-based multilingual abstractive text summarization approach for the Indian low-resource languages by fine-tuning the multilingual pre-trained text-to-text (mT5) transformer model (MATSFT). The MATSFT employs a co-attention mechanism within a shared encoder–decoder architecture alongside the mT5 model to transfer knowledge across multiple low-resource languages. The Co-attention captures cross-lingual dependencies, which allows the model to understand the relationships and nuances between the different languages. Most multilingual summarization datasets focus on major global languages like English, French, and Spanish. To address the challenges in the LRLs, we created an Indian language dataset, comprising seven LRLs and the English language, by extracting data from the BBC news website. We evaluate the performance of the MATSFT using the ROUGE metric and a language-agnostic target summary evaluation metric. Experimental results show that MATSFT outperforms the monolingual transformer model, pre-trained MTM, mT5 model, NLI model, IndicBART, mBART25, and mBART50 on the IL dataset. The statistical paired t-test indicates that the MATSFT achieves a significant improvement with a p-value of ≤ 0.05 compared to other models. [Display omitted] •Query-focused summarization enabling concise, relevant results across complex, distant contexts.•Proposed a MATSFT unified framework for multilingual text summarization.•Proposed the Indian Language (IL) dataset with 7 Indian and English languages.•Proposed LaTSEM metric to evaluate MATSFT model performance.•We fine-tune mT5 for concise, coherent summaries based on user queries.
ISSN:1110-0168
DOI:10.1016/j.aej.2025.04.031