MATSFT: User query-based multilingual abstractive text summarization for low resource Indian languages by fine-tuning mT5

User query-based summarization is a challenging research area of natural language processing. However, the existing approaches struggle to effectively manage the intricate long-distance semantic relationships between user queries and input documents. This paper introduces a user query-based multilin...

Full description

Saved in:
Bibliographic Details
Published in:Alexandria engineering journal Vol. 127; pp. 129 - 142
Main Authors: Phani, Siginamsetty, Abdul, Ashu, Prasad, M. Krishna Siva, Reddy, V. Dinesh
Format: Journal Article
Language:English
Published: Elsevier B.V 01.08.2025
Elsevier
Subjects:
ISSN:1110-0168
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:User query-based summarization is a challenging research area of natural language processing. However, the existing approaches struggle to effectively manage the intricate long-distance semantic relationships between user queries and input documents. This paper introduces a user query-based multilingual abstractive text summarization approach for the Indian low-resource languages by fine-tuning the multilingual pre-trained text-to-text (mT5) transformer model (MATSFT). The MATSFT employs a co-attention mechanism within a shared encoder–decoder architecture alongside the mT5 model to transfer knowledge across multiple low-resource languages. The Co-attention captures cross-lingual dependencies, which allows the model to understand the relationships and nuances between the different languages. Most multilingual summarization datasets focus on major global languages like English, French, and Spanish. To address the challenges in the LRLs, we created an Indian language dataset, comprising seven LRLs and the English language, by extracting data from the BBC news website. We evaluate the performance of the MATSFT using the ROUGE metric and a language-agnostic target summary evaluation metric. Experimental results show that MATSFT outperforms the monolingual transformer model, pre-trained MTM, mT5 model, NLI model, IndicBART, mBART25, and mBART50 on the IL dataset. The statistical paired t-test indicates that the MATSFT achieves a significant improvement with a p-value of ≤ 0.05 compared to other models. [Display omitted] •Query-focused summarization enabling concise, relevant results across complex, distant contexts.•Proposed a MATSFT unified framework for multilingual text summarization.•Proposed the Indian Language (IL) dataset with 7 Indian and English languages.•Proposed LaTSEM metric to evaluate MATSFT model performance.•We fine-tune mT5 for concise, coherent summaries based on user queries.
ISSN:1110-0168
DOI:10.1016/j.aej.2025.04.031