A Statistical Language Modeling Framework for Extractive Summarization of Text Documents

The availability of a large collection of text documents on a variety of topics, such as tweets, web pages, news articles, and stories, in different languages. Due to these electronic documents, users get exhausted reading the entire document and face many difficulties finding relevant information....

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:SN computer science Ročník 4; číslo 6; s. 750
Hlavní autoři: Gupta, Pooja, Nigam, Swati, Singh, Rajiv
Médium: Journal Article
Jazyk:angličtina
Vydáno: Singapore Springer Nature Singapore 01.11.2023
Springer Nature B.V
Témata:
ISSN:2661-8907, 2662-995X, 2661-8907
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The availability of a large collection of text documents on a variety of topics, such as tweets, web pages, news articles, and stories, in different languages. Due to these electronic documents, users get exhausted reading the entire document and face many difficulties finding relevant information. Text summarization is the solution for retrieving important information from multiple documents. The summarization process should be robust, language-independent, and able to generate summaries from multi-language corpora. With this aim, in this work, we propose a language-independent framework of extractive text summarization for the text documents, which are primarily based on English, and then the translated corpus of BBC, CNN, and DUC 2004 News stories from English to Hindi is exploited for this study. In the absence of a Hindi corpus for these datasets, we have used Google, Bing, and Systran machine translators for Hindi text generation to show the effectiveness of the proposed approach for a low resource language. The proposed method is based on the N-gram language model and statistical modeling using maximum likelihood estimation (MLE). Sentence ranking has been used to generate summaries of English and Hindi documents. We evaluate our model on the BBC, CNN, and DUC 2004 news stories datasets in both English and Hindi using the ROUGE metric at different levels to measure the performance of the generated summaries. Further, we compare the proposed method with other existing baseline and state-of-the-art methods, such as LexRank, TextRank, Lead, Luhn, LSA, and SumBasic. Experimental results of our approach indicate its effectiveness over existing summarization methods.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2661-8907
2662-995X
2661-8907
DOI:10.1007/s42979-023-02241-x