A Statistical Language Modeling Framework for Extractive Summarization of Text Documents
The availability of a large collection of text documents on a variety of topics, such as tweets, web pages, news articles, and stories, in different languages. Due to these electronic documents, users get exhausted reading the entire document and face many difficulties finding relevant information....
Uloženo v:
| Vydáno v: | SN computer science Ročník 4; číslo 6; s. 750 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Singapore
Springer Nature Singapore
01.11.2023
Springer Nature B.V |
| Témata: | |
| ISSN: | 2661-8907, 2662-995X, 2661-8907 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | The availability of a large collection of text documents on a variety of topics, such as tweets, web pages, news articles, and stories, in different languages. Due to these electronic documents, users get exhausted reading the entire document and face many difficulties finding relevant information. Text summarization is the solution for retrieving important information from multiple documents. The summarization process should be robust, language-independent, and able to generate summaries from multi-language corpora. With this aim, in this work, we propose a language-independent framework of extractive text summarization for the text documents, which are primarily based on English, and then the translated corpus of BBC, CNN, and DUC 2004 News stories from English to Hindi is exploited for this study. In the absence of a Hindi corpus for these datasets, we have used Google, Bing, and Systran machine translators for Hindi text generation to show the effectiveness of the proposed approach for a low resource language. The proposed method is based on the N-gram language model and statistical modeling using maximum likelihood estimation (MLE). Sentence ranking has been used to generate summaries of English and Hindi documents. We evaluate our model on the BBC, CNN, and DUC 2004 news stories datasets in both English and Hindi using the ROUGE metric at different levels to measure the performance of the generated summaries. Further, we compare the proposed method with other existing baseline and state-of-the-art methods, such as LexRank, TextRank, Lead, Luhn, LSA, and SumBasic. Experimental results of our approach indicate its effectiveness over existing summarization methods. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 2661-8907 2662-995X 2661-8907 |
| DOI: | 10.1007/s42979-023-02241-x |