Dynamic topic modelling for exploring the scientific literature on coronavirus: an unsupervised labelling technique

Saved in:
Bibliographic Details
Title: Dynamic topic modelling for exploring the scientific literature on coronavirus: an unsupervised labelling technique
Authors: Guillén-Pacho, Ibai, Badenes-Olmedo, Carlos, Corcho, Oscar
Contributors: Universidad Politécnica de Madrid
Source: International Journal of Data Science and Analytics ; volume 20, issue 3, page 2551-2581 ; ISSN 2364-415X 2364-4168
Publisher Information: Springer Science and Business Media LLC
Publication Year: 2024
Description: The work presented in this article focusses on improving the interpretability of probabilistic topic models created from a large collection of scientific documents that evolve over time. Several time-dependent approaches based on topic models were compared to analyse the annual evolution of latent concepts in the CORD-19 corpus: Dynamic Topic Model, Dynamic Embedded Topic Model, and BERTopic. Then COVID-19 period (December 2019–present) has been analysed in greater depth, month by month, to explore the evolution of what is written about the disease. The evaluations suggest that the Dynamic Topic Model is the best choice to analyse the CORD-19 corpus. A novel topic labelling strategy is proposed for dynamic topic models to analyse the evolution of latent concepts. It incorporates content changes in both the annual evolution of the corpus and the monthly evolution of the COVID-19 disease. The generated labels are manually validated using two approaches: through the most relevant documents on the topic and through the documents that share the most semantically similar label topics. The labelling enables the interpretation of topics. The novel method for dynamic topic labelling fits the content of each topic and supports the semantics of the topics.
Document Type: article in journal/newspaper
Language: English
DOI: 10.1007/s41060-024-00610-0
DOI: 10.1007/s41060-024-00610-0.pdf
DOI: 10.1007/s41060-024-00610-0/fulltext.html
Availability: https://doi.org/10.1007/s41060-024-00610-0
https://link.springer.com/content/pdf/10.1007/s41060-024-00610-0.pdf
https://link.springer.com/article/10.1007/s41060-024-00610-0/fulltext.html
Rights: https://creativecommons.org/licenses/by/4.0 ; https://creativecommons.org/licenses/by/4.0
Accession Number: edsbas.2D9F0C24
Database: BASE
Description
Abstract:The work presented in this article focusses on improving the interpretability of probabilistic topic models created from a large collection of scientific documents that evolve over time. Several time-dependent approaches based on topic models were compared to analyse the annual evolution of latent concepts in the CORD-19 corpus: Dynamic Topic Model, Dynamic Embedded Topic Model, and BERTopic. Then COVID-19 period (December 2019–present) has been analysed in greater depth, month by month, to explore the evolution of what is written about the disease. The evaluations suggest that the Dynamic Topic Model is the best choice to analyse the CORD-19 corpus. A novel topic labelling strategy is proposed for dynamic topic models to analyse the evolution of latent concepts. It incorporates content changes in both the annual evolution of the corpus and the monthly evolution of the COVID-19 disease. The generated labels are manually validated using two approaches: through the most relevant documents on the topic and through the documents that share the most semantically similar label topics. The labelling enables the interpretation of topics. The novel method for dynamic topic labelling fits the content of each topic and supports the semantics of the topics.
DOI:10.1007/s41060-024-00610-0