Optimal Number of Topics in Topic Modeling Using Deep Auto Encoder Graph Regularized Non-Negative Matrix Factorization Algorithm

Topic modeling stands as a well-explored and foundational challenge in the text mining domain. Traditional topic schemes based on word co-occurrences, aim to expose the latent semantic structure embedded in a document corpus. Nevertheless, the inherent brevity of short texts introduces data sparsity...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Journal of systems science and systems engineering Ročník 34; číslo 3; s. 257 - 283
Hlavní autori: Kherwa, Pooja, Arora, Jyoti
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Berlin/Heidelberg Springer Berlin Heidelberg 01.06.2025
Springer Nature B.V
Predmet:
ISSN:1004-3756, 1861-9576
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Topic modeling stands as a well-explored and foundational challenge in the text mining domain. Traditional topic schemes based on word co-occurrences, aim to expose the latent semantic structure embedded in a document corpus. Nevertheless, the inherent brevity of short texts introduces data sparsity, hindering the effectiveness of conventional topic models and yielding suboptimal outcomes for such text. Typically, short texts encompass a restricted number of topics, necessitating a grasp of relevant background knowledge for a comprehensive understanding of semantic content. Motivated by the observed information, this research introduces a novel Deep Auto encoder Graph Regularized Non-negative Matrix Factorization algorithm (DAGR-NMF) to uncover significant and meaningful topics within short document contents. The three main phases of proposed work are preprocessing, feature extraction and topic modeling. Initially, the data are preprocessed using natural language preprocessing tasks such as stop word removal, stemming and lemmatizing. Then, feature extraction is performed using hybrid Absolute Deviation Factors-Class Term Frequency (ADF-CTF) to capture the most relevant information from the text. Finally, topic modeling task is executed using proposed DAGR-NMF approach. Experimental findings demonstrate that the introduced DAGR-NMF model outperforms all other techniques by achieving NMI values of 0.852, 0.857, 0.793, and 0.831 on associated press, political blog datasets, 20NewsGroups, and News category dataset, respectively.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1004-3756
1861-9576
DOI:10.1007/s11518-024-5639-3