A similarity-based semi-supervised algorithm for labeling unlabeled text data

This paper presents a novel, non-iterative semi-supervised learning algorithm that leverages cosine similarity between document vectors and class mean vectors to label unlabeled text data automatically. The proposed method supports multiple vectorization techniques, including CountVectorizer, TF-IDF...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Expert systems with applications Ročník 296; s. 128941
Hlavní autoři: Potshangbam, Kirankumar Singh, Singh, Kshetrimayum Nareshkumar
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Ltd 15.01.2026
Témata:
ISSN:0957-4174
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:This paper presents a novel, non-iterative semi-supervised learning algorithm that leverages cosine similarity between document vectors and class mean vectors to label unlabeled text data automatically. The proposed method supports multiple vectorization techniques, including CountVectorizer, TF-IDF, and Doc2Vec, and is classifier-agnostic, enabling compatibility with both traditional and deep learning models such as KNN, Multinomial Naïve Bayes, SGDClassifier, Logistic Regression, Feedforward Neural Networks (FNN), and Convolutional Neural Networks (CNN). Extensive experiments conducted on benchmark datasets (BBC, Inshorts, 20-newsgroups) demonstrate: (1) achieving 96.88% accuracy on BBC, 93.59% on Inshorts, and 92.49% on 20-newsgroups with only 30% labeled data, thereby reducing manual labeling effort by over 99%; (2) TF-IDF consistently outperforms CountVectorizer and Doc2Vec by 3–12 percentages in accuracy across most experimental settings; and (3) Logistic Regression and FNN achieve the best performance among the classifiers. The method offers a practical, resource-efficient solution for real-world text classification by bridging labeled-unlabeled data gaps without iterative retraining.
ISSN:0957-4174
DOI:10.1016/j.eswa.2025.128941