A similarity-based semi-supervised algorithm for labeling unlabeled text data
This paper presents a novel, non-iterative semi-supervised learning algorithm that leverages cosine similarity between document vectors and class mean vectors to label unlabeled text data automatically. The proposed method supports multiple vectorization techniques, including CountVectorizer, TF-IDF...
Saved in:
| Published in: | Expert systems with applications Vol. 296; p. 128941 |
|---|---|
| Main Authors: | , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Elsevier Ltd
15.01.2026
|
| Subjects: | |
| ISSN: | 0957-4174 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | This paper presents a novel, non-iterative semi-supervised learning algorithm that leverages cosine similarity between document vectors and class mean vectors to label unlabeled text data automatically. The proposed method supports multiple vectorization techniques, including CountVectorizer, TF-IDF, and Doc2Vec, and is classifier-agnostic, enabling compatibility with both traditional and deep learning models such as KNN, Multinomial Naïve Bayes, SGDClassifier, Logistic Regression, Feedforward Neural Networks (FNN), and Convolutional Neural Networks (CNN). Extensive experiments conducted on benchmark datasets (BBC, Inshorts, 20-newsgroups) demonstrate: (1) achieving 96.88% accuracy on BBC, 93.59% on Inshorts, and 92.49% on 20-newsgroups with only 30% labeled data, thereby reducing manual labeling effort by over 99%; (2) TF-IDF consistently outperforms CountVectorizer and Doc2Vec by 3–12 percentages in accuracy across most experimental settings; and (3) Logistic Regression and FNN achieve the best performance among the classifiers. The method offers a practical, resource-efficient solution for real-world text classification by bridging labeled-unlabeled data gaps without iterative retraining. |
|---|---|
| ISSN: | 0957-4174 |
| DOI: | 10.1016/j.eswa.2025.128941 |