A similarity-based semi-supervised algorithm for labeling unlabeled text data

This paper presents a novel, non-iterative semi-supervised learning algorithm that leverages cosine similarity between document vectors and class mean vectors to label unlabeled text data automatically. The proposed method supports multiple vectorization techniques, including CountVectorizer, TF-IDF...

Full description

Saved in:
Bibliographic Details
Published in:Expert systems with applications Vol. 296; p. 128941
Main Authors: Potshangbam, Kirankumar Singh, Singh, Kshetrimayum Nareshkumar
Format: Journal Article
Language:English
Published: Elsevier Ltd 15.01.2026
Subjects:
ISSN:0957-4174
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:This paper presents a novel, non-iterative semi-supervised learning algorithm that leverages cosine similarity between document vectors and class mean vectors to label unlabeled text data automatically. The proposed method supports multiple vectorization techniques, including CountVectorizer, TF-IDF, and Doc2Vec, and is classifier-agnostic, enabling compatibility with both traditional and deep learning models such as KNN, Multinomial Naïve Bayes, SGDClassifier, Logistic Regression, Feedforward Neural Networks (FNN), and Convolutional Neural Networks (CNN). Extensive experiments conducted on benchmark datasets (BBC, Inshorts, 20-newsgroups) demonstrate: (1) achieving 96.88% accuracy on BBC, 93.59% on Inshorts, and 92.49% on 20-newsgroups with only 30% labeled data, thereby reducing manual labeling effort by over 99%; (2) TF-IDF consistently outperforms CountVectorizer and Doc2Vec by 3–12 percentages in accuracy across most experimental settings; and (3) Logistic Regression and FNN achieve the best performance among the classifiers. The method offers a practical, resource-efficient solution for real-world text classification by bridging labeled-unlabeled data gaps without iterative retraining.
ISSN:0957-4174
DOI:10.1016/j.eswa.2025.128941