Features extraction based on Naive Bayes algorithm and TF-IDF for news classification

The rapid proliferation of online news demands robust automated classification systems to enhance information organization and personalized recommendation. Although traditional methods like TF-IDF with Naive Bayes provide foundational solutions, their limitations in capturing semantic nuances and ha...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:PloS one Ročník 20; číslo 7; s. e0327347
Hlavný autor: Zhang, Li
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: United States Public Library of Science 30.07.2025
Public Library of Science (PLoS)
Predmet:
ISSN:1932-6203, 1932-6203
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:The rapid proliferation of online news demands robust automated classification systems to enhance information organization and personalized recommendation. Although traditional methods like TF-IDF with Naive Bayes provide foundational solutions, their limitations in capturing semantic nuances and handling real-time demands hinder practical applications. This study proposes a hybrid news classification framework that integrates classical machine learning with modern advances in NLP to address these challenges. Our methodology introduces three key innovations: (1) Domain-Specific Feature Engineering, combining tailored n-grams and entity-aware TF-IDF weighting to amplify discriminative terms; (2) BERT-Guided Feature Selection, leveraging distilled BERT to identify contextually important words and resolve rare-term ambiguities; and (3) Computationally Efficient Deployment, achieving 95.2% of the accuracy of BERT at 1/52.4th of the inference cost. Evaluated on a balanced corpus of Sina News articles in 11 categories, the system demonstrates a test precision of 95.12% (vs. 84.43% for SVM+TF-IDF baseline), with statistically significant improvements confirmed by 5-fold cross-validation( p < 0.01). The critical findings reveal strong performance in distinguishing semantically distinct categories, while exposing challenges in fine-grained differentiation. The efficiency of the framework (2.1 inference latency) and scalability (linear utilization of CPU resources) validate its practicality for real-world deployment. This work bridges the gap between traditional feature engineering and transformer-based models, offering a cost-effective solution for news platforms. Future research will explore hierarchical classification and the adaptation of dynamic topics to further refine semantic boundaries.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
Competing Interests: The authors have declared that no competing interests exist.
ISSN:1932-6203
1932-6203
DOI:10.1371/journal.pone.0327347