Semi-Supervised Learning in Large Scale Text Categorization

The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately because of its huge number. Automatic text classification technology based on machine l...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Shanghai jiao tong da xue xue bao Ročník 22; číslo 3; s. 291 - 302
Hlavný autor: 许泽文 李建强 刘博 毕敬 李蓉 毛睿
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Shanghai Shanghai Jiaotong University Press 01.06.2017
Springer Nature B.V
Predmet:
ISSN:1007-1172, 1995-8188
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:The rapid development of the Internet brings a variety of original information including text information, audio information, etc. However, it is difficult to find the most useful knowledge rapidly and accurately because of its huge number. Automatic text classification technology based on machine learning can classify a large number of natural language documents into the corresponding subject categories according to its correct semantics. It is helpful to grasp the text information directly. By learning from a set of hand-labeled documents,we obtain the traditional supervised classifier for text categorization(TC). However, labeling all data by human is labor intensive and time consuming. To solve this problem, some scholars proposed a semi-supervised learning method to train classifier, but it is unfeasible for various kinds and great number of Web data since it still needs a part of hand-labeled data. In 2012, Li et al. invented a fully automatic categorization approach for text(FACT)based on supervised learning, where no manual labeling efforts are required. But automatically labeling all data can bring noise into experiment and cause the fact that the result cannot meet the accuracy requirement. We put forward a new idea that part of data with high accuracy can be automatically tagged based on the semantic of category name, then a semi-supervised way is taken to train classifier with both labeled and unlabeled data,and ultimately a precise classification of massive text data can be achieved. The empirical experiments show that the method outperforms the supervised support vector machine(SVM) in terms of both F1 performance and classification accuracy in most cases. It proves the effectiveness of the semi-supervised algorithm in automatic TC.
Bibliografia:31-1943/U
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1007-1172
1995-8188
DOI:10.1007/s12204-017-1835-3