On the classification of text documents taking into account their structural features

A modification of the conventional bag of words model that can take into account the structural features of text documents in their classification (categorization) using machine learning techniques is studied. It is proposed to describe these features by relations on the set of certain lexemes and u...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Journal of computer & systems sciences international Ročník 55; číslo 3; s. 394 - 403
Hlavní autoři:	Gulin, V. V., Frolov, A. B.
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Moscow Pleiades Publishing 01.05.2016 Springer Nature B.V
Témata:	Artificial intelligence Classification Classifiers Collection Computer science Computer simulation Control Dictionaries Documents Engineering Machine learning Mathematical analysis Mathematical models Mechatronics Names Pattern Recognition and Image Processing Random variables Robotics Studies Text categorization Texts
ISSN:	1064-2307, 1555-6530
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	A modification of the conventional bag of words model that can take into account the structural features of text documents in their classification (categorization) using machine learning techniques is studied. It is proposed to describe these features by relations on the set of certain lexemes and use the relation names, along with the lexeme names, as features. This is a distinction from the conventional model in which only unary relations are used. The effectiveness of the proposed machine learning techniques is analyzed using computer experiments on the class of the Reuters-21578 collection with eight known classifiers. It is shown that it is reasonable to apply the proposed models to classify documents using simple classifiers.
Bibliografie:	SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23
ISSN:	1064-2307 1555-6530
DOI:	10.1134/S1064230716030102