A Survey on Text Classification Algorithms: From Text to Predictions

In recent years, the exponential growth of digital documents has been met by rapid progress in text classification techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Information (Basel) Ročník 13; číslo 2; s. 83
Hlavní autoři:	Gasparetto, Andrea, Marcuzzo, Matteo, Zangari, Alessandro, Albarelli, Andrea
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Basel MDPI AG 01.02.2022
Témata:	Algorithms Artificial intelligence Classification Datasets Deep learning Electronic documents English language Feature extraction Labeling Machine learning Natural language Neural networks news classification Sentiment analysis shallow learning Text categorization text classification tokenisation topic labelling transformer
ISSN:	2078-2489, 2078-2489
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	In recent years, the exponential growth of digital documents has been met by rapid progress in text classification techniques. Newly proposed machine learning algorithms leverage the latest advancements in deep learning methods, allowing for the automatic extraction of expressive features. The swift development of these methods has led to a plethora of strategies to encode natural language into machine-interpretable data. The latest language modelling algorithms are used in conjunction with ad hoc preprocessing procedures, of which the description is often omitted in favour of a more detailed explanation of the classification step. This paper offers a concise review of recent text classification models, with emphasis on the flow of data, from raw text to output labels. We highlight the differences between earlier methods and more recent, deep learning-based methods in both their functioning and in how they transform input data. To give a better perspective on the text classification landscape, we provide an overview of datasets for the English language, as well as supplying instructions for the synthesis of two new multilabel datasets, which we found to be particularly scarce in this setting. Finally, we provide an outline of new experimental results and discuss the open research challenges posed by deep learning-based language models.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2078-2489 2078-2489
DOI:	10.3390/info13020083