Data augmentation in natural language processing: a novel text generation approach for long and short text classifiers

In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP,...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	International journal of machine learning and cybernetics Ročník 14; číslo 1; s. 135 - 150
Hlavní autori:	Bayer, Markus, Kaufhold, Marc-André, Buchhold, Björn, Keller, Marcel, Dallmeyer, Jörg, Reuter, Christian
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Berlin/Heidelberg Springer Berlin Heidelberg 01.01.2023 Springer Nature B.V
Predmet:	Artificial Intelligence Classification Classifiers Complex Systems Computational Intelligence Control Data analysis Data augmentation Datasets Deep learning Engineering Linguistics Machine learning Mechatronics Methods Natural language processing Original Original Article Pattern Recognition Robotics Sentiment analysis Systems Biology Textual data augmentation Text generation Small text data analytics Long and short text classifier
ISSN:	1868-8071, 1868-808X, 1868-808X
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	In many cases of machine learning, research suggests that the development of training data might have a higher relevance than the choice and modelling of classifiers themselves. Thus, data augmentation methods have been developed to improve classifiers by artificially created training data. In NLP, there is the challenge of establishing universal rules for text transformations which provide new linguistic patterns. In this paper, we present and evaluate a text generation method suitable to increase the performance of classifiers for long and short texts. We achieved promising improvements when evaluating short as well as long text tasks with the enhancement by our text generation method. Especially with regard to small data analytics, additive accuracy gains of up to 15.53% and 3.56% are achieved within a constructed low data regime, compared to the no augmentation baseline and another data augmentation technique. As the current track of these constructed regimes is not universally applicable, we also show major improvements in several real world low data tasks (up to +4.84 F1-score). Since we are evaluating the method from many perspectives (in total 11 datasets), we also observe situations where the method might not be suitable. We discuss implications and patterns for the successful application of our approach on different types of datasets.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1868-8071 1868-808X 1868-808X
DOI:	10.1007/s13042-022-01553-3