Multilingual Sentiment Analysis with Data Augmentation: A Cross-Language Evaluation in French, German, and Japanese.
Saved in:
| Title: | Multilingual Sentiment Analysis with Data Augmentation: A Cross-Language Evaluation in French, German, and Japanese. |
|---|---|
| Authors: | Alkhushayni, Suboh1 (AUTHOR) suboh.alkhushayni@yu.edu.jo, Lee, Hyesu2 (AUTHOR) |
| Source: | Information. Sep2025, Vol. 16 Issue 9, p806. 24p. |
| Subject Terms: | *Sentiment analysis, *Natural language processing, *Machine translating, *Translating & interpreting, Data augmentation, Multilingualism, Support vector machines |
| Abstract: | Machine learning in natural language processing (NLP) analyzes datasets to make future predictions, but developing accurate models requires large, high-quality, and balanced datasets. However, collecting such datasets, especially for low-resource languages, is time-consuming and costly. As a solution, data augmentation can be used to increase the dataset size by generating synthetic samples from existing data. This study examines the effect of translation-based data augmentation on sentiment analysis using small datasets in three diverse languages: French, German, and Japanese. We use two neural machine translation (NMT) services—Google Translate and DeepL—to generate augmented datasets through intermediate language translation. Sentiment analysis models based on Support Vector Machine (SVM) are trained on both original and augmented datasets and evaluated using accuracy, precision, recall, and F1 score. Our results demonstrate that translation augmentation significantly enhances model performance in both French and Japanese. For example, using Google Translate, model accuracy improved from 62.50% to 83.55% in Japanese (+21.05%) and from 87.66% to 90.26% in French (+2.6%). In contrast, the German dataset showed a minor improvement or decline, depending on the translator used. Google-based augmentation generally outperformed DeepL, which yielded smaller or negative gains. To evaluate cross-lingual generalization, models trained on one language were tested on datasets in the other two. Notably, a model trained on augmented German data improved its accuracy on French test data from 81.17% to 85.71% and on Japanese test data from 71.71% to 79.61%. Similarly, a model trained on augmented Japanese data improved accuracy on German test data by up to 3.4%. These findings highlight that translation-based augmentation can enhance sentiment classification and cross-language adaptability, particularly in low-resource and multilingual NLP settings. [ABSTRACT FROM AUTHOR] |
| Database: | Library, Information Science & Technology Abstracts |
| Abstract: | Machine learning in natural language processing (NLP) analyzes datasets to make future predictions, but developing accurate models requires large, high-quality, and balanced datasets. However, collecting such datasets, especially for low-resource languages, is time-consuming and costly. As a solution, data augmentation can be used to increase the dataset size by generating synthetic samples from existing data. This study examines the effect of translation-based data augmentation on sentiment analysis using small datasets in three diverse languages: French, German, and Japanese. We use two neural machine translation (NMT) services—Google Translate and DeepL—to generate augmented datasets through intermediate language translation. Sentiment analysis models based on Support Vector Machine (SVM) are trained on both original and augmented datasets and evaluated using accuracy, precision, recall, and F1 score. Our results demonstrate that translation augmentation significantly enhances model performance in both French and Japanese. For example, using Google Translate, model accuracy improved from 62.50% to 83.55% in Japanese (+21.05%) and from 87.66% to 90.26% in French (+2.6%). In contrast, the German dataset showed a minor improvement or decline, depending on the translator used. Google-based augmentation generally outperformed DeepL, which yielded smaller or negative gains. To evaluate cross-lingual generalization, models trained on one language were tested on datasets in the other two. Notably, a model trained on augmented German data improved its accuracy on French test data from 81.17% to 85.71% and on Japanese test data from 71.71% to 79.61%. Similarly, a model trained on augmented Japanese data improved accuracy on German test data by up to 3.4%. These findings highlight that translation-based augmentation can enhance sentiment classification and cross-language adaptability, particularly in low-resource and multilingual NLP settings. [ABSTRACT FROM AUTHOR] |
|---|---|
| ISSN: | 20782489 |
| DOI: | 10.3390/info16090806 |
Full Text Finder
Nájsť tento článok vo Web of Science