Enhancing Hate Speech Detection in LowResource Code-Mixed Indonesian Tweets via GPT-Based Data Augmentation.

Uloženo v:
Podrobná bibliografie
Název: Enhancing Hate Speech Detection in LowResource Code-Mixed Indonesian Tweets via GPT-Based Data Augmentation.
Autoři: Pamungkas, Endang Wahyu, Purworini, Dian, Widayat, Widi, Putri, Divi Galih Prasetyo, Amal, Ikhlasul
Zdroj: Engineering, Technology & Applied Science Research; Dec2025, Vol. 15 Issue 6, p30649-30656, 8p
Témata: DATA augmentation, TRANSFORMER models, MACHINE learning, INDONESIAN language, PARAPHRASE, LOW-resource languages, SOCIAL media, INVECTIVE
Abstrakt: Automatic hate speech detection in low-resource, code-mixed languages, such as Indonesian social media environments, presents significant challenges due to the scarcity of annotated data and the linguistic variability introduced by code-mixing. However, due to the growing prevalence of hate speech on social media, there is a need for robust hate speech detection systems. This study investigates the effectiveness of data augmentation strategies, specifically Generative Pretrained Transformer (GPT)-based paraphrasing and aggressive text transformation, in enhancing the performance of hate speech detection models for Indonesian code-mixed tweets. To achieve that, we employed traditional machine learning models, Recurrent Neural Network (RNN)-based models, and transformer-based models to assess the impact of these augmentation strategies. Our findings reveal that GPT-generated data improve model performance, with transformer-based models, including Indonesian Bidirectional Encoder Representations from Transformers (IndoBERT) and the Cross-lingual Language Model Robustly Optimized BERT Pretraining approach (XLM-RoBERTa). [ABSTRACT FROM AUTHOR]
Copyright of Engineering, Technology & Applied Science Research is the property of Engineering, Technology & Applied Science Research and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Databáze: Complementary Index
Popis
Abstrakt:Automatic hate speech detection in low-resource, code-mixed languages, such as Indonesian social media environments, presents significant challenges due to the scarcity of annotated data and the linguistic variability introduced by code-mixing. However, due to the growing prevalence of hate speech on social media, there is a need for robust hate speech detection systems. This study investigates the effectiveness of data augmentation strategies, specifically Generative Pretrained Transformer (GPT)-based paraphrasing and aggressive text transformation, in enhancing the performance of hate speech detection models for Indonesian code-mixed tweets. To achieve that, we employed traditional machine learning models, Recurrent Neural Network (RNN)-based models, and transformer-based models to assess the impact of these augmentation strategies. Our findings reveal that GPT-generated data improve model performance, with transformer-based models, including Indonesian Bidirectional Encoder Representations from Transformers (IndoBERT) and the Cross-lingual Language Model Robustly Optimized BERT Pretraining approach (XLM-RoBERTa). [ABSTRACT FROM AUTHOR]
ISSN:22414487
DOI:10.48084/etasr.14342