Multi-label text classification on unbalanced Twitter with monolingual model and hyperparameter optimization for hate speech and abusive language detection

The increase in hate speech and abusive language on social media leads to uncomfortable interactions among users. Many datasets available publicly that address hate speech and abusive language are not balanced, particularly those from Indonesian Twitter. To develop a more effective classification mo...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:International journal of advanced and applied sciences Ročník 11; číslo 5; s. 177 - 185
Hlavní autoři: Alzahrani, Ahmad A., Bramantoro, Arif, Permana, Asep
Médium: Journal Article
Jazyk:angličtina
Vydáno: 01.05.2024
ISSN:2313-626X, 2313-3724
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:The increase in hate speech and abusive language on social media leads to uncomfortable interactions among users. Many datasets available publicly that address hate speech and abusive language are not balanced, particularly those from Indonesian Twitter. To develop a more effective classification model that also considers minority classes, we needed to optimize the hyperparameters of a monolingual model, use four different data preprocessing scenarios, and improve the treatment of slang words. We assessed the model's effectiveness by its accuracy, achieving 81.38%. This result came from optimizing hyperparameters, processing data without stemming and removing stop words, and enhancing the slang word data. The optimal hyperparameters were a learning rate of 4e-5, a batch size of 16, and a dropout rate of 0.1. However, using too much dropout can decrease the model’s performance and its ability to predict less common categories, such as physical- and gender-related hate speech.
ISSN:2313-626X
2313-3724
DOI:10.21833/ijaas.2024.05.019