RU-OLD: A Comprehensive Analysis of Offensive Language Detection in Roman Urdu Using Hybrid Machine Learning, Deep Learning, and Transformer Models
The detection of abusive language in Roman Urdu is important for secure digital interaction. This work investigates machine learning (ML), deep learning (DL), and transformer-based methods for detecting offensive language in Roman Urdu comments collected from YouTube news channels. Extracted feature...
Saved in:
| Published in: | Algorithms Vol. 18; no. 7; p. 396 |
|---|---|
| Main Authors: | , , , , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Basel
MDPI AG
01.07.2025
|
| Subjects: | |
| ISSN: | 1999-4893, 1999-4893 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | The detection of abusive language in Roman Urdu is important for secure digital interaction. This work investigates machine learning (ML), deep learning (DL), and transformer-based methods for detecting offensive language in Roman Urdu comments collected from YouTube news channels. Extracted features use TF-IDF and Count Vectorizer for unigrams, bigrams, and trigrams. Of all the ML models—Random Forest (RF), Logistic Regression (LR), Support Vector Machine (SVM), and Naïve Bayes (NB)—the best performance was achieved by the same SVM. DL models involved evaluating Bi-LSTM and CNN models, where the CNN model outperformed the others. Moreover, transformer variants such as LLaMA 2 and ModernBERT (MBERT) were instantiated and fine-tuned with LoRA (Low-Rank Adaptation) for better efficiency. LoRA has been tuned for large language models (LLMs), a family of advanced machine learning frameworks, based on the principle of making the process efficient with extremely low computational cost with better enhancement. According to the experimental results, LLaMA 2 with LoRA attained the highest F1-score of 96.58%, greatly exceeding the performance of other approaches. To elaborate, LoRA-optimized transformers perform well in capturing detailed subtleties of linguistic nuances, lending themselves well to Roman Urdu offensive language detection. The study compares the performance of conventional and contemporary NLP methods, highlighting the relevance of effective fine-tuning methods. Our findings pave the way for scalable and accurate automated moderation systems for online platforms supporting multiple languages. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 1999-4893 1999-4893 |
| DOI: | 10.3390/a18070396 |