DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text

This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube commen...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Language resources and evaluation Ročník 56; číslo 3; s. 765 - 806
Hlavní autoři: Chakravarthi, Bharathi Raja, Priyadharshini, Ruba, Muralidaran, Vigneshwaran, Jose, Navya, Suryawanshi, Shardul, Sherly, Elizabeth, McCrae, John P.
Médium: Journal Article
Jazyk:angličtina
Vydáno: Dordrecht Springer Netherlands 01.09.2022
Springer Nature B.V
Témata:
ISSN:1574-020X, 1574-0218
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff’s alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning and deep learning methods. The dataset is available on Github and Zenodo.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1574-020X
1574-0218
DOI:10.1007/s10579-022-09583-7