Automatic Dialect Detection for Low Resource Santali Language

Among different natural language processing (NLP) tasks, language detection is considered a difficult task especially for similar languages, varieties, and dialects. To build any tools or technologies in NLP, it is required reliable and robust language detection tool. Although such language detectio...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:2021 19th OITS International Conference on Information Technology (OCIT) s. 234 - 238
Hlavní autoři: Sahoo, Sunil Kumar, Mishra, Brojo Kishore, Parida, Shantipriya, Dash, Satya Ranjan, Besra, Jatindra Nath, Tello, Esau Villatoro
Médium: Konferenční příspěvek
Jazyk:angličtina
Vydáno: IEEE 01.12.2021
Témata:
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Among different natural language processing (NLP) tasks, language detection is considered a difficult task especially for similar languages, varieties, and dialects. To build any tools or technologies in NLP, it is required reliable and robust language detection tool. Although such language detection tools and technologies are available for many high-resource languages, it is difficult to find them for low-resource languages due to the lack of training datasets. In this work, we compiled a language detection dataset for the low resource language Santali. We considered the Sanatali written in Odia script and standard Odia script. We used a deep supervised autoencoder for language detection. The deep supervised autoencoder can detect both Odia and Santali languages although written using the same Odia script. We obtained an accuracy of 100% on the test set. Our proposed dataset will be enriched Santali language and help researchers to build tools and technologies for this low resource language.
DOI:10.1109/OCIT53463.2021.00055