Balinese text-to-speech dataset as digital cultural heritage

Balinese language has a complex and unique language level system, yet still lacks representation in speech-based technologies such as Text-to-Speech (TTS) and speech recognition. As one of the linguistically rich regional languages, Balinese language digitization efforts have not been optimally deve...

Full description

Saved in:
Bibliographic Details
Published in:Data in brief Vol. 60; p. 111528
Main Authors: Kadyanan, I Gusti Agung Gede Arya, ER, Ngurah Agus Sanjaya, Karyawati, Anak Agung Istri Ngurah Eka, Putra, I Gede Ngurah Arya Wira, Gunawan, ⁠I Made Suma, Budiantari, Ni Made Julia, Octavia, Hana Christine
Format: Journal Article
Language:English
Published: Netherlands Elsevier Inc 01.06.2025
Elsevier
Subjects:
ISSN:2352-3409, 2352-3409
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Balinese language has a complex and unique language level system, yet still lacks representation in speech-based technologies such as Text-to-Speech (TTS) and speech recognition. As one of the linguistically rich regional languages, Balinese language digitization efforts have not been optimally developed, limiting research in natural language processing (NLP) as well as the application of regional language-based voice technologies. The limitation of voice-based datasets in Balinese is a major challenge in the development of this technology. Therefore, this research aims to develop a dataset of Balinese native speaker audio recordings covering various language levels to support applications in Text-to-Speech (TTS) systems, speech recognition, and voice-to-text technology. The dataset was developed through a data acquisition process that involved recording the voices of native Balinese speakers of the Badung dialect. Data was collected by recording the voices of native Balinese speakers using the Badung dialect. The resulting recordings were then processed using denoising techniques to improve audio quality, before being categorized based on Balinese politeness levels (Alus Singgih, Alus Sor, Alus Mider, Mider, and Andap) as well as including additional phrases and alphabets to provide a wider variety to the dataset. The results show that this dataset consists of 1187 recordings that reflect a wide range of social variation in Balinese. By providing this resource, this research not only contributes to the development of speech-based technologies, but also plays a role in the preservation of Balinese in the digital age, as well as opening up further research opportunities in NLP for languages with limited resources.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2352-3409
2352-3409
DOI:10.1016/j.dib.2025.111528