Synthetic Data Generation for Enhancing Text Classification Performance Using Conditional Variational Autoencoders

Uloženo v:
Podrobná bibliografie
Název: Synthetic Data Generation for Enhancing Text Classification Performance Using Conditional Variational Autoencoders
Autoři: Ömer Faruk Cebeci, Mehmet Fatih Amasyali
Zdroj: Orclever Proceedings of Research and Development. 5:498-514
Informace o vydavateli: Orclever Science and Research Group, 2024.
Rok vydání: 2024
Popis: This study investigates the effect of generating synthetic data using a Conditional Variational Autoencoder (CVAE) model on classification performance in scenarios where the amount of available data is limited or the data sources are constrained. Experiments were conducted on datasets with varying numbers of classes, where synthetic data were produced through two different methods using CVAE models. The first method aimed to generate sentences from noise, initiated by sampling from a Gaussian distribution. The second method involved providing the first half of a real sentence to the model, which then completed the remaining half to produce synthetic data. The synthetic datasets generated by both methods were integrated into the original training sets at various ratios, and the resulting changes in classification performance were observed. Both synthetic data generation approaches significantly improved the classification performance. However, as the amount of data used to train the classifiers increased, the marginal benefit of incorporating synthetic data decreased. These findings suggest that producing and utilizing synthetic data can be an effective strategy in text classification tasks that suffer from data scarcity.
Druh dokumentu: Article
ISSN: 2980-020X
DOI: 10.56038/oprd.v5i1.581
Rights: CC BY NC
Přístupové číslo: edsair.doi...........fb059d1560948c7e82a60878019c846d
Databáze: OpenAIRE
Popis
Abstrakt:This study investigates the effect of generating synthetic data using a Conditional Variational Autoencoder (CVAE) model on classification performance in scenarios where the amount of available data is limited or the data sources are constrained. Experiments were conducted on datasets with varying numbers of classes, where synthetic data were produced through two different methods using CVAE models. The first method aimed to generate sentences from noise, initiated by sampling from a Gaussian distribution. The second method involved providing the first half of a real sentence to the model, which then completed the remaining half to produce synthetic data. The synthetic datasets generated by both methods were integrated into the original training sets at various ratios, and the resulting changes in classification performance were observed. Both synthetic data generation approaches significantly improved the classification performance. However, as the amount of data used to train the classifiers increased, the marginal benefit of incorporating synthetic data decreased. These findings suggest that producing and utilizing synthetic data can be an effective strategy in text classification tasks that suffer from data scarcity.
ISSN:2980020X
DOI:10.56038/oprd.v5i1.581