Enhancing Semi-Supervised Learning in Educational Data Mining Through Synthetic Data Generation Using Tabular Variational Autoencoder

This paper presents TVAE-SSL, a novel semi-supervised learning (SSL) paradigm that involves Tabular Variational Autoencoder (TVAE)-sampled synthetic data injection into the training process to enhance model performance under low-label data conditions in Educational Data Mining tasks. The algorithm b...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Algorithms Ročník 18; číslo 10; s. 663
Hlavní autoři: Kostopoulos, Georgios, Fazakis, Nikos, Kotsiantis, Sotiris, Dimakopoulos, Yiannis
Médium: Journal Article
Jazyk:angličtina
Vydáno: Basel MDPI AG 01.10.2025
Témata:
ISSN:1999-4893, 1999-4893
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:This paper presents TVAE-SSL, a novel semi-supervised learning (SSL) paradigm that involves Tabular Variational Autoencoder (TVAE)-sampled synthetic data injection into the training process to enhance model performance under low-label data conditions in Educational Data Mining tasks. The algorithm begins with training a TVAE on the given labeled data to generate imitative synthetic samples of the underlying data distribution. These synthesized samples are treated as additional unlabeled data and combined with the original unlabeled ones in order to form an augmented training pool. A standard SSL algorithm (e.g., Self-Training) is trained using a base classifier (e.g., Random Forest) on the combined dataset. By expanding the pool of unlabeled samples with realistic synthetic data, TVAE-SSL improves training sample quantity and diversity without introducing label noise. Large-scale experiments on a variety of datasets demonstrate that TVAE-SSL can outperform baseline supervised models in the full labeled dataset in terms of accuracy, F1-score and fairness metrics. Our results demonstrate the capacity of generative augmentation to enhance the effectiveness of semi-supervised learning for tabular data.
Bibliografie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1999-4893
1999-4893
DOI:10.3390/a18100663