Towards More Trustworthy Deep Code Models by Enabling Out-of-Distribution Detection

Numerous machine learning (ML) models have been developed, including those for software engineering (SE) tasks, under the assumption that training and testing data come from the same distribution. However, training and testing distributions often differ, as training datasets rarely encompass the ent...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Proceedings / International Conference on Software Engineering S. 769 - 781
Hauptverfasser:	Yan, Yanfu, Duong, Viet, Shao, Huajie, Poshyvanyk, Denys
Format:	Tagungsbericht
Sprache:	Englisch
Veröffentlicht:	IEEE 26.04.2025
Schlagworte:	Code Models Codes Contrastive learning Data models Measurement OOD detection Predictive models Reliability Software engineering Testing Training Training data Trustworthy ML
ISSN:	1558-1225
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	Numerous machine learning (ML) models have been developed, including those for software engineering (SE) tasks, under the assumption that training and testing data come from the same distribution. However, training and testing distributions often differ, as training datasets rarely encompass the entire distribution, while testing distribution tends to shift over time. Hence, when confronted with out-of-distribution (OOD) instances that differ from the training data, a reliable and trustworthy SE ML model must be capable of detecting them to either abstain from making predictions, or potentially forward these OODs to appropriate models handling other categories or tasks. In this paper, we develop two types of SE-specific OOD detection models, unsupervised and weakly-supervised OOD detection for code. The unsupervised OOD detection approach is trained solely on in-distribution samples while the weakly-supervised approach utilizes a tiny number of OOD samples to further enhance the detection performance in various OOD scenarios. Extensive experimental results demonstrate that our proposed methods significantly outperform the baselines in detecting OOD samples from four different scenarios simultaneously and also positively impact a main code understanding task.
ISSN:	1558-1225
DOI:	10.1109/ICSE55347.2025.00177