Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models.

Uloženo v:
Podrobná bibliografie
Název: Learning in the Wild: Towards Leveraging Unlabeled Data for Effectively Tuning Pre-trained Code Models.
Autoři: Gao, Shuzheng, Mao, Wenxin, Gao, Cuiyun, Li, Li, Hu, Xing, Xia, Xin, Lyu, Michael R.
Zdroj: ICSE: International Conference on Software Engineering; 2024, p1-13, 13p
Témata: SUPERVISED learning, SOURCE code, LOSS functions (Statistics), INFORMATION retrieval, TASK performance, DATA management
Abstrakt: Pre-trained code models have recently achieved substantial improvements in many code intelligence tasks. These models are first pre-trained on large-scale unlabeled datasets in a task-agnostic manner using self-supervised learning, and then fine-tuned on labeled datasets in downstream tasks. However, the labeled datasets are usually limited in size (i.e., human intensive efforts), which may hinder the performance of pre-trained code models in specific tasks. To mitigate this, one possible solution is to leverage the large-scale unlabeled data in the tuning stage by pseudo-labeling, i.e., generating pseudo labels for unlabeled data and further training the pre-trained code models with the pseudo-labeled data. However, directly employing the pseudo-labeled data can bring a large amount of noise, i.e., incorrect labels, leading to suboptimal performance. How to effectively leverage the noisy pseudo-labeled data is a challenging yet under-explored problem. In this paper, we propose a novel approach named HINT to improve pre-trained code models with large-scale unlabeled datasets by better utilizing the pseudo-labeled data. HINT includes two main modules: Hybrid pseudo-labeled data selection and Noise-tolerant Training. In the hybrid pseudo-data selection module, considering the robustness issue, apart from directly measuring the quality of pseudo labels through training loss, we propose to further employ a retrieval-based method to filter low-quality pseudo-labeled data. The noise-tolerant training module aims to further mitigate the influence of errors in pseudo labels by training the model with a noise-tolerant loss function and by regularizing the consistency of model predictions. We evaluate the effectiveness of HINT on three popular code intelligence tasks, including code summarization, defect detection, and assertion generation. We build our method on top of three popular open-source pre-trained code models. The experimental results show that HINT can better leverage those unlabeled data in a task-specific way and provide complementary benefits for pre-trained models, e.g., improving the best baseline model by 15.33%, 16.50%, and 8.98% on code summarization, defect detection, and assertion generation, respectively. [ABSTRACT FROM AUTHOR]
Copyright of ICSE: International Conference on Software Engineering is the property of Association for Computing Machinery and its content may not be copied or emailed to multiple sites without the copyright holder's express written permission. Additionally, content may not be used with any artificial intelligence tools or machine learning technologies. However, users may print, download, or email articles for individual use. This abstract may be abridged. No warranty is given about the accuracy of the copy. Users should refer to the original published version of the material for the full abstract. (Copyright applies to all Abstracts.)
Databáze: Complementary Index
Buďte první, kdo okomentuje tento záznam!
Nejprve se musíte přihlásit.