Optimizing Pre-trained Code Embeddings with Triplet Loss for Code Smell Detection

Code embedding represents code semantics in vector form. While existing code embedding methods have been successfully employed for various source code analysis tasks, further studies are needed to achieve the performance and functionality of static code analysis tools and improve code embedding for...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE access Ročník 13; s. 1
Hlavní autori: Nizam, Ali, Islamoglu, Ertugrul, Adali, Omer Kerem, Aydin, Musa
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Piscataway IEEE 01.01.2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:
ISSN:2169-3536, 2169-3536
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Code embedding represents code semantics in vector form. While existing code embedding methods have been successfully employed for various source code analysis tasks, further studies are needed to achieve the performance and functionality of static code analysis tools and improve code embedding for better code analysis capabilities. Additionally, there is a need to standardize augmentation methods for code embedding, as in the image processing domain, to facilitate the development of more effective embedding models. This study aims to develop a contrastive learning-based system to explore the potential of a generic method for enhancing code embedding for code classification tasks. A special triplet loss-based deep learning network is designed to optimize in-class similarity and increase the distance between classes for code classification tasks of code smell detection. An experimental dataset that contains code from Java, Python, and PHP programming languages and 4 different code smells is created by collecting code from open-source repositories on GitHub. We evaluated the proposed system with widely used BERT, CodeBERT, and GraphCodeBERT pretrained models to create code embedding. Our findings indicate that the proposed system may offer improvements in accuracy, an average of 8% and a maximum of 13% for models. These results suggest that incorporating contrastive learning techniques into the generation process of code representation as a preprocessing step vector may provide opportunities for further performance improvements in code analysis.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2025.3542566