Optimizing Pre-trained Code Embeddings with Triplet Loss for Code Smell Detection
Code embedding represents code semantics in vector form. While existing code embedding methods have been successfully employed for various source code analysis tasks, further studies are needed to achieve the performance and functionality of static code analysis tools and improve code embedding for...
Uložené v:
| Vydané v: | IEEE access Ročník 13; s. 1 |
|---|---|
| Hlavní autori: | , , , |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
Piscataway
IEEE
01.01.2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Predmet: | |
| ISSN: | 2169-3536, 2169-3536 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Shrnutí: | Code embedding represents code semantics in vector form. While existing code embedding methods have been successfully employed for various source code analysis tasks, further studies are needed to achieve the performance and functionality of static code analysis tools and improve code embedding for better code analysis capabilities. Additionally, there is a need to standardize augmentation methods for code embedding, as in the image processing domain, to facilitate the development of more effective embedding models. This study aims to develop a contrastive learning-based system to explore the potential of a generic method for enhancing code embedding for code classification tasks. A special triplet loss-based deep learning network is designed to optimize in-class similarity and increase the distance between classes for code classification tasks of code smell detection. An experimental dataset that contains code from Java, Python, and PHP programming languages and 4 different code smells is created by collecting code from open-source repositories on GitHub. We evaluated the proposed system with widely used BERT, CodeBERT, and GraphCodeBERT pretrained models to create code embedding. Our findings indicate that the proposed system may offer improvements in accuracy, an average of 8% and a maximum of 13% for models. These results suggest that incorporating contrastive learning techniques into the generation process of code representation as a preprocessing step vector may provide opportunities for further performance improvements in code analysis. |
|---|---|
| Bibliografia: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 2169-3536 2169-3536 |
| DOI: | 10.1109/ACCESS.2025.3542566 |