A Multiple Representation Transformer with Optimized Abstract Syntax Tree for Efficient Code Clone Detection

Over the past decade, the application of deep learning in code clone detection has produced remarkable results. However, the current approaches have two limitations: (a) code representation approaches with low information utilization, such as vanilla Abstract Syntax Tree (AST), leading to informatio...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Proceedings / International Conference on Software Engineering s. 281 - 293
Hlavní autori:	Yu, Tianchen, Yuan, Li, Lin, Liannan, He, Hongkui
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 26.04.2025
Predmet:	abstract syntax tree Cloning code clone detection Codes Deep learning Degradation Internet Java Redundancy Software engineering Syntactics Transformer Transformers
ISSN:	1558-1225
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Over the past decade, the application of deep learning in code clone detection has produced remarkable results. However, the current approaches have two limitations: (a) code representation approaches with low information utilization, such as vanilla Abstract Syntax Tree (AST), leading to information redundancy which results in performance degradation; (b) low efficiency of clone detection on evaluation, resulting in excessive time costs during practical use. In this paper, we propose a Multiple Representation Transformer with an Optimized Abstract Syntax Tree (MRT-OAST) to introduce an efficient code representation method while achieving competitive performance. Specifically, MRT-OAST strategically prunes and enhances the AST, utilizing both pre-order and post-order traversals to represent two different representations. To speed up the evaluation process, MRT-OAST utilizes a pure Siamese Network and employs cosine similarity to compare the similarity between codes. Our approach effectively reduces AST sequences to 40 % and 39 % of their original length in Java and C/C++ while preserving structural information. In code clone detection tasks, our model surpasses state-of-the-art approaches on OJClone and Google Code Jam. During the evaluation of BigCloneBench, our model has a 5x speed improvement compared to the state-of-the-art lightweight model and a 563x speed improvement compared to the BERT-based model, with only a 0.3 % and 0.9 % decrease in F_{1} -score.
ISSN:	1558-1225
DOI:	10.1109/ICSE55347.2025.00050