Improving Cross-Language Code Clone Detection via Code Representation Learning and Graph Neural Networks

Code clone detection is an important aspect of software development and maintenance. The extensive research in this domain has helped reduce the complexity and increase the robustness of source code, thereby assisting bug detection tools. However, the majority of the clone detection literature is co...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on software engineering Vol. 49; no. 11; pp. 4846 - 4868
Main Authors: Mehrotra, Nikita, Sharma, Akash, Jindal, Anmol, Purandare, Rahul
Format: Journal Article
Language:English
Published: New York IEEE 01.11.2023
IEEE Computer Society
Subjects:
ISSN:0098-5589, 1939-3520
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Code clone detection is an important aspect of software development and maintenance. The extensive research in this domain has helped reduce the complexity and increase the robustness of source code, thereby assisting bug detection tools. However, the majority of the clone detection literature is confined to a single language. With the increasing prevalence of cross-platform applications, functionality replication across multiple languages is common, resulting in code fragments having similar functionality but belonging to different languages. Since such clones are syntactically unrelated, single language clone detection tools are not applicable in their case. In this article, we propose a semi-supervised deep learning-based tool Rubhus , capable of detecting clones across different programming languages. Rubhus uses the control and data flow enriched abstract syntax trees (ASTs) of code fragments to leverage their syntactic and structural information and then applies graph neural networks (GNNs) to extract this information for the task of clone detection. We demonstrate the effectiveness of our proposed system through experiments conducted over datasets consisting of Java, C, and Python programs and evaluate its performance in terms of precision, recall, and F1 score. Our results indicate that Rubhus outperforms the state-of-the-art cross-language clone detection tools.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0098-5589
1939-3520
DOI:10.1109/TSE.2023.3311796