OpCodeBERT: A Method for Python Code Representation Learning by BERT With Opcode

Programming language pre-training models have made significant progress in code representation learning in recent years. Although various methods, such as data flow and Abstract Syntax Tree (AST), have been widely applied to enhance code representation, there has been no research literature, up to d...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on software engineering Vol. 51; no. 11; pp. 3103 - 3116
Main Authors:	Qiu, Canyu, Liu, Jianxun, Xiao, Xiaocong, Xiao, Yong
Format:	Journal Article
Language:	English
Published:	New York IEEE 01.11.2025 IEEE Computer Society
Subjects:	Biological system modeling Codes contrastive learning Data models Effectiveness Learning Logic MLM Programming languages Python Python opcode Representation learning Representations Semantics Source code Source coding Syntactics Training underlying execution logic
ISSN:	0098-5589, 1939-3520
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Programming language pre-training models have made significant progress in code representation learning in recent years. Although various methods, such as data flow and Abstract Syntax Tree (AST), have been widely applied to enhance code representation, there has been no research literature, up to date, specifically exploring the use of intermediate code of the source codes for code representation. For example, the intermediate code of Python, namely opcode, not only includes the data input and output stack processes during program execution, but also describes the specific execution order and control flow information. These features are not possessed in source code, data flow, AST and other structures or are difficult to directly reflect. In this paper, we propose OpCodeBERT ( https://github.com/qcy321/OpCodeBERT ) approach, which is the first to utilize Python opcode for code representation learning and improves code representation by encoding the underlying execution logic, comments, and source code. To support the training of opcode, we filter the public datasets to exclude unparsable data and innovatively propose an opcode-to-sequence mapping method to convert them into a form suitable for model input. In addition, we pre-train OpCodeBERT using a two-stage masked language modeling (MLM) and a multi-modal contrastive learning. To evaluate the effectiveness of OpCodeBERT, we have done experiment with multiple downstream tasks. The experimental results show that OpCodeBERT performs excellently on these tasks, validating the effectiveness of incorporating opcode and further demonstrating the feasibility of this method in code representation learning.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0098-5589 1939-3520
DOI:	10.1109/TSE.2025.3610244