OpCodeBERT: A Method for Python Code Representation Learning by BERT With Opcode
Programming language pre-training models have made significant progress in code representation learning in recent years. Although various methods, such as data flow and Abstract Syntax Tree (AST), have been widely applied to enhance code representation, there has been no research literature, up to d...
Saved in:
| Published in: | IEEE transactions on software engineering Vol. 51; no. 11; pp. 3103 - 3116 |
|---|---|
| Main Authors: | , , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
New York
IEEE
01.11.2025
IEEE Computer Society |
| Subjects: | |
| ISSN: | 0098-5589, 1939-3520 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Programming language pre-training models have made significant progress in code representation learning in recent years. Although various methods, such as data flow and Abstract Syntax Tree (AST), have been widely applied to enhance code representation, there has been no research literature, up to date, specifically exploring the use of intermediate code of the source codes for code representation. For example, the intermediate code of Python, namely opcode, not only includes the data input and output stack processes during program execution, but also describes the specific execution order and control flow information. These features are not possessed in source code, data flow, AST and other structures or are difficult to directly reflect. In this paper, we propose OpCodeBERT ( https://github.com/qcy321/OpCodeBERT ) approach, which is the first to utilize Python opcode for code representation learning and improves code representation by encoding the underlying execution logic, comments, and source code. To support the training of opcode, we filter the public datasets to exclude unparsable data and innovatively propose an opcode-to-sequence mapping method to convert them into a form suitable for model input. In addition, we pre-train OpCodeBERT using a two-stage masked language modeling (MLM) and a multi-modal contrastive learning. To evaluate the effectiveness of OpCodeBERT, we have done experiment with multiple downstream tasks. The experimental results show that OpCodeBERT performs excellently on these tasks, validating the effectiveness of incorporating opcode and further demonstrating the feasibility of this method in code representation learning. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0098-5589 1939-3520 |
| DOI: | 10.1109/TSE.2025.3610244 |