SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations
Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models foc...
Saved in:
| Published in: | 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) pp. 01 - 13 |
|---|---|
| Main Authors: | , , , , , |
| Format: | Conference Proceeding |
| Language: | English |
| Published: |
ACM
01.05.2022
|
| Subjects: | |
| ISSN: | 1558-1225 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pretraining tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existing pretraining tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstream tasks. Experimental results demonstrate that SPT-Code achieves state-of-the-art performance on five code-related downstream tasks after fine-tuning. |
|---|---|
| AbstractList | Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many code-related downstream tasks. But there are issues surrounding their application to SE tasks. First, the majority of the pre-trained models focus on pre-training only the encoder of the Transformer. For generation tasks that are addressed using models with the encoder-decoder architecture, however, there is no reason why the decoder should be left out during pre-training. Second, many existing pre-trained models, including state-of-the-art models such as T5-learning, simply reuse the pretraining tasks designed for natural languages. Moreover, to learn the natural language description of source code needed eventually for code-related tasks such as code summarization, existing pretraining tasks require a bilingual corpus composed of source code and the associated natural language description, which severely limits the amount of data for pre-training. To this end, we propose SPT-Code, a sequence-to-sequence pre-trained model for source code. In order to pre-train SPT-Code in a sequence-to-sequence manner and address the aforementioned weaknesses associated with existing pre-training tasks, we introduce three pre-training tasks that are specifically designed to enable SPT-Code to learn knowledge of source code, the corresponding code structure, as well as a natural language description of the code without relying on any bilingual corpus, and eventually exploit these three sources of information when it is applied to downstream tasks. Experimental results demonstrate that SPT-Code achieves state-of-the-art performance on five code-related downstream tasks after fine-tuning. |
| Author | Li, Chuanyi Luo, Bin Ng, Vincent Ge, Jidong Huang, Liguo Niu, Changan |
| Author_xml | – sequence: 1 givenname: Changan surname: Niu fullname: Niu, Changan email: nougatca@qq.com organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China – sequence: 2 givenname: Chuanyi surname: Li fullname: Li, Chuanyi email: lcy@nju.edu.cn organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China – sequence: 3 givenname: Vincent surname: Ng fullname: Ng, Vincent email: vince@hlt.utdallas.edu organization: Human Language Technology Research Institute University of Texas at Dallas,Richardson,Texas,USA – sequence: 4 givenname: Jidong surname: Ge fullname: Ge, Jidong email: gjd@nju.edu.cn organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China – sequence: 5 givenname: Liguo surname: Huang fullname: Huang, Liguo email: lghuang@lyle.smu.edu organization: Southern Methodist University,Dept. of Computer Science,Dallas,Texas,USA – sequence: 6 givenname: Bin surname: Luo fullname: Luo, Bin email: luobin@nju.edu.cn organization: Nanjing University,State Key Laboratory for Novel Software Technology,Nanjing,China |
| BookMark | eNo1j0tPwzAQhA0Cibb0zIFL_oCL14845oYqXlIkKhLOlROvURDYxUkP_HtMgdPMar5dzc7JSYgBCbkAtgKQ6kooYIyJ1UFNeUSWRlc5YMJwDnBMZqBURYFzdUbm4_iW6VIaMyNts2npOjq8Lhr83GPokU6R_vtik5C2yQ5hCK-Fj6mo0abD0MR9ysDPbvGMu4QjhslOQwzjOTn19n3E5Z8uyMvdbbt-oPXT_eP6pqaWazVRr0ubG7oOe91DZ5SWovO5FvPaAXjlrS6dYhJ8WXILUlqrmOOOS1Xlh8SCXP7eHRBxu0vDh01fW6ONMIKJb1ftULM |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IH CBEJK RIE RIO |
| DOI | 10.1145/3510003.3510096 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9781450392211 1450392210 |
| EISSN | 1558-1225 |
| EndPage | 13 |
| ExternalDocumentID | 9793930 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Natural Science Foundation of China grantid: 61802167,61802095 funderid: 10.13039/501100001809 – fundername: Natural Science Foundation of Jiangsu Province, China grantid: BK20201250 funderid: 10.13039/501100004608 – fundername: NSF grantid: 2034508 funderid: 10.13039/100000001 |
| GroupedDBID | -~X .4S .DC 123 23M 29O 5VS 6IE 6IF 6IH 6IK 6IL 6IM 6IN 8US AAJGR AAWTH ABLEC ADZIZ AFFNX ALMA_UNASSIGNED_HOLDINGS APO ARCSS AVWKF BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO EDO FEDTE I-F I07 IEGSK IJVOP IPLJI M43 OCL RIE RIL RIO RNS XOL |
| ID | FETCH-LOGICAL-a275t-f76a039dbec7c1b95743bf4990f7d11f5fa76d5041f662a144aa50d2d24581553 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 72 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000832185400162&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:28:32 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a275t-f76a039dbec7c1b95743bf4990f7d11f5fa76d5041f662a144aa50d2d24581553 |
| PageCount | 13 |
| ParticipantIDs | ieee_primary_9793930 |
| PublicationCentury | 2000 |
| PublicationDate | 2022-May |
| PublicationDateYYYYMMDD | 2022-05-01 |
| PublicationDate_xml | – month: 05 year: 2022 text: 2022-May |
| PublicationDecade | 2020 |
| PublicationTitle | 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE) |
| PublicationTitleAbbrev | ICSE |
| PublicationYear | 2022 |
| Publisher | ACM |
| Publisher_xml | – name: ACM |
| SSID | ssj0006499 ssj0002871777 |
| Score | 2.5404866 |
| Snippet | Recent years have seen the successful application of large pretrained models to code representation learning, resulting in substantial improvements on many... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 01 |
| SubjectTerms | code representation learning Codes Computer architecture Decoding Natural languages pre-training Representation learning sequence-to-sequence Task analysis Transformers |
| Title | SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations |
| URI | https://ieeexplore.ieee.org/document/9793930 |
| WOSCitedRecordID | wos000832185400162&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV07T8MwELZKxcBUoEW85YERt3nZjlkrKgZUVTSgbpWfiCVBbcrv5-y67cLCFCuSo-gS-7473_cdQg8MfJYRWpOCSkYKWEFEGKUJ0zxV4LKEDZJCH698Oi0XCzHroMc9F8ZaG4rP7NAPw1m-afTGp8pGAn4mkUOAfsQ523K19vkUj_yDtF3chRlA-SjlkxZ0lFOfyM6H4RoU-g-9VIIrmfT-9xKnaHDg5OHZ3tucoY6tz1Fv15QBxzXaR9V8VpFxY-wTnscyadI2ZDeGZ1hSxb4QGBArjgqrn3ge8vjYz8VvoUA28pLq9QC9T56r8QuJrROIzDhtieNMJjlY3WquUyUoAAXlwCSJ4yZNHXWSM0OTInWMZRKiKilpYjKTFbT0rYQuULduanuJsJMZRDmpNIG0WipVJFL5MEUBkgD_doX63kjL7606xjLa5_rv2zfoJPMEglAyeIu67Wpj79Cx_mm_1qv78El_ATuRoFE |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8JAEJ4QNNETKhjf7sGjC33tlnolEoxIiFTDjeyrxktroPj7nV0XuHjx1E2TbZppd-ex830fwB1Hn6UzpWjCBKcJriCaaakoV2ko0WVlxlEKvY_TyaQ_n2fTBtxvsTDGGNd8Zrp26M7ydaXWtlTWy_BnymJM0PescpZHa20rKjb2d-R2fh_mGMx7Mp8wYb2Y2VJ23HVXx9G_U1NxzmTY-t9rHEFnh8oj062_OYaGKU-gtZFlIH6VtiGfTXM6qLR5IDPfKE3rim7G-AxDc68MQTBmJZ5j9YPMXCWf2Lnk1bXIemRSuerA2_AxH4yoF0-gIkpZTYuUiyBGuxuVqlBmDEMFWaBJgiLVYViwQqRcsyAJC84jgXmVECzQkY4S1rdiQqfQLKvSnAEpRIR5Tii0g632pUwCIW2iIjGWQA93Dm1rpMXXLz_Gwtvn4u_bt3Awyl_Gi_HT5PkSDiMLJ3ANhFfQrJdrcw376rv-XC1v3Of9Acgdo5o |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+IEEE%2FACM+44th+International+Conference+on+Software+Engineering+%28ICSE%29&rft.atitle=SPT-Code%3A+Sequence-to-Sequence+Pre-Training+for+Learning+Source+Code+Representations&rft.au=Niu%2C+Changan&rft.au=Li%2C+Chuanyi&rft.au=Ng%2C+Vincent&rft.au=Ge%2C+Jidong&rft.date=2022-05-01&rft.pub=ACM&rft.eissn=1558-1225&rft.spage=01&rft.epage=13&rft_id=info:doi/10.1145%2F3510003.3510096&rft.externalDocID=9793930 |